扩展Axon应用程序-命令处理负载测试失败



我创建了一个Axon应用程序,它有两个Spring Boot服务——hotel-booking-commandhotel-booking-query,分别用于命令端和查询端。这些服务部分地和松散地基于AxonIQ提供的样例应用程序。我使用Axon Server作为事件存储和消息路由器。这些服务隐藏在Spring Cloud Gateway后面。我使用Consul作为发现服务。只要我只使用命令端应用程序的一个实例,一切似乎都可以正常工作。当我使用2个或更多实例并且负载变得更高时,到Axon Server的连接在所有实例上都丢失:

2022-06-26 17:00:37.675  INFO 86356 --- [ctor-http-nio-4] o.a.m.interceptors.LoggingInterceptor    : [AddRoomCommand] executed successfully with a [Integer] return value
2022-06-26 17:00:37.675  INFO 86356 --- [ctor-http-nio-4] o.a.m.interceptors.LoggingInterceptor    : Dispatched messages: [RoomAddedEvent]
2022-06-26 17:01:10.258  INFO 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel    : Unable to recover current connection to AxonServer. Attempting to reconnect...
2022-06-26 17:01:10.264  INFO 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel    : Requesting connection details from localhost:8124
2022-06-26 17:01:15.272  WARN 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel    : Connecting to AxonServer node [localhost:8124] failed.
io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 4.997735389s. [closed=[], open=[[buffered_nanos=4998488350, waiting_for_connection]]]
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262) ~[grpc-stub-1.43.0.jar:1.43.0]
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243) ~[grpc-stub-1.43.0.jar:1.43.0]
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156) ~[grpc-stub-1.43.0.jar:1.43.0]
at io.axoniq.axonserver.grpc.control.PlatformServiceGrpc$PlatformServiceBlockingStub.getPlatformServer(PlatformServiceGrpc.java:250) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.connectChannel(AxonServerManagedChannel.java:115) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.createConnection(AxonServerManagedChannel.java:335) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.ensureConnected(AxonServerManagedChannel.java:308) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
at io.axoniq.axonserver.connector.impl.AxonServerManagedChannel.lambda$scheduleConnectionCheck$4(AxonServerManagedChannel.java:378) ~[axonserver-connector-java-4.5.4.jar:4.5.4]
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[na:na]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[na:na]
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[na:na]
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[na:na]
at java.base/java.lang.Thread.run(Thread.java:833) ~[na:na]
2022-06-26 17:01:15.272  INFO 86356 --- [SQ-P024.local-0] i.a.a.c.impl.AxonServerManagedChannel    : Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms
2022-06-26 17:01:15.272  INFO 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel    : Connection to AxonServer lost. Attempting to reconnect...
2022-06-26 17:01:15.273  INFO 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel    : Requesting connection details from localhost:8124
2022-06-26 17:01:20.275  WARN 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel    : Connecting to AxonServer node [localhost:8124] failed: DEADLINE_EXCEEDED: deadline exceeded after 4.988964197s. [closed=[], open=[[buffered_nanos=4989266439, waiting_for_connection]]]
2022-06-26 17:01:20.275  INFO 86356 --- [SQ-P024.local-1] i.a.a.c.impl.AxonServerManagedChannel    : Failed to get connection to AxonServer. Scheduling a reconnect in 2000ms

Gatling的日志很快开始看起来像这样(每秒执行250个请求,Gatling模拟的代码在这里可用):

for HTTP POST "/api/hotel-booking/command/rooms"
io.netty.channel.AbstractChannel$AnnotatedConnectException: Operation timed out: /192.168.0.12:8082
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException: 
Error has been observed at the following site(s):
*__checkpoint ⇢ org.springframework.cloud.gateway.filter.WeightCalculatorWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ org.springframework.boot.actuate.metrics.web.reactive.server.MetricsWebFilter [DefaultWebFilterChain]
*__checkpoint ⇢ HTTP POST "/api/hotel-booking/command/rooms" [ExceptionHandlingWebHandler]
Original Stack Trace:
Caused by: java.net.ConnectException: Operation timed out
at java.base/sun.nio.ch.Net.pollConnect(Native Method) ~[na:na]
at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672) ~[na:na]
at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946) ~[na:na]

通常,只有前500或1000个请求被正确处理。

应用程序的当前版本(包括位于gatling模块中的负载测试)可在此处获得:https://github.com/a-glapinski/event-sourcing-and-cqrs-jvm/tree/api-testing。如果需要,我愿意提供更多关于申请的细节;现在我不知道哪里出了问题,也不知道该往哪里看。

在我的Axon命令端应用程序的配置中是否可能缺少某些东西?我应该在Axon服务器中更改一些配置吗?或者我创建的整个系统的概念在Axon框架的上下文中是错误的,我的应用程序根本无法扩展?

感谢分享你的项目。这是一个了不起的创举!

在战略层面上,我对这个设计有几个关注:

    1. implementation("org.axonframework:axon-spring-boot-starter");implementation("org.axonframework.extensions.springcloud:axon-springcloud-spring-boot-starter")一起用于分发命令。没有需要这个。Axon Server充当命令的服务注册中心和发现。它将命令路由到适当的命令处理程序(比Consul好得多)。
    1. 如果你真的需要使用Consul/Eureka作为服务发现,我的建议是限制Consul只发现Web组件/控制器,而不是Axon命令处理程序

这意味着

  • 删除springcloud:axon-springcloud-spring-boot-starter的第一步。您不需要两种机制(Consul, AxonServer)来进行服务发现和消息/命令路由。
  • 可能会提取web组件(控制器)它们可以被负载平衡,并与深海分开发现决策组件,如聚合(命令处理)组件)。请注意,这将返回到Consul的常规Spring Boot配置(在此级别上不需要springcloud:axon-springcloud-spring-boot-starter)。

如果首先看到没有Gateway和Consul的相同测试的结果将会很好。在这种情况下,您应该能够以更好的方式识别瓶颈。

最新更新