使用Amazon EC2的Akka远程失败



我正在Scala中用Akka actors构建一个库来做一些大规模的数据处理。

我使用StarCluster在Amazon EC2实例上运行我的代码。程序不稳定,因为actor远程操作有时会下降:

当代码运行时,节点将在几分钟内逐一断开连接。节点表示如下:

[ERROR] [07/16/2014 17:40:06.837] [slave-akka.actor.default-dispatcher-4] [akka://slave/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fslave%40master%3A2552-0/endpointWriter] AssociationError [akka.tcp://slave@node005:2552] -> [akka.tcp://slave@master:2552]: Error [Association failed with [akka.tcp://slave@master:2552]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://slave@master:2552]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: master

[WARN] [07/16/2014 17:30:05.548] [slave-akka.actor.default-dispatcher-12] [Remoting] Tried to associate with unreachable remote address [akka.tcp://slave@master:2552]. Address is now quarantined, all messages to this address will be delivered to dead letters.

尽管我可以在节点之间ping通。

我一直在努力解决这个问题;我想这是一些配置设置。Akka远程文档甚至说,

然而,在云环境中,如Amazon EC2,值可以是增加到12,以解释网络问题,有时出现在这样的平台上。

然而,我已经设置了这个和超越,仍然没有运气解决这个问题。以下是我当前的远程配置:

akka {
  actor {
    provider = "akka.remote.RemoteActorRefProvider"
  }
  remote {
    enabled-transports = ["akka.remote.netty.tcp"]
    netty.tcp {
      port = 2552
      # for modelling
      #send-buffer-size = 50000000b
      #receive-buffer-size = 50000000b
      #maximum-frame-size = 25000000b
      send-buffer-size = 5000000b
      receive-buffer-size = 5000000b
      maximum-frame-size = 2500000b
    }
    watch-failure-detector.threshold = 100
    acceptable-heartbeat-pause = 20s
    transport-failure-detector {
      heartbeat-interval = 4 s
      acceptable-heartbeat-pause = 20 s
    }
  }
  log-dead-letters = off
}

,我像这样从主节点部署我的actor:

val o2m = system.actorOf(Props(classOf[IntOneToMany], p), name = "o2m")
val remote = Deploy(scope = RemoteScope(Address("akka.tcp", "slave", args(i), 2552)))
val b = system.actorOf(Props(classOf[IntBoss], o2m).withDeploy(remote), name = "boss_" + i)
etc.

有没有人能指出我正在犯的错误/我如何解决这个问题并阻止节点断开连接?或者,如果actor断开连接,则重新启动actor的解决方案也有效;我不太关心丢失的信息。事实上,我认为这应该是很容易配置的行为,但我发现很难找到正确的地方去寻找它。

谢谢

至少属性语法是错误的:accept -heartbeat-pause应该在watch-failure-detector下(您的属性在同一级别)。它们应该像下面这样:

watch-failure-detector {
  threshold = 100
  acceptable-heartbeat-pause = 20 s
}
transport-failure-detector {
  heartbeat-interval = 4 s
  acceptable-heartbeat-pause = 20 s
}

相关内容

  • 没有找到相关文章

最新更新