节点不会重新连接到种子

我们在不同的 AWS 数据中心有三个节点，其中一个是单例的唯一种子节点和独占所有者，通过在单例代理设置上使用.withDataCenter来完成。我们可以通过启动种子节点，然后启动其他节点来使我们的集群按设计工作，但如果任何节点出现故障，让它们再次说话的唯一方法似乎是以相同的方式重新启动整个集群。我们希望让它们尝试重新连接到种子节点，并在可能的情况下恢复正常运行。

当我关闭非种子节点时，种子节点将其标记为无法访问，并开始定期记录以下内容：

Association with remote system [akka.tcp://application@xxx.xx.x.xxx:xxxx] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://application@xxx.xx.x.xxx:xxxx]] Caused by: [connection timed out: /xxx.xx.x.xxx:xxxx]

很公平。但是，当我恢复节点时，新启动的节点开始重复：

2018-01-29 22:59:09,587 [DEBUG]: akka.cluster.ClusterCoreDaemon in application-akka.actor.default-dispatcher-18 - now supervising Actor[akka://application/system/cluster/core/daemon/joinSeedNodeProcess-16#-1572745962]

2018-01-29 22:59:09,587 [DEBUG]: akka.cluster.JoinSeedNodeProcess in application-akka.actor.default-dispatcher-3 - started (akka.cluster.JoinSeedNodeProcess@2ae57537)

2018-01-29 22:59:09,755 [DEBUG]: akka.cluster.JoinSeedNodeProcess in application-akka.actor.default-dispatcher-2 - stopped

种子节点记录：

2018-01-29 22:56:25,442 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-4 - Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - New incarnation of existing member [Member(address = akka.tcp://application@172.xx.x.xxx:xxxx, dataCenter = indonesia, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

2018-01-29 22:56:25,443 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 - Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - Marking unreachable node [akka.tcp://application@172.xx.x.xxx:xxxx] as [Down]

此后反复：

2018-01-29 22:57:41,659 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 - Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - Sending InitJoinAck message from node [akka.tcp://application@52.xx.xxx.xx:xxxx] to [Actor[akka.tcp://application@172.xx.x.xxx:xxxx/system/cluster/core/daemon/joinSeedNodeProcess-8#-1322646338]]

2018-01-29 22:57:41,827 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 - Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - New incarnation of existing member [Member(address = akka.tcp://application@172.xx.x.xxx:xxxx, dataCenter = indonesia, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

对我来说似乎很奇怪，日志表明"将"发生的事情不会发生，现有成员被删除并允许新成员加入。我一直在谷歌上搜索这条消息，但找不到我可能需要做什么才能真正实现的解释。

假设你在 Akka.NET，看起来你可能遇到了一个悬而未决的问题，在这个问题上，领导者一直试图删除旧的化身，让新的化身加入。问题票证中有一些关于放宽heartbeat-interval的故障排除建议，这些建议可能会提供对可能原因的一些见解。

鉴于多个地理位置分散的数据中心的延迟通常较高，我会密切关注的一个领域是故障检测。

这似乎与报告的问题无关，但根据显示的日志，不同数据中心中的两个节点之间似乎存在时间差异。

相关内容

最新更新

热门标签：