etcd v2:etcd-服务器运行正常,但etcd-events没有加入("cluster ID mismatch"和"unmatched member while checking PeerURL



我有一个运行etcd v2的遗留Kubernetes集群,它有3个主机(etcd-a、etcd-b、etcd-c(。我们尝试升级到etcd-v3,但这破坏了第一个主机(etcd-a(,它无法再加入集群。一段时间后,我能够恢复它:

  1. etcdctl member rm从etcd簇中去除etcd-a
  2. 添加了一个新的具有干净状态的etcd-a1并添加到集群etcdctl member add
  3. ETCD_INITIAL_CLUSTER_STATE设置为existing的情况下启动kubelet,然后启动protokube。此时,主机可以加入集群

一开始我认为集群是健康的:

/ # etcdctl member list
a4***b2: name=etcd-c peerURLs=http://etcd-c.internal.mydomain.com:2380 clientURLs=http://etcd-c.internal.mydomain.com:4001
cf***97: name=etcd-a1 peerURLs=http://etcd-a1.internal.mydomain.com:2380 clientURLs=http://etcd-a1.internal.mydomain.com:4001
d3***59: name=etcd-b peerURLs=http://etcd-b.internal.mydomain.com:2380 clientURLs=http://etcd-b.internal.mydomain.com:4001
/ # etcdctl cluster-health
member a4***b2 is healthy: got healthy result from http://etcd-c.internal.mydomain.com:4001
member cf***97 is healthy: got healthy result from http://etcd-a1.internal.mydomain.com:4001
member d3***59 is healthy: got healthy result from http://etcd-b.internal.mydomain.com:4001
cluster is healthy

然而,etcd事件的地位并不高。a1的etcd事件没有运行

etcd-server-events-ip-a1       0/1     CrashLoopBackOff   430
etcd-server-events-ip-b        1/1     Running            3
etcd-server-events-ip-c        1/1     Running            0

蚀刻事件的日志-1:

flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=http://etcd-events-a1.internal.mydomain.com:4002
flags: recognized and used environment variable ETCD_DATA_DIR=/var/etcd/data-events
flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=http://etcd-events-a1.internal.mydomain.com:2381
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd-events-a1=http://etcd-events-a1.internal.mydomain.com:2381,etcd-events-b=http://etcd-events-b.internal.mydomain.com:2381,etcd-events-c=http://etcd-events-c.internal.mydomain.com:2381
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-token-etcd-events
flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=http://0.0.0.0:4002
flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=http://0.0.0.0:2381
flags: recognized and used environment variable ETCD_NAME=etcd-events-a1
etcdmain: etcd Version: 2.2.1
etcdmain: Git SHA: 75f8282
etcdmain: Go Version: go1.5.1
etcdmain: Go OS/Arch: linux/amd64
etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
etcdmain: the server is already initialized as member before, starting as etcd member...
etcdmain: listening for peers on http://0.0.0.0:2381
etcdmain: listening for client requests on http://0.0.0.0:4002
netutil: resolving etcd-events-b.internal.mydomain.com:2381 to 10.15.***:2381
netutil: resolving etcd-events-a1.internal.mydomain.com:2381 to 10.15.***:2381
etcdmain: stopping listening for client requests on http://0.0.0.0:4002
etcdmain: stopping listening for peers on http://0.0.0.0:2381
etcdmain: error validating peerURLs {ClusterID:5a***b3 Members:[&{ID:a7***32 RaftAttributes:{PeerURLs:[http://etcd-events-b.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-b ClientURLs:[http://etcd-events-b.internal.mydomain.com:4002]}} &{ID:cc***b3 RaftAttributes:{PeerURLs:[https://etcd-events-a.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-a ClientURLs:[https://etcd-events-a.internal.mydomain.com:4002]}} &{ID:7f***2ca RaftAttributes:{PeerURLs:[http://etcd-events-c.internal.mydomain.com:2381]} Attributes:{Name:etcd-events-c ClientURLs:[http://etcd-events-c.internal.mydomain.com:4002]}}] RemovedMemberIDs:[]}: unmatched member while checking PeerURLs
# restart
...
etcdserver: restarting member eb***3a in cluster 96***07 at commit index 3
raft: eb***a3a became follower at term 12407
raft: newRaft eb***3a [peers: [], term: 12407, commit: 3, applied: 0, lastindex: 3, lastterm: 1]
etcdserver: starting server... [version: 2.2.1, cluster version: to_be_decided]
etcdserver: added local member eb***3a [http://etcd-events-a1.internal.mydomain.com:2381] to cluster 96***07
etcdserver: added member 7f***ca [http://etcd-events-c.internal.mydomain.com:2381] to cluster 96***07
rafthttp: request sent was ignored (cluster ID mismatch: remote[7f***ca]=5a***b3, local=96***07)
rafthttp: request sent was ignored (cluster ID mismatch: remote[7f***ca]=5a***3, local=96***07)
rafthttp: failed to dial 7f***ca on stream Message (cluster ID mismatch)
rafthttp: failed to dial 7f***ca on stream MsgApp v2 (cluster ID mismatch)
etcdserver: added member a7***32 [http://etcd-events-b.internal.mydomain.com:2381] to cluster 96***07
rafthttp: request sent was ignored (cluster ID mismatch: remote[a7***32]=5a***b3, local=96***07)
rafthttp: failed to dial a7***32 on stream MsgApp v2 (cluster ID mismatch)
...
rafthttp: request sent was ignored (cluster ID mismatch: remote[a7***32]=5a***b3, local=96***07)
osutil: received terminated signal, shutting down...
etcdserver: aborting publish because server is stopped

来自etcd-events-b:的日志

rafthttp: streaming request ignored (cluster ID mismatch got 96***07 want 5a***b3)
rafthttp: the connection to peer cc***b3 is unhealthy

来自etcd-events-c:的日志

etcdserver: failed to reach the peerURL(https://etcd-events-a.internal.mydomain.com:2381) of member cc***b3 (Get https://etcd-events-a.internal.mydomain.com:2381/version: dial tcp 10.15.131.7:2381: i/o timeout)
etcdserver: cannot get the version of member cc***b3 (Get https://etcd-events-a.internal.mydomain.com:2381/version: dial tcp 10.15.131.7:2381: i/o timeout)

从日志中我看到了两个问题:

1a上的
  • etcd事件似乎忽略了现有集群(然后ID不匹配(
  • 其他节点(b和c(仍然以某种方式记得被移除的旧节点a

我对如何解决这个问题缺乏想法。有什么建议吗?

谢谢!

如果您尝试升级etcd2,但没有同时重新启动所有主机,则升级将失败。

一定要通读一遍https://kops.sigs.k8s.io/etcd3-migration/我还强烈建议使用最新版本的kOps,因为在这一过程中修复了不少迁移错误。

集群ID更改的原因可能有很多,但如果我没记错的话,从来没有真正支持过这样替换成员,使用etcd2,您的选择是有限的。尝试访问etcd-manager和etcdv3可能是使集群再次处于工作状态的最佳方式。

最新更新