领导人选举失败,租约无法自动更新



我有一个生产集群,当前运行在K8s版本1.19.9上,其中kube-scheduler和kube-controller-manager无法进行leader选举。leader能够获得第一个租约,但是它不能更新/重新获得租约,这导致其他pod不断地在选举leader的循环中,因为它们没有一个能够停留足够长的时间来处理任何事情/停留足够长的时间来做任何有意义的事情,它们超时了,另一个pod将获得新的租约;这发生在每个节点之间。以下是日志:

E1201 22:15:54.818902       1 request.go:1001] Unexpected error when reading response body: context deadline exceeded
E1201 22:15:54.819079       1 leaderelection.go:361] Failed to update lock: resource name may not be empty
I1201 22:15:54.819137       1 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F1201 22:15:54.819176       1 controllermanager.go:293] leaderelection lost

Docker详细日志:

Flag --port has been deprecated, see --secure-port instead.
I1201 22:14:10.374271       1 serving.go:331] Generated self-signed cert in-memory
I1201 22:14:10.735495       1 controllermanager.go:175] Version: v1.19.9+vmware.1
I1201 22:14:10.736289       1 dynamic_cafile_content.go:167] Starting request-header::/etc/kubernetes/pki/front-proxy-ca.crt
I1201 22:14:10.736302       1 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
I1201 22:14:10.736684       1 secure_serving.go:197] Serving securely on 0.0.0.0:10257
I1201 22:14:10.736747       1 leaderelection.go:243] attempting to acquire leader lease  kube-system/kube-controller-manager...
I1201 22:14:10.736868       1 tlsconfig.go:240] Starting DynamicServingCertificateController
E1201 22:14:20.737137       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://[IP address]:[Port]/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": context deadline exceeded
E1201 22:14:32.803658       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://[IP address]:[Port]/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": context deadline exceeded
E1201 22:14:44.842075       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: Get "https://[IP address]:[Port]/api/v1/namespaces/kube-system/endpoints/kube-controller-manager?timeout=10s": context deadline exceeded
E1201 22:15:13.386932       1 leaderelection.go:325] error retrieving resource lock kube-system/kube-controller-manager: context deadline exceeded
I1201 22:15:44.818571       1 leaderelection.go:253] successfully acquired lease kube-system/kube-controller-manager
I1201 22:15:44.818755       1 event.go:291] "Event occurred" object="kube-system/kube-controller-manager" kind="Endpoints" apiVersion="v1" type="Normal" reason="LeaderElection" message="master001_1d360610-1111-xxxx-aaaa-9999 became leader"
I1201 22:15:44.818790       1 event.go:291] "Event occurred" object="kube-system/kube-controller-manager" kind="Lease" apiVersion="coordination.k8s.io/v1" type="Normal" reason="LeaderElection" message="master001_1d360610-1111-xxxx-aaaa-9999 became leader"
E1201 22:15:54.818902       1 request.go:1001] Unexpected error when reading response body: context deadline exceeded
E1201 22:15:54.819079       1 leaderelection.go:361] Failed to update lock: resource name may not be empty
I1201 22:15:54.819137       1 leaderelection.go:278] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
F1201 22:15:54.819176       1 controllermanager.go:293] leaderelection lost
goroutine 1 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc00000e001, 0xc000fb20d0, 0x4c, 0xc6)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:996 +0xb9
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x6a57fa0, 0xc000000003, 0x0, 0x0, 0xc000472070, 0x68d5705, 0x14, 0x125, 0x0)
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:945 +0x191

我的管道胶带恢复方法是关闭其他候选人和禁用领导人选举--leader-elect=false。我们手动设置一个领导者,让它保持一段时间,然后重新启动领导者选举。这似乎又按预期工作了,租约在之后正常续期。

是否有可能api-server因为超时导致选举失败而无法使用任何资源(?)?我想知道是否有人遇到过这样的问题。

@janeosaka当您有1)resource crunch2)network issue时,会出现此问题。

看起来leader选举API调用超时了,因为Kube API服务器资源紧张,这增加了API调用的延迟。

1)Resource Crunch:(增加节点CPU和内存)

这似乎是预期的行为。当leader选举失败时,控制器无法更新租约,并且每个设计都重新启动控制器以确保每次只有一个控制器处于活动状态。

LeaseDuration和RenewDeadline (RenewDeadline是代理主服务器重试的持续时间)可以在控制器运行时配置。

你可以考虑的另一种方法是利用API优先级&如果你的控制器不在API过载的原点,公平性可以增加控制器对API调用成功的机会。

2)网络问题:如果是网络问题:(leader选举失败是主机网络问题的症状,不是原因)

Check the issue may resolve after restarting the SDN pod

"sdn-controller""sdn"是非常不同的东西。如果重新启动sdnpod可以解决问题,那么您注意到的sdn-controller错误并不是实际问题。

最新更新