故障排除一个未就绪的节点



我有一个节点目前正在给我一些麻烦。尚未找到解决方案,但这可能是一个技能水平问题,Google空了,或者我发现了一些无法解决的问题。后者极不可能。

kubectl version v1.8.5
docker version 1.12.6

在我的节点上进行一些正常的维护,我注意到以下内容:

NAME                            STATUS   ROLES     AGE       VERSION
ip-192-168-4-14.ourdomain.pro   Ready    master    213d      v1.8.5
ip-192-168-4-143.ourdomain.pro  Ready    master    213d      v1.8.5
ip-192-168-4-174.ourdomain.pro  Ready    <none>    213d      v1.8.5
ip-192-168-4-182.ourdomain.pro  Ready    <none>    46d       v1.8.5
ip-192-168-4-221.ourdomain.pro  Ready    <none>    213d      v1.8.5
ip-192-168-4-249.ourdomain.pro  Ready    master    213d      v1.8.5
ip-192-168-4-251.ourdomain.pro  NotReady <none>    206d      v1.8.5

上还没有准备就绪节点,我无法附加 exec> exec 我本人在中似乎很正常strong>状态,除非我误读它。出于相同的原因,无法在该节点上查看任何特定日志。

此时,我重新启动 kubelet ,并同时将自己固定在日志上,以查看是否会出现任何东西。

我已经附上了我一天谷歌搜索的东西,但我无法确认实际上是与问题相关的。

错误1

unable to connect to Rkt api service

我们不使用此功能,所以我将其放在忽略列表中。

错误2

unable to connect to CRI-O api service

我们不使用此功能,所以我将其放在忽略列表中。

错误3

Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /

我无法将其排除在潜在的陷阱中,但是到目前为止,我发现的东西似乎与我正在运行的版本无关。

错误4

skipping pod synchronization - [container runtime is down PLEG is not healthy

我没有答案,除了以下事实:上面的垃圾收集错误在此消息之后第二次出现。

错误5

Registration of the rkt container factory failed

不使用它,所以除非我误会了。

错误6

Registration of the crio container factory failed

不使用它,所以除非我再次误会。

错误7

28087 docker_sandbox.go:343] failed to read pod IP from plugin/docker: NetworkPlugin cni failed on the status hook for pod "kube-dns-545bc4bfd4-rt7qp_kube-system": CNI failed to retrieve network namespace path: Cannot find network namespace for the terminated container

找到了这张票的GitHub票,但似乎已修复,因此不确定它的关系。

错误8

28087 kubelet_node_status.go:791] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2019-05-06 05:00:40.664331773 +0000 UTC LastTransitionTime:2019-05-06 05:00:40.664331773 +0000 UTC Reason:KubeletNotReady Message:container runtime is down}

,这里的节点进入了还没有准备就绪。

最后日志消息和状态

    systemctl status kubelet
  kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: active (running) since Mon 2019-05-06 05:00:39 UTC; 1h 58min ago
     Docs: http://kubernetes.io/docs/
 Main PID: 28087 (kubelet)
    Tasks: 21
   Memory: 42.3M
   CGroup: /system.slice/kubelet.service
           └─28087 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --pod-manifest-path=/etc/kubernetes/manife...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310305   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310330   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310359   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "varl...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310385   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "cali...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310408   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "kube...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310435   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310456   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for vo...4a414b9c")
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310480   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "ca-c...
May 06 05:00:45 kube-master-1 kubelet[28087]: I0506 05:00:45.310504   28087 reconciler.go:212] operationExecutor.VerifyControllerAttachedVolume started for volume "k8s-...
May 06 05:14:29 kube-master-1 kubelet[28087]: E0506 05:14:29.848530   28087 helpers.go:468] PercpuUsage had 0 cpus, but the actual number is 2; ignoring extra CPUs

这是 kubectl获取po -o wide 输出。

NAME                                              READY     STATUS     RESTARTS   AGE       IP               NODE
docker-image-prune-fhjkl                          1/1       Running    4          213d      100.96.67.87     ip-192-168-4-249
docker-image-prune-ltfpf                          1/1       Running    4          213d      100.96.152.74    ip-192-168-4-143
docker-image-prune-nmg29                          1/1       Running    3          213d      100.96.22.236    ip-192-168-4-221
docker-image-prune-pdw5h                          1/1       Running    7          213d      100.96.90.116    ip-192-168-4-174
docker-image-prune-swbhc                          1/1       Running    0          46d       100.96.191.129   ip-192-168-4-182
docker-image-prune-vtsr4                          1/1       NodeLost   1          206d      100.96.182.197   ip-192-168-4-251
fluentd-es-4bgdz                                  1/1       Running    6          213d      192.168.4.249    ip-192-168-4-249
fluentd-es-fb4gw                                  1/1       Running    7          213d      192.168.4.14     ip-192-168-4-14
fluentd-es-fs8gp                                  1/1       Running    6          213d      192.168.4.143    ip-192-168-4-143
fluentd-es-k572w                                  1/1       Running    0          46d       192.168.4.182    ip-192-168-4-182
fluentd-es-lpxhn                                  1/1       Running    5          213d      192.168.4.174    ip-192-168-4-174
fluentd-es-pjp9w                                  1/1       Unknown    2          206d      192.168.4.251    ip-192-168-4-251
fluentd-es-wbwkp                                  1/1       Running    4          213d      192.168.4.221    ip-192-168-4-221
grafana-76c7dbb678-p8hzb                          1/1       Running    3          213d      100.96.90.115    ip-192-168-4-174
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-g8xmp   2/2       Running    2          101d      100.96.22.234    ip-192-168-4-221
model-5bbe4862e4b0aa4f77d0d499-7cb4f74648-tvp4m   2/2       Running    2          101d      100.96.22.235    ip-192-168-4-221
prometheus-65b4b68d97-82vr7                       1/1       Running    3          213d      100.96.90.87     ip-192-168-4-174
pushgateway-79f575d754-75l6r                      1/1       Running    3          213d      100.96.90.83     ip-192-168-4-174
rabbitmq-cluster-58db9b6978-g6ssb                 2/2       Running    4          181d      100.96.90.117    ip-192-168-4-174
replicator-56x7v                                  1/1       Running    3          213d      100.96.90.84     ip-192-168-4-174
traefik-ingress-6dc9779596-6ghwv                  1/1       Running    3          213d      100.96.90.85     ip-192-168-4-174
traefik-ingress-6dc9779596-ckzbk                  1/1       Running    4          213d      100.96.152.73    ip-192-168-4-143
traefik-ingress-6dc9779596-sbt4n                  1/1       Running    3          213d      100.96.22.232    ip-192-168-4-221

kubectl获取PO -N Kube -System -O wide

的输出
NAME                                       READY     STATUS     RESTARTS   AGE       IP          
calico-kube-controllers-78f554c7bb-s7tmj   1/1       Running    4          213d      192.168.4.14
calico-node-5cgc6                          2/2       Running    9          213d      192.168.4.249
calico-node-bbwtm                          2/2       Running    8          213d      192.168.4.14
calico-node-clwqk                          2/2       NodeLost   4          206d      192.168.4.251
calico-node-d2zqz                          2/2       Running    0          46d       192.168.4.182
calico-node-m4x2t                          2/2       Running    6          213d      192.168.4.221
calico-node-m8xwk                          2/2       Running    9          213d      192.168.4.143
calico-node-q7r7g                          2/2       Running    8          213d      192.168.4.174
cluster-autoscaler-65d6d7f544-dpbfk        1/1       Running    10         207d      100.96.67.88
kube-apiserver-ip-192-168-4-14             1/1       Running    6          213d      192.168.4.14
kube-apiserver-ip-192-168-4-143            1/1       Running    6          213d      192.168.4.143
kube-apiserver-ip-192-168-4-249            1/1       Running    6          213d      192.168.4.249
kube-controller-manager-ip-192-168-4-14    1/1       Running    5          213d      192.168.4.14
kube-controller-manager-ip-192-168-4-143   1/1       Running    6          213d      192.168.4.143
kube-controller-manager-ip-192-168-4-249   1/1       Running    6          213d      192.168.4.249
kube-dns-545bc4bfd4-rt7qp                  3/3       Running    13         213d      100.96.19.197
kube-proxy-2bn42                           1/1       Running    0          46d       192.168.4.182
kube-proxy-95cvh                           1/1       Running    4          213d      192.168.4.174
kube-proxy-bqrhw                           1/1       NodeLost   2          206d      192.168.4.251
kube-proxy-cqh67                           1/1       Running    6          213d      192.168.4.14
kube-proxy-fbdvx                           1/1       Running    4          213d      192.168.4.221
kube-proxy-gcjxg                           1/1       Running    5          213d      192.168.4.249
kube-proxy-mt62x                           1/1       Running    4          213d      192.168.4.143
kube-scheduler-ip-192-168-4-14             1/1       Running    6          213d      192.168.4.14
kube-scheduler-ip-192-168-4-143            1/1       Running    6          213d      192.168.4.143
kube-scheduler-ip-192-168-4-249            1/1       Running    6          213d      192.168.4.249
kubernetes-dashboard-7c5d596d8c-q6sf2      1/1       Running    5          213d      100.96.22.230
tiller-deploy-6d9f596465-svpql             1/1       Running    3          213d      100.96.22.231

我在此处从这里去哪里有些丢失。欢迎任何建议。

很可能必须下kubelet。

共享来自下面命令的输出

journalctl -u kubelet

从下面的命令共享输出

kubectl get po -n kube-system -owide

看起来节点无法与控制平面通信。您可以以下步骤

  1. 将节点从群集(Cordon the节点,排干节点并最终删除节点(
  2. 重置节点
  3. 将节点重新加入新鲜到群集

最新更新