容器在主机出现故障时未移动

我根据"Kuberenetes Up & Running"一书结合官方文档，在Ubuntu上运行了一个简单的1个主节点和3个节点的设置。

它基本上有效，直到我关闭其中一个worker节点。几秒钟后，节点运行状态切换到unknown。即使 Pod 位于脱机节点上，Pod 也会保留状态报告running。

k8s 不应该将这些 pod 移动到不同的健康主机上吗？我错过了什么吗？

谢谢建议！

在 Kubernetes 1.13 及更高版本中，节点故障/未就绪条件下的 Pod 逐出实际上是由污点和容忍度控制的。不再使用--pod-eviction-timeout参数。

当节点出现故障或未准备就绪时，节点控制器/kubelet 会将以下污点添加到节点 -node.kubernetes.io/unreachable和node.kubernetes.io/not-ready。默认情况下，所有 Pod 都容忍这些污点 300 秒。对于所有带有要kube-api-server标志的 Pod 以及使用 Pod 规范中的对象tolerations每个 Pod 都可以控制此容忍时间群集范围。

群集范围配置：

您可以使用--default-not-ready-toleration-seconds和--default-unreachable-toleration-seconds标志修改容忍时间群集范围以kube-api-server。

从文档中：

--default-not-ready-toleration-seconds int     Default: 300
Indicates the tolerationSeconds of the toleration for notReady:NoExecute that is added by default to every pod that does not already have such a toleration.
--default-unreachable-toleration-seconds int     Default: 300

每个容器配置：

您还可以使用以下配置修改每个 Pod 的容限时间。

tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 120
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 120

https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions

默认情况下，Pod 在 5m 分钟内不会移动，这可以通过控制器管理器--pod-eviction-timeout duration上的以下标志进行配置。

5 分钟后，如果仍然没有发生(有状态集(，您需要使用kubectl delete node删除节点，这将触发节点上 Pod 的重新调度。

从 Kubernetes 版本 1.13 及更高版本开始，节点故障/未就绪条件下的 Pod 逐出由污点和容忍控制。 --pod-eviction-timeout 参数被忽略。

集群范围的配置可以通过 kubelet 参数进行配置。

--default-not-ready-toleration-seconds int     Default: 300Indicates the tolerationSeconds of the toleration for notReady:NoExecute that is added by default to every pod that does not already have such a me toleration.
--default-unreachable-toleration-seconds int     Default: 300Indicates the tolerationSeconds of the toleration for unreachable:NoExecute that is added by default to every pod that does not already have such a toleration.

如果要在 POD 级别管理此属性，可以添加容差。

spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 30
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 30

查看此相关问题

https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/#taint-based-evictions

我能够使用此脚本解决此问题，以强制排出任何已进入"未就绪"状态超过 5 分钟(可调(的节点，然后在返回后取消封锁节点。

相关内容

最新更新

热门标签：