如何自动删除失败的Kubernetes Ceph节点

在具有多个节点并在RWO模式下使用Ceph块卷的环境中，如果某个节点出现故障(无法访问且不会很快返回(，并且pod被重新安排到另一个节点，则如果pod具有Ceph块PVC，则pod无法启动。原因是该卷"仍在被另一个pod使用"(因为节点出现故障，其资源无法正确删除(。

如果我使用kubectl delete node dead-node从集群中删除节点，则pod可以启动，因为资源会被删除。

我如何自动执行此操作？我考虑过的一些可能性是：

我可以为卷设置强制分离超时吗
是否设置删除节点超时
是否自动删除具有给定污点的节点

我可以将ReadWriteMany模式与其他卷类型一起使用，以允许多个吊舱使用PV，但这并不理想。

您可能有一个sidecar容器，并调整pod中的Readiness和Liveness探测器，以便在Ceph块卷在一段时间内无法由使用它的容器访问时，pod不会重新启动

类似这样的东西：

apiVersion: v1
kind: Pod
metadata:
labels:
test: ceph
name: ceph-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
- name: cephclient
image: ceph
volumeMounts:
- name: ceph
mountPath: /cephmountpoint
livenessProbe:
... 👈 something
initialDelaySeconds: 5
periodSeconds: 3600 👈 make this real long

✌️☮️

相关内容

最新更新

热门标签：