我可以看到这个消息:节点正在关闭,在pod描述中驱逐pod,这只发生在具有特定容限的pod上,节点选择器在一个可抢占的节点池上。
我们为pod添加了容限,并创建了不同的节点池,具有不同的污点(可抢占的,不可抢占的),以在集群上隔离可抢占的和不可抢占的pod。
没有污染的群集工作正常。
带有污点的集群有一个问题,即pod被卡在关闭状态(只有部署在可抢占的nodepool上的pod)
下面是pod的描述
Namespace: XXXXXX
Priority: 0
Node: gke-cluster-reliable-preemptible-node-XXXXXX
Start Time: Tue, 10 Aug 2021 16:44:30 +0530
Labels: app=XXXX
pod-template-hash=XXXX
release=XXXX
repo=XXX
Annotations: randVersion: a200a
Status: Failed
Reason: Shutdown
Message: Node is shutting, evicting pods
IP:
IPs: <none>
Controlled By: ReplicaSet/career-assessor-be-8467d6c885
Containers:
career-assessor-be:
Image: XXXXXX
Port: 8001/TCP
key: CLOUD_SQL_CONNECTION_NAME
Host Port: 0/TCP
Command:
/bin/sh
-c
Args:
XXXXX
Limits:
cpu: 3200m
memory: 2400Mi
Requests:
cpu: 1600m
memory: 1800Mi
Environment Variables from:
careerassessor-config ConfigMap Optional: false
Environment:
LOG_TO_CONSOLE: 1
INACTIVITY_PERIOD:
USER_EMAIL: jyostna@springboard.com
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xpdwd (ro)
cloudsql-proxy:
Image: gcr.io/cloudsql-docker/gce-proxy:1.17
Port: <none>
Host Port: <none>
Command:
/cloud_sql_proxy
-instances=$(CLOUD_SQL_CONNECTION_NAME)=tcp:0.0.0.0:3306
-credential_file=/secrets/cloudsql/cloudsql-instance-credentials.json
-term_timeout=$(CLOUD_SQL_CONNECTION_TIMEOUT)s
Limits:
cpu: 100m
memory: 50Mi
Requests:
cpu: 20m
memory: 20Mi
Environment:
CLOUD_SQL_CONNECTION_NAME: <set to the key 'CLOUD_SQL_CONNECTION_NAME' of config map 'careerassessor-config'> Optional: false
CLOUD_SQL_CONNECTION_TIMEOUT: <set to the key 'CLOUD_SQL_CONNECTION_TIMEOUT' of config map 'careerassessor-config'> Optional: false
Mounts:
/secrets/cloudsql from careerassessor-cloudsql-instance-credentials (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xpdwd (ro)
Volumes:
careerassessor-cloudsql-instance-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: XXXXX
Optional: false
default-token-xpdwd:
Type: Secret (a volume populated by a Secret)
SecretName: XXX
Optional: false
QoS Class: Burstable
Node-Selectors: non-preemptible=false
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
non-preemptible=false:NoSchedule
Events: <none>
这是pod
的urlapiVersion: v1
kind: Pod
metadata:
annotations:
randVersion: a200a
creationTimestamp: "2021-08-10T10:59:29Z"
generateName: xxx
labels:
app: xxx
pod-template-hash: 8467d6c885
release: xxxx
repo: xxx
managedFields:
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.: {}
f:randVersion: {}
f:generateName: {}
f:labels:
.: {}
f:app: {}
f:pod-template-hash: {}
f:release: {}
f:repo: {}
f:ownerReferences:
.: {}
k:{"uid":"674b9e8e-420e-44e7-9601-871be01a9fcb"}:
.: {}
f:apiVersion: {}
f:blockOwnerDeletion: {}
f:controller: {}
f:kind: {}
f:name: {}
f:uid: {}
f:spec:
f:containers:
k:{"name":"career-assessor-be"}:
.: {}
f:args: {}
f:command: {}
f:env:
.: {}
k:{"name":"INACTIVITY_PERIOD"}:
.: {}
f:name: {}
k:{"name":"LOG_TO_CONSOLE"}:
.: {}
f:name: {}
f:value: {}
k:{"name":"USER_EMAIL"}:
.: {}
f:name: {}
f:value: {}
f:envFrom: {}
f:image: {}
f:imagePullPolicy: {}
f:name: {}
f:ports:
.: {}
k:{"containerPort":8001,"protocol":"TCP"}:
.: {}
f:containerPort: {}
f:name: {}
f:protocol: {}
f:resources:
.: {}
f:limits:
.: {}
f:cpu: {}
f:memory: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
k:{"name":"cloudsql-proxy"}:
.: {}
f:command: {}
f:env:
.: {}
k:{"name":"CLOUD_SQL_CONNECTION_NAME"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:configMapKeyRef:
.: {}
f:key: {}
f:name: {}
k:{"name":"CLOUD_SQL_CONNECTION_TIMEOUT"}:
.: {}
f:name: {}
f:valueFrom:
.: {}
f:configMapKeyRef:
.: {}
f:key: {}
f:name: {}
f:image: {}
f:imagePullPolicy: {}
f:name: {}
f:resources:
.: {}
f:limits:
.: {}
f:cpu: {}
f:memory: {}
f:requests:
.: {}
f:cpu: {}
f:memory: {}
f:terminationMessagePath: {}
f:terminationMessagePolicy: {}
f:volumeMounts:
.: {}
k:{"mountPath":"/secrets/cloudsql"}:
.: {}
f:mountPath: {}
f:name: {}
f:readOnly: {}
f:dnsPolicy: {}
f:enableServiceLinks: {}
f:nodeSelector:
.: {}
f:non-preemptible: {}
f:restartPolicy: {}
f:schedulerName: {}
f:securityContext: {}
f:terminationGracePeriodSeconds: {}
f:tolerations: {}
f:volumes:
.: {}
k:{"name":"careerassessor-cloudsql-instance-credentials"}:
.: {}
f:name: {}
f:secret:
.: {}
f:defaultMode: {}
f:secretName: {}
manager: kube-controller-manager
operation: Update
time: "2021-08-10T10:59:29Z"
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:status:
f:conditions:
k:{"type":"PodScheduled"}:
f:message: {}
f:reason: {}
manager: kube-scheduler
operation: Update
time: "2021-08-10T10:59:29Z"
- apiVersion: v1
fieldsType: FieldsV1
fieldsV1:
f:status:
f:message: {}
f:phase: {}
f:reason: {}
f:startTime: {}
manager: kubelet
operation: Update
time: "2021-08-10T11:51:28Z"
name: career-assessor-be-8467d6c885-h27sh
namespace: jyostna1
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: career-assessor-be-8467d6c885
uid: 674b9e8e-420e-44e7-9601-871be01a9fcb
resourceVersion: "48899168"
uid: 8837f88d-7e3e-444f-a804-32a7a6e98c71
spec:
containers:
- args:
- |
xxxx
command:
- /bin/sh
- -c
env:
- name: LOG_TO_CONSOLE
value: "1"
- name: INACTIVITY_PERIOD
- name: USER_EMAIL
value: jyostna@springboard.com
envFrom:
- configMapRef:
name: careerassessor-config
image: us.gcr.io/springboard-production/career_assessor:IP-405-implement-explored-strategy-for-r
imagePullPolicy: Always
name: career-assessor-be
ports:
- containerPort: 8001
name: be-port
protocol: TCP
resources:
limits:
cpu: 3200m
memory: 2400Mi
requests:
cpu: 1600m
memory: 1800Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-xpdwd
readOnly: true
- command:
- /cloud_sql_proxy
- -instances=$(CLOUD_SQL_CONNECTION_NAME)=tcp:0.0.0.0:3306
- -credential_file=/secrets/cloudsql/cloudsql-instance-credentials.json
- -term_timeout=$(CLOUD_SQL_CONNECTION_TIMEOUT)s
env:
- name: CLOUD_SQL_CONNECTION_NAME
valueFrom:
configMapKeyRef:
key: CLOUD_SQL_CONNECTION_NAME
name: careerassessor-config
- name: CLOUD_SQL_CONNECTION_TIMEOUT
valueFrom:
configMapKeyRef:
key: CLOUD_SQL_CONNECTION_TIMEOUT
name: careerassessor-config
image: gcr.io/cloudsql-docker/gce-proxy:1.17
imagePullPolicy: IfNotPresent
name: cloudsql-proxy
resources:
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 20m
memory: 20Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /secrets/cloudsql
name: careerassessor-cloudsql-instance-credentials
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-xpdwd
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: gke-cluster-reliable-preemptible-node-4b42c9be-x9qs
nodeSelector:
non-preemptible: "false"
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoSchedule
key: non-preemptible
operator: Equal
value: "false"
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: careerassessor-cloudsql-instance-credentials
secret:
defaultMode: 420
secretName: careerassessor-cloudsql-instance-credentials
- name: default-token-xpdwd
secret:
defaultMode: 420
secretName: default-token-xpdwd
status:
message: Node is shutting, evicting pods
phase: Failed
reason: Shutdown
startTime: "2021-08-10T11:14:30Z"
节点描述
Name: gke-cluster-reliable-preemptible-node-xxxxx
Roles: <none>
Labels: beta.kubernetes.io/arch=xxx
beta.kubernetes.io/instance-type=xxx
beta.kubernetes.io/os=linux
cloud.google.com/gke-boot-disk=pd-standard
cloud.google.com/gke-container-runtime=containerd
cloud.google.com/gke-nodepool=preemptible-nodepool
cloud.google.com/gke-os-distribution=cos
cloud.google.com/gke-preemptible=true
cloud.google.com/machine-family=n1
failure-domain.beta.kubernetes.io/region=us-central1
failure-domain.beta.kubernetes.io/zone=us-central1-a
kubernetes.io/arch=amd64
kubernetes.io/hostname=gke-cluster-reliable-preemptible-node-xxxx
kubernetes.io/os=linux
node.kubernetes.io/instance-type=n1-standard-4
non-preemptible=false
topology.gke.io/zone=us-central1-a
topology.kubernetes.io/region=us-central1
topology.kubernetes.io/zone=us-central1-a
Annotations: container.googleapis.com/instance_id: 7488269578212988511
csi.volume.kubernetes.io/nodeid:
{"pd.csi.storage.gke.io":"projects/playground-206205/zones/us-central1-a/instances/gke-cluster-reliable-preemptible-node-4b42c9be-x9qs"}
node.alpha.kubernetes.io/ttl: 0
node.gke.io/last-applied-node-labels:
cloud.google.com/gke-boot-disk=pd-standard,cloud.google.com/gke-container-runtime=containerd,cloud.google.com/gke-nodepool=preemptible-nod...
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Tue, 10 Aug 2021 17:24:03 +0530
Taints: non-preemptible=false:NoSchedule
Unschedulable: false
Lease:
HolderIdentity: gke-cluster-reliable-preemptible-node-4b42c9be-x9qs
AcquireTime: <unset>
RenewTime: Tue, 10 Aug 2021 20:27:03 +0530
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
FrequentDockerRestart False Tue, 10 Aug 2021 20:24:28 +0530 Tue, 10 Aug 2021 17:24:08 +0530 NoFrequentDockerRestart docker is functioning properly
FrequentContainerdRestart False Tue, 10 Aug 2021 20:24:28 +0530 Tue, 10 Aug 2021 17:24:08 +0530 NoFrequentContainerdRestart containerd is functioning properly
FrequentUnregisterNetDevice False Tue, 10 Aug 2021 20:24:28 +0530 Tue, 10 Aug 2021 17:24:08 +0530 NoFrequentUnregisterNetDevice node is functioning properly
CorruptDockerOverlay2 False Tue, 10 Aug 2021 20:24:28 +0530 Tue, 10 Aug 2021 17:24:08 +0530 NoCorruptDockerOverlay2 docker overlay2 is functioning properly
KernelDeadlock False Tue, 10 Aug 2021 20:24:28 +0530 Tue, 10 Aug 2021 17:24:08 +0530 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Tue, 10 Aug 2021 20:24:28 +0530 Tue, 10 Aug 2021 17:24:08 +0530 FilesystemIsNotReadOnly Filesystem is not read-only
FrequentKubeletRestart False Tue, 10 Aug 2021 20:24:28 +0530 Tue, 10 Aug 2021 17:24:08 +0530 NoFrequentKubeletRestart kubelet is functioning properly
NetworkUnavailable False Tue, 10 Aug 2021 17:24:03 +0530 Tue, 10 Aug 2021 17:24:03 +0530 RouteCreated NodeController create implicit route
MemoryPressure False Tue, 10 Aug 2021 20:26:16 +0530 Tue, 10 Aug 2021 17:24:00 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Tue, 10 Aug 2021 20:26:16 +0530 Tue, 10 Aug 2021 17:24:00 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Tue, 10 Aug 2021 20:26:16 +0530 Tue, 10 Aug 2021 17:24:00 +0530 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 10 Aug 2021 20:26:16 +0530 Tue, 10 Aug 2021 17:24:03 +0530 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.128.0.100
ExternalIP: 34.133.49.148
InternalDNS: gke-cluster-reliable-preemptible-node-4b42c9be-x9qs.c.playground-206205.internal
Hostname: gke-cluster-reliable-preemptible-node-4b42c9be-x9qs.c.playground-206205.internal
Capacity:
attachable-volumes-gce-pd: 127
cpu: 4
感谢所有的信息。根据文档,假设您的GKE集群是1.20版本:
在运行1.20或更高版本的可抢占GKE节点上,kubelet默认开启节点安全关机功能。因此,kubelet检测到抢占并优雅地终止pod。
对于可抢占节点上的pod,不要指定超过25秒terminationGracePeriodSeconds,因为这些pod将只接收25
使用污染和容忍的最好方法是使用在可抢占虚拟机上创建的默认标签——为可抢占虚拟机污染节点:
kubectl taint nodes node-name cloud.google.com/gke-preemptible="true":NoSchedule
为Pod添加容忍度:
tolerations:
- key: cloud.google.com/gke-preemptible
operator: Equal
value: "true"
effect: NoSchedule
也:
当kubelet在可抢占的节点关闭期间终止pod时为pod分配一个失败状态和一个关机原因。这些豆荚在下一次垃圾收集期间清除。你也可以删除使用以下命令手动关闭pod:
kubectl get pods --all-namespaces | grep -i shutdown | awk '{print $1, $2}' | xargs kubectl delete pod -n
请查看完整的文档,其中解释了所有细节:https://cloud.google.com/kubernetes-engine/docs/how-to/preemptible-vms