NoExecuteTaintManager错误删除Pod



我收到了NoExecuteTaintManager事件,这些事件正在删除我的pod,但我不知道为什么。节点是健康的,Pod具有适当的容忍度。

这实际上导致了无限扩展,因为我的Pod设置为使用3/4节点CPU,并且容忍宽限期>0。这将在Pod终止时强制创建一个新节点。Cluster Autoscaler尝试保留副本==2。

我该如何找出具体是哪种污染导致的?那么,它为什么认为那个节点有那个污点呢?目前,吊舱正好在600秒时被杀死(我已经将tolerationSeconds更改为node.kubernetes.io/unreachablenode.kubernetes.io/not-ready(,但节点似乎没有经历这两种情况。

NAME                                     READY   STATUS        RESTARTS   AGE
my-api-67df7bd54c-dthbn                  1/1     Running       0          8d
my-api-67df7bd54c-mh564                  1/1     Running       0          8d
my-pod-6d7b698b5f-28rgw                  1/1     Terminating   0          15m
my-pod-6d7b698b5f-2wmmg                  1/1     Terminating   0          13m
my-pod-6d7b698b5f-4lmmg                  1/1     Running       0          4m32s
my-pod-6d7b698b5f-7m4gh                  1/1     Terminating   0          71m
my-pod-6d7b698b5f-8b47r                  1/1     Terminating   0          27m
my-pod-6d7b698b5f-bb58b                  1/1     Running       0          2m29s
my-pod-6d7b698b5f-dn26n                  1/1     Terminating   0          25m
my-pod-6d7b698b5f-jrnkg                  1/1     Terminating   0          38m
my-pod-6d7b698b5f-sswps                  1/1     Terminating   0          36m
my-pod-6d7b698b5f-vhqnf                  1/1     Terminating   0          59m
my-pod-6d7b698b5f-wkrtg                  1/1     Terminating   0          50m
my-pod-6d7b698b5f-z6p2c                  1/1     Terminating   0          47m
my-pod-6d7b698b5f-zplp6                  1/1     Terminating   0          62m
14:22:43.678937 8 taint_manager.go:102] NoExecuteTaintManager is deleting Pod: my-pod-6d7b698b5f-dn26n
14:22:43.679073 8 event.go:221] Event(v1.ObjectReference{Kind:"Pod", Namespace:"prod", Name:"my-pod-6d7b698b5f-dn26n", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'TaintManagerEviction' Marking for deletion Pod prod/my-pod-6d7b698b5f-dn26n
# kubectl -n prod get pod my-pod-6d7b698b5f-8b47r -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
checksum/config: bcdc41c616f736849a6bef9c726eec9bf704ce7d2c61736005a6fedda0ee14d0
kubernetes.io/psp: eks.privileged
creationTimestamp: "2019-10-25T14:09:17Z"
deletionGracePeriodSeconds: 172800
deletionTimestamp: "2019-10-27T14:20:40Z"
generateName: my-pod-6d7b698b5f-
labels:
app.kubernetes.io/instance: my-pod
app.kubernetes.io/name: my-pod
pod-template-hash: 6d7b698b5f
name: my-pod-6d7b698b5f-8b47r
namespace: prod
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: ReplicaSet
name: my-pod-6d7b698b5f
uid: c6360643-f6a6-11e9-9459-12ff96456b32
resourceVersion: "2408256"
selfLink: /api/v1/namespaces/prod/pods/my-pod-6d7b698b5f-8b47r
uid: 08197175-f731-11e9-9459-12ff96456b32
spec:
containers:
- args:
- -c
- from time import sleep; sleep(10000)
command:
- python
envFrom:
- secretRef:
name: pix4d
- secretRef:
name: rabbitmq
image: python:3.7-buster
imagePullPolicy: Always
name: my-pod
ports:
- containerPort: 5000
name: http
protocol: TCP
resources:
requests:
cpu: "3"
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-gv6q5
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: ip-10-142-54-235.ec2.internal
nodeSelector:
nodepool: zeroscaling-gpu-accelerated-p2-xlarge
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 172800
tolerations:
- key: specialized
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 600
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 600
volumes:
- name: default-token-gv6q5
secret:
defaultMode: 420
secretName: default-token-gv6q5
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:10:40Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:11:09Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:11:09Z"
status: "True"
type: ContainersReady
- lastProbeTime: null
lastTransitionTime: "2019-10-25T14:10:40Z"
status: "True"
type: PodScheduled
containerStatuses:
- containerID: docker://15e2e658c459a91a86573c1096931fa4ac345e06f26652da2a58dc3e3b3d5aa2
image: python:3.7-buster
imageID: docker-pullable://python@sha256:f0db6711abee8d406121c9e057bc0f7605336e8148006164fea2c43809fe7977
lastState: {}
name: my-pod
ready: true
restartCount: 0
state:
running:
startedAt: "2019-10-25T14:11:09Z"
hostIP: 10.142.54.235
phase: Running
podIP: 10.142.63.233
qosClass: Burstable
startTime: "2019-10-25T14:10:40Z"
# kubectl -n prod describe pod my-pod-6d7b698b5f-8b47r   
Name:                      my-pod-6d7b698b5f-8b47r
Namespace:                 prod
Priority:                  0
PriorityClassName:         <none>
Node:                      ip-10-142-54-235.ec2.internal/10.142.54.235
Start Time:                Fri, 25 Oct 2019 10:10:40 -0400
Labels:                    app.kubernetes.io/instance=my-pod
app.kubernetes.io/name=my-pod
pod-template-hash=6d7b698b5f
Annotations:               checksum/config: bcdc41c616f736849a6bef9c726eec9bf704ce7d2c61736005a6fedda0ee14d0
kubernetes.io/psp: eks.privileged
Status:                    Terminating (lasts 47h)
Termination Grace Period:  172800s
IP:                        10.142.63.233
Controlled By:             ReplicaSet/my-pod-6d7b698b5f
Containers:
my-pod:
Container ID:  docker://15e2e658c459a91a86573c1096931fa4ac345e06f26652da2a58dc3e3b3d5aa2
Image:         python:3.7-buster
Image ID:      docker-pullable://python@sha256:f0db6711abee8d406121c9e057bc0f7605336e8148006164fea2c43809fe7977
Port:          5000/TCP
Host Port:     0/TCP
Command:
python
Args:
-c
from time import sleep; sleep(10000)
State:          Running
Started:      Fri, 25 Oct 2019 10:11:09 -0400
Ready:          True
Restart Count:  0
Requests:
cpu:  3
Environment Variables from:
pix4d       Secret  Optional: false
rabbitmq    Secret  Optional: false
Environment:  <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-gv6q5 (ro)
Conditions:
Type              Status
Initialized       True 
Ready             True 
ContainersReady   True 
PodScheduled      True 
Volumes:
default-token-gv6q5:
Type:        Secret (a volume populated by a Secret)
SecretName:  default-token-gv6q5
Optional:    false
QoS Class:       Burstable
Node-Selectors:  nodepool=zeroscaling-gpu-accelerated-p2-xlarge
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 600s
node.kubernetes.io/unreachable:NoExecute for 600s
specialized
Events:
Type     Reason            Age                From                                    Message
----     ------            ----               ----                                    -------
Warning  FailedScheduling  12m (x2 over 12m)  default-scheduler                       0/13 nodes are available: 1 Insufficient pods, 13 Insufficient cpu, 6 node(s) didn't match node selector.
Normal   TriggeredScaleUp  12m                cluster-autoscaler                      pod triggered scale-up: [{prod-worker-gpu-accelerated-p2-xlarge 7->8 (max: 13)}]
Warning  FailedScheduling  11m (x5 over 11m)  default-scheduler                       0/14 nodes are available: 1 Insufficient pods, 1 node(s) had taints that the pod didn't tolerate, 13 Insufficient cpu, 6 node(s) didn't match node selector.
Normal   Scheduled         11m                default-scheduler                       Successfully assigned prod/my-pod-6d7b698b5f-8b47r to ip-10-142-54-235.ec2.internal
Normal   Pulling           11m                kubelet, ip-10-142-54-235.ec2.internal  pulling image "python:3.7-buster"
Normal   Pulled            10m                kubelet, ip-10-142-54-235.ec2.internal  Successfully pulled image "python:3.7-buster"
Normal   Created           10m                kubelet, ip-10-142-54-235.ec2.internal  Created container
Normal   Started           10m                kubelet, ip-10-142-54-235.ec2.internal  Started container
# kubectl -n prod describe node ip-10-142-54-235.ec2.internal
Name:               ip-10-142-54-235.ec2.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=p2.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-1
failure-domain.beta.kubernetes.io/zone=us-east-1b
kubernetes.io/hostname=ip-10-142-54-235.ec2.internal
nodepool=zeroscaling-gpu-accelerated-p2-xlarge
Annotations:        node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 25 Oct 2019 10:10:20 -0400
Taints:             specialized=true:NoExecute
Unschedulable:      false
Conditions:
Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
----             ------  -----------------                 ------------------                ------                       -------
MemoryPressure   False   Fri, 25 Oct 2019 10:23:11 -0400   Fri, 25 Oct 2019 10:10:19 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
DiskPressure     False   Fri, 25 Oct 2019 10:23:11 -0400   Fri, 25 Oct 2019 10:10:19 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
PIDPressure      False   Fri, 25 Oct 2019 10:23:11 -0400   Fri, 25 Oct 2019 10:10:19 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
Ready            True    Fri, 25 Oct 2019 10:23:11 -0400   Fri, 25 Oct 2019 10:10:40 -0400   KubeletReady                 kubelet is posting ready status
Addresses:
InternalIP:   10.142.54.235
ExternalIP:   3.86.112.24
Hostname:     ip-10-142-54-235.ec2.internal
InternalDNS:  ip-10-142-54-235.ec2.internal
ExternalDNS:  ec2-3-86-112-24.compute-1.amazonaws.com
Capacity:
attachable-volumes-aws-ebs:  39
cpu:                         4
ephemeral-storage:           209702892Ki
hugepages-1Gi:               0
hugepages-2Mi:               0
memory:                      62872868Ki
pods:                        58
Allocatable:
attachable-volumes-aws-ebs:  39
cpu:                         4
ephemeral-storage:           200777747706
hugepages-1Gi:               0
hugepages-2Mi:               0
memory:                      61209892Ki
pods:                        58
System Info:
Machine ID:                 0e76fec3e06d41a6bf2c49a18fbe1795
System UUID:                EC29973A-D616-F673-6899-A96C97D5AE2D
Boot ID:                    4bc510b6-f615-48a7-9e1e-47261ddf26a4
Kernel Version:             4.14.146-119.123.amzn2.x86_64
OS Image:                   Amazon Linux 2
Operating System:           linux
Architecture:               amd64
Container Runtime Version:  docker://18.6.1
Kubelet Version:            v1.13.11-eks-5876d6
Kube-Proxy Version:         v1.13.11-eks-5876d6
ProviderID:                  aws:///us-east-1b/i-0f5b519aa6e38e04a
Non-terminated Pods:         (5 in total)
Namespace                  Name                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
---------                  ----                          ------------  ----------  ---------------  -------------  ---
amazon-cloudwatch          cloudwatch-agent-4d24j        50m (1%)      250m (6%)   50Mi (0%)        250Mi (0%)     12m
amazon-cloudwatch          fluentd-cloudwatch-wkslq      50m (1%)      0 (0%)      150Mi (0%)       300Mi (0%)     12m
prod                       my-pod-6d7b698b5f-8b47r       3 (75%)       0 (0%)      0 (0%)           0 (0%)         14m
kube-system                aws-node-6nr6g                10m (0%)      0 (0%)      0 (0%)           0 (0%)         13m
kube-system                kube-proxy-wf8k4              100m (2%)     0 (0%)      0 (0%)           0 (0%)         13m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource                    Requests     Limits
--------                    --------     ------
cpu                         3210m (80%)  250m (6%)
memory                      200Mi (0%)   550Mi (0%)
ephemeral-storage           0 (0%)       0 (0%)
attachable-volumes-aws-ebs  0            0
Events:
Type    Reason                   Age                From                                       Message
----    ------                   ----               ----                                       -------
Normal  Starting                 13m                kubelet, ip-10-142-54-235.ec2.internal     Starting kubelet.
Normal  NodeHasSufficientMemory  13m (x2 over 13m)  kubelet, ip-10-142-54-235.ec2.internal     Node ip-10-142-54-235.ec2.internal status is now: NodeHasSufficientMemory
Normal  NodeHasNoDiskPressure    13m (x2 over 13m)  kubelet, ip-10-142-54-235.ec2.internal     Node ip-10-142-54-235.ec2.internal status is now: NodeHasNoDiskPressure
Normal  NodeHasSufficientPID     13m (x2 over 13m)  kubelet, ip-10-142-54-235.ec2.internal     Node ip-10-142-54-235.ec2.internal status is now: NodeHasSufficientPID
Normal  NodeAllocatableEnforced  13m                kubelet, ip-10-142-54-235.ec2.internal     Updated Node Allocatable limit across pods
Normal  Starting                 12m                kube-proxy, ip-10-142-54-235.ec2.internal  Starting kube-proxy.
Normal  NodeReady                12m                kubelet, ip-10-142-54-235.ec2.internal     Node ip-10-142-54-235.ec2.internal status is now: NodeReady
# kubectl get node ip-10-142-54-235.ec2.internal -o yaml
apiVersion: v1
kind: Node
metadata:
annotations:
node.alpha.kubernetes.io/ttl: "0"
volumes.kubernetes.io/controller-managed-attach-detach: "true"
creationTimestamp: "2019-10-25T14:10:20Z"
labels:
beta.kubernetes.io/arch: amd64
beta.kubernetes.io/instance-type: p2.xlarge
beta.kubernetes.io/os: linux
failure-domain.beta.kubernetes.io/region: us-east-1
failure-domain.beta.kubernetes.io/zone: us-east-1b
kubernetes.io/hostname: ip-10-142-54-235.ec2.internal
nodepool: zeroscaling-gpu-accelerated-p2-xlarge
name: ip-10-142-54-235.ec2.internal
resourceVersion: "2409195"
selfLink: /api/v1/nodes/ip-10-142-54-235.ec2.internal
uid: 2d934979-f731-11e9-89b8-0234143df588
spec:
providerID: aws:///us-east-1b/i-0f5b519aa6e38e04a
taints:
- effect: NoExecute
key: specialized
value: "true"
status:
addresses:
- address: 10.142.54.235
type: InternalIP
- address: 3.86.112.24
type: ExternalIP
- address: ip-10-142-54-235.ec2.internal
type: Hostname
- address: ip-10-142-54-235.ec2.internal
type: InternalDNS
- address: ec2-3-86-112-24.compute-1.amazonaws.com
type: ExternalDNS
allocatable:
attachable-volumes-aws-ebs: "39"
cpu: "4"
ephemeral-storage: "200777747706"
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 61209892Ki
pods: "58"
capacity:
attachable-volumes-aws-ebs: "39"
cpu: "4"
ephemeral-storage: 209702892Ki
hugepages-1Gi: "0"
hugepages-2Mi: "0"
memory: 62872868Ki
pods: "58"
conditions:
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:19Z"
message: kubelet has sufficient memory available
reason: KubeletHasSufficientMemory
status: "False"
type: MemoryPressure
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:19Z"
message: kubelet has no disk pressure
reason: KubeletHasNoDiskPressure
status: "False"
type: DiskPressure
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:19Z"
message: kubelet has sufficient PID available
reason: KubeletHasSufficientPID
status: "False"
type: PIDPressure
- lastHeartbeatTime: "2019-10-25T14:23:51Z"
lastTransitionTime: "2019-10-25T14:10:40Z"
message: kubelet is posting ready status
reason: KubeletReady
status: "True"
type: Ready
daemonEndpoints:
kubeletEndpoint:
Port: 10250
images:
- names:
- python@sha256:f0db6711abee8d406121c9e057bc0f7605336e8148006164fea2c43809fe7977
- python:3.7-buster
sizeBytes: 917672801
- names:
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni@sha256:5b7e7435f88a86bbbdb2a5ecd61e893dc14dd13c9511dc8ace362d299259700a
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.5.4
sizeBytes: 290739356
- names:
- fluent/fluentd-kubernetes-daemonset@sha256:582770d951f81e0971e852089239ced0186e0bdc3226daf16b99ca4cc22de4f7
- fluent/fluentd-kubernetes-daemonset:v1.3.3-debian-cloudwatch-1.4
sizeBytes: 261867521
- names:
- amazon/cloudwatch-agent@sha256:877106acbc56e747ebe373548c88cd37274f666ca11b5c782211db4c5c7fb64b
- amazon/cloudwatch-agent:latest
sizeBytes: 131360039
- names:
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy@sha256:4767b441ddc424b0ea63c305b79be154f65fb15ebefe8a3b2832ce55aa6de2f0
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/kube-proxy:v1.13.8
sizeBytes: 80183964
- names:
- busybox@sha256:fe301db49df08c384001ed752dff6d52b4305a73a7f608f21528048e8a08b51e
- busybox:latest
sizeBytes: 1219782
- names:
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64@sha256:bea77c323c47f7b573355516acf927691182d1333333d1f41b7544012fab7adf
- 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause-amd64:3.1
sizeBytes: 742472
nodeInfo:
architecture: amd64
bootID: 4bc510b6-f615-48a7-9e1e-47261ddf26a4
containerRuntimeVersion: docker://18.6.1
kernelVersion: 4.14.146-119.123.amzn2.x86_64
kubeProxyVersion: v1.13.11-eks-5876d6
kubeletVersion: v1.13.11-eks-5876d6
machineID: 0e76fec3e06d41a6bf2c49a18fbe1795
operatingSystem: linux
osImage: Amazon Linux 2
systemUUID: EC29973A-D616-F673-6899-A96C97D5AE2D

很遗憾,我对您的问题没有确切的答案,但我可能有一些变通方法。

我想我在亚马逊EKS集群1.13.11版本中也遇到了同样的问题——我的pod正在触发节点放大,pod被调度,工作300秒,然后被驱逐:

74m         Normal    TaintManagerEviction   pod/master-3bb760a7-b782-4138-b09f-0ca385db9ad7-workspace   Marking for deletion Pod project-beta/master-3bb760a7-b782-4138-b09f-0ca385db9ad7-workspace

有趣的是,如果在现有节点上而不是刚刚创建的节点上调度,那么同一个pod能够毫无问题地运行。

从我的调查来看,这个特定的Kubernetes版本确实存在一些问题。也许是TaintBasedEvictions功能的一些边缘情况(我认为它在1.13版本的Kubernetes中默认启用(。

为了"修复"这个问题,我将集群版本更新到了1.14。在那之后,神秘的吊舱驱逐事件就再也没有发生过。

因此,如果可能的话,我建议您将集群更新到1.14版本(与集群自动缩放器一起(。

最新更新