为什么我的Kubernetes Cronjob pod在执行时被杀死?



Kubernetes版本

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-13T02:40:46Z", GoVersion:"go1.16.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"e1d093448d0ed9b9b1a48f49833ff1ee64c05ba5", GitTreeState:"clean", BuildDate:"2021-06-03T00:20:57Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

我有一个Kubernetes crojob,它的目的是基于时间调度运行一些Azure cli命令。

在本地运行容器可以正常工作,但是,通过Lens手动触发Cronjob,或者让它按照调度运行会导致奇怪的行为(在云中作为作业运行会产生意想不到的结果)。

下面是cronjob的定义:

---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"

我手动运行cronjob,它创建了作业development-scale-down-manual-xwp1k。在完成这个任务后,我们可以看到以下内容:

$ kubectl describe job development-scale-down-manual-xwp1k
Name:                     development-scale-down-manual-xwp1k
Namespace:                development
Selector:                 controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
Labels:                   controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Annotations:              <none>
Parallelism:              1
Completions:              1
Start Time:               Wed, 04 Aug 2021 09:40:28 +1200
Active Deadline Seconds:  360s
Pods Statuses:            0 Running / 0 Succeeded / 1 Failed
Pod Template:
Labels:  controller-uid=ecf8fb47-cd50-42eb-9a6f-888f7e2c9257
job-name=development-scale-down-manual-xwp1k
Containers:
scaler:
Image:      myimage:latest
Port:       <none>
Host Port:  <none>
Environment:
CLUSTER_NAME:    ...
NODEPOOL_NAME:   ...
NODEPOOL_SIZE:   ...
RESOURCE_GROUP:  ...
SP_APP_ID:       <set to the key 'application_id' in secret 'scaler-secrets'>      Optional: false
SP_PASSWORD:     <set to the key 'application_pass' in secret 'scaler-secrets'>    Optional: false
SP_TENANT:       <set to the key 'application_tenant' in secret 'scaler-secrets'>  Optional: false
Mounts:            <none>
Volumes:             <none>
Events:
Type     Reason                Age   From            Message
----     ------                ----  ----            -------
Normal   SuccessfulCreate      24m   job-controller  Created pod: development-scale-down-manual-xwp1k-b858c
Normal   SuccessfulCreate      23m   job-controller  Created pod: development-scale-down-manual-xwp1k-xkkw9
Warning  BackoffLimitExceeded  23m   job-controller  Job has reached the specified backoff limit

这与我读过的其他问题不同,它没有提到"成功删除"事件。

kubectl get events收到的事件讲述了一个有趣的故事

$ ktl get events | grep xwp1k
3m19s       Normal    Scheduled                  pod/development-scale-down-manual-xwp1k-b858c   Successfully assigned development/development-scale-down-manual-xwp1k-b858c to aks-burst-37275452-vmss00000d
3m18s       Normal    Pulling                    pod/development-scale-down-manual-xwp1k-b858c   Pulling image "myimage:latest"
2m38s       Normal    Pulled                     pod/development-scale-down-manual-xwp1k-b858c   Successfully pulled image "myimage:latest" in 40.365655229s
2m23s       Normal    Created                    pod/development-scale-down-manual-xwp1k-b858c   Created container myimage
2m23s       Normal    Started                    pod/development-scale-down-manual-xwp1k-b858c   Started container myimage
2m12s       Normal    Killing                    pod/development-scale-down-manual-xwp1k-b858c   Stopping container myimage
2m12s       Normal    Scheduled                  pod/development-scale-down-manual-xwp1k-xkkw9   Successfully assigned development/development-scale-down-manual-xwp1k-xkkw9 to aks-default-37275452-vmss000002
2m12s       Normal    Pulling                    pod/development-scale-down-manual-xwp1k-xkkw9   Pulling image "myimage:latest"
2m11s       Normal    Pulled                     pod/development-scale-down-manual-xwp1k-xkkw9   Successfully pulled image "myimage:latest" in 751.93652ms
2m10s       Normal    Created                    pod/development-scale-down-manual-xwp1k-xkkw9   Created container myimage
2m10s       Normal    Started                    pod/development-scale-down-manual-xwp1k-xkkw9   Started container myimage
3m19s       Normal    SuccessfulCreate           job/development-scale-down-manual-xwp1k         Created pod: development-scale-down-manual-xwp1k-b858c
2m12s       Normal    SuccessfulCreate           job/development-scale-down-manual-xwp1k         Created pod: development-scale-down-manual-xwp1k-xkkw9
2m1s        Warning   BackoffLimitExceeded       job/development-scale-down-manual-xwp1k         Job has reached the specified backoff limit

我不知道为什么容器被杀死,日志看起来都很好,没有资源限制。容器被移除得非常快,这意味着我几乎没有时间进行调试。更详细的事件行读作

3m54s       Normal    Killing                    pod/development-scale-down-manual-xwp1k-b858c   spec.containers{myimage}   kubelet, aks-burst-37275452-vmss00000d                                 Stopping container myimage                                                                                                                                                       3m54s        1       development-scale-down-manual-xwp1k-b858c.1697e9d5e5b846ef

我注意到图像提取最初需要几秒钟(40),这是否有助于超过startingDeadline或另一个cron规范?

任何想法或帮助都很感激,谢谢你

读取日志!总是有帮助的。

<

上下文/h2>

对于上下文,作业本身扩展一个AKS节点池。我们有两个,默认的system一个,和一个新的用户控制的。cronjob用于扩展新的user(而不是system池)。

调查我注意到scale-down作业总是比scale-up作业花费更长的时间,这是由于当scale down作业运行时总是发生图像拉取。

我还注意到上面提到的Killing事件来自kubelet。(kubectl get events -o wide)

我去检查主机上的kubelet日志,并意识到主机名有点不典型(aks-burst-XXXXXXXX-vmss00000d),因为我们小型开发集群中的大多数主机通常在末尾都有数字,而不是d

我意识到命名不同,因为这个节点不是默认nodepool的一部分,而且我无法检查kubelet日志,因为主机已被删除。

作业缩减计算资源。缩小规模会失败,因为在它之前总是有一个扩大规模,在这一点上集群中有一个新的节点。这个节点上没有运行任何东西,所以下一个作业被安排在它上面。作业在新节点上启动,告诉Azure将新节点缩小到0,随后Kubelet在运行时杀死了作业。

总是在新节点上调度解释了为什么每次都会发生图像拉取。

修复我更改了规范并添加了一个NodeSelector,以便作业始终在system池上运行,这比user池更稳定

---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: development-scale-down
namespace: development
spec:
schedule: "0 22 * * 0-4"
concurrencyPolicy: Allow
startingDeadlineSeconds: 60
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0 # Do not retry
activeDeadlineSeconds: 360 # 5 minutes
template:
spec:
containers:
- name: scaler
image: myimage:latest
imagePullPolicy: Always
env: ...
restartPolicy: "Never"
nodeSelector:
agentpool: default

我们的基础设施也面临同样的问题。

在这种情况下,根本原因是集群自动伸缩,为了缩小集群并释放一个(或多个)节点而杀死和删除作业。

在这种情况下,我们使用"安全退出"来解决问题。注释,防止k8s由于自动缩放器而杀死作业。

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md

在我们的示例中,在查看了该命名空间的k8s '事件'之后,我们将责任归咎于自动缩放器。

相关内容

  • 没有找到相关文章

最新更新