即使Airflow 2.0中的退出代码为0,Kubernetes Pod也会因CrashLoopBackoff而失败



我正在将Airflow从1.10版升级到2.1.0版。我的项目使用KubernetesPodOperatorKubernetesExecutor上运行任务。所有人都在气流1.10中运行良好。但当我升级Airflow 2.1.0时,pod能够运行任务,在成功完成后,它将以CrashLoopBackoff状态重新启动。我已经检查了livenessProbe,它按预期工作。我检查了其他日志,但在指定的任何容器或pod中都找不到任何问题。

deployment.yaml文件:

# Airflows
apiVersion: apps/v1
kind: Deployment
metadata:
name: airflow
spec:
selector:
matchLabels:
app: airflow
replicas: 1
template:
metadata:
labels:
app: airflow
spec:
hostAliases:
- ip: "xx.xx.xx.xx"
hostnames:
- "xxx.xxx.xxx"
initContainers:
- name: init-db
image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
imagePullPolicy: Always
command:
- "/bin/sh"
args:
- "-c"
- "/usr/local/bin/bootstrap.sh"
env:
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
name: airflow-secrets
- name: AFPW
valueFrom:
secretKeyRef:
key: AFPW
name: airflow-secrets
containers:
- name: web
image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
imagePullPolicy: Always
ports:
- name: web
containerPort: 8080
command:
- "airflow"
args:
- "webserver"
livenessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 240
periodSeconds: 60
env:
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
name: airflow-secrets
## The following values have been created as part of production setup
- name: scheduler
image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
imagePullPolicy: Always
command:
- "airflow"
args:
- "scheduler"
env:
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
name: airflow-secrets

描述吊舱:

Name:         airflow-66776dc57c-z98vd
Namespace:    default
Priority:     0
Node:         gke-gke-xxxxx-de-nodes-xxxxx--ccb62dc3-24us/xxx.xx.xx.xx
Start Time:   Sat, 19 Jun 2021 17:49:16 +0000
Labels:       app=airflow
pod-template-hash=66776dc57c
Annotations:  <none>
Status:       Running
IP:           xxx.xx.xx.xx
IPs:
IP:           xxx.xx.xx.xx
Controlled By:  ReplicaSet/airflow-66776dc57c
Init Containers:
init-db:
Container ID:  xxxxxxxxx
Image:         xxxxxxxxx
Image ID:      xxxxxxxxx
Port:          <none>
Host Port:     <none>
Command:
/bin/sh
Args:
-c
/usr/local/bin/bootstrap.sh
State:          Terminated
Reason:       Completed
Exit Code:    0
Started:      Sat, 19 Jun 2021 17:50:04 +0000
Finished:     Sat, 19 Jun 2021 17:50:23 +0000
Ready:          True
Restart Count:  0
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
Containers:
web:
Container ID:  xxxxxxxxx
Image:         xxxxxxxxx
Image ID:      xxxxxxxxx
Port:          8080/TCP
Host Port:     0/TCP
Command:
airflow
Args:
webserver
State:          Running
Started:      Sat, 19 Jun 2021 17:50:24 +0000
Ready:          True
Restart Count:  0
Liveness:       http-get http://:8080/ delay=240s timeout=1s period=60s #success=1 #failure=3
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
scheduler:
Container ID:  xxxxxxxxx
Image:         xxxxxxxxx
Image ID:      xxxxxxxxx
Port:          <none>
Host Port:     <none>
Command:
airflow
Args:
scheduler
State:          Running
Started:      Sat, 19 Jun 2021 17:50:25 +0000
Ready:          True
Restart Count:  0
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
Conditions:
Type              Status
Initialized       True
Ready             True
ContainersReady   True
PodScheduled      True
Volumes:
default-token-kw529:
Type:        Secret (a volume populated by a Secret)
SecretName:  default-token-kw529
Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s

Worker pod列表和日志

restartPolicy: Always

始终意味着容器将重新启动,即使它以零退出代码退出(即成功退出(您可以显式指定restartPolicy: Never。默认情况下总是

检查为什么在Pod中启动daskdev/dask失败?对于几乎相同的

最新更新