我正在将Airflow从1.10版升级到2.1.0版。我的项目使用KubernetesPodOperator
在KubernetesExecutor
上运行任务。所有人都在气流1.10中运行良好。但当我升级Airflow 2.1.0时,pod能够运行任务,在成功完成后,它将以CrashLoopBackoff
状态重新启动。我已经检查了livenessProbe
,它按预期工作。我检查了其他日志,但在指定的任何容器或pod中都找不到任何问题。
deployment.yaml文件:
# Airflows
apiVersion: apps/v1
kind: Deployment
metadata:
name: airflow
spec:
selector:
matchLabels:
app: airflow
replicas: 1
template:
metadata:
labels:
app: airflow
spec:
hostAliases:
- ip: "xx.xx.xx.xx"
hostnames:
- "xxx.xxx.xxx"
initContainers:
- name: init-db
image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
imagePullPolicy: Always
command:
- "/bin/sh"
args:
- "-c"
- "/usr/local/bin/bootstrap.sh"
env:
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
name: airflow-secrets
- name: AFPW
valueFrom:
secretKeyRef:
key: AFPW
name: airflow-secrets
containers:
- name: web
image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
imagePullPolicy: Always
ports:
- name: web
containerPort: 8080
command:
- "airflow"
args:
- "webserver"
livenessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 240
periodSeconds: 60
env:
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
name: airflow-secrets
## The following values have been created as part of production setup
- name: scheduler
image: "{{ .Values.dags_image.repository }}:{{ .Values.dags_image.tag }}"
imagePullPolicy: Always
command:
- "airflow"
args:
- "scheduler"
env:
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
key: AIRFLOW__CORE__SQL_ALCHEMY_CONN
name: airflow-secrets
描述吊舱:
Name: airflow-66776dc57c-z98vd
Namespace: default
Priority: 0
Node: gke-gke-xxxxx-de-nodes-xxxxx--ccb62dc3-24us/xxx.xx.xx.xx
Start Time: Sat, 19 Jun 2021 17:49:16 +0000
Labels: app=airflow
pod-template-hash=66776dc57c
Annotations: <none>
Status: Running
IP: xxx.xx.xx.xx
IPs:
IP: xxx.xx.xx.xx
Controlled By: ReplicaSet/airflow-66776dc57c
Init Containers:
init-db:
Container ID: xxxxxxxxx
Image: xxxxxxxxx
Image ID: xxxxxxxxx
Port: <none>
Host Port: <none>
Command:
/bin/sh
Args:
-c
/usr/local/bin/bootstrap.sh
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 19 Jun 2021 17:50:04 +0000
Finished: Sat, 19 Jun 2021 17:50:23 +0000
Ready: True
Restart Count: 0
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
Containers:
web:
Container ID: xxxxxxxxx
Image: xxxxxxxxx
Image ID: xxxxxxxxx
Port: 8080/TCP
Host Port: 0/TCP
Command:
airflow
Args:
webserver
State: Running
Started: Sat, 19 Jun 2021 17:50:24 +0000
Ready: True
Restart Count: 0
Liveness: http-get http://:8080/ delay=240s timeout=1s period=60s #success=1 #failure=3
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
scheduler:
Container ID: xxxxxxxxx
Image: xxxxxxxxx
Image ID: xxxxxxxxx
Port: <none>
Host Port: <none>
Command:
airflow
Args:
scheduler
State: Running
Started: Sat, 19 Jun 2021 17:50:25 +0000
Ready: True
Restart Count: 0
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-kw529 (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
default-token-kw529:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-kw529
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Worker pod列表和日志
restartPolicy: Always
始终意味着容器将重新启动,即使它以零退出代码退出(即成功退出(您可以显式指定restartPolicy: Never
。默认情况下总是
检查为什么在Pod中启动daskdev/dask失败?对于几乎相同的