我有一个基于 Ansible 的运算符在 OpenShift 4.2 集群中运行。
大多数时候,当我应用相关的CR时,操作员运行良好。
有时,操作员挂起而不报告任何进一步的日志。
发生这种情况的步骤是相同的,但问题是这种情况在没有任何其他因素的情况下发生不一致,我不确定如何诊断它。
重新启动操作员总是可以解决问题,但我想知道我是否可以做些什么来诊断它并完全防止这种情况发生?
- name: allow Pods to reference images in myproject project
k8s:
definition:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: "system:image-puller-{{ meta.name }}"
namespace: myproject
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:image-puller
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: "system:serviceaccounts:{{ meta.name }}"
操作员的日志只是在上述步骤之后和下一步之前挂起:
- name: fetch some-secret
set_fact:
some_secret: "{{ lookup('k8s', kind='Secret', namespace='myproject', resource_name='some-secret') }}"
OC描述如下
oc describe -n openshift-operators pod my-ansible-operator-849b44d6cc-nr5st
Name: my-ansible-operator-849b44d6cc-nr5st
Namespace: openshift-operators
Priority: 0
PriorityClassName: <none>
Node: worker1.openshift.mycompany.com/10.0.8.21
Start Time: Wed, 10 Jun 2020 22:35:45 +0100
Labels: name=my-ansible-operator
pod-template-hash=849b44d6cc
Annotations: k8s.v1.cni.cncf.io/networks-status:
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.254.20.128"
],
"default": true,
"dns": {}
}]
Status: Running
IP: 10.254.20.128
Controlled By: ReplicaSet/my-ansible-operator-849b44d6cc
Containers:
ansible:
Container ID: cri-o://63b86ddef4055be4bcd661a3fcd70d525f9788cb96b7af8dd383ac08ea670047
Image: image-registry.openshift-image-registry.svc:5000/openshift-operators/my-ansible-operator:v0.0.1
Image ID: image-registry.openshift-image-registry.svc:5000/openshift-operators/my-ansible-operator@sha256:fda68898e6fe0c61760fe8c50fd0a55de392e63635c5c8da47fdb081cd126b5a
Port: <none>
Host Port: <none>
Command:
/usr/local/bin/ao-logs
/tmp/ansible-operator/runner
stdout
State: Running
Started: Wed, 10 Jun 2020 22:35:56 +0100
Ready: True
Restart Count: 0
Environment: <none>
Mounts:
/tmp/ansible-operator/runner from runner (ro)
/var/run/secrets/kubernetes.io/serviceaccount from my-ansible-operator-token-vbwlr (ro)
operator:
Container ID: cri-o://365077a3c1d83b97428d27eebf2f0735c9d670d364b16fad83fff5bb02b479fe
Image: image-registry.openshift-image-registry.svc:5000/openshift-operators/my-ansible-operator:v0.0.1
Image ID: image-registry.openshift-image-registry.svc:5000/openshift-operators/my-ansible-operator@sha256:fda68898e6fe0c61760fe8c50fd0a55de392e63635c5c8da47fdb081cd126b5a
Port: <none>
Host Port: <none>
State: Running
Started: Wed, 10 Jun 2020 22:35:57 +0100
Ready: True
Restart Count: 0
Environment:
WATCH_NAMESPACE: openshift-operators (v1:metadata.namespace)
POD_NAME: my-ansible-operator-849b44d6cc-nr5st (v1:metadata.name)
OPERATOR_NAME: my-ansible-operator
ANSIBLE_GATHERING: explicit
Mounts:
/tmp/ansible-operator/runner from runner (rw)
/var/run/secrets/kubernetes.io/serviceaccount from my-ansible-operator-token-vbwlr (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
runner:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
my-ansible-operator-token-vbwlr:
Type: Secret (a volume populated by a Secret)
SecretName: my-ansible-operator-token-vbwlr
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
我还能做些什么来进一步诊断问题或防止操作员偶尔挂起?
我在 operator-sdk 存储库中发现了一个非常相似的问题,链接到 Ansiblek8s
模块中的根本原因: Ansible 2.7 在 docker-ce 中停留在 Python 3.7 上
从问题中的讨论来看,问题似乎与未超时的任务有关,当前的解决方法似乎是:
现在我们只覆盖 ansible 本地连接和正常操作插件,因此:
- 所有 communication(( 调用都有 60 秒超时
- 所有引发的超时过期异常重试几次
你能检查一下这是否解决了你的问题吗?由于问题仍处于"未决"状态,因此您可能还需要联系该问题。