基于 OpenShift Ansible 的运算符挂起不一致



我有一个基于 Ansible 的运算符在 OpenShift 4.2 集群中运行。

大多数时候,当我应用相关的CR时,操作员运行良好。

有时,操作员挂起而不报告任何进一步的日志。

发生这种情况的步骤是相同的,但问题是这种情况在没有任何其他因素的情况下发生不一致,我不确定如何诊断它。

重新启动操作员总是可以解决问题,但我想知道我是否可以做些什么来诊断它并完全防止这种情况发生?

- name: allow Pods to reference images in myproject project
k8s:
definition:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: "system:image-puller-{{ meta.name }}"
namespace: myproject
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:image-puller
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: "system:serviceaccounts:{{ meta.name }}"

操作员的日志只是在上述步骤之后和下一步之前挂起:

- name: fetch some-secret
set_fact:
some_secret: "{{ lookup('k8s', kind='Secret', namespace='myproject', resource_name='some-secret') }}"

OC描述如下

oc describe -n openshift-operators pod my-ansible-operator-849b44d6cc-nr5st
Name:               my-ansible-operator-849b44d6cc-nr5st
Namespace:          openshift-operators
Priority:           0
PriorityClassName:  <none>
Node:               worker1.openshift.mycompany.com/10.0.8.21
Start Time:         Wed, 10 Jun 2020 22:35:45 +0100
Labels:             name=my-ansible-operator
pod-template-hash=849b44d6cc
Annotations:        k8s.v1.cni.cncf.io/networks-status:
[{
"name": "openshift-sdn",
"interface": "eth0",
"ips": [
"10.254.20.128"
],
"default": true,
"dns": {}
}]
Status:             Running
IP:                 10.254.20.128
Controlled By:      ReplicaSet/my-ansible-operator-849b44d6cc
Containers:
ansible:
Container ID:  cri-o://63b86ddef4055be4bcd661a3fcd70d525f9788cb96b7af8dd383ac08ea670047
Image:         image-registry.openshift-image-registry.svc:5000/openshift-operators/my-ansible-operator:v0.0.1
Image ID:      image-registry.openshift-image-registry.svc:5000/openshift-operators/my-ansible-operator@sha256:fda68898e6fe0c61760fe8c50fd0a55de392e63635c5c8da47fdb081cd126b5a
Port:          <none>
Host Port:     <none>
Command:
/usr/local/bin/ao-logs
/tmp/ansible-operator/runner
stdout
State:          Running
Started:      Wed, 10 Jun 2020 22:35:56 +0100
Ready:          True
Restart Count:  0
Environment:    <none>
Mounts:
/tmp/ansible-operator/runner from runner (ro)
/var/run/secrets/kubernetes.io/serviceaccount from my-ansible-operator-token-vbwlr (ro)
operator:
Container ID:   cri-o://365077a3c1d83b97428d27eebf2f0735c9d670d364b16fad83fff5bb02b479fe
Image:          image-registry.openshift-image-registry.svc:5000/openshift-operators/my-ansible-operator:v0.0.1
Image ID:       image-registry.openshift-image-registry.svc:5000/openshift-operators/my-ansible-operator@sha256:fda68898e6fe0c61760fe8c50fd0a55de392e63635c5c8da47fdb081cd126b5a
Port:           <none>
Host Port:      <none>
State:          Running
Started:      Wed, 10 Jun 2020 22:35:57 +0100
Ready:          True
Restart Count:  0
Environment:
WATCH_NAMESPACE:    openshift-operators (v1:metadata.namespace)
POD_NAME:           my-ansible-operator-849b44d6cc-nr5st (v1:metadata.name)
OPERATOR_NAME:      my-ansible-operator
ANSIBLE_GATHERING:  explicit
Mounts:
/tmp/ansible-operator/runner from runner (rw)
/var/run/secrets/kubernetes.io/serviceaccount from my-ansible-operator-token-vbwlr (ro)
Conditions:
Type              Status
Initialized       True
Ready             True
ContainersReady   True
PodScheduled      True
Volumes:
runner:
Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
my-ansible-operator-token-vbwlr:
Type:        Secret (a volume populated by a Secret)
SecretName:  my-ansible-operator-token-vbwlr
Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

我还能做些什么来进一步诊断问题或防止操作员偶尔挂起?

我在 operator-sdk 存储库中发现了一个非常相似的问题,链接到 Ansiblek8s模块中的根本原因: Ansible 2.7 在 docker-ce 中停留在 Python 3.7 上

从问题中的讨论来看,问题似乎与未超时的任务有关,当前的解决方法似乎是:

现在我们只覆盖 ansible 本地连接和正常操作插件,因此:

  • 所有 communication(( 调用都有 60 秒超时
  • 所有引发的超时过期异常重试几次

你能检查一下这是否解决了你的问题吗?由于问题仍处于"未决"状态,因此您可能还需要联系该问题。

最新更新