[HTCONDOR][kubernetes / k8s]:无法在k8s内启动miniicondor映像- condor_



POST EDIT

问题是由于:

PSP(Pod security policy)默认不允许我的condor用户升级。这就是它不起作用的原因。因为supervisord作为root用户运行,并尝试以root而不是作为其他用户(即condor)编写日志和启动秃鹰收集器

<标题>

mini-condor基本映像在kubernetes rancher pod上没有按预期启动。

我正在使用:

  • 此图像:https://hub.docker.com/r/htcondor/mini在rancher (k8s)的自定义命名空间

ps:图像运行正常

  • 本地环境
  • minikube默认安装

我正在运行它作为一个简单的部署:

当pod启动时,Kubernetes默认日志文件显示:

2021-09-15 09:26:36,908 INFO supervisord started with pid 1
2021-09-15 09:26:37,911 INFO spawned: 'condor_master' with pid 20
2021-09-15 09:26:37,912 INFO spawned: 'condor_restd' with pid 21
2021-09-15 09:26:37,917 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:37,924 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:38,926 INFO spawned: 'condor_master' with pid 22
2021-09-15 09:26:38,928 INFO spawned: 'condor_restd' with pid 23
2021-09-15 09:26:38,932 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:38,936 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:40,939 INFO spawned: 'condor_master' with pid 24
2021-09-15 09:26:40,943 INFO spawned: 'condor_restd' with pid 25
2021-09-15 09:26:40,947 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:40,948 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:43,953 INFO spawned: 'condor_master' with pid 26
2021-09-15 09:26:43,955 INFO spawned: 'condor_restd' with pid 27
2021-09-15 09:26:43,959 INFO exited: condor_restd (exit status 127; not expected)
2021-09-15 09:26:43,968 INFO gave up: condor_restd entered FATAL state, too many start retries too quickly
2021-09-15 09:26:43,969 INFO exited: condor_master (exit status 4; not expected)
2021-09-15 09:26:44,970 INFO gave up: condor_master entered FATAL state, too many start retries too quickly

下面是一个简短的cmd命令和输出结果:

<表类>CMD输出tbody><<tr>condor_statusCEDAR:6001:Failed to connect to <127.0.0.1:9618>condor_masterERROR "Cannot open log file '/var/log/condor/MasterLog'" at line 174 in file /var/lib/condor/execute/slot1/dir_17406/userdir/.tmpruBd6F/BUILD/condor-9.0.5/src/condor_utils/dprintf_setup.cpp`

问题原因

问题是由于:

PSP policy(Pod安全策略)默认情况下,我的秃鹰用户不允许升级。

解决方案目前我发现的最佳解决方案是以秃鹰用户运行所有内容,并将权限授予秃鹰用户. 为此,您需要:

  • supervisord.conf中:以condor用户运行supervisor
  • supervisord.conf:运行日志和套接字在/tmp
  • Dockerfile中:通过condor更改大部分文件夹的所有者
  • deployment.yaml中设置ID64(秃鹰用户)

Dockerfile

FROM htcondor/mini:9.2-el7
# SET WORKDIR
WORKDIR /home/condor/
RUN chown condor:condor /home/condor
# COPY SUPERVISOR
COPY supervisord.conf /etc/supervisord.conf
# Need to run the cmd to create all dir
RUN condor_master
# FIX PERMISSION ISSUES FOR RANCHER
RUN chown -R condor:condor /var/log/ /tmp &&
chown -R restd:restd /home/restd &&
chmod 755 -R /home/restd

supervisord.conf:

[supervisord]
user=condor
nodaemon=true
logfile = /tmp/supervisord.log
directory = /tmp
pidfile = /tmp/supervisord.pid
childlogdir = /tmp
# next 3 sections contain using supervisorctl to manage daemons
[unix_http_server]
file=/tmp/supervisord.sock
chown=condor:condor
chmod=0777
user=condor
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[supervisorctl]
serverurl=unix:///tmp/supervisor.sock
[program:condor_master]
user=condor
command=/usr/sbin/condor_master -f
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile = /var/log/condor_master.log
stderr_logfile = /var/log/condor_master.error.log

deployment.yaml

apiVersion: apps/v1
kind: Deployment
spec:
containers:
- image: <condor-image>
imagePullPolicy: Always
name: htcondor-exporter
ports:
- containerPort: 8080
name: myport
protocol: TCP
resources: {}
securityContext:
capabilities: {}
runAsNonRoot: false
runAsUser: 64
stdin: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
tty: true

最新更新