我在一个shell脚本中有一个Gearman工作程序,它以以下方式用perl启动:
runuid -s gds
/usr/bin/gearman -h 127.0.0.1 -t 1000 -w -f gds-rel
-- xargs /home/gds/gds-rel-worker.sh < /dev/null 2>/dev/null
worker只做一些输入验证,并调用另一个shell脚本run.sh,该脚本调用bash、curl、Terragrunt、Terraform、Ansible和gcloud来提供和更新GCP中的资源,如下所示:
./run.sh --release 1.2.3 2>&1 >> /var/log/gds-release
该脚本打算在无人参与的情况下运行。我遇到的问题是,在作业成功完成后(这是shell脚本run.sh和gdsrel-worker.sh(,Gearman作业仍在执行,因为子进程变成了僵尸(请参阅下面的最后一行(。
root 144748 1 0 Apr29 ? 00:00:00 perpboot -d /etc/perp
root 144749 144748 0 Apr29 ? 00:00:00 _ tinylog -k 8 -s 100000 -t -z /var/log/perp/perpd-root
root 144750 144748 0 Apr29 ? 00:00:00 _ perpd /etc/perp
root 2492482 144750 0 May14 ? 00:00:00 _ tinylog (gearmand) -k 10 -s 100000000 -t -z /var/log/perp/gearmand
gearmand 2492483 144750 0 May14 ? 00:00:08 _ /usr/sbin/gearmand -L 127.0.0.1 -p 4730 --verbose INFO --log-file stderr --keepalive --keepalive-idle 120 --keepalive-interval 120 --keepalive-count 3 --round-robin --threads 36 --worker-wakeup 3 --job-retries 1
root 2531800 144750 0 May14 ? 00:00:00 _ tinylog (gds-rel-worker) -k 10 -s 100000000 -t -z /var/log/perp/gds-rel-worker
gds 2531801 144750 0 May14 ? 00:00:00 _ /usr/bin/gearman -h 127.0.0.1 -t 1000 -w -f gds-rel -- xargs /home/gds/gds-rel-worker.sh
gds 2531880 2531801 0 May14 ? 00:00:00 _ [xargs] <defunct>
到目前为止,我已经将问题追溯到run.sh,因为如果我用更简单的东西替换它的调用(例如,echo"Hello";sleep 5(,工作程序就不会挂断。不幸的是,我不知道是什么导致了这个问题。脚本run.sh相当长且复杂,但到目前为止一直工作正常。追踪工人流程我看到的是:
getpid() = 2531801
write(2, "gearman: ", 9) = 9
write(2, "gearman_worker_work", 19) = 19
write(2, " : ", 3) = 3
write(2, "gearman_wait(GEARMAN_TIMEOUT) ti"..., 151) = 151
write(2, "n", 1) = 1
sendto(5, "