有一半的时间Supervisor重启失败

我正在尝试在一台运行Debian 8.1的机器上使用Uwsgi和supervisor部署Django应用程序。

当我通过sudo systemctl restart supervisor重启时，它失败了一半的时间。

$ root@host:/# systemctl start supervisor
    Job for supervisor.service failed. See 'systemctl status supervisor.service' and 'journalctl -xn' for details.
$ root@host:/# systemctl status supervisor.service
    ● supervisor.service - LSB: Start/stop supervisor
       Loaded: loaded (/etc/init.d/supervisor)
       Active: failed (Result: exit-code) since Wed 2015-09-23 11:12:01 UTC; 16s ago
      Process: 21505 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
      Process: 21511 ExecStart=/etc/init.d/supervisor start (code=exited, status=1/FAILURE)
    Sep 23 11:12:01 host supervisor[21511]: Starting supervisor:
    Sep 23 11:12:01 host systemd[1]: supervisor.service: control process exited, code=exited status=1
    Sep 23 11:12:01 host systemd[1]: Failed to start LSB: Start/stop supervisor.
    Sep 23 11:12:01 host systemd[1]: Unit supervisor.service entered failed state.

但是，在管理员或uwsgi日志中没有任何内容。对于uwsgi, Supervisor 3.0使用以下配置运行:

[program:uwsgi]
stopsignal=QUIT
command = uwsgi --ini uwsgi.ini
directory = /dir/
environment=ENVIRONMENT=STAGING
logfile-maxbytes = 300MB

已经添加了

stopsignal=QUIT，因为UWSGI忽略了停止时的默认信号(SIGTERM)，并被SIGKILL残忍地杀死，留下孤儿工人。

有没有办法让我调查一下发生了什么?

编辑:

按提示尝试:/etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start但它仍然有一半的时间失败。

 root@host:~# /etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start
    [ ok ] Stopping supervisor (via systemctl): supervisor.service.
    ● supervisor.service - LSB: Start/stop supervisor
       Loaded: loaded (/etc/init.d/supervisor)
       Active: inactive (dead) since Tue 2015-11-24 13:04:32 UTC; 89ms ago
      Process: 23490 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
      Process: 23349 ExecStart=/etc/init.d/supervisor start (code=exited, status=0/SUCCESS)
    Nov 24 13:04:30 xxx supervisor[23349]: Starting supervisor: supervisord.
    Nov 24 13:04:30 xxx systemd[1]: Started LSB: Start/stop supervisor.
    Nov 24 13:04:32 xxx systemd[1]: Stopping LSB: Start/stop supervisor...
    Nov 24 13:04:32 xxx supervisor[23490]: Stopping supervisor: supervisord.
    Nov 24 13:04:32 xxx systemd[1]: Stopped LSB: Start/stop supervisor.
    [....] Starting supervisor (via systemctl): supervisor.serviceJob for supervisor.service failed. See 'systemctl status supervisor.service' and 'journalctl -xn' for details.
     failed!
    root@host:~# /etc/init.d/supervisor stop && while /etc/init.d/supervisor status ; do sleep 1; done && /etc/init.d/supervisor start
    [ ok ] Stopping supervisor (via systemctl): supervisor.service.
    ● supervisor.service - LSB: Start/stop supervisor
       Loaded: loaded (/etc/init.d/supervisor)
       Active: failed (Result: exit-code) since Tue 2015-11-24 13:04:32 UTC; 1s ago
      Process: 23490 ExecStop=/etc/init.d/supervisor stop (code=exited, status=0/SUCCESS)
      Process: 23526 ExecStart=/etc/init.d/supervisor start (code=exited, status=1/FAILURE)
Nov 24 13:04:32 xxx systemd[1]: supervisor.service: control process exited, code=exited status=1
Nov 24 13:04:32 xxx systemd[1]: Failed to start LSB: Start/stop supervisor.
Nov 24 13:04:32 xxx systemd[1]: Unit supervisor.service entered failed state.
Nov 24 13:04:32 xxx supervisor[23526]: Starting supervisor:
Nov 24 13:04:33 xxx systemd[1]: Stopped LSB: Start/stop supervisor.
[ ok ] Starting supervisor (via systemctl): supervisor.service.

这不一定是上级的错误。我从您的systemctl status输出中看到，supervisor是通过sysv-init兼容层启动的，因此故障可能在/etc/init.d/supervisor脚本中。这就解释了为什么监管日志中没有错误。

要调试初始化脚本，最简单的方法是在该文件中添加set -x作为第一条非注释指令，并在journalctl输出中查看脚本执行的跟踪。编辑:

我已经用Debian Sid在测试系统上复制并调试了它。

问题是管理器初始化脚本的stop目标不检查守护进程是否已经真正终止，而只在进程存在时发送信号。如果守护进程需要一段时间才能关闭，则后续的start操作将失败，因为守护进程正在死亡，该守护进程被视为已经运行。

我在Debian bug Tracker上打开了一个bug: http://bugs.debian.org/805920

处理:

您可以使用以下命令解决此问题:

/etc/init.d/supervisor force-stop && 
/etc/init.d/supervisor stop && 
/etc/init.d/supervisor start

force-stop将确保监督已被终止(系统外)。
stop确保systemd知道它已被终止
start重新启动

需要force-stop之后的stop，否则systemd将忽略后续的start请求。stop和start可以使用restart组合，但这里我把它们都放在一起展示它是如何工作的。

我在ubuntu 14.04中遇到了这个问题，尝试了debian和@mnencia的最新initd脚本解决方案，但它们对我不起作用。强制停止解决方案不会终止程序进程，它们只是在终止监督命令后继续运行。

我的解决方案是修补监督和启动和重新启动initd脚本代码的一部分，我不想猜测一个好的DODTIME，我希望它能在旧的监督主进程死亡后尽快启动，所以我添加了一个重试逻辑。注意，它有点冗长，但如果你不喜欢这种行为，你可以删除echo调用，你可以改变最大时间(这里设置为20)。

start)
    echo -n "Starting $DESC: "
    i=1
    until [ $i -ge 21 ]; do
        start-stop-daemon --start --quiet --pidfile $PIDFILE --startas $DAEMON -- $DAEMON_OPTS  && break
        echo -n -e "nAlready running, old process still finishing? retrying ($i/20)..."
        let "i += 1"
        sleep 1
    done
sleep 1
    if running ; then
        echo "$NAME."
    else
        echo " ERROR."
    fi
;;
restart)
    echo -n "Restarting $DESC: "
    start-stop-daemon --stop --quiet --oknodo --pidfile $PIDFILE
    i=1
    until [ $i -ge 21 ]; do
        start-stop-daemon --start --quiet --pidfile $PIDFILE --startas $DAEMON -- $DAEMON_OPTS  && break
        echo -n -e "nAlready running, old process still finishing? retrying ($i/20)..."
        let "i += 1"
        sleep 1
    done
    echo "$NAME."
    ;;

我还更改了hashbang(第一行)，因此使用bash而不是sh，我想使用let

#! /bin/bash

相关内容

最新更新

热门标签：