当二进制或 bash 退出并显示错误代码时,Systemd "OnFailure="不启动



所以我有一个systemd单元,它需要被监视,在崩溃的情况下重新启动,并且在单元失败的情况下也要做一些事情。我在一个嵌入式系统上工作,所以这需要是健壮的。

在我的例子中,我们有一个systemd服务:

Description=Demo unit
Wants=multi-user.target
OnFailure=FailHandler@%N.service

[Service]
ExecStart=/bin/bash /home/root/demo.sh
Restart=on-failure
RestartSec=1
Type=simple 

bash I start:


echo "Started demo.sh"
current_date=`date`
sleep 10s
echo "${current_date} Demo was here" >> /home/root/demo.txt
exit 1 

到目前为止一切顺利。bash总是在10秒后显示1退出,并记录时间。问题是在这种情况下永远不会调用FailHandler。这只是一个演示所有的应用都在c++中,但行为是一样的。现在,如果我手动设置错误的路径到bash文件,它的单元失败,但它启动"OnFailure"部分。下面是正确路径下的syslog输出:

2021-09-03T13:06:31.575094+00:00 hostname bash[1125]: Started demo.sh
2021-09-03T13:06:41.629450+00:00 hostname systemd[1]: demo.service: Main process exited, code=exited, status=1/FAILURE
2021-09-03T13:06:41.644681+00:00 hostname systemd[1]: demo.service: Failed with result 'exit-code'.
2021-09-03T13:06:41.818089+00:00 hostname systemd[1]: demo.service: Service RestartSec=100ms expired, scheduling restart.
2021-09-03T13:06:41.824005+00:00 hostname systemd[1]: demo.service: Scheduled restart job, restart counter is at 1.
2021-09-03T13:06:41.850933+00:00 hostname bash[1179]: Started demo.sh
2021-09-03T13:06:51.870376+00:00 hostname systemd[1]: demo.service: Main process exited, code=exited, status=1/FAILURE
2021-09-03T13:06:51.872611+00:00 hostname systemd[1]: demo.service: Failed with result 'exit-code'.
2021-09-03T13:06:52.117479+00:00 hostname systemd[1]: demo.service: Service RestartSec=100ms expired, scheduling restart.
2021-09-03T13:06:52.136102+00:00 hostname systemd[1]: demo.service: Scheduled restart job, restart counter is at 2.
2021-09-03T13:06:52.163865+00:00 hostname bash[1221]: Started demo.sh

以下是path错误时的输出:

2021-09-03T13:07:46.582269+00:00 hostnaem bash[1446]: /bin/bash: /ahome/root/daemo.sh: No such file or directory
2021-09-03T13:07:46.588715+00:00 hostnaem systemd[1]: daemo.service: Main process exited, code=exited, status=127/n/a
2021-09-03T13:07:46.590356+00:00 hostnaem systemd[1]: daemo.service: Failed with result 'exit-code'.
2021-09-03T13:07:46.694616+00:00 hostnaem systemd[1]: daemo.service: Service RestartSec=100ms expired, scheduling restart.
2021-09-03T13:07:46.701519+00:00 hostnaem systemd[1]: daemo.service: Scheduled restart job, restart counter is at 1.
2021-09-03T13:07:46.720879+00:00 hostnaem systemd[1]: daemo.service: Start request repeated too quickly.
2021-09-03T13:07:46.721405+00:00 hostnaem systemd[1]: daemo.service: Failed with result 'exit-code'.
2021-09-03T13:07:46.722723+00:00 hostnaem systemd[1]: daemo.service: Triggering OnFailure= dependencies.
2021-09-03T13:07:46.804815+00:00 hostnaem FailHandler.sh[1457]: Failed application: daemo
2021-09-03T13:07:46.822342+00:00 hostnaem bash[1457]: error: cannot stat /etc/logrotate.d/daemo: No such file or directory
2021-09-03T13:07:46.841577+00:00 hostnaem FailHandler.sh[1457]: ERROR: Failed logrotate for daemo crash
2021-09-03T13:07:46.977003+00:00 hostnaem systemd[1]: FailHandler@daemo.service: Succeeded.

我从syslog中了解到,每当重启次数在100毫秒内达到StartLimitBurst=1时,它就会启动FailHandler,但是否有一种方法可以在应用程序以错误代码退出时启动?

谢谢你。我看了一眼你发过来的链接,就收到了。在我的例子中,解决方案是:

ExecStopPost=/bin/bash -c 'if [ "$$EXIT_STATUS" != 0 ]; then systemctl start FailHandler@%N.service; fi'