我有一个debian盒子,用celery和rabbitmq运行了大约一年的任务。最近我注意到任务没有被处理,所以我登录到系统,注意到芹菜不能连接到rabbitmq。我重新启动了rabbitmq-server,尽管芹菜不再抱怨了,但它现在没有执行新的任务。奇怪的是,rabbitmq正在疯狂地吞噬cpu和内存资源。重新启动服务器不能解决问题。花了几个小时在网上寻找解决方案,但无济于事,我决定重建服务器。
我用Debian 7.5, rabbitmq 2.8.4,芹菜3.1.13 (Cipater)重建了新的服务器。大约一个小时左右,一切都工作得很好,直到芹菜开始抱怨,它不能连接到rabbitmq!
[2014-08-06 05:17:21,036: ERROR/MainProcess] consumer: Cannot connect to amqp://guest:**@127.0.0.1:5672//: [Errno 111] Connection refused.
Trying again in 6.00 seconds...
我重新启动rabbitmq service rabbitmq-server start
和相同的问题增益:
rabbitmq又开始膨胀,不断地冲击CPU,慢慢地接管所有内存和交换空间:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21823 rabbitmq 20 0 908m 488m 3900 S 731.2 49.4 9:44.74 beam.smp
下面是rabbitmqctl status
的结果:
Status of node 'rabbit@li370-61' ...
[{pid,21823},
{running_applications,[{rabbit,"RabbitMQ","2.8.4"},
{os_mon,"CPO CXC 138 46","2.2.9"},
{sasl,"SASL CXC 138 11","2.2.1"},
{mnesia,"MNESIA CXC 138 12","4.7"},
{stdlib,"ERTS CXC 138 10","1.18.1"},
{kernel,"ERTS CXC 138 10","2.15.1"}]},
{os,{unix,linux}},
{erlang_version,"Erlang R15B01 (erts-5.9.1) [source] [64-bit] [smp:8:8] [async-threads:30] [kernel-poll:true]n"},
{memory,[{total,489341272},
{processes,462841967},
{processes_used,462685207},
{system,26499305},
{atom,504409},
{atom_used,473810},
{binary,98752},
{code,11874771},
{ets,6695040}]},
{vm_memory_high_watermark,0.3999999992280962},
{vm_memory_limit,414559436},
{disk_free_limit,1000000000},
{disk_free,48346546176},
{file_descriptors,[{total_limit,924},
{total_used,924},
{sockets_limit,829},
{sockets_used,3}]},
{processes,[{limit,1048576},{used,1354}]},
{run_queue,0},
/var/log/rabbitmq中的一些条目:
=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=WARNING REPORT==== 8-Aug-2014::00:11:35 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=WARNING REPORT==== 8-Aug-2014::00:11:36 ===
Mnesia('rabbit@li370-61'): ** WARNING ** Mnesia is overloaded: {dump_log,
write_threshold}
=INFO REPORT==== 8-Aug-2014::00:11:36 ===
vm_memory_high_watermark set. Memory used:422283840 allowed:414559436
=WARNING REPORT==== 8-Aug-2014::00:11:36 ===
memory resource limit alarm set on node 'rabbit@li370-61'.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
=INFO REPORT==== 8-Aug-2014::00:11:43 ===
started TCP Listener on [::]:5672
=INFO REPORT==== 8-Aug-2014::00:11:44 ===
vm_memory_high_watermark clear. Memory used:290424384 allowed:414559436
=WARNING REPORT==== 8-Aug-2014::00:11:44 ===
memory resource limit alarm cleared on node 'rabbit@li370-61'
=INFO REPORT==== 8-Aug-2014::00:11:59 ===
vm_memory_high_watermark set. Memory used:414584504 allowed:414559436
=WARNING REPORT==== 8-Aug-2014::00:11:59 ===
memory resource limit alarm set on node 'rabbit@li370-61'.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
=INFO REPORT==== 8-Aug-2014::00:12:00 ===
vm_memory_high_watermark clear. Memory used:411143496 allowed:414559436
=WARNING REPORT==== 8-Aug-2014::00:12:00 ===
memory resource limit alarm cleared on node 'rabbit@li370-61'
=INFO REPORT==== 8-Aug-2014::00:12:01 ===
vm_memory_high_watermark set. Memory used:415563120 allowed:414559436
=WARNING REPORT==== 8-Aug-2014::00:12:01 ===
memory resource limit alarm set on node 'rabbit@li370-61'.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
=INFO REPORT==== 8-Aug-2014::00:12:07 ===
Server startup complete; 0 plugins started.
=ERROR REPORT==== 8-Aug-2014::00:15:32 ===
** Generic server rabbit_disk_monitor terminating
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@li370-61",
50000000,46946492416,100,10000,
#Ref<0.0.1.79456>,false}
** Reason for termination ==
** {unparseable,[]}
=INFO REPORT==== 8-Aug-2014::00:15:37 ===
Disk free limit set to 50MB
=ERROR REPORT==== 8-Aug-2014::00:16:03 ===
** Generic server rabbit_disk_monitor terminating
** Last message in was update
** When Server state == {state,"/var/lib/rabbitmq/mnesia/rabbit@li370-61",
50000000,46946426880,100,10000,
#Ref<0.0.1.80930>,false}
** Reason for termination ==
** {unparseable,[]}
=INFO REPORT==== 8-Aug-2014::00:16:05 ===
Disk free limit set to 50MB
更新:当从rabbitmq.com仓库安装最新版本的rabbitmq(3.3.4-1)时,问题似乎已经解决了。最初我从Debian存储库中安装了一个(2.8.4)。到目前为止,rabbitmq-server工作正常。如果问题再次出现,我会更新这篇文章。
更新:不幸的是,大约24小时后,问题再次出现,rabbitmq关闭,重新启动进程会使它消耗资源,直到它在几分钟内再次关闭。
我终于找到了解决办法。这些帖子帮助我们弄清楚了这一点。EC2上的RabbitMQ消耗大量CPU和https://serverfault.com/questions/337982/how-do-i-restart-rabbitmq-after-switching-machines
发生的事情是rabbitmq持有所有从未释放的结果,以至于它变得过载。我清除了/var/lib/rabbitmq/mnesia/rabbit/
中所有过时的数据,重新启动兔子,现在工作正常。
我的解决方案是禁用存储结果与CELERY_IGNORE_RESULT = True
一起在芹菜配置文件,以确保这种情况不会再次发生。
您也可以重置队列:
Warning: 清空所有数据和配置!
sudo service rabbitmq-server start
sudo rabbitmqctl stop_app
sudo rabbitmqctl reset
sudo rabbitmqctl start_app
如果您的系统没有响应,您可能需要在重启后立即运行这些命令。
由于芹菜,您正在耗尽内存资源,我遇到了类似的问题,这是芹菜后端结果使用的队列问题。
你可以使用rabbitmqctl list_queues命令检查有多少队列,如果这个数字永远增长,请注意。在这种情况下,看看你的芹菜的用法。
关于芹菜,如果你没有得到异步事件的结果,不要配置一个后端来存储这些未使用的结果。
我遇到过一个类似的问题,结果是由于一些流氓RabbitMQ客户端应用程序。问题似乎是由于一些未处理的错误,流氓应用程序一直试图与RabbitMQ代理建立连接。一旦客户端应用程序重新启动,一切都恢复正常(因为应用程序停止故障,并且停止尝试在无限循环中与RabbitMQ建立连接)
另一个可能原因:management-plugin.
我运行的RabbitMQ 3.8.1与启用管理插件。在一个10核服务器上,我的cpu使用率高达1000%,有3个空闲消费者,没有发送任何消息,还有一个队列。
当我通过执行rabbitmq-plugins disable rabbitmq_management
禁用管理插件时,使用率下降到0%,偶尔出现200%的峰值。