With roughly 0.9 million messages in rabbitmq, celery workers stopped processing tasks. On killing celery and running it again, processing resumed. Rabbit never went out of memory. Nothing suspicious in any logs or statuses except:
** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}
from /var/log/rabbitmq/rabbit.log. Similar symptoms were present before with around 1.6m messages en-queued.
More info:
Celery concurrency: 4
RAM installed: 4GB
Swap space 8GB
disk_free_limit (Rabbit): 8GB
vm_memory_high_watermark: 2
vm_memory_high_watermark_paging_ratio: 0.75
How can the actual cause of workers stopping be diagnosed and how can it be prevented from reoccurring.
Thanks.
Probably submitting/consuming the messages from queue too fast?
If you don't need messages to be durable and can store those in memory only, it will significantly improve the RabbitMQ performance.
http://docs.celeryproject.org/en/latest/userguide/optimizing.html#using-transient-queues
Related
"rabbitmqctl list_connections" shows as running but on the UI in the connections tab, under client properties, i see "connection.blocked: true".
I can see that messages are in queued in RabbitMq and the connection is in idle state.
I am running Airflow with Celery. My jobs are not executing at all.
Is this the reason why jobs are not executing?
How to resolve the issue so that my jobs start running
I'm experiencing the same kind of issue by just using celery.
It seems that when you have a lot of messages in the queue, and these are fairly chunky, and your node memory goes high, the rabbitMQ memory watermark gets trespassed and this triggers a blocking into consumer connections, so no worker can access that node (and related queues).
At the same time publishers are happily sending stuff via the exchange so you get in a lose-lose situation.
The only solution we had is to avoid hitting that memory watermark and scale up the number of consumers.
Keep messages/tasks lean so that the signature is not MB but KB
This is what i'm getting in rabbitmq message broker
=INFO REPORT==== 13-Jan-2015::12:40:24 ===
vm_memory_high_watermark set. Memory used:478063864 allowed:415868518
=WARNING REPORT==== 13-Jan-2015::12:40:24 ===
memory resource limit alarm set on node 'rabbit#matchpointgps-141110'.
* Publishers will be blocked until this alarm clears *
This has happened twice at our server.
I'm still not able to get the correct solution for this.
We had a similar issue when the queue lengths got very high and it tried to write the messages to disk but couldn't do it fast enough. In our testing, we did not have this problem when we used SSD drives.
The easiest solution for us was to have the messages written to disk immediately by setting durable=true on the messages. This was also a good idea because if rabbit restarted the data in the queues wouldn't be lost.
Take a look at this blog post on how RabbitMQ queues use memory: http://www.rabbitmq.com/blog/2011/10/27/performance-of-queues-when-less-is-more/
TL;DR try to keep your queues as empty as possible
Finally, found some better configuration for rabbitmq queues.
I've added following line to celery config since it was creating one additional queue for each tasks.
CELERY_IGNORE_RESULT = True
And also created separate queue for my task.
This keeps the memory free and ready to take up more heavy and longer tasks
more information
https://denibertovic.com/posts/celery-best-practices/
I had similar issue in a rabbitMQ server running on docker. Everything was blocked and rabbit won't accept any message.
I simply did to reconfigure the Disk Free Space Limit:
rabbitmqctl set_disk_free_limit 1GB
This changed the "xx GiB low watermark" and solved the problem.
If you are using a "bitnami/rabbitmq" docker image, you can set this varibale with:
RABBITMQ_DISK_FREE_ABSOLUTE_LIMIT: "1GB"
I am testing my celery implementation over 3 ec2 machines right now. I am pretty confident in my implementation now, but I am getting problems with the actual worker execution. My test structure is as follows:
1 ec2 machine is designated as the broker, also runs a celery worker
1 ec2 machine is designated as the client (runs the client celery script that enqueues all the tasks using .delay(), also runs a celery worker
1 ec2 machine is purely a worker.
All the machines have 1 celery worker running. Before, I was immediately getting the message:
"Substantial drift from celery#[other ec2 ip] may mean clocks are out of sync."
A drift amount in seconds would then be printed, which would increase over time.
I would also get messages : "missed heartbeat from celery#[other ec2 ip].
The machine would be doing very little work at this point, so my AutoScaling config in ec2 would shut down the instance automatically once it got to cpu utilization levels very low (<5%)
So to try to solve this problem, i attempted to sync all my machine's clocks (although I thought celery handled this) with this command, which was performed upon start up for all machines:
apt-get -qy install ntp
service ntp start
With this, they all performed well for about 10 minutes with no hitches, after which I started getting missed heartbeats and my ec2 instances stalled and shut down. The weird thing is, the drift increased and then decreased sometimes.
Any idea on why this is happening?
I am using the newest version of celery (3.1) and rabbitmq
EDIT: It should be noted that I am utilizing us-west-1a and us-west-1c availability zones on ec2.
EDIT2: I am starting to think memory problems might be an issue. I am using a t2.micro instance, and running 3 celery workers on the same machine (only 1 instance) which is also the broker, still cause heartbeat misses and stalls.
My understanding of RabbitMQ durable queues (i.e. delivery_mode = 2) is that they run in RAM, but that messages are flushed to disk so that they can be recovered in the event that the process is restarted or the machine is rebooted.
It's unclear to me though what the expected behavior is when the machine runs out of memory. If the queue gets overloaded, dies, and needs to be restored, then simply loading the messages from the disk-backed store would consume all available RAM.
Do durable queues only load a subset of the messages into RAM in this scenario?
RabbitMQ will page the messages to disc as memory fills up. See https://www.rabbitmq.com/memory.html section "Configuring the Paging Threshold".
I have a celery + rabbitmq setup with a busy django site, in the celery setting I have this config:
CELERY_RESULT_BACKEND = "amqp"
CELERY_AMQP_TASK_RESULT_EXPIRES = 5
I am monitoring the queues with the "watch" command, what I have observed is whilst most of the temporary queues get deleled after a few seconds, there are some queues (same guid) did not get deleted, and the list grows slowly, regardless how many workers used.
The django site does generate about 60 tasks per second, accepts various information and use the tasks to digest information. The whole setup runs on a 16 core cpu server with plenty of RAM. Would this still caused by performance issue or a celery bug?
Cheers
James