When does a celery worker acknowledge to RabbitMQ that it has a task? - rabbitmq

I might be misunderstanding how this works (which is why I'm asking), but I think when a celery worker consumes a task from RabbitMQ it puts a lock on it -- so to speak -- and then must acknowledge it completed that task onces it's done. So say I have 4 workers which all have the prefetch setting at 1 and queue of 6 tasks which take a long time. Once I start those workers and I run:
rabbitmqctl -q list_queues name messages messages_ready messages_unacknowledged
I'd expect to see something like:
celery 6 2 4
indicating that 4 tasks are running (but not yet acknowledged) and 2 are ready to be consumed.
I think my understanding is wrong because what I actually see is:
celery 2 0 2
So it's as if the acknowledging happens when a message is received by a worker, but before that worker finishes processing that task.
So to sum up, my question is, when does a celery worker acknowledge it has a task? It seems like it's once it receives that task and starts working on it, not when it completes working on it. Can someone confirm?

This is mentioned in the FAQ, but I can't blame you for not finding it:
http://docs.celeryproject.org/en/latest/faq.html#should-i-use-retry-or-acks-late
The default behavior of early ack is there because we don't want to enforce users
to write idempotent tasks.

Related

Celery - automatic retrying of long running tasks running on crashed worker

I'm using Celery with Django over Redis.
Some of my tasks are quite long, taking about 1 hour. I'm aware that this is suboptimal, and preferably I should use shorter tasks, but this is what I got...
Sometimes the task/worker crash. This can happen for various unimportant reasons. Maybe this worker crashed, network problem, spot-instance when preempted, killed by OOM, or any other unexpected reason that I can't "catch" and handle.
I want to make sure the task will be tried again as fast as possible.
I can use ack_late, but the problem is that this task has a very long timeout (about 90 minutes), which means that if the task started and the worker crashed after 2 minutes, I will now wait for another 88 minutes until the task will get back to the queue and will start executing again on another worker.
I'm wondering if there exists another solution, that will see the worker "disappeared" and will put the task back in the queue?
You could give task_reject_on_worker_lost a try... It is a tricky one, but have a look...

Processing of tasks in celery workers are getting delayed

With celery, I have created listeners to Redis for getting all write events to Redis. Based on the events, I will trigger celery tasks to migrate data from Redis to DB.
I'm using the eventlet pool along with concurrency of 1000. Also, I'm having 5 celery queues for processing my data.
celery -A proj worker -l info -P eventlet -c 1000 -Q event_queue,vap_queue,client_queue,group_queue,ap_queue
Here, I'm facing the problem like, the listener is able to receive all the write events from Redis and workers are able to receive tasks from the listener. But, celery workers are delaying while processing huge number of data. (For example, I will be receiving 800 tasks per 10 seconds for each queue)
I have tried by increasing concurrency to higher values, changing the pool from eventlet to gevent and prefetch multiplier to 1. Still, My workers are delaying to complete a task.
Can anyone help to solve this? I'm new to celery actually :)
Some times concurrency is not the main factor in speeding up the task consumption.
When these tasks are processed.
Infact too much concurrency can lead to many context switches and slow down things, monitor your server CPU and memory to check if they are not getting overwhelmed by the tasks and find an optimum number.
For CPU bound task I will say prefer more worker than concurrent threads and for I/O bound tasks you can have concurrent threads

celery multiple workers but one queue

i am new to celery and redis.
I started up my redis server by using redis-server.
Celery was run using this parameter
celery -A proj worker
There are no other configurations. However, i realised that when i have a long running job in celery, it does not process another task that is in the queue until the long running task is completed. My understanding is that since i have 8 cores on my CPU, i should be able to process 8 tasks concurrently since the default parameter for -c is the number of cores?
Am i missing something here ?
Your problem is classical, everybode met this who had long-running tasks.
The root cause is that celery tries to optimize your execution flow reserving some tasks for each worker. But if one of these tasks is long-running the others get locked. It is known as 'prefetch count'. This is because by default celery set up for short tasks.
Another related setting is a 'late ack'. By default worker takes a task from the queue and immediately sends an 'acknowledge' signal, then broker removes this task from the queue. But this means that more messages will be prefetched for this worker. 'late ack' enabled tells worker to send acknowledge only after the task is completed.
This is just in two words. You may read more about prefetch and late ack.
As for the solution - just use these settings (celery 4.x):
task_acks_late = True
worker_prefetch_multiplier = 1
or for previous versions (2.x - 3.x):
CELERY_ACKS_LATE = True
CELERYD_PREFETCH_MULTIPLIER = 1
Also, starting the worker with parameter -Ofair does the same.

How to prevent ironworker from enqueuing tasks of workers that are still running?

I have this worker whose runtime greatly varies from 10 seconds to up to an hour. I want to run this worker every five minutes. This is fine as long as the job finishes within five minutes. However, If the job takes longer Iron.io keeps enqueuing the same task over and over and a bunch of tasks of the same type accumulate while the worker is running.
Furthermore, it is crucial that the task may not run concurrently, so max concurrency for this worker is set to one.
So my question is: Is there a way to prevent Iron.io from enqueuing tasks of workers that are still running?
Answering my own question.
According to Iron.io support it is not possible to prevent IronWorker from enqueuing tasks of workers that are still running. For cases like mine it is better to have master workers that do the scheduling, i.e. creating/enqueuing tasks from script via one of the client libraries.
The best option would be to enqueue new task from the worker's code. For example, your task is running for 10 sec - 1 hour and enqueues itself at the end (last line of code). This will prevent the tasks from accumulating while the worker is running.

celeryev Queue in RabbitMQ Becomes Very Large

I am using celery on rabbitmq. I have been sending thousands of messages to the queue and they are being processed successfully and everything is working just fine. However, the number of messages in several rabbitmq queues are growing quite large (hundreds of thousands of items in the queue). The queues are named celeryev.[...] (see screenshot below). Is this appropriate behavior? What is the purpose of these queues and shouldn't they be regularly purged? Is there a way to purge them more regularly, I think they are taking up quite a bit of disk space.
You can use the CELERY_EVENT_QUEUE_TTL celery option (only working with amqp), that will set the message expiry time, after which it will be deleted from the queue.
For anyone else who is running into problems with a celeryev queue becoming very large and threatening the disk space on your rabbitmq server, beware the accepted answer! Here's my suggestion. Just issue this command on your rabbitmq instance:
rabbitmqctl set_policy limit_celeryev_queues "^celeryev\." '{"max-length":1000000}' --apply-to queues
This will limit any queue beginning with "celeryev" to 1 Million entries. I did some experimenting with a stuck flower instance causing a runaway celeryev queue, and setting CELERY_EVENT_QUEUE_TTL / CELERY_EVENT_QUEUE_EXPIRES did not help control the queue size.
In my testing, I started a flower process, then SIGSTOP'ed it, and watched its celeryev queue start running away. Neither of these two settings helped at all. I confirmed SIGCONT'ing the flower process would bring the queue back to 0 rapidly. I am not certain why these two knobs didn't help, but it may have something to do with how RabbitMQ implements these two settings.
First, the Per-Message TTL corresponding to CELERY_EVENT_QUEUE_TTL only establishes an expiration time on each queue entry -- AIUI it will not automatically delete the message out of the queue to save space upon expiration. Second, the Queue TTL corresponding to CELERY_EVENT_QUEUE_EXPIRES says that it "... guarantees that the queue will be deleted, if unused for at least the expiration period". However, I believe that their definition of "unused" may be too strict to kick in for e.g. an overburdened, stuck, or killed flower process.
EDIT: Unfortunately, one problem with this suggestion is that the set_policy ... apply-to queues will only impact existing queues, and flower can and will create new queues which may overflow.
Celery use celeryev prefixed queues (and exchange) for monitoring, you can configure it as you want or disable at all (celery control disable_events).
You just have to set a config to your Celery.
If you want to avoid Celery from creating celeryev.* queues:
CELERY_SEND_EVENTS = False # Will not create celeryev.* queues
If you need these queues for monitoring purpose (CeleryFlower for instance), you may regularly purge them:
CELERY_EVENT_QUEUE_EXPIRES = 60 # Will delete all celeryev. queues without consumers after 1 minute.
The solution came from here: https://www.cloudamqp.com/docs/celery.html
You can limit the queue size in RabbitMQ with x-max-length queue declaration argument
http://www.rabbitmq.com/maxlength.html