Celery - How to solve PreconditionFailed relating to delivery timeout - rabbitmq

In my celery cluster I see that some tasks are killed and restarted automatically. This set of celery workers works on long running tasks. This is the error I see logged by celery worker before death
"Unrecoverable error: PreconditionFailed(406, 'PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more', (0, 0), '')"
How can I solve this issue?
I am using AWS MQ for running a rabbitmq cluster

Related

How does Flower determine if a Celery worker is online/offline?

Within celery, I see sometimes that the worker is offline. I run Flower in one Docker container and the Celery worker in another one. I use a RabbitMQ broker.
I see that the worker jumps between offline <-> online quite often.
What does it mean that a worker is offline? How does Flower figure that out?
Worker is considered "offline" if it does not broadcast heartbeat signal for some (short) period of time.

Celery task with a long ETA and RabbitMQ

RabbitMQ may enforce ack timeouts for consumers: https://www.rabbitmq.com/consumers.html#acknowledgement-modes
By default if a task has not been acked within 15 min entire node will go down with a PreconditionFailed error.
I need to schedule a celery task (using RabbitMQ as a broker) with an ETA quite far in the future (1-3 h) and as of now (with celery 4 and rabbitmq 3.8) when I try that... I get PreconditionFailed after the consumer ack timeout configured for my RMQ.
I expected that the task would be acknolwedged before its ETA ...
Is there a way to configure an ETA celery task to be acknowledged within the consumer ack timeout?
right now I am increasing the consumer_timeout to above my ETA time delta, but there must be a better solution ...
I think adjusting the consumer_timeout is your only option in Celery 5. Note that this is only applicable for RabbitMQ 3.8.15 and newer.
Another possible solution is to have the workers ack the message immediately upon receipt. Do this only if you don't need to guarantee task completion. For example, if the worker crashes before doing the task, Celery will not know that it wasn't completed.
In RabbitMQ, the best options for delayed tasks are the delayed-message-exchange or dead lettering. Celery cannot use either option. In Celery, messages are published to the message broker where they are sent to consumers as soon as possible. The delay is enforced in the worker, not at the broker.
There's a way to change this consumer_timeout for a running instance by running the following command on the RabbitMQ server:
rabbitmqctl eval 'application:set_env(rabbit, consumer_timeout, 36000000).'
This will set the new timeout to 10 hrs (36000000ms). For this to take effect, you need to restart your workers though. Existing worker connections will continue to use the old timeout.
You can check the current configured timeout value as well:
rabbitmqctl eval 'application:get_env(rabbit, consumer_timeout).'
If you are running RabbitMQ via Docker image, here's how to set the value: Simply add -e RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS="-rabbit consumer_timeout 36000000" to your docker run OR set the environment RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS to "-rabbit consumer_timeout 36000000".
Hope this helps!
I faced this problem, actually i think you would better to use PeriodicTask, if you would like only do it once set the one_off=True.
https://docs.celeryq.dev/en/stable/userguide/periodic-tasks.html?highlight=periodic
I encountered the same problem and I resolved it.
With RabbitMQ version 3.8.14 (3.8.14-management), I am able to send long ETA tasks.
I personaly use Celery to send tasks with a long ETA.
In my case, I setup up celery to add a timeout (~consumer_timeout), I can configure it with time_limit or soft_time_limit
I also wanted to do something similar and have tried something with the "rabbitmq-delayed-exchange-plugin" and "dead-letter-queue". I wrote an article about both and mentioned the links below. I hope it will be helpful to someone. In a nutshell, we can use both approaches for scheduling celery tasks( handling long ETA).
using dlx:
Dead Letter Exchanges (DLX):
using RabbitMQ Delayed Message Plugin:
RabbitMQ Delayed Message Plugin:
p.s: I know StackOverflow answers should be self-explanatory, but posting the links as answers is long.Sorry

Acks_late: celery + redis broker / backend

I was going through celery code. Acks_late is called once the task function runs via (task_trace).
However, in Redis, the once a task is received (i.e pop from Redis Queue) RedisWorkerController creates a task request for it. How is it enqueued again in the event the worker node dies?
The messages aren't enqueued again in case of them not being acknowledged (It would be impossible if the worker dies. They do exist in Redis as unacknowledged).
According to celery docs, Redis broker has a visibility timeout mechanism.
So we should be able to expect the message to be delivered again to a worker if it was not acknowledged within the visibility timeout. And that's what happens. If the power goes out during the processing of an acks_late task, the task is received again by an online worker after the visibility timeout is passed.

re-start celery queue after re-starting broker

I started my celery worker queue (in background):
celery worker -Q my_queue -l info
After this, its broker (redis) was stopped, and meanwhile the background celery worker keeps trying to re-connect to redis after growing amount of time.
Now my goal is re-start a non-duplicate my_queue after restarting redis. I realize that the following celery API will not return my_queue until the re-connection is made:
celery.task.control.inspect().active_queues()
Now if I start a new my_queue, I will end up with duplicate my_queue if the previous celery worker in the background is re-connected afterward.
A solution might be letting celery worker to actively quit if its broker is found stopped, but I don't find the right way to do this. I also don't want to kill it by previous-saved PID. Any suggestions or alternatives will be appreciated.
Well, I know it's contradictory to my requirement, but it seems that I do need the help from a PID file:
celery worker -Q my_queue -l info --pidfile=pid.log
which will raise an exception if the pid saved in pid.log is already running.
This is still not the ideal solution, and any suggestion regarding how to let celery worker actively quit if its broker is found stopped will still be appreciated.

Celery workers missing heartbeats and getting substantial drift over Ec2

I am testing my celery implementation over 3 ec2 machines right now. I am pretty confident in my implementation now, but I am getting problems with the actual worker execution. My test structure is as follows:
1 ec2 machine is designated as the broker, also runs a celery worker
1 ec2 machine is designated as the client (runs the client celery script that enqueues all the tasks using .delay(), also runs a celery worker
1 ec2 machine is purely a worker.
All the machines have 1 celery worker running. Before, I was immediately getting the message:
"Substantial drift from celery#[other ec2 ip] may mean clocks are out of sync."
A drift amount in seconds would then be printed, which would increase over time.
I would also get messages : "missed heartbeat from celery#[other ec2 ip].
The machine would be doing very little work at this point, so my AutoScaling config in ec2 would shut down the instance automatically once it got to cpu utilization levels very low (<5%)
So to try to solve this problem, i attempted to sync all my machine's clocks (although I thought celery handled this) with this command, which was performed upon start up for all machines:
apt-get -qy install ntp
service ntp start
With this, they all performed well for about 10 minutes with no hitches, after which I started getting missed heartbeats and my ec2 instances stalled and shut down. The weird thing is, the drift increased and then decreased sometimes.
Any idea on why this is happening?
I am using the newest version of celery (3.1) and rabbitmq
EDIT: It should be noted that I am utilizing us-west-1a and us-west-1c availability zones on ec2.
EDIT2: I am starting to think memory problems might be an issue. I am using a t2.micro instance, and running 3 celery workers on the same machine (only 1 instance) which is also the broker, still cause heartbeat misses and stalls.