Acks_late: celery + redis broker / backend - redis

I was going through celery code. Acks_late is called once the task function runs via (task_trace).
However, in Redis, the once a task is received (i.e pop from Redis Queue) RedisWorkerController creates a task request for it. How is it enqueued again in the event the worker node dies?

The messages aren't enqueued again in case of them not being acknowledged (It would be impossible if the worker dies. They do exist in Redis as unacknowledged).
According to celery docs, Redis broker has a visibility timeout mechanism.
So we should be able to expect the message to be delivered again to a worker if it was not acknowledged within the visibility timeout. And that's what happens. If the power goes out during the processing of an acks_late task, the task is received again by an online worker after the visibility timeout is passed.

Related

RabbitMQ auto-delete queues with timeouts

I have a k8s service, using rabbitMQ as message broker.
I want to be able to delete a specific queue if the service deployment which may have multiple pods is stopped.
Reading the documentation RabbitMq Queues Docs I found that the best case for me in this case is to use the auto-deleted property of the queue.
Is there any option so the auto-deleted queue will not be deleted immediately after the clients are disconnected, instead to wait some seconds to wait for reconnection ?

Celery creates 3 queues in RabbitMQ message queue

I was using celery as task queue and RabbitMQ as message queue, When pushing my tasks using the delay function to the queue. I see that there were 3 queues created in the rabbit mq. I don't understand what and why do we need these 2 extra queue. Also how do I identify onto which queue my tasks are actually getting pushed into?
Started celery :
celery -A myproject worker -l info
[tasks]
. app1.tasks.add
[2022-06-10 06:16:14,132: INFO/MainProcess] Connected to amqp://himanshu:**#IPADDRESS/vhostcheck
[2022-06-10 06:16:14,142: INFO/MainProcess] mingle: searching for neighbors
[2022-06-10 06:16:15,165: INFO/MainProcess] mingle: all alone
[2022-06-10 06:16:15,182: WARNING/MainProcess] /etc/myprojectenv/lib/python3.8/site-packages/celery/fixups/django.py:203: UserWarning: Using settings.DEBUG leads to a memory
leak, never use this setting in production environments!
warnings.warn('''Using settings.DEBUG leads to a memory
[2022-06-10 06:16:15,182: INFO/MainProcess] celery#ubuntu-s-1vcpu-1gb-blr1-01 ready.
[2022-06-10 06:17:38,485: INFO/MainProcess] Task app1.tasks.add[be566921-b320-466c-b406-7a6ed7ab06e7] received
[2022-06-10 06:16:15,182: INFO/MainProcess] celery#ubuntu-s-1vcpu-1gb-blr1-01 ready.
[2022-06-10 06:17:38,485: INFO/MainProcess] Task app1.tasks.add[be566921-b320-466c-b406-7a6ed7ab06e7] received
[2022-06-10 06:19:18,544: INFO/ForkPoolWorker-1] Task app1.tasks.add[be566921-b320-466c-b406-7a6ed7ab06e7] succeeded in 100.05838803993538s: 13
SO whenever I run my celery worker I see these 3 queues being generated.
RabbitMQ Management
What are those 3 queue and what for is celery using them for?
Also since queues are basically persistent database and therefore persistent queues, so why do they get deleted when I stop my workers. I see there is only 1 queue here after I stop celery.
The celery queue is there so that you can send tasks to that particular queue. Every Celery worker subscribed to this queue will be able to reserve and run tasks sent to it.
The .pidbox queue is created by every Celery worker to support execution of remote commands.
The celeryev queue is also created by every Celery worker and is used for monitoring. Every Celery worker will every few seconds broadcast heartbeat message for an example. These messages go to the celeryev queue.
Celery documentation does not give any details about these queues, so people had to look for answeres in the Celery/Kombu source code. Here is one example: https://github.com/celery/celery/issues/6371#issuecomment-716839203

If celery worker dies hard, does job get retried?

Is there a way for a celery job to be retried if the server where the worker is running dies? I don't just mean the sub-process that execute the job, but the entire server becomes unavailable.
I tried with RabbitMQ and Redis as brokers. In both cases, if a job is currently being processed, it is entirely forgotten. When a worker restarts, it doesn't even try to reprocess the job, and looking at Rabbit or Redis, their queues are empty. The result backend is also empty.
It looks like the worker grabs the message and assume it will put it back if the subprocess fails, but if the worker dies also, it can't put it back.
(yes, I work in an environment where this happens more than once a year, and I don't want to lose tasks)
In theory, set task_acks_late=True should do the trick. (doc)
With a Redis broker, the task will be redelivered after visibility_timeout, which defaults to one hour. (doc)
With RabbitMQ, the task is redelivered as soon as Rabbit noticed that the worker died.

Logstash with rabbitmq cluster

I have a 3 node cluster of Rabbitmq behind a HAproxy Load Balancer. When I shut down a node, Rabbitmq successfully switches the queue to the other nodes. However, I notice that Logstash stops pulling messages from the queue unless I restart it. Is this a problem with the way rabbitmq operates? i.e. it deactivates all active consumers. I am not sure if log stash has any retry capability. Anyone run into this issue?
Quoting rabbit mq documentation, page for clustering first
What is Replicated? All data/state required for the operation of a
RabbitMQ broker is replicated across all nodes. An exception to this
are message queues, which by default reside on one node, though they
are visible and reachable from all nodes.
and high availability
Clients that are consuming from a mirrored queue may wish to know that
the queue from which they have been consuming has failed over. When a
mirrored queue fails over, knowledge of which messages have been sent
to which consumer is lost, and therefore all unacknowledged messages
are redelivered with the redelivered flag set. Consumers may wish to
know this is going to happen.
If so, they can consume with the argument x-cancel-on-ha-failover set
to true. Their consuming will then be cancelled on failover and a
consumer cancellation notification sent. It is then the consumer's
responsibility to reissue basic.consume to start consuming again.
So, what does all this mean:
You have to mirror queues
The consumers should use manual ACK
The consumers should reconnect on their own
So the answer to your question is no, it's not a problem with rabbitmq, that's simply how it works. It's up to clients to reconnect.

Celery tasks retry (Celery, Django and RabbitMQ)

Can you tell me what is happening when in celery you tell the task to retry? Will it retry in the same worker thread or it will be returned to broker which may send it elsewhere?
What will happen with tasks for retry if worker or dispatcher suddenly stop? If tasks can be lost is there some approach to avoid this? May be save each task in database and retry them if no result is received for some time?
Or may be dispatcher have it's own persistent storage? What about then if worker thread crash receiving the task or while executing it?
Can you tell me what is happening when
in celery you tell the task to retry?
Will it retry in the same worker
thread or it will be returned to
broker which may send it elsewhere?
Yes the task return to the broker (ex. Rabbit MQ) with a different estimated execution time
What will happen with tasks for retry
if worker or dispatcher suddenly stop?
If tasks can be lost is there some
approach to avoid this? May be save
each task in database and retry them
if no result is received for some
time?
Or may be dispatcher have it's own
persistent storage? What about then if
worker thread crash receiving the task
or while executing it?
Here a complete answer Retry Lost or Failed Tasks (Celery, Django and RabbitMQ)