If celery worker dies hard, does job get retried? - redis

Is there a way for a celery job to be retried if the server where the worker is running dies? I don't just mean the sub-process that execute the job, but the entire server becomes unavailable.
I tried with RabbitMQ and Redis as brokers. In both cases, if a job is currently being processed, it is entirely forgotten. When a worker restarts, it doesn't even try to reprocess the job, and looking at Rabbit or Redis, their queues are empty. The result backend is also empty.
It looks like the worker grabs the message and assume it will put it back if the subprocess fails, but if the worker dies also, it can't put it back.
(yes, I work in an environment where this happens more than once a year, and I don't want to lose tasks)

In theory, set task_acks_late=True should do the trick. (doc)
With a Redis broker, the task will be redelivered after visibility_timeout, which defaults to one hour. (doc)
With RabbitMQ, the task is redelivered as soon as Rabbit noticed that the worker died.

Related

Celery 4.3.0 - Send Signal To a Task Without Termination

On a celery service on CENTOS which runs a single task at a time, the termination of a task is simple:
revoke(id, terminate=True, signal='SIGINT')
However while the interrupt signal is being processed, the running task gets revoked. Then a new task - from the queue - starts on the node. This is troublesome. Two task are running at the same time on the node. The signal handling could take up to a minute.
The question is how a signal could be sent to a running task, without actually terminating the task in celery?
Or let's say is there any way to send a signal to a running task?
The assumption is user should be able to send a signal from a remote node. In other words user does not have access to list the running processes of the node.
Any other solution is welcome.
I don't understand your goal.
Are you trying to kill the worker? if so, I guess you are talking about t "Warm shutdown", so you can send the SIGTEERM to the worker's process. The running task will get a chance to finish but no new task will be added.
If you're just interested in revoking a specific task and keep using the same worker, can you share your celery configuration and the worker command? are you sure you're running with concurrency 1 ?

How do I get rid of a zombie Celery worker?

I am running Celery with RabbitMQ backend.
Somehow I have ended up with what appears to be a zombie Celery worker. I see the worker in Flower, and in commands like celery inspect scheduled. But it references a PID that doesn't exist. There is no worker process. It is a big problem because Celery will delegate tasks to this worker, and they never get executed.
I believe what happened is the docker container within which this is running got shut down uncleanly. But now, even if I restart the docker container, this zombie worker always comes back. Always has the same name: celery#0357c65d991b.
The Celery docs say that to kill a worker you must send its process TERM. But I can't do that because there is no process. It's a zombie.
RabbitMQ must have a dangling reference to this worker. They only thing I could find in the RabbitMQ management interface is a queue named celery#0357c65d991b.celery.pidbox. I deleted this queue, but it simply reappeared a few seconds later.
Can anyone give me a pointer on where to look to get rid of this thing?

Acks_late: celery + redis broker / backend

I was going through celery code. Acks_late is called once the task function runs via (task_trace).
However, in Redis, the once a task is received (i.e pop from Redis Queue) RedisWorkerController creates a task request for it. How is it enqueued again in the event the worker node dies?
The messages aren't enqueued again in case of them not being acknowledged (It would be impossible if the worker dies. They do exist in Redis as unacknowledged).
According to celery docs, Redis broker has a visibility timeout mechanism.
So we should be able to expect the message to be delivered again to a worker if it was not acknowledged within the visibility timeout. And that's what happens. If the power goes out during the processing of an acks_late task, the task is received again by an online worker after the visibility timeout is passed.

how to resove "connection.blocked: true" in capabilities on the RabbitMQ UI

"rabbitmqctl list_connections" shows as running but on the UI in the connections tab, under client properties, i see "connection.blocked: true".
I can see that messages are in queued in RabbitMq and the connection is in idle state.
I am running Airflow with Celery. My jobs are not executing at all.
Is this the reason why jobs are not executing?
How to resolve the issue so that my jobs start running
I'm experiencing the same kind of issue by just using celery.
It seems that when you have a lot of messages in the queue, and these are fairly chunky, and your node memory goes high, the rabbitMQ memory watermark gets trespassed and this triggers a blocking into consumer connections, so no worker can access that node (and related queues).
At the same time publishers are happily sending stuff via the exchange so you get in a lose-lose situation.
The only solution we had is to avoid hitting that memory watermark and scale up the number of consumers.
Keep messages/tasks lean so that the signature is not MB but KB

Celery Flower not monitoring jobs

I am running Celery and Flower, with RabbitMQ as a message broker. When I have no running workers and start a task, it sits on the queue until a worker starts. Then, when I start my workers, the task is consumed and executed as expected. However, when I try to use the Flower API to get task info, args and kwargs are null. This never happens when my workers are already running when I call a task. Why is this, and how can I fix it? Thanks.