How does Flower determine if a Celery worker is online/offline? - rabbitmq

Within celery, I see sometimes that the worker is offline. I run Flower in one Docker container and the Celery worker in another one. I use a RabbitMQ broker.
I see that the worker jumps between offline <-> online quite often.
What does it mean that a worker is offline? How does Flower figure that out?

Worker is considered "offline" if it does not broadcast heartbeat signal for some (short) period of time.

Related

How do I get rid of a zombie Celery worker?

I am running Celery with RabbitMQ backend.
Somehow I have ended up with what appears to be a zombie Celery worker. I see the worker in Flower, and in commands like celery inspect scheduled. But it references a PID that doesn't exist. There is no worker process. It is a big problem because Celery will delegate tasks to this worker, and they never get executed.
I believe what happened is the docker container within which this is running got shut down uncleanly. But now, even if I restart the docker container, this zombie worker always comes back. Always has the same name: celery#0357c65d991b.
The Celery docs say that to kill a worker you must send its process TERM. But I can't do that because there is no process. It's a zombie.
RabbitMQ must have a dangling reference to this worker. They only thing I could find in the RabbitMQ management interface is a queue named celery#0357c65d991b.celery.pidbox. I deleted this queue, but it simply reappeared a few seconds later.
Can anyone give me a pointer on where to look to get rid of this thing?

If celery worker dies hard, does job get retried?

Is there a way for a celery job to be retried if the server where the worker is running dies? I don't just mean the sub-process that execute the job, but the entire server becomes unavailable.
I tried with RabbitMQ and Redis as brokers. In both cases, if a job is currently being processed, it is entirely forgotten. When a worker restarts, it doesn't even try to reprocess the job, and looking at Rabbit or Redis, their queues are empty. The result backend is also empty.
It looks like the worker grabs the message and assume it will put it back if the subprocess fails, but if the worker dies also, it can't put it back.
(yes, I work in an environment where this happens more than once a year, and I don't want to lose tasks)
In theory, set task_acks_late=True should do the trick. (doc)
With a Redis broker, the task will be redelivered after visibility_timeout, which defaults to one hour. (doc)
With RabbitMQ, the task is redelivered as soon as Rabbit noticed that the worker died.

re-start celery queue after re-starting broker

I started my celery worker queue (in background):
celery worker -Q my_queue -l info
After this, its broker (redis) was stopped, and meanwhile the background celery worker keeps trying to re-connect to redis after growing amount of time.
Now my goal is re-start a non-duplicate my_queue after restarting redis. I realize that the following celery API will not return my_queue until the re-connection is made:
celery.task.control.inspect().active_queues()
Now if I start a new my_queue, I will end up with duplicate my_queue if the previous celery worker in the background is re-connected afterward.
A solution might be letting celery worker to actively quit if its broker is found stopped, but I don't find the right way to do this. I also don't want to kill it by previous-saved PID. Any suggestions or alternatives will be appreciated.
Well, I know it's contradictory to my requirement, but it seems that I do need the help from a PID file:
celery worker -Q my_queue -l info --pidfile=pid.log
which will raise an exception if the pid saved in pid.log is already running.
This is still not the ideal solution, and any suggestion regarding how to let celery worker actively quit if its broker is found stopped will still be appreciated.

Celery Flower not monitoring jobs

I am running Celery and Flower, with RabbitMQ as a message broker. When I have no running workers and start a task, it sits on the queue until a worker starts. Then, when I start my workers, the task is consumed and executed as expected. However, when I try to use the Flower API to get task info, args and kwargs are null. This never happens when my workers are already running when I call a task. Why is this, and how can I fix it? Thanks.

Celery multiple instances and Redis

I have two servers running Celery and one Redis database. They both listen to the same queue as they are meant to divide the "workload". Tasks are queued onto Redis, but it looks like both my Celery servers pick up the task at the same time, hence executing it twice (once on each server.) Is there a way to prevent this with the Redis/Celery setup?
Thank you,
Each of my servers were using the same name for the celery workers. Since then I've added %h at the end of the worker name (-n my_worker_%h) to show the hostname. This way Celery Flower displays all of the workers in their own line, and there is no confusion more possible.