celery django + rabbitmq = dst bug? - rabbitmq

I'm using celery (3.1) with amqp and kombu 3.0, together with RabbitMQ (3.4) to run asynchronous tasks for a Django backend.
Tonight I just encountered a strange situation due to dst (Daylight Saving Time), while in Central European Time, our servers went from UTC+2 to UTC+1. Meaning, it was 2 AM and then at 2:59 AM, instead of being 3AM, it was 2AM. I am wondering if this is a bug or if there might be something wrong in the configuration.
The strange thing is that I could see the queue growing up in the "first 2-3AM hours range", like if it wasn't getting consumed and the same queue being slowly acknowledge in the "second 2-3 AM hours range". But not like if at the second 2AM occurrence, it would try to ack all messages, but just regularly, at the same rythm there where stacked the first hour.
To make it simple, it looked like the messages needed one hour to be acknowledge, if there was some shift there. Any clue if this is a problem in the config or a bug ?
Thank in advance, Matt

I have the same issue. I think this is a bug in the celery.
Use UTC time for eta.

Related

RabbitMQ slowing down after some times

On our RabbitMQ installed in production, we have a performance issue.
To explain the context, we have an initialization batch that creates around ~60k messages. For business reasons, those messages must be treated in strict order and we can't lose any. As such, we have only one queue which is durable and lazy and one consumer (SpringBoot AMQP) with a prefetch of 10. Both are on the same virtual machine.
At first, the processing is fast enough, around 5 to 10 messages per second. But it progressively slows down until it reaches a cap of fewer than 20 messages per hour. It takes approximately 1 hour to reach this point.
After some investigations, we found out that the problem comes from RabbitMQ. When we simply stop and restart it, the performance goes back to normal and then drops slowly again. Doing the same on just the consumer doesn't change anything.
I'm thinking about some resources bottleneck but I can't manage to find which one as RAM, CPU, and disk looks fine. I am not really familiar with ERL virtual machine and managing RabbitMQ itself so I may have missed something.
Does someone as an idea of the source of the problem or where I could look for more information on what is happening?
RabbitMQ characteristics :
ERL 23.3.2
RabbitMQ 3.8.14

Celery - automatic retrying of long running tasks running on crashed worker

I'm using Celery with Django over Redis.
Some of my tasks are quite long, taking about 1 hour. I'm aware that this is suboptimal, and preferably I should use shorter tasks, but this is what I got...
Sometimes the task/worker crash. This can happen for various unimportant reasons. Maybe this worker crashed, network problem, spot-instance when preempted, killed by OOM, or any other unexpected reason that I can't "catch" and handle.
I want to make sure the task will be tried again as fast as possible.
I can use ack_late, but the problem is that this task has a very long timeout (about 90 minutes), which means that if the task started and the worker crashed after 2 minutes, I will now wait for another 88 minutes until the task will get back to the queue and will start executing again on another worker.
I'm wondering if there exists another solution, that will see the worker "disappeared" and will put the task back in the queue?
You could give task_reject_on_worker_lost a try... It is a tricky one, but have a look...

RabbitMQ delayed exchange plugin loads and resources

We are using rabbitmq (3.6.6) to send analysis (millions) to different analyzers. These are very quick and we were planning on use the rabbit-message-plugin to schedule monitorizations over the analyzed elements.
We were thinking about rabbitmq-delayed-exchange-plugin, already made some tests and we need some clarification.
Currently:
We are scheduling millions of messages
Delays range from a few minutes to 24 hours
As previously said, these are tests, so we are using a machine with one core and 4G of RAM which has also other apps running on it.
What happened with a high memory watermark set up at 2.0G:
RabbitMQ eventually (a day or so) starts consuming 100% (only one core) and does not respond to the management interface nor rabbitmqctl. This goes on for at least 18 hours (always end up killing, deleting mnesia delayed file on disk - about 100 / 200 MB - and restarting).
What happened with a high memory watermark set up at 3.6G:
RabbitMQ was killed by kernel, because of high memory usage (4 GB hardware) about a week after working like this.
Mnesia file for delayed exchange is about 1.5G
RabbitMQ cannot start anymore giving to the below trace (we are assuming that because of being terminated by a KILL messages in the delay somehow ended up corrupted or something
{could_not_start,rabbit,
rabbitmq-server[12889]: {{case_clause,{timeout,['rabbit_delayed_messagerabbit#rabbitNode']}},
rabbitmq-server[12889]: [{rabbit_boot_steps,'-run_step/2-lc$^1/1-1-',1,
And right now we are asking ourselves: Are we a little over our heads using rabbit delayed exchange plugin for this volumes of information? If we are, then end of the problem, rethink and restart, but if not, what could be an appropiate hardware and/or configuration setup?
RabbitMQ delayed exchange plugin is not properly designed to store millions of messages.
It is also documented to the plugin page
Current design of this plugin doesn't really fit scenarios with a high
number of delayed messages (e.g. 100s of thousands or millions). See
72 for details.
Read also here: https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/issues/72
This plugin is often used as if RabbitMQ was a database. It is not.

celery takes too long time to write result to rabbitmq

Recently, I started celery beat to run a task periodically. The task will take about 2 minutes. The beat interval is 3 minutes. The back end use rabbitmq.
However, the totally elapsed time of a task become nearly 20 minutes. It looks so strange! After some work, I found that the extra time consumed by sending task result to rabbitmq. It is awesome! Why?
And the celery worker will use another 5 or 7 minutes to receive the next task. I do not know what the worker are doing in this period.
Anyone could help to explain them?

Fault tolerant / high availability producer

I'm working on a distributed producer/consumer system using a messaging queue. The part that I'm interested in parallelising is the consumer side of it, and I'm happy with what I have for that.
However, I'm not sure what to do about the producer. I only need one producer running at a time since the load of the producing part of my system is not too high, but I want a reliable way of managing it, as in starting, stopping, restarting, and mainly, monitor it so that if the producer host fails another one can pick up.
If it helps, I'm happy with my consumer algorithm, the one that queues jobs, since it's fault tolerant to be down for a period of time and pick up the stuff that happened during the time it was down.
I'm sure there are tools or at least known patterns to do this and not reinvent the wheel.
I'm using rabbitmq but can use activemq, or even refactor into storm or something like that if needed, my code is not complex so far.
After a couple of weeks thinking about it, the simplest solution came to mind, and I'm actually very pleased with it, so I'll share it in case you find it useful, or point out if you think of any downsides, it seems to be working fine so far.
I have created the simplest table in my DB, called heartbeat, with a single timestamp field called ts, and is meant to have a single row all the time.
I start all of my potential producers every 5 minutes (quartz), and they do an update of the table if the ts fields is older than now() - 5 minutes. Because the update call is blocking, I'll have no db threading issues. Now, if the update returns > 0 it means that it actually modified the value of ts, and then I execute the actual producing code (queue jobs). If the update returns 0, it did not modify the table, because someone else did less than 5 minutes ago, and therefore this producer won't do anything until it checks again in 5 minutes.
Obviously the 5 minutes value is configurable, and this allows for a very neat upgrade with small changes to be able to execute several producers at the same time, if I ever had that need.