Is it possible to define priorities for Celery workers consuming from the same queue? - rabbitmq

I have two machines on my network running Celery workers that process tasks from a common queue (the messaging back-end is RabbitMQ).
One machine is much more powerful and processes the tasks faster (which is important). If there is only one task in the queue, I always want it to run on this machine. If the queue is full, I want the less powerful machine to start accepting tasks as well.
Is there a recommended, elegant way to do this? Or do I have to set up two queues ("fast" and "slow") and implement some kind of router that sends tasks to the "slow" queue only when the "fast" queue is full?

Related

Coordinate scheduled jobs between multiple producers

I have a distributed system of producers and consumers across several servers, with redundant nodes—both for failover and load-balancing. The nodes communicate via RabbitMQ messages.
Each producer runs its own scheduler to invoke jobs, which one of the consumers should run. This works by publishing the appropriate RabbitMQ message, that one of the consumers will process.
Now, the tricky part is, each job should be run only once. In short, my requirements are:
Only one invoke message per scheduled job should be processed (by any of the consumer instances)
If any of the procuders goes down, the job should still be invoked by the other instances
I can't figure out how to implement this without relying on anything else but RabbitMQ. I could make it work if there was such a thing as an "exclusive exchange", which only one producer can connect to at a time. I thought about making the consumers ignore any duplicate invokes for the same job, but this will not work, because due to the load-balancing, subsequent messages may be received by any of the other instances. Another idea was implementing a mechanism to declare one of the producers the "principal" node, so only this one is allowed to send invokes, but this basically presented the same problem of coordinating between instances.
Any ideas? Thanks in advance.

Flow control limitting message rate on single queue

I have a exchange and only one queue bind to it. When the message publishing rate goes over some cap the rabbitmq automatically throttles the incoming message rate.
On further investigation i found this happens due to the "Flow control" trottling mechanism built in rabbitmq. https://www.rabbitmq.com/blog/2014/04/14/finding-bottlenecks-with-rabbitmq-3-3/
As per this document i have connection, channels in flow control and not the queue. which means there is a cpu-bound / disk-bound limit.
My messages are not persistent so i don't have disk limitation. On Searching, i found documents stating a queue is limited to single cpu. https://groups.google.com/forum/#!msg/rabbitmq-users/wzHMV7F0ugU/zhW_9b8ACQAJ
What does it mean ? do the rabbitmq queue process uses only 1 cpu even multiple cores are available in the machine? what is the limitation of cpu with respect to queue flow control?
A queue is handled by one and one only CPU, which mean that you have to design your message flow through rabbit with multiple queue in order to remain scalable.
If you are on one queue only you will be limited to a maximum number of messages no matter if you have 1 or more cores
https://www.rabbitmq.com/queues.html#runtime-characteristics
If you have a specific need to build an architecture with only one logical queue, which is explicitely not recommended ; or if you have a queue with a really high trafic, you can check sharded queues here : Github Sharded queues Plugin
It's a pluggin (take with caution and test everything before going to production, especialy failure and replication) that split a logical queue name into multiple queues.
If you are running a benchmark on rabbitmq, remember to produce and consume on a number of queues superior to the amount of CPU cores present on the server.
Other tips about benchmark, try to produce only, consume only, and both at the same time, with different persistence settings (persistence, message size, lazy queues, ...) and ack settings.

Why does celery need a message broker?

As celery is a job queue/task queue, name illustrates that it can maintain its tasks and process them. Then why does it need a message broker like rabbitmq or redis?
Celery is a Distributed Task Queue that means that the system can reside across multiple computers (containers) across multiple locations with a single centralise bus
the basic architecture is as follows:
workers - processes that can take jobs (data) from the bus (task queue) and process it
*it can put the result back into the bus for farther processing by a different worker (create a processing flow)
bus - task queue, this is basically a db that store the jobs as messages, so the workers can retrieve them,
it's important to implement a concurrent and non blocking db, so when one process takes or puts job from/on the bus, it doesn't block other workers from getting/putting theirs jobs.
RabbitMQ, Redis, ActiveMQ Kafka and such are best candidates for this sort of behaviour
the bus has an api which let to submit jobs for workers and retrieve them (among more complex features)
most buses implement an ack/fail feature so workers can ack their job being done or if not ack (or report failure) this message can be served again to another worker, and might get processed successfully this time, thus no data is lost...(this depends highly on the fail over logic and the context of data as an input to a task)
Celery include a scheduler (beat) that periodically put specific jobs on the bus and thus create a periodically tasks
lets work with a scrapping example, you want to scrap the world, but china can only allow traffic from it's region and so is Europe and the USA
so you can build a workers and place them all over the world
you can use only one bus, lets say it's located in the usa, all other workers know this bus and can connect to it, so by placing a specific job (scrap china) on the bus located in the US, a process in china can work on it, hence distributed
of course, workers will increase the throughput of the system, only due to parallelism, unrelated to their geo location and this is the common case of using an event-driven architecture (i.e central bus, consumers and producers)
I suggest read the formal docs, it's pretty straight forward

Multiple clients load distribution with redis

We are using redis as a queue for asynchronous processing of jobs. One application pushes jobs to redis (lpush), other application reads the redis queue (blpop) and processes the same. We wanted to scale the processing application so we ran two different instances on 2 different machines to process the jobs from queue, but we observed that one instance is taking 70% of the load from queue while other instances processes only a meagre amount. Is there any well defined strategy or configuration in using multiple clients with redis and proper load sharing? Or we have to maintain separate queues for the two instances and push the requests in a round robin manner?

Simple queue with Celery and RabbitMQ

I'm trying to implement a simple queue that performs one task at a time. Offloading tasks off the main thread using Celery and setting concurrency=1 in the Celery config works fine, but I might want to use more concurrent workers for other tasks.
Is there a way to tell Celery or RabbitMQ to not use multiple concurrent workers for a task (except by forcing concurrency=1)? I can't find anything in the documentation but maybe these tools are not designed for a linear queue?
Thanks!
I think what you need is a separate queue for each type of task. Create separate workers that consume from each queue, with concurrency set to 1.