RQ Worker Processing Jobs in a Batch - python-rq

Say you have a RQ Queue with lots of jobs, which gets filled from various sources.
Those Jobs would be more efficiently processed in batches, eg pulling and processing 100 Jobs at a time.
How would you achieve this in RQ?
Would you need to write a custom Worker class to pull multiple jobs at once, or a custom Queue Class to batch jobs when they are given out, or some other approach?
Thanks

I think that Tasktiger, which is based on redis like RQ, can better fulfill your needs. From the README:
Batch queues
Batch queues can be used to combine multiple queued tasks into one. That way, your task function can process multiple sets of arguments at the same time, which can improve performance. The batch size is configurable.

Related

DynamoDB large transaction writes are timing out

I have a service that receives events that vary in size from ~5 - 10k items. We split these events up into chunks and these chunks need to be written in transactions because we do some post-processing that depends on a successful write of all the items in the chunk. Ordering of the events is important so we can't Dead Letter them to process at a later time. We're running into an issue where we receive very large (10k) events and they're clogging up the event processor causing a timeout (currently set to 15s). I'm trying to find a way to increase the processing speed of these large events to eliminate timeouts.
I'm open to ideas but curious if there are there any pitfalls of running transaction writes concurrently? E.g. splitting the event into chunks of 100 and having X threads run through them to write to dynamo concurrently.
There is no concern on multi-threading writes to DynamoDB so long as you have the capacity to handle the extra throughput.
I would also advise at trying smaller batches, as with 100 items in a batch, if one happens to fail for any reason then they all fail. Typically I suggest aiming for batch sizes of approx 10. But of course this depends on your use-case.
Also ensure that no threads are targeting the same item at the same time, as this would result in conflicting writes causing large amounts of failed batches.
In summary, batch small as possible, ensure your table has adequate capacity and ensure you don't hit the same items concurrently.

Prevent Celery From Grabbing One More Task Than Concur

In Celery using RabbitMQ, I have distributed workers running very long tasks on individual ec2 instances.
What is happening is that my concurrency is set to 2, -Ofair is enabled, and task_acks_late = True worker_prefetch_multiplier = 1 are set, but the Celery worker runs the 2 tasks in parallel, but then grabs a third task and doesn't run it. This leaves other workers with no tasks to run.
What i would like to happen is for the workers to only grab jobs when they can perform work on them. Allowing other workers that are free to grab the tasks and perform them.
Does anyone know how to achieve the result that I'm looking for? Attached below is an example of my concurrency being 2, and there being three jobs on the worker, where one is not yet acknowledged. I would like for there to be only two tasks there, and the other remain on the server until another worker can start them.

How to achieve dynamic fair processing between batch tasks?

Our use case is that our system supports scheduling multiple multi-channel send jobs at any time. Multi-channel send meaning send emails, send notifications, send sms etc.
How it currently works is we have a SQS queue per channel. Whenever a job is scheduled, it pushes all its send records to appropriate channel SQS. Any job scheduled later then pushes its own send records to appropriate channel SQS and so on. This leads to starvation of later scheduled jobs if the first scheduled job is high volume, as its records will be processed first from queue before reaching 2nd job records.
On consumer side, we have much lower processing rate than incoming as we can only do a fixed amount of sends per hour. So a high volume job could go on for a long time after being scheduled.
To solve the starvation problem, our first idea was to create 3 queues per channel, low, medium high volume and jobs would be submitted to queue as per their volume. Problem is if 2 or more same volume jobs come, then we still face this problem.
The only guaranteed way to ensure no starvation and fair processing seems like having a queue per job created dynamically. Consumers process from each queue at equal rate and processing bandwidth gets divided between jobs. High volume job might take long time to complete, but it wont choke processing for other jobs.
We could create the sqs queues dynamically for every job scheduled, but that will mean monitoring maybe 50+ queues at some point. Better choice seemed having a kinesis stream with multiple shards, where we would need to ensure every shard only contains single partition key that would identify a single job, I am not sure if that's possible though.
Are there any better ways to achieve this, so we can do fair processing and not starve any job?
If this is not the right community for such questions, please let me know.

How to prevent ironworker from enqueuing tasks of workers that are still running?

I have this worker whose runtime greatly varies from 10 seconds to up to an hour. I want to run this worker every five minutes. This is fine as long as the job finishes within five minutes. However, If the job takes longer Iron.io keeps enqueuing the same task over and over and a bunch of tasks of the same type accumulate while the worker is running.
Furthermore, it is crucial that the task may not run concurrently, so max concurrency for this worker is set to one.
So my question is: Is there a way to prevent Iron.io from enqueuing tasks of workers that are still running?
Answering my own question.
According to Iron.io support it is not possible to prevent IronWorker from enqueuing tasks of workers that are still running. For cases like mine it is better to have master workers that do the scheduling, i.e. creating/enqueuing tasks from script via one of the client libraries.
The best option would be to enqueue new task from the worker's code. For example, your task is running for 10 sec - 1 hour and enqueues itself at the end (last line of code). This will prevent the tasks from accumulating while the worker is running.

Huge Number of consumers per ActiveMQ session

I am making one session per connection per thread to activeMQ cluster. But I want to consume from hundreds of destinations. I do understand that if I only have one thread ( one session ), I can't consume messages from these destinations concurrently. I don't want to do that either. But I want to have hundreds of consumers per session which will in-turn be associated to hundreds of different destinations, is this a viable approach? Please also provide the reason of viability or non-viability.
PS : I don't want to do any heavy processing on the messages, so that's why only 1 thread.
A session is not bound to a single thread - threading is a separate chapter. You can use a session in multiple threads (not recommended) and multiple sessions in a single thread. The session construct is more a thing to control transactions - i.e. commit and rollback messages in a transaction.
Anyway, you can use a single consumer to read multiple destinations. Simply put the destinations in a list like: "my.first.queue,my.other.queue,my.last.queue". You can also read a queue using wildcards - "my.>". would use all queues above.
This way, you can use a single thread and a single session to read from a large number of queues.