How to achieve dynamic fair processing between batch tasks? - batch-processing

Our use case is that our system supports scheduling multiple multi-channel send jobs at any time. Multi-channel send meaning send emails, send notifications, send sms etc.
How it currently works is we have a SQS queue per channel. Whenever a job is scheduled, it pushes all its send records to appropriate channel SQS. Any job scheduled later then pushes its own send records to appropriate channel SQS and so on. This leads to starvation of later scheduled jobs if the first scheduled job is high volume, as its records will be processed first from queue before reaching 2nd job records.
On consumer side, we have much lower processing rate than incoming as we can only do a fixed amount of sends per hour. So a high volume job could go on for a long time after being scheduled.
To solve the starvation problem, our first idea was to create 3 queues per channel, low, medium high volume and jobs would be submitted to queue as per their volume. Problem is if 2 or more same volume jobs come, then we still face this problem.
The only guaranteed way to ensure no starvation and fair processing seems like having a queue per job created dynamically. Consumers process from each queue at equal rate and processing bandwidth gets divided between jobs. High volume job might take long time to complete, but it wont choke processing for other jobs.
We could create the sqs queues dynamically for every job scheduled, but that will mean monitoring maybe 50+ queues at some point. Better choice seemed having a kinesis stream with multiple shards, where we would need to ensure every shard only contains single partition key that would identify a single job, I am not sure if that's possible though.
Are there any better ways to achieve this, so we can do fair processing and not starve any job?
If this is not the right community for such questions, please let me know.

Related

RabbitMQ - allow only one process per user

To keep it short, here is a simplified situation:
I need to implement a queue for background processing of imported data files. I want to dedicate a number of consumers for this specific task (let's say 10) so that multiple users can be processed at in parallel. At the same time, to avoid problems with concurrent data writes, I need to make sure that no one user is processed in multiple consumers at the same time, basically all files of a single user should be processed sequentially.
Current solution (but it does not feel right):
Have 1 queue where all import tasks are published (file_queue_main)
Have 10 queues for file processing (file_processing_n)
Have 1 result queue (file_results_queue)
Have a manager process (in this case in node.js) which consumes messages from file_queue_main one by one and decides to which file_processing queue to distribute that message. Basically keeps track of in which file_processing queues the current user is being processed.
Here is a little animation of my current solution and expected behaviour:
Is RabbitMQ even the tool for the job? For some reason, it feels like some sort of an anti-pattern. Appreciate any help!
The part about this that doesn't "feel right" to me is the manager process. It has to know the current state of each consumer, and it also has to stop and wait if all processors are working on other users. Ideally, you'd prefer to keep each process ignorant of the others. You're also getting very little benefit out of your processing queues, which are only used when a processor is already working on a message from the same user.
Ultimately, the best solution here is going to depend on exactly what your expected usage is and how likely it is that the next message is from a user that is already being processed. If you're expecting most of your messages coming in at any one time to be from 10 users or fewer, what you have might be fine. If you're expecting to be processing messages from many different users with only the occasional duplicate, your processing queues are going to be empty much of the time and you've created a lot of unnecessary complexity.
Other things you could do here:
Have all consumers pull from the same queue and use some sort of distributed locking to prevent collisions. If a consumer gets a message from a user that's already being worked on, requeue it and move on.
Set up your queue routing so that messages from the same user will always go to the same consumer. The downside is that if you don't spread the traffic out evenly, you could have some consumers backed up while others sit idle.
Also, if you're getting a lot of messages in from the same user at once that must be processed sequentially, I would question if they should be separate messages at all. Why not send a single message with a list of things to be processed? Much of the benefit of event queues comes from being able to treat each event as a discrete item that can be processed individually.
If the user has a unique ID, or the file being worked on has a unique ID then hash the ID to get the processing queue to enter. That way you will always have the same user / file task queued on the same processing queue.
I am not sure how this will affect queue length for the processing queues.

Idle Queue utilization in Capacity Scheduler - EMR

I configured capacity scheduler and schedule jobs in specific Queues. However, I see there are times when jobs in some Queues complete faster while other Queues have jobs waiting on the previous ones to commplete. This creates a scenario where half of my capacity is idle and other half is busy with jobs waiting to get resources.
Is there any config that I can tweak to maximize my utilization. I want to route waiting jobs to other queues where resources are available. Attached is a screenshot -
Seems like an issue with Capacity-Scheduler here, I switched to Fair-scheduler and definitely see huge improvements in cluster utilization, ~75% and way better than 40s with caoacity-scheduler
So the reason behind is when multiple users submits jobs to a same queue it can consume max resources, but a single user can't consume more than the capacity even though max capacity is greater than that.
So if you specify yarn.scheduler.capacity.root.QUEUE-1.capacity: 20 this to capacity-scheduler.xml one user can't take more than 20% resources for QUEUE-1 queue even though your cluster have free resources.
By default this user-limit-factor is set to 1. So if you set it to 2 your job can use 40% of resources if maximum allocated resources is greater than or equals to 40.
yarn.scheduler.capacity.root.QUEUE-1.user-limit-factor: 2
Please follow this blog

How to interpret number of messages in Distributor's Storage and Control queues?

This article tells about control queues Nsb Master node uses to control message load, though to me it's still not clear how to interpret disproportions in number of messages in this queues: https://docs.particular.net/nservicebus/msmq/distributor/
I'm observing slowness in my Nsb service which have never experienced slowness before. For some reason less parallel threads are created per every master node comparing to the past time, and there have been no change in workers or master nodes configuration, like max amount of threads to allocate. I'm trying to figure out if it's Master node that does not want to feed workers, or workers do not want to take more job.
I see that amount of messages in control queue jumps from 15 to 40, while storage has only 5-8. Should I interpret that as workers ready to work, while Distributor can't send them more messages? Thanks
The numbers in the control and storage queue will jump up and down as long as the distributor is handing out messages. A message coming into the control queue will immediately be popped off that queue and onto the storage queue. A message coming into the primary queue of the distributor will immediately result in the first message of the storage queue to be popped off.
It's hard to interpret the numbers of messages in the queues of a running distributor, because, by the time you look at the numbers with Computer Management or Queue Explorer, they will have changed.
The extreme cases are this:
1. No messages in the primary input queue of the distributor and no work happening on any of the workers.
Input queue: 0
Control queue: 0
Storage queue: number of workers*configured threads per worker
2. All workers are working at full capacity. None able to take on more work.
Input queue: 0+ (grows as new messages comes in)
Control queue: 0
Storage queue: 0
In a running system, it can be anything between these two extremes, so, unfortunately, it's hard to say much from just a snapshot of the control and storage queue.
Some troubleshooting tips:
If the storage queue is empty, the distributor can not hand out more work. It does not know where to send it. This happens if all the workers are fully occupied as they will not be sending any ready-messages back to the control queue until they finish up handling a message.
If the storage queue is consistently small compared to the total number of worker threads across all the workers, you are approaching the total maximum capacity of your workers.
I suggest you start looking at the logs of the workers and see if the work they are doing is taking longer than usual. Slower database/third party integration?
Another thing to check is if there has been anything IO-heavy added to the machine hosting the distributor. If the distributor was already running at close to max capacity, adding extra IO might slow down MSMQ on the box, giving you worse throughput.

How can I measure the frequency which is good enough take out the data from RabbitMQ?

I have RabbitMQ running on a server and there's some script which inserts data into it. I know the approximate frequency in which the data is inserted, but it's not only approximate, it can also vary quite a lot.
How can I know how often does another script have to take the data out of RabbitMQ?
What will happen if the 2nd script take the data out of RabbitMQ slower than needed?
How can I measure whether or not the frequency is good enough?
How can I know how often does another script have to take the data out of RabbitMQ?
You should consume messages from the queue at a rate greater than or equal to the rate they are published. RabbitMQ reports publish rates; however, you will want to get a reasonable estimate from load testing your application.
What will happen if the 2nd script take the data out of RabbitMQ slower than needed?
In the short term, the number of messages in the queue will increase, as will processing time (think about what happens when more people get in line for Space Mountain at Disney). In the long term, the system will be unstable because the queue will increase without bound, eventually resulting in a failure of the queue, as well as other practical consequences (think of this as the case where Space Mountain is broken down, but people are still allowed to enter the queue line).
How can I measure whether or not the frequency is good enough?
From an information only perspective, you can monitor the queue yourself using the RabbitMQ management plugin. If you need automated processes to spawn up additional workers, you'll have to integrate those processes into the RabbitMQ management API. How to do this is the subject of a number of how-to articles.

What's the recommended way to queue "delayed execution" messages via ServiceStack/Redis MQ?

I would like to queue up messages to be processed, only after a given duration of time elapses (i.e., a minimum date/time for execution is met), and/or at processing time of a message, defer its execution to a later point in time (say some prerequisite checks are not met).
For example, an event happens which defines a process that needs to run no sooner than 1 hour from the time of the initial event.
Is there any built in/suggested model to orchestrate this using https://github.com/ServiceStack/ServiceStack/wiki/Messaging-and-Redis?
I would probably build this in a two step approach.
Queue the Task into your Queueing system, which will process it into a persistence store: SQL Server, MongoDB, RavenDB.
Have a service polling your "Queued" tasks for when they should be reinserted back into the Queue.
Probably the safest way, since you don't want to lose these jobs presumably.
If you use RabbitMQ instead of Redis you could use Dead Letter Queues to get the same behavior. Dead letter queues essentially are catchers for expired messages.
So you push your messages into a queue with no intention of processing them, and they have a specific expiration in minutes. When they expire they pop over into the queue that you will process out of. Pretty slick way to queue things for later.
You could always use https://github.com/dominionenterprises/mongo-queue-csharp or https://github.com/dominionenterprises/mongo-queue-php or https://github.com/gaillard/mongo-queue-java which provides delayed messages and other uncommon features.