Load Balancing job queue among disproportionate workers - load-balancing

I'm working on a tool to automatically manage a job queue (in this case, Beanstalkd). Currently, you must manually set the number of available workers to pull jobs off the queue, but this does not allow for spikes in jobs, or it wastes resources during low job times.
I have a client/server set up that runs on the job queue server and the workers. The client connects to the server and reports available resources (CPU/Memory) as well as what types of jobs it can run. The server then monitors queues and dictates to the connected clients how many workers to run to process that queue once a second. There are currently a hundred or so different worker types and they all use very different amounts of CPU/memory, and the worker servers themselves have different levels of performance.
I'm looking for techniques to balance the workload most effectively based on job queue length and the resource requirement of each worker - for example, some workers use 100% of a core for 5s, while others take microseconds to complete. Also, some jobs are higher priority than others.

Related

How to achieve dynamic fair processing between batch tasks?

Our use case is that our system supports scheduling multiple multi-channel send jobs at any time. Multi-channel send meaning send emails, send notifications, send sms etc.
How it currently works is we have a SQS queue per channel. Whenever a job is scheduled, it pushes all its send records to appropriate channel SQS. Any job scheduled later then pushes its own send records to appropriate channel SQS and so on. This leads to starvation of later scheduled jobs if the first scheduled job is high volume, as its records will be processed first from queue before reaching 2nd job records.
On consumer side, we have much lower processing rate than incoming as we can only do a fixed amount of sends per hour. So a high volume job could go on for a long time after being scheduled.
To solve the starvation problem, our first idea was to create 3 queues per channel, low, medium high volume and jobs would be submitted to queue as per their volume. Problem is if 2 or more same volume jobs come, then we still face this problem.
The only guaranteed way to ensure no starvation and fair processing seems like having a queue per job created dynamically. Consumers process from each queue at equal rate and processing bandwidth gets divided between jobs. High volume job might take long time to complete, but it wont choke processing for other jobs.
We could create the sqs queues dynamically for every job scheduled, but that will mean monitoring maybe 50+ queues at some point. Better choice seemed having a kinesis stream with multiple shards, where we would need to ensure every shard only contains single partition key that would identify a single job, I am not sure if that's possible though.
Are there any better ways to achieve this, so we can do fair processing and not starve any job?
If this is not the right community for such questions, please let me know.

Idle Queue utilization in Capacity Scheduler - EMR

I configured capacity scheduler and schedule jobs in specific Queues. However, I see there are times when jobs in some Queues complete faster while other Queues have jobs waiting on the previous ones to commplete. This creates a scenario where half of my capacity is idle and other half is busy with jobs waiting to get resources.
Is there any config that I can tweak to maximize my utilization. I want to route waiting jobs to other queues where resources are available. Attached is a screenshot -
Seems like an issue with Capacity-Scheduler here, I switched to Fair-scheduler and definitely see huge improvements in cluster utilization, ~75% and way better than 40s with caoacity-scheduler
So the reason behind is when multiple users submits jobs to a same queue it can consume max resources, but a single user can't consume more than the capacity even though max capacity is greater than that.
So if you specify yarn.scheduler.capacity.root.QUEUE-1.capacity: 20 this to capacity-scheduler.xml one user can't take more than 20% resources for QUEUE-1 queue even though your cluster have free resources.
By default this user-limit-factor is set to 1. So if you set it to 2 your job can use 40% of resources if maximum allocated resources is greater than or equals to 40.
yarn.scheduler.capacity.root.QUEUE-1.user-limit-factor: 2
Please follow this blog

Why does celery need a message broker?

As celery is a job queue/task queue, name illustrates that it can maintain its tasks and process them. Then why does it need a message broker like rabbitmq or redis?
Celery is a Distributed Task Queue that means that the system can reside across multiple computers (containers) across multiple locations with a single centralise bus
the basic architecture is as follows:
workers - processes that can take jobs (data) from the bus (task queue) and process it
*it can put the result back into the bus for farther processing by a different worker (create a processing flow)
bus - task queue, this is basically a db that store the jobs as messages, so the workers can retrieve them,
it's important to implement a concurrent and non blocking db, so when one process takes or puts job from/on the bus, it doesn't block other workers from getting/putting theirs jobs.
RabbitMQ, Redis, ActiveMQ Kafka and such are best candidates for this sort of behaviour
the bus has an api which let to submit jobs for workers and retrieve them (among more complex features)
most buses implement an ack/fail feature so workers can ack their job being done or if not ack (or report failure) this message can be served again to another worker, and might get processed successfully this time, thus no data is lost...(this depends highly on the fail over logic and the context of data as an input to a task)
Celery include a scheduler (beat) that periodically put specific jobs on the bus and thus create a periodically tasks
lets work with a scrapping example, you want to scrap the world, but china can only allow traffic from it's region and so is Europe and the USA
so you can build a workers and place them all over the world
you can use only one bus, lets say it's located in the usa, all other workers know this bus and can connect to it, so by placing a specific job (scrap china) on the bus located in the US, a process in china can work on it, hence distributed
of course, workers will increase the throughput of the system, only due to parallelism, unrelated to their geo location and this is the common case of using an event-driven architecture (i.e central bus, consumers and producers)
I suggest read the formal docs, it's pretty straight forward

NServiceBus Pub/Sub Distributor/Worker Scenario Too Slow

I am working on a proof of concept implementation of NServiceBus v4.x for work.
Right now I have two subscribers and a single publisher.
The publisher can publish over 500 message per second. It runs great.
Subscriber A runs without distributors/workers. It is a single process.
Subscriber B runs with a single distributor powering N number of workers.
In my test I hit an endpoint that creates and publishes 100,000 messages. I do this publish with the subscribers off line.
Subscriber A processes a steady 100 messages per second.
Subscriber B with 2+ workers (same result with 2, 3, or 4) struggles to top 50 messages per second gross across all workers.
It seems in my scenario that the workers (which I ramped up to 40 threads per worker) are waiting around for the distributor to give them work.
Am I missing something possibly that is causing the distributor to be throttled? All Buses are running an unlimited Dev license.
System Information:
Intel Core i5 M520 # 2.40 GHz
8 GBs of RAM
SSD Hard Drive
UPDATE 08/06/2013: I finished deploying the system to a set of servers. I am experiencing the same results. Every server with a worker that I add decreases the performance of the subscriber.
Subscriber B has a distributor on one server and two additional servers for workers. With Subscriber B and one server with an active worker I am experiencing ~80 messages/events per second. Adding in another worker on an additional physical machine decreases that to ~50 messages per second. Also, these are "dummy messages". No logic actually happens in the handlers other than a log of the message through log4net. Turning off the logging doesn't increase performance.
Suggestions?
If you're scaling out with NServiceBus master/worker nodes on one server, then trying to measure performance is meaningless. One process with multiple threads will always do better than a distributor and multiple worker nodes on the same machine because the distributor will become a bottleneck while everything is competing for the same compute resources.
If the workers are moved to separate servers, it becomes a completely different story. The distributor is very efficient at doling out messages if that's the only thing happening on the server.
Give it a try with multiple servers and see what happens.
Rather than have a dummy handler that does nothing, can you simulate actual processing by adding in some sleep time, say 5 seconds. And then compare the results of having a subscriber and through the distributor?
Scaling out (with or without a distributor) is only useful for where the work being done by a single machine takes time and therefore more computing resources helps.
To help with this, monitor the CriticalTime performance counter on the endpoint and when you have the need, add in the distributor.
Scaling out using the distributor when needed is made easy by not having to change code, just starting the same endpoint in distributor and worker profiles.
The whole chain is transactional. You are paying heavy for this. Increasing the workload across machines will really not increase performance when you do not have very fast disk storage with write through caching to speed up transactional writes.
When you have your poc scaled out to several servers just try to mark a messages as 'Express' which does not do transactional writes in the queue and disable MSDTC on the bus instance to see what kind of performance is possible without transactions. This is not really usable for production unless you know where this is not mandatory or what is capable when you have a architecture which does not require DTC.

Multiple clients load distribution with redis

We are using redis as a queue for asynchronous processing of jobs. One application pushes jobs to redis (lpush), other application reads the redis queue (blpop) and processes the same. We wanted to scale the processing application so we ran two different instances on 2 different machines to process the jobs from queue, but we observed that one instance is taking 70% of the load from queue while other instances processes only a meagre amount. Is there any well defined strategy or configuration in using multiple clients with redis and proper load sharing? Or we have to maintain separate queues for the two instances and push the requests in a round robin manner?