Idle Queue utilization in Capacity Scheduler - EMR - amazon-emr

I configured capacity scheduler and schedule jobs in specific Queues. However, I see there are times when jobs in some Queues complete faster while other Queues have jobs waiting on the previous ones to commplete. This creates a scenario where half of my capacity is idle and other half is busy with jobs waiting to get resources.
Is there any config that I can tweak to maximize my utilization. I want to route waiting jobs to other queues where resources are available. Attached is a screenshot -

Seems like an issue with Capacity-Scheduler here, I switched to Fair-scheduler and definitely see huge improvements in cluster utilization, ~75% and way better than 40s with caoacity-scheduler

So the reason behind is when multiple users submits jobs to a same queue it can consume max resources, but a single user can't consume more than the capacity even though max capacity is greater than that.
So if you specify yarn.scheduler.capacity.root.QUEUE-1.capacity: 20 this to capacity-scheduler.xml one user can't take more than 20% resources for QUEUE-1 queue even though your cluster have free resources.
By default this user-limit-factor is set to 1. So if you set it to 2 your job can use 40% of resources if maximum allocated resources is greater than or equals to 40.
yarn.scheduler.capacity.root.QUEUE-1.user-limit-factor: 2
Please follow this blog

Related

How to achieve dynamic fair processing between batch tasks?

Our use case is that our system supports scheduling multiple multi-channel send jobs at any time. Multi-channel send meaning send emails, send notifications, send sms etc.
How it currently works is we have a SQS queue per channel. Whenever a job is scheduled, it pushes all its send records to appropriate channel SQS. Any job scheduled later then pushes its own send records to appropriate channel SQS and so on. This leads to starvation of later scheduled jobs if the first scheduled job is high volume, as its records will be processed first from queue before reaching 2nd job records.
On consumer side, we have much lower processing rate than incoming as we can only do a fixed amount of sends per hour. So a high volume job could go on for a long time after being scheduled.
To solve the starvation problem, our first idea was to create 3 queues per channel, low, medium high volume and jobs would be submitted to queue as per their volume. Problem is if 2 or more same volume jobs come, then we still face this problem.
The only guaranteed way to ensure no starvation and fair processing seems like having a queue per job created dynamically. Consumers process from each queue at equal rate and processing bandwidth gets divided between jobs. High volume job might take long time to complete, but it wont choke processing for other jobs.
We could create the sqs queues dynamically for every job scheduled, but that will mean monitoring maybe 50+ queues at some point. Better choice seemed having a kinesis stream with multiple shards, where we would need to ensure every shard only contains single partition key that would identify a single job, I am not sure if that's possible though.
Are there any better ways to achieve this, so we can do fair processing and not starve any job?
If this is not the right community for such questions, please let me know.

Is it possible to limit number of oozie workflows running at the same time?

This is not clear to me from the docs. Here's our scenario and why we need this as succinctly as I can:
We have 60 coordinators running, launching workflows usually hourly, some of which have sub-workflows (some multiple in parallel). This works out to around 40 workflows running at any given time. However when cluster is under load or some underlying service is slow (e.g. impala or hbase), workflows will run longer than usual and back up so we can end up with 80+ workflows (including sub-workflows) running.
This sometimes results in ALL workflows hanging indefinitely, because we have only enough memory and cores allocated to this pool that oozie can start the launcher jobs (i.e. oozie:launcher:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ), but not their corresponding actions (i.e. oozie:action:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ).
We could simply allocate enough resources to the pool to accommodate for these spikes, but that would be a massive waste (hundreds of cores and GBs that other pools/tenants could never use).
So I'm trying to enforce some limit on number of workflows running, even if that means some will be running behind sometimes. BTW all our coordinators are configured with execution=LAST_ONLY, and any delayed workflow will simply catch up fully on the next run. We are on CDH 5.13 with Oozie 4.1; pools are setup with DRF scheduler.
Thanks in advance for your ideas.
AFAIK there is not a configuration parameter that let you control the number of workflows running at a given time.
If your coordinators are scheduled to run approximately in the same time-window, you could think to collapse them in just one coordinator/workflow and use the fork/join control nodes to control the degree of parallelism. Thus you can distribute your actions in a K number of queues in your workflow and this will ensure that you will not have more than K actions running at the same time, limiting the load on the cluster.
We use a script to generate automatically the fork queues inside the workflow and distribute the actions (of course this is only for actions that can run in parallel, i.e. there no data dependencies etc).
Hope this helps

NServiceBus Pub/Sub Distributor/Worker Scenario Too Slow

I am working on a proof of concept implementation of NServiceBus v4.x for work.
Right now I have two subscribers and a single publisher.
The publisher can publish over 500 message per second. It runs great.
Subscriber A runs without distributors/workers. It is a single process.
Subscriber B runs with a single distributor powering N number of workers.
In my test I hit an endpoint that creates and publishes 100,000 messages. I do this publish with the subscribers off line.
Subscriber A processes a steady 100 messages per second.
Subscriber B with 2+ workers (same result with 2, 3, or 4) struggles to top 50 messages per second gross across all workers.
It seems in my scenario that the workers (which I ramped up to 40 threads per worker) are waiting around for the distributor to give them work.
Am I missing something possibly that is causing the distributor to be throttled? All Buses are running an unlimited Dev license.
System Information:
Intel Core i5 M520 # 2.40 GHz
8 GBs of RAM
SSD Hard Drive
UPDATE 08/06/2013: I finished deploying the system to a set of servers. I am experiencing the same results. Every server with a worker that I add decreases the performance of the subscriber.
Subscriber B has a distributor on one server and two additional servers for workers. With Subscriber B and one server with an active worker I am experiencing ~80 messages/events per second. Adding in another worker on an additional physical machine decreases that to ~50 messages per second. Also, these are "dummy messages". No logic actually happens in the handlers other than a log of the message through log4net. Turning off the logging doesn't increase performance.
Suggestions?
If you're scaling out with NServiceBus master/worker nodes on one server, then trying to measure performance is meaningless. One process with multiple threads will always do better than a distributor and multiple worker nodes on the same machine because the distributor will become a bottleneck while everything is competing for the same compute resources.
If the workers are moved to separate servers, it becomes a completely different story. The distributor is very efficient at doling out messages if that's the only thing happening on the server.
Give it a try with multiple servers and see what happens.
Rather than have a dummy handler that does nothing, can you simulate actual processing by adding in some sleep time, say 5 seconds. And then compare the results of having a subscriber and through the distributor?
Scaling out (with or without a distributor) is only useful for where the work being done by a single machine takes time and therefore more computing resources helps.
To help with this, monitor the CriticalTime performance counter on the endpoint and when you have the need, add in the distributor.
Scaling out using the distributor when needed is made easy by not having to change code, just starting the same endpoint in distributor and worker profiles.
The whole chain is transactional. You are paying heavy for this. Increasing the workload across machines will really not increase performance when you do not have very fast disk storage with write through caching to speed up transactional writes.
When you have your poc scaled out to several servers just try to mark a messages as 'Express' which does not do transactional writes in the queue and disable MSDTC on the bus instance to see what kind of performance is possible without transactions. This is not really usable for production unless you know where this is not mandatory or what is capable when you have a architecture which does not require DTC.

Load Balancing job queue among disproportionate workers

I'm working on a tool to automatically manage a job queue (in this case, Beanstalkd). Currently, you must manually set the number of available workers to pull jobs off the queue, but this does not allow for spikes in jobs, or it wastes resources during low job times.
I have a client/server set up that runs on the job queue server and the workers. The client connects to the server and reports available resources (CPU/Memory) as well as what types of jobs it can run. The server then monitors queues and dictates to the connected clients how many workers to run to process that queue once a second. There are currently a hundred or so different worker types and they all use very different amounts of CPU/memory, and the worker servers themselves have different levels of performance.
I'm looking for techniques to balance the workload most effectively based on job queue length and the resource requirement of each worker - for example, some workers use 100% of a core for 5s, while others take microseconds to complete. Also, some jobs are higher priority than others.

Django-celery status RECEIVED?

Im running some tasks via django-celery (with rabbitmq as backend), the tasks are time consuming and cpu intensive.
I got 2 worker Ec2 instances (One is small and other is high cpu medium).
Ive set the small instance to run 1 concurrent task, and the medium to do 4. This works well for me. But occasionally, in the celery monitor, I see that the small instance is working on a task and 2 or 3 more tasks are in "RECEIVED" state(assigned to the small instance), while the medium instance is not doing anything. Ideally id like the medium instance to have preference over the small, but in this case if small is at its concurrency the task should goto the medium. It seems the small instance is biting more than it can chew.. as in allocating tasks to itself which it cant start at the moment.
Is there a way to make workers accept only the tasks it can start at that moment?
Screenshot : http://dl.dropbox.com/u/361747/task-state.png . The worker starting with domU is the small, the one starting with ip is medium.
You can use CELERYD_PREFETCH_MULTIPLIER option to control how many tasks to prefetch. In your case CELERYD_PREFETCH_MULTIPLIER=1 will help evenly distribute tasks.
http://ask.github.com/celery/configuration.html#celeryd-prefetch-multiplier