Dask Yarn failed to allocate number of workers - hadoop-yarn

We have a CDH cluster (version 5.14.4) with 6 worker servers with a total of 384 vcores (64 cores per server).
We are running some ETL processes using dask version 2.8.1, dask-yarn version 0.8 with skein 0.8 .
Currently we are having problem allocating the maximum number of workers .
We are not able to run a job with more the 18 workers! (we can see the actual number of workers in the dask dashboad.
The definition of the cluster is as follows:
cluster = YarnCluster(environment = 'path/to/my/env.tar.gz',
n_workers = 24,
worker_vcores = 4,
worker_memory= '64GB'
)
Even when increasing the number of workers to 50 nothing changes, although when changing the worker_vcores or worker_memory we can see the changes in the dashboard.
Any suggestions?
update
Following #jcrist answer I realized that I didn't fully understand the termenology between the Yarn web UI application dashboard and the Yarn Cluster parameters.
From my understanding:
a Yarn Container is equal to a dask worker.
When ever a Yarn cluster is generated there are 2 additional workers/containers that are running (one for a Schedualer and one for a logger - each with 1 vCore)
The limitation between the n_workers * worker_vcores vs. n_workers * worker_memory that I still need fully grok.
There is another issue - while optemizing I tried using cluster.adapt(). The cluster was running with 10 workers each with 10 ntrheads with a limit of 100GB but in the Yarn web UI there was only displayed 2 conteiners running (my cluster has 384 vCorres and 1.9TB so there is still plenty of room to expand). probably worth to open a different question.

There are many reasons why a job may be denied more containers. Do you have enough memory across your cluster to allocate that many 64 GiB chunks? Further, does 64 GiB tile evenly across your cluster nodes? Is your YARN cluster configured to allow jobs that large in this queue? Are there competing jobs that are also taking resources?
You can see the status of all containers using the ApplicationClient.get_containers method.
>>> cluster.application_client.get_containers()
You could filter on state REQUESTED to see just the pending containers
>>> cluster.application_client.get_containers(states=['REQUESTED'])
this should give you some insight as to what's been requested but not allocated.
If you suspect a bug in dask-yarn, feel free to file an issue (including logs from the application master for a problematic run), but I suspect this is more an issue with the size of containers you're requesting, and how your queue is configured/currently used.

Related

Is it possible to limit resources of a running app in yarn?

Sometimes at work I need to use our cluster to run something, but it is used up to 100%, because certain jobs scale up when there are available resources, and my job won't execute for a long time. Is is possible to limit the resources of a running app? Or should we somehow choose a different scheduling policy, if so, then which one?
We use Capacity Scheduler.
It depends on what your apps are, are you 100% coming from large queries (hive app) or from another let's say, spark app.
Spark can eat up the whole cluster easily, even doing almost nothing, that is why you need to define how many cpus to give to those apps, memory, driver memory, etc.
You accomplish that when you do the spark submit, e.g.
spark-submit --master yarn --deploy-mode cluster --queue {your yarn queue} {program name} --driver-cores 1 --driver-memory 1G --num-executors 2 --executor-cores 1 -executor-memory 2G
That will limit that application to use only those resources (plus a little overhead)
If you have a more complicated environment then you will need to limit by queue, for example, queue1=20% of the cluster with up to 20% only, the default is like queue1 can go up to 100% of the cluster if nobody is using it.
Ideally, you should have several queues with the right limits in place and be really careful with preemption.

Sometimes get the error "err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: out of memory"

I'm very lost on how to solve my particular problem, which is why I followed the getting help guideline in the Object Detection API and made a post here on Stack Overflow.
To start off, my goal was to run distributed training jobs on Azure. I've previously used gcloud ai-platform jobs submit training with great ease to run distributed jobs, but it's a bit difficult on Azure.
I built the tf1 docker image for the Object Detection API from the dockerfile here.
I had a cluster (Azure Kubernetes Service/AKS Cluster) with the following nodes:
4x Standard_DS2_V2 nodes
8x Standard_NC6 nodes
In Azure, NC6 nodes are GPU nodes backed by a single K80 GPU each, while DS2_V2 are typical CPU nodes.
I used TFJob to configure my job with the following replica settings:
Master (limit: 1 GPU) 1 replica
Worker (limit: 1 GPU) 7 replicas
Parameter Server (limit: 1 CPU) 3 replicas
Here's my conundrum: The job fails as one of the workers throw the following error:
tensorflow/stream_executor/cuda/cuda_driver.cc:175] Check failed: err == cudaSuccess || err == cudaErrorInvalidValue Unexpected CUDA error: out of memory
I randomly tried reducing the number of workers, and surprisingly, the job worked. It worked only if I had 3 or less Worker replicas. Although it took a lot of time (bit more than a day), the model could finish training successfully with 1 Master and 3 Workers.
This was a bit vexing as I could only use up to 4 GPUs even though the cluster had 8 GPUs allocated. I ran another test: When my cluster had 3 GPU nodes, I could only successfully run the job with 1 Master and 1 Worker! Seems like I can't fully utilize the GPUs for some reason.
Finally, I ran into another problem. The above runs were done with a very small amount of data (about 150 Mb) since they were tests. I ran a proper job later with a lot more data (about 12 GiB). Even though the cluster had 8 GPU nodes, it could only successfully do the job when there was 1 Master and 1 Worker.
Increasing the Worker replica count to more than 1 immediately caused the same cuda error as above.
I'm not sure if this is an Object Detection API based issue, or if it is caused by Kubeflow/TFJob or even if it's something Azure specific. I've opened a similar issue on the Kubeflow page, but I'm also now seeing if I can get some guide from the Object Detection API community. If you need any further details (like the tfjob yaml, or pipeline.config for the training) or have any questions, please let me know in the comments.
It might be related to the batch size used by the API.
Try to control the batch size, maybe as described in this answer:
https://stackoverflow.com/a/55529875/2109287
this is because of insufficient gpu memory:
try this below commands
hope it'll help
$ sudo fuser -v /dev/nvidia*
$ sudo kill -9 pid_no (Ex: 12345)
$ nvidia-smi --gpu-reset
:)

RabbitMQ poor performance

We are facing bad performance in our RabbitMQ clusters. Even when idle.
Once installed the rabbitmq-top plugin, we see many processes with very high reductions/sec. 100k and more!
Questions:
What does it mean?
How to control it?
What might be causing such slowness without any errors?
Notes:
Our clusters are running on Kubernetes 1.15.11
We allocated 3 nodes, each with 8 CPU and 8 GB limits. Set vm_watermark to 7G. Actual usage is ~1.5 CPU and 1 GB RAM
RabbitMQ 3.8.2. Erlang 22.1
We don't have many consumers and producers. The slowness is also on a fairly idle environment
The rabbitmqctl status is very slow to return details (sometimes 2 minutes) but does not show any errors
After some more investigation, we found the actual reason was made up of two issues.
RabbitMQ (Erlang) run time configuration by default (using the bitnami helm chart) assigns only a single scheduler. This is good for some simple app with a few concurrent connections. Production grade with 1000s of connections have to use many more schedulers. Bumping up from 1 to 8 schedulers improved throughput dramatically.
Our monitoring that was hammering RabbitMQ with a lot of requests per seconds (about 100/sec). The monitoring hits the aliveness-test, which creates a connection, declares a queue (not mirrored), publishes a message and then consumes that message. Disabling the monitoring reduced load dramatically. 80%-90% drop in CPU usage and the reductions/sec also dropped by about 90%.
References
Performance:
https://www.rabbitmq.com/runtime.html#scheduling
https://www.rabbitmq.com/blog/2020/06/04/how-to-run-benchmarks/
https://www.rabbitmq.com/blog/2020/08/10/deploying-rabbitmq-to-kubernetes-whats-involved/
https://www.rabbitmq.com/runtime.html#cpu-reduce-idle-usage
Monitoring:
http://rabbitmq.1065348.n5.nabble.com/RabbitMQ-API-aliveness-test-td32723.html
https://groups.google.com/forum/#!topic/rabbitmq-users/9pOeHlhQoHA
https://www.rabbitmq.com/monitoring.html

Setting up a Hadoop Cluster on Amazon Web services with EBS

I was wondering how I could setup a hadoop cluster (say 5 nodes) through AWS. I know how to create the cluster on EC2 but I don't know how to face the following challenges.
What happens if I lose my spot instance. How do I keep the cluster going.
I am working with some datasets of Size 1TB. Would it be possible to setup the EBS accordingly. How can I access the HDFS in this scenario.
Any help will be great!
Depending on your requirements, these suggestions would change. However, assuming a 2 Master and 3 Worker setup, you can probably use r3 instances for Master nodes as they are memory intensive app optimized and go for d2 instances for the worker nodes. d2 instances have multiple local disks and thus can withstand some disk failures while still keeping your data safe.
To answer your specific questions,
treat Hadoop machines as any linux applications. What would happen if your general centOS spot instances are lost? Hwnce, generally it is advised to use reserved instances.
Hadoop typically stores data by maintaining 3 copies and distributing them across the worker nodes in forms of 128 or 256 MB blocks. So, you will have 3TB data to store across the three worker nodes. Obviously, you have to consider some overhead while calculating space requirements.
You can use AWS's EMR service - it is designed especially for Hadoop clusters on top of EC2 instances.
It it fully managed, and it comes pre-packed with all the services you need in Hadoop.
Regarding your questions:
There are three main types of nodes in hadoop:
Master - a single node, don't need to spot it.
Core - a node that handle tasks, and have part of the HDFS
Task - a node that handle tasks, but does not have any part of the HDFS
If Task nodes are lost (if they are spot instances) the cluster will continue to work with no problems.
Regarding storage, the default replication factor in EMR is as follows:
1 for clusters < four nodes
2 for clusters < ten nodes
3 for all other clusters
But you can change it - http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hdfs-config.html

Redis Length growing

Our Pipeline:
VMware-Netflow -> Logstash -> Redis -> Logstash-indexer -> 3xElastic
Data I have gathered:
I notiticed in kibana that the flows coming in were 1 hour old, then
2, then 3 and so on.
Running 'redis-cli llen netflow' shows a very large number that is slowly increasing.
Running 'redis-cli INFO shows pretty constant input at 80kbps and output at 1kbps. I would think these should be near equal.
The cpu load on all nodes is pretty negligible.
What I've tried:
I ensured that the logstash-indexer was sending to all 3 elastic nodes.
I launched many additional logstash instances on the indexers, redis now shows 40 clients.
I am not sure what else to try.
TLDR: rebooted all three elasticsearch nodes, and life is good again.
I inadvertently disabled elasticsearch as an output, and sent my netflows into the ether. The queue size in redis dropped down to 0 in minutes. Although sad, this did prove that it was elasticsearch not logstash or redis.
I watched the elastic instances, and it seemed like something was wrong with the communication between them. All three showed logs indicating that 2/3 were dropping out of the cluster, and taking forever to respond to cluster pings. What I think was happening, is writes were accepted by elastic, and just bounced around a while before being written successfully.
Upon rebooting them all, they negotiated correctly, and writes are happening as they should.