How to configure nthreads in dask-yarn - hadoop-yarn

Following the nthreads vs ncores issue we wanted to define in the yarn cluster API the number of threads per worker.
In the dask yarn CLI docs there is an option to define the number of threads.
But in the API docs we could not find the reference (only the option to define worker_vcores).
Any assistance would be appreciated.

I believe dask-yarn will default to the number of vcores per worker as nthreads -- vcores is a term more familiar to the yarn community.
In the help for the CLI:
Number of threads. Defaults to number of vcores in the "container"

Related

Flink job on EMR runs only on one TaskManager

I am running EMR cluster with 3 m5.xlarge nodes (1 master, 2 core) and Flink 1.8 installed (emr-5.24.1).
On master node I start a Flink session within YARN cluster using the following command:
flink-yarn-session -s 4 -jm 12288m -tm 12288m
That is the maximum memory and slots per TaskManager that YARN let me set up based on selected instance types.
During startup there is a log:
org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=12288, taskManagerMemoryMB=12288, numberTaskManagers=1, slotsPerTaskManager=4}
This shows that there is only one task manager. Also when looking at YARN Node manager I see that there is only one container running on one of the core nodes. YARN Resource manager shows that the application is using only 50% of cluster.
With the current setup I would assume that I can run Flink job with parallelism set to 8 (2 TaskManagers * 4 slots), but in case that submitted job has set parallelism to more than 4, it fails after a while as it could not get desired resources.
In case the job parallelism is set to 4 (or less), the job runs as it should. Looking at CPU and memory utilisation with Ganglia it shows that only one node is utilised, while the other flat.
Why is application run only on one node and how to utilise the other node as well? Did I need to set up something on YARN that it would set up Flink on the other node as well?
In previous version of Flik there was startup option -n which was used to specify number of task managers. The option is now obsolete.
When you're starting a 'Session Cluster', you should see only one container which is used for the Flink Job Manager. This is probably what you see in the YARN Resource Manager. Additional containers will automatically be allocated for Task Managers, once you submit a job.
How many cores do you see available in the Resource Manager UI?
Don't forget that the Job Manager also uses cores out of the available 8.
You need to do a little "Math" here.
For example, if you would have set the number of slots to 2 per TM and less memory per TM, then submitted a job with parallelism of 6 it should have worked with 3 TMs.

Does Ignite on YARN support Node Labels?

I know Ignite still does not support setting up custom YARN queues from this JIRA ticket - https://issues.apache.org/jira/browse/IGNITE-2738 . I cannot find any information on whether Ignite supports running its containers within specified YARN node labels?
Currently in our cluster we have labelled all of our nodes and in attempting to start an Ignite applicaiton, the app is stuck in Pending stage because it is waiting for resources to be assigned from AM, with the AM container Node Label expression defaulting to <DEFAULT_PARTITION> .
Is there a way to supply node labels for Ignite on YARN?
ignite-yarn doesn't seem to set node labels.
Have you tried specifying them externally?

Synchronous distributed tensorflow training runs asynchronously

System Information:
Debian 4.5.5
TF installed from binary (pip3 install tensorflow-gpu==1.0.1 --user)
TF version: v1.0.0-65-g4763edf-dirty 1.0.1
Bazel version: N.A.
CUDA 8.0 cuDNN v5.1
Steps to reproduce
Make a directory and download the following files into it:
training.py run.sh
Run the command ./run.sh to simply reproduce this issue.
Detailed descriptions for the bug
Recently, I tried to deploy the synchronous distributed tensorflow training on the cluster. I followed the tutorial and the inception example to write my own program. The training.py is from other user's implementation, which follows the same API usage as the official example. I modified it to enable it running on a single machine with multiple GPUs by making them communicate through localhost and mapping each worker to see only one GPU.
The run.sh launched three processes. One of them is the parameter server and the others are two workers implemented by between-graph replication. I created the training supervisor by tf.train.Supervisor() to manage multiple sessions in the distributed training for the initialization and synchronization.
I expect these two workers would synchronize each batch and work in the same epoch. However, the worker 0, which is launched prior to the worker 1, completed the whole training set without waiting for the worker 1. After that, the process of the worker 0 finished training process and exited normally while worker 1 behaved like falling into the deadlock and keep near 0% utilization of CPU and GPU for several hours.
Based on my observation, I suspect these two workers didn't communicate and synchronize at all for the data they passed. I report this problem as a bug because I create the optimizer tf.train.SyncReplicasOptimizer as suggested by the official website and the inception example. However, it seems that the synchronization behaviors, if any, are very strange and the program can not exit normally.
Source code / logs
Two files:
training.py: This file contains the source code for the parameter server and workers created to use synchronous distributed optimizers (tf.train.SyncReplicasOptimizer).
run.sh: This file launched the parameter server and the workers.
Log:
Please produce according to the steps and look at worker_0_log and worker_1_log

Is mapred-site.xml totally useless in Hadoop 2 with YARN

I found from many Hadoop related pages that in Hadoop 2 lots of mapred-site configurations are not working anymore and instead yarn-site.xml is used. But if so why still mapred-site.xml is maintained?
For Hadoop v2, mapred-site.xml is still in use. You have a new property named mapreduce.framework.name, its value is the runtime framework for executing MapReduce jobs, it can be one of local, classic or yarn. If you use local, your mapper and reducer processes will be executed in a same JVM, normally you do this for debugging purposes(only one reducer could be launched with local runtime). If you use classic the tradition MR v1 will be in action and MR v2 will be in action if you mention yarn.
When you are using YARN runtime for MapReduce jobs, your mapper and reducer processes will be spawned in containers, mapred-site.xml file could be used to specify system resources(e.g memory and v-cpu) for mapreduce containers. Remember that the upper limit for memory and v-cpu for containers is set in yarn-site.xml. Have a look at the following properties in mapred-site.xml.
mapreduce.map.memory.mb
mapreduce.map.cpu.vcores
mapreduce.reduce.memory.mb
mapreduce.reduce.cpu.vcores
mapred-site.xml
yarn-site.xml
mapred-site.xml is still in use. Look here for all default values.
Look here for new properties in Hadoop V2 as a replacement for deprecated properties of Hadoop V1.
Few examples:
mapred.map.child.env mapreduce.map.env
mapred.map.child.java.opts mapreduce.map.java.opts
mapred.map.child.log.level mapreduce.map.log.level
mapred.map.max.attempts mapreduce.map.maxattempts
mapred.map.output.compression.codec mapreduce.map.output.compress.codec
mapred.mapoutput.key.class mapreduce.map.output.key.class
mapred.mapoutput.value.class mapreduce.map.output.value.class
mapred.mapper.regex.group mapreduce.mapper.regexmapper..group
mapred.mapper.regex mapreduce.mapper.regex
mapred.map.task.debug.script mapreduce.map.debug.script
mapred.map.tasks mapreduce.job.maps

What is the relation between 'mapreduce.map.memory.mb' and 'mapred.map.child.java.opts' in Apache Hadoop YARN?

I would like to know the relation between the mapreduce.map.memory.mb and mapred.map.child.java.opts parameters.
Is mapreduce.map.memory.mb > mapred.map.child.java.opts?
mapreduce.map.memory.mb is the upper memory limit that Hadoop allows to be allocated to a mapper, in megabytes. The default is 512.
If this limit is exceeded, Hadoop will kill the mapper with an error like this:
Container[pid=container_1406552545451_0009_01_000002,containerID=container_234132_0001_01_000001]
is running beyond physical memory limits. Current usage: 569.1 MB of
512 MB physical memory used; 970.1 MB of 1.0 GB virtual memory used.
Killing container.
Hadoop mapper is a java process and each Java process has its own heap memory maximum allocation settings configured via mapred.map.child.java.opts (or mapreduce.map.java.opts in Hadoop 2+).
If the mapper process runs out of heap memory, the mapper throws a java out of memory exceptions:
Error: java.lang.RuntimeException: java.lang.OutOfMemoryError
Thus, the Hadoop and the Java settings are related. The Hadoop setting is more of a resource enforcement/controlling one and the Java is more of a resource configuration one.
The Java heap settings should be smaller than the Hadoop container memory limit because we need reserve memory for Java code. Usually, it is recommended to reserve 20% memory for code. So if settings are correct, Java-based Hadoop tasks should never get killed by Hadoop so you should never see the "Killing container" error like above.
If you experience Java out of memory errors, you have to increase both memory settings.
The following properties let you specify options to be passed to the JVMs running your tasks. These can be used with -Xmx to control heap available.
Hadoop 0.x, 1.x (deprecated) Hadoop 2.x
------------------------------- --------------------------
mapred.child.java.opts
mapred.map.child.java.opts mapreduce.map.java.opts
mapred.reduce.child.java.opts mapreduce.reduce.java.opts
Note there is no direct Hadoop 2 equivalent for the first of these; the advice in the source code is to use the other two. mapred.child.java.opts is still supported (but is overridden by the other two more-specific settings if present).
Complementary to these, the following let you limit total memory (possibly virtual) available for your tasks - including heap, stack and class definitions:
Hadoop 0.x, 1.x (deprecated) Hadoop 2.x
------------------------------- --------------------------
mapred.job.map.memory.mb mapreduce.map.memory.mb
mapred.job.reduce.memory.mb mapreduce.reduce.memory.mb
I suggest setting -Xmx to 75% of the memory.mb values.
In a YARN cluster, jobs must not use more memory than the server-side config yarn.scheduler.maximum-allocation-mb or they will be killed.
To check the defaults and precedence of these, see JobConf and MRJobConfig in the Hadoop source code.
Troubleshooting
Remember that your mapred-site.xml may provide defaults for these settings. This can be confusing - e.g. if your job sets mapred.child.java.opts programmatically, this would have no effect if mapred-site.xml sets mapreduce.map.java.opts or mapreduce.reduce.java.opts. You would need to set those properties in your job instead, to override the mapred-site.xml. Check your job's configuration page (search for 'xmx') to see what values have been applied and where they have come from.
ApplicationMaster memory
In a YARN cluster, you can use the following two properties to control the amount of memory available to your ApplicationMaster (to hold details of input splits, status of tasks, etc):
Hadoop 0.x, 1.x Hadoop 2.x
------------------------------- --------------------------
yarn.app.mapreduce.am.command-opts
yarn.app.mapreduce.am.resource.mb
Again, you could set -Xmx (in the former) to 75% of the resource.mb value.
Other configurations
There are many other configurations relating to memory limits, some of them deprecated - see the JobConf class. One useful one:
Hadoop 0.x, 1.x (deprecated) Hadoop 2.x
------------------------------- --------------------------
mapred.job.reduce.total.mem.bytes mapreduce.reduce.memory.totalbytes
Set this to a low value (10) to force shuffle to happen on disk in the event that you hit an OutOfMemoryError at MapOutputCopier.shuffleInMemory.