How to start n tasks with one GPU each? - gpu

I have a large cluster of computing nodes, each node having 6 GPUs. And I want to start, lets say, 100 workers on that thing, each one having access to exactly one single GPU.
What I do now is like this:
sbatch --gres=gpu:6 --gpus-per-task=1 --ntasks='100' main.sh
And inside the main.sh:
srun --gpus-per-task=1 --gres=gpu:1 -n 100 worker.sh
And this way, I get 100 workers started (fully using like 17 nodes). But I have a problem: the CUDA_VISIBLE_DEVICES is not set properly.
sbatch --gres=gpu:6 --gpus-per-task=1 --ntasks='100' main.sh
# CUDA_VISIBLE_DEVICES in main.sh: 0,1,2,3,4,5 (that's fine)
srun --gpus-per-task=1 --gres=gpu:1 -n 100 worker.sh
# CUDA_VISIBLE_DEVICES in worker.sh: 0,1,2,3,4,5 (this is my problem: how to assign exactly 1 GPU to each worker and to that worker alone?)
It might be a misunderstanding on my part on how Slurm actually works since I'm quite new programming on such HPC systems. But any clue how to achieve what I want to achieve? (each worker having exactly 1 GPU assigned to it and only it)
We use SLURM 20.02.2.

I think what you are missing here is that you should define also the number of nodes explicitly:
eg you can have
NODES=17 x 6 GPUs per NODE = 102 tasks (if you only need 100 task this means that the last node may be under utilized with only 4 tasks)
#SBATCH --ntasks=100
#SBATCH --gres=gpu:6
#SBATCH --nodes=17
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=1 (depends on the available cores per node)
srun -l main.py

Related

how to get the current consumed cpu % of vmhost in vcenter using powershell

How to get the current consumed CPU% of vmhost in vcenter using powershell script.
Below command doesn't gives similar output what we checked manually.
Get-Stat -Entity $command1 -Stat cpu.usagemhz.average -Realtime -MaxSamples 1
Get-Stat -Entity $myHost -Stat cpu.usage.average -Realtime -MaxSamples 1 -Instance ""
From VMware's doc on this cpu usage perf counter:
Actively used CPU, as a percentage of the total available CPU, for
each physical CPU on the host. Active CPU is approximately equal to
the ratio of the used CPU to the available CPU.
Available CPU = # of physical CPUs × clock rate.
100% represents all CPUs on the host. For example, if a four-CPU host
is running a virtual machine with two CPUs, and the usage is 50%, the
host is using two CPUs completely
Explanations from Luc Dekens around the -Instance filter...
If the ESX/ESXi server is equipped with a quadcore CPU, there will be
four instances: 0, 1, 2 and 3. In this case the instance corresponds
with the numeric position within the CPU core
And there will be a so-called aggregate, which is the metric averaged
over all the instances.
These instances each get their own identifier which will be part of
the returned statistical data. The aggregate instance is always
represented by a blank identifier.
...and -MaxSamples
Although I asked for 1 sample (-MaxSamples 1) the cmdlet returned 9
values. The -MaxSamples parameter apparently only looks at the
Timestamp. It doesn’t count the number of returned values

How to get multi GPUs same type on slurm?

How can I create a job with a multi GPU of the same type but not specific that type directly? My experiment has a constraint that all GPUs have the same type but this type can be whatever we want.
Currently I am able only to create a experiment with multi GPUs with telling exactly what type I want:
--gres=gpu:gres_type:amount
If I don't specify gres_type, then sometimes I get mixed GPUs packs (let say 2x titan V and 2x titan X).
If you are fortunate enough that the cluster is consistent in the types of nodes that host the GPUs, and that the features of the nodes a properly specified and allow distinguishing between the nodes that host the different GPU types, you can use the --constraint parameter.
For the sake of the argument, let's assume that the nodes that host the titanV have haswell CPUs, and those that host the titanX have skylake CPUs and that those are defined as features. Then, you can request
--gres=gpu:2
--constraint=[haswell|skylake]
If the above does not apply to your use case, you can submit two jobs and keep only the one that starts the earliest. For that, give your jobs an identical name, and use the singleton dependency.
Write a submission script like this one
#!/bin/bash
#SBATCH --dependency=singleton
#SBATCH --job-name=gpujob
# Other options
scancel --state=PENDING --jobname=gpujob
# etc.
and submit it twice with
$ sbatch --gres=gpu:titanX:2 submit.sh
$ sbatch --gres=gpu:titanV:2 submit.sh
Each job will be assigned only one type of GPU, and the first one that starts will cancel the other one. This approach can scale up with more than two GPU types.

How to run TensorFlow on multiple nodes with several CPUs each

I want to run linear regression with TensorFlow on very large datasets. I have a cluster with 9 nodes and 36 CPUs each. What is the best way to distribute the computations across all the resources available?
According to this course https://www.coursera.org/learn/intro-tensorflow, the best way to use TensorFlow on distributed setting is to use Estimators. So I wrote my code as suggested there and followed the instructions at https://www.tensorflow.org/deploy/distributed for the parallelisation. I then tried to run my script my_code.py (on a "small" dataset with 120 million data points and 2 feature columns to test the code) on nodes 2 and 3 as follows:
python my_code.py \
--ps_hosts=node1:2222 \
--worker_hosts=node2:2222,node3:2222
--job_name=worker
--task_index="i-2"
where i is the number of the node (either 2 or 3); whereas on node 1 I do the same but with --job_name=ps and --task_index=0. However this way it seems that only one CPU per node is used. Do I need to specify each CPU individually?
Thank you in advance.
As far as I understand, the best thing to do is to use all the CPUs on the same node together as a single worker, in order to make the most of the shared memory. So for example in the case above, one would have to specify manually only 9 workers and make sure that each of them corresponds to one node where all the 36 CPUs are used. The commands to do this depend on the specific cluster used.

What is gradient repacking in Tensorflow?

When running tensorflow benchmarks from terminal, there are a couple of parameters we can specify. There is a parameter called gradient_repacking. What does it represent and how would one think about setting it?
python tf_cnn_benchmarks.py --data_format=NCHW --batch_size=256 \
--model=resnet50 --optimizer=momentum --variable_update=replicated \
--nodistortions --gradient_repacking=8 --num_gpus=8 \
--num_epochs=90 --weight_decay=1e-4 --data_dir=${DATA_DIR} --use_fp16 \
--train_dir=${CKPT_DIR}
For those searching in the future, gradient_repacking affects all-reduce in replicated mode. From the flags definition:
flags.DEFINE_integer('gradient_repacking', 0, 'Use gradient repacking. It'
'currently only works with replicated mode. At the end of'
'of each step, it repacks the gradients for more efficient'
'cross-device transportation. A non-zero value specifies'
'the number of split packs that will be formed.',
lower_bound=0)
As for the optimal, I've seen gradient_repacking=8 as you have and gradient_repacking=2.
My best guess is the parameter refers to the number of shards the gradients get broken down into for sharing among other workers. Eight in this case would seem to mean each GPU shares with each other GPU (i.e. all-to-all) (for your num_gpus=8) while 2 would mean sharing only with neighbors in a ring fashion.
Given that Horovod uses its own all reduce algorithm, it makes sense that setting gradient_repacking has no effect when --variable_update=horovod.

How to set threads based on number of cores available?

I have a workflow that runs on different machines with different numbers of CPUs, and I'd like to be able to setup a rule that uses "all but N" cores. I.e. I'd like to be able to do:
threads: lambda cores: max(2, cores-4)
But I cannot find any way to access cores (i.e. the value passed to or inferred by -j/--cores on the commmand line) in my rules. Is there a way to do the above?
As #bli said in the comments, this should work:
from multiprocessing import cpu_count
rule test:
output:
example.txt
threads:
lambda cores: max(2, cpu_count() - 4)
shell:
cmd -{threads}