Distribution of MPI-threaded program on available GPUs, using slurm - gpu

My program consists out of two parts, A and B, both written in C++. B is loaded from a separate DLL, and is capable of running both on the CPU or on the GPU, depending on how it is linked. When the main program is launched, it creates one instance of A, which in turn creates one instance of B (which then either works on the locally available CPUs or on the first GPU).
When launching the program using mpirun (or via slurm, which in turn launches mpirun), for each MPI rank one version of A is created, which in turn creates one version of B for itself. When only one GPU is in the system, this GPU will be used, but what happens if there are multiple GPUs in the system? Are versions of B all placed on the same GPU, regardless if there are several GPUs available, or are they distributed evenly?
Is there any way to influence that behavior? Unfortunately my development machine does not have multiple GPUs, thus I can not test it, except on production.

Slurm supports and understands binding MPI ranks to GPUs through for-example the --gpu-bind option: https://slurm.schedmd.com/gres.html. Assuming that the cluster is correctly configured to enforce GPU affinities, this will then allow you assign one GPU per rank even if there are multiple ranks on a single node.
If you want to be able to test this, you could use for example the cudaGetDevice and cudaGetDeviceProperties calls to get the device luid (local unique id) for each rank and then check that there is no duplication of luids within a node.

Related

How to define multiple gres resources in SLURM using the same GPU device?

I'm running machine learning (ML) jobs that make use of very little GPU memory.
Thus, I could run multiple ML jobs on a single GPU.
To achieve that, I would like to add multiple lines in the gres.conf file that specify the same device.
However, it seems the slurm deamon doesn't accept this, the service returning:
fatal: Gres GPU plugin failed to load configuration
Is there any option I'm missing to make this work?
Or maybe a different way to achieve that with SLURM?
It is kind of smiliar to this one, but this one seems specific to some CUDA code with compilation enabled. Something which seems way more specific than my general case (or at least as far as I understand).
How to run multiple jobs on a GPU grid with CUDA using SLURM
I don't think you can oversubscribe GPUs, so I see two options:
You can configure the CUDA Multi-Process Service or
pack multiple calculations into a single job that has one GPU and run them in parallel.
Besides nVidia MPS mentioned by #Marcus Boden, which is relevant for V100 types of cards, there also is Multi-Instance GPU which is relevant for A100 types of cards.

Can I create multiple virtual devices from multiple GPUs in Tensorflow?

I'm using Local device configuration in Tensorflow 2.3.0 currently, to simulate multiple GPU training, and it is working. If I buy another GPU, will I be able to use the same functionality to each GPU?
Right now I have 4 virtual GPUs and one physical GPU. I want to buy another GPU and want to have 2x4 virtual GPUs. I haven't found any information about it, and because I don't have another GPU right now, I can't test it. Is it supported? I'm afraid, it's not.
Yes, you can have additional GPU, since there is no restriction in the number of GPU's you can make use of all the GPU devices you have.
As you can see in the document also which says,
A visible tf.config.PhysicalDevice will by default have a single
tf.config.LogicalDevice associated with it once the runtime is
initialized. Specifying a list of tf.config.LogicalDeviceConfiguration
objects allows multiple devices to be created on the same
tf.config.PhysicalDevice
You can follow this documentation for more details on usage of multiple GPU's.

The mechanics behind the mapping of redis instances to separate CPU cores

It's documented that separate redis instances map to separate CPU cores. If I have 8 redis instances running on a Debian/Ubuntu machine with 8 cores, all of them would map to a core each.
1) What happens if I scale this machine down to 4 cores?
2) Do the changes happen automatically (by default), or is some explicit configuration involved?
3) Is there any way to control the behavior? If so, to what extent?
Would love to understand the technicals behind this, and an illustrative example is most welcome. I run an app hosted in the cloud which uses redis as a back-end. Scaling up (and down) the machine's CPU cores is one of the things I have to do, but I'd like to know what I'm first getting into.
Thanks in advance!
There is no magic. Since redis is single-threaded, a single instance of redis will only occupy a single core at once. Running multiple instances creates the possibility that more than one of them will be executing at once, on different cores (if you have them). How this is done is left entirely up to the operating system. redis itself doesn't do anything to "map" instances to specific cores.
In practice, it's possible that running 8 instances on 8 cores might give you something that looks like a direct mapping of instances to cores, since a smart OS will spread processes across cores (to maximize available resources), and should show some preference for running a process on the same core that it recently vacated (to make best use of cache). But at best, this is only true for the simple case of a 1:1 mapping, with no other processes on the system, all processes equally loaded, no influence from network drivers, etc.
In the general case, all you can say is that the OS will decide how to give CPU time to all of the instances that you run, and it will probably do a pretty good job, because the scheduling parts of the OS were written by people who know what they're doing.
Redis is a (mostly) single-threaded process, which means that an instance of the server will use a single CPU core.
The server process is mapped to a core by the operating system - that's one of the main tasks that an OS is in charge of. To reiterate, assigning resources, including CPU, is an OS decision and a very complex one at that (i.e. try reading the code of the kernel's scheduler ;)).
If I have 8 redis instances running on a Debian/Ubuntu machine with 8 cores, all of them would map to a core each.
Perhaps, that's up to the OS' discretion. There is no guarantee that every instance will get a unique core, and it is possible that one core may be used by several instances.
1) What happens if I scale this machine down to 4 cores?
Scaling down like this means a restart. Once the Redis servers are restarted, the OS will assign them with the available cores.
2) Do the changes happen automatically (by default), or is some explicit configuration involved?
There are no changes involved - every process, Redis or not, gets a core. Cores are shared between processes, with the OS orchestrating the entire thing.
3) Is there any way to control the behavior? If so, to what extent?
Yes, most operating systems provide interfaces for controlling the allocation of resources. Specifically, the taskset Linux command can be used to set or get a process's CPU affinity.
Note: you should leave CPU affinity setting to the OS - it is supposed to be quite good at that. Instead, make sure that you provision your server correctly for the load.

How does TensorFlow cluster distribute load across machines if not specified explicitly?

I took "Distributed TensorFlow" how-to and tried to apply it to the "MNIST For ML Beginners" tutorial. I started three TensorFlow worker nodes locally (there are 8 cores in the PC) and ran the training script with replacing this line:
sess = tf.InteractiveSession()
with the following:
sess = tf.InteractiveSession("grpc://localhost:12345")
where 12346 is a port where node 0 is listening (e.g. master session is created on node 0). Note that I did not specify explicitly where computations should be performed.
Looking at htop's output, I can see that the job is indeed performed by the cluster - it consumes some CPU. However, the only consumer is node 0, remaining nodes do not perform any work. If I select node 1 as a place to create master session, picture changes: only ~2/3 of the work is performed on node 0 (judging by CPU load), but the remaining 1/3 of the work is performed on node 1. If I select node 2 as master, then that 1/3 of the work is performed on node 2. If I run two processes in parallel, one using node 1 as master and another using node 2 as master, both nodes 1 and 2 get some load, but node 0 is loaded much more (like, 200% vs 60% vs 60% of CPU).
So far it looks like "default" behavior of distributed TensorFlow is not great for parallelizing work automatically right now. I'm wondering what the behavior is and whether distributed TensorFlow is intended for data parallelization at all (as opposed to manual model parallelization)?
TF is great for data parallelization, e.g. when you need to sift through tons of data, which is then distributed to multiple GPUs.
It's also great for weights parallelization. Using tf.train.replica_device_setter, weights are distributed among multiple devices for better IO.
Now, it seems you are asking for parallelization within a single model. That's difficult to do automatically, since TF does not know what's the best way to distribute your computation of the same model to multiple devices. It would depend on too many factors, e.g. how fast is the connection between your devices.

multiple workers on a single machine in distributed TensorFlow

is it possible to launch distributed TensorFlow on a local machine, in a way that each worker has a replica of the model?
if yes, is it possible to assign each agent to utilize only a single CPU core?
Yes it is possible to launch a distributed Tensorflow locally:
Each task typically runs on a different machine, but you can run multiple tasks on the same machine (e.g. to control different GPU devices).
and in such a way that each worker has the same graph:
If you are using more than one graph (created with tf.Graph()) in the same process, you will have to use different sessions for each graph, but each graph can be used in multiple sessions.
As mentioned by in your comments, there is a suggestion of how to try and achieve execution of distributed TF to a single core which involves distributing to different CPUs and then limiting the thread pool to a single thread.
Currently there is no feature that allows the distributed execution of TF graphs to particular cores.
To your first question, the answer is yes. More details here: https://www.tensorflow.org/versions/r0.9/how_tos/distributed/index.html
For the second question, I'm not sure if Tensorflow has this level of fine-grained control at core-level; but in general the OS will load balance threads on multiple cores.
Note that Tensorflow does have the ability to specify a device at processor level, if you have multiple CPUs/GPUs.