How to define multiple gres resources in SLURM using the same GPU device? - tensorflow

I'm running machine learning (ML) jobs that make use of very little GPU memory.
Thus, I could run multiple ML jobs on a single GPU.
To achieve that, I would like to add multiple lines in the gres.conf file that specify the same device.
However, it seems the slurm deamon doesn't accept this, the service returning:
fatal: Gres GPU plugin failed to load configuration
Is there any option I'm missing to make this work?
Or maybe a different way to achieve that with SLURM?
It is kind of smiliar to this one, but this one seems specific to some CUDA code with compilation enabled. Something which seems way more specific than my general case (or at least as far as I understand).
How to run multiple jobs on a GPU grid with CUDA using SLURM

I don't think you can oversubscribe GPUs, so I see two options:
You can configure the CUDA Multi-Process Service or
pack multiple calculations into a single job that has one GPU and run them in parallel.

Besides nVidia MPS mentioned by #Marcus Boden, which is relevant for V100 types of cards, there also is Multi-Instance GPU which is relevant for A100 types of cards.

Related

Does tensorflow-quantum support GPU, and if so how do I make it use mine?

I am getting started on using tensorflow-quantum for some QML circuit simulations. I have everything configured correctly for TensorFlow with GPU, and when I run print(tf.config.list_physical_devices('GPU')), it reports the presence of my GPU.
However, I've done some Googling, and I've come across a few things suggesting that tensorflow-quantum doesn't actually support GPU acceleration for simulations (e.g. MichaelBroughton's first reply here, and this issue which is still open). However, it's unclear to me how up-to-date this state of affairs is. I can't find anything about adding GPU support in the version notes.
Does tensorflow-quantum currently support GPU? If so, how do I (a) make it use my GPU for simulations and (b) verify that it is doing so?

How to choose GPU when running phoronix-test-suite benchmark?

I am new to Phoronix Test Suite and ran my first test with phoronix-test-suite benchmark testname. This ran the test for one of my GPUs but not the other. How can I choose which GPU to use for the benchmark?
I've searched Google and skimmed the documentation for an answer but found nothing.
EDIT
The test I am trying to run is here, using
phoronix-test-suite benchmark 2102179-HA-NVIDIAGEF76
I've also tried using the method described here but to no avail.
I am using Phoronix Test Suite v10.2.2 (Harstad) on Ubuntu 20.04.2 LTS.
UPDATE
According to this issue, phoronix-test-suite always chooses the default GPU on a given system.
PTS currently sticks to using the default GPU configured by your system whether it be configured via PRIME handling or other multi-GPU setup configurations. Basically, it doesn't override your default GPU choice(s) or interfere beyond simply reporting the enumerated GPUs.
So the official way to change the GPU utilized by a phoronix benchmark is to change the 'default GPU' on the broader system. I don't understand what determines which GPU is the default or how to change the default. The above quote indicates that the default GPU might be changed using PRIME.
When running nvidia-settings the following message is printed.
** (nvidia-settings:9809): WARNING **: 15:46:41.950: PRIME: Failed to execute child process “/usr/bin/prime-supported” (No such file or directory)
** Message: 15:46:41.950: PRIME: is it supported? no
So it seems that whatever PRIME is, it's not part of my system.
As you were looking to configure an Nvidia GPU, the logic is slightly different:
looking at the source, PTS seems to always use the first GPU it can find on nvidia-settings --query PCIID output.
This theory has been further confirmed by PTS lead developer on github. So unfortunately there's no switch in PTS that would help achieve that.
This can be done if you are using a Nividea GPU you can go to the Nividea control panel:
Go to Manage 3D settings
Go to "Program Settings"
Select your app (i.e in this case Phoronix-test-suite benchmark) and select the high-performance Nividea GPU.
Now run the benchmark test. <---- For Windows
for more help visit: https://www.phoronix-test-suite.com/documentation/phoronix-test-suite.pdf

Can I create multiple virtual devices from multiple GPUs in Tensorflow?

I'm using Local device configuration in Tensorflow 2.3.0 currently, to simulate multiple GPU training, and it is working. If I buy another GPU, will I be able to use the same functionality to each GPU?
Right now I have 4 virtual GPUs and one physical GPU. I want to buy another GPU and want to have 2x4 virtual GPUs. I haven't found any information about it, and because I don't have another GPU right now, I can't test it. Is it supported? I'm afraid, it's not.
Yes, you can have additional GPU, since there is no restriction in the number of GPU's you can make use of all the GPU devices you have.
As you can see in the document also which says,
A visible tf.config.PhysicalDevice will by default have a single
tf.config.LogicalDevice associated with it once the runtime is
initialized. Specifying a list of tf.config.LogicalDeviceConfiguration
objects allows multiple devices to be created on the same
tf.config.PhysicalDevice
You can follow this documentation for more details on usage of multiple GPU's.

One instance with multple GPUs or multiple instances with one GPU

I am running multiple models using GPUs and all jobs combined can be run on 4 GPUs, for example. Multiple jobs can be run on the same GPU since the GPU memory can handle it.
Is it a better idea to spin up a powerful instance with all 4 GPUs as part of it and run all the jobs on one instance? Or should I go the route of having multiple instances with 1 GPU on each?
There are a few factors I'm thinking of:
Latency of reading files. Having a local disk on one machine should be faster latency wise, but it would be a quite a few reads from one source. Would this cause any issues?
I would need quite a few vCPU and a lot of memory to scale the IOPS since GPC scales IOPS that way, apparently. What is the best way to approach this? If anyone has any more on this, would appreciate pointers.
If in the future I need to downgrade to save costs/downgrade performance, I could simple stop the instance and change my specs.
Having everything on one machine would be easier to work with. I know in production I would want a more distributed approach, but this is strictly experimentation.
Those are my main thoughts. Am I missing something? Thanks for all of the help.
Ended up going with one machine with multiple GPUs. Just assigned the jobs to the different GPUs to make the memory work.
I suggest you'll take a look here if you want to run multiple tasks on the same GPU.
Basically when using several tasks (different processes or containers) on the same GPU, it won't be efficient due to some kind on context switching.
You'll need the latest nvidia hardware to test it.

Slurm oversubscribe GPUs

Is there a way to oversubscribe GPUs on Slurm, i.e. run multiple jobs/job steps that share one GPU? We've only found ways to oversubscribe CPUs and memory, but not GPUs.
We want to run multiple job steps on the same GPU in parallel and optionally specify the GPU memory used for each step.
The easiest way of doing that is to have the GPU defined as a feature rather than as a gres so Slurm will not manage the GPUs, just make sure that job that need one land on nodes that offer one.