Using openpai, whether the task can work on more than one worker? - openpai

For example, I have built a pai cluster with 2 workers, and each worker has 2 GPUs. If I want to use four GPUs to run a task, can this cluster meet the demand and use both worker to run the task?

yes. please refer to distributed tensorflow example for details.
https://github.com/Microsoft/pai/tree/master/examples/tensorflow#distributed-tensorflow-cifar-10-image-classification

Related

How to define multiple gres resources in SLURM using the same GPU device?

I'm running machine learning (ML) jobs that make use of very little GPU memory.
Thus, I could run multiple ML jobs on a single GPU.
To achieve that, I would like to add multiple lines in the gres.conf file that specify the same device.
However, it seems the slurm deamon doesn't accept this, the service returning:
fatal: Gres GPU plugin failed to load configuration
Is there any option I'm missing to make this work?
Or maybe a different way to achieve that with SLURM?
It is kind of smiliar to this one, but this one seems specific to some CUDA code with compilation enabled. Something which seems way more specific than my general case (or at least as far as I understand).
How to run multiple jobs on a GPU grid with CUDA using SLURM
I don't think you can oversubscribe GPUs, so I see two options:
You can configure the CUDA Multi-Process Service or
pack multiple calculations into a single job that has one GPU and run them in parallel.
Besides nVidia MPS mentioned by #Marcus Boden, which is relevant for V100 types of cards, there also is Multi-Instance GPU which is relevant for A100 types of cards.

Slurm oversubscribe GPUs

Is there a way to oversubscribe GPUs on Slurm, i.e. run multiple jobs/job steps that share one GPU? We've only found ways to oversubscribe CPUs and memory, but not GPUs.
We want to run multiple job steps on the same GPU in parallel and optionally specify the GPU memory used for each step.
The easiest way of doing that is to have the GPU defined as a feature rather than as a gres so Slurm will not manage the GPUs, just make sure that job that need one land on nodes that offer one.

How does TensorFlow cluster distribute load across machines if not specified explicitly?

I took "Distributed TensorFlow" how-to and tried to apply it to the "MNIST For ML Beginners" tutorial. I started three TensorFlow worker nodes locally (there are 8 cores in the PC) and ran the training script with replacing this line:
sess = tf.InteractiveSession()
with the following:
sess = tf.InteractiveSession("grpc://localhost:12345")
where 12346 is a port where node 0 is listening (e.g. master session is created on node 0). Note that I did not specify explicitly where computations should be performed.
Looking at htop's output, I can see that the job is indeed performed by the cluster - it consumes some CPU. However, the only consumer is node 0, remaining nodes do not perform any work. If I select node 1 as a place to create master session, picture changes: only ~2/3 of the work is performed on node 0 (judging by CPU load), but the remaining 1/3 of the work is performed on node 1. If I select node 2 as master, then that 1/3 of the work is performed on node 2. If I run two processes in parallel, one using node 1 as master and another using node 2 as master, both nodes 1 and 2 get some load, but node 0 is loaded much more (like, 200% vs 60% vs 60% of CPU).
So far it looks like "default" behavior of distributed TensorFlow is not great for parallelizing work automatically right now. I'm wondering what the behavior is and whether distributed TensorFlow is intended for data parallelization at all (as opposed to manual model parallelization)?
TF is great for data parallelization, e.g. when you need to sift through tons of data, which is then distributed to multiple GPUs.
It's also great for weights parallelization. Using tf.train.replica_device_setter, weights are distributed among multiple devices for better IO.
Now, it seems you are asking for parallelization within a single model. That's difficult to do automatically, since TF does not know what's the best way to distribute your computation of the same model to multiple devices. It would depend on too many factors, e.g. how fast is the connection between your devices.

How to run Tensorflow on SLURM cluster with properly configured parameter server?

I am in the fortunate position of having access to my university's SLURM powered GPU cluster. I have been trying to get Tensorflow to run a in a cluster node, but so far I have failed to find any documentation. (Everyone I have spoken to at the university has run it using CPU nodes before or using a single GPU node.
I found an excellent bit of documentation from this previous question here. Unfortunately, it's rather incomplete. All of the other distributed examples I have found such as such as this one rely on explicitly specifying the parameter server.
When I try to run it using the code from the SO question, I appears to work perfectly until it either fails to connect to a nonexistent parameter server or hangs when server.join is called and no print outs are provided to the sbatch outfile (which I understand should happen).
So in short, my question is how would one go about starting Tensorflow on a SLURM cluster? From the sbatch stage onwards. This is my first time dealing with a distributed computing framework besides SPARK on AWS and I would love to learn more about how to properly configure Tensorflow. How do I specify which one of the items in the tf_hostlist for example server as the parameter server? Alternatively can I use sbatch to send slightly different commands to each worker as I have seen done in other examples?

multiple workers on a single machine in distributed TensorFlow

is it possible to launch distributed TensorFlow on a local machine, in a way that each worker has a replica of the model?
if yes, is it possible to assign each agent to utilize only a single CPU core?
Yes it is possible to launch a distributed Tensorflow locally:
Each task typically runs on a different machine, but you can run multiple tasks on the same machine (e.g. to control different GPU devices).
and in such a way that each worker has the same graph:
If you are using more than one graph (created with tf.Graph()) in the same process, you will have to use different sessions for each graph, but each graph can be used in multiple sessions.
As mentioned by in your comments, there is a suggestion of how to try and achieve execution of distributed TF to a single core which involves distributing to different CPUs and then limiting the thread pool to a single thread.
Currently there is no feature that allows the distributed execution of TF graphs to particular cores.
To your first question, the answer is yes. More details here: https://www.tensorflow.org/versions/r0.9/how_tos/distributed/index.html
For the second question, I'm not sure if Tensorflow has this level of fine-grained control at core-level; but in general the OS will load balance threads on multiple cores.
Note that Tensorflow does have the ability to specify a device at processor level, if you have multiple CPUs/GPUs.