As far as I know,
GPU can simultaneously run thousands of threads that belong to one GPU kernel process, and that is the reason why it called SIMT.
However, can GPGPU simultaneously run multiple threads that belong to different GPU kernel processes? If possible, does it mean that those threads can run on one of GPU cores, and multiple cores can run simultaneously? Or does it mean that even one GPU core can run multiple threads belongs to different kernels?
Or simply, is it only possible to run threads that belong to the same GPU kernel process at a time on the entire GPU cores?
Related
Running on tensorflow, the moment that I try and run two jobs in parallel with any gpu related jobs, it flags and gives me an error regarding cuDNN module being not available. Is there any method where I can run two jobs in parallel (maybe at the cost of increased time length)?
My program consists out of two parts, A and B, both written in C++. B is loaded from a separate DLL, and is capable of running both on the CPU or on the GPU, depending on how it is linked. When the main program is launched, it creates one instance of A, which in turn creates one instance of B (which then either works on the locally available CPUs or on the first GPU).
When launching the program using mpirun (or via slurm, which in turn launches mpirun), for each MPI rank one version of A is created, which in turn creates one version of B for itself. When only one GPU is in the system, this GPU will be used, but what happens if there are multiple GPUs in the system? Are versions of B all placed on the same GPU, regardless if there are several GPUs available, or are they distributed evenly?
Is there any way to influence that behavior? Unfortunately my development machine does not have multiple GPUs, thus I can not test it, except on production.
Slurm supports and understands binding MPI ranks to GPUs through for-example the --gpu-bind option: https://slurm.schedmd.com/gres.html. Assuming that the cluster is correctly configured to enforce GPU affinities, this will then allow you assign one GPU per rank even if there are multiple ranks on a single node.
If you want to be able to test this, you could use for example the cudaGetDevice and cudaGetDeviceProperties calls to get the device luid (local unique id) for each rank and then check that there is no duplication of luids within a node.
Is there a way to oversubscribe GPUs on Slurm, i.e. run multiple jobs/job steps that share one GPU? We've only found ways to oversubscribe CPUs and memory, but not GPUs.
We want to run multiple job steps on the same GPU in parallel and optionally specify the GPU memory used for each step.
The easiest way of doing that is to have the GPU defined as a feature rather than as a gres so Slurm will not manage the GPUs, just make sure that job that need one land on nodes that offer one.
According to my understanding one operation is operated by one CPU. And, multiprocessing systems have multiple CPUs. That is, multiprocessing systems can work on multiple tasks at the same time.
But, in multiprocessing systems, only one process is in working state at a point of time.
And processes are alternately performed by process scheduling.
So multiprocessing systems can work on multiple processes at the same time.
Why multiprocessing systems that have multiple CPUs use process scheduling and perform only one process at once by using one CPU?
Why don't multiple CPUs perform multiple processes at the same time?
Why multiprocessing systems that have multiple CPUs use process scheduling and perform only one process at once by using one CPU?
Most systems these days use THREAD scheduling; not process scheduling. There are still some Eunuchs variants that still schedule processes but most have switched to threads.
Why don't multiple CPUs perform multiple processes at the same time?
They do. They also execute multiple threads in the same process at the same time.
is it possible to launch distributed TensorFlow on a local machine, in a way that each worker has a replica of the model?
if yes, is it possible to assign each agent to utilize only a single CPU core?
Yes it is possible to launch a distributed Tensorflow locally:
Each task typically runs on a different machine, but you can run multiple tasks on the same machine (e.g. to control different GPU devices).
and in such a way that each worker has the same graph:
If you are using more than one graph (created with tf.Graph()) in the same process, you will have to use different sessions for each graph, but each graph can be used in multiple sessions.
As mentioned by in your comments, there is a suggestion of how to try and achieve execution of distributed TF to a single core which involves distributing to different CPUs and then limiting the thread pool to a single thread.
Currently there is no feature that allows the distributed execution of TF graphs to particular cores.
To your first question, the answer is yes. More details here: https://www.tensorflow.org/versions/r0.9/how_tos/distributed/index.html
For the second question, I'm not sure if Tensorflow has this level of fine-grained control at core-level; but in general the OS will load balance threads on multiple cores.
Note that Tensorflow does have the ability to specify a device at processor level, if you have multiple CPUs/GPUs.