How to enable Keras with Theano to utilize multiple GPUs - gpu

Setup:
Using a Amazon Linux system with a Nvidia GPU
I'm using Keras 1.0.1
Running Theano v0.8.2 backend
Using CUDA and CuDNN
THEANO_FLAGS="device=gpu,floatX=float32,lib.cnmem=1"
Everything works fine, but I run out of video memory on large models when I increase the batch size to speed up training. I figure moving to a 4 GPU system would in theory either improve total memory available or allow smaller batches to build faster, but observing the the nvidia stats, I can see only one GPU is used by default:
+------------------------------------------------------+
| NVIDIA-SMI 361.42 Driver Version: 361.42 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 44C P0 45W / 125W | 3954MiB / 4095MiB | 94% Default |
+-------------------------------+----------------------+----------------------+
| 1 GRID K520 Off | 0000:00:04.0 Off | N/A |
| N/A 28C P8 17W / 125W | 11MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GRID K520 Off | 0000:00:05.0 Off | N/A |
| N/A 32C P8 17W / 125W | 11MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GRID K520 Off | 0000:00:06.0 Off | N/A |
| N/A 29C P8 17W / 125W | 11MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9862 C python34 3941MiB |
I know with raw Theano you can use manually multiple GPU's explicitly. Does Keras support use of multiple GPU's? If so, does it abstract it or do you need to map the GPU's to devices as in Theano and explicitly marshall computations to specific GPU's?

Multi-GPU training is experimental ("The code is rather new and is still considered experimental at this point. It has been tested and seems to perform correctly in all cases observed, but make sure to double-check your results before publishing a paper or anything of the sort.") and hasn't been integrated into Keras yet. However, you can use multiple GPUs with Keras with the Tensorflow backend: https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html#multi-gpu-and-distributed-training.

Related

CUDA_ERROR_NOT_INITIALIZED on A100 after server reset

I'm running on a server with a A100 GPU. When trying to run tensorflow code after a server reset, tensorflow does not recognize the GPU. Running tf.config.list_physical_devices('GPU') yields CUDA_ERROR_NOT_INITIALIZED:
2021-09-09 07:41:42.956917: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-09-09 07:41:43.899014: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
2021-09-09 07:41:43.899148: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: f42a3aa12bd1
2021-09-09 07:41:43.899169: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: f42a3aa12bd1
2021-09-09 07:41:43.899890: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.32.3
2021-09-09 07:41:43.899955: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.32.3
2021-09-09 07:41:43.899969: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.32.3
Running nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB Off | 00000000:00:06.0 Off | On |
| N/A 46C P0 40W / 250W | 0MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| No MIG devices found |
+-----------------------------------------------------------------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Why do I get CUDA_ERROR_NOT_INITIALIZED? The server ran perfectly well before the reset, and nvidia-smi is clearly working.
It seems NVIDIA Multi-Instance GPU (MIG) is enabled on your GPU, but you haven't defined any GPU instances. This can be seen from the fact that nvidia-smi shows a MIG devices table, but it's empty (No MIG devices found).
The MIG documentation states:
Without creating GPU instances (and corresponding compute instances),
CUDA workloads cannot be run on the GPU. In other words, simply
enabling MIG mode on the GPU is not sufficient. Also note that, the
created MIG devices are not persistent across system reboots. Thus,
the user or system administrator needs to recreate the desired MIG
configurations if the GPU or system is reset.
You probably had a MIG configuration defined before the reset, but the server reset removed that configuration. You need to re-configure the GPU instances to get the GPU working again. If you just want a basic configuration, in which you have only one GPU instance that uses all the resources, you can run:
sudo nvidia-smi mig -cgi 0 -C
If you need a fancier configuration than that, you should consult the documentation.
After configuring the GPU instances, the nvidia-smi command should show the MIG devices table full. In our case, it should have one entry:
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 0 0 0 | 0MiB / 40536MiB | 98 0 | 7 0 5 1 1 |
| | 1MiB / 65536MiB | | |
+------------------+----------------------+-----------+-----------------------+

Using GPU for reinforcement learning with Keras

I am using this code (please excuse its messiness) to run on my CPU. I have a custom RL environment that I have created myself and I am using DQN agent.
But when I run this code on GPU, it doesn't utilize much of it and in fact it is slower than my CPU.
This is the output of nvidia-smi. As you can see my processes are running on GPU but the speed is much slower than I would expect.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:00:05.0 Off | N/A |
| 23% 37C P2 60W / 250W | 11619MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:00:06.0 Off | N/A |
| 23% 29C P8 9W / 250W | 157MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 25540 C python3 11609MiB |
| 1 25540 C python3 147MiB |
+-----------------------------------------------------------------------------+
Can anyone point out what can I do to change my code for GPU capabilities?
PS: Notice that I have two GPUs and my process is running on both of them. Even if I use any one of two GPUs, the issue is that my GPU is not utilized and the speed is comparatively slower than GPU so two GPUs is not the issue

IBM XL C/C++ Compiler: "No valid target devices available"

I'm compiling some OpenMP 4.5 code with the IBM XL C/C++ compiler with the intention of offloading some of its work to a GPU, like so:
xlc++ mycode.cpp -qsmp=omp -qreport -qoffload -std=c++11 -Wall
Compilation seems to be successful, giving me only the following messages:
mycode.cpp:
"mycode.cpp", line 284: 1586-358 (I) Loop was parallelized.
"mycode.cpp", line 293: 1586-358 (I) Loop was parallelized.
"mycode.cpp", line 309: 1586-358 (I) Loop was parallelized.
"mycode.cpp", line 324: 1586-358 (I) Loop was parallelized.
"mycode.cpp", line 126: 1586-674 (I) Remark: Simd or nested parallel directive requires OpenMP runtime
"" 1586-671 (I) GPU OpenMP Runtime is required for offloaded kernel '__xl__Z9MyCodeiii_l123_h44039046689_OL_1'
However, when I run the code, I get the following unpleasant message:
1587-169 No valid target devices available.
Using nvidia-smi, I have verified that target devices are, in fact, available:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.59 Driver Version: 384.59 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 00000002:01:00.0 Off | 0 |
| N/A 33C P0 29W / 300W | 10MiB / 16276MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 00000003:01:00.0 Off | 0 |
| N/A 29C P0 30W / 300W | 10MiB / 16276MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-SXM2... Off | 00000006:01:00.0 Off | 0 |
| N/A 31C P0 28W / 300W | 10MiB / 16276MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-SXM2... Off | 00000007:01:00.0 Off | 0 |
| N/A 27C P0 29W / 300W | 10MiB / 16276MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
My thought is that XL is somehow targeting the wrong accelerator, but I can't find an option to set this.
How can I get my code to recognize and utilize the available GPUs?
-qtgtarch specifies GPU architectures where the code may run. Please try -qtgtarch=auto if you would like the compiler to automatically detect the architecture of device 0 of the system on which the compiler is being executed. Alternatively, you might try setting it manually, for example -qtgtarch=sm_60.
More information at Knowledge Center.

How to access to GPUs on different nodes in a cluster with Slurm?

I have access to a cluster that's run by Slurm, in which each node has 4 GPUs.
I have a code that needs 8 gpus.
So the question is how can I request 8 gpus on a cluster that each node has only 4 gpus?
So this is the job that I tried to submit via sbatch:
#!/bin/bash
#SBATCH --gres=gpu:8
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00
But then I get the following error:
sbatch: error: Batch job submission failed: Requested node configuration is not available
Then I changed my the settings to this and submitted again:
#!/bin/bash
#SBATCH --gres=gpu:4
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00
nvidia-smi
and the result shows only 4 gpus not 8.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 0000:03:00.0 Off | 0 |
| N/A 32C P0 31W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 0000:04:00.0 Off | 0 |
| N/A 37C P0 29W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 0000:82:00.0 Off | 0 |
| N/A 35C P0 28W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 0000:83:00.0 Off | 0 |
| N/A 33C P0 26W / 250W | 0MiB / 12193MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Thanks.
Slurm does not support what you need. It only can assign to your job GPUs/node, not GPUs/cluster.
So, unlike CPUs or other consumable resources, GPUs are not consumable and are binded to the node where they are hosted.
If you are interested in this topic, there is a research effort to turn GPUs into consumable resources, check this paper.
There you'll find how to do it using GPU virtualization technologies.
Job script: You are requesting 2 nodes with each of them 4 GPUs. Tolal 8 GPUs are assigned to you. You are running "nvidia-smi". nvidia-smi does not aware of SLURM nor MPI. It runs only on first node assigned to you. So it shows only 4 GPUs, result is normal.
If you run GPU based engineering application like Ansys HFSS or CST, They can use all 8 GPUs.

GPU load in tensorflow

I just built TensorFlow v1.0 and I am trying to run MNIST test just to see if it's working. Seems like it is, but i am observing weird behaiviour.
My system has two Tesla P100, and nvidia-smi shows the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 0002:01:00.0 Off | 0 |
| N/A 34C P0 114W / 300W | 15063MiB / 16280MiB | 51% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 0006:01:00.0 Off | 0 |
| N/A 27C P0 35W / 300W | 14941MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 67288 C python3 15061MiB |
| 1 67288 C python3 14939MiB |
+-----------------------------------------------------------------------------+
As it shown, python3 ate all the memory on both GPUs, but computational load are placed only on first.
Exporting CUDA_VISIBLE_DEVICES I can limit GPU to be used, but it's not affect computational time. So no gain from adding second GPU. Single GPU
real 2m23.496s
user 4m26.597s
sys 0m12.587s
Two GPUs:
real 2m18.165s
user 4m18.625s
sys 0m12.958s
So the question is, how to load both GPUs?