Using GPU for reinforcement learning with Keras - tensorflow

I am using this code (please excuse its messiness) to run on my CPU. I have a custom RL environment that I have created myself and I am using DQN agent.
But when I run this code on GPU, it doesn't utilize much of it and in fact it is slower than my CPU.
This is the output of nvidia-smi. As you can see my processes are running on GPU but the speed is much slower than I would expect.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.82 Driver Version: 440.82 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:00:05.0 Off | N/A |
| 23% 37C P2 60W / 250W | 11619MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:00:06.0 Off | N/A |
| 23% 29C P8 9W / 250W | 157MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 25540 C python3 11609MiB |
| 1 25540 C python3 147MiB |
+-----------------------------------------------------------------------------+
Can anyone point out what can I do to change my code for GPU capabilities?
PS: Notice that I have two GPUs and my process is running on both of them. Even if I use any one of two GPUs, the issue is that my GPU is not utilized and the speed is comparatively slower than GPU so two GPUs is not the issue

Related

Google Colab not detecting GPU with 'sudo'

I've been using Google Colab to produce Blender renders for a few months, but today my scripts stopped working without any changes. I run my scripts with sudo and for whatever reason Google Colab is not giving GPU access to commands run with sudo.
This is the output for nvidia-smi:
/content# nvidia-smi
Fri Jun 3 13:46:21 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But the same command throws an error if run with sudo:
/content# sudo nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
sudo is important for me because my Blender commands, for whatever reason, don't work without sudo.
/content# ./blender-3.1.0-linux-x64/blender -b --python-console -noaudio
src/tcmalloc.cc:283] Attempt to free invalid pointer 0x7fe46122b000
Aborted (core dumped)

Is it possible that the actual memory size of RTX 2080 Ti is smaller than RTX 1080 Ti

This is my first post, if there is something inappropriate, please bear with me. :)
Recently, I was running my program on two machines with different GPU type (one with RTX 1080Ti and another with RTX 2080Ti). For convenience, I will call the machine with RTX 1080Ti as M1 and the other as M2.
Even I run totally the same program on two machine, but M2 can't run as larger batch size as M1 do.... (By the way, M2 is indeed faster than M1 under same parameter setting)
Then I check nvdia-smi and the information are shown below:
Machine with RTX 1080Ti (M1)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 21% 19C P8 8W / 250W | 10MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 21% 21C P8 9W / 250W | 10MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Machine with RTX 2080Ti (M2)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:1D:00.0 Off | N/A |
| 27% 32C P2 48W / 250W | 914MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:20:00.0 Off | N/A |
| 31% 31C P8 1W / 250W | 11MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
As it shows, the memory in M2 (11019 MiB) is a bit smaller than M1 (11178 MiB).
Since M2 is a higher level product, shouldn’t it be better than the previous generation in all aspects of performance?
So...I would like to ask, could someone tell me why M2 has lower memory capacity than M1?
And, is there any way I can try to increase the memory of M2? such as update the driver or...?

How to access to GPUs on different nodes in a cluster with Slurm?

I have access to a cluster that's run by Slurm, in which each node has 4 GPUs.
I have a code that needs 8 gpus.
So the question is how can I request 8 gpus on a cluster that each node has only 4 gpus?
So this is the job that I tried to submit via sbatch:
#!/bin/bash
#SBATCH --gres=gpu:8
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00
But then I get the following error:
sbatch: error: Batch job submission failed: Requested node configuration is not available
Then I changed my the settings to this and submitted again:
#!/bin/bash
#SBATCH --gres=gpu:4
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00
nvidia-smi
and the result shows only 4 gpus not 8.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 0000:03:00.0 Off | 0 |
| N/A 32C P0 31W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 0000:04:00.0 Off | 0 |
| N/A 37C P0 29W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 0000:82:00.0 Off | 0 |
| N/A 35C P0 28W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 0000:83:00.0 Off | 0 |
| N/A 33C P0 26W / 250W | 0MiB / 12193MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Thanks.
Slurm does not support what you need. It only can assign to your job GPUs/node, not GPUs/cluster.
So, unlike CPUs or other consumable resources, GPUs are not consumable and are binded to the node where they are hosted.
If you are interested in this topic, there is a research effort to turn GPUs into consumable resources, check this paper.
There you'll find how to do it using GPU virtualization technologies.
Job script: You are requesting 2 nodes with each of them 4 GPUs. Tolal 8 GPUs are assigned to you. You are running "nvidia-smi". nvidia-smi does not aware of SLURM nor MPI. It runs only on first node assigned to you. So it shows only 4 GPUs, result is normal.
If you run GPU based engineering application like Ansys HFSS or CST, They can use all 8 GPUs.

GPU load in tensorflow

I just built TensorFlow v1.0 and I am trying to run MNIST test just to see if it's working. Seems like it is, but i am observing weird behaiviour.
My system has two Tesla P100, and nvidia-smi shows the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 0002:01:00.0 Off | 0 |
| N/A 34C P0 114W / 300W | 15063MiB / 16280MiB | 51% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 0006:01:00.0 Off | 0 |
| N/A 27C P0 35W / 300W | 14941MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 67288 C python3 15061MiB |
| 1 67288 C python3 14939MiB |
+-----------------------------------------------------------------------------+
As it shown, python3 ate all the memory on both GPUs, but computational load are placed only on first.
Exporting CUDA_VISIBLE_DEVICES I can limit GPU to be used, but it's not affect computational time. So no gain from adding second GPU. Single GPU
real 2m23.496s
user 4m26.597s
sys 0m12.587s
Two GPUs:
real 2m18.165s
user 4m18.625s
sys 0m12.958s
So the question is, how to load both GPUs?

How to enable Keras with Theano to utilize multiple GPUs

Setup:
Using a Amazon Linux system with a Nvidia GPU
I'm using Keras 1.0.1
Running Theano v0.8.2 backend
Using CUDA and CuDNN
THEANO_FLAGS="device=gpu,floatX=float32,lib.cnmem=1"
Everything works fine, but I run out of video memory on large models when I increase the batch size to speed up training. I figure moving to a 4 GPU system would in theory either improve total memory available or allow smaller batches to build faster, but observing the the nvidia stats, I can see only one GPU is used by default:
+------------------------------------------------------+
| NVIDIA-SMI 361.42 Driver Version: 361.42 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 44C P0 45W / 125W | 3954MiB / 4095MiB | 94% Default |
+-------------------------------+----------------------+----------------------+
| 1 GRID K520 Off | 0000:00:04.0 Off | N/A |
| N/A 28C P8 17W / 125W | 11MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GRID K520 Off | 0000:00:05.0 Off | N/A |
| N/A 32C P8 17W / 125W | 11MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GRID K520 Off | 0000:00:06.0 Off | N/A |
| N/A 29C P8 17W / 125W | 11MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9862 C python34 3941MiB |
I know with raw Theano you can use manually multiple GPU's explicitly. Does Keras support use of multiple GPU's? If so, does it abstract it or do you need to map the GPU's to devices as in Theano and explicitly marshall computations to specific GPU's?
Multi-GPU training is experimental ("The code is rather new and is still considered experimental at this point. It has been tested and seems to perform correctly in all cases observed, but make sure to double-check your results before publishing a paper or anything of the sort.") and hasn't been integrated into Keras yet. However, you can use multiple GPUs with Keras with the Tensorflow backend: https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html#multi-gpu-and-distributed-training.