CUDA only recognizes 1 of 5 GPUs - tensorflow

I would like to run tensorflow on a windows 10 server with 5 NVIDIA Quadro P6000 GPUs. After installing CUDA 8.0 and running deviceQuery.exe it only shows one of the GPUs and therefore tensorflow only uses one GPU as well.
(tensorflow-gpu) C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\extras\demo_suite>deviceQuery.exe
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro P6000"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 24576 MBytes (25769803776 bytes)
(30) Multiprocessors, (128) CUDA Cores/MP: 3840 CUDA Cores
GPU Max Clock rate: 1645 MHz (1.64 GHz)
Memory Clock rate: 4513 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Quadro P6000
Result = PASS
However, nvidia-smi.exe and GPU-Z do recognize all 5 of them.
(tensorflow-gpu) C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe
Wed Sep 27 11:07:44 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 376.51 Driver Version: 376.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P6000 WDDM | 0000:04:00.0 Off | Off |
| 26% 45C P5 19W / 250W | 353MiB / 24576MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P6000 WDDM | 0000:05:00.0 Off | Off |
| 26% 24C P8 7W / 250W | 69MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Quadro P6000 WDDM | 0000:08:00.0 Off | Off |
| 26% 26C P8 8W / 250W | 69MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Quadro P6000 WDDM | 0000:09:00.0 Off | Off |
| 26% 26C P8 8W / 250W | 69MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Quadro P6000 WDDM | 0000:89:00.0 Off | Off |
| 26% 22C P8 8W / 250W | 262MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2640 C+G Insufficient Permissions N/A |
| 0 3908 C+G ...x86)\Google\Chrome\Application\chrome.exe N/A |
| 4 2640 C+G Insufficient Permissions N/A |
| 4 3552 C+G ...\ImmersiveControlPanel\SystemSettings.exe N/A |
| 4 5260 C+G Insufficient Permissions N/A |
| 4 6552 C+G Insufficient Permissions N/A |
| 4 7208 C+G ...ost_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 4 7444 C+G ...indows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
+-----------------------------------------------------------------------------+
Does anyone have an idea what I could try to make all of them work for CUDA and tensorflow?

Related

Allocating Large Tensor on multiple GPUs using Distributed Learning in Keras

I am using Tensorflow Distributed learning using the following commands -
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
The system being used has 4 32 GB GPU devices. The following is the output of nvidia-smi -
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 37C P0 65W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 38C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 33C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 39C P0 41W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But after running the script to create the model, I am getting the following error -
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape [131072,65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]
A tensor of shape [131072,65536] of type float would allocate 131072 * 65536 * 4 bytes i.e., 34.35 GB. And there are 4 32 GB GPUs, so why is it not allocated?
MirroredStrategy creates a copy of all variables within the scope per GPU. So since the tensor size is 34.35GB, that's too large. You might be trying to use something similar to tf.distribute.experimental.CentralStorageStrategy. MirroredStrategy, in terms of gpu memory, isn't vram * num_of_gpu, it practically is smallest_vram, so in your case, Keras is working with 32GB of memory per replica, not 32*4=128GB.
strategy = tf.distribute.experimental.CentralStorageStrategy()
dataset = # some dataset
dataset = strategy.experimental_distribute_dataset(dataset)
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
Example:
Tensor A is [0, 1, 2, 3] and you have four GPUs. MirroredStrategy will load:
GPU0: [0, 1, 2, 3]
GPU1: [0, 1, 2, 3]
GPU2: [0, 1, 2, 3]
GPU3: [0, 1, 2, 3]
NOT
GPU0: [0]
GPU1: [1]
GPU2: [2]
GPU3: [3]
As you can see, MirroredStrategy requires all your available devices to be able to hold all of the data, therefore, you're limited to your smallest device when using this strategy.

Is it possible that the actual memory size of RTX 2080 Ti is smaller than RTX 1080 Ti

This is my first post, if there is something inappropriate, please bear with me. :)
Recently, I was running my program on two machines with different GPU type (one with RTX 1080Ti and another with RTX 2080Ti). For convenience, I will call the machine with RTX 1080Ti as M1 and the other as M2.
Even I run totally the same program on two machine, but M2 can't run as larger batch size as M1 do.... (By the way, M2 is indeed faster than M1 under same parameter setting)
Then I check nvdia-smi and the information are shown below:
Machine with RTX 1080Ti (M1)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 21% 19C P8 8W / 250W | 10MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:09:00.0 Off | N/A |
| 21% 21C P8 9W / 250W | 10MiB / 11178MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Machine with RTX 2080Ti (M2)
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:1D:00.0 Off | N/A |
| 27% 32C P2 48W / 250W | 914MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... Off | 00000000:20:00.0 Off | N/A |
| 31% 31C P8 1W / 250W | 11MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
As it shows, the memory in M2 (11019 MiB) is a bit smaller than M1 (11178 MiB).
Since M2 is a higher level product, shouldn’t it be better than the previous generation in all aspects of performance?
So...I would like to ask, could someone tell me why M2 has lower memory capacity than M1?
And, is there any way I can try to increase the memory of M2? such as update the driver or...?

How to query NVIDIA GPU parameters for a certain PID?

I know with nvidia-smi an overview is generated like:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 0000:01:00.0 Off | N/A |
| N/A 43C P0 26W / N/A | 227MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1724 G /usr/bin/X 219MiB |
| 0 8074 G qtcreator 6MiB |
+-----------------------------------------------------------------------------+
However, for the parameters I'd like to break it down for each process (e.g. GPU usage, used memory). I can't find a respective query, but then again I can't imagine that such a basic function is not implemented. Hence
Is there an easy way to display the GPU parameters for each process?
I don't think it gets any closer to nvidia-smi pmon:
# gpu pid type sm mem enc dec fb command
# Idx # C/G % % % % MB name
0 1750 G 1 0 0 0 179 X
0 3734 G 0 0 0 0 7 qtcreator

How to access to GPUs on different nodes in a cluster with Slurm?

I have access to a cluster that's run by Slurm, in which each node has 4 GPUs.
I have a code that needs 8 gpus.
So the question is how can I request 8 gpus on a cluster that each node has only 4 gpus?
So this is the job that I tried to submit via sbatch:
#!/bin/bash
#SBATCH --gres=gpu:8
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00
But then I get the following error:
sbatch: error: Batch job submission failed: Requested node configuration is not available
Then I changed my the settings to this and submitted again:
#!/bin/bash
#SBATCH --gres=gpu:4
#SBATCH --nodes=2
#SBATCH --mem=16000M
#SBATCH --time=0-01:00
nvidia-smi
and the result shows only 4 gpus not 8.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 0000:03:00.0 Off | 0 |
| N/A 32C P0 31W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-PCIE... Off | 0000:04:00.0 Off | 0 |
| N/A 37C P0 29W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla P100-PCIE... Off | 0000:82:00.0 Off | 0 |
| N/A 35C P0 28W / 250W | 0MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla P100-PCIE... Off | 0000:83:00.0 Off | 0 |
| N/A 33C P0 26W / 250W | 0MiB / 12193MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Thanks.
Slurm does not support what you need. It only can assign to your job GPUs/node, not GPUs/cluster.
So, unlike CPUs or other consumable resources, GPUs are not consumable and are binded to the node where they are hosted.
If you are interested in this topic, there is a research effort to turn GPUs into consumable resources, check this paper.
There you'll find how to do it using GPU virtualization technologies.
Job script: You are requesting 2 nodes with each of them 4 GPUs. Tolal 8 GPUs are assigned to you. You are running "nvidia-smi". nvidia-smi does not aware of SLURM nor MPI. It runs only on first node assigned to you. So it shows only 4 GPUs, result is normal.
If you run GPU based engineering application like Ansys HFSS or CST, They can use all 8 GPUs.

How to enable Keras with Theano to utilize multiple GPUs

Setup:
Using a Amazon Linux system with a Nvidia GPU
I'm using Keras 1.0.1
Running Theano v0.8.2 backend
Using CUDA and CuDNN
THEANO_FLAGS="device=gpu,floatX=float32,lib.cnmem=1"
Everything works fine, but I run out of video memory on large models when I increase the batch size to speed up training. I figure moving to a 4 GPU system would in theory either improve total memory available or allow smaller batches to build faster, but observing the the nvidia stats, I can see only one GPU is used by default:
+------------------------------------------------------+
| NVIDIA-SMI 361.42 Driver Version: 361.42 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 44C P0 45W / 125W | 3954MiB / 4095MiB | 94% Default |
+-------------------------------+----------------------+----------------------+
| 1 GRID K520 Off | 0000:00:04.0 Off | N/A |
| N/A 28C P8 17W / 125W | 11MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 GRID K520 Off | 0000:00:05.0 Off | N/A |
| N/A 32C P8 17W / 125W | 11MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 GRID K520 Off | 0000:00:06.0 Off | N/A |
| N/A 29C P8 17W / 125W | 11MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9862 C python34 3941MiB |
I know with raw Theano you can use manually multiple GPU's explicitly. Does Keras support use of multiple GPU's? If so, does it abstract it or do you need to map the GPU's to devices as in Theano and explicitly marshall computations to specific GPU's?
Multi-GPU training is experimental ("The code is rather new and is still considered experimental at this point. It has been tested and seems to perform correctly in all cases observed, but make sure to double-check your results before publishing a paper or anything of the sort.") and hasn't been integrated into Keras yet. However, you can use multiple GPUs with Keras with the Tensorflow backend: https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html#multi-gpu-and-distributed-training.