Tensorflow seems to be using two GPUs but one GPU seems not be doing anything

Tensorflow seems to be using two GPUs but one GPU seems not be doing anything - tensorflow

I just build a system with two GTX 680 GPUs. To test my system I'm running cifar10_multi_gpu_train.py, training CIFAR10 using Tensorflow.
Tensorflow creates two Tensorflow devices based on the GPUs (last two lines):
$ python tutorials/image/cifar10/cifar10_multi_gpu_train.py
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
>> Downloading cifar-10-binary.tar.gz 100.0%
Successfully downloaded cifar-10-binary.tar.gz 170052171 bytes.
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 680
major: 3 minor: 0 memoryClockRate (GHz) 1.15
pciBusID 0000:01:00.0
Total memory: 3.94GiB
Free memory: 3.15GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x28eb270
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: GeForce GTX 680
major: 3 minor: 0 memoryClockRate (GHz) 1.15
pciBusID 0000:03:00.0
Total memory: 3.94GiB
Free memory: 3.90GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 680, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 680, pci bus id: 0000:03:00.0)
However, when monitoring the GPUs during training (using watch -n 1 nvidia-smi), I noticed that the second GPU isn't getting hot at all (71 degrees for GPU0 vs 30 degrees for GPU1):
Every 1,0s: nvidia-smi Mon Apr 24 01:30:40 2017
Mon Apr 24 01:30:40 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 680 Off | 0000:01:00.0 N/A | N/A |
| 43% 71C P0 N/A / N/A | 3947MiB / 4036MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 680 Off | 0000:03:00.0 N/A | N/A |
| 30% 30C P8 N/A / N/A | 3737MiB / 4036MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
+-----------------------------------------------------------------------------+
Also note here that the memory of both GPUs are completely allocated.
Why is my second GPU not used?

Ok, I should have taken more time in reading the script:
tf.app.flags.DEFINE_integer('num_gpus', 1,
"""How many GPUs to use.""")
I just set this to two, and everything works just fine:
Every 1,0s: nvidia-smi Mon Apr 24 02:44:30 2017
Mon Apr 24 02:44:30 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 680 Off | 0000:01:00.0 N/A | N/A |
| 37% 63C P0 N/A / N/A | 3807MiB / 4036MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 680 Off | 0000:03:00.0 N/A | N/A |
| 36% 61C P0 N/A / N/A | 3807MiB / 4036MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
+-----------------------------------------------------------------------------+
I would have expected that the script would automatically use all the GPUs available.
Getting around 2450 examples/sec, 0.051 sec/batch with cifar10_multi_gpu_train.py.

Related

Stuck with enabling GPUs for Tensorflow in WSL2 under Windows 10

I can't get Tensorflow 2 to use my GPUs under WSL2. I am aware of this question, but GPU support is now (supposedly) no longer experimental.
Windows is on the required 21H2 version, which should support the WSL2 GPU connection.
Windows 10 Pro, 21H2, build 19044.1706
The PC has two GPUs:
GPU 0: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-19c8549a-4b8d-5d70-456b-776ceece4b0f)
GPU 1: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-2a946756-0472-fb90-f1a4-b40cce1bba4f)
I had installed Ubuntu under WSL2 some time ago:
PS C:\Users\jem-m> wsl --status
Default Distribution: Ubuntu-20.04
Default Version: 2
...
Kernel version: 5.10.16
In the Windows PowerShell, I can run nvidia-smi.exe, which gives me
PS C:\Users\jem-m> nvidia-smi.exe
Mon May 16 18:13:27 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 512.77 Driver Version: 512.77 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:08:00.0 On | N/A |
| 23% 31C P8 10W / 250W | 753MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... WDDM | 00000000:41:00.0 Off | N/A |
| 23% 31C P8 12W / 250W | 753MiB / 11264MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
while the nvidia-smi in the WSL2 Ubuntu shell gives
(testenv) jem-mosig:~/ $ nvidia-smi [17:48:30]
Mon May 16 17:49:53 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.68.02 Driver Version: 512.77 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:08:00.0 On | N/A |
| 23% 34C P8 10W / 250W | 784MiB / 11264MiB | 8% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:41:00.0 Off | N/A |
| 23% 34C P8 13W / 250W | 784MiB / 11264MiB | 12% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Note the same driver and CUDA version, but different NVIDIA-SMI version.
This seems to indicate that CUDA works under WSL2 as it is supposed to. But when I run
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))
# 2022-05-17 12:13:05.016328: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
# []
in python inside WSL2 I get [], so no GPU is recognized by Tensorflow. This is Python 3.8.0 and Tensorflow 2.4.1 freshly installed in a new Miniconda environment inside Ubuntu WSL2. I don't know what is going wrong. Any suggestions?
Addendum
I don't get any error messages when importing Tensorflow. But some warnings are produced when working with it. E.g., when I run
import tensorflow as tf
print(tf.__version__)
model = tf.keras.Sequential([tf.keras.layers.Dense(3)])
model.compile(loss="mse")
print(model.predict([[0.]]))
I get
2.4.1
2022-05-17 10:38:28.792209: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-05-17 10:38:28.792411: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-17 10:38:28.794356: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2022-05-17 10:38:28.853557: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2022-05-17 10:38:28.860126: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3792975000 Hz
[[0. 0. 0.]]
These don't seem to be GPU related, though.

Dr. Snoopy got me onto the right track: Despite the fact that the TF website says that
The TensorFlow pip package includes GPU support for CUDA®-enabled cards
, I still needed to run conda install tensorflow-gpu and it worked! Now
import tensorflow as tf
from tensorflow.python.client import device_lib
print("devices: ", [d.name for d in device_lib.list_local_devices()])
print("GPUs: ", tf.config.list_physical_devices('GPU'))
print("TF v.: ", tf.__version__)
gives lots of debug messages and
devices: ['/device:CPU:0', '/device:GPU:0', '/device:GPU:1']
GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
TF v.: 2.4.1

Tensorflow is not detecting my GPUs. What shall I do (May 2021)?

TF Version : 2.4.1
CUDA Version : 11.1
tf.test_is_gpu_available() -- returns --> FALSE
tf.test.is_built_with_cuda() -- returns --> TRUE
I tried to revert back TF to 2.4.0, but didn't work
I have also tried:
$ pip uninstall tensorflow
$ pip install tensorflow-gpu
But nothing seems to work, TF is just not detecting my GPUs
EDIT 1:
Output of nvcc --version:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Output of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 30% 35C P8 23W / 300W | 23MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3090 Off | 00000000:43:00.0 Off | N/A |
| 30% 40C P8 27W / 300W | 5MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 GeForce RTX 3090 Off | 00000000:81:00.0 Off | N/A |
| 64% 63C P2 179W / 300W | 24043MiB / 24268MiB | 59% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2362 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 2564 G /usr/bin/gnome-shell 12MiB |
| 1 N/A N/A 2362 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 2362 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 14304 C python3 24035MiB |
+-----------------------------------------------------------------------------+
While running tf.test.is_gpu_avaliable(), I get the following warning:
WARNING:tensorflow:From Spell_correction.py:35: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2021-05-07 21:46:21.855460: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-07 21:46:21.856690: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:43:00.0 name: GeForce RTX 3090 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 82 deviceMemorySize: 23.70GiB deviceMemoryBandwidth: 871.81GiB/s
2021-05-07 21:46:21.856716: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-05-07 21:46:21.856735: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-05-07 21:46:21.856747: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-05-07 21:46:21.856759: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-05-07 21:46:21.856771: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-05-07 21:46:21.856829: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.1/lib64
2021-05-07 21:46:21.856846: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-05-07 21:46:21.856856: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-05-07 21:46:21.856863: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-05-07 21:46:21.942589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-07 21:46:21.942626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267] 0
2021-05-07 21:46:21.942633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0: N
Another Observation:
Pytorch is detecting GPU, while TF is not.
torch.cuda.is_available() --> TRUE
tf.test.is_gpu_available() --> FALSE

if you use ubuntu 20.04, I suggest to follow steps from here. I recently had the same problem.
You have
NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 30% 35C P8 23W / 300W | 23MiB / 24268MiB | 0% Default |
| | | N/A
try to get latest version of NVIDIA 465 and Cuda 11.3. For my case nvidia-smi is as below:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
What I did;
(1) I uninstalled NVIDIA and CUDA completely see here and be careful.
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get install ubuntu-desktop
sudo rm /etc/X11/xorg.conf
echo 'nouveau' | sudo tee -a /etc/modules
(2) I downloaded NVIDIA, download .run file and simply run sudo bash NVIDIA*.run
(3) I downloaded cuDNN and perform following as described here
tar -xzvf cudnn-11.3-.*.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
Also check .bashrc files as well as described here:
cd ~
gedit .bashrc or nano .bashrc
#add this in the end :
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export PATH=/usr/local/cuda-11.3/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.3/lib64\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda11.3/targets/x86_64linux\${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
Then, pip install tensorflow-gpu==2.4.1

TensorFlow not loading cuDNN

I have finally managed to get CUDA to work on a Microsoft Azure server with a Kesla T80. Now I need to get cuDNN to work, but TensorFlow won't load it.
This is the message from TensorFlow:
>>> import tensorflow as tf
>>> tf.Session()
2017-04-27 13:05:51.476251: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 13:05:51.476306: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 13:05:51.476338: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 13:05:51.476366: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 13:05:51.476394: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 13:05:58.164781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID ad52:00:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-04-27 13:05:58.164822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-04-27 13:05:58.164835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-04-27 13:05:58.164853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: ad52:00:00.0)
<tensorflow.python.client.session.Session object at 0x7fc3c76c0050>
So I see there is no cuDNN libraries being loaded.
I have the proper files in /cuda-8.0/include/ and /cuda-8.0/lib64/
$ ls /usr/local/cuda-8.0/include/ | grep "cudnn"
cudnn.h
$ ls /usr/local/cuda-8.0/lib64/ | grep "cudnn"
libcudnn.so
libcudnn.so.5
libcudnn.so.5.1.10
libcudnn_static.a
My ~/.bashrc file has the proper paths
export CUDA_HOME=/usr/local/cuda8.0
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
EDIT
Changed the .bashrcto:
export CUDA_HOME=/usr/local/cuda-8.0
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
export PATH=${CUDA_HOME}/include:${PATH}
Still no luck.
Output from nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | AD52:00:00.0 Off | 0 |
| N/A 71C P0 61W / 149W | 0MiB / 11439MiB | 24% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I'm using tensorflow version 1.1.0, Ubuntu 16.04 and CUDA 8.0.
EDIT
So I just tried to delete the cudnn files and load tensorflow, which gave me an error. Something along couldn't finde libcuddn.so.5. So I think it loads it, but I was of the impression that TensorFlow will write something along with "libcuddn.so loaded successfully" if it used cuDNN.

Tensorflow not using GPU (according to TensorBoard)

edit : GTX 1070, ubuntu 16.04, git hash :
3b75eb34ea2c4982fb80843be089f02d430faade
I am retraining inception model on my own data. Everything is fine until the final command :
bazel-bin/inception/flowers_train \
--config=cuda \
--train_dir="${TRAIN_DIR}" \
--data_dir="${OUTPUT_DIRECTORY}" \
--pretrained_model_checkpoint_path="${MODEL_PATH}" \
--fine_tune=True \
--initial_learning_rate=0.001 \
--input_queue_memory_factor=1
According to the logs, Tensorflow seems to be using the GPU :
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:03:00.0
Total memory: 7.92GiB
Free memory: 7.77GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:03:00.0)
But when I am checking the learning in TensorBoard, the net is using mainly the CPU (blue /device:CPU:0, green /device:GPU:0):
TensorBoard graph:
I have tried this two TensorFlow setups :
Install from the source with nvidia-367 drivers, CUDA8 8.0, cuDNN
v5, source from the master (16/10/06 - r11?). compiled for GPU
use:
bazel build -c opt --config=cuda //tensorflow/cc:tutorials_example_trainer
bazel-bin/tensorflow/cc/tutorials_example_trainer --use_gpu
bazel build -c opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
docker GPU image of Tensorflow on a PC with a GTX
1070 8Go
nvidia-docker run -it -p 8888:8888 -p 6006:6006 gcr.io/tensorflow/tensorflow:latest-gpu /bin/bash
Any help ?

According to this issue , the inception 'tower' is where the bulk of the work is being performed. So it seems mostly fine.
Except there is still something weird.
Running watch nvidia-smi gives :
Mon Oct 10 10:31:04 2016
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 Off | 0000:03:00.0 On | N/A |
| 29% 57C P2 41W / 230W | 7806MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1082 G /usr/lib/xorg/Xorg 69MiB |
| 0 3082 C /usr/bin/python 7729MiB |
+-----------------------------------------------------------------------------+
While top gives :
PID UTIL. PR NI VIRT RES SHR S %CPU %MEM TEMPS+ COM.
3082 root 20 0 26,739g 3,469g 1,657g S 101,3 59,7 7254:50 python
GPU seems to be ignored...

CUDA_ERROR_OUT_OF_MEMORY in tensorflow

When I started to train some neural network, it met the CUDA_ERROR_OUT_OF_MEMORY but the training could go on without error. Because I wanted to use gpu memory as it really needs, so I set the gpu_options.allow_growth = True.The logs are as follows:
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device:0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 4.00G (4294967296 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Iter 20, Minibatch Loss= 40491.636719
...
And after using nvidia-smi command, it gets:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.27 Driver Version: 367.27
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:01:00.0 Off | N/A |
| 40% 61C P2 46W / 180W | 8107MiB / 8111MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A |
| 0% 40C P0 40W / 180W | 0MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
│
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 22932 C python 8105MiB |
+-----------------------------------------------------------------------------+
After I commented the gpu_options.allow_growth = True, I trained the net again and everything was normal. There was no the problem of CUDA_ERROR_OUT_OF_MEMORY. Finally, ran the nvidia-smi command, it gets:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.27 Driver Version: 367.27
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M.
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 0000:01:00.0 Off | N/A |
| 40% 61C P2 46W / 180W | 7793MiB / 8111MiB | 99% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A |
| 0% 40C P0 40W / 180W | 0MiB / 8113MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
│
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 22932 C python 7791MiB |
+-----------------------------------------------------------------------------+
I have two questions about it. Why did the CUDA_OUT_OF_MEMORY come out and the procedure went on normally? why did the memory usage become smaller after commenting allow_growth = True.

In case it's still relevant for someone, I encountered this issue when trying to run Keras/Tensorflow for the second time, after a first run was aborted. It seems the GPU memory is still allocated, and therefore cannot be allocated again. It was solved by manually ending all python processes that use the GPU, or alternatively, closing the existing terminal and running again in a new terminal window.

By default, tensorflow try to allocate a fraction per_process_gpu_memory_fraction of the GPU memory to his process to avoid costly memory management. (See the GPUOptions comments).
This can fail and raise the CUDA_OUT_OF_MEMORY warnings.
I do not know what is the fallback in this case (either using CPU ops or a allow_growth=True).
This can happen if an other process uses the GPU at the moment (If you launch two process running tensorflow for instance).
The default behavior takes ~95% of the memory (see this answer).
When you use allow_growth = True, the GPU memory is not preallocated and will be able to grow as you need it. This will lead to smaller memory usage (as the default option is to use the whole memory) but decreases the perfomances if not use properly as it requires a more complex handeling of the memory (which is not the most efficient part of CPU/GPU interactions).

I faced this issue when trying to train model back to back. I figured that the GPU memory wasn't available due to previous training run. So I found the easiest way would be to manually flush the GPU memory before every next training.
Use nvidia-smi to check the GPU memory usage:
nvidia-smi
nvidia-smi --gpu-reset
The above command may not work if other processes are actively using the GPU.
Alternatively you can use the following command to list all the processes that are using GPU:
sudo fuser -v /dev/nvidia*
And the output should look like this:
USER PID ACCESS COMMAND
/dev/nvidia0: root 2216 F...m Xorg
sid 6114 F...m krunner
sid 6116 F...m plasmashell
sid 7227 F...m akonadi_archive
sid 7239 F...m akonadi_mailfil
sid 7249 F...m akonadi_sendlat
sid 18120 F...m chrome
sid 18163 F...m chrome
sid 24154 F...m code
/dev/nvidiactl: root 2216 F...m Xorg
sid 6114 F...m krunner
sid 6116 F...m plasmashell
sid 7227 F...m akonadi_archive
sid 7239 F...m akonadi_mailfil
sid 7249 F...m akonadi_sendlat
sid 18120 F...m chrome
sid 18163 F...m chrome
sid 24154 F...m code
/dev/nvidia-modeset: root 2216 F.... Xorg
sid 6114 F.... krunner
sid 6116 F.... plasmashell
sid 7227 F.... akonadi_archive
sid 7239 F.... akonadi_mailfil
sid 7249 F.... akonadi_sendlat
sid 18120 F.... chrome
sid 18163 F.... chrome
sid 24154 F.... code
From here, I got the PID for the process which was holding the GPU memory, which in my case is 24154.
Use the following command to kill the process by its PID:
sudo kill -9 MY_PID
Replace MY_PID with the relevant PID.

Tensorflow 2.0 alpha
The problem is, that Tensorflow is greedy in allocating all available VRAM. That causes issues for some people.
For Tensorflow 2.0 alpha / nightly use this:
import tensorflow as tf
tf.config.gpu.set_per_process_memory_fraction(0.4)
Source: https://www.tensorflow.org/alpha/guide/using_gpu

I was experienced memory error in Ubuntu 18.10.
When i changed resolution of my monitor from 4k to fullhd (1920-1080) memory available become 438mb and neural network training started.
I was really surprised by this behavior.
By the way, i have Nvidia 1080 with 8gb memory, still dont know why only 400mb available

Environment:
1.CUDA 10.0
2.cuNDD 10.0
3.tensorflow 1.14.0
4.pip install opencv-contrib-python
5.git clone https://github.com/thtrieu/darkflow
6.Allowing GPU memory growth
Reference

fuser -k /dev/nvidia[0]
Worked for me.
Thanks to https://forums.developer.nvidia.com/t/11-gb-of-gpu-ram-used-and-no-process-listed-by-nvidia-smi/44459/16

Check the correctness of the input dataset.
İf you have a null input list may occur this error too.
The situation that I faced in Colab with tf.keras

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas