Why does ML training fail on one gpu but run on another? - tensorflow

My machine has 2 GPU's, a GTX 1070 and a GTX 3080.
I have a conda environment with tensorflow 1.15 and all its relevant dependencies (CUDA 10, CuDnn 7.6 ect..).
When calling my tensorflow based training script to train I get
#Training on GTX 1070
$ CUDA_VISIBLE_DEVICES=1, python train_script.py
#Output
2021-06-24 21:36:24.253225: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
total_loss: 0.010825163 #Trains as usual
However when I try to train on my GTX 3080
#Training on GTX 3080
$ CUDA_VISIBLE_DEVICES=0, python train_script.py
#Output
2021-06-24 21:43:25.828707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-06-24 21:44:15.331037: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
...
File "/home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
[[node ProjectNet/fc0/MatMul (defined at home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[Sum_7/_421]]
(1) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
[[node ProjectNet/fc0/MatMul (defined at /home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Graphics cards info:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27 Driver Version: 465.27 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:04:00.0 Off | N/A |
| 0% 42C P8 5W / 151W | 11MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:2B:00.0 On | N/A |
| 0% 49C P8 36W / 370W | 624MiB / 10001MiB | 27% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Can anyone explain why training fails on the GTX 3080?

Related

Allocating Large Tensor on multiple GPUs using Distributed Learning in Keras

I am using Tensorflow Distributed learning using the following commands -
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
The system being used has 4 32 GB GPU devices. The following is the output of nvidia-smi -
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000004:04:00.0 Off | 0 |
| N/A 37C P0 65W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000004:05:00.0 Off | 0 |
| N/A 38C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000035:03:00.0 Off | 0 |
| N/A 33C P0 40W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000035:04:00.0 Off | 0 |
| N/A 39C P0 41W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
But after running the script to create the model, I am getting the following error -
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape [131072,65536] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:RandomUniform]
A tensor of shape [131072,65536] of type float would allocate 131072 * 65536 * 4 bytes i.e., 34.35 GB. And there are 4 32 GB GPUs, so why is it not allocated?
MirroredStrategy creates a copy of all variables within the scope per GPU. So since the tensor size is 34.35GB, that's too large. You might be trying to use something similar to tf.distribute.experimental.CentralStorageStrategy. MirroredStrategy, in terms of gpu memory, isn't vram * num_of_gpu, it practically is smallest_vram, so in your case, Keras is working with 32GB of memory per replica, not 32*4=128GB.
strategy = tf.distribute.experimental.CentralStorageStrategy()
dataset = # some dataset
dataset = strategy.experimental_distribute_dataset(dataset)
with strategy.scope():
model = Basic_Model()
model.compile(loss='mean_squared_error', optimizer=rms, metrics=['mean_squared_error'])
Example:
Tensor A is [0, 1, 2, 3] and you have four GPUs. MirroredStrategy will load:
GPU0: [0, 1, 2, 3]
GPU1: [0, 1, 2, 3]
GPU2: [0, 1, 2, 3]
GPU3: [0, 1, 2, 3]
NOT
GPU0: [0]
GPU1: [1]
GPU2: [2]
GPU3: [3]
As you can see, MirroredStrategy requires all your available devices to be able to hold all of the data, therefore, you're limited to your smallest device when using this strategy.

CUDA_ERROR_UNKNOWN: unknown error in Colab instance

I'm connected to Google Colab through SSH (using this method). I get the following error when trying to use the GPU.
python lstm_example.py
Num GPUs Available: 1
(25000,)
(25000,)
2022-03-21 12:43:53.301917: W tensorflow/stream_executor/cuda/cuda_driver.cc:374] A non-primary context 0x559ed434d210 for device 0 exists before initializing the StreamExecutor. The primary context is now 0. We haven't verified StreamExecutor works with that.
2022-03-21 12:43:53.302331: F tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
Aborted (core dumped)
GPU info
nvidia-smi
Mon Mar 21 13:00:24 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.54 Driver Version: 460.32.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 50C P0 59W / 149W | Function Not Found | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
I've added the following lines
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"
The same code works when run in a notebook cell. I also notice that Memory_Usage is available when running nvidia-smi from a notebook and the CUDA version used is different (11.2).
Tue Mar 22 10:52:20 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 43C P8 31W / 149W | 3MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-

How to debug tensorflow in nvidia-docker segfaulting?

Am on ubuntu 18.04 running in interactive env like this:
docker run --runtime=nvidia -it --rm -v $PWD:/root/stuff -w /root tensorflow/tensorflow:latest-gpu-py3 bash
Curiously, I don't get segfaults when I run non-interactively i.e. docker run ... python stuff/mnist.py
nvidia details:
$ nvidia-smi
Thu Nov 29 22:09:25 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.18 Driver Version: 415.18 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A |
| 30% 32C P8 11W / 175W | 358MiB / 7949MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1471 G /usr/lib/xorg/Xorg 18MiB |
| 0 1523 G /usr/bin/gnome-shell 50MiB |
| 0 1919 G /usr/lib/xorg/Xorg 129MiB |
| 0 2063 G /usr/bin/gnome-shell 114MiB |
| 0 3762 G ...quest-channel-token=2440404091774701506 43MiB |
+-----------------------------------------------------------------------------+
root#4a46cc9acb73:~# python -X faulthandler -vv stuff/mnist.py
Train on 60000 samples, validate on 10000 samples
Epoch 1/15
2018-11-29 22:06:26.371579: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-29 22:06:26.500120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-29 22:06:26.500670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 7.76GiB freeMemory: 7.29GiB
2018-11-29 22:06:26.500686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-29 22:06:26.723360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-29 22:06:26.723400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-11-29 22:06:26.723407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-11-29 22:06:26.723859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7015 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
Fatal Python error: Segmentation fault
Thread 0x00007f82a1277700 (most recent call first):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1439 in __call__
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/backend.py", line 2986 in __call__
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 215 in fit_loop
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training.py", line 1639 in fit
File "stuff/mnist.py", line 36 in <module>
Segmentation fault (core dumped)

Using GPU error when use TensorFlow to train image

When I am runing a tensorflow image train job in the container tensorflow/tensorflow:latest-gpu, it doesn't work.
Error message:
Cannot assign a device for operation InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D: Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
[[node InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D (defined at /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py:1057) = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/device:GPU:0"](fifo_queue_Dequeue, InceptionV3/Conv2d_1a_3x3/weights/read)]]
GPU info:
nvidia-smi
Mon Nov 26 07:48:59 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 630 Off | 00000000:01:00.0 N/A | N/A |
| 25% 47C P0 N/A / N/A | 0MiB / 1998MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
It seems that you Tensorflow is not detecting any gpu as available but maps the operations to GPU:0. First try this:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
And you'll get the available devices. Is there /device:GPU:0 ?

GPU load in tensorflow

I just built TensorFlow v1.0 and I am trying to run MNIST test just to see if it's working. Seems like it is, but i am observing weird behaiviour.
My system has two Tesla P100, and nvidia-smi shows the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 361.107 Driver Version: 361.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 0002:01:00.0 Off | 0 |
| N/A 34C P0 114W / 300W | 15063MiB / 16280MiB | 51% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 0006:01:00.0 Off | 0 |
| N/A 27C P0 35W / 300W | 14941MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 67288 C python3 15061MiB |
| 1 67288 C python3 14939MiB |
+-----------------------------------------------------------------------------+
As it shown, python3 ate all the memory on both GPUs, but computational load are placed only on first.
Exporting CUDA_VISIBLE_DEVICES I can limit GPU to be used, but it's not affect computational time. So no gain from adding second GPU. Single GPU
real 2m23.496s
user 4m26.597s
sys 0m12.587s
Two GPUs:
real 2m18.165s
user 4m18.625s
sys 0m12.958s
So the question is, how to load both GPUs?