How to debug tensorflow in nvidia-docker segfaulting? - tensorflow

Am on ubuntu 18.04 running in interactive env like this:
docker run --runtime=nvidia -it --rm -v $PWD:/root/stuff -w /root tensorflow/tensorflow:latest-gpu-py3 bash
Curiously, I don't get segfaults when I run non-interactively i.e. docker run ... python stuff/mnist.py
nvidia details:
$ nvidia-smi
Thu Nov 29 22:09:25 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.18 Driver Version: 415.18 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2070 Off | 00000000:01:00.0 On | N/A |
| 30% 32C P8 11W / 175W | 358MiB / 7949MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1471 G /usr/lib/xorg/Xorg 18MiB |
| 0 1523 G /usr/bin/gnome-shell 50MiB |
| 0 1919 G /usr/lib/xorg/Xorg 129MiB |
| 0 2063 G /usr/bin/gnome-shell 114MiB |
| 0 3762 G ...quest-channel-token=2440404091774701506 43MiB |
+-----------------------------------------------------------------------------+
root#4a46cc9acb73:~# python -X faulthandler -vv stuff/mnist.py
Train on 60000 samples, validate on 10000 samples
Epoch 1/15
2018-11-29 22:06:26.371579: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-29 22:06:26.500120: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-11-29 22:06:26.500670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 7.76GiB freeMemory: 7.29GiB
2018-11-29 22:06:26.500686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-29 22:06:26.723360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-29 22:06:26.723400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-11-29 22:06:26.723407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-11-29 22:06:26.723859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7015 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
Fatal Python error: Segmentation fault
Thread 0x00007f82a1277700 (most recent call first):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1439 in __call__
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/backend.py", line 2986 in __call__
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training_arrays.py", line 215 in fit_loop
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/engine/training.py", line 1639 in fit
File "stuff/mnist.py", line 36 in <module>
Segmentation fault (core dumped)

Related

Why does ML training fail on one gpu but run on another?

My machine has 2 GPU's, a GTX 1070 and a GTX 3080.
I have a conda environment with tensorflow 1.15 and all its relevant dependencies (CUDA 10, CuDnn 7.6 ect..).
When calling my tensorflow based training script to train I get
#Training on GTX 1070
$ CUDA_VISIBLE_DEVICES=1, python train_script.py
#Output
2021-06-24 21:36:24.253225: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
total_loss: 0.010825163 #Trains as usual
However when I try to train on my GTX 3080
#Training on GTX 3080
$ CUDA_VISIBLE_DEVICES=0, python train_script.py
#Output
2021-06-24 21:43:25.828707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-06-24 21:44:15.331037: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
...
File "/home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
[[node ProjectNet/fc0/MatMul (defined at home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[Sum_7/_421]]
(1) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
[[node ProjectNet/fc0/MatMul (defined at /home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Graphics cards info:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27 Driver Version: 465.27 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:04:00.0 Off | N/A |
| 0% 42C P8 5W / 151W | 11MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:2B:00.0 On | N/A |
| 0% 49C P8 36W / 370W | 624MiB / 10001MiB | 27% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Can anyone explain why training fails on the GTX 3080?

"Adding visible gpu devices: 0.." has been constantly outputted in nohup.out

I'm running a pso program and I use tensorflow to calculate the mean square error as fitness, but every minute there's an output in nohup.out beginning with "Adding visible gpu devices: 0". I use both gpu and cpu to run my code, but 5 days later they have almost the same rate. Why dose gpu run so slow? How can I stop the constant output?
I use the No.2 device, and the gpu seems to work.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.130 Driver Version: 384.130 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:02:00.0 Off | N/A |
| 23% 39C P0 61W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp Off | 00000000:04:00.0 Off | N/A |
| 23% 40C P0 60W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp Off | 00000000:83:00.0 Off | N/A |
| 23% 39C P8 17W / 250W | 7261MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp Off | 00000000:84:00.0 Off | N/A |
| 23% 37C P0 58W / 250W | 0MiB / 12189MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 2 26737 C python 7251MiB
nohup.out is printing the imformation every minute:
2019-11-08 11:49:18.032239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-11-08 11:49:18.032316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-08 11:49:18.032326: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-11-08 11:49:18.032332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-11-08 11:49:18.032511: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6991 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
2019-11-08 11:50:42.409142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2019-11-08 11:50:42.409214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-08 11:50:42.409223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2019-11-08 11:50:42.409229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
2019-11-08 11:50:42.409407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6991 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:83:00.0, compute capability: 6.1)
The objective function is like this:
def function(self, M, w_h, w_o):
def model(X, w_h, w_o):
h = tf.matmul(X, w_h)
return tf.matmul(h, w_o)
X = tf.placeholder(tf.float64, [None, 4])
Y = tf.placeholder(tf.float64, [None, 2])
w_h = tf.Variable(w_h)
w_o = tf.Variable(w_o)
py_x = model(X, w_h, w_o)
loss = tf.reduce_mean((py_x-Y)**2)
with tf.Session() as sess:
tf.initializers.global_variables().run()
sum = 0
length = len(trY)
for start, end in zip(range(0, length, batchsize), range(batchsize, length + 1, batchsize)):
sum += sess.run(loss, feed_dict={X: trX[start:end], Y: trY[start:end]})
if not length%batchsize:
E = sum/(length/batchsize)
else:
E = sum/(1+floor(length/batchsize))
return E

Using GPU error when use TensorFlow to train image

When I am runing a tensorflow image train job in the container tensorflow/tensorflow:latest-gpu, it doesn't work.
Error message:
Cannot assign a device for operation InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D: Operation was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
[[node InceptionV3/InceptionV3/Conv2d_1a_3x3/Conv2D (defined at /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/layers.py:1057) = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 2, 2, 1], use_cudnn_on_gpu=true, _device="/device:GPU:0"](fifo_queue_Dequeue, InceptionV3/Conv2d_1a_3x3/weights/read)]]
GPU info:
nvidia-smi
Mon Nov 26 07:48:59 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 630 Off | 00000000:01:00.0 N/A | N/A |
| 25% 47C P0 N/A / N/A | 0MiB / 1998MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
It seems that you Tensorflow is not detecting any gpu as available but maps the operations to GPU:0. First try this:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
And you'll get the available devices. Is there /device:GPU:0 ?

CUDA only recognizes 1 of 5 GPUs

I would like to run tensorflow on a windows 10 server with 5 NVIDIA Quadro P6000 GPUs. After installing CUDA 8.0 and running deviceQuery.exe it only shows one of the GPUs and therefore tensorflow only uses one GPU as well.
(tensorflow-gpu) C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\extras\demo_suite>deviceQuery.exe
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Quadro P6000"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 24576 MBytes (25769803776 bytes)
(30) Multiprocessors, (128) CUDA Cores/MP: 3840 CUDA Cores
GPU Max Clock rate: 1645 MHz (1.64 GHz)
Memory Clock rate: 4513 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Quadro P6000
Result = PASS
However, nvidia-smi.exe and GPU-Z do recognize all 5 of them.
(tensorflow-gpu) C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi.exe
Wed Sep 27 11:07:44 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 376.51 Driver Version: 376.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P6000 WDDM | 0000:04:00.0 Off | Off |
| 26% 45C P5 19W / 250W | 353MiB / 24576MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P6000 WDDM | 0000:05:00.0 Off | Off |
| 26% 24C P8 7W / 250W | 69MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Quadro P6000 WDDM | 0000:08:00.0 Off | Off |
| 26% 26C P8 8W / 250W | 69MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Quadro P6000 WDDM | 0000:09:00.0 Off | Off |
| 26% 26C P8 8W / 250W | 69MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Quadro P6000 WDDM | 0000:89:00.0 Off | Off |
| 26% 22C P8 8W / 250W | 262MiB / 24576MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2640 C+G Insufficient Permissions N/A |
| 0 3908 C+G ...x86)\Google\Chrome\Application\chrome.exe N/A |
| 4 2640 C+G Insufficient Permissions N/A |
| 4 3552 C+G ...\ImmersiveControlPanel\SystemSettings.exe N/A |
| 4 5260 C+G Insufficient Permissions N/A |
| 4 6552 C+G Insufficient Permissions N/A |
| 4 7208 C+G ...ost_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 4 7444 C+G ...indows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
+-----------------------------------------------------------------------------+
Does anyone have an idea what I could try to make all of them work for CUDA and tensorflow?

TF KMeansClustering doesn't run on GPU

Running on Ubuntu 16.04, latest (1.1.0) tensorflow (installed via pip3 install tensorflow-gpu), CUDA8 + CUDNN5.
The code looks more or less like this:
import tensorflow as tf
from tensorflow.contrib.learn import KMeansClustering
trainencflt = #pandas frame with ~30k rows and ~300 columns
def train_input_fn():
return (tf.constant(trainencflt, shape = [trainencflt.shape[0], trainencflt.shape[1]]), None)
configuration = tf.contrib.learn.RunConfig(log_device_placement=True)
model = KMeansClustering(num_clusters=k,
initial_clusters=KMeansClustering.RANDOM_INIT,
relative_tolerance=1e-8,
config=configuration)
model.fit(input_fn = train_input_fn, steps = 100)
When it runs I see:
2017-06-15 10:24:41.564890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:81:00.0
Total memory: 7.92GiB
Free memory: 7.81GiB
2017-06-15 10:24:41.564934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-06-15 10:24:41.564942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-06-15 10:24:41.564956: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:81:00.0)
Memory gets allocated:
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 548 C python 7745MiB |
+-----------------------------------------------------------------------------+
But then none of the operations are performed on the GPU (it stays at 0% all the time, the CPU utilization does skyrocket on all cores):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 1 GeForce GTX 1080 Off | 0000:02:00.0 Off | N/A |
| 29% 43C P8 13W / 180W | 7747MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Not seeing any placement logs (even though I specified log_device_placement to be True).
I did try the simple GPU examples and they were working just fine (at least the placement logs were looking fine).
Am I missing something?
Went through the codebase - TF 1.1.0 simply didn't have a GPU kernel.