tensorflow-gpu running failure on LINUX - tensorflow

I've installed CUDA and cuDnn on ubuntu 16.04.
CUDA version : 9.0 // with driver version 390.87
cuDNN version : 7.2 for CUDA9.0
import tensorflow as tf
works fine, but
tf.Session()
renders the following error.
2018-09-15 16:43:23.281375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-15 16:43:23.281431: E tensorflow/core/common_runtime/direct_session.cc:158] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/imhgchoi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1494, in __init__
super(Session, self).__init__(target, graph, config=config)
File "/home/imhgchoi/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 626, in __init__
self._session = tf_session.TF_NewSession(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.
The error message implies that I've installed the wrong version of CUDA driver, but I'm lost. I'm not sure what steps to take in order to remedy this situation.
AFTER ADDING ENVIRONMENT VARIABLES
That only added new errors..
2018-09-15 17:13:39.684390: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-09-15 17:13:39.767963: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-09-15 17:13:39.768481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.506
pciBusID: 0000:09:00.0
totalMemory: 3.94GiB freeMemory: 3.41GiB
2018-09-15 17:13:39.768502: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-15 17:13:39.768635: E tensorflow/core/common_runtime/direct_session.cc:158] Internal: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version

Maybe it is your envirnment variables causing this problem.
try this:
Add these lines at the end of your ~/.bashrc file and open a terminal and simply start a python session there and then import tensorflow (you should have the tensporflow-gpu installed via apt) and see if it works:
sudo vim ~/.bashrc
and add these at the end of the file and restart your terminal:
export CUDA_HOME="/usr/local/cuda-9.0"
export LD_LIBRARY_PATH="${CUDA_HOME}/lib64"
export PATH="${CUDA_HOME}/bin:${PATH}"
export DYLD_LIBRARY_PATH="${CUDA_HOME}/lib"
Edit.1
Please make sure that "usr/local/cuda-9.0" is the directory that you installed cuda.

Related

Tensorflow 2.0 can't use GPU, something wrong in cuDNN? :Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

I am trying to understand and debug my code. I try to predict with a CNN model developed under tf2.0/tf.keras on GPU, but get those error messages.
could someone help me to fix it?
here is my environmental configuration
enviroments:
python 3.6.8
tensorflow-gpu 2.0.0-rc0
nvidia 418.x
CUDA 10.0
cuDNN 7.6+**
and the log file,
2019-09-28 13:10:59.833892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-28 13:11:00.228025: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-28 13:11:00.957534: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-09-28 13:11:00.963310: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-09-28 13:11:00.963416: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node mobilenetv2_1.00_192/Conv1/Conv2D}}]]
mobilenetv2_1.00_192/block_15_expand_BN/cond/then/_630/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0=====>GPU Available: True
=====> 4 Physical GPUs, 1 Logical GPUs
mobilenetv2_1.00_192/block_15_expand_BN/cond/then/_630/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_depthwise_BN/cond/then/_644/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_depthwise_BN/cond/then/_644/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_project_BN/cond/then/_658/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_project_BN/cond/then/_658/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_expand_BN/cond/then/_672/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_expand_BN/cond/then/_672/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_depthwise_BN/cond/then/_686/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_depthwise_BN/cond/then/_686/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_project_BN/cond/then/_700/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_project_BN/cond/then/_700/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/Conv_1_bn/cond/then/_714/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/Conv_1_bn/cond/then/_714/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Traceback (most recent call last):
File "NSFW_Server.py", line 162, in <module>
model.predict(initial_tensor)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 915, in predict
use_multiprocessing=use_multiprocessing)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 722, in predict
callbacks=callbacks)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 393, in model_iteration
batch_outs = f(ins_batch)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3625, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1081, in __call__
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1121, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 511, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node mobilenetv2_1.00_192/Conv1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_keras_scratch_graph_10727]
Function call stack:
keras_scratch_graph
The code
if __name__ == "__main__":
print("=====>GPU Available: ", tf.test.is_gpu_available())
tf.debugging.set_log_device_placement(True)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print("=====>", len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
paras_path = "./paras/{}".format(int(2011))
model = tf.keras.experimental.load_from_saved_model(paras_path)
initial_tensor = np.zeros((1, INPUT_SHAPE, INPUT_SHAPE, 3))
model.predict(initial_tensor)
You have to check that you have the right version of CUDA + CUDNN + TensorFlow (also ensure that you have all installed).
A couple of examples of running configurations are presented below(UPDATE FOR LATEST VERSIONS OF TENSORFLOW)
Cuda 11.3.1 + CuDNN 8.2.1.32 + TensorFlow 2.7.0
Cuda 11.0 + CuDNN 8.0.4 + TensorFlow 2.4.0
Cuda 10.1 + CuDNN 7.6.5 (normally > 7.6) + TensorFlow 2.2.0/TensorFlow 2.3.0 (TF >= 2.1 requires CUDA >=10.1)
Cuda 10.1 + CuDNN 7.6.5 (normally > 7.6) + TensorFlow 2.1.0 (TF >= 2.1 requires CUDA >=
10.1)
Cuda 10.0 + CuDNN 7.6.3 + / TensorFlow 1.13/1.14 / TensorFlow 2.0.
Cuda 9.0 + CuDNN 7.0.5 + TensorFlow 1.10
Usually this error appears when you have an incompatible version of TensorFlow/CuDNN installed. In my case, this appeared when I tried using an older TensorFlow with a newer version of CuDNN.
**If for some reason you get an error message like(and nothing happens afterwards) :
Relying on the driver to perform ptx compilation
Solution : Install the latest nvidia driver
[SEEMS TO BE SOLVED IN TF >= 2.5.0] (see below):
Only for Windows Users : Some late combintations of CUDA, CUDNN and TF may not work, due to a bug (a .dll extension named improperly). To handle that specific case, please consult this link: Tensorflow GPU Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
For those who are facing issues regarding the above error(For Windows platform), I sorted it just by installing CuDNN version compatible with the CUDA already installed in the system.
This suitable version can be downloaded from the website Download CuDNN from Developer's portal. You might need Nvidia account for it. This will be easily created by providing mail id and filling a questionnaire.
To check the CUDA version, run NVCC --version.
Once the suitable version is downloaded, extract the folder from the zip file.
Go to the bin folder of the extracted folder. copy the cudnn64:7.dll and paste it in the CUDA's bin folder. In my case, the location where Cuda is installed is C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin.
This would most probably solve the problem.
My system details:
Windows 10
CUDA 10.0
TensorFlow 2.0
GPU- Nvidia GTX 1060
I also found this blog Installing TensorFlow with CUDA and GPU support on Windows 10. very useful.
Check the instructions on this TensorFlow GPU instruction page for your OS. It resolved issue for me on Ubuntu 16.04.6 LTS and Tensorflow 2.0

No module named tensorflow after installation?

I installed tensorflow-gpu but I got error in Pycharm:
ModuleNotFoundError: No module named 'tensorflow'
I checked in terminal:
$ pip3 list|grep tensorflow
tensorflow-gpu 1.4.0
tensorflow-tensorboard 0.4.0
Edit: ( after installation using venv):
Successfully installed tensorflow-gpu-1.12.0
(venv) wojtek#wojtek-GF63-8RC:~$ python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
2018-12-17 21:49:14.893016: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-12-17 21:49:14.961123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-17 21:49:14.961466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1050 major: 6 minor: 1 memoryClockRate(GHz): 1.493
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.58GiB
2018-12-17 21:49:14.961479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-17 21:49:15.148507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-17 21:49:15.148538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-17 21:49:15.148544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-17 21:49:15.148687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3306 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050, pci bus id: 0000:01:00.0, compute capability: 6.1)
tf.Tensor(918.94904, shape=(), dtype=float32)
You'll want to configure the interpreter src
1) In the Project Interpreters page, select one of the configured interpreters or virtual environments.
2) Click Edit.
3) In the Edit Python Interpreter dialog box that opens, type the desired interpreter name.
Changing interpreter's name
The Python interpreter name specified in the Name field, becomes visible in the list of available interpreters.
If necessary, change the path to the Python executable.

Your kernel may have been built without NUMA support

I have Jetson TX2, python 2.7, Tensorflow 1.5, CUDA 9.0
Tensorflow seems to be working but everytime, I run the program, I get this warning:
with tf.Session() as sess:
print (sess.run(y,feed_dict))
...
2018-08-07 18:07:53.200320: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:881] could not open file to read NUMA node: /sys/bus/pci/devices/0000:00:00.0/numa_node Your kernel may have been built without NUMA support.
2018-08-07 18:07:53.200427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: NVIDIA Tegra X2
major: 6
minor: 2
memoryClockRate(GHz): 1.3005
pciBusID: 0000:00:00.0
totalMemory: 7.66GiB
freeMemory: 1.79GiB
2018-08-07 18:07:53.200474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: NVIDIA Tegra X2, pci bus id: 0000:00:00.0, compute capability: 6.2)
2018-08-07 18:07:53.878574: I tensorflow/core/common_runtime/gpu/gpu_device.cc:859] Could not identify NUMA node of /job:localhost/replica:0/task:0/device:GPU:0, defaulting to 0. Your kernel may not have been built with NUMA support.
Should I be worried? Or is it something negligible?
It shouldn't be a problem for you, since you don't need NUMA support for this board (it has only one memory controller, so memory accesses are uniform).
Also, I found this post on nvidia forum that seems to confirm this.

I'm trying to use tensorflow on my ubuntu 16.04 machine, but even after installing tensorflow gpu; I cant use gpu for my tensorflow

I'm trying to use tensorflow GPU on my ubuntu 16.04 machine. I've successfuly installed CUDA toolkit (8.0.61) and cuDNN (6.0.21). The problem is, I can't use tensorflow gpu even after this installtion process.
While importing tensorflow is not showing any import lines for tensorflow GPU.its not showing any lines while importing tensorflow, while other my other ubuntu machine it shows some lines while importing tensorflow.
Nvidia driver version and usage
There are no lines when importing TensorFlow. Only starting a sessions gives you some output:
>>> import tensorflow as tf
>>> tf.Session()
2017-09-17 20:06:20.174697: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX
2017-09-17 20:06:20.343531: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-09-17 20:06:20.344062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Found device 0 with properties:
name: GeForce GTX 960 major: 5 minor: 2 memoryClockRate(GHz): 1.329
pciBusID: 0000:01:00.0
totalMemory: 3.94GiB freeMemory: 3.61GiB
2017-09-17 20:06:20.344084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1055] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:01:00.0, compute capability: 5.2)
<tensorflow.python.client.session.Session object at 0x7f387d89f210>
After consulting my crystal ball, I suggest you to further check the environment variable TF_CPP_MIN_LOG_LEVEL.

NotFoundError running TensorFlow XLA example (libdevice.compute_35.10.bc)

I'm running the tutorial example for XLA using a TensorFlow compiled from source. Running python mnist_softmax_xla.py results in the following error:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
I tensorflow/core/platform/default/cuda_libdevice_path.cc:35] TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root.
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform CUDA. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
W tensorflow/core/framework/op_kernel.cc:993] Not found: ./libdevice.compute_35.10.bc not found
Traceback (most recent call last):
File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
return fn(*args)
File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
status, run_metadata)
File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
next(self.gen)
File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: ./libdevice.compute_35.10.bc not found
[[Node: cluster_0/_0/_1 = _XlaLaunch[Targs=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], Tconstants=[DT_INT32], Tresults=[DT_FLOAT, DT_FLOAT], function=cluster_0[_XlaCompiledKernel=true, _XlaNumConstantArgs=1], _device="/job:localhost/replica:0/task:0/gpu:0"](Shape_2, _recv_Placeholder_0/_3, _recv_Placeholder_1_0/_1, Variable_1, Variable)]]
I have CUDA 8 installed with cuDNN 5.1. The file libdevice.compute_35.10.bc does exist on the machine:
$ find /usr/local/cuda/ -type f | grep libdevice.compute_35.10.bc
/usr/local/cuda/nvvm/libdevice/libdevice.compute_35.10.bc
My hunch is that this has something to do with the message TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root., but I'm not sure what to do about it.
The key is this log message:
I tensorflow/core/platform/default/cuda_libdevice_path.cc:35] TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root.
(I only noticed it in the logs later; I actually found that file by digging around in the sources and only then noticed the message in the logs.)
For reasons that I do not currently understand, XLA does not look in /usr/local/cuda (or whatever directory you gave when you ran ./configure) for libdevice. Per cuda_libdevice_path.cc [1], it's looking for a symlink that was created specifically to point it to libdevice.
I'm going to loop in the person who wrote this code to figure out what it's supposed to be doing. In the meantime, I was able to work around it myself as follows:
$ mkdir local_config_cuda
$ ln -s /usr/local/cuda local_config_cuda/cuda
$ TEST_SRCDIR=$(pwd) python my_program.py
The important thing is to set TEST_SRCDIR to the parent of the local_config_cuda directory.
Sorry for the trouble, and sorry I don't have a less-hacky answer for you right now.
[1] https://github.com/tensorflow/tensorflow/blob/e1f44d8/tensorflow/core/platform/cuda_libdevice_path.cc#L23 https://github.com/tensorflow/tensorflow/blob/1084748efa3234c7daa824718aeb7df7b9252def/tensorflow/core/platform/default/cuda_libdevice_path.cc#L27
https://github.com/tensorflow/tensorflow/pull/7079 should be able to fix this. Thanks for the bug report!