When I run CNN in Tensorflow 2.0, I get CUDNN_STATUS_INTERNAL_ERROR.
It seems that libcublas.so.10.0 and libcudnn.so.7 are loaded fine.
versions should be fine:
Tensorflow 2.0
ubuntu 18.04
GeForce GTX 1650
NVIDIA driver 430
cudnn: 7.4.2.24 (also tried with 7.3.0.29 and 7.6.4.38)
(ref)
I tried followings but they didn't fix the problem:
I removed ~/.nv (ref)
Modified /usr/include/cudnn.h #include "driver_types.h" to #include <driver_types.h> and passed mnistCUDNN test (ref)
Questions:
Does passing the mnistCUDNN test mean that required packages are installed correctly?
How can I fix this problem below?
After all, here's error message:
Using TensorFlow backend.
2019-10-16 14:48:16.226892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-10-16 14:48:16.255123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
...
2019-10-16 14:48:16.370703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3253 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650, pci bus id: 0000:01:00.0, compute capability: 7.5)
Train on 48000 samples, validate on 12000 samples
Epoch 1/12
2019-10-16 14:48:17.357747: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-10-16 14:48:17.525865: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
--error here--
2019-10-16 14:48:17.873127: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-10-16 14:48:17.879412: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
--error here--
2019-10-16 14:48:17.879516: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
Traceback (most recent call last):
File "lenet.py", line 96, in <module> x_train, y_train, batch_size=128, epochs=12, validation_split=0.2
File "lenet.py", line 83, in train verbose=self.verbose
File "/home/yuyu/venv/lib/python3.6/site-packages/keras/engine/training.py", line 1239, in fit validation_freq=validation_freq)
File "/home/yuyu/venv/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop outs = fit_function(ins_batch)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/backend.py", line 3740, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1081, in __call__
return self._call_impl(args, kwargs)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1121, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
ctx=ctx)
File "/home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node conv2d_1/convolution (defined at /home/yuyu/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_keras_scratch_graph_1220]
Function call stack:
keras_scratch_graph
I encountered this error on my Ubuntu 20.04 / RTX 2070 system. I found this:
https://gist.github.com/mikaelhg/cae5b7938aa3dfdf3d06a40739f2f3f4#file-cuda-install-md
where it suggests exporting an environment variable like this:
export TF_FORCE_GPU_ALLOW_GROWTH=true
That fixed it for me. Happy days.
Related
I am trying to convert yolov3 weights to tflite using DW2TF.
Here is the tutorial I am following.
When I am trying to execute the following statement I am getting an error.
!python to_frozen_graph.py --model_dir data --output_node_names yolov3/convolutional59/
Here is the error.
WARNING:tensorflow:From to_frozen_graph.py:18: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.
WARNING:tensorflow:From to_frozen_graph.py:39: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
2022-06-12 15:38:28.599693: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2022-06-12 15:38:28.608002: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-06-12 15:38:28.608087: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ac9eb598934a): /proc/driver/nvidia/version does not exist
2022-06-12 15:38:28.615211: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200220000 Hz
2022-06-12 15:38:28.615511: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x2ed12c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-06-12 15:38:28.615558: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:From to_frozen_graph.py:41: The name tf.train.import_meta_graph is deprecated. Please use tf.compat.v1.train.import_meta_graph instead.
WARNING:tensorflow:From to_frozen_graph.py:49: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
WARNING:tensorflow:From to_frozen_graph.py:50: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.convert_variables_to_constants`
WARNING:tensorflow:From /tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/graph_util_impl.py:277: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`
Traceback (most recent call last):
File "to_frozen_graph.py", line 66, in <module>
freeze_graph(args.model_dir, args.output_node_names)
File "to_frozen_graph.py", line 50, in freeze_graph
output_node_names.split(",") # The output node names are used to select the usefull nodes
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/graph_util_impl.py", line 277, in convert_variables_to_constants
inference_graph = extract_sub_graph(input_graph_def, output_node_names)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/graph_util_impl.py", line 197, in extract_sub_graph
_assert_nodes_are_present(name_to_node, dest_nodes)
File "/tensorflow-1.15.2/python3.7/tensorflow_core/python/framework/graph_util_impl.py", line 152, in _assert_nodes_are_present
assert d in name_to_node, "%s is not in graph" % d
AssertionError: yolov3/convolutional59/ is not in graph
What can I try to solve this?
I am trying to understand and debug my code. I try to predict with a CNN model developed under tf2.0/tf.keras on GPU, but get those error messages.
could someone help me to fix it?
here is my environmental configuration
enviroments:
python 3.6.8
tensorflow-gpu 2.0.0-rc0
nvidia 418.x
CUDA 10.0
cuDNN 7.6+**
and the log file,
2019-09-28 13:10:59.833892: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2019-09-28 13:11:00.228025: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-09-28 13:11:00.957534: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-09-28 13:11:00.963310: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2019-09-28 13:11:00.963416: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node mobilenetv2_1.00_192/Conv1/Conv2D}}]]
mobilenetv2_1.00_192/block_15_expand_BN/cond/then/_630/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0=====>GPU Available: True
=====> 4 Physical GPUs, 1 Logical GPUs
mobilenetv2_1.00_192/block_15_expand_BN/cond/then/_630/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_depthwise_BN/cond/then/_644/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_depthwise_BN/cond/then/_644/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_project_BN/cond/then/_658/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_15_project_BN/cond/then/_658/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_expand_BN/cond/then/_672/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_expand_BN/cond/then/_672/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_depthwise_BN/cond/then/_686/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_depthwise_BN/cond/then/_686/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_project_BN/cond/then/_700/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/block_16_project_BN/cond/then/_700/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/Conv_1_bn/cond/then/_714/Const: (Const): /job:localhost/replica:0/task:0/device:GPU:0
mobilenetv2_1.00_192/Conv_1_bn/cond/then/_714/Const_1: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Traceback (most recent call last):
File "NSFW_Server.py", line 162, in <module>
model.predict(initial_tensor)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 915, in predict
use_multiprocessing=use_multiprocessing)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 722, in predict
callbacks=callbacks)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_arrays.py", line 393, in model_iteration
batch_outs = f(ins_batch)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/backend.py", line 3625, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1081, in __call__
return self._call_impl(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1121, in _call_impl
return self._call_flat(args, self.captured_inputs, cancellation_manager)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/function.py", line 511, in call
ctx=ctx)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[node mobilenetv2_1.00_192/Conv1/Conv2D (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_keras_scratch_graph_10727]
Function call stack:
keras_scratch_graph
The code
if __name__ == "__main__":
print("=====>GPU Available: ", tf.test.is_gpu_available())
tf.debugging.set_log_device_placement(True)
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print("=====>", len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
paras_path = "./paras/{}".format(int(2011))
model = tf.keras.experimental.load_from_saved_model(paras_path)
initial_tensor = np.zeros((1, INPUT_SHAPE, INPUT_SHAPE, 3))
model.predict(initial_tensor)
You have to check that you have the right version of CUDA + CUDNN + TensorFlow (also ensure that you have all installed).
A couple of examples of running configurations are presented below(UPDATE FOR LATEST VERSIONS OF TENSORFLOW)
Cuda 11.3.1 + CuDNN 8.2.1.32 + TensorFlow 2.7.0
Cuda 11.0 + CuDNN 8.0.4 + TensorFlow 2.4.0
Cuda 10.1 + CuDNN 7.6.5 (normally > 7.6) + TensorFlow 2.2.0/TensorFlow 2.3.0 (TF >= 2.1 requires CUDA >=10.1)
Cuda 10.1 + CuDNN 7.6.5 (normally > 7.6) + TensorFlow 2.1.0 (TF >= 2.1 requires CUDA >=
10.1)
Cuda 10.0 + CuDNN 7.6.3 + / TensorFlow 1.13/1.14 / TensorFlow 2.0.
Cuda 9.0 + CuDNN 7.0.5 + TensorFlow 1.10
Usually this error appears when you have an incompatible version of TensorFlow/CuDNN installed. In my case, this appeared when I tried using an older TensorFlow with a newer version of CuDNN.
**If for some reason you get an error message like(and nothing happens afterwards) :
Relying on the driver to perform ptx compilation
Solution : Install the latest nvidia driver
[SEEMS TO BE SOLVED IN TF >= 2.5.0] (see below):
Only for Windows Users : Some late combintations of CUDA, CUDNN and TF may not work, due to a bug (a .dll extension named improperly). To handle that specific case, please consult this link: Tensorflow GPU Could not load dynamic library 'cusolver64_10.dll'; dlerror: cusolver64_10.dll not found
For those who are facing issues regarding the above error(For Windows platform), I sorted it just by installing CuDNN version compatible with the CUDA already installed in the system.
This suitable version can be downloaded from the website Download CuDNN from Developer's portal. You might need Nvidia account for it. This will be easily created by providing mail id and filling a questionnaire.
To check the CUDA version, run NVCC --version.
Once the suitable version is downloaded, extract the folder from the zip file.
Go to the bin folder of the extracted folder. copy the cudnn64:7.dll and paste it in the CUDA's bin folder. In my case, the location where Cuda is installed is C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin.
This would most probably solve the problem.
My system details:
Windows 10
CUDA 10.0
TensorFlow 2.0
GPU- Nvidia GTX 1060
I also found this blog Installing TensorFlow with CUDA and GPU support on Windows 10. very useful.
Check the instructions on this TensorFlow GPU instruction page for your OS. It resolved issue for me on Ubuntu 16.04.6 LTS and Tensorflow 2.0
I ran a simple keras script that trains a conv net on the MNIST database. This script works on my laptop yet not on my PC with the GeForce RTX 2070 graphics card.
The error is this:
File "/home/squall/spencer/kaggle/understanding_cloud_organization/mnist_model.py", line 67, in <module>
validation_data=(x_test, y_test))
File "/home/squall/anaconda3/envs/thunder/lib/python3.6/site-packages/keras/engine/training.py", line 1239, in fit
validation_freq=validation_freq)
File "/home/squall/anaconda3/envs/thunder/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
outs = fit_function(ins_batch)
File "/home/squall/anaconda3/envs/thunder/lib/python3.6/site-packages/tensorflow/python/keras/backend.py", line 3292, in __call__
run_metadata=self.run_metadata)
File "/home/squall/anaconda3/envs/thunder/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1458, in __call__
run_metadata_ptr)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
[[metrics/accuracy/Identity/_91]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node conv2d_1/convolution}}]]
0 successful operations.
0 derived errors ignored.
Cuda is 10.1. Driver is 418.56. CuDNN is 7.4.2. Tensorflow is 1.14. According to the official Nvidia chart, these are all compatible versions.
Any ideas?
Try this
PS: CUDA is 10.0 and cuDNN is 7.6.3 for CUDA10.0
For tensorflow to work, you need Cuda 10 version. Uninstall the Cuda 10.1 completely and install supported Cuda 10. You can read about requirements for TF here.
I trained a model using the image_retraining guide of tensorflow (https://www.tensorflow.org/hub/tutorials/image_retraining). Then I tried to convert the pb model with tensorflojs_converter but I get an error about metagraph.
My environment is Ubuntu 18.04, I'm using tensorflow-gpu (https://www.tensorflow.org/install/gpu) and the latest version of tensorflowjs_converter (1.0.1).
Command executed for training the model:
python retrain.py --image_dir ./flower_photos --saved_model_dir=/tmp/saved_models/$(date +%s)/
Command executed for converting the model:
tensorflowjs_converter --input_format=tf_saved_model --output_format=tfjs_graph_model /tmp/saved_models/1555066703 /tmp/web_models
2019-04-12 15:45:06.797479: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-04-12 15:45:06.818525: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2592000000 Hz
2019-04-12 15:45:06.819292: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55637fb624e0 executing computations on platform Host. Devices:
2019-04-12 15:45:06.819327: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
2019-04-12 15:45:10.845000: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1364] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
WARNING: Logging before flag parsing goes to stderr.
W0412 15:45:11.737798 139737592477504 meta_graph.py:447] Issue encountered when serializing variables.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
to_proto not supported in EAGER mode.
W0412 15:45:11.738872 139737592477504 meta_graph.py:447] Issue encountered when serializing model_variables.
Type is unsupported, or the types of the items don't match field type in CollectionDef. Note this is a warning and probably safe to ignore.
to_proto not supported in EAGER mode.
2019-04-12 15:45:11.743861: I tensorflow/core/grappler/devices.cc:61] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA support)
2019-04-12 15:45:11.743944: I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
2019-04-12 15:45:11.762060: E tensorflow/core/grappler/grappler_item_builder.cc:636] Init node final_retrain_ops/weights/final_weights/Assign doesn't exist in graph
Traceback (most recent call last):
File "/home/davide/.local/bin/tensorflowjs_converter", line 11, in <module>
sys.exit(main())
File "/home/davide/.local/lib/python2.7/site-packages/tensorflowjs/converters/converter.py", line 358, in main
strip_debug_ops=FLAGS.strip_debug_ops)
File "/home/davide/.local/lib/python2.7/site-packages/tensorflowjs/converters/tf_saved_model_conversion_v2.py", line 271, in convert_tf_saved_model
concrete_func)
File "/home/davide/.local/lib/python2.7/site-packages/tensorflow/python/framework/convert_to_constants.py", line 99, in convert_variables_to_constants_v2
graph_def = _run_inline_graph_optimization(func)
File "/home/davide/.local/lib/python2.7/site-packages/tensorflow/python/framework/convert_to_constants.py", line 57, in _run_inline_graph_optimization
return tf_optimizer.OptimizeGraph(config, meta_graph)
File "/home/davide/.local/lib/python2.7/site-packages/tensorflow/python/grappler/tf_optimizer.py", line 43, in OptimizeGraph
verbose, graph_id, status)
File "/home/davide/.local/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 548, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Failed to import metagraph, check error log for more info.
I expect a tfjs model, I get the above result.
I'm running the tutorial example for XLA using a TensorFlow compiled from source. Running python mnist_softmax_xla.py results in the following error:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
I tensorflow/core/platform/default/cuda_libdevice_path.cc:35] TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root.
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform Host. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): <undefined>, <undefined>
I tensorflow/compiler/xla/service/platform_util.cc:58] platform CUDA present with 1 visible devices
I tensorflow/compiler/xla/service/platform_util.cc:58] platform Host present with 4 visible devices
I tensorflow/compiler/xla/service/service.cc:180] XLA service executing computations on platform CUDA. Devices:
I tensorflow/compiler/xla/service/service.cc:187] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
W tensorflow/core/framework/op_kernel.cc:993] Not found: ./libdevice.compute_35.10.bc not found
Traceback (most recent call last):
File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call
return fn(*args)
File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1004, in _run_fn
status, run_metadata)
File "/usr/lib/python3.4/contextlib.py", line 66, in __exit__
next(self.gen)
File "/mnt/software/envs/xla/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.NotFoundError: ./libdevice.compute_35.10.bc not found
[[Node: cluster_0/_0/_1 = _XlaLaunch[Targs=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], Tconstants=[DT_INT32], Tresults=[DT_FLOAT, DT_FLOAT], function=cluster_0[_XlaCompiledKernel=true, _XlaNumConstantArgs=1], _device="/job:localhost/replica:0/task:0/gpu:0"](Shape_2, _recv_Placeholder_0/_3, _recv_Placeholder_1_0/_1, Variable_1, Variable)]]
I have CUDA 8 installed with cuDNN 5.1. The file libdevice.compute_35.10.bc does exist on the machine:
$ find /usr/local/cuda/ -type f | grep libdevice.compute_35.10.bc
/usr/local/cuda/nvvm/libdevice/libdevice.compute_35.10.bc
My hunch is that this has something to do with the message TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root., but I'm not sure what to do about it.
The key is this log message:
I tensorflow/core/platform/default/cuda_libdevice_path.cc:35] TEST_SRCDIR environment variable not set: using local_config_cuda/cuda under this executable's runfiles directory as the CUDA root.
(I only noticed it in the logs later; I actually found that file by digging around in the sources and only then noticed the message in the logs.)
For reasons that I do not currently understand, XLA does not look in /usr/local/cuda (or whatever directory you gave when you ran ./configure) for libdevice. Per cuda_libdevice_path.cc [1], it's looking for a symlink that was created specifically to point it to libdevice.
I'm going to loop in the person who wrote this code to figure out what it's supposed to be doing. In the meantime, I was able to work around it myself as follows:
$ mkdir local_config_cuda
$ ln -s /usr/local/cuda local_config_cuda/cuda
$ TEST_SRCDIR=$(pwd) python my_program.py
The important thing is to set TEST_SRCDIR to the parent of the local_config_cuda directory.
Sorry for the trouble, and sorry I don't have a less-hacky answer for you right now.
[1] https://github.com/tensorflow/tensorflow/blob/e1f44d8/tensorflow/core/platform/cuda_libdevice_path.cc#L23 https://github.com/tensorflow/tensorflow/blob/1084748efa3234c7daa824718aeb7df7b9252def/tensorflow/core/platform/default/cuda_libdevice_path.cc#L27
https://github.com/tensorflow/tensorflow/pull/7079 should be able to fix this. Thanks for the bug report!