Tensorflow2 Unknown device in Tensorboard - tensorflow2.0

I cannot display the device placement in tensorboard graph. The only device is "unknown device" (see screenshot below).
In the python script, I activated the following option:
tf.debugging.set_log_device_placement(True)
And indeed, the log device placement information is logged correctly in the python output:
Executing op Fill in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0
Executing op VarIsInitializedOp in device /job:localhost/replica:0/task:0/device:GPU:0
...
The tracing is activated as follows:
tf.summary.trace_on(graph=True, profiler=True)
And the graph is displayed correctly in tensorboard but the device name is simply wrong:
OS: Windows 10 x64
Python: 3.8
Tensorflow: 2.2.0
GPU

Related

CUDA Error when training YOLOv4-tiny on Colab: no kernel image is available for execution on the device

I was following this tutorial to train a YOLOv4-tiny model to detect custom objects: https://www.youtube.com/watch?v=NTnZgLsk_DA
However, when I attempt to train the model, I get this error message:
CUDA status Error: file: ./src/blas_kernels.cu : () : line: 841 : build time: Jan 7 2022 - 12:01:41
CUDA Error: no kernel image is available for execution on the device
CUDA Error: no kernel image is available for execution on the device: File exists
I was running the code on Colab, not locally. The GPU used for training is Tesla K80.
A common answer is to set the ARCH values to compute_37, code_37, but I've already set them this way and keep getting the same error! So what should I do to get this code running?
Link to my Colab notebook: https://colab.research.google.com/drive/16EQ6I67OOs1I7rF6PHgBHp1eVHMXXvyO#scrollTo=QyMBDkaL-Aep
Any help would be appreciated!

Tensorflow: device CUDA:0 not supported by XLA service while setting up XLA_GPU_JIT device number 0

I got this when using keras with Tensorflow backend:
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
while setting up XLA_GPU_JIT device number 0
Relevant code:
tfconfig = tf.ConfigProto()
tfconfig.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
tfconfig.gpu_options.allow_growth = True
K.tensorflow_backend.set_session(tf.Session(config=tfconfig))
tensorflow version: 1.14.0
Chairman Guo's code:
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
solved my problem of jupyter notebook kernel crashing at:
tf.keras.models.load_model(path/to/my/model)
The fatal message was:
2020-01-26 11:31:58.727326: F
tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch
value instead of handling error Internal: failed initializing
StreamExecutor for CUDA device ordinal 0: Internal: failed call to
cuDevicePrimaryCtxRetain: CUDA_ERROR_UNKNOWN: unknown error
My TF's version is: 2.2.0-dev20200123. There are 2 GPUs on this system.
This could be due to your TF-default (i.e. 1st) GPU is running out of memory. If you have multiple GPUs, divert your Python program to run on other GPUs. In TF (suppose using TF-2.0-rc1), set the following:
# Specify which GPU(s) to use
os.environ["CUDA_VISIBLE_DEVICES"] = "1" # Or 2, 3, etc. other than 0
# On CPU/GPU placement
config = tf.compat.v1.ConfigProto(allow_soft_placement=True, log_device_placement=True)
config.gpu_options.allow_growth = True
tf.compat.v1.Session(config=config)
# Note that ConfigProto disappeared in TF-2.0
Suppose, however, your environment have only one GPU, then perhaps you have no choice but ask your buddy to stop his program, then treat him a cup of coffee.

Can't use tensorflow.keras.layers.CuDNNLSTM or keras.layers.CuDNNLSTM in my Colab hosted runtime

When I tried to use either tensorflow.keras.layers.CuDNNLSTM or keras.layers.CuDNNLSTM, I got the following error:
InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNN' used by {{node cu_dnnlstm/CudnnRNN}}with these attrs: [dropout=0, seed=0, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", is_training=true, seed2=0]
Registered devices: [CPU, XLA_CPU]
I am using the hosted runtime and I presume that supports GPU as well but I noticed the error message above shows there is no GPU. Not so sure what the problem is but any clue will be appreciated
You need to explicitly request a GPU enabled runtime.
From the Runtime menu select "Change runtime type" then select GPU under "hardware accelerator":

Operation was explicitly assigned to /GPU:1 but available devices are

My system has 2 Nvidia GPUs, a GTX 750 Ti which is assigned to the OS for OS graphics and a GTX 1080Ti which is free to be used for Tensorflow. I use the call: tensorflow::graph::SetDefaultDevice("/GPU:1", &graph); to enable GPU1 .
I am running TF 1.10 hand compiled and configured for C++/CMake and Cuda compilation tools, release 9.1, V9.1.85. When I assign GPU 1 to execute my graph I get the following error:
"Invalid argument: Cannot assign a device for operation 'h1_w/read':
Operation was explicitly assigned to /GPU:1 but available devices are
[ /job:localhost/replica:0/task:0/device:CPU:0,
/job:localhost/replica:0/task:0/device:GPU:0 ]. Make sure the device
specification refers to a valid device. [[{{node h1_w/read}} =
IdentityT=DT_FLOAT, _class=["loc:#h1_w"], _device="/GPU:1"]]"

An error ocurred while starting the kernel

I have installed all necessary software (opencv, tensorflow-gpu, matplotlib, scikit-learn, pandas, keras 2) to run my code and validated them each. I am using Spyder as an IDE and going to train CNN in Keras with Tensorflow backend. I could run my code snippets until I reach training stage:
hist = model.fit(X_train, y_train, batch_size=32, nb_epoch=num_epoch, verbose=1, validation_data=(X_test, y_test))
When I run this line training somewhat starts and instead of displaying the epochs and other attributes (val_acc, training_acc, etc) the kernel suddenly dies, then re-connects to kernel and dies again, etc.
At the end I get this error:
2018󈚧󈚭 16:25:49.961500: I C:\tf_jenkins\workspace\rel‑win\M\windows‑gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018󈚧󈚭 16:25:50.664501: I C:\tf_jenkins\workspace\rel‑win\M\windows‑gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1212] Found device 0 with properties:
name: GeForce GT 740 major: 3 minor: 0 memoryClockRate(GHz): 1.0715
pciBusID: 0000:01:00.0
totalMemory: 1.00GiB freeMemory: 756.79MiB
2018󈚧󈚭 16:25:50.664501: I C:\tf_jenkins\workspace\rel‑win\M\windows‑gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1312] Adding visible gpu devices: 0
2018󈚧󈚭 16:25:51.148102: I C:\tf_jenkins\workspace\rel‑win\M\windows‑gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:993] Creating TensorFlow device (/device:GPU:0 with 501 MB memory) ‑> physical GPU (device: 0, name: GeForce GT 740, pci bus id: 0000:01:00.0, compute capability: 3.0)
2018󈚧󈚭 16:27:22.549779: I C:\tf_jenkins\workspace\rel‑win\M\windows‑gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1312] Adding visible gpu devices: 0
2018󈚧󈚭 16:27:22.549779: I C:\tf_jenkins\workspace\rel‑win\M\windows‑gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 224 MB memory) ‑> physical GPU (device: 0, name: GeForce GT 740, pci bus id: 0000:01:00.0, compute capability: 3.0)
2018󈚧󈚭 16:27:43.118021: E C:\tf_jenkins\workspace\rel‑win\M\windows‑gpu\PY\35\tensorflow\stream_executor\cuda\cuda_dnn.cc:378] Loaded runtime CuDNN library: 7101 (compatibility version 7100) but source was compiled with 7003 (compatibility version 7000). If using a binary install, upgrade your CuDNN library to match. If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
2018󈚧󈚭 16:27:43.164821: F C:\tf_jenkins\workspace\rel‑win\M\windows‑gpu\PY\35\tensorflow\core\kernels\conv_ops.cc:717] Check failed: stream‑>parent()‑>GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms)
I though it is a Spyder problem and issued on github and received a reply that is not Spyder-related but compatibility problem
I searched the web hoping I could find solution to this, but it seems there is no exact same issue. (at least among I came across)
If there is someone who had the same problem, help me please.
What am I supposed to do?
I had the same issue when l was using the Jupyter notebook, the fix for me was changing the browser l was running the code with.
If using a different IDE (e.g. Jupyter notebook, pycharm) does not work l would recommend running the script from terminal/command prompt.