Upgrading Cudnn version in Vertex AI Notebook [Kernel Restarting Problem] - tensorflow

Problem: Cudnn version incompatiable with tensorflow and Cuda, Kernel dies and unable to start training in Vertex AI.
Current versions:
import tensorflow as tf
from tensorflow.python.platform import build_info as build
print(f"tensorflow version: {tf.__version__}")
print(f"Cuda Version: {build.build_info['cuda_version']}")
print(f"Cudnn version: {build.build_info['cudnn_version']}")
tensorflow version: 2.10.0
Cuda Version: 11.2
Cudnn version: 8
As per the information (shown in attached screenshot) available here, Cudnn version must be 8.1.
A similar question has been asked here that is related to upgrading Cudnn in Google colab. However, it does not solve my issue. Every other online sources are helpful for Anaconda environment only.
How can I upgrade the Cudnn in my case?
Thank you.

I tried several combinations of tensorflow, Cuda, and Cudnn versions in Google Colab and the following version worked [OS: Ubuntu 20.04]:
tensorflow version: 2.9.2
Cuda Version: 11.2
Cudnn version: 8
Therefore, I downgrated the tensorflow version in Vertex AI from 2.10.0 to 2.9.2 and it worked (solved only the incompatibility issue). I'm still searching the solution for Kernel restarting.
UPDATE::
The problem of Kernel Restatring got fixed after I changed the Kernel from Tensorflow 2 (Local) to Python (Local) in Vertex AI's Notebook as shown in the attached image [Kernel changing option is available on the right-top near the bug symbol].

Related

Using TensorFlow with GPU taking a long time for loading library related to CUDA

Machine Setting:
GPU: GeForce RTX 3060
Driver Version: 460.73.01
CUDA Driver Veresion: 11.2
Tensorflow: tensorflow-gpu 1.14.0
CUDA Runtime Version: 10.0
cudnn: 7.4.1
Note:
CUDA Runtime and cudnn version fits the guide from Tensorflow official documentation.
I've also tried for TensorFlow-gpu = 2.0, still the same problem.
Problem:
I am using Tensorflow for an objection detection task. My situation is that the program will stuck at
2021-06-05 12:16:54.099778: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
for several minutes.
And then stuck at next loading process
2021-06-05 12:21:22.212818: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
for even longer time. You may check log.txt for log details.
After waiting for around 30 mins, the program will start to running and WORK WELL.
However, whenever program invoke self.session.run(...), it will load the same two library related to cuda (libcublas and libcudnn) again, which is time-wasted and annoying.
I am confused that where the problem comes from and how to resolve it. Anyone could help?
Discussion Issue on Github
===================================
Update
After #talonmies 's help, the problem was resolved by resetting the environment with correct version matching among GPU, CUDA, cudnn and tensorflow. Now it works smoothly.
Generally, if there are any incompatibility between TF, CUDA and cuDNN version you can observed this behavior.
For GeForce RTX 3060, support starts from CUDA 11.x. Once you upgrade to TF2.4 or TF2.5 your issue will be resolved.
For the benefit of community providing tested built configuration
CUDA Support Matrix

how to check which cuda is being used by tensorflow gpu

In my laptop there are three versions of cuda, 8.0, 9.0 and 10.0 installed, all of which are configured in the environment path. When I use tensorflow-gpu 2.0.0, how to know which version of cuda is to be deployed, without considering that the present version of tensorflow is only compatible with cuda 10.0. Is there any way to print the information on python console?
I found answers here get the CUDA and CUDNN version on windows with Anaconda installe:
from tensorflow.python.platform import build_info as tf_build_info
print(tf_build_info.cuda_version_number)
#10.0
print(tf_build_info.cudnn_version_number)
#7

Unable to configure tensorflow to use GPU acceleration in Ubuntu 16.04

I am trying to install Tensorflow in Ubuntu 16.04 ( in google cloud ). What I have done so far is created an compute instance. I have added a NVIDIA Tesla K80 to this instance.
Also, made sure that the proper version of tensorflow ( version 1.14.0 ) is installed and
Cuda version of 8.0 is installed
and
CudNN version of 6.0 is installed as per the tensorflow gpu - cuda mapping
When I run a simple tensorflow program, I get
Cannot assign a device for operation MatMul: {{node MatMul}}was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
Can anyone please let me know where I am doing wrong. Is the instance selection is correct?
Please do let me know and thanks for your help.
The CUDA and CudNN versions that have been tested with Tensorflow 1.14 are the 10.0 and the 7.4, respectively.
More information about version compatibility can be found here.

How to run tensorflow-gpu on Nvidia Quadro GV100?

I am currently working as a working student and now I have trouble installing Tensorflow-gpu on a machine using a Nvidia Quadro GV100 GPU.
On the Tensorflow homepage I found out that I need to install CUDA 9.0 and Cudnn 7.x in order to run Tensorflow-gpu 1.9. The problem is that I can't find a suitable CUDA version supporting the GV100. Could it be that there is no CUDA version yet? Is it possible that one can't use the GV100 for tensoflow-gpu?
Sorry for the stupid question, I am new to installing DL frameworks :-)
Thank you very much for your help!
On the Tensorflow homepage I found out that I need to install CUDA 9.0 and Cudnn 7.x in order to run Tensorflow-gpu 1.9.
That is if you want to install a pre-built Tensorflow binary distribution. In that case you need to use the version of CUDA which the Tensorflow binaries were built against, which in this case in CUDA 9.0
The problem is that I can't find a suitable CUDA version supporting the GV100
The CUDA 9.0 and later toolkits fully support Volta cards and that should include the Quadro GV100. The driver which ships with CUDA 9.0 is a 384 series which won't support your GPU. If you are referring to a driver support issue, then the solution would be to install the recommended driver for your GPU, and only install the CUDA toolkit from the CUDA 9.0 bundle, not the toolkit and driver, which is the default.
Otherwise you can use CUDA 9.1 or 9.2, which should have support for your GPU with their supplied drivers, but you will then need to build Tensorflow yourself from source.

Is there a tensorflow version that is compatible with Cuda 9.0 and cudnn 7.1

I have a machine with cuda 9.0 and cudnn 7.1.
I've tried using tensorflow 1.7.0 on this machine but it does not work since this version of tensorflow has been created for cudnn 7.0
I'm getting this error when launching a training on my gpu:
Loaded runtime CuDNN library: 7102 (compatibility version 7100) but source was compiled with 7005 (compatibility version 7000).
Is there a tensorflow version that is compatible with my cuda and cudnn versions? I also need this working tensorflow version to be >=1.7.0.
I have googled this, searched every question but I never got answers for these particular versions of cuda and cudnn.
This should be possible with tensorflow_gpu-1.9.0. Linked below is a table which displays compatibilities of CUDA and cuDNN with varying versions of tensorflow.
https://www.tensorflow.org/install/install_sources#tested_source_configurations
Ok, seems I missed some installation steps.
By installing the last version of tensorflow, which at the time of writing is 1.9.0, it did work on my machine.