tensorflow on GPU: no known devices, despite cuda's deviceQuery returning a "PASS" result - tensorflow

Note : this question was initially asked on github, but it was asked to be here instead
I'm having trouble running tensorflow on gpu, and it does not seems to be the usual cuda's configuration problem, because everything seems to indicate cuda is properly setup.
The main symptom: when running tensorflow, my gpu is not detected (the code being run, and its output).
What differs from usual issues is that cuda seems properly installed and running ./deviceQuery from cuda samples is successful (output).
I have two graphical cards:
an old GTX 650 used for my monitors (I don't want to use that one with tensorflow)
a GTX 1060 that I want to dedicate to tensorflow
I use:
tensorflow-1.0.0
cuda-8.0 (ls -l /usr/local/cuda/lib64/libcud*)
cudnn-5.1.10
python-2.7.12
nvidia-drivers-375.26 (this was installed by cuda and replaced my distro driver package)
I've tried:
adding /usr/local/cuda/bin/ to $PATH
forcing gpu placement in tensorflow script using with tf.device('/gpu:1'): (and with tf.device('/gpu:0'): when it failed, for good measure)
whitelisting the gpu I wanted to use with CUDA_VISIBLE_DEVICES, in case the presence of my old unsupported card did cause problems
running the script with sudo (because why not)
Here are the outputs of nvidia-smi and nvidia-debugdump -l, in case it's useful.
At this point, I feel like I have followed all the breadcrumbs and have no idea what I could try else. I'm not even sure if I'm contemplating a bug or a configuration problem. Any advice about how to debug this would be greatly appreciated. Thanks!
Update: with the help of Yaroslav on github, I gathered more debugging info by raising log level, but it doesn't seem to say much about the device selection : https://gist.github.com/oelmekki/760a37ca50bf58d4f03f46d104b798bb
Update 2: Using theano detects gpu correctly, but interestingly it complains about cuDNN being too recent, then fallback to cpu (code ran, output). Maybe that could be the problem with tensorflow as well?

From the log output, it looks like you are running the CPU version of TensorFlow (PyPI: tensorflow), and not the GPU version (PyPI: tensorflow-gpu). Running the GPU version would either log information about the CUDA libraries, or an error if it failed to load them or open the driver.
If you run the following commands, you should be able to use the GPU in subsequent runs:
$ pip uninstall tensorflow
$ pip install tensorflow-gpu

None of the other answers here worked for me. After a bit of tinkering I found that this fixed my issues when dealing with Tensorflow built from binary:
Step 0: Uninstall protobuf
pip uninstall protobuf
Step 1: Uninstall tensorflow
pip uninstall tensorflow
pip uninstall tensorflow-gpu
Step 2: Force reinstall Tensorflow with GPU support
pip install --upgrade --force-reinstall tensorflow-gpu
Step 3: If you haven't already, set CUDA_VISIBLE_DEVICES
So for me with 2 GPUs it would be
export CUDA_VISIBLE_DEVICES=0,1

In my case:
pip3 uninstall tensorflow
is not enough. Because when reinstall with:
pip3 install tensorflow-gpu
It is still reinstall tensorflow with cpu not gpu.
So, before install tensorflow-gpu, I tried to remove all related tensor folders in site-packages uninstall protobuf, and it works!
For conclusion:
pip3 uninstall tensorflow
Remove all tensor folders in ~\Python35\Lib\site-packages
pip3 uninstall protobuf
pip3 install tensorflow-gpu

Might seem dumb but a sudo reboot has fixed the exact same problem for me and a couple others.

The answer that saved my day came from Mark Sonn. Simply add this to .bashrc and
source ~/.bashrc if you are on Linux:
export CUDA_VISIBLE_DEVICES=0,1
Previously I had to use this workaround to get tensorflow recognize my GPU:
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices(device_type="GPU")
tf.config.experimental.set_visible_devices(devices=gpus[0], device_type="GPU")
tf.config.experimental.set_memory_growth(device=gpus[0], enable=True)
Even though the code still worked, adding these lines every time is clearly not something I would want.
My version of tensorflow was built from source according to the documentation to get v2.3 support CUDA 10.2 and cudnn 7.6.5.
If anyone having trouble with that, I suggest doing a quick skim over the docs. Took 1.5 hours to build with bazel. Make sure you have gcc7 and bazel installed.

This error may be caused by your GPU's compute capability, CUDA officially supports GPU's compute capability within 3.5 ~ 5.0, you can check here: https://en.wikipedia.org/wiki/CUDA
In my case, the error was like this:
Ignoring visible gpu device (device: 0, name: GeForce GT 640M, pci bus id: 0000:01:00.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
For now we can only compile from source code on Linux (or mac OS) to break the '3.5~5.0' limit.

There are various system incompatible problems.
The requirement for libraries can vary from the version of TensorFlow.
During using python in interactive mode a lot of useful information is printing into stderr. What I suggest for TensorFlow with version 2.0 or more to call:
python3.8 -c "import tensorflow as tf; print('tf version:', tf.version); tf.config.list_physical_devices()"
After this command, you will observe missing libraries (or a version of it) for work with GPU in addition to requirements:
https://www.tensorflow.org/install/gpu#software_requirements
https://www.tensorflow.org/install/gpu#hardware_requirements
p.s. CUDA_VISIBLE_DEVICES should not have a real connection with TensorFlow, or it's more general - it's a way to customize available GPUs for all launched processes.

For anaconda users. I installed tensorflow-gpu via GUI using Anaconda Navigator and configured NVIDIA GPU as in tensorflow guide but tensorflow couldn't find the GPU anyway. Then I uninstalled tensorflow, always via GUI (see here) and reinstalled it via command line in an anaconda prompt issuing:
conda install -c anaconda tensorflow-gpu
and then tensorflow could find the GPU correctly.

Related

How to use system GPU in Jupyter notebook?

I tried a lot of things before I could finally figure out this approach. There are a lot of videos and blogs asking to install the Cuda toolkit and cuDNN from the website. Checking the compatible version. But this is not required anymore all you have to do is the following
pip install tensorflow-gpu
pip install cuda
pip install cudnn
then use the following code to check if your GPU is active in the current notebook
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.config.list_physical_devices('GPU')
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
tf.test.is_built_with_cuda()
tf.debugging.set_log_device_placement(True)
I just want to confirm, if these steps are enough to enable GPU in jupyter notebook or am I missing something here?
If you installed the compatible versions of CUDA and cuDNN (relative to your GPU), Tensorflow should use that since you installed tensorflow-gpu. If you want to be sure, run a simple demo and check out the usage on the task manager.

Using Object Detection API on local GPU but not last version (v2.5.0)

I am trying to use my local GPU to train an EfficientDetD0 model. I already have a good pipeline (that works on Google Colab for example), I modified it a bit to use it locally, but one problem happens every time I launch the training.
I use conda to install tensorflow-gpu with cuda and cudnn but it makes TensorFlow v2.4.1 environments and when I launch the training the Object Detection API automatically install TensorFlow V2.5.0. So my env is not using the gpu for the training because cuda and cudnn are waiting for TensorFlow to be v2.4.1 and not v2.5.0.
Is there a way to get the Object Detection API in v2.4.1 and not v2.5.0 ?
I tried many things but it doesn't work (training is failing or going for CPU training).
Here is the code that install dependencies and overwrite TensorFlow version to TensorFlow v2.5.0:
os.system("cp object_detection/packages/tf2/setup.py .")
os.system("python -m pip install .")
SYSTEM:
gpu : Nvidia RTX 3070
os : Ubuntu 20.04 LTS
tensorflow: 2.4.1
P.S.: I go with conda install -c conda-forge tensorflow-gpu for installing TensorFlow, cuda and cudnn in my training env because manually there was a dependency problem, so I took the easy way.
EDIT : solution found explained in comments.
Follow these steps to install specific version of tensorflow gpu
1. Set Up Anaconda Environments
conda create -n tf_gpu cudatoolkit=11.0
2. Activate the new Environment
source activate tf_gpu
3. Install tensorflow-gpu 2.4.1
pip install tensorflow==2.4.1
Try to run object_detection without "installing" it. Dont run setup.py. Just setup the neccesery paths and packages manually.
Or edit the setup.py to skip installing the specific verison of TF. I quess that this version is a requirement of some of the packages installed in setup.py.
I use the object_detection without running the setup.py or doing any "installation" without any problems.

What is the proper configuration for Quadro RTX3000 to run tensorflow with GPU?

My laptop System is Win10, with GPU NVIDIA Quadro RTX3000.
While trying to set up the TensorFlow with GPU, it always can't recognize my GPU.
What is the proper configuration for CUDA/CUDNN/Tensorflow etc.?
I did suffer a while before making it works.
Here is my configuration:
Win10
RTX 3000
Nvidia driver version 456.71
cuda_11.0.3_451.82_win10 (can't works with 11.1 version, not sure why)
cudnn -v8.0.4.30
Python 3.8.7
Tensorflow 2.5.0-dev20210106 (2.4 don't support cuda 11.x)
For future reference, You could have simply installed Anaconda on windows and run the command conda install -c anaconda tensorflow-gpu which would install the required CUDA, Tensorflow, CUDNN (correct versions) while forming a separate environment to effortlessly install Tensorflow.
It's the easiest solution, one that works out-of-the box and automates all the tasks.

Tensorflow will not run on GPU

I'm a newbie when it comes to AWS and Tensorflow and I've been learning about CNNs over the last week via Udacity's Machine Learning course.
Now I've a need to use an AWS instance of a GPU. I launched a p2.xlarge instance of Deep Learning AMI with Source Code (CUDA 8, Ubuntu) (that's what they recommended)
But now, it seems that tensorflow is not using the GPU at all. It's still training using the CPU. I did some searching and I found some answers to this problem and none of them seemed to work.
When I run the Jupyter notebook, it still uses the CPU
What do I do to get it to run on the GPU and not the CPU?
The problem of tensorflow not detecting GPU can possibly be due to one of the following reasons.
Only the tensorflow CPU version is installed in the system.
Both tensorflow CPU and GPU versions are installed in the system, but the Python environment is preferring CPU version over GPU version.
Before proceeding to solve the issue, we assume that the installed environment is an AWS Deep Learning AMI having CUDA 8.0 and tensorflow version 1.4.1 installed. This assumption is derived from the discussion in comments.
To solve the problem, we proceed as follows:
Check the installed version of tensorflow by executing the following command from the OS terminal.
pip freeze | grep tensorflow
If only the CPU version is installed, then remove it and install the GPU version by executing the following commands.
pip uninstall tensorflow
pip install tensorflow-gpu==1.4.1
If both CPU and GPU versions are installed, then remove both of them, and install the GPU version only.
pip uninstall tensorflow
pip uninstall tensorflow-gpu
pip install tensorflow-gpu==1.4.1
At this point, if all the dependencies of tensorflow are installed correctly, tensorflow GPU version should work fine. A common error at this stage (as encountered by OP) is the missing cuDNN library which can result in following error while importing tensorflow into a python module
ImportError: libcudnn.so.6: cannot open shared object file: No such
file or directory
It can be fixed by installing the correct version of NVIDIA's cuDNN library. Tensorflow version 1.4.1 depends upon cuDNN version 6.0 and CUDA 8, so we download the corresponding version from cuDNN archive page (Download Link). We have to login to the NVIDIA developer account to be able to download the file, therefore it is not possible to download it using command line tools such as wget or curl. A possible solution is to download the file on host system and use scp to copy it onto AWS.
Once copied to AWS, extract the file using the following command:
tar -xzvf cudnn-8.0-linux-x64-v6.0.tgz
The extracted directory should have structure similar to the CUDA toolkit installation directory. Assuming that CUDA toolkit is installed in the directory /usr/local/cuda, we can install cuDNN by copying the files from the downloaded archive into corresponding folders of CUDA Toolkit installation directory followed by linker update command ldconfig as follows:
cp cuda/include/* /usr/local/cuda/include
cp cuda/lib64/* /usr/local/cuda/lib64
ldconfig
After this, we should be able to import tensorflow GPU version into our python modules.
A few considerations:
If we are using Python3, pip should be replaced with pip3.
Depending upon user privileges, the commands pip, cp and ldconfig may require to be run as sudo.

Keras with Tensorflow backend on GPU. MKL ERROR: Parameter 4 was incorrect on entry to DLASCL

I installed Tensorflow with GPU support and Keras to an environment in Anaconda (v1.6.5) by using following commands:
conda install -n EnvName tensorflow-gpu
conda install -n EnvName -c conda-forge keras-gpu
I have NVIDIA Quadro 2200K on my machine with driver v384.66, cuda-8.0, cudnn 7.0
When I am trying to run a python code with Keras at the stage of training I get the following
Intel MKL ERROR: Parameter 4 was incorrect on entry to DLASCL.
and later
File
"/home/User/anaconda3/envs/keras_gpu/lib/python3.6/site-packages/numpy/linalg/linalg.py",
line 99, in _raise_linalgerror_svd_nonconvergence
raise LinAlgError("SVD did not converge") numpy.linalg.linalg.LinAlgError: SVD did not converge
Other relevant sources suggest to check data for NaNs and Infs, but my data is clean for sure. By the way, CPU version of the installation is working fine, the issue occurs only when trying to run on GPU
I tried to reinstall Anaconda, to reinstall CUDA and numpy, but it didn't work out.
The problem was in package mkl (2018.0.0) - it seems like it has recently been released and conflicts with the version of some packages supplied with Tensorflow(1.3.0) and Keras(2.0.5) via conda*.
So I manually downgraded mkl using Anaconda Navigator to v11.3.3 which led automatically to downgrade of other packages and everything is working well now.