I already have a CUDA toolkit installed, why is conda installing CUDA again? - tensorflow

I have installed cuda version 11.2 and CUDNN version 8.1 in ubuntu
cnvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0
When I installed tensorflow-gpu in conda environment, it is again installing cuda and cudnn.
Why is it happening.
How to stop conda from installing cuda and cudnn again?
Can I just use cuda and cudnn that I have already installed? If yes, how?

Why is it happening?
Conda expects to manage any packages you install and all their dependencies. The intention is that you literally never have to install anything else by hand for any packages they distribute in their own channel. If a GPU accelerated package requires a CUDA runtime, conda will try to select and install a correctly versioned CUDA runtime for the version of the Python package it has selected for installation.
How to stop conda from installing cuda and cudnn again?
You probably can't, or at least can't without winding up with a non-functional Tensorflow installation. But see here -- what conda installs is only the necessary, correctly versioned CUDA runtime components to make their GPU accelerated packages work. All they don't/can't install is a GPU driver for the hardware.
Can I just use cuda and cudnn that I have already installed?
You say you installed CUDA 11.2. If you look at the conda output, you can see that it wants to install a CUDA 10.2 runtime. As you are now fully aware, versioning is critical to Tensorflow and a Tensorflow build requiring CUDA 10.2 won't work with CUDA 11.2. So even if you were to stop conda from performing the dependency installation, there is a version mismatch so it wouldn't work.
If yes, how?
See above.

Related

Could not load dynamic library 'libcudart.so.11.0';

I am trying to use Tensorflow 2.7.0 with GPU, but I am constantly running into the same issue:
2022-02-03 08:32:31.822484: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/.cache/pypoetry/virtualenvs/poetry_env/lib/python3.7/site-packages/cv2/../../lib64:/home/username/miniconda3/envs/project/lib/
2022-02-03 08:32:31.822528: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
This issue has already appeared multiple times here & on github. However, the solutions usually proposed to a) download the missing CUDA files, b) downgrade/upgrade to the correct CUDA version, c) set the correct LD_LIBRARY_PATH.
I have been already using my PC with CUDA-enabled PyTorch, and I did not have a single issue there. My nvidia-smi returns 11.0 version, which is exactly the only I want to have. Also, if I try to run:
import os
LD_LIBRARY_PATH = '/home/username/miniconda3/envs/project/lib/'
print(os.path.exists(os.path.join(LD_LIBRARY_PATH, "libcudart.so.11.0")))
it returns True. This is exactly the part of LD_LIBRARY_PATH from the error message, where Tensorflow, apparently, cannot see the libcudart.so.11.0 (which IS there).
Is there something really obvious that I am missing?
nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.156.00 Driver Version: 450.156.00 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
nvcc:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
Firstly:
Can you find out where the "libcudart.so.11.0" is
If you lost it at error stack, you can replace the "libcudart.so.11.0" by your word in below:
sudo find / -name 'libcudart.so.11.0'
Outputs in my system. This result shows where the "libcudart.so.11.0" is in the system:
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudart.so.11.0
If the result shows nothing, please make sure you have install cuda or other staff that must install in your system.
Second, add the path to environment file.
# edit /etc/profile
sudo vim /etc/profile
# append path to "LD_LIBRARY_PATH" in profile file
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.1/targets/x86_64-linux/lib
# make environment file work
source /etc/profile
You may also refer to this link
Third thing you may try is:
conda install cudatoolkit
Installing the correct version of cuda 11.3 and cudnn 8.2.1 for tf2.8. Based on this blog https://www.tensorflow.org/install/source#gpu using following commands.
conda uninstall cudatoolkit
conda install cudnn
Then exporting LD path - dynamic link loader path after finding location by
this sudo find / -name 'libcudnn' System was able to find required libraries and use GPU for training.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/usr/miniconda3/envs/tf2/lib/
Hope it helped.
Faced the same issue with tensorflow 2.9 and cuda 11.7 on arch linux x86_64 with 2 nvidia gpus (1080ti / titan rtx) and solved it:
It is not absolutely necessary to respect the compatibility matrix (cuda 11.7 vs 11.2 so minor superior version). But python 3 version was downgraded according to the tensorflow comp matrix (3.10 to 3.7).
Note that you can have multiple cuda version installed and manage it by symlink on linux. (win should be different a bit)
setup with conda and python 3.7
sudo pacman -S base-devel cudnn
conda activate tf-2.9
conda uninstall cudatoolkit && conda install cudnn
I've also had to update gcc for another lib (out of topic)
conda install -c conda-forge gcc=12.1.0
added the snippet for debug according to tf-gpu docs
import tensorflow as tf
tf.config.list_physical_devices('GPU')
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
I now see 2 gpu detected instead of 0, training time is divided by 10.
nvidia-smi reports ram usage maxed and power level raised from 9W to 150W validating the usage of the gpu (the other was left idle).
Rootcause: cudnn was not installed system-wide.

Tensorflow 1.15 cannot detect gpu with Cuda10.1

I have installed both tensorflow 2.2.0 and tensorflow 1.15.0(by pip install tensorflow-gpu==1.15.0). The tensorflow 2 is installed in the base environment of Anaconda 3, while the tensorflow 1 is installed in a separate environment.
The tensorflow 2.2.0 can recognize gpu based on a simple test:
if tf.test.gpu_device_name():
print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
//output: Default GPU Device: /device:GPU:0
But the tensorflow 1.15.0 can not detect gpu.
For your information, my system environment is python + cuda 10.1 + vs 2015.
The tensosflow versions 1.15.0 to 1.15.3 (the latest version) are all compiled against Cuda 10.0. Downgrading the cuda 10.1 to cuda 10.0 solved the problem.
Also be aware of the python version. It is recommended to install the tensorflow .whl file (as listed at https://nero-mirror.stanford.edu/pypi/simple/tensorflow-gpu/) for the specific python version. As for installation, see How do I install a Python package with a .whl file?
Tensorflow 1.15 expects cuda 10.0 , but I managed to make it work with cuda 10.1 by installing the following packages with Anaconda: cudatoolkit (10.0) and cudnn (7.6.5). So, after running
conda install cudatoolkit=10.0
conda install cudnn=7.6.5
tensorflow 1.15 was able to find and use GPU (which is using cuda 10.1).
PS: I understand your environment is Windows based, but this question pops on Google for the same problem happening on Linux (where I tested this solution). Might be useful also on Windows.
It might have to do with the version compatibility of TF, Cuda and CuDNN. This post has it discussed thoroughly.
Have you tried installing Anaconda? it downloads all the requirements and make it easy for you with just a few clicks.

Is Tensorflow 1.12 compatible with CUDA 10.1?

I've been able to successfully set up an Ubuntu 18.04 server with nvidia-smi 418.39, Driver version 418.39, and CUDA 10.1
I now have a user who wants to run TensorFlow but insists that it is not compatible with CUDA 10.1, only CUDA 10. There is no statement confirming this online anywhere that I can find, nor is it in any release patch notes from TF. Because setting this system up was kind of a pain to do, I'm a little hesitant to try downgrading just one version.
Does anyone have verification whether TensorFlow 1.12 does or does not work with CUDA 10.1?
I can confirm that even tf 1.13.1 only works with CUDA 10.0 for me, not 10.1.
Don't know if symlink will work through.
If you try to run tf 1.13.1 on CUDA 10.1, it will give you "ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory"
TensorFlow 1.12 (and even later versions 1.13.1 and 2.0.0-alpha0) could not be built against CUDA 10.1, thus can be considered incompatible.
I have tried building TensorFlow from source with GPU support. The TensorFlow versions I considered were 1.13.1 and 2.0.0-alpha0. The machine I used runs CentOS 7.6 with GCC 4.8.5. I have the NVIDIA Driver version 418.67 installed (which has the release date 2019.5.7 and supports CUDA Toolkit 10.1).
I succeeded in building both TensorFlow versions with CUDA 10.0 and cuDNN 7.6.0 + NCCL 2.4.7 (for CUDA 10.0). Note that you don't need to have the GPU attached to the machine (especially if you're using a VM in the cloud) while you're building TensorFlow with GPU support.
However, when I switched to CUDA 10.1 and cuDNN 7.6.0 + NCCL 2.4.7 (for CUDA 10.1), none of these TensorFlow versions could be built. Besides the changes in location of libcublas, another source of the error is no libcudart.so* are found in cuda-10.1/lib64/ (while they do exist in cuda-10.0/lib64/).
I can also confirm that tf 1.13.1 does not work with CUDA 10.1. While importing tensorflow you will get the following error
ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory
running ldconfig -v shows the difference
libcublas.so.10.0 vs libcublas.so.10.1.0.105

Installing tensorflow-gpu 1.3.0 on windows 10

I have been trying to install tensorflow-gpu on windows 10, via
pip3 install --upgrade tensorflow-gpu
When I do this I break the current installation of ordinary tensorflow!, and get this error: On Windows, running "import tensorflow" generates No module named "_pywrap_tensorflow" error.
Somehow I manage to fix this by re-installing ordinary tensorflow, but then when I import tensorflow in python 3.5.2 and try to identify my GPU, No device is found!
I have a Cuda 9.0 installed alongside cudnn64_6 defined as a DLL in CUDA/v9.0/bin, and I can run the nbody test program without problems and I can see the GPU being used for that demo application.
Is there any known issue with tensorflow-gpu 1.3.0?
Really struggling on this. Why does it have to be so problematic installing this library!
Please help
mg
TensorFlow 1.3 (and 1.4) require CUDA 8.0 and do not support later versions. You will either need to downgrade CUDA to 8.0 or make a custom build from source.

How to debug a segmentation fault 11 on TensorFlow?

I installed cuda 8 and the new tensorflow 1.0.
When I run "import tensorflow as tf" I get the following:
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.8.0.dylib locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.8.0.dylib locally
Segmentation fault: 11
Knowing that nvcc -V gives the following:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Oct_30_22:18:43_CDT_2016
Cuda compilation tools, release 8.0, V8.0.54
Any idea how to fix this segmentation fault?
You might be missing a library in your local cuda installation. E.g., /usr/local/cuda/lib/libcuda.dylib was missing for me after trying to install CUDA Toolkit 8.0 locally (possibly because I installed the drivers first before the toolkit, as this ancient thread suggests: https://render.otoy.com/forum/viewtopic.php?f=25&t=1859). Re-running the installer for just the driver installed it properly, and also symlinked it to another name (https://github.com/tensorflow/tensorflow/issues/3263#issuecomment-232184358).
Lastly, double check your environment variable paths, see if echo $DYLD_LIBRARY_PATH looks right.
As an aside, I still saw some warnings when testing the install, e.g. The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.. These just are suggesting to build from source (https://github.com/tensorflow/tensorflow/issues/8037), rather than using pip install --upgrade tensorflow-gpu. 🍻