system76 ubuntu 20.04 tensorflow gpu cuda version conflicts

system76 ubuntu 20.04 tensorflow gpu cuda version conflicts - tensorflow

After an upgrade to Ubuntu 20.04 from 18.04 Tensorflow is no longer able to use my gpu because it is attempting to mix and load different versions (some 10 and some 11). It is a System76 machine, and I have cuda 10.1 installed from System76 (so it works with the System76 nvidia driver). When running tensorflow the following errors occur:
2021-01-07 18:12:22.584886: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-07 18:12:22.584906: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-07 18:12:23.640665: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-07 18:12:23.641412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-07 18:12:23.669966: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-07 18:12:23.670257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1060 computeCapability: 6.1
coreClock: 1.733GHz coreCount: 10 deviceMemorySize: 5.93GiB deviceMemoryBandwidth: 178.99GiB/s
2021-01-07 18:12:23.670328: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.670379: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.670425: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.671387: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-07 18:12:23.671667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-07 18:12:23.673022: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-07 18:12:23.673100: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.673245: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-07 18:12:23.673259: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU.
Notice all the warnings are for attempting to load version 11 of Cuda but it's only for some of the libraries. The version 10 ones load fine.
This is the output of nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105
This is the output of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 Off | N/A |
| N/A 53C P0 26W / N/A | 585MiB / 6069MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2999 G /usr/lib/xorg/Xorg 101MiB |
| 0 N/A N/A 3479 G /usr/lib/xorg/Xorg 255MiB |
| 0 N/A N/A 3720 G /usr/bin/gnome-shell 88MiB |
| 0 N/A N/A 6487 G ...AAAAAAAA== --shared-files 45MiB |
| 0 N/A N/A 6959 G ...AAAAAAAA== --shared-files 40MiB |
| 0 N/A N/A 11642 G ...AAAAAAAA== --shared-files 21MiB |
| 0 N/A N/A 25206 G WickrMe 17MiB |
+-----------------------------------------------------------------------------+
I see that the driver version in the output of nvidia-smi is version 11, but as I understand it, that has nothing to do with cuda runtime. That is simply the version up to which the driver supports. Correct me if I'm wrong.
I have to use version 10 because that is what is supported by System76 and it worked fine prior to the upgrade. I have also tried uninstalling and re-installing Tensorflow via pip3 and no luck.
Does anyone know how get all the libraries in sync to version 10.1? I also tried to manually place the version 11 libraries in place and let Tensorflow use the mixed version (which of course is a bad idea) but it won't recognize them (or I didn't place them properly).

As #talonmies pointed out, I was misunderstanding the versioning system. However, because it's a System76 machine, it was also confounding because System76 uses their own Nvidia driver, and it's not straightforward to install Cuda 11 and Cudnn. I'm posting the answer in case anyone else runs into problems with System76.
First, DO NOT use the System76 install for Cuda and Cudnn. They have their own versions (on their website) so as to be compatible with their Nvidia driver, but they will not work (they are version 10, and TF 2.2+ requires 11). Also, most general Cuda guides will tell you to uninstall/install the Nvida driver first so as to have a clean install, but DO NOT do this if you have a System76 system. Just leave the System76 driver alone. Also, if you have any previous Cuda/Cudnn remove/uninstall all of it.
Go to Nvidia and get their latest Cuda and Cudnn. I used
wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run
Run that with
sudo sh cuda_11.0.2_450.51.05_linux.run
When it runs it will tell you that you have a conflict with the driver package. Ignore that and proceed. When you get to the install menu, UNCHECK "install driver" and continue with the install. When it's done, add to your path
/usr/local/cuda-11.0:/usr/local/cuda-11.0/bin:
You need to add both the cuda root and bin, not just bin (which is different than most general instructions). Source your .bashrc or .profile or wherever you put the path addition (or open a new terminal).
Now install Cudnn.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/libcudnn8_8.0.5.39-1+cuda11.0_amd64.deb
Install it with dpkg. For example (in my case)...
sudo dpkg -i libcudnn8_8.0.5.39-1+cuda11.0_amd64.deb
That's it. Once I completed all that, everything worked fine. Hope that helps some System76 people get through Ununtu 20.04 and Cuda 11 a little easier.

Thank you very much.
One of the reasons I have used POP OS is that the Nvidia drivers+cuda/cudnn just worked with tensorflow, until this issue with version 11.0 missing.
One thing I needed to be able in install cuda 11.0 using the recipe above was to install gcc versions 8 :
sudo apt -y install gcc-8 g++-8
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 8
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 8
I really wish POP!_os would provide CUDA 11.0 packages directly.....

Related

Tensorflow GPU CUDA Could not load dynamic library 'libcufft.so.10'; dlerror

I fear this to be marked as duplicate but I find examples with libcudart or libcublas but not libcufft (which is my issue).
I installed TensorFlow and I want to use the GPU. I, therefore, run the script on this link.
When running TensorFlow to train a network I get the following message:
2021-09-23 11:19:22.158959: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-23 11:19:22.162563: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2021-09-23 11:19:22.162651: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2021-09-23 11:19:22.162730: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2021-09-23 11:19:22.162806: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-09-23 11:19:22.162989: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-09-23 11:19:22.163345: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Using tf.config.list_physical_devices() I get:
2021-09-23 11:30:18.327648: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-23 11:30:18.329447: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2021-09-23 11:30:18.329510: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2021-09-23 11:30:18.329573: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2021-09-23 11:30:18.329687: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2021-09-23 11:30:18.329814: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
I have a folder called /usr/local/cuda-11.0 but not cuda alone, neither I have an extras folder in it.
It is true that it says for Ubuntu 18.04 and I have Ubuntu 20.04.
If I try to run sudo apt install nvidia-cuda-toolkit as suggested here I get:
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
nvidia-cuda-toolkit : Depends: nvidia-cuda-dev (= 10.1.243-3) but it is not going to be installed
Recommends: nsight-compute (= 10.1.243-3)
Recommends: nsight-systems (= 10.1.243-3)
E: Unable to correct problems, you have held broken packages.
Output of whereis cuda is cuda: (empty).
The output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 0% 40C P8 31W / 300W | 626MiB / 11016MiB | 15% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1141 G /usr/lib/xorg/Xorg 59MiB |
| 0 N/A N/A 1749 G /usr/lib/xorg/Xorg 315MiB |
| 0 N/A N/A 1886 G /usr/bin/gnome-shell 59MiB |
| 0 N/A N/A 1907 G ...mviewer/tv_bin/TeamViewer 2MiB |
| 0 N/A N/A 2463 G ...ble-features=SpareRendere 4MiB |
| 0 N/A N/A 3825 G ...AAAAAAAAA= --shared-files 105MiB |
| 0 N/A N/A 4682 G .../debug.log --shared-files 36MiB |
| 0 N/A N/A 20600 G ...AAAAAAAAA= --shared-files 24MiB |
+-----------------------------------------------------------------------------+
I fear installing stuff to solve it and finish with the typical of 20 versions of CUDA colliding with each other.

So I did as suggested in the comments and uninstall everything in a very aggressive manner:
sudo apt clean
sudo apt update
sudo apt purge cuda
sudo apt purge nvidia-*
sudo apt autoremove
I then followed the instructions to install:
CUDA
CUDA Toolkit (Although I think it's the same, I just added a command sudo apt-get install nvidia-gds which I don't even know if it was necessary)
CUDNN
Now it seems to be working.

Tensorflow Could not load dynamic library 'libcudart.so.10.0 on ubuntu 18.04

I have
$ python3 -c "import tensorflow as tf;print(tf.__version__)"
1.15.0
and
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
with
python --version
Python 3.6.9
pip --version
pip 19.3.1 from /usr/local/lib/python3.6/dist-packages/pip (python 3.6)
but I see CUDA 10.2 from nvidia-smi
$ nvidia-smi
Tue Nov 17 18:40:54 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 On | 00000000:01:00.0 Off | N/A |
| 32% 42C P2 56W / 215W | 265MiB / 7979MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1840 G /usr/lib/xorg/Xorg 57MiB |
| 0 1895 G /usr/bin/gnome-shell 85MiB |
| 0 29999 C /usr/bin/python 109MiB |
+-----------------------------------------------------------------------------+
I can see
$ ls /usr/local/
bin cuda cuda-10.1 cuda-10.2 etc games include lib man sbin share src
and in the .profile I can see
# set PATH for cuda 10.2 installation
if [ -d "/usr/local/cuda-10.2/bin/" ]; then
export PATH=/usr/local/cuda-10.2/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.2/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
fi
so I did override the PATH and LD_LIBRARY_PATH to
export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
but it does not seem to fix.
2020-11-17 18:38:39.470074: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-11-17 18:38:39.487544: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2020-11-17 18:38:39.489215: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47007e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-17 18:38:39.489273: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-11-17 18:38:39.494309: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-11-17 18:38:39.542010: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-17 18:38:39.542387: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4b1bf40 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-11-17 18:38:39.542399: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 2080, Compute Capability 7.5
2020-11-17 18:38:39.542519: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-17 18:38:39.542788: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:01:00.0
2020-11-17 18:38:39.542872: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2020-11-17 18:38:39.542919: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2020-11-17 18:38:39.543012: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2020-11-17 18:38:39.543059: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2020-11-17 18:38:39.543093: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2020-11-17 18:38:39.543125: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-10.1/lib64
2020-11-17 18:38:39.545590: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-17 18:38:39.545617: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-11-17 18:38:39.545653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-17 18:38:39.545658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-11-17 18:38:39.545662: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
['/device:CPU:0', '/device:XLA_CPU:0', '/device:XLA_GPU:0']

I assume the library exists in /usr/local/lib/libcudart.so.11.0
First active your python virtual environment, something like: source ./venv/bin/activate
Once your in the virtual environment set LD_LIBRARY_PATH:
export LD_LIBRARY_PATH=/usr/local/lib
Finally re-run
In my case, Tensor flow was looking for libcudart.so.11.0, the steps above worked for me:
devbox1#devbox1:~/onibex/algo$ source ./venv/bin/activate
(venv) devbox1#devbox1:~/onibex/algo$
(venv) devbox1#devbox1:~/onibex/algo$ cd /home/devbox1/docs/onibex/wa/data/sprint0/code/algo ; /usr/bin/env /home/devbox1/docs/onibex/wa/data/sprint0/code/algo/venv/bin/python3 /home/devbox1/.vscode/extensions/ms-python.python-2021.2.636928669/pythonFiles/lib/python/debugpy/launcher 34287 -- /home/devbox1/docs/onibex/wa/data/sprint0/code/algo/quickly_tensor_flow.py
2021-03-14 00:12:18.588232: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory;
(venv) devbox1#devbox1:~/onibex/algo$ export LD_LIBRARY_PATH=/usr/local/cuda-11.2/targets/x86_64-linux/lib
(venv) devbox1#devbox1:~/onibex/algo$ echo $LD_LIBRARY_PATH
/usr/local/cuda-11.2/targets/x86_64-linux/lib
(venv) devbox1#devbox1:~/onibex/algo$ cd /home/devbox1/docs/onibex/wa/data/sprint0/code/algo ; /usr/bin/env /home/devbox1/docs/onibex/wa/data/sprint0/code/algo/venv/bin/python3 /home/devbox1/.vscode/extensions/ms-python.python-2021.2.636928669/pythonFiles/lib/python/debugpy/launcher 34089 -- /home/devbox1/docs/onibex/wa/data/sprint0/code/algo/quickly_tensor_flow.py
2021-03-14 21:36:49.207430: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
... hello world!
(venv) devbox1#devbox1:~/onibex/algo$

Debug broken Tensorflow-gpu installation with Conda (1.14 vs 2.3), Ubuntu 18.04

I just recently made the mistake of fiddling with my TF install, and broke everything. I used to have two Conda envs with respectively TF 1.14 and 2.1, Cuda 10.1, both working fine. After much plumbing, I now have my main Conda env with TF 2.3, Cuda 10.1, but after doing everything to install the libs & tensorrt, and creating the new env for TF 1.14 (still some older code I haven't ported), what used to work like a charm, the conda install -c (conda-forge|anaconda) tensorflow-gpu now fails to see my gpu.
Sun Nov 1 09:15:15 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... On | 00000000:01:00.0 Off | N/A |
| N/A 38C P8 6W / N/A | 11MiB / 5944MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1469 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2719 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
/usr/local/cuda:
bin doc extras include lib64 libnsight libnvvp LICENSE nsightee_plugins nvml nvvm README samples share src targets tools version.txt
/usr/local/cuda-10.1:
bin doc extras include lib64 libnsight libnvvp LICENSE nsightee_plugins nvml nvvm README samples share src targets tools version.txt
/usr/local/cuda-10.2:
doc lib64 LICENSE README targets version.txt
/usr/local/cuda-11.1:
include lib64 src targets
And lastly the error:
In [2]: tf.test.is_gpu_available()
2020-11-01 00:42:23.536860: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
AVX2 FMA
2020-11-01 00:42:23.570537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2295750000 Hz
2020-11-01 00:42:23.571572: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557fe1bd9660 executing computations on platform Host. Devices:
2020-11-01 00:42:23.571626: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
Out[2]: False
(Whereas in my other env with TF 2.3 everything is fine:)
In [2]: tf.config.list_physical_devices()
2020-11-01 09:11:18.858155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-01 09:11:18.901461: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.901901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti with Max-Q Design computeCapability: 7.5
coreClock: 1.335GHz coreCount: 24 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 268.26GiB/s
2020-11-01 09:11:18.901934: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-11-01 09:11:18.903297: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-11-01 09:11:18.904777: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-11-01 09:11:18.905133: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-11-01 09:11:18.906631: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-11-01 09:11:18.907411: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-11-01 09:11:18.910462: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-11-01 09:11:18.910683: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.911185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.911554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Out[2]:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
I also know that the Conda-distributed version of TF worked with Cuda 10.1, it was working on my machine until yesterday, and now that I redo what seems to me the same steps, nothing works, so what could be the issue...?
Has anyone encountered this? I also need to solve this on another machine, exact same problem, and no cuda-11.1 in /usr/local this ... Thanks in advance!

So, after much wrangling (and it is certainly a symptom of madness of wanting to setup not one but two versions of TF on one machine in this day and age), the solution I found to work was:
in the main, TF 2.3 environment, follow the steps described here, except for two tweaks:
DO NOT INSTALL TENSORFLOW YET.
currently (October 2020) sudo apt-get install --no-install-recommends cuda-10-1 does not work any longer, but conda install cudatoolkit=10.1.243 does, see this;
OTHER CAVEAT I also notice that TF 2.3 could not find the whole array of libraries (libcublas.so.10, libcufft.so.10, libcurand.so.10, etc.) until I installed cuda 10.2... conda install cudatoolkit=10.2.89, which I've seen people talk about here, so unclear that this is the perfect solution (other people symlink the files, or copy them manually from one dir to another, those hellish days will be remembered;
(another option, without TensorRT, but very useful for purging cuda and nvidia things, and fail-safe, can be found here)
after all the libraries, cuda, etc., are installed (you need a reboot at this point, and you can check that your gpu(s) are visible using nvidia-smi, create a fresh environment, and install TF 1.4 using the anaconda channel (conda-forge failed for me): conda install tensorflow-gpu=1.14.
finally, at the very end, go back to the main env and install tensorflow with pip.
In there, you should have this:
$ conda list | grep tensop tensor
tensorboard 1.14.0 py37hf484d3e_0 anaconda
tensorflow 1.14.0 gpu_py37h74c33d7_0 anaconda
tensorflow-base 1.14.0 gpu_py37he45bfe2_0 anaconda
tensorflow-estimator 1.14.0 py_0 anaconda
tensorflow-gpu 1.14.0 h0d30ee6_0 anaconda
And, importantly:
$ pip freeze | grep tensor
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
This does not work if you installed TF with pip beforehand.
After that, activate your other base env, and complete your installation with pip
$ pip install tensorflow
Which should give you:
$ conda list | grep tenso tensor
tensorboard 2.3.0 pypi_0 pypi
tensorboard-plugin-wit 1.7.0 pypi_0 pypi
tensorflow 2.3.1 pypi_0 pypi
tensorflow-estimator 2.3.0 pypi_0 pypi
And:
$ pip freeze | grep tensor
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-estimator==2.3.0

Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory;

When I try to run a python script , which uses tensorflow, it shows following error ...
2020-10-04 16:01:44.994797: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-04 16:01:46.780656: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-04 16:01:46.795642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:03:00.0 name: TITAN X (Pascal) computeCapability: 6.1
coreClock: 1.531GHz coreCount: 28 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 447.48GiB/s
2020-10-04 16:01:46.795699: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-04 16:01:46.795808: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64/:/usr/local/cuda-10.0/lib64
2020-10-04 16:01:46.797391: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-04 16:01:46.797707: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-04 16:01:46.799529: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-04 16:01:46.800524: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-04 16:01:46.804150: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-10-04 16:01:46.804169: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Output of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) On | 00000000:03:00.0 Off | N/A |
| 23% 28C P8 9W / 250W | 18MiB / 12194MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1825 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1957 G /usr/bin/gnome-shell 6MiB |
+-----------------------------------------------------------------------------+
Tensorflow version 2.3.1,
Ubuntu - 18.04
I tried to completely remove cuda toolkit and install from scratch but the error remains.
Anybody could help me to identify the source of problem??

On Ubuntu 20.04, you can simply install NVIDIAs cuda toolkit cuda:
sudo apt-get update
sudo apt install nvidia-cuda-toolkit
There are also install advices for Windows.
The packge is around 1GB and it took a while to install... Some minutes later you need to export PATH variables so that it can be found:
Find Shared Object
sudo find / -name 'libcudart.so*'
/usr/lib/x86_64-linux-gnu/libcudart.so.10.1
/usr/lib/x86_64-linux-gnu/libcudart.so
Add the folder to path, so that python finds it
export PATH=/usr/lib/x86_64-linux-gnu${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
Permissions
sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcuda*
Helped me

This usually happens when you run tensorflow with a non compatible version of CUDA. Looks like this has been asked before (could not comment). Refer this question.

Today I was facing this problem. I went to the CUDA toolkit website, selected the options, and that showed some instructions like this:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda-repo-ubuntu2004-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-6-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda # I have broken packages, so could not invoke this command
So the instructions will change depending on your specifications, DO NOT copy from here/other stackoverflow answer.
I could not invoke the last command, but after some trials and errors, I invoked:
sudo apt install libcudart.so.11.0 # this worked for me!
This worked for me!

You have to download/update Cuda
If you are looking CUDA Toolkit 10.2 Download use this link:
https://developer.nvidia.com/cuda-10.2-download-archive
Then active the virtual environment and set the LD_LIBRARY_PATH, example:
Tensorflow Could not load dynamic library 'libcudart.so.10.0 on ubuntu 18.04

Please run these commands, if you are having ubuntu 18.04 installed. or follow the instructions here
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda

This worked for me:
sudo apt-get install libcudart10.1

What is the correct version of CUDNN for CUDA 11.0

I want to start using tensorflow-gpu, and I looked some stuff up, and found out that I need to ensure that I have both CUDA and CUDNN. So, I opened up the command prompt and ran the command nvidia-smi to check my CUDA version:
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi
Tue Jun 02 14:13:03 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 445.87 Driver Version: 445.87 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1050 WDDM | 00000000:01:00.0 Off | N/A |
| N/A 40C P8 N/A / N/A | 77MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU PID Type Process name GPU Memory |
| Usage |
|=============================================================================|
| 0 10488 C+G ...n64\EpicGamesLauncher.exe N/A |
| 0 12636 C+G ...4\UnrealCEFSubProcess.exe N/A |
+-----------------------------------------------------------------------------+
Now that I see my CUDA version is 11.0, I went to the NVidia's website to select a version of CUDNN that can work with CUDA 11.0, but the latest ones support up to CUDA 10.2 currently. What should I do? Can I use the one for CUDA 10.2?

What nvidia-smi shows is not the CUDA version that you have installed, but the maximum CUDA version that your driver supports.
CUDA 11.0 has been announced but not released yet (as of June 2nd 2020), so you should use CUDA 10.2 as it's the latest available version.

A couple of weeks ago, I have upgraded three of them to the new cuda_11.0.2, Driver 450.51.06 and cuDNN _8.0.
My environment:
86-64
Centos 7 with gcc 4.8.5 ( sudo doesn't work in Centos. Login as root)
I downloaded cuda_11.0.2-450.51.05_linux.run
I took a risk but it went fine. On Nvidia cudnn matrix it said:
Compute > 3.5, toolkit =11.0 , and driver r450
So the driver and toolkit minors doesn't matter.
Installed, and went through pre-, post- and recommended.
Everything went fine.
*This is very important
My cudnn installed but couldn't run the examples.
If you are an Engineer, you have went through such dilemma because you bypass some small details.
Gcc 4.8.5 if for installing toolkit and driver.
Cudnn 8.0 needs gcc 5 and above for c++ 11 or 14 for tool chain.
So what I have done is that( I have a lot of. devtoolset versions in my environment).
I choose 6.0 version instead of 5 to make not be on the border line.
Re-install it, you will be cool.
***Regarding tensor-flow×××: It has nothing to do with cudnn other than kera for python if I get this right.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas