COS install GPU failed to download driver signature - gpu

I used Compute Engine VM with T4 GPU for quite some time on COS and it has been working fine until recently that cos-extensions install gpu does not work like before.
I0830 07:32:58.419130 987 main.go:21] Checking if this is the only cos_gpu_installer that is running.
I0830 07:32:58.427417 987 install.go:74] Running on COS build id 16108.470.16
I0830 07:32:58.427566 987 installer.go:187] Getting the default GPU driver version
I0830 07:32:58.427911 987 utils.go:72] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548403 987 utils.go:120] Successfully downloaded gpu_default_version from https://storage.googleapis.com/cos-tools/16108.470.16/gpu_default_version
I0830 07:32:58.548594 987 install.go:85] Installing GPU driver version 450.119.04
I0830 07:32:58.549646 987 cache.go:72] map[BUILD_ID:16108.470.11 DRIVER_VERSION:450.119.04]
I0830 07:32:58.549674 987 install.go:120] Did not find cached version, installing the drivers...
I0830 07:32:58.549681 987 installer.go:82] Configuring driver installation directories
I0830 07:32:58.563327 987 installer.go:196] Updating container's ld cache
I0830 07:32:58.793692 987 signature.go:30] Downloading driver signature for version 450.119.04
I0830 07:32:58.793721 987 utils.go:72] Downloading 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/16108.470.16/extensions/gpu/450.119.04.signature.tar.gz
E0830 07:32:58.828902 987 artifacts.go:106] Failed to download extensions/gpu/450.119.04.signature.tar.gz from public GCS: failed to download 450.119.04.signature.tar.gz, status: 404 Not Found
E0830 07:32:58.829401 987 install.go:175] failed to download driver signature: failed to download driver signature for version 450.119.04: failed to download extensions/gpu/450.119.04.signature.tar.gz
It seems like the installer could not find the driver signature. I have looked into this and followed the workaround by doing
/usr/bin/docker run --rm \
--privileged \
--net=host \
--pid=host \
--volume /dev:/dev \
--volume /:/root \
--volume /var/lib/toolbox/nvidia:/usr/local/nvidia \
--env NVIDIA_DRIVER_VERSION=450.119.04 \
gcr.io/cos-cloud/cos-gpu-installer:latest
but got this instead
+ COS_KERNEL_INFO_FILENAME=kernel_info
+ COS_KERNEL_SRC_HEADER=kernel-headers.tgz
+ TOOLCHAIN_URL_FILENAME=toolchain_url
+ TOOLCHAIN_ENV_FILENAME=toolchain_env
+ TOOLCHAIN_PKG_DIR=/build/cos-tools
+ CHROMIUMOS_SDK_GCS=https://storage.googleapis.com/chromiumos-sdk
+ ROOT_OS_RELEASE=/root/etc/os-release
+ KERNEL_SRC_HEADER=/build/usr/src/linux
+ NVIDIA_DRIVER_VERSION=450.119.04
+ NVIDIA_DRIVER_MD5SUM=
+ NVIDIA_INSTALL_DIR_HOST=/var/lib/nvidia
+ NVIDIA_INSTALL_DIR_CONTAINER=/usr/local/nvidia
+ ROOT_MOUNT_DIR=/root
+ CACHE_FILE=/usr/local/nvidia/.cache
+ LOCK_FILE=/root/tmp/cos_gpu_installer_lock
+ LOCK_FILE_FD=20
+ set +x
[INFO 2021-08-30 07:36:38 UTC] PRELOAD: false
[INFO 2021-08-30 07:36:38 UTC] Running on COS build id 16108.470.16
[INFO 2021-08-30 07:36:38 UTC] Data dependencies (e.g. kernel source) will be fetched from https://storage.googleapis.com/cos-tools/16108.470.16
[INFO 2021-08-30 07:36:38 UTC] Checking if this is the only cos-gpu-installer that is running.
[INFO 2021-08-30 07:36:38 UTC] Checking if third party kernel modules can be installed
/tmp/esp /
/
[INFO 2021-08-30 07:36:38 UTC] Checking cached version
/entrypoint.sh: line 172: CACHE_BUILD_ID: unbound variable
It seems like there are some changes going on with COS and COS GPU driver (maybe?), but just want to know whether there is a workaround on this problem apart from waiting GCP to solve things out.

This is the same case as the one Jan Vansteenlandt linked to.
This happens in some versions of COS;
For example latest stable COS version available now - 89-16108:
vm-16108 ~ # cos-extensions list Available extensions for COS version
89-16108.470.16:
[gpu]
There's no driver listed under [gpu] and running cos-extensions install gpu ends in the same way as in your case. When trying to run the docker container you mentioned also yielded the same results.
This is a known issue and has already been raised on IssueTracker. You can fallow the link and click on +1 button, also you can comment and post your own findings in the thread.
There's also a workaround in the thread so you may give it a go.
If you can use some older version of COS (85-13310 for example) - the driver is listed:
vm-13310 ~ # cos-extensions list
Available extensions for COS version 85-13310.1308.10:
[gpu]
450.119.04 [default]
And when you run cos-extensions install gpu it will result in succesfull installation of NVIDIA drivers:
vm-13310 ~ # cos-extensions install gpu
I0831 14:25:11.405591 1168 main.go:21] Checking if this is the only cos_gpu_installer that is running.
I0831 14:25:11.407510 1168 install.go:74] Running on COS build id 13310.1308.10
I0831 14:25:11.407519 1168 installer.go:187] Getting the default GPU driver version
I0831 14:25:11.407581 1168 utils.go:72] Downloading gpu_default_version from https://storage.googleapis.com/cos-tools/13310.1308.10/gpu_default_version
I0831 14:25:11.448046 1168 utils.go:120] Successfully downloaded gpu_default_version from https://storage.googleapis.com/cos-tools/13310.1308.10/gpu_default_version
I0831 14:25:11.448539 1168 install.go:85] Installing GPU driver version 450.119.04
I0831 14:25:11.448751 1168 cache.go:69] error: failed to read file /root/var/lib/nvidia/.cache: open /root/var/lib/nvidia/.cache: no such file or directory
I0831 14:25:11.448942 1168 install.go:120] Did not find cached version, installing the drivers...
I0831 14:25:11.449084 1168 installer.go:82] Configuring driver installation directories
I0831 14:25:11.469718 1168 installer.go:196] Updating container's ld cache
I0831 14:25:11.480682 1168 signature.go:30] Downloading driver signature for version 450.119.04
I0831 14:25:11.481007 1168 utils.go:72] Downloading 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/13310.1308.10/extensions/gpu/450.119.04.signature.tar.gz
I0831 14:25:11.506186 1168 utils.go:120] Successfully downloaded 450.119.04.signature.tar.gz from https://storage.googleapis.com/cos-tools/13310.1308.10/extensions/gpu/450.119.04.signature.tar.gz
I0831 14:25:11.506541 1168 signature.go:37] Decompressing signature /build/sign-gpu-driver/450.119.04.signature.tar.gz
I0831 14:25:11.510104 1168 installer.go:68] Downloading GPU driver installer version 450.119.04
I0831 14:25:11.511637 1168 utils.go:72] Downloading GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/85/tesla/450_00/450.119.04/NVIDIA-Linux-x86_64-450.119.04_85-13310-1308-10.cos
I0831 14:25:12.885856 1168 utils.go:120] Successfully downloaded GPU driver installer from https://storage.googleapis.com/nvidia-drivers-us-public/nvidia-cos-project/85/tesla/450_00/450.119.04/NVIDIA-Linux-x86_64-450.119.04_85-13310-1308-10.cos
----- removed some lines for better readibility -----
I0831 14:28:49.433597 1168 cache.go:58] Updated cached version as
I0831 14:28:49.498379 1168 cache.go:60] BUILD_ID=13310.1308.10
I0831 14:28:49.498560 1168 cache.go:60] DRIVER_VERSION=450.119.04
I0831 14:28:49.498694 1168 installer.go:32] Verifying GPU driver installation
I0831 14:28:50.309502 1168 utils.go:334] Tue Aug 31 14:28:50 2021
I0831 14:28:50.309879 1168 utils.go:334] +-----------------------------------------------------------------------------+
I0831 14:28:50.311093 1168 utils.go:334] | NVIDIA-SMI 450.119.04 Driver Version: 450.119.04 CUDA Version: 11.0 |
I0831 14:28:50.311300 1168 utils.go:334] |-------------------------------+----------------------+----------------------+
I0831 14:28:50.311497 1168 utils.go:334] | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
I0831 14:28:50.311640 1168 utils.go:334] | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
I0831 14:28:50.311784 1168 utils.go:334] | | | MIG M. |
I0831 14:28:50.311949 1168 utils.go:334] |===============================+======================+======================|
I0831 14:28:50.322257 1168 utils.go:334] | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
I0831 14:28:50.322566 1168 utils.go:334] | N/A 76C P0 27W / 70W | 0MiB / 15109MiB | 0% Default |
I0831 14:28:50.322708 1168 utils.go:334] | | | N/A |
I0831 14:28:50.322878 1168 utils.go:334] +-------------------------------+----------------------+----------------------+
I0831 14:28:50.323119 1168 utils.go:334]
I0831 14:28:50.323293 1168 utils.go:334] +-----------------------------------------------------------------------------+
I0831 14:28:50.323431 1168 utils.go:334] | Processes: |
I0831 14:28:50.323597 1168 utils.go:334] | GPU GI CI PID Type Process name GPU Memory |
I0831 14:28:50.323715 1168 utils.go:334] | ID ID Usage |
I0831 14:28:50.323863 1168 utils.go:334] |=============================================================================|
I0831 14:28:50.324222 1168 utils.go:334] | No running processes found |
I0831 14:28:50.324439 1168 utils.go:334] +-----------------------------------------------------------------------------+
I0831 14:28:50.465730 1168 modules.go:48] Updating host's ld cache
I0831 14:28:52.305122 1168 install.go:167] Finished installing the drivers.

Related

Tensorflow GPU CUDA Could not load dynamic library 'libcufft.so.10'; dlerror

I fear this to be marked as duplicate but I find examples with libcudart or libcublas but not libcufft (which is my issue).
I installed TensorFlow and I want to use the GPU. I, therefore, run the script on this link.
When running TensorFlow to train a network I get the following message:
2021-09-23 11:19:22.158959: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-23 11:19:22.162563: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2021-09-23 11:19:22.162651: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2021-09-23 11:19:22.162730: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory
2021-09-23 11:19:22.162806: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-09-23 11:19:22.162989: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2021-09-23 11:19:22.163345: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Using tf.config.list_physical_devices() I get:
2021-09-23 11:30:18.327648: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-23 11:30:18.329447: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2021-09-23 11:30:18.329510: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2021-09-23 11:30:18.329573: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2021-09-23 11:30:18.329687: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda/extras/CUPTI/lib64
2021-09-23 11:30:18.329814: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]
I have a folder called /usr/local/cuda-11.0 but not cuda alone, neither I have an extras folder in it.
It is true that it says for Ubuntu 18.04 and I have Ubuntu 20.04.
If I try to run sudo apt install nvidia-cuda-toolkit as suggested here I get:
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
nvidia-cuda-toolkit : Depends: nvidia-cuda-dev (= 10.1.243-3) but it is not going to be installed
Recommends: nsight-compute (= 10.1.243-3)
Recommends: nsight-systems (= 10.1.243-3)
E: Unable to correct problems, you have held broken packages.
Output of whereis cuda is cuda: (empty).
The output of nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... Off | 00000000:01:00.0 On | N/A |
| 0% 40C P8 31W / 300W | 626MiB / 11016MiB | 15% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1141 G /usr/lib/xorg/Xorg 59MiB |
| 0 N/A N/A 1749 G /usr/lib/xorg/Xorg 315MiB |
| 0 N/A N/A 1886 G /usr/bin/gnome-shell 59MiB |
| 0 N/A N/A 1907 G ...mviewer/tv_bin/TeamViewer 2MiB |
| 0 N/A N/A 2463 G ...ble-features=SpareRendere 4MiB |
| 0 N/A N/A 3825 G ...AAAAAAAAA= --shared-files 105MiB |
| 0 N/A N/A 4682 G .../debug.log --shared-files 36MiB |
| 0 N/A N/A 20600 G ...AAAAAAAAA= --shared-files 24MiB |
+-----------------------------------------------------------------------------+
I fear installing stuff to solve it and finish with the typical of 20 versions of CUDA colliding with each other.
So I did as suggested in the comments and uninstall everything in a very aggressive manner:
sudo apt clean
sudo apt update
sudo apt purge cuda
sudo apt purge nvidia-*
sudo apt autoremove
I then followed the instructions to install:
CUDA
CUDA Toolkit (Although I think it's the same, I just added a command sudo apt-get install nvidia-gds which I don't even know if it was necessary)
CUDNN
Now it seems to be working.

system76 ubuntu 20.04 tensorflow gpu cuda version conflicts

After an upgrade to Ubuntu 20.04 from 18.04 Tensorflow is no longer able to use my gpu because it is attempting to mix and load different versions (some 10 and some 11). It is a System76 machine, and I have cuda 10.1 installed from System76 (so it works with the System76 nvidia driver). When running tensorflow the following errors occur:
2021-01-07 18:12:22.584886: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-07 18:12:22.584906: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-07 18:12:23.640665: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-07 18:12:23.641412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-07 18:12:23.669966: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-07 18:12:23.670257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1060 computeCapability: 6.1
coreClock: 1.733GHz coreCount: 10 deviceMemorySize: 5.93GiB deviceMemoryBandwidth: 178.99GiB/s
2021-01-07 18:12:23.670328: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.670379: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.670425: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.671387: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-07 18:12:23.671667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-07 18:12:23.673022: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-07 18:12:23.673100: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.673245: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-07 18:12:23.673259: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU.
Notice all the warnings are for attempting to load version 11 of Cuda but it's only for some of the libraries. The version 10 ones load fine.
This is the output of nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105
This is the output of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 Off | N/A |
| N/A 53C P0 26W / N/A | 585MiB / 6069MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2999 G /usr/lib/xorg/Xorg 101MiB |
| 0 N/A N/A 3479 G /usr/lib/xorg/Xorg 255MiB |
| 0 N/A N/A 3720 G /usr/bin/gnome-shell 88MiB |
| 0 N/A N/A 6487 G ...AAAAAAAA== --shared-files 45MiB |
| 0 N/A N/A 6959 G ...AAAAAAAA== --shared-files 40MiB |
| 0 N/A N/A 11642 G ...AAAAAAAA== --shared-files 21MiB |
| 0 N/A N/A 25206 G WickrMe 17MiB |
+-----------------------------------------------------------------------------+
I see that the driver version in the output of nvidia-smi is version 11, but as I understand it, that has nothing to do with cuda runtime. That is simply the version up to which the driver supports. Correct me if I'm wrong.
I have to use version 10 because that is what is supported by System76 and it worked fine prior to the upgrade. I have also tried uninstalling and re-installing Tensorflow via pip3 and no luck.
Does anyone know how get all the libraries in sync to version 10.1? I also tried to manually place the version 11 libraries in place and let Tensorflow use the mixed version (which of course is a bad idea) but it won't recognize them (or I didn't place them properly).
As #talonmies pointed out, I was misunderstanding the versioning system. However, because it's a System76 machine, it was also confounding because System76 uses their own Nvidia driver, and it's not straightforward to install Cuda 11 and Cudnn. I'm posting the answer in case anyone else runs into problems with System76.
First, DO NOT use the System76 install for Cuda and Cudnn. They have their own versions (on their website) so as to be compatible with their Nvidia driver, but they will not work (they are version 10, and TF 2.2+ requires 11). Also, most general Cuda guides will tell you to uninstall/install the Nvida driver first so as to have a clean install, but DO NOT do this if you have a System76 system. Just leave the System76 driver alone. Also, if you have any previous Cuda/Cudnn remove/uninstall all of it.
Go to Nvidia and get their latest Cuda and Cudnn. I used
wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run
Run that with
sudo sh cuda_11.0.2_450.51.05_linux.run
When it runs it will tell you that you have a conflict with the driver package. Ignore that and proceed. When you get to the install menu, UNCHECK "install driver" and continue with the install. When it's done, add to your path
/usr/local/cuda-11.0:/usr/local/cuda-11.0/bin:
You need to add both the cuda root and bin, not just bin (which is different than most general instructions). Source your .bashrc or .profile or wherever you put the path addition (or open a new terminal).
Now install Cudnn.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/libcudnn8_8.0.5.39-1+cuda11.0_amd64.deb
Install it with dpkg. For example (in my case)...
sudo dpkg -i libcudnn8_8.0.5.39-1+cuda11.0_amd64.deb
That's it. Once I completed all that, everything worked fine. Hope that helps some System76 people get through Ununtu 20.04 and Cuda 11 a little easier.
Thank you very much.
One of the reasons I have used POP OS is that the Nvidia drivers+cuda/cudnn just worked with tensorflow, until this issue with version 11.0 missing.
One thing I needed to be able in install cuda 11.0 using the recipe above was to install gcc versions 8 :
sudo apt -y install gcc-8 g++-8
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 8
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 8
I really wish POP!_os would provide CUDA 11.0 packages directly.....

Debug broken Tensorflow-gpu installation with Conda (1.14 vs 2.3), Ubuntu 18.04

I just recently made the mistake of fiddling with my TF install, and broke everything. I used to have two Conda envs with respectively TF 1.14 and 2.1, Cuda 10.1, both working fine. After much plumbing, I now have my main Conda env with TF 2.3, Cuda 10.1, but after doing everything to install the libs & tensorrt, and creating the new env for TF 1.14 (still some older code I haven't ported), what used to work like a charm, the conda install -c (conda-forge|anaconda) tensorflow-gpu now fails to see my gpu.
Sun Nov 1 09:15:15 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06 Driver Version: 450.36.06 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... On | 00000000:01:00.0 Off | N/A |
| N/A 38C P8 6W / N/A | 11MiB / 5944MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1469 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 2719 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
/usr/local/cuda:
bin doc extras include lib64 libnsight libnvvp LICENSE nsightee_plugins nvml nvvm README samples share src targets tools version.txt
/usr/local/cuda-10.1:
bin doc extras include lib64 libnsight libnvvp LICENSE nsightee_plugins nvml nvvm README samples share src targets tools version.txt
/usr/local/cuda-10.2:
doc lib64 LICENSE README targets version.txt
/usr/local/cuda-11.1:
include lib64 src targets
And lastly the error:
In [2]: tf.test.is_gpu_available()
2020-11-01 00:42:23.536860: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
AVX2 FMA
2020-11-01 00:42:23.570537: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2295750000 Hz
2020-11-01 00:42:23.571572: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557fe1bd9660 executing computations on platform Host. Devices:
2020-11-01 00:42:23.571626: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): <undefined>, <undefined>
Out[2]: False
(Whereas in my other env with TF 2.3 everything is fine:)
In [2]: tf.config.list_physical_devices()
2020-11-01 09:11:18.858155: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-01 09:11:18.901461: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.901901: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1660 Ti with Max-Q Design computeCapability: 7.5
coreClock: 1.335GHz coreCount: 24 deviceMemorySize: 5.80GiB deviceMemoryBandwidth: 268.26GiB/s
2020-11-01 09:11:18.901934: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-11-01 09:11:18.903297: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-11-01 09:11:18.904777: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-11-01 09:11:18.905133: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-11-01 09:11:18.906631: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-11-01 09:11:18.907411: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-11-01 09:11:18.910462: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-11-01 09:11:18.910683: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.911185: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NU
MA node, so returning NUMA node zero
2020-11-01 09:11:18.911554: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
Out[2]:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'),
PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
I also know that the Conda-distributed version of TF worked with Cuda 10.1, it was working on my machine until yesterday, and now that I redo what seems to me the same steps, nothing works, so what could be the issue...?
Has anyone encountered this? I also need to solve this on another machine, exact same problem, and no cuda-11.1 in /usr/local this ... Thanks in advance!
So, after much wrangling (and it is certainly a symptom of madness of wanting to setup not one but two versions of TF on one machine in this day and age), the solution I found to work was:
in the main, TF 2.3 environment, follow the steps described here, except for two tweaks:
DO NOT INSTALL TENSORFLOW YET.
currently (October 2020) sudo apt-get install --no-install-recommends cuda-10-1 does not work any longer, but conda install cudatoolkit=10.1.243 does, see this;
OTHER CAVEAT I also notice that TF 2.3 could not find the whole array of libraries (libcublas.so.10, libcufft.so.10, libcurand.so.10, etc.) until I installed cuda 10.2... conda install cudatoolkit=10.2.89, which I've seen people talk about here, so unclear that this is the perfect solution (other people symlink the files, or copy them manually from one dir to another, those hellish days will be remembered;
(another option, without TensorRT, but very useful for purging cuda and nvidia things, and fail-safe, can be found here)
after all the libraries, cuda, etc., are installed (you need a reboot at this point, and you can check that your gpu(s) are visible using nvidia-smi, create a fresh environment, and install TF 1.4 using the anaconda channel (conda-forge failed for me): conda install tensorflow-gpu=1.14.
finally, at the very end, go back to the main env and install tensorflow with pip.
In there, you should have this:
$ conda list | grep tensop tensor
tensorboard 1.14.0 py37hf484d3e_0 anaconda
tensorflow 1.14.0 gpu_py37h74c33d7_0 anaconda
tensorflow-base 1.14.0 gpu_py37he45bfe2_0 anaconda
tensorflow-estimator 1.14.0 py_0 anaconda
tensorflow-gpu 1.14.0 h0d30ee6_0 anaconda
And, importantly:
$ pip freeze | grep tensor
tensorboard==1.14.0
tensorflow==1.14.0
tensorflow-estimator==1.14.0
This does not work if you installed TF with pip beforehand.
After that, activate your other base env, and complete your installation with pip
$ pip install tensorflow
Which should give you:
$ conda list | grep tenso tensor
tensorboard 2.3.0 pypi_0 pypi
tensorboard-plugin-wit 1.7.0 pypi_0 pypi
tensorflow 2.3.1 pypi_0 pypi
tensorflow-estimator 2.3.0 pypi_0 pypi
And:
$ pip freeze | grep tensor
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
tensorflow==2.3.1
tensorflow-estimator==2.3.0

Nvidia GeForce 210 compute issue on Ubuntu 18.04

I am using ubuntu 18.04 (I have dual booted windows with ubuntu 18.04).
nvidia-smi
This is the output I got when I ran the above command on my ubuntu(18.04) terminal:
Fri Oct 9 09:33:56 2020
+------------------------------------------------------+
| NVIDIA-SMI 340.108 Driver Version: 340.108 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce 210 Off | 0000:01:00.0 N/A | N/A |
| 35% 52C P8 N/A / N/A | 368MiB / 1023MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
Before that, I followed these steps to install required driver on my system:
sudo add-apt-repository --remove ppa:graphics-drivers/ppa
sudo apt-get purge nvidia*
sudo apt autoremove
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
sudo shutdown -r now
When I tried to run Geekbench5 compute benchmark test, the output stopped when it was running Histogram Equalization. This is the output when I ran this ./geekbench5 --compute OpenCL in the folder where I extracted geekbench5:
[1009/092949:FATAL:src/halogen/cuda/cuda_library.cpp(1481)] Failed to load
cuDevicePrimaryCtxRetain: /usr/lib/x86_64-linux-gnu/libcuda.so.1: undefined symbol: cuDevicePrimaryCtxRetain
[1009/092949:FATAL:src/halogen/cuda/cuda_library.cpp(1481)] Failed to load cuDevicePrimaryCtxRetain: /usr/lib/x86_64-linux-gnu/libcuda.so.1: undefined symbol: cuDevicePrimaryCtxRetain
Geekbench 5.2.4 Tryout : https://www.geekbench.com/
Geekbench 5 is in tryout mode.
Geekbench 5 requires an active Internet connection when in tryout mode, and
automatically uploads test results to the Geekbench Browser. Other features
are unavailable in tryout mode.
Buy a Geekbench 5 license to enable offline use and remove the limitations of
tryout mode.
If you would like to purchase Geekbench you can do so online:
https://store.primatelabs.com/v5
If you have already purchased Geekbench, enter your email address and license
key from your email receipt with the following command line:
./geekbench5 -r <email address> <license key>
Running Gathering system information
System Information
Operating System Ubuntu 18.04.5 LTS 4.15.0-118-generic x86_64
Model To be filled by O.E.M. To be filled by O.E.M.
Motherboard O.E.M Intel H81
BIOS American Megatrends Inc. 4.6.5
Processor Information
Name Intel Core i5-4460
Topology 1 Processor, 4 Cores
Identifier GenuineIntel Family 6 Model 60 Stepping 3
Base Frequency 3.20 GHz
L1 Instruction Cache 32.0 KB x 2
L1 Data Cache 32.0 KB x 2
L2 Cache 256 KB x 2
L3 Cache 6.00 MB
Memory Information
Size 7.75 GB
OpenCL Information
Platform Vendor NVIDIA Corporation
Platform Name NVIDIA CUDA
Device Vendor NVIDIA Corporation
Device Name GeForce 210
Device Driver Version 340.108
Maximum Frequency 1.23 GHz
Compute Units 2
Device Memory 1024 MB
OpenCL
Running Sobel
Running Canny
Running Stereo Matching
Running Histogram Equalization
[1009/093329:ERROR:src/interface/console/consolemain.cpp(808)] Geekbench encountered an internal error and cannot continue. Please contact support#primatelabs.com for assistance.
Internal error message: clCreateImage returned -40.
Also, when I tried running the geekbench5 compute benchmark test on windows 10(same machine, on GUI), it paused running at Histogram equalization.
I am not getting any idea why this is happening.Is anything really wrong with my GPU or driver or anything else? I tried to search online, installed the driver again,rebooted the system, but the results are same. Can someone please help?
Your driver installation is fine, but your GPU is 11 years old and does not support some of the more recent features of the OpenCL standard. The geekbench error message -40 means that the image size geekbench uses for one of its benchmarks is not supported by your GPU. This causes the benchmark to crash. Maybe an older version of geekbench still works.

Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory;

When I try to run a python script , which uses tensorflow, it shows following error ...
2020-10-04 16:01:44.994797: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-04 16:01:46.780656: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-04 16:01:46.795642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:03:00.0 name: TITAN X (Pascal) computeCapability: 6.1
coreClock: 1.531GHz coreCount: 28 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 447.48GiB/s
2020-10-04 16:01:46.795699: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-04 16:01:46.795808: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64/:/usr/local/cuda-10.0/lib64
2020-10-04 16:01:46.797391: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-04 16:01:46.797707: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-04 16:01:46.799529: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-04 16:01:46.800524: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-04 16:01:46.804150: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-10-04 16:01:46.804169: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Output of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) On | 00000000:03:00.0 Off | N/A |
| 23% 28C P8 9W / 250W | 18MiB / 12194MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1825 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1957 G /usr/bin/gnome-shell 6MiB |
+-----------------------------------------------------------------------------+
Tensorflow version 2.3.1,
Ubuntu - 18.04
I tried to completely remove cuda toolkit and install from scratch but the error remains.
Anybody could help me to identify the source of problem??
On Ubuntu 20.04, you can simply install NVIDIAs cuda toolkit cuda:
sudo apt-get update
sudo apt install nvidia-cuda-toolkit
There are also install advices for Windows.
The packge is around 1GB and it took a while to install... Some minutes later you need to export PATH variables so that it can be found:
Find Shared Object
sudo find / -name 'libcudart.so*'
/usr/lib/x86_64-linux-gnu/libcudart.so.10.1
/usr/lib/x86_64-linux-gnu/libcudart.so
Add the folder to path, so that python finds it
export PATH=/usr/lib/x86_64-linux-gnu${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
Permissions
sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcuda*
Helped me
This usually happens when you run tensorflow with a non compatible version of CUDA. Looks like this has been asked before (could not comment). Refer this question.
Today I was facing this problem. I went to the CUDA toolkit website, selected the options, and that showed some instructions like this:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda-repo-ubuntu2004-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-6-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda # I have broken packages, so could not invoke this command
So the instructions will change depending on your specifications, DO NOT copy from here/other stackoverflow answer.
I could not invoke the last command, but after some trials and errors, I invoked:
sudo apt install libcudart.so.11.0 # this worked for me!
This worked for me!
You have to download/update Cuda
If you are looking CUDA Toolkit 10.2 Download use this link:
https://developer.nvidia.com/cuda-10.2-download-archive
Then active the virtual environment and set the LD_LIBRARY_PATH, example:
Tensorflow Could not load dynamic library 'libcudart.so.10.0 on ubuntu 18.04
Please run these commands, if you are having ubuntu 18.04 installed. or follow the instructions here
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda
This worked for me:
sudo apt-get install libcudart10.1