Trying to get xgboost compiled for GPU. Seems my Cuda install is broken.
~$ cmake .. -DUSE_CUDA=ON
CMake Error at /usr/share/cmake-3.5/Modules/FindPackageHandleStandardArgs.cmake:148 (message):
Could NOT find CUDA: Found unsuitable version "7.5", but required is at
least "8.0" (found /usr)
Call Stack (most recent call first):
/usr/share/cmake-3.5/Modules/FindPackageHandleStandardArgs.cmake:386 (_FPHSA_FAILURE_MESSAGE)
/usr/share/cmake-3.5/Modules/FindCUDA.cmake:949 (find_package_handle_standard_args)
CMakeLists.txt:113 (find_package)
I originally had CUDA 7.5 installed, but afterwards installed CUDA 9.1. I tried to uninstall 7.5, but probably missed something. I ran the following commands to check my Cuda version.
~$ which nvcc
/usr/bin/nvcc
~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2015 NVIDIA Corporation
Built on Tue_Aug_11_14:27:32_CDT_2015
Cuda compilation tools, release 7.5, V7.5.17
~$ cat /usr/local/cuda/version.txt
CUDA Version 9.1.85
~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 390.30 Wed Jan 31 22:08:49 PST 2018
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.6)
~$ nvidia-smi
Wed Feb 21 00:35:35 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:01:00.0 On | N/A |
| 25% 46C P2 56W / 250W | 487MiB / 11175MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
This question suggests clearing cuda files in /usr/bin, and I have cleared the following files.
~$ ls /usr/local/cuda-9.1/bin
bin2c cuda-gdbserver nsight nvprof
computeprof cuda-install-samples-9.1.sh nsight_ee_plugins_manage.sh nvprune
crt cuda-memcheck nvcc nvvp
cudafe cuobjdump nvcc.profile ptxas
cudafe++ fatbinary nvdisasm uninstall_cuda_9.1.pl
cuda-gdb gpu-library-advisor nvlink
~$ cd /usr/bin
~$ ls /usr/local/cuda-9.1/bin | sudo xargs rm
rm: cannot remove 'computeprof': No such file or directory
rm: cannot remove 'crt': No such file or directory
rm: cannot remove 'gpu-library-advisor': No such file or directory
rm: cannot remove 'nsight': No such file or directory
rm: cannot remove 'nsight_ee_plugins_manage.sh': No such file or directory
rm: cannot remove 'nvcc.profile': No such file or directory
rm: cannot remove 'uninstall_cuda_9.1.pl': No such file or directory
Following the question, I added new paths in ~/.bashrc
export PATH=/usr/local/cuda-9.1/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.1/lib64\
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
After these changes, the system correctly references Cuda 9.1. The other diagnostic calls remain unchanged.
~$ which nvcc
/usr/local/cuda-9.1/bin/nvcc
~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
However, running cmake .. -DUSE_CUDA=ON still fails, returning the same error. I tried restarting my computer, but it didn't help.
How can I get this to work??
Got it working...
Removed the xgboost directory, re-cloned it from github, and then ran make. Some residual files from the make config clogging stuff?
Related
After an upgrade to Ubuntu 20.04 from 18.04 Tensorflow is no longer able to use my gpu because it is attempting to mix and load different versions (some 10 and some 11). It is a System76 machine, and I have cuda 10.1 installed from System76 (so it works with the System76 nvidia driver). When running tensorflow the following errors occur:
2021-01-07 18:12:22.584886: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-07 18:12:22.584906: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-07 18:12:23.640665: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-01-07 18:12:23.641412: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-01-07 18:12:23.669966: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-01-07 18:12:23.670257: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1060 computeCapability: 6.1
coreClock: 1.733GHz coreCount: 10 deviceMemorySize: 5.93GiB deviceMemoryBandwidth: 178.99GiB/s
2021-01-07 18:12:23.670328: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.670379: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.670425: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.671387: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-07 18:12:23.671667: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-07 18:12:23.673022: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-07 18:12:23.673100: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory
2021-01-07 18:12:23.673245: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-07 18:12:23.673259: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU.
Notice all the warnings are for attempting to load version 11 of Cuda but it's only for some of the libraries. The version 10 ones load fine.
This is the output of nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105
This is the output of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 Off | N/A |
| N/A 53C P0 26W / N/A | 585MiB / 6069MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2999 G /usr/lib/xorg/Xorg 101MiB |
| 0 N/A N/A 3479 G /usr/lib/xorg/Xorg 255MiB |
| 0 N/A N/A 3720 G /usr/bin/gnome-shell 88MiB |
| 0 N/A N/A 6487 G ...AAAAAAAA== --shared-files 45MiB |
| 0 N/A N/A 6959 G ...AAAAAAAA== --shared-files 40MiB |
| 0 N/A N/A 11642 G ...AAAAAAAA== --shared-files 21MiB |
| 0 N/A N/A 25206 G WickrMe 17MiB |
+-----------------------------------------------------------------------------+
I see that the driver version in the output of nvidia-smi is version 11, but as I understand it, that has nothing to do with cuda runtime. That is simply the version up to which the driver supports. Correct me if I'm wrong.
I have to use version 10 because that is what is supported by System76 and it worked fine prior to the upgrade. I have also tried uninstalling and re-installing Tensorflow via pip3 and no luck.
Does anyone know how get all the libraries in sync to version 10.1? I also tried to manually place the version 11 libraries in place and let Tensorflow use the mixed version (which of course is a bad idea) but it won't recognize them (or I didn't place them properly).
As #talonmies pointed out, I was misunderstanding the versioning system. However, because it's a System76 machine, it was also confounding because System76 uses their own Nvidia driver, and it's not straightforward to install Cuda 11 and Cudnn. I'm posting the answer in case anyone else runs into problems with System76.
First, DO NOT use the System76 install for Cuda and Cudnn. They have their own versions (on their website) so as to be compatible with their Nvidia driver, but they will not work (they are version 10, and TF 2.2+ requires 11). Also, most general Cuda guides will tell you to uninstall/install the Nvida driver first so as to have a clean install, but DO NOT do this if you have a System76 system. Just leave the System76 driver alone. Also, if you have any previous Cuda/Cudnn remove/uninstall all of it.
Go to Nvidia and get their latest Cuda and Cudnn. I used
wget http://developer.download.nvidia.com/compute/cuda/11.0.2/local_installers/cuda_11.0.2_450.51.05_linux.run
Run that with
sudo sh cuda_11.0.2_450.51.05_linux.run
When it runs it will tell you that you have a conflict with the driver package. Ignore that and proceed. When you get to the install menu, UNCHECK "install driver" and continue with the install. When it's done, add to your path
/usr/local/cuda-11.0:/usr/local/cuda-11.0/bin:
You need to add both the cuda root and bin, not just bin (which is different than most general instructions). Source your .bashrc or .profile or wherever you put the path addition (or open a new terminal).
Now install Cudnn.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/libcudnn8_8.0.5.39-1+cuda11.0_amd64.deb
Install it with dpkg. For example (in my case)...
sudo dpkg -i libcudnn8_8.0.5.39-1+cuda11.0_amd64.deb
That's it. Once I completed all that, everything worked fine. Hope that helps some System76 people get through Ununtu 20.04 and Cuda 11 a little easier.
Thank you very much.
One of the reasons I have used POP OS is that the Nvidia drivers+cuda/cudnn just worked with tensorflow, until this issue with version 11.0 missing.
One thing I needed to be able in install cuda 11.0 using the recipe above was to install gcc versions 8 :
sudo apt -y install gcc-8 g++-8
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 8
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-8 8
I really wish POP!_os would provide CUDA 11.0 packages directly.....
When I try to run a python script , which uses tensorflow, it shows following error ...
2020-10-04 16:01:44.994797: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-04 16:01:46.780656: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-10-04 16:01:46.795642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:03:00.0 name: TITAN X (Pascal) computeCapability: 6.1
coreClock: 1.531GHz coreCount: 28 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 447.48GiB/s
2020-10-04 16:01:46.795699: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-10-04 16:01:46.795808: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcublas.so.10'; dlerror: libcublas.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64/:/usr/local/cuda-10.0/lib64
2020-10-04 16:01:46.797391: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2020-10-04 16:01:46.797707: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2020-10-04 16:01:46.799529: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2020-10-04 16:01:46.800524: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2020-10-04 16:01:46.804150: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-10-04 16:01:46.804169: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Output of nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 TITAN X (Pascal) On | 00000000:03:00.0 Off | N/A |
| 23% 28C P8 9W / 250W | 18MiB / 12194MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1825 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 1957 G /usr/bin/gnome-shell 6MiB |
+-----------------------------------------------------------------------------+
Tensorflow version 2.3.1,
Ubuntu - 18.04
I tried to completely remove cuda toolkit and install from scratch but the error remains.
Anybody could help me to identify the source of problem??
On Ubuntu 20.04, you can simply install NVIDIAs cuda toolkit cuda:
sudo apt-get update
sudo apt install nvidia-cuda-toolkit
There are also install advices for Windows.
The packge is around 1GB and it took a while to install... Some minutes later you need to export PATH variables so that it can be found:
Find Shared Object
sudo find / -name 'libcudart.so*'
/usr/lib/x86_64-linux-gnu/libcudart.so.10.1
/usr/lib/x86_64-linux-gnu/libcudart.so
Add the folder to path, so that python finds it
export PATH=/usr/lib/x86_64-linux-gnu${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
Permissions
sudo chmod a+r /usr/lib/x86_64-linux-gnu/libcuda*
Helped me
This usually happens when you run tensorflow with a non compatible version of CUDA. Looks like this has been asked before (could not comment). Refer this question.
Today I was facing this problem. I went to the CUDA toolkit website, selected the options, and that showed some instructions like this:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.6.2/local_installers/cuda-repo-ubuntu2004-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-11-6-local_11.6.2-510.47.03-1_amd64.deb
sudo apt-key add /var/cuda-repo-ubuntu2004-11-6-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda # I have broken packages, so could not invoke this command
So the instructions will change depending on your specifications, DO NOT copy from here/other stackoverflow answer.
I could not invoke the last command, but after some trials and errors, I invoked:
sudo apt install libcudart.so.11.0 # this worked for me!
This worked for me!
You have to download/update Cuda
If you are looking CUDA Toolkit 10.2 Download use this link:
https://developer.nvidia.com/cuda-10.2-download-archive
Then active the virtual environment and set the LD_LIBRARY_PATH, example:
Tensorflow Could not load dynamic library 'libcudart.so.10.0 on ubuntu 18.04
Please run these commands, if you are having ubuntu 18.04 installed. or follow the instructions here
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda
This worked for me:
sudo apt-get install libcudart10.1
I am trying to address the issue in the title:
Loaded runtime CuDNN library: 7.1.2 but source was compiled with: 7.6.0. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version
I have read several other posts (example: Loaded runtime CuDNN library: 5005 (compatibility version 5000) but source was compiled with 5103 (compatibility version 5100))
that basically tells me that my machine has CuDNN 7.1.2 but I need 7.6.0. The answer is then to download and install 7.6.*
the only issue is that I thought I did that by following the instructions on nvidia archive (https://developer.nvidia.com/rdp/cudnn-archive)
and if I go to /usr/local/cuda/include and read cudnn.h it shows
#if !defined(CUDNN_H_)
#define CUDNN_H_
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 4
Currently I have CUDA-10.0, 10.1, and 10.2 installed with my .bashrc set to 10.0 (although nvcc --version states I have cuda 9.1 --another issue I cant seem to fix).
Any suggestions? I have been trying to tackle this for days but no luck.
UPDATE:
Here are the paths I have
export PATH=$PATH:/usr/local/cuda-10.0/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-10.0/lib64
export CUDA_HOME=/usr/local/cuda
Before this is closed could you help with either suggesting a proper path to set or to find old cudnn please?
I hit a very similar error:
Loaded runtime CuDNN library: 7.1.4 but source was compiled with: 7.6.5. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
and tracked it down to accidentally having an older CuDNN in my ldconfig:
$ sudo ldconfig -p | grep libcudnn
libcudnn.so.7 (libc6,x86-64) => /usr/local/cuda-9.0/lib64/libcudnn.so.7
libcudnn.so.7 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudnn.so.7
libcudnn.so (libc6,x86-64) => /usr/local/cuda-9.0/lib64/libcudnn.so
libcudnn.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcudnn.so
The libcudnn.so.7 file in the cuda-9.0 directory was pointing to the older version:
ls -alh /usr/local/cuda-9.0/lib64/libcudnn.so.7
lrwxrwxrwx 1 root root 17 Dec 16 2018 /usr/local/cuda-9.0/lib64/libcudnn.so.7 -> libcudnn.so.7.1.4
But I had compiled tensorflow against the newer version:
ls -alh /usr/lib/x86_64-linux-gnu/libcudnn.so.7
lrwxrwxrwx 1 root root 17 Oct 27 2019 /usr/lib/x86_64-linux-gnu/libcudnn.so.7 -> libcudnn.so.7.6.5
Since ldconfig uses /etc/ld.so.conf to determine where to look for libraries (I guess in conjunction with LD_LIBRARY_PATH), I checked it and it showed:
include /etc/ld.so.conf.d/*.conf
When I listed the files in that directory, I spotted the problem file and removed it:
$ cat /etc/ld.so.conf.d/cuda9.conf
/usr/local/cuda-9.0/lib64
$ sudo rm /etc/ld.so.conf.d/cuda9.conf
After that I re-ran ldconfig to reload the config, and then everything worked as expected and the error disappeared.
I'm trying to run some Tensorflow code, and I get what seems to be a common problem:
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()"
2019-02-06 20:36:15.903204: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-06 20:36:15.908809: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2019-02-06 20:36:15.908858: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: tigris
2019-02-06 20:36:15.908868: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: tigris
2019-02-06 20:36:15.908942: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 390.77.0
2019-02-06 20:36:15.908985: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 390.30.0
2019-02-06 20:36:15.909006: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:308] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration
$
The key pieces of that error message seem to be:
[...] libcuda reported version is: 390.77.0
[...] kernel reported version is: 390.30.0
[...] kernel version 390.30.0 does not match DSO version 390.77.0 -- cannot find working devices in this configuration
How can I install compatible versions? Where is that libcuda version coming from?
Background
A few months ago, I tried installing Tensorflow with GPU support, but the versions either broke my display or wouldn't work with Tensorflow. Finally, I got it working by following a tutorial on how to install multiple versions of the CUDA libraries on the same machine. That worked at the time, but when I came back to the project after a few months, it has stopped working. I assume that some driver got upgraded during that time.
Investigation
The first thing I tried was to see what versions I have of the nvidia drivers and libcuda package.
$ dpkg --list|grep libcuda
ii libcuda1-390 390.30-0ubuntu1 amd64 NVIDIA CUDA runtime library
Looks like it's 390.30. Why does the error message say that libcuda reported 390.77?
$ dpkg --list|grep nvidia
ii libnvidia-container-tools 1.0.1-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.0.1-1 amd64 NVIDIA container runtime library
rc nvidia-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA binary driver - version 384.130
ii nvidia-390 390.30-0ubuntu1 amd64 NVIDIA binary driver - version 390.30
ii nvidia-390-dev 390.30-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
rc nvidia-396 396.44-0ubuntu1 amd64 NVIDIA binary driver - version 396.44
ii nvidia-container-runtime 2.0.0+docker18.09.1-1 amd64 NVIDIA container runtime
ii nvidia-container-runtime-hook 1.4.0-1 amd64 NVIDIA container runtime hook
ii nvidia-docker2 2.0.3+docker18.09.1-1 all nvidia-docker CLI wrapper
ii nvidia-modprobe 390.30-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
rc nvidia-opencl-icd-384 384.130-0ubuntu0.16.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-opencl-icd-390 390.30-0ubuntu1 amd64 NVIDIA OpenCL ICD
rc nvidia-opencl-icd-396 396.44-0ubuntu1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime
ii nvidia-settings 396.44-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
Again, everything looks like it's 390.30. There were some packages that had version 390.77, but they were in the rc status. I guess I installed that version and later removed it, so the configuration files were left behind. I purged the configuration files with commands like this:
sudo apt-get remove --purge nvidia-kernel-common-390
Now, there are no packages at all with version 390.77.
$ dpkg --list|grep 390.77
$
I tried reinstalling CUDA, to see if it had been compiled with the wrong version.
$ sudo sh cuda_9.0.176_384.81_linux.run --silent --toolkit --toolkitpath=/usr/local/cuda-9.0 --override
That didn't make any difference.
Finally, I tried running nvidia-smi.
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
$
All of this is running on Ubuntu 18.04 with Python 3.6.7, and my graphics card is NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2).
I finally had the idea to look for any files with 390.77 in the name.
$ locate 390.77
/usr/lib/i386-linux-gnu/libcuda.so.390.77
/usr/lib/i386-linux-gnu/libnvcuvid.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-compiler.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-encode.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-fatbinaryloader.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-ml.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-opencl.so.390.77
/usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
/usr/lib/i386-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
/usr/lib/x86_64-linux-gnu/libcuda.so.390.77
/usr/lib/x86_64-linux-gnu/libnvcuvid.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.390.77
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.390.77
/usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.390.77
So there they are! A closer look shows that I must have installed the newer version at some point.
$ ls /usr/lib/i386-linux-gnu/libcuda* -l
lrwxrwxrwx 1 root root 12 Nov 8 13:58 /usr/lib/i386-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 17 Nov 12 14:04 /usr/lib/i386-linux-gnu/libcuda.so.1 -> libcuda.so.390.77
-rw-r--r-- 1 root root 9179124 Jan 31 2018 /usr/lib/i386-linux-gnu/libcuda.so.390.30
-rw-r--r-- 1 root root 9179796 Jul 10 2018 /usr/lib/i386-linux-gnu/libcuda.so.390.77
Where did they come from?
$ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.30
libcuda1-390: /usr/lib/i386-linux-gnu/libcuda.so.390.30
$ dpkg -S /usr/lib/i386-linux-gnu/libcuda.so.390.77
dpkg-query: no path found matching pattern /usr/lib/i386-linux-gnu/libcuda.so.390.77
So the 390.77 no longer belongs to any package. Perhaps I installed the old version and had to force it to overwrite the links.
My plan is to delete the files, then reinstall the packages to set up the links to the correct version. So which packages will I need to reinstall?
$ locate 390.77|sed -e 's/390.77/390.30/'|xargs dpkg -S
Some of the files don't match anything, but the ones that do match are from these packages:
libcuda1-390
nvidia-opencl-icd-390
Crossing my fingers, I delete the version 390.77 files.
locate 390.77|sudo xargs rm
Then I reinstall the packages.
sudo apt-get install --reinstall libcuda1-390 nvidia-opencl-icd-390
Finally, it works!
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 python -c "import tensorflow; tensorflow.Session()"
2019-02-06 22:13:59.460822: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-02-06 22:13:59.665756: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-02-06 22:13:59.666205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.176
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.81GiB
2019-02-06 22:13:59.666226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-02-06 22:17:21.254445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-02-06 22:17:21.254489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-02-06 22:17:21.254496: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-02-06 22:17:21.290992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3539 MB memory) -> physical GPU (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
nvidia-smi also works now.
$ LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64 nvidia-smi
Wed Feb 6 22:19:24 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960M Off | 00000000:01:00.0 Off | N/A |
| N/A 45C P8 N/A / N/A | 113MiB / 4046MiB | 6% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3212 G /usr/lib/xorg/Xorg 113MiB |
+-----------------------------------------------------------------------------+
I rebooted, and the video drivers continued to work. Hurrah!
I am new with TensorFlow.
I just installed TensorFlow and to test the installation, I tried the following code and as soon as I initiate the TF Session, I am getting the Segmentation fault (core dumped) error.
bafhf#remote-server:~$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56)
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/home/bafhf/anaconda3/envs/ismll/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
>>> tf.Session()
2018-05-15 12:04:15.461361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1349] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:04:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
Segmentation fault (core dumped)
My nvidia-smi is:
Tue May 15 12:12:26 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30 Driver Version: 390.30 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:04:00.0 Off | 0 |
| N/A 38C P8 26W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 On | 00000000:05:00.0 Off | 2 |
| N/A 31C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
And nvcc --version is:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176
Also gcc --version is:
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Following is my PATH:
/home/bafhf/bin:/home/bafhf/.local/bin:/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib:/home/bafhf/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
and the LD_LIBRARY_PATH:
/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib
I am running this on a server and I don't have root privileges. Still I managed to install everything as per the instructions on the official website.
Edit: New observations:
Seems like the GPU is allocating memory for the process for a second and then the core segmentation dumped error is thrown:
Edit2: Changed tensorflow version
I downgraded my tensorflow version from v1.8 to v1.5. The issue still remains.
Is there any way address or debug this issue?
This could possibly occur since you are using multiple GPUs here. Try setting cuda visible devices to just one of the GPUs. See this linkfor instructions on how to do that. In my case, this solved the problem.
If you can see the nvidia-smi output, the second GPU has an ECC code of 2. This error manifests itself irrespective of a CUDA version or TF version error, and usually as a segfault, and sometimes, with the CUDA_ERROR_ECC_UNCORRECTABLE flag in the stack trace.
I got to this conclusion from this post:
"Uncorrectable ECC error" usually refers to a hardware failure. ECC is
Error Correcting Code, a means to detect and correct errors in bits
stored in RAM. A stray cosmic ray can disrupt one bit stored in RAM
every once in a great while, but "uncorrectable ECC error" indicates
that several bits are coming out of RAM storage "wrong" - too many for
the ECC to recover the original bit values.
This could mean that you have a bad or marginal RAM cell in your GPU
device memory.
Marginal circuits of any kind may not fail 100%, but are more likely
to fail under the stress of heavy use - and associated rise in
temperature.
A reboot usually is supposed to take away the ECC error. If not, seems like the only option is to change the hardware.
So what all I did and finally how I fixed the issue?
I tested my code a on a separate machcine with NVIDIA 1050 Ti
machine and my code executed perfectly fine.
I made the code run only on the first card for which the ECC
value was normal, just to narrow down the issue. This I did
following, this post, setting the
CUDA_VISIBLE_DEVICES environment variable.
I then requested for restart of the Tesla-K80 server to check
whether a restart can fix this issue, they took a while but the
server was then restarted
Now the issue is no more and I can run both the cards for my
tensorflow implemntations.
In case anyone still interested in, I happened to had the same issue, with "Volatile Uncorr. ECC" output. My problem was incompatible versions as shown below:
Loaded runtime CuDNN library: 7.1.1 but source was compiled with:
7.2.1. CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version. If using a
binary install, upgrade your CuDNN library. If building from sources,
make sure the library loaded at runtime is compatible with the version
specified during compile configuration. Segmentation fault
After I upgrade CuDNN library to 7.3.1 (which is greater than 7.2.1), segmentation fault error disappeared. To upgrade I did the following (as also documented in here).
Download CuDNN library from NVIDIA website
sudo tar -xzvf [TAR_FILE]
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*
I was also facing the same issue. I have a workaround for the same you can try that.
I followed the following steps:
1. Reinstall the python 3.5 or above
2. Reinstall the Cuda and Add the Cudnn libraries to it.
3. Reinstall Tensorflow 1.8.0 GPU version.
Check that you are using the exact version of CUDA and CuDNN required by tensorflow, and also that you are using the version of driver of the graphics card that comes with this CUDA version.
I once had a similar issue having a driver that was too recent. Downgrading it to the version coming with the CUDA version required by tensorflow solved the issue for me.
I encounter this problem recently.
The reason is multiple GPUs in docker container.
The solution is pretty simple, you either:
set CUDA_VISIBLE_DEVICES in host
refers to https://stackoverflow.com/a/50464695/2091555
or
use --ipc=host to launch the docker if you need multiple GPUs
e.g.
docker run --runtime nvidia --ipc host \
--rm -it
nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04:latest
This problem is actually pretty nasty, and segfault happens during cuInit() calls in docker container and everything works fine in the host. I will leave log here to let the search engine find this answer easier for other people.
(base) root#e121c445c1eb:~# conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
Collecting package metadata (current_repodata.json): / Segmentation fault (core dumped)
(base) root#e121c445c1eb:~# gdb python /data/corefiles/core.conda.572.1569384636
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
warning: core file may not match specified executable file.
[New LWP 572]
[New LWP 576]
warning: Unexpected size of section `.reg-xstate/572' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/conda/bin/python /opt/conda/bin/conda upgrade conda'.
Program terminated with signal SIGSEGV, Segmentation fault.
warning: Unexpected size of section `.reg-xstate/572' in core file.
#0 0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
[Current thread is 1 (Thread 0x7f82bbfd7700 (LWP 572))]
(gdb) bt
#0 0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#1 0x00007f829f06e3a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2 0x00007f829f07002c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#3 0x00007f829f0e04f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so
#4 0x00007f82b99a1ec0 in ffi_call_unix64 () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#5 0x00007f82b99a187d in ffi_call () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#6 0x00007f82b9bb7f7e in _call_function_pointer (argcount=1, resmem=0x7ffded858980, restype=<optimized out>, atypes=0x7ffded858940, avalues=0x7ffded858960, pProc=0x7f829f0e0380 <cuInit>,
flags=4353) at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:827
#7 _ctypes_callproc () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:1184
#8 0x00007f82b9bb89b4 in PyCFuncPtr_call () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/_ctypes.c:3969
#9 0x000055c05db9bd2b in _PyObject_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:199
#10 0x000055c05dbf7026 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4619
#11 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
#12 0x000055c05db9a79b in function_code_fastcall (globals=<optimized out>, nargs=0, args=<optimized out>, co=<optimized out>)
at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:283
#13 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:408
#14 0x000055c05dbf2846 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4616
#15 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
... (stack omitted)
#46 0x000055c05db9aa27 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:433
---Type <return> to continue, or q <return> to quit---q
Quit
Another try is using pip to install
(base) root#e121c445c1eb:~# pip install torch torchvision
(base) root#e121c445c1eb:~# python
Python 3.7.3 (default, Mar 27 2019, 22:11:17)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
Segmentation fault (core dumped)
(base) root#e121c445c1eb:~# gdb python /data/corefiles/core.python.28.1569385311
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
warning: core file may not match specified executable file.
[New LWP 28]
warning: Unexpected size of section `.reg-xstate/28' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
bt
Core was generated by `python'.
Program terminated with signal SIGSEGV, Segmentation fault.
warning: Unexpected size of section `.reg-xstate/28' in core file.
#0 0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0 0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007ffaa1d623a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007ffaa1d6402c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ffaa1dd44f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007ffaee75f724 in cudart::globalState::loadDriverInternal() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5 0x00007ffaee760643 in cudart::__loadDriverInternalUtil() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6 0x00007ffafe2cda99 in __pthread_once_slow (once_control=0x7ffaeebe2cb0 <cudart::globalState::loadDriver()::loadDriverControl>,
... (stack omitted)
I am using tensorflow in a cloud enviornment from paperspace.
Update of cuDNN 7.3.1 did not work for me.
One way is to build Tensorflow with proper GPU and CPU support.
This is not proper solution but this solved my issue temporarily (downgrade tensoflow to 1.5.0):
pip uninstall tensorflow-gpu
pip install tensorflow==1.5.0
pip install numpy==1.14.0
pip install six==1.10.0
pip install joblib==0.12
Hope this helps !