NVRM: API mismatch despite having the same version on client and kernel - gpu

I have installed nvidia driver 440.64.
After reboot, I get a black screen instead of the login screen. I pressed CTRL+ALT+F3 to console login and typed sudo prime-select intel. The login screen appears and I can login.
After login, I type nvidia-smi :
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
I do the following:
prime-select query
Output:
intel
Then
sudo prime-select nvidia
Output:
Info: selecting the nvidia profile
Then
nvidia-smi
Output:
Failed to initialize NVML: Driver/library version mismatch
Then
dmesg
Output:
...
[ 68.122795] NVRM: API mismatch: the client has the version 440.64.00, but
NVRM: this kernel module has the version 440.64. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
...
If I reboot, I run into the same problem.

Install the nvidia-driver-440 from the nvidia deb repo will fix it.
The nvidia driver and client library should come from the same repository or the version most likely mismatch.

Related

Install Tensorflow-GPU on WSL2

Has anyone successfully installed Tensorflow-GPU on WSL2 with NVIDIA GPUs? I have Ubuntu 18.04 on WSL2, but am struggling to get NVIDIA drivers installed. Any help would be appreciated as I'm lost.
So I have just got this running.
The steps you need to follow are here. To summarise them:
sign up for windows insider program and get the development builds of windows so that you have the latest version
Install wsl 2
Install Ubuntu from the windows store
Install the wsl 2 cuda driver on windows
Install cuda toolkit
Install cudnn (you can download the linux version from windows and then copy the file to linux)
If you are getting memory errors like 'cannot allocate memory' then you might need to increase the amount of memory wsl can get
Then install tensorflow-gpu
pray it works
bugs I hit along the way:
If when you open ubuntu for the first time you get an error you need to enable virutalisation in the bios
If you cannot run the ./Blackscholes example in the installation instructions you might not have the right build of windows! You must have the right version
if you are getting 'cannot allocate memory' errors when running tf you need to give wsl more ram. It only access half your ram by default
create a .wslconfig file under your user directory in windows with the amount of memory you want. Mine looks like:
[wsl2]
memory=16GB
Edit after running some code
This is much slower then when I was running on windows directly. I went from 1 minute per epoch to 5 minutes. I'm just going to dualboot.
These are the steps I had to follow for Ubuntu 20.04. I am no longer on dev channel, beta channel works fine for this use case and is much more stable.
Install WSL2
Install Ubuntu 20.04 from Windows Store
Install Nvidia Drivers for Windows from: https://developer.nvidia.com/cuda/wsl/download
Install nvcc inside of WSL with:
sudo apt install nvidia-cuda-toolkit
Check that it is there with:
nvcc --version
For my use case, I do data science and already had anaconda installed. I created an environment with:
conda create --name tensorflow
conda install tensorflow-gpu
Then just test it with this little python program with the environment activated:
import tensorflow as tf
tf.config.list_physical_devices('GPU')
sys_details = tf.sysconfig.get_build_info()
cuda = sys_details["cuda_version"]
cudnn = sys_details["cudnn_version"]
print(cuda, cudnn)
For reasons I do not understand, my machine was unable to find the GPU without installing the nvcc and actually gave an error message saying it could not find nvcc.
Online tutorials I had found which had you downloading CUDA and CUDNN separately but I thinkNVCC includes CUDNN since it is . . . there somehow.
I can confirm I am able to get this working without the need for Docker on WSL2 thanks to the following article:
https://qiita.com/Navier/items/cf551908bae707db4258
Be sure to update to driver version 460.15, not 455.41 as listed in the CUDA documentation.
Note, this does not work with the card in TCC mode (only WDDM). Also, be sure to place your files on the Linux file system (i.e. not on a mount drive, like /mnt/c/). Performance is significantly faster on the Linux file system (this has to do with the difference in implementation of WSL 1 vs. WSL 2; see 1, 2, and 3).
NOTE: See also Is the class generator (inheriting Sequence) thread safe in Keras/Tensorflow?
I just want to point out that using anaconda to install cudatoolkit and cudnn does not seem to work in wsl.
Maybe there is some problem with paths that make TF look for the needed files only in the system paths instead of the conda enviroments.

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. Why?

I'm trying to run stylegan2 on Google Colab but with all the files on my Drive and avoiding using !git clone from the github of stylegan2.
Here is my code on the specific cell:
%tensorflow_version 1.x
%cd /content/drive/My Drive/stylegan2-master/
!nvcc test_nvcc.cu -o test_nvcc -run
print('Tensorflow version: {}'.format(tf.__version__) )
!nvidia-smi -L
print('GPU Identified at: {}'.format(tf.test.gpu_device_name()))
And the result:
/content/drive/My Drive/stylegan2-master
CPU says hello.
cudaErrorNoDevice: no CUDA-capable device is detected
Tensorflow version: 1.15.2
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
GPU Identified at:
Why can't I get the GPU ?
I am new in the field so I may be missing something very simple, but still can't find out on the internet the answer.
You have to enable the GPU first in the Notebook settings.
You can easily do it by clicking on Edit > Notebook settings and selecting GPU as hardware accelerator.
That should be it.

Cuda Installation Error

I installed Cuda on My Ubuntu 18.04(Dual Boot with windows 10) using the following Commands
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo ubuntu-drivers autoinstall
Then ReBooted my Computer.
sudo apt install nvidia-cuda-toolkit gcc-6
Then verified the installation using
nvcc --version
which nvcc
Both worked well without any errors. After few days I wanted verify it completely when I entered these 2 commands
sudo modprobe nvidia
nvidia-smi
which gave me this error respectively
modprobe: ERROR: could not insert 'nvidia': Required key not available
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Now I am unable to understand if Cuda is properly installed or not. I am also unable to find Cuda-9.0 in "usr" file inside ubuntu. I need this so that I can work with tensorflow-gpu (Python3).
Thank you in Advance.
Apparently, the "required key not available" message is a typical (side-)effect of the "secure boot" feature of newer Linux kernels (EFI_SECURE_BOOT_SIG_ENFORCE); and you may be able to get around it by Disabling Secure Boot in your UEFI BIOS.
See this AskUbuntu question for details:
Why do I get “Required key not available” when install 3rd party kernel modules or after a kernel upgrade?

Tensorflow failed to use gpu: libnvidia-fatbinaryloader.so.396.26 not found

I need helping on setting up tensorflow with GPU, however got error while trigging tensorflow job with gpu:
ImportError: libnvidia-fatbinaryloader.so.396.26: cannot open shared object file: No such file or directory
I already have nvidia driver verison 396, cuda tool kit 9 and cudnn7 installed. and my gpu is Tesla K80. i checked files under /usr/lib/nvidia-396, only libnvidia-fatbinaryloader.so.396.24 found.
Can anyone help me out?
Best,
Juhua
It seems that 2 nvidia drivers versions are conflicting 396.24 and 396.26
When you update the drivers, the corresponding cuda library is not always updated.
You can reinstall libcuda
sudo apt-get purge libcuda1-*
sudo apt-get install libcuda1-396

How to debug a segmentation fault 11 on TensorFlow?

I installed cuda 8 and the new tensorflow 1.0.
When I run "import tensorflow as tf" I get the following:
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.8.0.dylib locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.5.dylib locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.8.0.dylib locally
Segmentation fault: 11
Knowing that nvcc -V gives the following:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Oct_30_22:18:43_CDT_2016
Cuda compilation tools, release 8.0, V8.0.54
Any idea how to fix this segmentation fault?
You might be missing a library in your local cuda installation. E.g., /usr/local/cuda/lib/libcuda.dylib was missing for me after trying to install CUDA Toolkit 8.0 locally (possibly because I installed the drivers first before the toolkit, as this ancient thread suggests: https://render.otoy.com/forum/viewtopic.php?f=25&t=1859). Re-running the installer for just the driver installed it properly, and also symlinked it to another name (https://github.com/tensorflow/tensorflow/issues/3263#issuecomment-232184358).
Lastly, double check your environment variable paths, see if echo $DYLD_LIBRARY_PATH looks right.
As an aside, I still saw some warnings when testing the install, e.g. The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.. These just are suggesting to build from source (https://github.com/tensorflow/tensorflow/issues/8037), rather than using pip install --upgrade tensorflow-gpu. 🍻