GKE - GPU nvidia - cuda drivers dont work - gpu

I have setup a kubernetes node with a nvidia tesla k80 and followed this tutorial to try to run a pytorch docker image with nvidia drivers and cuda drivers working.
I have managed to install the nvidia daemonsets and i can now see the following pods:
nvidia-driver-installer-gmvgt
nvidia-gpu-device-plugin-lmj84
The problem is that even while using the recommendend image nvidia/cuda:10.0-runtime-ubuntu18.04 i still can't find the nvidia drivers inside my pod:
root#pod-name-5f6f776c77-87qgq:/app# ls /usr/local/
bin cuda cuda-10.0 etc games include lib man sbin share src
But the tutorial mention:
CUDA libraries and debug utilities are made available inside the container at /usr/local/nvidia/lib64 and /usr/local/nvidia/bin, respectively.
I have also tried to test if cuda was working through torch.cuda.is_available() but i get False as a return value.
Many help in advance for your help

Ok so i finally made nvidia drivers work.
It is mandatory to set a ressource limit to access the nvidia driver, which is weird considering either way my pod was on the right node with the nvidia drivers installed..
This made the nvidia folder accessible, but im'still unable to make the cuda install work with pytorch 1.3.0 .. [ issue here ]

Related

How to deal with CUDA version?

How to set up different versions of CUDA in one OS?
Here is my problem: Lastest Tensorflow with GPU support requires CUDA 11.2, whereas Pytorch works with 11.3. So what is the solution to install both libraries in Windows and Ubuntu?
One solution is to use Docker Container Environment, which would only need the Nvidia Driver to be of version XYZ.AB; in this way, you can use both PyTorch and TensorFlow versions.
A very good starting point for your problem would be this one(ML-WORKSPACE) : https://github.com/ml-tooling/ml-workspace

How to install GPU driver on Google Deep Learning VM?

I just created a Google Deep Learning VM with this image:
c1-deeplearning-tf-1-15-cu110-v20210619-debian-10
The tensorflow version is 1.15.5. But when I run
nvidia-smi
it says -bash: nvidia-smi: command not found.
When I run
nvcc --version
I got
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0
Does anyone know how to install the GPU driver? Thank you in advance!
Update: I've noticed that if you select GPU instance, then the GPU driver is pre-installed.
This is the guide: Installing GPU drivers.
Required NVIDIA driver versions
NVIDIA GPUs running on Compute Engine must use the following NVIDIA driver versions:
For A100 GPUs:
Linux : 450.80.02 or later
Windows: 452.77 or later\
For all other GPU types:
Linux : NVIDIA 410.79 driver or later
Windows : 426.00 driver or later
I would suggest to delete the instance and create another one. Keep in mind the version compatibility here and here. If you are installing drivers by yourself then whats the point of using pre-build instance.

why tf.test.is_gpu_available() did not provide true or false,it got stuck?

After installing tensorflow gpu = 2.0.0 it got stuck after detecting gpu.
enviornment settings for this project is
ubuntu 18.04
cuda 10.0
cudnn 7.4.1
created a virtual enviornment
install tensorflow-gpu=2.0.0
While trying to check gpu with tf.test.is_gpu_available().compliation got stucked it is shown below.
enter image description here
changed cudnn version to 7.6.2.Then it works well.

Unable to configure tensorflow to use GPU acceleration in Ubuntu 16.04

I am trying to install Tensorflow in Ubuntu 16.04 ( in google cloud ). What I have done so far is created an compute instance. I have added a NVIDIA Tesla K80 to this instance.
Also, made sure that the proper version of tensorflow ( version 1.14.0 ) is installed and
Cuda version of 8.0 is installed
and
CudNN version of 6.0 is installed as per the tensorflow gpu - cuda mapping
When I run a simple tensorflow program, I get
Cannot assign a device for operation MatMul: {{node MatMul}}was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
Can anyone please let me know where I am doing wrong. Is the instance selection is correct?
Please do let me know and thanks for your help.
The CUDA and CudNN versions that have been tested with Tensorflow 1.14 are the 10.0 and the 7.4, respectively.
More information about version compatibility can be found here.

TensorFlow isn't using Nvidia

TensorFlow fails to use nvidia card though nvidia driver, cuda toolkit, cudnn installed and configured.
One thing that I suspect is the reason is the nvidia card on my laptop is connected to pci as 3d controller instead of VGA:
00:02.0 VGA compatible controller: Intel Corporation Sky Lake Integrated Graphics (rev 07)
Subsystem: ASUSTeK Computer Inc. Skylake Integrated Graphics
Kernel driver in use: i915_bpo
01:00.0 3D controller: NVIDIA Corporation GK208M [GeForce 920M] (rev a1)
Subsystem: ASUSTeK Computer Inc. GK208M [GeForce 920M]
Kernel modules: nvidiafb, nouveau, nvidia_304
Even the Nvidia xserver settings don't see the GPU:
Is this true that tensorflow can only use the graphic card as VGA?
After three month, I finally figured out even first what the issue is and resolved it. It turned out to be a nvidia issue with Secure Boot.
Feel obliged to thank jorgemf and Yao Zhang for your help at a time I couldn't even good articulate the problem.
Meanwhile I hope my case can help other people having a same problem.
All started with my attempt to install nvidia driver again today. The installation seemed successful but in the end, it says,
Unable to load the “nvidia-drm” kernel module.
So I thought maybe I could manually load the kernel with
modprobe mvidia-drm
but got an error says something like "required key not applicable". Wonder what that meant so googled a bit. It turned out to be application not registered! So that module has been stopped by Secure Boot!
Went back to boot settings and disabled secure boot. Installed nvidia driver again, successful! Now in Nvidia settings it looks like this:
See now the gpu device shows there.
Head further to install cuda and cudnn. Found this github gist super useful: https://gist.github.com/wangruohui/df039f0dc434d6486f5d4d098aa52d07
Last step, just followed the installation on Tensorflow home page. Tested it did run on GPU!
The take-home message is if you fail to install Nvidia driver on linux system, you probably need to disable Secure Boot. Personal opinion, Windows turned this good idea into a nightmare for linux users!