How to install GPU driver on Google Deep Learning VM? - tensorflow

I just created a Google Deep Learning VM with this image:
c1-deeplearning-tf-1-15-cu110-v20210619-debian-10
The tensorflow version is 1.15.5. But when I run
nvidia-smi
it says -bash: nvidia-smi: command not found.
When I run
nvcc --version
I got
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Thu_Jun_11_22:26:38_PDT_2020
Cuda compilation tools, release 11.0, V11.0.194
Build cuda_11.0_bu.TC445_37.28540450_0
Does anyone know how to install the GPU driver? Thank you in advance!
Update: I've noticed that if you select GPU instance, then the GPU driver is pre-installed.

This is the guide: Installing GPU drivers.
Required NVIDIA driver versions
NVIDIA GPUs running on Compute Engine must use the following NVIDIA driver versions:
For A100 GPUs:
Linux : 450.80.02 or later
Windows: 452.77 or later\
For all other GPU types:
Linux : NVIDIA 410.79 driver or later
Windows : 426.00 driver or later

I would suggest to delete the instance and create another one. Keep in mind the version compatibility here and here. If you are installing drivers by yourself then whats the point of using pre-build instance.

Related

How to deal with CUDA version?

How to set up different versions of CUDA in one OS?
Here is my problem: Lastest Tensorflow with GPU support requires CUDA 11.2, whereas Pytorch works with 11.3. So what is the solution to install both libraries in Windows and Ubuntu?
One solution is to use Docker Container Environment, which would only need the Nvidia Driver to be of version XYZ.AB; in this way, you can use both PyTorch and TensorFlow versions.
A very good starting point for your problem would be this one(ML-WORKSPACE) : https://github.com/ml-tooling/ml-workspace

Unable to configure tensorflow to use GPU acceleration in Ubuntu 16.04

I am trying to install Tensorflow in Ubuntu 16.04 ( in google cloud ). What I have done so far is created an compute instance. I have added a NVIDIA Tesla K80 to this instance.
Also, made sure that the proper version of tensorflow ( version 1.14.0 ) is installed and
Cuda version of 8.0 is installed
and
CudNN version of 6.0 is installed as per the tensorflow gpu - cuda mapping
When I run a simple tensorflow program, I get
Cannot assign a device for operation MatMul: {{node MatMul}}was explicitly assigned to /device:GPU:0 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:XLA_CPU:0, /job:localhost/replica:0/task:0/device:XLA_GPU:0 ]. Make sure the device specification refers to a valid device.
Can anyone please let me know where I am doing wrong. Is the instance selection is correct?
Please do let me know and thanks for your help.
The CUDA and CudNN versions that have been tested with Tensorflow 1.14 are the 10.0 and the 7.4, respectively.
More information about version compatibility can be found here.

How to run tensorflow-gpu on Nvidia Quadro GV100?

I am currently working as a working student and now I have trouble installing Tensorflow-gpu on a machine using a Nvidia Quadro GV100 GPU.
On the Tensorflow homepage I found out that I need to install CUDA 9.0 and Cudnn 7.x in order to run Tensorflow-gpu 1.9. The problem is that I can't find a suitable CUDA version supporting the GV100. Could it be that there is no CUDA version yet? Is it possible that one can't use the GV100 for tensoflow-gpu?
Sorry for the stupid question, I am new to installing DL frameworks :-)
Thank you very much for your help!
On the Tensorflow homepage I found out that I need to install CUDA 9.0 and Cudnn 7.x in order to run Tensorflow-gpu 1.9.
That is if you want to install a pre-built Tensorflow binary distribution. In that case you need to use the version of CUDA which the Tensorflow binaries were built against, which in this case in CUDA 9.0
The problem is that I can't find a suitable CUDA version supporting the GV100
The CUDA 9.0 and later toolkits fully support Volta cards and that should include the Quadro GV100. The driver which ships with CUDA 9.0 is a 384 series which won't support your GPU. If you are referring to a driver support issue, then the solution would be to install the recommended driver for your GPU, and only install the CUDA toolkit from the CUDA 9.0 bundle, not the toolkit and driver, which is the default.
Otherwise you can use CUDA 9.1 or 9.2, which should have support for your GPU with their supplied drivers, but you will then need to build Tensorflow yourself from source.

I'd like to manipulate the way using gpu in tensorflow lite, what can i study for that

At first, let me explain what i have to do.
My develop enviroment is Tizen OS. may be you are unfamilier that, anyway this os is using linux kernel based redhat and targeting on mobile, tv, etc.. And my target device is consists of exynos 5422 and arm mali-t628.
My main work is implement some gpu library to let tensorflow lite's operation can use the library.
I proceeded to build and install tensorflow lite as a rpm package file.
I am googling many times about the tensorflow and gpu. and get some useless information about cuda. i didnt see any info for my case(tizen and mali gpu).
i think linux have gpu instruction like the cpu or library.. but i cant find them.
can you suggest search keyword or document?
You can go to nvidia’s cuda toolkit page, where you can find the documentation and
Training buttons / options.
Also there’s the CUDA programming guide wich i myself find very usefull and helpull for CUDA.
I believe that one or two of those may help you.
CUDA is for NVidia GPU. Mali is not NVidia's, but ARM's. So you CANNOT use CUDA in your given hardware. Besides, if you want CUDA, you'd better drop Tensorflow-lite and use Tensorflow.
If you want to use CUDA, get a hardware with supported NVidia GPU (e.g., x64 machine with NVidia GPU). Note that you can use Tensorflow-GPU & CUDA/CUDNN in Tizen with x64+NVidia GPU. You just need to be careful on nvidia GPU kernel driver version and userspace driver version. Because NVidia's GPU userspace driver and CUDA/CUDNN are statically built, its Linux drivers are compatible with Tizen. (I've tested tensorflow-gpu, CUDA/CUDNN in Tizen with NVidia driver version 111... probably in winter, 2017)
If you want to use Tizen/Tensorflow-lite in the given hardware, forget CUDA.

GKE - GPU nvidia - cuda drivers dont work

I have setup a kubernetes node with a nvidia tesla k80 and followed this tutorial to try to run a pytorch docker image with nvidia drivers and cuda drivers working.
I have managed to install the nvidia daemonsets and i can now see the following pods:
nvidia-driver-installer-gmvgt
nvidia-gpu-device-plugin-lmj84
The problem is that even while using the recommendend image nvidia/cuda:10.0-runtime-ubuntu18.04 i still can't find the nvidia drivers inside my pod:
root#pod-name-5f6f776c77-87qgq:/app# ls /usr/local/
bin cuda cuda-10.0 etc games include lib man sbin share src
But the tutorial mention:
CUDA libraries and debug utilities are made available inside the container at /usr/local/nvidia/lib64 and /usr/local/nvidia/bin, respectively.
I have also tried to test if cuda was working through torch.cuda.is_available() but i get False as a return value.
Many help in advance for your help
Ok so i finally made nvidia drivers work.
It is mandatory to set a ressource limit to access the nvidia driver, which is weird considering either way my pod was on the right node with the nvidia drivers installed..
This made the nvidia folder accessible, but im'still unable to make the cuda install work with pytorch 1.3.0 .. [ issue here ]