Tesla GPU Usage - windows-server-2008

In my machine three GPUs are connected.ie, Tesla M2090. I want to get the usage of that GPUs. There is a tool called NVIDIA SMI which show shows the GPU usage. But when i tried to run the Option nvidia-smi.exe -d (I want to know memory and GPU Utilization). Please help
Driver version : 275.65
OS: Windows Server 2008 R2

Yes I got it by following
https://developer.nvidia.com/sites/default/files/akamai/cuda/files/CUDADownloads/NVML_cuda5/nvidia-smi.4.304.pdf
I give the command as follows
nvidia-smi.exe -q -d MEMORY,UTILIZATION

Related

Qemu 5.2 - nothing shows up after VNC running

i'm trying to use QEMU 5.x for research.
I got QEMU 5.2 source code from qemu.org and installed following instructions.
However, when i tried to run VM by this command:
qemu-system-x86_64 \
-monitor stdio \
--enable-kvm \
-m 4096 \
-cdrom ubuntu-20.04.iso \
-drive file=img.qcow,if=virtio \
-boot c
-rtc base=localtime \
-device virtio-keyboard-pci \
-vga virtio \
then the following texts are printed:
QEMU 5.2.0 monitor - type 'help' for more information
(qemu) VNC server running on 127.0.0.1:5900
then nothing shows up, while QEMU 4.x (used before) pops up a window showing guest ubuntu's GUI.
I'm using ubuntu 20.04. Hope anyone has breakthrough for this..
The message says that this QEMU is using the VNC protocol for graphics output. You can connect a VNC client to the 127.0.0.1:5900 port that it tells you about to see the graphics output.
If what you wanted was a native X11 window (GTK), then the problem is probably that you didn't have the necessary libraries installed to build the GTK support. QEMU's configure script's default behaviour is "build all the optional features that this host has the libraries installed for, and omit the features where the libraries aren't present". So if you don't have any of the GTK/SDL etc libraries when you build QEMU, the only thing you will get in the resulting QEMU binary is the lowest-common-denominator VNC support. If you want configure to report an error for a missing feature then you need to pass it the appropriate --enable-whatever option to force the feature to be enabled (in this case, --enable-gtk).
If you're running on Ubuntu and your apt sources.list file has deb-src lines in it, the easiest way to install all the dependencies that would get you the same feature list as the real Ubuntu QEMU package is to run "apt build-dep qemu". I recommend that you do that and then re-build QEMU, passing --enable-gtk to configure so you can confirm that the necessary dependencies were installed.

How to enable IOMMU on Guest

I am planning to perform nested virtualization with GPU device. I have guest Ubuntu OS running and I have mapped GPU to it by enabling intel_iommu on the host, and configuring NVIDIA PCI as vfio-pci device. I am also able to install NVIDIA driver on the guest and use it for deep-learning.
However, now I want to run another VM inside the guest, let's call the guest that runs on host as L1 and the guest that runs on guest as L2, I want the GPU to be accessiable by the L2 guest, I came across vIOMMU supported on Q35 Qemu chipset, how do I enable IOMMU on L1 guest, so that I can pass the gpu directly to L2 guest??
Hardware :
Intel i7 8th Gen
NVIDIA GeForce 1070
Linux - Ubuntu 18.04,
Hypervisor - KVM
There are a couple of things to be done on KVM-QEMU to allow nested IOMMU
Use BIOS OVMF.fd, because default bios might not support the same
enable the chipset q35 with accel=kvm,kernel_irqchip=split
then check with dmesg | grep -e DMAR -e IOMMU and find /sys/kernel/iommu_groups/ -type l, for iommu groups inside the VM.

Is it possible to run VM with ppc64le architecture on a host machine with x86_64 architecture?

I want to test some use-cases which need to run on 'ppc64le' architecture but I don't have a host machine with ppc64le architecture.
My host system is of x86_64 architecture. Is it possible to run VM with 'ppc64le' architecture on my host machine with x86_64 architecture?
Absolutely! The only caveat is that since you're not running natively, the virtual machine needs to emulate the target (ppc64le) instruction set. This can be much slower than running native instructions.
The way to do this will depend on which tools you're using to manage your virtual machine instances. For example, virt-manager allows you to select the architecture type when you're creating a new virtual machine. If you set this to ppc64el, you'll get a ppc64el machine. Other options (like disk and network devices) can be set just like native VMs.
If you're not using any specific VM management tools, the following invocation of qemu will get a ppc64el machine going easily:
qemu-system-ppc64le \
-M pseries # use the pseries machine model \
-m 4G # with 4G of RAM \
-hda ubuntu-18.04-server-ppc64el.iso # Ubuntu installer as a virtual disk
Depending on your usage, you may want to use the following options too:
-nographic -serial pty to use a text console instead of an emulated graphics device. qemu will print the console pty on startup - something like /dev/pts/X. Run screen /dev/pts/X to access it.
-M powernv -bios skiboot.lid to use the non-virtualised ppc64el machine model, which is closer to current OpenPOWER hardware. The skiboot.lid firmware may be included in your distro's install of qemu.
-drive, -device and -netdev to configure virtual disks and networking. These work in the same manner at x86 VMs on qemu.
I hosted centos7-ppc64le on my x86_64 machine(OS RHEL-7). I used qemu + virt-install for that. First install qemu as
wget https://download.qemu.org/qemu-3.1.0-rc1.tar.xz
tar xvJf qemu-3.1.0-rc1.tar.xz
cd qemu-3.1.0-rc1
./configure
make
make install
After installation check qemu-system-ppc64le is available from the command line. Then install virt-manager,virt-install,virt-viewer and libvirt for managing the VM's. Then I started the VM as
virt-install --name centos7-ppc64le \
--disk centos7-ppc64le.qcow2 \
--machine pseries \
--arch ppc64 \
--vcpus 2 \
--cdrom CentOS-7-ppc64le-Minimal-1804.iso \
--memory 2048 \
--network=bridge:virbr0 \
--graphics vnc

How to get graphical GUI output and user touch / keyboard / mouse input in a full system gem5 simulation?

Hopefully with fs.py, but not necessarily.
For example, I have some x86 BIOS example that draw a line on the screen on QEMU, and I'd like to see that work on gem5 too.
Interested in all archs.
https://www.mail-archive.com/gem5-users#gem5.org/msg15455.html
arm
I have managed to get an image on the screen for ARM.
Here is a highly automated setup which does the following steps:
grab the ARM gem5 Linux kernel v4.15 fork from: https://gem5.googlesource.com/arm/linux/ and use the config file arch/arm/configs/gem5_defconfig from there.
The fork is required for the commit drm: Add component-aware simple encoder https://gem5.googlesource.com/arm/linux/ I believe, which adds the required option CONFIG_DRM_VIRT_ENCODER=y.
The other required option is CONFIG_DRM_HDLCD=y, which enables the HDLCD ARM IP that manages the display: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0541c/CHDBAIDI.html
run gem5 at 49f96e7b77925837aa5bc84d4c3453ab5f07408e
with a command of type:
M5_PATH='/data/git/linux-kernel-module-cheat/out/common/gem5/system' \
'/data/git/linux-kernel-module-cheat/out/common/gem5/build/ARM/gem5.opt' \
--debug-file=trace.txt \
-d '/data/git/linux-kernel-module-cheat/out/arm/gem5/m5out' \
'/data/git/linux-kernel-module-cheat/gem5/gem5/configs/example/fs.py' \
--disk-image='/data/git/linux-kernel-module-cheat/out/arm/buildroot/images/rootfs.ext2' \
--kernel='/data/git/linux-kernel-module-cheat/out/arm/buildroot/build/linux-custom/vmlinux' \
--mem-size='256MB' \
--num-cpus='1' \
--script='/data/git/linux-kernel-module-cheat/data/readfile' \
--command-line='earlyprintk=pl011,0x1c090000 console=ttyAMA0 lpj=19988480 rw loglevel=8 mem=256MB root=/dev/sda console_msg_format=syslog nokaslr norandmaps printk.devkmsg=on printk.time=y' \
--dtb-file='/data/git/linux-kernel-module-cheat/out/common/gem5/system/arm/dt/armv7_gem5_v1_1cpu.dtb' \
--machine-type=VExpress_GEM5_V1 \
connect to the VNC server gem5 provides with your favorite client.
On Ubuntu 18.04, I like:
sudo apt-get install vinagre
vinagre localhost:5900
The port shows up on a gem5 message of type:
system.vncserver: Listening for connections on port 5900
and it takes up the first free port starting from 5900.
Only raw connections are supported currently.
Outcome:
after a few seconds, the VNC client shows up a little penguin on the screen! This is because our kernel was compiled with: CONFIG_LOGO=y.
the latest frame gets dumped to system.framebuffer.png, and it also contains the little penguin.
the Linux kernel dmesg shows on telnet 3456 terminal a messages like:
[ 0.152755] [drm] found ARM HDLCD version r0p0
[ 0.152790] hdlcd 2b000000.hdlcd: bound virt-encoder (ops 0x80935f94)
[ 0.152795] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[ 0.152799] [drm] No driver support for vblank timestamp query.
[ 0.215179] Console: switching to colour frame buffer device 240x67
[ 0.230389] hdlcd 2b000000.hdlcd: fb0: frame buffer device
[ 0.230509] [drm] Initialized hdlcd 1.0.0 20151021 for 2b000000.hdlcd on minor 0
which shows that the HDLCD was enabled.
when we connect, gem5 shows on stdout:
info: VNC client attached
TODO: also get a shell working. Currently I only have a the little penguin, and my keystrokes do nothing. Likely have to tweak the console= kernel parameter or setup a tty console there on init? CONFIG_FRAMEBUFFER_CONSOLE=y is set. Maybe the answer is contained in: https://www.kernel.org/doc/Documentation/fb/fbcon.txt
aarch64
The aarch64 gem5 defconfig does not come with all required options, e.g. CONFIG_DRM_HDLCD=y.
Adding the following options, either by hacking or with a config fragment made it work:
CONFIG_DRM=y
CONFIG_DRM_HDLCD=y
CONFIG_DRM_VIRT_ENCODER=y

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver

I'm running an AWS EC2 g2.2xlarge instance with Ubuntu 14.04 LTS.
I'd like to observe the GPU utilization while training my TensorFlow models.
I get an error trying to run 'nvidia-smi'.
ubuntu#ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu#ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh nvidia-debugdump nvidia-xconfig
nvidia-cuda-mps-control nvidia-persistenced
nvidia-cuda-mps-server nvidia-smi
ubuntu#ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
ubuntu#ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia
ii nvidia-346 352.63-0ubuntu0.14.04.1 amd64 Transitional package for nvidia-346
ii nvidia-346-dev 346.46-0ubuntu1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-346-uvm 346.96-0ubuntu0.0.1 amd64 Transitional package for nvidia-346
ii nvidia-352 375.26-0ubuntu1 amd64 Transitional package for nvidia-375
ii nvidia-375 375.39-0ubuntu0.14.04.1 amd64 NVIDIA binary driver - version 375.39
ii nvidia-375-dev 375.39-0ubuntu0.14.04.1 amd64 NVIDIA binary Xorg driver development files
ii nvidia-modprobe 375.26-0ubuntu1 amd64 Load the NVIDIA kernel driver and create device files
ii nvidia-opencl-icd-346 352.63-0ubuntu0.14.04.1 amd64 Transitional package for nvidia-opencl-icd-352
ii nvidia-opencl-icd-352 375.26-0ubuntu1 amd64 Transitional package for nvidia-opencl-icd-375
ii nvidia-opencl-icd-375 375.39-0ubuntu0.14.04.1 amd64 NVIDIA OpenCL ICD
ii nvidia-prime 0.6.2.1 amd64 Tools to enable NVIDIA's Prime
ii nvidia-settings 375.26-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ubuntu#ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
ubuntu#ip-10-0-1-213:/usr/lib/nvidia-375/bin$
$ inxi -G
Graphics: Card-1: Cirrus Logic GD 5446
Card-2: NVIDIA GK104GL [GRID K520]
X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X
$ lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
Subsystem: XenSource, Inc. Device 0001
Kernel driver in use: cirrus
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
Subsystem: NVIDIA Corporation Device 1014
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)
I followed these instructions to install CUDA 7 and cuDNN:
$sudo apt-get -q2 update
$sudo apt-get upgrade
$sudo reboot
=======================================================================
Post reboot, update the initramfs by running '$sudo update-initramfs -u'
Now, please edit the /etc/modprobe.d/blacklist.conf file to blacklist nouveau. Open the file in an editor and insert the following lines at the end of the file.
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
Save and exit from the file.
Now install the build essential tools and update the initramfs and reboot again as below:
$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential
$sudo update-initramfs -u
$sudo reboot
========================================================================
Post reboot, run the following commands to install Nvidia.
$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
$sudo chmod 700 ./cuda_7.0.28_linux.run
$sudo ./cuda_7.0.28_linux.run
$sudo update-initramfs -u
$sudo reboot
========================================================================
Now that the system has come up, verify the installation by running the following.
$sudo modprobe nvidia
$sudo nvidia-smi -q | head`enter code here`
You should see the output like 'nvidia.png'.
Now run the following commands.
$
cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery
$make
$./deviceQuery
However, 'nvidia-smi' still doesn't show GPU activity while Tensorflow is training models:
ubuntu#ip-10-0-1-48:~$ ipython
Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec 6 2015, 18:08:32)
Type "copyright", "credits" or "license" for more information.
IPython 4.1.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally
ubuntu#ip-10-0-1-48:~$ nvidia-smi
Thu Mar 30 05:45:26 2017
+------------------------------------------------------+
| NVIDIA-SMI 346.46 Driver Version: 346.46 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 35C P0 38W / 125W | 10MiB / 4095MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I solved "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver" on my ASUS laptop with GTX 950m and Ubuntu 18.04 by disabling Secure Boot Control from BIOS.
Run the following to get the right NVIDIA driver :
sudo ubuntu-drivers devices
Then pick the right and run:
sudo apt install <version>
In my case none of the above solutions didn't help:
Root cause: incompatible version of gcc
Solution:
1. sudo apt install --reinstall gcc
2. sudo apt-get --purge -y remove 'nvidia*'
3 sudo apt install nvidia-driver-450
4. sudo reboot
System: AWS EC2 18.04 instance
Solution source: https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-in-ubuntu-18-04/68288/4
I am working with a AWS DeepAMI P2 instance and suddenly I found that Nvidia-driver command doesn't working and GPU is not found torch or tensorflow library. Then I have resolved the problem in the following way,
Run nvcc --version if it doesn't work
Then run the following
apt install nvidia-cuda-toolkit
Hopefully that will solve the problem.
I was getting the same error on my Ubuntu 16.04 (Linux 4.14 kernel) in Google Compute Engine with K80 GPU. I upgraded the kernel to 4.15 from 4.14 and boom the problem was solved. Here is how I upgraded my Linux kernel from 4.14 to 4.15:
Step 1:
Check the existing kernel of your Ubuntu Linux:
uname -a
Step 2:
Ubuntu maintains a website for all the versions of kernel that have
been released. At the time of this writing, the latest stable release
of Ubuntu kernel is 4.15. If you go to this
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will
see several links for download.
Step 3:
Download the appropriate files based on the type of OS you have. For 64
bit, I would download the following deb files:
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
Step 4:
Install all the downloaded deb files:
sudo dpkg -i *.deb
Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -a
You should see that your kernel has been upgraded and hopefully nvidia-smi should work.
Double check if you've the right permission to /dev/nvidiactl device or maybe it does exist at all.
$ strace nvidia-smi
...
openat(AT_FDCWD, "/dev/nvidiactl", O_RDONLY) = -1 ENOENT (No such file or directory)
Make sure nvidia-persistenced service is installed, up and running:
nvidia-persistenced --version
sudo systemctl start nvidia-persistenced
sudo systemctl status nvidia-persistenced
tail /var/log/syslog # When failed.
journalctl -xeu nvidia-persistenced.service
See: Who creates /dev/nvidia0 and /dev/nvidiactl?
You may try to create the device manually by:
sudo modprobe -abq nvidia
sudo nvidia-modprobe -c 0 -u
nvidia-smi -L
In my case, I had the following error in syslog after restarting nvidia-persistenced service:
NVRM: The NVIDIA probe routine was not called for X device(s).
This can occur when a driver such as: nouveau, rivafb, nvidiafb or rivatv was loaded and obtained ownership of the NVIDIA device(s).
Try unloading the conflicting kernel module (and/or reconfigure your kernel without the conflicting driver(s).
The solution was to blacklist the nouveau driver, by adding the following lines into /etc/modprobe.d/blacklist.conf file:
# Blacklist nouveau.
blacklist nouveau
blacklist lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off
Then reboot the system.
See: How to remove Nouveau kernel driver (fix Nvidia install error).
What I found to fix the issue regardless of kernel version, was taking the WGET options and having apt install them.
sudo apt-get install --reinstall linux-headers-$(uname -r)
Driver Version: 390.138 on Ubuntu server 18.04.4
I just want to thank #Heapify for providing a practical answer and update his answer because the attached links are not up-to-date.
Step 1:
Check the existing kernel of your Ubuntu Linux:
uname -a
Step 2:
Ubuntu maintains a website for all the versions of kernel that have
been released. At the time of this writing, the latest stable release
of Ubuntu kernel is 4.15. If you go to this
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will
see several links for download.
Step 3:
Download the appropriate files based on the type of OS you have. For 64
bit, I would download the following deb files:
// UP-TO-DATE 2019-03-18
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
Step 4:
Install all the downloaded deb files:
sudo dpkg -i *.deb
Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -aenter code here
My system version: ubuntu 20.04 LTS.
I solved this by generate a new MOK and enroll it into shim.
Without disable of Secure Boot, although it also really works for me.
Simply execute this command and follow what it suggests:
sudo update-secureboot-policy --enroll-key
According to ubuntu's wiki:
How can I do non-automated signing of drivers
$ sudo apt update
$ sudo apt upgrade
Just use this commands
Try reinstall Nvidia drivers correctly, if you use ubuntu..
First emove everything about Nvidia and Cuda
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get remove --purge '^libnvidia-.*'
sudo apt-get remove --purge '^cuda-.*'
Then run the next line
sudo apt-get install linux-headers-$(uname -r)
After that, download the latestrun file from the Nvidia site according to your target platform, your architecture, etc. like this: (the style of the website could change)
Then the site will give you the commands to run for installing the Nvidia drivers, like this
In my case were:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda
Run those commands to install the Nvidia drivers, accept if needed to upgrade the current, or install from scratch. then the error should be fixed.
I hope it would help you, good luck!
References: https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver-make-sure-that-the-latest-nvidia-driver-is-installed-and-running/197141
I tried above solutions but only the below worked for me.
sudo apt-get update
sudo apt-get install --no-install-recommends nvidia-384 libcuda1-384 nvidia-opencl-icd-384
sudo reboot
credit --> https://deeptalk.lambdalabs.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver/148
for all the other that have the same problem, and all of them solutions not work, well, here its the solution to me, just disable the security boot, and reinstall again the driver.
I was facing the same issue on GE Force-920 M Nvidia chip. Initially the nvidia driver version installed was 510 which is not compatible with ubuntu 18. So, I removed that and installed the 470 version, now it's working perfectly.
sudo apt-get purge nvidia-driver-510
sudo apt-get install nvidia-driver-470
For Ubuntu 20.04 or later, try installing the NVIDIA driver:
sudo ubuntu-drivers autoinstall
Then
sudo reboot
As per these instructions:
https://linuxconfig.org/how-to-install-the-nvidia-drivers-on-ubuntu-20-04-focal-fossa-linux
I had to install the NVIDIA 367.57 driver and CUDA 7.5 with Tensorflow on the g2.2xlarge Ubuntu 14.04LTS instance.
e.g.
nvidia-graphics-drivers-367_367.57.orig.tar
Now the GRID K520 GPU is working while I train tensorflow models:
ubuntu#ip-10-0-1-70:~$ nvidia-smi
Sat Apr 1 18:03:32 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 Off | 0000:00:03.0 Off | N/A |
| N/A 39C P8 43W / 125W | 3800MiB / 4036MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2254 C python 3798MiB |
+-----------------------------------------------------------------------------+
ubuntu#ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GRID K520"
CUDA Driver Version / Runtime Version 8.0 / 7.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 4036 MBytes (4232052736 bytes)
( 8) Multiprocessors, (192) CUDA Cores/MP: 1536 CUDA Cores
GPU Max Clock rate: 797 MHz (0.80 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 3
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520
Result = PASS
None of the above helped for me.
I am using Kubernetes on Google Cloud with tesla k-80 gpu.
Follow along this guide to ensure you installed everything correctly:
https://cloud.google.com/kubernetes-engine/docs/how-to/gpus
I was missing few important things:
Installing NVIDIA GPU device drivers On your NODES. To do this use:
For COS node:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
For UBUNTU node:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml
Make sure an update was rolled to your nodes. Restart them if upgrades are off.
I use this image nvidia/cuda:10.1-base-ubuntu16.04 in my docker
You have to set gpu limit! This is the only way the node driver can communicate with the pod. In your yaml configuration add this under your container:
resources:
limits:
nvidia.com/gpu: 1
One important fact about NVIDIA drivers that is not very well known is that its built is done by DKMS. This allows automatic rebuild in case of kernel upgrade, this happens on system startup. Because of that, it's quite easy to miss error messages, especially if you're working on cloud VM, or server without an additional IPMI/management interface. However, it's possible to trigger DKMS build just executing dkms autoinstall right after packages installation. If this fails then you'll have a meaningful error message about missing dependency or what so ever. If dkms autoinstall builds modules correctly you can simply load it by modprobe - there is no need to reboot the system (which is often used as a way to trigger DKMS rebuild).
You can check an example here
It may happen after your Linux kernel update, if you entered this error, you can rebuild your nvidia driver using the following command to fix:
Firstly, you need to have dkms, which can automatically regenerate new modules after kernel version changes.
sudo apt-get install dkms
Secondly, rebuild your nvidia driver. Here my nvidia driver version is 440.82, if you have installed before, you could check your installed version on /usr/src.
sudo dkms build -m nvidia -v 440.82
Lastly, reinstall nvidia driver. And then reboot your computer.
sudo dkms install -m nvidia -v 440.82
Now you can check to see if you can use it by sudo nvidia-smi.
Solved the problem by re-installing CUDA:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
echo "md5sum: $(md5sum cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb)"
echo "correct: 056de5e03444cce506202f50967b0016"
dpkg -i cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
apt-key add /var/cuda-repo-ubuntu1804-11-1-local/7fa2af80.pub
apt-get -qq update
apt-get -qq -y install cuda
rm cuda-repo-ubuntu1804-11-1-local_11.1.0-455.23.05-1_amd64.deb
I have been struggling on this issue for two days, sharing my solution here in case anyone may need it.
The VMs that I'm using are Standard N-series GPU server with 2 K80 cards on Azure platform. With Ubuntu 18.04 OS installed.
Apparently there is an update of linux kernel several days before I came across this issue, and after the update the driver stopped working.
At first, I did purge and re-install as above replies suggested. Nothing works. Out of sudden(I don't remember why I wanted to do it), I updated the default gcc and g++ version on one of my VM as following.
sudo apt install software-properties-common
sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-9 90
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-9 90
Then I purged the nvidia softwares and reinstall it as instructed in official document(please choose the correct one for your system: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=deblocal) again.
sudo apt-get purge nvidia-*
Then the nvidia-smi command finally worked again.
PS:
If you are using Azure linux VM like me. The recommended way to install CUDA is actually by enabling "NVIDIA GPU Driver Extension" in the Azure portal (of course, after you have configured the correct gcc version).
I have tried this way on my another VM and It works as well.
I'm no linux expert, but I did the following things and it worked out well for me:
Go to https://www.nvidia.com/download/index.aspx download the appropriate driver.
Transfer the .run file to the ec2 system, (I used filezilla for transferring the file)
cd into the .run file path
Execute chmod +x NVIDIA-Linux-x86_64-XXX.XXX.XX.run
Execute ./NVIDIA-Linux-x86_64-XXX.XXX.XX.run
Execute sudo reboot
What fixed it for me was similar to what #chaotux said, updating the kernel headers and development packages:
$ sudo apt-get install linux-headers-$(uname -r).
However, I would like to offer a less try-random-commands-from-stackoverflow approach, which is to read and follow CUDA installation guide: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
Try pulling out the NVIDIA graphics card and reinserting it.