100% GPU utilization on a GCE without any processes - gpu

I've just started an instance on a Google Compute Engine with 2 GPUs (Nvidia Tesla K80). And straight away after the start, I can see via nvidia-smi that one of them is already fully utilized.
I've checked a list of running processes and there is nothing running at all. Does it mean that Google has rented out that same GPU to someone else?
It's all running on this machine:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.5 LTS
Release: 16.04
Codename: xenial

Enabling "persistence mode" with nvidia-smi -pm 1 might solve the problem.
ECC in combination with non persistence mode can lead to 100% GPU utilization.
Alternatively you can disable ECC with nvidia-smi -e 0.
Note: I'm not sure if the performance actually is worse. I can remember that I was able to train ML model despite the 100% GPU utilization but I don't know if it was slower.

I would like to suggest you to report and create this issue on the Google Issue Tracker as need to investigate. Please provide your project number and instance name over there. Please follow this URL that make you able to create a file as private in Google Issue Tracker.

Related

Stopping and starting a deep learning google cloud VM instance causes tensorflow to stop recognizing GPU

I am using the pre-built deep learning VM instances offered by google cloud, with an Nvidia tesla K80 GPU attached. I choose to have Tensorflow 2.5 and CUDA 11.0 automatically installed. When I start the instance, everything works great - I can run:
Import tensorflow as tf
tf.config.list_physical_devices()
And my function returns the CPU, accelerated CPU, and the GPU. Similarly, if I run tf.test.is_gpu_available(), the function returns True.
However, if I log out, stop the instance, and then restart the instance, running the same exact code only sees the CPU and tf.test.is_gpu_available() results in False. I get an error that looks like the driver initialization is failing:
E tensorflow/stream_executor/cuda/cuda_driver.cc:355] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
Running nvidia-smi shows that the computer still sees the GPU, but my tensorflow can’t see it.
Does anyone know what could be causing this? I don’t want to have to reinstall everything when I’m restarting the instance.
Some people (sadly not me) are able to resolve this by setting the following at the beginning of their script/main:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
I had to reinstall CUDA drivers and from then on it worked even after restarting the instance. You can configure your system settings on NVIDIAs website and it will provide you the commands you need to follow to install cuda. It also asks you if you want to uninstall the previous cuda version (yes!).This is luckily also very fast.
I fixed the same issue with the commands below, taken from https://issuetracker.google.com/issues/191612865?pli=1
gsutil cp gs://dl-platform-public-nvidia/b191551132/restart_patch.sh /tmp/restart_patch.sh
chmod +x /tmp/restart_patch.sh
sudo /tmp/restart_patch.sh
sudo service jupyter restart
Option-1:
Upgrade a Notebooks instance's environment. Refer the link to upgrade.
Notebooks instances that can be upgraded are dual-disk, with one boot disk and one data disk. The upgrade process upgrades the boot disk to a new image while preserving your data on the data disk.
Option-2:
Connect to the notebook VM via SSH and run the commands link.
After execution of the commands, the cuda version will update to 11.3 and the nvidia driver version to 465.19.01.
Restart the notebook VM.
Note: Issue has been solved in gpu images. New notebooks will be created with image version M74. About new image version is not yet updated in google-public-issue-tracker but you can find the new image version M74 in console.

How to get to know which GPU the machine uses?

I have an unknown Linux machine, I need to check which GPU it uses (more specifically if it uses an AMD GPU). To know more about CPUs I have used cat /proc/cpuinfo. Is there something similar for GPUs?
If clinfo is available, it'll give you a list of OpenCL-capable compute devices, including GPUs. You're out of luck if GPUs are not supporting OpenCL or drivers are not installed. There is no generic way of getting a list of all kinds of GPU devices. On some platforms you can at least get a list of discrete GPUs from lspci output, but you'll miss the integrated and non-PCI GPUs this way.
If you already have an X11 server running on that box, you can always do glxinfo on it. It cannot be done in a headless way though.

Google cloud GPU machines reboot abruptly

When training a model on the GPU machine, it get interrupted due to some system patch process. Since Google cloud GPU machines do not have an option of live migration, it is painful task to restart the training every time this happens. Google has clearly mentioned that there is no way around this but to restart the machines in this Doc.
Is there a clever way to detect if the machine is rebooted and resume the training automatically.
Sometimes it also happens that due to some kernel update, the CUDA drivers stop working and GPU is not visible and CUDA drivers need a re-installation. So writing a startup script to resume the training is also not a bulletproof solution.
Yes there is. If you use tensorflow, you can use it's checkpointing feature to save your progress and pick up where you left off.
One great example of this is provided here: https://github.com/GoogleCloudPlatform/ml-on-gcp/blob/master/gce/survival-training/README-tf-estimator.md

system auto reboot when tensorflow model is too large

I'm using a nvidia GTX1080 gpu(8GB) to run Inception model on tensorflow, when I set batch_size = 16 and image_size = 400, then after I start the program, my ubuntu14.04 will auto reboot.
Make sure it is not a power supply unit problem. I was observing strange occasional reboots on my development machine. As I was increasing the size of input (batch size, larger NN) the rate of reboots was increasing as well. Turned out to be a PSU problem. A quick check is to limit GPU power consumption and see if this behavior will go away. For instance, you can limit power to about 150 watts with this command (you'll need a sudo rights):
sudo nvidia-smi -pl 150
I tracked the issue down to a faulty power supply. It had enough capacity according to spec, and limiting GPU power consumption by running "nvidia-smi -pl 150" didn't help at all. Probably it couldn't handle bursts in power consumption.
Anyway, after I changed the power supply from "Corsair CX750 Builder Series ATX 80 PLUS" to "Cooler Master V1000", the issue is gone.
See details of my investigation in the TensorFlow GitHub issue.
Changing the GPU power settings will work, if you have PSU with enough power (WATTS).
I limited my GPU's (TITANX) power to max. 200 WATTS using,
sudo nvidia-smi -pl 200
NOTE: Each GPU has power limitations, for e.g. TITANX's power limit is between 125W and 300W. So make sure to give value between those limits.
I was facing similar problems. Even with small batch sizes in both tensorflow and pytorch the pc was restarting by itself. I removed a video card but still no solution. Just nvidia-smi -pl 150 didn't work.
In addition;
sudo nvidia-smi -pm 1
sudo nvidia-smi -lgc 1400
sudo nvidia-smi -lmc 6500
sudo nvidia-smi -gtt 65
sudo nvidia-smi -cc 1
sudo nvidia-smi -pl 165
I added them and now it works with 2 gpu without any problem. These settings are for the RTX2080TI. Edit according to your own video card.
My system:
HP Z800 Workstation
Intel(R) Xeon(R) CPU E5-2643 0 # 3.30GHz
PSU 850W
Ubuntu 20.04
2x RTX 2080TI
I got the exactly same problem after a GTX 2070 installed on DELL T3610. The answer provided by Sergey above solved my problem. Just add a comment for windows users:
Run your command prompt as administrator
go to nvidia-smi directory: typically it is under C:\Program Files\NVIDIA Corporation\NVSMI
run nvidia-smi -pl 150
Then your problem should be solved and you will see the output that the power limit of your GPU has been reduced to 150w. (In my case, reduced to 150w from 185w).
I had a very similar problem but tracked it down to a PATH problem where CUDA 11 got inserted and somehow was overriding my CUDA 10.1 libraries. I am not sure when/how, but it might be related to an upgrade of the Nvidia drivers I had done recently. At least check and make sure your PATH and versions are correct. CUDA 11 will not work with Tensorflow 2.3.1 or prior, at least as of 11/2020 on Windows 10. Please let me know if there is a workaround that I am unaware of, but this was definitely the problem. When I fixed the PATH to point to the CUDA 10.1 path only, everything worked fine and I was able to max out the GPU for over 20 minutes with no restart.
I had the same issue and limiting poser usage resolved it. I had to reduce power supply to 150 as 200 did not work though.

Using VMWare Fusion to access GPUs

I am running the VM Fusion 8 Pro with Ubuntu 14.04 on a MacPro. The MacPro comes with dual AMD FirePro D500 GPUs. I installed the AMD APP SDK within Ubuntu, but it is only seeing the CPU as a device, and not the GPUs. Can someone please guide me so that I can run OpenCL kernels on the GPU(s).
Googling has revealed things like GPU passthrough, but there isn't enough detail on how to exactly access a GPU from within VMWare Fusion.
Sincerely,
Vishal
Last time I checked it was necessary to have motherboard support to allow the virtual machines to access the GPUs.