Nvidia GPU memory allocated but by no process? - gpu

I am frequently rerunning the same mxnet script while I try to iron out some bugs in a new script (and I am new to mxnet). Pretty often when I try to run my script I get an error that the GPU is out of memory, and when I use nvidia-smi to check, this is what I see:
Wed Dec 5 15:41:29 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.24.02 Driver Version: 396.24.02 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:65:00.0 On | N/A |
| 0% 54C P2 68W / 300W | 10891MiB / 11144MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1446 G /usr/lib/xorg/Xorg 40MiB |
| 0 1481 G /usr/bin/gnome-shell 114MiB |
| 0 10216 G ...-token=8422C9FC67F51AEC1893FEEBE9DB68C6 31MiB |
| 0 18221 G /usr/lib/xorg/Xorg 458MiB |
| 0 18347 G /usr/bin/gnome-shell 282MiB |
+-----------------------------------------------------------------------------+
So it seems like most of the memory is in use (10891/11144) BUT I don't see any process in the list taking up a large portion of the GPU, so there doesn't seem to be anything to call. And my mxnet script has been exited out, so I assume it shouldn't be that. I would understand if there were some seconds or even tens of seconds lagging if the GPU does not know right away that the script no longer needs memory, but I am going on many minutes and still see the same display. What gives, and is there some memory cleanup I should do? If so, how? Thank you for any tips to a newbie.

The GPU memory usage is completely bound to the lifetime of the process. If you see GPU memory used, there must be a process that's still alive and holding on to memory. If you run ps -a |grep python you should see all python processes and that will tell you which process is still alive.

Related

Is TensorFlow (GPU version) compatible with laptops using Nvidia Quadro P1000, P2000 and 1050Ti?

I am going to buy a laptop to do some TF work. Is the GPU version of TF able to take advantage of Nvidia Quadro P1000 and P2000? Will it run faster on these two GPUs than on the mobile version of 1050Ti? Thanks
If I am correct, Tensorflow can run in all Nvidia devices that supports CUDA.
Check this website for their computational compabilities:
https://developer.nvidia.com/cuda-gpus
There you can see the computational power of Nvidia GPU cards.
For your questions about those three cards (P1000, P2000, GeForce 1050Ti), they all have the same computational capabilities: 6.1, which means they won't differ too much in GPU computation.
But from their datasheet (P2000, P1000, 1050ti):
---------------------------------------------------------
| | Memory | Memory Interface | Memory Bandwidth|
---------------------------------------------------------
|P1000 |4G GDRR5| 128 bit | 82Gb/s |
|P2000 |5G GDDR5| 160 bit | 140Gb/s |
|1050Ti |4G GDDR5| 128 bit | 112Gb/s |
---------------------------------------------------------
I would say, P2000 > 1050Ti > P1000
BTW, what does that 6.1 number mean? Basically, it means how much operations and functions they can support. You can find the details in the figure below and this link, and similar discussion here

Google compute engine cannot select 1 NVIDIA Tesla K80

I am trying to create preemptible VM on Google compute engine. For some reason I cannot select 1 GPU NVIDIA Tesla K80, it is simply grayed out. I can select 1 GPU NVIDIA Tesla P100.
I can select 2 GPU NVIDIA Tesla K80, but then I get error: "Quota 'PREEMPTIBLE_NVIDIA_K80_GPUS' exceeded. Limit: 1.0 in region us-central1."
I don't want to increase quota to 2 GPU, since I will have to deposit more money.
Previously, I was able to select 2 GPU NVIDIA Tesla K80 and launch instance successfully, but something changed in last 2 months or so and now it is not working

NVIDIA-SMI, NVML, Power usage: [NOT SUPPORTED]

I tried to get current power usage with the following command in Windows 10 x64:
nvidia-smi.exe --format=csv,noheader --query-gpu=power.draw
And got the next result:
[Not Supported]
I checked it on the GTX1050(notebook) video card.
Please also see the nvidia-smi output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 382.05 Driver Version: 382.05 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1050 WDDM | 0000:01:00.0 Off | N/A |
| N/A 38C P8 N/A / N/A | 319MiB / 2048MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Also tried to get this info via
NVML library:
nvmlReturn_t result;
nvmlDevice_t device;
result = nvmlInit();
if (NVML_SUCCESS != result)
{
printf("Failed to initialize NVML: %s\n", nvmlErrorString(result));
return 1;
}
result = nvmlDeviceGetHandleByIndex(0, &device);
if (NVML_SUCCESS != result)
{
printf("Failed to get handle for device %i: %s\n", 0, nvmlErrorString(result));
}
unsigned int power_usage = 0;
result = nvmlDeviceGetPowerUsage(device, &power_usage);
printf(nvmlErrorString(result));
The output is the same:
Not Supported
First question: Is exist the way to get the power usage or other parameter from NVIDIA card which is not supported?
Please also see the Feature Matrix part in the old
manual
it contain the information about features supported NVIDIA cards. Second question: Is exist such docs about new video cards?
I had the same problem with NVIDIA GT1030. It seems that some features including the feature you mention is no longer supported by NVIDIA in newer drivers. I solved the problem by installing an older version. Try finding the first version of driver that included support for your GPU. Check this link.

TensorFlow Norm (LRN) doesn't support GPU

I am running following code on Google Cloud ML using BASIC GPU (Tesla K80)
https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10
LRN is taking the most amount of time and its running on CPU. I am wondering if following stats quoted in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_train.py were obtained by running on CPU because I don't see thats the case.
System | Step Time (sec/batch) | Accuracy
1 Tesla K20m | 0.35-0.60 | ~86% at 60K steps (5 hours)
If I force it to run it with GPU it throws following error:
Cannot assign a device to node 'norm1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. [[Node: norm1 = LRNT=DT_HALF, alpha=0.00011111111, beta=0.75, bias=1, depth_radius=4, _device="/device:GPU:0"]

nvidia-smi shows GPU utilization when it's unused

I'm running tensorflow on GPU id 1 using export CUDA_VISIBLE_DEVICES=1, everything in nvidia-smi looks good, my python process is running on gpu 1, memory and power consumption show GPU 1 is in use.
But oddly GPU 0, which is unused (based on the process list, memory, power usage, and common sense) shows 96% Volatile GPU-Utilization.
Anyone know why?
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.48 Driver Version: 367.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20c Off | 0000:03:00.0 Off | 0 |
| 30% 41C P0 53W / 225W | 0MiB / 4742MiB | 96% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20c Off | 0000:43:00.0 Off | 0 |
| 36% 49C P0 95W / 225W | 4516MiB / 4742MiB | 63% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 1 5193 C python 4514MiB |
+-----------------------------------------------------------------------------+
Run ps aux | grep 5193 to see which program is using the GPU.
Your GPUs have ECC enabled, so you will see high CPU or memory utilization.
During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.
When Persistence Mode is Disabled, driver deinitializes when there are no clients running (CUDA apps or nvidia-smi or XServer) and needs to initialize again before any GPU application (like nvidia-smi) can query its state thus causing ECC Scrubbing.
As a rule of thumb always run with Persistence Mode Enabled. Just run as root nvidia-smi -pm 1. This will speed up application lunching by keeping the driver always loaded.
Reference: https://devtalk.nvidia.com/default/topic/539632/k20-with-high-utilization-but-no-compute-processes-/