nvidia-smi gpu-util meaning - tensorflow

I am new to learn how to use GPU. I've been searching for the answers of the meaning about GPU-Util when using nvidia-smi, but I haven't get enough answers.
I attached the png file about nvidia-smi and my question is below:
As you can see, my current GPU-util is 100% and my GPU memory usage is 5838MiB and total GPU memory is 5941MiB. If I add the another process 'A' that uses 50MiB on GPU memory, will the process 'A' be pending because GPU-Util is already 100%? Otherwise, will the process 'A' be proceeding because there is enough gpu memory to run the process 'A'?

Related

Memory allocation strategies CPU vs GPU on deeplearning (cuda, tensorflow, pytorch,…)

I'm trying to start multiple processes (10 for example) of learning with tensorflow 2. I'm still using Session and so on multiple tf.compat.v1 in all my codebase.
When I'm running with CPU, processes take each around 500mo of CPU memory. htop output :
When I'm running with GPU, processes take each much more CPU memory (like 3Go each) and almost the same (more in reality) GPU memory. nvtop output (GPU mem left, CPU (HOST) mem right) :
I can reduce GPU memory process fingerprint by using environment variable TF_CUDNN_USE_AUTOTUNE=0 (1.5Go GPU, not more than 3Go CPU). But it's still much more memory consumption than running process on CPU only. I tried a lot of thing like TF_GPU_ALLOCATOR=cuda_malloc_async with a tf nightly release, but it's still the same. This cause OOM errors if I would like to keep 10 processes on GPU like on CPU.
I found memory fragmentation may be a hint, by profiling a single process. You can find screenshots here.
TL;DR
When running tf process on CPU only, it uses some memory (comparable to data size). When running the same tf process on GPU only, it uses much more memory (~x16 without any tensorflow optimization).
I would like to know what can cause a huge difference of memory usage like this, and how to prevent it. Even how to fix it.
FYI -> Current setup : tf 2.6, cuda 11.4 (or 11.2 or 11.1 or 11.0), ubuntu 20.04, nvidia driver 370
EDIT : I tried to convert my tensorflow / tflearn code to pytorch. I have the same behaviour (low memory on CPU, and everything explode when running on GPU)
EDIT2 : Some of memory allocated on GPU should be for CUDA runtime. On pytorch. I have 300mo memory allocated on CPU run. I have 2go of GPU memory and almost 5go of CPU memory used when running on GPU. May the main problem is the CPU/system memory allocated for this process when I'm running on GPU, since it seems that CUDA runtime can take almost 2go of GPU mem (this is huge...). It looks like related to CUDA initialization.
EDIT3 : This is definitely an issue with CUDA. Even if I try to create a 1,1 tensor with pytorch, it takes 2go of GPU and almost 5go of CPU memory. It can be explain because pytorch is loading a huge number of kernels to memory; even if the main program isn't using them.

Does low GPU utilization indicate bad fit for GPU acceleration?

I'm running some GPU-accelerated PyTorch code and training it against a custom dataset, but while monitoring the state of my workstation during the process, I see GPU usage along the following lines:
I have never written my own GPU primitives, but I have a long history of doing low-level optimizations for CPU-intensive workloads and my experience there makes me concerned that while pytorch/torchvision are offloading the work to the GPU, it may not be an ideal workload for GPU acceleration.
When optimizing CPU-bound code, the goal is to try and get the CPU to perform as much (meaningful) work as possible in a unit of time: a supposedly CPU-bound task that shows only 20% CPU utilization (of a single core or of all cores, depending on whether the task is parallelizable or not) is a task that is not being performed efficiently because the CPU is sitting idle when ideally it would be working towards your goal. Low CPU usage means that something other than number crunching is taking up your wall clock time, whether it's inefficient locking, heavy context switching, pipeline flushes, locking IO in the main loop, etc. which prevents the workload from properly saturating the CPU.
When looking at the GPU utilization in the chart above, and again speaking as a complete novice when it comes to GPU utilization, it strikes me that the GPU usage is extremely low and appears to be limited by the rate at which data is being copied into the GPU memory. Is this assumption correct? I would expect to see a spike in copy (to GPU) followed by an extended period of calculations/transforms, followed by a brief copy (back from the GPU), repeated ad infinitum.
I notice that despite the low (albeit constant) copy utilization, the GPU memory is constantly peaking at the 8GB limit. Can I assume the workload is being limited by the low GPU memory available (i.e. not maxing out the copy bandwidth because there's only so much that can be copied)?
Does that mean this is a workload better suited for the CPU (in this particular case with this RTX 2080 and in general with any card)?

Screen slows down when GPU-Util is about 50%

Screen runs smoothly when GPU-Util is about 25%, but pretty slowly for 55%. In the first case GPU memory usage was around 5.7GB/8GB and the second one 5.2GB/8GB.
On a second GPU (which I'm pretty sure the OS is not using) I have GPU-Util 99%, which makes me think GPUs have the capability to reach very high GPU-Util if needed.
My hypothesis is there is nothing wrong with my computer, but that I'm missing something of how things work.
Why does the screen slow down at 55% and not in the 90s?
In case it helps, I'm on Linux14.04 with 2 GTX-1080 and I get GPU-Util running nvidia-smi.
Ended up being something dumb.
Even though the processes I was running were GPU intensive, RAM was even more strained.

Tensorflow: GPU util big difference when setting CUDA_VISIBLE_DIVICES to different values

Linux: Ubuntu 16.04.3 LTS (GNU/Linux 4.10.0-38-generic x86_64)
Tensorflow: compile from source, 1.4
GPU: 4xP100
I am trying the new released object detection tutorial training program.
I noticed that there is big difference when I set CUDA_VISIBLE_DEVICES to different value. Specifically, when it is set to "gpu:0", the gpu util is
quite high like 80%-90%, but when I set it to other gpu devices, such as
gpu:1, gpu:2 etc. The gpu util is very low between 10%-30%.
As for the training speed, it seems to be roughly the same, much faster than that when using CPU only.
I just curious how this happens.
As this answer mentions GPU-Util is a measure of usage/business of the computation of each GPU.
I'm not an expert, but from my experience GPU 0 is generally where most of your processes run by default. CUDA_VISIBLE_DEVICES sets the GPUs seen by the processes you run on that bash. Therefore, by setting CUDA_VISIBLE_DEVICES to gpu:1/2 you are making it to run on less busy GPUs.
Moreover, you only reported 1 value, in theory you should have one per GPU; there is the possibility you were only looking at GPU-util for GPU-0 which would of course decrease if you are not using.

non-graphics benchmarks for gpu

Most of the benchmarks for gpu performance and load testing are graphics related. Is there any benchmark that is computationally intensive but not graphics related ? I am using
DELL XPS 15 laptop,
nvidia GT 525M graphics card,
Ubuntu 11.04 with bumblebee installed.
I want to load test my system to come up with a max load the graphics cards can handle. Are there any non-graphics benchmarks for gpu ?
What exactly do you want to measure?
To measure GFLOPS on the card just write a simple Kernel in Cuda (or OpenCL).
If you have never written anything in CUDA let me know and i can post something for you.
If your application is not computing intensive (take a look at a roofline paper) then I/O will be the bottleneck. Getting data from global (card) memory to the processor takes 100's of cycles.
On the other hand if your application IS compute intensive then just time it and calculate how many bytes you process per second. In order to hit the maximum GFLOPS (your card can do 230) you need many FLOPs per memory access, so that the processors are busy and not stalling for memory and switching threads.