Tensorflow: GPU util big difference when setting CUDA_VISIBLE_DIVICES to different values - object-detection

Linux: Ubuntu 16.04.3 LTS (GNU/Linux 4.10.0-38-generic x86_64)
Tensorflow: compile from source, 1.4
GPU: 4xP100
I am trying the new released object detection tutorial training program.
I noticed that there is big difference when I set CUDA_VISIBLE_DEVICES to different value. Specifically, when it is set to "gpu:0", the gpu util is
quite high like 80%-90%, but when I set it to other gpu devices, such as
gpu:1, gpu:2 etc. The gpu util is very low between 10%-30%.
As for the training speed, it seems to be roughly the same, much faster than that when using CPU only.
I just curious how this happens.

As this answer mentions GPU-Util is a measure of usage/business of the computation of each GPU.
I'm not an expert, but from my experience GPU 0 is generally where most of your processes run by default. CUDA_VISIBLE_DEVICES sets the GPUs seen by the processes you run on that bash. Therefore, by setting CUDA_VISIBLE_DEVICES to gpu:1/2 you are making it to run on less busy GPUs.
Moreover, you only reported 1 value, in theory you should have one per GPU; there is the possibility you were only looking at GPU-util for GPU-0 which would of course decrease if you are not using.

Related

Memory allocation strategies CPU vs GPU on deeplearning (cuda, tensorflow, pytorch,…)

I'm trying to start multiple processes (10 for example) of learning with tensorflow 2. I'm still using Session and so on multiple tf.compat.v1 in all my codebase.
When I'm running with CPU, processes take each around 500mo of CPU memory. htop output :
When I'm running with GPU, processes take each much more CPU memory (like 3Go each) and almost the same (more in reality) GPU memory. nvtop output (GPU mem left, CPU (HOST) mem right) :
I can reduce GPU memory process fingerprint by using environment variable TF_CUDNN_USE_AUTOTUNE=0 (1.5Go GPU, not more than 3Go CPU). But it's still much more memory consumption than running process on CPU only. I tried a lot of thing like TF_GPU_ALLOCATOR=cuda_malloc_async with a tf nightly release, but it's still the same. This cause OOM errors if I would like to keep 10 processes on GPU like on CPU.
I found memory fragmentation may be a hint, by profiling a single process. You can find screenshots here.
TL;DR
When running tf process on CPU only, it uses some memory (comparable to data size). When running the same tf process on GPU only, it uses much more memory (~x16 without any tensorflow optimization).
I would like to know what can cause a huge difference of memory usage like this, and how to prevent it. Even how to fix it.
FYI -> Current setup : tf 2.6, cuda 11.4 (or 11.2 or 11.1 or 11.0), ubuntu 20.04, nvidia driver 370
EDIT : I tried to convert my tensorflow / tflearn code to pytorch. I have the same behaviour (low memory on CPU, and everything explode when running on GPU)
EDIT2 : Some of memory allocated on GPU should be for CUDA runtime. On pytorch. I have 300mo memory allocated on CPU run. I have 2go of GPU memory and almost 5go of CPU memory used when running on GPU. May the main problem is the CPU/system memory allocated for this process when I'm running on GPU, since it seems that CUDA runtime can take almost 2go of GPU mem (this is huge...). It looks like related to CUDA initialization.
EDIT3 : This is definitely an issue with CUDA. Even if I try to create a 1,1 tensor with pytorch, it takes 2go of GPU and almost 5go of CPU memory. It can be explain because pytorch is loading a huge number of kernels to memory; even if the main program isn't using them.

Colab pro and GPU availability

I need GPU for my project. Till now I had limited use and used Colab free. Now I think I may need as much as 3 hours a day. Now it says GPU is not available because they are already taken. My question is, what effect does upgrading to Colab pro have on GPU availability? How many hours should I expect to have GPU and are these hours arbitrary chosen by me or not?
I referred Here and There but no good detail about GPU availability is given.
In Their website they tell that these limitations vary and depends on previous usage, and a precise answer might not be even available, so even an approximated answer is welcome.
Thanks.
Yeah.I had the same experience that GPU is not available in colab.
Why not try gpushare.com to run 3090 or 2080ti with free credit.
The platform supports the most popular machine learning frameworks,like TensorFlow and PyTorch,users can be fast to instantiate a VM image.
I think it's appropriate to accelerate your model training.

How to speed up Tensorflow-gpu with using CUDA code simultaneoulsy

I only have one GPU(GTX 1070, 8GB VRAM) and I would like to using tensorflow-gpu with another CUDA code simultaneously, on the same GPU.
But, using CUDA code and tensorflow-gpu at the same time slows tensorflow-gpu down about twice time.
Is there any solutions to speed up when tensorflow-gpu and CUDA code are used together?
A slightly longer version of #talonmies comment:
GPUs are awesome, but they still have finite resources. Any competently-built application that uses the GPU will do its best to saturate the device, leaving few resources for other applications. In fact, one of the goals and challenges of optimizing GPU code - whether it be a shader, CUDA or CL kernel - is making sure that all CUs are used as efficiently as possible.
Assuming that TF is already doing that: When running another GPU-heavy application, or you're sharing a resource that's already running full-tilt. So, things slow down.
Some options are:
Get a second, or faster, GPU.
Optimize your CUDA kernels to reduce requirements and simplify your TF stuff. While this is always important to keep in mind when developing for GPGPU, it's unlikely to help with your current problem.
Don't run these things at the same time. This may turn out to be slightly faster than this quasi time-slicing situation that you currently have.

Where do Workers and Parameter Servers reside in Distributed TensorFlow?

In this post, it was mentioned that:
Also, there's no built-in distinction between worker and ps devices --
it's just a convention that variables get assigned to ps devices, and
ops are assigned to worker devices.
In this post, it was mentioned that:
TL;DR: TensorFlow doesn't know anything about "parameter servers", but
instead it supports running graphs across multiple devices in
different processes. Some of these processes have devices whose names
start with "/job:ps", and these hold the variables. The workers drive
the training process, and when they run the train_op they will cause
work to happen on the "/job:ps" devices, which will update the shared
variables.
Questions:
Do variables in ps reside on the CPU or GPU? Also, are there any performance gains if "/job:ps" resides on CPU or GPU?
Do the lower level libraries decide where to place a variable or operation?
Do variables in ps reside on the CPU or GPU? Also, are there any performance gains if "/job:ps" resides on CPU or GPU?
You can pin ps job to either on of those (with exceptions, see below), but pinning it to GPU is not practical. ps is really a storage of parameters and ops to update it. A CPU device can have a lot more memory (i.e., main RAM) than a GPU and is fast enough to update the parameters as the gradients are coming in. In most cases, matrix multiplications, convolutions and other expensive ops are done by the workers, hence a placement of a worker on a GPU makes sense. A placement of a ps to a GPU is a waste of resources, unless a ps job is doing something very specific and expensive.
But: Tensorflow does not currently have a GPU kernel for integer variables, so the following code will fail when Tensorflow tries to place the variable i on GPU #0:
with tf.device("/gpu:0"):
i = tf.Variable(3)
with tf.Session() as sess:
sess.run(i.initializer) # Fails!
with the following message:
Could not satisfy explicit device specification '/device:GPU:0'
because no supported kernel for GPU devices is available.
This is the case when there's no choice of device for a parameter, and thus for a parameter server: only CPU.
Do the lower level libraries decide where to place a variable or operation?
If I get this question right, node placement rules are pretty simple:
If a node was already placed on a device in a previous run of the graph, it is left on that device.
Else, if the user pinned a node to a device via tf.device, the placer places it on that device.
Else, it defaults to GPU #0, or the CPU if there is no GPU.
Tensorflow whitepaper describes also a dynamic placer, which is more sophisticated, but it's not part of the open source version of tensorflow right now.

How much faster (approx) does Tensorflow run with a GPU?

I have a Mac, and consequently have been running Tensorflow without GPU support (because it's not official yet). However, there are some hacked together impls that I'm thinking of installing... that is if the performance gains are worth the trouble. How much faster (approximately) would Tensorflow run on a Macbook Pro with GPU support?
Thanks
as a rule of thumb somewhere between 10 and 20 times - I've found just running the standard examples.
To give you an idea of the speed difference, I ran some language modelling code (similar to the PTB example), with a fairly large data set, on 3 different machines with the following results:
Intel Xeon X5690 (CPU only): 1 day, 19 hours
Nvidia Grid K520 (on Amazon AWS): 17 hours
Nvidia Tesla K80: 4 hours