System memory usage increase after using GPU? - tensorflow

When I use pytorch/tensorflow to build my neural network, I find that system memory usage increase when I use GPU.
The example code is pretty simple using pytorch, a small Conv1d neural network.
When I set use_gpu=False, the system memory usage is just 131M, . But when I set use_gpu=True, system memory usage increases to more than 2G.
As far as I know, when using gpu, many parameters will be pinned into GPU'S memory, that means the system memory usage will decrease.
Could anyone help to explain the situation ?

Related

How to estimate how much GPU memory required for deep learning?

We are trying to train our model for object recognition using tensorflow. Since there are too many images (100GB), I guess our current GPU server (1*2080Ti) could not work. We may need to purchase a more powerful one, but I do not sure how to estimate how much GPU memory we need. Is there some approach to estimate the requirements? thanks!
Your 2080Ti would do just fine for your task. The GPU memory for DL tasks are dependent on many factors such as number of trainable parameters in the network, size of the images you are feeding, batch size, floating point type (FP16 or FP32) and number of activations and etc. I think you get confused about loading all of the images to GPU memory at once. We do not do that, instead we use minibatches of different sizes to fit all of the images and params into memory. Throw any kind of network to your 2080Ti and adjust batch size then your training will run smoothly. You could go with your 2080Ti or can get another or two increase training speed. This blogpost provides beautiful insights about creating optimal DL environments.

Can tensorflow consume GPU memory exactly equal to required

I use tensorflow c++ version to do CNN inference. I already set_allow_growth(true), but it still consume more GPU memory than exactly need .
set_per_process_gpu_memory_fraction can only set an upper bound of the GPU memory, but different CNN model have different upper bound. Is there a good way to solve the problem
Unfortunately, there's no such flag to use out-of-the-box, but this could be done (manually):
By default, TF allocates all the available GPU memory. Setting set_allow_growth to true, causing TF to allocate the needed memory in chunks instead of allocating all GPU memory at once. Every time TF will require more GPU memory than already allocated, it will allocate another chunk.
In addition, as you mentioned, TF supports set_per_process_gpu_memory_fraction which specifies the maximum GPU memory the process can require, in terms of percent of the total GPU memory. This results in out-of-memory (OOM) exceptions in case TF requires more GPU memory than allowed.
Unfortunately, I think the chunk size cannot be set by the user and is hard-coded in TF (for some reason I think the chunk size is 4GB but I'm not sure).
This results in being able to specify the maximum amount of GPU memory that you allow TF to use (in terms of percents). If you know how much GPU memory you have in total (can be retrieved by nvidia-smi, and you know how much memory you want to allow, you can calculate it in terms of percents and set it to TF.
If you run a small number of neural networks, you can find the required GPU memory for each of them by running it with different allowed GPU memory, like a binary search and see what's the minimum fraction that enables the NN to run. Then, setting the values you found as the values for set_per_process_gpu_memory_fraction for each NN will achieve what you wanted.

TensorFlow GPU memory

I have a very large deep neural network. when I try to run it on GPU I get "OOM when allocating". but when I mask the GPU and run on CPU it works (about 100x slower when comparing small model).
My question is if there is any mechanism in tenosrflow that would enable me to run the model on GPU. I assume the CPU uses virtual memory so it can allocates as much as he likes and move between cache/RAM/Disk (thrashing).
is there something similiar on Tensorflow with GPU? that would help me even if it will be 10x slower than regular GPU run
Thanks
GPU memory is currently not extensible (Till something like PASCAL is available)

Is there a workaround for running out of memory on GPU with tensorflow?

I am currently building a 3d convolutional network, for video classification. The main problem is that I run out of memory too easily. Even if I set my batch_size to 1, there is still not enough memory to train my CNN the way I want.
I am using a GTX 970 with 4Gb of VRAM (3.2Gb free to use by tensorflow). I was expecting it to still train my network, maybe using my RAM memory as a backup, or doing the calculations in parts. But until now I could only run it making the CNN simpler, which affects performance directly.
I think I can run on CPU, but it is significantly slower, making it not a good solution either.
Is there a better solution than to buy a better GPU?
Thanks in advance.
Using gradient checkpointing will help with memory limits.

Low GPU usage by Keras / Tensorflow?

I'm using keras with tensorflow backend on a computer with a nvidia Tesla K20c GPU. (CUDA 8)
I'm tranining a relatively simple Convolutional Neural Network, during training I run the terminal program nvidia-smi to check the GPU use. As you can see in the following output, the GPU utilization commonly shows around 7%-13%
My question is: during the CNN training shouldn't the GPU usage be higher? is this a sign of a bad GPU configuration or usage by keras/tensorflow?
nvidia-smi output
Could be due to several reasons but most likely you're having a bottleneck when reading the training data. As your GPU has processed a batch it requires more data. Depending on your implementation this can cause the GPU to wait for the CPU to load more data resulting in a lower GPU usage and also a longer training time.
Try loading all data into memory if it fits or use a QueueRunner which will make an input pipeline reading data in the background. This will reduce the time that your GPU is waiting for more data.
The Reading Data Guide on the TensorFlow website contains more information.
You should find the bottleneck:
On windows use Task-Manager> Performance to monitor how you are using your resources
On Linux use nmon, nvidia-smi, and htop to monitor your resources.
The most possible scenarios are:
If you have a huge dataset, take a look at the disk read/write rates; if you are accessing your hard-disk frequently, most probably you need to change they way you are dealing with the dataset to reduce number of disk access
Use the memory to pre-load everything as much as possible.
If you are using a restful API or any similar services, make sure that you do not wait much for receiving what you need. For restful services, the number of requests per second might be limited (check your network usage via nmon/Task manager)
Make sure you do not use swap space in any case!
Reduce the overhead of preprocessing by any means (e.g. using cache, faster libraries, etc.)
Play with the bach_size (however, it is said that higher values (>512) for batch size might have negative effects on accuracy)
The reason may be that your network is "relatively simple". I had a MNIST network with 60k training examples.
with 100 neurons in 1 hidden layer, CPU training was faster and GPU utilization on GPU training was around 10%
with 2 hidden layers, 2000 neurons each, GPU was significantly faster(24s vs 452s on CPU) and its utilization was around 39%
I have a pretty old PC (24GB DDR3-1333, i7 3770k) but a modern graphic card(RTX 2070 + SSDs if that matters) so there is a memory-GPU data transfer bottleneck.
I'm not yet sure how much room for improvement is here. I'd have to train a bigger network and compare it with better CPU/memory configuration + same GPU.
I guess that for smaller networks it doesn't matter that much anyway because they are relatively easy for the CPU.
Measuring GPU performance and utilization is not as straightforward as CPU or Memory. GPU is an extreme parallel processing unit and there are many factors. The GPU utilization number shown by nvidia-smi means what percentage of the time at least one gpu multiprocessing group was active. If this number is 0, it is a sign that none of your GPU is being utilized but if this number is 100 does not mean that the GPU is being used at its full potential.
These two articles have lots of interesting information on this topic:
https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/
https://www.imgtec.com/blog/measuring-gpu-compute-performance/
Low GPU utilization might be due to the small batch size. Keras has a habit of occupying the whole memory size whether, for example, you use batch size x or batch size 2x. Try using a bigger batch size if possible and see if it changes.