Why tensorflow GPU memory usage decreasing when I increasing the batch size?

Why tensorflow GPU memory usage decreasing when I increasing the batch size? - tensorflow

Recently I implemented a VGG-16 network using both Tensorflow and PyTorch, data set is CIFAR-10. Each picture is 32 * 32 RGB.
I use a 64 batch size in beginning, while I found PyTorch using much less GPU memory than tensorflow. Then I did some experiments and got a figure, which is posted below.
After some researching, I known the tensorflow using BFC algorithm to manage memory. So it's can explain why tensorflow's memory using decreasing or increasing by 2048, 1024, ... MB and sometimes the memory use not increasing when batch size is bigger.
But I am still confused, why the memory use is lower when batch size is 512 than batch size is 384, 448 etc. which has a smaller batch size. The same as when batch size is from 1024 to 1408, and batch size is 2048 to 2688.
Here is my source code:
PyTorch:https://github.com/liupeng3425/tesorflow-vgg/blob/master/vgg-16-pytorch.py
Tensorflow:https://github.com/liupeng3425/tesorflow-vgg/blob/master/vgg-16.py
edit:
I have two Titan XP on my computer, OS: Linux Mint 18.2 64-bit.
I determine GPU memory usage with command nvidia-smi.
My code runs on GPU1, which is defined in my code:
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
And I am sure there only one application using GPU1.
GPU memory usage can be determined by the application list below.
For example, like the posted screen shot below, process name is /usr/bin/python3 and its GPU memory usage is 1563 MiB.

As noted in the comments, by default TensorFlow always takes up all memory on a GPU. I assume you have disabled that function for this test, but it does show that the algorithms do not generally attempt to minimize the memory that is reserved, even if it's not all utilized in the calculations.
To find the optimal configuration for your device and code, TensorFlow often runs (parts of) the first calculation multiple times. I suspect that this included settings for pre-loading data onto the GPU. This would mean that the numbers you see happen to be the optimal values for your device and configuration.
Since TensorFlow doesn't mind using more memory, 'optimal' here is measured by speed, not memory usage.

Related

How to run tensorflow inference for multiple models on GPU in parallel?

Do you know any elegant way to do inference on 2 python processes with 1 GPU tensorflow?
Suppose I have 2 processes, first one is classifying cats/dogs, 2nd one is classifying birds/planes, each process is running different tensorflow model and run on GPU. These 2 models will be given images from different cameras continuously.
Usually, tensorflow will occupy all memory of the entire GPU. So when you start another process, it will crash saying OUT OF MEMORY or failed convolution CUDA or something along that line.
Is there a tutorial/article/sample code that shows how to load 2 models in different processes and both run in parallel?
This is very useful also in case you are running a model inference while you are doing some heavy graphics e.g. playing games. I also want to know how running the model affects the game.
I've tried using python Thread and it works but each model predicts 2 times slower (and you know that python thread is not utilizing multiple CPU cores). I want to use python Process but it's not working. If you have sample few lines of code that work I would appreciate that very much.
I've attached current Thread code also:

As summarized here, you can specify the proportion of GPU memory allocated per process.
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
Using Keras, it may be simpler to allow 'memory growth' which will expand the allocated memory on demand as described here.
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
The following should work for Tensorflow 2.0:
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

Apart from setting gpu memory fraction, you need to enable MPS in CUDA to get better speed if you are running more than one model on GPU simultaneoulsy. Otherwise, inference speed will be slower as compared to single model running on GPU.
sudo nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
sudo nvidia-cuda-mps-control -d
Here 0 is your GPU number
After finishing stop the MPS daemon
echo quit | sudo nvidia-cuda-mps-control

OK. I think I've found the solution now.
I use tensorflow 2 and there are essentially 2 methods to manage the memory usage of GPU.
set memory growth to true
set memory limit to some number
You can use both methods, ignore all the warning messages about out of memory stuff. I still don't know what it exactly means but the model is still running and that's what I care about.
I measured the exact time the model uses to run and it's a lot better than running on CPU. If I run both processes at the same time, the speed drop a bit, but it's still lot better than running on CPU.
For memory growth approach, my GPU is 3GB so first process try to allocate everything and then 2nd process said out of memory. But it still works.
For memory limit approach, I set the limit to some number e.g. 1024 MB. Both processes work.
So What is the right minimum number that you can set?
I tried reducing the memory limit until I found that my model works with 64 MB limit fine. The prediction speed is still the same as when I set the memory limit to 1024 MB. When I set the memory limit to 32MB, I noticed 50% speed drop. When I set to 16 MB, the model refuses to run because it does not have enough memory to store the image tensor.
This means that my model requires minimum of 64 MB which is very little considering that I have 3GB to spare. This also allows me to run the model while playing some video games.
Conclusion: I chose to use the memory limit approach with 64 MB limit. You can check how to use memory limit here: https://www.tensorflow.org/guide/gpu
I suggest you to try changing the memory limit to see the minimum you need for your model. You will see speed drop or model refusing to run when the memory is not enough.

Can tensorflow consume GPU memory exactly equal to required

I use tensorflow c++ version to do CNN inference. I already set_allow_growth(true), but it still consume more GPU memory than exactly need .
set_per_process_gpu_memory_fraction can only set an upper bound of the GPU memory, but different CNN model have different upper bound. Is there a good way to solve the problem

Unfortunately, there's no such flag to use out-of-the-box, but this could be done (manually):
By default, TF allocates all the available GPU memory. Setting set_allow_growth to true, causing TF to allocate the needed memory in chunks instead of allocating all GPU memory at once. Every time TF will require more GPU memory than already allocated, it will allocate another chunk.
In addition, as you mentioned, TF supports set_per_process_gpu_memory_fraction which specifies the maximum GPU memory the process can require, in terms of percent of the total GPU memory. This results in out-of-memory (OOM) exceptions in case TF requires more GPU memory than allowed.
Unfortunately, I think the chunk size cannot be set by the user and is hard-coded in TF (for some reason I think the chunk size is 4GB but I'm not sure).
This results in being able to specify the maximum amount of GPU memory that you allow TF to use (in terms of percents). If you know how much GPU memory you have in total (can be retrieved by nvidia-smi, and you know how much memory you want to allow, you can calculate it in terms of percents and set it to TF.
If you run a small number of neural networks, you can find the required GPU memory for each of them by running it with different allowed GPU memory, like a binary search and see what's the minimum fraction that enables the NN to run. Then, setting the values you found as the values for set_per_process_gpu_memory_fraction for each NN will achieve what you wanted.

Training on multi-GPUs with a small batch size

I am running TensorFlow on a machine which has two GPUs, each with 3 GB memory. My batch size is only 2GB, and so can fit on one GPU. Is there any point in training with both GPUs (using CUDA_VISIBLE_DEVICES)? If I did, how would TensorFlow distribute the training?

With regards to memory: I assume that you mean that one data batch is 2GB. However, Tensorflow also requires memory to store variables as well as hidden layer results etc. (to compute gradients). For this reason it also depends on your specific model whether or not the memory will be enough. Your best bet would be to just try with one GPU and see if the program crashes due to memory errors.
With regards to distribution: Tensorflow doesn't do this automatically at all. Each op is placed on some device. By default, if you have any number of GPUs available, all GPU-compatible ops will be placed on the first GPU and the rest on the CPU. This is despite Tensorflow reserving all memory on all GPUs by default.
You should have a look at the GPU guide on the Tensorflow website. The most important thing is that you can use the with tf.device context manager to place ops on other GPUs. Using this, the idea would be to split your batch into X chunks (X = number of GPUs) and define your model on each device, each time taking the respective chunk as input and making sure to reuse variables.
If you are using tf.Estimator, there is some information in this question. It is very easy to do distributed execution here using just two simple wrappers, but I personally haven't been able to use it successfully (pretty slow and crashes randomly with a segfault).

Low GPU usage by Keras / Tensorflow?

I'm using keras with tensorflow backend on a computer with a nvidia Tesla K20c GPU. (CUDA 8)
I'm tranining a relatively simple Convolutional Neural Network, during training I run the terminal program nvidia-smi to check the GPU use. As you can see in the following output, the GPU utilization commonly shows around 7%-13%
My question is: during the CNN training shouldn't the GPU usage be higher? is this a sign of a bad GPU configuration or usage by keras/tensorflow?
nvidia-smi output

Could be due to several reasons but most likely you're having a bottleneck when reading the training data. As your GPU has processed a batch it requires more data. Depending on your implementation this can cause the GPU to wait for the CPU to load more data resulting in a lower GPU usage and also a longer training time.
Try loading all data into memory if it fits or use a QueueRunner which will make an input pipeline reading data in the background. This will reduce the time that your GPU is waiting for more data.
The Reading Data Guide on the TensorFlow website contains more information.

You should find the bottleneck:
On windows use Task-Manager> Performance to monitor how you are using your resources
On Linux use nmon, nvidia-smi, and htop to monitor your resources.
The most possible scenarios are:
If you have a huge dataset, take a look at the disk read/write rates; if you are accessing your hard-disk frequently, most probably you need to change they way you are dealing with the dataset to reduce number of disk access
Use the memory to pre-load everything as much as possible.
If you are using a restful API or any similar services, make sure that you do not wait much for receiving what you need. For restful services, the number of requests per second might be limited (check your network usage via nmon/Task manager)
Make sure you do not use swap space in any case!
Reduce the overhead of preprocessing by any means (e.g. using cache, faster libraries, etc.)
Play with the bach_size (however, it is said that higher values (>512) for batch size might have negative effects on accuracy)

The reason may be that your network is "relatively simple". I had a MNIST network with 60k training examples.
with 100 neurons in 1 hidden layer, CPU training was faster and GPU utilization on GPU training was around 10%
with 2 hidden layers, 2000 neurons each, GPU was significantly faster(24s vs 452s on CPU) and its utilization was around 39%
I have a pretty old PC (24GB DDR3-1333, i7 3770k) but a modern graphic card(RTX 2070 + SSDs if that matters) so there is a memory-GPU data transfer bottleneck.
I'm not yet sure how much room for improvement is here. I'd have to train a bigger network and compare it with better CPU/memory configuration + same GPU.
I guess that for smaller networks it doesn't matter that much anyway because they are relatively easy for the CPU.

Measuring GPU performance and utilization is not as straightforward as CPU or Memory. GPU is an extreme parallel processing unit and there are many factors. The GPU utilization number shown by nvidia-smi means what percentage of the time at least one gpu multiprocessing group was active. If this number is 0, it is a sign that none of your GPU is being utilized but if this number is 100 does not mean that the GPU is being used at its full potential.
These two articles have lots of interesting information on this topic:
https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/
https://www.imgtec.com/blog/measuring-gpu-compute-performance/

Low GPU utilization might be due to the small batch size. Keras has a habit of occupying the whole memory size whether, for example, you use batch size x or batch size 2x. Try using a bigger batch size if possible and see if it changes.

How to tell if my neural network is crashing due to memory errors?

I am trying to calibrate my expectations around a single laptop's ability to train a neural network. I am using tensorflow and keras and after about say 10 minutes, it crashes. I've seen killsignal 9 exit code 137, and I'm wondering if this is due to insufficient memory? Other times, when one-hot encoding using np_utils.to_categorical() I've seen the words memoryerror in the console, and that's it and my script crashes. This is just trying to transform the outputs into what a neural net expects before it even runs.
I have 6400 inputs and 1500 outputs and a small hidden layer of 100 nodes. Batch size 128.
That's it. It's not even deep. It crashes whether using an nvidia gpu or a 4 core cpu. For you pros, is my network too big to train on my system (i7 4 cores, 16gb ram, nvidia GT 750m , compute capability 3.0). Is my neural network considered a large one? I have 3 million samples, btw.
1) How do I estimate the amount of memory required for my network? Is it 6400 (# inputs) * 1500 (#outputs) * 4 bytes (per parameter) = 38.4 gb? Can I see how much memory is being used in real time on a mac somewhere? I've used activity monitor and the memory pressure gauge is normal.
2) GPUs typically are maxing out at 8gb-12gb of RAM, whereas CPUs on desktops could easily have 64 gb. So if the memory requirements of my network exceed 8gb of RAM, would it be impossible to train on a single GPU?
3) what is the difference, especially memory wise, between batch_size and batch_training?
Thank you!

Your calculation was correct with the multiplication, with the exception, that you are dealing with mega bytes and not giga bytes. The actual requirement is 6400*100*4 + 100*1500*4, which should ~4 MB if you use the default float32. You multiply the layer sizes of two subsequent layers together, because every neuron is connected to every neuron in the subsequent layer. Then the whole memory requirement is multiplied by the batch size. This is why convolutional layers are used to train deep networks.
For gpu I am using nvidia-smi to monitor the memory requirements on linux. A google search gave me this for mac: https://phvu.net/2015/03/30/nvidia-smi-on-macos/. If the memory requirements exceed the GPU memory you can not train it on the gpu. You could train it on a cpu, but that will take ages.
There are multiple ways to train with a large training set. Normally generator are used to train on batches. This means only loading the parts of the training set you actually need (https://keras.io/getting-started/faq/#how-can-i-use-keras-with-datasets-that-dont-fit-in-memory).

Finding the memory requirements for your neural network not only depends on the size of the network or the number of parameters itself. For calculating the memory foot print of the neural network, one document that I always go to is the Stanford CS231n Convolutional Neural Networks for Visual Recognition course notes. Please take a look at the portion where they find the memory requirements for each and every layer of the network.
To add to that, batch size (the number of inputs per one batch) is a crucial factor in deciding the 'memory usage'. For example, in a newer NVIDIA P100 GPU, I can go as much as 2048 images per batch if I train a CIFAR10 model and less than 512 or 256 images if I train AlexNet on ImageNet dataset. The input size matters a lot, so does the batch size since the GPU memory need to account for the batch of inputs.
One way to test the batch size which works is to do nvidia-smi and see how much memory is used. Since doing it every now and then is boring, I usually do watch nvidia-smi in my Linux machine. In my MAC, I do not have a NVIDIA GPU installed so I seldom use these tricks. When I want to, I will write quick bash scripts like these:
while true; do nvidia-smi; sleep 0.5; clear; done
You can port install watch in Mac as well.
Also, two of my most favorite tools of all time are htop and dstat.
htop gives you a much better graphical interface to the famous top command in Linux. It gives you real-time information regarding your memory and processor usage, along with the different processes. If you give sudo access to htop, you can change the niceness and other parameters directly from the interface.
dstat gives you real time information about your I/O. In most cases, I will add two flags -d and -n to specify disk and network usage only.
Fortunately, htop can be brew installed on Mac by running:
brew install htop
dstat on the other hand is not directly available. Please look into ifstat or iostat for similar functionalities.
A screenshot of htop command in Mac.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas