README in the Google's BERT repo says, even a single sentence of length 512 can not sit in a 12 GB Titan X for the BERT-Large model.
But in the BERT paper, it says 64 TPU chips are used to train BERT-Large
with a maximum length 512 and batch size 256. How could they fit a >256x larger batch into only 171x more memory?
From another point of view, we can compare these two configurations in a memory-usage-per-sample basis:
TPU: Assume TPUv3 is used in pre-training, the total TPU memory is 32 GB/chip * 64 chips = 2048 GB. According to the paper, a batch size of 256 with maximum length 512 works well in this configuration, which means 8 GB memory is able to hold a single sample. Furthermore, memory usage per sample will reduce to only 4 GB if GPUv2 is used.
GPU: A 12 GB Titan X can not hold even a single sample of length 512.
Why is memory consumption on GPUs much larger? Does this mean memory consumption on TPUs is optimized way better than that on GPUs?
This is probably due to the advanced compiler that comes with TPU and optimized for tensorflow ops. As the readme - out-of-memory issues in BERT says,
The major use of GPU/TPU memory during DNN training is caching the intermediate activations in the forward pass that are necessary for efficient computation in the backward pass.
However, in the TPU compiling, a special XLA (domain-specific compiler for linear algebra that optimizes TensorFlow computations) instruction called fusion
can merge multiple instructions from different TensorFlow operations into a single computation. The TensorFlow operation corresponding to the root instruction in the fusion is used as the namespace of the fusion operation.
On the other side, running on the GPU with vanilla TF basically has no (or very limited) optimizations.
Related
I am running a model which allocates [32768,32768] float weight (around 4.29 GB) in its first layer. But it gives an oom error while adding the layer in the sequential model.
This is the output of nvidia-smi before adding layer -
This is the error -
And this is the output of nvidia-smi after the error -
When the Colab GPU is of 13 GB size, why can't it allocate a weight of 4.29 GB?
The other answers on this for e.g., allowing GPU growth doesn't work.
(Note - the GPU and CPU code division in the model creation was originally meant to be on gpu1 and gpu2, but since Colab provides only one GPU, I divided it between CPU and GPU to use RAM from both)
IMHO you don't need to specify GPU explicitly in TF/keras - current versions on Colab will use it when it is available. GPU loading usually takes place at fit and predict times, not at model building - and then you can fine tune memory consumption using batch size.
Please try your code without the with blocks.
And please use proper code copy & paste instead of pictures in future.
I am running TensorFlow on a machine which has two GPUs, each with 3 GB memory. My batch size is only 2GB, and so can fit on one GPU. Is there any point in training with both GPUs (using CUDA_VISIBLE_DEVICES)? If I did, how would TensorFlow distribute the training?
With regards to memory: I assume that you mean that one data batch is 2GB. However, Tensorflow also requires memory to store variables as well as hidden layer results etc. (to compute gradients). For this reason it also depends on your specific model whether or not the memory will be enough. Your best bet would be to just try with one GPU and see if the program crashes due to memory errors.
With regards to distribution: Tensorflow doesn't do this automatically at all. Each op is placed on some device. By default, if you have any number of GPUs available, all GPU-compatible ops will be placed on the first GPU and the rest on the CPU. This is despite Tensorflow reserving all memory on all GPUs by default.
You should have a look at the GPU guide on the Tensorflow website. The most important thing is that you can use the with tf.device context manager to place ops on other GPUs. Using this, the idea would be to split your batch into X chunks (X = number of GPUs) and define your model on each device, each time taking the respective chunk as input and making sure to reuse variables.
If you are using tf.Estimator, there is some information in this question. It is very easy to do distributed execution here using just two simple wrappers, but I personally haven't been able to use it successfully (pretty slow and crashes randomly with a segfault).
Recently I implemented a VGG-16 network using both Tensorflow and PyTorch, data set is CIFAR-10. Each picture is 32 * 32 RGB.
I use a 64 batch size in beginning, while I found PyTorch using much less GPU memory than tensorflow. Then I did some experiments and got a figure, which is posted below.
After some researching, I known the tensorflow using BFC algorithm to manage memory. So it's can explain why tensorflow's memory using decreasing or increasing by 2048, 1024, ... MB and sometimes the memory use not increasing when batch size is bigger.
But I am still confused, why the memory use is lower when batch size is 512 than batch size is 384, 448 etc. which has a smaller batch size. The same as when batch size is from 1024 to 1408, and batch size is 2048 to 2688.
Here is my source code:
PyTorch:https://github.com/liupeng3425/tesorflow-vgg/blob/master/vgg-16-pytorch.py
Tensorflow:https://github.com/liupeng3425/tesorflow-vgg/blob/master/vgg-16.py
edit:
I have two Titan XP on my computer, OS: Linux Mint 18.2 64-bit.
I determine GPU memory usage with command nvidia-smi.
My code runs on GPU1, which is defined in my code:
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
And I am sure there only one application using GPU1.
GPU memory usage can be determined by the application list below.
For example, like the posted screen shot below, process name is /usr/bin/python3 and its GPU memory usage is 1563 MiB.
As noted in the comments, by default TensorFlow always takes up all memory on a GPU. I assume you have disabled that function for this test, but it does show that the algorithms do not generally attempt to minimize the memory that is reserved, even if it's not all utilized in the calculations.
To find the optimal configuration for your device and code, TensorFlow often runs (parts of) the first calculation multiple times. I suspect that this included settings for pre-loading data onto the GPU. This would mean that the numbers you see happen to be the optimal values for your device and configuration.
Since TensorFlow doesn't mind using more memory, 'optimal' here is measured by speed, not memory usage.
I'm using keras with tensorflow backend on a computer with a nvidia Tesla K20c GPU. (CUDA 8)
I'm tranining a relatively simple Convolutional Neural Network, during training I run the terminal program nvidia-smi to check the GPU use. As you can see in the following output, the GPU utilization commonly shows around 7%-13%
My question is: during the CNN training shouldn't the GPU usage be higher? is this a sign of a bad GPU configuration or usage by keras/tensorflow?
nvidia-smi output
Could be due to several reasons but most likely you're having a bottleneck when reading the training data. As your GPU has processed a batch it requires more data. Depending on your implementation this can cause the GPU to wait for the CPU to load more data resulting in a lower GPU usage and also a longer training time.
Try loading all data into memory if it fits or use a QueueRunner which will make an input pipeline reading data in the background. This will reduce the time that your GPU is waiting for more data.
The Reading Data Guide on the TensorFlow website contains more information.
You should find the bottleneck:
On windows use Task-Manager> Performance to monitor how you are using your resources
On Linux use nmon, nvidia-smi, and htop to monitor your resources.
The most possible scenarios are:
If you have a huge dataset, take a look at the disk read/write rates; if you are accessing your hard-disk frequently, most probably you need to change they way you are dealing with the dataset to reduce number of disk access
Use the memory to pre-load everything as much as possible.
If you are using a restful API or any similar services, make sure that you do not wait much for receiving what you need. For restful services, the number of requests per second might be limited (check your network usage via nmon/Task manager)
Make sure you do not use swap space in any case!
Reduce the overhead of preprocessing by any means (e.g. using cache, faster libraries, etc.)
Play with the bach_size (however, it is said that higher values (>512) for batch size might have negative effects on accuracy)
The reason may be that your network is "relatively simple". I had a MNIST network with 60k training examples.
with 100 neurons in 1 hidden layer, CPU training was faster and GPU utilization on GPU training was around 10%
with 2 hidden layers, 2000 neurons each, GPU was significantly faster(24s vs 452s on CPU) and its utilization was around 39%
I have a pretty old PC (24GB DDR3-1333, i7 3770k) but a modern graphic card(RTX 2070 + SSDs if that matters) so there is a memory-GPU data transfer bottleneck.
I'm not yet sure how much room for improvement is here. I'd have to train a bigger network and compare it with better CPU/memory configuration + same GPU.
I guess that for smaller networks it doesn't matter that much anyway because they are relatively easy for the CPU.
Measuring GPU performance and utilization is not as straightforward as CPU or Memory. GPU is an extreme parallel processing unit and there are many factors. The GPU utilization number shown by nvidia-smi means what percentage of the time at least one gpu multiprocessing group was active. If this number is 0, it is a sign that none of your GPU is being utilized but if this number is 100 does not mean that the GPU is being used at its full potential.
These two articles have lots of interesting information on this topic:
https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/
https://www.imgtec.com/blog/measuring-gpu-compute-performance/
Low GPU utilization might be due to the small batch size. Keras has a habit of occupying the whole memory size whether, for example, you use batch size x or batch size 2x. Try using a bigger batch size if possible and see if it changes.
I am trying to calibrate my expectations around a single laptop's ability to train a neural network. I am using tensorflow and keras and after about say 10 minutes, it crashes. I've seen killsignal 9 exit code 137, and I'm wondering if this is due to insufficient memory? Other times, when one-hot encoding using np_utils.to_categorical() I've seen the words memoryerror in the console, and that's it and my script crashes. This is just trying to transform the outputs into what a neural net expects before it even runs.
I have 6400 inputs and 1500 outputs and a small hidden layer of 100 nodes. Batch size 128.
That's it. It's not even deep. It crashes whether using an nvidia gpu or a 4 core cpu. For you pros, is my network too big to train on my system (i7 4 cores, 16gb ram, nvidia GT 750m , compute capability 3.0). Is my neural network considered a large one? I have 3 million samples, btw.
1) How do I estimate the amount of memory required for my network? Is it 6400 (# inputs) * 1500 (#outputs) * 4 bytes (per parameter) = 38.4 gb? Can I see how much memory is being used in real time on a mac somewhere? I've used activity monitor and the memory pressure gauge is normal.
2) GPUs typically are maxing out at 8gb-12gb of RAM, whereas CPUs on desktops could easily have 64 gb. So if the memory requirements of my network exceed 8gb of RAM, would it be impossible to train on a single GPU?
3) what is the difference, especially memory wise, between batch_size and batch_training?
Thank you!
Your calculation was correct with the multiplication, with the exception, that you are dealing with mega bytes and not giga bytes. The actual requirement is 6400*100*4 + 100*1500*4, which should ~4 MB if you use the default float32. You multiply the layer sizes of two subsequent layers together, because every neuron is connected to every neuron in the subsequent layer. Then the whole memory requirement is multiplied by the batch size. This is why convolutional layers are used to train deep networks.
For gpu I am using nvidia-smi to monitor the memory requirements on linux. A google search gave me this for mac: https://phvu.net/2015/03/30/nvidia-smi-on-macos/. If the memory requirements exceed the GPU memory you can not train it on the gpu. You could train it on a cpu, but that will take ages.
There are multiple ways to train with a large training set. Normally generator are used to train on batches. This means only loading the parts of the training set you actually need (https://keras.io/getting-started/faq/#how-can-i-use-keras-with-datasets-that-dont-fit-in-memory).
Finding the memory requirements for your neural network not only depends on the size of the network or the number of parameters itself. For calculating the memory foot print of the neural network, one document that I always go to is the Stanford CS231n Convolutional Neural Networks for Visual Recognition course notes. Please take a look at the portion where they find the memory requirements for each and every layer of the network.
To add to that, batch size (the number of inputs per one batch) is a crucial factor in deciding the 'memory usage'. For example, in a newer NVIDIA P100 GPU, I can go as much as 2048 images per batch if I train a CIFAR10 model and less than 512 or 256 images if I train AlexNet on ImageNet dataset. The input size matters a lot, so does the batch size since the GPU memory need to account for the batch of inputs.
One way to test the batch size which works is to do nvidia-smi and see how much memory is used. Since doing it every now and then is boring, I usually do watch nvidia-smi in my Linux machine. In my MAC, I do not have a NVIDIA GPU installed so I seldom use these tricks. When I want to, I will write quick bash scripts like these:
while true; do nvidia-smi; sleep 0.5; clear; done
You can port install watch in Mac as well.
Also, two of my most favorite tools of all time are htop and dstat.
htop gives you a much better graphical interface to the famous top command in Linux. It gives you real-time information regarding your memory and processor usage, along with the different processes. If you give sudo access to htop, you can change the niceness and other parameters directly from the interface.
dstat gives you real time information about your I/O. In most cases, I will add two flags -d and -n to specify disk and network usage only.
Fortunately, htop can be brew installed on Mac by running:
brew install htop
dstat on the other hand is not directly available. Please look into ifstat or iostat for similar functionalities.
A screenshot of htop command in Mac.