How to tell if my neural network is crashing due to memory errors?

How to tell if my neural network is crashing due to memory errors? - tensorflow

I am trying to calibrate my expectations around a single laptop's ability to train a neural network. I am using tensorflow and keras and after about say 10 minutes, it crashes. I've seen killsignal 9 exit code 137, and I'm wondering if this is due to insufficient memory? Other times, when one-hot encoding using np_utils.to_categorical() I've seen the words memoryerror in the console, and that's it and my script crashes. This is just trying to transform the outputs into what a neural net expects before it even runs.
I have 6400 inputs and 1500 outputs and a small hidden layer of 100 nodes. Batch size 128.
That's it. It's not even deep. It crashes whether using an nvidia gpu or a 4 core cpu. For you pros, is my network too big to train on my system (i7 4 cores, 16gb ram, nvidia GT 750m , compute capability 3.0). Is my neural network considered a large one? I have 3 million samples, btw.
1) How do I estimate the amount of memory required for my network? Is it 6400 (# inputs) * 1500 (#outputs) * 4 bytes (per parameter) = 38.4 gb? Can I see how much memory is being used in real time on a mac somewhere? I've used activity monitor and the memory pressure gauge is normal.
2) GPUs typically are maxing out at 8gb-12gb of RAM, whereas CPUs on desktops could easily have 64 gb. So if the memory requirements of my network exceed 8gb of RAM, would it be impossible to train on a single GPU?
3) what is the difference, especially memory wise, between batch_size and batch_training?
Thank you!

Your calculation was correct with the multiplication, with the exception, that you are dealing with mega bytes and not giga bytes. The actual requirement is 6400*100*4 + 100*1500*4, which should ~4 MB if you use the default float32. You multiply the layer sizes of two subsequent layers together, because every neuron is connected to every neuron in the subsequent layer. Then the whole memory requirement is multiplied by the batch size. This is why convolutional layers are used to train deep networks.
For gpu I am using nvidia-smi to monitor the memory requirements on linux. A google search gave me this for mac: https://phvu.net/2015/03/30/nvidia-smi-on-macos/. If the memory requirements exceed the GPU memory you can not train it on the gpu. You could train it on a cpu, but that will take ages.
There are multiple ways to train with a large training set. Normally generator are used to train on batches. This means only loading the parts of the training set you actually need (https://keras.io/getting-started/faq/#how-can-i-use-keras-with-datasets-that-dont-fit-in-memory).

Finding the memory requirements for your neural network not only depends on the size of the network or the number of parameters itself. For calculating the memory foot print of the neural network, one document that I always go to is the Stanford CS231n Convolutional Neural Networks for Visual Recognition course notes. Please take a look at the portion where they find the memory requirements for each and every layer of the network.
To add to that, batch size (the number of inputs per one batch) is a crucial factor in deciding the 'memory usage'. For example, in a newer NVIDIA P100 GPU, I can go as much as 2048 images per batch if I train a CIFAR10 model and less than 512 or 256 images if I train AlexNet on ImageNet dataset. The input size matters a lot, so does the batch size since the GPU memory need to account for the batch of inputs.
One way to test the batch size which works is to do nvidia-smi and see how much memory is used. Since doing it every now and then is boring, I usually do watch nvidia-smi in my Linux machine. In my MAC, I do not have a NVIDIA GPU installed so I seldom use these tricks. When I want to, I will write quick bash scripts like these:
while true; do nvidia-smi; sleep 0.5; clear; done
You can port install watch in Mac as well.
Also, two of my most favorite tools of all time are htop and dstat.
htop gives you a much better graphical interface to the famous top command in Linux. It gives you real-time information regarding your memory and processor usage, along with the different processes. If you give sudo access to htop, you can change the niceness and other parameters directly from the interface.
dstat gives you real time information about your I/O. In most cases, I will add two flags -d and -n to specify disk and network usage only.
Fortunately, htop can be brew installed on Mac by running:
brew install htop
dstat on the other hand is not directly available. Please look into ifstat or iostat for similar functionalities.
A screenshot of htop command in Mac.

Related

How to estimate how much GPU memory required for deep learning?

We are trying to train our model for object recognition using tensorflow. Since there are too many images (100GB), I guess our current GPU server (1*2080Ti) could not work. We may need to purchase a more powerful one, but I do not sure how to estimate how much GPU memory we need. Is there some approach to estimate the requirements? thanks!

Your 2080Ti would do just fine for your task. The GPU memory for DL tasks are dependent on many factors such as number of trainable parameters in the network, size of the images you are feeding, batch size, floating point type (FP16 or FP32) and number of activations and etc. I think you get confused about loading all of the images to GPU memory at once. We do not do that, instead we use minibatches of different sizes to fit all of the images and params into memory. Throw any kind of network to your 2080Ti and adjust batch size then your training will run smoothly. You could go with your 2080Ti or can get another or two increase training speed. This blogpost provides beautiful insights about creating optimal DL environments.

Training on multi-GPUs with a small batch size

I am running TensorFlow on a machine which has two GPUs, each with 3 GB memory. My batch size is only 2GB, and so can fit on one GPU. Is there any point in training with both GPUs (using CUDA_VISIBLE_DEVICES)? If I did, how would TensorFlow distribute the training?

With regards to memory: I assume that you mean that one data batch is 2GB. However, Tensorflow also requires memory to store variables as well as hidden layer results etc. (to compute gradients). For this reason it also depends on your specific model whether or not the memory will be enough. Your best bet would be to just try with one GPU and see if the program crashes due to memory errors.
With regards to distribution: Tensorflow doesn't do this automatically at all. Each op is placed on some device. By default, if you have any number of GPUs available, all GPU-compatible ops will be placed on the first GPU and the rest on the CPU. This is despite Tensorflow reserving all memory on all GPUs by default.
You should have a look at the GPU guide on the Tensorflow website. The most important thing is that you can use the with tf.device context manager to place ops on other GPUs. Using this, the idea would be to split your batch into X chunks (X = number of GPUs) and define your model on each device, each time taking the respective chunk as input and making sure to reuse variables.
If you are using tf.Estimator, there is some information in this question. It is very easy to do distributed execution here using just two simple wrappers, but I personally haven't been able to use it successfully (pretty slow and crashes randomly with a segfault).

object detection Training becomes slower in time. Uses more CPU than GPU as the training progresses

System information
What is the top-level directory of the model you are using:research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes (just VGG-16 implementation for Faster RCNN)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 16.04
TensorFlow version (use command below):1.4.0
CUDA/cuDNN version:8 and 6
GPU model and memory: NVIDIA-1060 6GB
I am trying to Train a Faster-RCNN with VGG-16 as feature extractor(paper) on my custom dataset using the API.
Training params are same as described in the paper except for, am running for 15k steps only and resizing the images to 1200x1200 with a batch size = 1.
The Training Runs Fine but as the Time progresses The Training becomes slower. It is shifting between CPU and GPU.
The steps where the time around 1sec is running on GPU and the other high numbers like ~20secs is running in CPU I cross verified them using 'top' and 'nvidia-smi'. Why is it shifting between CPU and GPU in the middle? I can understand the shift when the model and logs are getting saved but this I don't understand why.
PS: I am running Only the Train script. Am not running the eval script
Update:
This becomes worse over time.
the secs/step is increasing thus affecting the rate at which the checkpoints and the logs getting stored
It should run less than 1sec/step because that was the speed when I started the training for the first 2k steps. And my dataset is pretty small (300 Images for training).

In my experience, it is possible that the size of your input image is quite too large. When you take a look at the tensorboard during the training session, you can find out that all the reshape calculation are running on GPU. So maybe you can write a python script to resize your input image without changing the aspect ratio, and you can at the same time set your batch size(maybe 4 or 8) a little bit higher. Then your can train your dataset faster and can also get a relative good result (mAP)

Why tensorflow GPU memory usage decreasing when I increasing the batch size?

Recently I implemented a VGG-16 network using both Tensorflow and PyTorch, data set is CIFAR-10. Each picture is 32 * 32 RGB.
I use a 64 batch size in beginning, while I found PyTorch using much less GPU memory than tensorflow. Then I did some experiments and got a figure, which is posted below.
After some researching, I known the tensorflow using BFC algorithm to manage memory. So it's can explain why tensorflow's memory using decreasing or increasing by 2048, 1024, ... MB and sometimes the memory use not increasing when batch size is bigger.
But I am still confused, why the memory use is lower when batch size is 512 than batch size is 384, 448 etc. which has a smaller batch size. The same as when batch size is from 1024 to 1408, and batch size is 2048 to 2688.
Here is my source code:
PyTorch:https://github.com/liupeng3425/tesorflow-vgg/blob/master/vgg-16-pytorch.py
Tensorflow:https://github.com/liupeng3425/tesorflow-vgg/blob/master/vgg-16.py
edit:
I have two Titan XP on my computer, OS: Linux Mint 18.2 64-bit.
I determine GPU memory usage with command nvidia-smi.
My code runs on GPU1, which is defined in my code:
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
And I am sure there only one application using GPU1.
GPU memory usage can be determined by the application list below.
For example, like the posted screen shot below, process name is /usr/bin/python3 and its GPU memory usage is 1563 MiB.

As noted in the comments, by default TensorFlow always takes up all memory on a GPU. I assume you have disabled that function for this test, but it does show that the algorithms do not generally attempt to minimize the memory that is reserved, even if it's not all utilized in the calculations.
To find the optimal configuration for your device and code, TensorFlow often runs (parts of) the first calculation multiple times. I suspect that this included settings for pre-loading data onto the GPU. This would mean that the numbers you see happen to be the optimal values for your device and configuration.
Since TensorFlow doesn't mind using more memory, 'optimal' here is measured by speed, not memory usage.

Low GPU usage by Keras / Tensorflow?

I'm using keras with tensorflow backend on a computer with a nvidia Tesla K20c GPU. (CUDA 8)
I'm tranining a relatively simple Convolutional Neural Network, during training I run the terminal program nvidia-smi to check the GPU use. As you can see in the following output, the GPU utilization commonly shows around 7%-13%
My question is: during the CNN training shouldn't the GPU usage be higher? is this a sign of a bad GPU configuration or usage by keras/tensorflow?
nvidia-smi output

Could be due to several reasons but most likely you're having a bottleneck when reading the training data. As your GPU has processed a batch it requires more data. Depending on your implementation this can cause the GPU to wait for the CPU to load more data resulting in a lower GPU usage and also a longer training time.
Try loading all data into memory if it fits or use a QueueRunner which will make an input pipeline reading data in the background. This will reduce the time that your GPU is waiting for more data.
The Reading Data Guide on the TensorFlow website contains more information.

You should find the bottleneck:
On windows use Task-Manager> Performance to monitor how you are using your resources
On Linux use nmon, nvidia-smi, and htop to monitor your resources.
The most possible scenarios are:
If you have a huge dataset, take a look at the disk read/write rates; if you are accessing your hard-disk frequently, most probably you need to change they way you are dealing with the dataset to reduce number of disk access
Use the memory to pre-load everything as much as possible.
If you are using a restful API or any similar services, make sure that you do not wait much for receiving what you need. For restful services, the number of requests per second might be limited (check your network usage via nmon/Task manager)
Make sure you do not use swap space in any case!
Reduce the overhead of preprocessing by any means (e.g. using cache, faster libraries, etc.)
Play with the bach_size (however, it is said that higher values (>512) for batch size might have negative effects on accuracy)

The reason may be that your network is "relatively simple". I had a MNIST network with 60k training examples.
with 100 neurons in 1 hidden layer, CPU training was faster and GPU utilization on GPU training was around 10%
with 2 hidden layers, 2000 neurons each, GPU was significantly faster(24s vs 452s on CPU) and its utilization was around 39%
I have a pretty old PC (24GB DDR3-1333, i7 3770k) but a modern graphic card(RTX 2070 + SSDs if that matters) so there is a memory-GPU data transfer bottleneck.
I'm not yet sure how much room for improvement is here. I'd have to train a bigger network and compare it with better CPU/memory configuration + same GPU.
I guess that for smaller networks it doesn't matter that much anyway because they are relatively easy for the CPU.

Measuring GPU performance and utilization is not as straightforward as CPU or Memory. GPU is an extreme parallel processing unit and there are many factors. The GPU utilization number shown by nvidia-smi means what percentage of the time at least one gpu multiprocessing group was active. If this number is 0, it is a sign that none of your GPU is being utilized but if this number is 100 does not mean that the GPU is being used at its full potential.
These two articles have lots of interesting information on this topic:
https://www.imgtec.com/blog/a-quick-guide-to-writing-opencl-kernels-for-rogue/
https://www.imgtec.com/blog/measuring-gpu-compute-performance/

Low GPU utilization might be due to the small batch size. Keras has a habit of occupying the whole memory size whether, for example, you use batch size x or batch size 2x. Try using a bigger batch size if possible and see if it changes.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas