Tensorflow out of memory - tensorflow

I am using tensorflow to build CNN based text classification. Some of the datasets are large and some are small.
I use feed_dict to feed the network by sampling data from system memory (not GPU memory). The network is trained batch by batch. The batch size is 1024 fixed for every dataset.
My question is:
The network is trained by batches, and each batch the code retrieve data from system memory. Therefore, no matter how large the dataset is the code should handle it like the same, right?
But I got out of memory problem with large dataset, and for small dataset it works fine. I am pretty sure the system memory is enough for holding all the data. So the OOM problem is about tensorflow, right?
Is it that I write my code wrong, or is it about tensorflow's memory management?
Thanks a lot!

I think your batch size is way too big with 1024. There is a lot of matrices overhead created, especially if you use AgaGrad Adam and the like, dropout, attention and/or more. Try smaller values, like 100, as batchsize. Should solve and train just fine.

Related

Prediction with GPU is much slower than with CPU?

curiously I just found out that my CPU is much faster for predictions.
Doing inference with GPU is much slower then with CPU.
I have tf.keras (tf2) NN model with a simple dense layer:
input = tf.keras.layers.Input(shape=(100,), dtype='float32')
X = X = tf.keras.layers.Dense(2)(input)
model = tf.keras.Model(input,X)
#also initiialized with weights from a file
weights = np.load("weights.npy", allow_pickle=True )
model.layers[-1].set_weights(weights)
scores = model.predict_on_batch(data)
For 100 samples doing predictions I get:
2 s for GPU
0.07 s for CPU (!)
I am using a simple geforce mx150 with 2gb
I also tried the predict_on_batch(x) as someone suggested this as it is more faster than just predict. But here it is of same time.
Refer: Why does keras model predict slower after compile?
Has anyone an idea, what is going on there? What could be an issue possibly?
Using the GPU puts a lot of overhead to load data on the GPU memory (through the relatively slow PCI bus) and to get the results back.
In order for the GPU to be more efficient than the CPU, the model must to be very big, have plenty of data and use algorithms that can run fully inside the GPU, without requiring partial results to be moved back to the CPU.
The optimal configuration depends on the quantity of memory and of cores inside your GPU, so you must do some tests, but the following rules apply:
Your NN must have at least >10k parameters, training data set must have at least 10k records. Otherwise your overhead will probably kill the performances of GPU
When you model.fit, use a large batch_size (pay attention, the default is only 32), possibly to contain your whole dataset, or at least a multiple of 1024. Do some test to find the optimum for you.
For some GPUs, it might help performing computations in float16 instead of float32. Follow this tutorial to see how to activate it.
If your GPU has specific Tensor Cores, in order to use efficiently its hardware, several data must be multiples of 8. In the preceding tutorial, see at the paragraph "Ensuring GPU Tensor Cores are used" what parameters must be changed and how. In general, it's a bad idea to use layers which contain a number of neurons not multiple of 8.
Some type of layers, namely RNNs, have an architecture which cannot be solved directly by the GPU. In this case, data must be moved constantly back and forth to CPU and the speed is lost. If a RNN is really needed, Tensorflow v2 has an implementation of the LSTM layer which is optimized for GPU, but some limitations on the parameters are present: see this thread and the documetation.
If you are training a Reinforcement Learning, activate an Experience Replay and use a memory buffer for the experience which is at least >10x your batch_size. This way, you will activate the NN training only when a big bunch of data is ready.
Deactivate as much verbosity as possible
If everything is set up correctly, you should be able to train your model faster with GPU than with CPU.
GPU is good if you have compute-intensive tasks (large models) due to the overhead of copying your data and results between the host and GPU. In your case, the model is very small. It means it will take you longer to copy data than to predict. Even if the CPU is slower than the GPU, you don't have to copy the data, so it's ultimately faster.

How to estimate how much GPU memory required for deep learning?

We are trying to train our model for object recognition using tensorflow. Since there are too many images (100GB), I guess our current GPU server (1*2080Ti) could not work. We may need to purchase a more powerful one, but I do not sure how to estimate how much GPU memory we need. Is there some approach to estimate the requirements? thanks!
Your 2080Ti would do just fine for your task. The GPU memory for DL tasks are dependent on many factors such as number of trainable parameters in the network, size of the images you are feeding, batch size, floating point type (FP16 or FP32) and number of activations and etc. I think you get confused about loading all of the images to GPU memory at once. We do not do that, instead we use minibatches of different sizes to fit all of the images and params into memory. Throw any kind of network to your 2080Ti and adjust batch size then your training will run smoothly. You could go with your 2080Ti or can get another or two increase training speed. This blogpost provides beautiful insights about creating optimal DL environments.

How to select batch size automatically to fit GPU?

I am training deep neural networks with a GPU. If I make samples too large, batches too large, or networks too deep, I get an out of memory error. In this case, it is sometimes possible to make smaller batches and still train.
Is it possible to calculate GPU size required for training and determine what batch size to choose beforehand?
UPDATE
If I print network summary, it displays number of "trainable parameters". Can't I estimate from this value? For example, take this, multiply by batch size, double for gradients etc?
PyTorch Lightning recently added a feature called "auto batch size", especially for this! It computes the max batch size that can fit into the memory of your GPU :)
More info can be found here.
Original PR: https://github.com/PyTorchLightning/pytorch-lightning/pull/1638
No, it is not possible to do this automatically. So you need to go through a lot of trial and error to find appropriate size if you want your batch to be as much as possible.
Stanford's CNN class provides some guidance how to estimate the memory size, but all suggestions are related to CNN (not sure what do you train).
I think Salvador here means that it is not possible to analytically compute the best suited batch size, however, as all things are in ML, it is just another hyperparameter, that can be added to your grid search to be computed automatically. Simply evaluate your model's loss or accuracy (however you measure performance) for the best and most stable (least variable) measure given several batch sizes, say some powers of 2, such as 64, 256, 1024, etc. Then keep use the best found batch size. Note that batch size can depend on your model's architecture, machine hardware, etc. For example, if you move your modeling from a local PC to some cloud compute engine (GCP, AWS, Azure,...), then the batch size which was too large for your PC's RAM becomes easily suitable for practically limitless RAM/CPU/GPU (mind the costs).

Specifically, how to train neural network when it is larger than ram?

I have specific questions about how to train a neural network that is larger than ram. I want to use the de facto standard which appears to be Keras and tensorflow.
What are the key classes and methods that I need to use
From Numpy, to scipy, to pandas, h5py, to keras in order to not exceed my meager 8 gb of ram? I have time to train the model; I don't have cash. My dataset requires 200 GB of ram.
In keras there is a model_fit() method. It requires X and Y numpy arrays. How do I get it to accept hdf5 numpy arrays on disk? And when specifying the model architecture itself How do I save ram because wouldn't the working memory require > 8 gb at times?
Regarding fit_generator, does that accept hdf5 files? If the model_fit() method can accept hdf5, do I even need fit generator? It seems that you still need to be able to fit the entire model in ram even with these methods?
In keras does the model include the training data when calculating its memory requirements? If so I am in trouble I think.
In essence I am under the assumption that at no time can I exceed my 8 Gb of ram, whether from one hot encoding to loading the model to training on even a small batch of samples. I am just not sure how to accomplish this concretely.
I cannot answer everything, and I'm also very interested in those answers, because I'm facing that 8GB problem too.
I can only suggest how to pass little batches at a time.
Question 2:
I don't think Keras will support passing the h5py file (but I really don't know), but you can create a loop to load the file partially (if the file is properly saved for that).
You can create an outer loop to:
create a little array with only one or two samples from the file
use the method train_on_batch passing only that little array.
release the memory disposing of the array or filling this same array with the next sample(s).
Question 3:
Also don't know about the h5py file, is the object that opens the file a python generator?
If not, you can create the generator yourself.
The idea is to make the generator load only part of the file and yield little batch arrays with one or two data samples. (Pretty much the same as done in question 2, but the loop goes inside a generator.
Usually for very large sample sets an "online" training method is used. This means that instead of training your neural network in one go with a large batch, it allows the neural network to be updated incrementally as more samples are obtained. See: Stochastic Gradient Descent

TensorFlow: How to measure how much GPU memory each tensor takes?

I'm currently implementing YOLO in TensorFlow and I'm a little surprised on how much memory that is taking. On my GPU I can train YOLO using their Darknet framework with batch size 64. On TensorFlow I can only do it with batch size 6, with 8 I already run out of memory. For the test phase I can run with batch size 64 without running out of memory.
I am wondering how I can calculate how much memory is being consumed by each tensor? Are all tensors by default saved in the GPU? Can I simply calculate the total memory consumption as the shape * 32 bits?
I noticed that since I'm using momentum, all my tensors also have a /Momentum tensor. Could that also be using a lot of memory?
I am augmenting my dataset with a method distorted_inputs, very similar to the one defined in the CIFAR-10 tutorial. Could it be that this part is occupying a huge chunk of memory? I believe Darknet does the modifications in the CPU.
Now that 1258 has been closed, you can enable memory logging in Python by setting an environment variable before importing TensorFlow:
import os
os.environ['TF_CPP_MIN_VLOG_LEVEL']='3'
import tensorflow as tf
There will be a lot of logging as a result of this. You'll want to grep the results to find the appropriate lines. For example:
grep MemoryLogTensorAllocation train.log
Sorry for the slow reply. Unfortunately right now the only way to set the log level is to edit tensorflow/core/platform/logging.h and recompile with e.g.
#define VLOG_IS_ON(lvl) ((lvl) <= 1)
There is a bug open 1258 to control logging more elegantly.
MemoryLogTensorOutput entries are logged at the end of each Op execution, and indicate the tensors that hold the outputs of the Op. It's useful to know these tensors since the memory is not released until the downstream Op consumes the tensors, which may be much later on in a large graph.
See the description in this (commit).
The memory allocation is raw info is there although it needs a script to collect the information in an easy to read form.