trying to load data into ram and not into gpu memory - tensorflow

I'm trying to fit a model using tensorflow, i have a gpu with 6gb of memory, when i try to train a model with batch size of 128 and each element in the batch is 1 mb in size i get an error that the gpu is out of memory given that the model size is relatively small ~300k params.
however i have seen that my gpu memory gets full once i call the fit method (5.6 gb's), i'm assuming that the whole data somehow gets in the gpu memory then gets processed not just the training batch.
so im wondering if there is a way to load all the data on the ram memory then just send those batches once i begin to train the model to the gpu then process it, which will certainly let me train model with batch size of 128 given its size above.
is that possible or im missing out on alot of information?

Related

Is it possible to do the whole training procedure in GPU with Tensorflow/Keras?

If the dataset is small enough to fit in the GPU memory, is it possible with Tensorflow to allocate it all initially and then do the training without having data transfers between CPU and GPU?
It seems to me that with tf.data this is not possible and the data transfer is not controlled by the programmer.
Analyzing the GPU workload during training, it reaches 75% with CIFAR10, but I would expect it to reach 100% being that the dataset fit in GPU memory.Also analyzing with Tensorboard I see that there are a lot of Send operations.
(I saw that there is a similar question quite old here, however at that time there was no tf.data yet)

Why my inception and LSTM model with 2M parameters take 1G GPU memory?

The model mainly builds on inception and LSTM, and it is implemented by Keras on tensorflow 2.x. The saved model parameters take only 2M space. The model is trained on-the-fly with batch size of 32 and data volume for each batch of 0.25M. The worker in model.fit_generator is 20 with use_multiprocessing=True.
However, I have observed it takes 1G of GPU memory. I cannot figure out the reason, and I also do not know which tools can be used to monitor the GPU memory cost for different parts of the model during the training.
Below show the details of the model:
The allocated GPU memory is not only for parameters but also for activations and gradients for the backward pass.
In addition you have to consider that the following things affect the amount of memory used:
Batch size: more images in the batch means more activations
Number of workers: every worker need some amount of memory to operate

Can Tensorflow saturate a GPU with a batch size of 1?

I'm running a modest 5 layer convolutional network in tensorflow here.
If I have a large enough batch size I can pretty much saturate the GPU. I've implemented asynchronous loading with the tf.StagingArea, and I print a warning message if it's ever starved because it's waiting on data loading. I've confirmed that it is not waiting on the data loading at any point.
As I reduce my batch size I notice that the GPU utilization drops. When I have a batch size of 1 I get 35% utilization on a 1080 TI GPU. At a batch size of 20 I get 50% utilization, and so on. It's also less notable on slower GPUs.
I've tested it with a loop over nothing but the main sess.run call to ensure I don't have any other code slowing things down.
I've reviewed the TF high-performance models documentation, but don't note any reference to small batch sizes.
I wonder if anyone has any insight on improving the performance, other than, "duh, just increase the batch size". I include batch size as a hyperparameter because it affects model regularization. I'd like to test with small batch sizes but still utilize the GPU fully if possible. Also, this is generally the case when a single train step takes sufficiently less than ~0.1 sec to process.

Minimizing GPU's idle time when using TensorFlow

How can we minimize the idle time of a GPU when training a network using Tensorflow ?
To do this :-
I used multiple Python threads to preprocess data and feed it to a tf.RandomShuffleQueue from where the TensorFlow took the data.
I thought that this will be more efficient than the feed_dict method.
However I still find on doing nvidia-smi that my GPU still goes from 100% utilization to 0% utilization and back to 100% quite often.
Since my network is large and the dataset is also large 12 million, any fruitful advice on speeding up would be very helpful.
Is my thinking that reading data directly from a tf.Queue is better than feed_dict correct ?
NOTE: I am using a 12 GB Titan X GPU (Maxwell architecture)
You are correct on assuming that feeding through a queue is better than feed_dict, for multiple reasons (mainly loading and preprocessing done on CPU, and not on the main thread). But one thing that can undermine this is if the GPU consume the data faster than it is loaded. You should therefore monitor the size of your queue to check if you have times where the queue size is 0.
If this is the case, I would recommand you to move your threading process into the graph, tensorflow as some nice mecanismes to allow batch loading (your loading batchs should be larger than your training batchs to maximise your loading efficiency, I personnaly use training batchs of 128 and loading batchs of 1024) in threads on CPU very efficiently. Moreover, you should place your queue on CPU and give it a large maximum size, you will be able to take advantage of the large size of RAM memory (I always have more than 16000 images loaded in RAM, waiting for training).
If you still have troubles, you should check tensorflow's performance guide:
https://www.tensorflow.org/guide/data_performance

Tensorflow out of memory

I am using tensorflow to build CNN based text classification. Some of the datasets are large and some are small.
I use feed_dict to feed the network by sampling data from system memory (not GPU memory). The network is trained batch by batch. The batch size is 1024 fixed for every dataset.
My question is:
The network is trained by batches, and each batch the code retrieve data from system memory. Therefore, no matter how large the dataset is the code should handle it like the same, right?
But I got out of memory problem with large dataset, and for small dataset it works fine. I am pretty sure the system memory is enough for holding all the data. So the OOM problem is about tensorflow, right?
Is it that I write my code wrong, or is it about tensorflow's memory management?
Thanks a lot!
I think your batch size is way too big with 1024. There is a lot of matrices overhead created, especially if you use AgaGrad Adam and the like, dropout, attention and/or more. Try smaller values, like 100, as batchsize. Should solve and train just fine.