Why my inception and LSTM model with 2M parameters take 1G GPU memory? - tensorflow

The model mainly builds on inception and LSTM, and it is implemented by Keras on tensorflow 2.x. The saved model parameters take only 2M space. The model is trained on-the-fly with batch size of 32 and data volume for each batch of 0.25M. The worker in model.fit_generator is 20 with use_multiprocessing=True.
However, I have observed it takes 1G of GPU memory. I cannot figure out the reason, and I also do not know which tools can be used to monitor the GPU memory cost for different parts of the model during the training.
Below show the details of the model:

The allocated GPU memory is not only for parameters but also for activations and gradients for the backward pass.
In addition you have to consider that the following things affect the amount of memory used:
Batch size: more images in the batch means more activations
Number of workers: every worker need some amount of memory to operate

Related

trying to load data into ram and not into gpu memory

I'm trying to fit a model using tensorflow, i have a gpu with 6gb of memory, when i try to train a model with batch size of 128 and each element in the batch is 1 mb in size i get an error that the gpu is out of memory given that the model size is relatively small ~300k params.
however i have seen that my gpu memory gets full once i call the fit method (5.6 gb's), i'm assuming that the whole data somehow gets in the gpu memory then gets processed not just the training batch.
so im wondering if there is a way to load all the data on the ram memory then just send those batches once i begin to train the model to the gpu then process it, which will certainly let me train model with batch size of 128 given its size above.
is that possible or im missing out on alot of information?

Why does Training time not reduce when training a keras model after Increasing the batch size in beyond a certain amount

I am currently traing an NLP model in Keras with TF 2.8 where I am experimenting by adding GRU and LSTM layers. When I train the model, I used different batch size to see the impact it had on the accuracy and overal training time.
What I noticed was that after Increasing the batch size after a certain amount the training time doesnt reduce, after a certain amount the training size stayed the same.
I started with a batch size of 2 then slowly increased upto 4096 trying multiples of two, yet after 512 the training time remained the same.
It's often wrongly mentioned that batch learning is as fast or faster than on-line training. In fact, batch-learning is changing the weights once, the complete set of data (the batch) has been presented to the network. Therefore, the weight update frequency is rather slow. This explains why the processing speed in your measurements acts like you observed.
Even if its matrix operation, each row-colum multiplication might be happening on one gpu-core. So, full matrix multiplication is divided on as many cores as possible. For one matrix mul, each gpu-core takes some time, and when you add more images, that time increases, do more rows. If at batch size of 4, your gpu is already at full performance capacity, i.e. all cores are running, then increasing batch size is not going to give any advantage. Your added data just sits in gpu memory and is processed when an nvidia dice gets free of previous operation.
To get a further understanding for the training techniques, have a look at the 2003 paper The general inefficiency of batch training for gradient descent learning. It deals with the comparison of batch and on-line learning.
Also generally, RNN kernels can have O(timesteps) complexity, with batch size having a smaller effect than you might anticipate.

What is the maximum and minimum batch size we can use in .fit() method?

I want to see the effect of batch size on generalization for which I want to run my .fit() method with all the possible batch sizes.
But I was wondering what could be the constraints be on choosing batch sizes?
What does it depend on, a machine?? a dataset?
Any help is highly appreciated
It depends on the size of each sample and your GPU memory, if you're using it, else your RAM. Keep in mind that various other things are loaded in your memory, like the model's parameters, the graph, etc. But strictly for the size of a batch: NUM_SAMPLES * SIZE_OF_SAMPLE.
The batch size you choose is affected by several parameters:
Resources - You need to choose a small enough batch size that will be able to fit inside you CPU / GPU RAM.
Normalization - If you use BatchNorm you should probably use a large batch size, as the BatchNorm layers learn the mean and variance of your batch. The smaller the batches are the larger the deviance between them will be.
Personally, I usually use the largest batch size possible according to my resources. In case the possible batch is small (<16) I swap BatchNorm with other normalization methods such as LayerNorm / InstanceNorm.
The machine's memory.
The training batch size has a huge impact on the required GPU memory for training a neural network.
The GPU memory includes Parameters, optimizer’s variables, intermediate calculations, and workspace variables. So, the larger the batch size, the more samples are being propagated through the neural network in the forward pass. This results in larger intermediate calculations (e.g. layer activation outputs) that need to be stored in GPU memory. Technically speaking, the size of the activations is linearly dependent on the batch size
You can use some walk-around to increase the limitation:
Data-parallelism — use multiple GPUs to train all mini-batches in parallel, each on a single GPU
Gradient accumulation — run the mini-batches sequentially, while accumulating the gradients. The accumulated results are used to update the model variables at the end of the last mini-batch.

Prediction with GPU is much slower than with CPU?

curiously I just found out that my CPU is much faster for predictions.
Doing inference with GPU is much slower then with CPU.
I have tf.keras (tf2) NN model with a simple dense layer:
input = tf.keras.layers.Input(shape=(100,), dtype='float32')
X = X = tf.keras.layers.Dense(2)(input)
model = tf.keras.Model(input,X)
#also initiialized with weights from a file
weights = np.load("weights.npy", allow_pickle=True )
model.layers[-1].set_weights(weights)
scores = model.predict_on_batch(data)
For 100 samples doing predictions I get:
2 s for GPU
0.07 s for CPU (!)
I am using a simple geforce mx150 with 2gb
I also tried the predict_on_batch(x) as someone suggested this as it is more faster than just predict. But here it is of same time.
Refer: Why does keras model predict slower after compile?
Has anyone an idea, what is going on there? What could be an issue possibly?
Using the GPU puts a lot of overhead to load data on the GPU memory (through the relatively slow PCI bus) and to get the results back.
In order for the GPU to be more efficient than the CPU, the model must to be very big, have plenty of data and use algorithms that can run fully inside the GPU, without requiring partial results to be moved back to the CPU.
The optimal configuration depends on the quantity of memory and of cores inside your GPU, so you must do some tests, but the following rules apply:
Your NN must have at least >10k parameters, training data set must have at least 10k records. Otherwise your overhead will probably kill the performances of GPU
When you model.fit, use a large batch_size (pay attention, the default is only 32), possibly to contain your whole dataset, or at least a multiple of 1024. Do some test to find the optimum for you.
For some GPUs, it might help performing computations in float16 instead of float32. Follow this tutorial to see how to activate it.
If your GPU has specific Tensor Cores, in order to use efficiently its hardware, several data must be multiples of 8. In the preceding tutorial, see at the paragraph "Ensuring GPU Tensor Cores are used" what parameters must be changed and how. In general, it's a bad idea to use layers which contain a number of neurons not multiple of 8.
Some type of layers, namely RNNs, have an architecture which cannot be solved directly by the GPU. In this case, data must be moved constantly back and forth to CPU and the speed is lost. If a RNN is really needed, Tensorflow v2 has an implementation of the LSTM layer which is optimized for GPU, but some limitations on the parameters are present: see this thread and the documetation.
If you are training a Reinforcement Learning, activate an Experience Replay and use a memory buffer for the experience which is at least >10x your batch_size. This way, you will activate the NN training only when a big bunch of data is ready.
Deactivate as much verbosity as possible
If everything is set up correctly, you should be able to train your model faster with GPU than with CPU.
GPU is good if you have compute-intensive tasks (large models) due to the overhead of copying your data and results between the host and GPU. In your case, the model is very small. It means it will take you longer to copy data than to predict. Even if the CPU is slower than the GPU, you don't have to copy the data, so it's ultimately faster.

How to make Tensorflow load GPU higher?

I have Tensorflow 1.4 GPU version installed. It successfully detects my GPU and uses it while trainig and evaluating. I have GeForce 1050Ti with 4Gb memory.
But I could not reach GPU load higher that 12-15% (more usual 5-6%). Meanwhile I get high CPU load and pretty slow training process.
I tested many different examples of differen NNs (RNN, LSTM, CNN, GAN etc) with plain Tensorflow and with Keras with TF as backend, but the result is the same.
I found that increasing a batch size helps to load GPU more, but batch size also affects training itself, so I can't increase it more than some possible limits.
So how to use GPU at maximum load and speed-up the NN training?
If you are using Keras in ubuntu, you can use multiprocessing and increase number of workers. If you use Batch Generator then you can increase limit on sequence depending upon the system RAM you have.
model.fit_generator(..., max_queue_size = 24, ..., workers = 2, use_multiprocessing = True, ...)