Why does Training time not reduce when training a keras model after Increasing the batch size in beyond a certain amount - tensorflow2.0

I am currently traing an NLP model in Keras with TF 2.8 where I am experimenting by adding GRU and LSTM layers. When I train the model, I used different batch size to see the impact it had on the accuracy and overal training time.
What I noticed was that after Increasing the batch size after a certain amount the training time doesnt reduce, after a certain amount the training size stayed the same.
I started with a batch size of 2 then slowly increased upto 4096 trying multiples of two, yet after 512 the training time remained the same.

It's often wrongly mentioned that batch learning is as fast or faster than on-line training. In fact, batch-learning is changing the weights once, the complete set of data (the batch) has been presented to the network. Therefore, the weight update frequency is rather slow. This explains why the processing speed in your measurements acts like you observed.
Even if its matrix operation, each row-colum multiplication might be happening on one gpu-core. So, full matrix multiplication is divided on as many cores as possible. For one matrix mul, each gpu-core takes some time, and when you add more images, that time increases, do more rows. If at batch size of 4, your gpu is already at full performance capacity, i.e. all cores are running, then increasing batch size is not going to give any advantage. Your added data just sits in gpu memory and is processed when an nvidia dice gets free of previous operation.
To get a further understanding for the training techniques, have a look at the 2003 paper The general inefficiency of batch training for gradient descent learning. It deals with the comparison of batch and on-line learning.
Also generally, RNN kernels can have O(timesteps) complexity, with batch size having a smaller effect than you might anticipate.

Related

What is the maximum and minimum batch size we can use in .fit() method?

I want to see the effect of batch size on generalization for which I want to run my .fit() method with all the possible batch sizes.
But I was wondering what could be the constraints be on choosing batch sizes?
What does it depend on, a machine?? a dataset?
Any help is highly appreciated
It depends on the size of each sample and your GPU memory, if you're using it, else your RAM. Keep in mind that various other things are loaded in your memory, like the model's parameters, the graph, etc. But strictly for the size of a batch: NUM_SAMPLES * SIZE_OF_SAMPLE.
The batch size you choose is affected by several parameters:
Resources - You need to choose a small enough batch size that will be able to fit inside you CPU / GPU RAM.
Normalization - If you use BatchNorm you should probably use a large batch size, as the BatchNorm layers learn the mean and variance of your batch. The smaller the batches are the larger the deviance between them will be.
Personally, I usually use the largest batch size possible according to my resources. In case the possible batch is small (<16) I swap BatchNorm with other normalization methods such as LayerNorm / InstanceNorm.
The machine's memory.
The training batch size has a huge impact on the required GPU memory for training a neural network.
The GPU memory includes Parameters, optimizer’s variables, intermediate calculations, and workspace variables. So, the larger the batch size, the more samples are being propagated through the neural network in the forward pass. This results in larger intermediate calculations (e.g. layer activation outputs) that need to be stored in GPU memory. Technically speaking, the size of the activations is linearly dependent on the batch size
You can use some walk-around to increase the limitation:
Data-parallelism — use multiple GPUs to train all mini-batches in parallel, each on a single GPU
Gradient accumulation — run the mini-batches sequentially, while accumulating the gradients. The accumulated results are used to update the model variables at the end of the last mini-batch.

Improving cross-validation throughput in TensorFlow, Keras

I am working on CNN models which are intended to predict a protein's structure from its amino acid sequence. I am implementing my CNN's in Keras. The Keras API is the one that comes bundled with TensorFlow 1.4.0, so obviously TensorFlow is my backend. I have installed the GPU version of TensorFlow, and I have verified that the GPU is being used. My GPU is somewhat older, an NVidia GTX 760.
When I perform 3X cross-validation to help select architectures and hyperparameters, I have 50K examples in my training folds and 25K samples in my validation folds. These are decently large data sets, however they're small in comparison to the RAM available in my computer (16 GB) or on my GPU (2 GB). Fully unpacked and expressed as float32 values, with redundancy introduced because of sliding windows, all the folds taken together, input plus target values, occupies 316 MB. I have pre-calculated my folds, and saved files of each fold to disk. When I experiment with architectures and hyperparameters, the same folds are being used in every trial.
I started with networks containing a single hidden layer to see what I could achieve, and then switched to two hidden layers. I used a fixed batch size of 64 for all of my early experiments. Training proceeded quickly enough that I didn't concern myself with speed. Performing a 3X cross-validation for a given architecture typically took about 12 minutes.
But in the last experiment that I did with two-layer networks, I decided to start investigating the effect of batch size. I learned that smaller batch sizes gave me better results, up to a point. Batch sizes of 8 were the smallest ones that I could count on not to crash. My loss values will occasionally flip to NaN with batch sizes of 4, and they will frequently flip to NaN with batch sizes of 1 or 2. After that occurs, the network becomes untrainable. I am aware of the possibility of gradient instability. I think I was getting some.
So why not just use batch sizes of 8 and keep going? The problem is speed. Using two hidden layers, batches of eight took me approximately 35 minutes to cross-validate. Batches of 64, as I mentioned above, took one third that much time. My first experiments with three hidden layers have taken 45 to 65 minutes per trial. And I want to investigate potentially hundreds of architectures and hyperparameters, using still deeper networks. With small batches, I can see that the batch-by-batch progress bar in Keras progresses more slowly. I can see much longer pauses when an epoch ends.
Yes, I can upgrade my GPU to a 10 series. I think that will only double my throughput at most? Yes, I can rent GPU time in the cloud. Eventually I might do that. But if my software is inefficient, I definitely don't want to set it loose in the cloud to burn my money.
It is my understanding (please correct me if I am wrong) that when the GPU is used in a normal TF / Keras work flow, each individual batch is sent separately from the GPU to the CPU. If I am training 50 networks in a 3X cross-validation scheme, this would mean that I'm sending the same data to my GPU 150 times. As I mentioned earlier, all my data occupies at most 316 MB, about 15% of the RAM available on the GPU. Can I devise a workflow which sends this 316 MB to the GPU once, and if so, will that have a useful impact on my throughput? Intuitively, it feels like it should.
Are there other bottlenecks I should be thinking about? Is there a way to profile TF or Keras operations?
Thanks for any advice you may have!
Okay. I know that you're more concerned about throughput from Keras and your hardware, but there are a few things I'd like to mention here:
smaller batch sizes gave me better results
Given you case, where you have not so huge data, assuming you're running the training for fixed number of epochs (say 5), training with lesser batch size is naturally expected to give you a slightly better result as it would mean a higher number of back-prop steps overall compared to that of a higher batch-size. If you're training for a fixed number of training steps instead, I don't know why this is happening.
loss values will occasionally flip to NaN with batch sizes of 4
Again, I'm assuming you're using batch-normalization here, with CNNs. While using BN, it's never actually recommended to use a smaller batch-size like 2 or 4 (or even 8). And probably, one of the reasons why you can be facing NaN with smaller batch-size is if you have low-variance in the current batch and if you take the epsilon value too small, you might have very small values that can lead to numerical instability going forward. But more generally, this might be a case of gradient instability like you mentioned. Consider using gradient clipping to see if it helps.
GPU Workflow
Here, I assume that you have only 1 GPU. And unfortunately, you can't parallelise using single-GPU. To clarify, you shouldn't be concerned about the size of your data for GPU RAM. In most of the single-GPU cases, the current batch stays on the CPU and GPU would only take up the operations. Rather, you should be concerned about the size of parameters that GPU would be computing. Since for 1-layer experiment and 3-layers experiment your operations differ a lot, I don't think it's possible as you can't place multiple ops on same device simultaneously. The best case for you here would be to use a larger batch-size (not too large - as this would reduce the number of back-prop steps in case of training for fixed-epochs), so that you'd cover more data in a single-go.
Just a tip for hyper-paramter tuning, you can consider using Highway-CNNs. These are inspired from gating mechanism of LSTMs where you specify a large number of hidden layers and the network figures out itself on how to control the information flow among the layers. So in short, this would practically eliminate your efforts of tuning the depth of network, and allowing you tune other hyper-params like learning rate or filter-sizes etc.
I hope at least some of this is relevant and helpful to you ;)

How can I batch images with arbitrary sizes in TensorFlow?

Acually in Caffe, there seems exists a way to maintain the aspect ratio after resizing images such that the smaller dimension equaled 500 or else.
But I cannot find any way to solve this problem.
In this paper, we can see that,
We implemented our model using the Caffe library [15] and optimized it using SGD with momentum. Training on the AVA dataset’s approximately 250k training images took 2 weeks on a single Nvidia M40 GPU. Although our network can train and evaluate with images of arbitrary dimensions, very large images drastically decrease training and evaluation speed and pose memory issues due to GPU memory constraints. Therefore, in practice we resize each image such that the smaller image dimension equaled 500, while maintaining the original aspect ratio. This resulted in significant loss of resolution in some cases, but is a signifi- cantly higher resolution than is typically used for convolutional networks. We used a batch size of 128, a learning rate of 10−3 , momentum of 0.9 and weight decay of 5 · 10−4 . We reduced the learning rate after every 20k iterations. The convolutional layers were pre-trained on ImageNet [6].
You can simply resize the image as they say with your favorite library, for instance, you could use scipy.misc.imresize.

Can Tensorflow saturate a GPU with a batch size of 1?

I'm running a modest 5 layer convolutional network in tensorflow here.
If I have a large enough batch size I can pretty much saturate the GPU. I've implemented asynchronous loading with the tf.StagingArea, and I print a warning message if it's ever starved because it's waiting on data loading. I've confirmed that it is not waiting on the data loading at any point.
As I reduce my batch size I notice that the GPU utilization drops. When I have a batch size of 1 I get 35% utilization on a 1080 TI GPU. At a batch size of 20 I get 50% utilization, and so on. It's also less notable on slower GPUs.
I've tested it with a loop over nothing but the main sess.run call to ensure I don't have any other code slowing things down.
I've reviewed the TF high-performance models documentation, but don't note any reference to small batch sizes.
I wonder if anyone has any insight on improving the performance, other than, "duh, just increase the batch size". I include batch size as a hyperparameter because it affects model regularization. I'd like to test with small batch sizes but still utilize the GPU fully if possible. Also, this is generally the case when a single train step takes sufficiently less than ~0.1 sec to process.

What is the advantage of doing a Multi-GPU training in TensorFlow?

In this TensorFlow tutorial, you can use N number of GPUs to distribute N mini-batches (each containing M training samples) to each GPU and calculate the gradients concurrently.
Then you average the gradients collected from N GPUs and update the model parameters.
But this has the same effect as using a single GPU to calculate the gradients of N*M training samples, then updating the parameters.
So the only advantage seems to me is that you can use a larger-sized mini-batch in the same amount of time.
But is the larger-sized mini-batch necessarily better?
I thought you shouldn't use a large-sized mini-batch, in order to make the optimization more robust to saddle points.
If the larger-sized mini-batch is indeed not better, why would you care about Multi-GPU learning, or even Multi-server learning?
(The tutorial above is a synchronous training. If it was asynchronous training, then I can see the merit, since the parameters will be updated without averaging the gradients calculated by each GPU)
The main purpose for multi-GPU learning is to enable you train on large data set in shorter time. It is not necessarily better with larger mini-batch, but at least you can finish learning in a more feasible time.
More precisely, those N mini-batches are not trained in a synchronized way if you use Asynchronous SGD algorithm. As the algorithm changes when using multi-GPU, it is not equal to using MxN size mini-batch on single-GPU with SGD algorithm.
If you use sync multi-GPU training, the benefit is mainly time reduction. You could use M/N-size mini-match to maintain the effective mini-batch size, and of course the scalability is limited as smaller mini-batch size leads to more overhead. Data-exchange and synchronization on large number of computing nodes are also disasters.
Finally to solve the scalability issue, people move to A-SGD when using large number of GPUs concurrently. So probably you won't see someone using sync multi-GPU training on hundreds of (or even tens of) GPUs.
More gpu means more data in a batch. And the gradients of a batch data is averaged for back-propagation.
If the learning rate of a batch is fixed, then the learning rate of a data is smaller.
If the learning rate of a data is fixed, then the learning rate of a batch is larger.
https://github.com/guotong1988/BERT-GPU