What is the advantage of doing a Multi-GPU training in TensorFlow? - optimization

In this TensorFlow tutorial, you can use N number of GPUs to distribute N mini-batches (each containing M training samples) to each GPU and calculate the gradients concurrently.
Then you average the gradients collected from N GPUs and update the model parameters.
But this has the same effect as using a single GPU to calculate the gradients of N*M training samples, then updating the parameters.
So the only advantage seems to me is that you can use a larger-sized mini-batch in the same amount of time.
But is the larger-sized mini-batch necessarily better?
I thought you shouldn't use a large-sized mini-batch, in order to make the optimization more robust to saddle points.
If the larger-sized mini-batch is indeed not better, why would you care about Multi-GPU learning, or even Multi-server learning?
(The tutorial above is a synchronous training. If it was asynchronous training, then I can see the merit, since the parameters will be updated without averaging the gradients calculated by each GPU)

The main purpose for multi-GPU learning is to enable you train on large data set in shorter time. It is not necessarily better with larger mini-batch, but at least you can finish learning in a more feasible time.
More precisely, those N mini-batches are not trained in a synchronized way if you use Asynchronous SGD algorithm. As the algorithm changes when using multi-GPU, it is not equal to using MxN size mini-batch on single-GPU with SGD algorithm.
If you use sync multi-GPU training, the benefit is mainly time reduction. You could use M/N-size mini-match to maintain the effective mini-batch size, and of course the scalability is limited as smaller mini-batch size leads to more overhead. Data-exchange and synchronization on large number of computing nodes are also disasters.
Finally to solve the scalability issue, people move to A-SGD when using large number of GPUs concurrently. So probably you won't see someone using sync multi-GPU training on hundreds of (or even tens of) GPUs.

More gpu means more data in a batch. And the gradients of a batch data is averaged for back-propagation.
If the learning rate of a batch is fixed, then the learning rate of a data is smaller.
If the learning rate of a data is fixed, then the learning rate of a batch is larger.
https://github.com/guotong1988/BERT-GPU

Related

Why does Training time not reduce when training a keras model after Increasing the batch size in beyond a certain amount

I am currently traing an NLP model in Keras with TF 2.8 where I am experimenting by adding GRU and LSTM layers. When I train the model, I used different batch size to see the impact it had on the accuracy and overal training time.
What I noticed was that after Increasing the batch size after a certain amount the training time doesnt reduce, after a certain amount the training size stayed the same.
I started with a batch size of 2 then slowly increased upto 4096 trying multiples of two, yet after 512 the training time remained the same.
It's often wrongly mentioned that batch learning is as fast or faster than on-line training. In fact, batch-learning is changing the weights once, the complete set of data (the batch) has been presented to the network. Therefore, the weight update frequency is rather slow. This explains why the processing speed in your measurements acts like you observed.
Even if its matrix operation, each row-colum multiplication might be happening on one gpu-core. So, full matrix multiplication is divided on as many cores as possible. For one matrix mul, each gpu-core takes some time, and when you add more images, that time increases, do more rows. If at batch size of 4, your gpu is already at full performance capacity, i.e. all cores are running, then increasing batch size is not going to give any advantage. Your added data just sits in gpu memory and is processed when an nvidia dice gets free of previous operation.
To get a further understanding for the training techniques, have a look at the 2003 paper The general inefficiency of batch training for gradient descent learning. It deals with the comparison of batch and on-line learning.
Also generally, RNN kernels can have O(timesteps) complexity, with batch size having a smaller effect than you might anticipate.

Memory requirements for back propagation - why not use the mean activation?

I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).
For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.
My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.
Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.
(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments
The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.
The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.
The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).
I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.

Selecting tensorflow object detection API training hyper parameters

I am setting up an object detection pipeline based on recently released tensorflow object detection API. I am using the arXiv as guidance. I am looking to understand the below for training on my own dataset.
It is not clear how they selected the learning rate schedules and how that would change based on the number of GPUs available for training. How do the training rate schedule change based on number of GPU's available for training? The paper mentions 9 GPUs are used. How should I change the training rate if I only want to use 1 GPU?
The released sample training config file for Pascal VOC using Faster R-CNN has initial learning rate = 0.0001. This is 10x lower than what was published in the original Faster-RCNN paper. Is this due to an assumption on the number of GPU's available for training or due to a different reason?
When I start training from the COCO detection checkpoint, how should the training loss decrease? Looking at tensorboard, on my dataset training loss is low - between 0.8 to 1.2 per iteration (with batch size of 1). Below image shows the various losses from tensorboard. . Is this expected behavior?
For questions 1 and 2: our implementation differs in a few small details compared to the original paper and internally we train all of our detectors with asynchronous SGD with ~10 GPUs. Our learning rates are calibrated for this setting (which you will also have if you decide to train via Cloud ML Engine as in the Pets walkthrough). If you use another setting, you will have to do a bit of hyperparameter exploration. For a single GPU, leaving the learning rate alone probably won't hurt performance, but you may be able to get faster convergence by increasing it.
For question 3: Training losses decrease erratically and you can only see the decrease if you smooth the plots quite a bit over time. Moreover, it's hard to explicitly say how well you are doing with respect to eval metrics just by looking at the training losses. I recommend looking at the mAP plots over time as well as the image visualizations to really get an idea of whether your model has "lifted off".
Hope this helps.

speeden up Haar Cascade training process

Training a haar cascade takes a lot of time and also the entire training period depends on the machine.
What are the factors that contribute to speeding up the process?
Does having more RAM and a GPU help?
Does haar cascade training have GPU support like tensorflow does?
Opencv documentation here state that
LBP features yield integer precision in contrast to HAAR features, yielding floating point precision, so both training and detection with LBP are several times faster then with HAAR features. Regarding the LBP and HAAR detection quality, it mainly depends on the training data used and the training parameters selected. It's possible to train a LBP-based classifier that will provide almost the same quality as HAAR-based one, within a percentage of the training time.
So while training, instead of using old opencv_haartraining tool use opencv_traincascade tool with -featureType LBP as a parameter(default is HAAR)
Also, you can use -precalcValBufSize and -precalcIdxBufSize parameters to assign specific amount of memory for training. The more memory you assign the faster the training process, however keep in mind that -precalcValBufSize and -precalcIdxBufSize combined should not exceed your available system memory.

TensorFlow RNN training 100% CPU while only using 60% GPU

I'm working on code that trains a relatively large RNN (128 cell LSTM and some added layers). The main process is maxing out a core on the CPU, and I'm wondering if this is normal or whether I can optimize it. During the training loop (session.run calls) it's using about 60-70% GPU load while using 100% CPU load on one core. Note that data sampling work is already being done concurrently on other cores, so it's just the updating of the model parameters. Is this regular for such applications in TensorFlow or should the CPU load be much lower, while using the full capacity of the GPU?
We don't have full documentation on it yet, but you can take a look at the profiling information to see if it gives you more of an idea of where the time is going:
https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659
I think RNN cell have two input, it must wait for those two direction input when traning data, in other word, it optimize parallelism don't as easy as CNN. You can use a big batch size to improve the GPU utilization rate, but maybe cause other problem like that paper On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.