Training GPU on multiple minibatches in parallel with TensorFlow - tensorflow

I am using TensorFlow 1.9, on an NVIDIA GPU with 3 GB of memory. The size of my minibatch is 100 MB. Therefore, I could potentially fit multiple minibatches on my GPU at the same time. So my question is about whether this is possible and whether it is standard practice.
For example, when I train my TensorFlow model, I run something like this on every epoch:
loss_sum = 0
for batch_num in range(num_batches):
batch_inputs = get_batch_inputs()
batch_labels = get_batch_labels()
batch_loss, _ = sess.run([loss_op, train_op], feed_dict={inputs: batch_inputs, labels: batch_labels})
loss_sum += batch_loss
loss = batch_loss / num_batches
This iterates over my minibatches and performs one weight update per minibatch. But the size of image_data and label_data is only 100 MB, so the majority of the GPU is not being used.
One option would be to just increase the minibatch size so that the minibatch is closer to the 3 GB GPU capacity. However, I want to keep the same small minibatch size to help with optimisation.
So the other option might be to send multiple minibatches through the GPU in parallel, and perform one weight update per minibatch. Being able to send the minibatches in parallel would significantly reduce the training time.
Is this possible and recommended?

The goal of the Mini Batch approach is to update the weights of your network after each batch is processed and use the updated weights in the next mini-batch. If you do some clever tricks and batch several mini-batches they would effectively use the same old weights.
The only potential benefit I can see is if the model works better with bigger mini-batches, e.g. big_batches * more_epochs is better than mini_batches * less_epochs. I don't remember the theory behind Mini Batch Gradient Descent but I remember there is a reason you should use mini batches instead of the whole training set for each iteration. On the other hand, the mini batch size is a hyperparameter that has to be tuned anyway, so it's probably worth fiddling it a bit.

Thought I might point out that, arbitrarily making the batch size large (when you have large amounts of memory) can be bad sometimes in terms of the generalization of your model.
Reference:
Train longer, generalize better
On Large-Batch Training for Deep Learning.

Related

tf.distribute.MirroredStrategy - suggestion for improving test mean_iou for segmentation network using distributed training

I am using tensorflow 2.5.0 and implemented semantic segmatation network. used DeepLab_v3_plus network with ResNet101 backbone, adam optimizer and Categorical cross entropy loss to train network. I have first build code for single gpu and achieved test accuracy (mean_iou) of 54% trained for 96 epochs. Then added tf MirroredStrategy (one machine) in code to support for multi gpu training. Surprisingly with 2 gpus, training for 48 epochs, test mean_iou is just 27% and training with 4 gpus, for 24 epochs, test mean_iou can around 12% for same dataset.
Code I have modified to support multi-gpu training from single-gpu training.
By following tensorflow blog for distributed training, created mirrored strategy and created model, model compilation and dataset_generator inside strategy scope. As per my understanding, by doing so, model.fit() method will take care of synchronization of gradients and distributing data on each gpus for training. Though code was running without any error, and also training time reduced compared to single gpu for same number of image training, test mean_iou keep getting worst with more number of gpus.
Replaced BatchNormalization with SyncBatchNormalization, but no improvement.
used warmup learning rate with linear scaling of learning rate with number of gpus, but no improvement.
in cross entropy loss, used both losses_utils.ReductionV2.AUTO and losses_utils.ReductionV2.NONE.
loss = ce(y_true, y_pred)
# reshape loss for each sample (BxHxWxC -> BxN)
# Normalize loss by number of non zero elements and sum for each sample and mean across all samples.
using .AUTO/.NONE options, I am not scaling loss by global_batch_size understanding tf will take care of it and I am already normalizing for each gpus. but with both options, didn't get any luck.
changed data_generator to tf.data.Dataset obj. Though it has helped in training time, but test mean_iou become even worst.
I would appreciate if any lead or suggestion for improving test_iou in distributed training.
let me know if you need any additional details.
Thank you

Time taken to train Resnet on CIFAR-10

I was writing a neural net to train Resnet on CIFAR-10 dataset.
The paper Deep Residual Learning For Image Recognition mentions training for around 60,000 epochs.
I was wondering - what exactly does an epoch refer to in this case? Is it a single pass through a minibatch of size 128 (which would mean around 150 passes through the entire 50000 image training set?
Also how long is this expected to take to train(assume CPU only, 20-layer or 32-layer ResNet)? With the above definition of an epoch, it seems it would take a very long time...
I was expecting something around 2-3 hours only, which is equivalent to about 10 passes through the 50000 image training set.
The paper never mentions 60000 epochs. An epoch is generally taken to mean one pass over the full dataset. 60000 epochs would be insane. They use 64000 iterations on CIFAR-10. An iteration involves processing one minibatch, computing and then applying gradients.
You are correct in that this means >150 passes over the dataset (these are the epochs). Modern neural network models often take days or weeks to train. ResNets in particular are troublesome due to their massive size/depth. Note that in the paper they mention training the model on two GPUs which will be much faster than on the CPU.
If you are just training some models "for fun" I would recommend scaling them down significantly. Try 8 layers or so; even this might be too much. If you are doing this for research/production use, get some GPUs.

Training TensorFlow model with summary operations is much slower than without summary operations

I am training an Inception-like model using TensorFlow r1.0 with GPU Nvidia Titan X.
I added some summary operations to visualize the training procedure, using code as follows:
def variable_summaries(var):
"""Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
with tf.name_scope('summaries'):
mean = tf.reduce_mean(var)
tf.summary.scalar('mean', mean)
with tf.name_scope('stddev'):
stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
tf.summary.scalar('stddev', stddev)
tf.summary.scalar('max', tf.reduce_max(var))
tf.summary.scalar('min', tf.reduce_min(var))
tf.summary.histogram('histogram', var)
When I run these operations, the time cost of training one epoch is about 400 seconds. But when I turn off these operations, the time cost of training one epoch is just 90 seconds.
How to optimize the graph to minimize the summary operations time cost?
Summaries of course slow down the training process, because you do more operations and you need to write them to disc. Also, histogram summaries slow the training even more, because for histograms you need more data to be copied from GPU to CPU than for scalar values.
So I would try to use histogram logging less often than the rest, that could make some difference.
The usual solution is to compute summaries only every X batches. Since you compute summaries only one per epoch and not every batch, it might be worth trying even less summaries logging.
Depends on how many batches you have in your dataset, but usually you don't lose much information by gathering a bit less logs.

GPU + CPU Tensorflow Training

Setup
I have a network, one whose parameter is a large-embedding matrix (3Million X 300 sized), say embed_mat.
During training, for each mini-batch, I only update a small subset of the vectors from embed_mat (max 15000 vectors) which are chosen using the embedding_lookup op. I am using the Adam optimizer to train my model.
As I cannot store this embed_mat in the GPU, due to its size, I define it under CPU (say /cpu:0) device, but the rest of the parameters of the model, the optimizer etc. are defined under a GPU (say, gpu:/0) device.
Questions
I see that my GPU usage is very minimal (200 MB), which suggests all my training is happening on the CPU. What I expected was that the result of the embedding_lookup is copied to the GPU and all my training happens there. Am I doing something wrong.
The training time is very largely affected by the size (num_vectors) of the embedding matrix which doesn't seem correct to me. In any mini-batch, I only update my network parameters and the vectors I looked up (~15000), so the training time should, if at all, grow sub-linearly with the size of the embedding matrix.
Is there a way to automatically and seamlessly split up my embed_mat to multiple GPUs for faster training?
I suspect the Adam Optimizer for this. Looks like because the embed_mat is on the CPU, all training is happening on the CPU. Is this correct?
Try visualizing on tensorboard where each of your ops is placed. In the "graph" tab you can color by "device". Ideally the embedding variable, the embedding lookup, and the embedding gradient update should be in the CPU, while most other things should be in the GPU.

Regarding the use of tf.train.shuffle_batch() to create batches

In Tensorflow tutorial, it gives the following example regarding tf.train.shuffle_batch():
# Creates batches of 32 images and 32 labels.
image_batch, label_batch = tf.train.shuffle_batch(
[single_image, single_label],
batch_size=32,
num_threads=4,
capacity=50000,
min_after_dequeue=10000)
I am not very clear about the meaning of capacity and min_after_dequeue. In this example, it is set as 50000 and 10000 respectively. What is the logic for this kind of setup, or what does that mean. If input has 200 images and 200 labels, what will happen?
The tf.train.shuffle_batch() function uses a tf.RandomShuffleQueue internally to accumulate batches of batch_size elements, which are sampled uniformly at random from the elements currently in the queue.
Many training algorithms, such as the stochastic gradient descent–based algorithms that TensorFlow uses to optimize neural networks, rely on sampling records uniformly at random from the entire training set. However, it is not always practical to load the entire training set in memory (in order to sample from it), so tf.train.shuffle_batch() offers a compromise: it fills an internal buffer with between min_after_dequeue and capacity elements, and samples uniformly at random from that buffer. For many training processes, this improves the accuracy of the model and provides adequate randomization.
The min_after_dequeue and capacity arguments have an indirect effect on training performance. Setting a large min_after_dequeue value will delay the start of training, because TensorFlow has to process at least that many elements before training can start. The capacity is an upper bound on the amount of memory that the input pipeline will consume: setting this too large may cause the training process to run out of memory (and possibly start swapping, which will impair the training throughput).
If the dataset has only 200 images, it would be easily possible to load the entire dataset in memory. tf.train.shuffle_batch() would be quite inefficient, because it enqueue each image and label multiple times in the tf.RandomShuffleQueue. In this case, you may find it more efficient to do the following instead, using tf.train.slice_input_producer() and tf.train.batch():
random_image, random_label = tf.train.slice_input_producer([all_images, all_labels],
shuffle=True)
image_batch, label_batch = tf.train.batch([random_image, random_label],
batch_size=32)