Training TensorFlow model with summary operations is much slower than without summary operations - tensorflow

I am training an Inception-like model using TensorFlow r1.0 with GPU Nvidia Titan X.
I added some summary operations to visualize the training procedure, using code as follows:
def variable_summaries(var):
"""Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
with tf.name_scope('summaries'):
mean = tf.reduce_mean(var)
tf.summary.scalar('mean', mean)
with tf.name_scope('stddev'):
stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
tf.summary.scalar('stddev', stddev)
tf.summary.scalar('max', tf.reduce_max(var))
tf.summary.scalar('min', tf.reduce_min(var))
tf.summary.histogram('histogram', var)
When I run these operations, the time cost of training one epoch is about 400 seconds. But when I turn off these operations, the time cost of training one epoch is just 90 seconds.
How to optimize the graph to minimize the summary operations time cost?

Summaries of course slow down the training process, because you do more operations and you need to write them to disc. Also, histogram summaries slow the training even more, because for histograms you need more data to be copied from GPU to CPU than for scalar values.
So I would try to use histogram logging less often than the rest, that could make some difference.
The usual solution is to compute summaries only every X batches. Since you compute summaries only one per epoch and not every batch, it might be worth trying even less summaries logging.
Depends on how many batches you have in your dataset, but usually you don't lose much information by gathering a bit less logs.

Related

Multi-GPU training does not reduce training time

I have tried training three UNet models using keras for image segmentation to assess the effect of multi-GPU training.
First model was trained using 1 batch size on 1 GPU (P100). Each training step took ~254ms. (Note it is step, not epoch).
Second model was trained using 2 batch size using 1 GPU (P100). Each training step took ~399ms.
Third model was trained using 2 batch size using 2 GPUs (P100). Each training step took ~370ms. Logically it should have taken the same time as the first case, since both GPUs process 1 batch in parallel but it took more time.
Anyone who can tell whether multi-GPU training results in reduced training time or not? For reference, I tried all the models using keras.
I presume that this is due to the fact that you use a very small batch_size; in this case, the cost of distributing the gradients/computations over two GPUs and fetching them back (as well as CPU to GPU(2) data distribution) outweigh the parallel time advantage that you might gain versus the sequential training(on 1 GPU).
Expect to see a bigger difference for a batch size of 8/16 for instance.

TPU terminology confusion

So I know how epochs, train steps, batch sizes and this kind of stuff are defined, but it is really hard to me to get my head wraped around the TPU terminology like train loops, iterations per loop and so on. I read this but Im still confused.
Also how can I benchmark the time for iterations per loop for example.
Any explanation would help me a lot there. Thanks!
As the other answers have described, iterations_per_loop is a tuning parameter that controls the amount of work done by the TPU before checking in with it again. A lower number lets you inspect results (and benchmark them) more often, and a higher number reduces the overhead due to synchronization.
This is no different from familiar network or file buffering techniques; changing its value affects performance, but not your final result. In contrast, ML hyperparameters like num_epochs, train_steps, or train_batch_size will change your result.
EDIT: Adding an illustration in pseudocode, below. Notionally, the training loop functions like this:
def process_on_TPU(examples, train_batch_size, iterations_per_loop):
# The TPU will run `iterations_per_loop` training iterations before returning to the host
for i in range(0, iterations_per_loop):
# on every iteration, the TPU will compute `train_batch_size` examples,
# calculating the gradient from every example in the given batch
compute(examples[i * train_batch_size : (i + 1) * train_batch_size])
# assume each entry in `example` is a single training example
for b in range(0, train_steps, train_batch_size * iterations_per_loop)
process_on_TPU(examples[b:b + train_batch_size * iterations_per_loop],
train_batch_size,
iterations_per_loop)
From this, it might appear that train_batch_size and iterations_per_loop are simply two different ways of accomplishing the same thing. However, this is not the case; train_batch_size affects the learning rate, since (at least in ResNet-50) the gradient is computed at each iteration from the average of the gradient of every example in the batch. Taking 50 steps per 50k examples will produce a different result from taking from 1k steps per 50k examples, since the latter case calculates the gradient much more often.
EDIT 2: Below is a way to visualize what's happening, with a racing metaphor. Think of the TPU as running a race that has a distance of train_steps examples, and its stride lets it cover a batch of examples per step. The race is on a track, which is shorter than the total race distance; the length of the lap is your total number of training examples, and every lap around the track is one epoch. You can think of iterations_per_loop as being the point where the TPU can stop at a "water station" of sorts where the training is temporarily paused for a variety of tasks (benchmarking, checkpointing, other housekeeping).
By "train loop", I'm assuming it's the same meaning as "training loop". The training loop is the one that iterates through each epoch in order to feed the model.
The iterations per loop is related to how Cloud TPU handles the training loop. In order to amortize the TPU launch cost, the model training step is wrapped in a tf.while_loop, such that one Session run actually runs many iterations for a single training loop.
Because of this, Cloud TPU runs a specified number of iterations of the training loop before returning to the host. Therefore, iterations_per_loop is how many iterations will run for one session.run call.
TPU literally means "Tensor Processing Unit", it's a hardware device used for computation in exactly the same way a GPU is used. The TPUs are effectively Google proprietary GPUs. There are technical differences under the hood of a GPU vs a TPU, mostly regarding speed and power consumption, and some issues of floating point precision, but you don't need to care about the details.
iterations_per_loop appears to be an effort to improve efficiency by loading the TPU with multiple training batches. There are often hardware bandwidth limitations when transferring large amounts of data from main memory to a GPU/TPU.
It appears that the code you reference is passing iterations_per_loop number of training batches to the TPU, then running iterations_per_loop number of training steps before pausing to do another data transfer from main memory to TPU memory.
I'm rather surprised to see that though, I would expect that asynchronous background data transfers would be possible by now.
My only disclaimer is that, while I'm proficient with Tensorflow, and have watched TPU evolution in papers and articles, I'm not directly experienced with the Google API or running on TPUs, so I'm inferring from what I read in the documentation you linked to.

Is there a way to run each segment of a batch on different GPU?

So I am very new to Tensorflow and GPUs and I was wondering if I can feed different segments of my batch to different GPUs and aggregate the result at the end. What I mean is that, let's say that batch size in each epoch of my training is 1600 and I have 4 GPUs. Can I feed batches of size 400 to each GPU during each epoch of training and then aggregate the result?
You can do that. You will have to perform multi-gpu training.
Though in TensorFlow you can do a tower-based design where you collect and aggregate the gradients from each tower before backpropagation, it is not so simple and efficient.
You should use horovod which is easy and efficient.

Time taken to train Resnet on CIFAR-10

I was writing a neural net to train Resnet on CIFAR-10 dataset.
The paper Deep Residual Learning For Image Recognition mentions training for around 60,000 epochs.
I was wondering - what exactly does an epoch refer to in this case? Is it a single pass through a minibatch of size 128 (which would mean around 150 passes through the entire 50000 image training set?
Also how long is this expected to take to train(assume CPU only, 20-layer or 32-layer ResNet)? With the above definition of an epoch, it seems it would take a very long time...
I was expecting something around 2-3 hours only, which is equivalent to about 10 passes through the 50000 image training set.
The paper never mentions 60000 epochs. An epoch is generally taken to mean one pass over the full dataset. 60000 epochs would be insane. They use 64000 iterations on CIFAR-10. An iteration involves processing one minibatch, computing and then applying gradients.
You are correct in that this means >150 passes over the dataset (these are the epochs). Modern neural network models often take days or weeks to train. ResNets in particular are troublesome due to their massive size/depth. Note that in the paper they mention training the model on two GPUs which will be much faster than on the CPU.
If you are just training some models "for fun" I would recommend scaling them down significantly. Try 8 layers or so; even this might be too much. If you are doing this for research/production use, get some GPUs.

Training GPU on multiple minibatches in parallel with TensorFlow

I am using TensorFlow 1.9, on an NVIDIA GPU with 3 GB of memory. The size of my minibatch is 100 MB. Therefore, I could potentially fit multiple minibatches on my GPU at the same time. So my question is about whether this is possible and whether it is standard practice.
For example, when I train my TensorFlow model, I run something like this on every epoch:
loss_sum = 0
for batch_num in range(num_batches):
batch_inputs = get_batch_inputs()
batch_labels = get_batch_labels()
batch_loss, _ = sess.run([loss_op, train_op], feed_dict={inputs: batch_inputs, labels: batch_labels})
loss_sum += batch_loss
loss = batch_loss / num_batches
This iterates over my minibatches and performs one weight update per minibatch. But the size of image_data and label_data is only 100 MB, so the majority of the GPU is not being used.
One option would be to just increase the minibatch size so that the minibatch is closer to the 3 GB GPU capacity. However, I want to keep the same small minibatch size to help with optimisation.
So the other option might be to send multiple minibatches through the GPU in parallel, and perform one weight update per minibatch. Being able to send the minibatches in parallel would significantly reduce the training time.
Is this possible and recommended?
The goal of the Mini Batch approach is to update the weights of your network after each batch is processed and use the updated weights in the next mini-batch. If you do some clever tricks and batch several mini-batches they would effectively use the same old weights.
The only potential benefit I can see is if the model works better with bigger mini-batches, e.g. big_batches * more_epochs is better than mini_batches * less_epochs. I don't remember the theory behind Mini Batch Gradient Descent but I remember there is a reason you should use mini batches instead of the whole training set for each iteration. On the other hand, the mini batch size is a hyperparameter that has to be tuned anyway, so it's probably worth fiddling it a bit.
Thought I might point out that, arbitrarily making the batch size large (when you have large amounts of memory) can be bad sometimes in terms of the generalization of your model.
Reference:
Train longer, generalize better
On Large-Batch Training for Deep Learning.