Google Cloud ML Loss in Recall: Distributed Learning - tensorflow

I have two model versions trained on Google Cloud ML, one using 2 workers and one with just the master nodes. However, there is a significant drop in recall after training in the distributed mode. I followed the sample examples provided for around 2000 steps (workers and master both contribute to the steps)
Only Master
RECALL metrics: 0.352357320099
Accuracy over the validation set: 0.737576772753
Master and 2 Workers
RECALL metrics: 0.0223325062035
Accuracy over the validation set: 0.770519262982

The general idea to keep in mind is that as you increase the number of workers, you are also increasing your effective batch size (each worker is processing N examples per step).
To account for that, you'll need to look at adjusting other hyper-parameters. Try picking a smaller learning rate to reduce the amount of change per step. Consequently you may also need to increase the number of steps by some factor, depending on your model and data, to get to the same convergence.

Related

TPU terminology confusion

So I know how epochs, train steps, batch sizes and this kind of stuff are defined, but it is really hard to me to get my head wraped around the TPU terminology like train loops, iterations per loop and so on. I read this but Im still confused.
Also how can I benchmark the time for iterations per loop for example.
Any explanation would help me a lot there. Thanks!
As the other answers have described, iterations_per_loop is a tuning parameter that controls the amount of work done by the TPU before checking in with it again. A lower number lets you inspect results (and benchmark them) more often, and a higher number reduces the overhead due to synchronization.
This is no different from familiar network or file buffering techniques; changing its value affects performance, but not your final result. In contrast, ML hyperparameters like num_epochs, train_steps, or train_batch_size will change your result.
EDIT: Adding an illustration in pseudocode, below. Notionally, the training loop functions like this:
def process_on_TPU(examples, train_batch_size, iterations_per_loop):
# The TPU will run `iterations_per_loop` training iterations before returning to the host
for i in range(0, iterations_per_loop):
# on every iteration, the TPU will compute `train_batch_size` examples,
# calculating the gradient from every example in the given batch
compute(examples[i * train_batch_size : (i + 1) * train_batch_size])
# assume each entry in `example` is a single training example
for b in range(0, train_steps, train_batch_size * iterations_per_loop)
process_on_TPU(examples[b:b + train_batch_size * iterations_per_loop],
train_batch_size,
iterations_per_loop)
From this, it might appear that train_batch_size and iterations_per_loop are simply two different ways of accomplishing the same thing. However, this is not the case; train_batch_size affects the learning rate, since (at least in ResNet-50) the gradient is computed at each iteration from the average of the gradient of every example in the batch. Taking 50 steps per 50k examples will produce a different result from taking from 1k steps per 50k examples, since the latter case calculates the gradient much more often.
EDIT 2: Below is a way to visualize what's happening, with a racing metaphor. Think of the TPU as running a race that has a distance of train_steps examples, and its stride lets it cover a batch of examples per step. The race is on a track, which is shorter than the total race distance; the length of the lap is your total number of training examples, and every lap around the track is one epoch. You can think of iterations_per_loop as being the point where the TPU can stop at a "water station" of sorts where the training is temporarily paused for a variety of tasks (benchmarking, checkpointing, other housekeeping).
By "train loop", I'm assuming it's the same meaning as "training loop". The training loop is the one that iterates through each epoch in order to feed the model.
The iterations per loop is related to how Cloud TPU handles the training loop. In order to amortize the TPU launch cost, the model training step is wrapped in a tf.while_loop, such that one Session run actually runs many iterations for a single training loop.
Because of this, Cloud TPU runs a specified number of iterations of the training loop before returning to the host. Therefore, iterations_per_loop is how many iterations will run for one session.run call.
TPU literally means "Tensor Processing Unit", it's a hardware device used for computation in exactly the same way a GPU is used. The TPUs are effectively Google proprietary GPUs. There are technical differences under the hood of a GPU vs a TPU, mostly regarding speed and power consumption, and some issues of floating point precision, but you don't need to care about the details.
iterations_per_loop appears to be an effort to improve efficiency by loading the TPU with multiple training batches. There are often hardware bandwidth limitations when transferring large amounts of data from main memory to a GPU/TPU.
It appears that the code you reference is passing iterations_per_loop number of training batches to the TPU, then running iterations_per_loop number of training steps before pausing to do another data transfer from main memory to TPU memory.
I'm rather surprised to see that though, I would expect that asynchronous background data transfers would be possible by now.
My only disclaimer is that, while I'm proficient with Tensorflow, and have watched TPU evolution in papers and articles, I'm not directly experienced with the Google API or running on TPUs, so I'm inferring from what I read in the documentation you linked to.

Avoiding overfitting while training a neural network with Tensorflow

I am training a neural network using Tensorflow's object detetction API to detect cars. I used the following youtube video to learn and execute the process.
https://www.youtube.com/watch?v=srPndLNMMpk&t=65s
Part 1 to 6 of his series.
Now in his video, he has mentioned to stop the training when the loss value reaches ~1 or below on an average and that it would take about 10000'ish' steps.
In my case, it is 7500 steps right now and the loss values keep fluctuating from 0.6 to 1.3.
Alot of people complained in the comment section about false positives on this series but I think this happened because of the unnecessary prolonged process of training (because they din't know maybe when to stop ?) which caused overfitting!
I would like to avoid this problem. I would like to have not the most optimum weights but fairly optimum weights while avoiding false detection or overfitting. I am also observing 'Total Loss' section of Tensorboard. It fluctuates between 0.8 to 1.2. When do I stop the training process?
I would also like to know in general, which factors does the 'stopping of training' depend on? is it always about the average loss of 1 or less?
Additional information:
My training data has ~300 images
Test data ~ 20 images
Since I am using the concept of transfer learning, I chose ssd_mobilenet_v1.model.
Tensorflow version 1.9 (on CPU)
Python version 3.6
Thank you!
You should use a validation test, different from the training set and the test set.
At each epoch, you compute the loss of both training and validation set.
If the validation loss begin to increase, stop your training. You can now test your model on your test set.
The Validation set size is usually the same as the test one. For example, training set is 70% and both validation and test set are 15% each.
Also, please note that 300 images in your dataset seems not enough. You should increase it.
For your other question :
The loss is the sum of your errors, and thus, depends on the problem, and your data. A loss of 1 does not mean much in this regard. Never rely on it to stop your training.

Multi GPU architecture, gradient averaging - less accurate model?

When I execute the cifar10 model as described at https://www.tensorflow.org/tutorials/deep_cnn I achieve 86% accuracy after approx 4 hours using a single GPU , when I utilize 2 GPU's the accuracy drops to 84% but reaching 84% accuracy is faster on 2 GPU's than 1.
My intuition is
that average_gradients function as defined at https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py returns a less accurate gradient value as an average of gradients will be less accurate than the actual gradient value.
If the gradients are less accurate then the parameters than control the function that is learned as part of training is less accurate. Looking at the code (https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py) why is averaging the gradients over multiple GPU's less accurate than computing the gradient on a single GPU ?
Is my intuition of averaging the gradients producing a less accurate value correct ?
Randomness in the model is described as :
The images are processed as follows:
They are cropped to 24 x 24 pixels, centrally for evaluation or randomly for training.
They are approximately whitened to make the model insensitive to dynamic range.
For training, we additionally apply a series of random distortions to artificially increase the data set size:
Randomly flip the image from left to right.
Randomly distort the image brightness.
Randomly distort the image contrast.
src : https://www.tensorflow.org/tutorials/deep_cnn
Does this have an effect on training accuracy ?
Update :
Attempting to investigate this further, the loss function value training with different number of GPU's.
Training with 1 GPU : loss value : .7 , Accuracy : 86%
Training with 2 GPU's : loss value : .5 , Accuracy : 84%
Shouldn't the loss value be lower for higher for higher accuracy, not vice versa ?
In the code you linked, using the function average_gradient with 2 GPUs is exactly equivalent (1) to simply using 1 GPU with twice the batch size.
You can see it in the definition:
grad = tf.concat(axis=0, values=grads)
grad = tf.reduce_mean(grad, 0)
Using a larger batch size (given the same number of epochs) can have any kind of effect on your results.
Therefore, if you want to do exactly equivalent (1) calculations in 1-GPU or 2-GPU cases, you may want to halve the batch size in the latter case. (People sometimes avoid doing it, because smaller batch sizes may also make the computation on each GPU slower, in some cases)
Additionally, one needs to be careful with learning rate decay here. If you use it, you want to make sure the learning rate is the same in the nth epoch in both 1-GPU and 2-GPU cases -- I'm not entirely sure this code is doing the right thing here. I tend to print the learning rate in the logs, something like
print sess.run(lr)
should work here.
(1) Ignoring issues related to pseudo-random numbers, finite precision or data set sizes not divisible by the batch size.
There is a decent discussion of this here (not my content). Basically when you distribute SGD, you have to communicate gradients back and forth somehow between workers. This is inherently imperfect, and so your distributed SGD typically diverges from a sequential, single-worker SGD at least to some degree. It is also typically faster, so there is a trade off.
[Zhang et. al., 2015] proposes one method for distributed SGD called elastic-averaged SGD. The paper goes through a stability analysis characterizing the behavior of the gradients under different communication constraints. It gets a little heavy, but it might shed some light on why you see this behavior.
Edit: regarding whether the loss should be lower for the higher accuracy, it is going to depend on a couple of things. First, I am assuming that you are using softmax cross-entropy for your loss (as stated in the deep_cnn tutorial you linked), and assuming accuracy is the total number of correct predictions divided by the total number of samples. In this case, a lower loss on the same dataset should correlate to a higher accuracy. The emphasis is important.
If you are reporting loss during training but then report accuracy on your validation (or testing) dataset, it is possible for these two to be only loosely correlated. This is because the model is fitting (minimizing loss) to a certain subset of your total samples throughout the training process, and then tests against new samples that it has never seen before to verify that it generalizes well. The loss against this testing/validation set could be (and probably is) higher than the loss against the training set, so if the two numbers are being reported from different sets, you may not be able to draw comparisons like "loss for 1 GPU case should be lower since its accuracy is lower".
Second, if you are distributing the training then you are calculating losses across multiple workers (I believe), but only one accuracy at the end, again against a testing or validation set. Maybe the loss being reported is the best loss seen by any one worker, but overall the average losses were higher.
Basically I do not think we have enough information to decisively say why the loss and accuracy do not seem to correlate the way you expect, but there are a number of ways this could be happening, so I wouldn't dismiss it out of hand.
I've also encountered this issue.
See Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour from Facebook which addresses the same issue. The suggested solution is simply to scale up the learning rate by k (after some reasonable warm-up epochs) for k GPUs.
In practice I've found out that simply summing up the gradients from the GPUs (rather than averaging them) and using the original learning rate sometimes does the job as well.

Outputs of the distributed version of MNIST model

I am playing with the distributed version of the MNIST on CloudML and I am not sure to understand the logs displayed during the training phase:
INFO:root:Train [master/0], step 1693: Loss: 1.176, Accuracy: 0.464 (760.724 sec) 4.2 global steps/s, 4.2 local steps/s
INFO:root:Train [master/0], step 1696: Loss: 1.175, Accuracy: 0.464 (761.420 sec) 4.3 global steps/s, 4.3 local steps/s
INFO:root:Eval, step 1696: Loss: 0.990, Accuracy: 0.537
INFO:root:Train [master/0], step 1701: Loss: 1.175, Accuracy: 0.465 (766.337 sec) 1.0 global steps/s, 1.0 local steps/s
I am batching over 200 examples at a time, randomly.
Why is there such a gap between Train acc/loss and Eval acc/loss, the metrics for the eval set being significantly higher than for the train set, when it is usually the opposite?
Also, what is the difference between the global step and the local step?
The code I am talking about is here. task.py is calling model.py, the file where the graph is created.
When you're doing distributed training you can have more than 1 worker. Each of these workers can compute an update to the parameters. So each time a worker computes an update that counts as 1 local step. Depending on the type of training, synchronous vs. asynchronous training, the updates can be combined in different ways before actually applying the update to the parameters.
For example, every worker might update the parameters, or you might average the updates from each worker and only apply update the parameters once.
The global step tells you how many times you actually updated the parameters. So if you have N workers and you apply each worker's update then N local steps should correspond to N global steps. On the other hand, if you have N workers and you take 1 update from each worker, average them, and then update the parameters, then you'd have 1 global step for every N local steps.
RE: gap between Train acc/loss and Eval acc/loss
This is result of using exponential moving average to smooth out loss and accuracy between batches (otherwise plot is very noisy). The way the metrics are calculated on eval set and train set is different:
the eval set loss and accuracy is calculated using a moving exponential average of all batches of examples in the eval set - after each run of the evaluation the moving average is reset, so each point represents single run - also all eval steps are calculated using a consistent checkpoint (consistent value of weights).
the train set loss and accuracy is calculated using a moving exponential average of all batches of examples in training set during training process - the moving average is never reset so it can carry over past information for long time - and it is not based on a consistent checkpoint. This is a cheap to calculate approximation of the metrics.
We will be providing updated samples calculating loss on training and eval set in consistent way. Surprisingly even with the update initially eval set gets higher accuracy than training set - probably all the data wasn't properly shuffled and randomly split into training and eval set - and eval set contains slightly 'easier' subset of the data. After more training steps the classifier starts to over-fit training data and accuracy on training set exceeds accuracy on eval set.
RE: what is the difference between the global step and the local step?
Local steps is number of batches processed by single worker (each worker logs this information), global steps is number of batches processed by all workers. When single worker is used the two numbers are equal. When using more workers in distributed setting the global step > local step.
The training measures that are being reported are computed on just the "mini-batch" of examples used in that step (200 in your case). The eval measures are reported on the entire evaluation set.
So, the mini-batch training statistics will be quite noisy.
After further investigating the code in the example, the problem is that the accuracy reported over the training data is a moving average. Thus, the average reported at the current step is actually influenced by the average from many steps ago, which is typically lower. We will update the samples by not averaging with previous steps.

Difference of training steps or complete run through

on tensorflow.org in the beginner-mnist tutorial they train with 1000 steps, 100 examples. Which is more than the training set which only includes 55,000 points ? In the expert-mnist tutorial they train with 20000 steps, 50 examples.
I think the training steps are done, so that one could every training step make a print-output what loss or/and accuracy one got without waiting till the end or processing.
But could one also simply pipe all examples through the train_operation in 1 step and then look at the outcome, or is not possible ?
Training on the whole dataset at each iteration is called batch gradient descent. Training on minibatches (e.g. 100 samples at a time) is called stochastic gradient descent. You can read more about the two and the reasons for choosing larger or smaller batch sizes in this question on Cross Validated.
Batch gradient descent typically isn't feasible because it requires too much RAM. Each iteration will also take significantly longer and the tradeoff often isn't worth it even if you have the computational resources.
That said, the batch size is a hyperparameter that you can play around with to find a value that works well.