Outputs of the distributed version of MNIST model - tensorflow

I am playing with the distributed version of the MNIST on CloudML and I am not sure to understand the logs displayed during the training phase:
INFO:root:Train [master/0], step 1693: Loss: 1.176, Accuracy: 0.464 (760.724 sec) 4.2 global steps/s, 4.2 local steps/s
INFO:root:Train [master/0], step 1696: Loss: 1.175, Accuracy: 0.464 (761.420 sec) 4.3 global steps/s, 4.3 local steps/s
INFO:root:Eval, step 1696: Loss: 0.990, Accuracy: 0.537
INFO:root:Train [master/0], step 1701: Loss: 1.175, Accuracy: 0.465 (766.337 sec) 1.0 global steps/s, 1.0 local steps/s
I am batching over 200 examples at a time, randomly.
Why is there such a gap between Train acc/loss and Eval acc/loss, the metrics for the eval set being significantly higher than for the train set, when it is usually the opposite?
Also, what is the difference between the global step and the local step?
The code I am talking about is here. task.py is calling model.py, the file where the graph is created.

When you're doing distributed training you can have more than 1 worker. Each of these workers can compute an update to the parameters. So each time a worker computes an update that counts as 1 local step. Depending on the type of training, synchronous vs. asynchronous training, the updates can be combined in different ways before actually applying the update to the parameters.
For example, every worker might update the parameters, or you might average the updates from each worker and only apply update the parameters once.
The global step tells you how many times you actually updated the parameters. So if you have N workers and you apply each worker's update then N local steps should correspond to N global steps. On the other hand, if you have N workers and you take 1 update from each worker, average them, and then update the parameters, then you'd have 1 global step for every N local steps.

RE: gap between Train acc/loss and Eval acc/loss
This is result of using exponential moving average to smooth out loss and accuracy between batches (otherwise plot is very noisy). The way the metrics are calculated on eval set and train set is different:
the eval set loss and accuracy is calculated using a moving exponential average of all batches of examples in the eval set - after each run of the evaluation the moving average is reset, so each point represents single run - also all eval steps are calculated using a consistent checkpoint (consistent value of weights).
the train set loss and accuracy is calculated using a moving exponential average of all batches of examples in training set during training process - the moving average is never reset so it can carry over past information for long time - and it is not based on a consistent checkpoint. This is a cheap to calculate approximation of the metrics.
We will be providing updated samples calculating loss on training and eval set in consistent way. Surprisingly even with the update initially eval set gets higher accuracy than training set - probably all the data wasn't properly shuffled and randomly split into training and eval set - and eval set contains slightly 'easier' subset of the data. After more training steps the classifier starts to over-fit training data and accuracy on training set exceeds accuracy on eval set.
RE: what is the difference between the global step and the local step?
Local steps is number of batches processed by single worker (each worker logs this information), global steps is number of batches processed by all workers. When single worker is used the two numbers are equal. When using more workers in distributed setting the global step > local step.

The training measures that are being reported are computed on just the "mini-batch" of examples used in that step (200 in your case). The eval measures are reported on the entire evaluation set.
So, the mini-batch training statistics will be quite noisy.

After further investigating the code in the example, the problem is that the accuracy reported over the training data is a moving average. Thus, the average reported at the current step is actually influenced by the average from many steps ago, which is typically lower. We will update the samples by not averaging with previous steps.

Related

Avoiding overfitting while training a neural network with Tensorflow

I am training a neural network using Tensorflow's object detetction API to detect cars. I used the following youtube video to learn and execute the process.
https://www.youtube.com/watch?v=srPndLNMMpk&t=65s
Part 1 to 6 of his series.
Now in his video, he has mentioned to stop the training when the loss value reaches ~1 or below on an average and that it would take about 10000'ish' steps.
In my case, it is 7500 steps right now and the loss values keep fluctuating from 0.6 to 1.3.
Alot of people complained in the comment section about false positives on this series but I think this happened because of the unnecessary prolonged process of training (because they din't know maybe when to stop ?) which caused overfitting!
I would like to avoid this problem. I would like to have not the most optimum weights but fairly optimum weights while avoiding false detection or overfitting. I am also observing 'Total Loss' section of Tensorboard. It fluctuates between 0.8 to 1.2. When do I stop the training process?
I would also like to know in general, which factors does the 'stopping of training' depend on? is it always about the average loss of 1 or less?
Additional information:
My training data has ~300 images
Test data ~ 20 images
Since I am using the concept of transfer learning, I chose ssd_mobilenet_v1.model.
Tensorflow version 1.9 (on CPU)
Python version 3.6
Thank you!
You should use a validation test, different from the training set and the test set.
At each epoch, you compute the loss of both training and validation set.
If the validation loss begin to increase, stop your training. You can now test your model on your test set.
The Validation set size is usually the same as the test one. For example, training set is 70% and both validation and test set are 15% each.
Also, please note that 300 images in your dataset seems not enough. You should increase it.
For your other question :
The loss is the sum of your errors, and thus, depends on the problem, and your data. A loss of 1 does not mean much in this regard. Never rely on it to stop your training.

val_loss: 1.1921e-07 - val_acc: 0.0715 How is that possible?

I am currently training this model: https://pastebin.com/F7dQvmZP. When i trained it with only 1 feature (raw data) per timestep i got a loss of ~1.3 and an accuracy of ~57%. After adding the direction of change (1 if increased 0 if same -1 if decreased) as a second feature to each timestep my loss went down to ~0.8 and my accuracy increased to ~70%. Then i added a differently scaled version of the raw data as a third feature. This data is basically scaled such that the maximum reading during that timeseries is 1.0. Training this quickly results in a loss of ~1e-7 but the accuracy stays at ~7%. The input is composed like this
np.dstack((measurements, change, scaled))
I dont really know how that is possible since my outputs are one hot encoded and I only have 22 classes. The training data includes 291300 training and 97100 validation samples. It trains normal until I add the third feature (Even if I only use the third feature). Any help would be appreciated.

[MXNet]Periodic Loss Value when training with "step" learning rate policy

When training deep CNN, a common way is to use SGD with momentum with a "step" learning rate policy (e.g. learning rate set to be 0.1,0.01,0.001.. at different stages of training).But I encounter an unexpected phenomenon when training with this strategy under MXNet.
That is the periodic training loss value
https://user-images.githubusercontent.com/26757001/31327825-356401b6-ad04-11e7-9aeb-3f690bc50df2.png
The above is the training loss at a fixed learning rate 0.01, where the loss is decreasing normally
https://user-images.githubusercontent.com/26757001/31327872-8093c3c4-ad04-11e7-8fbd-327b3916b278.png
However, at the second stage of training (with lr 0.001) , the loss goes up and down periodically, and the period is exactly an epoch
So I thought it might be the problem of data shuffling, but it cannot explain why it doesn't happen in the first stage. Actually I used ImageRecordIter as the DataIter and reset it after every epoch, is there anything I missed or set mistakenly?
train_iter = mx.io.ImageRecordIter(
path_imgrec=recPath,
data_shape=dataShape,
batch_size=batchSize,
last_batch_handle='discard',
shuffle=True,
rand_crop=True,
rand_mirror=True)
The codes for training and loss evaluation:
while True:
train_iter.reset()
for i,databatch in enumerate(train_iter):
globalIter += 1
mod.forward(databatch,is_train=True)
mod.update_metric(metric,databatch.label)
if globalIter % 100 == 0:
loss = metric.get()[1]
metric.reset()
mod.backward()
mod.update()
Actually the loss can converge, but it takes too long.
I've suffered from this problem for a long period of time, on different network and different datasets.
I didn't have this problem when using Caffe. Is this due to the implementation difference?
Your loss/learning curves look suspiciously smooth, and I believe you can observe the same oscillation in the loss even when the learning rate is set to 0.01 just at a smaller relative scale (i.e. if you 'zoomed in' to the chart you'd see the same pattern). You may have an issue with your data iterator passing the same batch for example. And your training loop looks wrong but this could be due to formatting (e.g. mod.update() only performed every 100 batches isn't correct).
You can observe periodicity in your loss when you're traveling across a valley in the loss surface, up and down the sides rather than down the valley. Choosing a lower learning rate can help fix this, and make sure you are using momentum too.

Training TensorFlow model with summary operations is much slower than without summary operations

I am training an Inception-like model using TensorFlow r1.0 with GPU Nvidia Titan X.
I added some summary operations to visualize the training procedure, using code as follows:
def variable_summaries(var):
"""Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
with tf.name_scope('summaries'):
mean = tf.reduce_mean(var)
tf.summary.scalar('mean', mean)
with tf.name_scope('stddev'):
stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
tf.summary.scalar('stddev', stddev)
tf.summary.scalar('max', tf.reduce_max(var))
tf.summary.scalar('min', tf.reduce_min(var))
tf.summary.histogram('histogram', var)
When I run these operations, the time cost of training one epoch is about 400 seconds. But when I turn off these operations, the time cost of training one epoch is just 90 seconds.
How to optimize the graph to minimize the summary operations time cost?
Summaries of course slow down the training process, because you do more operations and you need to write them to disc. Also, histogram summaries slow the training even more, because for histograms you need more data to be copied from GPU to CPU than for scalar values.
So I would try to use histogram logging less often than the rest, that could make some difference.
The usual solution is to compute summaries only every X batches. Since you compute summaries only one per epoch and not every batch, it might be worth trying even less summaries logging.
Depends on how many batches you have in your dataset, but usually you don't lose much information by gathering a bit less logs.

Google Cloud ML Loss in Recall: Distributed Learning

I have two model versions trained on Google Cloud ML, one using 2 workers and one with just the master nodes. However, there is a significant drop in recall after training in the distributed mode. I followed the sample examples provided for around 2000 steps (workers and master both contribute to the steps)
Only Master
RECALL metrics: 0.352357320099
Accuracy over the validation set: 0.737576772753
Master and 2 Workers
RECALL metrics: 0.0223325062035
Accuracy over the validation set: 0.770519262982
The general idea to keep in mind is that as you increase the number of workers, you are also increasing your effective batch size (each worker is processing N examples per step).
To account for that, you'll need to look at adjusting other hyper-parameters. Try picking a smaller learning rate to reduce the amount of change per step. Consequently you may also need to increase the number of steps by some factor, depending on your model and data, to get to the same convergence.