Tensorflow: When to stop training due to the best (minimum) cost? - tensorflow

I am using TensorFlow to classify images using LeNet network. I use AdamOptimizer to minimize the cost function. When I start to train the model, I can observe that the training accuracy and validation accuracy and also the cost is changing, sometimes reducing and sometimes increasing.
My questions: When should we stop the training? How can we know that the optimizer will find the minimum cost? how many iterations should we do the training? Can we set a variable or condition to stop at the minimum cost?
My solution is to define a global variable (min_cost) and in each iteration check if the cost is reducing then save the session and replace the min_cost with the new cost. At the end, I will have the saved session for the minimum cost,
Is this a correct approach?
Thanks in advance,

While training neural networks, mostly a target error is defined alongside with a maximum amount of iterations to train. So for example, a target error could be 0.001MSE. Once this error has been reached, the training will stop - if this error has not been reached after the maximum amount of iterations, the training will also stop.
But it seems like you want to train until you know the network can't do any better. Saving the 'best' parameters like you're doing is a fine approach, but do realise that once some kind of minimum cost has been reached, the error won't fluctuate that much anymore. It won't be like the error suddenly goes up significantly, so it is not completely necessary to save the network.
There is no such thing as 'minimal cost' - the network is always trying to go to some local minima, and it will always be doing so. There is not really way you (or an algorithm) can figure out that there is no better error to be reached anymore.
tl;dr - just set a target reasonable target error alongside with a maximum amount of iterations.

Related

TensorFlow: Fit gives great val_acc, but evaluate gives low acc

I have some data (reuters).
I combine the training and test data into one large data set.
I shuffle the entire data set.
I then split the data set into 50%. 50% for training and 50% for testing.
I then setup a sequential model (Embedding, GRU, BatchNormalization, Dense).
I compile with adam optimizer, and loss = sparse_categorical_crossentropy
I run fit on it, for 5 epochs, with validation_split = 0.2
I get a val_acc of 95%. I am very happy with this.
I then call evaluate with the test data, and get acc = 71%
When I repeat evaluate with the train data (instead of test data), just to see what happens, I get 95% acc (as I should).
I am trying to understand what is wrong with my acc score with the test data.
The data is shuffled between train and test each time, so it can not be related to the data.
I tried the checkpoint save/restore trick, that did not seem to help.
I have reduced epochs in case of overfitting, but this has not helped.
I am curious what is going on. The only thing I can think of is there is some sort of overfitting going on that is carrying over to test, but is NOT impacting val_acc.
(Side note: the 20% data saved for validation during fit should not cause an overfitting problem, correct?)
Any ideas what could be wrong with my approach?
Thank you.
Edit 1:
Okey, I might be onto something. I noticed something odd about the test data that is included with the reuters dataset, vs the training data.
In previous experiments, results were always poor evaluating against test set, vs an unused portion of the training data.
This time, I combined the training and test data sets into one set, and shuffled.
Then shuffled some more, and then generated pseudo data around it, and shuffled some more.
Then I peeled off 20% to use as test data.
This time it worked. I got 95% for training, validation and evaluation accuracy.
I tried searching for information on the test set to see if anyone came up with anything but found no results. So I'm going to presently chalk it up to test data that is significantly different from the training set.
Edit 2:
Nevermind edit 1. I think I was corrupting my test data with pseudo-generated version of training data, that was close enough to work.
The only conclusion I can draw is that there is a lot of overfitting going on during my training AND using validation data during training is misleading, as it is also being overfit.
(Why validation data is being used to help overfit, I do not know).
Overfitting occurs due to many reasons, to overcome this problem you can try different methods like:
L1/L2 regularization.
Dropout layers: By using dropout layers in the network, we ignore a subset of units of our network with a set probability. Using dropout, we can reduce interdependent learning among units, which may have led to overfitting.
Early stopping: Monitors the performance of the model for every epoch on a held-out validation set during the training, and terminates the training conditional on the validation performance.
Feature selection: Improves the machine learning process and increases the predictive power of machine learning algorithms by selecting the most important variables and eliminating unnecessary and irrelevant features.
For more details refer to this link.

Is it a good idea to mix the validation / testing data with the training data?

I am working with a large dataset (e.g. large for a single machine) - with 1,000,000 examples.
I split my dataset to as follows: (80% Training Data, 10% Validation Data, 10% Testing Data). Every time I retrain the model, I shuffle the data first - such that some of the data from the validation / testing set ends up into the training set and vice versa.)
My thinking is this:
Ideally I would want all possible available data for the model to learn. The more the better - for improved accuracy.
Even though 20% of the data is dedicated to validation and testing, that is still 100,000 examples per piece - (i.e. I may potentially miss out on some crucial data that exists within the validation or testing set that the previous training set may not have accounted for.)
Shuffling prevents the training set from learning order where it is not important (at least in my particular dataset).
Here is my workflow process:
The Test Accuracy is more or less the equivalent to the Validation Accuracy (plus or minus 0.5%)
Per each retrain, the results usually ends up something like this: where the accuracy keeps improving (until it runs out of total epoch), but the validation accuracy ends up stuck at a particular percentage. I then save that model. Start the retraining process again. Shuffles data occurs. The training accuracy drops, but validation accuracy jumps up. The training accuracy improves until total epoch. The validation accuracy, converges downward (still greater than the previous run).
See Example:
I plan on doing this until the training accuracy data reaches 99%. (Note: I used Keras-Tuner to find the best architecture/model for my particular problem)
I can't help but think, that I am doing something wrong by doing this. From my perspective, this is just the model eventually learning all 1,000,000 examples. It feels like "mild overfitting" because of the shuffling per each retrain.
Is it a good idea to mix the validation / testing data with the training data?
Am I wrong by doing it this way? If so, why should I not do this method? Is there a better way to approach this?
If you mix your test/validation data with training data, you then can not evaluate your model on that data, since that data has been seen by your model. The model evaluation is done on the basis of how well it is able to make predictions/classification on data which your model has not seen (assuming that the data you are using to evaluate your model is coming from the same distribution as your training data). If you also mix your test set data with training set data, you will eventually end up with really good test set accuracy since that data has been seen by your model, but it might not perform well on new unseen data coming from the same distribution.
If you are worried size of test/validation data, I suggest you further reduce the size of your test/validation data. Use 99.9% instead of 99%. Also, the random shuffling will take care of learning almost every feature of your data.
After all, my point is, never ever evaluate your model on the data it has seen before. It will always give you better results (assuming you have trained your model well untill it memorizes the training data). The validation data is used when you have multiple algorithms/models and you need to select one algorithm/model from all those available models. Here, the validation data is used to select the model. The algo/model which gives good results on validation data is selected (again you do not evaluate your model based on validation set accuracy, it is just used for the selection of the model.) Once you have selected your model based on validation set accuracy, you then evaluate it on new unseen data (called test data) and report the prediction/classification accuracy on test data as your model accuracy.

how important is the loss difference between training and validation data at the beginning when training a neuronal network?

Short question:
Is the difference between validation and training loss at the beginning of the training (first epochs) a good indicator for the amount of data that should be used?
E.g would it be a good method to increase the amount of data until the difference at the beginning is as small as possible? It would save me time and computation.
backround:
I am working on a neuronal network that overfits very fast. The best result after applying many different techniques like dropouts, batch normalization, reducing learning rate, reducing batch size, increasing variety of data, reducing layers, increasing filter sizes ..... is still very bad.
While the training loss decreases very well, validation loss overfits too early(with too early I mean, the desired loss is not reached, it should be many times less)
Since the training with my dataset ~200 samples took 24 hours for 50 epochs, I was hoping to find a way to fight against overfitting with all the methods I described above, before increasing the amount of data. Because nothing helped I am at the point of increasing the amount of data.
I am thinking about how much data could be enough for my network to eliminate overfitting. I know that this it is not easy to answer because it depends on the complexity of the data and the task I am trying to solve.. therefore I try to generalize my question to:
Short answer to the short question: No
explanation: There's a correlation between (train_loss - val_loss) and the amount of data you need to train your model, but there's a bunch of other factors that could be the source of the big (train_loss - val_loss). For example, your network architecture is too small, and therefor your model quickly overfits. Or, your validation set doesn't reflect the training data. Or your learning rate is too big. Or...
So my recommendation: formulate your problem in another SO question, and ask "what might I be doing wrong?"

TPU terminology confusion

So I know how epochs, train steps, batch sizes and this kind of stuff are defined, but it is really hard to me to get my head wraped around the TPU terminology like train loops, iterations per loop and so on. I read this but Im still confused.
Also how can I benchmark the time for iterations per loop for example.
Any explanation would help me a lot there. Thanks!
As the other answers have described, iterations_per_loop is a tuning parameter that controls the amount of work done by the TPU before checking in with it again. A lower number lets you inspect results (and benchmark them) more often, and a higher number reduces the overhead due to synchronization.
This is no different from familiar network or file buffering techniques; changing its value affects performance, but not your final result. In contrast, ML hyperparameters like num_epochs, train_steps, or train_batch_size will change your result.
EDIT: Adding an illustration in pseudocode, below. Notionally, the training loop functions like this:
def process_on_TPU(examples, train_batch_size, iterations_per_loop):
# The TPU will run `iterations_per_loop` training iterations before returning to the host
for i in range(0, iterations_per_loop):
# on every iteration, the TPU will compute `train_batch_size` examples,
# calculating the gradient from every example in the given batch
compute(examples[i * train_batch_size : (i + 1) * train_batch_size])
# assume each entry in `example` is a single training example
for b in range(0, train_steps, train_batch_size * iterations_per_loop)
process_on_TPU(examples[b:b + train_batch_size * iterations_per_loop],
train_batch_size,
iterations_per_loop)
From this, it might appear that train_batch_size and iterations_per_loop are simply two different ways of accomplishing the same thing. However, this is not the case; train_batch_size affects the learning rate, since (at least in ResNet-50) the gradient is computed at each iteration from the average of the gradient of every example in the batch. Taking 50 steps per 50k examples will produce a different result from taking from 1k steps per 50k examples, since the latter case calculates the gradient much more often.
EDIT 2: Below is a way to visualize what's happening, with a racing metaphor. Think of the TPU as running a race that has a distance of train_steps examples, and its stride lets it cover a batch of examples per step. The race is on a track, which is shorter than the total race distance; the length of the lap is your total number of training examples, and every lap around the track is one epoch. You can think of iterations_per_loop as being the point where the TPU can stop at a "water station" of sorts where the training is temporarily paused for a variety of tasks (benchmarking, checkpointing, other housekeeping).
By "train loop", I'm assuming it's the same meaning as "training loop". The training loop is the one that iterates through each epoch in order to feed the model.
The iterations per loop is related to how Cloud TPU handles the training loop. In order to amortize the TPU launch cost, the model training step is wrapped in a tf.while_loop, such that one Session run actually runs many iterations for a single training loop.
Because of this, Cloud TPU runs a specified number of iterations of the training loop before returning to the host. Therefore, iterations_per_loop is how many iterations will run for one session.run call.
TPU literally means "Tensor Processing Unit", it's a hardware device used for computation in exactly the same way a GPU is used. The TPUs are effectively Google proprietary GPUs. There are technical differences under the hood of a GPU vs a TPU, mostly regarding speed and power consumption, and some issues of floating point precision, but you don't need to care about the details.
iterations_per_loop appears to be an effort to improve efficiency by loading the TPU with multiple training batches. There are often hardware bandwidth limitations when transferring large amounts of data from main memory to a GPU/TPU.
It appears that the code you reference is passing iterations_per_loop number of training batches to the TPU, then running iterations_per_loop number of training steps before pausing to do another data transfer from main memory to TPU memory.
I'm rather surprised to see that though, I would expect that asynchronous background data transfers would be possible by now.
My only disclaimer is that, while I'm proficient with Tensorflow, and have watched TPU evolution in papers and articles, I'm not directly experienced with the Google API or running on TPUs, so I'm inferring from what I read in the documentation you linked to.

Avoiding overfitting while training a neural network with Tensorflow

I am training a neural network using Tensorflow's object detetction API to detect cars. I used the following youtube video to learn and execute the process.
https://www.youtube.com/watch?v=srPndLNMMpk&t=65s
Part 1 to 6 of his series.
Now in his video, he has mentioned to stop the training when the loss value reaches ~1 or below on an average and that it would take about 10000'ish' steps.
In my case, it is 7500 steps right now and the loss values keep fluctuating from 0.6 to 1.3.
Alot of people complained in the comment section about false positives on this series but I think this happened because of the unnecessary prolonged process of training (because they din't know maybe when to stop ?) which caused overfitting!
I would like to avoid this problem. I would like to have not the most optimum weights but fairly optimum weights while avoiding false detection or overfitting. I am also observing 'Total Loss' section of Tensorboard. It fluctuates between 0.8 to 1.2. When do I stop the training process?
I would also like to know in general, which factors does the 'stopping of training' depend on? is it always about the average loss of 1 or less?
Additional information:
My training data has ~300 images
Test data ~ 20 images
Since I am using the concept of transfer learning, I chose ssd_mobilenet_v1.model.
Tensorflow version 1.9 (on CPU)
Python version 3.6
Thank you!
You should use a validation test, different from the training set and the test set.
At each epoch, you compute the loss of both training and validation set.
If the validation loss begin to increase, stop your training. You can now test your model on your test set.
The Validation set size is usually the same as the test one. For example, training set is 70% and both validation and test set are 15% each.
Also, please note that 300 images in your dataset seems not enough. You should increase it.
For your other question :
The loss is the sum of your errors, and thus, depends on the problem, and your data. A loss of 1 does not mean much in this regard. Never rely on it to stop your training.