TensorFlow: Fit gives great val_acc, but evaluate gives low acc - tensorflow

I have some data (reuters).
I combine the training and test data into one large data set.
I shuffle the entire data set.
I then split the data set into 50%. 50% for training and 50% for testing.
I then setup a sequential model (Embedding, GRU, BatchNormalization, Dense).
I compile with adam optimizer, and loss = sparse_categorical_crossentropy
I run fit on it, for 5 epochs, with validation_split = 0.2
I get a val_acc of 95%. I am very happy with this.
I then call evaluate with the test data, and get acc = 71%
When I repeat evaluate with the train data (instead of test data), just to see what happens, I get 95% acc (as I should).
I am trying to understand what is wrong with my acc score with the test data.
The data is shuffled between train and test each time, so it can not be related to the data.
I tried the checkpoint save/restore trick, that did not seem to help.
I have reduced epochs in case of overfitting, but this has not helped.
I am curious what is going on. The only thing I can think of is there is some sort of overfitting going on that is carrying over to test, but is NOT impacting val_acc.
(Side note: the 20% data saved for validation during fit should not cause an overfitting problem, correct?)
Any ideas what could be wrong with my approach?
Thank you.
Edit 1:
Okey, I might be onto something. I noticed something odd about the test data that is included with the reuters dataset, vs the training data.
In previous experiments, results were always poor evaluating against test set, vs an unused portion of the training data.
This time, I combined the training and test data sets into one set, and shuffled.
Then shuffled some more, and then generated pseudo data around it, and shuffled some more.
Then I peeled off 20% to use as test data.
This time it worked. I got 95% for training, validation and evaluation accuracy.
I tried searching for information on the test set to see if anyone came up with anything but found no results. So I'm going to presently chalk it up to test data that is significantly different from the training set.
Edit 2:
Nevermind edit 1. I think I was corrupting my test data with pseudo-generated version of training data, that was close enough to work.
The only conclusion I can draw is that there is a lot of overfitting going on during my training AND using validation data during training is misleading, as it is also being overfit.
(Why validation data is being used to help overfit, I do not know).

Overfitting occurs due to many reasons, to overcome this problem you can try different methods like:
L1/L2 regularization.
Dropout layers: By using dropout layers in the network, we ignore a subset of units of our network with a set probability. Using dropout, we can reduce interdependent learning among units, which may have led to overfitting.
Early stopping: Monitors the performance of the model for every epoch on a held-out validation set during the training, and terminates the training conditional on the validation performance.
Feature selection: Improves the machine learning process and increases the predictive power of machine learning algorithms by selecting the most important variables and eliminating unnecessary and irrelevant features.
For more details refer to this link.

Related

Validation loss and accuracy has a lot of 'jumps'

Hello everyone so I made this cnn model.
My data:
Train folder->30 classes->800 images each->24000 all together
Validation folder->30 classes->100 images each->3000 all together
Test folder->30 classes -> 100 images each -> 3000 all together
-I've applied data augmentation. ( on the train data)
-I got 5 conv layers with filters 32->64->128->128->128
each with maxpooling and batch normalization
-Added dropout 0.5 after flattening layers
Train part looks good. Validation part has a lot of 'jumps' though. Does it overfit?
Is there any way to fix this and make validation part more stable?
Note: I plann to increase epochs on my final model I'm just experimenting to see what works best since the model takes a lot of time in order to train. So for now I train with 20 epochs.
I've applied data augmentation (on the train data).
What does this mean? What kind of data did you add and how much? You might think I'm nitpicking, but if the distribution of the augmented data is different enough from the original data, then this will indeed cause your model to generalize poorly to the validation set.
Increasing your epochs isn't going to help here, your training loss is already decreasing reasonably. Training your model for longer is a good step if the validation loss is also decreasing nicely, but that's obviously not the case.
Some things I would personally try:
Try decreasing the learning rate.
Try training the model without the augmented data and see how the validation loss behaves.
Try splitting the augmented data so that it's also contained in the validation set and see how the model behaves.
Train part looks good. Validation part has a lot of 'jumps' though. Does it overfit?
the answer is yes. The so-called 'jumps' in the validation part may indicate that the model is not generalizing well to the validation data and therefore your model might be overfitting.
Is there any way to fix this and make validation part more stable?
To fix this you can use the following:
Increasing the size of your training set
Regularization techniques
Early stopping
Reduce the complexity of your model
Use different hyperparameters like learning rate

Is it a good idea to mix the validation / testing data with the training data?

I am working with a large dataset (e.g. large for a single machine) - with 1,000,000 examples.
I split my dataset to as follows: (80% Training Data, 10% Validation Data, 10% Testing Data). Every time I retrain the model, I shuffle the data first - such that some of the data from the validation / testing set ends up into the training set and vice versa.)
My thinking is this:
Ideally I would want all possible available data for the model to learn. The more the better - for improved accuracy.
Even though 20% of the data is dedicated to validation and testing, that is still 100,000 examples per piece - (i.e. I may potentially miss out on some crucial data that exists within the validation or testing set that the previous training set may not have accounted for.)
Shuffling prevents the training set from learning order where it is not important (at least in my particular dataset).
Here is my workflow process:
The Test Accuracy is more or less the equivalent to the Validation Accuracy (plus or minus 0.5%)
Per each retrain, the results usually ends up something like this: where the accuracy keeps improving (until it runs out of total epoch), but the validation accuracy ends up stuck at a particular percentage. I then save that model. Start the retraining process again. Shuffles data occurs. The training accuracy drops, but validation accuracy jumps up. The training accuracy improves until total epoch. The validation accuracy, converges downward (still greater than the previous run).
See Example:
I plan on doing this until the training accuracy data reaches 99%. (Note: I used Keras-Tuner to find the best architecture/model for my particular problem)
I can't help but think, that I am doing something wrong by doing this. From my perspective, this is just the model eventually learning all 1,000,000 examples. It feels like "mild overfitting" because of the shuffling per each retrain.
Is it a good idea to mix the validation / testing data with the training data?
Am I wrong by doing it this way? If so, why should I not do this method? Is there a better way to approach this?
If you mix your test/validation data with training data, you then can not evaluate your model on that data, since that data has been seen by your model. The model evaluation is done on the basis of how well it is able to make predictions/classification on data which your model has not seen (assuming that the data you are using to evaluate your model is coming from the same distribution as your training data). If you also mix your test set data with training set data, you will eventually end up with really good test set accuracy since that data has been seen by your model, but it might not perform well on new unseen data coming from the same distribution.
If you are worried size of test/validation data, I suggest you further reduce the size of your test/validation data. Use 99.9% instead of 99%. Also, the random shuffling will take care of learning almost every feature of your data.
After all, my point is, never ever evaluate your model on the data it has seen before. It will always give you better results (assuming you have trained your model well untill it memorizes the training data). The validation data is used when you have multiple algorithms/models and you need to select one algorithm/model from all those available models. Here, the validation data is used to select the model. The algo/model which gives good results on validation data is selected (again you do not evaluate your model based on validation set accuracy, it is just used for the selection of the model.) Once you have selected your model based on validation set accuracy, you then evaluate it on new unseen data (called test data) and report the prediction/classification accuracy on test data as your model accuracy.

Small gap between train and test error, does that mean overfitting?

i am working on a dataset of 368609 samples and 34 features, i wanted to use a neural network to predict latency (real value) using keras, the model has 3 hidden layers, each layer has 1024 neurons, i have used drop_out (50 %) and l2 regularization (0.001) for each hidden layer. The problem is i am getting a test mean absolute error of 3.5505 ms and train mean absolute error of 3.4528. Here, the train error is smaller than test error by a small gap, does this mean that we have an overfitting problem here ?
Not really, yet it is still always a good idea to see how your model is generalizing to new data.
Keep something between 10%-20% of your original dataset as a test set and try to predict the output for each record in the test set.
Sometimes when we deal with the same validation set for many attempts of improving our model, we tend to overfit the evaluation dataset as well.
Having 3 different datasets for training, evaluation and testing usually provides a whole solution to overfitting.
If you get a high accuracy on your training set and a low accuracy on you test set it often means that your are overfitting. So in you case - no you are probably not overfitting.
Normally you would also have a validation set, so you don't fit your data to the testset.

how important is the loss difference between training and validation data at the beginning when training a neuronal network?

Short question:
Is the difference between validation and training loss at the beginning of the training (first epochs) a good indicator for the amount of data that should be used?
E.g would it be a good method to increase the amount of data until the difference at the beginning is as small as possible? It would save me time and computation.
backround:
I am working on a neuronal network that overfits very fast. The best result after applying many different techniques like dropouts, batch normalization, reducing learning rate, reducing batch size, increasing variety of data, reducing layers, increasing filter sizes ..... is still very bad.
While the training loss decreases very well, validation loss overfits too early(with too early I mean, the desired loss is not reached, it should be many times less)
Since the training with my dataset ~200 samples took 24 hours for 50 epochs, I was hoping to find a way to fight against overfitting with all the methods I described above, before increasing the amount of data. Because nothing helped I am at the point of increasing the amount of data.
I am thinking about how much data could be enough for my network to eliminate overfitting. I know that this it is not easy to answer because it depends on the complexity of the data and the task I am trying to solve.. therefore I try to generalize my question to:
Short answer to the short question: No
explanation: There's a correlation between (train_loss - val_loss) and the amount of data you need to train your model, but there's a bunch of other factors that could be the source of the big (train_loss - val_loss). For example, your network architecture is too small, and therefor your model quickly overfits. Or, your validation set doesn't reflect the training data. Or your learning rate is too big. Or...
So my recommendation: formulate your problem in another SO question, and ask "what might I be doing wrong?"

Does batch normalization work on balanced dataset?

I trained a classification network using tensorFlow with batch normalization in every convolutional layer. When I predict on a balanced test set where every category included in it, the accuracy is normal. However, if I chose any one specific category from test set, the accuracy is low, even zero.
But when 3 categories included in test set, the accuracy became higher. As we all know, the weights was fixed when the model finished training. But I find the balance in test set have greatly influence on prediction accuracy.
I think if batch normalization has influence on this, so I remove all batch normalization and retrained the model again. This time, when I predict only one category picture, it became normal.
Could anyone know why? THANKS!
You're right. If your training set is unbalanced you compute and accumulate mean values (for every layer) that are skewed in favor of the majority class.
In fact, you're not "normalizing" but instead, you're making the unbalancing problem worse.
Use batch normalization when you have a balanced training set and you can be sure that your batches will contain a balanced number of samples. This gives you optimal results.
However, since you added in the comments that you're using tf.contrib.layers.conv2d(x, num_output, kernel_size, stride, padding, activation_fn, normal_fn=tf.contrib.layers.batch_norm)
I spotted the problem: normalizer_fn calls the function you pass (batch_norm). But it uses the defaults parameters. By default, is_training equals to True thus you're computing even during the test phase the mean and the variance over the batch. Just read carefully the documentation of tf.contrib.layers.conv2d and use normalizer_params to pass is_training=True when training and is_training=False when testing/validating.