Is it a good idea to mix the validation / testing data with the training data? - tensorflow

I am working with a large dataset (e.g. large for a single machine) - with 1,000,000 examples.
I split my dataset to as follows: (80% Training Data, 10% Validation Data, 10% Testing Data). Every time I retrain the model, I shuffle the data first - such that some of the data from the validation / testing set ends up into the training set and vice versa.)
My thinking is this:
Ideally I would want all possible available data for the model to learn. The more the better - for improved accuracy.
Even though 20% of the data is dedicated to validation and testing, that is still 100,000 examples per piece - (i.e. I may potentially miss out on some crucial data that exists within the validation or testing set that the previous training set may not have accounted for.)
Shuffling prevents the training set from learning order where it is not important (at least in my particular dataset).
Here is my workflow process:
The Test Accuracy is more or less the equivalent to the Validation Accuracy (plus or minus 0.5%)
Per each retrain, the results usually ends up something like this: where the accuracy keeps improving (until it runs out of total epoch), but the validation accuracy ends up stuck at a particular percentage. I then save that model. Start the retraining process again. Shuffles data occurs. The training accuracy drops, but validation accuracy jumps up. The training accuracy improves until total epoch. The validation accuracy, converges downward (still greater than the previous run).
See Example:
I plan on doing this until the training accuracy data reaches 99%. (Note: I used Keras-Tuner to find the best architecture/model for my particular problem)
I can't help but think, that I am doing something wrong by doing this. From my perspective, this is just the model eventually learning all 1,000,000 examples. It feels like "mild overfitting" because of the shuffling per each retrain.
Is it a good idea to mix the validation / testing data with the training data?
Am I wrong by doing it this way? If so, why should I not do this method? Is there a better way to approach this?

If you mix your test/validation data with training data, you then can not evaluate your model on that data, since that data has been seen by your model. The model evaluation is done on the basis of how well it is able to make predictions/classification on data which your model has not seen (assuming that the data you are using to evaluate your model is coming from the same distribution as your training data). If you also mix your test set data with training set data, you will eventually end up with really good test set accuracy since that data has been seen by your model, but it might not perform well on new unseen data coming from the same distribution.
If you are worried size of test/validation data, I suggest you further reduce the size of your test/validation data. Use 99.9% instead of 99%. Also, the random shuffling will take care of learning almost every feature of your data.
After all, my point is, never ever evaluate your model on the data it has seen before. It will always give you better results (assuming you have trained your model well untill it memorizes the training data). The validation data is used when you have multiple algorithms/models and you need to select one algorithm/model from all those available models. Here, the validation data is used to select the model. The algo/model which gives good results on validation data is selected (again you do not evaluate your model based on validation set accuracy, it is just used for the selection of the model.) Once you have selected your model based on validation set accuracy, you then evaluate it on new unseen data (called test data) and report the prediction/classification accuracy on test data as your model accuracy.

Related

Validation loss and accuracy has a lot of 'jumps'

Hello everyone so I made this cnn model.
My data:
Train folder->30 classes->800 images each->24000 all together
Validation folder->30 classes->100 images each->3000 all together
Test folder->30 classes -> 100 images each -> 3000 all together
-I've applied data augmentation. ( on the train data)
-I got 5 conv layers with filters 32->64->128->128->128
each with maxpooling and batch normalization
-Added dropout 0.5 after flattening layers
Train part looks good. Validation part has a lot of 'jumps' though. Does it overfit?
Is there any way to fix this and make validation part more stable?
Note: I plann to increase epochs on my final model I'm just experimenting to see what works best since the model takes a lot of time in order to train. So for now I train with 20 epochs.
I've applied data augmentation (on the train data).
What does this mean? What kind of data did you add and how much? You might think I'm nitpicking, but if the distribution of the augmented data is different enough from the original data, then this will indeed cause your model to generalize poorly to the validation set.
Increasing your epochs isn't going to help here, your training loss is already decreasing reasonably. Training your model for longer is a good step if the validation loss is also decreasing nicely, but that's obviously not the case.
Some things I would personally try:
Try decreasing the learning rate.
Try training the model without the augmented data and see how the validation loss behaves.
Try splitting the augmented data so that it's also contained in the validation set and see how the model behaves.
Train part looks good. Validation part has a lot of 'jumps' though. Does it overfit?
the answer is yes. The so-called 'jumps' in the validation part may indicate that the model is not generalizing well to the validation data and therefore your model might be overfitting.
Is there any way to fix this and make validation part more stable?
To fix this you can use the following:
Increasing the size of your training set
Regularization techniques
Early stopping
Reduce the complexity of your model
Use different hyperparameters like learning rate

TensorFlow: Fit gives great val_acc, but evaluate gives low acc

I have some data (reuters).
I combine the training and test data into one large data set.
I shuffle the entire data set.
I then split the data set into 50%. 50% for training and 50% for testing.
I then setup a sequential model (Embedding, GRU, BatchNormalization, Dense).
I compile with adam optimizer, and loss = sparse_categorical_crossentropy
I run fit on it, for 5 epochs, with validation_split = 0.2
I get a val_acc of 95%. I am very happy with this.
I then call evaluate with the test data, and get acc = 71%
When I repeat evaluate with the train data (instead of test data), just to see what happens, I get 95% acc (as I should).
I am trying to understand what is wrong with my acc score with the test data.
The data is shuffled between train and test each time, so it can not be related to the data.
I tried the checkpoint save/restore trick, that did not seem to help.
I have reduced epochs in case of overfitting, but this has not helped.
I am curious what is going on. The only thing I can think of is there is some sort of overfitting going on that is carrying over to test, but is NOT impacting val_acc.
(Side note: the 20% data saved for validation during fit should not cause an overfitting problem, correct?)
Any ideas what could be wrong with my approach?
Thank you.
Edit 1:
Okey, I might be onto something. I noticed something odd about the test data that is included with the reuters dataset, vs the training data.
In previous experiments, results were always poor evaluating against test set, vs an unused portion of the training data.
This time, I combined the training and test data sets into one set, and shuffled.
Then shuffled some more, and then generated pseudo data around it, and shuffled some more.
Then I peeled off 20% to use as test data.
This time it worked. I got 95% for training, validation and evaluation accuracy.
I tried searching for information on the test set to see if anyone came up with anything but found no results. So I'm going to presently chalk it up to test data that is significantly different from the training set.
Edit 2:
Nevermind edit 1. I think I was corrupting my test data with pseudo-generated version of training data, that was close enough to work.
The only conclusion I can draw is that there is a lot of overfitting going on during my training AND using validation data during training is misleading, as it is also being overfit.
(Why validation data is being used to help overfit, I do not know).
Overfitting occurs due to many reasons, to overcome this problem you can try different methods like:
L1/L2 regularization.
Dropout layers: By using dropout layers in the network, we ignore a subset of units of our network with a set probability. Using dropout, we can reduce interdependent learning among units, which may have led to overfitting.
Early stopping: Monitors the performance of the model for every epoch on a held-out validation set during the training, and terminates the training conditional on the validation performance.
Feature selection: Improves the machine learning process and increases the predictive power of machine learning algorithms by selecting the most important variables and eliminating unnecessary and irrelevant features.
For more details refer to this link.

Week accuracy with testing data

I'm dealling with a data science problem, and I got this problem.
I have a labelled data (Training data) and non labelled data (Test data) and both of them have a lot of missing data.
I worked with my data and I split it to trainig data and validating data
I got a very good accuracy and a very small RMSE error between Y_validation and the predicted one ( model.predict(X_validate) ). But when I submit my solution, the RMSE error get bigger with testing data !
What can I do ?!
Firstly, you need to label your test data. If your test data is not labelled, you will not be able to gauge the accuracy. It will not return accurate error representation.
You need to understand that the training set contain a known output that the model learn from. The test data have to be labelled so that when the model returns its predictions on the test data, we are able to gauge whether the model has correctly predicted the label given to the test data.
On top of doing a train test split you can also do cross validation to improve your model performance. You can understand more from here. (https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6)
This will happen sometimes when a model doesn't generalize well. This can happen when a model over fits to training data.
Resampling or better sampling of test and train data (which as mentioned, needs to be labeled) can help you get a better generalized model.

how important is the loss difference between training and validation data at the beginning when training a neuronal network?

Short question:
Is the difference between validation and training loss at the beginning of the training (first epochs) a good indicator for the amount of data that should be used?
E.g would it be a good method to increase the amount of data until the difference at the beginning is as small as possible? It would save me time and computation.
backround:
I am working on a neuronal network that overfits very fast. The best result after applying many different techniques like dropouts, batch normalization, reducing learning rate, reducing batch size, increasing variety of data, reducing layers, increasing filter sizes ..... is still very bad.
While the training loss decreases very well, validation loss overfits too early(with too early I mean, the desired loss is not reached, it should be many times less)
Since the training with my dataset ~200 samples took 24 hours for 50 epochs, I was hoping to find a way to fight against overfitting with all the methods I described above, before increasing the amount of data. Because nothing helped I am at the point of increasing the amount of data.
I am thinking about how much data could be enough for my network to eliminate overfitting. I know that this it is not easy to answer because it depends on the complexity of the data and the task I am trying to solve.. therefore I try to generalize my question to:
Short answer to the short question: No
explanation: There's a correlation between (train_loss - val_loss) and the amount of data you need to train your model, but there's a bunch of other factors that could be the source of the big (train_loss - val_loss). For example, your network architecture is too small, and therefor your model quickly overfits. Or, your validation set doesn't reflect the training data. Or your learning rate is too big. Or...
So my recommendation: formulate your problem in another SO question, and ask "what might I be doing wrong?"

time-series prediction for price forecasting (problems with predictions)

I am working on a project for price movement forecasting and I am stuck with poor quality predictions.
At every time-step I am using an LSTM to predict the next 10 time-steps. The input is the sequence of the last 45-60 observations. I tested several different ideas, but they all seems to give similar results. The model is trained to minimize MSE.
For each idea I tried a model predicting 1 step at a time where each prediction is fed back as an input for the next prediction, and a model directly predicting the next 10 steps(multiple outputs). For each idea I also tried using as input just the moving average of the previous prices, and extending the input to input the order book at those time-steps.
Each time-step corresponds to a second.
These are the results so far:
1- The first attempt was using as input the moving average of the last N steps, and predict the moving average of the next 10.
At time t, I use the ground truth value of the price and use the model to predict t+1....t+10
This is the result
Predicting moving average
On closer inspection we can see what's going wrong:
Prediction seems to be a flat line. Does not care much about the input data.
2) The second attempt was trying to predict differences, instead of simply the price movement. The input this time instead of simply being X[t] (where X is my input matrix) would be X[t]-X[t-1].
This did not really help.
The plot this time looks like this:
Predicting differences
But on close inspection, when plotting the differences, the predictions are always basically 0.
Plot of differences
At this point, I am stuck here and running our of ideas to try. I was hoping someone with more experience in this type of data could point me in the right direction.
Am I using the right objective to train the model? Are there any details when dealing with this type of data that I am missing?
Are there any "tricks" to prevent your model from always predicting similar values to what it last saw? (They do incur in low error, but they become meaningless at that point).
At least just a hint on where to dig for further info would be highly appreciated.
Thanks!
Am I using the right objective to train the model?
Yes, but LSTM are always very tricky for forecasting time series. And are very prone to overfitting compared to other time series models.
Are there any details when dealing with this type of data that I am missing?
Are there any "tricks" to prevent your model from always predicting similar values to what it last saw?
I haven't seen your code, or the details of the LSTM you are using. Make sure you are using a very small network, and you are avoiding overfitting. Make sure that after you differenced the data - you then reintegrate it before evaluating the final forecast.
On trick to try to build a model that forecasts 10 steps ahead directly instead of building a one-step ahead model and then forecasting recursively.