i am working on a dataset of 368609 samples and 34 features, i wanted to use a neural network to predict latency (real value) using keras, the model has 3 hidden layers, each layer has 1024 neurons, i have used drop_out (50 %) and l2 regularization (0.001) for each hidden layer. The problem is i am getting a test mean absolute error of 3.5505 ms and train mean absolute error of 3.4528. Here, the train error is smaller than test error by a small gap, does this mean that we have an overfitting problem here ?
Not really, yet it is still always a good idea to see how your model is generalizing to new data.
Keep something between 10%-20% of your original dataset as a test set and try to predict the output for each record in the test set.
Sometimes when we deal with the same validation set for many attempts of improving our model, we tend to overfit the evaluation dataset as well.
Having 3 different datasets for training, evaluation and testing usually provides a whole solution to overfitting.
If you get a high accuracy on your training set and a low accuracy on you test set it often means that your are overfitting. So in you case - no you are probably not overfitting.
Normally you would also have a validation set, so you don't fit your data to the testset.
Related
Hello everyone so I made this cnn model.
My data:
Train folder->30 classes->800 images each->24000 all together
Validation folder->30 classes->100 images each->3000 all together
Test folder->30 classes -> 100 images each -> 3000 all together
-I've applied data augmentation. ( on the train data)
-I got 5 conv layers with filters 32->64->128->128->128
each with maxpooling and batch normalization
-Added dropout 0.5 after flattening layers
Train part looks good. Validation part has a lot of 'jumps' though. Does it overfit?
Is there any way to fix this and make validation part more stable?
Note: I plann to increase epochs on my final model I'm just experimenting to see what works best since the model takes a lot of time in order to train. So for now I train with 20 epochs.
I've applied data augmentation (on the train data).
What does this mean? What kind of data did you add and how much? You might think I'm nitpicking, but if the distribution of the augmented data is different enough from the original data, then this will indeed cause your model to generalize poorly to the validation set.
Increasing your epochs isn't going to help here, your training loss is already decreasing reasonably. Training your model for longer is a good step if the validation loss is also decreasing nicely, but that's obviously not the case.
Some things I would personally try:
Try decreasing the learning rate.
Try training the model without the augmented data and see how the validation loss behaves.
Try splitting the augmented data so that it's also contained in the validation set and see how the model behaves.
Train part looks good. Validation part has a lot of 'jumps' though. Does it overfit?
the answer is yes. The so-called 'jumps' in the validation part may indicate that the model is not generalizing well to the validation data and therefore your model might be overfitting.
Is there any way to fix this and make validation part more stable?
To fix this you can use the following:
Increasing the size of your training set
Regularization techniques
Early stopping
Reduce the complexity of your model
Use different hyperparameters like learning rate
I have some data (reuters).
I combine the training and test data into one large data set.
I shuffle the entire data set.
I then split the data set into 50%. 50% for training and 50% for testing.
I then setup a sequential model (Embedding, GRU, BatchNormalization, Dense).
I compile with adam optimizer, and loss = sparse_categorical_crossentropy
I run fit on it, for 5 epochs, with validation_split = 0.2
I get a val_acc of 95%. I am very happy with this.
I then call evaluate with the test data, and get acc = 71%
When I repeat evaluate with the train data (instead of test data), just to see what happens, I get 95% acc (as I should).
I am trying to understand what is wrong with my acc score with the test data.
The data is shuffled between train and test each time, so it can not be related to the data.
I tried the checkpoint save/restore trick, that did not seem to help.
I have reduced epochs in case of overfitting, but this has not helped.
I am curious what is going on. The only thing I can think of is there is some sort of overfitting going on that is carrying over to test, but is NOT impacting val_acc.
(Side note: the 20% data saved for validation during fit should not cause an overfitting problem, correct?)
Any ideas what could be wrong with my approach?
Thank you.
Edit 1:
Okey, I might be onto something. I noticed something odd about the test data that is included with the reuters dataset, vs the training data.
In previous experiments, results were always poor evaluating against test set, vs an unused portion of the training data.
This time, I combined the training and test data sets into one set, and shuffled.
Then shuffled some more, and then generated pseudo data around it, and shuffled some more.
Then I peeled off 20% to use as test data.
This time it worked. I got 95% for training, validation and evaluation accuracy.
I tried searching for information on the test set to see if anyone came up with anything but found no results. So I'm going to presently chalk it up to test data that is significantly different from the training set.
Edit 2:
Nevermind edit 1. I think I was corrupting my test data with pseudo-generated version of training data, that was close enough to work.
The only conclusion I can draw is that there is a lot of overfitting going on during my training AND using validation data during training is misleading, as it is also being overfit.
(Why validation data is being used to help overfit, I do not know).
Overfitting occurs due to many reasons, to overcome this problem you can try different methods like:
L1/L2 regularization.
Dropout layers: By using dropout layers in the network, we ignore a subset of units of our network with a set probability. Using dropout, we can reduce interdependent learning among units, which may have led to overfitting.
Early stopping: Monitors the performance of the model for every epoch on a held-out validation set during the training, and terminates the training conditional on the validation performance.
Feature selection: Improves the machine learning process and increases the predictive power of machine learning algorithms by selecting the most important variables and eliminating unnecessary and irrelevant features.
For more details refer to this link.
I am training an autoencoder DNN for a regression question. Need suggestions on how to improve the training process.
The total number of training sample is about ~100,000. I use Keras to fit the model, setting validation_split = 0.1. After training, I drew loss function change and got the following picture. As can be seen here, validation loss is unstable and mean values are very close to training loss.
My question is: based on this, what is the next step I should try to improve the training process?
[Edit on 1/26/2019]
The details of network architecture are as follows:
It has 1 latent layer of 50 nodes. The input and output layer have 1000 nodes,respectively. The activation of hidden layer is ReLU. Loss function is MSE. For optimizer, I use Adadelta with default parameter settings. I also tried to set lr=0.5, but got very similar results. Different features of the data have scaled between -10 and 10, with mean of 0.
By observing the graph provided, the network could not approximate the function which establishes a relation between the input and output.
If your features are too diverse. That one of them is large and others have a very small value, then you should normalize the feature vector. You can read more here.
For a better training and testing result, you can follow these tips,
Use a small network. A network with one hidden layer is enough.
Perform activations in the input as well as hidden layers. The output layer must have a linear function. Use ReLU activation function.
Prefer small learning rate like 0.001. Use RMSProp optimizer. It works fine on most regression problems.
If you are not using mean squared error function, use it.
Try slow and steady learning and not fast learning.
I have a FFNN with 2 hidden layers for a regression task that overfits almost immediately (epoch 2-5, depending on # hidden units). (ReLU, Adam, MSE, same # hidden units per layer, tf.keras)
32 neurons:
128 neurons:
I will be tuning the number of hidden units, but to limit the search space I would like to know what the upper and lower bounds should be.
Afaik it is better to have a too large network and try to regularize via L2-reg or dropout than to lower the network's capacity -- because a larger network will have more local minima, but the actual loss value will be better.
Is there any point in trying to regularize (via e.g. dropout) a network that overfits from the get-go?
If so I suppose I could increase both bounds. If not I would lower them.
model = Sequential()
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(1, 'linear'))
model.compile('adam', 'mse')
Hyperparameter tuning is generally the hardest step in ML, In general we try different values randomly and evalute the model and choose those set of values which give the best performance.
Getting back to your question, You have a high varience problem (Good in training, bad in testing).
There are eight things you can do in order
Make sure your test and training distribution are same.
Make sure you shuffle and then split the data into two sets (test and train)
A good train:test split will be 105:15K
Use a deeper network with Dropout/L2 regularization.
Increase your training set size.
Try Early Stopping
Change your loss function
Change the network architecture (Switch to ConvNets, LSTM etc).
Depending on your computation power and time you can set a bound to the number of hidden units and hidden layers you can have.
because a larger network will have more local minima.
Nope, this is not quite true, in reality as the number of input dimension increases the chance of getting stuck into a local minima decreases. So We usually ignore the problem of local minima. It is very rare. The derivatives across all the dimensions in the working space must be zero for a local/global minima. Hence, it is highly unlikely in a typical model.
One more thing, I noticed you are using linear unit for last layer. I suggest you to go for ReLu instead. In general we do not need negative values in regression. It will reduce test/train error
Take this :
In MSE 1/2 * (y_true - y_prediction)^2
because y_prediction can be nagative value. The whole MSE term may blow up to large values as y_prediction gets highly negative or highly positive.
Using a ReLu for last layer makes sure that y_prediction is positive. Hence low error will be expected.
Let me try to substantiate some of the ideas here, referenced from Ian Goodfellow et. al. Deep Learning book which is available for free online:
Chapter 7: Regularization The most important point is data, one can and should avoid regularization if they have large amounts of data that best approximate the distribution. In you case, it looks like there might be a significant discrepancy between training and test data. You need to ensure the data is consistent.
Section 7.4: Data-augmentation With regards to data, Goodfellow talks about data-augmentation and inducing regularization by injecting noise (most likely Gaussian) which mathematically has the same effect. This noise works well with regression tasks as you limit the model from latching onto a single feature to overfit.
Section 7.8: Early Stopping is useful if you just want a model with the best test error. But again this only works if your data allows the training to infer the test data. If there is an immediate increase in test error the training would stop immediately.
Section 7.12: Dropout Just applying dropout to a regression model doesn't necessarily help. In fact "when extremely few labeled training examples are available, dropout is less effective". For classification, dropout forces the model to not rely on single features, but in regression all inputs might be required to compute a value rather than classify.
Chapter 11: Practicals emphasises the use of base models to ensure that the training task is not trivial. If a simple linear regression can achieve similar behaviour than you don't even have a training problem to begin with.
Bottom line is you can't just play with the model and hope for the best. Check the data, understand what is required and then apply the corresponding techniques. For more details read the book, it's very good. Your starting point should be a simple regression model, 1 layer, very few neurons and see what happens. Then incrementally experiment.
My team is training a CNN in Tensorflow for binary classification of damaged/acceptable parts. We created our code by modifying the cifar10 example code. In my prior experience with Neural Networks, I always trained until the loss was very close to 0 (well below 1). However, we are now evaluating our model with a validation set during training (on a separate GPU), and it seems like the precision stopped increasing after about 6.7k steps, while the loss is still dropping steadily after over 40k steps. Is this due to overfitting? Should we expect to see another spike in accuracy once the loss is very close to zero? The current max accuracy is not acceptable. Should we kill it and keep tuning? What do you recommend? Here is our modified code and graphs of the training process.
https://gist.github.com/justineyster/6226535a8ee3f567e759c2ff2ae3776b
Precision and Loss Images
A decrease in binary cross-entropy loss does not imply an increase in accuracy. Consider label 1, predictions 0.2, 0.4 and 0.6 at timesteps 1, 2, 3 and classification threshold 0.5. timesteps 1 and 2 will produce a decrease in loss but no increase in accuracy.
Ensure that your model has enough capacity by overfitting the training data. If the model is overfitting the training data, avoid overfitting by using regularization techniques such as dropout, L1 and L2 regularization and data augmentation.
Last, confirm your validation data and training data come from the same distribution.
Here are my suggestions, one of the possible problems is that your network start to memorize data, yes you should increase regularization,
update:
Here I want to mention one more problem that may cause this:
The balance ratio in the validation set is much far away from what you have in the training set. I would recommend, at first step try to understand what is your test data (real-world data, the one your model will face in inference time) descriptive look like, what is its balance ratio, and other similar characteristics. Then try to build such a train/validation set almost with the same descriptive you achieve for real data.
Well, I faced the similar situation when I used Softmax function in the last layer instead of Sigmoid for binary classification.
My validation loss and training loss were decreasing but accuracy of both remained constant. So this gave me lesson why sigmoid is used for binary classification.