I have used 100000 samples to train a general model in Keras and achieve good performance. Then, for a particular sample, I want to use the trained weights as initialization and continue to optimize the weights to further optimize the loss of the particular sample.
However, the problem occurred. First, I load the trained weight by the keras API easily, then, I evaluate the loss of the one particular sample, and the loss is close to the loss of the validation loss during the training of the model. I think it is normal. However, when I use the trained weight as the inital and further optimize the weight over the one sample by model.fit(), the loss is really strange. It is much higher than the evaluate result and gradually became normal after several epochs.
I think it is strange that, for the same one simple and loading the same model weight, why the model.fit() and model.evaluate() return different results. I used batch normalization layers in my model and I wonder that it may be the reason. The result of model.evaluate() seems normal, as it is close to what I seen in the validation set before.
So what cause the different between fit and evaluation? How can I solve it?
I think your core issue is that you are observing two different loss values during fit and evaluate. This has been extensively discussed here, here, here and here.
The fit() function loss includes contributions from:
Regularizers: L1/L2 regularization loss will be added during training, increasing the loss value
Batch norm variations: during batch norm, running mean and variance of the batch will be collected and then those statistics will be used to perform normalization irrespective of whether batch norm is set to trainable or not. See here for more discussion on that.
Multiple batches: Of course, the training loss will be averaged over multiple batches. So if you take average of first 100 batches and evaluate on the 100th batch only, the results will be different.
Whereas for evaluate, just do forward propagation and you get the loss value, nothing random here.
Bottomline is, you should not compare train and validation loss (or fit and evaluate loss). Those functions do different things. Look for other metrics to check if your model is training fine.
Related
Hello everyone so I made this cnn model.
My data:
Train folder->30 classes->800 images each->24000 all together
Validation folder->30 classes->100 images each->3000 all together
Test folder->30 classes -> 100 images each -> 3000 all together
-I've applied data augmentation. ( on the train data)
-I got 5 conv layers with filters 32->64->128->128->128
each with maxpooling and batch normalization
-Added dropout 0.5 after flattening layers
Train part looks good. Validation part has a lot of 'jumps' though. Does it overfit?
Is there any way to fix this and make validation part more stable?
Note: I plann to increase epochs on my final model I'm just experimenting to see what works best since the model takes a lot of time in order to train. So for now I train with 20 epochs.
I've applied data augmentation (on the train data).
What does this mean? What kind of data did you add and how much? You might think I'm nitpicking, but if the distribution of the augmented data is different enough from the original data, then this will indeed cause your model to generalize poorly to the validation set.
Increasing your epochs isn't going to help here, your training loss is already decreasing reasonably. Training your model for longer is a good step if the validation loss is also decreasing nicely, but that's obviously not the case.
Some things I would personally try:
Try decreasing the learning rate.
Try training the model without the augmented data and see how the validation loss behaves.
Try splitting the augmented data so that it's also contained in the validation set and see how the model behaves.
Train part looks good. Validation part has a lot of 'jumps' though. Does it overfit?
the answer is yes. The so-called 'jumps' in the validation part may indicate that the model is not generalizing well to the validation data and therefore your model might be overfitting.
Is there any way to fix this and make validation part more stable?
To fix this you can use the following:
Increasing the size of your training set
Regularization techniques
Early stopping
Reduce the complexity of your model
Use different hyperparameters like learning rate
I am training an autoencoder DNN for a regression question. Need suggestions on how to improve the training process.
The total number of training sample is about ~100,000. I use Keras to fit the model, setting validation_split = 0.1. After training, I drew loss function change and got the following picture. As can be seen here, validation loss is unstable and mean values are very close to training loss.
My question is: based on this, what is the next step I should try to improve the training process?
[Edit on 1/26/2019]
The details of network architecture are as follows:
It has 1 latent layer of 50 nodes. The input and output layer have 1000 nodes,respectively. The activation of hidden layer is ReLU. Loss function is MSE. For optimizer, I use Adadelta with default parameter settings. I also tried to set lr=0.5, but got very similar results. Different features of the data have scaled between -10 and 10, with mean of 0.
By observing the graph provided, the network could not approximate the function which establishes a relation between the input and output.
If your features are too diverse. That one of them is large and others have a very small value, then you should normalize the feature vector. You can read more here.
For a better training and testing result, you can follow these tips,
Use a small network. A network with one hidden layer is enough.
Perform activations in the input as well as hidden layers. The output layer must have a linear function. Use ReLU activation function.
Prefer small learning rate like 0.001. Use RMSProp optimizer. It works fine on most regression problems.
If you are not using mean squared error function, use it.
Try slow and steady learning and not fast learning.
I have a FFNN with 2 hidden layers for a regression task that overfits almost immediately (epoch 2-5, depending on # hidden units). (ReLU, Adam, MSE, same # hidden units per layer, tf.keras)
32 neurons:
128 neurons:
I will be tuning the number of hidden units, but to limit the search space I would like to know what the upper and lower bounds should be.
Afaik it is better to have a too large network and try to regularize via L2-reg or dropout than to lower the network's capacity -- because a larger network will have more local minima, but the actual loss value will be better.
Is there any point in trying to regularize (via e.g. dropout) a network that overfits from the get-go?
If so I suppose I could increase both bounds. If not I would lower them.
model = Sequential()
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(1, 'linear'))
model.compile('adam', 'mse')
Hyperparameter tuning is generally the hardest step in ML, In general we try different values randomly and evalute the model and choose those set of values which give the best performance.
Getting back to your question, You have a high varience problem (Good in training, bad in testing).
There are eight things you can do in order
Make sure your test and training distribution are same.
Make sure you shuffle and then split the data into two sets (test and train)
A good train:test split will be 105:15K
Use a deeper network with Dropout/L2 regularization.
Increase your training set size.
Try Early Stopping
Change your loss function
Change the network architecture (Switch to ConvNets, LSTM etc).
Depending on your computation power and time you can set a bound to the number of hidden units and hidden layers you can have.
because a larger network will have more local minima.
Nope, this is not quite true, in reality as the number of input dimension increases the chance of getting stuck into a local minima decreases. So We usually ignore the problem of local minima. It is very rare. The derivatives across all the dimensions in the working space must be zero for a local/global minima. Hence, it is highly unlikely in a typical model.
One more thing, I noticed you are using linear unit for last layer. I suggest you to go for ReLu instead. In general we do not need negative values in regression. It will reduce test/train error
Take this :
In MSE 1/2 * (y_true - y_prediction)^2
because y_prediction can be nagative value. The whole MSE term may blow up to large values as y_prediction gets highly negative or highly positive.
Using a ReLu for last layer makes sure that y_prediction is positive. Hence low error will be expected.
Let me try to substantiate some of the ideas here, referenced from Ian Goodfellow et. al. Deep Learning book which is available for free online:
Chapter 7: Regularization The most important point is data, one can and should avoid regularization if they have large amounts of data that best approximate the distribution. In you case, it looks like there might be a significant discrepancy between training and test data. You need to ensure the data is consistent.
Section 7.4: Data-augmentation With regards to data, Goodfellow talks about data-augmentation and inducing regularization by injecting noise (most likely Gaussian) which mathematically has the same effect. This noise works well with regression tasks as you limit the model from latching onto a single feature to overfit.
Section 7.8: Early Stopping is useful if you just want a model with the best test error. But again this only works if your data allows the training to infer the test data. If there is an immediate increase in test error the training would stop immediately.
Section 7.12: Dropout Just applying dropout to a regression model doesn't necessarily help. In fact "when extremely few labeled training examples are available, dropout is less effective". For classification, dropout forces the model to not rely on single features, but in regression all inputs might be required to compute a value rather than classify.
Chapter 11: Practicals emphasises the use of base models to ensure that the training task is not trivial. If a simple linear regression can achieve similar behaviour than you don't even have a training problem to begin with.
Bottom line is you can't just play with the model and hope for the best. Check the data, understand what is required and then apply the corresponding techniques. For more details read the book, it's very good. Your starting point should be a simple regression model, 1 layer, very few neurons and see what happens. Then incrementally experiment.
So I was reading the tensorflow getstarted tutorial and I found it very hard to follow. There were a lot of explanations missing about each function and why they are necesary (or not).
In the tf.estimator section, what's the meaning or what are they supposed to be the "x_eval" and "y_eval" arrays? The x_train and y_train arrays give the desired output (which is the corresponding y coordinate) for a given x coordinate. But the x_eval and y_eval values are incorrect: for x=5, y should be -4, not -4.1. Where do those values come from? What do x_eval and y_eval mean? Are they necesary? How did they choose those values?
The difference between "input_fn" (what does "fn" even mean?) and "train_input_fn". I see that the only difference is one has
num_epochs=None, shuffle=True
num_epochs=1000, shuffle=False
but I don't understand what "input_fn" or "train_input_fn" are/do, or what's the difference between the two, or if both are necesary.
3.In the
estimator.train(input_fn=input_fn, steps=1000)
piece of code, I don't understand the difference between "steps" and "num_epochs". What's the meaning of each one? Can you have num_epochs=1000 and steps=1000 too?
The final question is, how do i get the W and the b? In the previous way of doing it (not using tf.estimator) they explicitelly found that W=-1 and b=1. If I was doing a more complex neural network, involving biases and weights, I think I would want to recover the actual values of the weights and biases. That's the whole point of why I'm using tensorflow, to find the weights! So how do I recover them in the tf.estimator example?
These are just some of the questions that bugged me while reading the "getStarted" tutorial. I personally think it leaves a lot to desire, since it's very unclear what each thing does and you can at best guess.
I agree with you that the tf.estimator is not very well introduced in this "getting started" tutorial. I also think that some machine learning background would help with understanding what happens in the tutorial.
As for the answers to your questions:
In machine learning, we usually minimizer the loss of the model on the training set, and then we evaluate the performance of the model on the evaluation set. This is because it is easy to overfit the training set and get 100% accuracy on it, so using a separate validation set makes it impossible to cheat in this way.
Here (x_train, y_train) corresponds to the training set, where the global minimum is obtained for W=-1, b=1.
The validation set (x_eval, y_eval) doesn't have to perfectly follow the distribution of the training set. Although we can get a loss of 0 on the training set, we obtain a small loss on the validation set because we don't have exactly y_eval = - x_eval + 1
input_fn means "input function". This is to indicate that the object input_fn is a function.
In tf.estimator, you need to provide an input function if you want to train the estimator (estimator.train()) or evaluate it (estimator.evaluate()).
Usually you want different transformations for training or evaluation, so you have two functions train_input_fn and eval_input_fn (the input_fn in the tutorial is almost equivalent to train_input_fn and is just confusing).
For instance, during training we want to train for multiple epochs (i.e. multiple times on the dataset). For evaluation, we only need one pass over the validation data to compute the metrics we need
The number of epochs is the number of times we repeat the entire dataset. For instance if we train for 10 epochs, the model will see each input 10 times.
When we train a machine learning model, we usually use mini-batches of data. For instance if we have 1,000 images, we can train on batches of 100 images. Therefore, training for 10 epochs means training on 100 batches of data.
Once the estimator is trained, you can access the list of variables through estimator.get_variable_names() and the value of a variable through estimator.get_variable_value().
Usually we never need to do that, as we can for instance use the trained estimator to predict on new examples, using estimator.predict().
If you feel that the getting started is confusing, you can always submit a GitHub issue to tell the TensorFlow team and explain your point.
I trained a classification network using tensorFlow with batch normalization in every convolutional layer. When I predict on a balanced test set where every category included in it, the accuracy is normal. However, if I chose any one specific category from test set, the accuracy is low, even zero.
But when 3 categories included in test set, the accuracy became higher. As we all know, the weights was fixed when the model finished training. But I find the balance in test set have greatly influence on prediction accuracy.
I think if batch normalization has influence on this, so I remove all batch normalization and retrained the model again. This time, when I predict only one category picture, it became normal.
Could anyone know why? THANKS!
You're right. If your training set is unbalanced you compute and accumulate mean values (for every layer) that are skewed in favor of the majority class.
In fact, you're not "normalizing" but instead, you're making the unbalancing problem worse.
Use batch normalization when you have a balanced training set and you can be sure that your batches will contain a balanced number of samples. This gives you optimal results.
However, since you added in the comments that you're using tf.contrib.layers.conv2d(x, num_output, kernel_size, stride, padding, activation_fn, normal_fn=tf.contrib.layers.batch_norm)
I spotted the problem: normalizer_fn calls the function you pass (batch_norm). But it uses the defaults parameters. By default, is_training equals to True thus you're computing even during the test phase the mean and the variance over the batch. Just read carefully the documentation of tf.contrib.layers.conv2d and use normalizer_params to pass is_training=True when training and is_training=False when testing/validating.