I am training a neural network for regression with TensorFlow, and getting strange behaviour on my loss curves. The task is to predict the motion of an object in an image, when an action is applied to the object. So the network takes in an image, and an action, and outputs the motion.
The image input is followed by three CNN layers, and in parallel, the action input is followed by a dense layer. These are then concatenated, and followed by two dense layers, before the output. All layers have ReLUs. Data is normalised to have zero mean and standard deviation of one.
Below is the training curve:
The strange behaviour is that, whilst the training loss decreases over time, the validation loss increases right from the start. Usually, when the training curve decreases far below the validation curve, this indicates overfitting. However, in my case, the validation curve never actually decreases at all. Usually, overfitting can be diagnosed when the validation curve drops and then rises again.
Instead, it is as if the network is overfitting right from the very first epoch. In fact, the validation curve seems to follow the opposite trajectory to the training curve. Every improvement in the training prediction results in an opposite effect on the validation prediction.
I have also tried varying the step size (I am using Adam, and in this graph, the step size is 0.0001, then reduces to 0.00001 at epoch 100). My network uses dropout on all the dense layers. I have also tried reducing the number of parameters in the network to prevent overfitting, but the same behaviour occurs. I have a batch size of 50.
What could be the diagnosis of this behaviour? Is the network overfitting, or something else? If it is overfitting, then why do my attempts to reduce the number of parameters and add dropout still result in this same effect? Andy why does the overfitting occur immediately, without the validation loss decreasing at all?
Thank you!
Related
I am working on cnn model which has 4 conv layers and 3 dense layers. dataset have around 28000 images and 7000 test images. The model has saved checkpoints and I have trained it several times and achieved 60 % accuracy so far, and while training learning rate is reduced to 2.6214403e-07 (as i used ReduceLROnPlateau factor 0.4). I have question if I increased the learning rate say 1e-4. and resumed the training how will it effect my model? Is It a good idea?
accuracy vs epoch
If your learning curve plateaus immediately and doesn't change much beyond the initial few epochs (as in your case), then your learning rate is too low. While you can resume training with higher learning rates, it would likely render any progress of the initial epochs meaningless. Since you typically only decrease the learning rate between epochs, and given the slow initial progress of your network, you should simply retrain with an increased initial learning rate until you see larger changes in the first few epochs. You can then identify the point of convergence by whenever overfitting happens (test accuracy goes down while train accuracy goes up) and stop there. If this point is still "unnecessarily late", you can additionally reduce the amount that the learning rate decays to make faster progress between epochs.
While training my CNNs I usually aim to maximize the validation accuracy to 1.0 (i.e. 100%). I know that on the other hand it would not make much sense to aim for a training accuracy of 1.0, because we don't want our model to memorize the training data itself.
However, what about a "mixed" approach --
wouldn't it make sense to maximize both training and validation accuracy?
Let's first address what the purpose of validation is:
When we're training a neural net, we are trying to teach the neural net to perform well at a given task for the entire population of input/output pairs in the task. However, it is unrealistic to have the entire dataset, especially for high dimensional inputs such as images. Therefore, we create a training dataset that contains a (hopefully) large amount of that data. We hope when we're training a neural net that by maximizing performance on the training dataset, we maximize performance on the entire dataset. This is called generalization.
How do we know that the neural net is generalizing well? As you mentioned, we don't want to simply memorize the training data. That is where validation accuracy comes in. We feed data that the neural net did not train on through the network to evaluate its performance. Therefore, the purpose of the validations set is to measure the generalization.
You should watch both the training and validation accuracy. The difference between the validation and training accuracy is called the generalization gap, which will tell you how well your neural net is generalizing to new inputs. You want both the training and validation accuracy to be high, and the difference between them to be minimal.
Technically if you could do so, that would be awesome, you wouldn't say a model is over fitting unless there is a gap between validation accuracy and training accuracy, if their values are close, both high or both low, then the model is not over fitting. ideally you want high accuracy on all samples, training, validation and testing, but as I said "IDEALLY". you just don't care as much about training samples.
I'm getting really high variability in both the accuracy and loss between each epoch, as high as 10%. It happens to my accuracy all the time, and my loss when I start adding in dropout. However I really need the dropout, any ideas on how to smooth it out?
It is hard to say anything concrete without knowing what you do. But because you mentioned that your dataset is very small: 500 samples, I say that your 10% performance jumps are not surprising. Still a few ideas:
definitely use a bigger dataset if you can. If it is not possible to collect a bigger dataset, try to augment whatever you have.
try a smaller dropout and see how it goes, try different regularizers (dropout is not the only option)
you data is small, you can afford to run more than 200 iterations
see how your model performs on the test set, it is possible that it just severely overfitted the data
Beside the fact that the data set is very small, during a training with a dropout regularization the loss function is not anymore well defined and I presume the accuracy is also biased. Therefore any tracked metric should be assessed without dropout. It seams that keras does not switch it off while calculating the accuracy during training.
I am working on a deep learning (CNN + AEs) approach on facial images.
I have
an input layer of 112*112*3 of facial images
3 convolution + max pooling + ReLU
2 layers of fully connected with 512 neurons with 50% dropout to
avoid overfitting and last output layer with 10 neurons since I have
10 classes.
also used reduce mean of softmax cross entropy and also L2.
For training I divided my dataset to 3 groups of:
60% for training
20% for validation
20% for evaluation
The problem is after few epochs the validation error rate stay fixed value and never changes. I have used tensorflow to implement my project.
I hadn't such problem before with CNNs so I think it's first time. I have checked the code it's based on tensorflow documentation so I don't think if the problem is with the code. Maybe I need to change some parameters but I am not sure.
Any idea about common solutions for such problem?
Update:
I changed the optimizer from momentum to Adam whith default learning rate. For now validation error changes but it's lower than mini batch error most of the time while both have same batch sizes.
I have tested the model with and without biases with 0.1 as initial values but no good fit yet.
Update
I fixed the issue I will update with more details soon.
One common solution that I found helpful for this type of problem is using TensorBoard. You can add details visualize training performance information after each epoch for different points in the computational graph. Adding key metrics is worth it since you can see how training progresses after applying changes in the adaptive learning rate, batch size, neural network architecture, drop out / regularization, number of GPUs, etc.
Here is the link that I found helpful to add these details:
https://www.tensorflow.org/how_tos/graph_viz/#runtime_statistics
I am currently running some tests with simple Autoencoders. I wrote an Autoencoder myself entirely in Tensorflow and in addition copied and pasted the code from this keras blog entry: https://blog.keras.io/building-autoencoders-in-keras.html (just to have a different Autoencoder implementation).
When I was testing different architectures, I started with a single layer and a couple of hidden units in this layer. I noticed that when I reduce the number of hidden units to only a single (!) hidden unit, I still get the same training and test losses I get with bigger architectures (up to a couple of thousand hidden units). In my data, the worst loss is 0.5. Any architecture I've tried achieves ~ 0.15.
Just out of curiosity, I reduced the number of hidden units in the only existing hidden layer to zero (which I know doesn't make any sense). However, I still get a training and test loss of 0.15. I assumed that this strange behavior might be due to the bias in the decoding layer (when I reconstruct the input). Initially, I've set the bias variable (in TF) to trainable=True. So now I guess even without any hidden units, the model still learns the bias in the decoding layer which might lead to the reduction of my loss from 0.5 to 0.15.
In the next step, I set the bias in the decoding layer to trainable=False. Now the model (with no hidden units) doesn't learn anything, just as I would have expected it(loss=0.5). With one hidden unit,however, I again get test and training losses of around 0.15.
Following this line of thought, I set the bias in the encoding layer to trainable=False, since I wanted to avoid that my architecture only learns the bias. So now, only the weights of my autoencoder are trainable. This still works for a single hidden unit (and of course just a single hidden layer). Surprisingly, this only works in case of a single-layer network. As soon as I increase the number of layers (independent of the numbers of hidden units), the network again doesn't learn anything (in case only the weights get updated).
All the things I reported are true for the training loss as well as for the test loss (in a completely independent dataset the machine never sees). This makes it even more curious to me.
My question is: How can it happen that I learn as much from a 1 node "network" as from a bigger one (both for training and testing)? Second, how can it be that even larger nets seem to never overfit (training and test error slightly change, but are always comparable). Any suggestions would be very helpful!
Thanks a lot!
Nils