dropout and data.split in model.fit - tensorflow

As we know, dropout is a mechanism to help control overfitting. In the training process of Keras, we can conduct online cross-validation by monitoring the validation loss, and setup data split in model.fit.
Generally, do I need to use both of these mechanisms? Or if I setup data split in model.fit, then I do not need to use dropout.

Dropout is a regularization technique, i.e. it prevents the network from overfitting on your data quickly. The validation loss just gives you an indication of when your network is overfitting. These are two completely different things and having a validation loss does not help you when your model is overfitting, it just shows you that it is.
I would say that having a validation loss is valuable information to have during training and you should never go without it. Whether you need regularization techniques such as noise, dropout or batch normalization depends on how your network learns. If you see that it overfits then you should attempt to employ regularization techniques.

Related

keras prioritizes metrics or loss?

I'm struggling with understanding how keras model works.
When we train model, we give metrics(like ['accuracy']) and loss function(like cross-entropy) as arguments.
What I want to know is which is the goal for model to optimize.
After fitting, leant model maximize accuracy? or minimize loss?
The model optimizes the loss; Metrics are only there for your information and reporting results.
https://en.wikipedia.org/wiki/Loss_function
Note that metrics are optional, but you must provide a loss function to do training.
You can also evaluate a model on metrics that were not added during training.
Keras models work by minimizing the loss by adjusting trainable model parameters via back propagation. The metrics such as training accuracy, validation accuracy etc are provided as information but can also be used to improve your model's performance through the use of Keras callbacks. Documentation for that is located here. For example the callback ReduceLROnPlateau (documentation is here) can be used to monitor a metric like validation loss and adjust the model's learning rate if the loss fails to decrease after a certain number (patience parameter) of consecutive epochs.

Why setting training=True in tf.keras.layers.Dropout during testing mode is leading to lower training loss values and higher prediction accuracy?

I'm using dropout layers on my model implemented in tensorflow (tf.keras.layers.Dropout). I set the "training= True" during the training and "training=False" while testing. The performance is poor. I accidentally changed "training=True" during testing too, and the results got much better. I'm wondering what's happening? And why it is affecting the training loss values? Because I'm not making any changes to the training and the whole testing process happens after training. However, changing "training=True" in testing is affecting the training process and causing the training loss to get closer to zero and then the testing results are better. Any possible explanation?
Thanks,
Sorry for the late response, but the answer from Celius is not quite correct.
The training parameter of the Dropout Layer (and for the BatchNormalization layer as well) defines whether this layer should behave in training or inference mode. You can read this in the official documentation.
However, the documentation is a bit unclear on how this affects the execution of your network. Setting training=False does not mean that the Dropout layer is not part of your network. It is by no means ignored as Celius explained, but it just behaves in inference mode. For Dropout, this means that no dropout will be applied. For BN, it means that BN will use the parameters estimated during training instead of computing new parameters for every mini-batch. This is really. The other way around, if you set training=True, the layer will behave in training mode and dropout will be applied.
Now to your question: The behavior of your network does not make sense. If dropout was applied on unseen data, there is nothing to learn from that. You only throw away information, hence your results should be worse. But I think your problem is not related to the Dropout layer anyway. Does your network also make use of BatchNormalization layers? If BN is applied in a poor way, it can mess up your final results. But I haven't seen any code, so it is hard to fully answer your question as is.

Preventing overfitting in transfer learning using TensorFlow and Keras

I've got a TensorFlow 2 model with a pre-trained Keras layer coming from TensorFlow Hub. I want to fine-tune the weights in this sub-model to suit my dataset, but if I do that naively by setting trainable=True and training=True, my model will grossly overfit.
If I had the actual layers of the underlying model under my control, I would insert dropout layers or set L2 coefficient on those individual layers. But the layers are imported to my network using TensorFlow Hub KerasLayer method. Also, I suspect that the underlying model is quite complicated.
I wonder what's the standard practice for solving this kind of issues.
Maybe there is a way to force regularization to the whole network somehow? I know that in TensorFlow 1, there were optimizers like ProximalAdagradOptimizer that took L2 coefficients. In TensorFlow 2, the only optimizer like this is FTRL, but it's hard for me to make it work for my dataset.
I "solved" it by
pretraining non-transferred parts of the model,
then turning on learning for the shared layers,
introducing early stopping,
and configuring the optimizer to go really slow.
This way, I managed to not damage the transferred layers too much. Anyway, I still wonder whether this is the best one can do.

Behavior of Dropout layers in test / training phase

According to the Keras documentation dropout layers show different behaviors in training and test phase:
Note that if your model has a different behavior in training and
testing phase (e.g. if it uses Dropout, BatchNormalization, etc.), you
will need to pass the learning phase flag to your function:
Unfortunately, nobody talks about the actual differences. Why should dropout behave differently in test phase? I expect the layer to set a certain amount of neurons to 0. Why should this behavior depend on the training/test phase?
Dropout is used in the training phase to reduce the chance of overfitting. As you mention this layer deactivates certain neurons. The model will become more insensitive to weights of other nodes. Basically with the dropout layer the trained model will be the average of many thinned models. Check a more detailed explanation here
However, in when you apply your trained model you want to use the full power of the model. You want to use all neurons in the trained (average) network to get the highest accuracy.

Prevention of overfitting in convolutional layers of a CNN

I'm using TensorFlow to train a Convolutional Neural Network (CNN) for a sign language application. The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting. I've taken several steps to accomplish this:
I've collected a large amount of high-quality training data (over 5000 samples per label).
I've built a reasonably sophisticated pre-processing stage to help maximize invariance to things like lighting conditions.
I'm using dropout on the fully-connected layers.
I'm applying L2 regularization to the fully-connected parameters.
I've done extensive hyper-parameter optimization (to the extent possible given HW and time limitations) to identify the simplest model that can achieve close to 0% loss on training data.
Unfortunately, even after all these steps, I'm finding that I can't achieve much better that about 3% test error. (It's not terrible, but for the application to be viable, I'll need to improve that substantially.)
I suspect that the source of the overfitting lies in the convolutional layers since I'm not taking any explicit steps there to regularize (besides keeping the layers as small as possible). But based on examples provided with TensorFlow, it doesn't appear that regularization or dropout is typically applied to convolutional layers.
The only approach I've found online that explicitly deals with prevention of overfitting in convolutional layers is a fairly new approach called Stochastic Pooling. Unfortunately, it appears that there is no implementation for this in TensorFlow, at least not yet.
So in short, is there a recommended approach to prevent overfitting in convolutional layers that can be achieved in TensorFlow? Or will it be necessary to create a custom pooling operator to support the Stochastic Pooling approach?
Thanks for any guidance!
How can I fight overfitting?
Get more data (or data augmentation)
Dropout (see paper, explanation, dropout for cnns)
DropConnect
Regularization (see my masters thesis, page 85 for examples)
Feature scale clipping
Global average pooling
Make network smaller
Early stopping
How can I improve my CNN?
Thoma, Martin. "Analysis and Optimization of Convolutional Neural Network Architectures." arXiv preprint arXiv:1707.09725 (2017).
See chapter 2.5 for analysis techniques. As written in the beginning of that chapter, you can usually do the following:
(I1) Change the problem definition (e.g., the classes which are to be distinguished)
(I2) Get more training data
(I3) Clean the training data
(I4) Change the preprocessing (see Appendix B.1)
(I5) Augment the training data set (see Appendix B.2)
(I6) Change the training setup (see Appendices B.3 to B.5)
(I7) Change the model (see Appendices B.6 and B.7)
Misc
The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting.
I don't understand how this is connected. You can have hundreds of labels without a problem of overfitting.