Behavior of Dropout layers in test / training phase - tensorflow

According to the Keras documentation dropout layers show different behaviors in training and test phase:
Note that if your model has a different behavior in training and
testing phase (e.g. if it uses Dropout, BatchNormalization, etc.), you
will need to pass the learning phase flag to your function:
Unfortunately, nobody talks about the actual differences. Why should dropout behave differently in test phase? I expect the layer to set a certain amount of neurons to 0. Why should this behavior depend on the training/test phase?

Dropout is used in the training phase to reduce the chance of overfitting. As you mention this layer deactivates certain neurons. The model will become more insensitive to weights of other nodes. Basically with the dropout layer the trained model will be the average of many thinned models. Check a more detailed explanation here
However, in when you apply your trained model you want to use the full power of the model. You want to use all neurons in the trained (average) network to get the highest accuracy.

Related

keras prioritizes metrics or loss?

I'm struggling with understanding how keras model works.
When we train model, we give metrics(like ['accuracy']) and loss function(like cross-entropy) as arguments.
What I want to know is which is the goal for model to optimize.
After fitting, leant model maximize accuracy? or minimize loss?
The model optimizes the loss; Metrics are only there for your information and reporting results.
https://en.wikipedia.org/wiki/Loss_function
Note that metrics are optional, but you must provide a loss function to do training.
You can also evaluate a model on metrics that were not added during training.
Keras models work by minimizing the loss by adjusting trainable model parameters via back propagation. The metrics such as training accuracy, validation accuracy etc are provided as information but can also be used to improve your model's performance through the use of Keras callbacks. Documentation for that is located here. For example the callback ReduceLROnPlateau (documentation is here) can be used to monitor a metric like validation loss and adjust the model's learning rate if the loss fails to decrease after a certain number (patience parameter) of consecutive epochs.

What is freezing/unfreezing a layer in neural networks?

I have been playing around with neural networks for quite a while now, and recently came across the terms "freezing" & "unfreezing" the layers before training a neural network while reading about transfer learning & am struggling with understanding their usage.
When is one supposed to use freezing/unfreezing?
Which layers are to freezed/unfreezed? For instance, when I import a pre-trained model & train it on my data, is my entire neural-net except the output layer freezed?
How do I determine if I need to unfreeze?
If so how do I determine which layers to unfreeze & train to improve model performance?
I would just add to the other answer that this is most commonly used with CNNs and the amount of layers that you want to freeze (not train) is "given" by the amount of similarity between the task that you are solving and the original one (the one that the original network is solving).
If the tasks are very similar, let's say that you are using CNN pretrained on imagenet and you just want to add some other "general" objects that the network should recognize then you might get away with training just the dense top of the network.
The more dissimilar the tasks are, the more layers of the original network you will need to unfreeze during the training.
By freezing it means that the layer will not be trained. So, its weights will not be changed.
Why do we need to freeze such layers?
Sometimes we want to have deep enough NN, but we don't have enough time to train it. That's why use pretrained models that already have usefull weights. The good practice is to freeze layers from top to bottom. For examle, you can freeze 10 first layers or etc.
For instance, when I import a pre-trained model & train it on my data, is my entire neural-net except the output layer freezed?
- Yes, that's may be a case. But you can also don't freeze a few layers above the last one.
How do I freeze and unfreeze layers?
- In keras if you want to freeze layers use: layer.trainable = False
And to unfreeze: layer.trainable = True
If so how do I determine which layers to unfreeze & train to improve model performance?
- As I said, the good practice is from top to bottom. You should tune the number of frozen layers by yourself. But take into account that the more unfrozen layers you have, the slower is training.
When training a model while transfer layer, we freeze training of certain layers due to multiple reasons, such as they might have already converged or we want to train the newly added layers to an already pre-trained models. This is a really basic concept of Transfer learning and I suggest you go through this article if you have no idea about transfer learning .

TensorFlow: discarding convolution gradients/parameters at test time

Lately I've been reading up on the memory consumed by convolutional neural networks (ConvNets). During training, each convolutional layer has several parameters required for back-propagation of the gradients. These lecture notes suggest that these parameters can in principle be discarded at test time. A quote from the linked notes:
Usually, most of the activations are on the earlier layers of a ConvNet (i.e. first Conv Layers). These are kept around because they are needed for backpropagation, but a clever implementation that runs a ConvNet only at test time could in principle reduce this by a huge amount, by only storing the current activations at any layer and discarding the previous activations on layers below.
Is there any way (using TensorFlow) to make use of this "clever implementation" for inference of large batches? Is there some flag that specifies whether or not the model is in its training phase? Or is this already handled automatically based on whether the optimiser function is called?

dropout and data.split in model.fit

As we know, dropout is a mechanism to help control overfitting. In the training process of Keras, we can conduct online cross-validation by monitoring the validation loss, and setup data split in model.fit.
Generally, do I need to use both of these mechanisms? Or if I setup data split in model.fit, then I do not need to use dropout.
Dropout is a regularization technique, i.e. it prevents the network from overfitting on your data quickly. The validation loss just gives you an indication of when your network is overfitting. These are two completely different things and having a validation loss does not help you when your model is overfitting, it just shows you that it is.
I would say that having a validation loss is valuable information to have during training and you should never go without it. Whether you need regularization techniques such as noise, dropout or batch normalization depends on how your network learns. If you see that it overfits then you should attempt to employ regularization techniques.

Prevention of overfitting in convolutional layers of a CNN

I'm using TensorFlow to train a Convolutional Neural Network (CNN) for a sign language application. The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting. I've taken several steps to accomplish this:
I've collected a large amount of high-quality training data (over 5000 samples per label).
I've built a reasonably sophisticated pre-processing stage to help maximize invariance to things like lighting conditions.
I'm using dropout on the fully-connected layers.
I'm applying L2 regularization to the fully-connected parameters.
I've done extensive hyper-parameter optimization (to the extent possible given HW and time limitations) to identify the simplest model that can achieve close to 0% loss on training data.
Unfortunately, even after all these steps, I'm finding that I can't achieve much better that about 3% test error. (It's not terrible, but for the application to be viable, I'll need to improve that substantially.)
I suspect that the source of the overfitting lies in the convolutional layers since I'm not taking any explicit steps there to regularize (besides keeping the layers as small as possible). But based on examples provided with TensorFlow, it doesn't appear that regularization or dropout is typically applied to convolutional layers.
The only approach I've found online that explicitly deals with prevention of overfitting in convolutional layers is a fairly new approach called Stochastic Pooling. Unfortunately, it appears that there is no implementation for this in TensorFlow, at least not yet.
So in short, is there a recommended approach to prevent overfitting in convolutional layers that can be achieved in TensorFlow? Or will it be necessary to create a custom pooling operator to support the Stochastic Pooling approach?
Thanks for any guidance!
How can I fight overfitting?
Get more data (or data augmentation)
Dropout (see paper, explanation, dropout for cnns)
DropConnect
Regularization (see my masters thesis, page 85 for examples)
Feature scale clipping
Global average pooling
Make network smaller
Early stopping
How can I improve my CNN?
Thoma, Martin. "Analysis and Optimization of Convolutional Neural Network Architectures." arXiv preprint arXiv:1707.09725 (2017).
See chapter 2.5 for analysis techniques. As written in the beginning of that chapter, you can usually do the following:
(I1) Change the problem definition (e.g., the classes which are to be distinguished)
(I2) Get more training data
(I3) Clean the training data
(I4) Change the preprocessing (see Appendix B.1)
(I5) Augment the training data set (see Appendix B.2)
(I6) Change the training setup (see Appendices B.3 to B.5)
(I7) Change the model (see Appendices B.6 and B.7)
Misc
The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting.
I don't understand how this is connected. You can have hundreds of labels without a problem of overfitting.