Why tensorflow's implementation of AdamOptimizer does not support L2 normalization - tensorflow

Tensorflow's implementation of AdamOptimzer do not have regularization params like that in ProximalAdamOptimizer, for example l2_regularization_strength, is it necessary to add l2 norm in AdamOptimzer?

Tensorflows Adam implementation is just that: An implementation of Adam, exactly how it is defined and tested in the paper.
If you want to use Adam with L2 regularization for your problem you simply have to add an L2 regularization term to your loss with some regularization strength you can choose yourself.
I can't tell you if that is necessary or helpful or what regularization and regularization strength to use, because that highly depends on the problem and is rather subjective.

Usually you add the regularization to your loss yourself, like it is described here. However tf.train.ProximalAdagradOptimizer includes a special non-standard regularization which is part of the algorithm and therefore also part of tf.train.ProximalAdagradOptimizer.

Related

How to use tensorflow probability to implement a Masked Autoregressive Flow with batch normalization and L2 penalty

I can find some examples about Masked Autoregressive Flow implementation in tensorflow probability. However, none of them shows how to add batch normalization and L2 penalty, which are used in the original paper (Masked Autoregressive Flow for Density Estimation). I hope to know how to use tensorflow probability to implement the default MAF as mentioned in the paper.
I tried to do something like bijectors.append(tfb.BatchNormalization()). But I cannot add batch normalization by doing so.

Is it meaningless to use ReduceLROnPlateau with Adam optimizer?

This question is basically for the working of Keras or tf.keras for people who have the verty deep knowledge of the framework
According to my knowledge, tf.keras.optimizers.Adam is an optimizer which has already an Adaptive Learning rate scheme. So if we are using from keras.callbacks.ReduceLROnPlateau with the Adam optimizer or any other, isn't it meaningless to do so? I don't have the very inner workings of Keras based Optimizer but it looks natural to me that if we are using the adaptive optimizer, why to to use this and If we use this given callback, what would be the effect on the training?
Conceptually, consider the gradient a fixed, mathematical value from automatic differentiation.
What every optimizer other than pure SGD does is to take the gradient and apply some statistical analysis to create a better gradient. In the simplest case, momentum, the gradient is averaged with previous gradients. In RMSProp, the variance of the gradient across batches is measured - the noisier it is, the less RMSProp "trusts" the gradient and so the gradient is reduced (divided by the stdev of the gradient for that weight). Adam does both.
Then, all optimizers multiply the statistically adjusted gradient by a learning rate.
So although one colloquial description of Adam is that it automatically tunes a learning rate... a more informative description is that Adam statistically adjusts gradients to be more reliable, but you still need to decide on a learning rate and how it changes during training (e.g. a LR policy). ReduceLROnPlateau, cosine decay, warmup, etc are examples of an LR policy.
Whether you program TF or PyTorch, the psuedocode on PyTorch's optimizers are my go to to understand the optimizer algorithms. Looks like a wall of greek letters as first, but you'll grok it if you stare at it for a few minutes.
https://pytorch.org/docs/stable/optim.html

Why is L2 regularization not added back into original loss function?

I'm aware that when using a kernal regularizer, particularly, l2 loss, I should bee add it back into the loss function and this is what is being done in other posts. However, in Keras, they are not following this process. Why is this so?
For instance, consider this and this notebook. They are using l2 loss as a kernal regularizer in some layers but not adding back into the original loss. Is this because of the particular loss, or is this a behavior followed in just Keras or am I completely misunderstanding everything?
Keras hides a lot of complexity (and this is not always a good thing).
You're using the Model abstraction: this model contains inside all the required information about the architecture and the training procedure.
When you invoke the method compile or train or train_on_batch you specify the loss function but under the hood what happens is:
Instantiate the loss function specified (e.g. categorical cross entropy)
Fetch from the model the regularizations applied and add all of them to the loss term previously instantiated
You can see the operations that are going to be added to the loss term accessing to the property .losses of the model instance (that's a list of tensorflow operations, usually all multilication operations, since the regularizations are in the for regularization_strenght * norm_p(variable).
The L2 regularization (or any weight regularization) in Keras is still added to the loss function in the same way as you would expect. It just happens behind the scene, so the user doesn't need to worry about it.
The notebooks you linked are the right way to use weight regularization in Keras.

Why Tensorflow Object Detection disable regularization for Faster R-CNN

In Tensorflow Object Detection sample configuration files, all Faster R-CNN configuration files disabled the regularization term as
regularizer {
l2_regularizer {
weight: 0.0
}
}
I feel this not reasonable and very likely to get over fitting. Any explanations for such settings? Thank you.
"Strong regularization such as maxout or dropout is applied to obtain the best results on this dataset. In this paper, we use no maxout/dropout and just simply impose regularization via deep and thin architectures by design, without distracting from the focus on the difficulties of optimization. But combining with stronger regularization may improve results, which we will study in the future." [He et. al, Deep Residual Learning for Image Recognition]
I think the regularization the authors refer to which is being applied directly within the RESNET architecture comes from the batch norm layers that are sandwiched between every conv layer and every activation. While the authors don't say anything about the use of L2 regularization, their statement about maxout and dropout ought apply. BN layers have the effect of regularizing the network without imposing an explicit penalty, so L2 regularization isn't necessary.
That said, the option is there in case you want to try out stronger regularization.

Prevention of overfitting in convolutional layers of a CNN

I'm using TensorFlow to train a Convolutional Neural Network (CNN) for a sign language application. The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting. I've taken several steps to accomplish this:
I've collected a large amount of high-quality training data (over 5000 samples per label).
I've built a reasonably sophisticated pre-processing stage to help maximize invariance to things like lighting conditions.
I'm using dropout on the fully-connected layers.
I'm applying L2 regularization to the fully-connected parameters.
I've done extensive hyper-parameter optimization (to the extent possible given HW and time limitations) to identify the simplest model that can achieve close to 0% loss on training data.
Unfortunately, even after all these steps, I'm finding that I can't achieve much better that about 3% test error. (It's not terrible, but for the application to be viable, I'll need to improve that substantially.)
I suspect that the source of the overfitting lies in the convolutional layers since I'm not taking any explicit steps there to regularize (besides keeping the layers as small as possible). But based on examples provided with TensorFlow, it doesn't appear that regularization or dropout is typically applied to convolutional layers.
The only approach I've found online that explicitly deals with prevention of overfitting in convolutional layers is a fairly new approach called Stochastic Pooling. Unfortunately, it appears that there is no implementation for this in TensorFlow, at least not yet.
So in short, is there a recommended approach to prevent overfitting in convolutional layers that can be achieved in TensorFlow? Or will it be necessary to create a custom pooling operator to support the Stochastic Pooling approach?
Thanks for any guidance!
How can I fight overfitting?
Get more data (or data augmentation)
Dropout (see paper, explanation, dropout for cnns)
DropConnect
Regularization (see my masters thesis, page 85 for examples)
Feature scale clipping
Global average pooling
Make network smaller
Early stopping
How can I improve my CNN?
Thoma, Martin. "Analysis and Optimization of Convolutional Neural Network Architectures." arXiv preprint arXiv:1707.09725 (2017).
See chapter 2.5 for analysis techniques. As written in the beginning of that chapter, you can usually do the following:
(I1) Change the problem definition (e.g., the classes which are to be distinguished)
(I2) Get more training data
(I3) Clean the training data
(I4) Change the preprocessing (see Appendix B.1)
(I5) Augment the training data set (see Appendix B.2)
(I6) Change the training setup (see Appendices B.3 to B.5)
(I7) Change the model (see Appendices B.6 and B.7)
Misc
The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting.
I don't understand how this is connected. You can have hundreds of labels without a problem of overfitting.