Is there a thorough exploration of the effect of momentum on Stochastic Gradient Descent? - optimization

Many CNN papers use momentum=0.9 when using Stochastic Gradient Descent in weight update. There is a good logic for using it, but what I am looking for is a thorough exploration of effects of that parameter. I've searched across many papers, and there are some insights here and there, but I have not been able a comprehensive exploration. Also, does it usefulness vary across different computer vision tasks like classification, segmentation, detection?

Here is a good review paper on this topic "A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay" by Leslie N. Smith
https://arxiv.org/pdf/1803.09820.pdf

Related

What is the difference between optimization algorithms and Ensembling methods?

I was going through ensemblling methods and was wondering what is the difference between the optimization techniques like gradient descent etc. And ensembling techniques like bagging, boosting etc.
Optimization like Gradient Decent is a single model approach. Ensemble per Wikipedia is multiple models. Constituents in the ensemble are weighted for overall consideration. Boosting (per Wikipedia https://en.wikipedia.org/wiki/Ensemble_learning seems to say that it is retraining with a focus on missed (errors) in a model.
To me this is like single image recognition in a monocular fashion vs. binocular image recognition. The two images being an ensemble. Further scrutiny requiring extra attention to errors in classification is boosting. That is to say retraining on some errors. Perhaps error condition data were represented too infrequently enough to make good classifications (thinking black swan here). In vehicles, this could be like combining infrared, thermal, radar and lidar sensor results for an overall classification. The link above has really good explanations of each of your areas of concern.

Is it meaningless to use ReduceLROnPlateau with Adam optimizer?

This question is basically for the working of Keras or tf.keras for people who have the verty deep knowledge of the framework
According to my knowledge, tf.keras.optimizers.Adam is an optimizer which has already an Adaptive Learning rate scheme. So if we are using from keras.callbacks.ReduceLROnPlateau with the Adam optimizer or any other, isn't it meaningless to do so? I don't have the very inner workings of Keras based Optimizer but it looks natural to me that if we are using the adaptive optimizer, why to to use this and If we use this given callback, what would be the effect on the training?
Conceptually, consider the gradient a fixed, mathematical value from automatic differentiation.
What every optimizer other than pure SGD does is to take the gradient and apply some statistical analysis to create a better gradient. In the simplest case, momentum, the gradient is averaged with previous gradients. In RMSProp, the variance of the gradient across batches is measured - the noisier it is, the less RMSProp "trusts" the gradient and so the gradient is reduced (divided by the stdev of the gradient for that weight). Adam does both.
Then, all optimizers multiply the statistically adjusted gradient by a learning rate.
So although one colloquial description of Adam is that it automatically tunes a learning rate... a more informative description is that Adam statistically adjusts gradients to be more reliable, but you still need to decide on a learning rate and how it changes during training (e.g. a LR policy). ReduceLROnPlateau, cosine decay, warmup, etc are examples of an LR policy.
Whether you program TF or PyTorch, the psuedocode on PyTorch's optimizers are my go to to understand the optimizer algorithms. Looks like a wall of greek letters as first, but you'll grok it if you stare at it for a few minutes.
https://pytorch.org/docs/stable/optim.html

Fairness metrics for multi-class classification

Are there any metrics implemented in Fairlearn or any published papers that I can refer to for use-cases around fairness measurement of multi-class classification where the metrics are AP and not accuracy? Thanks!
Update: The Fairlearn documentation now has a FAQ section on this topic https://fairlearn.org/main/faq.html Search for "Does Fairlearn support multi-class classification?"
Previous answer:
Fairlearn's metrics are designed for binary classification or regression. You could evaluate the various labels individually, of course. If you have a specific idea of what you'd like to see please open a new feature request.
Fairlearn does support a variety of metrics, not just accuracy. The user guide has a full list: https://fairlearn.org/v0.6.0/user_guide/assessment.html#scalar-results-from-metricframe
One example that comes to mind for a paper doing multi-class classification while thinking about fairness is CheXclusion by Seyyed-Kalantari et al. They mostly look into TPR differences when classifying chest x-rays.
The Fairlearn community would definitely be interested in hearing about your use case. Perhaps there's some way we can help. Feel free to reach out via Gitter or by creating your feature request (as mentioned above).

Tensorflow: how to find good neural network architectures/hyperparameters?

I've been using tensorflow on and off for various things that I guess are considered rather easy these days. Captcha cracking, basic OCR, things I remember from my AI education at university. They are problems that are reasonably large and therefore don't really lend themselves to experimenting efficiently with different NN architectures.
As you probably know, Joel Grus came out with FizzBuzz in tensorflow. TLDR: learning from a binary representation of a number (ie. 12 bits encoding the number) into 4 bits (none_of_the_others, divisible by 3, divisible by 5, divisible by 15). For this toy problem, you can quickly compare different networks.
So I've been trying a simple feedforward network and wrote a program to compare various architectures. Things like a 2-hidden-layer feedforward network, then 3 layers, different activation functions, ... Most architectures, well, suck. They get somewhere near 50-60 success rate and remain there, independent of how much training you do.
A few perform really well. For instance, a sigmoid-activated double hidden layer with 23 neurons each works really well (89-90% correct after 2000 training epochs). Unfortunately anything close to it is rather disastrously bad. Take one neuron out of the second or first layer and it drops to 30% correct. Same for taking it out of the first layer ... Single hidden layer, 20 neurons tanh activated does pretty well as well. But most have a little over half this performance.
Now given that for real problems I can't realistically do these sorts of studies of different architectures, are there ways to get good architectures guaranteed to work ?
You might find the paper by Yoshua Bengio on Practical Recommendations for Gradient-Based Training of Deep Architectures helpful to learn more about hyperparameters and their settings.
If you're asking specifically for settings that have more guaranteed succes, I advise you to read on Batch Normalization. I find that it decreases the failure rate for bad picks of the learning rate and weight initialization.
Some people also discourage the use of non-linearities like sigmoid() and tanh() as they suffer from the vanishing gradient problem

Prevention of overfitting in convolutional layers of a CNN

I'm using TensorFlow to train a Convolutional Neural Network (CNN) for a sign language application. The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting. I've taken several steps to accomplish this:
I've collected a large amount of high-quality training data (over 5000 samples per label).
I've built a reasonably sophisticated pre-processing stage to help maximize invariance to things like lighting conditions.
I'm using dropout on the fully-connected layers.
I'm applying L2 regularization to the fully-connected parameters.
I've done extensive hyper-parameter optimization (to the extent possible given HW and time limitations) to identify the simplest model that can achieve close to 0% loss on training data.
Unfortunately, even after all these steps, I'm finding that I can't achieve much better that about 3% test error. (It's not terrible, but for the application to be viable, I'll need to improve that substantially.)
I suspect that the source of the overfitting lies in the convolutional layers since I'm not taking any explicit steps there to regularize (besides keeping the layers as small as possible). But based on examples provided with TensorFlow, it doesn't appear that regularization or dropout is typically applied to convolutional layers.
The only approach I've found online that explicitly deals with prevention of overfitting in convolutional layers is a fairly new approach called Stochastic Pooling. Unfortunately, it appears that there is no implementation for this in TensorFlow, at least not yet.
So in short, is there a recommended approach to prevent overfitting in convolutional layers that can be achieved in TensorFlow? Or will it be necessary to create a custom pooling operator to support the Stochastic Pooling approach?
Thanks for any guidance!
How can I fight overfitting?
Get more data (or data augmentation)
Dropout (see paper, explanation, dropout for cnns)
DropConnect
Regularization (see my masters thesis, page 85 for examples)
Feature scale clipping
Global average pooling
Make network smaller
Early stopping
How can I improve my CNN?
Thoma, Martin. "Analysis and Optimization of Convolutional Neural Network Architectures." arXiv preprint arXiv:1707.09725 (2017).
See chapter 2.5 for analysis techniques. As written in the beginning of that chapter, you can usually do the following:
(I1) Change the problem definition (e.g., the classes which are to be distinguished)
(I2) Get more training data
(I3) Clean the training data
(I4) Change the preprocessing (see Appendix B.1)
(I5) Augment the training data set (see Appendix B.2)
(I6) Change the training setup (see Appendices B.3 to B.5)
(I7) Change the model (see Appendices B.6 and B.7)
Misc
The CNN has to classify 27 different labels, so unsurprisingly, a major problem has been addressing overfitting.
I don't understand how this is connected. You can have hundreds of labels without a problem of overfitting.