Is it meaningless to use ReduceLROnPlateau with Adam optimizer? - tensorflow

This question is basically for the working of Keras or tf.keras for people who have the verty deep knowledge of the framework
According to my knowledge, tf.keras.optimizers.Adam is an optimizer which has already an Adaptive Learning rate scheme. So if we are using from keras.callbacks.ReduceLROnPlateau with the Adam optimizer or any other, isn't it meaningless to do so? I don't have the very inner workings of Keras based Optimizer but it looks natural to me that if we are using the adaptive optimizer, why to to use this and If we use this given callback, what would be the effect on the training?

Conceptually, consider the gradient a fixed, mathematical value from automatic differentiation.
What every optimizer other than pure SGD does is to take the gradient and apply some statistical analysis to create a better gradient. In the simplest case, momentum, the gradient is averaged with previous gradients. In RMSProp, the variance of the gradient across batches is measured - the noisier it is, the less RMSProp "trusts" the gradient and so the gradient is reduced (divided by the stdev of the gradient for that weight). Adam does both.
Then, all optimizers multiply the statistically adjusted gradient by a learning rate.
So although one colloquial description of Adam is that it automatically tunes a learning rate... a more informative description is that Adam statistically adjusts gradients to be more reliable, but you still need to decide on a learning rate and how it changes during training (e.g. a LR policy). ReduceLROnPlateau, cosine decay, warmup, etc are examples of an LR policy.
Whether you program TF or PyTorch, the psuedocode on PyTorch's optimizers are my go to to understand the optimizer algorithms. Looks like a wall of greek letters as first, but you'll grok it if you stare at it for a few minutes.
https://pytorch.org/docs/stable/optim.html

Related

Is Gradient Descent always used during backpropagation for updating weights?

Gradient Descent, rmsprop, adam are optimizers. Assume I have taken adam or rmsprop optimizer while compiling model i.e model.compile(optimizer = "adam").
My doubt is that, now during backpropagation, is gradient Descent is used for updating weights or Adam is used for updating weights?
We are using gradient descent to calculate the gradient and then update the weights by backpropagation. There are plenty optimizers, like the ones you mention and many more.
The optimizers use an adaptive learning rate. With an adaptive loss we have more DoF to increase my learning rate on y directions and decrease along the x direction. They don't stuck on one direction and they are able to traverse more on one direction against the other.
RMSprop uses a momentum-like exponential decay to the gradient history. Gradients in extreme past have less influence. It modiļ¬es AdaGrad optimizer to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average.
Adam (adaptive moments) Calls the 1st and 2nd power of the gradient moments and uses a momentum-like decay on both moments. In addition, it uses bias correction to avoid initial instabilities of the moments.
How to chose one?
Depends on the problem we are trying to solve. The best algorithm is the one that can traverse the loss for that problem pretty well.
It's more empirical than mathematical

Projected gradient descent

I was wondering if any of the current deep learning frameworks can perform project gradient descent.
There are implementations available for projected gradient descent in PyTorch, TensorFlow, and Python. You may need to slightly change them based on your model, loss, etc.
PyTorch: https://gist.github.com/oscarknagg/45b187c236c6262b1c4bbe2d0920ded6
TensorFlow: https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/ProximalGradientDescentOptimizer (tensorflow, this may be easier to use, you can just use it as a tensorflow optimizer without actually handling the dirty gradients yourself)
Python: https://github.com/amkatrutsa/liboptpy (constrained optimization section)

Does Keras Adam Optimizer and other momemtum-based optimizers retain its past update information over different fit calls?

The Adam optimizer uses a momentum-like approach to train a neural network in fewer iterations of gradient descent than vanilla gradient descent. I'm trying to figure out whether the Adam optimizer works in a Q-learning situation where you have a non-stationary dataset and numerous model.fit() calls. Does Adam and other optimizers retain their momentum over the different calls or is the state of the optimizer reset at each fit call?
I've tried searching the code on https://github.com/keras-team/keras/blob/master/keras/optimizers.py#L436 but can't find where this information is stored and whether it's retained.

Unusual behavior of ADAM optimizer with AMSGrad

I am trying some 1, 2, and 3 layer LSTM networks to classify land cover of some selected pixels from a Landsat time-series spectral data. I tried different optimizers (as implemented in Keras) to see which of them is better, and generally found AMSGrad variant of ADAM doing a relatively better job in my case. However, one strange thing to me is that for the AMSGrad variant, the training and test accuracies start at a relatively high value from the first epoch (instead of increasing gradually) and it changes only slightly after that, as you see in the below graph.
Performance of ADAM optimizer with AMSGrad on
Performance of ADAM optimizer with AMSGrad off
I have not seen this behavior in any other optimizer. Does it show a problem in my experiment? What can be the explanation for this phenomenon?
Pay attention to the number of LSTM layers. They are notorious for easily overfitting the data. Try a smaller model initially(less number of layers), and gradually increase the number of units in a layer. If you notice poor results, then try adding another LSTM layer, but only after the previous step has been done.
As for the optimizers, I have to admit I have never used AMSGrad. However, the plot with regard to the accuracy does seem to be much better in case of the AMSGrad off. You can see that when you use AMSGrad the accuracy on the training set is much better than that on the test set, which a strong sign of overfitting.
Remember to keep things simple, experiment with simple models and generic optimizers.

MNIST for ML Beginners tutorial mistake

In the MNIST for LM Beginners tutorial I believe there is a mistake. I think this part is not accurate:
Using small batches of random data is called stochastic training -- in this case, stochastic gradient descent.
Stochastic gradient descent is for updating the parameters for each training example (http://sebastianruder.com/optimizing-gradient-descent/index.html#gradientdescentvariants), and in the tutorial batches of size of 100 are used, which I believe would be mini-batch gradient descent instead.
I could be wrong but shouldn't this be changed?
It is true that stochastic gradient descent (SGD) is referred to as gradient descent with a single data sample on Wikipedia (https://en.wikipedia.org/wiki/Stochastic_gradient_descent) and in the Sebastian Ruder's survey. However, it has become quite popular among machine learners to also use the term for mini-batch gradient descent.
When using stochastic gradient descent, you assume that the gradient can be reasonably approximated by the gradient using a single data sample, which may be quite a heavy assumption, depending on the fluctuations in the data. If you use mini-batch gradient descent with a small batch size (100 may be a small batch size for some problems), you are still depending on the individual batch, although this dependence is usually smaller than for a single sample (since you have at least a bit of averaging here).
Thus, the gradient itself (or the update rule, if you prefer this point of view) is a stochastical variable, since it fluctuates around the average value of the complete data set. Therefore, many people use mini-batch gradient descent and stochastic gradient descent as synonyms.