Is Gradient Descent always used during backpropagation for updating weights? - tensorflow

Gradient Descent, rmsprop, adam are optimizers. Assume I have taken adam or rmsprop optimizer while compiling model i.e model.compile(optimizer = "adam").
My doubt is that, now during backpropagation, is gradient Descent is used for updating weights or Adam is used for updating weights?

We are using gradient descent to calculate the gradient and then update the weights by backpropagation. There are plenty optimizers, like the ones you mention and many more.
The optimizers use an adaptive learning rate. With an adaptive loss we have more DoF to increase my learning rate on y directions and decrease along the x direction. They don't stuck on one direction and they are able to traverse more on one direction against the other.
RMSprop uses a momentum-like exponential decay to the gradient history. Gradients in extreme past have less influence. It modiļ¬es AdaGrad optimizer to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average.
Adam (adaptive moments) Calls the 1st and 2nd power of the gradient moments and uses a momentum-like decay on both moments. In addition, it uses bias correction to avoid initial instabilities of the moments.
How to chose one?
Depends on the problem we are trying to solve. The best algorithm is the one that can traverse the loss for that problem pretty well.
It's more empirical than mathematical

Related

Is it meaningless to use ReduceLROnPlateau with Adam optimizer?

This question is basically for the working of Keras or tf.keras for people who have the verty deep knowledge of the framework
According to my knowledge, tf.keras.optimizers.Adam is an optimizer which has already an Adaptive Learning rate scheme. So if we are using from keras.callbacks.ReduceLROnPlateau with the Adam optimizer or any other, isn't it meaningless to do so? I don't have the very inner workings of Keras based Optimizer but it looks natural to me that if we are using the adaptive optimizer, why to to use this and If we use this given callback, what would be the effect on the training?
Conceptually, consider the gradient a fixed, mathematical value from automatic differentiation.
What every optimizer other than pure SGD does is to take the gradient and apply some statistical analysis to create a better gradient. In the simplest case, momentum, the gradient is averaged with previous gradients. In RMSProp, the variance of the gradient across batches is measured - the noisier it is, the less RMSProp "trusts" the gradient and so the gradient is reduced (divided by the stdev of the gradient for that weight). Adam does both.
Then, all optimizers multiply the statistically adjusted gradient by a learning rate.
So although one colloquial description of Adam is that it automatically tunes a learning rate... a more informative description is that Adam statistically adjusts gradients to be more reliable, but you still need to decide on a learning rate and how it changes during training (e.g. a LR policy). ReduceLROnPlateau, cosine decay, warmup, etc are examples of an LR policy.
Whether you program TF or PyTorch, the psuedocode on PyTorch's optimizers are my go to to understand the optimizer algorithms. Looks like a wall of greek letters as first, but you'll grok it if you stare at it for a few minutes.
https://pytorch.org/docs/stable/optim.html

What Is Regularisation Loss in TensorFlow API? It Doesn't Align With Any Other Loss Function

I'm training an EfficientDet V7 model using the V2 model zoo and have the following output in TensorBoard:
This is great, you can see that my classification and localisation losses are dropping to low levels (I'll worry about overfitting later if this is a separate issue) - but regularisation loss is still high and this is keeping my total loss at quite high levels. I can't seem to a) find a clear explanation (for a newbie) on what I'm looking at with the regularisaton loss (what does it represent in this context) and b) suggestions as to why it might be so high.
Usually, regularization loss is something like a L2 loss computed on the weights of your neural net. Minimization of this loss tends to shrink the values of the weights.
It is a regularization (hence the name) technique, which can help with such problems as over-fitting (maybe this article can help if you want to know more).
Bottom line: You don't have to do anything about it.

Does the training lost diagram showing over-fitting? Deep Q-learning

below diagram is the training loss values against epoch. Based on the diagram, does it mean I have make it over-fitting? If not, what is causing the spike in loss values along the epoch? In overall, it can be observed that the loss value is in decreasing trend. How should I tune my setting in deep Q-learning?
Such a messy loss trajectory would usually mean that the learning rate is too high for the given smoothness of the loss function.
An alternative interpretation is that the loss function is not at all predictive of the success at the given task.

use resilient propagation on tensorflow

is there a way to use resilient propagation on Tensorflow? I know there are a number of back propagation strategies. Is there one that is close to rPorp? Can I insert my own implementation of resilient propagation?
RProp and RMSProp are quite similar. They both compare the sign of the gradient of the current batch (or single sample) to the previous, to update a unique value for each parameter (usually multiplying by 1.2 when the signs agree and by 0.5 when they don't). But while RProp compares each batch gradient, RMSProp uses a discount-factor to keep a running average for comparing signs. RProp uses this unique value to take an absolute step in the direction of the gradient while RMSProp multiplies the value and the gradient.
RProp works great for larger batches, but doesn't work well for stochastic updates, since the sign of the gradient will flicker causing the steps to approach minimum which stops learning. The running average of RMSProp solves this issue. But because RMSProp multiplies the value and gradient, it's more susceptible to saturation than RProp (at least for Sigmoid and Tanh - but you can of course use Relu or Leaky Relu to get around that).
There is no rprop implementation in tensorflow, although it would be fairly trivial to create one. You could create one by writing an op, or directly in python by combining ops.
There is an RMSProp which is a different thing.
Note that RProp doesn't work well with stochastic updates. The batch sizes would have to be very large for it to work.

MNIST for ML Beginners tutorial mistake

In the MNIST for LM Beginners tutorial I believe there is a mistake. I think this part is not accurate:
Using small batches of random data is called stochastic training -- in this case, stochastic gradient descent.
Stochastic gradient descent is for updating the parameters for each training example (http://sebastianruder.com/optimizing-gradient-descent/index.html#gradientdescentvariants), and in the tutorial batches of size of 100 are used, which I believe would be mini-batch gradient descent instead.
I could be wrong but shouldn't this be changed?
It is true that stochastic gradient descent (SGD) is referred to as gradient descent with a single data sample on Wikipedia (https://en.wikipedia.org/wiki/Stochastic_gradient_descent) and in the Sebastian Ruder's survey. However, it has become quite popular among machine learners to also use the term for mini-batch gradient descent.
When using stochastic gradient descent, you assume that the gradient can be reasonably approximated by the gradient using a single data sample, which may be quite a heavy assumption, depending on the fluctuations in the data. If you use mini-batch gradient descent with a small batch size (100 may be a small batch size for some problems), you are still depending on the individual batch, although this dependence is usually smaller than for a single sample (since you have at least a bit of averaging here).
Thus, the gradient itself (or the update rule, if you prefer this point of view) is a stochastical variable, since it fluctuates around the average value of the complete data set. Therefore, many people use mini-batch gradient descent and stochastic gradient descent as synonyms.