In the MNIST for LM Beginners tutorial I believe there is a mistake. I think this part is not accurate:
Using small batches of random data is called stochastic training -- in this case, stochastic gradient descent.
Stochastic gradient descent is for updating the parameters for each training example (http://sebastianruder.com/optimizing-gradient-descent/index.html#gradientdescentvariants), and in the tutorial batches of size of 100 are used, which I believe would be mini-batch gradient descent instead.
I could be wrong but shouldn't this be changed?
It is true that stochastic gradient descent (SGD) is referred to as gradient descent with a single data sample on Wikipedia (https://en.wikipedia.org/wiki/Stochastic_gradient_descent) and in the Sebastian Ruder's survey. However, it has become quite popular among machine learners to also use the term for mini-batch gradient descent.
When using stochastic gradient descent, you assume that the gradient can be reasonably approximated by the gradient using a single data sample, which may be quite a heavy assumption, depending on the fluctuations in the data. If you use mini-batch gradient descent with a small batch size (100 may be a small batch size for some problems), you are still depending on the individual batch, although this dependence is usually smaller than for a single sample (since you have at least a bit of averaging here).
Thus, the gradient itself (or the update rule, if you prefer this point of view) is a stochastical variable, since it fluctuates around the average value of the complete data set. Therefore, many people use mini-batch gradient descent and stochastic gradient descent as synonyms.
Related
This question is basically for the working of Keras or tf.keras for people who have the verty deep knowledge of the framework
According to my knowledge, tf.keras.optimizers.Adam is an optimizer which has already an Adaptive Learning rate scheme. So if we are using from keras.callbacks.ReduceLROnPlateau with the Adam optimizer or any other, isn't it meaningless to do so? I don't have the very inner workings of Keras based Optimizer but it looks natural to me that if we are using the adaptive optimizer, why to to use this and If we use this given callback, what would be the effect on the training?
Conceptually, consider the gradient a fixed, mathematical value from automatic differentiation.
What every optimizer other than pure SGD does is to take the gradient and apply some statistical analysis to create a better gradient. In the simplest case, momentum, the gradient is averaged with previous gradients. In RMSProp, the variance of the gradient across batches is measured - the noisier it is, the less RMSProp "trusts" the gradient and so the gradient is reduced (divided by the stdev of the gradient for that weight). Adam does both.
Then, all optimizers multiply the statistically adjusted gradient by a learning rate.
So although one colloquial description of Adam is that it automatically tunes a learning rate... a more informative description is that Adam statistically adjusts gradients to be more reliable, but you still need to decide on a learning rate and how it changes during training (e.g. a LR policy). ReduceLROnPlateau, cosine decay, warmup, etc are examples of an LR policy.
Whether you program TF or PyTorch, the psuedocode on PyTorch's optimizers are my go to to understand the optimizer algorithms. Looks like a wall of greek letters as first, but you'll grok it if you stare at it for a few minutes.
https://pytorch.org/docs/stable/optim.html
Gradient Descent, rmsprop, adam are optimizers. Assume I have taken adam or rmsprop optimizer while compiling model i.e model.compile(optimizer = "adam").
My doubt is that, now during backpropagation, is gradient Descent is used for updating weights or Adam is used for updating weights?
We are using gradient descent to calculate the gradient and then update the weights by backpropagation. There are plenty optimizers, like the ones you mention and many more.
The optimizers use an adaptive learning rate. With an adaptive loss we have more DoF to increase my learning rate on y directions and decrease along the x direction. They don't stuck on one direction and they are able to traverse more on one direction against the other.
RMSprop uses a momentum-like exponential decay to the gradient history. Gradients in extreme past have less influence. It modiļ¬es AdaGrad optimizer to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average.
Adam (adaptive moments) Calls the 1st and 2nd power of the gradient moments and uses a momentum-like decay on both moments. In addition, it uses bias correction to avoid initial instabilities of the moments.
How to chose one?
Depends on the problem we are trying to solve. The best algorithm is the one that can traverse the loss for that problem pretty well.
It's more empirical than mathematical
I was wondering if any of the current deep learning frameworks can perform project gradient descent.
There are implementations available for projected gradient descent in PyTorch, TensorFlow, and Python. You may need to slightly change them based on your model, loss, etc.
PyTorch: https://gist.github.com/oscarknagg/45b187c236c6262b1c4bbe2d0920ded6
TensorFlow: https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/ProximalGradientDescentOptimizer (tensorflow, this may be easier to use, you can just use it as a tensorflow optimizer without actually handling the dirty gradients yourself)
Python: https://github.com/amkatrutsa/liboptpy (constrained optimization section)
I need help understanding the memory requirements of a neural network and their differences between training and evaluation processes. More specifically, the memory requirements of the training process (I'm using a Keras API running on top of TensorFlow).
For a CNN that contains N weights, when using a batch of size x, there is a constant amount of memory required for the weights themselves and the input data. During the forward pass the GPU needs additional x*N units of memory (the specific required amount is not crucial to the question) for passing all the samples simultaneously and calculating the activation of each neuron.
My question is regarding the back propagation process, it seems that the process requires additional x*N units of memory(*) for the specific gradient of every weight for every sample. According to my understanding, it means that the algorithm calculates the specific gradients of each sample and then sums them up for the back-propagation to the previous layer.
Q. Since there is only a single update step per batch, why isn't the gradient calculation performed on the mean activation of each neuron? That way the additional required memory for training will only be (x+1)*N and not 2*x*N.
(*) This is according to my own little experiment of the maximal allowed batch size during evaluation (~4200) and training (~1200). Obviously it is a very simplified way of looking at the memory requirments
The short answer is: that is just the way the mini-batch SGD back-propagation algorithm works.
Looking back at its origins and difference between using the standard SGD and mini-batch SGD it is clearer why.
The standard stochastic gradient decent algorithm passes a single sample thru the model, then back-propagates its gradients and updates model weights before repeating the process with the next sample. The main downside is that it is a serial process (can't run samples simultaneously because the each sample needs to run on a model that was already updated by the previous sample), so it is very computationally expensive. In addition using just a single sample for each update results in a very noisy gradient.
The mini-batch SGD utilizes the same principle, with one difference - the gradients are accumulated from multiple samples and an update is only performed once every x samples. This helps to get a smooth gradient during training and enables passing multiple samples thru the model in parallel. This is the algorithm which is used when training with keras/tensorflow in mini-batches (commonly called batches but that term actually means using the batch gradient decent which is slightly different algorithm).
I haven't found any work regarding using the mean of the gradients in each layer for the update. It is interesting to check the results of such an algorithm. It would be more memory efficient however it is likely that it will also be less capable of reaching good minimum points.
is there a way to use resilient propagation on Tensorflow? I know there are a number of back propagation strategies. Is there one that is close to rPorp? Can I insert my own implementation of resilient propagation?
RProp and RMSProp are quite similar. They both compare the sign of the gradient of the current batch (or single sample) to the previous, to update a unique value for each parameter (usually multiplying by 1.2 when the signs agree and by 0.5 when they don't). But while RProp compares each batch gradient, RMSProp uses a discount-factor to keep a running average for comparing signs. RProp uses this unique value to take an absolute step in the direction of the gradient while RMSProp multiplies the value and the gradient.
RProp works great for larger batches, but doesn't work well for stochastic updates, since the sign of the gradient will flicker causing the steps to approach minimum which stops learning. The running average of RMSProp solves this issue. But because RMSProp multiplies the value and gradient, it's more susceptible to saturation than RProp (at least for Sigmoid and Tanh - but you can of course use Relu or Leaky Relu to get around that).
There is no rprop implementation in tensorflow, although it would be fairly trivial to create one. You could create one by writing an op, or directly in python by combining ops.
There is an RMSProp which is a different thing.
Note that RProp doesn't work well with stochastic updates. The batch sizes would have to be very large for it to work.