Projected gradient descent - tensorflow

I was wondering if any of the current deep learning frameworks can perform project gradient descent.

There are implementations available for projected gradient descent in PyTorch, TensorFlow, and Python. You may need to slightly change them based on your model, loss, etc.
PyTorch: https://gist.github.com/oscarknagg/45b187c236c6262b1c4bbe2d0920ded6
TensorFlow: https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/ProximalGradientDescentOptimizer (tensorflow, this may be easier to use, you can just use it as a tensorflow optimizer without actually handling the dirty gradients yourself)
Python: https://github.com/amkatrutsa/liboptpy (constrained optimization section)

Related

Is it meaningless to use ReduceLROnPlateau with Adam optimizer?

This question is basically for the working of Keras or tf.keras for people who have the verty deep knowledge of the framework
According to my knowledge, tf.keras.optimizers.Adam is an optimizer which has already an Adaptive Learning rate scheme. So if we are using from keras.callbacks.ReduceLROnPlateau with the Adam optimizer or any other, isn't it meaningless to do so? I don't have the very inner workings of Keras based Optimizer but it looks natural to me that if we are using the adaptive optimizer, why to to use this and If we use this given callback, what would be the effect on the training?
Conceptually, consider the gradient a fixed, mathematical value from automatic differentiation.
What every optimizer other than pure SGD does is to take the gradient and apply some statistical analysis to create a better gradient. In the simplest case, momentum, the gradient is averaged with previous gradients. In RMSProp, the variance of the gradient across batches is measured - the noisier it is, the less RMSProp "trusts" the gradient and so the gradient is reduced (divided by the stdev of the gradient for that weight). Adam does both.
Then, all optimizers multiply the statistically adjusted gradient by a learning rate.
So although one colloquial description of Adam is that it automatically tunes a learning rate... a more informative description is that Adam statistically adjusts gradients to be more reliable, but you still need to decide on a learning rate and how it changes during training (e.g. a LR policy). ReduceLROnPlateau, cosine decay, warmup, etc are examples of an LR policy.
Whether you program TF or PyTorch, the psuedocode on PyTorch's optimizers are my go to to understand the optimizer algorithms. Looks like a wall of greek letters as first, but you'll grok it if you stare at it for a few minutes.
https://pytorch.org/docs/stable/optim.html

Is Gradient Descent always used during backpropagation for updating weights?

Gradient Descent, rmsprop, adam are optimizers. Assume I have taken adam or rmsprop optimizer while compiling model i.e model.compile(optimizer = "adam").
My doubt is that, now during backpropagation, is gradient Descent is used for updating weights or Adam is used for updating weights?
We are using gradient descent to calculate the gradient and then update the weights by backpropagation. There are plenty optimizers, like the ones you mention and many more.
The optimizers use an adaptive learning rate. With an adaptive loss we have more DoF to increase my learning rate on y directions and decrease along the x direction. They don't stuck on one direction and they are able to traverse more on one direction against the other.
RMSprop uses a momentum-like exponential decay to the gradient history. Gradients in extreme past have less influence. It modiļ¬es AdaGrad optimizer to perform better in the non-convex setting by changing the gradient accumulation into an exponentially weighted moving average.
Adam (adaptive moments) Calls the 1st and 2nd power of the gradient moments and uses a momentum-like decay on both moments. In addition, it uses bias correction to avoid initial instabilities of the moments.
How to chose one?
Depends on the problem we are trying to solve. The best algorithm is the one that can traverse the loss for that problem pretty well.
It's more empirical than mathematical

Tensorflow: How to perform binary classification as pre-processing and perform linear regression training

In Tensorflow, you can either perform either classification or linear regression to train your inputs against the labels. Is it possible to perform some classification for your inputs (as pre-processing but not necessarily to use Tensorflow) and determine if you want to run the linear regression using Tensorflow?
For example in image denoising task, you have found that your linear regression algorithm can provide a good smoothing effect against the edges but in the meantime also remove the details for the texture objects. Therefore you would like to perform a binary classification to determine if an input is a texture object, and run the linear regression algorithm using Tensorflow; otherwise do nothing for texture object.
I understand Tensorflow supports transfer learning so I guess one of the possible solutions is to perform binary classification using Tensorflow, and transfer the "texture classification" knowledge to instruct Tensorflow to apply linear regression algorithm only when the input is a texture object? Please correct me if I am wrong as I am not too sure if the above task is do-able in Tensorflow (it would be great if you can describe how to do this in details if this is do-able :-) ).
I guess an alternative solution is to use some binary classification without Tensorflow, and filter out (remove) the texture inputs before passing them to Tensorflow.
Please kindly tell me if which of the above solution (or any other solution) is better (if do-able) for the above scenario? Any suggestions are welcome.

Is it possible to train pytorch and tensorflow model together on one GPU?

I have a pytorch model and a tensorflow model, I want to train them together on one GPU, following the process bellow: input --> pytorch model--> output_pytorch --> tensorflow model --> output_tensorflow --> pytorch model.
Is is possible to do this? If answer is yes, is there any problem which I will encounter?
Thanks in advance.
I haven't done this but it is possible but implementing is can be a little bit.
You can consider each network as a function, you want to - in some sense - compose these function to form your network, to do this you can compute the final function by just giving result of one network to the other and then use chain-rule to compute the derivatives(using symbolic differentiation from both packages).
I think a good way for implementing this you might be to wrap TF models as a PyTorch Function and use tf.gradients for computing the backward pass.
Doing gradient updates can really get hard (because some variables exist in TF's computation graph) you can turn TF variables to PyTorch Variable turn them into placeholdes in TF computation graph, feed them in feed_dict and update them using PyTorch mechanisms, but I think it would be really hard to do, instead if you do your updates inside backward method of the function you might be able to do the job(it is really ugly but might do the job).

MNIST for ML Beginners tutorial mistake

In the MNIST for LM Beginners tutorial I believe there is a mistake. I think this part is not accurate:
Using small batches of random data is called stochastic training -- in this case, stochastic gradient descent.
Stochastic gradient descent is for updating the parameters for each training example (http://sebastianruder.com/optimizing-gradient-descent/index.html#gradientdescentvariants), and in the tutorial batches of size of 100 are used, which I believe would be mini-batch gradient descent instead.
I could be wrong but shouldn't this be changed?
It is true that stochastic gradient descent (SGD) is referred to as gradient descent with a single data sample on Wikipedia (https://en.wikipedia.org/wiki/Stochastic_gradient_descent) and in the Sebastian Ruder's survey. However, it has become quite popular among machine learners to also use the term for mini-batch gradient descent.
When using stochastic gradient descent, you assume that the gradient can be reasonably approximated by the gradient using a single data sample, which may be quite a heavy assumption, depending on the fluctuations in the data. If you use mini-batch gradient descent with a small batch size (100 may be a small batch size for some problems), you are still depending on the individual batch, although this dependence is usually smaller than for a single sample (since you have at least a bit of averaging here).
Thus, the gradient itself (or the update rule, if you prefer this point of view) is a stochastical variable, since it fluctuates around the average value of the complete data set. Therefore, many people use mini-batch gradient descent and stochastic gradient descent as synonyms.