why doesn't gradient descnet step back to avoid oscilation - optimization

I am performing an optimization using gradient descent but sometime it jumps over the mimimum and the cost function increases. I added a condition that if cost function value increased then step back and make the learning rate smaller this time. It is working very well. Why is it that I am not seeing this in the literature anywhere? I have read a lot of optimization literature trying to adapt the learning rate but they never step back and modify their step. Is there something wrong with this approach?

Related

what is a "convolution warmup"?

i encountered this phrase few times before, mostly in the context of neural networks and tensorflow, but i get the impression its something more general and not restricted to these environments.
here for example, they say that this "convolution warmup" process takes about 10k iterations.
why do convolutions need to warmup? what prevents them from reaching their top speed right away?
one thing that i can think of is memory allocation. if so, i would expect that it would be solved after 1 (or at least <10) iteration. why 10k?
edit for clarification: i understand that the warmup is a time period or number of iterations that have to be done until the convolution operator reaches its top speed (time per operator).
what i ask is - why is it needed and what happens during this time that makes the convolution faster?
Training neural networks works by offering training data, calculating the output error, and backpropagating the error back to the individual connections. For symmetry breaking, the training doesn't start with all zeros, but by random connection strengths.
It turns out that with the random initialization, the first training iterations aren't really effective. The network isn't anywhere near to the desired behavior, so the errors calculated are large. Backpropagating these large errors would lead to overshoot.
A warmup phase is intended to get the initial network away from a random network, and towards a first approximation of the desired network. Once the approximation has been achieved, the learning rate can be accelerated.
This is an empirical result. The number of iterations will depend on the complexity of your program domain, and therefore also with the complexity of the necessary network. Convolutional neural networks are fairly complex, so warmup is more important for them.
You are not alone to claiming the timer per iteration varies.
I run the same example and I get the same question.And I can say the main reason is the differnet input image shape and obeject number to detect.
I offer my test result to discuss it.
I enable trace and get the timeline at the first,then I found that Conv2D occurrences vary between steps in gpu stream all compulte,Then I use export TF_CUDNN_USE_AUTOTUNE=0 to disable autotune.
then there are same number of Conv2D in the timeline,and the time are about 0.4s .
the time cost are still different ,but much closer!

Online Learning with SGD using Occasional Updates

I am working on an online machine learning scheme, where SGD is utilized. However, in my case, calculation of the gradient is rather costy & the adjacent input samples are very similar. Therefore, I do not want to measure gradient and make updates for every new sample, but update occasionally when a significant change in the input is present. I want this update scheme to be mathematically justifable since this one will be in my masters thesis.
My questions are:
1) Does this make sense or are there better strategies?
2) What might be a good measure of 'sufficient change in inputs'? (I use time series)
Thanks a lot!

Neural Network optimization

I am trying to understand the purpose of ReduceLROnPlateau() function in keras.
I understood that this function helps to reduce the learning rate when there is no improvement in the validation loss. But will this not make the network not to get out of a local minimum? What if the network stays at a local minimum for about 5 epochs and this function further reduces the learning rate while increasing the learning rate would actually help the network get out of such a local minimum?
In other words, how will it understand if it has reached a local minimum or a plateau?
First up, here is a good explanation from CS231n class why learning rate decay is reasonable in general:
In training deep networks, it is usually helpful to anneal the
learning rate over time. Good intuition to have in mind is that with a
high learning rate, the system contains too much kinetic energy and
the parameter vector bounces around chaotically, unable to settle down
into deeper, but narrower parts of the loss function. Knowing when to
decay the learning rate can be tricky: Decay it slowly and you’ll be
wasting computation bouncing around chaotically with little
improvement for a long time. But decay it too aggressively and the
system will cool too quickly, unable to reach the best position it
can.
Concerning your question, unfortunately, you can't know it. If the optimizer hits a deep valley and can't get out of it, it simply hopes that this valley is good and worth exploring with smaller learning rate. Currently, there's no technique to tell whether there are better valleys, i.e., if it's a local or global minimum. So the optimizer makes a bet to explore the current one, rather than jump far away and start over. As it turns out in practice, no local minimum is much worse than others, that's why this strategy often works.
Also note that the loss surface may appear like a plateau for some learning rate, but not for 10 times smaller learning rate. So "escape the plateau" and "escape local minimum" are different challenges, and ReduceLROnPlateau aims for the first one.

Calculating gradients in optical flow optimization (example: for incremental Horn and Schunck method)

I am having a problem in understanding the way how gradients are calculated for the incremental Horn and Schunck method (but actually, not only for that method, but more generally in iterative optimization methods for optical flow/deformable image registration). I use the Horn and Schunck example here, because there I can refer a paper that shows what is unclear to me.
I was reading the following paper http://campar.in.tum.de/pub/zikic2010revisiting/zikic2010revisiting.pdf that states that incremental Horn and Schunck method is actually the same as using Gauss-Newton optimization scheme on the original problem. In incremental Horn and Schunck, the solution of one Horn and Schunck iteration is used as the initial estimate for the following one: the displacement field is used to warp the source image, then the next Horn and Schunck iteration is used to calculate an incremental step. Afterwards, the initial estimate and the step are added and used as initialization for the next iteration. So far, so good, one can do this, even if I wouldn't say it is intuitive that this procedure of splitting things up and putting them back together should be correct.
Now the paper states that this (at first sight heuristic) approach can be derived as a Gauss-Newton optimization, which means it should have a more mathematical foundation. Now I find a motif that I came across more than once but cannot reason about:
In the term
IT(x)-IS(x+U(x))
, IT is the target image and IS is the source image, U(x) is the deformation field that is optimized.
When linearizing this energy term around a certain current deformation field value U(x) (in the paper it is equation 19, I changed it slightly), we get for a step h(x)
linearized(h(x))≡IT(x)−IS(x+U(x))−grad(IS(x+U(x)))*h(x)
Now the question is: WHAT is grad(IS(x+U(x)))? The authors argue that this would then be the same gradient as in the incremental Horn and Schunck method, but: I would say, the gradient needs to be taken wrt U(x). That means I would take the numerical gradient of IS at position x+U(x). However, I have seen now often that instead, the image is warped according to U(x), and then the numerical gradient of the warped image is taken at position x. This seems also to be what in incremental Horn and Schunck is done, however, it doesn't seem correct to me.
Is it an approximation that nobody talks about? Do I miss something? Or were all implementations that I saw using numerical gradients of warped images when iteratively optimizing for optical flow just doing the wrong thing?
Many thanks to anybody who could help me to get a bit enlightened.

How to run gradient descent algorithm when parameter space is constrained?

I would like to maximize a function with one parameter.
So I run gradient descent (or, ascent actually): I start with an initial parameter and keep adding the gradient (times some learning rate factor that gets smaller and smaller), re-evaluate the gradient given the new parameter, and so on until convergence.
But there is one problem: My parameter must stay positive, so it is not supposed to become <= 0 because my function will be undefined. My gradient search will sometimes go into such regions though (when it was positive, the gradient told it to go a bit lower, and it overshoots).
And to make things worse, the gradient at such a point might be negative, driving the search toward even more negative parameter values. (The reason is that the objective function contains logs, but the gradient doesn't.)
What are some good (simple) algorithms that deal with this constrained optimization problem? I'm hoping for just a simple fix to my algorithm. Or maybe ignore the gradient and do some kind of line search for the optimal parameter?
Each time you update your parameter, check to see if it's negative, and if it is, clamp it to zero.
If clamping to zero is not acceptable, try adding a "log-barrier" (Google it). Basically, it adds a smooth "soft" wall to your objective function (and modifying your gradient) to keep it away from regions you don't want it to go to. You then repeatedly run the optimization by hardening up the wall to make it more infinitely vertical, but starting with the previously found solution. In the limit (in practice only a few iterations are needed), the problem you are solving is identical to the original problem with a hard constraint.
Without knowing more about your problem, it's hard to give specific advice. Your gradient ascent algorithm may not be particularly suitable for your function space. However, given that's what you've got, here's one tweak that would help.
You're following what you believe is an ascending gradient. But when you move forwards in the direction of the gradient, you discover you have fallen into a pit of negative value. This implies that there was a nearby local maximum, but also a very sharp negative gradient cliff. The obvious fix is to backtrack to your previous position, and take a smaller step (e.g half the size). If you still fall in, repeat with a still smaller step. This will iterate until you find the local maximum at the edge of the cliff.
The problem is, there is no guarantee that your local maximum is actually global (unless you know more about your function than you are sharing). This is the main limitation of naive gradient ascent - it always fixes on the first local maximum and converges to it. If you don't want to switch to a more robust algorithm, one simple approach that could help is to run n iterations of your code, starting each time from random positions in the function space, and keeping the best maximum you find overall. This Monte Carlo approach increases the odds that you will stumble on the global maximum, at the cost of increasing your run time by a factor n. How effective this is will depend on the 'bumpiness' of your objective function.
A simple trick to restrict a parameter to be positive is to re-parametrize the problem in terms of its logarithm (make sure to change the gradient appropriately). Of course it is possible that the optimum moves to -infty with this transformation, and the search does not converge.
At each step, constrain the parameter to be positive. This is (in short) the projected gradient method you may want to google about.
I have three suggestions, in order of how much thinking and work you want to do.
First, in gradient descent/ascent, you move each time by the gradient times some factor, which you refer to as a "learning rate factor." If, as you describe, this move causes x to become negative, there are two natural interpretations: Either the gradient was too big, or the learning rate factor was too big. Since you can't control the gradient, take the second interpretation. Check whether your move will cause x to become negative, and if so, cut the learning rate factor in half and try again.
Second, to elaborate on Aniko's answer, let x be your parameter, and f(x) be your function. Then define a new function g(x) = f(e^x), and note that although the domain of f is (0,infinity), the domain of g is (-infinity, infinity). So g cannot suffer the problems that f suffers. Use gradient descent to find the value x_0 that maximizes g. Then e^(x_0), which is positive, maximizes f. To apply gradient descent on g, you need its derivative, which is f'(e^x)*e^x, by the chain rule.
Third, it sounds like you're trying maximize just one function, not write a general maximization routine. You could consider shelving gradient descent, and tailoring the
method of optimization to the idiosyncrasies of your specific function. We would have to know a lot more about the expected behavior of f to help you do that.
Just use Brent's method for minimization. It is stable and fast and the right thing to do if you have only one parameter. It's what the R function optimize uses. The link also contains a simple C++ implementation. And yes, you can give it MIN and MAX parameter value limits.
You're getting good answers here. Reparameterizing is what I would recommend. Also, have you considered another search method, like Metropolis-Hastings? It's actually quite simple once you bull through the scary-looking math, and it gives you standard errors as well as an optimum.