Update equation for gradient descent - optimization

If we have a approximation function y = f(w,x), where x is input, y is output, and w is the weight. According to gradient descent rule, we should update the weight according to w = w - df/dw. But is that possible that we update the weight according to w = w - w * df/dw instead? Has anyone seen this before? The reason I want to do this is because it is easier for me to do it this way in my algorithm.

Recall, gradient descent is based on the Taylor expansion of f(w, x) in the close vicinity of w, and has its purpose---in your context---in repeatedly modifying the weight in small steps. The reverse gradient direction is just a search direction, based upon very local knowledge of the function f(w, x).
Usually the iterative of the weight includes a step length, yielding the expression
w_(i+1) = w_(i) - nu_j df/dw,
where the value of the step length nu_j is found by using line search, see e.g. https://en.wikipedia.org/wiki/Line_search.
Hence, based on the discussion above, to answer your question: no, it is not a good idea to update according to
w_(i+1) = w_(i) - w_(i) df/dw.
Why? If w_(i) is large (in context), we'll take a huge step based on very local information, and we would be using something very different than the fine-stepped gradient descent method.
Also, as lejlot points out in the comments below, a negative value of w(i) would mean you traverse in the (positive) direction of the gradient, i.e., in the direction in which the function grows most rapidly, which is, locally, the worst possible search direction (for minimization problems).

Related

What is the most reliable method to optimize linear objective function with nonlinear constraints and Why?

I currently solve the following problem:
Basically, this problem is equivalent to find the confidence interval for logistic regression. The objective function is linear (no second derivative), meanwhile, the constraint is non-linear. Specifically, I used n = 1, alpha = 0.05, theta = logit of p where p = [0,1] (for detail, please see binomial distribution). Thus, I have a closed-form solution for the gradient and jacobian for objective and constraints respectively.
In R, I first tried the alabama::auglag function which used augmented Lagrangian method with BFGS (as a default) and nloptr::auglag function which used augmented Lagrangian method with SLSQP (i.e. SLSQP as a local minimizer). Although they were able to find the (global) minimizer most time, sometimes they failed and produced a far-off solution.
After all, I could obtain the best (most stable) result using SLSQP method (nloptr::nloptr with algorithm=NLOPT_LD_SLSQP).
Now, I have a question of why SLSQP produced better result in this setting than the first two methods and why the first two methods (augmented Lagrangian with BFGS and SLSQP as a local optimizer) did not perform well.
Another question is, considering my problem setting, what would be the best method to find the optimizer?
Any comments and suggestions would be much appreciated.
Thanks.

Non-Convex Loss Function

I am trying to understand gradient descent algorithm by plotting the error vs value of parameters in the function. What would be an example of a simple function of the form y = f(x) with just just one input variable x and two parameters w1 and w2 such that it has a non-convex loss function ? Is y = w1.tanh(w2.x) an example ? What i am trying to achieve is this :
How does one know if the function has a non-convex loss function without plotting the graph ?
In iterative optimization algorithms such as gradient descent or Gauss-Newton, what matters is whether the function is locally convex. This is correct (on a convex set) if and only if the Hessian matrix (Jacobian of gradient) is positive semi-definite. As for a non-convex function of one variable (see my Edit below), a perfect example is the function you provide. This is because its second derivative, i.e Hessian (which is of size 1*1 here) can be computed as follows:
first_deriv=d(w1*tanh(w2*x))/dx= w1*w2 * sech^2(w2*x)
second_deriv=d(first_deriv)/dx=some_const*sech^2(w2*x)*tanh(w2*x)
The sech^2 part is always positive, so the sign of second_deriv depends on the sign of tanh, which can vary depending on the values you supply as x and w2. Therefore, we can say that it is not convex everywhere.
Edit: It wasn't clear to me what you meant by one input variable and two parameters, so I assumed that w1 and w2 were fixed beforehand, and computed the derivative w.r.t x. But I think that if you want to optimize w1 and w2 (as I suppose it makes more sense if your function is from a toy neural net), then you can compute the 2*2 Hessian in a similar way.
The same way as in high-school algebra: the second derivative tells you the direction of flex. If that's negative in all orientations, then the function is convex.

how tensorflow handles complex gradient?

Let z is a complex variable, C(z) is its conjugation.
In complex analysis theory, the derivative of C(z) w.r.t z don't exist. But in tesnsorflow, we can calculate dC(z)/dz and the result is just 1.
Here is an example:
x = tf.placeholder('complex64',(2,2))
y = tf.reduce_sum(tf.conj(x))
z = tf.gradients(y,x)
sess = tf.Session()
X = np.random.rand(2,2)+1.j*np.random.rand(2,2)
X = X.astype('complex64')
Z = sess.run(z,{x:X})[0]
The input X is
[[0.17014372+0.71475762j 0.57455420+0.00144318j]
[0.57871044+0.61303568j 0.48074263+0.7623235j ]]
and the result Z is
[[1.-0.j 1.-0.j]
[1.-0.j 1.-0.j]]
I don't understand why the gradient is set to be 1?
And I want to know how tensorflow handles the complex gradients in general.
How?
The equation used by Tensorflow for the gradient is:
Where the '*' means conjugate.
When using the definition of the partial derivatives wrt z and z* it uses Wirtinger Calculus. Wirtinger calculus enables to calculate the derivative wrt a complex variable for non-holomorphic functions. The Wirtinger definition is:
Why this definition?
When using for example Complex-Valued Neural Networks (CVNN) the gradients will be used over non-holomorphic, real-valued scalar function of one or several complex variables, tensorflow definition of a gradient can then be written as:
This definition corresponds with the literature of CVNN like for example chapter 4 section 4.3 of this book or Amin et al. (between countless examples).
Bit late, but I came across this issue recently too.
The key point is that TensorFlow defines the "gradient" of a complex-valued function f(z) of a complex variable as "the gradient of the real map F: (x,y) -> Re(f(x+iy)), expressed as a complex number" (the gradient of that real map is a vector in R^2, so we can express it as a complex number in the obvious way).
Presumably the reason for that definition is that in TF one is usually concerned with gradients for the purpose of running gradient descent on a loss function, and in particular for identifying the direction of maximum increase/decrease of that loss function. Using the above definition of gradient means that a complex-valued function of complex variables can be used as a loss function in a standard gradient descent algorithm, and the result will be that the real part of the function gets minimised (which seems to me a somewhat reasonable interpretation of "optimise this complex-valued function").
Now, to your question, an equivalent way to write that definition of gradient is
gradient(f) := dF/dx + idF/dy = conj(df/dz + dconj(f)/dz)
(you can easily verify that using the definition of d/dz). That's how TensorFlow handles complex gradients. As for the case of f(z):=conj(z), we have df/dz=0 (as you mention) and dconj(f)/dz=1, giving gradient(f)=1.
I wrote up a longer explanation here, if you're interested: https://github.com/tensorflow/tensorflow/issues/3348#issuecomment-512101921

Tensorflow: opt.compute_gradients() returns values different from the weight difference of opt.apply_gradients()

Question: What is the most efficient way to get the delta of my weights in the most efficient way in a TensorFlow network?
Background: I've got the operators hooked up as follows (thanks to this SO question):
self.cost = `the rest of the network`
self.rmsprop = tf.train.RMSPropOptimizer(lr,rms_decay,0.0,rms_eps)
self.comp_grads = self.rmsprop.compute_gradients(self.cost)
self.grad_placeholder = [(tf.placeholder("float", shape=grad[1].get_shape(), name="grad_placeholder"), grad[1]) for grad in self.comp_grads]
self.apply_grads = self.rmsprop.apply_gradients(self.grad_placeholder)
Now, to feed in information, I run the following:
feed_dict = `training variables`
grad_vals = self.sess.run([grad[0] for grad in self.comp_grads], feed_dict=feed_dict)
feed_dict2 = `feed_dict plus gradient values added to self.grad_placeholder`
self.sess.run(self.apply_grads, feed_dict=feed_dict2)
The command of run(self.apply_grads) will update the network weights, but when I compute the differences in the starting and ending weights (run(self.w1)), those numbers are different than what is stored in grad_vals[0]. I figure this is because the RMSPropOptimizer does more to the raw gradients, but I'm not sure what, or where to find out what it does.
So back to the question: How do I get the delta on my weights in the most efficient way? Am I stuck running self.w1.eval(sess) multiple times to get the weights and calc the difference? Is there something that I'm missing with the tf.RMSPropOptimizer function.
Thanks!
RMSprop does not subtract the gradient from the parameters but use more complicated formula involving a combination of:
a momentum, if the corresponding parameter is not 0
a gradient step, rescaled non uniformly (on each coordinate) by the square root of the squared average of the gradient.
For more information you can refer to these slides or this recent paper.
The delta is first computed in memory by tensorflow in the slot variable 'momentum' and then the variable is updated (see the C++ operator).
Thus, you should be able to access it and construct a delta node with delta_w1 = self.rmsprop.get_slot(self.w1, 'momentum'). (I have not tried it yet.)
You can add the weights to the list of things to fetch each run call. Then you can compute the deltas outside of TensorFlow since you will have the iterates. This should be reasonably efficient, although it might incur an extra elementwise difference, but to avoid that you might have to hack around in the guts of the optimizer and find where it puts the update before it applies it and fetch that each step. Fetching the weights each call shouldn't do wasteful extra evaluations of part of the graph at least.
RMSProp does complicated scaling of the learning rate for each weight. Basically it divides the learning rate for a weight by a running average of the magnitudes of recent gradients of that weight.

Overflows and spikes while doing gradient descent

I am trying to find a function h(r) that minimises a functional H(h) by a very simple gradient descent algorithm. The result of H(h) is a single number. (Basically, I have a field configuration in space and I am trying to minimise the energy due to this field). The field is discretized in space, and the gradient used is the derivative of H with respect to the field's value at each discrete point in space. I am doing this on Mathematica, but I think the code is easy to follow for non-users of Mathematica.
The function Hamiltonian takes a vector of field values, the spacing of points d, and the number of points imax, and gives the value of energy. EderhSym is the function that gives a table of values for the derivatives at each point. I wrote the derivative function manually to save computation time. (The details of these two functions are probably irrelevant for the question).
Hamiltonian[hvect_, d_, imax_] :=
Sum[(i^2/2)*d*(hvect[[i + 1]] - hvect[[i]])^2, {i, 1, imax - 1}] +
Sum[i^2*d^3*Potential[hvect[[i]]], {i, 1, imax}]
EderhSym[hvect_,d_,imax_]:=Join[{0},Table[2 i^2 d hvect[[i]]-i(i+1)d hvect[[i+1]]
-i(i-1)d hvect[[i-1]]+i^2 d^3 Vderh[hvect[[i]]], {i, 2, imax - 1}], {0}]
The code below shows a single iteration of gradient descent. hvect1 is some starting configuration that I have guessed using physical principles.
Ederh = EderhSym[hvect1, d, imax];
hvect1 = hvect1 - StepSize*Ederh;
The problem is that I am getting random spikes in the derivative table. The spikes keep growing until there is an overflow. I have tried changing the step size, I have tried using moving averages, low pass filters, Gaussian filters etc. I still get spikes that cause an overflow. Has anyone encountered this? Is it a problem with the way I am setting up gradient descent?
Aside - I am testing my gradient descent code as I will have to adapt it to a different multivariable Hamiltonian where I do this to find a saddle point instead: (n is an appropriately chosen small number, 10 in my case)
For[c = 1, c <= n, c++,
Ederh = EderhSym[hvect1, \[CapitalDelta], imax];
hvect = hvect - \[CapitalDelta]\[Tau]*Ederh;
];
Ederh = EderhSym[hvect1, \[CapitalDelta], imax];
hvect1 = hvect1 + n \[CapitalDelta]\[Tau]*Ederh;
Edit: It seems to be working with a much smaller step size than I had previously tried (although convergence is slow). I believe that was the problem, but I do not see why the divergences are localised at particular points.