how tensorflow handles complex gradient? - tensorflow

Let z is a complex variable, C(z) is its conjugation.
In complex analysis theory, the derivative of C(z) w.r.t z don't exist. But in tesnsorflow, we can calculate dC(z)/dz and the result is just 1.
Here is an example:
x = tf.placeholder('complex64',(2,2))
y = tf.reduce_sum(tf.conj(x))
z = tf.gradients(y,x)
sess = tf.Session()
X = np.random.rand(2,2)+1.j*np.random.rand(2,2)
X = X.astype('complex64')
Z = sess.run(z,{x:X})[0]
The input X is
[[0.17014372+0.71475762j 0.57455420+0.00144318j]
[0.57871044+0.61303568j 0.48074263+0.7623235j ]]
and the result Z is
[[1.-0.j 1.-0.j]
[1.-0.j 1.-0.j]]
I don't understand why the gradient is set to be 1?
And I want to know how tensorflow handles the complex gradients in general.

How?
The equation used by Tensorflow for the gradient is:
Where the '*' means conjugate.
When using the definition of the partial derivatives wrt z and z* it uses Wirtinger Calculus. Wirtinger calculus enables to calculate the derivative wrt a complex variable for non-holomorphic functions. The Wirtinger definition is:
Why this definition?
When using for example Complex-Valued Neural Networks (CVNN) the gradients will be used over non-holomorphic, real-valued scalar function of one or several complex variables, tensorflow definition of a gradient can then be written as:
This definition corresponds with the literature of CVNN like for example chapter 4 section 4.3 of this book or Amin et al. (between countless examples).

Bit late, but I came across this issue recently too.
The key point is that TensorFlow defines the "gradient" of a complex-valued function f(z) of a complex variable as "the gradient of the real map F: (x,y) -> Re(f(x+iy)), expressed as a complex number" (the gradient of that real map is a vector in R^2, so we can express it as a complex number in the obvious way).
Presumably the reason for that definition is that in TF one is usually concerned with gradients for the purpose of running gradient descent on a loss function, and in particular for identifying the direction of maximum increase/decrease of that loss function. Using the above definition of gradient means that a complex-valued function of complex variables can be used as a loss function in a standard gradient descent algorithm, and the result will be that the real part of the function gets minimised (which seems to me a somewhat reasonable interpretation of "optimise this complex-valued function").
Now, to your question, an equivalent way to write that definition of gradient is
gradient(f) := dF/dx + idF/dy = conj(df/dz + dconj(f)/dz)
(you can easily verify that using the definition of d/dz). That's how TensorFlow handles complex gradients. As for the case of f(z):=conj(z), we have df/dz=0 (as you mention) and dconj(f)/dz=1, giving gradient(f)=1.
I wrote up a longer explanation here, if you're interested: https://github.com/tensorflow/tensorflow/issues/3348#issuecomment-512101921

Related

What is the most reliable method to optimize linear objective function with nonlinear constraints and Why?

I currently solve the following problem:
Basically, this problem is equivalent to find the confidence interval for logistic regression. The objective function is linear (no second derivative), meanwhile, the constraint is non-linear. Specifically, I used n = 1, alpha = 0.05, theta = logit of p where p = [0,1] (for detail, please see binomial distribution). Thus, I have a closed-form solution for the gradient and jacobian for objective and constraints respectively.
In R, I first tried the alabama::auglag function which used augmented Lagrangian method with BFGS (as a default) and nloptr::auglag function which used augmented Lagrangian method with SLSQP (i.e. SLSQP as a local minimizer). Although they were able to find the (global) minimizer most time, sometimes they failed and produced a far-off solution.
After all, I could obtain the best (most stable) result using SLSQP method (nloptr::nloptr with algorithm=NLOPT_LD_SLSQP).
Now, I have a question of why SLSQP produced better result in this setting than the first two methods and why the first two methods (augmented Lagrangian with BFGS and SLSQP as a local optimizer) did not perform well.
Another question is, considering my problem setting, what would be the best method to find the optimizer?
Any comments and suggestions would be much appreciated.
Thanks.

Does TensorFlow gradient compute derivative of functions with unknown dependency on decision variable

I appreciate if you can answer my questions or provide me with useful resources.
Currently, I am working on a problem that I need to do alternating optimization. So, consider we have two decision variables x and y. In the first step I take the derivative of loss function wrt. x (for fixed y) and update x. On the second step, I need to take the derivative wrt. y. The issue is x is dependent on y implicitly and finding the closed form of cost function in a way to show the dependency of x on y is not feasible, so the gradients of cost function wrt. y are unknown.
1) My first question is whether "autodiff" method in reverse mode used in TensorFlow works for these problems where we do not have an explicit form of cost function wrt to one variable and we need the derivatives? Actually, the value of cost function is known but the dependency on decision variable is unknown via math.
2) From a general view, if I define a node as a "tf.Variable" and have an arbitrary intractable function(intractable via computation by hand) of that variable that evolves through code execution, is it possible to calculate the gradients via "tf.gradients"? If yes, how can I make sure that it is implemented correctly? Can I check it using TensorBoard?
My model is too complicated but a simplified form can be considered in this way: suppose the loss function for my model is L(x). I can code L(x) as a function of "x" during the construction phase in tensorflow. However, I have also another variable "k" that is initialized to zero. The dependency of L(x) on "k" shapes as the code runs so my loss function is L(x,k), actually. And more importantly, "x" is a function of "k" implicitly. (all the optimization is done using GradientDescent). The problem is I do not have L(x,k) as a closed form function but I have the value of L(x,k) at each step. I can use "numerical" methods like FDSA/SPSA but they are not exact. I just need to make sure as you said there is a path between "k" and L(x,k)but I do not know how!
TensorFlow gradients only work when the graph connecting the x and the y when you're computing dy/dx has at least one path which contains only differentiable operations. In general if tf gives you a gradient it is correct (otherwise file a bug, but gradient bugs are rare, since the gradient for all differentiable ops is well tested and the chain rule is fairly easy to apply).
Can you be a little more specific about what your model looks like? You might also want to use eager execution if your forward complication is too weird to express as a fixed dataflow graph.

Complex convolution in tensorflow

I'm trying to run a simple convolution but with complex numbers:
r = np.random.random([1,10,10,10])
i = np.random.random([1,10,10,10])
x = tf.complex(r,i)
conv_layer = tf.layers.conv2d(
inputs=x,
filters=10,
kernel_size=[3,3],
kernel_initializer=utils.truncated_normal_complex(),
activation=tf.nn.sigmoid)
However I get this error:
TypeError: Value passed to parameter 'input' has DataType complex128 not in list of allowed values: float16, float32
Does anyone know how to implement such a convolution in Tensorflow?
Will I need to implement a custom op, or is there some better option here?
Frustratingly, complex matrix multiplication is possible, e.g. the following runs fine:
def r():
return np.random.random([10,10])
A = tf.complex(r(),r())
B = tf.complex(r(),r())
C = tf.multiply(A,B)
sess.run(C)
So there's no real reason convolution shouldn't work, I would think (as convolution is essentially just matrix multiplication).
Thanks
Probably too late but for anyone who still is interested: applying convolutions to complex valued data is not as straightforward as your usual data types, like float32. There are studies that investigat different network structures for this purpose (for example see this link for "Deep Complex U-Net"). There are implementations of these structures in pytorch and tensorflow.
All complex-valued features are split into either Cartesian (real, imaginary) or polar (modulus, angle) representations. Nobody is really trying to use a single feature that is purely complex; I would love to be proven wrong!

taking the gradient in Tensorflow, tf.gradient

I am using this function of tensorflow to get my function jacobian. Came across two problems:
The tensorflow documentation is contradicted to itself in the following two paragraph if I am not mistaken:
gradients() adds ops to the graph to output the partial derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys.
Blockquote
Blockquote
Returns:
A list of sum(dy/dx) for each x in xs.
Blockquote
According to my test, it is, in fact, return a vector of len(ys) which is the sum(dy/dx) for each x in xs.
I do not understand why they designed it in a way that the return is the sum of the columns(or row, depending on how you define your Jacobian).
How can I really get the Jacobian?
4.In the loss, I need the partial derivative of my function with respect to input (x), but when I am optimizing with respect to the network weights, I define x as a placeholder whose value is fed later, and weights are variable, in this case, can I still define the symbolic derivative of function with respect to input (x)? and put it in the loss? ( which later when we optimize with respect to weights will bring second order derivative of the function.)
I think you are right and there is a typo there, it was probably meant to be "of length len(ys)".
For efficiency. I can't explain exactly the reasoning, but this seems to be a pretty fundamental characteristic of how TensorFlow handles automatic differentiation. See issue #675.
There is no straightforward way to get the Jacobian matrix in TensorFlow. Take a look at this answer and again issue #675. Basically, you need one call to tf.gradients per column/row.
Yes, of course. You can compute whatever gradients you want, there is no real difference between a placeholder and any other operation really. There are a few operations that do not have a gradient because it is not well defined or not implemented (in which case it will generally return 0), but that's all.

Non-Convex Loss Function

I am trying to understand gradient descent algorithm by plotting the error vs value of parameters in the function. What would be an example of a simple function of the form y = f(x) with just just one input variable x and two parameters w1 and w2 such that it has a non-convex loss function ? Is y = w1.tanh(w2.x) an example ? What i am trying to achieve is this :
How does one know if the function has a non-convex loss function without plotting the graph ?
In iterative optimization algorithms such as gradient descent or Gauss-Newton, what matters is whether the function is locally convex. This is correct (on a convex set) if and only if the Hessian matrix (Jacobian of gradient) is positive semi-definite. As for a non-convex function of one variable (see my Edit below), a perfect example is the function you provide. This is because its second derivative, i.e Hessian (which is of size 1*1 here) can be computed as follows:
first_deriv=d(w1*tanh(w2*x))/dx= w1*w2 * sech^2(w2*x)
second_deriv=d(first_deriv)/dx=some_const*sech^2(w2*x)*tanh(w2*x)
The sech^2 part is always positive, so the sign of second_deriv depends on the sign of tanh, which can vary depending on the values you supply as x and w2. Therefore, we can say that it is not convex everywhere.
Edit: It wasn't clear to me what you meant by one input variable and two parameters, so I assumed that w1 and w2 were fixed beforehand, and computed the derivative w.r.t x. But I think that if you want to optimize w1 and w2 (as I suppose it makes more sense if your function is from a toy neural net), then you can compute the 2*2 Hessian in a similar way.
The same way as in high-school algebra: the second derivative tells you the direction of flex. If that's negative in all orientations, then the function is convex.