tf.gradients, how can I understand `grad_ys` and use it? - tensorflow

In tf.gradients, there is a keyword argument grad_ys
grad_ys is a list of tensors of the same length as ys that holds the initial gradients for each y in ys. When grad_ys is None, we fill in a tensor of ‘1’s of the shape of y for each y in ys. A user can provide their own initial grad_ys to compute the derivatives using a different initial gradient for each y (e.g., if one wanted to weight the gradient differently for each value in each y).
Why is grads_ys needed here? The docs here is implicit. Could you please give some specific purpose and code?
And my example code for tf.gradients is
In [1]: import numpy as np
In [2]: import tensorflow as tf
In [3]: sess = tf.InteractiveSession()
In [4]: X = tf.placeholder("float", shape=[2, 1])
In [5]: Y = tf.placeholder("float", shape=[2, 1])
In [6]: W = tf.Variable(np.random.randn(), name='weight')
In [7]: b = tf.Variable(np.random.randn(), name='bias')
In [8]: pred = tf.add(tf.multiply(X, W), b)
In [9]: cost = 0.5 * tf.reduce_sum(tf.pow(pred-Y, 2))
In [10]: grads = tf.gradients(cost, [W, b])
In [11]: sess.run(tf.global_variables_initializer())
In [15]: W_, b_, pred_, cost_, grads_ = sess.run([W, b, pred, cost, grads],
feed_dict={X: [[2.0], [3.]], Y: [[3.0], [2.]]})

grad_ys is only needed for advanced use cases. Here is how you can think about it.
tf.gradients allows you to compute tf.gradients(y, x, grad_ys) = grad_ys * dy/dx. In other words, grad_ys is the multiplier of each y. In this notation, it seems silly to provide this argument because one should be able to just multiple himself, i.e. tf.gradients(y, x, grad_ys) = grad_ys * tf.gradients(y, x). Unfortunately, this equality does not hold because when computing gradients backwards, we perform reduction (typically summation) after each step to get "intermediate loss".
This functionality can be useful in many cases. One is mentioned in the doc string. Here is another. Remember the chain rule - dz/dx = dz/dy * dy/dx. Let's say that we wanted to compute dz/dx but dz/dy is not differentiable and we can only approximate it. Let's say we compute the approximation somehow and call it approx. Then, dz/dx = tf.gradients(y, x, grad_ys=approx).
Another use case can be when you have a model with a "huge fan-in". Let's say you have 100 input sources that go through a few layers (call these "100 branches"), get combined at y, and go through 10 more layers until you get to a loss. It might be that computing all the gradients (which requires remembering many activations) for the whole model at once does not fit in memory. One way to do this would be to compute d(loss)/dy first. Then, compute the gradients for variables in branch_i with respect to loss using tf.gradients(y, branch_i_variables, grad_ys=d(loss)/dy). Using this (and a few more details I am skipping) you can reduce the peak memory requirement.

Related

How can I differentiate a neural network with multi-outputs?

Since we know the automatic differentiation is achieved by tf.GradientTape in Python, like:
with tf.GradientTape(persistent=True) as tape1:
func_1 = u(x, y)
d_fun1_dx, d_fun1_dy = tape1.gradient(func_1, [x, y])
del tape1
it could get the derivative of a single output neural network.
And i have an neural network with two inputs x, y and two outputs f1, f2. I want to get df1/dx, df1/dy, df2/dx, df2/dy, how can i achieve this?
What you are looking for is a Jacobian, not a gradient. It is implemented in tf under tape1.jacobian and will return a jacobian matrix of partial derivatives.
Example from the documentation:
with tf.GradientTape() as g:
x = tf.constant([1.0, 2.0])
g.watch(x)
y = x * x
jacobian = g.jacobian(y, x)
# jacobian value is [[2., 0.], [0., 4.]]
That being said, use of Jacobians usually requires more advanced methods, what are you planning to do with it really will guide you if you really need a Jacobian. For example if you were to simply use "gradient descent" you need to now make a decision what to do with 2 gradients per parameter. Are you going to analyse them? Are you just going to add them? If you were to add them than note that
(dy/dx) f(x) + (dz/dx) f(x) = (d/dx) [ f.z(x) + f.y(x) ]
so it is equivalent to just adding outputs and computing normal gradient. There are of course uses of Jacobians but they go much beyond typical gradient descent algorithms.

Compute gradient of the ouputs wrt the weights

Starting from a tensorflow model, I would like to be able to retrieve the gradient of the outputs with respect to the weights. Backpropagation aims to compute the gradient of the loss wrt the weights, in order to do that somewhere in the code the computation of the gradient of the ouputs wrt the weights has to happen.
But I am wondering how to get this Jacobian at the API level, any ideas ?
I know that we can have access to the tape but I am not sure what do to with that, actually I do not need the whole Jacobian I just need to be able to compute the matrix vector product of J^{*}v where J^{} is the transpose of the jacobian and v a given vector.
Thank you,
Regards.
If you only need to compute the vector-Jacobian product, doing only that will be much more efficient than computing the full Jacobian. Computing the Jacobian of a function of N variables will cost O(N) time, as opposed to O(1) time for a vector-Jacobian product.
So how do you compute a vector-Jacobian product in TensorFlow? The trick is to use the output_gradients keyword arg in the gradient function. You set the value of output_gradients to the vector in the vector-Jacobian product. Let's look at an example.
import tensorflow as tf
with tf.GradientTape() as g:
x = tf.constant([1.0, 2.0])
g.watch(x)
y = x*x # y is a length 2 vector
vec = tf.constant([2.0,3.0]) # vector in vector jacobian product
grad = g.gradient(y,x,output_gradients = vec)
print(grad) # prints the vector-jacobian product, [4.,12.]
Note: If you try to compute the gradient of a vector-valued (rather than scalar) function in tensorflow without setting output_gradients, it computes a vector-jacobian product where the vector is set to be all ones. For example,
import tensorflow as tf
with tf.GradientTape() as g:
x = tf.constant([1.0, 2.0])
g.watch(x)
y = x*x # y is a length 2 vector
grad = g.gradient(y,x)
print(grad) # prints the vector-jacobian product with a vector of ones, [2.0,4.0]

How to force Tensorflow to show a simple linear regression prediction result?

I have a simple linear regression question as below:
My codes are as below:
import tensorflow as tf
import numpy as np
batch_xs=np.array([[0,0,1],[1,1,1],[1,0,1],[0,1,1]])
batch_ys=np.array([[0],[1],[1],[0]])
x = tf.placeholder(tf.float32, [None, 3])
W = tf.Variable(tf.zeros([3, 1]))
b = tf.Variable(tf.zeros([1]))
y = tf.nn.sigmoid(tf.matmul(x, W) + b)
y_ = tf.placeholder(tf.float32, [None, 1])
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
learning_rate = 0.05
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)
sess = tf.Session()
tf.global_variables_initializer().run(session=sess)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
Prediction:
x0=np.array([[1.,0.,0.]])
x0=np.float32(x0)
y0=tf.nn.softmax(tf.matmul(x0,W) + b)
print(y0)
However, print(y0) shows Tensor("Softmax_2:0", shape=(1, 1), dtype=float32) instead of a figure. I expect y0 would be around 0.99.
I tried y0.eval(), but I got ValueError: Cannot evaluate tensor using 'eval()': No default session is registered..
How can I make a change to obtain the result? Thanks!
There are a couple of ways to get things to print out while writing TensorFlow code. Of course, there’s the classic Python built-in, print (Or the function print(), if we’re being Python 3 about it). And then there’s TensorFlow’s print function, tf.Print (notice the capital P).
When working with TensorFlow, it’s important to remember that everything is ultimately a graph computation. This means that if you print a TensorFlow operation using Python’s print, it will simply show a description of what that operation is, since no values have been passed through it yet. It will also often show the dimensions that are expected to be in that node, if they’re known.
If you want to print the values that are ‘flowing’ through a particular part of the graph as it’s being executed, then we need to turn to using tf.Print.

Breaking TensorFlow gradient calculation into two (or more) parts

Is it possible to use TensorFlow's tf.gradients() function in parts, that is - calculate the gradient from of loss w.r.t some tensor, and of that tensor w.r.t the weight, and then multiply them to get the original gradient from the loss to the weight?
For example, let W,b be some weights, let x be an input of a network, and let y0 denote labels.
Assume a forward graph such as
h=Wx+b
y=tanh(h)
loss=mse(y-y0)
We can calculate tf.gradients(loss,W) and then apply (skipping some details) optimizer.apply_gradients() to update W.
I then try to extract an intermediate tensor, by using var=tf.get_default_graph().get_tensor_by_name(...), and then calculate two gradients: g1=tf.gradients(loss,var) and g2=tf.gradients(var,W).
I would then, by the chain rule, expect the dimensions of g1 and g2 to work out so that I can write g=g1*g2 in some sense, and get back tf.gradients(loss,W).
Unfortunately, this is not the case. The dimensions are incorrect. Each gradient's dimensions will be that of the "w.r.t variable", so there won't be a correspondence between the first gradient and the second one. What am I missing, and how can I do this?
Thanks.
tf.gradients will sum over the gradients of the input tensor. To avoid it you have to split the tensor into scalars and apply tf.gradients to each of them:
import tensorflow as tf
x = tf.ones([1, 10])
w = tf.get_variable("w", initializer=tf.constant(0.5, shape=[10, 5]))
out = tf.matmul(x, w)
out_target = tf.constant(0., shape=[5])
loss = tf.reduce_mean(tf.square(out - out_target))
grad = tf.gradients(loss, x)
part_grad_1 = tf.gradients(loss, out)
part_grad_2 = tf.concat([tf.gradients(i, x) for i in tf.split(out, 5, axis=1)], axis=1)
grad_by_parts = tf.matmul(part_grad_1, part_grad_2)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sess.run([grad]))
print(sess.run([grad_by_parts]))
From the docs, tf.gradients (emphasis mine)
constructs symbolic derivatives of sum of ys w.r.t. x in xs.
If any tensor in ys in multidimensional, it is reduce_summed before the resulting list of scalar is itself summed, before being differenciated. This is why the output gradient has the same size as the xs.
This also explain why losses can be multidimensional in tensorflow: they are implicitely summed over before differentiation.
for future readers:
Tensorflow has made some advancements, and as for tf2.7 (and maybe even earlier versions) you can use tf.GradientTape.jacobian to avoid the sum over the target's dimensions.
https://www.tensorflow.org/guide/advanced_autodiff#jacobians

Tensorflow: How to Manually Edit Gradient Values

I am reading gradient values from a outside source (i.e computation is done elsewhere, but I want to accumulate the different sources in a "master" network), and I would like to just use the apply_gradients() op in tensorflow. The problem is, the gradients get sent in as floats. Is there any way I can use the float array to apply the gradients with the built-in Optimizer functions?
In a very minimal example / test case, this is what I would essentially like to do.
W = tf.Variable(1.0)
b = tf.Variable(2.0)
trainable_variables = [W, b]
gradients = [0.05, 0.01] # Example gradients for W, b
# ... Somehow make this gradient vector into a tensor
optimizer.apply_gradients(zip(gradients_tensor, trainable_variables))
There are many ways of doing so, in particular, you can just create placeholders for your external gradients, and combine them by simply performing arithmetics on them before apply_gradients.
x = tf.Variable( ... )
f = x ** 2
g = tf.gradients(f, x)
my_gradient['x'] = tf.placeholder( ... ) # same size and type as x
g = [(grad + my_gradient[var.name], var) for grad, var in g]
optimizer.apply_gradients(g)
and now during optimisation step, just feed_dict my_gradient['x'] value to the computed one.
If they do not change overtime, you could use tf.constant() instead but I can't see any mathematical situation to have a constant (and non-zero) gradient in ML.