How can I differentiate a neural network with multi-outputs? - tensorflow

Since we know the automatic differentiation is achieved by tf.GradientTape in Python, like:
with tf.GradientTape(persistent=True) as tape1:
func_1 = u(x, y)
d_fun1_dx, d_fun1_dy = tape1.gradient(func_1, [x, y])
del tape1
it could get the derivative of a single output neural network.
And i have an neural network with two inputs x, y and two outputs f1, f2. I want to get df1/dx, df1/dy, df2/dx, df2/dy, how can i achieve this?

What you are looking for is a Jacobian, not a gradient. It is implemented in tf under tape1.jacobian and will return a jacobian matrix of partial derivatives.
Example from the documentation:
with tf.GradientTape() as g:
x = tf.constant([1.0, 2.0])
g.watch(x)
y = x * x
jacobian = g.jacobian(y, x)
# jacobian value is [[2., 0.], [0., 4.]]
That being said, use of Jacobians usually requires more advanced methods, what are you planning to do with it really will guide you if you really need a Jacobian. For example if you were to simply use "gradient descent" you need to now make a decision what to do with 2 gradients per parameter. Are you going to analyse them? Are you just going to add them? If you were to add them than note that
(dy/dx) f(x) + (dz/dx) f(x) = (d/dx) [ f.z(x) + f.y(x) ]
so it is equivalent to just adding outputs and computing normal gradient. There are of course uses of Jacobians but they go much beyond typical gradient descent algorithms.

Related

Having trouble understanding how tensorflow probability Bijectors 'RealNVP' 'log_prob works

Here's the code
tfd = tfp.distributions
tfb = tfp.bijectors
# A common choice for a normalizing flow is to use a Gaussian for the base
# distribution. (However, any continuous distribution would work.) E.g.,
nvp = tfd.TransformedDistribution(
distribution=tfd.MultivariateNormalDiag(loc=[0., 0., 0.]),
bijector=tfb.RealNVP(
num_masked=2,
shift_and_log_scale_fn=tfb.real_nvp_default_template(
hidden_layers=[512, 512])))
x = nvp.sample((32,32))
x = nvp.sample((32,32)) gives me a tensor with 32x32x3shape . But when throwing the x into nvp.log_prob(x), I get a 32x32shape tensor. I was expecting a (1,)like tensor since I want to get log_prob of this 32,32,3 tensor.
So the problem is, how to tinker the code above to calculate log_prob of a 32x32x3-shape tensor?
RNVP transforms vector-valued distributions (i.e. MVNDiag in your case above). You can try nvp.distribution.log_prob(x) (apply the underlying distribution's log_prob), and note that it has the same shape. The log_prob function "consumes" the event shape of x.
The log_prob of a transformed distribution is something like
nvp.distribution.log_prob(nvp.bijector.inverse(x)) - nvp.bijector.inverse_log_det_jacobian(x) (I may have the sign swapped.)
Namely, it is the sum of the underlying distribution's log_prob applied to the samples pulled back through the bijective transformation plus a correction term to account for the (local, at x) change in volume induced by the bijective transformation.

Compute gradient of the ouputs wrt the weights

Starting from a tensorflow model, I would like to be able to retrieve the gradient of the outputs with respect to the weights. Backpropagation aims to compute the gradient of the loss wrt the weights, in order to do that somewhere in the code the computation of the gradient of the ouputs wrt the weights has to happen.
But I am wondering how to get this Jacobian at the API level, any ideas ?
I know that we can have access to the tape but I am not sure what do to with that, actually I do not need the whole Jacobian I just need to be able to compute the matrix vector product of J^{*}v where J^{} is the transpose of the jacobian and v a given vector.
Thank you,
Regards.
If you only need to compute the vector-Jacobian product, doing only that will be much more efficient than computing the full Jacobian. Computing the Jacobian of a function of N variables will cost O(N) time, as opposed to O(1) time for a vector-Jacobian product.
So how do you compute a vector-Jacobian product in TensorFlow? The trick is to use the output_gradients keyword arg in the gradient function. You set the value of output_gradients to the vector in the vector-Jacobian product. Let's look at an example.
import tensorflow as tf
with tf.GradientTape() as g:
x = tf.constant([1.0, 2.0])
g.watch(x)
y = x*x # y is a length 2 vector
vec = tf.constant([2.0,3.0]) # vector in vector jacobian product
grad = g.gradient(y,x,output_gradients = vec)
print(grad) # prints the vector-jacobian product, [4.,12.]
Note: If you try to compute the gradient of a vector-valued (rather than scalar) function in tensorflow without setting output_gradients, it computes a vector-jacobian product where the vector is set to be all ones. For example,
import tensorflow as tf
with tf.GradientTape() as g:
x = tf.constant([1.0, 2.0])
g.watch(x)
y = x*x # y is a length 2 vector
grad = g.gradient(y,x)
print(grad) # prints the vector-jacobian product with a vector of ones, [2.0,4.0]

tf.gradients, how can I understand `grad_ys` and use it?

In tf.gradients, there is a keyword argument grad_ys
grad_ys is a list of tensors of the same length as ys that holds the initial gradients for each y in ys. When grad_ys is None, we fill in a tensor of ‘1’s of the shape of y for each y in ys. A user can provide their own initial grad_ys to compute the derivatives using a different initial gradient for each y (e.g., if one wanted to weight the gradient differently for each value in each y).
Why is grads_ys needed here? The docs here is implicit. Could you please give some specific purpose and code?
And my example code for tf.gradients is
In [1]: import numpy as np
In [2]: import tensorflow as tf
In [3]: sess = tf.InteractiveSession()
In [4]: X = tf.placeholder("float", shape=[2, 1])
In [5]: Y = tf.placeholder("float", shape=[2, 1])
In [6]: W = tf.Variable(np.random.randn(), name='weight')
In [7]: b = tf.Variable(np.random.randn(), name='bias')
In [8]: pred = tf.add(tf.multiply(X, W), b)
In [9]: cost = 0.5 * tf.reduce_sum(tf.pow(pred-Y, 2))
In [10]: grads = tf.gradients(cost, [W, b])
In [11]: sess.run(tf.global_variables_initializer())
In [15]: W_, b_, pred_, cost_, grads_ = sess.run([W, b, pred, cost, grads],
feed_dict={X: [[2.0], [3.]], Y: [[3.0], [2.]]})
grad_ys is only needed for advanced use cases. Here is how you can think about it.
tf.gradients allows you to compute tf.gradients(y, x, grad_ys) = grad_ys * dy/dx. In other words, grad_ys is the multiplier of each y. In this notation, it seems silly to provide this argument because one should be able to just multiple himself, i.e. tf.gradients(y, x, grad_ys) = grad_ys * tf.gradients(y, x). Unfortunately, this equality does not hold because when computing gradients backwards, we perform reduction (typically summation) after each step to get "intermediate loss".
This functionality can be useful in many cases. One is mentioned in the doc string. Here is another. Remember the chain rule - dz/dx = dz/dy * dy/dx. Let's say that we wanted to compute dz/dx but dz/dy is not differentiable and we can only approximate it. Let's say we compute the approximation somehow and call it approx. Then, dz/dx = tf.gradients(y, x, grad_ys=approx).
Another use case can be when you have a model with a "huge fan-in". Let's say you have 100 input sources that go through a few layers (call these "100 branches"), get combined at y, and go through 10 more layers until you get to a loss. It might be that computing all the gradients (which requires remembering many activations) for the whole model at once does not fit in memory. One way to do this would be to compute d(loss)/dy first. Then, compute the gradients for variables in branch_i with respect to loss using tf.gradients(y, branch_i_variables, grad_ys=d(loss)/dy). Using this (and a few more details I am skipping) you can reduce the peak memory requirement.

Breaking TensorFlow gradient calculation into two (or more) parts

Is it possible to use TensorFlow's tf.gradients() function in parts, that is - calculate the gradient from of loss w.r.t some tensor, and of that tensor w.r.t the weight, and then multiply them to get the original gradient from the loss to the weight?
For example, let W,b be some weights, let x be an input of a network, and let y0 denote labels.
Assume a forward graph such as
h=Wx+b
y=tanh(h)
loss=mse(y-y0)
We can calculate tf.gradients(loss,W) and then apply (skipping some details) optimizer.apply_gradients() to update W.
I then try to extract an intermediate tensor, by using var=tf.get_default_graph().get_tensor_by_name(...), and then calculate two gradients: g1=tf.gradients(loss,var) and g2=tf.gradients(var,W).
I would then, by the chain rule, expect the dimensions of g1 and g2 to work out so that I can write g=g1*g2 in some sense, and get back tf.gradients(loss,W).
Unfortunately, this is not the case. The dimensions are incorrect. Each gradient's dimensions will be that of the "w.r.t variable", so there won't be a correspondence between the first gradient and the second one. What am I missing, and how can I do this?
Thanks.
tf.gradients will sum over the gradients of the input tensor. To avoid it you have to split the tensor into scalars and apply tf.gradients to each of them:
import tensorflow as tf
x = tf.ones([1, 10])
w = tf.get_variable("w", initializer=tf.constant(0.5, shape=[10, 5]))
out = tf.matmul(x, w)
out_target = tf.constant(0., shape=[5])
loss = tf.reduce_mean(tf.square(out - out_target))
grad = tf.gradients(loss, x)
part_grad_1 = tf.gradients(loss, out)
part_grad_2 = tf.concat([tf.gradients(i, x) for i in tf.split(out, 5, axis=1)], axis=1)
grad_by_parts = tf.matmul(part_grad_1, part_grad_2)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sess.run([grad]))
print(sess.run([grad_by_parts]))
From the docs, tf.gradients (emphasis mine)
constructs symbolic derivatives of sum of ys w.r.t. x in xs.
If any tensor in ys in multidimensional, it is reduce_summed before the resulting list of scalar is itself summed, before being differenciated. This is why the output gradient has the same size as the xs.
This also explain why losses can be multidimensional in tensorflow: they are implicitely summed over before differentiation.
for future readers:
Tensorflow has made some advancements, and as for tf2.7 (and maybe even earlier versions) you can use tf.GradientTape.jacobian to avoid the sum over the target's dimensions.
https://www.tensorflow.org/guide/advanced_autodiff#jacobians

Weighted random tensor select in tensorflow

I have a list of tensors and list representing their probability mass function. How can I each session run tell tensorflow to randomly pick one tensor according to probability mass function.
I see few possible ways to do that:
One is packing list of tensors in rank one higher, and select one with slice & squeeze based on tensorflow variable I'm going to assign correct index. What would be performance penalty for this approach? Would tensorflow evaluate other, non-needed tensors?
Another is using tf.case in similar fashion as before with me picking one tensor out of many. Same question -> What's the performance penalty since I plan on having quite a few(~100s) conditional statements per one graph run.
Is there any better way of doing this?
I think you should use tf.multinomial(logits, num_samples).
Say you have:
a batch of tensors of shape [batch_size, num_features]
a probability distribution of shape [batch_size]
You want to output:
1 example from the batch of tensors, of shape [1, num_features]
batch_tensors = tf.constant([[0., 1., 2.], [3., 4., 5.]]) # shape [batch_size, num_features]
probabilities = tf.constant([0.7, 0.3]) # shape [batch_size]
# we need to convert probabilities to log_probabilities and reshape it to [1, batch_size]
rescaled_probas = tf.expand_dims(tf.log(probabilities), 0) # shape [1, batch_size]
# We can now draw one example from the distribution (we could draw more)
indice = tf.multinomial(rescaled_probas, num_samples=1)
output = tf.gather(batch_tensors, tf.squeeze(indice, [0]))
What's the performance penalty since I plan on having quite a few(~100s) conditional statements per one graph run?
If you want to do multiple draws, you should do it in one run by increasing the parameter num_samples. You can then gather these num_samples examples in one run with tf.gather.