Breaking TensorFlow gradient calculation into two (or more) parts - tensorflow

Is it possible to use TensorFlow's tf.gradients() function in parts, that is - calculate the gradient from of loss w.r.t some tensor, and of that tensor w.r.t the weight, and then multiply them to get the original gradient from the loss to the weight?
For example, let W,b be some weights, let x be an input of a network, and let y0 denote labels.
Assume a forward graph such as
h=Wx+b
y=tanh(h)
loss=mse(y-y0)
We can calculate tf.gradients(loss,W) and then apply (skipping some details) optimizer.apply_gradients() to update W.
I then try to extract an intermediate tensor, by using var=tf.get_default_graph().get_tensor_by_name(...), and then calculate two gradients: g1=tf.gradients(loss,var) and g2=tf.gradients(var,W).
I would then, by the chain rule, expect the dimensions of g1 and g2 to work out so that I can write g=g1*g2 in some sense, and get back tf.gradients(loss,W).
Unfortunately, this is not the case. The dimensions are incorrect. Each gradient's dimensions will be that of the "w.r.t variable", so there won't be a correspondence between the first gradient and the second one. What am I missing, and how can I do this?
Thanks.

tf.gradients will sum over the gradients of the input tensor. To avoid it you have to split the tensor into scalars and apply tf.gradients to each of them:
import tensorflow as tf
x = tf.ones([1, 10])
w = tf.get_variable("w", initializer=tf.constant(0.5, shape=[10, 5]))
out = tf.matmul(x, w)
out_target = tf.constant(0., shape=[5])
loss = tf.reduce_mean(tf.square(out - out_target))
grad = tf.gradients(loss, x)
part_grad_1 = tf.gradients(loss, out)
part_grad_2 = tf.concat([tf.gradients(i, x) for i in tf.split(out, 5, axis=1)], axis=1)
grad_by_parts = tf.matmul(part_grad_1, part_grad_2)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sess.run([grad]))
print(sess.run([grad_by_parts]))

From the docs, tf.gradients (emphasis mine)
constructs symbolic derivatives of sum of ys w.r.t. x in xs.
If any tensor in ys in multidimensional, it is reduce_summed before the resulting list of scalar is itself summed, before being differenciated. This is why the output gradient has the same size as the xs.
This also explain why losses can be multidimensional in tensorflow: they are implicitely summed over before differentiation.

for future readers:
Tensorflow has made some advancements, and as for tf2.7 (and maybe even earlier versions) you can use tf.GradientTape.jacobian to avoid the sum over the target's dimensions.
https://www.tensorflow.org/guide/advanced_autodiff#jacobians

Related

tf.GradientTape returns None for gradient

I am using tf.GradientTape().gradient() to compute a representer point, which can be used to compute the "influence" of a given training example on a given test example. A representer point for a given test example x_t and training example x_i is computed as the dot product of their feature representations, f_t and f_i, multiplied by a weight alpha_i.
Note: The details of this approach are not necessary for understanding the question, since the main issue is getting gradient tape to work. That being said, I have included a screenshot of the some of the details below for anyone who is interested.
Computing alpha_i requires differentiation, since it is expressed as the following:
In the equation above L is the standard loss function (categorical cross-entropy for multiclass classification) and phi is the pre-softmax activation output (so its length is the number of classes). Furthermore alpha_i can be further broken up into alpha_ij, which is computed with respect to a specific class j. Therefore, we just obtain the pre-softmax output phi_j corresponding to the predicted class of the test example (class with highest final prediction).
I have created a simple setup with MNIST and have implemented the following:
def simple_mnist_cnn(input_shape = (28,28,1)):
input = Input(shape=input_shape)
x = layers.Conv2D(32, kernel_size=(3, 3), activation="relu")(input)
x = layers.MaxPooling2D(pool_size=(2, 2))(x)
x = layers.Conv2D(64, kernel_size=(3, 3), activation="relu")(x)
x = layers.MaxPooling2D(pool_size=(2, 2))(x)
x = layers.Flatten()(x) # feature representation
output = layers.Dense(num_classes, activation=None)(x) # presoftmax activation output
activation = layers.Activation(activation='softmax')(output) # final output with activation
model = tf.keras.Model(input, [x, output, activation], name="mnist_model")
return model
Now assume the model is trained, and I want to compute the influence of a given train example on a given test example's prediction, perhaps for model understanding/debugging purposes.
with tf.GradientTape() as t1:
f_t, _, pred_t = model(x_t) # get features for misclassified example
f_i, presoftmax_i, pred_i = model(x_i)
# compute dot product of feature representations for x_t and x_i
dotps = tf.reduce_sum(
tf.multiply(f_t, f_i))
# get presoftmax output corresponding to highest predicted class of x_t
phi_ij = presoftmax_i[:,np.argmax(pred_t)]
# y_i is actual label for x_i
cl_loss_i = tf.keras.losses.categorical_crossentropy(pred_i, y_i)
alpha_ij = t1.gradient(cl_loss_i, phi_ij)
# note: alpha_ij returns None currently
k_ij = tf.reduce_sum(tf.multiply(alpha_i, dotps))
The code above gives the following error, since alpha_ij is None: ValueError: Attempt to convert a value (None) with an unsupported type (<class 'NoneType'>) to a Tensor.. However, if I change t1.gradient(cl_loss_i, phi_ij) -> t1.gradient(cl_loss_i, presoftmax_i), it no longer returns None. Not sure why this is the case? Is there an issue with computing gradients on sliced tensors? Is there an issue with "watching" too many variables? I haven't worked much with gradient tape so I'm not sure what the fix is, but would appreciate help.
For anyone who is interested, here are more details:
I never see you watch any tensors. Note that the tape only traces tf.Variable by default. Is this missing from your code? Else I don't see how t1.gradient(cl_loss_i, presoftmax_i) is working.
Either way, I think the easiest way to fix it is to do
all_gradients = t1.gradient(cl_loss_i, presoftmax_i)
desired_gradients = all_gradients[[:,np.argmax(pred_t)]]
so simply do the indexing after the gradient. Note that this can be wasteful (if there are many classes) as you are computing more gradients than you need.
The explanation for why (I believe) your version doesn't work would be easiest to show in a drawing, but let me try to explain: Imagine the computations in a directed graph. We have
presoftmax_i -> pred_i -> cl_loss_i
Backpropagating the loss to the presoftmax is easy. But then you set up another branch,
presoftmax_i -> presoftmax_ij
Now, when you try to compute the gradient of the loss with respect to presoftmax_ij, there is actually no backpropagation path (we can only follow arrows backwards). Another way to think about it: You compute presoftmax_ij after computing the loss. How could the loss depend on it then?

Compute gradient of the ouputs wrt the weights

Starting from a tensorflow model, I would like to be able to retrieve the gradient of the outputs with respect to the weights. Backpropagation aims to compute the gradient of the loss wrt the weights, in order to do that somewhere in the code the computation of the gradient of the ouputs wrt the weights has to happen.
But I am wondering how to get this Jacobian at the API level, any ideas ?
I know that we can have access to the tape but I am not sure what do to with that, actually I do not need the whole Jacobian I just need to be able to compute the matrix vector product of J^{*}v where J^{} is the transpose of the jacobian and v a given vector.
Thank you,
Regards.
If you only need to compute the vector-Jacobian product, doing only that will be much more efficient than computing the full Jacobian. Computing the Jacobian of a function of N variables will cost O(N) time, as opposed to O(1) time for a vector-Jacobian product.
So how do you compute a vector-Jacobian product in TensorFlow? The trick is to use the output_gradients keyword arg in the gradient function. You set the value of output_gradients to the vector in the vector-Jacobian product. Let's look at an example.
import tensorflow as tf
with tf.GradientTape() as g:
x = tf.constant([1.0, 2.0])
g.watch(x)
y = x*x # y is a length 2 vector
vec = tf.constant([2.0,3.0]) # vector in vector jacobian product
grad = g.gradient(y,x,output_gradients = vec)
print(grad) # prints the vector-jacobian product, [4.,12.]
Note: If you try to compute the gradient of a vector-valued (rather than scalar) function in tensorflow without setting output_gradients, it computes a vector-jacobian product where the vector is set to be all ones. For example,
import tensorflow as tf
with tf.GradientTape() as g:
x = tf.constant([1.0, 2.0])
g.watch(x)
y = x*x # y is a length 2 vector
grad = g.gradient(y,x)
print(grad) # prints the vector-jacobian product with a vector of ones, [2.0,4.0]

Tensorflow weighted vs sigmoid cross-entropy loss

I am trying to implement multi-label classification using TensorFlow (i.e., each output pattern can have many active units). The problem has imbalanced classes (i.e., much more zeros than ones in the labels distribution, which makes label patterns very sparse).
The best way to tackle the problem should be to use the tf.nn.weighted_cross_entropy_with_logits function. However, I get this runtime error:
ValueError: Tensor conversion requested dtype uint8 for Tensor with dtype float32
I can't understand what is wrong here. As input to the loss function, I pass the labels tensor, the logits tensor, and the positive class weight, which is a constant:
positive_class_weight = 10
loss = tf.nn.weighted_cross_entropy_with_logits(targets=labels, logits=logits, pos_weight=positive_class_weight)
Any hints about how to solve this? If I just pass the same labels and logits tensors to the tf.losses.sigmoid_cross_entropy loss function, everything works well (in the sense that Tensorflow runs properly, but of course following training predictions are always zero).
See related problem here.
The error is likely to be thrown after the loss function, because the only significant difference between tf.losses.sigmoid_cross_entropy and tf.nn.weighted_cross_entropy_with_logits is the shape of the returned tensor.
Take a look at this example:
logits = tf.linspace(-3., 5., 10)
labels = tf.fill([10,], 1.)
positive_class_weight = 10
weighted_loss = tf.nn.weighted_cross_entropy_with_logits(targets=labels, logits=logits, pos_weight=positive_class_weight)
print(weighted_loss.shape)
sigmoid_loss = tf.losses.sigmoid_cross_entropy(multi_class_labels=labels, logits=logits)
print(sigmoid_loss.shape)
Tensors logits and labels are kind of artificial and both have shape (10,). But it's important that weighted_loss and sigmoid_loss are different. Here's the output:
(10,)
()
This is because tf.losses.sigmoid_cross_entropy performs reduction (the sum by default). So in order to replicate it, you have to wrap the weighted loss with tf.reduce_sum(...).
If this doesn't help, make sure that labels tensor has type float32. This bug is very easy to make, e.g., the following declaration won't work:
labels = tf.fill([10,], 1) # the type is not float!
You might be also interested to read this question.

Tensorflow: How to Manually Edit Gradient Values

I am reading gradient values from a outside source (i.e computation is done elsewhere, but I want to accumulate the different sources in a "master" network), and I would like to just use the apply_gradients() op in tensorflow. The problem is, the gradients get sent in as floats. Is there any way I can use the float array to apply the gradients with the built-in Optimizer functions?
In a very minimal example / test case, this is what I would essentially like to do.
W = tf.Variable(1.0)
b = tf.Variable(2.0)
trainable_variables = [W, b]
gradients = [0.05, 0.01] # Example gradients for W, b
# ... Somehow make this gradient vector into a tensor
optimizer.apply_gradients(zip(gradients_tensor, trainable_variables))
There are many ways of doing so, in particular, you can just create placeholders for your external gradients, and combine them by simply performing arithmetics on them before apply_gradients.
x = tf.Variable( ... )
f = x ** 2
g = tf.gradients(f, x)
my_gradient['x'] = tf.placeholder( ... ) # same size and type as x
g = [(grad + my_gradient[var.name], var) for grad, var in g]
optimizer.apply_gradients(g)
and now during optimisation step, just feed_dict my_gradient['x'] value to the computed one.
If they do not change overtime, you could use tf.constant() instead but I can't see any mathematical situation to have a constant (and non-zero) gradient in ML.

How do I get the gradient of the loss at a TensorFlow variable?

The feature I'm after is to be able to tell what the gradient of a given variable is with respect to my error function given some data.
One way to do this would be to see how much the variable has changed after a call to train, but obviously that can vary massively based on the learning algorithm (for example it would be almost impossible to tell with something like RProp) and just isn't very clean.
Thanks in advance.
The tf.gradients() function allows you to compute the symbolic gradient of one tensor with respect to one or more other tensors—including variables. Consider the following simple example:
data = tf.placeholder(tf.float32)
var = tf.Variable(...) # Must be a tf.float32 or tf.float64 variable.
loss = some_function_of(var, data) # some_function_of() returns a `Tensor`.
var_grad = tf.gradients(loss, [var])[0]
You can then use this symbolic gradient to evaluate the gradient in some specific point (data):
sess = tf.Session()
var_grad_val = sess.run(var_grad, feed_dict={data: ...})
In TensorFlow 2.0 you can use GradientTape to achieve this. GradientTape records the gradients of any computation that happens in the context of that. Below is an example of how you might do that.
import tensorflow as tf
# Here goes the neural network weights as tf.Variable
x = tf.Variable(3.0)
# TensorFlow operations executed within the context of
# a GradientTape are recorded for differentiation
with tf.GradientTape() as tape:
# Doing the computation in the context of the gradient tape
# For example computing loss
y = x ** 2
# Getting the gradient of network weights w.r.t. loss
dy_dx = tape.gradient(y, x)
print(dy_dx) # Returns 6