Unaggregated gradients / gradients per example in tensorflow - tensorflow

Given a simple mini-batch gradient descent problem on mnist in tensorflow (like in this tutorial), how can I retrieve the gradients for each example in the batch individually.
tf.gradients() seems to return gradients averaged over all examples in the batch. Is there a way to retrieve gradients before aggregation?
Edit: A first step towards this answer is figuring out at which point tensorflow averages the gradients over the examples in the batch. I thought this happened in _AggregatedGrads, but that doesn't appear to be the case. Any ideas?

tf.gradients returns the gradient with respect to the loss. This means that if your loss is a sum of per-example losses, then the gradient is also the sum of per-example loss gradients.
The summing up is implicit. For instance if you want to minimize the sum of squared norms of Wx-y errors, the gradient with respect to W is 2(WX-Y)X' where X is the batch of observations and Y is the batch of labels. You never explicitly form "per-example" gradients that you later sum up, so it's not a simple matter of removing some stage in the gradient pipeline.
A simple way to get k per-example loss gradients is to use batches of size 1 and do k passes. Ian Goodfellow wrote up how to get all k gradients in a single pass, for this you would need to specify gradients explicitly and not rely on tf.gradients method

To partly answer my own question after tinkering with this for a while. It appears that it is possible to manipulate gradients per example while still working in batch by doing the following:
Create a copy of tf.gradients() that accepts an extra tensor/placeholder with example-specific factors
Create a copy of _AggregatedGrads() and add a custom aggregation method that uses the example-specific factors
Call your custom tf.gradients function and give your loss as a list of slices:
custagg_gradients(
ys=[cross_entropy[i] for i in xrange(batch_size)],
xs=variables.trainable_variables(),
aggregation_method=CUSTOM,
gradient_factors=gradient_factors
)
But this will probably have the same complexity as doing individual passes per example, and I need to check if the gradients are correct :-).

One way of retrieving gradients before aggregation is to use the grads_ys parameter. A good discussion is found here:
Use of grads_ys parameter in tf.gradients - TensorFlow
EDIT:
I haven't been working with Tensorflow a lot lately, but here is an open issue tracking the best way to compute unaggregated gradients:
https://github.com/tensorflow/tensorflow/issues/675
There is a lot of sample code solutions provided by users (including myself) that you can try based on your needs.

Related

How does Keras (Tensorflow) compute gradients over batches in custom loss functions?

Under the hood, is a single gradient computed with respect to the whole batch, or is it the mean of gradients for each training pair? I'm writing a custom loss function and would like to include a loss component that is a function of the aggregate statistics over the batch. I'm wondering if this is consistent with the framework. My actual use case is complicated, but as an example, consider that I want my loss function to be whether the categories are correct (dog or cat) plus a term pushing for a 50/50 split between dog and cats in the batch. It's easy enough to program this into the loss function, but will the gradients do the right thing?

What is the purpose of the Tensorflow Gradient Tape?

I watched the Tensorflow Developer's summit video on Eager Execution in Tensorflow, and the presenter gave an introduction to "Gradient Tape." Now I understand that Gradient Tape tracks the automatic differentiation that occurs in a TF model.
I was trying to understand why I would use Gradient Tape? Can anyone explain how Gradient Tape is used as a diagnostic tool? Why would someone use Gradient Tape versus just Tensorboard visualization of weights.
So I get that the automatic differentiation that occurs with a model is to compute the gradients of each node--meaning the adjustment of the weights and biases at each node, given some batch of data. So that is the learning process. But I was under the impression that I can actually use a tf.keras.callback.TensorBoard() call to see the tensorboard visualization of training--so I can watch the weights on each node and determine if there are any dead or oversaturated nodes.
Is the use of Gradient Tape only to see if some gradients go to zero or get really big, etc? Or is there some other use of the Gradient Tape?
With eager execution enabled, Tensorflow will calculate the values of tensors as they occur in your code. This means that it won't precompute a static graph for which inputs are fed in through placeholders. This means to back propagate errors, you have to keep track of the gradients of your computation and then apply these gradients to an optimiser.
This is very different from running without eager execution, where you would build a graph and then simply use sess.run to evaluate your loss and then pass this into an optimiser directly.
Fundamentally, because tensors are evaluated immediately, you don't have a graph to calculate gradients and so you need a gradient tape. It is not so much that it is just used for visualisation, but more that you cannot implement a gradient descent in eager mode without it.
Obviously, Tensorflow could just keep track of every gradient for every computation on every tf.Variable. However, that could be a huge performance bottleneck. They expose a gradient tape so that you can control what areas of your code need the gradient information. Note that in non-eager mode, this will be statically determined based on the computational branches that are descendants of your loss but in eager mode there is no static graph and so no way of knowing.
Having worked on this for a while, after posting the initial question, I have a better sense of where Gradient Tape is useful. Seems like the most useful application of Gradient Tap is when you design a custom layer in your keras model for example--or equivalently designing a custom training loop for your model.
If you have a custom layer, you can define exactly how the operations occur within that layer, including the gradients that are computed and also calculating the amount of loss that is accumulated.
So Gradient tape will just give you direct access to the individual gradients that are in the layer.
Here is an example from Aurelien Geron's 2nd edition book on Tensorflow.
Say you have a function you want as your activation.
def f(w1, w2):
return 3 * w1 ** 2 + 2 * w1 * w2
Now if you want to take derivatives of this function with respec to w1 and w2:
w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
z = f(w1, w2)
gradients = tape.gradient(z, [w1, w2])
So the optimizer will calculate the gradient and give you access to those values. Then you can double them, square them, triple them, etc., whatever you like. Whatever you choose to do, then you can add those adjusted gradients to the loss calculation for the backpropagation step, etc.
I think the most important thing to say in answer to this question is simply that GradientTape is not a diagnostic tool. That's the misconception here.
GradientTape is a mathematical tool for automatic differentiation (autodiff), which is the core functionality of TensorFlow. It does not "track" the autodiff, it is a key part of performing the autodiff.
As the other answers describe, it is used to record ("tape") a sequence of operations performed upon some input and producing some output, so that the output can be differentiated with respect to the input (via backpropagation / reverse-mode autodiff) (in order to then perform gradient descent optimisation).

Can I get unaggregated gradient from tensorflow?

I'm trying to implement reinforcement learning using tensorflow, follow this paper: http://www.kyb.mpg.de/fileadmin/user_upload/files/publications/attachments/Neural-Netw-2008-21-682_4867%5b0%5d.pdf
On page 687, table 4, they have a formula to calculate the optimal baseline. But that require to get unaggregated gradients first, do some calculation then mean over the batch.
But tf.gradients returns us already aggregated gradients. Is there a way to do that? There are also similar question: Unaggregated gradients / gradients per example in tensorflow , of course, we can do runtime tf.while_loop over batch size, and get single gradient one by one, but that will kill the performance.

Tensorflow - Total Variation Loss - reduce_sum vs reduce_mean?

Why does the Total Variation Loss in Tensorflow suggest to use reduce_sum instead of reduce_mean as a loss function?
This can be used as a loss-function during optimization so as to
suppress noise in images. If you have a batch of images, then you
should calculate the scalar loss-value as the sum:
loss = tf.reduce_sum(tf.image.total_variation(images))
I contacted the author and it seems there wasn't any important reason behind it at all. He mentioned that maybe reduce_sum worked better for his test case than reduce_mean but encouraged me to test both cases and choose the one which gives me the best results.

When using tensorboard, how to summarize a loss that is computed over several minibatches?

I would like to use Tensorboard to visualize the evolution of the loss over a validation sample. But the validation set is too large to compute in one minibatch. Therefore, to compute my validation loss, I have to call session.run several times over several minibatches covering the validation set. Then I sum the loss (in python) of each minibatches to obtain the full validation loss.
My problem is that tf.scalar_summary seems to have to be attached to a tensorflow node. But I would need to somehow "attach" it to the sum of the values of a node over several run of session.run.
Is there a way to do that? Maybe by directly summarizing the python float that contains the sum of the minibatch losses? But I have not seen in the docs a way to "summarize" for tensorboard a python value that is outside of a computation. The example in the "How-To" section of the doc is only concerned with losses that can be computed in a single call to session.run.
You could add a Variable that is updated on each sess.Run call and have the summary track the value of the Variable.