Optimizers' apply_gradients functionality - tensorflow

I am in the process of implementing a Quasi-Newton optimizer for tensorflow, and my question is when Optimizer apply_gradients function is called inside of the minimize function, are the gradients applied at whatever values the tensors happen to have at that moment in time?
Cheers,
Sergey

The apply gradients function can be run at any time and will update the current weights in the network. You can change all of the gradients yourself to all ones and watch the weights increase by one.
You can see from the docs or from git itself here

Related

What is the purpose of the Tensorflow Gradient Tape?

I watched the Tensorflow Developer's summit video on Eager Execution in Tensorflow, and the presenter gave an introduction to "Gradient Tape." Now I understand that Gradient Tape tracks the automatic differentiation that occurs in a TF model.
I was trying to understand why I would use Gradient Tape? Can anyone explain how Gradient Tape is used as a diagnostic tool? Why would someone use Gradient Tape versus just Tensorboard visualization of weights.
So I get that the automatic differentiation that occurs with a model is to compute the gradients of each node--meaning the adjustment of the weights and biases at each node, given some batch of data. So that is the learning process. But I was under the impression that I can actually use a tf.keras.callback.TensorBoard() call to see the tensorboard visualization of training--so I can watch the weights on each node and determine if there are any dead or oversaturated nodes.
Is the use of Gradient Tape only to see if some gradients go to zero or get really big, etc? Or is there some other use of the Gradient Tape?
With eager execution enabled, Tensorflow will calculate the values of tensors as they occur in your code. This means that it won't precompute a static graph for which inputs are fed in through placeholders. This means to back propagate errors, you have to keep track of the gradients of your computation and then apply these gradients to an optimiser.
This is very different from running without eager execution, where you would build a graph and then simply use sess.run to evaluate your loss and then pass this into an optimiser directly.
Fundamentally, because tensors are evaluated immediately, you don't have a graph to calculate gradients and so you need a gradient tape. It is not so much that it is just used for visualisation, but more that you cannot implement a gradient descent in eager mode without it.
Obviously, Tensorflow could just keep track of every gradient for every computation on every tf.Variable. However, that could be a huge performance bottleneck. They expose a gradient tape so that you can control what areas of your code need the gradient information. Note that in non-eager mode, this will be statically determined based on the computational branches that are descendants of your loss but in eager mode there is no static graph and so no way of knowing.
Having worked on this for a while, after posting the initial question, I have a better sense of where Gradient Tape is useful. Seems like the most useful application of Gradient Tap is when you design a custom layer in your keras model for example--or equivalently designing a custom training loop for your model.
If you have a custom layer, you can define exactly how the operations occur within that layer, including the gradients that are computed and also calculating the amount of loss that is accumulated.
So Gradient tape will just give you direct access to the individual gradients that are in the layer.
Here is an example from Aurelien Geron's 2nd edition book on Tensorflow.
Say you have a function you want as your activation.
def f(w1, w2):
return 3 * w1 ** 2 + 2 * w1 * w2
Now if you want to take derivatives of this function with respec to w1 and w2:
w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
z = f(w1, w2)
gradients = tape.gradient(z, [w1, w2])
So the optimizer will calculate the gradient and give you access to those values. Then you can double them, square them, triple them, etc., whatever you like. Whatever you choose to do, then you can add those adjusted gradients to the loss calculation for the backpropagation step, etc.
I think the most important thing to say in answer to this question is simply that GradientTape is not a diagnostic tool. That's the misconception here.
GradientTape is a mathematical tool for automatic differentiation (autodiff), which is the core functionality of TensorFlow. It does not "track" the autodiff, it is a key part of performing the autodiff.
As the other answers describe, it is used to record ("tape") a sequence of operations performed upon some input and producing some output, so that the output can be differentiated with respect to the input (via backpropagation / reverse-mode autodiff) (in order to then perform gradient descent optimisation).

Tensorflow batch normalization: difference momentum and renorm_momentum

I want to replicate a network build with the lasagne-library in tensor flow. I'm having some trouble with the batch normalization.
This is the lasagne documentation about the used batch normalization:
http://lasagne.readthedocs.io/en/latest/modules/layers/normalization.html?highlight=batchNorm
In tensorflow I found two functions to normalize:
https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
The first one is simpler but does not let me choose the alpha parameter from lasagne (Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training). I tried using the second function, which has a lot more options, but there are two things I do not understand about it:
I am not clear about the difference between momentum and renorm_momentum. If I have a alpha of 0.9 in the lasagne network, can I just set both tensorflow momentums to 0.9 and expect the same behaviour?
The tf documentation notes:
when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
I do not really understand what is happening here and where I need to put something similar in my code. Can I just put this somewhere before I run the session? What parts of this code piece should I not copy literally but change depending on my code?
There is a big difference between tf.nn.batch_normalization and tf.layers.batch_normalization. See my answer here. So you have made the right choice by using the layers version. Now, on your questions:
renorm_momentum only has an effect is you use batch renormalization by setting the renorm argument to True. You can ignore this if using default batch normalization.
Short answer: You can literally copy that code snippet. Put it exactly where you would normally call optimizer.minimize.
Long answer on 2.: Batch normalization has two "modes": Training and inference. During training, mean and variance of the current minibatch is used. During inference, this is not desirable (e.g. you might not even use batches as input, so there would be no minibatch statistics). For this reason, moving averages over minibatch means/variances are kept during training. These moving averages are then used for inference.
By default, Tensorflow only executes what it needs to. Those moving averages are not needed for training, so they normally would never be executed/updated. The tf.control_dependencies context manager forces Tensorflow to do the updates every time it computes whatever is in the code block (in this case the cost). Since the cost certainly needs to be computed exactly one per training step, this is a good way of making sure the moving averages are updated.
The code example seems a bit arcane, but in context it would really just be (as an example):
loss = ...
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
becomes
loss = ...
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
Finally, keep in mind to use the correct training argument for batch normalization so that either minibatch statistics or moving averages are used as intended.

Modifying intermediate gradients Tensorflow

You can get intermediate gradients with tf.gradients()and you can create a new tensor by applying an op on this results (like clipping) but how to modify the backpropagation accordingly ?
For instance to implement the Huber loss (with delta=1).
The first method is to create a boolean mask on the batch dimension doing something like.
cond=tf.less(input_tensor,1)
cond=tf.cast(cond,"tf.float32")
loss=cond*tf.square(input_tensor)+(1.-cond)*(tf.abs(input_tensor)-0.5)
A simpler way to implement it would be to use the l2 loss and to clip its gradients wrt inputs to 1.
l2_loss=tf.square(input_tensor)
modified_grad_wrt_input=tf.clip_by_value(tf.gradients(l2_loss,input_tensor),0.,1.)
But when you train your network you have to use compute_gradients and apply_gradients, which only give you gradients wrt variables. How to make your optimizer use the tensor modified_grad_wrt_input when doing the chain rule ?
Do you have to use gradient_override_map as in this github issue ?
Is there a simpler way without registering a new op/gradients ?

apply_gradients for tensorflow optimizers

The tensorflow documentation states that:
Calling minimize() takes care of both computing the gradients and
applying them to the variables. If you want to process the gradients
before applying them you can instead use the optimizer in three steps:
Compute the gradients with compute_gradients(). Process the gradients
as you wish. Apply the processed gradients with apply_gradients().
However the example given is for vanilla SGD.
Does this two step process work for other types of optimizers (like momentum, adam etc), which don't use the gradients directly but instead use other derived descent directions ?
If so, where do the various intermediate variables and the final descent direction get computed - in compute_gradients or apply_gradients ?
Thanks.

Unaggregated gradients / gradients per example in tensorflow

Given a simple mini-batch gradient descent problem on mnist in tensorflow (like in this tutorial), how can I retrieve the gradients for each example in the batch individually.
tf.gradients() seems to return gradients averaged over all examples in the batch. Is there a way to retrieve gradients before aggregation?
Edit: A first step towards this answer is figuring out at which point tensorflow averages the gradients over the examples in the batch. I thought this happened in _AggregatedGrads, but that doesn't appear to be the case. Any ideas?
tf.gradients returns the gradient with respect to the loss. This means that if your loss is a sum of per-example losses, then the gradient is also the sum of per-example loss gradients.
The summing up is implicit. For instance if you want to minimize the sum of squared norms of Wx-y errors, the gradient with respect to W is 2(WX-Y)X' where X is the batch of observations and Y is the batch of labels. You never explicitly form "per-example" gradients that you later sum up, so it's not a simple matter of removing some stage in the gradient pipeline.
A simple way to get k per-example loss gradients is to use batches of size 1 and do k passes. Ian Goodfellow wrote up how to get all k gradients in a single pass, for this you would need to specify gradients explicitly and not rely on tf.gradients method
To partly answer my own question after tinkering with this for a while. It appears that it is possible to manipulate gradients per example while still working in batch by doing the following:
Create a copy of tf.gradients() that accepts an extra tensor/placeholder with example-specific factors
Create a copy of _AggregatedGrads() and add a custom aggregation method that uses the example-specific factors
Call your custom tf.gradients function and give your loss as a list of slices:
custagg_gradients(
ys=[cross_entropy[i] for i in xrange(batch_size)],
xs=variables.trainable_variables(),
aggregation_method=CUSTOM,
gradient_factors=gradient_factors
)
But this will probably have the same complexity as doing individual passes per example, and I need to check if the gradients are correct :-).
One way of retrieving gradients before aggregation is to use the grads_ys parameter. A good discussion is found here:
Use of grads_ys parameter in tf.gradients - TensorFlow
EDIT:
I haven't been working with Tensorflow a lot lately, but here is an open issue tracking the best way to compute unaggregated gradients:
https://github.com/tensorflow/tensorflow/issues/675
There is a lot of sample code solutions provided by users (including myself) that you can try based on your needs.