What is the purpose of the Tensorflow Gradient Tape? - tensorflow

I watched the Tensorflow Developer's summit video on Eager Execution in Tensorflow, and the presenter gave an introduction to "Gradient Tape." Now I understand that Gradient Tape tracks the automatic differentiation that occurs in a TF model.
I was trying to understand why I would use Gradient Tape? Can anyone explain how Gradient Tape is used as a diagnostic tool? Why would someone use Gradient Tape versus just Tensorboard visualization of weights.
So I get that the automatic differentiation that occurs with a model is to compute the gradients of each node--meaning the adjustment of the weights and biases at each node, given some batch of data. So that is the learning process. But I was under the impression that I can actually use a tf.keras.callback.TensorBoard() call to see the tensorboard visualization of training--so I can watch the weights on each node and determine if there are any dead or oversaturated nodes.
Is the use of Gradient Tape only to see if some gradients go to zero or get really big, etc? Or is there some other use of the Gradient Tape?

With eager execution enabled, Tensorflow will calculate the values of tensors as they occur in your code. This means that it won't precompute a static graph for which inputs are fed in through placeholders. This means to back propagate errors, you have to keep track of the gradients of your computation and then apply these gradients to an optimiser.
This is very different from running without eager execution, where you would build a graph and then simply use sess.run to evaluate your loss and then pass this into an optimiser directly.
Fundamentally, because tensors are evaluated immediately, you don't have a graph to calculate gradients and so you need a gradient tape. It is not so much that it is just used for visualisation, but more that you cannot implement a gradient descent in eager mode without it.
Obviously, Tensorflow could just keep track of every gradient for every computation on every tf.Variable. However, that could be a huge performance bottleneck. They expose a gradient tape so that you can control what areas of your code need the gradient information. Note that in non-eager mode, this will be statically determined based on the computational branches that are descendants of your loss but in eager mode there is no static graph and so no way of knowing.

Having worked on this for a while, after posting the initial question, I have a better sense of where Gradient Tape is useful. Seems like the most useful application of Gradient Tap is when you design a custom layer in your keras model for example--or equivalently designing a custom training loop for your model.
If you have a custom layer, you can define exactly how the operations occur within that layer, including the gradients that are computed and also calculating the amount of loss that is accumulated.
So Gradient tape will just give you direct access to the individual gradients that are in the layer.
Here is an example from Aurelien Geron's 2nd edition book on Tensorflow.
Say you have a function you want as your activation.
def f(w1, w2):
return 3 * w1 ** 2 + 2 * w1 * w2
Now if you want to take derivatives of this function with respec to w1 and w2:
w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
z = f(w1, w2)
gradients = tape.gradient(z, [w1, w2])
So the optimizer will calculate the gradient and give you access to those values. Then you can double them, square them, triple them, etc., whatever you like. Whatever you choose to do, then you can add those adjusted gradients to the loss calculation for the backpropagation step, etc.

I think the most important thing to say in answer to this question is simply that GradientTape is not a diagnostic tool. That's the misconception here.
GradientTape is a mathematical tool for automatic differentiation (autodiff), which is the core functionality of TensorFlow. It does not "track" the autodiff, it is a key part of performing the autodiff.
As the other answers describe, it is used to record ("tape") a sequence of operations performed upon some input and producing some output, so that the output can be differentiated with respect to the input (via backpropagation / reverse-mode autodiff) (in order to then perform gradient descent optimisation).

Related

Is it possible to integrate Levenberg-Marquardt optimizer from Tensorflow Graphics with a Tensorflow 2.0 model?

I have a Tensorflow 2.0 tf.keras.Sequential model. Now, my technical specification prescribes using the Levenberg-Marquardt optimizer to fit the model. Tensorflow 2.0 doesn't provide it as an optimizer out of the box, but it is available in the Tensorflow Graphics module.
tfg.math.optimizer.levenberg_marquardt.minimize function accepts residuals ( a residual is a Python callable returning a tensor) and variables (list of tensors corresponding to my model weights) as parameters.
What would be the best way to convert my model into residuals and variables?
If I understand correctly how the minimize function works, I have to provide two residuals. The first residual must call my model for every learning case and aggregate all the results into a tensor. The second residuals must return all labels as a single constant tensor. The problem is that tf.keras.Sequential.predict function returns a numpy array instead of tensor. I believe that if I convert it to a tensor, the minimizer won't be able to calculate jacobians with respect to variables.
The same problem is with variables. It doesn't seem like there's a way to extract all weights from a model into a list of tensors.
There's a major difference between tfg.math.optimizer.levenberg_marquardt.minimize and Keras optimizers from the implementation/API perspective.
Keras optimizers, such as tf.keras.optimizers.Adam consume gradients as input and updates tf.Variables.
In contrast, tfg.math.optimizer.levenberg_marquardt.minimize essentially unrolls the optimization loop in graph mode (using a tf.while_loop construct). It takes initial parameter values and produces updated parameter values, unlike Adam & co, which only apply one iteration and actually change the values of tf.Variables via assign_add.
Stepping back a bit to the theoretical big picture, Levenberg-Marquardt is not a general gradient descent-like solver for any nonlinear optimization problem (such as Adam is). It specifically addresses nonlinear least-squares optimization, so it's not a drop-in replacement for optimizers like Adam. In gradient descent, we compute the gradient of the loss with respect to the parameters. In Levenberg-Marquardt, we compute the Jacobian of the residuals with respect to the parameters. Concretely, it repeatedly solves the linearized problem Jacobian # delta_params = residuals for delta_params using tf.linalg.lstsq (which internally uses Cholesky decomposition on the Gram matrix computed from the Jacobian) and applies delta_params as the update.
Note that this lstsq operation has cubic complexity in the number of parameters, so in case of neural nets it can only be applied for fairly small ones.
Also note that Levenberg-Marquardt is usually applied as a batch algorithm, not a minibatch algorithm like SGD, though there's nothing stopping you from applying the LM iteration on different minibatches in each iteration.
I think you may only be able to get one iteration out of tfg's LM algorithm, through something like
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
for input_batch, target_batch in dataset:
def residual_fn(trainable_params):
# do not use trainable params, it will still be at its initial value, since we only do one iteration of Levenberg Marquardt each time.
return model(input_batch) - target_batch
new_objective_value, new_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=1)
for var, new_param in zip(model.trainable_variables, new_params):
var.assign(new_param)
In contrast, I believe the following naive method will not work where we assign model parameters before computing the residuals:
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
dataset_iterator = ...
def residual_fn(params):
input_batch, target_batch = next(dataset_iterator)
for var, param in zip(model.trainable_variables, params):
var.assign(param)
return model(input_batch) - target_batch
final_objective, final_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=10000)
for var, final_param in zip(model.trainable_variables, final_params):
var.assign(final_param)
The main conceptual problem is that residual_fn's output has no gradients wrt its input params, since this dependency goes through a tf.assign. But it might even fail before that due to using constructs that are disallowed in graph mode.
Overall I believe it's best to write your own LM optimizer that works on tf.Variables, since tfg.math.optimizer.levenberg_marquardt.minimize has a very different API that is not really suited for optimizing Keras model parameters since you can't directly compute model(input, parameters) - target_value without a tf.assign.

How do I calculate subgradients in TensorFlow?

Does the automatic differentiation procedure in TensorFlow compute subgradient whenever needed? If there are many subgradients then which one will be chosen as output?
I am trying to implement the paper in the link https://www.aclweb.org/anthology/P13-1045 which uses recursive neural networks to perform efficient language parsing. The objective function uses hinge loss function to pick the optimal output vectors, which makes the function not differentiable. I used TensorFlow (v1.12) in eager mode to program the model and used the automatic differentiation to compute the gradients. After every batch, I could see the gradient values changing and the accuracy is slightly improved. After a while, it decreases and this process continues. The model does not converge at all for all the hyper-parameter configurations.
Mini batch size : 256, 512, 1024; Regularization parameters - 0.1, 0.01, 0.001; Learning rate - 0.1, 0.01, 0.001; Optimization function - gradient descent, adagrad, adam;
In the paper, they have described how to find subgradient for the optimum function in a very abstract manner, which I have not understood yet. I was of the opinion at the beginning that automatic gradient computation calculates the subgradient. But at this moment, I am starting to doubt so because that seems to be the only variable missing.
Unfortunately, Tensorflow does not computes subgradients, only gradients.
As explained here How does tensorflow handle non differentiable nodes during gradient calculation? .
To summarize, when computing a partial derivative, if there is a problem of differentiability, Tensorflow simply puts this derivative to be zero.
As for you having trouble training your model, there are no general rules saying how to tune the hyperparameters, thus, I would suggest to do a grid search on the learning rates (on a few epochs) to find a good initial learning rate which provide good results for one of the optimization algorithms. Usually, ADAM or SGD with momentum provide satisfying results when choosing a right initial learning rate.

How is get_updates() of optimizers.SGD used in Keras during training?

I am not familiar with the inner workings of Keras and have difficulty understanding how Keras uses the get_updates() function of optimizers.SGD during training.
I searched quite a while on the internet, but only got few details. Specifically, my understanding is that the parameters/weights update rule of SGD is defined in the get_updates() function. But it appears that get_updates() isn't literally called in every iteration during training; otherwise 'moments' wouldn't carry from one iteration to the next to implement momentum correctly, as it's reset in every call, c.f. optimizers.py:
shapes = [K.get_variable_shape(p) for p in params]
moments = [K.zeros(shape) for shape in shapes]
self.weights = [self.iterations] + moments
for p, g, m in zip(params, grads, moments):
v = self.momentum * m - lr * g # velocity
self.updates.append(K.update(m, v))
As pointed out in https://github.com/keras-team/keras/issues/7502, get_updates() only defines 'a symbolic computation graph'. I'm not sure what that means. Can someone give a more detailed explanation of how it works?
For example, how is the 'v' computed in one iteration got passed to 'moments' in the next iteration to implement momentum? I'd also appreciate it if someone can point me to some tutorial about how this works.
Thanks a lot! (BTW, I'm using tensorflow, if it matters.)
get_updates() defines graph operations that update the gradients.
When the graph is evaluated for training it will look somehow like this:
forward passes compute a prediction value
loss computes a cost
backward passes compute gradients
gradients are updated
Updating the gradients is a graph computation itself; i.e. the snippet of code that you quote defines how to perform the operation by specifying which tensors are involves and what math operations occur. The math operations themselves are not occurring at that point.
moments is a vectors of tensors defined in the code above. The code creates a graph operation that updates each moments element.
Every iteration of the graph will run this update operation.
The following link tries to explain the concept of the computational graph in TensorFlow:
https://www.tensorflow.org/guide/graphs
Keras uses the same underlying ideas but abstract the user from having to deal with the low level details. Defining a model in traditional TensorFlow 1.0 API requires a much higher level of detail.

How to cancel BP in some layers in tensorflow?

when I try to fine-tune a VGG network, I only want to update the weights after 5th convolution layers ,in caffe , we can cancel BP in configure file. What should I do in tensorflow ? thanks !
Just use tf.stop_gradient() on the input of your 5th layer. Tensorflow will not backpropagate the error below. tf.stop_gradient() is an operation that acts as the identity function in the forward direction, but stops the gradient in the backward direction.
From documentation:
tf.stop_gradient
Stops gradient computation.
When executed in a graph, this op outputs its input tensor as-is.
When building ops to compute gradients, this op prevents the
contribution of its inputs to be taken into account. Normally, the
gradient generator adds ops to a graph to compute the derivatives of a
specified 'loss' by recursively finding out inputs that contributed to
its computation. If you insert this op in the graph it inputs are
masked from the gradient generator. They are not taken into account
for computing gradients.
Otherwise you can use optimizer.minimize(loss, variables_of_fifth_layer). Here you are running backpropagation and updating only on the variables of your 5th layer.
For a fast selection of the variables of interest you could:
Define as trainable=False all the variables that you don't want to update, and use variables_of_fifth_layer=tf.trainable_variables().
Divide layers by defining specific scopes and then variables_of_fifth_layer = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,"scope/of/fifth/layer")

Anyway to backprob derivatives when derivatives of the custom loss function are calculated by myself

I have been using tensorflow to train deep NN acoustic models for speech recognition for a while. The loss function I use is Cross Entropy and the NN models performe very well. Now I want to change the loss function to a more complex one named MMI (Maximum Mutual Information) which is also a classical criterion used in speech recognition domain. I put one paper here which describes this loss function in case that you have interests.
When using this special loss function, the derivatives of the loss function w.r.t. the activations of output layer can be computed by some special algorithms defined in Hidden Markov Model scenario. It means that I can compute the derivatives of the loss function w.r.t. the activations of output layer by myself rather than just write out the loss function and leave Tensorflow to calculate the derivatives automatically.
But based on my poor experiences, I don't know how to backprob the derivatives which I calculate by myself. Is there any way to do this without touching Tensorflow C++ source code?
Probably yes if all the computation involved use existing tensorflow functions.
You just have to set up the chain of operations that compute the gradients from the current variables.
Then you just use tf.assign_add() to the variables with your gradients multiplied by minus the learning rate.
You are thus mimicking what happens in the background in TF usually.
EDIT: If calculations are made in numpy for instance for the gradients you can use.
#perform numpy calculations
a=f(output_npy,variables_npy)
grad_from_user=tf.placeholder(tf.float32, a.shape)
grad_update=tf.assign_add(variables_tf,-lr*grad_from_user)
#and then
sess.run(grad_update,feed_dict={grad_from_user:a,...})