How to debug exploding gradient (covariance matrix) in Tensorflow 2.0 (TFP) - tensorflow

A question that comes from the fact that I never had to debug my models in TF so deeply.
I'm running a variational inference with a full-rank Gaussian approximation using Tensorflow Probability. I noticed my optimization often explodes. Here is my loss curve.
I suspect numerical issues, as all the losses and the optimization process look reasonable and I don't observe any NaNs.
I use tfp.distributions.MultivariateNormalTriL with a covariance parameter transformed by tfp.bijectors.FillScaleTriL with the default diagonal shift. The condition number of the covariance matrix is reasonable. The variational inference is performed with fit_surrogate_posterior function.
I optimize with an SGD with momentum, using 10 samples per iteration.
Internally in Tensorflow Probability source code, the minimization objective uses a gradient tape:
with tf.GradientTape(watch_accessed_variables=trainable_variables is None) as tape:
for v in trainable_variables or []:
tape.watch(v)
loss = loss_fn()
In order to solve my issue I would like to see the gradient through every operation.
My question is how can I get more insight into which operation is exploding by the gradient computation? How to get the value of gradient at every tensor?
And if any of you faced a similar issue:
Is there a better way to prevent instabilities in the covariance matrix optimization?
Detailed explanations:
I observed that this explosion is caused by one parameter (though it is not always the same parameter that explodes). This can be simply checked by comparing the covariance matrix two iterations before the explosion
and one iteration before the point where the loss explodes
Note the last parameter. When I run the same optimization multiple times, it might happen that one of the "small" parameters (rows from 9 to the last) explodes at some point.
Thanks,
Mateusz

Related

Is it possible to integrate Levenberg-Marquardt optimizer from Tensorflow Graphics with a Tensorflow 2.0 model?

I have a Tensorflow 2.0 tf.keras.Sequential model. Now, my technical specification prescribes using the Levenberg-Marquardt optimizer to fit the model. Tensorflow 2.0 doesn't provide it as an optimizer out of the box, but it is available in the Tensorflow Graphics module.
tfg.math.optimizer.levenberg_marquardt.minimize function accepts residuals ( a residual is a Python callable returning a tensor) and variables (list of tensors corresponding to my model weights) as parameters.
What would be the best way to convert my model into residuals and variables?
If I understand correctly how the minimize function works, I have to provide two residuals. The first residual must call my model for every learning case and aggregate all the results into a tensor. The second residuals must return all labels as a single constant tensor. The problem is that tf.keras.Sequential.predict function returns a numpy array instead of tensor. I believe that if I convert it to a tensor, the minimizer won't be able to calculate jacobians with respect to variables.
The same problem is with variables. It doesn't seem like there's a way to extract all weights from a model into a list of tensors.
There's a major difference between tfg.math.optimizer.levenberg_marquardt.minimize and Keras optimizers from the implementation/API perspective.
Keras optimizers, such as tf.keras.optimizers.Adam consume gradients as input and updates tf.Variables.
In contrast, tfg.math.optimizer.levenberg_marquardt.minimize essentially unrolls the optimization loop in graph mode (using a tf.while_loop construct). It takes initial parameter values and produces updated parameter values, unlike Adam & co, which only apply one iteration and actually change the values of tf.Variables via assign_add.
Stepping back a bit to the theoretical big picture, Levenberg-Marquardt is not a general gradient descent-like solver for any nonlinear optimization problem (such as Adam is). It specifically addresses nonlinear least-squares optimization, so it's not a drop-in replacement for optimizers like Adam. In gradient descent, we compute the gradient of the loss with respect to the parameters. In Levenberg-Marquardt, we compute the Jacobian of the residuals with respect to the parameters. Concretely, it repeatedly solves the linearized problem Jacobian # delta_params = residuals for delta_params using tf.linalg.lstsq (which internally uses Cholesky decomposition on the Gram matrix computed from the Jacobian) and applies delta_params as the update.
Note that this lstsq operation has cubic complexity in the number of parameters, so in case of neural nets it can only be applied for fairly small ones.
Also note that Levenberg-Marquardt is usually applied as a batch algorithm, not a minibatch algorithm like SGD, though there's nothing stopping you from applying the LM iteration on different minibatches in each iteration.
I think you may only be able to get one iteration out of tfg's LM algorithm, through something like
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
for input_batch, target_batch in dataset:
def residual_fn(trainable_params):
# do not use trainable params, it will still be at its initial value, since we only do one iteration of Levenberg Marquardt each time.
return model(input_batch) - target_batch
new_objective_value, new_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=1)
for var, new_param in zip(model.trainable_variables, new_params):
var.assign(new_param)
In contrast, I believe the following naive method will not work where we assign model parameters before computing the residuals:
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
dataset_iterator = ...
def residual_fn(params):
input_batch, target_batch = next(dataset_iterator)
for var, param in zip(model.trainable_variables, params):
var.assign(param)
return model(input_batch) - target_batch
final_objective, final_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=10000)
for var, final_param in zip(model.trainable_variables, final_params):
var.assign(final_param)
The main conceptual problem is that residual_fn's output has no gradients wrt its input params, since this dependency goes through a tf.assign. But it might even fail before that due to using constructs that are disallowed in graph mode.
Overall I believe it's best to write your own LM optimizer that works on tf.Variables, since tfg.math.optimizer.levenberg_marquardt.minimize has a very different API that is not really suited for optimizing Keras model parameters since you can't directly compute model(input, parameters) - target_value without a tf.assign.

How do I calculate subgradients in TensorFlow?

Does the automatic differentiation procedure in TensorFlow compute subgradient whenever needed? If there are many subgradients then which one will be chosen as output?
I am trying to implement the paper in the link https://www.aclweb.org/anthology/P13-1045 which uses recursive neural networks to perform efficient language parsing. The objective function uses hinge loss function to pick the optimal output vectors, which makes the function not differentiable. I used TensorFlow (v1.12) in eager mode to program the model and used the automatic differentiation to compute the gradients. After every batch, I could see the gradient values changing and the accuracy is slightly improved. After a while, it decreases and this process continues. The model does not converge at all for all the hyper-parameter configurations.
Mini batch size : 256, 512, 1024; Regularization parameters - 0.1, 0.01, 0.001; Learning rate - 0.1, 0.01, 0.001; Optimization function - gradient descent, adagrad, adam;
In the paper, they have described how to find subgradient for the optimum function in a very abstract manner, which I have not understood yet. I was of the opinion at the beginning that automatic gradient computation calculates the subgradient. But at this moment, I am starting to doubt so because that seems to be the only variable missing.
Unfortunately, Tensorflow does not computes subgradients, only gradients.
As explained here How does tensorflow handle non differentiable nodes during gradient calculation? .
To summarize, when computing a partial derivative, if there is a problem of differentiability, Tensorflow simply puts this derivative to be zero.
As for you having trouble training your model, there are no general rules saying how to tune the hyperparameters, thus, I would suggest to do a grid search on the learning rates (on a few epochs) to find a good initial learning rate which provide good results for one of the optimization algorithms. Usually, ADAM or SGD with momentum provide satisfying results when choosing a right initial learning rate.

What is the purpose of the Tensorflow Gradient Tape?

I watched the Tensorflow Developer's summit video on Eager Execution in Tensorflow, and the presenter gave an introduction to "Gradient Tape." Now I understand that Gradient Tape tracks the automatic differentiation that occurs in a TF model.
I was trying to understand why I would use Gradient Tape? Can anyone explain how Gradient Tape is used as a diagnostic tool? Why would someone use Gradient Tape versus just Tensorboard visualization of weights.
So I get that the automatic differentiation that occurs with a model is to compute the gradients of each node--meaning the adjustment of the weights and biases at each node, given some batch of data. So that is the learning process. But I was under the impression that I can actually use a tf.keras.callback.TensorBoard() call to see the tensorboard visualization of training--so I can watch the weights on each node and determine if there are any dead or oversaturated nodes.
Is the use of Gradient Tape only to see if some gradients go to zero or get really big, etc? Or is there some other use of the Gradient Tape?
With eager execution enabled, Tensorflow will calculate the values of tensors as they occur in your code. This means that it won't precompute a static graph for which inputs are fed in through placeholders. This means to back propagate errors, you have to keep track of the gradients of your computation and then apply these gradients to an optimiser.
This is very different from running without eager execution, where you would build a graph and then simply use sess.run to evaluate your loss and then pass this into an optimiser directly.
Fundamentally, because tensors are evaluated immediately, you don't have a graph to calculate gradients and so you need a gradient tape. It is not so much that it is just used for visualisation, but more that you cannot implement a gradient descent in eager mode without it.
Obviously, Tensorflow could just keep track of every gradient for every computation on every tf.Variable. However, that could be a huge performance bottleneck. They expose a gradient tape so that you can control what areas of your code need the gradient information. Note that in non-eager mode, this will be statically determined based on the computational branches that are descendants of your loss but in eager mode there is no static graph and so no way of knowing.
Having worked on this for a while, after posting the initial question, I have a better sense of where Gradient Tape is useful. Seems like the most useful application of Gradient Tap is when you design a custom layer in your keras model for example--or equivalently designing a custom training loop for your model.
If you have a custom layer, you can define exactly how the operations occur within that layer, including the gradients that are computed and also calculating the amount of loss that is accumulated.
So Gradient tape will just give you direct access to the individual gradients that are in the layer.
Here is an example from Aurelien Geron's 2nd edition book on Tensorflow.
Say you have a function you want as your activation.
def f(w1, w2):
return 3 * w1 ** 2 + 2 * w1 * w2
Now if you want to take derivatives of this function with respec to w1 and w2:
w1, w2 = tf.Variable(5.), tf.Variable(3.)
with tf.GradientTape() as tape:
z = f(w1, w2)
gradients = tape.gradient(z, [w1, w2])
So the optimizer will calculate the gradient and give you access to those values. Then you can double them, square them, triple them, etc., whatever you like. Whatever you choose to do, then you can add those adjusted gradients to the loss calculation for the backpropagation step, etc.
I think the most important thing to say in answer to this question is simply that GradientTape is not a diagnostic tool. That's the misconception here.
GradientTape is a mathematical tool for automatic differentiation (autodiff), which is the core functionality of TensorFlow. It does not "track" the autodiff, it is a key part of performing the autodiff.
As the other answers describe, it is used to record ("tape") a sequence of operations performed upon some input and producing some output, so that the output can be differentiated with respect to the input (via backpropagation / reverse-mode autodiff) (in order to then perform gradient descent optimisation).

Is the L1 regularization in Keras/Tensorflow *really* L1-regularization?

I am employing L1 regularization on my neural network parameters in Keras with keras.regularizers.l1(0.01) to obtain a sparse model. I am finding that, while many of my coefficients are close to zero, few of them are actually zero.
Upon looking at the source code for the regularization, it suggests that Keras simply adds the L1 norm of the parameters to the loss function.
This would be incorrect because the parameters would almost certainly never go to zero (within floating point error) as intended with L1 regularization. The L1 norm is not differentiable when a parameter is zero, so subgradient methods need to be used where the parameters are set to zero if close enough to zero in the optimization routine. See the soft threshold operator max(0, ..) here.
Does Tensorflow/Keras do this, or is this impractical to do with stochastic gradient descent?
EDIT: Also here is a superb blog post explaining the soft thresholding operator for L1 regularization.
So despite #Joshua answer, there are three other things that are worth to mention:
There is no problem connected with a gradient in 0. keras is automatically setting it to 1 similarly to relu case.
Remember that values lesser than 1e-6 are actually equal to 0 as this is float32 precision.
The problem of not having most of the values set to 0 might arise due to computational reasons due to the nature of a gradient-descent based algorithm (and setting a high l1 value) because of oscillations which might occur due to gradient discontinuity. To understand imagine that for a given weight w = 0.005 your learning rate is equal to 0.01 and a gradient of the main loss is equal to 0 w.r.t. to w. So your weight would be updated in the following manner:
w = 0.005 - 1 * 0.01 = -0.05 (because gradient is equal to 1 as w > 0),
and after the second update:
w = -0.005 + 1 * 0.01 = 0.05 (because gradient is equal to -1 as w < 0).
As you may see the absolute value of w hasn't decreased even though you applied l1 regularization and this happened due to the nature of the gradient-based algorithm. Of course, this is simplified situation but you could experience such oscillating behavior really often when using l1 norm regularizer.
Keras correctly implements L1 regularization. In the context of neural networks, L1 regularization simply adds the L1 norm of the parameters to the loss function (see CS231).
While L1 regularization does encourages sparsity, it does not guarantee that output will be sparse. The parameter updates from stochastic gradient descent are inherently noisy. Thus, the probability that any given parameter is exactly 0 is vanishingly small.
However, many of the parameters of an L1 regularized network are often close to 0. A rudimentary approach would be to threshold small values to 0. There has been research to explore more advanced methods of generating sparse neural network. In this paper, the authors simultaneously prune and train a neural network to achieve 90-95% sparsity on a number of well known network architectures.
TL;DR:
The formulation in deep learning frameworks are correct, but currently we don't have a powerful solver/optimizer to solve it EXACTLY with SGD or its variants. But if you use proximal optimizers, you can obtain sparse solution.
Your observation is right.
Almost all deep learning frameworks (including TF) implement L1 regularization by adding absolute values of parameters to the loss function. This is Lagrangian form of L1 regularization and IS CORRECT.
However, The SOLVER/OPTIMIZER is to be blamed. Even for the well studied LASSO problem, where the solution should be sparse and the soft-threshold operator DOES give us the sparse solution, the subgradient descent solver CAN NOT get the EXACT SPARSE solution. This answer from Quora gives some insight on convergence property of subgradient descent, which says:
Subgradient descent has very poor convergence properties for
non-smooth functions, such as the Lasso objective, since it ignores
problem structure completely (it doesn't distinguish between the least
squares fit and the regularization term) by just looking at
subgradients of the entire objective. Intuitively, taking small steps
in the direction of the (sub)gradient usually won't lead to
coordinates equal to zero exactly.
If you use proximal operators, you can get sparse solution. For example, you can have a look at the paper "Data-driven sparse structure selection for deep neural networks" (this one comes with MXNET code and easy to reproduce!) or "Stochastic Proximal Gradient Descent with Acceleration Techniques" (this one gives more theoretical insight). I'm not pretty sure if the built-in proximal optimizer in TF (e.g.: tf.train.ProximalAdagradOptimizer) can lead to sparse solutions, but you may have a try.
Another simple work around is to zero out small weights (i.e.: absolute value <1e-4) after training or after each gradient descent step to force sparsity. This is just a handy heuristic approach and not theoretically rigorous.
Keras implements L1 regularization properly, but this is not a LASSO. For the LASSO one would need a soft-thresholding function, as correctly pointed out in the original post. It would be very useful with a function similar to the keras.layers.ThresholdedReLU(theta=1.0), but with f(x) = x for x > theta or f(x) = x for x < -theta, f(x) = 0 otherwise. For the LASSO, theta would be equal to the learning rate times the regularization factor of the L1 function.

Tensorflow: intercept gradients of arbitrary node in the computational graph (not necessarily loss)

I would like to intercept gradients that are backpropagated in my Tensorflow graph, which are not based on the loss (∂L/∂w), but based on some other node in the graph, for example the class scores (∂s/∂w) in a classification problem or some activation (∂a/∂w) to see how it changes when certain weights w change.
How can one implement this efficiently in Tensorflow? Intuitively, the gradients should already all be there for backprop of the loss as intermediate results, so there should be a solution without a big overhead.
I am already aware of the following suggestions, which don't exactly solve the problem:
The Tensorflow method tf.gradients(ys, xs), which computes the gradient for every y in ys w.r.t. every xs, but then, for every x in xs sums over all y. Applying this function for every y in ys separately, however, induces a large computational overhead.
This stackoverflow post, which ask this question for the derivative of the loss w.r.t. some parameters, i.e. ∂L/∂w.
The part of the documentation, which proposes to call optimizer.compute_gradients() as an easy to use 'wrapper' around tf.gradients(). However, calling this function for every variable of interest introduces again a large computational overhead.
Update: Phrased differently, what I want is the Jacobian of any component of the computational graph w.r.t. any other. This topic has been touched in this recent Tensorflow issue, but is described as currently not being efficiently/conveniently implemented therein.