Where does Tensorflow store the *actual* Jacobians? - tensorflow

As discussed here, Tensorflow's gradients are not true Jacobians -- the "gradient" of Y against X is actually just the gradient of sum(Y) against X. Similarly, Pytorch does not show Jacobians at all when the output is not a scalar.
But surely the full Jacobian is needed for computation (in the application of the chain rule).
Where, internally, is this stored? i.e. is there some object hidden within GradientTape that I can access?

Related

Why do we call .detach() before calling .numpy() on a Pytorch Tensor?

It has been firmly established that my_tensor.detach().numpy() is the correct way to get a numpy array from a torch tensor.
I'm trying to get a better understanding of why.
In the accepted answer to the question just linked, Blupon states that:
You need to convert your tensor to another tensor that isn't requiring a gradient in addition to its actual value definition.
In the first discussion he links to, albanD states:
This is expected behavior because moving to numpy will break the graph and so no gradient will be computed.
If you don’t actually need gradients, then you can explicitly .detach() the Tensor that requires grad to get a tensor with the same content that does not require grad. This other Tensor can then be converted to a numpy array.
In the second discussion he links to, apaszke writes:
Variable's can’t be transformed to numpy, because they’re wrappers around tensors that save the operation history, and numpy doesn’t have such objects. You can retrieve a tensor held by the Variable, using the .data attribute. Then, this should work: var.data.numpy().
I have studied the internal workings of PyTorch's autodifferentiation library, and I'm still confused by these answers. Why does it break the graph to to move to numpy? Is it because any operations on the numpy array will not be tracked in the autodiff graph?
What is a Variable? How does it relate to a tensor?
I feel that a thorough high-quality Stack-Overflow answer that explains the reason for this to new users of PyTorch who don't yet understand autodifferentiation is called for here.
In particular, I think it would be helpful to illustrate the graph through a figure and show how the disconnection occurs in this example:
import torch
tensor1 = torch.tensor([1.0,2.0],requires_grad=True)
print(tensor1)
print(type(tensor1))
tensor1 = tensor1.numpy()
print(tensor1)
print(type(tensor1))
I think the most crucial point to understand here is the difference between a torch.tensor and np.ndarray:
While both objects are used to store n-dimensional matrices (aka "Tensors"), torch.tensors has an additional "layer" - which is storing the computational graph leading to the associated n-dimensional matrix.
So, if you are only interested in efficient and easy way to perform mathematical operations on matrices np.ndarray or torch.tensor can be used interchangeably.
However, torch.tensors are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) to compute the derivative of the loss function w.r.t each of the independent variables used to compute the loss.
As mentioned before, np.ndarray object does not have this extra "computational graph" layer and therefore, when converting a torch.tensor to np.ndarray you must explicitly remove the computational graph of the tensor using the detach() command.
Computational Graph
From your comments it seems like this concept is a bit vague. I'll try and illustrate it with a simple example.
Consider a simple function of two (vector) variables, x and w:
x = torch.rand(4, requires_grad=True)
w = torch.rand(4, requires_grad=True)
y = x # w # inner-product of x and w
z = y ** 2 # square the inner product
If we are only interested in the value of z, we need not worry about any graphs, we simply moving forward from the inputs, x and w, to compute y and then z.
However, what would happen if we do not care so much about the value of z, but rather want to ask the question "what is w that minimizes z for a given x"?
To answer that question, we need to compute the derivative of z w.r.t w.
How can we do that?
Using the chain rule we know that dz/dw = dz/dy * dy/dw. That is, to compute the gradient of z w.r.t w we need to move backward from z back to w computing the gradient of the operation at each step as we trace back our steps from z to w. This "path" we trace back is the computational graph of z and it tells us how to compute the derivative of z w.r.t the inputs leading to z:
z.backward() # ask pytorch to trace back the computation of z
We can now inspect the gradient of z w.r.t w:
w.grad # the resulting gradient of z w.r.t w
tensor([0.8010, 1.9746, 1.5904, 1.0408])
Note that this is exactly equals to
2*y*x
tensor([0.8010, 1.9746, 1.5904, 1.0408], grad_fn=<MulBackward0>)
since dz/dy = 2*y and dy/dw = x.
Each tensor along the path stores its "contribution" to the computation:
z
tensor(1.4061, grad_fn=<PowBackward0>)
And
y
tensor(1.1858, grad_fn=<DotBackward>)
As you can see, y and z stores not only the "forward" value of <x, w> or y**2 but also the computational graph -- the grad_fn that is needed to compute the derivatives (using the chain rule) when tracing back the gradients from z (output) to w (inputs).
These grad_fn are essential components to torch.tensors and without them one cannot compute derivatives of complicated functions. However, np.ndarrays do not have this capability at all and they do not have this information.
please see this answer for more information on tracing back the derivative using backwrd() function.
Since both np.ndarray and torch.tensor has a common "layer" storing an n-d array of numbers, pytorch uses the same storage to save memory:
numpy() → numpy.ndarray
Returns self tensor as a NumPy ndarray. This tensor and the returned ndarray share the same underlying storage. Changes to self tensor will be reflected in the ndarray and vice versa.
The other direction works in the same way as well:
torch.from_numpy(ndarray) → Tensor
Creates a Tensor from a numpy.ndarray.
The returned tensor and ndarray share the same memory. Modifications to the tensor will be reflected in the ndarray and vice versa.
Thus, when creating an np.array from torch.tensor or vice versa, both object reference the same underlying storage in memory. Since np.ndarray does not store/represent the computational graph associated with the array, this graph should be explicitly removed using detach() when sharing both numpy and torch wish to reference the same tensor.
Note, that if you wish, for some reason, to use pytorch only for mathematical operations without back-propagation, you can use with torch.no_grad() context manager, in which case computational graphs are not created and torch.tensors and np.ndarrays can be used interchangeably.
with torch.no_grad():
x_t = torch.rand(3,4)
y_np = np.ones((4, 2), dtype=np.float32)
x_t # torch.from_numpy(y_np) # dot product in torch
np.dot(x_t.numpy(), y_np) # the same dot product in numpy
I asked, Why does it break the graph to to move to numpy? Is it because any operations on the numpy array will not be tracked in the autodiff graph?
Yes, the new tensor will not be connected to the old tensor through a grad_fn, and so any operations on the new tensor will not carry gradients back to the old tensor.
Writing my_tensor.detach().numpy() is simply saying, "I'm going to do some non-tracked computations based on the value of this tensor in a numpy array."
The Dive into Deep Learning (d2l) textbook has a nice section describing the detach() method, although it doesn't talk about why a detach makes sense before converting to a numpy array.
Thanks to jodag for helping to answer this question. As he said, Variables are obsolete, so we can ignore that comment.
I think the best answer I can find so far is in jodag's doc link:
To stop a tensor from tracking history, you can call .detach() to detach it from the computation history, and to prevent future computation from being tracked.
and in albanD's remarks that I quoted in the question:
If you don’t actually need gradients, then you can explicitly .detach() the Tensor that requires grad to get a tensor with the same content that does not require grad. This other Tensor can then be converted to a numpy array.
In other words, the detach method means "I don't want gradients," and it is impossible to track gradients through numpy operations (after all, that is what PyTorch tensors are for!)
This is a little showcase of a tensor -> numpy array connection:
import torch
tensor = torch.rand(2)
numpy_array = tensor.numpy()
print('Before edit:')
print(tensor)
print(numpy_array)
tensor[0] = 10
print()
print('After edit:')
print('Tensor:', tensor)
print('Numpy array:', numpy_array)
Output:
Before edit:
Tensor: tensor([0.1286, 0.4899])
Numpy array: [0.1285522 0.48987144]
After edit:
Tensor: tensor([10.0000, 0.4899])
Numpy array: [10. 0.48987144]
The value of the first element is shared by the tensor and the numpy array. Changing it to 10 in the tensor changed it in the numpy array as well.

TensorFlow / PyTorch: Gradient for loss which is measured externally

I am relatively new to Machine Learning and Python.
I have a system, which consists of a NN whose output is fed into an unknown nonlinear function F, e.g. some hardware. The idea is to train the NN to be an inverse F^(-1) of that unknown nonlinear function F. This means that a loss L is calculated at the output of F. However, backpropagation cannot be used in a straightforward manner for calculating the gradients and updating the NN weights because the gradient of F is not known either.
Is there any way how to use a loss function L, which is not directly connected to the NN, for the calculation of the gradients in TensorFlow or PyTorch? Or to take a loss that was obtained with any other software (Matlab, C, etc.) use it for backpropagation?
As far as I know, Keras keras.backend.gradients only allows to calculate gradients with respect to connected weights, otherwise the gradient is either zero or NoneType.
I read about the stop_gradient() function in TensorFlow. But I am not sure whether this is what I am looking for. It allows to not compute the gradient with respect to some variables during backpropagation. But I think the operation F is not interpreted as a variable anyway.
Can I define any arbitrary loss function (including a hardware measurement) and use it for backpropagation in TensorFlow or is it required to be connected to the graph as well?
Please, let me know if my question is not specific enough.
AFAIK, all modern deep learning packages (pytorch, tensorflow, keras etc.) are relaying on gradient descent (and its many variants) to train networks.
As the name suggests, you cannot do gradient descent without gradients.
However, you might circumvent the "non differentiability" of your "given" function F by looking at the problem from a slightly different perspective:
You are trying to learn a model M that "counters" the effect of F. So you have access to F (but not its gradients) and a set of representative inputs X={x_0, x_1, ... x_n}.
For each example x_i you can compute y_i = F(x_i) and your end goal is to have a model M that given y_i will output x_i.
Therefore, you can treat y_i as your model's input and compute a loss between M(y_i) and x_i that produced it. This way you do not need to compute gradients through the "black box" F.
A pseudo code would look something like:
for x in examples:
y = F(x) # applying F on x - getting only output WITHOUT any gradients
pred = M(y) # apply the trainable model M to the output of F
loss = ||x - pred|| # loss will propagate gradients through M and stop at F
loss.backward()

Is it possible to integrate Levenberg-Marquardt optimizer from Tensorflow Graphics with a Tensorflow 2.0 model?

I have a Tensorflow 2.0 tf.keras.Sequential model. Now, my technical specification prescribes using the Levenberg-Marquardt optimizer to fit the model. Tensorflow 2.0 doesn't provide it as an optimizer out of the box, but it is available in the Tensorflow Graphics module.
tfg.math.optimizer.levenberg_marquardt.minimize function accepts residuals ( a residual is a Python callable returning a tensor) and variables (list of tensors corresponding to my model weights) as parameters.
What would be the best way to convert my model into residuals and variables?
If I understand correctly how the minimize function works, I have to provide two residuals. The first residual must call my model for every learning case and aggregate all the results into a tensor. The second residuals must return all labels as a single constant tensor. The problem is that tf.keras.Sequential.predict function returns a numpy array instead of tensor. I believe that if I convert it to a tensor, the minimizer won't be able to calculate jacobians with respect to variables.
The same problem is with variables. It doesn't seem like there's a way to extract all weights from a model into a list of tensors.
There's a major difference between tfg.math.optimizer.levenberg_marquardt.minimize and Keras optimizers from the implementation/API perspective.
Keras optimizers, such as tf.keras.optimizers.Adam consume gradients as input and updates tf.Variables.
In contrast, tfg.math.optimizer.levenberg_marquardt.minimize essentially unrolls the optimization loop in graph mode (using a tf.while_loop construct). It takes initial parameter values and produces updated parameter values, unlike Adam & co, which only apply one iteration and actually change the values of tf.Variables via assign_add.
Stepping back a bit to the theoretical big picture, Levenberg-Marquardt is not a general gradient descent-like solver for any nonlinear optimization problem (such as Adam is). It specifically addresses nonlinear least-squares optimization, so it's not a drop-in replacement for optimizers like Adam. In gradient descent, we compute the gradient of the loss with respect to the parameters. In Levenberg-Marquardt, we compute the Jacobian of the residuals with respect to the parameters. Concretely, it repeatedly solves the linearized problem Jacobian # delta_params = residuals for delta_params using tf.linalg.lstsq (which internally uses Cholesky decomposition on the Gram matrix computed from the Jacobian) and applies delta_params as the update.
Note that this lstsq operation has cubic complexity in the number of parameters, so in case of neural nets it can only be applied for fairly small ones.
Also note that Levenberg-Marquardt is usually applied as a batch algorithm, not a minibatch algorithm like SGD, though there's nothing stopping you from applying the LM iteration on different minibatches in each iteration.
I think you may only be able to get one iteration out of tfg's LM algorithm, through something like
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
for input_batch, target_batch in dataset:
def residual_fn(trainable_params):
# do not use trainable params, it will still be at its initial value, since we only do one iteration of Levenberg Marquardt each time.
return model(input_batch) - target_batch
new_objective_value, new_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=1)
for var, new_param in zip(model.trainable_variables, new_params):
var.assign(new_param)
In contrast, I believe the following naive method will not work where we assign model parameters before computing the residuals:
from tensorflow_graphics.math.optimizer.levenberg_marquardt import minimize as lm_minimize
dataset_iterator = ...
def residual_fn(params):
input_batch, target_batch = next(dataset_iterator)
for var, param in zip(model.trainable_variables, params):
var.assign(param)
return model(input_batch) - target_batch
final_objective, final_params = lm_minimize(residual_fn, model.trainable_variables, max_iter=10000)
for var, final_param in zip(model.trainable_variables, final_params):
var.assign(final_param)
The main conceptual problem is that residual_fn's output has no gradients wrt its input params, since this dependency goes through a tf.assign. But it might even fail before that due to using constructs that are disallowed in graph mode.
Overall I believe it's best to write your own LM optimizer that works on tf.Variables, since tfg.math.optimizer.levenberg_marquardt.minimize has a very different API that is not really suited for optimizing Keras model parameters since you can't directly compute model(input, parameters) - target_value without a tf.assign.

What is "gate_gradients" atribute in tensorflow minimize() function in the optimzer class?

This is the link to TF optimizer class https://www.tensorflow.org/versions/r0.12/api_docs/python/train/optimizers
GATE_NONE: Take the simple case of a matmul op on two vectors 'x' and 'y'. let the output be L. Now gradient of L wrt x is y and gradient of L wrt y is xT (x transpose). with GATE_NONE it could so happen that the gradient wrt x is applied to modify x before the gradient for y is even calculated. Now when the gradient wrt y is calculated it would be computed equal to modified x which is an error. Of course it won't happen in such a simple case but you could imagine it could happen in more complex/extreme cases
GATE_OP: For each Op, make sure all gradients are computed before they are used. This prevents race conditions for Ops that generate gradients for multiple inputs where the gradients depend on the inputs. (You could see how this prevents the problem of GATE_NONE, though at the price of some parallelism).
GATE_GRAPH: Make sure all gradients for all variables are computed before any one of them is used. This provides the least parallelism but can be useful if you want to process all gradients before applying any of them.(an example of use case is clipping gradients according to global norm before applying them)
In the same page that you have linked, if you scroll down a little bit, it says:
gate_gradients argument that controls the degree of parallelism during the application of the gradients

Tensorflow: intercept gradients of arbitrary node in the computational graph (not necessarily loss)

I would like to intercept gradients that are backpropagated in my Tensorflow graph, which are not based on the loss (∂L/∂w), but based on some other node in the graph, for example the class scores (∂s/∂w) in a classification problem or some activation (∂a/∂w) to see how it changes when certain weights w change.
How can one implement this efficiently in Tensorflow? Intuitively, the gradients should already all be there for backprop of the loss as intermediate results, so there should be a solution without a big overhead.
I am already aware of the following suggestions, which don't exactly solve the problem:
The Tensorflow method tf.gradients(ys, xs), which computes the gradient for every y in ys w.r.t. every xs, but then, for every x in xs sums over all y. Applying this function for every y in ys separately, however, induces a large computational overhead.
This stackoverflow post, which ask this question for the derivative of the loss w.r.t. some parameters, i.e. ∂L/∂w.
The part of the documentation, which proposes to call optimizer.compute_gradients() as an easy to use 'wrapper' around tf.gradients(). However, calling this function for every variable of interest introduces again a large computational overhead.
Update: Phrased differently, what I want is the Jacobian of any component of the computational graph w.r.t. any other. This topic has been touched in this recent Tensorflow issue, but is described as currently not being efficiently/conveniently implemented therein.