Does tf.trace() only evaluate diagonal elements? - tensorflow

I have a TensorFlow tensor t with shape (d,d), a square matrix. I define the trace tensor tr = tf.trace(t). Now tr is evaluated, using session.run(tr): Is TensorFlow smart enough to only evaluate the diagonal elements of t, or are all elements of t evaluated first, and only then the trace is computed?

TensorFlow will compute the matrix first, then run the trace op to extract/sum the diagonal. Potentially this is something that XLA could optimize away if no other ops consume the full matrix (not sure if it does or not currently), but TensorFlow itself sees these ops as more or less black boxes.
If there are no consumers of the full matrix, maybe just do computations on a vector representing that diagonal? You could also use sparse tensors to avoid unnecessary computation while keeping track of indices.

Related

How to debug exploding gradient (covariance matrix) in Tensorflow 2.0 (TFP)

A question that comes from the fact that I never had to debug my models in TF so deeply.
I'm running a variational inference with a full-rank Gaussian approximation using Tensorflow Probability. I noticed my optimization often explodes. Here is my loss curve.
I suspect numerical issues, as all the losses and the optimization process look reasonable and I don't observe any NaNs.
I use tfp.distributions.MultivariateNormalTriL with a covariance parameter transformed by tfp.bijectors.FillScaleTriL with the default diagonal shift. The condition number of the covariance matrix is reasonable. The variational inference is performed with fit_surrogate_posterior function.
I optimize with an SGD with momentum, using 10 samples per iteration.
Internally in Tensorflow Probability source code, the minimization objective uses a gradient tape:
with tf.GradientTape(watch_accessed_variables=trainable_variables is None) as tape:
for v in trainable_variables or []:
tape.watch(v)
loss = loss_fn()
In order to solve my issue I would like to see the gradient through every operation.
My question is how can I get more insight into which operation is exploding by the gradient computation? How to get the value of gradient at every tensor?
And if any of you faced a similar issue:
Is there a better way to prevent instabilities in the covariance matrix optimization?
Detailed explanations:
I observed that this explosion is caused by one parameter (though it is not always the same parameter that explodes). This can be simply checked by comparing the covariance matrix two iterations before the explosion
and one iteration before the point where the loss explodes
Note the last parameter. When I run the same optimization multiple times, it might happen that one of the "small" parameters (rows from 9 to the last) explodes at some point.
Thanks,
Mateusz

How to optimise contraction of three inner tensor indices

In my code for my convolutional neural network there is a step which involves a tensor contraction along 3 dimensions; in NumPy it looks like this (however I am planning on using raw BLAS):
y = np.einsum('abcijk,ijkd->abcd', x, f)
which is symbolic for
Generally, CNN routines tackle large tensor contractions by converting higher-dimensional tensors into 2-D "images", then using regular matrix multiplication - the cost and redundancy of the initial dimension reduction is offset by leveraging the speed of GEMM and other well-optimised matrix multiplication routines.
My method involves skipping the dimension reduction stage, and the only real computational effort required is in the tensor multiplication of X and F. Do there exist similarly fast routines for contracting the inner indices of tensors? How can I whittle this down to something that is well-established in BLAS etc.? The fact that only inner indices are being contracted suggests to me that there should be a way to take advantage of the locality of reference via the natural traversal order, rather than using generic routines such as "GETT".

Why do we call .detach() before calling .numpy() on a Pytorch Tensor?

It has been firmly established that my_tensor.detach().numpy() is the correct way to get a numpy array from a torch tensor.
I'm trying to get a better understanding of why.
In the accepted answer to the question just linked, Blupon states that:
You need to convert your tensor to another tensor that isn't requiring a gradient in addition to its actual value definition.
In the first discussion he links to, albanD states:
This is expected behavior because moving to numpy will break the graph and so no gradient will be computed.
If you don’t actually need gradients, then you can explicitly .detach() the Tensor that requires grad to get a tensor with the same content that does not require grad. This other Tensor can then be converted to a numpy array.
In the second discussion he links to, apaszke writes:
Variable's can’t be transformed to numpy, because they’re wrappers around tensors that save the operation history, and numpy doesn’t have such objects. You can retrieve a tensor held by the Variable, using the .data attribute. Then, this should work: var.data.numpy().
I have studied the internal workings of PyTorch's autodifferentiation library, and I'm still confused by these answers. Why does it break the graph to to move to numpy? Is it because any operations on the numpy array will not be tracked in the autodiff graph?
What is a Variable? How does it relate to a tensor?
I feel that a thorough high-quality Stack-Overflow answer that explains the reason for this to new users of PyTorch who don't yet understand autodifferentiation is called for here.
In particular, I think it would be helpful to illustrate the graph through a figure and show how the disconnection occurs in this example:
import torch
tensor1 = torch.tensor([1.0,2.0],requires_grad=True)
print(tensor1)
print(type(tensor1))
tensor1 = tensor1.numpy()
print(tensor1)
print(type(tensor1))
I think the most crucial point to understand here is the difference between a torch.tensor and np.ndarray:
While both objects are used to store n-dimensional matrices (aka "Tensors"), torch.tensors has an additional "layer" - which is storing the computational graph leading to the associated n-dimensional matrix.
So, if you are only interested in efficient and easy way to perform mathematical operations on matrices np.ndarray or torch.tensor can be used interchangeably.
However, torch.tensors are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) to compute the derivative of the loss function w.r.t each of the independent variables used to compute the loss.
As mentioned before, np.ndarray object does not have this extra "computational graph" layer and therefore, when converting a torch.tensor to np.ndarray you must explicitly remove the computational graph of the tensor using the detach() command.
Computational Graph
From your comments it seems like this concept is a bit vague. I'll try and illustrate it with a simple example.
Consider a simple function of two (vector) variables, x and w:
x = torch.rand(4, requires_grad=True)
w = torch.rand(4, requires_grad=True)
y = x # w # inner-product of x and w
z = y ** 2 # square the inner product
If we are only interested in the value of z, we need not worry about any graphs, we simply moving forward from the inputs, x and w, to compute y and then z.
However, what would happen if we do not care so much about the value of z, but rather want to ask the question "what is w that minimizes z for a given x"?
To answer that question, we need to compute the derivative of z w.r.t w.
How can we do that?
Using the chain rule we know that dz/dw = dz/dy * dy/dw. That is, to compute the gradient of z w.r.t w we need to move backward from z back to w computing the gradient of the operation at each step as we trace back our steps from z to w. This "path" we trace back is the computational graph of z and it tells us how to compute the derivative of z w.r.t the inputs leading to z:
z.backward() # ask pytorch to trace back the computation of z
We can now inspect the gradient of z w.r.t w:
w.grad # the resulting gradient of z w.r.t w
tensor([0.8010, 1.9746, 1.5904, 1.0408])
Note that this is exactly equals to
2*y*x
tensor([0.8010, 1.9746, 1.5904, 1.0408], grad_fn=<MulBackward0>)
since dz/dy = 2*y and dy/dw = x.
Each tensor along the path stores its "contribution" to the computation:
z
tensor(1.4061, grad_fn=<PowBackward0>)
And
y
tensor(1.1858, grad_fn=<DotBackward>)
As you can see, y and z stores not only the "forward" value of <x, w> or y**2 but also the computational graph -- the grad_fn that is needed to compute the derivatives (using the chain rule) when tracing back the gradients from z (output) to w (inputs).
These grad_fn are essential components to torch.tensors and without them one cannot compute derivatives of complicated functions. However, np.ndarrays do not have this capability at all and they do not have this information.
please see this answer for more information on tracing back the derivative using backwrd() function.
Since both np.ndarray and torch.tensor has a common "layer" storing an n-d array of numbers, pytorch uses the same storage to save memory:
numpy() → numpy.ndarray
Returns self tensor as a NumPy ndarray. This tensor and the returned ndarray share the same underlying storage. Changes to self tensor will be reflected in the ndarray and vice versa.
The other direction works in the same way as well:
torch.from_numpy(ndarray) → Tensor
Creates a Tensor from a numpy.ndarray.
The returned tensor and ndarray share the same memory. Modifications to the tensor will be reflected in the ndarray and vice versa.
Thus, when creating an np.array from torch.tensor or vice versa, both object reference the same underlying storage in memory. Since np.ndarray does not store/represent the computational graph associated with the array, this graph should be explicitly removed using detach() when sharing both numpy and torch wish to reference the same tensor.
Note, that if you wish, for some reason, to use pytorch only for mathematical operations without back-propagation, you can use with torch.no_grad() context manager, in which case computational graphs are not created and torch.tensors and np.ndarrays can be used interchangeably.
with torch.no_grad():
x_t = torch.rand(3,4)
y_np = np.ones((4, 2), dtype=np.float32)
x_t # torch.from_numpy(y_np) # dot product in torch
np.dot(x_t.numpy(), y_np) # the same dot product in numpy
I asked, Why does it break the graph to to move to numpy? Is it because any operations on the numpy array will not be tracked in the autodiff graph?
Yes, the new tensor will not be connected to the old tensor through a grad_fn, and so any operations on the new tensor will not carry gradients back to the old tensor.
Writing my_tensor.detach().numpy() is simply saying, "I'm going to do some non-tracked computations based on the value of this tensor in a numpy array."
The Dive into Deep Learning (d2l) textbook has a nice section describing the detach() method, although it doesn't talk about why a detach makes sense before converting to a numpy array.
Thanks to jodag for helping to answer this question. As he said, Variables are obsolete, so we can ignore that comment.
I think the best answer I can find so far is in jodag's doc link:
To stop a tensor from tracking history, you can call .detach() to detach it from the computation history, and to prevent future computation from being tracked.
and in albanD's remarks that I quoted in the question:
If you don’t actually need gradients, then you can explicitly .detach() the Tensor that requires grad to get a tensor with the same content that does not require grad. This other Tensor can then be converted to a numpy array.
In other words, the detach method means "I don't want gradients," and it is impossible to track gradients through numpy operations (after all, that is what PyTorch tensors are for!)
This is a little showcase of a tensor -> numpy array connection:
import torch
tensor = torch.rand(2)
numpy_array = tensor.numpy()
print('Before edit:')
print(tensor)
print(numpy_array)
tensor[0] = 10
print()
print('After edit:')
print('Tensor:', tensor)
print('Numpy array:', numpy_array)
Output:
Before edit:
Tensor: tensor([0.1286, 0.4899])
Numpy array: [0.1285522 0.48987144]
After edit:
Tensor: tensor([10.0000, 0.4899])
Numpy array: [10. 0.48987144]
The value of the first element is shared by the tensor and the numpy array. Changing it to 10 in the tensor changed it in the numpy array as well.

How to use conv1d_transpose in TensorFlow for single-channel images?

New to TensorFlow. I have a single-channel image of size W x H. I would like to do a 1D deconvolution on this image with a kernel that only calculates the deconvoluted output row-wise, and 3 by 3 pixels. Meaning that it uses each group of 3 pixels within a row only once in the deconvolution process. I guess this could be achieved by the stride parameter?
I am aware that there is a conv1d_transpose in the contrib branch of TensorFlow, but with the current limited documentation on it, I am rather confused how to achieve the above. Any recommendations are appreciated.
I would do this with stride and using the standard 2D convolution/transpose. I'm not familiar with conv1d_transpose, but I'm all but certain you wouldn't be able to use a 3x3 kernel with a conv1D operation.
A conv1D operations would operate on a vector, such as a optical spectra (an example here just in case it doesn't make sense: https://dr12.sdss.org/spectrumDetail?plateid=5008&mjd=55744&fiber=278)

Optimizing a subset of a tensor in Tensor Flow

I have a free varaible (tf.variable) x, and I wish to minimize an error term with respect to subset of the tensor x (for example minimizing the error only with respect to the first row of 2D tensor).
One way is to compute the gradients and change the gradient to zero for the irrelevant parts of the tensor and apply the gradients. Is their another way?
You can use mask and tf.stop_gradient to selectively make the variable non-trainable: tf.stop_gradient(mask*x). The value in matrix mask 1 should denote parts to apply gradient and 0 otherwise.