Faster alternative to Tensorflows tf.einsum() - tensorflow

I need to compute the following tensor with three-dimensional tensors x,y
tf.einsum("ijk,ljk->ilj",x,y)
Unfortunately, this is pretty slow. Is there any way to rewrite this using only matmul operations that I didn't think of?

You can (implicitely) broadcast x to the shape iljk and y to the shape ilkj.
Then it is possible to use tf.matmul() to get shape iljj and tf.sum() for eliminating one j.
The resulting shape then is ilj.
x = tf.expand_dims(x, axis=1)
y = tf.transpose(y, [0,2,1])
y = tf.expand_dims(y, axis=0)
tf.sum(tf.matmul(x, y), axis=-2)
However, I dont think that this would be faster since you get a 4D tensor here.

Related

How to implement a 3D sparse_tensor_dense_matmul operation in pytorch (or tf)?

If I have two tensor, a sparse tensor A and a dense tensor B, A.shape is [batch_size, m, n], B.shape is [batch_size, n, k], how can I implement a function f that can perform the following task efficiently:C = f(A, B), C.shape is [batch_size, m, k], and for any batch < batch_size, C[batch] = matmul(A[batch], B[batch]), this function should support backward method.
I try to use for loop and torch.sparse.mm, however, this method does not get the most of GPU. How can I parallelize these operations? (I don't mean torch.nn.DataParallel or something like that.)
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Thanks in advance.

TensorFlow: slice Tensor and keep original shape

I have a Tensor tensor of shape (?, 1082) and I want to slice this Tensor into n subparts in a for-loop but I want to keep the original shape, including the unknown dimension ?.
Example:
lst = []
for n in range(15):
sub_tensor = tensor[n] # this will reduce the first dimension
print(sub_tensor.get_shape())
Print output I'm looking for:
(?, 1082)
(?, 1082)
etc.
How can this be achieved in TensorFlow?
Considering that your problem can have many constraints, I can think of at least 3 solutions.
You can use tf.split. I'll use tf.placeholder, but it's applicable to tensors and variables as well.
p = tf.placeholder(shape=[None,10], dtype=tf.int32)
s1, s2 = tf.split(value=p, num_or_size_splits=2, axis=1)
However, this approach can become unfeasible if number of splits required is large. Note that it can split None axis as well.
for n in range(15):
sub_tensor = tensor[n, :]
s = tf.slice(p, [0,2], [-1, 2])
Slice can be used for multidimensional tensors, but it' pretty tricky to use. And you can use tf.Tensor.getitem method, almost as you described in your question. It acts similar to NumPy. So this should do the job:
for n in range(10):
print(p[n, :])
However, usage of these methods heavily depend on your particular application. Hope this helps.

tf.gradients, how can I understand `grad_ys` and use it?

In tf.gradients, there is a keyword argument grad_ys
grad_ys is a list of tensors of the same length as ys that holds the initial gradients for each y in ys. When grad_ys is None, we fill in a tensor of ‘1’s of the shape of y for each y in ys. A user can provide their own initial grad_ys to compute the derivatives using a different initial gradient for each y (e.g., if one wanted to weight the gradient differently for each value in each y).
Why is grads_ys needed here? The docs here is implicit. Could you please give some specific purpose and code?
And my example code for tf.gradients is
In [1]: import numpy as np
In [2]: import tensorflow as tf
In [3]: sess = tf.InteractiveSession()
In [4]: X = tf.placeholder("float", shape=[2, 1])
In [5]: Y = tf.placeholder("float", shape=[2, 1])
In [6]: W = tf.Variable(np.random.randn(), name='weight')
In [7]: b = tf.Variable(np.random.randn(), name='bias')
In [8]: pred = tf.add(tf.multiply(X, W), b)
In [9]: cost = 0.5 * tf.reduce_sum(tf.pow(pred-Y, 2))
In [10]: grads = tf.gradients(cost, [W, b])
In [11]: sess.run(tf.global_variables_initializer())
In [15]: W_, b_, pred_, cost_, grads_ = sess.run([W, b, pred, cost, grads],
feed_dict={X: [[2.0], [3.]], Y: [[3.0], [2.]]})
grad_ys is only needed for advanced use cases. Here is how you can think about it.
tf.gradients allows you to compute tf.gradients(y, x, grad_ys) = grad_ys * dy/dx. In other words, grad_ys is the multiplier of each y. In this notation, it seems silly to provide this argument because one should be able to just multiple himself, i.e. tf.gradients(y, x, grad_ys) = grad_ys * tf.gradients(y, x). Unfortunately, this equality does not hold because when computing gradients backwards, we perform reduction (typically summation) after each step to get "intermediate loss".
This functionality can be useful in many cases. One is mentioned in the doc string. Here is another. Remember the chain rule - dz/dx = dz/dy * dy/dx. Let's say that we wanted to compute dz/dx but dz/dy is not differentiable and we can only approximate it. Let's say we compute the approximation somehow and call it approx. Then, dz/dx = tf.gradients(y, x, grad_ys=approx).
Another use case can be when you have a model with a "huge fan-in". Let's say you have 100 input sources that go through a few layers (call these "100 branches"), get combined at y, and go through 10 more layers until you get to a loss. It might be that computing all the gradients (which requires remembering many activations) for the whole model at once does not fit in memory. One way to do this would be to compute d(loss)/dy first. Then, compute the gradients for variables in branch_i with respect to loss using tf.gradients(y, branch_i_variables, grad_ys=d(loss)/dy). Using this (and a few more details I am skipping) you can reduce the peak memory requirement.

Breaking TensorFlow gradient calculation into two (or more) parts

Is it possible to use TensorFlow's tf.gradients() function in parts, that is - calculate the gradient from of loss w.r.t some tensor, and of that tensor w.r.t the weight, and then multiply them to get the original gradient from the loss to the weight?
For example, let W,b be some weights, let x be an input of a network, and let y0 denote labels.
Assume a forward graph such as
h=Wx+b
y=tanh(h)
loss=mse(y-y0)
We can calculate tf.gradients(loss,W) and then apply (skipping some details) optimizer.apply_gradients() to update W.
I then try to extract an intermediate tensor, by using var=tf.get_default_graph().get_tensor_by_name(...), and then calculate two gradients: g1=tf.gradients(loss,var) and g2=tf.gradients(var,W).
I would then, by the chain rule, expect the dimensions of g1 and g2 to work out so that I can write g=g1*g2 in some sense, and get back tf.gradients(loss,W).
Unfortunately, this is not the case. The dimensions are incorrect. Each gradient's dimensions will be that of the "w.r.t variable", so there won't be a correspondence between the first gradient and the second one. What am I missing, and how can I do this?
Thanks.
tf.gradients will sum over the gradients of the input tensor. To avoid it you have to split the tensor into scalars and apply tf.gradients to each of them:
import tensorflow as tf
x = tf.ones([1, 10])
w = tf.get_variable("w", initializer=tf.constant(0.5, shape=[10, 5]))
out = tf.matmul(x, w)
out_target = tf.constant(0., shape=[5])
loss = tf.reduce_mean(tf.square(out - out_target))
grad = tf.gradients(loss, x)
part_grad_1 = tf.gradients(loss, out)
part_grad_2 = tf.concat([tf.gradients(i, x) for i in tf.split(out, 5, axis=1)], axis=1)
grad_by_parts = tf.matmul(part_grad_1, part_grad_2)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
print(sess.run([grad]))
print(sess.run([grad_by_parts]))
From the docs, tf.gradients (emphasis mine)
constructs symbolic derivatives of sum of ys w.r.t. x in xs.
If any tensor in ys in multidimensional, it is reduce_summed before the resulting list of scalar is itself summed, before being differenciated. This is why the output gradient has the same size as the xs.
This also explain why losses can be multidimensional in tensorflow: they are implicitely summed over before differentiation.
for future readers:
Tensorflow has made some advancements, and as for tf2.7 (and maybe even earlier versions) you can use tf.GradientTape.jacobian to avoid the sum over the target's dimensions.
https://www.tensorflow.org/guide/advanced_autodiff#jacobians

Slicing by tensor with indices

I have a tensor tf.shape(X) == [M, N, N] and a set of indices tf.shape(IDX) == [N, N]. How can I form a tensor tf.shape(Y) = [N, N], which equals to the slice of X using indices IDX in the first dimension? I.e.
Y[i, j] = X[IDX[i, j], i, j] for all i,j = 1..N.
I have tried to play with tf.gather_nd but with no result :(
Update 10-12-2016:
As of tensorflow version 0.11 and up one can index into tensors in the same way as numpy.
a = tf.Variable([9,10,11])
b = tf.constant([[1,2,3,4],[5,6,7,8]])
a = b[0,1:]
Gradients are also supported on the indexing.
What did you try already?
It seems like there's a bug with tf.gather_nd that I reported.
Here's the response
Support for partial indices in gather_nd (fewer indices than dimensions) was added quite recently. You are you using a version of TensorFlow where each index tensor must have exactly the number of tensor dimensions. The code should work at HEAD.
so by version 0.10 or above gather_nd should work like you want.
However this below works
import tensorflow as tf
x = tf.constant([[1,1,1,1],[1,2,3,4]],shape=(2,4))
indices = [[0,0],[0,1]]
y = tf.gather_nd(x,indices)
so it seems like you need the full index description at the moment, not just slice 0. You also try tf.pack.
You can also track the progress of indexing tensors in tensorflow here:
https://github.com/tensorflow/tensorflow/issues/206