I have the following setup:
B = Batchsize
N = Number of Objects
T = Number of Targets
L = Length of feature embedding per target
For each object, I want to attend to a target. The model decides which target to attend to by taking the argmax of a vector attention_weights with shape=[B,N,T]:
pick = tf.math.argmax(attention_weights, axis=2)
So pick has the shape [B,N] and each entry is an index. Now I would like to use these indices to access the correct target features
target_features.set_shape(target_features, [B, D, L])
features_picked = tf.some_function(target_features, pick)
My question is, what to use for tf.some_function? Is it something related to tf.gather? I have trouble figuring out how to use it in this case.
Many thanks in advance for any help!
PS: I am using tf.version = '1.13.1'
I came up with the following solution, I will accept the answer once I confirmed it's doing what it should:
I tiled target_features so it has shape [B, N, T, L]
Then I do:
features_picked = tf.batch_gather(target_features, indices=pick)
where features_picked has shape [B, N, 1, L]
Related
I'm new to automatic differentiation programming, so this maybe a naive question. Below is a simplified version of what I'm trying to solve.
I have two input arrays - a vector A of size N and a matrix B of shape (N, M), as well a parameter vector theta of size M. I define a new array C(theta) = B * theta to get a new vector of size N. I then obtain the indices of elements that fall in the upper and lower quartile of C, and use them to create a new array A_low(theta) = A[lower quartile indices of C] and A_high(theta) = A[upper quartile indices of C]. Clearly these two do depend on theta, but is it possible to differentiate A_low and A_high w.r.t theta?
My attempts so far seem to suggest no - I have using the python libraries of autograd, JAX and tensorflow, but they all return a gradient of zero. (The approaches I have tried so far involve using argsort or extracting the relevant sub-arrays using tf.top_k.)
What I'm seeking help with is either a proof that the derivative is not defined (or cannot be analytically computed) or if it does exist, a suggestion on how to estimate it. My eventual goal is to minimize some function f(A_low, A_high) wrt theta.
This is the JAX computation that I wrote based on your description:
import numpy as np
import jax.numpy as jnp
import jax
N = 10
M = 20
rng = np.random.default_rng(0)
A = jnp.array(rng.random((N,)))
B = jnp.array(rng.random((N, M)))
theta = jnp.array(rng.random(M))
def f(A, B, theta, k=3):
C = B # theta
_, i_upper = lax.top_k(C, k)
_, i_lower = lax.top_k(-C, k)
return A[i_lower], A[i_upper]
x, y = f(A, B, theta)
dx_dtheta, dy_dtheta = jax.jacobian(f, argnums=2)(A, B, theta)
The derivatives are all zero, and I believe this is correct, because the change in value of the outputs does not depend on the change in value of theta.
But, you might ask, how can this be? After all, theta enters into the computation, and if you put in a different value for theta, you get different outputs. How could the gradient be zero?
What you must keep in mind, though, is that differentiation doesn't measure whether an input affects an output. It measures the change in output given an infinitesimal change in input.
Let's use a slightly simpler function as an example:
import jax
import jax.numpy as jnp
A = jnp.array([1.0, 2.0, 3.0])
theta = jnp.array([5.0, 1.0, 3.0])
def f(A, theta):
return A[jnp.argmax(theta)]
x = f(A, theta)
dx_dtheta = jax.grad(f, argnums=1)(A, theta)
Here the result of differentiating f with respect to theta is all zero, for the same reasons as above. Why? If you make an infinitesimal change to theta, it will in general not affect the sort order of theta. Thus, the entries you choose from A do not change given an infinitesimal change in theta, and thus the derivative with respect to theta is zero.
Now, you might argue that there are circumstances where this is not the case: for example, if two values in theta are very close together, then certainly perturbing one even infinitesimally could change their respective rank. This is true, but the gradient resulting from this procedure is undefined (the change in output is not smooth with respect to the change in input). The good news is this discontinuity is one-sided: if you perturb in the other direction, there is no change in rank and the gradient is well-defined. In order to avoid undefined gradients, most autodiff systems will implicitly use this safer definition of a derivative for rank-based computations.
The result is that the value of the output does not change when you infinitesimally perturb the input, which is another way of saying the gradient is zero. And this is not a failure of autodiff – it is the correct gradient given the definition of differentiation that autodiff is built on. Moreover, were you to try changing to a different definition of the derivative at these discontinuities, the best you could hope for would be undefined outputs, so the definition that results in zeros is arguably more useful and correct.
Suppose I have a matrix of N users, and each user is associated with a vector of words (translated to integers). So for example for N = 2 I'd have:
user 0 corresponds to words['20','56']
user 1 corresponds to words ['58','10','105']
So I have a list
user_words = [['20','56'],['58','10','105']]
Suppose further I created a 100-column embedding matrix (word_emb) for these words. I'd like to look up the (mean) embeddings of each of the user vectors and create a new Tensor, whose shape I would expect to be [2,100]. I tried doing this:
word_vec = []
for word_sequence_i in tf.map_fn(lambda x: x, user_words):
all_word_vecs = tf.nn.embedding_lookup(word_emb, word_sequence_i)
word_vec.append( tf.reduce_mean(all_word_vecs, 1))
But this gives me an error:
TypeError: `Tensor` objects are not iterable when eager execution is not enabled. To iterate over this tensor use `tf.map_fn`.
I thought I already was using tf.map_fn above! So what is Tensorflow complaining about? Is there even a way to do what I am trying to do?
Thanks so much!
tf.map_fn returns a Tensor object itself, which is a symbolic reference to a value that will be computed at Session.run() time. You can see this with type(tf.map_fn(lambda x: x, user_words)). So, it's the iteration implied in for word_sequence_i in tf.map_fn(...) that is generating the error.
Perhaps what you're looking for is something like:
all_word_vecs = tf.map_fn(lambda x: tf.nn.embedding_lookup(word_emb, x), user_words)
word_vec = tf.reduce_mean(all_word_vecs, axis=1)
On a related note, if this distinction between graph construction and execution is getting bothersome, you might want to give TensorFlow's eager execution a spin. See getting started and the programmer's guide.
Hope that helps.
Lets say, that we do want to process images (or ndim vectors) using Keras/TensorFlow.
And we want, for fancy regularization, to shift each input by a random number of positions to the left (owerflown portions reappearing at the right side ).
How could it be viewed and solved:
1)
Is there any variation to numpy roll function for TensorFlow?
2)
x - 2D tensor
ri - random integer
concatenate(x[:,ri:],x[:,0:ri], axis=1) #executed for each single input to the layer, ri being random again and again (I can live with random only for each batch)
In TensorFlow v1.15.0 and up, you can use tf.roll which works just like numpy roll. https://github.com/tensorflow/tensorflow/pull/14953 .
To improve on the answer above you can do:
# size of x dimension
x_len = tensor.get_shape().as_list()[1]
# random roll amount
i = tf.random_uniform(shape=[1], maxval=x_len, dtype=tf.int32)
output = tf.roll(tensor, shift=i, axis=[1])
For older versions starting from v1.6.0 you will have to use tf.manip.roll :
# size of x dimension
x_len = tensor.get_shape().as_list()[1]
# random roll amount
i = tf.random_uniform(shape=[1], maxval=x_len, dtype=tf.int32)
output = tf.manip.roll(tensor, shift=i, axis=[1])
I just had to do this myself, and I don't think there is a tensorflow op to do np.roll unfortunately. Your code above looks basically correct though, except it doesn't roll by ri, rather by (x.shape[1] - ri).
Also you need to be careful in choosing your random integer that it is from range(1,x.shape[1]+1) rather than range(0,x.shape[1]), as if ri was 0, then x[:,0:ri] would be empty.
So what I would suggest would be something more like (for rolling along dimension 1):
x_len = x.get_shape().as_list()[1]
i = np.random.randint(0,x_len) # The amount you want to roll by
y = tf.concat([x[:,x_len-i:], x[:,:x_len-i]], axis=1)
EDIT: added missing colon after hannes' correct comment.
Suppose that I have tensors x[i,j,k] and y[p,q] in a graph. What is the correct way to specify the tensor z[i,j,k,p,q] = x[i,j,k]y[p,q]? This is the coordinate representation of the tensor product of x and y. I can get the job done using a combination of tf.expand_dims, tf.mult and tf.tile, but I feel like there should be a better way...
I think you can get away without the tile operation using broadcasting.
x_reshaped = tf.reshape(x, (i, j, k, 1, 1))
y_reshaped = tf.reshape(y, (1, 1, 1, p, q))
z = x_reshaped * y_reshaped
When a dimension has size 1 and does not match the size of the other tensor's dimensions it is being multiplied with, it is copied / broadcasted automatically along that dimension and the product is carried out. Tile is often unnecessary. I actually don't think I have ever even used tile in tensorflow. Here I also used reshape rather than expand_dims but the result is the same either way.
Given...
a Matrix A of shape [m, n]
a tensor I of shape [m]
I want to get a list J of elements from A where
J[i] = A[i, I[i]].
That is, I holds the index of the element to select from each row in A.
Context: I already have the argmax(A, 1) and now I also want the max.
I know that I can just use reduce_max.
And after trying around for a bit I also came up with this:
J = tf.gather_nd(A,
tf.transpose(tf.pack([tf.to_int64(tf.range(A.get_shape()[0])), I])))
Where the to_int64 is needed because range only produces int32 and argmax only produces int64.
None of the two strike me as particularly elegant.
One has runtime overhead (probably about factor n) and the other has an unknown factor cognitive overhead. Am I missing something here?
The gather() function provides a way to do it:
r = tf.random.uniform([4,5],0, 9, dtype=tf.int32)
i = tf.random.uniform([4], 0, 4, dtype=tf.int32)
tf.gather(r, i, axis=1, batch_dims=1)
This is a rather late answer, but could doing
mask = tf.one_hot(I, depth=n, dtype=tf.bool, on_value=True, off_value=False)
elements = tf.boolean_mask(A, mask)
Accomplish what you're looking for?
edit: I should point out that this is NOT a good idea if A is already a very large tensor, as this ends up making a dense matrix.
Link provided by #yaroslav-bulatov mentiones this solution:
def get_elements(data, indices):
indeces = tf.range(0, tf.shape(indices)[0])*data.shape[1] + indices
return tf.gather(tf.reshape(data, [-1]), indeces)
Your solution is not currently differentiable (because gradients for tf.gather_nd are not currently supported).
Hopefully, data[:, indices] will be introduced soon.