How to shift a tensor using api in tensorflow, just like nump.roll() or shift ? [duplicate] - tensorflow

Lets say, that we do want to process images (or ndim vectors) using Keras/TensorFlow.
And we want, for fancy regularization, to shift each input by a random number of positions to the left (owerflown portions reappearing at the right side ).
How could it be viewed and solved:
1)
Is there any variation to numpy roll function for TensorFlow?
2)
x - 2D tensor
ri - random integer
concatenate(x[:,ri:],x[:,0:ri], axis=1) #executed for each single input to the layer, ri being random again and again (I can live with random only for each batch)

In TensorFlow v1.15.0 and up, you can use tf.roll which works just like numpy roll. https://github.com/tensorflow/tensorflow/pull/14953 .
To improve on the answer above you can do:
# size of x dimension
x_len = tensor.get_shape().as_list()[1]
# random roll amount
i = tf.random_uniform(shape=[1], maxval=x_len, dtype=tf.int32)
output = tf.roll(tensor, shift=i, axis=[1])
For older versions starting from v1.6.0 you will have to use tf.manip.roll :
# size of x dimension
x_len = tensor.get_shape().as_list()[1]
# random roll amount
i = tf.random_uniform(shape=[1], maxval=x_len, dtype=tf.int32)
output = tf.manip.roll(tensor, shift=i, axis=[1])

I just had to do this myself, and I don't think there is a tensorflow op to do np.roll unfortunately. Your code above looks basically correct though, except it doesn't roll by ri, rather by (x.shape[1] - ri).
Also you need to be careful in choosing your random integer that it is from range(1,x.shape[1]+1) rather than range(0,x.shape[1]), as if ri was 0, then x[:,0:ri] would be empty.
So what I would suggest would be something more like (for rolling along dimension 1):
x_len = x.get_shape().as_list()[1]
i = np.random.randint(0,x_len) # The amount you want to roll by
y = tf.concat([x[:,x_len-i:], x[:,:x_len-i]], axis=1)
EDIT: added missing colon after hannes' correct comment.

Related

Getting nans for gradient

I am trying to create a search relevance model where I take the dot product between query vector and resulting documents. I add a positional bias term on top to take into account the fact that position 1 is more likely to be clicked on. The final (unnormalised) log likelihood calculation is as follows:
query = self.query_model(query_input_ids, query_attention_mask)
docs = self.doc_model(doc_input_ids, doc_attention_mask)
positional_bias = self.position_model()
if optimizer_idx is not None:
if optimizer_idx == 0:
docs = docs.detach()
positional_bias = positional_bias.clone().detach()
elif optimizer_idx == 1:
query = query.detach()
positional_bias = positional_bias.clone().detach()
else:
query = query.detach()
docs = docs.detach()
similarity = (docs # query.unsqueeze(-1)).squeeze()
click_log_lik = (similarity + positional_bias)\
.reshape(doc_mask.shape)\
.masked_fill_((1 - doc_mask).bool(), float("-inf"))
The query and doc model is simply a distilbert model with a projection layer on top of CLS token. The models can be seen here: https://pastebin.com/g21g9MG3
When inspecting the first gradient descent step, it has nans, but only for the query model and not the doc model. My hypothesis is that normalizing the return values for doc and query models (return F.normalize(out, dim=-1)) is somehow playing up with the gradients.
Does anyone know 1. If my hypothesis is true and more importantly 2. How can I rectify nan gradients?.
Additional Info:
None of the losses are inf or nan.
query is BS x 768
docs is BS x DOC_RESULTS x 768
positional_bias is DOC_RESULTS
DOC_RESULTS is 10 in my case.
The masked_fill in the last line is because occasionally I have less than 10 data points for a query.
Update 1
The following changes made no difference to nans:
Changing masked_fill from -inf to 1e5.
Changing the projection from F.normalize(out, dim=-1) to out / 100.
Removed positional bias altogether with again no luck.
If it helps anyone, and you come across this while using Transformers this is what I did:
So in the end the bug was due to the fact that I was masking away nan's. Since I had some documents with zero length, the output of the transformer was nan. I was hoping that masked_fill would fix this problem, but it doesn't. The solution in my case was to only put non-zero length sequences through transformers, and then append with zeros to fill the batch size.

TFP Linear Regression yhat=model(x_tst) - doesn't work for other data

I cannot see the difference between what I am doing and the working Google TFP example, whose structure I am following. What am I doing wrong/should I be doing differently?
[Setup: Win 10 Home 64-bit 20H2, Python 3.7, TF2.4.1, TFP 0.12.2, running in Jupyter Lab]
I have been building a model step by step following the example of TFP Probabilistic Layers Regression. The Case 1 code runs fine, but my parallel model doesn't and I cannot see the difference that might cause this
yhat = model(x_tst)
to fail with message Input 0 of layer sequential_14 is incompatible with the layer: : expected min_ndim=2, found ndim=1. Full shape received: (2019,) (which is the correct 1D size of x_tst)
For comparison: Google's load_dataset function for the TFP example returns y, x, x_tst, which are all np.ndarray of size 150, whereas I read data from a csv file with pandas.read_csv, split it into train_ and test_datasets and then take 1 col of data as independent variable 'g' and dependent variable 'redz' from the training dataset.
I know x, y, etc. need to be np.ndarray, but one does not create ndarray directly, so I have...
x = np.array(train_dataset['g'])
y = np.array(train_dataset['redz'])
x_tst = np.array(test_dataset['g'])
where x, y, x_tst are all 1-dimensional - just like the TFP example.
The model itself runs
model = tf.keras.Sequential([
tf.keras.layers.Dense(1),
tfp.layers.DistributionLambda(lambda t: tfd.Normal(loc=t, scale=1)),
])
# Do inference.
model.compile(optimizer=tf.optimizers.Adam(learning_rate=0.01), loss=negloglik)
model.fit(x, y, epochs=1, verbose=False);
(and when plotted gives the expected output for the google data - I don't get this far):
But, per the example when I try to "profit" by doing yhat = model(x_tst) I get the dimensions error given above.
What's wrong?
(If I try mode.predict I think I hit a known bug/gap in TFP; then it fails the assert)
Update - Explicit Reshape Resolves Issue
The hint from Frightera led to further investigation: x_tst had shape (2019,)
Reshaping by x_tst = x_tst.rehape(2019,1) resolved the issue. Is TF inconsistent in its requirements or is there some good reason that the explicit final dimension 1 was required? Who knows. At least predictions can be made now.
In this question Difference between numpy.array shape (R, 1) and (R,), the OP asked for the difference between (R,) and (R,1) but the answers given did not address this specific point.
Similarly in this question Difference between these array shapes in numpy
I believe the answer lies in the numpy glossary, where it says of (n,) that
A parenthesized number followed by a comma denotes a tuple with one
element. The trailing comma distinguishes a one-element tuple from a
parenthesized n.
Which, naturally, echoes the Python statements concerning tuples here
Thus an array of shape (R,) is a tuple describing an array as being 1D of a certain extent R, where the comma is appended to distinguish the tuple (R,) from the non-tuple (R).
However, for a 1D array, there is no sense of row or column ordering; (R,1) is R rows by 1 column, but (1, R) would be 1 row of R columns, and though it shouldn't matter to a 1D iterator either it does or the iterator doesn't correctly recognise ( ,) and thinks it is 2D. (i.e. I don't know the technical details of that part, but these seem to be the only options that account for the behaviour.)
This issue is unrelated to the indeterminacy of size that occurs in tensor definition in Tensorflow. In the context of Tensorflow, Tensors (arrays) may have indeterminate shapes, so that more data may be added along a certain axis as processing occurs, e.g. in batches, in which case the initial Tensor shape includes a leading None to indicate where array expansion is expected to occur. (See e.g. tensor's shape here)

How does the gradient of the sum trick work to get maxpooling positions in keras?

The keras examples directory contains a lightweight version of a stacked what-where autoencoder (SWWAE) which they train on MNIST data. (https://github.com/fchollet/keras/blob/master/examples/mnist_swwae.py)
In the original SWWAE paper, the authors compute the what and where using soft functions. However, in the keras implementation, they use a trick to get these locations. I would like to understand this trick.
Here is the code of the trick.
def getwhere(x):
''' Calculate the 'where' mask that contains switches indicating which
index contained the max value when MaxPool2D was applied. Using the
gradient of the sum is a nice trick to keep everything high level.'''
y_prepool, y_postpool = x
return K.gradients(K.sum(y_postpool), y_prepool) # How exactly does this line work?
Where y_prepool is a MxN matrix and y_postpool is a M/2 x N/2 matrix (lets assume canonical pooling of a size 2 pixels).
I have verified that the output of getwhere() is a bed of nails matrix where the nails indicate the position of the max (the local argmax if you will).
Can someone construct a small example demonstrating how getwhere works using this "Trick?"
Lets focus on the simplest example, without really talking about convolutions, say we have a vector
x = [1 4 2]
which we max-pool over (with a single, big window), we get
mx = 4
mathematically speaking, it is:
mx = x[argmax(x)]
now, the "trick" to recover one hot mask used by pooling is to do
magic = d mx / dx
there is no gradient for argmax, however it "passes" the corresponding gradient to an element in a vector at the location of maximum element, so:
d mx / dx = [0/dx[1] dx[2]/dx[2] 0/dx[3]] = [0 1 0]
as you can see, all the gradient for non-maximum elements are zero (due to argmax), and "1" appears at the maximum value because dx/x = 1.
Now for "proper" maxpool you have many pooling regions, connected to many input locations, thus taking analogous gradient of sum of pooled values, will recover all the indices.
Note however, that this trick will not work if you have heavily overlapping kernels - you might end up with bigger values than "1". Basically if a pixel is max-pooled by K kernels, than it will have value K, not 1, for example:
[1 ,2, 3]
x = [13,3, 1]
[4, 2, 9]
if we max pool with 2x2 window we get
mx = [13,3]
[13,9]
and the gradient trick gives you
[0, 0, 1]
magic = [2, 0, 0]
[0, 0, 1]

Numpy- Deep Learning, Training Examples

Silly Question, I am going through the third week of Andrew Ng's newest Deep learning course, and getting stuck at a fairly simple Numpy function ( i think? ).
The exercise is to find How many training examples, m , we have.
Any idea what the Numpy function is to find out about the size of a preloaded training example.
Thanks!
shape_X = X.shape
shape_Y = Y.shape
m = ?
print ('The shape of X is: ' + str(shape_X))
print ('The shape of Y is: ' + str(shape_Y))
print ('I have m = %d training examples!' % (m))
It depends on what kind of storage-approach you use.
Most python-based tools use the [n_samples, n_features] approach where the first dimension is the sample-dimension, the second dimension is the feature-dimension (like in scikit-learn and co.). Alternatively expressed: samples are rows and features are columns.
So:
# feature 1 2 3 4
x = np.array([[1,2,3,4], # first sample
[2,3,4,5], # second sample
[3,4,5,6]
])
is a training-set of 3 samples with 4 features each.
The sizes M,N (again: interpretation might be different for others) you can get with:
M, N = x.shape
because numpy's first dimension are rows, numpy's second dimension are columns like in matrix-algebra.
For the above example, the target-array is of shape (M) = n_samples.
Anytime you want to find the number of training examples or the size of an array, you can use
m = X.size
This will give you the size or the total number of the examples. In this case, it would be 400.
The above method is also correct but not the optimal method to find the size since, in large datasets, the values could be large and while python easily handles large values, it is not advisable to utilize extra unneeded space.
Or a better way of doing the above scenario is
m=X.shape[1]

Elegant Way to Select one Element per Row in Tensorflow

Given...
a Matrix A of shape [m, n]
a tensor I of shape [m]
I want to get a list J of elements from A where
J[i] = A[i, I[i]].
That is, I holds the index of the element to select from each row in A.
Context: I already have the argmax(A, 1) and now I also want the max.
I know that I can just use reduce_max.
And after trying around for a bit I also came up with this:
J = tf.gather_nd(A,
tf.transpose(tf.pack([tf.to_int64(tf.range(A.get_shape()[0])), I])))
Where the to_int64 is needed because range only produces int32 and argmax only produces int64.
None of the two strike me as particularly elegant.
One has runtime overhead (probably about factor n) and the other has an unknown factor cognitive overhead. Am I missing something here?
The gather() function provides a way to do it:
r = tf.random.uniform([4,5],0, 9, dtype=tf.int32)
i = tf.random.uniform([4], 0, 4, dtype=tf.int32)
tf.gather(r, i, axis=1, batch_dims=1)
This is a rather late answer, but could doing
mask = tf.one_hot(I, depth=n, dtype=tf.bool, on_value=True, off_value=False)
elements = tf.boolean_mask(A, mask)
Accomplish what you're looking for?
edit: I should point out that this is NOT a good idea if A is already a very large tensor, as this ends up making a dense matrix.
Link provided by #yaroslav-bulatov mentiones this solution:
def get_elements(data, indices):
indeces = tf.range(0, tf.shape(indices)[0])*data.shape[1] + indices
return tf.gather(tf.reshape(data, [-1]), indeces)
Your solution is not currently differentiable (because gradients for tf.gather_nd are not currently supported).
Hopefully, data[:, indices] will be introduced soon.