how to prepare `bias` vector in tensor2tensor? - tensorflow

I am having problem understanding how bias works in tensor2tensor, specially in multihead_attention or dot_product_attention. I want to use it as a library for my problem.
Lets say I have an input tensor, T with dimension, (batch, max_input_length, hidden_unit) for a batch of sentences S. And I also have a tensor, sequence_length whose dimension is (batch) mentioning the length of the each sentences in S. Now how can I prepare the bias vector for this input?
I want to calculate the bias vector for self_attention that means when q, k, v is same.
Another thing,
What happens to the bias if q is different and k, v is same? This is sort of cross_attention. I think in that case we have to calculate the bias vector for k. But I'm not sure.

Related

Tensorflow: Accumulating gradients of a Tensor

TL;DR: you can just skip to the question in yellow box below.
Suppose I have a Encoder-Decoder Neural Network, with weights W_1 and W_2 of the encoder and decoder respectively. Let's denote Z as the output of the encoder. The network is trained with batch size n, and all the gradients will be calculated with respect to the mean loss value over the batch (as shown in image below, the L_hat which is the sum of per-sample loss L).
What I'm trying to achieve is, in the backward pass, to manipulate the gradients of Z before passing it further to the encoder's weights W_1. Suppose is a somehow modified gradients operator, for which the following holds:
The described above, in case of a synchronuous pass (first calculate the modified gradients of Z, then propagate down to W_1) is very easy to implement (the Jacobian multiplication is done using grad_ys of tf.gradients):
def modify_grad(grad_z):
# do some modifications
grad_z = tf.gradients(L_hat, Z)
mod_grad_z = modify_grad(grad_z)
mod_grad_w1 = tf.gradients(Z, W_1, mod_grad_z)
The problem is, I need to accumulate the gradients grad_z of the tensor Z over several batches. As the shape of it is dynamic (with None in one of the dimensions, as in the illustration above), I cannot define a tf.Variable to store it. Furthermore, the batch size n may change during training. How can I store the average of grad_z over several batches?
PS: I just wanted to combine pareto-optimal training of ArXiv:1810.04650, the asynchronous network training of ArXiv:1609.02132, and batch size scheduling of ArXiv:1711.00489.

Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

Tensorflow reduce dimensions of rank 3 tensor

I am trying to build a CLDNN that is researched in the paper here
After the convolutional layers, the features go through a dim-reduction layer. At the point when the features leave the conv layers, the dimensions are [?, N, M]. N represents the number of windows and I think the network requires the reduction in the dimension M, so the dimensions of the features after the dim-red layer is [?,N,Q] , where Q < M.
I have two questions.
How do I do this in TensorFlow? I tried using a weight with
W = tf.Variable( tf.truncated_normal([M,Q],stddev=0.1) )
I thought the multiplication of tf.matmul(x,W) would yield [?, N, Q] but [?, N, M] and [M, Q] are not valid dimensions for multiplication. I would like to keep N constant and reduce the dimension of M.
What kind of non-linearity should I apply to the outcome of tf.matmul(x,W)? I was thinking about using a ReLU but I couldn't even get #1 done.
According to the linked paper (T. N. Sainath et al.: "Convolutional, Long Short-Term Memory, Fully Connected Deep Neural Networks"),
[...] reducing the dimensionality, such that we have 256 outputs from the linear layer, was appropriate.
That means, whatever the input size is, i.e. [?, N, M] or any other dimensionality (always assuming that the first dimension is the number of samples in a mini-batch, denoted by ?), the output will be [?, Q], where typically Q=256.
As we are doing dimensionality reduction by multiplying the input with a weight matrix, no spatial information will be preserved. This means, that it doesn't matter whether each input is a matrix or a vector, so we can reshape the input to the linear layer x to have the dimensions [?, N*M]. Then, we can create a simple matrix multiplication tf.matmul(x, W) where W is a matrix with dimensions [N*M, Q].
W = tf.Variable(tf.truncated_normal([N*M, Q], stddev=0.1))
x_vec = tf.reshape(x, shape=(-1, N*M))
y = tf.matmul(x_vec, W)
Finally, regarding question 2: in the paper, the dimensionality reduction layer is a linear layer, i.e. you do not apply a non-linearity to the output.

understanding tensorflow sequence_loss parameters

The sequence_Loss module's source_code has three parameters that are required they list them as outputs, targets, and weights.
Outputs and targets are self explanatory, but I'm looking to better understand is what is the weight parameter?
The other thing I find confusing is that it states that the targets should be the same length as the outputs, what exactly do they mean by the length of a tensor? Especially if its a 3 dimensional tensor.
Think of the weights as a mask applied to the input tensor. In some NLP applications, we often have different sentence length for each sentence. In order to parallel/batch multiple instance sentences into a minibatch to feed into a neural net, people use a mask matrix to denotes which element in the the input tensor is actually a valid input. For instance, the weight can be a np.ones([batch, max_length]) that means all of the input elements are legit.
We can also use a matrix of the same shape as the labels such as np.asarray([[1,1,1,0],[1,1,0,0],[1,1,1,1]]) (we assume the labels shape is 3x4), then the crossEntropy of the first row last column will be masked out as 0.
You can also use weight to calculate weighted accumulation of cross entropy.
We used this in a class and our professor said we could just pass it ones of the right shape (the comment says "list of 1D batch-sized float-Tensors of the same length as logits"). That doesn't help with what they mean, but maybe it will help you get your code to run. Worked for me.
This code should do the trick: [tf.ones(batch_size, tf.float32) for _ in logits].
Edit: from TF code:
for logit, target, weight in zip(logits, targets, weights):
if softmax_loss_function is None:
# TODO(irving,ebrevdo): This reshape is needed because
# sequence_loss_by_example is called with scalars sometimes, which
# violates our general scalar strictness policy.
target = array_ops.reshape(target, [-1])
crossent = nn_ops.sparse_softmax_cross_entropy_with_logits(
logit, target)
else:
crossent = softmax_loss_function(logit, target)
log_perp_list.append(crossent * weight)
The weights that are passed are multiplied by the loss for that particular logit. So I guess if you want to take a particular prediction extra-seriously you can increase the weight above 1.

TensorFlow: Are my logits in the right format for cross entropy function?

Alright, so I'm getting ready to run the tf.nn.softmax_cross_entropy_with_logits() function in Tensorflow.
It's my understanding that the 'logits' should be a Tensor of probabilities, each one corresponding to a certain pixel's probability that it is part of an image that will ultimately be a "dog" or a "truck" or whatever... a finite number of things.
These logits will get plugged into this cross entropy equation:
As I understand it, the logits are plugged into the right side of the equation. That is, they are the q of every x (image). If they were probabilities from 0 to 1... that would make sense to me. But when I'm running my code and ending up with a tensor of logits, I'm not getting probabilities. Instead I'm getting floats that are both positive and negative:
-0.07264724 -0.15262917 0.06612295 ..., -0.03235611 0.08587133 0.01897052 0.04655019 -0.20552202 0.08725972 ..., -0.02107313 -0.00567073 0.03241089 0.06872301 -0.20756687 0.01094618 ..., etc
So my question is... is that right? Do I have to somehow calculate all my logits and turn them into probabilities from 0 to 1?
The crucial thing to note is that tf.nn.softmax_cross_entropy_with_logits(logits, labels) performs an internal softmax on each row of logits so that they are interpretable as probabilities before they are fed to the cross entropy equation.
Therefore, the "logits" need not be probabilities (or even true log probabilities, as the name would suggest), because of the internal normalization that happens within that op.
An alternative way to write:
xent = tf.nn.softmax_cross_entropy_with_logits(logits, labels)
...would be:
softmax = tf.nn.softmax(logits)
xent = -tf.reduce_sum(labels * tf.log(softmax), 1)
However, this alternative would be (i) less numerically stable (since the softmax may compute much larger values) and (ii) less efficient (since some redundant computation would happen in the backprop). For real uses, we recommend that you use tf.nn.softmax_cross_entropy_with_logits().