Clarification of TensorFlow AttentionWrapper's Layer Size - tensorflow

In tensorflow.contrib.seq2seq's AttentionWrapper, what does "depth" refer to as stated in the attention_layer_size documentation? When the documentation says to "use the context as attention" if the value is None, what is meant by "the context"?

In Neural Machine Translation by Jointly Learning to Align and Translate they give a description of the (Bahdanau) attention mechanism; essentially what happens is that you compute scalar "alignment scores" a_1, a_2, ..., a_n that indicate how important each element of your encoded input sequence is at a given moment in time (i.e. which part of the input sentence you should pay attention to right now in the current timestep).
Assuming your (encoded) input sequence that you want to "pay attention"/"attend over" is a sequence of vectors denoted as e_1, e_2, ..., e_n, the context vector at a given timestep is the weighted sum over all of these as determined by your alignment scores:
context = c := (a_1*e_1) + (a_2*e_2) + ... + (a_n*e_n)
(Remember that the a_k's are scalars; you can think of this as an "averaged-out" letter/word in your sentence --- so ideally if your model is trained well, the context looks most similar to the e_i you want to pay attention to the most, but bears a little bit of resemblance to e_{i-1}, e_{i+1}, etc. Intuitively, think of a "smeared-out" input element, if that makes any sense...)
Anyway, if attention_layer_size is not None, then it specifies the number of hidden units in a feedforward layer within your decoder that is used to mix this context vector with the output of the decoder's internal RNN cell to get the attention value. If attention_layer_size == None, it just uses the context vector above as the attention value, and no mixing of the internal RNN cell's output is done. (When I say "mixing", I mean that the context vector and the RNN cell's output are concatenated and then projected to the dimensionality that you specify by setting attention_layer_size.)
The relevant part of the implementation is at this line and has a description of how it's computed.
Hope that helps!

Related

Should RNN attention weights over variable length sequences be re-normalized to "mask" the effects of zero-padding?

To be clear, I am referring to "self-attention" of the type described in Hierarchical Attention Networks for Document Classification and implemented many places, for example: here. I am not referring to the seq2seq type of attention used in encoder-decoder models (i.e. Bahdanau), although my question might apply to that as well... I am just not as familiar with it.
Self-attention basically just computes a weighted average of RNN hidden states (a generalization of mean-pooling, i.e. un-weighted average). When there are variable length sequences in the same batch, they will typically be zero-padded to the length of the longest sequence in the batch (if using dynamic RNN). When the attention weights are computed for each sequence, the final step is a softmax, so the attention weights sum to 1.
However, in every attention implementation I have seen, there is no care taken to mask out, or otherwise cancel, the effects of the zero-padding on the attention weights. This seems wrong to me, but I fear maybe I am missing something since nobody else seems bothered by this.
For example, consider a sequence of length 2, zero-padded to length 5. Ultimately this leads to the attention weights being computed as the softmax of a similarly 0-padded vector, e.g.:
weights = softmax([0.1, 0.2, 0, 0, 0]) = [0.20, 0.23, 0.19, 0.19, 0.19]
and because exp(0)=1, the zero-padding in effect "waters down" the attention weights. This can be easily fixed, after the softmax operation, by multiplying the weights with a binary mask, i.e.
mask = [1, 1, 0, 0, 0]
and then re-normalizing the weights to sum to 1. Which would result in:
weights = [0.48, 0.52, 0, 0, 0]
When I do this, I almost always see a performance boost (in the accuracy of my models - I am doing document classification/regression). So why does nobody do this?
For a while I considered that maybe all that matters is the relative values of the attention weights (i.e., ratios), since the gradient doesn't pass through the zero-padding anyway. But then why would we use softmax at all, as opposed to just exp(.), if normalization doesn't matter? (plus, that wouldn't explain the performance boost...)
Great question! I believe your concern is valid and zero attention scores for the padded encoder outputs do affect the attention. However, there are few aspects that you have to keep in mind:
There are different score functions, the one in tf-rnn-attention uses simple linear + tanh + linear transformation. But even this score function can learn to output negative scores. If you look at the code and imagine inputs consists of zeros, vector v is not necessarily zero due to bias and the dot product with u_omega can boost it further to low negative numbers (in other words, plain simple NN with a non-linearity can make both positive and negative predictions). Low negative scores don't water down the high scores in softmax.
Due to bucketing technique, the sequences within a bucket usually have roughly the same length, so it's unlikely to have half of the input sequence padded with zeros. Of course, it doesn't fix anything, it just means that in real applications negative effect from the padding is naturally limited.
You mentioned it in the end, but I'd like to stress it too: the final attended output is the weighted sum of encoder outputs, i.e. relative values actually matter. Take your own example and compute the weighted sum in this case:
the first one is 0.2 * o1 + 0.23 * o2 (the rest is zero)
the second one is 0.48 * o1 + 0.52 * o2 (the rest is zero too)
Yes, the magnitude of the second vector is two times bigger and it isn't a critical issue, because it goes then to the linear layer. But relative attention on o2 is just 7% higher, than it would have been with masking.
What this means is that even if the attention weights won't do a good job in learning to ignore zero outputs, the end effect on the output vector is still good enough for the decoder to take the right outputs into account, in this case to concentrate on o2.
Hope this convinces you that re-normalization isn't that critical, though probably will speed-up learning if actually applied.
BERT implementation applies a padding mask for calculating attention score.
Adds 0 to the non-padding attention score and adds -10000 to padding attention scores. the e^-10000 is very small w.r.t to other attention score values.
attention_score = [0.1, 0.2, 0, 0, 0]
mask = [0, 0, -10000, -10000] # -10000 is a large negative value
attention_score += mask
weights = softmax(attention_score)

taking the gradient in Tensorflow, tf.gradient

I am using this function of tensorflow to get my function jacobian. Came across two problems:
The tensorflow documentation is contradicted to itself in the following two paragraph if I am not mistaken:
gradients() adds ops to the graph to output the partial derivatives of ys with respect to xs. It returns a list of Tensor of length len(xs) where each tensor is the sum(dy/dx) for y in ys.
Blockquote
Blockquote
Returns:
A list of sum(dy/dx) for each x in xs.
Blockquote
According to my test, it is, in fact, return a vector of len(ys) which is the sum(dy/dx) for each x in xs.
I do not understand why they designed it in a way that the return is the sum of the columns(or row, depending on how you define your Jacobian).
How can I really get the Jacobian?
4.In the loss, I need the partial derivative of my function with respect to input (x), but when I am optimizing with respect to the network weights, I define x as a placeholder whose value is fed later, and weights are variable, in this case, can I still define the symbolic derivative of function with respect to input (x)? and put it in the loss? ( which later when we optimize with respect to weights will bring second order derivative of the function.)
I think you are right and there is a typo there, it was probably meant to be "of length len(ys)".
For efficiency. I can't explain exactly the reasoning, but this seems to be a pretty fundamental characteristic of how TensorFlow handles automatic differentiation. See issue #675.
There is no straightforward way to get the Jacobian matrix in TensorFlow. Take a look at this answer and again issue #675. Basically, you need one call to tf.gradients per column/row.
Yes, of course. You can compute whatever gradients you want, there is no real difference between a placeholder and any other operation really. There are a few operations that do not have a gradient because it is not well defined or not implemented (in which case it will generally return 0), but that's all.

How to train with inputs of variable size?

This question is rather abstract and not necessarily tied to tensorflow or keras. Say that you want to train a language model, and you want to use inputs of different sizes for your LSTMs. Particularly, I'm following this paper: https://www.researchgate.net/publication/317379370_A_Neural_Language_Model_for_Query_Auto-Completion.
The authors use, among other things, word embeddings and one-hot encoding of characters. Most likely, the dimensions of each of these inputs are different. Now, to feed that into a network, I see a few alternatives but I'm sure I'm missing something and I would like to know how it should be done.
Create a 3D tensor of shape (instances, 2, max(embeddings,characters)). That is, padding the smaller input with 0s.
Create a 3D tensor of shape (instances, embeddings+characters, 1)). That is, concatenating inputs.
It looks to me that both alternatives are bad for efficiently training the model. So, what's the best way to approach this? I see the authors use an embedding layer for this purpose, but technically, what does that mean?
EDIT
Here are more details. Let's call these inputs X (character-level input) and E (word-level input). On each character of a sequence (a text), I compute x, e and y, the label.
x: character one-hot encoding. My character index is of size 38, so this is a vector filled with 37 zeros and one 1.
e: precomputed word embedding of dimension 200. If the character is a space, I fetch the word embedding of the previous word in the sequence, Otherwise, I assign the vector for incomplete word (INC, also of size 200). Real example with the sequence "red car": r>INC, e>INC, d>INC, _>embeddings["red"], c>INC, a>INC, r>INC.
y: the label to be predicted, which is the next character, one-hot encoded. This output is of the same dimension as x because it uses the same character index. In the example above, for "r", y is the one-hot encoding of "e".
According to keras documentation, the padding idea seems to be the one. There is the masking parameter in the embedding layer, that will make keras skip these values instead of processing them. In theory, you don't lose that much performance. If the library is well built, the skipping is actually skipping extra processing.
You just need to take care not to attribute the value zero to any other character, not even spaces or unknown words.
An embedding layer is not only for masking (masking is just an option in an embedding layer).
The embedding layer transforms integer values from a word/character dictionary into actual vectors of a certain shape.
Suppose you have this dictionary:
1: hey
2: ,
3: I'm
4: here
5: not
And you form sentences like
[1,2,3,4,0] -> this is "hey, I'm here"
[1,2,3,5,4] -> this is "hey, I'm not here"
[1,2,1,2,1] -> this is "hey, hey, hey"
The embedding layer will tranform each of those integers into vectors of a certain size. This does two good things at the same time:
Transforms the words in vectors because neural networks can only handle vectors or intensities. A list of indices cannot be processed by a neural network directly, there is no logical relation between indices and words
Creates a vector that will be a "meaningful" set of features for each word.
And after training, they become "meaningful" vectors. Each element starts to represent a certain feature of the word, although that feature is obscure to humans. It's possible that an embedding be capable of detecting words that are verbs, nouns, feminine, masculine, etc, everything encoded in a combination of numeric values (presence/abscence/intensity of features).
You may also try the approach in this question, which instead of using masking, needs to separate batches by length, so each batch can be trained at a time without needing to pad them: Keras misinterprets training data shape

Is Tensorflow RNN implements Elman network fully?

Q: Is Tensorflow RNN implemented to ouput Elman Network's hidden state?
cells = tf.contrib.rnn.BasicRNNCell(4)
outputs, state = tf.nn.dynamic_rnn(cell=cells, etc...)
I'm quiet new to TF's RNN and curious about meaning of outputs, and state.
I'm following stanford's tensorflow tutorial but there seems no detailed explanation so I'm asking here.
After testing, I think state is hidden state after sequence calculation and outputs is array of hidden states after each time steps.
so I want to make it clear. outputs and state are just hidden state vectors so to fully implement Elman network, I have to make V matrix in the picture and do matrix multiplication again. am I correct?
I believe you are asking what the output of a intermediate state and output is.
From what I understand, the state would be intermediate output after a convolution / sequence calculation and is hidden, so your understanding is in the right direction.
Output may vary as how you decide to implement your network model, but on a general basis, it is an array where any operation (convolution, sequence calc etc) has been applied after which activation & downsampling/pooling has been applied, to concentrate on identifiable features across that layer.
From Colah's blog ( http://colah.github.io/posts/2015-08-Understanding-LSTMs/ ):
Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanhtanh (to push the values to be between −1−1 and 11) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.
For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.
Hope this helps.
Thank you

How does one implement action masking?

The Actor Mimic paper talks about implementing an action-masking procedure. I quote
While playing a certain game, we
mask out AMN action outputs that are not valid for that game and take the softmax over only the subset of valid actions
Does anyone have an idea about how this action masking can be implemented in say Tensorflow? In specifc, how would one take a softmax only over specified subset of actions?
Say you have a valid state tensor which contains ones and zeros.
is_valid = [1, 0, 1, ...]
and then you have an actions tensor on which you want to take the softmax over those values which are valid. You could do the following.
(tf.exp(actions) * is_valid) / (tf.reduce_sum(tf.exp(actions) * is_valid) + epsilon)
In this case the is_valid is masking out the invalid values in the sum. I would also add a small epsilon to the division for the sake of numerical stability so you can never divide by zero.
You should rely on the built in softmax function. i.e. you should first mask the invalid action in your tensor by using a boolean_mask and then apply the softmax function.
(The solution provided by chasep255 has numerical issues.)