How to mask zero-padding values in Tensorflow Encoder-Decoder RNN with Attention? - tensorflow

In the official Tensorflow Neural Machine Translation example (https://www.tensorflow.org/alpha/tutorials/text/nmt_with_attention), in the Encoder model, a GRU layer is defined.
However, the zero-padded values will be processed normally by the GRU as there is no masking applied. And in the Decoder I think that the situation is even worse, because the Attention over the padded values will play an important role in the final computation of the context vector. I think that in the definition of the loss function below, the zeroes are masked, but at this point it is too late and the outputs of both the encoder and the attention decoder will be "broken".
Am I missing something in the whole process? Shouldn't the normal way of implementing this be with masking the padded values?

You are right, you can see that when you print the tensor returned from the encoder that the numbers on the right side of the differ although most of it comes from the padding:
Usual implementation indeed includes masking. You would then use the mask in computing the attention weights in the next cell. The simplest way is adding something like to (1 - mask) * 1e9 to the attention logits in the score tensor. The tutorial is a very basic one. For instance, the text prepreprocessing is very simple (remove all non-ASCII characters), or the tokenization differs from what is usual in machine translation.

Related

Why does Conv2DTranspose outputs checkerboarded images if using matched kernel sizes and strides?

I've read multiple times the famous distill paper concernging why Conv2DTranspose results in checkerboard artefacts. https://distill.pub/2016/deconv-checkerboard/
I understand from it, however, that if both the kernel size and the strides are matched, there shouldn't be any issue. I'm downsampling using Conv2D with kernel=(2,2) and strides=(2,2). And I'm upsampling using Conv2DTranspose with the exact same values. However if I visualize the output of the Conv2DTranspose layers they're extremely checkerboaded. Nothing that the next conv layer can't fix but... Why is that? By the way, I don't see artefacts in the final output (also, I'm segmenting so you don't see small intensity noise in the final quasi-binary mask).
What's the exact definition of Conv2DTranspose anyway? I'd expect the output of such a layer to be non "grid-like" when using matched kernel and strides (as shown in the examples in the distill link above), but in the documentation there's no exact mathematical definition of what it's doing. Why am I wrong?
The Conv2Transpose mathematical operation is the exact opposite of the Convolution. Note that the usage of Deconvolution or other terms is mathematically incorrect.
This is the perfect example to illustrate, since it uses exactly a stride of two. You can see how the initial image size is 2x2, the white squares are 0 padding, and the final output tensor dimension is 4x4.
Although not in use anymore, http://deeplearning.net/software/theano_versions/dev/tutorial/conv_arithmetic.html#transposed-convolution-arithmetic the theano documentation explains beautifully the mathematics behind Conv2DTranspose. You should check the documentation to better understand.

What's the attention model used in tfjs-examples/date-conversion-attention?

I've been looking at tfjs examples and trying to learn about seq2seq models. During the process, I've stumbled upon the date-conversion-attention example.
It's a great example but what kind of attention mechanism is being used in the example? There is no info in Readme file. Can somebody point me to the paper that describes the attention that's being used here?
Link to attention part:
https://github.com/tensorflow/tfjs-examples/blob/908ee32750ba750a14d15caeb53115e2d3dda2b3/date-conversion-attention/model.js#L102-L119
I believe I found the answer.
The attention model used in the date-conversion-attention uses the dot product alignment score and it's described in Effective Approaches to Attention-based Neural Machine Translation. Link: https://arxiv.org/pdf/1508.04025.pdf
I have twisted my head around this sample for some hours now, and this what I have concluded so far:
The encoder looks at the full input, one character-embedding for each lstm-step. The decoder expects a time-shifted copy of the output as its input -starting with a special character. The output (target strings) are provided as-is to the decoder during training. During evaluation, one character is predicted at the time, passing the prediction back into the decoder for the next character.
The decoder does not see the input, but it receives the encoder's final step output as its initial state. This state initialisation tells the decoder how to produce it's outputs, something like an encoded description of the date-format to work on (I assume).
The LSTM's output, one for each step (=character of input or output), from the encoder and decoder are then dot product'ed and normalised with softmax. This dot-product is the attention matrix - basically a highlight of the activations from the encoder and the decoder. For the attention heatmap to light up for the given next character, the decoder must have output'ed something that "matches" the encoder's outputs. The attention matrix is not learned weights or biases, its just a product of the encoder and decoder's outputs.
Finally this attention matrix is dot product'ed with the full encoder input and concatenated with the decoder output - to allow the final dense layers to decode the attention mappings and "read" the right values from the encoder output.
In the prediction process, only the last character is read from the prediction. Possibly because the previous predictions might be unstable?
I read the excellent book: Deep Learning with JavaScript Neural networks in TensorFlow.js The book explains the examples one by one and adds lots of extra documentation. But I don't think they explain the general architecture very well, for this sample - only the details.

What is the equivalence of Masking() Keras function in tensorflow? And does batch norm, conv, and relu support Masking?

I am training a GRU layer where inputs doesn't have the same length. Therefore, I have padded the inputs' features with 0.0 to make all sequences of same length. On the other hand, I don't want to compute any loss at any time step, for any sample as long as the input feature vector is all zeros. Example, at time step 1000, I have a batch size of 34, but samples number 33 and 34 of this batch lack data or feature values at time step 1000.
I have found that we can use the method Masking()(inputs) in Keras as long as all subsequent layers or operations support masking. But I have implemented my model in tensorflow. So what is the equivalence of Masking() in tensorflow?
Second, how can I know whether: batch normalization, conv layer and any non linear activation function has support for the masking() function in Keras?
Your help is much appreciated!!
So I found the detailed solution in danijar blog https://danijar.com/variable-sequence-lengths-in-tensorflow/.
The masking in keras is used when having incomplete sequences. So usually, you need to pad your sequences with 0.0 in the third dimension (The feature's dimension; when the input dimension has shape = [batch_size, sequence_length, num_features]).Afterwards, the masking in keras will take a number, will output 0 for their activations.
In summary: He showed how to compute the sequence length for each sample in the batch using length() he implemented. The output vector is then fed into the dynamic_rnn which will output zero vectors for incomplete sequences (for states and outputs), which is somehow similar to what happens in Keras Masking() function. Second, we should use a mask when computing the loss function.
All the details are discussed in this blog post.
But regarding the support thingy for masking in batch_norm, conv and non linear activation function; usually, if the output of the LSTM is zeros; then in case with sigmoid activation function at the output; the derivative of the output with respect to the input of the sigmoid function is output(1 - output). Hence, when the output is 0, this derivative is zero as well. And since back propagation applies the chain rule, then the gradients of the current sample with respect to any weight parameter in the network is going to be 0 as well. Hence, there is no need to worry about the support thingy... But the problem arises when the activation is relu for example, this is when the gradients should be explicitely multiplied by zeros before doing the back propagation (I guess). Maybe doing something like this will help:
final_output = output * mask
Then derivative of the final_output with respect to output will be the mask => 0 or 1 (the any time step; for any sample). Then, back propagate this gradient from the output of the activation function to its inputs...followed by chain rule => weights wont be affected in this case.

Tensorflow LSTM Dropout Implementation

How specifically does tensorflow apply dropout when calling tf.nn.rnn_cell.DropoutWrapper() ?
Everything I read about applying dropout to rnn's references this paper by Zaremba et. al which says don't apply dropout between recurrent connections. Neurons should be dropped out randomly before or after LSTM layers, but not inter-LSTM layers. Ok.
The question I have is how are the neurons turned off with respect to time?
In the paper that everyone cites, it seems that a random 'dropout mask' is applied at each timestep, rather than generating one random 'dropout mask' and reusing it, applying it to all the timesteps in a given layer being dropped out. Then generating a new 'dropout mask' on the next batch.
Further, and probably what matters more at the moment, how does tensorflow do it? I've checked the tensorflow api and tried searching around for a detailed explanation but have yet to find one.
Is there a way to dig into the actual tensorflow source code?
You can check the implementation here.
It uses the dropout op on the input into the RNNCell, then on the output, with the keep probabilities you specify.
It seems like each sequence you feed in gets a new mask for input, then for output. No changes inside of the sequence.

Tensorflow sequence2sequence model padding

In the seq2seq models, paddings are applied to make all sequences in a bucket have the same lengths. And apart from this, it looks like no special handling is applied to the paddings:
the encoder encodes the paddings as well
the basic decoder w/o attention decodes using the last encoding which encodes the paddings
the decoder with attention attends to the hidden states of the padding inputs too
It would be really helpful if this could be clarified: is it true that, basically the paddings are just a special id/embedding, and the current seq2seq implementation treats them just like other embeddings? And no special mechanism is needed to ignore these padding, for example when encoding a sequence containing paddings; or to decode a sequence containing paddings using the attention-based decoder? So after padding, nothing special is done to the paddings, we can just pretend a padding is just another embedding (apart from maybe when doing the weighted x-entropy using target_weights)?
If the above is true, then when testing a trained model, is padding needed at all (since at test time, each sentence is decoded separately and not in a batch)? --- It looks like from the code, at test time an input sentence is still bucketed first and then padded?
I think your basic premise is correct: the model does not treat the padding symbol differently than any other symbol. However, when packing the data tensors the padding always shows up at the end of decoder training examples after the 'EOS' symbol, and at the beginning of encoder training examples (because the encoder sequences get reversed.)
Presumably the model will learn that padding on the encoder side carries no real semantic information as it wouldn't correlate with anything of the other words...I suppose it could convey something about the length of the sequence when attention is factored in, but that wouldn't be a bad thing. The bucket approach is an attempt to limit the amount of excess padding.
On the decoder side the model will quickly learn that padding always occurs after the 'EOS' symbol so you ignore everything after the 'EOS' symbol is emitted anyway.
Padding is mainly useful because when running tensors in batches, the sequences all have to be the same size...so when testing one-at-a-time you really don't need to pad. However, when testing large validation sets it is still useful to run in batches...with padding.
(I have more questions and concerns about the UNK symbol myself.)