What's the attention model used in tfjs-examples/date-conversion-attention? - tensorflow

I've been looking at tfjs examples and trying to learn about seq2seq models. During the process, I've stumbled upon the date-conversion-attention example.
It's a great example but what kind of attention mechanism is being used in the example? There is no info in Readme file. Can somebody point me to the paper that describes the attention that's being used here?
Link to attention part:
https://github.com/tensorflow/tfjs-examples/blob/908ee32750ba750a14d15caeb53115e2d3dda2b3/date-conversion-attention/model.js#L102-L119

I believe I found the answer.
The attention model used in the date-conversion-attention uses the dot product alignment score and it's described in Effective Approaches to Attention-based Neural Machine Translation. Link: https://arxiv.org/pdf/1508.04025.pdf

I have twisted my head around this sample for some hours now, and this what I have concluded so far:
The encoder looks at the full input, one character-embedding for each lstm-step. The decoder expects a time-shifted copy of the output as its input -starting with a special character. The output (target strings) are provided as-is to the decoder during training. During evaluation, one character is predicted at the time, passing the prediction back into the decoder for the next character.
The decoder does not see the input, but it receives the encoder's final step output as its initial state. This state initialisation tells the decoder how to produce it's outputs, something like an encoded description of the date-format to work on (I assume).
The LSTM's output, one for each step (=character of input or output), from the encoder and decoder are then dot product'ed and normalised with softmax. This dot-product is the attention matrix - basically a highlight of the activations from the encoder and the decoder. For the attention heatmap to light up for the given next character, the decoder must have output'ed something that "matches" the encoder's outputs. The attention matrix is not learned weights or biases, its just a product of the encoder and decoder's outputs.
Finally this attention matrix is dot product'ed with the full encoder input and concatenated with the decoder output - to allow the final dense layers to decode the attention mappings and "read" the right values from the encoder output.
In the prediction process, only the last character is read from the prediction. Possibly because the previous predictions might be unstable?
I read the excellent book: Deep Learning with JavaScript Neural networks in TensorFlow.js The book explains the examples one by one and adds lots of extra documentation. But I don't think they explain the general architecture very well, for this sample - only the details.

Related

Why Bert transformer uses [CLS] token for classification instead of average over all tokens?

I am doing experiments on bert architecture and found out that most of the fine-tuning task takes the final hidden layer as text representation and later they pass it to other models for the further downstream task.
Bert's last layer looks like this :
Where we take the [CLS] token of each sentence :
Image source
I went through many discussion on this huggingface issue, datascience forum question, github issue Most of the data scientist gives this explanation :
BERT is bidirectional, the [CLS] is encoded including all
representative information of all tokens through the multi-layer
encoding procedure. The representation of [CLS] is individual in
different sentences.
My question is, Why the author ignored the other information ( each token's vector ) and taking the average, max_pool or other methods to make use of all information rather than using [CLS] token for classification?
How does this [CLS] token help compare to the average of all token vectors?
The use of the [CLS] token to represent the entire sentence comes from the original BERT paper, section 3:
The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.
Your intuition is correct that averaging the vectors of all the tokens may produce superior results. In fact, that is exactly what is mentioned in the Huggingface documentation for BertModel:
Returns
pooler_output (torch.FloatTensor: of shape (batch_size, hidden_size)):
Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pre-training.
This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.
Update: Huggingface removed that statement ("This output is usually not a good summary of the semantic content ...") in v3.1.0. You'll have to ask them why.
BERT is designed primarily for transfer learning, i.e., finetuning on task-specific datasets. If you average the states, every state is averaged with the same weight: including stop words or other stuff that are not relevant for the task. The [CLS] vector gets computed using self-attention (like everything in BERT), so it can only collect the relevant information from the rest of the hidden states. So, in some sense the [CLS] vector is also an average over token vectors, only more cleverly computed, specifically for the tasks that you fine-tune on.
Also, my experience is that when I keep the weights fixed and do not fine-tune BERT, using the token average yields better results.

How to use Transformers for text classification?

I have two questions about how to use Tensorflow implementation of the Transformers for text classifications.
First, it seems people mostly used only the encoder layer to do the text classification task. However, encoder layer generates one prediction for each input word. Based on my understanding of transformers, the input to the encoder each time is one word from the input sentence. Then, the attention weights and the output is calculated using the current input word. And we can repeat this process for all of the words in the input sentence. As a result we'll end up with pairs of (attention weights, outputs) for each word in the input sentence. Is that correct? Then how would you use this pairs to perform a text classification?
Second, based on the Tensorflow implementation of transformer here, they embed the whole input sentence to one vector and feed a batch of these vectors to the Transformer. However, I expected the input to be a batch of words instead of sentences based on what I've learned from The Illustrated Transformer
Thank you!
There are two approaches, you can take:
Just average the states you get from the encoder;
Prepend a special token [CLS] (or whatever you like to call it) and use the hidden state for the special token as input to your classifier.
The second approach is used by BERT. When pre-training, the hidden state corresponding to this special token is used for predicting whether two sentences are consecutive. In the downstream tasks, it is also used for sentence classification. However, my experience is that sometimes, averaging the hidden states give a better result.
Instead of training a Transformer model from scratch, it is probably more convenient to use (and eventually finetune) a pre-trained model (BERT, XLNet, DistilBERT, ...) from the transformers package. It has pre-trained models ready to use in PyTorch and TensorFlow 2.0.
The Transformers are designed to take the whole input sentence at once. The main motive for designing a transformer was to enable parallel processing of the words in the sentences. This parallel processing is not possible in LSTMs or RNNs or GRUs as they take words of the input sentence as input one by one.
So in the encoder part of the transformers, the very first layer contains the number of units equal to the number of words in a sentence and then each unit converts that word into an embedding vector corresponding to that word. Further, the rest of the processes are carried out. For more details, you can go through the article: http://jalammar.github.io/illustrated-transformer/
How to use this transformer for text classification - Since in text classification our output is a single number not a sequence of numbers or vectors so we can remove the decoder part and just use the encoder part. The output of the encoder is a set of vectors, the same in number as the number of words in the input sentence. Further, we can feed these sets of output vectors into a CNN, or we can add an LSTM or RNN model and perform classification.
The input is the whole sentence or batch of sentences not word by word. Surely you would have misunderstood it.

How to mask zero-padding values in Tensorflow Encoder-Decoder RNN with Attention?

In the official Tensorflow Neural Machine Translation example (https://www.tensorflow.org/alpha/tutorials/text/nmt_with_attention), in the Encoder model, a GRU layer is defined.
However, the zero-padded values will be processed normally by the GRU as there is no masking applied. And in the Decoder I think that the situation is even worse, because the Attention over the padded values will play an important role in the final computation of the context vector. I think that in the definition of the loss function below, the zeroes are masked, but at this point it is too late and the outputs of both the encoder and the attention decoder will be "broken".
Am I missing something in the whole process? Shouldn't the normal way of implementing this be with masking the padded values?
You are right, you can see that when you print the tensor returned from the encoder that the numbers on the right side of the differ although most of it comes from the padding:
Usual implementation indeed includes masking. You would then use the mask in computing the attention weights in the next cell. The simplest way is adding something like to (1 - mask) * 1e9 to the attention logits in the score tensor. The tutorial is a very basic one. For instance, the text prepreprocessing is very simple (remove all non-ASCII characters), or the tokenization differs from what is usual in machine translation.

Seq2Seq Models for Chatbots

I am building a chat-bot with a sequence to sequence encoder decoder model as in NMT. From the data given I can understand that when training they feed the decoder outputs into the decoder inputs along with the encoder cell states. I cannot figure out that when i am actually deploying a chatbot in real time, how what should I input into the decoder since that time is the output that i have to predict. Can someone help me out with this please?
The exact answer depends on which building blocks you take from Neural Machine Translation model (NMT) and which ones you would replace with your own. I assume the graph structure exactly as in NMT.
If so, at inference time, you can feed just a vector of zeros to the decoder.
Internal details: NMT uses the entity called Helper to determine the next input in the decoder (see tf.contrib.seq2seq.Helper documentation).
In particular, tf.contrib.seq2seq.BasicDecoder relies solely on helper when it performs a step: the next_inputs that the are fed in to the subsequent cell is exactly the return value of Helper.next_inputs().
There are different implementations of Helper interface, e.g.,
tf.contrib.seq2seq.TrainingHelper is returning the next decoder input (which is usually ground truth). This helper is used in training as indicated in the tutorial.
tf.contrib.seq2seq.GreedyEmbeddingHelper discards the inputs, and returns the argmax sampled token from the previous output. NMT uses this helper in inference when sampling_temperature hyper-parameter is 0.
tf.contrib.seq2seq.SampleEmbeddingHelper does the same, but samples the token according to categorical (a.k.a. generalized Bernoulli) distribution. NMT uses this helper in inference when sampling_temperature > 0.
...
The code is in BaseModel._build_decoder method.
Note that both GreedyEmbeddingHelper and SampleEmbeddingHelper don't care what the decoder input is. So in fact you can feed anything, but the zero tensor is the standard choice.

what is the difference between tf.nn.dynamic_rnn and tf.nn.raw_rnn in tensorflow?

I went through this tutorial . In the last block it says that dynamic_rnn function cannot apply to calculate attention. But what I don't understand is all we need is the hidden state of the decoder in order to find the attention which will be work out with encoder symbols.
Attention mechanism in the context of encoder-decoder means that decoder at each time step "attends" to the "useful" parts of the encoder. This is implemented as, for example, averaging encoder's outputs and feeding that value (called context) into a decoder at a given time step.
dynamic_rnn computes outputs of LSTM cells across all time steps and gives you the final value. So, there is no way to tell the model that the cell state at time step t should depend not only on the output of the previous cell and input, but also on additional information such as context. You can control computation at each time step of encoder or decoder LSTM using raw_rnn.
If I understand correctly, in this tutorial the author feeds ground truth input as input to the decoder at each time step. However, this is not the usual way it is done. Usually, you want to feed the output of decoder at time t as input to decoder at time t+1. In short, the input to the decoder at each time step is variable, whereas in dynamic_rnn it is predefined.
Refer to for more technical details: https://arxiv.org/abs/1409.0473