Why Bert transformer uses [CLS] token for classification instead of average over all tokens? - tensorflow

I am doing experiments on bert architecture and found out that most of the fine-tuning task takes the final hidden layer as text representation and later they pass it to other models for the further downstream task.
Bert's last layer looks like this :
Where we take the [CLS] token of each sentence :
Image source
I went through many discussion on this huggingface issue, datascience forum question, github issue Most of the data scientist gives this explanation :
BERT is bidirectional, the [CLS] is encoded including all
representative information of all tokens through the multi-layer
encoding procedure. The representation of [CLS] is individual in
different sentences.
My question is, Why the author ignored the other information ( each token's vector ) and taking the average, max_pool or other methods to make use of all information rather than using [CLS] token for classification?
How does this [CLS] token help compare to the average of all token vectors?

The use of the [CLS] token to represent the entire sentence comes from the original BERT paper, section 3:
The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.
Your intuition is correct that averaging the vectors of all the tokens may produce superior results. In fact, that is exactly what is mentioned in the Huggingface documentation for BertModel:
Returns
pooler_output (torch.FloatTensor: of shape (batch_size, hidden_size)):
Last layer hidden-state of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pre-training.
This output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.
Update: Huggingface removed that statement ("This output is usually not a good summary of the semantic content ...") in v3.1.0. You'll have to ask them why.

BERT is designed primarily for transfer learning, i.e., finetuning on task-specific datasets. If you average the states, every state is averaged with the same weight: including stop words or other stuff that are not relevant for the task. The [CLS] vector gets computed using self-attention (like everything in BERT), so it can only collect the relevant information from the rest of the hidden states. So, in some sense the [CLS] vector is also an average over token vectors, only more cleverly computed, specifically for the tasks that you fine-tune on.
Also, my experience is that when I keep the weights fixed and do not fine-tune BERT, using the token average yields better results.

Related

Size of input and output layers in Keras implementation of an RNN Language Model

As part of my thesis, I am trying to build a recurrent Neural Network Language Model.
From theory, I know that the input layer should be a one-hot vector layer with a number of neurons equal to the number of words of our Vocabulary, followed by an Embedding layer, which, in Keras, it apparently translates to a single Embedding layer in a Sequential model. I also know that the output layer should also be the size of our vocabulary so that each output value maps 1-1 to each vocabulary word.
However, in both the Keras documentation for the Embedding layer (https://keras.io/layers/embeddings/) and in this article (https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/#comment-533252), the vocabulary size is arbitrarily augmented by one for both the input and the output layers! Jason gives an explenation that this is due to the implementation of the Embedding layer in Keras but that doesn't explain why we would also use +1 neuron in the output layer. I am at the point of wanting to order the possible next words based on their probabilities and I have one probability too many that I do not know to which word to map it too.
Does anyone know what is the correct way of acheiving the desired result? Did Jason just forget to subtrack one from the output layer and the Embedding layer just needs a +1 for implementation reasons (I mean it's stated in the official API)?
Any help on the subject would be appreciated (why is Keras API documentation so laconic?).
Edit:
This post Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2? made me think that Jason does in fact have it wrong and that the size of the Vocabulary should not be incremented by one when our word indices are: 0, 1, ..., n-1.
However, when using Keras's Tokenizer our word indices are: 1, 2, ..., n. In this case, the correct approach is to:
Set mask_zero=True, to treat 0 differently, as there is never a
0 (integer) index input in the Embedding layer and keep the
vocabulary size the same as the number of vocabulary words (n)?
Set mask_zero=True but augment the vocabulary size by one?
Not set mask_zero=True and keep the vocabulary size the same as the
number of vocabulary words?
the reason why we add +1 leads to the possibility that we can encounter a chance to see an unseen word(out of our vocabulary) during testing or in production, it is common to consider a generic term for those UNKNOWN and that is why we add a OOV word in front which resembles all out of vocabulary words.
Check this issue on github which explains it in detail:
https://github.com/keras-team/keras/issues/3110#issuecomment-345153450

What's the attention model used in tfjs-examples/date-conversion-attention?

I've been looking at tfjs examples and trying to learn about seq2seq models. During the process, I've stumbled upon the date-conversion-attention example.
It's a great example but what kind of attention mechanism is being used in the example? There is no info in Readme file. Can somebody point me to the paper that describes the attention that's being used here?
Link to attention part:
https://github.com/tensorflow/tfjs-examples/blob/908ee32750ba750a14d15caeb53115e2d3dda2b3/date-conversion-attention/model.js#L102-L119
I believe I found the answer.
The attention model used in the date-conversion-attention uses the dot product alignment score and it's described in Effective Approaches to Attention-based Neural Machine Translation. Link: https://arxiv.org/pdf/1508.04025.pdf
I have twisted my head around this sample for some hours now, and this what I have concluded so far:
The encoder looks at the full input, one character-embedding for each lstm-step. The decoder expects a time-shifted copy of the output as its input -starting with a special character. The output (target strings) are provided as-is to the decoder during training. During evaluation, one character is predicted at the time, passing the prediction back into the decoder for the next character.
The decoder does not see the input, but it receives the encoder's final step output as its initial state. This state initialisation tells the decoder how to produce it's outputs, something like an encoded description of the date-format to work on (I assume).
The LSTM's output, one for each step (=character of input or output), from the encoder and decoder are then dot product'ed and normalised with softmax. This dot-product is the attention matrix - basically a highlight of the activations from the encoder and the decoder. For the attention heatmap to light up for the given next character, the decoder must have output'ed something that "matches" the encoder's outputs. The attention matrix is not learned weights or biases, its just a product of the encoder and decoder's outputs.
Finally this attention matrix is dot product'ed with the full encoder input and concatenated with the decoder output - to allow the final dense layers to decode the attention mappings and "read" the right values from the encoder output.
In the prediction process, only the last character is read from the prediction. Possibly because the previous predictions might be unstable?
I read the excellent book: Deep Learning with JavaScript Neural networks in TensorFlow.js The book explains the examples one by one and adds lots of extra documentation. But I don't think they explain the general architecture very well, for this sample - only the details.

How to use Transformers for text classification?

I have two questions about how to use Tensorflow implementation of the Transformers for text classifications.
First, it seems people mostly used only the encoder layer to do the text classification task. However, encoder layer generates one prediction for each input word. Based on my understanding of transformers, the input to the encoder each time is one word from the input sentence. Then, the attention weights and the output is calculated using the current input word. And we can repeat this process for all of the words in the input sentence. As a result we'll end up with pairs of (attention weights, outputs) for each word in the input sentence. Is that correct? Then how would you use this pairs to perform a text classification?
Second, based on the Tensorflow implementation of transformer here, they embed the whole input sentence to one vector and feed a batch of these vectors to the Transformer. However, I expected the input to be a batch of words instead of sentences based on what I've learned from The Illustrated Transformer
Thank you!
There are two approaches, you can take:
Just average the states you get from the encoder;
Prepend a special token [CLS] (or whatever you like to call it) and use the hidden state for the special token as input to your classifier.
The second approach is used by BERT. When pre-training, the hidden state corresponding to this special token is used for predicting whether two sentences are consecutive. In the downstream tasks, it is also used for sentence classification. However, my experience is that sometimes, averaging the hidden states give a better result.
Instead of training a Transformer model from scratch, it is probably more convenient to use (and eventually finetune) a pre-trained model (BERT, XLNet, DistilBERT, ...) from the transformers package. It has pre-trained models ready to use in PyTorch and TensorFlow 2.0.
The Transformers are designed to take the whole input sentence at once. The main motive for designing a transformer was to enable parallel processing of the words in the sentences. This parallel processing is not possible in LSTMs or RNNs or GRUs as they take words of the input sentence as input one by one.
So in the encoder part of the transformers, the very first layer contains the number of units equal to the number of words in a sentence and then each unit converts that word into an embedding vector corresponding to that word. Further, the rest of the processes are carried out. For more details, you can go through the article: http://jalammar.github.io/illustrated-transformer/
How to use this transformer for text classification - Since in text classification our output is a single number not a sequence of numbers or vectors so we can remove the decoder part and just use the encoder part. The output of the encoder is a set of vectors, the same in number as the number of words in the input sentence. Further, we can feed these sets of output vectors into a CNN, or we can add an LSTM or RNN model and perform classification.
The input is the whole sentence or batch of sentences not word by word. Surely you would have misunderstood it.

Seq2Seq Models for Chatbots

I am building a chat-bot with a sequence to sequence encoder decoder model as in NMT. From the data given I can understand that when training they feed the decoder outputs into the decoder inputs along with the encoder cell states. I cannot figure out that when i am actually deploying a chatbot in real time, how what should I input into the decoder since that time is the output that i have to predict. Can someone help me out with this please?
The exact answer depends on which building blocks you take from Neural Machine Translation model (NMT) and which ones you would replace with your own. I assume the graph structure exactly as in NMT.
If so, at inference time, you can feed just a vector of zeros to the decoder.
Internal details: NMT uses the entity called Helper to determine the next input in the decoder (see tf.contrib.seq2seq.Helper documentation).
In particular, tf.contrib.seq2seq.BasicDecoder relies solely on helper when it performs a step: the next_inputs that the are fed in to the subsequent cell is exactly the return value of Helper.next_inputs().
There are different implementations of Helper interface, e.g.,
tf.contrib.seq2seq.TrainingHelper is returning the next decoder input (which is usually ground truth). This helper is used in training as indicated in the tutorial.
tf.contrib.seq2seq.GreedyEmbeddingHelper discards the inputs, and returns the argmax sampled token from the previous output. NMT uses this helper in inference when sampling_temperature hyper-parameter is 0.
tf.contrib.seq2seq.SampleEmbeddingHelper does the same, but samples the token according to categorical (a.k.a. generalized Bernoulli) distribution. NMT uses this helper in inference when sampling_temperature > 0.
...
The code is in BaseModel._build_decoder method.
Note that both GreedyEmbeddingHelper and SampleEmbeddingHelper don't care what the decoder input is. So in fact you can feed anything, but the zero tensor is the standard choice.

Seq2Seq for prediction of complex states

My problem:
I have a sequence of complex states and I want to predict the future states.
Input:
I have a sequence of states. Each sequence can be of variable length. Each state is a moment in time and is described by several attributes: [att1, att2, ...]. Where each attribute is a number between an interval [[0..5], [1..3651], ...]
The example (and paper) of Seq2Seq is based on that each state (word) is taken from their dictionary. So each state has around 80.000 possibilities. But how would you represent each state when it is taken from a set of vectors and the set is just each possible combination of the attributes.
Is there any method to work with more complex states with TensorFlow? Also, what is a good method do decide the boundaries of your buckets when the relation between input length and output length is unclear?
May I suggest a rephrasing and splitting of your question into two parts? The first is really a general machine learning/LSTM question that's independent of tensorflow: How to use an LSTM to predict when the sequence elements are general vectors, and the second is how to represent this in tensorflow. For the former - there's nothing really magical to do there.
But a very quick answer: You've really just skipped the embedding lookup part of seq2seq. You can feed dense tensors in to a suitably modified version of it -- your state is just a dense vector representation of the state. That's the same thing that comes out of an embedding lookup.
The vector representation tutorial discusses the preprocessing that turns, e.g., words into embeddings for use in later parts of the learning pipeline.
If you look at line 139 of seq2seq.py you'll see that the embedding_rnn_decoder takes in a 1D batch of things to decide (the dimension is elements in the batch), but then uses the embedding lookup to turn it into a batch_size * cell.input_size tensor. You want to directly input a batch_size * cell.input_size tensor into the RNN, skipping the embedding step.