What does the embedding elements stand for in huggingFace bert model? - tensorflow

Prior to passing my tokens through encoder in BERT model, I would like to perform some processing on their embeddings. I extracted the embedding weight using:
from transformers import TFBertModel
# Load a pre-trained BERT model
model = TFBertModel.from_pretrained('bert-base-uncased')
# Get the embedding layer of the model
embedding_layer = model.get_layer('bert').get_input_embeddings()
# Extract the embedding weights
embedding_weights = embedding_layer.get_weights()
I found it contains 5 elements as shown in Figure.
enter image description here
In my understanding, the first three elements are the word embedding weights, token type embedding weights, and positional embedding weights. My question is what does the last two elements stand for?
I dive deep into the source code of bert model. But I cannot figure out the meaning of the last two elements.

In bert model, there is a post-processing of the embedding tensor that uses layer normalization followed by dropout ,
https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/modeling.py#L362
I think that those two arrays are the gamma and beta of the normalization layer, https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization
They are learned parameters, and will span the axes of inputs specified in param "axis" which defaults to -1 (corresponding to 768 in embedding tensor).

Related

TensorFlow keras expected dimension not expected LSTM

I'm trying to complete the LSTM composing using TensorFlow from
https://www.datacamp.com/tutorial/using-tensorflow-to-compose-music
I've got so far as the LSTM model, but a dimension error for the inputs is given. I've used the code provided in the tutorial. A new trainingset is created, but not converted, as for the auto encoder models in the examples above.
This piece of code is not included in the preparation step for the LSTM model
# Convert to one-hot encoding and swap chord and sequence dimensions
trainChords = tf.keras.utils.to_categorical(trainChords).transpose(0,2,1)
# Convert data to numpy array of type float
trainChords = np.array(trainChords, np.float)
# Flatten sequence of chords into single dimension
trainChordsFlat = trainChords.reshape(nSamples, nChordsSequence)
What do these steps do? Are they also required for the LSTM model?

How to use embedding models in tensorflow hub with LSTM layer?

I'm learning tensorflow 2 working through the text classification with TF hub tutorial. It used an embedding module from TF hub. I was wondering if I could modify the model to include a LSTM layer. Here's what I've tried:
train_data, validation_data, test_data = tfds.load(
name="imdb_reviews",
split=('train[:60%]', 'train[60%:]', 'test'),
as_supervised=True)
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
dtype=tf.string, trainable=True)
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Embedding(10000, 50))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(1))
model.summary()
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_data.shuffle(10000).batch(512),
epochs=10,
validation_data=validation_data.batch(512),
verbose=1)
results = model.evaluate(test_data.batch(512), verbose=2)
for name, value in zip(model.metrics_names, results):
print("%s: %.3f" % (name, value))
I don't know how to get the vocabulary size from the hub_layer. So I just put 10000 there. When run it, it throws this exception:
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[480,1] = -6 is not in [0, 10000)
[[node sequential/embedding/embedding_lookup (defined at .../learning/tensorflow/text_classify.py:36) ]] [Op:__inference_train_function_36284]
Errors may have originated from an input operation.
Input Source operations connected to node sequential/embedding/embedding_lookup:
sequential/embedding/embedding_lookup/34017 (defined at Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py:112)
Function call stack:
train_function
I stuck here. My questions are:
how should I use the embedding module from TF hub to feed an LSTM layer? it looks like embedding lookup has some issues with the setting.
how do I get the vocabulary size from the hub layer?
Thanks
Finally figured out the way to link pre-trained embeddings to LSTM or other layers. Just post the steps here in case anyone feels helpful.
Embedding layer has to be the first layer in the model. (hub_layer is the same as Embedding layer.) The not very intuitive part is that any text input to the hub layer will be converted to only one vector of shape [embedding_dim]. You need to do sentence splitting and tokenization to make sure whatever input to the model is a sequence in the form of array of arrays. e.g., "Let us prepare the data." should be converted to [["let"],["us"],["prepare"], ["the"], ["data"]]. You will also need to pad the sequences if you are using batch mode.
In addition, you will need to convert your target tokens to int if your training labels are strings. The input to the model is array of strings with shape [batch, seq_length], the hub embedding layer converts it to [batch, seq_length, embed_dim]. (If you add a LSTM or other RNN layer, the output from the layer is [batch, seq_length, rnn_units]. ) The output dense layer will output index of text instead of actual text. The index of text is stored in the downloaded tfhub directory as "tokens.txt". You can load the file and convert text to the corresponding index. Otherwise you cannot compute the loss.

How to get intermediate layers' output of pre-trained BERT model in HuggingFace Transformers library?

(I'm following this pytorch tutorial about BERT word embeddings, and in the tutorial the author is access the intermediate layers of the BERT model.)
What I want is to access the last, lets say, 4 last layers of a single input token of the BERT model in TensorFlow2 using HuggingFace's Transformers library. Because each layer outputs a vector of length 768, so the last 4 layers will have a shape of 4*768=3072 (for each token).
How can I implement this in TF/keras/TF2, to get the intermediate layers of pretrained model for an input token? (later I will try to get the tokens for each token in a sentence, but for now one token is enough).
I'm using the HuggingFace's BERT model:
!pip install transformers
from transformers import (TFBertModel, BertTokenizer)
bert_model = TFBertModel.from_pretrained("bert-base-uncased") # Automatically loads the config
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
sentence_marked = "hello"
tokenized_text = bert_tokenizer.tokenize(sentence_marked)
indexed_tokens = bert_tokenizer.convert_tokens_to_ids(tokenized_text)
print (indexed_tokens)
>> prints [7592]
The output is a token ([7592]), which should be the input of the for the BERT model.
The third element of the BERT model's output is a tuple which consists of output of embedding layer as well as the intermediate layers hidden states. From documentation:
hidden_states (tuple(tf.Tensor), optional, returned when config.output_hidden_states=True):
tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
For the bert-base-uncased model, the config.output_hidden_states is by default True. Therefore, to access hidden states of the 12 intermediate layers, you can do the following:
outputs = bert_model(input_ids, attention_mask)
hidden_states = outputs[2][1:]
There are 12 elements in hidden_states tuple corresponding to all the layers from beginning to the last, and each of them is an array of shape (batch_size, sequence_length, hidden_size). So, for example, to access the hidden state of third layer for the fifth token of all the samples in the batch, you can do: hidden_states[2][:,4].
Note that if the model you are loading does not return the hidden states by default, then you can load the config using BertConfig class and pass output_hidden_state=True argument, like this:
config = BertConfig.from_pretrained("name_or_path_of_model",
output_hidden_states=True)
bert_model = TFBertModel.from_pretrained("name_or_path_of_model",
config=config)

Weights update in Tensorflow embedding layer with pretrained fasttext weights

I'm not sure if my understanding is correct but...
While training a seq2seq model, one of the purpose I want to initiated a set of pre-trained fasttext weights in the embedding layers is to decrease the unknown words in the test environment (these unknown words are not in training set). Since pre-trained fasttext model has larger vocabulary, during test environment, the unknown word can be represented by fasttext out-of-vocabulary word vectors, which supposed to have similar direction of the semantic similar words in the training set.
However, due to the fact that the initial fasttext weights in the embedding layers will be updated through the training process (updating weights generates better results). I am wondering if the updated embedding weights would distort the relationship of semantic similarity between words and undermine the representation of fasttext out-of-vocabulary word vectors? (and, between those updated embedding weights and word vectors in the initial embedding layers but their corresponding ID didn't appear in the training data)
If the input ID can be distributed represented vectors extracted from pre-trained model and, then, map these pre-trained word vectors (fixed weights while training) via a lookup table to the embedding layers (these weights will be updated while training), would it be a better solution?
Any suggestions will be appreciated!
You are correct about the problem: when using pre-trained vector and fine-tuning them in your final model, the words that are infrequent or hasn't appear in your training set won't get any updates.
Now, usually one can test how much of the issue for your particular case this is. E.g. if you have a validation set, try fine-tuning and not fine-tuning the weights and see what's the difference in model performance on validation set.
If you see a big difference in performance on validation set when you are not fine-tuning, here is a few ways to handle this:
a) Add a linear transformation layer after not-trainable embeddings. Fine-tuning embeddings in many cases does affine transformations to the space, so one can capture this in a separate layer that can be applied at test time.
E.g. A is pre-trained embedding matrix:
embeds = tf.nn.embedding_lookup(A, tokens)
X = tf.get_variable("X", [embed_size, embed_size])
b = tf.get_vairable("b", [embed_size])
embeds = tf.mul(embeds, X) + b
b) Keep pre-trained embeddings in the not-trainable embedding matrix A. Add trainable embedding matrix B, that has a smaller vocab of popular words in your training set and embedding size. Lookup words both in A and B (and if word is out of vocab use ID=0 for example), concat results and use it input to your model. This way you will teach your model to use mostly A and sometimes rely on B for popular words in your training set.
fixed_embeds = tf.nn.embedding_lookup(A, tokens)
B = tf.get_variable("B", [smaller_vocab_size, embed_size])
oov_tokens = tf.where(tf.less(tokens, smaller_vocab_size), tokens, tf.zeros(tf.shape(tokens), dtype=tokens.dtype))
dyn_embeds = tf.nn.embedding_lookup(B, oov_tokens)
embeds = tf.concat([fixed_embeds, dyn_embeds], 1)

Implementing a many-to-many LSTM in TensorFlow?

I am using TensorFlow to make predictions on time-series data. So it is like I have 50 tags and I want to find out the next possible 5 tags.
As shown in the following picture, I want to make it like the 4th structure.
I went through the tutorial demo: Recurrent Neural Networks
But I found it can provide like the 5th one in the above picture, which is different.
I am wondering which model could I use? I am thinking of the seq2seq models, but not sure if it is the right way.
You are right that you can use a seq2seq model. For brevity I've written up an example of how you can do it in Keras which also has a Tensorflow backend. I've not run the example so it might need tweaking. If your tags are one-hot you need to use cross-entropy loss instead.
from keras.models import Model
from keras.layers import Input, LSTM, RepeatVector
# The input shape is your sequence length and your token embedding size
inputs = Input(shape=(seq_len, embedding_size))
# Build a RNN encoder
encoder = LSTM(128, return_sequences=False)(inputs)
# Repeat the encoding for every input to the decoder
encoding_repeat = RepeatVector(5)(encoder)
# Pass your (5, 128) encoding to the decoder
decoder = LSTM(128, return_sequences=True)(encoding_repeat)
# Output each timestep into a fully connected layer
sequence_prediction = TimeDistributed(Dense(1, activation='linear'))(decoder)
model = Model(inputs, sequence_prediction)
model.compile('adam', 'mse') # Or categorical_crossentropy
model.fit(X_train, y_train)