How to get intermediate layers' output of pre-trained BERT model in HuggingFace Transformers library? - tensorflow

(I'm following this pytorch tutorial about BERT word embeddings, and in the tutorial the author is access the intermediate layers of the BERT model.)
What I want is to access the last, lets say, 4 last layers of a single input token of the BERT model in TensorFlow2 using HuggingFace's Transformers library. Because each layer outputs a vector of length 768, so the last 4 layers will have a shape of 4*768=3072 (for each token).
How can I implement this in TF/keras/TF2, to get the intermediate layers of pretrained model for an input token? (later I will try to get the tokens for each token in a sentence, but for now one token is enough).
I'm using the HuggingFace's BERT model:
!pip install transformers
from transformers import (TFBertModel, BertTokenizer)
bert_model = TFBertModel.from_pretrained("bert-base-uncased") # Automatically loads the config
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
sentence_marked = "hello"
tokenized_text = bert_tokenizer.tokenize(sentence_marked)
indexed_tokens = bert_tokenizer.convert_tokens_to_ids(tokenized_text)
print (indexed_tokens)
>> prints [7592]
The output is a token ([7592]), which should be the input of the for the BERT model.

The third element of the BERT model's output is a tuple which consists of output of embedding layer as well as the intermediate layers hidden states. From documentation:
hidden_states (tuple(tf.Tensor), optional, returned when config.output_hidden_states=True):
tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
For the bert-base-uncased model, the config.output_hidden_states is by default True. Therefore, to access hidden states of the 12 intermediate layers, you can do the following:
outputs = bert_model(input_ids, attention_mask)
hidden_states = outputs[2][1:]
There are 12 elements in hidden_states tuple corresponding to all the layers from beginning to the last, and each of them is an array of shape (batch_size, sequence_length, hidden_size). So, for example, to access the hidden state of third layer for the fifth token of all the samples in the batch, you can do: hidden_states[2][:,4].
Note that if the model you are loading does not return the hidden states by default, then you can load the config using BertConfig class and pass output_hidden_state=True argument, like this:
config = BertConfig.from_pretrained("name_or_path_of_model",
output_hidden_states=True)
bert_model = TFBertModel.from_pretrained("name_or_path_of_model",
config=config)

Related

What does the embedding elements stand for in huggingFace bert model?

Prior to passing my tokens through encoder in BERT model, I would like to perform some processing on their embeddings. I extracted the embedding weight using:
from transformers import TFBertModel
# Load a pre-trained BERT model
model = TFBertModel.from_pretrained('bert-base-uncased')
# Get the embedding layer of the model
embedding_layer = model.get_layer('bert').get_input_embeddings()
# Extract the embedding weights
embedding_weights = embedding_layer.get_weights()
I found it contains 5 elements as shown in Figure.
enter image description here
In my understanding, the first three elements are the word embedding weights, token type embedding weights, and positional embedding weights. My question is what does the last two elements stand for?
I dive deep into the source code of bert model. But I cannot figure out the meaning of the last two elements.
In bert model, there is a post-processing of the embedding tensor that uses layer normalization followed by dropout ,
https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/modeling.py#L362
I think that those two arrays are the gamma and beta of the normalization layer, https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization
They are learned parameters, and will span the axes of inputs specified in param "axis" which defaults to -1 (corresponding to 768 in embedding tensor).

TensorFlow keras expected dimension not expected LSTM

I'm trying to complete the LSTM composing using TensorFlow from
https://www.datacamp.com/tutorial/using-tensorflow-to-compose-music
I've got so far as the LSTM model, but a dimension error for the inputs is given. I've used the code provided in the tutorial. A new trainingset is created, but not converted, as for the auto encoder models in the examples above.
This piece of code is not included in the preparation step for the LSTM model
# Convert to one-hot encoding and swap chord and sequence dimensions
trainChords = tf.keras.utils.to_categorical(trainChords).transpose(0,2,1)
# Convert data to numpy array of type float
trainChords = np.array(trainChords, np.float)
# Flatten sequence of chords into single dimension
trainChordsFlat = trainChords.reshape(nSamples, nChordsSequence)
What do these steps do? Are they also required for the LSTM model?

How to use embedding models in tensorflow hub with LSTM layer?

I'm learning tensorflow 2 working through the text classification with TF hub tutorial. It used an embedding module from TF hub. I was wondering if I could modify the model to include a LSTM layer. Here's what I've tried:
train_data, validation_data, test_data = tfds.load(
name="imdb_reviews",
split=('train[:60%]', 'train[60%:]', 'test'),
as_supervised=True)
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
dtype=tf.string, trainable=True)
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Embedding(10000, 50))
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(1))
model.summary()
model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
history = model.fit(train_data.shuffle(10000).batch(512),
epochs=10,
validation_data=validation_data.batch(512),
verbose=1)
results = model.evaluate(test_data.batch(512), verbose=2)
for name, value in zip(model.metrics_names, results):
print("%s: %.3f" % (name, value))
I don't know how to get the vocabulary size from the hub_layer. So I just put 10000 there. When run it, it throws this exception:
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[480,1] = -6 is not in [0, 10000)
[[node sequential/embedding/embedding_lookup (defined at .../learning/tensorflow/text_classify.py:36) ]] [Op:__inference_train_function_36284]
Errors may have originated from an input operation.
Input Source operations connected to node sequential/embedding/embedding_lookup:
sequential/embedding/embedding_lookup/34017 (defined at Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/contextlib.py:112)
Function call stack:
train_function
I stuck here. My questions are:
how should I use the embedding module from TF hub to feed an LSTM layer? it looks like embedding lookup has some issues with the setting.
how do I get the vocabulary size from the hub layer?
Thanks
Finally figured out the way to link pre-trained embeddings to LSTM or other layers. Just post the steps here in case anyone feels helpful.
Embedding layer has to be the first layer in the model. (hub_layer is the same as Embedding layer.) The not very intuitive part is that any text input to the hub layer will be converted to only one vector of shape [embedding_dim]. You need to do sentence splitting and tokenization to make sure whatever input to the model is a sequence in the form of array of arrays. e.g., "Let us prepare the data." should be converted to [["let"],["us"],["prepare"], ["the"], ["data"]]. You will also need to pad the sequences if you are using batch mode.
In addition, you will need to convert your target tokens to int if your training labels are strings. The input to the model is array of strings with shape [batch, seq_length], the hub embedding layer converts it to [batch, seq_length, embed_dim]. (If you add a LSTM or other RNN layer, the output from the layer is [batch, seq_length, rnn_units]. ) The output dense layer will output index of text instead of actual text. The index of text is stored in the downloaded tfhub directory as "tokens.txt". You can load the file and convert text to the corresponding index. Otherwise you cannot compute the loss.

Embedding feature vectors in Tensorflow

In text processing there is embedding to show up (if I understood it correctly) the database words as vector (after dimension reduction).
now, I am wondering, is there any method like this to show extracted features via CNN?
for example: consider we have a CNN and train and test sets. we want to train the CNN with train set and meanwhile see the extracted features (from dense layer) corresponding class labels via CNN in the embedding section of tensorboard.
the purpose of this work is seeing the features of input data in every batch and understand how close or far are they from together. and finally, in the trained model, we can find out accuracy of our classifier (like softmax or etc.).
thank you in advance for your help.
I have taken help of Tensorflow documentation.
For in depth information on how to run TensorBoard and make sure you are logging all the necessary information, see TensorBoard: Visualizing Learning.
To visualize your embeddings, there are 3 things you need to do:
1) Setup a 2D tensor that holds your embedding(s).
embedding_var = tf.get_variable(....)
2) Periodically save your model variables in a checkpoint in LOG_DIR.
saver = tf.train.Saver()
saver.save(session, os.path.join(LOG_DIR, "model.ckpt"), step)
3) (Optional) Associate metadata with your embedding.
If you have any metadata (labels, images) associated with your embedding, you can tell TensorBoard about it either by directly storing a projector_config.pbtxt in the LOG_DIR, or use our python API.
For instance, the following projector_config.ptxt associates the word_embedding tensor with metadata stored in $LOG_DIR/metadata.tsv:
embeddings {
tensor_name: 'word_embedding'
metadata_path: '$LOG_DIR/metadata.tsv'
}
The same config can be produced programmatically using the following code snippet:
from tensorflow.contrib.tensorboard.plugins import projector
# Create randomly initialized embedding weights which will be trained.
vocabulary_size = 10000
embedding_size = 200
embedding_var = tf.get_variable('word_embedding', [vocabulary_size,
embedding_size])
# Format: tensorflow/tensorboard/plugins/projector/projector_config.proto
config = projector.ProjectorConfig()
# You can add multiple embeddings. Here we add only one.
embedding = config.embeddings.add()
embedding.tensor_name = embedding_var.name
# Link this tensor to its metadata file (e.g. labels).
embedding.metadata_path = os.path.join(LOG_DIR, 'metadata.tsv')
#Use the same LOG_DIR where you stored your checkpoint.
summary_writer = tf.summary.FileWriter(LOG_DIR)
# The next line writes a projector_config.pbtxt in the LOG_DIR. TensorBoard will
# read this file during startup.
projector.visualize_embeddings(summary_writer, config)

Implementing a many-to-many LSTM in TensorFlow?

I am using TensorFlow to make predictions on time-series data. So it is like I have 50 tags and I want to find out the next possible 5 tags.
As shown in the following picture, I want to make it like the 4th structure.
I went through the tutorial demo: Recurrent Neural Networks
But I found it can provide like the 5th one in the above picture, which is different.
I am wondering which model could I use? I am thinking of the seq2seq models, but not sure if it is the right way.
You are right that you can use a seq2seq model. For brevity I've written up an example of how you can do it in Keras which also has a Tensorflow backend. I've not run the example so it might need tweaking. If your tags are one-hot you need to use cross-entropy loss instead.
from keras.models import Model
from keras.layers import Input, LSTM, RepeatVector
# The input shape is your sequence length and your token embedding size
inputs = Input(shape=(seq_len, embedding_size))
# Build a RNN encoder
encoder = LSTM(128, return_sequences=False)(inputs)
# Repeat the encoding for every input to the decoder
encoding_repeat = RepeatVector(5)(encoder)
# Pass your (5, 128) encoding to the decoder
decoder = LSTM(128, return_sequences=True)(encoding_repeat)
# Output each timestep into a fully connected layer
sequence_prediction = TimeDistributed(Dense(1, activation='linear'))(decoder)
model = Model(inputs, sequence_prediction)
model.compile('adam', 'mse') # Or categorical_crossentropy
model.fit(X_train, y_train)