How to get token index map from tf hub pre-trained embedding? - tensorflow2.0

I'm trying to use tfhub pre-trained word embedding in a text generation project. The setting is that there is a corpus of English text. I want to convert each word to dense vector (embedding) and then feed the sequence to a LSTM model and try to learn how to generate next word given a sequence.
Initially I was trying to load the embedding as a KerasLayer.
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
dtype=tf.string, trainable=True, name='embedding')
However, the KerasLayer doesn't seem to take sequences as input which is 2D. It looks like I have to preprocess the text, tokenize and convert each token to vector, and then feed the vector directly to a LSTM layer.
In this case, I will need to use token to int mapping from the model. I located the tokens.txt file in the assets directory from local cache.
./tf_cache/510580b203329a4a95dfdfefd838bdcd202f0d13/assets/tokens.txt
But I don't want to manually copy the file out and load it to memory. Is there an API from tensorflow that I can call to get the token mapping instead of reading the file manually?

You should be able to manipulate the tensors so that you can pass it into the KerasLayer.
If you are using ragged tensors tf.ragged.map_flat_values is your friend, e.g.:
sentences = ["sentence 1", "sentence number 2"]
words = tf.strings.split(sentences)
word_embeddings = tf.ragged.map_flat_values(hub_layer, words)
word_embeddings.to_tensor() # Convert to dense now to feed into next layers.
If you already have a dense tensor of shape [num_sentences, num_words], you could reshape it to a [num_sentencesnum_words], then embed it transforming it into [num_sentencesnum_words, embedding_size] and then you could reshape back into [num_sentences, num_words, embedding_size]. In this case tf.reshape is your friend.
Something like:
dense_features = tf.constant([["sentence", "with", "four", "words"], ["hello", "world", "", ""]])
# Reshape to 1-d tensor.
flatten_words = tf.reshape(tokenized_dense_sentences, [-1])
# Embed each element as if it was a single batch of words.
flatten_word_embeddings = hub_layer(flatten_words)
# Reshape back to 3-d tensor.
num_sentences = tf.shape(dense_features)[0]
max_num_words = tf.shape(dense_features)[1]
embedded_features = tf.reshape(flatten_word_embeddings, [num_sentences, max_num_words, -1])
These examples differ on how they treat the non existent words.

Related

tfidf weighted average of word embeddings with Keras

I don't know how this is possible, but I want to calculated some weighted average of word embeddings in a sentence like with tfidf scores.
Is it exactly this, but with just weights:
averaging a sentence’s word vectors in Keras- Pre-trained Word Embedding
import keras
from keras.layers import Embedding
from keras.models import Sequential
import numpy as np
# Set parameters
vocab_size=1000
max_length=10
# Generate random embedding matrix for sake of illustration
embedding_matrix = np.random.rand(vocab_size,300)
model = Sequential()
model.add(Embedding(vocab_size, 300, weights=[embedding_matrix],
input_length=max_length, trainable=False))
# Average the output of the Embedding layer over the word dimension
model.add(keras.layers.Lambda(lambda x: keras.backend.mean(x, axis=1)))
model.summary()
How could you get with a custom layer or lambda layer the proper weights belonging to a specific word? You would need access somehow the embedding layer to get the index and then look up the proper weight.
Or is there a simple way I don't see?
embeddings = model.layers[0].get_weights()[0] # get embedding layer, shape (vocab, embedding_dim)
Alternatively, if you define the layer object:
embedding_layer = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=max_length, trainable=False)
embeddings = emebdding_layer.get_weights()[0]
From here, you can probably directly address the individual weights by just querying their positions using your unprocessed bag of words or integer inputs.
If you want to, you can additionally go through the actual word vectors by the string words, though that shouldn't be necessary for simply accumulating all word vectors of each sentence:
# `word_to_index` is a mapping (i.e. dict) from words to their index that you need to provide (from your original input data which should be ints)
word_embeddings = {w:embeddings[idx] for w, idx in word_to_index.items()}
print(word_embeddings['chair']) # gives you the word vector

How to get intermediate layers' output of pre-trained BERT model in HuggingFace Transformers library?

(I'm following this pytorch tutorial about BERT word embeddings, and in the tutorial the author is access the intermediate layers of the BERT model.)
What I want is to access the last, lets say, 4 last layers of a single input token of the BERT model in TensorFlow2 using HuggingFace's Transformers library. Because each layer outputs a vector of length 768, so the last 4 layers will have a shape of 4*768=3072 (for each token).
How can I implement this in TF/keras/TF2, to get the intermediate layers of pretrained model for an input token? (later I will try to get the tokens for each token in a sentence, but for now one token is enough).
I'm using the HuggingFace's BERT model:
!pip install transformers
from transformers import (TFBertModel, BertTokenizer)
bert_model = TFBertModel.from_pretrained("bert-base-uncased") # Automatically loads the config
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
sentence_marked = "hello"
tokenized_text = bert_tokenizer.tokenize(sentence_marked)
indexed_tokens = bert_tokenizer.convert_tokens_to_ids(tokenized_text)
print (indexed_tokens)
>> prints [7592]
The output is a token ([7592]), which should be the input of the for the BERT model.
The third element of the BERT model's output is a tuple which consists of output of embedding layer as well as the intermediate layers hidden states. From documentation:
hidden_states (tuple(tf.Tensor), optional, returned when config.output_hidden_states=True):
tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).
Hidden-states of the model at the output of each layer plus the initial embedding outputs.
For the bert-base-uncased model, the config.output_hidden_states is by default True. Therefore, to access hidden states of the 12 intermediate layers, you can do the following:
outputs = bert_model(input_ids, attention_mask)
hidden_states = outputs[2][1:]
There are 12 elements in hidden_states tuple corresponding to all the layers from beginning to the last, and each of them is an array of shape (batch_size, sequence_length, hidden_size). So, for example, to access the hidden state of third layer for the fifth token of all the samples in the batch, you can do: hidden_states[2][:,4].
Note that if the model you are loading does not return the hidden states by default, then you can load the config using BertConfig class and pass output_hidden_state=True argument, like this:
config = BertConfig.from_pretrained("name_or_path_of_model",
output_hidden_states=True)
bert_model = TFBertModel.from_pretrained("name_or_path_of_model",
config=config)

Error when attempting to change tensor shape in keras model

I want to change the shape and the content of the tensor in a keras model. Tensor is the output of a layer and has
shape1=(batch_size, max_sentences_in_doc, max_tokens_in_doc, embedding_size)
and I want to convert to
shape2=(batch_size, max_documents_length, embedding_size)
suitable as input of the next layer. Here sentences are made of tokens, and are zero-padded so every sentence has length=max_tokens_in_sentence.
In detail:
I wanto to concatenate all the sentences of a batch taking only the nonzero part of the sentences;
then I zero-pad this concatenation to a length=max_document_length.
So passing from shape1 to shape2 is not only a reshape as mathematical operations are involved.
I created the function embedding_to_docs(x) that iterates on the tensor of shape1 to transform it into shape2. I call the function using a Lambda layer in the model, it works in debug with fictious data, but when I try to call it during the build of the model an error is raised:
Tensor objects are only iterable when eager execution is enabled. To iterate over this tensor use tf.map_fn.
def embedding_to_docs(x):
new_output = []
for doc in x:
document = []
for sentence in doc:
non_zero_indexes = np.nonzero(sentence[:, 0])
max_index = max(non_zero_indexes[0])
if max_index > 0:
document.extend(sentence[0:max_index])
if MAX_DOCUMENT_LENGTH-len(document) > 0:
a = np.zeros((MAX_DOCUMENT_LENGTH-len(document), 1024))
document.extend(a)
else:
document = document[0:MAX_DOCUMENT_LENGTH]
new_output.append(document)
return np.asarray(new_output)
...
# in the model:
tensor_of_shape2 = Lambda(embedding_to_docs)(tensor_of_shape1)
How to fix this?
You can use py_function, which allows you to switch from the graph mode (used by Keras) to the eager mode (where it is possible to iterate over tensors like in your function).
def to_docs(x):
return tf.py_function(embedding_to_docs, [x], tf.float32)
tensor_of_shape2 = Lambda(to_docs)(tensor_of_shape1)
Note that the code run within your embedding_to_docs must be written in tensorflow eager instead of numpy. This means that you'd need to replace some of the numpy calls with tensorflow. You'd surely need to replace the return line with:
return tf.convert_to_tensor(new_output)
Using numpy arrays will stop the gradient computation, but you are likely not interested in gradient flowing through the input data anyway.

Keras image captioning model not compiling because of concatenate layer when mask_zero=True in a previous layer

I am new to Keras and I am trying to implement a model for an image captioning project.
I am trying to reproduce the model from Image captioning pre-inject architecture (The picture is taken from this paper: Where to put the image in an image captioning generator) (but with a minor difference: generating a word at each time step instead of only generating a single word at the end), in which the inputs for the LSTM at the first time step are the embedded CNN features. The LSTM should support variable input length and in order to do this I padded all the sequences with zeros so that all of them have maxlen time steps.
The code for the model I have right now is the following:
def get_model(model_name, batch_size, maxlen, voc_size, embed_size,
cnn_feats_size, dropout_rate):
# create input layer for the cnn features
cnn_feats_input = Input(shape=(cnn_feats_size,))
# normalize CNN features
normalized_cnn_feats = BatchNormalization(axis=-1)(cnn_feats_input)
# embed CNN features to have same dimension with word embeddings
embedded_cnn_feats = Dense(embed_size)(normalized_cnn_feats)
# add time dimension so that this layer output shape is (None, 1, embed_size)
final_cnn_feats = RepeatVector(1)(embedded_cnn_feats)
# create input layer for the captions (each caption has max maxlen words)
caption_input = Input(shape=(maxlen,))
# embed the captions
embedded_caption = Embedding(input_dim=voc_size,
output_dim=embed_size,
input_length=maxlen)(caption_input)
# concatenate CNN features and the captions.
# Ouput shape should be (None, maxlen + 1, embed_size)
img_caption_concat = concatenate([final_cnn_feats, embedded_caption], axis=1)
# now feed the concatenation into a LSTM layer (many-to-many)
lstm_layer = LSTM(units=embed_size,
input_shape=(maxlen + 1, embed_size), # one additional time step for the image features
return_sequences=True,
dropout=dropout_rate)(img_caption_concat)
# create a fully connected layer to make the predictions
pred_layer = TimeDistributed(Dense(units=voc_size))(lstm_layer)
# build the model with CNN features and captions as input and
# predictions output
model = Model(inputs=[cnn_feats_input, caption_input],
outputs=pred_layer)
optimizer = Adam(lr=0.0001,
beta_1=0.9,
beta_2=0.999,
epsilon=1e-8)
model.compile(loss='categorical_crossentropy',optimizer=optimizer)
model.summary()
return model
The model (as it is above) compiles without any errors (see: model summary) and I managed to train it using my data. However, it doesn't take into account the fact that my sequences are zero-padded and the results won't be accurate because of this. When I try to change the Embedding layer in order to support masking (also making sure that I use voc_size + 1 instead of voc_size, as it's mentioned in the documentation) like this:
embedded_caption = Embedding(input_dim=voc_size + 1,
output_dim=embed_size,
input_length=maxlen, mask_zero=True)(caption_input)
I get the following error:
Traceback (most recent call last):
File "/export/home/.../py3_env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1567, in _create_c_op
c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 200 and 1. Shapes are [200] and [1]. for 'concatenate_1/concat_1' (op: 'ConcatV2') with input shapes: [?,1,200], [?,25,1], [] and with computed input tensors: input[2] = <1>
I don't know why it says the shape of the second array is [?, 25, 1], as I am printing its shape before the concatenation and it's [?, 25, 200] (as it should be).
I don't understand why there'd be an issue with a model that compiles and works fine without that parameter, but I assume there's something I am missing.
I have also been thinking about using a Masking layer instead of mask_zero=True, but it should be before the Embedding and the documentation says that the Embedding layer should be the first layer in a model (after the input).
Is there anything I could change in order to fix this or is there a workaround to this ?
The non-equal shape error refers to the mask rather than the tensors/inputs. With concatenate supporting masking, it need to handle mask propagation. Your final_cnn_feats doesn't have mask (None), while your embedded_caption has a mask of shape (?, 25). You can find this out by doing:
print(embedded_caption._keras_history[0].compute_mask(caption_input))
Since final_cnn_feats has no mask, concatenate will give it a all non-zero mask for proper mask propagation. While this is correct, the shape of the mask, however, has the same shape as final_cnn_feats which is (?, 1, 200) rather than (?, 1), i.e. masking all features at all time step rather than just all time step. This is where the non-equal shape error comes from ((?, 1, 200) vs (?, 25)).
To fix it, you need to give final_cnn_feats a correct/matching mask. Now I'm not familiar with your project here. One option is to apply a Masking layer to final_cnn_feats, since it is designed to mask timestep(s).
final_cnn_feats = Masking()(RepeatVector(1)(embedded_cnn_feats))
This can be correct only when not all 200 features in final_cnn_feats are zero, i.e. there is always at least one non-zero value in final_cnn_feats. With that condition, Masking layer will give a (?, 1) mask and will not mask the single time step in final_cnn_feats.

CNTK Transfer Learning with LSTM: appending pretrained network to another network

I have a pretrained Seq-to-Seq slot tagger network as which in its simplest form is follows:
Network_1 = Sequential ([
Embedding(emb_dim)
Recurrence(LSTM(LSTM_dim))
Dense(num_labels)
])
I would like to use the output of this as initial layers in another network. Basically I would like to concatenate the embeddings from the network_1 (pretrained) to an embedding layer in the network_2 as follows:
Network_2 = Sequential ([
Concat_embeddings ( Embedding(emb_dim), Network_1_embed() )
Recurrence(LSTM(LSTM_dim))
(Label('encoded_h'), Label('encoded_c'))
])
def Network_1_embed():
loaded_model = load_model(path_to_network_1_saved_model);
cloned_model = loaded_model.clone(CloneMethod.freeze);
return cloned_model
def Concat_embeddings(emb1, emb2):
X=Placeholder();
return splice(emb1(X), emb2(X))
This is giving me the following error
ValueError: Times: The 1 leading dimensions of the right operand with shape '[50360]' do not match the left operand's trailing dimensions with shape '[293]'
For reference, we get [293] since emb_dim=256, and num_network_1_labels=37, while [50360] is the vocabulary size of the network_2 input. The Network_1 also had the same vocabulary mapping when being trained, so it can take the same input, and output a 37 dimensional vector for each token.
How do I make this work?
Thanks
I think your problem is that you are using the entire Network_1 as the embedding, instead of just its embedding layer.
One way would be to define embed separately and train it through Network_1:
embed = Embedding(emb_dim)
Network_1 = Sequential ([
embed,
Recurrence(LSTM(LSTM_dim)),
Dense(num_labels)
])
Then train Network_1, but save embed:
embed.save(EMBED_PATH)
Explanation: Since Network_1 just invokes embed, they share parameters, so that training Network_1 will train embed's parameters. Saving embed then gives you the embedding layer trained by Network_1. Quite straight-forward, actually.
Then, to train your second model (in a second script), load embed from disk and just use it:
Network_1_embed = load_model(EMBED_PATH)
Network_2 = Sequential ([
( Embedding(emb_dim), Network_1_embed() ),
splice,
Recurrence(LSTM(LSTM_dim)),
(Label('encoded_h'), Label('encoded_c'))
])
Note the use of a function tuple as the first item passed to Sequential(). The tuple means to apply both functions to the same input, and generates two outputs, which are then the input to the subsequent function, splice.
To keep embed constant, clone it with Freeze option as you already did in your example.
(I am not in front of a computer with the latest CNTK and cannot test this, so it is possible that I made a mistake.)