How to extract and use BERT encodings of sentences for Text similarity among sentences. (PyTorch/Tensorflow) - tensorflow

I want to make a text similarity model which I tend to use for FAQ finding and other methods to get the most related text. I want to use the highly optimised BERT model for this NLP task .I tend to use the the encodings of all the sentences to get a similarity matrix using the cosine_similarity and return results.
In the hypothetical conditions, if I have two sentences as hello world and hello hello world then I am assuming the BRT would give me something like [0.2,0.3,0], (0 for padding) and [0.2,0.2,0.3] and I can pass these two inside the sklearn's cosine_similarity.
How am I supposed to extract the embeddings the sentences to use them in the model? I found somewhere that it can be extracted like:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
Using Tensorflow:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
Is this the right way? Because I read somewhere that there are different type of embeddings that BERT offers.
ALSO please do suggest any other method to find the text similarity

When you want to compare the embeddings of sentences the recommended way to do this with BERT is to use the value of the CLS token. This corresponds to the first token of the output (after the batch dimension).
last_hidden_states = outputs[0]
cls_embedding = last_hidden_states[0][0]
This will give you one embedding for the entire sentence. As you will have the same size embedding for each sentence you can then easily compute the cosine similarity.
If you do not get satisfactory results using the CLS token you can also try averaging the output embedding for each word in the sentence.

Related

How to get token index map from tf hub pre-trained embedding?

I'm trying to use tfhub pre-trained word embedding in a text generation project. The setting is that there is a corpus of English text. I want to convert each word to dense vector (embedding) and then feed the sequence to a LSTM model and try to learn how to generate next word given a sequence.
Initially I was trying to load the embedding as a KerasLayer.
embedding = "https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
dtype=tf.string, trainable=True, name='embedding')
However, the KerasLayer doesn't seem to take sequences as input which is 2D. It looks like I have to preprocess the text, tokenize and convert each token to vector, and then feed the vector directly to a LSTM layer.
In this case, I will need to use token to int mapping from the model. I located the tokens.txt file in the assets directory from local cache.
./tf_cache/510580b203329a4a95dfdfefd838bdcd202f0d13/assets/tokens.txt
But I don't want to manually copy the file out and load it to memory. Is there an API from tensorflow that I can call to get the token mapping instead of reading the file manually?
You should be able to manipulate the tensors so that you can pass it into the KerasLayer.
If you are using ragged tensors tf.ragged.map_flat_values is your friend, e.g.:
sentences = ["sentence 1", "sentence number 2"]
words = tf.strings.split(sentences)
word_embeddings = tf.ragged.map_flat_values(hub_layer, words)
word_embeddings.to_tensor() # Convert to dense now to feed into next layers.
If you already have a dense tensor of shape [num_sentences, num_words], you could reshape it to a [num_sentencesnum_words], then embed it transforming it into [num_sentencesnum_words, embedding_size] and then you could reshape back into [num_sentences, num_words, embedding_size]. In this case tf.reshape is your friend.
Something like:
dense_features = tf.constant([["sentence", "with", "four", "words"], ["hello", "world", "", ""]])
# Reshape to 1-d tensor.
flatten_words = tf.reshape(tokenized_dense_sentences, [-1])
# Embed each element as if it was a single batch of words.
flatten_word_embeddings = hub_layer(flatten_words)
# Reshape back to 3-d tensor.
num_sentences = tf.shape(dense_features)[0]
max_num_words = tf.shape(dense_features)[1]
embedded_features = tf.reshape(flatten_word_embeddings, [num_sentences, max_num_words, -1])
These examples differ on how they treat the non existent words.

How to get word vectors from pre-trained word2vec model downloaded from TFHub?

So I'm using the following word2vec model from TFHub:
embed = hub.load("https://tfhub.dev/google/Wiki-words-250-with-normalization/2")
The type of this object is:
tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject
While I can use the model to embed lists of text, it's not clear to me how I can access the word embeddings themselves.
First of all, let's discuss what is embed actually? According to the official documentation, the embed object is a TextEmbedding created based on Skipgram model stored in TensorFlow 2 format.
The Skipgram model is just a feed-forward neural network that takes the one-hot encoding representations of the words in the vocabulary as an input, and it calculates the word embedding. So, these word embeddings aren't stored within the model, they get calculated.
So, if you want the word embedding of separate words, then you can pass them one at a time like so:
# word embedding of `apple`
>>> apple_embedding = embed(["apple"])
>>> apple_embedding.shape
TensorShape([1, 250])
>>> #concatenation of three different word embeddings
>>> group = embed(["apple", "banana", "carrot"])
>>> group.shape
TensorShape([3, 250])

Load pretrained model on TF-Hub to calculate Word Mover's Distance (WMD) on Gensim or spaCy

I'd like to calculate Word Mover's Distance with Universal Sentence Encoder on TensorFlow Hub embedding.
I have tried the example on spaCy for WMD-relax, which loads 'en' model from spaCy, but I couldn't find another way to feed other embeddings.
In gensim, it seems that it only accepts load_word2vec_format file (file.bin) or load file (file.vec).
As I know, someone has written a Bert to token embeddings based on pytorch, but it's not generalized to other models on tf-hub.
Is there any other approach to transfer pretrained models on tf-hub to spaCy format or word2vec format?
You need two different things.
First tell SpaCy to use an external vector for your documents, spans or tokens. This can be done by setting the user_hooks:
- user_hooks["vector"] is for the document vector
- user_span_hooks["vector"] is for the span vector
- user_token_hooks["vector"] is for the token vector
Given the fact that you have a function that retrieves from TF Hub the vectors for a Doc/Span/Token (all of them have the property text):
import spacy
import tensorflow_hub as hub
model = hub.load(TFHUB_URL)
def embed(element):
# get the text
text = element.text
# then get your vector back. The signature is for batches/arrays
results = model([text])
# get the first element because we queried with just one text
result = np.array(results)[0]
return result
You can write the following pipe component, that tells spacy how to retrieve the custom embedding for documents, spans and tokens:
def overwrite_vectors(doc):
doc.user_hooks["vector"] = embed
doc.user_span_hooks["vector"] = embed
doc.user_token_hooks["vector"] = embed
# add this to your nlp pipeline to get it on every document
nlp = spacy.blank('en') # or any other Language
nlp.add_pipe(overwrite_vectors)
For your question related to the custom distance, there is a user hook also for this one:
def word_mover_similarity(a, b):
vector_a = a.vector
vector_b = b.vector
# your distance score needs to be converted to a similarity score
similarity = TODO_IMPLEMENT(vector_a, vector_b)
return similarity
def overwrite_similarity(doc):
doc.user_hooks["similarity"] = word_mover_similarity
doc.user_span_hooks["similarity"] = word_mover_similarity
doc.user_token_hooks["similarity"] = word_mover_similarity
# as before, add this to the pipeline
nlp.add_pipe(overwrite_similarity)
I have an implementation of the TF Hub Universal Sentence Encoder that uses the user_hooks in this way: https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub
Here is the implementation of WMD in spacy. You can create a WMD object and load your own embeddings:
import numpy
from wmd import WMD
embeddings_numpy_array = # your array with word vectors
calc = WMD(embeddings_numpy_array, ...)
Or, as shown in this example., you can create your own class:
import spacy
spacy_nlp = spacy.load('en_core_web_lg')
class SpacyEmbeddings(object):
def __getitem__(self, item):
return spacy_nlp.vocab[item].vector # here you can return your own vector instead
calc = WMD(SpacyEmbeddings(), documents)
...
...
calc.nearest_neighbors("some text")
...

Output from elmo pretrained model

I am working on sentiment analysis. I am using elmo method to get word embeddings. But i am confused with the output this method is giving. Consider the code given in tensor flow website:
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
embeddings = elmo(["the cat is on the mat", "dogs are in the fog"],
signature="default",as_dict=True)["elmo"]
The embedding vectors for a particular sentence vary based on the number of strings you give. To explain in detail let
x = "the cat is on the mat"
y = "dogs are in the fog"
x1 = elmo([x],signature="default",as_dict=True)["elmo"]
z1 = elmo([x,y] ,signature="default",as_dict=True)["elmo"]
So x1[0] will not be equal to z1[0]. This changes as you change the input list of strings. Why is the output for one sentence depends on the other. I am not training the data. I am only using an existing pretrained model. As this is the case, I am confused how to convert my comments text to embeddings and use for sentiment analysis. Please explain.
Note :To get the embedding vectors I use the following code:
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
# return average of ELMo features
return sess.run(tf.reduce_mean(x1,1))
When I run your code, x1[0] and z1[0] is the same. However, z1[1] differs from the result of
y1 = elmo([y],signature="default",as_dict=True)["elmo"]
return sess.run(tf.reduce_mean(y1,1))
because y has fewer tokens than x, and blindly reducing over outputs past-the-end will pick up junk.
I recommend using the "default" output instead of "elmo", which does the intended reduction. Please see the module documentation.

How to use word embeddings for prediction in Tensorflow

I'm trying to work through the Tensorflow tutorials and have gotten stuck trying to enhance the RNN/language model tutorial so that I can predict the next word in a sentence. The tutorial uses word embeddings as the representation for the words.
Since the model learns on the word embeddings, I'm assuming that any sort of prediction I add will output the same embeddings. What I can't figure out is how to convert from those embeddings back to the word ids from the dataset. The only example I have seen kept an in memory data structure with the reverse of the mapping of wordid -> embedding and used that for lookups. This obviously won't work for all problems. Is there a better way?
Assuming you have both word_to_idx and idx_to_word from the vocabulary, this is the pseudocode for what you do
Imagine the input for prediction is "this is sample"
batch_size = 1
num_steps = 3 # i.e each step for the word in "this is sample"
hidden_size = 1500
vocab_size = 10000
translate the `word_to_idx` for the input `"this is sample"`
get the word embeddings for the each word in the input
Input to model will be word embedding of size 1x1500 at each time step
Output of model at each time step will be of size 1x1500
y_pred is output at last step from model for the given input
adding projection to output (i.e y_pred x Weights(hidden_size, vocab_size) + bias(vocab_size, ) = 1x10000)
now sample the output through the below function to get the index with max probability
generate idx_to_word from the index we just got
use the generated word along with the previous input to generate next word until you get `<eos>` or some predefined sentence stopping.
Here's an example of sampling from here:
def sample(a, temperature=1.0):
# helper function to sample an index from a probability array
a = np.log(a) / temperature
a = np.exp(a) / np.sum(np.exp(a))
return np.argmax(np.random.multinomial(1, a, 1))