How to convert the vectorized text into embedding matrix using Longformer transformer - tensorflow

Everyone knows about converting the text into vectors and that vectors into a matrix which helps to feed the machine learning models like LightGBM as features.
import transformers
from transformers import LongformerTokenizer,LongformerForSequenceClassification,Trainer, TrainingArguments, LongformerConfig,LongformerTokenizerFast
import tensorflow as tf
#tokenizer=LongformerTokenizer.from_pretrained("hf-internal-testing/tiny-random-longformer")
#model=TFLongformerForSequenceClassification.from_pretrained("hf-internal-testing/tiny-random-longformer")
from torch.utils.data import Dataset, DataLoader
config=LongformerConfig()
test=pd.read_csv('../input/feedback-prize-effectiveness/test.csv')
train=pd.read_csv('../input/feedback-prize-effectiveness/train.csv')
# load model and tokenizer and define length of the text sequence
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096',gradient_checkpointing=False,attention_window = 512)
tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', max_length = 1024)
#inputs = tokenizer("Hello, my dog is cute killer and bad")
#print(inputs.input_ids)
k=[]
for i in train['discourse_text']:
inputs=tokenizer(i)
m=inputs.input_ids
k.append(m)
train['long_tokens']=k
The above code uses the tokenization method from longformer to encode the sentences in the dataset. So, after doing that the dataset is going to look like below
So, the feature "long_tokens" should serve as a feature for the machine learning model[LightGBM].
My question is how can we transform those features to input the model?
The datatype of the "long_tokens" is tensor.
Please answer the question
Thanks & Regards
Satwik Sunnam

You can use Sentence transformers to encode sentences into vectors and then use them as features.
https://www.sbert.net/
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('allenai/longformer-base-4096')
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)

Related

Tensorflow Recommender - ScaNN passing embedding as query

I want to pass a query embedding to ScaNN instead of a model, what data type should I use for this?
My query would look like this [1, 0.3, 0.4]
My candidate embedding would be something like:
[[0.2, 1, .4],
[0.3,0.1,0.56]]
All the examples I see are passing an query model, not the embedding itself.
I tried passing a numpy array but it didn't work
Embeddings are just lists of vectors which your model produces. In this case using the tf.keras.layers.Embedding layer.
self._embeddings = {}
# Compute embeddings for string features
for feature_name in str_features:
vocabulary = vocabularies[feature_name]
self._embeddings[feature_name] = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=vocabulary, mask_token=None),
tf.keras.layers.Embedding(len(vocabulary) + 1,
self.embedding_dimension)
])
You can also use another model such as a Sentence Transformer to create embeddings.
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)
You do not need to pass the model to ScaNN, you can pass it the embeddings directly as well as mentioned in the documentation here
Here is a sample code snippet on how to pass embeddings directly to scann
import pandas as pd
from sklearn import preprocessing, metrics
df = pd.read_csv("./data/mydata.csv")
# normalization
df_np = preprocessing.normalize(df.iloc[:,1:], norm=norm)
num_neighbors = 100
# creating searcher
k = int(np.sqrt(df_np.shape[0]))
searcher = scann.scann_ops_pybind.builder(df_np, num_neighbors, "dot_product").tree(
num_leaves=k,
num_leaves_to_search=int(k/20),
training_sample_size=2500).score_brute_force(2).reorder(7).build()
Here is a blog post on using ScaNN
ScaNN optimization and configuration

BERT encodings from TensorFlow hub

I am using the following code to generate embeddings for my text classification.
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
bert_preprocess =hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")
def get_sentence_embeding(sentences):
preprocessed_text = bert_preprocess(sentences)
return bert_encoder(preprocessed_text)['pooled_output']
e = get_sentence_embeding(["happy", "sad"])
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([e[0]],[e[1]])
the above gives array([[0.99355495]], dtype=float32)
it is saying similarity score between happy and said is 99%
why it is given 99%? can I use these embeddings for my text classification?
BERT wasn't optimized to put words with contrary meanings far away from each other in the embedding space. Instead, the two words are close together since both are adjectives.
This tutorial actually demonstrates how to fine-tune BERT for sentiment analysis.

Updating a BERT model through Huggingface transformers

I am attempting to update the pre-trained BERT model using an in house corpus. I have looked at the Huggingface transformer docs and I am a little stuck as you will see below.My goal is to compute simple similarities between sentences using the cosine distance but I need to update the pre-trained model for my specific use case.
If you look at the code below, which is precisely from the Huggingface docs. I am attempting to "retrain" or update the model and I assumed that special_token_1 and special_token_2 represent "new sentences" from my "in house" data or corpus. Is this correct? In summary, I like the already pre-trained BERT model but I would like to update it or retrain it using another in house dataset. Any leads will be appreciated.
import tensorflow as tf
import tensorflow_datasets
from transformers import *
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
SPECIAL_TOKEN_1="dogs are very cute"
SPECIAL_TOKEN_2="dogs are cute but i like cats better and my
brother thinks they are more cute"
tokenizer.add_tokens([SPECIAL_TOKEN_1, SPECIAL_TOKEN_2])
model.resize_token_embeddings(len(tokenizer))
#Train our model
model.train()
model.eval()
BERT is pre-trained on 2 tasks: masked language modeling (MLM) and next sentence prediction (NSP). The most important of those two is MLM (it turns out that the next sentence prediction task is not really that helpful for the model's language understanding capabilities - RoBERTa for example is only pre-trained on MLM).
If you want to further train the model on your own dataset, you can do so by using BERTForMaskedLM in the Transformers repository. This is BERT with a language modeling head on top, which allows you to perform masked language modeling (i.e. predicting masked tokens) on your own dataset. Here's how to use it:
from transformers import BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits
You can update the weights of BertForMaskedLM using loss.backward(), which is the main way of training PyTorch models. If you don't want to do this yourself, the Transformers library also provides a Python script which allows you perform MLM really quickly on your own dataset. See here (section "RoBERTa/BERT/DistilBERT and masked language modeling"). You just need to provide a training and test file.
You don't need to add any special tokens. Examples of special tokens are [CLS] and [SEP], which are used for sequence classification and question answering tasks (among others). These are added by the tokenizer automatically. How do I know this? Because BertTokenizer inherits from PretrainedTokenizer, and if you take a look at the documentation of its __call__ method here, you can see that the add_special_tokens parameter defaults to True.

Classify intent of random utterance of chat bot from training data and give different graphical visualization using random forest?

I am creating a nlp model to detect the intent from the provided utterance from a excel file which I am using for training having 2 columns like shown below:
Utterence Intent
hi can I have an Apple Watch service
how much I will be paying monthly service
you still around YOU_THERE
are you still there YOU_THERE
you there YOU_THERE
Speak to me if you are there. YOU_THERE
you around YOU_THERE
There are like around 3000 utterances in the training files and many intents.
I trained my model using scikit learn module and my code looks like this.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import re
def preprocessing(userQuery):
letters_only = re.sub("[^a-zA-Z\\d]", " ", userQuery)
words = letters_only.lower().split()
return( " ".join(words ))
#read utterance data from a xlsx file
train = pd.read_excel('training.xlsx')
query_features = train['Utterence']
#create tfidf
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 1))
new_query = [preprocessing(query) for query in query_features]
features = tfidf_vectorizer.fit_transform(new_query).toarray()
#create random forest classification model
model = RandomForestClassifier()
model.fit(features, train['Intent'])
#intent prediction on user query
userQuery = "I want apple watch"
userQueryList=[]
userQueryList.append(preprocessing(userQuery))
utfidf = tfidf_vectorizer.transform(userQueryList)
print(" prediction: ", model.predict(utfidf))
The one of problem for me here is for example: when i run for utterance I want apple watch it gives predicted intent as you_there instead of service as shown below(confirmation on training snapshot above):
C:\U\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\ensemble\forest.py:246: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
"10 in version 0.20 to 100 in 0.22.", FutureWarning)
prediction: ['YOU_THERE']
Please help me how should i train my model and what changes should I make to fix such issues and how i can check accuracy? Also I want to see graphical visualization and ROC curve how it can achieved using random forest. I am not very verse in NLP any help would be appreciated.
You are using word bags approach which does not perform well on sequence data.
For your problem, sequential is material to classification.
I would suggest to you that use LSTM (performs better on sequence data)
Let's address your first issue:
how should i train my model and what changes should I make to fix such issues
Below I'm using word2vec approach which rather than just converting the utterances to vectors using TFIDF approach (losing the semantic information contained in that particular sentence), it maintains the semantic info.
To understand more about word2vec, refer this blog :
[1]https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
Below is the code for predicted the intent using word2vec approach (Note - It's same as your code, just instead of using TFIDFVectorizer, I'm using word2vec to obtain the vectors. Also the code is divided into different functions to get a good overview of logics that will be evident by there names).
import pandas as pd
import numpy as np
from gensim.models import Word2Vec
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
def preprocess_lower(token):
#utility for preprocessing
return token.lower()
def load_data(file_name):
# load a csv file in memory
return pd.read_csv(file_name)
def process_training_data(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in list(training_data.Utterence.values)]
target_class = training_data.Intent.values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(target_class))
return target_class, training_sentences, label_encoded_Y
def process_user_query(training_data):
# process the training data and split it between independent and dependent variables
training_sentences = [list(map(preprocess_lower,sentence.split(" "))) for sentence in training_data]
return training_sentences
def train_word2vec_model(train_sentences_list):
# training word2vec on sentences list (inputted by user)
model = Word2Vec(train_sentences_list, size=100, window=4, min_count=1, workers=4)
return model
def convert_training_data_vectors(model, train_sentences_list):
#get the sentences average vector
training_sectences_vector = list()
for sentence in train_sentences_list:
sentence_vetor = [list(model.wv[token]) for token in sentence if token in model.wv.vocab ]
training_sectences_vector.append(list(np.mean(sentence_vetor, axis=0)))
return training_sectences_vector
def training_rf_prediction_model(training_data_vectors, label_encoded_Y):
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# training model on user inputted data
rf_model = RandomForestClassifier()
# here use the split function and divide the data into training and testing
x_train,x_test,y_train,y_test=train_test_split(training_data_vectors,label_encoded_Y,
train_size=0.8,test_size=0.2)
rf_model.fit(x_train, y_train)
y_pred = rf_model.predict(x_test)
print(accuracy_score(y_test, y_pred))
return rf_model
def training_svm_prediction_model(training_data_vectors, label_encoded_Y):
svm_model = SVC(gamma='auto')
svm_model.fit(training_data_vectors, label_encoded_Y)
return svm_model
def process_data_flow(file_name):
training_data = load_data(file_name)
target_class, training_sentences, label_encoded_Y = process_training_data(training_data)
word2vec_model = train_word2vec_model(train_sentences_list=training_sentences)
training_data_vectors = convert_training_data_vectors(word2vec_model, train_sentences_list=training_sentences)
prediction_model = training_rf_prediction_model(training_data_vectors, label_encoded_Y)
#intent prediction on user query
userQuery = ["i want apple watch"]
user_query_vectors = convert_training_data_vectors(word2vec_model, process_user_query(userQuery))
predicted_class = prediction_model.predict(user_query_vectors)[0]
predicted_intent = target_class[list(label_encoded_Y).index(predicted_class)]
return predicted_intent
print("Predicted class: ", process_data_flow("sample_intent_data.csv"))
sample data file is in csv format, you just need to format and paste the data in below format :
#sample_input_data.csv
Utterence,Intent
hi can I have an Apple Watch,service
how much I will be paying monthly,service
you still around,YOU_THERE
are you still there,YOU_THERE
you there,YOU_THERE
Speak to me if you are there,YOU_THERE
you around,YOU_THERE
Also note, your training data should contain good amount of training utterances for each intents for the approach to work.
For accuracy, you can use below approach:
Divide the data into training and testing (mention the split ratio) :
x_train,x_test,y_train,y_test=train_test_split(training_vectors,label_encoded_Y,
train_size=0.8,
test_size=0.2)
And after training the model, use predict function on x_test to get the predictions. Now just match the prediction for testing data from the model and actual from the data set and you will be able to easily determine the accuracy.
Edit: Added the accuracy score calculation while predicting.

Load pretrained model on TF-Hub to calculate Word Mover's Distance (WMD) on Gensim or spaCy

I'd like to calculate Word Mover's Distance with Universal Sentence Encoder on TensorFlow Hub embedding.
I have tried the example on spaCy for WMD-relax, which loads 'en' model from spaCy, but I couldn't find another way to feed other embeddings.
In gensim, it seems that it only accepts load_word2vec_format file (file.bin) or load file (file.vec).
As I know, someone has written a Bert to token embeddings based on pytorch, but it's not generalized to other models on tf-hub.
Is there any other approach to transfer pretrained models on tf-hub to spaCy format or word2vec format?
You need two different things.
First tell SpaCy to use an external vector for your documents, spans or tokens. This can be done by setting the user_hooks:
- user_hooks["vector"] is for the document vector
- user_span_hooks["vector"] is for the span vector
- user_token_hooks["vector"] is for the token vector
Given the fact that you have a function that retrieves from TF Hub the vectors for a Doc/Span/Token (all of them have the property text):
import spacy
import tensorflow_hub as hub
model = hub.load(TFHUB_URL)
def embed(element):
# get the text
text = element.text
# then get your vector back. The signature is for batches/arrays
results = model([text])
# get the first element because we queried with just one text
result = np.array(results)[0]
return result
You can write the following pipe component, that tells spacy how to retrieve the custom embedding for documents, spans and tokens:
def overwrite_vectors(doc):
doc.user_hooks["vector"] = embed
doc.user_span_hooks["vector"] = embed
doc.user_token_hooks["vector"] = embed
# add this to your nlp pipeline to get it on every document
nlp = spacy.blank('en') # or any other Language
nlp.add_pipe(overwrite_vectors)
For your question related to the custom distance, there is a user hook also for this one:
def word_mover_similarity(a, b):
vector_a = a.vector
vector_b = b.vector
# your distance score needs to be converted to a similarity score
similarity = TODO_IMPLEMENT(vector_a, vector_b)
return similarity
def overwrite_similarity(doc):
doc.user_hooks["similarity"] = word_mover_similarity
doc.user_span_hooks["similarity"] = word_mover_similarity
doc.user_token_hooks["similarity"] = word_mover_similarity
# as before, add this to the pipeline
nlp.add_pipe(overwrite_similarity)
I have an implementation of the TF Hub Universal Sentence Encoder that uses the user_hooks in this way: https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub
Here is the implementation of WMD in spacy. You can create a WMD object and load your own embeddings:
import numpy
from wmd import WMD
embeddings_numpy_array = # your array with word vectors
calc = WMD(embeddings_numpy_array, ...)
Or, as shown in this example., you can create your own class:
import spacy
spacy_nlp = spacy.load('en_core_web_lg')
class SpacyEmbeddings(object):
def __getitem__(self, item):
return spacy_nlp.vocab[item].vector # here you can return your own vector instead
calc = WMD(SpacyEmbeddings(), documents)
...
...
calc.nearest_neighbors("some text")
...