To use BERT for word embedding, why does SpaCy "en_core_web_trf" give an empty result?
Related
Everyone knows about converting the text into vectors and that vectors into a matrix which helps to feed the machine learning models like LightGBM as features.
import transformers
from transformers import LongformerTokenizer,LongformerForSequenceClassification,Trainer, TrainingArguments, LongformerConfig,LongformerTokenizerFast
import tensorflow as tf
#tokenizer=LongformerTokenizer.from_pretrained("hf-internal-testing/tiny-random-longformer")
#model=TFLongformerForSequenceClassification.from_pretrained("hf-internal-testing/tiny-random-longformer")
from torch.utils.data import Dataset, DataLoader
config=LongformerConfig()
test=pd.read_csv('../input/feedback-prize-effectiveness/test.csv')
train=pd.read_csv('../input/feedback-prize-effectiveness/train.csv')
# load model and tokenizer and define length of the text sequence
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096',gradient_checkpointing=False,attention_window = 512)
tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', max_length = 1024)
#inputs = tokenizer("Hello, my dog is cute killer and bad")
#print(inputs.input_ids)
k=[]
for i in train['discourse_text']:
inputs=tokenizer(i)
m=inputs.input_ids
k.append(m)
train['long_tokens']=k
The above code uses the tokenization method from longformer to encode the sentences in the dataset. So, after doing that the dataset is going to look like below
So, the feature "long_tokens" should serve as a feature for the machine learning model[LightGBM].
My question is how can we transform those features to input the model?
The datatype of the "long_tokens" is tensor.
Please answer the question
Thanks & Regards
Satwik Sunnam
You can use Sentence transformers to encode sentences into vectors and then use them as features.
https://www.sbert.net/
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('allenai/longformer-base-4096')
#Our sentences we like to encode
sentences = ['This framework generates embeddings for each input sentence',
'Sentences are passed as a list of string.',
'The quick brown fox jumps over the lazy dog.']
#Sentences are encoded by calling model.encode()
embeddings = model.encode(sentences)
I am using the following code to generate embeddings for my text classification.
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
bert_preprocess =hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")
def get_sentence_embeding(sentences):
preprocessed_text = bert_preprocess(sentences)
return bert_encoder(preprocessed_text)['pooled_output']
e = get_sentence_embeding(["happy", "sad"])
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity([e[0]],[e[1]])
the above gives array([[0.99355495]], dtype=float32)
it is saying similarity score between happy and said is 99%
why it is given 99%? can I use these embeddings for my text classification?
BERT wasn't optimized to put words with contrary meanings far away from each other in the embedding space. Instead, the two words are close together since both are adjectives.
This tutorial actually demonstrates how to fine-tune BERT for sentiment analysis.
I want to make a text similarity model which I tend to use for FAQ finding and other methods to get the most related text. I want to use the highly optimised BERT model for this NLP task .I tend to use the the encodings of all the sentences to get a similarity matrix using the cosine_similarity and return results.
In the hypothetical conditions, if I have two sentences as hello world and hello hello world then I am assuming the BRT would give me something like [0.2,0.3,0], (0 for padding) and [0.2,0.2,0.3] and I can pass these two inside the sklearn's cosine_similarity.
How am I supposed to extract the embeddings the sentences to use them in the model? I found somewhere that it can be extracted like:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
Using Tensorflow:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.constant(tokenizer.encode("Hello, my dog is cute"))[None, :] # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
Is this the right way? Because I read somewhere that there are different type of embeddings that BERT offers.
ALSO please do suggest any other method to find the text similarity
When you want to compare the embeddings of sentences the recommended way to do this with BERT is to use the value of the CLS token. This corresponds to the first token of the output (after the batch dimension).
last_hidden_states = outputs[0]
cls_embedding = last_hidden_states[0][0]
This will give you one embedding for the entire sentence. As you will have the same size embedding for each sentence you can then easily compute the cosine similarity.
If you do not get satisfactory results using the CLS token you can also try averaging the output embedding for each word in the sentence.
i'm trying to build classifier for my study using bert and keras.
i got bert encoding (shape - (1,X,768)) when X is the number of words and spaces in the sentence.
how can i build keras model if X is not consistent?
I am attempting to update the pre-trained BERT model using an in house corpus. I have looked at the Huggingface transformer docs and I am a little stuck as you will see below.My goal is to compute simple similarities between sentences using the cosine distance but I need to update the pre-trained model for my specific use case.
If you look at the code below, which is precisely from the Huggingface docs. I am attempting to "retrain" or update the model and I assumed that special_token_1 and special_token_2 represent "new sentences" from my "in house" data or corpus. Is this correct? In summary, I like the already pre-trained BERT model but I would like to update it or retrain it using another in house dataset. Any leads will be appreciated.
import tensorflow as tf
import tensorflow_datasets
from transformers import *
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
SPECIAL_TOKEN_1="dogs are very cute"
SPECIAL_TOKEN_2="dogs are cute but i like cats better and my
brother thinks they are more cute"
tokenizer.add_tokens([SPECIAL_TOKEN_1, SPECIAL_TOKEN_2])
model.resize_token_embeddings(len(tokenizer))
#Train our model
model.train()
model.eval()
BERT is pre-trained on 2 tasks: masked language modeling (MLM) and next sentence prediction (NSP). The most important of those two is MLM (it turns out that the next sentence prediction task is not really that helpful for the model's language understanding capabilities - RoBERTa for example is only pre-trained on MLM).
If you want to further train the model on your own dataset, you can do so by using BERTForMaskedLM in the Transformers repository. This is BERT with a language modeling head on top, which allows you to perform masked language modeling (i.e. predicting masked tokens) on your own dataset. Here's how to use it:
from transformers import BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits
You can update the weights of BertForMaskedLM using loss.backward(), which is the main way of training PyTorch models. If you don't want to do this yourself, the Transformers library also provides a Python script which allows you perform MLM really quickly on your own dataset. See here (section "RoBERTa/BERT/DistilBERT and masked language modeling"). You just need to provide a training and test file.
You don't need to add any special tokens. Examples of special tokens are [CLS] and [SEP], which are used for sequence classification and question answering tasks (among others). These are added by the tokenizer automatically. How do I know this? Because BertTokenizer inherits from PretrainedTokenizer, and if you take a look at the documentation of its __call__ method here, you can see that the add_special_tokens parameter defaults to True.