pretrained tokenizer for tf-idf for pytorch - tokenize

I am working on mlp with pytorch and applied bert-based-uncased as a tokenizer for multi-lable text classification
parser.add_argument("--tokenizer_name", default="bert-base-uncased", type=str,
help="Pretrained tokenizer name or path if not the same as model_name")
please help me understand how can I use tf-idf tokenizer here and which tokenizer to used to do it

Related

What does the embedding elements stand for in huggingFace bert model?

Prior to passing my tokens through encoder in BERT model, I would like to perform some processing on their embeddings. I extracted the embedding weight using:
from transformers import TFBertModel
# Load a pre-trained BERT model
model = TFBertModel.from_pretrained('bert-base-uncased')
# Get the embedding layer of the model
embedding_layer = model.get_layer('bert').get_input_embeddings()
# Extract the embedding weights
embedding_weights = embedding_layer.get_weights()
I found it contains 5 elements as shown in Figure.
enter image description here
In my understanding, the first three elements are the word embedding weights, token type embedding weights, and positional embedding weights. My question is what does the last two elements stand for?
I dive deep into the source code of bert model. But I cannot figure out the meaning of the last two elements.
In bert model, there is a post-processing of the embedding tensor that uses layer normalization followed by dropout ,
https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/modeling.py#L362
I think that those two arrays are the gamma and beta of the normalization layer, https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization
They are learned parameters, and will span the axes of inputs specified in param "axis" which defaults to -1 (corresponding to 768 in embedding tensor).

Save trained gensim word2vec model as a tensorflow SavedModel

Do we have an option to save a trained Gensim Word2Vec model as a saved model using tf 2.0 tf.saved_model.save? In other words, how can I save a trained embedding vector as a saved model signature to work with tensorflow 2.0. The following steps are not correct normally:
model = gensim.models.Word2Vec(...)
model.init_sims(..)
model.train(..)
model.save(..)
module = gensim.models.KeyedVectors.load_word2vec(...)
tf.saved_model.save(
module,
export_dir
)
EDIT:
This example helped me about how to do it : https://keras.io/examples/nlp/pretrained_word_embeddings/
Gensim does not use TensorFlow and it has its own methods for loading and saving models.
You would need to convert Gensim embeddings into a TensorFlow a model which only makes sense if you further plan to use your embeddings within TensorFlow and possibly fine-tune them for your task.
Gensim Word2Vec are two steps in TensorFlow:
Vocabulary lookup: a table that assigns indices to tokens.
Embedding lookup layer that picks up the actual embeddings for the indices.
Then, you can save it as any other TensorFlow model.

how to build keras model on Bert output with different shapes

i'm trying to build classifier for my study using bert and keras.
i got bert encoding (shape - (1,X,768)) when X is the number of words and spaces in the sentence.
how can i build keras model if X is not consistent?

Updating a BERT model through Huggingface transformers

I am attempting to update the pre-trained BERT model using an in house corpus. I have looked at the Huggingface transformer docs and I am a little stuck as you will see below.My goal is to compute simple similarities between sentences using the cosine distance but I need to update the pre-trained model for my specific use case.
If you look at the code below, which is precisely from the Huggingface docs. I am attempting to "retrain" or update the model and I assumed that special_token_1 and special_token_2 represent "new sentences" from my "in house" data or corpus. Is this correct? In summary, I like the already pre-trained BERT model but I would like to update it or retrain it using another in house dataset. Any leads will be appreciated.
import tensorflow as tf
import tensorflow_datasets
from transformers import *
model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
SPECIAL_TOKEN_1="dogs are very cute"
SPECIAL_TOKEN_2="dogs are cute but i like cats better and my
brother thinks they are more cute"
tokenizer.add_tokens([SPECIAL_TOKEN_1, SPECIAL_TOKEN_2])
model.resize_token_embeddings(len(tokenizer))
#Train our model
model.train()
model.eval()
BERT is pre-trained on 2 tasks: masked language modeling (MLM) and next sentence prediction (NSP). The most important of those two is MLM (it turns out that the next sentence prediction task is not really that helpful for the model's language understanding capabilities - RoBERTa for example is only pre-trained on MLM).
If you want to further train the model on your own dataset, you can do so by using BERTForMaskedLM in the Transformers repository. This is BERT with a language modeling head on top, which allows you to perform masked language modeling (i.e. predicting masked tokens) on your own dataset. Here's how to use it:
from transformers import BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased', return_dict=True)
inputs = tokenizer("The capital of France is [MASK].", return_tensors="pt")
labels = tokenizer("The capital of France is Paris.", return_tensors="pt")["input_ids"]
outputs = model(**inputs, labels=labels)
loss = outputs.loss
logits = outputs.logits
You can update the weights of BertForMaskedLM using loss.backward(), which is the main way of training PyTorch models. If you don't want to do this yourself, the Transformers library also provides a Python script which allows you perform MLM really quickly on your own dataset. See here (section "RoBERTa/BERT/DistilBERT and masked language modeling"). You just need to provide a training and test file.
You don't need to add any special tokens. Examples of special tokens are [CLS] and [SEP], which are used for sequence classification and question answering tasks (among others). These are added by the tokenizer automatically. How do I know this? Because BertTokenizer inherits from PretrainedTokenizer, and if you take a look at the documentation of its __call__ method here, you can see that the add_special_tokens parameter defaults to True.

Use pre-trained word2vec in lstm language model?

I used tensorflow to train LSTM language model, code is from here.
According to article here, it seems that if I use pre-trained word2vec, it works better.
Using word embeddings such as word2vec and GloVe is a popular method to improve the accuracy of your model. Instead of using one-hot vectors to represent our words, the low-dimensional vectors learned using word2vec or GloVe carry semantic meaning – similar words have similar vectors. Using these vectors is a form of pre-training.
So, I want to use word2vec to redo the training, but I am a little bit confused about how to do this.
The embedding code goes here:
with tf.device("/cpu:0"):
embedding = tf.get_variable(
"embedding", [vocab_size, size], dtype=data_type())
inputs = tf.nn.embedding_lookup(embedding, input_.input_data)
How can I change this code to use pre-trained word2vec?