Using Gensim Fasttext model with LSTM nn in keras - tensorflow

I have trained fasttext model with Gensim over the corpus of very short sentences (up to 10 words). I know that my test set includes words that are not in my train corpus, i.e some of the words in my corpus are like "Oxytocin" "Lexitocin", "Ematrophin",'Betaxitocin"
given a new word in the test set, fasttext knows pretty well to generate a vector with high cosine-similarity to the other similar words in the train set by using the characters level n-gram
How do i incorporate the fasttext model inside a LSTM keras network without losing the fasttext model to just a list of vectors in the vocab? because then I won't handle any OOV even when fasttext do it well.
Any idea?

here the procedure to incorporate the fasttext model inside an LSTM Keras network
# define dummy data and precproces them
docs = ['Well done',
'Good work',
'Great effort',
'nice work',
'Excellent',
'Weak',
'Poor effort',
'not good',
'poor work',
'Could have done better']
docs = [d.lower().split() for d in docs]
# train fasttext from gensim api
ft = FastText(size=10, window=2, min_count=1, seed=33)
ft.build_vocab(docs)
ft.train(docs, total_examples=ft.corpus_count, epochs=10)
# prepare text for keras neural network
max_len = 8
tokenizer = tf.keras.preprocessing.text.Tokenizer(lower=True)
tokenizer.fit_on_texts(docs)
sequence_docs = tokenizer.texts_to_sequences(docs)
sequence_docs = tf.keras.preprocessing.sequence.pad_sequences(sequence_docs, maxlen=max_len)
# extract fasttext learned embedding and put them in a numpy array
embedding_matrix_ft = np.random.random((len(tokenizer.word_index) + 1, ft.vector_size))
pas = 0
for word,i in tokenizer.word_index.items():
try:
embedding_matrix_ft[i] = ft.wv[word]
except:
pas+=1
# define a keras model and load the pretrained fasttext weights matrix
inp = Input(shape=(max_len,))
emb = Embedding(len(tokenizer.word_index) + 1, ft.vector_size,
weights=[embedding_matrix_ft], trainable=False)(inp)
x = LSTM(32)(emb)
out = Dense(1)(x)
model = Model(inp, out)
model.predict(sequence_docs)
how to deal unseen text
unseen_docs = ['asdcs work','good nxsqa zajxa']
unseen_docs = [d.lower().split() for d in unseen_docs]
sequence_unseen_docs = tokenizer.texts_to_sequences(unseen_docs)
sequence_unseen_docs = tf.keras.preprocessing.sequence.pad_sequences(sequence_unseen_docs, maxlen=max_len)
model.predict(sequence_unseen_docs)

Related

How to train Tensorflow's pre trained BERT on MLM task? ( Use pre-trained model only in Tensorflow)

Before Anyone suggests pytorch and other things, I am looking specifically for Tensorflow + pretrained + MLM task only. I know, there are lots of blogs for PyTorch and lots of blogs for fine tuning ( Classification) on Tensorflow.
Coming to the problem, I got a language model which is English + LaTex where a text data can represent any text from Physics, Chemistry, MAths and Biology and any typical example can look something like this:
Link to OCR image
"Find the value of function x in the equation: \n \\( f(x)=\\left\\{\\begin{array}{ll}x^{2} & \\text { if } x<0 \\\\ 2 x & \\text { if } x \\geq 0\\end{array}\\right. \\)"
So my language model needs to understand \geq \\begin array \eng \left \right other than the English language and that is why I need to train an MLM first on pre-trained BERT or SciBERT to have both. So I went up digging the internet and found some tutorials:
MLM training on Tensorflow BUT from Scratch; I need pre-trained
MLM on pre-trained but in Pytorch; I need Tensorflow
Fine Tuning with Keras; It is for classification but I want MLM
I already have a fine tuning classification model. Some of the code is as follows:
tokenizer = transformers.BertTokenizer.from_pretrained('bert-large-uncased')
def regular_encode(texts, tokenizer, maxlen=maxlen):
enc_di = tokenizer.batch_encode_plus(texts, return_token_type_ids=False,padding='max_length',max_length=maxlen,truncation = True,)
return np.array(enc_di['input_ids'])
Xtrain_encoded = regular_encode(X_train.astype('str'), tokenizer, maxlen=maxlen)
ytrain_encoded = tf.keras.utils.to_categorical(y_train, num_classes=classes,dtype = 'int32')
def build_model(transformer, loss='categorical_crossentropy', max_len=maxlen, dense = 512, drop1 = 0.3, drop2 = 0.3):
input_word_ids = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
sequence_output = transformer(input_word_ids)[0]
cls_token = sequence_output[:, 0, :]
#Fine Tuning Model Start
x = tf.keras.layers.Dropout(drop1)(cls_token)
x = tf.keras.layers.Dense(512, activation='relu')(x)
x = tf.keras.layers.Dropout(drop2)(x)
out = tf.keras.layers.Dense(classes, activation='softmax')(x)
model = tf.keras.Model(inputs=input_word_ids, outputs=out)
return model
Only useful thing I could get was in the HuggingFace that
With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it as a PyTorch model (or vice-versa)
from transformers import AutoModelForSequenceClassification
model.save_pretrained("my_imdb_model")
pytorch_model = AutoModelForSequenceClassification.from_pretrained("my_imdb_model", from_tf=True)
So maybe I could train a pytorch MLM and then load it as a tensorflow fine tuned classification model?
Is there any other way?
I think after some research, I found something. I don't know if it'll work but it uses transformer and Tensorflow with XLM . I think it'll work for BERT too.
PRETRAINED_MODEL = 'jplu/tf-xlm-roberta-large'
from tensorflow.keras.optimizers import Adam
import transformers
from transformers import TFAutoModelWithLMHead, AutoTokenizer
def create_mlm_model_and_optimizer():
with strategy.scope():
model = TFAutoModelWithLMHead.from_pretrained(PRETRAINED_MODEL)
optimizer = tf.keras.optimizers.Adam(learning_rate=LR)
return model, optimizer
mlm_model, optimizer = create_mlm_model_and_optimizer()
def define_mlm_loss_and_metrics():
with strategy.scope():
mlm_loss_object = masked_sparse_categorical_crossentropy
def compute_mlm_loss(labels, predictions):
per_example_loss = mlm_loss_object(labels, predictions)
loss = tf.nn.compute_average_loss(
per_example_loss, global_batch_size = global_batch_size)
return loss
train_mlm_loss_metric = tf.keras.metrics.Mean()
return compute_mlm_loss, train_mlm_loss_metric
def masked_sparse_categorical_crossentropy(y_true, y_pred):
y_true_masked = tf.boolean_mask(y_true, tf.not_equal(y_true, -1))
y_pred_masked = tf.boolean_mask(y_pred, tf.not_equal(y_true, -1))
loss = tf.keras.losses.sparse_categorical_crossentropy(y_true_masked,
y_pred_masked,
from_logits=True)
return loss
def train_mlm(train_dist_dataset, total_steps=2000, evaluate_every=200):
step = 0
### Training lopp ###
for tensor in train_dist_dataset:
distributed_mlm_train_step(tensor)
step+=1
if (step % evaluate_every == 0):
### Print train metrics ###
train_metric = train_mlm_loss_metric.result().numpy()
print("Step %d, train loss: %.2f" % (step, train_metric))
### Reset metrics ###
train_mlm_loss_metric.reset_states()
if step == total_steps:
break
#tf.function
def distributed_mlm_train_step(data):
strategy.experimental_run_v2(mlm_train_step, args=(data,))
#tf.function
def mlm_train_step(inputs):
features, labels = inputs
with tf.GradientTape() as tape:
predictions = mlm_model(features, training=True)[0]
loss = compute_mlm_loss(labels, predictions)
gradients = tape.gradient(loss, mlm_model.trainable_variables)
optimizer.apply_gradients(zip(gradients, mlm_model.trainable_variables))
train_mlm_loss_metric.update_state(loss)
compute_mlm_loss, train_mlm_loss_metric = define_mlm_loss_and_metrics()
Now train it as train_mlm(train_dist_dataset, TOTAL_STEPS, EVALUATE_EVERY)
Above ode is from this notebook and you need to do all the necessary things given exactly
The author says in the end that:
This fine tuned model can be loaded just as the original to build a classification model from it

NLP: How to train a spaCy NER Model using GoldParse objects

I am trying to train a spaCy NER model using GoldParse objects. This is what I have done:
Adding extra labels to NER model
add_ents = ['A1', 'B1', 'C1', 'D1', 'E1', 'F1', 'G1'] # sample labels
# Create a pipe if it does not exist
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
else:
ner = nlp.get_pipe("ner")
for e in add_ents:
ner.add_label(e)
Training the NER Model
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
model = None # Since we training a fresh model not a saved model
with nlp.disable_pipes(*other_pipes): # only train ner
if model is None:
optimizer = nlp.begin_training()
else:
optimizer = nlp.resume_training()
for i in range(20):
loss = {}
nlp.update(X, y, sgd=optimizer, drop=0.0, losses=loss)
print("Loss: ", loss)
Here X is a list of Doc objects and y is a list of corresponding GoldParse objects. While executing I am running into the following error:
nn_parser.pyx in spacy.syntax.nn_parser.Parser.update()
nn_parser.pyx in spacy.syntax.nn_parser.Parser._init_gold_batch()
ner.pyx in spacy.syntax.ner.BiluoPushDown.preprocess_gold()
ner.pyx in spacy.syntax.ner.BiluoPushDown.lookup_transition()
ValueError: 'A1' is not in list
I tried searching for the solution but couldn't find anything relevant. Is there a way to fix this issue?

How to save my own trained word embedding model (Using Keras) like word2vec and Glove

I trained my own word embedding model using my own data set. Now, after training the Keras Embeddings layer, I want to save the model in a text file so I can use for later classification of the unknown data set.
Source Code: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/#comment-507619:
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
# define documents
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))
However, I tried to save the model using a different format (.txt),
model.save('My_Model.txt')
however, the problem is that when I open the file, the data seems to have weird characters.
I also got Embedding weights by this:
How I could save my model in the same format of Glove and word2vec, the word followed by its vector..?
Please guide me, how to save it like
the -6.07421696e-02, -3.59963439e-02, 4.38806415e-02, -3.08707543e-02, -1.51436431e-02, -5.56904338e-02, 5.70665635e-02, -8.34236890e-02
As I understand you don't need to save exactly the model, but need to save pre-trained embeddings. I a bit adjust your code:
import numpy as np
from keras.preprocessing.text import one_hot, Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
# define documents
docs = np.array(['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.'])
# define class labels
labels = np.array([1,1,1,1,1,0,0,0,0,0])
# train the tokenizer
vocab_size = 50
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(docs)
# encode the sentences
encoded_docs = tokenizer.texts_to_sequences(docs)
print(encoded_docs)
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length, name='embeddings'))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# save embeddings
embeddings = model.get_layer('embeddings').get_weights()[0]
w2v_my = {}
for word, index in tokenizer.word_index.items():
w2v_my[word] = embeddings[index]
print(w2v_my['good'])
Please note, that you can find embeddings only for the words, that were present in the training dataset. In the current example, you will not find the embeddings for the word "the".
If you need to be able to get embeddings for unknown words, you can use Tokenizer with char_level.
To be able to store and re-use embeddings you can use numpy:
np.savez('embeddings.npz', **w2v_my)
embeddings = np.load('embeddings.npz')
# check existing words (keys)
print(embeddings.files)
# get embeddings for some specific word
print(embeddings['done'])

How to train any Hugging face transformer model (eg DistilBERT) for question answer from scratch using Tensorflow backend?

I want to understand how to train a hugging face transformer model (like BERT, DistilBERT, etc) for the question-answer system and TensorFlow as backend. Following is the logic that I am currently using (but I am not sure whether is it right approach):
I am using SQuAD v1.1 dataset.
In SQuAd dataset answer to any question is always present in context. So to put in simple words I am trying to predict start index and end index and answer.
I have transformed the dataset for the same purpose. I have added the start index and end index of on word level after performing tokenization. Here is how my dataset looks,
Next I am encoded question and context as per hugging face docs guide and returing input_ids, attention_ids and token_type_ids; which will be used as input to model.
def tokenize(questions, contexts):
input_ids, input_masks, input_segments = [],[],[]
for question,context in tqdm_notebook(zip(questions, contexts)):
inputs = tokenizer.encode_plus(question,context, add_special_tokens=True, max_length=512, pad_to_max_length=True,return_attention_mask=True, return_token_type_ids=True )
input_ids.append(inputs['input_ids'])
input_masks.append(inputs['attention_mask'])
input_segments.append(inputs['token_type_ids'])
return [np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')]
Finally I define a Keras model which takes this three input and predict two value, start and end word index of answer from given context.
input_ids_in = tf.keras.layers.Input(shape=(512,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(512,), name='masked_token', dtype='int32')
input_segment_in = tf.keras.layers.Input(shape=(512,), name='segment_token', dtype='int32')
embedding_layer = transformer_model({'inputs':input_ids_in,'attention_mask':input_masks_in,
'token_type_ids':input_segment_in})[0]
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
X = tf.keras.layers.GlobalMaxPool1D()(X)
start_branch = tf.keras.layers.Dense(1024, activation='relu')(X)
start_branch = tf.keras.layers.Dropout(0.3)(start_branch)
start_branch_output = tf.keras.layers.Dense(512, activation='softmax', name='start_branch')(start_branch)
end_branch = tf.keras.layers.Dense(1024, activation='relu')(X)
end_branch = tf.keras.layers.Dropout(0.3)(end_branch)
end_branch_output = tf.keras.layers.Dense(512, activation='softmax', name='end_branch')(end_branch)
model = tf.keras.Model(inputs=[input_ids_in, input_masks_in, input_segment_in], outputs = [start_branch_output, end_branch_output])
I am using last softmax layer with 512 units as it is my max no of words I my aim is to predict index dromit.

Keras autoencoder and getting the compressed feature vector representation

I have one sentence per row in a file and sentences are not more than 30 words. I am building an autoencoder using Keras and I am very new to this - so I may be doing few things incorrectly. So, help me out here.
I am trying to use autoencoder to get the intermediate context vector - the compressed feature vectors after the encode step.
Vocabulary is nothing but a list of distinct words in my file. 300 is the dimension of word embedding matrix. 30 is the maximum words each sentence can have. X_train is (#of sentence, 30) matrix of numbers where each number is nothing but where in the dictionary the word existed.
print len(vocabulary)
model = Sequential()
model.add(Embedding(len(vocabulary), 300))
model.compile('rmsprop', 'mse')
input_i = Input(shape=(30, 300))
encoded_h1 = Dense(64, activation='tanh')(input_i)
encoded_h2 = Dense(32, activation='tanh')(encoded_h1)
encoded_h3 = Dense(16, activation='tanh')(encoded_h2)
encoded_h4 = Dense(8, activation='tanh')(encoded_h3)
encoded_h5 = Dense(4, activation='tanh')(encoded_h4)
latent = Dense(2, activation='tanh')(encoded_h5)
decoder_h1 = Dense(4, activation='tanh')(latent)
decoder_h2 = Dense(8, activation='tanh')(decoder_h1)
decoder_h3 = Dense(16, activation='tanh')(decoder_h2)
decoder_h4 = Dense(32, activation='tanh')(decoder_h3)
decoder_h5 = Dense(64, activation='tanh')(decoder_h4)
output = Dense(300, activation='tanh')(decoder_h5)
autoencoder = Model(input_i,output)
autoencoder.compile('adadelta','mse')
X_embedded = model.predict(X_train)
autoencoder.fit(X_embedded,X_embedded,epochs=10, batch_size=256, validation_split=.1)
print autoencoder.summary()
The idea is taken from Keras - Autoencoder for Text Analysis
So, after training (if I have done correctly) how should I just run the encoding part for each sentence to get the feature representation? Help is appreciated. Thanks!
make a standalone model for encoder
encoder=Model(input_i,latent)
suppose for mnist data the code should be like--
encoder.predict(x_train[0])
by this you will get latent_space vector as output
To do this, refer to 'popping' off last layer via model.pop(). After training "pop off" the last layer by using model.pop(), then use model.predict(X_train) to get the representation.
https://keras.io/getting-started/faq/#how-can-i-remove-a-layer-from-a-sequential-model