pretrained vectors not loading in spacy

pretrained vectors not loading in spacy - spacy

I am training a custom NER model from scratch using the spacy.blank("en") model. I add custom word vectors to it. The vectors are loaded as follows:
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
med_vec = KeyedVectors.load_word2vec_format('./wikipedia-pubmed-and-PMC-w2v.bin', binary=True, limit = 300000)
and I add it to the blank model in this code snippet here:
def main(model=None, n_iter=3, output_dir=None):
"""Set up the pipeline and entity recognizer, and train the new entity."""
random.seed(0)
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
nlp.vocab.reset_vectors(width=200)
for idx in range(len(med_vec.index2word)):
word = med_vec.index2word[idx]
vector = med_vec.vectors[idx]
nlp.vocab.set_vector(word, vector)
for key, vector in nlp.vocab.vectors.items():
nlp.vocab.strings.add(nlp.vocab.strings[key])
nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
print("Created blank 'en' model")
......Code for training the ner
I then save this model.
When I try to load the model,
nlp = spacy.load("./NDLA/vectorModel0")
I get the following error:
`~\AppData\Local\Continuum\anaconda3\lib\site-packages\thinc\neural\_classes\static_vectors.py in __init__(self, lang, nO, drop_factor, column)
47 if self.nM == 0:
48 raise ValueError(
---> 49 "Cannot create vectors table with dimension 0.\n"
50 "If you're using pre-trained vectors, are the vectors loaded?"
51 )
ValueError: Cannot create vectors table with dimension 0.
If you're using pre-trained vectors, are the vectors loaded?
I also get this warning:
UserWarning: [W019] Changing vectors name from spacy_pretrained_vectors to spacy_pretrained_vectors_336876, to avoid clash with previously loaded vectors. See Issue #3853.
"__main__", mod_spec)
The vocab directory in the model has a vectors file of size 270 MB. So I know it is not empty... What is causing this error?

You could try to pass all vectors at once instead of using a for loop.
nlp.vocab.vectors = spacy.vocab.Vectors(data=med_vec.syn0, keys=med_vec.vocab.keys())
So you're else statement would become like this:
else:
nlp = spacy.blank("en") # create blank Language class
nlp.vocab.reset_vectors(width=200)
nlp.vocab.vectors = spacy.vocab.Vectors(data=med_vec.syn0, keys=med_vec.vocab.keys())
nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
print("Created blank 'en' model")

Related

Tensor flow and tflearn Chatbot keeps on getting high probability even when user input is wrong

I coded a simple AI chatbot with TensorFlow and tflearn and it runs just fine but the issue is when the user inputs the wrong thing, the bot is supposed to say it doesnt understand if the prediction accuracy is less than 70%, but the bot always scores above that even if the user gives jibberish like "rjrigrejfr". The bot assumes theyre greeting them. The patterns its supposed to study in the json are "patterns": ["Hi", "How are you", "Wassup", "Hello", "Good day", "Waddup", "Yo"]. I can share the json file if needed its short. Anyway, this is the python code:
import numpy as np
import nltk
import tensorflow
import tflearn
import random
import json
import pickle
# Some extra configuration:
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()
nltk.download('punkt')
# Load the data from the json file into a variable.
with open("intents.json") as file:
data = json.load(file)
# If we already have saved data, we do not need to retrain the model and waste time (could develop into an issue in more complex programs. Save in pickle. )
try:
with open("data.pickle", "rb") as f: # rb stands for bytes.
words, labels, training, output = pickle.load(f)
# --- Pre-training data preparation ---
except:
words = []
docsx = [] # Stores patterns
docsy = [] # Stores intents
labels = [] # All the specific tag values such as greeting, contact, etc.
for intent in data["intents"]:
for pattern in intent["patterns"]:
w = nltk.word_tokenize(pattern) # nltk function that splits the sentences inside intent into words list.
words.extend(w) # Add the tokenized list to words list.
docsx.append(w)
docsy.append(intent["tag"]) # append the classification of the sentence
if intent["tag"] not in labels:
labels.append(intent["tag"])
words = [stemmer.stem(w.lower()) for w in words if w not in ".?!"] # Stemming the words to remove unnecessary elements leaving their root. Convert all to lowercase.
words = sorted(list(set(words))) # Set ensures no duplicate elements then we convert back to list and sort it.
labels = sorted(labels)
training = []
output = []
out_empty = [0 for i in range(len(labels))] # Gives a list of 0 ints based on # of tags. This is useful later in the program when binerizing.
# One hot encoding the intent categories. Need to one-hot code the data which improves the efficiency of the ML to "binerize" the data.
# In this case, we have a list of 0s and 1s if the word appears it is assigned a 1 else a 0.
for x, doc in enumerate(docsx):
bag = [] # Bag of words or the one-hot coded data for the ML.
docx_word_stemmed = [stemmer.stem(word) for word in doc] # Stemming the data in docx.
# Now adding and transforming data into the one-hot coded list/bag of words data.
for i in words:
if i in docx_word_stemmed: # Checking against stemmed words:
# Word exists
bag.append(1)
else:
bag.append(0)
output_row = out_empty[:] # Copying out_empty
# Going through the labels list using .index() and for the occurance of docx value in docy, assign binary 1.
output_row[labels.index(docsy[x])] = 1
training.append(bag)
output.append(output_row)
# Required to use numpy arrays for use in tflearn. It is also faster.
training = np.array(training)
output = np.array(output)
# Saving the data so we do not need to do the data configuration every time.
with open("data.pickle", "wb") as f:
pickle.dump((words, labels, training, output), f)
try:
model.load('model.tflearn')
except:
tensorflow.compat.v1.reset_default_graph()
net = tflearn.input_data(shape=[None, len(training[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(output[0]), activation='softmax')
net = tflearn.regression(net)
model = tflearn.DNN(net)
model.fit(training, output, n_epoch=1000, batch_size=8, show_metric=True)
model.save("model.tflearn")
def bagofwords(sentence, words):
bag = [0 for _ in range(len(words))] # blank bag of words.
# Tokenize s and then stem it.
sentence_words = nltk.word_tokenize(sentence)
sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
for string in sentence_words:
for i, word in enumerate(words):
if word == string:
bag[i] = 1
return np.array(bag)
def chat():
print("Hello there! I'm the SRO AI Virtual Assistant. How am I help you?")
# Figure out the error slime!
while True:
user_input = input("Type here:")
if user_input == "quit":
break
result = model.predict([bagofwords(user_input, words)])[0] #bagofwords func and predict function to give predictions on what the user is saying.
best_result = np.argmax(result) # We want to only use the best result.
tag = labels[best_result]
print(result[best_result])
# Open JSON file and pick a response.
if result[best_result] > 0.7:
for tg in data["intents"]:
if tg['tag'] == tag:
responses = tg['responses']
print(random.choice(responses))
else:
print("I don't quite understand")
chat()

How to avoid memory leakage in an autoregressive model within tensorflow

Recently, I am training a LSTM with attention mechanism for regressionin tensorflow 2.9 and I met an problem during training with model.fit():
At the beginning, the training time is okay, like 7s/step. However, it was increasing during the process and after several steps, like 1000, the value might be 50s/step. Here below is a part of the code for my model:
class AttentionModel(tf.keras.Model):
def __init__(self, encoder_output_dim, dec_units, dense_dim, batch):
super().__init__()
self.dense_dim = dense_dim
self.batch = batch
encoder = Encoder(encoder_output_dim)
decoder = Decoder(dec_units,dense_dim)
self.encoder = encoder
self.decoder = decoder
def call(self, inputs):
# Creat a tensor to record the result
tempt = list()
encoder_output, encoder_state = self.encoder(inputs)
new_features = np.zeros((self.batch, 1, 1))
dec_initial_state = encoder_state
for i in range(6):
dec_inputs = DecoderInput(new_features=new_features, enc_output=encoder_output)
dec_result, dec_state = self.decoder(dec_inputs, dec_initial_state)
tempt.append(dec_result.logits)
new_features = dec_result.logits
dec_initial_state = dec_state
result=tf.concat(tempt,1)
return result
In the official documents for tf.function, I notice: "Don't rely on Python side effects like object mutation or list appends".
Since I use a dynamic python list with append() to record the intermediate variables, I guess each time during training, a new tf.graph was added. Is the reason my training is getting slower and slower?
Additionally, what should I use instead of python list to avoid this? I have tried with a numpy.zeros matrix but it will lead to another problem:
tempt = np.zeros(shape=(1,6))
...
for i in range(6):
dec_inputs = DecoderInput(new_features=new_features, enc_output=encoder_output)
dec_result, dec_state = self.decoder(dec_inputs, dec_initial_state)
tempt[i]=(dec_result.logits)
...
Cannot convert a symbolic tf.Tensor (decoder/dense_3/BiasAdd:0) to a numpy array. This error may indicate that you're trying to pass a Tensor to a NumPy call, which is not supported.

Two models of the same architecture with same weights giving different results

Problem
After copying weights from a pretrained model, I do not get the same output.
Description
tf2cv repository provides pretrained models in TF2 for various backbones. Unfortunately the codebase is of limited use to me because they use subclassing via tf.keras.Model which makes it very hard to extract intermediate outputs and gradients at will. I therefore embarked upon rewriting the codes for the backbones using the functional API. After rewriting the resnet architecture codes, I copied their weights into my model and saved them in SavedModel format. In order to test if it is correctly done, I gave an input to my model instance and theirs and the results were different.
My approaches to debugging the problem
I checked the number of trainable and non-trainable parameters and they are the same between my model instance and theirs.
I checked if all trainable weights have been copied which they have.
My present line of thinking
I think it might be possible that weights have not been copied to the correct layers. For example :- Layer X and Layer Y might have weights of the same shape but during weight copying, weights of layer Y might have gone into Layer X and vice versa. This is only possible if I have not mapped the layer names between the two models properly.
However I have exhaustively checked and have not found any error so far.
The Code
My code is attached below. Their (tfcv) code for resnet can be found here
Please note that resnet_orig in the following snippet is the same as here
My converted code can be found here
from vision.image import resnet as myresnet
from glob import glob
from loguru import logger
import tensorflow as tf
import resnet_orig
import re
import os
import numpy as np
from time import time
from copy import deepcopy
tf.random.set_seed(time())
models = [
'resnet10',
'resnet12',
'resnet14',
'resnetbc14b',
'resnet16',
'resnet18_wd4',
'resnet18_wd2',
'resnet18_w3d4',
'resnet18',
'resnet26',
'resnetbc26b',
'resnet34',
'resnetbc38b',
'resnet50',
'resnet50b',
'resnet101',
'resnet101b',
'resnet152',
'resnet152b',
'resnet200',
'resnet200b',
]
def zipdir(path, ziph):
# ziph is zipfile handle
for root, dirs, files in os.walk(path):
for file in files:
ziph.write(os.path.join(root, file),
os.path.relpath(os.path.join(root, file),
os.path.join(path, '..')))
def find_model_file(model_type):
model_files = glob('*.h5')
for m in model_files:
if '{}-'.format(model_type) in m:
return m
return None
def remap_our_model_variables(our_variables, model_name):
remapped = list()
reg = re.compile(r'(stage\d+)')
for var in our_variables:
newvar = var.replace(model_name, 'features/features')
stage_search = re.search(reg, newvar)
if stage_search is not None:
stage_search = stage_search[0]
newvar = newvar.replace(stage_search, '{}/{}'.format(stage_search,
stage_search))
newvar = newvar.replace('conv_preact', 'conv/conv')
newvar = newvar.replace('conv_bn','bn')
newvar = newvar.replace('logits','output1')
remapped.append(newvar)
remap_dict = dict([(x,y) for x,y in zip(our_variables, remapped)])
return remap_dict
def get_correct_variable(variable_name, trainable_variable_names):
for i, var in enumerate(trainable_variable_names):
if variable_name == var:
return i
logger.info('Uffff.....')
return None
layer_regexp_compiled = re.compile(r'(.*)\/.*')
model_files = glob('*.h5')
a = np.ones(shape=(1,224,224,3), dtype=np.float32)
inp = tf.constant(a, dtype=tf.float32)
for model_type in models:
logger.info('Model is {}.'.format(model_type))
model = eval('myresnet.{}(input_height=224,input_width=224,'
'num_classes=1000,data_format="channels_last")'.format(
model_type))
model2 = eval('resnet_orig.{}(data_format="channels_last")'.format(
model_type))
model2.build(input_shape=(None,224, 224,3))
model_name=find_model_file(model_type)
logger.info('Model file is {}.'.format(model_name))
original_weights = deepcopy(model2.weights)
if model_name is not None:
e = model2.load_weights(model_name, by_name=True, skip_mismatch=False)
print(e)
loaded_weights = deepcopy(model2.weights)
else:
logger.info('Pretrained model is not available for {}.'.format(
model_type))
continue
diff = [np.mean(x.numpy()-y.numpy()) for x,y in zip(original_weights,
loaded_weights)]
our_model_weights = model.weights
their_model_weights = model2.weights
assert (len(our_model_weights) == len(their_model_weights))
our_variable_names = [x.name for x in model.weights]
their_variable_names = [x.name for x in model2.weights]
remap_dict = remap_our_model_variables(our_variable_names, model_type)
new_weights = list()
for i in range(len(our_model_weights)):
our_name = model.weights[i].name
remapped_name = remap_dict[our_name]
source_index = get_correct_variable(remapped_name, their_variable_names)
new_weights.append(
model2.weights[source_index].value())
logger.debug('Copying from {} ({}) to {} ({}).'.format(
model2.weights[
source_index].name,
model2.weights[source_index].value().shape,
model.weights[
i].name,
model.weights[i].value().shape))
logger.info(len(new_weights))
logger.info('Setting new weights')
model.set_weights(new_weights)
logger.info('Finished setting new weights.')
their_output = model2(inp)
our_output = model(inp)
logger.info(np.max(their_output.numpy() - our_output.numpy()))
logger.info(diff) # This must be 0.0
break

ValueError when trying to fine-tune GPT-2 model in TensorFlow

I am encountering a ValueError in my Python code when trying to fine-tune Hugging Face's distribution of the GPT-2 model. Specifically:
ValueError: Dimensions must be equal, but are 64 and 0 for
'{{node Equal_1}} = Equal[T=DT_FLOAT, incompatible_shape_error=true](Cast_18, Cast_19)'
with input shapes: [64,0,1024], [2,0,12,1024].
I have around 100 text files that I concatenate into a string variable called raw_text and then pass into the following function to create training and testing TensorFlow datasets:
def to_datasets(raw_text):
# split the raw text in smaller sequences
seqs = [
raw_text[SEQ_LEN * i:SEQ_LEN * (i + 1)]
for i in range(len(raw_text) // SEQ_LEN)
]
# set up Hugging Face GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
# tokenize the character sequences
tokenized_seqs = [
tokenizer(seq, padding="max_length", return_tensors="tf")["input_ids"]
for seq in seqs
]
# convert tokenized sequences into TensorFlow datasets
trn_seqs = tf.data.Dataset \
.from_tensor_slices(tokenized_seqs[:int(len(tokenized_seqs) * TRAIN_PERCENT)])
tst_seqs = tf.data.Dataset \
.from_tensor_slices(tokenized_seqs[int(len(tokenized_seqs) * TRAIN_PERCENT):])
def input_and_target(x):
return x[:-1], x[1:]
# map into (input, target) tuples, shuffle order of elements, and batch
trn_dataset = trn_seqs.map(input_and_target) \
.shuffle(SHUFFLE_BUFFER_SIZE) \
.batch(BATCH_SIZE, drop_remainder=True)
tst_dataset = tst_seqs.map(input_and_target) \
.shuffle(SHUFFLE_BUFFER_SIZE) \
.batch(BATCH_SIZE, drop_remainder=True)
return trn_dataset, tst_dataset
I then try to train my model, calling train_model(*to_datasets(raw_text)):
def train_model(trn_dataset, tst_dataset):
# import Hugging Face GPT-2 model
model = TFGPT2Model.from_pretrained("gpt2")
model.compile(
optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=tf.metrics.SparseCategoricalAccuracy()
)
model.fit(
trn_dataset,
epochs=EPOCHS,
initial_epoch=0,
validation_data=tst_dataset
)
The ValueError is triggered on the model.fit() call. The variables in all-caps are settings pulled in from a JSON file. Currently, they are set to:
{
"BATCH_SIZE":64,
"SHUFFLE_BUFFER_SIZE":10000,
"EPOCHS":500,
"SEQ_LEN":2048,
"TRAIN_PERCENT":0.9
}
Any information regarding what this error means or ideas on how to resolve it would be greatly appreciated. Thank you!

I'm having the same problem but when I change the batch size to 12 (same as n_layer parameter in the gpt-2 config file) it works.
I don't Know why it works but you can try it...
If you manage to solve it on different way I will be glad to hear.

Tensorflow, Keras: In a multi-class classification, accuracy is high, but precision, recall, and f1-score is zero for most classes

General Explanation:
My codes work fine, but the results are wired. I don't know the problem is with
the network structure,
or the way I feed the data to the network,
or anything else.
I am struggling with this error several weeks and so far I have changed the loss function, optimizer, data generator, etc., but I could not solve it. I appreciate any help.
If the following information is not enough, let me know, please.
Field of study:
I am using tensorflow, keras for multiclass classification. The dataset has 36 binary human attributes. I have used resnet50, then for each part of the body (head, upper body, lower body, shoes, accessories), I have added a separated branch to the network. The network has 1 input image with 36 labels and 36 output nodes (36 denes layers with sigmoid activation).
Problem:
The problem is that the accuracy that keras is reporting is high, but f1-score is very low or zero for most of the outputs (even when I use f1-score as a metric when compiling the network, the f1-socre for validation is very bad).
aAfter train, when I use the network in prediction mode, it returns always one/zero for some classes. It means that the network is not able to learn (even when I use weighted loss function or focal loss function.)
Why it is weird? Because, state-of-the-art methods report heigh f1 score even after the first epoch (e.g. https://github.com/chufengt/iccv19_attribute, that I have run it in my PC and got good results after one epoch).
Parts of the Codes:
print("setup model ...")
input_image = KL.Input(args.img_input_shape, name= "input_1")
C1, C2, C3, C4, C5 = resnet_graph(input_image, architecture="resnet50", stage5=False, train_bn=True)
output_layers = merged_model (input_features=C4)
model = Model(inputs=input_image, outputs=output_layers, name='SoftBiometrics_Model')
...
print("model compiling ...")
OPTIM = optimizers.Adadelta(lr=args.learning_rate, rho=0.95)
model.compile(optimizer=OPTIM, loss=binary_focal_loss(alpha=.25, gamma=2), metrics=['acc',get_f1])
plot_model(model, to_file='model.png')
...
img_datagen = ImageDataGenerator(rotation_range=6, width_shift_range=0.03, height_shift_range=0.03, brightness_range=[0.85,1.15], shear_range=0.06, zoom_range=0.09, horizontal_flip=True, preprocessing_function=preprocess_input_resnet, rescale=1/255.)
img_datagen_test = ImageDataGenerator(preprocessing_function=preprocess_input_resnet, rescale=1/255.)
def multiple_outputs(generator, dataframe, batch_size, x_col):
Gen = generator.flow_from_dataframe(dataframe=dataframe,
directory=None,
x_col = x_col,
y_col = args.Categories,
target_size = (args.img_input_shape[0],args.img_input_shape[1]),
class_mode = "multi_output",
classes=None,
batch_size = batch_size,
shuffle = True)
while True:
gnext = Gen.next()
# return image batch and 36 sets of lables
labels = gnext[1]
output_dict = {"{}_output".format(Category): np.array(labels[index]) for index, Category in enumerate(args.Categories)}
yield {'input_1':gnext[0]}, output_dict
trainGen = multiple_outputs (generator = img_datagen, dataframe=Train_df_img, batch_size=args.BATCH_SIZE, x_col="Train_Filenames")
testGen = multiple_outputs (generator = img_datagen_test, dataframe=Test_df_img, batch_size=args.BATCH_SIZE, x_col="Test_Filenames")
STEP_SIZE_TRAIN = len(Train_df_img["Train_Filenames"]) // args.BATCH_SIZE
STEP_SIZE_VALID = len(Test_df_img["Test_Filenames"]) // args.BATCH_SIZE
...
print("Fitting the model to the data ...")
history = model.fit_generator(generator=trainGen,
epochs=args.Number_of_epochs,
steps_per_epoch=STEP_SIZE_TRAIN,
validation_data=testGen,
validation_steps=STEP_SIZE_VALID,
callbacks= [chekpont],
verbose=1)

There is a possibility that you are passing binary f1-score to compile function. This should fix the problem -
pip install tensorflow-addons
...
import tensorflow_addons as tfa
f1 = tfa.metrics.F1Score(36,'micro' or 'macro')
model.compile(...,metrics=[f1])
You can read more about how f1-micro and f1-macro is calculated and which can be useful here.

Somehow, the predict_generator() of Keras' model does not work as expected. I would rather loop through all test images one-by-one and get the prediction for each image in each iteration. I am using Plaid-ML Keras as my backend and to get prediction I am using the following code.
import os
from PIL import Image
import keras
import numpy
print("Prediction result:")
dir = "/path/to/test/images"
files = os.listdir(dir)
correct = 0
total = 0
#dictionary to label all traffic signs class.
classes = {
0:'This is Cat',
1:'This is Dog',
}
for file_name in files:
total += 1
image = Image.open(dir + "/" + file_name).convert('RGB')
image = image.resize((100,100))
image = numpy.expand_dims(image, axis=0)
image = numpy.array(image)
image = image/255
pred = model.predict_classes([image])[0]
sign = classes[pred]
if ("cat" in file_name) and ("cat" in sign):
print(correct,". ", file_name, sign)
correct+=1
elif ("dog" in file_name) and ("dog" in sign):
print(correct,". ", file_name, sign)
correct+=1
print("accuracy: ", (correct/total))

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pretrained vectors not loading in spacy - spacy

Related

Tensor flow and tflearn Chatbot keeps on getting high probability even when user input is wrong

How to avoid memory leakage in an autoregressive model within tensorflow

Two models of the same architecture with same weights giving different results

ValueError when trying to fine-tune GPT-2 model in TensorFlow

Tensorflow, Keras: In a multi-class classification, accuracy is high, but precision, recall, and f1-score is zero for most classes

Categories

Resources