I trained a neural network with tensorflow and extracted the weights from the embedding layer to make an array of embeddings. I generated it as a txt file and I can't read it with KeyedVectors
Generate the file.
Generate the file
import io
out_vectors = io.open('vecs.txt', 'w', encoding='utf-8')
out_vectors.write(f'{vocab_size} {embedding_dim}')
for word_num in range(1,vocab_size):
word = reversed_word_index[word_num]
embeddings = weights[word_num]
line_embedding ='\n' + word + '\t'+ '\t'.join([str(x) for x in embeddings])
out_vectors.write(line_embedding)
out_vectors.close()
Read the file
from gensim.models import KeyedVectors
model_embedding = KeyedVectors.load_word2vec_format('vecs.txt')
try to add a lambda method (mmap) in loading function, should look like:
model_embedding = KeyedVectors.load("vecs.txt", mmap="r")
Related
I coded a simple AI chatbot with TensorFlow and tflearn and it runs just fine but the issue is when the user inputs the wrong thing, the bot is supposed to say it doesnt understand if the prediction accuracy is less than 70%, but the bot always scores above that even if the user gives jibberish like "rjrigrejfr". The bot assumes theyre greeting them. The patterns its supposed to study in the json are "patterns": ["Hi", "How are you", "Wassup", "Hello", "Good day", "Waddup", "Yo"]. I can share the json file if needed its short. Anyway, this is the python code:
import numpy as np
import nltk
import tensorflow
import tflearn
import random
import json
import pickle
# Some extra configuration:
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()
nltk.download('punkt')
# Load the data from the json file into a variable.
with open("intents.json") as file:
data = json.load(file)
# If we already have saved data, we do not need to retrain the model and waste time (could develop into an issue in more complex programs. Save in pickle. )
try:
with open("data.pickle", "rb") as f: # rb stands for bytes.
words, labels, training, output = pickle.load(f)
# --- Pre-training data preparation ---
except:
words = []
docsx = [] # Stores patterns
docsy = [] # Stores intents
labels = [] # All the specific tag values such as greeting, contact, etc.
for intent in data["intents"]:
for pattern in intent["patterns"]:
w = nltk.word_tokenize(pattern) # nltk function that splits the sentences inside intent into words list.
words.extend(w) # Add the tokenized list to words list.
docsx.append(w)
docsy.append(intent["tag"]) # append the classification of the sentence
if intent["tag"] not in labels:
labels.append(intent["tag"])
words = [stemmer.stem(w.lower()) for w in words if w not in ".?!"] # Stemming the words to remove unnecessary elements leaving their root. Convert all to lowercase.
words = sorted(list(set(words))) # Set ensures no duplicate elements then we convert back to list and sort it.
labels = sorted(labels)
training = []
output = []
out_empty = [0 for i in range(len(labels))] # Gives a list of 0 ints based on # of tags. This is useful later in the program when binerizing.
# One hot encoding the intent categories. Need to one-hot code the data which improves the efficiency of the ML to "binerize" the data.
# In this case, we have a list of 0s and 1s if the word appears it is assigned a 1 else a 0.
for x, doc in enumerate(docsx):
bag = [] # Bag of words or the one-hot coded data for the ML.
docx_word_stemmed = [stemmer.stem(word) for word in doc] # Stemming the data in docx.
# Now adding and transforming data into the one-hot coded list/bag of words data.
for i in words:
if i in docx_word_stemmed: # Checking against stemmed words:
# Word exists
bag.append(1)
else:
bag.append(0)
output_row = out_empty[:] # Copying out_empty
# Going through the labels list using .index() and for the occurance of docx value in docy, assign binary 1.
output_row[labels.index(docsy[x])] = 1
training.append(bag)
output.append(output_row)
# Required to use numpy arrays for use in tflearn. It is also faster.
training = np.array(training)
output = np.array(output)
# Saving the data so we do not need to do the data configuration every time.
with open("data.pickle", "wb") as f:
pickle.dump((words, labels, training, output), f)
try:
model.load('model.tflearn')
except:
tensorflow.compat.v1.reset_default_graph()
net = tflearn.input_data(shape=[None, len(training[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(output[0]), activation='softmax')
net = tflearn.regression(net)
model = tflearn.DNN(net)
model.fit(training, output, n_epoch=1000, batch_size=8, show_metric=True)
model.save("model.tflearn")
def bagofwords(sentence, words):
bag = [0 for _ in range(len(words))] # blank bag of words.
# Tokenize s and then stem it.
sentence_words = nltk.word_tokenize(sentence)
sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
for string in sentence_words:
for i, word in enumerate(words):
if word == string:
bag[i] = 1
return np.array(bag)
def chat():
print("Hello there! I'm the SRO AI Virtual Assistant. How am I help you?")
# Figure out the error slime!
while True:
user_input = input("Type here:")
if user_input == "quit":
break
result = model.predict([bagofwords(user_input, words)])[0] #bagofwords func and predict function to give predictions on what the user is saying.
best_result = np.argmax(result) # We want to only use the best result.
tag = labels[best_result]
print(result[best_result])
# Open JSON file and pick a response.
if result[best_result] > 0.7:
for tg in data["intents"]:
if tg['tag'] == tag:
responses = tg['responses']
print(random.choice(responses))
else:
print("I don't quite understand")
chat()
I'm trying to get a "tf.transform encoded dict" with this tfx.components.Transform function.
transform = Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
module_file=os.path.abspath(_taxi_transform_module_file),
instance_name="taxi")
context.run(transform)
I need a dict like this: " a dict of the data you load ({feature_name: feature_value})."
Transform as mentioned above gives me a TfRecord file. How can i decode it properly?
Any help would be appreciated.
import tensorflow_transform as tft
def preprocessing_fn(inputs):
NUMERIC_FEATURE_KEYS = ['PetalLengthCm', 'PetalWidthCm',
'SepalLengthCm', 'SepalWidthCm']
TARGET_FEATURES = "Species"
outputs = inputs.copy()
del outputs['Id']
for key in NUMERIC_FEATURE_KEYS:
outputs[key] = tft.scale_to_0_1(outputs[key])
return outputs
Write a module like this i have written one for iris dataset it's simple to understand for your dataset also you can do like this it will be saved as a tfrecord dataset
I am training a custom NER model from scratch using the spacy.blank("en") model. I add custom word vectors to it. The vectors are loaded as follows:
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors
med_vec = KeyedVectors.load_word2vec_format('./wikipedia-pubmed-and-PMC-w2v.bin', binary=True, limit = 300000)
and I add it to the blank model in this code snippet here:
def main(model=None, n_iter=3, output_dir=None):
"""Set up the pipeline and entity recognizer, and train the new entity."""
random.seed(0)
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
nlp.vocab.reset_vectors(width=200)
for idx in range(len(med_vec.index2word)):
word = med_vec.index2word[idx]
vector = med_vec.vectors[idx]
nlp.vocab.set_vector(word, vector)
for key, vector in nlp.vocab.vectors.items():
nlp.vocab.strings.add(nlp.vocab.strings[key])
nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
print("Created blank 'en' model")
......Code for training the ner
I then save this model.
When I try to load the model,
nlp = spacy.load("./NDLA/vectorModel0")
I get the following error:
`~\AppData\Local\Continuum\anaconda3\lib\site-packages\thinc\neural\_classes\static_vectors.py in __init__(self, lang, nO, drop_factor, column)
47 if self.nM == 0:
48 raise ValueError(
---> 49 "Cannot create vectors table with dimension 0.\n"
50 "If you're using pre-trained vectors, are the vectors loaded?"
51 )
ValueError: Cannot create vectors table with dimension 0.
If you're using pre-trained vectors, are the vectors loaded?
I also get this warning:
UserWarning: [W019] Changing vectors name from spacy_pretrained_vectors to spacy_pretrained_vectors_336876, to avoid clash with previously loaded vectors. See Issue #3853.
"__main__", mod_spec)
The vocab directory in the model has a vectors file of size 270 MB. So I know it is not empty... What is causing this error?
You could try to pass all vectors at once instead of using a for loop.
nlp.vocab.vectors = spacy.vocab.Vectors(data=med_vec.syn0, keys=med_vec.vocab.keys())
So you're else statement would become like this:
else:
nlp = spacy.blank("en") # create blank Language class
nlp.vocab.reset_vectors(width=200)
nlp.vocab.vectors = spacy.vocab.Vectors(data=med_vec.syn0, keys=med_vec.vocab.keys())
nlp.vocab.vectors.name = 'spacy_pretrained_vectors'
print("Created blank 'en' model")
I have books in pdf and I want to do NLP tasks such as preprocessing, tf-idf calculation, word2vec, etc on those books. So I converted them into .txt files and was trying to get tf-idf scores. Previously I performed tf-idf on a CSV file, so I made some changes in that code and tried to use it for .txt file. But I am unsuccessful in my attempt.
Below is my code:
import pandas as pd
import numpy as np
from itertools import islice
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
data = open('jungle book.txt', 'r+')
# print(data.read())
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec.fit(data)
list(islice(cvec.vocabulary_.items(), 20))
len(cvec.vocabulary_)
cvec_count = cvec.transform(data)
print('Sparse Matrix Shape : ', cvec_count.shape)
print('Non Zero Count : ', cvec_count.nnz)
print('sparsity: %.2f%%' % (100 * cvec_count.nnz / (cvec_count.shape[0] * cvec_count.shape[1])))
occ = np.asarray(cvec_count.sum(axis=0)).ravel().tolist()
count_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences' : occ})
term_freq = count_df.sort_values(by='occurrences', ascending=False).head(20)
print(term_freq)
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_count)
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weight_df = pd.DataFrame({'term' : cvec.get_feature_names(), 'weight' : weights})
tf_idf = weight_df.sort_values(by='weight', ascending=False).head(20)
print(tf_idf)
This code is working until print ('Non Zero Count :', cvec_count.shape) and printing:
Sparse Matrix Shape : (0, 7132)
Non Zero Count : 0
Then it is giving error:
ZeroDivisionError: division by zero
Even if I run this code with ignoring ZeroDivisionError, still it is wrong as it is not counting any frequencies.
I have no idea how to work around .txt file. What is the proper way to work on .txt file for NLP tasks?
Thanks in advance!
You are getting the error because data variable is empty or wrong type. Just opening the text file is not enough. You have to read the contents into a string variable and then do the preprocessing on that variable. Try replacing
data = open('jungle book.txt', 'r+')
# print(data.read())
with
with open('jungle book.txt', 'r') as file:
data = file.read()
Hello can somebody explain step by step what's hapening in following code?
Escpecially the part classes and the reshape? tnx
def load_data():
train_dataset = h5py.File('datasets/train_catvnoncat.h5', "r")
train_set_x_orig = np.array(train_dataset["train_set_x"][:]) # your train set features
train_set_y_orig = np.array(train_dataset["train_set_y"][:]) # your train set labels
test_dataset = h5py.File('datasets/test_catvnoncat.h5', "r")
test_set_x_orig = np.array(test_dataset["test_set_x"][:]) # your test set features
test_set_y_orig = np.array(test_dataset["test_set_y"][:]) # your test set labels
classes = np.array(test_dataset["list_classes"][:]) # the list of classes
train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
test_set_y_orig = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))
return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes
Most of the lines just load datasets from the h5 file. The np.array(...) wrapper isn't needed. test_dataset[name][:] is sufficient to load an array.
test_set_y_orig = test_dataset["test_set_y"][:]
test_dataset is the opened file. test_dataset["test_set_y"] is a dataset on that file. The [:] loads the dataset into a numpy array. Look up the h5py docs for more details on load a dataset.
I deduce from
train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
that the array, as loaded is 1d, with shape (n,), and this reshape is just adding an initial dimension, making it (1,n). I would have coded it as
train_set_y_orig = train_set_y_orig[None,:]
but the result is the same.
There's nothing special about the classes array (though it might well be an array of strings).