Tensor flow and tflearn Chatbot keeps on getting high probability even when user input is wrong - tensorflow

I coded a simple AI chatbot with TensorFlow and tflearn and it runs just fine but the issue is when the user inputs the wrong thing, the bot is supposed to say it doesnt understand if the prediction accuracy is less than 70%, but the bot always scores above that even if the user gives jibberish like "rjrigrejfr". The bot assumes theyre greeting them. The patterns its supposed to study in the json are "patterns": ["Hi", "How are you", "Wassup", "Hello", "Good day", "Waddup", "Yo"]. I can share the json file if needed its short. Anyway, this is the python code:
import numpy as np
import nltk
import tensorflow
import tflearn
import random
import json
import pickle
# Some extra configuration:
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()
nltk.download('punkt')
# Load the data from the json file into a variable.
with open("intents.json") as file:
data = json.load(file)
# If we already have saved data, we do not need to retrain the model and waste time (could develop into an issue in more complex programs. Save in pickle. )
try:
with open("data.pickle", "rb") as f: # rb stands for bytes.
words, labels, training, output = pickle.load(f)
# --- Pre-training data preparation ---
except:
words = []
docsx = [] # Stores patterns
docsy = [] # Stores intents
labels = [] # All the specific tag values such as greeting, contact, etc.
for intent in data["intents"]:
for pattern in intent["patterns"]:
w = nltk.word_tokenize(pattern) # nltk function that splits the sentences inside intent into words list.
words.extend(w) # Add the tokenized list to words list.
docsx.append(w)
docsy.append(intent["tag"]) # append the classification of the sentence
if intent["tag"] not in labels:
labels.append(intent["tag"])
words = [stemmer.stem(w.lower()) for w in words if w not in ".?!"] # Stemming the words to remove unnecessary elements leaving their root. Convert all to lowercase.
words = sorted(list(set(words))) # Set ensures no duplicate elements then we convert back to list and sort it.
labels = sorted(labels)
training = []
output = []
out_empty = [0 for i in range(len(labels))] # Gives a list of 0 ints based on # of tags. This is useful later in the program when binerizing.
# One hot encoding the intent categories. Need to one-hot code the data which improves the efficiency of the ML to "binerize" the data.
# In this case, we have a list of 0s and 1s if the word appears it is assigned a 1 else a 0.
for x, doc in enumerate(docsx):
bag = [] # Bag of words or the one-hot coded data for the ML.
docx_word_stemmed = [stemmer.stem(word) for word in doc] # Stemming the data in docx.
# Now adding and transforming data into the one-hot coded list/bag of words data.
for i in words:
if i in docx_word_stemmed: # Checking against stemmed words:
# Word exists
bag.append(1)
else:
bag.append(0)
output_row = out_empty[:] # Copying out_empty
# Going through the labels list using .index() and for the occurance of docx value in docy, assign binary 1.
output_row[labels.index(docsy[x])] = 1
training.append(bag)
output.append(output_row)
# Required to use numpy arrays for use in tflearn. It is also faster.
training = np.array(training)
output = np.array(output)
# Saving the data so we do not need to do the data configuration every time.
with open("data.pickle", "wb") as f:
pickle.dump((words, labels, training, output), f)
try:
model.load('model.tflearn')
except:
tensorflow.compat.v1.reset_default_graph()
net = tflearn.input_data(shape=[None, len(training[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(output[0]), activation='softmax')
net = tflearn.regression(net)
model = tflearn.DNN(net)
model.fit(training, output, n_epoch=1000, batch_size=8, show_metric=True)
model.save("model.tflearn")
def bagofwords(sentence, words):
bag = [0 for _ in range(len(words))] # blank bag of words.
# Tokenize s and then stem it.
sentence_words = nltk.word_tokenize(sentence)
sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
for string in sentence_words:
for i, word in enumerate(words):
if word == string:
bag[i] = 1
return np.array(bag)
def chat():
print("Hello there! I'm the SRO AI Virtual Assistant. How am I help you?")
# Figure out the error slime!
while True:
user_input = input("Type here:")
if user_input == "quit":
break
result = model.predict([bagofwords(user_input, words)])[0] #bagofwords func and predict function to give predictions on what the user is saying.
best_result = np.argmax(result) # We want to only use the best result.
tag = labels[best_result]
print(result[best_result])
# Open JSON file and pick a response.
if result[best_result] > 0.7:
for tg in data["intents"]:
if tg['tag'] == tag:
responses = tg['responses']
print(random.choice(responses))
else:
print("I don't quite understand")
chat()

Related

What is the most efficient way of creating a tf.dataset from multiple json.gz files with multiple text records?

I have thousands of json.gz files, each with a variety of information about scientific papers. For each file, I have to extract the relevant information - e.g. title and labels - to make a dataset, then transform it to a tf.dataset. However, it is poorly efficient since I cannot filter the subjects directly or shuffle them in a single step.
I would like to read them using tf.dataset.interleave in order to shuffle them, but also to filter them according to specific labels.
Here is how I'm doing it up to now.
import tensorflow as tf
import pandas as pd
#For relevant feature extraction
def load_file(file):
#with gzip.open(bytes.decode(file), 'r') as fin: # 4. gzip
with gzip.open(file, 'r') as fin:
json_bytes = fin.read()
json_str = json_bytes.decode('utf-8') # 2. string (i.e. JSON)
bb = json.loads(json_str)
bb = pd.json_normalize(bb, 'items', ['indexed', ['title', 'publisher', 'type','indexed.date-parts', 'subject']],
errors='ignore')
bb.dropna(subset=['title', 'publisher', 'type','indexed.date-parts', 'subject'], inplace=True)
bb.subject = bb.subject.apply(lambda x: int(themes[list(set(x) & set(list(themes.keys())))[0]]) if len(list(set(x) & set(list(themes.keys()))))>0 else len(list(themes.keys()))+1)
bb.title = bb.title.str.join('').values
#bb['author'] = bb['author'].apply(lambda x: '; '.join([', '.join([i['given'], i['family']]) for i in x]))
bb['indexed.date-parts'] = bb['indexed.date-parts'].apply(lambda tpl: datetime.datetime.strptime('-'.join(str(x) for x in tpl[0]), '%Y-%m-%d').strftime('%Y-%m-%d'))
#bb = bb.sample(n=32, replace=True)
#return bb.title.str.join('').values, bb.subject.str.join(', ').values
return dict(bb[['title', 'publisher', 'type','indexed.date-parts', 'subject' ]])
file_list = ['file_2021_01/10625.json.gz',
'file_2021_01/23897.json.gz',
'file_2021_01/12169.json.gz',
'file_2021_01/427.json.gz',...]
filenames = tf.data.Dataset.list_files(file_list, shuffle=True)
dataset = filenames.apply(
tf.data.experimental.parallel_interleave(
lambda x: tf.data.Dataset.from_tensor_slices(tf.numpy_function(load_file, [x], (tf.int64))), cycle_length=1))
However, it results it a error:
InternalError: Unsupported object type dict
[[{{node PyFunc}}]] [Op:IteratorGetNext]
Thanks

Text classification using embedding for two columns of dataset

I am working on a project where i am using mental health related subreddit posts containing two feature columns (text, title) and a label column (Subreddit).
I want to use LSTM for classification where i need to create embedding matrix for both the columns in short need both columns for text classification but i cannot find the way to embed both columns.
Code i am using for text sequences is
text_sequences_train = token.texts_to_sequences(preprocessed_text_train)
title_sequences_train = token.texts_to_sequences(preprocessed_title_train)
#print(sequences_train)
train=np.hstack(text_sequences_train+title_sequences_train)
train.reshape(1,train.shape[0])
train_seq_x=pad_sequences(train, maxlen=300)
text_sequences_test = token.texts_to_sequences(preprocessed_text_test)
title_sequences_test = token.texts_to_sequences(preprocessed_title_test)
#print(sequences_train)
test=np.hstack(text_sequences_test+title_sequences_test)
test.reshape(1,test.shape[0])
test_seq_x=pad_sequences(test, maxlen=300)
text_sequences_val = token.texts_to_sequences(preprocessed_text_val)
title_sequences_val = token.texts_to_sequences(preprocessed_title_val)
#print(sequences_train)
val=np.hstack(text_sequences_val+title_sequences_val)
val.reshape(1,val.shape[0])
val_seq_x=pad_sequences(val, maxlen=300)
the above code gives me an error
ValueError: `sequences` must be a list of iterables. Found non-iterable: 428.0
code i am using for embedding matrix is
glove_file = "glove.42B.300d.txt"
import tqdm
EMBEDDING_VECTOR_LENGTH = 300 # <=200
def construct_embedding_matrix(glove_file, word_index):
embedding_dict = {}
with open(glove_file,'r', encoding='utf-8') as f:
for line in f:
values=line.split()
# get the word
word=values[0]
if word in word_index.keys():
# get the vector
vector = np.asarray(values[1:], 'float32')
embedding_dict[word] = vector
#print(embedding_dict[word].shape)
### oov words (out of vacabulary words) will be mapped to 0 vectors
num_words=len(word_index)+1
#initialize it to 0
embedding_matrix=np.zeros((num_words, EMBEDDING_VECTOR_LENGTH))
for word,i in tqdm.tqdm(word_index.items()):
if i < num_words:
vect=embedding_dict.get(word, [])
if len(vect)>0:
embedding_matrix[i] = vect[:EMBEDDING_VECTOR_LENGTH]
#print(embedding_matrix[i].shape)
print(embedding_matrix)
return embedding_matrix
embedding_matrix=construct_embedding_matrix(glove_file, word_index)
If I convert text sequences and then train test split it gives an error where X and Y no of samples do not match

How do I train a pseudo-projective parser on spaCy?

I am trying to train a parser for custom semantics following the sample code from https://raw.githubusercontent.com/explosion/spaCy/master/examples/training/train_intent_parser.py
The idea is to get a non-projective parse so when I pass a text like: ROOT AAAA BBBB 12 21 12 becomes a child of AAAA and 21 becomes a child of BBBB. To test this I am training only this case and testing this same case but it doesn't seem to work, what I get as a response is:
[('ROOT', 'ROOT', 'ROOT'), ('AAAA', 'LETTERS', 'ROOT'), ('BBBB', 'LETTERS', 'ROOT'), ('12', 'NUMBERS', 'BBBB'), ('21', 'NUMBERS', 'BBBB')]
As you can see both numbers are dependent on BBBB when 12 should be dependent on AAAA.
The code I am using to train and test is:
import plac
import random
import spacy
from spacy.util import minibatch, compounding
TRAIN_DATA = list()
samples = 1000
for _ in range(samples):
sample = (
'ROOT AAAA BBBB 12 21',
{
'heads': [0, 0, 0, 1, 2],
'deps': ['ROOT', 'LETTERS', 'LETTERS', 'NUMBERS', 'NUMBERS']
}
)
TRAIN_DATA.append(sample)
def test_model(nlp):
texts = ['ROOT AAAA BBBB 12 21']
docs = nlp.pipe(texts)
for doc in docs:
print(doc.text)
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != "-"])
#plac.annotations(
model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
n_iter=("Number of training iterations", "option", "n", int),
)
# Just in case I am using the german model since it supports pseudo-projective parsing (https://explosion.ai/blog/german-model#word-order)
def main(model='de_core_news_sm', n_iter=15):
"""Load the model, set up the pipeline and train the parser."""
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")
# We'll use the built-in dependency parser class, but we want to create a
# fresh instance – just in case.
if "parser" in nlp.pipe_names:
nlp.remove_pipe("parser")
parser = nlp.create_pipe("parser")
nlp.add_pipe(parser, first=True)
for text, annotations in TRAIN_DATA:
for dep in annotations.get("deps", []):
parser.add_label(dep)
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training()
for itn in range(n_iter):
random.shuffle(TRAIN_DATA)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer, losses=losses)
print("Losses", losses)
# test the trained model
test_model(nlp)
if __name__ == "__main__":
plac.call(main)
So, what am I doing wrong?
Thank you in advance for any help on this!
The problem is that the simple training example script isn't projectivitizing the training instances when initializing and training the model. The parsing algorithm itself can only handle projective parses, but if the parser component finds projectivized labels in its output, they're deprojectivitzed in a postprocessing step. You don't need to modify any parser settings (so starting with a German model makes no difference), just provide projectivized input in the right format.
The initial projectivization is handled automatically by the train CLI, which uses GoldCorpus.train_docs() to prepare the training examples for nlp.update() and sets make_projective=True when creating the GoldParses. In general, I'd recommend switching to the train CLI (which also requires switching to the internal JSON training format, which is admittedly a minor hassle), because the train CLI sets a lot of better defaults.
However, a toy example also works fine as long as you create projectivized training examples (with GoldParse(make_projective=True), add all the projectivized dependency labels to the parser, and train with Doc and the projectivized GoldParse input instead of the text/annotation input:
# tested with spaCy v2.2.4
import spacy
from spacy.util import minibatch, compounding
from spacy.gold import GoldParse
TRAIN_DATA = [
(
'ROOT AAAA BBBB 12 21',
{
'heads': [0, 0, 0, 1, 2],
'deps': ['ROOT', 'LETTERS', 'LETTERS', 'NUMBERS', 'NUMBERS']
}
)
]
samples = 200
def test_model(nlp):
texts = ["ROOT AAAA BBBB 12 21"]
for doc in nlp.pipe(texts):
print(doc.text)
print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != "-"])
spacy.displacy.serve(doc)
#plac.annotations(
n_iter=("Number of training iterations", "option", "n", int),
)
def main(n_iter=10):
"""Load the model, set up the pipeline and train the parser."""
nlp = spacy.blank("xx")
parser = nlp.create_pipe("parser")
nlp.add_pipe(parser)
docs_golds = []
for text, annotation in TRAIN_DATA:
doc = nlp.make_doc(text)
gold = GoldParse(doc, **annotation, make_projective=True)
# add the projectivized labels
for dep in gold.labels:
parser.add_label(dep)
docs_golds.append((doc, gold))
# duplicate the training instances
docs_golds = docs_golds * samples
pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
with nlp.disable_pipes(*other_pipes): # only train parser
optimizer = nlp.begin_training(min_action_freq=1)
for itn in range(n_iter):
random.shuffle(docs_golds)
losses = {}
# batch up the examples using spaCy's minibatch
batches = minibatch(docs_golds, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
docs, golds = zip(*batch)
nlp.update(docs, golds, sgd=optimizer, losses=losses)
print("Losses", losses)
# test the trained model
test_model(nlp)
if __name__ == "__main__":
plac.call(main)

Using tf,py_func with pickle files in Dataset API

I am trying to use the Dataset API with my dataset, which are pickle files. These files contains my data which is a vector of floats and the labels which is a one hot vector.
I have tried using the tf.py_func to load the features but I am unable to do it as I have missmatching shapes. As, I am these pickle files which includes the label as well, I can not give it directly to the tuple as the example here. So I am a bit lost on how to continue.
This is my code so far
path = "my_dir_to_pkl_files"
pkl_files = glob.glob((path+"*.pkl"))
dataset = tf.data.Dataset.from_tensor_slices((pkl_files))
dataset = dataset.map(
lambda filename: tuple(tf.py_func(
load_features, [filename], [tf.float32])))
And here is my python function to read the features.
def load_features(name):
decoded = name.decode("UTF-8")
if os.path.exists(decoded):
with open(decoded, 'rb') as f:
file = pickle.load(f)
return file['features']
# I have commented the line below but this should return
# the features and the label in a one hot vector
# return file['features'], file['targets']
else:
print("Something went wrong!")
exit(-1)
I would expect Dataset API to return a tuple with N features and 1 hot vector for each sample in my batch. Instead im getting
InvalidArgumentError: pyfunc_0 returns 30 values, but expects to see 1
values.
Any suggestions? Thanks.
Edit:
I show how my pickle file is. The features vector has a shape of [30,100]. I attach the same file as well here.
{'features': array([[0.64864044, 0.71419346, 0.35874235, ..., 0.66058507, 0.89013242,
0.67564707],
[0.15958826, 0.38115951, 0.46636267, ..., 0.49682084, 0.08863887,
0.17142761],
[0.26925915, 0.27901399, 0.91624607, ..., 0.30269212, 0.47494327,
0.43265325],
...,
[0.50405357, 0.7441127 , 0.04308265, ..., 0.06766902, 0.87449393,
0.31018099],
[0.44777562, 0.30836258, 0.48148097, ..., 0.74899213, 0.97264324,
0.43391464],
[0.50583501, 0.56803691, 0.61290449, ..., 0.8350931 , 0.52897295,
0.23731264]]), 'targets': array([0, 0, 1, 0])}
The error I got is after I try to get an element for the dataset
dataset.make_one_shot_iterator()
next_element = iterator.get_next()
print(sess.run(next_element))

Do not understand the classes part and reshape from reading a h5 dataset file

Hello can somebody explain step by step what's hapening in following code?
Escpecially the part classes and the reshape? tnx
def load_data():
train_dataset = h5py.File('datasets/train_catvnoncat.h5', "r")
train_set_x_orig = np.array(train_dataset["train_set_x"][:]) # your train set features
train_set_y_orig = np.array(train_dataset["train_set_y"][:]) # your train set labels
test_dataset = h5py.File('datasets/test_catvnoncat.h5', "r")
test_set_x_orig = np.array(test_dataset["test_set_x"][:]) # your test set features
test_set_y_orig = np.array(test_dataset["test_set_y"][:]) # your test set labels
classes = np.array(test_dataset["list_classes"][:]) # the list of classes
train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
test_set_y_orig = test_set_y_orig.reshape((1, test_set_y_orig.shape[0]))
return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes
Most of the lines just load datasets from the h5 file. The np.array(...) wrapper isn't needed. test_dataset[name][:] is sufficient to load an array.
test_set_y_orig = test_dataset["test_set_y"][:]
test_dataset is the opened file. test_dataset["test_set_y"] is a dataset on that file. The [:] loads the dataset into a numpy array. Look up the h5py docs for more details on load a dataset.
I deduce from
train_set_y_orig = train_set_y_orig.reshape((1, train_set_y_orig.shape[0]))
that the array, as loaded is 1d, with shape (n,), and this reshape is just adding an initial dimension, making it (1,n). I would have coded it as
train_set_y_orig = train_set_y_orig[None,:]
but the result is the same.
There's nothing special about the classes array (though it might well be an array of strings).