Emoji vectors via spacy - spacy

In short, emoji vectors in spacy? Where is this documented?
import spacy
nlp = spacy.load('en_core_web_sm')
a = "🔥"
b = "❄️"
v = "🥑"
h = "💔"
l = "💌"
e = [a,b,v,h,l]
# emoji vector
ev = [nlp(emoji).vector for emoji in e]
# numpy array
ev = np.array(ev)
ev.shape
The shape is (5, 96), so I am curious as to where I can learn more about the source of the vectors. At first, I assumed that these were OOV, but:
ev.sum(axis=1)
yields
array([2.906692 , 3.8687153, 1.2295313, 3.986846 , 1.9255924],
dtype=float32)
All above is via Colab environment as of 2/21/2021

The sm models do not contain word vectors. If there aren't any word vectors, token.vector returns token.tensor as a backoff, which is the context-sensitive tensor from the tagger component. See the first warning box here: https://v2.spacy.io/usage/vectors-similarity
If you want word vectors, use an md or lg model instead, and then the emoji will be OOV and token.vector will return an all-0 300d vector.

Related

Tensor flow and tflearn Chatbot keeps on getting high probability even when user input is wrong

I coded a simple AI chatbot with TensorFlow and tflearn and it runs just fine but the issue is when the user inputs the wrong thing, the bot is supposed to say it doesnt understand if the prediction accuracy is less than 70%, but the bot always scores above that even if the user gives jibberish like "rjrigrejfr". The bot assumes theyre greeting them. The patterns its supposed to study in the json are "patterns": ["Hi", "How are you", "Wassup", "Hello", "Good day", "Waddup", "Yo"]. I can share the json file if needed its short. Anyway, this is the python code:
import numpy as np
import nltk
import tensorflow
import tflearn
import random
import json
import pickle
# Some extra configuration:
from nltk.stem.lancaster import LancasterStemmer
stemmer = LancasterStemmer()
nltk.download('punkt')
# Load the data from the json file into a variable.
with open("intents.json") as file:
data = json.load(file)
# If we already have saved data, we do not need to retrain the model and waste time (could develop into an issue in more complex programs. Save in pickle. )
try:
with open("data.pickle", "rb") as f: # rb stands for bytes.
words, labels, training, output = pickle.load(f)
# --- Pre-training data preparation ---
except:
words = []
docsx = [] # Stores patterns
docsy = [] # Stores intents
labels = [] # All the specific tag values such as greeting, contact, etc.
for intent in data["intents"]:
for pattern in intent["patterns"]:
w = nltk.word_tokenize(pattern) # nltk function that splits the sentences inside intent into words list.
words.extend(w) # Add the tokenized list to words list.
docsx.append(w)
docsy.append(intent["tag"]) # append the classification of the sentence
if intent["tag"] not in labels:
labels.append(intent["tag"])
words = [stemmer.stem(w.lower()) for w in words if w not in ".?!"] # Stemming the words to remove unnecessary elements leaving their root. Convert all to lowercase.
words = sorted(list(set(words))) # Set ensures no duplicate elements then we convert back to list and sort it.
labels = sorted(labels)
training = []
output = []
out_empty = [0 for i in range(len(labels))] # Gives a list of 0 ints based on # of tags. This is useful later in the program when binerizing.
# One hot encoding the intent categories. Need to one-hot code the data which improves the efficiency of the ML to "binerize" the data.
# In this case, we have a list of 0s and 1s if the word appears it is assigned a 1 else a 0.
for x, doc in enumerate(docsx):
bag = [] # Bag of words or the one-hot coded data for the ML.
docx_word_stemmed = [stemmer.stem(word) for word in doc] # Stemming the data in docx.
# Now adding and transforming data into the one-hot coded list/bag of words data.
for i in words:
if i in docx_word_stemmed: # Checking against stemmed words:
# Word exists
bag.append(1)
else:
bag.append(0)
output_row = out_empty[:] # Copying out_empty
# Going through the labels list using .index() and for the occurance of docx value in docy, assign binary 1.
output_row[labels.index(docsy[x])] = 1
training.append(bag)
output.append(output_row)
# Required to use numpy arrays for use in tflearn. It is also faster.
training = np.array(training)
output = np.array(output)
# Saving the data so we do not need to do the data configuration every time.
with open("data.pickle", "wb") as f:
pickle.dump((words, labels, training, output), f)
try:
model.load('model.tflearn')
except:
tensorflow.compat.v1.reset_default_graph()
net = tflearn.input_data(shape=[None, len(training[0])])
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, 8)
net = tflearn.fully_connected(net, len(output[0]), activation='softmax')
net = tflearn.regression(net)
model = tflearn.DNN(net)
model.fit(training, output, n_epoch=1000, batch_size=8, show_metric=True)
model.save("model.tflearn")
def bagofwords(sentence, words):
bag = [0 for _ in range(len(words))] # blank bag of words.
# Tokenize s and then stem it.
sentence_words = nltk.word_tokenize(sentence)
sentence_words = [stemmer.stem(word.lower()) for word in sentence_words]
for string in sentence_words:
for i, word in enumerate(words):
if word == string:
bag[i] = 1
return np.array(bag)
def chat():
print("Hello there! I'm the SRO AI Virtual Assistant. How am I help you?")
# Figure out the error slime!
while True:
user_input = input("Type here:")
if user_input == "quit":
break
result = model.predict([bagofwords(user_input, words)])[0] #bagofwords func and predict function to give predictions on what the user is saying.
best_result = np.argmax(result) # We want to only use the best result.
tag = labels[best_result]
print(result[best_result])
# Open JSON file and pick a response.
if result[best_result] > 0.7:
for tg in data["intents"]:
if tg['tag'] == tag:
responses = tg['responses']
print(random.choice(responses))
else:
print("I don't quite understand")
chat()

Text classification using embedding for two columns of dataset

I am working on a project where i am using mental health related subreddit posts containing two feature columns (text, title) and a label column (Subreddit).
I want to use LSTM for classification where i need to create embedding matrix for both the columns in short need both columns for text classification but i cannot find the way to embed both columns.
Code i am using for text sequences is
text_sequences_train = token.texts_to_sequences(preprocessed_text_train)
title_sequences_train = token.texts_to_sequences(preprocessed_title_train)
#print(sequences_train)
train=np.hstack(text_sequences_train+title_sequences_train)
train.reshape(1,train.shape[0])
train_seq_x=pad_sequences(train, maxlen=300)
text_sequences_test = token.texts_to_sequences(preprocessed_text_test)
title_sequences_test = token.texts_to_sequences(preprocessed_title_test)
#print(sequences_train)
test=np.hstack(text_sequences_test+title_sequences_test)
test.reshape(1,test.shape[0])
test_seq_x=pad_sequences(test, maxlen=300)
text_sequences_val = token.texts_to_sequences(preprocessed_text_val)
title_sequences_val = token.texts_to_sequences(preprocessed_title_val)
#print(sequences_train)
val=np.hstack(text_sequences_val+title_sequences_val)
val.reshape(1,val.shape[0])
val_seq_x=pad_sequences(val, maxlen=300)
the above code gives me an error
ValueError: `sequences` must be a list of iterables. Found non-iterable: 428.0
code i am using for embedding matrix is
glove_file = "glove.42B.300d.txt"
import tqdm
EMBEDDING_VECTOR_LENGTH = 300 # <=200
def construct_embedding_matrix(glove_file, word_index):
embedding_dict = {}
with open(glove_file,'r', encoding='utf-8') as f:
for line in f:
values=line.split()
# get the word
word=values[0]
if word in word_index.keys():
# get the vector
vector = np.asarray(values[1:], 'float32')
embedding_dict[word] = vector
#print(embedding_dict[word].shape)
### oov words (out of vacabulary words) will be mapped to 0 vectors
num_words=len(word_index)+1
#initialize it to 0
embedding_matrix=np.zeros((num_words, EMBEDDING_VECTOR_LENGTH))
for word,i in tqdm.tqdm(word_index.items()):
if i < num_words:
vect=embedding_dict.get(word, [])
if len(vect)>0:
embedding_matrix[i] = vect[:EMBEDDING_VECTOR_LENGTH]
#print(embedding_matrix[i].shape)
print(embedding_matrix)
return embedding_matrix
embedding_matrix=construct_embedding_matrix(glove_file, word_index)
If I convert text sequences and then train test split it gives an error where X and Y no of samples do not match

Error in prediction in svm classifier after one hot encoding

I have used one hot encoding to my dataset before training my SVM classifier.
which increased number of features in training set to 982. But during
prediction of test dataset which has 7 features i am getting error " X has 7
features per sample; expecting 982". I don't understand how to increase
number of features in test dataset.
My code is:
df = pd.read_csv('train.csv',header=None);
features = df.iloc[:,:-1].values
labels = df.iloc[:,-1].values
encode = LabelEncoder()
features[:,2] = encode.fit_transform(features[:,2])
features[:,3] = encode.fit_transform(features[:,3])
features[:,4] = encode.fit_transform(features[:,4])
features[:,5] = encode.fit_transform(features[:,5])
df1 = pd.DataFrame(features)
#--------------------------- ONE HOT ENCODING --------------------------------#
hotencode = OneHotEncoder(categorical_features=[2])
features = hotencode.fit_transform(features).toarray()
hotencode = OneHotEncoder(categorical_features=[14])
features = hotencode.fit_transform(features).toarray()
hotencode = OneHotEncoder(categorical_features=[37])
features = hotencode.fit_transform(features).toarray()
hotencode = OneHotEncoder(categorical_features=[466])
features = hotencode.fit_transform(features).toarray()
X = np.array(features)
y = np.array(labels)
clf = svm.LinearSVC()
clf.fit(X,y)
d_test = pd.read_csv('query.csv')
Z_test =np.array(d_test)
confidence = clf.predict(Z_test)
print("The query image belongs to Class ")
print(confidence)
######################### test dataset
query.csv
1 0.076 1 3232236298 2886732679 3128 60604
The short answer: you need to apply the same OHE transform (or LE+OHE in your case) on the test set.
For a good advice, see Scikit Learn OneHotEncoder fit and transform Error: ValueError: X has different shape than during fitting or How to deal with imputation and hot one encoding in pandas?

word2vec, sum or average word embeddings?

I'm using word2vec to represent a small phrase (3 to 4 words) as a unique vector, either by adding each individual word embedding or by calculating the average of word embeddings.
From the experiments I've done I always get the same cosine similarity. I suspect it has to do with the word vectors generated by word2vec being normed to unit length (Euclidean norm) after training? or either I have a BUG in the code, or I'm missing something.
Here is the code:
import numpy as np
from nltk import PunktWordTokenizer
from gensim.models import Word2Vec
from numpy.linalg import norm
from scipy.spatial.distance import cosine
def pattern2vector(tokens, word2vec, AVG=False):
pattern_vector = np.zeros(word2vec.layer1_size)
n_words = 0
if len(tokens) > 1:
for t in tokens:
try:
vector = word2vec[t.strip()]
pattern_vector = np.add(pattern_vector,vector)
n_words += 1
except KeyError, e:
continue
if AVG is True:
pattern_vector = np.divide(pattern_vector,n_words)
elif len(tokens) == 1:
try:
pattern_vector = word2vec[tokens[0].strip()]
except KeyError:
pass
return pattern_vector
def main():
print "Loading word2vec model ...\n"
word2vecmodelpath = "/data/word2vec/vectors_200.bin"
word2vec = Word2Vec.load_word2vec_format(word2vecmodelpath, binary=True)
pattern_1 = 'founder and ceo'
pattern_2 = 'co-founder and former chairman'
tokens_1 = PunktWordTokenizer().tokenize(pattern_1)
tokens_2 = PunktWordTokenizer().tokenize(pattern_2)
print "vec1", tokens_1
print "vec2", tokens_2
p1 = pattern2vector(tokens_1, word2vec, False)
p2 = pattern2vector(tokens_2, word2vec, False)
print "\nSUM"
print "dot(vec1,vec2)", np.dot(p1,p2)
print "norm(p1)", norm(p1)
print "norm(p2)", norm(p2)
print "dot((norm)vec1,norm(vec2))", np.dot(norm(p1),norm(p2))
print "cosine(vec1,vec2)", np.divide(np.dot(p1,p2),np.dot(norm(p1),norm(p2)))
print "\n"
print "AVG"
p1 = pattern2vector(tokens_1, word2vec, True)
p2 = pattern2vector(tokens_2, word2vec, True)
print "dot(vec1,vec2)", np.dot(p1,p2)
print "norm(p1)", norm(p1)
print "norm(p2)", norm(p2)
print "dot(norm(vec1),norm(vec2))", np.dot(norm(p1),norm(p2))
print "cosine(vec1,vec2)", np.divide(np.dot(p1,p2),np.dot(norm(p1),norm(p2)))
if __name__ == "__main__":
main()
and here is the output:
Loading word2vec model ...
Dimensions 200
vec1 ['founder', 'and', 'ceo']
vec2 ['co-founder', 'and', 'former', 'chairman']
SUM
dot(vec1,vec2) 5.4008677771
norm(p1) 2.19382594282
norm(p2) 2.87226958166
dot((norm)vec1,norm(vec2)) 6.30125952303
cosine(vec1,vec2) 0.857109242583
AVG
dot(vec1,vec2) 0.450072314758
norm(p1) 0.731275314273
norm(p2) 0.718067395416
dot(norm(vec1),norm(vec2)) 0.525104960252
cosine(vec1,vec2) 0.857109242583
I'm using the cosine similarity as defined here Cosine Similarity (Wikipedia). The values for the norms and dot products are indeed different.
Can anyone explain why the cosine is the same?
Thank you,
David
Cosine measures the angle between two vectors and does not take the length of either vector into account. When you divide by the length of the phrase, you are just shortening the vector, not changing its angular position. So your results look correct to me.

Bayesian Probabilistic Matrix Factorization (BPMF) with PyMC3: PositiveDefiniteError using `NUTS`

I've implemented the Bayesian Probabilistic Matrix Factorization algorithm using pymc3 in Python. I also implemented it's precursor, Probabilistic Matrix Factorization (PMF). See my previous question for a reference to the data used here.
I'm having trouble drawing MCMC samples using the NUTS sampler. I initialize the model parameters using the MAP from PMF, and the hyperparameters using Gaussian random draws sprinkled around 0. However, I get a PositiveDefiniteError when setting up the step object for the sampler. I've verified that the MAP estimate from PMF is reasonable, so I expect it has something to do with the way the hyperparameters are being initialized. Here is the PMF model:
import pymc3 as pm
import numpy as np
import pandas as pd
import theano
import scipy as sp
data = pd.read_csv('jester-dense-subset-100x20.csv')
n, m = data.shape
test_size = m / 10
train_size = m - test_size
train = data.copy()
train.ix[:,train_size:] = np.nan # remove test set data
train[train.isnull()] = train.mean().mean() # mean value imputation
train = train.values
test = data.copy()
test.ix[:,:train_size] = np.nan # remove train set data
test = test.values
# Low precision reflects uncertainty; prevents overfitting
alpha_u = alpha_v = 1/np.var(train)
alpha = np.ones((n,m)) * 2 # fixed precision for likelihood function
dim = 10 # dimensionality
# Specify the model.
with pm.Model() as pmf:
pmf_U = pm.MvNormal('U', mu=0, tau=alpha_u * np.eye(dim),
shape=(n, dim), testval=np.random.randn(n, dim)*.01)
pmf_V = pm.MvNormal('V', mu=0, tau=alpha_v * np.eye(dim),
shape=(m, dim), testval=np.random.randn(m, dim)*.01)
pmf_R = pm.Normal('R', mu=theano.tensor.dot(pmf_U, pmf_V.T),
tau=alpha, observed=train)
# Find mode of posterior using optimization
start = pm.find_MAP(fmin=sp.optimize.fmin_powell)
And here is BPMF:
n, m = data.shape
dim = 10 # dimensionality
beta_0 = 1 # scaling factor for lambdas; unclear on its use
alpha = np.ones((n,m)) * 2 # fixed precision for likelihood function
logging.info('building the BPMF model')
std = .05 # how much noise to use for model initialization
with pm.Model() as bpmf:
# Specify user feature matrix
lambda_u = pm.Wishart(
'lambda_u', n=dim, V=np.eye(dim), shape=(dim, dim),
testval=np.random.randn(dim, dim) * std)
mu_u = pm.Normal(
'mu_u', mu=0, tau=beta_0 * lambda_u, shape=dim,
testval=np.random.randn(dim) * std)
U = pm.MvNormal(
'U', mu=mu_u, tau=lambda_u, shape=(n, dim),
testval=np.random.randn(n, dim) * std)
# Specify item feature matrix
lambda_v = pm.Wishart(
'lambda_v', n=dim, V=np.eye(dim), shape=(dim, dim),
testval=np.random.randn(dim, dim) * std)
mu_v = pm.Normal(
'mu_v', mu=0, tau=beta_0 * lambda_v, shape=dim,
testval=np.random.randn(dim) * std)
V = pm.MvNormal(
'V', mu=mu_v, tau=lambda_v, shape=(m, dim),
testval=np.random.randn(m, dim) * std)
# Specify rating likelihood function
R = pm.Normal(
'R', mu=theano.tensor.dot(U, V.T), tau=alpha,
observed=train)
# `start` is the start dictionary obtained from running find_MAP for PMF.
for key in bpmf.test_point:
if key not in start:
start[key] = bpmf.test_point[key]
with bpmf:
step = pm.NUTS(scaling=start)
At the last line, I get the following error:
PositiveDefiniteError: Scaling is not positive definite. Simple check failed. Diagonal contains negatives. Check indexes [ 0 2 ... 2206 2207 ]
As I understand it, I can't use find_MAP with models that have hyperpriors like BPMF. This is why I'm attempting to initialize with the MAP values from PMF, which uses point estimates for the parameters on U and V rather than parameterized hyperpriors.
Unfortunately the Wishart distribution is non-functional. I recently added a warning here: https://github.com/pymc-devs/pymc3/commit/642f63973ec9f807fb6e55a0fc4b31bdfa1f261e
See here for more discussions on this tricky distribution: https://github.com/pymc-devs/pymc3/issues/538
You could confirm that that's the source by fixing the covariance matrix. If that's the case, I'd try using the JKL prior distribution: https://github.com/pymc-devs/pymc3/blob/master/pymc3/examples/LKJ_correlation.py