Merge two CountVectorizers and compute Cosine Similarity

Merge two CountVectorizers and compute Cosine Similarity - numpy

I'm trying to implement a technique described in an information retrieval paper, where documents are decomposed into vectors and then, their cosine similarity is computed, much like how it is explained here: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
In the example, we have:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents = (
"The sky is blue",
"The sun is bright",
"The sun in the sky is bright",
"We can see the shining sun, the bright sun"
)
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
However, from time to time I'll get a new document. Is there a way to calculate the cosine similarity of this new document without recreating the documents tuple and the tfidf_matrix?

Yes, you can do:
new_docs = [
"This is new doc 1",
"This is new doc 2",
]
new_tfidf_matrix = tfidf_vectorizer.predict(new_docs)
cosine_similarity(new_tfidf_matrix, tfidf_matrix)
If you think the new docs will have new vocabulary not present in the training dataset, then you should consider retraining the Vectorizer with tfidf_vectorizer.fit(all_docs).

Related

How can I find the optimal number of topics in LDA with scikit-learn?

I'm computing topic models through scikit-learn with this script (I'm starting with a dataset "df" which has one document per row in the column "Text")
from sklearn.decomposition import LatentDirichletAllocation
#Applying LDA
# the vectorizer object will be used to transform text to vector form
vectorizer = CountVectorizer(max_df=int(0.9*len(df)), min_df=int(0.01*len(df)), token_pattern='\w+|\$[\d\.]+|\S+')
# apply transformation
tf = vectorizer.fit_transform(df.Text).toarray()
# tf_feature_names tells us what word each column in the matric represents
tf_feature_names = vectorizer.get_feature_names()
number_of_topics = 6
model = LatentDirichletAllocation(n_components=number_of_topics, random_state=0)
model.fit(tf)
I'm interested in comparing models with different number of topics (kind of from 2 to 20 topics) through a coherence measure. How can I do it?

Restricting the term-document matrix to most frequent unigrams

The code below as an example for analyzing massive corpus. I want to restrict the term-document matrix to 1000 most frequent unigrams, but changing the max-features parameter to n only return the first n unigrams. Any suggestion?
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = ['Hi my name is Joe.', 'Hi my name is Donald.']
vectorizer = TfidfVectorizer(max_features=3)
X = vectorizer.fit_transform(corpus).todense()
df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
df.to_csv('test.csv')

I am assuming that this is a problem in your example, but according to the sklearn documentation for TfidfVectorizer ssays the following for max_features:
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
It might be that the first n terms are considered when there are words of equal frequency, but otherwise it should return the correct result. If it still does not work, I strongly suggest to open a bug report in the sklearn repository. However, you can also manually construct a vocabulary yourself (with your own interpretation of "frequency", by setting the vocabulary option:
vocabulary: Mapping or iterable, default=None
Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.

Is it possible to improve spaCy's similarity results with custom named entities?

I've found that spaCy's similarity does a decent job of comparing my documents using "en_core_web_lg" out-of-the-box.
I'd like to tighten up relationships in some areas and thought adding custom NER labels to the model would help, but my results before and after show no improvements, even though I've been able to create a test set of custom entities.
Now I'm wondering, was my theory completely wrong, or could I simply be missing something in my pipeline?
If I was wrong, what's the best approach to improve results? Seems like some sort of custom labeling should help.
Here's an example of what I've tested so far:
import spacy
from spacy.pipeline import EntityRuler
from spacy.tokens import Doc
from spacy.gold import GoldParse
nlp = spacy.load("en_core_web_lg")
docA = nlp("Add fractions with like denominators.")
docB = nlp("What does one-third plus one-third equal?")
sim_before = docA.similarity(docB)
print(sim_before)
0.5949629181460099
^^ Not too shabby, but I'd like to see results closer to 0.85 in this example.
So, I use EntityRuler and add some patterns to try and tighten up the relationships:
ruler = EntityRuler(nlp)
patterns = [
{"label": "ADDITION", "pattern": "Add"},
{"label": "ADDITION", "pattern": "plus"},
{"label": "FRACTION", "pattern": "one-third"},
{"label": "FRACTION", "pattern": "fractions"},
{"label": "FRACTION", "pattern": "denominators"},
]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before='ner')
print(nlp.pipe_names)
['tagger', 'parser', 'entity_ruler', 'ner']
Adding GoldParse seems to be important, so I added the following and updated NER:
doc1 = Doc(nlp.vocab, [u'What', u'does', u'one-third', u'plus', u'one-third', u'equal'])
gold1 = GoldParse(doc1, [u'0', u'0', u'U-FRACTION', u'U-ADDITION', u'U-FRACTION', u'O'])
doc2 = Doc(nlp.vocab, [u'Add', u'fractions', u'with', u'like', u'denominators'])
gold2 = GoldParse(doc2, [u'U-ADDITION', u'U-FRACTION', u'O', u'O', u'U-FRACTION'])
ner = nlp.get_pipe("ner")
losses = {}
optimizer = nlp.begin_training()
ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)
{'ner': 0.0}
You can see my custom entities are working, but the test results show zero improvement:
test1 = nlp("Add fractions with like denominators.")
test2 = nlp("What does one-third plus one-third equal?")
print([(ent.text, ent.label_) for ent in test1.ents])
print([(ent.text, ent.label_) for ent in test2.ents])
sim = test1.similarity(test2)
print(sim)
[('Add', 'ADDITION'), ('fractions', 'FRACTION'), ('denominators', 'FRACTION')]
[('one-third', 'FRACTION'), ('plus', 'ADDITION'), ('one-third', 'FRACTION')]
0.5949629181460099
Any tips would be greatly appreciated!

Doc.similarity only uses the word vectors, not any other annotation. From the Doc API:
The default estimate is cosine similarity using an average of word vectors.

I found my solution was nestled in this tutorial: Text Classification in Python Using spaCy, which generates a BoW matrix for spaCy's text data by using SciKit-Learn's CountVectorizer.
I avoided sentiment analysis tutorials, due to binary classification, since I need support for multiple categories. The trick was to set multi_class='auto' on the LogisticRegression linear model, and to use average='micro' on the precision score and precision recall, so all my text data, like entities, were leveraged:
classifier = LogisticRegression(solver='lbfgs', multi_class='auto')
and...
print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted,average='micro'))
print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted,average='micro'))
Hope this helps save someone some time!

How to perform kmean clustering from Gensim TFIDF values

I am using Gensim for vector space model. after creating a dictionary and corpus from Gensim I calculated the (Term frequency*Inverse document Frequency)TFIDF using the following line
Term_IDF = TfidfModel(corpus)
corpus_tfidf = Term_IDF[corpus]
The corpus_tfidf contain list of the list having Terms ids and corresponding TFIDF. then I separated the TFIDF from ids using following lines:
for doc in corpus_tfidf:
for ids,tfidf in doc:
IDS.append(ids)
tfidfmtx.append(tfidf)
IDS=[]
now I want to use k-means clustering so I want to perform cosine similarities of tfidf matrix the problem is Gensim does not produce square matrix so when I run following line it generates an error. I wonder how can I get the square matrix from Gensim to calculate the similarities of all the documents in vector space model. Also how to convert tfidf matrix (which in this case is a list of lists) into 2D NumPy array. any comments are much appreciated.
dumydist = 1 - cosine_similarity(tfidfmtx)

When you fit your corpus to a Gensim Dictionary, get the number or documents and tokens in the dictionary:
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(corpus_lists)
num_docs = dictionary.num_docs
num_terms = len(dictionary.keys())
Transform into bow:
corpus_bow = [dictionary.doc2bow(doc) for doc in corpus_lists]
Transform into tf-idf:
from gensim.models.tfidfmodel import TfidfModel
tfidf = TfidfModel(corpus_bow)
corpus_tfidf = tfidf[corpus_bow]
Now you can transform into sparse/dense matrix:
from gensim.matutils import corpus2dense, corpus2csc
corpus_tfidf_dense = corpus2dense(corpus_tfidf, num_terms, num_docs)
corpus_tfidf_sparse = corpus2csc(corpus_tfidf, num_terms, num_docs)
Now fit your model using either sparse/dense matrix (after transposing):
model = KMeans(n_clusters=7)
clusters = model.fit_predict(corpus_bow_dense.T)

To create document term matrix from gensim, you may use matutils.corpus2csv
Corpus - list of list(Genism Corpus)
from scipy.sparse import csc_matrix
scipy_csc_matrix =genism.matutils.corpus2csc(corpus)
full_matrix=csc_matrix(scipy_csc_matrix).toarray()
you may want to use scipy sparse format if your corpus size is very large.

Implementation of Hierarchical Sampling for Active Learning

I am working on a personal machine learning project where I am attempting to classify data into binary classes when the classes are extremely imbalanced. I am initially trying to implement the approach proposed in Hierarchical Sampling for Active Learning by S Dasgupta which exploits the cluster structure of the dataset to aide the active learner.
However, I am struggling to implement the algorithm proposed in the article. I have written this so far, however am unsure how to continue:
from scipy.spatial.distance import pdist, squareform
from scipy.cluster.hierarchy import linkage, dendrogram
data_dist = pdist(X) # computing the distance
data_link = linkage(data_dist) # computing the linkage
The data is stored in X and the correct clasification in y. Sample dataset:
X = np.array([[0.3,0.7],[0.5,0.5] ,[0.2,0.8], [0.1,0.9]])
y = np.array([[0], [1], [0], [1]])
(Note the actual dataset is approximately 500 times larger)

Hierarchical Sampling for Active Learning as proposed by S Dasgupta is now implemented in libact, a Python active learning libary. For the source code, see this link.
Example (from doccumentation):
from libact.query_strategies import UncertaintySampling
from libact.query_strategies.multiclass import HierarchicalSampling
sub_qs = UncertaintySampling(
dataset, method='sm', model=SVM(decision_function_shape='ovr'))
qs = HierarchicalSampling(
dataset, # Dataset object
dataset.get_num_of_labels(),
active_selecting=True,
subsample_qs=sub_qs
)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Merge two CountVectorizers and compute Cosine Similarity - numpy

Related

How can I find the optimal number of topics in LDA with scikit-learn?

Restricting the term-document matrix to most frequent unigrams

Is it possible to improve spaCy's similarity results with custom named entities?

How to perform kmean clustering from Gensim TFIDF values

Implementation of Hierarchical Sampling for Active Learning

Categories

Resources