Lucene Term Vector Multivariate Bayes Model Expectation Maximization - lucene

I am trying to implement an Expectation Maximization algorithm for document clustering. I am planning to use Lucene Term Vectors for finding similarity between 2 documents. There are 2 kinds of EM algos using naive Bayes: the multivariate model and the multinomial model. In simple terms, the multinomial model uses the frequencies of different words in the documents which the multivariate model just uses the info of whether a word is present or not in the document(a boolean vector).
I know that the term vectors in Lucene store the terms present in the current document along with their frequencies. This is exactly what is needed for the multinomial model.
But the multivariate model requires the following:
A vector which stores the presence or absence of a particular term. Thus all the terms in all the documents must be handled by this vector.
As an example:
doc1 : field CONTENT has the following terms : this is the world of pleasure.
doc2 : field CONTENT has the following terms : this amazing world is full of sarcastic people.
now the vector that I need should be
< this is the world of pleasure amazing full sarcastic people > ( it contains all the words in all the documents )
for doc1 the value of this vector is <1 1 1 1 1 1 0 0 0 0>
for doc2 the vakue of this vector is <1 1 0 1 0 0 1 1 1 1>
Is there any way to generate such a boolean vector in Lucene?

I would first generate the multinomial vectors, and then process them (maybe their textual representation) to get the multivariate vectors.
If the set of documents is not very small, storing full vectors is wasteful. You should have a sparse representation, because every document contains a small subset of the possible terms.
This blog post describes generating feature vectors from Lucene/Solr documents, although I do not think it goes much farther than what you already did.

Related

Understanding Time2Vec embedding for implementing this as a keras layer

The paper time2vector link (the relevant theory is in section 4) shows an approach to include a time embedding for features to improve model performance. I would like to give this a try. I found a implementation as keras layer which I changed a little bit. Basically it creates two matrices for one feature:
(1) linear = w * x + b
(2) periodic = sin(w * x + b)
Currently I choose this feature manually. Concerning the paper there are a few things i don't understand. The first thing is the term k as the number of sinusoids. The authors use up to 64 sinusoids. What does this mean? I have just 1 sinusoid at the moment, right? Secondly I'm about to put every feature I have through the sinus transformation for me dataset that would make 6 (sinusoids) periodic features. The authors use only one linear term. How should I choose the feature for the linear term? Unfortunately the code from the paper is not available anymore. Has anyone worked with time embeddings or even with this particularly approach?
For my limited understanding, the linear transformation of time is a fixed element of the produced embedding and the parameter K allows you to select how many different learned time representations you want to use in your model. So, the resulting embedding has a size of K+1 elements.

Can home-made embeddings work for RNNs, or do they HAVE to be trained?

Let's say I'm training an RNN for classification, using a vocabulary of 100 words. I can skip the embedding and pass in the sentences as one-hot vectors, but using one-hot vectors for a space of 100 features seems very wasteful in terms of memory. And it just gets worse as the vocab grows. Is there any reason why I couldn't create my own embedding where each value from 0-100 was converted into binary and stored as an array of length 7, i.e. 0=[0,0,0,0,0,0,0], 1=[1,0,0,0,0,0,0], ..., 100=[1,1,0,0,1,0,0]? I realize the dimensionality is low, but aside from that, I wasn't sure if this random embedding is a bad idea since there are no relationships between the word vectors like there are with GLoVe. BTW I can't use pre-made embeddings here, and my sample size isn't huge which is why I'm exploring making my own.
Nice logic but it is missing two important features of Embeddings.
1)We use Word Embeddings to get nearly similar vector representation of similar words.
for example, Apple and Mango will have nearly the same representation. Cricket and Football will have nearly the same representation. You might ask why is this beneficial. The answer is, Imagine we have mostly trained our Model only on Apples. If our testing has something to do with Mangos, even though we haven't explicitly trained on Mangoes, we will get an appropriate answer.
ex: Training: I like Apples, I drink apple juice every day.
Testing: I like Mangoes, I drink _____ juice every day.
The blank will be filled with "Mango" even though we didn't train on Mangoes explicitly.
This is achieved by having a similar vector representation for both Mangoes and Apples. Which cannot be achieved by your method.
2)Don't you think even with your logic the vector will be sparse? I agree Its better in compared to One hot encoding, but not compared to Word Embeddings. 90% of word embeddings vectors aren't zeros. But in your case, it will be just 50 %.

Spacy 2.0 en_vectors_web_lg vs en_core_web_lg

What is the difference between the word vectors given in en_core_web_lg and en_vectors_web_lg? The number of keys are different: 1.1m vs 685k. I assume this means the en_vectors_web_lg has broader coverage by maintaining morphological information somewhat resulting in more distinct tokens as they are both trained on the common crawl corpus but have a different number of tokens.
The en_vectors_web_lg package has exactly every vector provided by the original GloVe model. The en_core_web_lg model uses the vocabulary from the v1.x en_core_web_lg model, which from memory pruned out all entries which occurred fewer than 10 times in a 10 billion word dump of Reddit comments.
In theory, most of the vectors that were removed should be things that the spaCy tokenizer never produces. However, earlier experiments with the full GloVe vectors did score slightly higher than the current NER model --- so it's possible we're actually missing out on something by losing the extra vectors. I'll do more experiments on this, and likely switch the lg model to include the unpruned vector table, especially now that we have the md model, which strikes a better compromise than the current lg package.

Inverse transform word count vector to original document

I am training a simple model for text classification (currently with scikit-learn). To transform my document samples into word count vectors using a vocabulary I use
CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)
from sklearn.feature_extraction.text.
This works great and I can subsequently train my classifier on this word count vectors as feature vectors. But what I don't know is how to inverse transform these word count vectors to the original documents. CountVectorizer indeed has a function inverse_transform(X) but this only gives you back the unique non-zero tokens.
As far as I know CountVectorizer doesn't have any implementation of a mapping back to the original documents.
Anyone know how I can restore the original sequences of tokens from their count-vectorized representation? Is there maybe a Tensorflow or any other module for this?
CountVectorizer is "lossy", i.e. for a document :
This is the amazing string in amazing program , it will only store counts of words in the document (i.e. string -> 1, amazing ->2 etc), but loses the position information.
So by reversing it, you can create a document having same words repeated same number of times, but their sequence in the document cannot be retraced.

how does lucene build VSM?

I understand the concept of VSM, TFIDF and cosine similarity, however, I am still confused about how lucene build VSM and calculate similarity for each query after reading lucene website.
As I understood, VSM is a matrix where the values of TFIDF of each term are filled. When i tried building VSM from a set of documents, it took a long time with this tool http://sourceforge.net/projects/wvtool/
This is not really related to the coding, because intuitively building a VSM matrix of large data is time consuming, but that seems not the case for lucene.
In additon, with a VSM prebuilt, finding most similar document which basically is the calculation of similarity between two documents or a query vs document often time consuming (assume millions of documents, because one has to compute similarity to everyone else), but lucene seems does it really fast. I guess that's also related to how it builds VSM internally. If possible, can someone also explain this ?
so please help me to understand two point here:
1. how lucene builds VSM so fast which can be used for calculating similarity.
2. how come lucene similarity calculation amoung millions of documents is so fast.
I'd appreciate it if an real example is given.
Thanks
As I understood, VSM is a matrix where the values of TFIDF of each term are filled.
This is more properly called a term-document matrix. The VSM is more of a conceptual framework from which this matrix, and the notion of cosine similarity arise.
Lucene stores term frequencies and document frequencies that can be used to get tf-idf weights for document and query terms. It uses those to compute a variant of cosine similarity outlined here. So, the rows of the term-document matrix are represented in the index, which is a hash table mapping terms to (document, tf) pairs plus a separate table mapping terms to their df value.
one has to compute similarity to everyone else
That's not true. If you review the textbook definition of cosine similarity, you'll find that it's the sum of products of corresponding term weights in a query and a document, normalized. Terms that occur in the document but not the query, or vice versa, have no effect on the similarity. It follows that, to compute cosine similarity, you only need to consider those documents that have some term in common with the query. That's how Lucene gets its speed: it does a hash table lookup for the query terms and computes similarities only to the documents that have non-zero intersection with the query's bag of words.