how does lucene build VSM? - lucene

I understand the concept of VSM, TFIDF and cosine similarity, however, I am still confused about how lucene build VSM and calculate similarity for each query after reading lucene website.
As I understood, VSM is a matrix where the values of TFIDF of each term are filled. When i tried building VSM from a set of documents, it took a long time with this tool http://sourceforge.net/projects/wvtool/
This is not really related to the coding, because intuitively building a VSM matrix of large data is time consuming, but that seems not the case for lucene.
In additon, with a VSM prebuilt, finding most similar document which basically is the calculation of similarity between two documents or a query vs document often time consuming (assume millions of documents, because one has to compute similarity to everyone else), but lucene seems does it really fast. I guess that's also related to how it builds VSM internally. If possible, can someone also explain this ?
so please help me to understand two point here:
1. how lucene builds VSM so fast which can be used for calculating similarity.
2. how come lucene similarity calculation amoung millions of documents is so fast.
I'd appreciate it if an real example is given.
Thanks

As I understood, VSM is a matrix where the values of TFIDF of each term are filled.
This is more properly called a term-document matrix. The VSM is more of a conceptual framework from which this matrix, and the notion of cosine similarity arise.
Lucene stores term frequencies and document frequencies that can be used to get tf-idf weights for document and query terms. It uses those to compute a variant of cosine similarity outlined here. So, the rows of the term-document matrix are represented in the index, which is a hash table mapping terms to (document, tf) pairs plus a separate table mapping terms to their df value.
one has to compute similarity to everyone else
That's not true. If you review the textbook definition of cosine similarity, you'll find that it's the sum of products of corresponding term weights in a query and a document, normalized. Terms that occur in the document but not the query, or vice versa, have no effect on the similarity. It follows that, to compute cosine similarity, you only need to consider those documents that have some term in common with the query. That's how Lucene gets its speed: it does a hash table lookup for the query terms and computes similarities only to the documents that have non-zero intersection with the query's bag of words.

Related

How can I study the properties of outliers in high-dimensional data?

I have a bundle of high-dimensional data and the instances are labeled as outliers or not. I am looking to get some insights around where these outliers reside within the data. I seek to answer questions like:
Are the outliers spread far apart from each other? Or are they clustered together?
Are the outliers lying 'in-between' clusters of good data? Or are they on the 'edge' boundaries of the data?
If outliers are clustered together, how do these cluster densities compare with clusters of good data?
'Where' are the outliers?
What kind of techniques will let me find these insights? If the data was 2 or 3-dimensional, I can easily plot the data and just look at it. But I can't do it high-dimensional data.
Analyzing the Statistical Properties of Outliers
First of all, if you can choose to focus on specific features. For
example, if you know a featues is subject to high variation, you can
draw a box plot. You can also draw a 2D graph if you want to focus on
2 features. THis shows how much the labelled outliers vary.
Next, there's a metric called a Z-score, which basically says how
many standard devations a point varies compared to the mean. The
Z-score is signed, meaning if a point is below the mean, the Z-score
will be negative. This can be used to analyze all the features of the
dataset. You can find the threshold value in your labelled dataset for which all the points above that threshold are labelled outliers
Lastly, we can find the interquartile range and similarly filter
based on it. The IQR is simply the difference between the 75
percentile and 25 percentile. You can also use this similarly to Z-score.
Using these techniques, we can analyze some of the statistical properties of the outliers.
If you also want to analyze the clusters, you can adapt the DBSCAN algorithm to your problem. This algorithm clusters data based on densities, so it will be easy to apply the techniques to outliers.

Select important features then impute or first impute then select important features?

I have a dataset with lots of features (mostly categorical features(Yes/No)) and lots of missing values.
One of the techniques for dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. That is basically we can generate a large set of very shallow trees, with each tree being trained on a small fraction of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain.
I am also using an imputer to fill the missing values.
My doubt is what should be the order to the above two. Which of the above two (dimensionality reduction and imputation) to do first and why?
From mathematical perspective you should always avoid data imputation (in the sense - use it only if you have to). In other words - if you have a method which can work with missing values - use it (if you do not - you are left with data imputation).
Data imputation is nearly always heavily biased, it has been shown so many times, I believe that I even read paper about it which is ~20 years old. In general - in order to do a statistically sound data imputation you need to fit a very good generative model. Just imputing "most common", mean value etc. makes assumptions about the data of similar strength to the Naive Bayes.

Extrapolate Sentence Similarity Given Word Similarities

Assuming that I have a word similarity score for each pair of words in two sentences, what is a decent approach to determining the overall sentence similarity from those scores?
The word scores are calculated using cosine similarity from vectors representing each word.
Now that I have individual word scores, is it too naive to sum the individual word scores and divide by the total word count of both sentences to get a score for the two sentences?
I've read about further constructing vectors to represent the sentences, using the word scores, and then again using cosine similarity to compare the sentences. But I'm not familiar with how to construct sentence vectors from the existing word scores. Nor am I aware of what the tradeoffs are compared with the naive approach described above, which at the very least, I can easily comprehend. :).
Any insights are greatly appreciated.
Thanks.
What I ended up doing, was taking the mean of each set of vectors, and then applying cosine-similarity to the two means, resulting in a score for the sentences.
I'm not sure how mathematically sound this approach is, but I've seen it done in other places (like python's gensim).
It would be better to use contextual word embeddings(vector representations) for words.
Here is an approach to sentence similarities by pairwise word similarities: BERTScore.
You can check the math here.

Weighted cosine similarity calculation using Lucene

This question is related with calculating CosineSimilarity between documents using Lucene
The documents are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity between documents, I want to give higher weights to those Taxonomy terms and Ontology terms.
When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program.
Field ontologyTerm= new Field("fiboterms", fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
Field taxonomyTerm = new Field("taxoterms", taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
Field document = new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);
I’m using Lucene index .TermFreqVector functions to calculate TFIDF values and, then calculate cosine similarity between two documents using TFIDF values.
I can use Lucene’s field.setBoost() function to give higher weights to the fields before indexing. I used the debugger see frequency values of Taxonomy terms after seeting a boost value, but it dosen’t change the term frequencies. So that means setboost() function dosen’t give any effect on TermFreVector or TFIDF values? Is setboost() function increase the weights and can be used only in document searching?
Another thing what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation?
Are there any other Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? Or can I just use the setboost() function for this purpose, then how?
The TermFreqVector is just that - the term frequencies. No weights. It says in the docs "Each location in the array contains the number of times this term occurs in the document or the document field."
You can see from Lucene's algorithm that the way boosts are used is as a multiplicative factor. So if you want to replicate that then yes this will give your terms a higher weight.
I'm not sure what your use case is, but you might want to consider just using Lucene's Scorer class. Then you won't have to deal with making your own.

Lucene. How to build a term-doc matrix

I need to build that matrix but I can't find a way to compute normalized tf-idf for each cell.
The normalization I would perform is cosine-normalization that is divide tf-idf (computed using DefaultSimilarity ) per 1/sqrt(sumOfSquaredtf-idf in the column).
Does anyone know a way to perform that?
Thanks in advance
Antonio
One way, not using Lucene, is described in Sujit Pal's blog. Alternatively, you can build a Lucene index that has term vectors per field, iterate over terms to get idf, then iterate over term's documents to get tf.