Does a larger tf always boost a documents score in Lucene? - lucene

I understand that the default term frequency (tf) is simply calculated as the sqrt of number of times a particular term being searched appears in a field. So documents containing multiple occurences of a term you are searching on will have a higher tf and hence weight.
What I'm unsure about is whether this helps increase the documents score because the weight is higher or reduces the documents score because its move the document vector away from the query vector as the book Hibernate Search in Action seems to be saying (pg 363). I confess I'm really struggling to see how the document vector model fits in with lucene scoring equation

I don't have this book to check, but basically (if we ignore the different boosts that can be set manually at indexing time), there are three reasons why the score of some document may be higher (or lower) than the score of other documents with Lucene's default scoring model and for a given query:
the queried term has a low document frequency (boosting the IDF part of the score),
the queried term has a high number of occurrences in the document (boosting the TF part of the score),
the queried term appears in a rather small field of the document (boosting the norm part of the score).
This means that for two documents D1 and D2 and one queried term T, if
T appears n times in D1,
T appears p > n times in D2,
the queried field of D2 has (almost) the same size (number of terms) as D1,
D2 will have a better score than D1.

Related

How is the Gini-Index minimized in CART Algorithm for Decision Trees?

For neural networks for example I minimize the cost function by using the backpropagation algorithm. Is there something equivalent for the Gini Index in decision trees?
CART Algorithm always states "choose partition of set A, that minimizes Gini-Index", but how to I actually get that partition mathematically?
Any input on this would be helpful :)
For a decision tree, there are different methods for splitting continuous variables like age, weight, income, etc.
A) Discretize the continuous variable to use it as a categorical variable in all aspects of the DT algorithm. This can be done:
only once during the start and then keeping this discretization
static
at every stage where a split is required, using percentiles or
interval ranges or clustering to bucketize the variable
B) Split at all possible distinct values of the variable and see where there is the highest decrease in the Gini Index. This can be computationally expensive. So, there are optimized variants where you sort the variables and instead of choosing all distinct values, choose the midpoints between two consecutive values as the splits. For example, if the variable 'weight' has 70, 80, 90 and 100 kgs in the data points, try 75, 85, 95 as splits and pick the best one (highest decrease in Gini or other impurities)
But then, what is the exact split algorithm that is implemented in scikit-learn in python, rpart in R, and the mlib package in pyspark , and what are the differences between them in the splitting of a continuous variable is something I am not sure as well and am still researching.
Here there is a good example of CART algorithm. Basically, we get the gini index like this:
For each attribute we have different values each of which will have a gini index, according to the class they belong to. For example, if we had two classes (positive and negative), each value of an attribute will have some records that belong to the positive class and some other values that belong to the negative class. So we can calculate the probabilities. Say if an attribute was called weather and it had two values (e.g. rainy and sunny), and we had these information:
rainy: 2 positive, 3 negative
sunny: 1 positive, 2 negative
we could say:
Then we can have the weighted sum of gini indexes for weather (assuming we had a total of 8 records):
We do this for all the other attributes (like we did for weather) and at the end we choose the attribute with the lowest gini index to be the one to split the tree from. We have to do all this at each split (unless we could classify the sub-tree without the need for splitting).

Algorithm - finding the order of HMM from observations

I am given a data that consists of N sequences of variable lengths of hidden variables and their corresponding observed variables (i.e., I have both the hidden variables and the observed variables for each sequence).
Is there a way to find the order K of the "best" HMM model for this data, without exhaustive search? (justified heuristics are also legitimate).
I think there may be a confusion about the word "order":
A first-order HMM is an HMM which transition matrix depends only on the previous state. A 2nd-order HMM is an HMM which transition matrix depends only on the 2 previous states, and so on. As the order increases, the theory gets "thicker" (i.e., the equations) and very few implementations of such complex models are implemented in mainstream libraries.
A search on your favorite browser with the keywords "second-order HMM" will bring you to meaningful readings about these models.
If by order you mean the number of states, and with the assumptions that you use single distributions assigned to each state (i.e., you do not use HMMs with mixtures of distributions) then, indeed the only hyperparameter you need to tune is the number of states.
You can estimate the optimal number of states using criteria such as the Bayesian Information Criterion, the Akaike Information Criterion, or the Minimum Message Length Criterion which are based on model's likelihood computations. Usually, the use of these criteria necessitates training multiple models in order to be able to compute some meaningful likelihood results to compare.
If you just want to get a blur idea of a good K value that may not be optimal, a k-means clustering combined with the percentage of variance explained can do the trick: if X clusters explain more than, let say, 90% of the variance of the observations in your training set then, going with an X-state HMM is a good start. The 3 first criteria are interesting because they include a penalty term that goes with the number of parameters of the model and can therefore prevent some overfitting.
These criteria can also be applied when one uses mixture-based HMMs, in which case there are more hyperparameters to tune (i.e., the number of states and the number of component of the mixture models).

how does lucene build VSM?

I understand the concept of VSM, TFIDF and cosine similarity, however, I am still confused about how lucene build VSM and calculate similarity for each query after reading lucene website.
As I understood, VSM is a matrix where the values of TFIDF of each term are filled. When i tried building VSM from a set of documents, it took a long time with this tool http://sourceforge.net/projects/wvtool/
This is not really related to the coding, because intuitively building a VSM matrix of large data is time consuming, but that seems not the case for lucene.
In additon, with a VSM prebuilt, finding most similar document which basically is the calculation of similarity between two documents or a query vs document often time consuming (assume millions of documents, because one has to compute similarity to everyone else), but lucene seems does it really fast. I guess that's also related to how it builds VSM internally. If possible, can someone also explain this ?
so please help me to understand two point here:
1. how lucene builds VSM so fast which can be used for calculating similarity.
2. how come lucene similarity calculation amoung millions of documents is so fast.
I'd appreciate it if an real example is given.
Thanks
As I understood, VSM is a matrix where the values of TFIDF of each term are filled.
This is more properly called a term-document matrix. The VSM is more of a conceptual framework from which this matrix, and the notion of cosine similarity arise.
Lucene stores term frequencies and document frequencies that can be used to get tf-idf weights for document and query terms. It uses those to compute a variant of cosine similarity outlined here. So, the rows of the term-document matrix are represented in the index, which is a hash table mapping terms to (document, tf) pairs plus a separate table mapping terms to their df value.
one has to compute similarity to everyone else
That's not true. If you review the textbook definition of cosine similarity, you'll find that it's the sum of products of corresponding term weights in a query and a document, normalized. Terms that occur in the document but not the query, or vice versa, have no effect on the similarity. It follows that, to compute cosine similarity, you only need to consider those documents that have some term in common with the query. That's how Lucene gets its speed: it does a hash table lookup for the query terms and computes similarities only to the documents that have non-zero intersection with the query's bag of words.

Weighted cosine similarity calculation using Lucene

This question is related with calculating CosineSimilarity between documents using Lucene
The documents are marked up with Taxonomy and Ontology terms separately. When I calculate the document similarity between documents, I want to give higher weights to those Taxonomy terms and Ontology terms.
When I index the document, I have defined the Document content, Taxonomy and Ontology terms as Fields for each document like this in my program.
Field ontologyTerm= new Field("fiboterms", fiboTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
Field taxonomyTerm = new Field("taxoterms", taxoTermList[curDocNo], Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
Field document = new Field(docNames[curDocNo], strRdElt, Field.TermVector.YES);
I’m using Lucene index .TermFreqVector functions to calculate TFIDF values and, then calculate cosine similarity between two documents using TFIDF values.
I can use Lucene’s field.setBoost() function to give higher weights to the fields before indexing. I used the debugger see frequency values of Taxonomy terms after seeting a boost value, but it dosen’t change the term frequencies. So that means setboost() function dosen’t give any effect on TermFreVector or TFIDF values? Is setboost() function increase the weights and can be used only in document searching?
Another thing what I can do is, programmatically multiply the Taxonomy and Ontology term frequencies with defined weight factor before calculating the TFIDF scores. Will this give higher weight to Taxonomy and Ontology terms in document similarity calculation?
Are there any other Lucene functions that can be used to give higher weights to the certain fields when calculating TFIDF values using TermFreqVector? Or can I just use the setboost() function for this purpose, then how?
The TermFreqVector is just that - the term frequencies. No weights. It says in the docs "Each location in the array contains the number of times this term occurs in the document or the document field."
You can see from Lucene's algorithm that the way boosts are used is as a multiplicative factor. So if you want to replicate that then yes this will give your terms a higher weight.
I'm not sure what your use case is, but you might want to consider just using Lucene's Scorer class. Then you won't have to deal with making your own.

Markovian chains with Redis

For self-education purposes, I want to implement a Markov chain generator, using as much Redis, and as little application-level logic as possible.
Let's say I want to build a word generator, based on frequency table with history depth N (say, 2).
As a not very interesting example, for dictionary of two words bar and baz, the frequency table is as follows ("." is terminator, numbers are weights):
. . -> b x2
. b -> a x2
b a -> r x1
b a -> z x1
a r -> . x1
a z -> . x1
When I generate the word, I start with history of two terminators . .
There is only one possible outcome for the first two letters, b a.
Third letter may be either r or z, with equal probabilities, since their weights are equal.
Fourth letter is always a terminator.
(Things would be more interesting with longer words in dictionary.)
Anyway, how to do this with Redis elegantly?
Redis sets have SRANDMEMBER, but do not have weights.
Redis sorted sets have weights, but do not have random member retrieval.
Redis lists allow to represent weights as entry copies, but how to make set intersections with them?
Looks like application code is doomed to do some data processing...
You can accomplish a weighted random selection with a redis sorted set, by assigning each member a score between zero and one, according to the cumulative probability of the members of the set considered thus far, including the current member.
The ordering you use is irrelevant; you may choose any order which is convenient for you. The random selection is then accomplished by generating a random floating point number r uniformly distributed between zero and one, and calling
ZRANGEBYSCORE zset r 1 LIMIT 0 1,
which will return the first element with a score greater than or equal to r.
A little bit of reasoning should convince you that the probability of choosing a member is thus weighted correctly.
Unfortunately, the fact that the scores assigned to the elements needs to be proportional to the cumulative probability would seem to make it difficult to use the sorted set union or intersection operations in a way which would preserve the significance of the scores for random selection of elements. That part would seem to require some significant application logic.