Implement Simple TF-IDF Scoring in Lucene - lucene

This is the Lucene practical scoring function (https://lucene.apache.org/core/7_5_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html):
The documentation says:
idf(t) stands for Inverse Document Frequency. This value correlates to
the inverse of docFreq (the number of documents in which the term t
appears). This means rarer terms give higher contribution to the total
score. idf(t) appears for t in both the query and the document, hence
it is squared in the equation.
The last line is not clear to me (i.e., idf(t) appears for t in both the query and document). How to calculate the idf(t) for t in query?
What I am trying to do is to implement the simple TF-IDF scoring formula as a baseline approach for an experiment.
(TF-IDF Sore(q,d) = ∑for each term t in q: tf(t,d)*idf(t,d))
Lucene's scoring function is different from what I am trying to do. I can override the DefaultSimilarity class to ignore the effect of coord(q,d), queryNorm(q), t.getBoost(), and norm(t,d). My only concern is the idf(t) part. Why is it squared? How can I implement the simple TF-IDF scoring?

Related

Machine Learning text comparison model

I am creating a machine learning model that essentially returns the correctness of one text to another.
For example; “the cat and a dog”, “a dog and the cat”. The model needs to be able to identify that some words (“cat”/“dog”) are more important/significant than others (“a”/“the”). I am not interested in conjunction words etc. I would like to be able to tell the model which words are the most “significant” and have it determine how correct text 1 is to text 2, with the “significant” words bearing more weight than others.
It also needs to be able to recognise that phrases don’t necessarily have to be in the same order. The two above sentences should be an extremely high match.
What is the basic algorithm I should use to go about this? Is there an alternative to just creating a dataset with thousands of example texts and a score of correctness?
I am only after a broad overview/flowchart/process/algorithm.
I think TF-IDF might be a good fit to your problem, because:
Emphasis on words occurring in many documents (say, 90% of your sentences/documents contain the conjuction word 'and') is much smaller, essentially giving more weight to the more document specific phrasing (this is the IDF part).
Ordering in Term Frequency (TF) does not matter, as opposed to methods using sliding windows etc.
It is very lightweight when compared to representation oriented methods like the one mentioned above.
Big drawback: Your data, depending on the size of corpus, may have too many dimensions (the same number of dimensions as unique words), you could use stemming/lemmatization in order to mitigate this problem to some degree.
You may calculate similiarity between two TF-IDF vector using cosine similiarity for example.
EDIT: Woops, this question is 8 months old, sorry for the bump, maybe it will be of use to someone else though.

SOLR and Ratio of Matching Words

Using SOLR version 4.3, it appears that SOLR is valuing the percentage of matching terms more than the number of matching terms.
For example, we do a search for Dog and a document with just the word dog and a three other words returns. We have another article with hundreds of words, that has the word dog in it 27 times.
I would expect the second article to return first. However, the one with one word out of three returns first. I was hoping to find out what in SOLR controls this so I can make the appropriate modifications. I have looked the SOLR documentation and have seen COORD mentioned, but it seems to indicate that the article with 27 references should return first. Any help would be appreciated.
For 4.x Solr still used regular TF/IDF as its scoring formula, and you can see the Lucene implementation detailed in the documentation for TFIDFSimilarity.
For your question, the two factors that affect the score is:
The length of the field, as given in norm():
norm(t,d) encapsulates a few (indexing time) boost and length factors:
Field boost - set by calling field.setBoost() before adding the field to a document.
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
.. while the number of terms matching (not their frequency), is given by coord():
coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.
There are a few settings in your schema that can affect how Solr scores the documents in your example:
omitNorms
If true, omits the norms associated with this field (this disables length normalization for the field, and saves some memory)
.. this will remove the norm() part of the score.
omitTermFreqAndPositions
If true, omits term frequency, positions, and payloads from postings for this field.
.. and this will remove the boost from multiple occurrences of the same term. Be aware that this will remove positions as well, making phrase queries impossible.
But you should also consider upgrading Solr, as the BM25 similarity that's the default from 6.x usually performs better. I can't remember if a version is available for 4.3.

When indexing, what are the factors that can affect a term's score when searched

The question is little confusing. I am new to Lucene and going through the documents. I found out that adding boost to a field, increases the norm of the field and thus, increases the score of the term when its searched.
I.E. adding boost to a field at indexing time can affect the score at search time. My question is are there any other ways, other than boosting, to do the same? Please advice.
Before Lucene 4.x, there used to be a single scoring formula based on Vector Space Model.
The following are the factors that contribute to the Lucene scoring.
1) Tf : term frequency, i.e. frequency of a term in a document.
2) Idf : Inverse document frequency : log(Collection Size / Number of documents that have term) " This formula may vary.
3) Field Boost : the one you've mentioned. It's provided while Indexing.
4) Coord : a score factor based on how many of the query terms are found in the specified document.
5) queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable
6) norm(t,d) encapsulates a few (indexing time) boost and length factors:
a) Document boost - set by calling doc.setBoost() before adding the document to the index.
b) Field boost - set by calling field.setBoost() before adding the field to a document.
c) lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
7) Term boost: is a search time boost of term t in the query q
For in-depth knowledge of Lucene's default scoring formula : Check the documentation : Lucene Similarity
With the new release of Lucene 4.x, new scoring formulas have been introduced like BM25. For more details, please check the subclasses of Lucene 4.2 Similarity
You can implement a subclass of Similarity to customize all the above scoring factors. Here is an Example...

Lucene - Scoring and payload

We have an application where every term position in a document is associated with an "engine score".
A term query should then be scored according to the sum of "engine scores" of the term in a document, rather than on the term frequency.
For example, term frequency of 5 with an average engine score of 100 should be equivalent to term frequency of 1 with engine score 500.
I understood that if I keep the engine score per position in the payload, I will be able to use scorePayload in combination of a summary version of PayloadFunction to get the sum of engine scores of a term in a document, and so will be able to achieve my goal.
There are two issues with this solution:
Even the simplest term query should scan the positions file in order to get the payloads, which could be a performance issue.
We would prefer to index the sum of engine scores in advance per document, in addition to the term frequency. This is some sort of payload in the document level. Does Lucene support that or have any other solution for this issue ?
The "engine score" of a phrase occurrence is defined as the multiplication of engine scores of the terms that compose the phrase.
So in scorePayload I need the payloads of all the terms in the phrase in order to be able to appropriately score the phrase occurrence.
As much as I understand, the current interface of scorePayload does not provide this information.
Is there another way this can be achieved in Lucene ?
One workaround for a document-level payload is to create a single Lucene document / your document that just contains the engine score for your whole document as a specially-named field (different from all other Lucene document field names). You can then combine / extract that document during your searches. Not much of a workaround, but there it is.

Does Lucene use Extended Boolean Model retrieval?

Some time ago I came across extended boolean model, which combine boolean retrieval logic with ability to rank documents the way similar to Vector Space Model does.
As far as I understand this is exactly the way Lucene does it's job in ranking documents. Am I right?
It is a combination of the Vector Space Model and the Boolean Model. Checkout the Scoring docs page:
Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification.
If you compare the formulas at Similarity with the classic VSM formula you'll note that they are similar (though not equal).