lucene: custom scoring - lucene

I'm creating an index of simple documents of the form:
[paragraph-id] < numeric field (monotonically increasing ID value)
[paragraph-text] < medium (~500 word) text field
There are around 100K documents and they are indexed by a multi-threaded indexer that divides-and-conquers the documents so the paragraph-id order in which they are inserted into the index is random.
The semantics of my search system are such that the "relevance" or "score" of a document is dictated only by the paragraph-id (larger paragraph-id is more relevant). I want to fully ignore what Lucene internally calculates as a "score" for the document based on standard metrics such as TF or IDF.
What's the best way to achieve this?
My "dumb" solution to this is to call the search API IndexSearcher::search(Query q, Filter f, int max, Sort s) with a huge max value (100K, so as to cover all documents) and passing a sorter to sort the results by the paragraph-id.
Lucene version 3.0.2 (I know it's old, but this shouldn't matter for this question)

Related

Lucene calculate score without indexing the document

For research purposes I want to determine how Lucene would score a given document to a given query if that document were in the index, but without actually indexing it. It should calculate e.g. the BM25 Score using the collection's statistics (like IDF, average document length etc.).
PyTerrier offers an API like that: TextScorer takes a batch of query-document-pairs and calculates their score with the option of specifying a background_index for the statistics.
How can I do the same with Lucene?

SOLR and Ratio of Matching Words

Using SOLR version 4.3, it appears that SOLR is valuing the percentage of matching terms more than the number of matching terms.
For example, we do a search for Dog and a document with just the word dog and a three other words returns. We have another article with hundreds of words, that has the word dog in it 27 times.
I would expect the second article to return first. However, the one with one word out of three returns first. I was hoping to find out what in SOLR controls this so I can make the appropriate modifications. I have looked the SOLR documentation and have seen COORD mentioned, but it seems to indicate that the article with 27 references should return first. Any help would be appreciated.
For 4.x Solr still used regular TF/IDF as its scoring formula, and you can see the Lucene implementation detailed in the documentation for TFIDFSimilarity.
For your question, the two factors that affect the score is:
The length of the field, as given in norm():
norm(t,d) encapsulates a few (indexing time) boost and length factors:
Field boost - set by calling field.setBoost() before adding the field to a document.
lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
.. while the number of terms matching (not their frequency), is given by coord():
coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. This is a search time factor computed in coord(q,d) by the Similarity in effect at search time.
There are a few settings in your schema that can affect how Solr scores the documents in your example:
omitNorms
If true, omits the norms associated with this field (this disables length normalization for the field, and saves some memory)
.. this will remove the norm() part of the score.
omitTermFreqAndPositions
If true, omits term frequency, positions, and payloads from postings for this field.
.. and this will remove the boost from multiple occurrences of the same term. Be aware that this will remove positions as well, making phrase queries impossible.
But you should also consider upgrading Solr, as the BM25 similarity that's the default from 6.x usually performs better. I can't remember if a version is available for 4.3.

When indexing, what are the factors that can affect a term's score when searched

The question is little confusing. I am new to Lucene and going through the documents. I found out that adding boost to a field, increases the norm of the field and thus, increases the score of the term when its searched.
I.E. adding boost to a field at indexing time can affect the score at search time. My question is are there any other ways, other than boosting, to do the same? Please advice.
Before Lucene 4.x, there used to be a single scoring formula based on Vector Space Model.
The following are the factors that contribute to the Lucene scoring.
1) Tf : term frequency, i.e. frequency of a term in a document.
2) Idf : Inverse document frequency : log(Collection Size / Number of documents that have term) " This formula may vary.
3) Field Boost : the one you've mentioned. It's provided while Indexing.
4) Coord : a score factor based on how many of the query terms are found in the specified document.
5) queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable
6) norm(t,d) encapsulates a few (indexing time) boost and length factors:
a) Document boost - set by calling doc.setBoost() before adding the document to the index.
b) Field boost - set by calling field.setBoost() before adding the field to a document.
c) lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
7) Term boost: is a search time boost of term t in the query q
For in-depth knowledge of Lucene's default scoring formula : Check the documentation : Lucene Similarity
With the new release of Lucene 4.x, new scoring formulas have been introduced like BM25. For more details, please check the subclasses of Lucene 4.2 Similarity
You can implement a subclass of Similarity to customize all the above scoring factors. Here is an Example...

Lucene - Scoring and payload

We have an application where every term position in a document is associated with an "engine score".
A term query should then be scored according to the sum of "engine scores" of the term in a document, rather than on the term frequency.
For example, term frequency of 5 with an average engine score of 100 should be equivalent to term frequency of 1 with engine score 500.
I understood that if I keep the engine score per position in the payload, I will be able to use scorePayload in combination of a summary version of PayloadFunction to get the sum of engine scores of a term in a document, and so will be able to achieve my goal.
There are two issues with this solution:
Even the simplest term query should scan the positions file in order to get the payloads, which could be a performance issue.
We would prefer to index the sum of engine scores in advance per document, in addition to the term frequency. This is some sort of payload in the document level. Does Lucene support that or have any other solution for this issue ?
The "engine score" of a phrase occurrence is defined as the multiplication of engine scores of the terms that compose the phrase.
So in scorePayload I need the payloads of all the terms in the phrase in order to be able to appropriately score the phrase occurrence.
As much as I understand, the current interface of scorePayload does not provide this information.
Is there another way this can be achieved in Lucene ?
One workaround for a document-level payload is to create a single Lucene document / your document that just contains the engine score for your whole document as a specially-named field (different from all other Lucene document field names). You can then combine / extract that document during your searches. Not much of a workaround, but there it is.

How does Lucene work

I would like to find out how lucene search works so fast. I can't find any useful docs on the web. If you have anything (short of lucene source code) to read, let me know.
A text search query using mysql5 text search with index takes about 18 minutes in my case. A lucene search for the same query takes less than a second.
Lucene is an inverted full-text index. This means that it takes all the documents, splits them into words, and then builds an index for each word. Since the index is an exact string-match, unordered, it can be extremely fast. Hypothetically, an SQL unordered index on a varchar field could be just as fast, and in fact I think you'll find the big databases can do a simple string-equality query very quickly in that case.
Lucene does not have to optimize for transaction processing. When you add a document, it need not ensure that queries see it instantly. And it need not optimize for updates to existing documents.
However, at the end of the day, if you really want to know, you need to read the source. Both things you reference are open source, after all.
Lucene creates a big index. The index contains word id, number of docs where the word is present, and the position of the word in those documents. So when you give a single word query it just searches the index (O(1) time complexity). Then the result is ranked using different algorithms. For multi-word query just take the intersection of the set of files where the words are present.
Thus Lucene is very very fast.
For more info read this article by Google developers- http://infolab.stanford.edu/~backrub/google.html
In a word: indexing.
Lucene creates an index of your document that allows it to search much more quickly.
It's the same difference between a list O(N) data structure and a hash table O(1) data structure. The list has to walk through the entire collection to find what you want. The hash table has an index that lets it figure out exactly where the desired item is and simply fetch it.
Update:
I'm not certain what you mean by "Lucene index searches are a lot faster than mysql index searches."
My guess is that you're using MySQL "WHERE document LIKE '%phrase%'" to search for a document. If that's true, then MySQL has to do a table scan on every row, which will be O(N).
Lucene gets to parse the document into tokens, group them into n-grams at your direction, and calculate indexes for each one of those. It's O(1) to find a word in an indexed Lucene document.
Lucene works with Term frequency and Inverse document frequency. It creates an index mapping each word with the document and it's frequency count which is nothing but inverse index on the document.
Example :
File 1 : Random Access Memory is the main memory.
File 2 : Hard disks are secondary memory.
Lucene creates a reverse index something like
File 1 :
Term : Random
Frequency : 1
Position : 0
Term : Memory
Frequency : 2
Position : 3
Position : 6
So it is able to search and retrieve the searched content quickly. When there is too many matches for the search query it outputs the result based on the weight. Consider the search query "Main Memory" it searches for all 4 words individually and the result would be like,
Main
File 1 : Frequency - 1
Memory
File 1 : Frequency - 2
File 2 : Frequency - 1
The result would be File1 followed by File2. To stop getting carried away by weights on most common words like 'and', 'or', 'the' it considers the inverse document frequency (ie' it decreases the weight of the word which is most popular among the document set).