Lucene calculate score without indexing the document - lucene

For research purposes I want to determine how Lucene would score a given document to a given query if that document were in the index, but without actually indexing it. It should calculate e.g. the BM25 Score using the collection's statistics (like IDF, average document length etc.).
PyTerrier offers an API like that: TextScorer takes a batch of query-document-pairs and calculates their score with the option of specifying a background_index for the statistics.
How can I do the same with Lucene?

Related

lucene: custom scoring

I'm creating an index of simple documents of the form:
[paragraph-id] < numeric field (monotonically increasing ID value)
[paragraph-text] < medium (~500 word) text field
There are around 100K documents and they are indexed by a multi-threaded indexer that divides-and-conquers the documents so the paragraph-id order in which they are inserted into the index is random.
The semantics of my search system are such that the "relevance" or "score" of a document is dictated only by the paragraph-id (larger paragraph-id is more relevant). I want to fully ignore what Lucene internally calculates as a "score" for the document based on standard metrics such as TF or IDF.
What's the best way to achieve this?
My "dumb" solution to this is to call the search API IndexSearcher::search(Query q, Filter f, int max, Sort s) with a huge max value (100K, so as to cover all documents) and passing a sorter to sort the results by the paragraph-id.
Lucene version 3.0.2 (I know it's old, but this shouldn't matter for this question)

Elasticsearch - higher scoring if higher frequency of term

I have 2 documents, and am searching for the keyword "Twitter". Suppose both documents are blog posts with a "tags" field.
Document A has ONLY 1 term in the "tags" field, and it's "Twitter".
Document B has 100 terms in the "tags" field, but 3 of them is "Twitter".
Elastic Search gives the higher score to Document A even though Document B has a higher frequency. But the score is "diluted" because it has more terms. How do I give Document B a higher score, since it has a higher frequency of the search term?
I know ElasticSearch/Lucene performs some normalization based on the number of terms in the document. How can I disable this normalization, so that Document B gets a higher score above?
As the other answer says it would be interesting to see whether you have the same result on a single shard. I think you would and that depends on the norms for the tags field, which is taken into account when computing the score using the tf/idf similarity (default).
In fact, lucene does take into account the term frequency, in other words the number of times the term appears within the field (1 or 3 in your case), and the inverted document frequency, in other words how the term is frequent in the index, in order to compare it with other terms in the query (in your case it doesn't make any difference if you are searching for a single term).
But there's another factor called norms, that rewards shorter fields and take into account eventual index time boosting, which can be per field (in the mapping) or even per document. You can verify that norms are the reason of your result enabling the explain option in your search request and looking at the explain output.
I guess the fact that the first document contains only that tag makes it more important that the other ones that contains that tag multiple times but a lot of ther tags as well. If you don't like this behaviour you can just disable norms in your mapping for the tags field. It should be enabled by default if the field is "index":"analyzed" (default). You can either switch to "index":"not_analyzed" if you don't want your tags field to be analyzed (it usually makes sense but depends on your data and domain) or add the "omit_norms": true option in the mapping for your tags field.
Are the documents found on different shards? From Elastic search documentation:
"When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first execute the query against all shards and gather the relevant term frequencies, and then, based on it, execute the query."
The solution is to specify the search type. Use dfs_query_and_fetch search type to execute an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.
You can read more here.

When indexing, what are the factors that can affect a term's score when searched

The question is little confusing. I am new to Lucene and going through the documents. I found out that adding boost to a field, increases the norm of the field and thus, increases the score of the term when its searched.
I.E. adding boost to a field at indexing time can affect the score at search time. My question is are there any other ways, other than boosting, to do the same? Please advice.
Before Lucene 4.x, there used to be a single scoring formula based on Vector Space Model.
The following are the factors that contribute to the Lucene scoring.
1) Tf : term frequency, i.e. frequency of a term in a document.
2) Idf : Inverse document frequency : log(Collection Size / Number of documents that have term) " This formula may vary.
3) Field Boost : the one you've mentioned. It's provided while Indexing.
4) Coord : a score factor based on how many of the query terms are found in the specified document.
5) queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable
6) norm(t,d) encapsulates a few (indexing time) boost and length factors:
a) Document boost - set by calling doc.setBoost() before adding the document to the index.
b) Field boost - set by calling field.setBoost() before adding the field to a document.
c) lengthNorm - computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.
7) Term boost: is a search time boost of term t in the query q
For in-depth knowledge of Lucene's default scoring formula : Check the documentation : Lucene Similarity
With the new release of Lucene 4.x, new scoring formulas have been introduced like BM25. For more details, please check the subclasses of Lucene 4.2 Similarity
You can implement a subclass of Similarity to customize all the above scoring factors. Here is an Example...

Lucene - Scoring and payload

We have an application where every term position in a document is associated with an "engine score".
A term query should then be scored according to the sum of "engine scores" of the term in a document, rather than on the term frequency.
For example, term frequency of 5 with an average engine score of 100 should be equivalent to term frequency of 1 with engine score 500.
I understood that if I keep the engine score per position in the payload, I will be able to use scorePayload in combination of a summary version of PayloadFunction to get the sum of engine scores of a term in a document, and so will be able to achieve my goal.
There are two issues with this solution:
Even the simplest term query should scan the positions file in order to get the payloads, which could be a performance issue.
We would prefer to index the sum of engine scores in advance per document, in addition to the term frequency. This is some sort of payload in the document level. Does Lucene support that or have any other solution for this issue ?
The "engine score" of a phrase occurrence is defined as the multiplication of engine scores of the terms that compose the phrase.
So in scorePayload I need the payloads of all the terms in the phrase in order to be able to appropriately score the phrase occurrence.
As much as I understand, the current interface of scorePayload does not provide this information.
Is there another way this can be achieved in Lucene ?
One workaround for a document-level payload is to create a single Lucene document / your document that just contains the engine score for your whole document as a specially-named field (different from all other Lucene document field names). You can then combine / extract that document during your searches. Not much of a workaround, but there it is.

understanding the relationship between boosting a document in lucene at index time and its corresponding score at search time

When indexing, I boost certain documents, but they do not appear on the top of the list of retrieved documents. I looked at the score of those documents, and somehow, the score of the documents retrieved is always NaN.
What is the relationship between a boost of a document at index time and its score at retrieve time? I thought these would be correlated, and further, I thought I would get a wide range of scores in my scoredocs, not just NaN. If you can shed some light on this I would be grateful.
I have read http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/Similarity.html
and cant figure out what is missing.
Here is the simple boosting code:
if (myCondition)
{
myDocument.SetBoost(1.1f);
}
myIndexWriter.AddDocument(document);
I'm gonna go on a wild guess here since you havent provide a sample of you search code, but a common reason why the score of retreived docs is NaN is because you use a Sort. When sorting, most of the time the score of the documents is not used, and therefore disabled by default.
If you use a Sort for your search, and want the score, check the method setDefaultFieldSortScoring of the IndexSearcher class. This method allows you to enable scoring the documents in a search that uses a Sort.
http://lucene.apache.org/java/2_9_4/api/all/org/apache/lucene/search/IndexSearcher.html#setDefaultFieldSortScoring(boolean, boolean)