How to normalize Lucene scores? - lucene

I need to normalize the Lucene scores between 0 and 1.
For example, a random query returns the following scores...
8.864665
2.792687
2.792687
2.792687
2.792687
0.49009037
0.33730242
0.33730242
0.33730242
0.33730242
What's the biggest score ? 10.0 ?
thanks

You can divide all scores with the maximum score to get scores between 0 and 1.
However, please note that the normalised scores should be used to compare the results of a single query only. It is not correct to compare the scores (normalised or not) of results from 2 different queries.

There is no good standard way to normalize scores with lucene. Read this: ScoresAsPercentages and this explanation
In your case the highest score is the score of the first result, if the results are sorted by score. But this score will be different for every other query.
See also how-do-i-normalise-a-solr-lucene-score

There is no maximum score in Solr, it depends on too many variables, so it can't be predicted.
But you can implement something called normalized score (Scores As Percentages) which is not recommended.
See related links for more details:
Is it possible to set a Solr Score threshold 'reasonably', independent of results returned? (i.e. Is Solr Scoring standardized in any way)
how do I normalise a solr/lucene score?
Remove results below a certain score threshold in Solr/Lucene?

A regular normalization will only help you to compare the scoring distribution among queries (and theirs retrieved lists).
You cannot simply normalize the score to compare the performance between queries.
Think of a query which all retrieved documents are highly relevant and received the same (high score), and on another query that the retrieved list comprise barley relevant document (again, with the same score) - now, no matter the per-query normalization you make - the normalized score will be the same.
You need to think on a cross-query factor that can bring all the scores to the same level.
For example - maybe computing similarity between the query and the whole index, and use that score somehow along with the document-score

If you want to compare two or more queries, i found an workaround.
You can compare your highest scored document with your queryterm using the LevenstheinDistance or LuceneLevenstheinDistance(Damerau) class to get the distance between your queryterm and your result. The result is the similiarity between them. Do this for each query you want to compare against. Now you have a tool to compare your queries using the similiarity of your querytherm and your highest result. You can now choose the query with the highest score of similiarity and use this for next proper actions.
//Damerau LevenstheinDistance
LuceneLevenshteinDistance d = new LuceneLevenshteinDistance();
similiarity = d.getDistance(queryterm, yourResult );

I applied a non-linearity function in order to compress every queries.

Related

RavenDB -- More Like This -- Need a (similarity) metric; not just rank-orders

I have a RavenDB / 'More Like This' example running (C#) as per
Creating more like this in RavenDB
However, in addition to receiving similar documents back, I really need some measure of similarity back for those documents.
I am assuming (correctly?) that the order in which I get the similar documents back represents the rank-order scores of the documents' similarities (first one back has the highest similarity, second one back has the second highest similarity, etc.).
However, rather than rank orders I need the metric similarity results. This assumes (of course) that the rank orders are computed from a more continuous metric; e.g., tf-idf. If that is true, can I get a hold of those metric scores?
When using MoreLikeThis, you can issue a query such as the following:
from index 'Product/Search'
where morelikethis(id() = 'products/1-A')
And assuming you have setup the TermVector on the index properly, you'll get the results.
In the metadata of the results, you have the index score, which is what I think you are looking for.

How does score multiplier operator in Watson Discovery service work?

I have a set of JSON documents uploaded to my WDS instance. I want to understand the importance of the score multiplier operator(^). The document just says,"Increases the score value of the search term". I have tried a simple query on one field, it multiplies the score by the number specified.
If I specify two fields & I want Watson Discovery to know which of the two fields is more important for the search, is score multiplier applicable in this case? With two fields & a score multiplier applied to one, I could not identify the difference. Also, on what datatypes is this allowed? It didn't work with a number.
I found this through some more experiments. Score multiplier is used when you want to increase the relative importance of fields in the query. So, for example, you want to give more importance to Name.LastName in the below example:
Name.FirstName:"ABC",Name.LastName:"DEF"^3
Here, LastName is given more relevance & the search results are ordered in the same way.
Could be useful for someone.

How can I control the order of results? Lucene range queries in Cloudant

I've got a simple index which outputs a "score" from 1000 to 12000 in increments of 1000. I want to get a range of results from a lo- to high -score, for example;
q=score:[1000 TO 3000]
However, this always returns a list of matches starting at 3000 and depending on the limit (and number of matches) it might never return any 1000 matches, even though they exist. I've tried to use sort:+- and grouping but nothing seems to have any impact on the returned result.
So; how can the order of results returned be controlled?
What I ideally want is a selection of matches from the range but I assume this isn't possible, given that the query just starts filling the results in from the top?
For reference the index looks like this;
function(doc) {
var score = doc.score;
index("score", score, {
"store": "yes"
});
...
I cannot comment on this so posting an answer here:
Based on the cloudant doc on lucene queries, there isn't a way to sort results of a query. The sort options given there are for grouping. And even for grouped results I never saw sort work. In any case it is supposed to sort the sequence of the groups themselves. Not the data within.
#pal2ie you are correct, and Cloudant has come back to me confirming it. It does make sense, in some way, but I was hoping I could at least control the direction (lo->hi, hi->lo). The solution I have implemented to get a better distribution across the range is to not use range queries but instead;
create a distribution of the number of desired results for each score in the range (a simple, discrete, Gaussian for example)
execute individual queries for each score in the range with limit set to the number of desired results for that score
execute step 2 from min to max, filling up the result
It's not the most effective since it means multiple round-trips to the server but at least it gives me full control over the distribution in the range

How to determine Lucene relevancy/cutoff?

What is the best way to determine relevancy and the cutoff of results to show?
So the system I'm working on right now involves searching the inventory and returning the results. Each result must be reviewed by an employee to determine whether it is a true match. Obviously, we want to minimize the number of false results we return.
I've been tweaking boosts and stuff to get it to score better, but we still have a few problems with determining relevancy.
An absolute threshold doesn't work because search scores are only meaningful relative to the results in a given query. So a score of 200 on one query may not be as relevant as a score of .2 on another.
The other method I've seen is a score normalized with respect to the top score of a query. Then we can return all results at are within x% of that score. However, if there are no good results, then the top result is very poor, and all the results we return will be poor.
How can I determine which documents are relevant and which are not?

Querying Apache Solr based on score values

I am working on an image retrieval task. I have a dataset of wikipedia images with their textual description in xml files (1 xml file per image). I have indexed those xmls in Solr. Now while retrieving those, I want to maintain some threshold for Score values, so that docs with less score will not come in the result (because they are not of much importance). For example I want to retrieve all documents having similarity score greater than or equal to 2.0. I have already tried range queries like score:[2.0 TO *] but can't get it working. Does anyone have any idea how can I do that?
What's the motivation for wanting to do this? The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. It is only meaningful for comparing the results of a
specific query with a specific instance of the index. In other words,
it isn't useful to filter on b/c there is no way of knowing what a
good cutoff value would be.
http://lucene.472066.n3.nabble.com/score-filter-td493438.html
Also, take a look here - http://wiki.apache.org/lucene-java/ScoresAsPercentages
So, in general it's bad to cut off by some value, because you'll never know which threshold value is best. In good query it could be score=2, in bad query score=0.5, etc.
These two links should explain you why you DONT want to do it.
P.S. If you still want to do it take a look here - https://stackoverflow.com/a/15765203/2663985
P.P.S. I recommend you to fix your search queries, so they will search better with high precision (http://en.wikipedia.org/wiki/Precision_and_recall)