returning fuzzy match percentage in solr query result - lucene

I've implemented solr/lucene fuzzy match for my system and its working perfectly.
I have requirement to display percentage fuzzy match after query sends response back.
As an example if my index data is "rushikupadhyay" and if my query is "rushikupadhya"~0.8, I should get exact percentage as part of response like 0.85 or 85%.
I want to use percentage result as part of application and perform additional steps based on return value, like if percentage match is 70-80% do X, 80-90% do Y, and > 90% do Z.
Any pointers are appreciated.

Please Note: The guidance found in this post on the Lucene Wiki - ScoresAsPercentages that you may want to review before deciding to go with a purely percentage based logic.
However, if you do decide to go with a percentage value, you can get this value by also including the score field in the query response. See the Solr Admin page (Full Interface link) will direct you to /admin/form.jsp In the Fields to Return option it shows, *,score This will return the match score for each document in the result set. However, please note that this is the raw score of the document match and is relative to the maxScore value that is part of the <result> element. So in order to get the true percentage based score for each document, you will need to normalize each documents score against the maxScore by using logic such as (score/maxScore * 100) to get the correct percentage value to display.

Related

RavenDB -- More Like This -- Need a (similarity) metric; not just rank-orders

I have a RavenDB / 'More Like This' example running (C#) as per
Creating more like this in RavenDB
However, in addition to receiving similar documents back, I really need some measure of similarity back for those documents.
I am assuming (correctly?) that the order in which I get the similar documents back represents the rank-order scores of the documents' similarities (first one back has the highest similarity, second one back has the second highest similarity, etc.).
However, rather than rank orders I need the metric similarity results. This assumes (of course) that the rank orders are computed from a more continuous metric; e.g., tf-idf. If that is true, can I get a hold of those metric scores?
When using MoreLikeThis, you can issue a query such as the following:
from index 'Product/Search'
where morelikethis(id() = 'products/1-A')
And assuming you have setup the TermVector on the index properly, you'll get the results.
In the metadata of the results, you have the index score, which is what I think you are looking for.

How can I control the order of results? Lucene range queries in Cloudant

I've got a simple index which outputs a "score" from 1000 to 12000 in increments of 1000. I want to get a range of results from a lo- to high -score, for example;
q=score:[1000 TO 3000]
However, this always returns a list of matches starting at 3000 and depending on the limit (and number of matches) it might never return any 1000 matches, even though they exist. I've tried to use sort:+- and grouping but nothing seems to have any impact on the returned result.
So; how can the order of results returned be controlled?
What I ideally want is a selection of matches from the range but I assume this isn't possible, given that the query just starts filling the results in from the top?
For reference the index looks like this;
function(doc) {
var score = doc.score;
index("score", score, {
"store": "yes"
});
...
I cannot comment on this so posting an answer here:
Based on the cloudant doc on lucene queries, there isn't a way to sort results of a query. The sort options given there are for grouping. And even for grouped results I never saw sort work. In any case it is supposed to sort the sequence of the groups themselves. Not the data within.
#pal2ie you are correct, and Cloudant has come back to me confirming it. It does make sense, in some way, but I was hoping I could at least control the direction (lo->hi, hi->lo). The solution I have implemented to get a better distribution across the range is to not use range queries but instead;
create a distribution of the number of desired results for each score in the range (a simple, discrete, Gaussian for example)
execute individual queries for each score in the range with limit set to the number of desired results for that score
execute step 2 from min to max, filling up the result
It's not the most effective since it means multiple round-trips to the server but at least it gives me full control over the distribution in the range

SOLR search filter by relevancy score

So each SOLR search result has their own relevancy score:
https://wiki.apache.org/solr/SolrRelevancyFAQ
"How can I see the relevancy scores for search results
Request that the pseudo-field named "score" be returned by adding it
to the fl (field list) parameter. The "score" will then appear along
with the stored fields in returned documents. q=Justice
League&fl=*,score"
My question is...is it possible to filter SOLR results by this relevancy score?
Eg. perform a query in the nature of the following
Search for keyword "LOL" and only fetch documents whose relevancy score > 50
If it's possible how would you go about specifying this query syntactically?
You can specify a maximum number of results to return. The results will appear in descending order by score, so you could stop processing at a specific point in the result set.
solr/search/select?q=LOL&&start=0&rows=10&fl=*%2Cscore
See the following article for a discussion about setting a minimum score: Is it possible to set a Solr Score threshold 'reasonably', independent of results returned? (i.e. Is Solr Scoring standardized in any way)
I spent hours trying to filter out values with a relevance score of 0. I couldn't find any straight forward way to do this. I ended up accomplishing this with a workaround that assigns the query function to a local param. I call this local param in both the query ("q=") and the filter query ("fq=").
Example
Let's say you have a query like:
q={!func}sum(*your arguments*)
First, make the function component its own parameter:
q={!func}$localParam
&localParam={!func}sum(*your arguments*)
Now to only return results with scores between 1 and 10 simply add a filter query on that localParam:
q={!func}$localParam
&localParam={!func}sum(*your arguments*)
&fq={!frange l=1 u=10 inclusive=true}$localParam
solr 6.6:
add a solr filter query (fq):
q=SEARCH_PHRASE
&fq={!frange l=50.0}query($q,0)
in this case solr will return the result with "score" >= 50.0
$q in query($q,0) - it's like a reference to a q parameter

How to determine Lucene relevancy/cutoff?

What is the best way to determine relevancy and the cutoff of results to show?
So the system I'm working on right now involves searching the inventory and returning the results. Each result must be reviewed by an employee to determine whether it is a true match. Obviously, we want to minimize the number of false results we return.
I've been tweaking boosts and stuff to get it to score better, but we still have a few problems with determining relevancy.
An absolute threshold doesn't work because search scores are only meaningful relative to the results in a given query. So a score of 200 on one query may not be as relevant as a score of .2 on another.
The other method I've seen is a score normalized with respect to the top score of a query. Then we can return all results at are within x% of that score. However, if there are no good results, then the top result is very poor, and all the results we return will be poor.
How can I determine which documents are relevant and which are not?

How to normalize Lucene scores?

I need to normalize the Lucene scores between 0 and 1.
For example, a random query returns the following scores...
8.864665
2.792687
2.792687
2.792687
2.792687
0.49009037
0.33730242
0.33730242
0.33730242
0.33730242
What's the biggest score ? 10.0 ?
thanks
You can divide all scores with the maximum score to get scores between 0 and 1.
However, please note that the normalised scores should be used to compare the results of a single query only. It is not correct to compare the scores (normalised or not) of results from 2 different queries.
There is no good standard way to normalize scores with lucene. Read this: ScoresAsPercentages and this explanation
In your case the highest score is the score of the first result, if the results are sorted by score. But this score will be different for every other query.
See also how-do-i-normalise-a-solr-lucene-score
There is no maximum score in Solr, it depends on too many variables, so it can't be predicted.
But you can implement something called normalized score (Scores As Percentages) which is not recommended.
See related links for more details:
Is it possible to set a Solr Score threshold 'reasonably', independent of results returned? (i.e. Is Solr Scoring standardized in any way)
how do I normalise a solr/lucene score?
Remove results below a certain score threshold in Solr/Lucene?
A regular normalization will only help you to compare the scoring distribution among queries (and theirs retrieved lists).
You cannot simply normalize the score to compare the performance between queries.
Think of a query which all retrieved documents are highly relevant and received the same (high score), and on another query that the retrieved list comprise barley relevant document (again, with the same score) - now, no matter the per-query normalization you make - the normalized score will be the same.
You need to think on a cross-query factor that can bring all the scores to the same level.
For example - maybe computing similarity between the query and the whole index, and use that score somehow along with the document-score
If you want to compare two or more queries, i found an workaround.
You can compare your highest scored document with your queryterm using the LevenstheinDistance or LuceneLevenstheinDistance(Damerau) class to get the distance between your queryterm and your result. The result is the similiarity between them. Do this for each query you want to compare against. Now you have a tool to compare your queries using the similiarity of your querytherm and your highest result. You can now choose the query with the highest score of similiarity and use this for next proper actions.
//Damerau LevenstheinDistance
LuceneLevenshteinDistance d = new LuceneLevenshteinDistance();
similiarity = d.getDistance(queryterm, yourResult );
I applied a non-linearity function in order to compress every queries.