I am tinkering with Freebase search api. I am searching for entities and interested in scores returned by Freebase.
According to http://wiki.freebase.com/wiki/Search_Service , score is defined as
score: the score of this result - scores are out of 100
If my query contains only one word, like nirvana, score returned is under 100, but if I query something like 'statue of liberty' score is as high as 1100. Sample query link:
https://www.googleapis.com/freebase/v1/search?query=statue%20of%20liberty&indent=true
Can anyone please point out what I am doing wrong?
The docs are wrong, the scores can be higher than 100 and they are only relevant when compared with other scores within the same result/response.
Related
I have a RavenDB / 'More Like This' example running (C#) as per
Creating more like this in RavenDB
However, in addition to receiving similar documents back, I really need some measure of similarity back for those documents.
I am assuming (correctly?) that the order in which I get the similar documents back represents the rank-order scores of the documents' similarities (first one back has the highest similarity, second one back has the second highest similarity, etc.).
However, rather than rank orders I need the metric similarity results. This assumes (of course) that the rank orders are computed from a more continuous metric; e.g., tf-idf. If that is true, can I get a hold of those metric scores?
When using MoreLikeThis, you can issue a query such as the following:
from index 'Product/Search'
where morelikethis(id() = 'products/1-A')
And assuming you have setup the TermVector on the index properly, you'll get the results.
In the metadata of the results, you have the index score, which is what I think you are looking for.
I'd like your advices regarding optimalization of this:
Data:
I have SQLite database with +- 3000 cities, all of which have name and some lattitude and longitude. All cities have also relevance (based on how often user visits them). Relevance is classic integer. Then, I have user location, again, as lat/lon coordinates.
Request:
I need to create autocomplete editBox. Suggestions must satisfy these conditions:
1) Phrase in editBox must be a substring of suggested city name.
2) Suggestions must by ordered first by relevance. (Classic integer ordering, no problem)
3) If relevance is the same, then suggestions are ordered by distance to user.
4) Display max. 10 suggestions.
Since there are usually a lot of cities with equal relevance, biggest problem is the distance ordering.
My current approach:
A) Get IDs and coordinates of cities that satisfy condition (1) and (2) using classic: name LIKE '% phrase%' ordered by relevance.
B) Split result to groups by relevance. Order these relevance groups by distance using sorting in Java.
C) When there are 10 suggestions that are fixed, (f.e. 11 relevance groups, all containing one city, so no location ordering is needed) stop ordering.
This works well. But, there is a problem. Usually, very few cities have different relevance.
So when user starts typing and there is just one or two letters in the search phrase, I end up sorting 500 cities by distance, just to get to my 10 suggestions, what I find highly inefficient.
Is there any better way to handle such situations using SQLite?
P.S. It is running on Android, if that helps :)
I've implemented solr/lucene fuzzy match for my system and its working perfectly.
I have requirement to display percentage fuzzy match after query sends response back.
As an example if my index data is "rushikupadhyay" and if my query is "rushikupadhya"~0.8, I should get exact percentage as part of response like 0.85 or 85%.
I want to use percentage result as part of application and perform additional steps based on return value, like if percentage match is 70-80% do X, 80-90% do Y, and > 90% do Z.
Any pointers are appreciated.
Please Note: The guidance found in this post on the Lucene Wiki - ScoresAsPercentages that you may want to review before deciding to go with a purely percentage based logic.
However, if you do decide to go with a percentage value, you can get this value by also including the score field in the query response. See the Solr Admin page (Full Interface link) will direct you to /admin/form.jsp In the Fields to Return option it shows, *,score This will return the match score for each document in the result set. However, please note that this is the raw score of the document match and is relative to the maxScore value that is part of the <result> element. So in order to get the true percentage based score for each document, you will need to normalize each documents score against the maxScore by using logic such as (score/maxScore * 100) to get the correct percentage value to display.
I am trying to query a solr server in order to obtain the most relevant results for a list of terms.
For example i have the list of words "nokia", "iphone", "charger"
My schema contains the following data:
nokia
iphone
nokia iphone otherwords
nokia white
iphone white
If I run a simple query like q=nokia OR iphone OR charger i get "nokia iphone otherwords" as the most relevant result (because it contains more query terms)
I would like to get "nokia" or "iphone" or "iphone white" as first results, because for each individual term they would be the most relevant.
In order to obtain the correct list i would do a query for each term, then aggregate the results and order them based on the maximum score.
Can I make this query in one request?
I think you should look at the "coord". From the SolrRelevancyFAQ:
coord is the coordination factor - if there are multiple terms in a query, the more terms that match, the higher the score
You could write your own Similarity subclass to ignore the coord or only take the highest value when scoring.
There might be other ways too, you could ask in the solr-users mailing list.
This might also help: comparing lucene scores across queries
Seems like you should execute 3 separate searches to me
I need to normalize the Lucene scores between 0 and 1.
For example, a random query returns the following scores...
8.864665
2.792687
2.792687
2.792687
2.792687
0.49009037
0.33730242
0.33730242
0.33730242
0.33730242
What's the biggest score ? 10.0 ?
thanks
You can divide all scores with the maximum score to get scores between 0 and 1.
However, please note that the normalised scores should be used to compare the results of a single query only. It is not correct to compare the scores (normalised or not) of results from 2 different queries.
There is no good standard way to normalize scores with lucene. Read this: ScoresAsPercentages and this explanation
In your case the highest score is the score of the first result, if the results are sorted by score. But this score will be different for every other query.
See also how-do-i-normalise-a-solr-lucene-score
There is no maximum score in Solr, it depends on too many variables, so it can't be predicted.
But you can implement something called normalized score (Scores As Percentages) which is not recommended.
See related links for more details:
Is it possible to set a Solr Score threshold 'reasonably', independent of results returned? (i.e. Is Solr Scoring standardized in any way)
how do I normalise a solr/lucene score?
Remove results below a certain score threshold in Solr/Lucene?
A regular normalization will only help you to compare the scoring distribution among queries (and theirs retrieved lists).
You cannot simply normalize the score to compare the performance between queries.
Think of a query which all retrieved documents are highly relevant and received the same (high score), and on another query that the retrieved list comprise barley relevant document (again, with the same score) - now, no matter the per-query normalization you make - the normalized score will be the same.
You need to think on a cross-query factor that can bring all the scores to the same level.
For example - maybe computing similarity between the query and the whole index, and use that score somehow along with the document-score
If you want to compare two or more queries, i found an workaround.
You can compare your highest scored document with your queryterm using the LevenstheinDistance or LuceneLevenstheinDistance(Damerau) class to get the distance between your queryterm and your result. The result is the similiarity between them. Do this for each query you want to compare against. Now you have a tool to compare your queries using the similiarity of your querytherm and your highest result. You can now choose the query with the highest score of similiarity and use this for next proper actions.
//Damerau LevenstheinDistance
LuceneLevenshteinDistance d = new LuceneLevenshteinDistance();
similiarity = d.getDistance(queryterm, yourResult );
I applied a non-linearity function in order to compress every queries.