RavenDB -- More Like This -- Need a (similarity) metric; not just rank-orders - ravendb

I have a RavenDB / 'More Like This' example running (C#) as per
Creating more like this in RavenDB
However, in addition to receiving similar documents back, I really need some measure of similarity back for those documents.
I am assuming (correctly?) that the order in which I get the similar documents back represents the rank-order scores of the documents' similarities (first one back has the highest similarity, second one back has the second highest similarity, etc.).
However, rather than rank orders I need the metric similarity results. This assumes (of course) that the rank orders are computed from a more continuous metric; e.g., tf-idf. If that is true, can I get a hold of those metric scores?

When using MoreLikeThis, you can issue a query such as the following:
from index 'Product/Search'
where morelikethis(id() = 'products/1-A')
And assuming you have setup the TermVector on the index properly, you'll get the results.
In the metadata of the results, you have the index score, which is what I think you are looking for.

Related

Internal db logic/operation to group/compress result

I have a CrateDB table storing various information for zipcodes. It contains around 30k zipcodes, and I need my query to return certain profiling information for all zipcodes at once. I understand that typically it wouldn't be feasible, but since I only need ballpark information and many zipcodes are consecutive, I think an optimization is possible.
For example, if I wanted to profile population, a grouped result such as this would work for me:
group 1 (0-1000): 00000-02000,02004-02010,02012
group 2 (1001-3000): ...
...
The populations and groups above are fake, but the idea should hold. Basically, group profiled category into buckets, assign zipcodes to correct bucket, and further reduce size by using range representation. I could settle for a predefined number of groups or have group buckets defined by request/query itself. This would hopefully reduce the response from something that would be too large for a single query to one that's manageable.
Is it possible to write a cratedb function to do something similar to avoid bandwidth issues from having this grouping done on a different service/container/vm?
You could probably crate groups on the fly or as columns if you wish with a regex, I have done this on a 23M row table and group by that.
In my example regex grouping and AVG took around 30s, but this is very subjective to my hardware.
Something like this would probably work as a general pointer
SELECT avg (--yourColumn--), regexp_matches(--yourColumn--, '--your regex--','i')[1]
FROM "doc"."--yourTable--"
group by regexp_matches(postcode, '--your regex--','i')[1]
order by regexp_matches(postcode, '--your regex--','i')[1]
You could use over windowed function but this doesn't yet have the full SQL support for partitioning etc.

How can I control the order of results? Lucene range queries in Cloudant

I've got a simple index which outputs a "score" from 1000 to 12000 in increments of 1000. I want to get a range of results from a lo- to high -score, for example;
q=score:[1000 TO 3000]
However, this always returns a list of matches starting at 3000 and depending on the limit (and number of matches) it might never return any 1000 matches, even though they exist. I've tried to use sort:+- and grouping but nothing seems to have any impact on the returned result.
So; how can the order of results returned be controlled?
What I ideally want is a selection of matches from the range but I assume this isn't possible, given that the query just starts filling the results in from the top?
For reference the index looks like this;
function(doc) {
var score = doc.score;
index("score", score, {
"store": "yes"
});
...
I cannot comment on this so posting an answer here:
Based on the cloudant doc on lucene queries, there isn't a way to sort results of a query. The sort options given there are for grouping. And even for grouped results I never saw sort work. In any case it is supposed to sort the sequence of the groups themselves. Not the data within.
#pal2ie you are correct, and Cloudant has come back to me confirming it. It does make sense, in some way, but I was hoping I could at least control the direction (lo->hi, hi->lo). The solution I have implemented to get a better distribution across the range is to not use range queries but instead;
create a distribution of the number of desired results for each score in the range (a simple, discrete, Gaussian for example)
execute individual queries for each score in the range with limit set to the number of desired results for that score
execute step 2 from min to max, filling up the result
It's not the most effective since it means multiple round-trips to the server but at least it gives me full control over the distribution in the range

Sqlite, autocomplete cities based on location and relevance

I'd like your advices regarding optimalization of this:
Data:
I have SQLite database with +- 3000 cities, all of which have name and some lattitude and longitude. All cities have also relevance (based on how often user visits them). Relevance is classic integer. Then, I have user location, again, as lat/lon coordinates.
Request:
I need to create autocomplete editBox. Suggestions must satisfy these conditions:
1) Phrase in editBox must be a substring of suggested city name.
2) Suggestions must by ordered first by relevance. (Classic integer ordering, no problem)
3) If relevance is the same, then suggestions are ordered by distance to user.
4) Display max. 10 suggestions.
Since there are usually a lot of cities with equal relevance, biggest problem is the distance ordering.
My current approach:
A) Get IDs and coordinates of cities that satisfy condition (1) and (2) using classic: name LIKE '% phrase%' ordered by relevance.
B) Split result to groups by relevance. Order these relevance groups by distance using sorting in Java.
C) When there are 10 suggestions that are fixed, (f.e. 11 relevance groups, all containing one city, so no location ordering is needed) stop ordering.
This works well. But, there is a problem. Usually, very few cities have different relevance.
So when user starts typing and there is just one or two letters in the search phrase, I end up sorting 500 cities by distance, just to get to my 10 suggestions, what I find highly inefficient.
Is there any better way to handle such situations using SQLite?
P.S. It is running on Android, if that helps :)

returning fuzzy match percentage in solr query result

I've implemented solr/lucene fuzzy match for my system and its working perfectly.
I have requirement to display percentage fuzzy match after query sends response back.
As an example if my index data is "rushikupadhyay" and if my query is "rushikupadhya"~0.8, I should get exact percentage as part of response like 0.85 or 85%.
I want to use percentage result as part of application and perform additional steps based on return value, like if percentage match is 70-80% do X, 80-90% do Y, and > 90% do Z.
Any pointers are appreciated.
Please Note: The guidance found in this post on the Lucene Wiki - ScoresAsPercentages that you may want to review before deciding to go with a purely percentage based logic.
However, if you do decide to go with a percentage value, you can get this value by also including the score field in the query response. See the Solr Admin page (Full Interface link) will direct you to /admin/form.jsp In the Fields to Return option it shows, *,score This will return the match score for each document in the result set. However, please note that this is the raw score of the document match and is relative to the maxScore value that is part of the <result> element. So in order to get the true percentage based score for each document, you will need to normalize each documents score against the maxScore by using logic such as (score/maxScore * 100) to get the correct percentage value to display.

How to normalize Lucene scores?

I need to normalize the Lucene scores between 0 and 1.
For example, a random query returns the following scores...
8.864665
2.792687
2.792687
2.792687
2.792687
0.49009037
0.33730242
0.33730242
0.33730242
0.33730242
What's the biggest score ? 10.0 ?
thanks
You can divide all scores with the maximum score to get scores between 0 and 1.
However, please note that the normalised scores should be used to compare the results of a single query only. It is not correct to compare the scores (normalised or not) of results from 2 different queries.
There is no good standard way to normalize scores with lucene. Read this: ScoresAsPercentages and this explanation
In your case the highest score is the score of the first result, if the results are sorted by score. But this score will be different for every other query.
See also how-do-i-normalise-a-solr-lucene-score
There is no maximum score in Solr, it depends on too many variables, so it can't be predicted.
But you can implement something called normalized score (Scores As Percentages) which is not recommended.
See related links for more details:
Is it possible to set a Solr Score threshold 'reasonably', independent of results returned? (i.e. Is Solr Scoring standardized in any way)
how do I normalise a solr/lucene score?
Remove results below a certain score threshold in Solr/Lucene?
A regular normalization will only help you to compare the scoring distribution among queries (and theirs retrieved lists).
You cannot simply normalize the score to compare the performance between queries.
Think of a query which all retrieved documents are highly relevant and received the same (high score), and on another query that the retrieved list comprise barley relevant document (again, with the same score) - now, no matter the per-query normalization you make - the normalized score will be the same.
You need to think on a cross-query factor that can bring all the scores to the same level.
For example - maybe computing similarity between the query and the whole index, and use that score somehow along with the document-score
If you want to compare two or more queries, i found an workaround.
You can compare your highest scored document with your queryterm using the LevenstheinDistance or LuceneLevenstheinDistance(Damerau) class to get the distance between your queryterm and your result. The result is the similiarity between them. Do this for each query you want to compare against. Now you have a tool to compare your queries using the similiarity of your querytherm and your highest result. You can now choose the query with the highest score of similiarity and use this for next proper actions.
//Damerau LevenstheinDistance
LuceneLevenshteinDistance d = new LuceneLevenshteinDistance();
similiarity = d.getDistance(queryterm, yourResult );
I applied a non-linearity function in order to compress every queries.