How can I control the order of results? Lucene range queries in Cloudant - lucene

I've got a simple index which outputs a "score" from 1000 to 12000 in increments of 1000. I want to get a range of results from a lo- to high -score, for example;
q=score:[1000 TO 3000]
However, this always returns a list of matches starting at 3000 and depending on the limit (and number of matches) it might never return any 1000 matches, even though they exist. I've tried to use sort:+- and grouping but nothing seems to have any impact on the returned result.
So; how can the order of results returned be controlled?
What I ideally want is a selection of matches from the range but I assume this isn't possible, given that the query just starts filling the results in from the top?
For reference the index looks like this;
function(doc) {
var score = doc.score;
index("score", score, {
"store": "yes"
});
...

I cannot comment on this so posting an answer here:
Based on the cloudant doc on lucene queries, there isn't a way to sort results of a query. The sort options given there are for grouping. And even for grouped results I never saw sort work. In any case it is supposed to sort the sequence of the groups themselves. Not the data within.

#pal2ie you are correct, and Cloudant has come back to me confirming it. It does make sense, in some way, but I was hoping I could at least control the direction (lo->hi, hi->lo). The solution I have implemented to get a better distribution across the range is to not use range queries but instead;
create a distribution of the number of desired results for each score in the range (a simple, discrete, Gaussian for example)
execute individual queries for each score in the range with limit set to the number of desired results for that score
execute step 2 from min to max, filling up the result
It's not the most effective since it means multiple round-trips to the server but at least it gives me full control over the distribution in the range

Related

RavenDB -- More Like This -- Need a (similarity) metric; not just rank-orders

I have a RavenDB / 'More Like This' example running (C#) as per
Creating more like this in RavenDB
However, in addition to receiving similar documents back, I really need some measure of similarity back for those documents.
I am assuming (correctly?) that the order in which I get the similar documents back represents the rank-order scores of the documents' similarities (first one back has the highest similarity, second one back has the second highest similarity, etc.).
However, rather than rank orders I need the metric similarity results. This assumes (of course) that the rank orders are computed from a more continuous metric; e.g., tf-idf. If that is true, can I get a hold of those metric scores?
When using MoreLikeThis, you can issue a query such as the following:
from index 'Product/Search'
where morelikethis(id() = 'products/1-A')
And assuming you have setup the TermVector on the index properly, you'll get the results.
In the metadata of the results, you have the index score, which is what I think you are looking for.

Limiting the number of rows returned by `.where(...)` in pytables

I am dealing with tables having having up to a few billion rows and I do a lot of "where(numexpr_condition)" lookups using pytables.
We managed to optimise the HDF5 format so a simple where-query over 600mio rows is done under 20s (we still struggling to find out how to make this faster, but that's another story).
However, since it is still too slow for playing around, I need a way to limit the number of results in a query like this simple example one (the foo column is of course indexed):
[row['bar'] for row in table.where('(foo == 234)')]
So this would return lets say 100mio entries and it takes 18s, which is way to slow for prototyping and playing around.
How would you limit the result to lets say 10000?
The database like equivalent query would be roughly:
SELECT bar FROM row WHERE foo==234 LIMIT 10000
Using the stop= attribute is not the way, since it simply takes the first n rows and applies the condition to them. So in worst case if the condition is not fulfilled, I get an empty array:
[row['bar'] for row in table.where('(foo == 234)', stop=10000)]
Using slice on the list comprehension is also not the right way, since it will first create the whole array and then apply the slice, which of course is no speed gain at all:
[row['bar'] for row in table.where('(foo == 234)')][:10000]
However, the iterator must know its own size while the list comprehension exhaustion so there is surely a way to hack this together. I just could not find a suitable way doing that.
Btw. I also tried using zip and range to force a StopIteration:
[row['bar'] for for _, row in zip(range(10000), table.where('(foo == 234)'))]
But this gave me repeated numbers of the same row.
Since it’s an iterable and appears to produce rows on demand, you should be able to speed it up with itertools.islice.
rows = list(itertools.islice(table.where('(foo == 234)'), 10000))

returning fuzzy match percentage in solr query result

I've implemented solr/lucene fuzzy match for my system and its working perfectly.
I have requirement to display percentage fuzzy match after query sends response back.
As an example if my index data is "rushikupadhyay" and if my query is "rushikupadhya"~0.8, I should get exact percentage as part of response like 0.85 or 85%.
I want to use percentage result as part of application and perform additional steps based on return value, like if percentage match is 70-80% do X, 80-90% do Y, and > 90% do Z.
Any pointers are appreciated.
Please Note: The guidance found in this post on the Lucene Wiki - ScoresAsPercentages that you may want to review before deciding to go with a purely percentage based logic.
However, if you do decide to go with a percentage value, you can get this value by also including the score field in the query response. See the Solr Admin page (Full Interface link) will direct you to /admin/form.jsp In the Fields to Return option it shows, *,score This will return the match score for each document in the result set. However, please note that this is the raw score of the document match and is relative to the maxScore value that is part of the <result> element. So in order to get the true percentage based score for each document, you will need to normalize each documents score against the maxScore by using logic such as (score/maxScore * 100) to get the correct percentage value to display.

How to normalize Lucene scores?

I need to normalize the Lucene scores between 0 and 1.
For example, a random query returns the following scores...
8.864665
2.792687
2.792687
2.792687
2.792687
0.49009037
0.33730242
0.33730242
0.33730242
0.33730242
What's the biggest score ? 10.0 ?
thanks
You can divide all scores with the maximum score to get scores between 0 and 1.
However, please note that the normalised scores should be used to compare the results of a single query only. It is not correct to compare the scores (normalised or not) of results from 2 different queries.
There is no good standard way to normalize scores with lucene. Read this: ScoresAsPercentages and this explanation
In your case the highest score is the score of the first result, if the results are sorted by score. But this score will be different for every other query.
See also how-do-i-normalise-a-solr-lucene-score
There is no maximum score in Solr, it depends on too many variables, so it can't be predicted.
But you can implement something called normalized score (Scores As Percentages) which is not recommended.
See related links for more details:
Is it possible to set a Solr Score threshold 'reasonably', independent of results returned? (i.e. Is Solr Scoring standardized in any way)
how do I normalise a solr/lucene score?
Remove results below a certain score threshold in Solr/Lucene?
A regular normalization will only help you to compare the scoring distribution among queries (and theirs retrieved lists).
You cannot simply normalize the score to compare the performance between queries.
Think of a query which all retrieved documents are highly relevant and received the same (high score), and on another query that the retrieved list comprise barley relevant document (again, with the same score) - now, no matter the per-query normalization you make - the normalized score will be the same.
You need to think on a cross-query factor that can bring all the scores to the same level.
For example - maybe computing similarity between the query and the whole index, and use that score somehow along with the document-score
If you want to compare two or more queries, i found an workaround.
You can compare your highest scored document with your queryterm using the LevenstheinDistance or LuceneLevenstheinDistance(Damerau) class to get the distance between your queryterm and your result. The result is the similiarity between them. Do this for each query you want to compare against. Now you have a tool to compare your queries using the similiarity of your querytherm and your highest result. You can now choose the query with the highest score of similiarity and use this for next proper actions.
//Damerau LevenstheinDistance
LuceneLevenshteinDistance d = new LuceneLevenshteinDistance();
similiarity = d.getDistance(queryterm, yourResult );
I applied a non-linearity function in order to compress every queries.

Optimizing Solr for Sorting

I'm using Solr for a realtime search index. My dataset is about 60M large documents. Instead of sorting by relevance, I need to sort by time. Currently I'm using the sort flag in the query to sort by time. This works fine for specific searches, but when searches return large numbers of results, Solr has to take all of the resulting documents and sort them by time before returning. This is slow, and there has to be a better way.
What is the better way?
I found the answer.
If you want to sort by time, and not relevance, use fq= instead of q= for all of your filters. This way, Solr doesn't waste time figuring out the weighted value of the documents matching q=. It turns out that Solr was spending too much time weighting, not sorting.
Additionally, you can speed sorting up by pre-warming your sort fields in the newSearcher and firstSearcher event listeners in solrconfig.xml. This will ensure that sorts are done via cache.
Obvious first question: what's type of your time field? If it's string, then sorting is obviously very slow. tdate is even faster than date.
Another point: do you have enough memory for Solr? If it starts swapping, then performance is immediately awful.
And third one: if you have older Lucene, then date is just string, which is very slow.
Warning: Wild suggestion, not based on prior experience or known facts. :)
Perform a query without sorting and rows=0 to get the number of matches. Disable faceting etc. to improve performance - we only need the total number of matches.
Based on the number of matches from Step #1, the distribution of your data and the count/offset of the results that you need, fire another query which sorts by date and also adds a filter on the date, like fq=date:[NOW()-xDAY TO *] where x is the estimated time period in days during which we will find the required number of matching documents.
If the number of results from Step #2 is less than what you need, then relax the filter a bit and fire another query.
For starters, you can use the following to estimate x:
If you are uniformly adding n documents a day to the index of size N documents and a specific query matched d documents in Step #1, then to get the top r results you can use x = (N*r*1.2)/(d*n). If you have to relax your filter too often in Step #3, then slowly increase the value 1.2 in the formula as required.