Getting maximum value of field in solr - lucene

I'd like to boost my query by the item's view count; I'd like to use something like view_count / max_view_count for this purpose, to be able to measure how the item's view count relates to the biggest view count in the index. I know how to boost the results with a function query, but how can I easily get the maximum view count? If anybody could provide an example it would be very helpful...

There aren't any aggregate functions under solr in the way you might be thinking about them from SQL. The easiest way to do it is to have a two-step process:
Get the max value via an appropriate query with a sort
use it with the max() function
So, something like:
q=*:*&sort=view_count desc&rows=1&fl=view_count
...to get an item with the max view_count, which you record somewhere, and then
q=whatever&bq=div(view_count, max(the_max_view_count, 1))
Note that that max() function isn't doing an aggregate max; just getting the maximum of the max-view-count you pass in or 1 (to avoid divide-by-zero errors).
If you have a multiValued field (which you can't sort on) you could also use the StatsComponent to get the max. Either way, you would probably want to do this once, not for every query (say, every night at midnight or whatever once your data set settles down).

You can add just:
&stats=true&stats.field=view_count
You will see a small statistics on that specified field. More documentation here

Related

RavenDB -- More Like This -- Need a (similarity) metric; not just rank-orders

I have a RavenDB / 'More Like This' example running (C#) as per
Creating more like this in RavenDB
However, in addition to receiving similar documents back, I really need some measure of similarity back for those documents.
I am assuming (correctly?) that the order in which I get the similar documents back represents the rank-order scores of the documents' similarities (first one back has the highest similarity, second one back has the second highest similarity, etc.).
However, rather than rank orders I need the metric similarity results. This assumes (of course) that the rank orders are computed from a more continuous metric; e.g., tf-idf. If that is true, can I get a hold of those metric scores?
When using MoreLikeThis, you can issue a query such as the following:
from index 'Product/Search'
where morelikethis(id() = 'products/1-A')
And assuming you have setup the TermVector on the index properly, you'll get the results.
In the metadata of the results, you have the index score, which is what I think you are looking for.

How can I control the order of results? Lucene range queries in Cloudant

I've got a simple index which outputs a "score" from 1000 to 12000 in increments of 1000. I want to get a range of results from a lo- to high -score, for example;
q=score:[1000 TO 3000]
However, this always returns a list of matches starting at 3000 and depending on the limit (and number of matches) it might never return any 1000 matches, even though they exist. I've tried to use sort:+- and grouping but nothing seems to have any impact on the returned result.
So; how can the order of results returned be controlled?
What I ideally want is a selection of matches from the range but I assume this isn't possible, given that the query just starts filling the results in from the top?
For reference the index looks like this;
function(doc) {
var score = doc.score;
index("score", score, {
"store": "yes"
});
...
I cannot comment on this so posting an answer here:
Based on the cloudant doc on lucene queries, there isn't a way to sort results of a query. The sort options given there are for grouping. And even for grouped results I never saw sort work. In any case it is supposed to sort the sequence of the groups themselves. Not the data within.
#pal2ie you are correct, and Cloudant has come back to me confirming it. It does make sense, in some way, but I was hoping I could at least control the direction (lo->hi, hi->lo). The solution I have implemented to get a better distribution across the range is to not use range queries but instead;
create a distribution of the number of desired results for each score in the range (a simple, discrete, Gaussian for example)
execute individual queries for each score in the range with limit set to the number of desired results for that score
execute step 2 from min to max, filling up the result
It's not the most effective since it means multiple round-trips to the server but at least it gives me full control over the distribution in the range

What is a best way to organise the complex couchdb view (sql-like query)?

In my application I need a SQL-like query of the documents. The big picture is that there is a page with a paginated table showing the couchdb documents of a certain "type". I have about 15 searchable columns like timestamp, customer name, the us state, different numeric fields, etc. All of these columns are orderable, also there is a filter form allowing the user to filter by each of the fields.
For a more concrete below is a typical query which is a result by a customer setting some of the filter options and following to the second page. Its written in a pseodo-sql code, just to explain the problem:
timestamp > last_weeks_monday_epoch AND timestamp < this_weeks_monday_epoch AND marked_as_test = False AND dataspace="production" AND fico > 650
SORT BY timestamp DESC
LIMIT 15
SKIP 15
This would be a trivial problem if I were using any sql-like database, but couchdb is way more fun ;) To solve this I've created a view with the following structure of the emitted rows:
key: [field, value], id: doc._id, value: null
Now, to resolve the example query above I need to perform a bunch of queries:
{startkey: ["timestamp", last_weeks_monday_epoch], endkey: ["timestamp", this_weeks_monday_epoch]}, the *_epoch here are integers epoch timestamps,
{key: ["marked_as_test", False]},
{key: ["dataspace", "production"]},
{startkey: ["fico", 650], endkey: ["fico", {}]}
Once I have the results of the queries above I calculate intersection of the sets of document IDs and apply the sorting using the result of timestamp query. Than finally I can apply the slice resolving the document IDs of the rows 15-30 and download their content using bulk get operation.
Needless to say, its not the fastest operation. Currently the dataset I'm working with is roughly 10K documents big. I can already see that the part when I'm calculating the intersection of the sets can take like 4 seconds, obviously I need to optimize it further. I'm afraid to think, how slow its going to get in a few months when my dataset doubles, triples, etc.
Ok, so having explained the situation I'm at, let me ask the actual questions.
Is there a better, more natural way to reach my goal without loosing the flexibility of the tool?
Is the view structure I've used optimal ? At some point I was considering using a separate map() function generating the value of each field. This would result in a smaller b-trees but more work of the view server to generate the index. Can I benefit this way ?
The part of algorithm where I have to calculate intersections of the big sets just to later get the slice of the result bothers me. Its not a scalable approach. Does anyone know a better algorithm for this ?
Having map function:
function(doc){
if(doc.marked_as_test) return;
emit([doc.dataspace, doc.timestamp, doc.fico], null):
}
You can made similar request:
http://localhost:5984/db/_design/ddoc/_view/view?startkey=["production", :this_weeks_monday_epoch]&endkey=["production", :last_weeks_monday_epoch, 650]&descending=true&limit=15&skip=15
However, you should pass :this_weeks_monday_epoch and :last_weeks_monday_epoch values from the client side (I believe they are some calculable variables on database side, right?)
If you don't care about dataspace field (e.g. it's always constant), you may move it into the map function code instead of having it in query parameters.
I don't think CouchDB is a good fit for the general solution to your problem. However, there are two basic ways you can mitigate the ways CouchDB fits the problem.
Write/generate a bunch of map() functions that use each separate column as the key (for even better read/query performance, you can even do combinatoric approaches). That way you can do smart filtering and sorting, making use of a bunch of different indices over the data. On the other hand, this will cost extra disk space and index caching performance.
Try to find out which of the filters/sort orders your users actually use, and optimize for those. It seems unlikely that each combination of filters/sort orders is used equally, so you should be able to find some of the most-used patterns and write view functions that are optimal for those patterns.
I like the second option better, but it really depends on your use case. This is one of those things SQL engines have been pretty good at traditionally.

search lucene NumericField for maximum value

I know there is a NumericRangeQuery in Lucene but is it possible to have lucene simply return the maximum value stored in in a NumericField. I can use a RangeQuery over the entire known range and then sort but this is extremely cumbersome and it may return a huge amount of results if there are a lot of records
The second parameter of IndexSearcher.search(Query query, int n, Sort sort) allows to specify the top n hits (in your case 1), which, if you sort correctly, only returns the desired result. There are other overloaded methods that allow achieving the same.
Can't argue about the cumbersomeness though :)
You could Term Enum through your index. Unfortunately I don't think they're sorted in a way which makes finding the maximum instantaneous, but at least you won't have to do an actual search to find it. You will need to use NumericUtils to convert from Lucene's internal structure to a normal number.
This thread contains an example.

Providing a LIMIT param in Django query without taking a slice of a QuerySet

I have a utility function in my program for searching for entities. It takes a max_count parameter. It returns a QuerySet.
I would like this function to limit the max number of entries. The standard way would be to take a slice out of my QuerySet:
return results[:max_count]
My problem is that the views which utilize this function sort in various ways by using .order_by(). This causes exceptions as re-ordering is not allowed after taking a slice.
Is it possible to force a "LIMIT 1000" into my SQL query without taking a slice?
Do results[:max_count] in a view, after .order_by(). Don't be afraid of requesting too much from DB, query won't be evaluated until the slice (and even after it either).
You could subclass QuerySet to achieve this, by simply ignore every slice and do [:max_count] at last in __getitem__, but I don't think it worths with the complex and side-effects.
If you are worrying about memory consumption by large queryset, follow http://www.mellowmorning.com/2010/03/03/django-query-set-iterator-for-really-large-querysets/
For normal usage, please just follow DrTyrsa's suggestion. You could write shortcuts to shorten the order_by and afterwards slice in code to simplify your code.