search lucene NumericField for maximum value - lucene

I know there is a NumericRangeQuery in Lucene but is it possible to have lucene simply return the maximum value stored in in a NumericField. I can use a RangeQuery over the entire known range and then sort but this is extremely cumbersome and it may return a huge amount of results if there are a lot of records

The second parameter of IndexSearcher.search(Query query, int n, Sort sort) allows to specify the top n hits (in your case 1), which, if you sort correctly, only returns the desired result. There are other overloaded methods that allow achieving the same.
Can't argue about the cumbersomeness though :)

You could Term Enum through your index. Unfortunately I don't think they're sorted in a way which makes finding the maximum instantaneous, but at least you won't have to do an actual search to find it. You will need to use NumericUtils to convert from Lucene's internal structure to a normal number.
This thread contains an example.

Related

Does ZRANGEBYLEX support contains query?

How can I query my sorted set to get all keys containing some characters?
"Starts with" works fine but I need "contains".
I am using below query for "start with" which works fine
zrangebylex zset [2110 "[2110\xff" LIMIT 0 10
Is there any way we can do \xff query \xff ?
No. The lexicographical range for Redis' Sorted Sets can only be used for prefix searches.
Note that by using another Sorted Set that stores the reverse of the values you can also perform a suffix search on the values. However, even combining these two approaches will not provide the functionality you need.
Alternatively, you could perform a prefix search and then filter the results using a Lua script. Depending on your queries and data, this may or may not be an effective approach.
You could, also, consider implementing a full text indexing mechanism on top of Redis but that would be an overkill in most cases and besides, there are existing tested technologies that already do that.
But you can use ZSCAN with a glob-style pattern, for example to get all the strings which contains the characters "s" and/or "a":
ZSCAN key 0 MATCH *[sa]*
From the ZRANGEBYLEX original documentation (also look zzlCompareElements function realization in source code):
The elements are considered to be ordered from lower to higher strings as compared byte-by-byte using the memcmp() C function. Longer strings are considered greater than shorter strings if the common part is identical.
From memcmp documentation:
memcmp compares the first num bytes of the block of memory pointed by ptr1 to the first num bytes pointed by ptr2, returning zero if they all match or a value different from zero representing which is greater if they do not.
So you cant use zrangebylex with contains query. And I'm afraid there is not any "lite" workaround. "Lite" - without redis sourfce patching.

Querying Apache Solr based on score values

I am working on an image retrieval task. I have a dataset of wikipedia images with their textual description in xml files (1 xml file per image). I have indexed those xmls in Solr. Now while retrieving those, I want to maintain some threshold for Score values, so that docs with less score will not come in the result (because they are not of much importance). For example I want to retrieve all documents having similarity score greater than or equal to 2.0. I have already tried range queries like score:[2.0 TO *] but can't get it working. Does anyone have any idea how can I do that?
What's the motivation for wanting to do this? The reason I ask, is
score is a relative thing determined by Lucene based on your index
statistics. It is only meaningful for comparing the results of a
specific query with a specific instance of the index. In other words,
it isn't useful to filter on b/c there is no way of knowing what a
good cutoff value would be.
http://lucene.472066.n3.nabble.com/score-filter-td493438.html
Also, take a look here - http://wiki.apache.org/lucene-java/ScoresAsPercentages
So, in general it's bad to cut off by some value, because you'll never know which threshold value is best. In good query it could be score=2, in bad query score=0.5, etc.
These two links should explain you why you DONT want to do it.
P.S. If you still want to do it take a look here - https://stackoverflow.com/a/15765203/2663985
P.P.S. I recommend you to fix your search queries, so they will search better with high precision (http://en.wikipedia.org/wiki/Precision_and_recall)

Lucene Paging with search

Hello I am currently using Lucene 4.6.1
In my design I need to be able to search and page possibly many results, so i have some general questions for optimization.
First in the "search(query q, int n)" What is the goal of the variable "n" , Is "n" different from ".totalHits()" ? How should this number be chosen and with what specifications?
Second, it seems that there are two general algorithms for paging. I can either use "searchAfter" or process the "ScoreDoc[]" given a page size.
Currently what way do most people recommend, and what are the design ideas that are required?
searchAfter can be used for efficient "deep paging".
A tutorial on using it with Solr
http://heliosearch.org/solr/paging-and-deep-paging/
The int passed to search is the maximum number of hits the search will retrieve. totalHits, from the TopDocs is the total number of hits for the query. It may be more or less than the value passed in.
Not clear to me what you mean by processing the ScoreDoc array. searchAfter is specifically intended to be used for pagination. Use it.

Optimizing Solr for Sorting

I'm using Solr for a realtime search index. My dataset is about 60M large documents. Instead of sorting by relevance, I need to sort by time. Currently I'm using the sort flag in the query to sort by time. This works fine for specific searches, but when searches return large numbers of results, Solr has to take all of the resulting documents and sort them by time before returning. This is slow, and there has to be a better way.
What is the better way?
I found the answer.
If you want to sort by time, and not relevance, use fq= instead of q= for all of your filters. This way, Solr doesn't waste time figuring out the weighted value of the documents matching q=. It turns out that Solr was spending too much time weighting, not sorting.
Additionally, you can speed sorting up by pre-warming your sort fields in the newSearcher and firstSearcher event listeners in solrconfig.xml. This will ensure that sorts are done via cache.
Obvious first question: what's type of your time field? If it's string, then sorting is obviously very slow. tdate is even faster than date.
Another point: do you have enough memory for Solr? If it starts swapping, then performance is immediately awful.
And third one: if you have older Lucene, then date is just string, which is very slow.
Warning: Wild suggestion, not based on prior experience or known facts. :)
Perform a query without sorting and rows=0 to get the number of matches. Disable faceting etc. to improve performance - we only need the total number of matches.
Based on the number of matches from Step #1, the distribution of your data and the count/offset of the results that you need, fire another query which sorts by date and also adds a filter on the date, like fq=date:[NOW()-xDAY TO *] where x is the estimated time period in days during which we will find the required number of matching documents.
If the number of results from Step #2 is less than what you need, then relax the filter a bit and fire another query.
For starters, you can use the following to estimate x:
If you are uniformly adding n documents a day to the index of size N documents and a specific query matched d documents in Step #1, then to get the top r results you can use x = (N*r*1.2)/(d*n). If you have to relax your filter too often in Step #3, then slowly increase the value 1.2 in the formula as required.

Lucene Scoring: TermQuery w & w/o TermVectors

Does TermQuery:ExtractTerms result in a higher count when termvectors/positions/offsets are turned on? (assuming that there is more than 1 occurence of a match). Conversely, with the inverted file info turned off, does ExtractTerms always return 1 and only 1 term?
EDIT: How and where does turning on termvectors in the schema affect scoring?
TermQuery.ExtractTerms extracts the terms in the query, not the result. So a search for "foo:bar" will always return exactly one term, regardless of what's in the index.
It sounds to me like you want to know about highlighting, not Query.ExtractTerms.
EDIT: Based on your comment, it sounds like you are asking: "how is scoring affected by term vectors?" The answer to that is: not at all. The term frequency, norm, etc. is calculated at index time, so it doesn't matter what you store.
The major exception is PhraseQuery with slop, which uses the term positions. A minor exception is that custom scoring classes can use whatever data they want, so not only term vectors but also payloads etc. can potentially affect the score.
If you're just doing TermQuerys though, what you store should have no effect.