Filtering by date range in Lucene - lucene

I know the title might suggest it is a duplicate but I haven't been able to find the answer to this specific issue:
I have to filter search results based on a date range. Date of each document is stored (but not indexed) on each one. When using a Filter I noticed the filter is called with all the documents in the index.
This means the filter will get slower as the index grows (currently only ~300,000 documents in it) as it has to iterate through every single document.
I can't using RangeQuery since the date is not indexed.
How can I apply the filter AFTER only on the documents that are the results of the query to make it more efficient?
I prefer to do it before I am handed the results not to mess up the scores and collectors I have.

Not quite sure if this will help, but I had a similar problem to yours and came up with the following (+ notes):
I think you're really going to have to index the date field. Nothing else makes any sense in terms of querying/filtering etc.
In Lucene.net v2.9, range querying where there are lots of terms seems to have got terribly slow compared to v2.9
I fixed my speed issues when using date fields by switching to using a numeric field and numeric field queries. This actually gave me quite a speed boost over my Lucene.net v2.4 baseline.
Wrapping your query in a caching wrapper filter means you can hang onto the document bit set for the filter. This will also dramatically speed up subsequent queries using the same filter.
A filter doesn't play a part in the scoring for a set of query results
Joining your cached filter to the rest of your query (where I guess you've got your custom scores and collectors) means it should meet the final part of your criteria
So, to summarise: index your date fields as numeric fields; build your queries as numeric range queries; transform these into cached filter wrappers and hang onto them.
I think you'll see some spectacular speedups over your current index usage.
Good luck!
p.s.
I would never second guess what'll be fast or slow when using Lucene. I've always been surprised in both directions!

First, to filter on a field, it has to be indexed.
Second, using a Filter is considered to be the best way to restrict the set of document to search on. One reason for this is that you can cache the filter results to be used for other queries. And the filter data structure is pretty efficient: it is a bit set of documents matching the filter.
But if you insist on not using filters, I think the only way is to use a boolean query to do the filtering.

Related

Why do functions on columns prevent the use of indexes?

On this question that I asked the other day I got the following comment.
In almost any database, almost any function on a column prevents the use of indexes. There are exceptions here and there, but in general, functions prevent the use of indexes
I googled around and found more mentions of this same behavior, but I had trouble finding something more in depth than what the comment already told me.
Could someone elaborate on why this occurs, and perhaps strategies for avoiding it?
An index in its most basic form is just the sorted column data, making it easy to look up by some value. For example, a textbook can have the pages in some order, but then have an index in the back for all the terms. As you can see, the data is precomputed/sorted and stored in a separate area.
When you apply a function to the column and try to match/filter based on the output, the index is no longer useful. Let's take a look at our book example again, and say that the function we're applying is the reverse of the term (so reverse('integral') becomes 'largetni'). You won't find this value in the index, so you have to take all the terms, put them through the function, and only then compare. All at query time. Originally we could skip search for i, then in, then int and so on, making it easy to find the term so the function made everything much slower.
If you query using this function often, you could make an index with reverse(term) ahead of time to speed up look ups. But without doing so explicitly, it will always be slow.
The indexes are stored separately from the data itself on the SQL server. So when you do a query the B-tree index that ought to be referenced to provide the speed can no longer be referenced because there is an operation(the function) on each of the column so the query optimiser will opt not to use the index any more.
Here is a good explanation of why this occurs (this is a SQL Server specific article, but probably applies to other SQL RDBMS systems):
https://www.mssqltips.com/sqlservertip/1236/avoid-sql-server-functions-in-the-where-clause-for-performance/
The line from the article that really stands out is "The reason for this is is that the function value has to be evaluated for each row of data to determine it matches your criteria."
Let's consider an extreme example. Let's say that you're looking up a row using a cryptographic hash function, like HASH(email_address) = 0x123456. The database has an index built on email_address, but now you're asking it to look up data on HASH(email_address) which it doesn't have. It could still use the index, but it would end up having to look at every single index entry for email_address and see if HASH(email_address) matches. If it's going to have to scan the full index, it may as well just scan the full table instead so it doesn't have to bounce back and forth fetching individual row locations.

Storing Search results for Future Use

I have the following scenario where the search returns a list of userid values (1,2,3,4,5,6... etc.) If the search were to be run again, the results are guaranteed to change given some time. However I need to stored the instance of the search results to be used in the future.
We have a current implementation (legacy), which creates a record for the search_id with the criteria and inserts every row returned into a different table with the associated search_id.
table search_results
search_id unsigned int FK, PK (clustered index)
user_id unsigned int FK
This is an unacceptable approach as this table has grown onto millions of records. I've considered partitioning the table, but either I will have numerous partitions (1000s).
I've optimized the existing tables that search results expired unless they're used elsewhere, so all the search results are referenced elsewhere.
In the current schema, I cannot store the results as serialized arrays or XML. I am looking to efficiently store the search result information, such that it can be efficiently accessed later without being burdened by the number of records.
EDIT: Thank you for the answers, I don't have any problems running the searches themselves, but the result set for the search gets used in this case for recipient lists, which will be used over and over again, the purpose of storing is exactly to have a snapshot of the data at the given time.
The answer is don't store query results. It's a terrible idea!
It introduces statefulness, which is very bad unless you really (really really) need it
It isn't scalable (as you're finding out)
The data is stale as soon as it's stored
The correct approach is to fix your query/database so it runs acceptable quickly.
If you can't make the queries faster using better SQL and/or indexes etc, I recommend using lucene (or any text-based search engine) and denormalizing your database into it. Lucene queries are incredibly fast.
I recently did exactly this on a large web site that was doing what you're doing: It was caching query results from the production relational database in the session object in an attempt top speed up queries, but it was a mess, and wasn't much faster anyway - before my time, a "senior" java developer (whose name started with Jam.. and ended with .illiams) who was actually a moron decided it was a good idea.
I put in Solr (a java-tailored lucene implementation) and kept Solr up to date with the relational database (using work queues) and the web queries are now just a few milliseconds.
Is there a reason why you need to store every search? Surely you would want the most up to date information available for the user ?
I'll admit first, this isn't a great solution.
Setup another database alongside your current one [SYS_Searches]
The save script could use SELECT INTO [SYS_Searches].Results_{Search_ID}
The script that retrieves can do a simple SELECT out of the matching table.
Benefits:
Every search is neatly packed into it's own table, [preferably in another DB]
The retrieval query is very simple
The retrieval time should be very quick, no massive table scans.
Drawbacks:
You will have a table for every x user * y searches a user can store.
This could get very silly very quickly unless there is management involved to expire results or the user can only have 1 cached search result set.
Not pretty, but I can't think of another way.

Determining search results quality in Lucene

I have been searching about score normalization for few days (now i know this can't be done) in Lucene using mailing list, wiki, blogposts, etc. I'm going to expose my problem because I'm not sure that score normalization is what our project need.
Background:
In our project, we are using Solr on top of Lucene with custom RequestHandlers and SearchComponents. For a given query, we need to detect when a query got poor results to trigger different actions.
Assumptions:
Inmutable index (once indexed, it is not updated) and Same query tipology (dismax qparser with same field boosting, without boost functions nor boost queries).
Problem:
We know that score normalization is not implementable. But is there any way to determine (using TF/IDF and boost field assumptions) when search results match quality are poor?
Example: We've got an index with science papers and other one with medcare centre's info. When a user query against first index and got poor results (inferring it from score?), we want to query second index and merge results using some threshold (score threshold?)
Thanks in advance
You're right that normalization of scores across different queries doesn't make sense, because nearly all similarity measures base on term frequency, which is of course local to a query.
However, I think that it is viable to compare the scores in this very special case that you are describing, if only you would override the default similarity to use IDF calculated jointly for both indexes. For instance, you could achieve it easily by keeping all the documents in one index and adding an extra (and hidden to the users) 'type' field. Then, you could compare the absolute values returned by these queries.
Generally, it could be possible to determine low quality results by looking at some features, like for example very small number of results, or some odd distributions of scores, but I don't think it actually solves your problem. It looks more similar to the issue of merging of isolated search results, which is discussed for example in this paper.

Lucene.Net memory consumption and slow search when too many clauses used

I have a DB having text file attributes and text file primary key IDs and
indexed around 1 million text files along with their IDs (primary keys in DB).
Now, I am searching at two levels.
First is straight forward DB search, where i get primary keys as result (roughly 2 or 3 million IDs)
Then i make a Boolean query for instance as following
+Text:"test*" +(pkID:1 pkID:4 pkID:100 pkID:115 pkID:1041 .... )
and search it in my Index file.
The problem is that such query (having 2 million clauses) takes toooooo much time to give result and consumes reallly too much memory....
Is there any optimization solution for this problem ?
Assuming you can reuse the dbid part of your queries:
Split the query into two parts: one part (the text query) will become the query and the other part (the pkID query) will become the filter
Make both parts into queries
Convert the pkid query to a filter (by using QueryWrapperFilter)
Convert the filter into a cached filter (using CachingWrapperFilter)
Hang onto the filter, perhaps via some kind of dictionary
Next time you do a search, use the overload that allows you to use a query and filter
As long as the pkid search can be reused, you should quite a large improvement. As long as you don't optimise your index, the effect of caching should even work through commit points (I understand the bit sets are calculated on a per-segment basis).
HTH
p.s.
I think it would be remiss of me not to note that I think you're putting your index through all sorts of abuse by using it like this!
The best optimization is NOT to use the query with 2 million clauses. Any Lucene query with 2 million clauses will run slowly no matter how you optimize it.
In your particular case, I think it will be much more practical to search your index first with +Text:"test*" query and then limit the results by running a DB query on Lucene hits.

Using IN or a text search

I want to search a table to find all rows where one particular field is one of two values. I know exactly what the values would be, but I'm wondering which is the most efficient way to search for them:
for the sake of example, the two values are "xpoints" and "ypoints". I know for certain that there will be no other values in that field which has "points" at the end, so the two queries I'm considering are:
WHERE `myField` IN ('xpoints', 'ypoints')
--- or...
WHERE `myField` LIKE '_points'
which would give the best results in this case?
As always with SQL queries, run it through the profiler to find out. However, my gut instinct would have to say that the IN search would be quicker. Espcially in the example you gave, if the field was indexed, it would only have to do 2 lookups. If you did a like search, it may have to do a scan, because you are looking for records that end with a certain value. It would also be more accurate as LIKE '_points' could also return 'gpoints', or any other similar string.
Unless all of the data items in the column in question start with 'x' or 'y', I believe IN will always give you a better query. If it is indexed, as #Kibbee points out, you will only have to perform 2 lookups to get both. Alternatively, if it is not indexed, a table scan using IN will only have to check the first letter most of the time whereas with LIKE it will have to check two characters every time (assuming all items are at least 2 characters) -- since the first character is allowed to be anything.
Try it and see. Create a large amount of test data, Also, try it with and without an index on myfield. While you are at it, see if there's a noticeable difference between
LIKE 'points' and LIKE 'xpoint'.
It depends on what the optimizer does with each query.
For small amounts of data, the difference will be negligible. Do whichever one makes more sense. For large amounts of data the amount of disk I/O matters much more than the amount of CPU time.
I'm betting that IN will get you better results than LIKE, if there is an index on myfield. I'm also betting that 'xpoint_' runs faster than '_points'. But there's nothing like trying it yourself.
MySQL can't use an index when using string comparisons such as LIKE '%foo' or '_foo', but can use an index for comparisons like 'foo%' and 'foo_'.
So in your case, IN will be much faster assuming that the field is indexed.
If you're working with a limited set of possible values, it's worth specifying the field as an ENUM - MySQL will then store it internally as an integer and make this sort of lookup much faster, and save disk space.
It will be faster to do the IN-version than the LIKE-version. Especially when your wildcard isn't at the end of the comparison, but even under ideal conditions IN would still be ideal up until your query nears the size of your max-query insert.