Sort by date in Solr/Lucene performance problems - lucene

We have set up an Solr index containing 36 million documents (~1K-2K each) and we try to query a maximum of 100 documents matching a single simple keyword. This works pretty fast as we had hoped for.
However, if we now add "&sort=createDate+desc" to the query (thus asking for the top 100 'new' documents matching the query) it runs for a long, very long time and finally results in an OutOfMemoryException.
From what I've understood from the manual this is caused by the fact that Lucene needs to load all the distinct values for this field (createDate) into memory (the FieldCache afaik) before it can execute the query. As the createDate field contains date and time the number of distinct values is pretty large.
Also important to mention is that we frequently update the index.
Perhaps someone can provide some insights and directions on how we can tune Lucene / Solr or change our approach in such a way that query times become acceptable?
Your input will be much appreciated! Thanks.

The problem is Lucene stores numbers as strings. There are some utilities, which split the date into YYYY, MM, DD and put them in different fields. That gives much better results.
Newer version of Lucene (2.9 onwards) support numeric fields and the performance improvements are significant (couple of orders of magnitude, IIRC.) Check this article about the numeric queries.

You can sort the results by index order instead.
The sort specification for descending by document number is:
new SortField(null, SortField.DOC, true)
You should also partition the index directories by the date field.
All matching documents are examined by Lucene when collecting the top N results.
The partitioning will split the examined set. You don't need to examine the older partitions, if you have N results in the newest partition.

Try converting you Date type data into String type (such as milliseconds).

Related

Limitations in using all string columns in BigQuery

I have an input table in BigQuery that has all fields stored as strings. For example, the table looks like this:
name dob age info
"tom" "11/27/2000" "45" "['one', 'two']"
And in the query, I'm currently doing the following
WITH
table AS (
SELECT
"tom" AS name,
"11/27/2000" AS dob,
"45" AS age,
"['one', 'two']" AS info )
SELECT
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob)) birth_year,
ANY_value(PARSE_DATE('%m/%d/%Y', dob)) bod,
ANY_VALUE(name) example_name,
ANY_VALUE(SAFE_CAST(age AS INT64)) AS age
FROM
table
GROUP BY
EXTRACT( year from PARSE_DATE('%m/%d/%Y', dob))
Additionally, I tried doing a very basic group by operation casting an item to a string vs not, and I didn't see any performance degradation on a data set of ~1M rows (actually, in this particular case, casting to a string was faster):
Other than it being bad practice to "keep" this all-string table and not convert it into its proper type, what are some of the limitations (either functional or performance-wise) that I would encounter by keeping a table all-string instead of storing it as their proper type. I know there would be a slight increase in size due to storing strings instead of number/date/bool/etc., but what would be the major limitations or performance hits I'd run into if I kept it this way?
Off the top of my head, the only limitations I see are:
Queries would become more complex (though wouldn't really matter if using a query-builder).
A bit more difficult to extract non-string items from array fields.
Inserting data becomes a bit trickier (for example, need to keep track of what the date format is).
But these all seem like very small items that can be worked around. Are there are other, "bigger" reasons why using all string fields would be a huge limitation, either in limiting query-ability or having a huge performance hit in various cases?
First of all - I don't really see any bigger show-stoppers than those you already know and enlisted
Meantime,
though wouldn't really matter if using a query-builder ...
based on above excerpt - I wanted to touch upon some aspect of this approach (storing all as strings)
While we usually concerned about CASTing from string to native type to apply relevant functions and so on, I realized that building complex and generic query with some sort of query builder in some cases requires opposite - cast native type to string for applying function like STRING_AGG [just] as a quick example
So, my thoughts are:
When table is designed for direct user's access with trivial or even complex queries - having native types is beneficial and performance wise and being more friendly for user to understand, etc.
Meantime, if you are developing your own query builder and you design table such that it will be available to users for querying via that query builder with some generic logic being implemented - having all fields in string can be helpful in building the query builder itself.
So it is a balance - you can lose a little in performance but you can win in being able to better implement generic query builder. And such balance depend on nature of your business - both from data prospective and what kind of query you envision to support
Note: your question is quite broad and opinion based (which is btw not much respected on SO) so, obviously my answer - is totally my opinion but based on quite an experience with BigQuery
Are you OK to store string "33/02/2000" as a date in one row and "21st of December 2012" in another row and "22ое октября 2013" in another row?
Are you OK to store string "45" as age in one row and "young" in another row?
Are you OK when age "10" is less than age "9"?
Data types provide some basic data validation mechanism at the database level.
Does BigQuery databases have a notion of indexes?
If yes, then most likely these indexes become useless as soon as you start casting your strings to proper types, such as
SELECT
...
WHERE
age > 10 and age < 30
vs
SELECT
...
WHERE
ANY_VALUE(SAFE_CAST(age AS INT64)) > 10
and ANY_VALUE(SAFE_CAST(age AS INT64)) < 30
It is normal that with less columns/rows you don't feel the problems. You start to feel the problems when your data gets huge.
Major concerns:
Maintenance of the code: Think of future requirements that you may receive. Every conversion for data manipulation will add extra complexity to your code. For example, if your customer asks for retrieving teenagers in future, you'll need to convert string to date to get the age and then be able to do the manupulation.
Data size: The data size has broader impacts that can not be seen at the start. For example if you have N parallel test teams which require own test systems, you'll need to allocate more disk space.
Read Performance: When you have more bytes to read in huge tables it will cost you considerable time. For example typically telco operators have a couple of billions of rows data per month.
If your code complexity increase, you'll need to replicate conversions in multiple places.
Even single of above items should push one to distance from using strings for everything.
I would think the biggest issue with this would be if there are other users of this table/data, for instance if someone is trying to write reports with it and do calculations or charts or date ranges it could be a big headache having to always cast or convert the data with whatever tool they are using. You or someone would likely get a lot of complaints about it.
And if someone decided to build a layer between this data and the reporting tool which converted all of the data, then you may as well just do it one time to the table/data and be done with it.
From the solution below, you might face some storage and performance problems, you can find some guidance in the official documentation:
The main performance problem will come from the CAST operation, remember that the BigQuery Engine will have to deal with a CAST operation for each value per row.
In order to test the compute cost of this operations, I used the following query:
SELECT
street_number
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
Inspecting the stages executed in the execution details we are able to see the following:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
WRITE
$1
TO __stage00_output
Only the Read, Limit and Write operations are required. However if we execute the same query adding the the CAST operator.
SELECT
CAST(street_number AS int64)
FROM
`bigquery-public-data.austin_311.311_service_requests`
LIMIT
5000
We see that a compute operation is also required in order to perform the cast operation:
READ
$1:street_number
FROM bigquery-public-data.austin_311.311_service_requests
LIMIT
5000
COMPUTE
$10 := CAST($1 AS INT64)
WRITE
$10
TO __stage00_output
Those compute operations will consume some time, that might cause problems when escalating the operation size.
Also, remember that each time that you want to use the data type properties of each data type, you will have to cast your value, and deal with the compute operation time required.
Finally, referring to the storage performance, as you mentioned Strings do not have a fixed size, and that might cause a size increase.

Cloud Datastore avoid exploding indexes on very simple table

I'm trying to use Google Cloud Datastore to store METAR observations (airport weather observations) but I am experiencing what I think is exploding indexes. My index for station_id (which is a 4 character string) is 20 times larger than the actual data itself. The database will increase by roughly 250 000 entities per day, so index size will become an issue.
Table
- observation_time (Date / Time) - indexed
- raw_text (String) (which is ~200 characters) - unindexed
- station_id (String) (which is always 4 characters) - indexed
Composite index:
- station_id (ASC), observation_time (ASC)
Query
The only query I will ever run is:
query.add_filter('station_id', '=', station_icao)
query.add_filter('observation_time', '>=', before)
query.add_filter('observation_time', '<=', after)
where before and after are datetime values
Index sizes
name type count size index size
observation_time Date/Time 1,096,184 26.14MB 313.62MB
station_id String 1,096,184 16.73MB 294.8MB
Datastore reports:
Resource Count Size
Entities 1,096,184 244.62MB
Built-in-indexes 5,488,986 740.63MB
Composite indexes 1,096,184 137.99MB
Help
I guess my first question is: What am I missing? I assume I'm doing something un-optimized, but I can't figure out what. Query time is not an immediate issue here, as long as lookups stays below ~2s.
Can I simply remove the built-in indexes, will the composite continue to work?
I've read up on Google and StackOverflow but can't seem to wrap my head around this. The reason I simply don't try to remove all built-in indexes is that it takes quite some time to download/un-index/put all the data afterwards I need to way 48hours for the dashboard summary to update - ie it will take me days before I get a result.
As +Jeffrey Rennie pointed out, "Exploding Indexes" is a very specific term that does not apply here.
You can see how storage size is calculate from our documentation here, so you can apply it to your example to see where the size adds up.
TL;DR: You can save space by using slightly more concise (but still readable!) property names. For example, observation_time to observation, etc
Key things to keep in mind:
To have a composite index, you need to have the individual properties indexed, so don't remove the built-ins or it'll stop working
Built-ins are indexed twice - once for ascending and once for descending
Kind names and property names are strings used in the index for each entity, so the longer they are the bigger the indexes

search lucene NumericField for maximum value

I know there is a NumericRangeQuery in Lucene but is it possible to have lucene simply return the maximum value stored in in a NumericField. I can use a RangeQuery over the entire known range and then sort but this is extremely cumbersome and it may return a huge amount of results if there are a lot of records
The second parameter of IndexSearcher.search(Query query, int n, Sort sort) allows to specify the top n hits (in your case 1), which, if you sort correctly, only returns the desired result. There are other overloaded methods that allow achieving the same.
Can't argue about the cumbersomeness though :)
You could Term Enum through your index. Unfortunately I don't think they're sorted in a way which makes finding the maximum instantaneous, but at least you won't have to do an actual search to find it. You will need to use NumericUtils to convert from Lucene's internal structure to a normal number.
This thread contains an example.

Suggestions/Opinions for implementing a fast and efficient way to search a list of items in a very large dataset

Please comment and critique the approach.
Scenario: I have a large dataset(200 million entries) in a flat file. Data is of the form - a 10 digit phone number followed by 5-6 binary fields.
Every week I will be getting a Delta files which will only contain changes to the data.
Problem : Given a list of items i need to figure out whether each item(which will be the 10 digit number) is present in the dataset.
The approach I have planned :
Will parse the dataset and put it a DB(To be done at the start of the
week) like MySQL or Postgres. The reason i want to have RDBMS in the
first step is I want to have full time series data.
Then generate some kind of Key Value store out of this database with
the latest valid data which supports operation to find out whether
each item is present in the dataset or not(Thinking some kind of a
NOSQL db, like Redis here optimised for search. Should have
persistence and be distributed). This datastructure will be read-only.
Query this key value store to find out whether each item is present
(if possible match a list of values all at once instead of matching
one item at a time). Want this to be blazing fast. Will be using this functionality as the back-end to a REST API
Sidenote: Language of my preference is Python.
A few considerations for the fast lookup:
If you want to check a set of numbers at a time, you could use the Redis SINTER which performs set intersection.
You might benefit from using a grid structure by distributing number ranges over some hash function such as the first digit of the phone number (there are probably better ones, you have to experiment), this would e.g. reduce the size per node, when using an optimal hash, to near 20 million entries when using 10 nodes.
If you expect duplicate requests, which is quite likely, you could cache the last n requested phone numbers in a smaller set and query that one first.

Optimizing Solr for Sorting

I'm using Solr for a realtime search index. My dataset is about 60M large documents. Instead of sorting by relevance, I need to sort by time. Currently I'm using the sort flag in the query to sort by time. This works fine for specific searches, but when searches return large numbers of results, Solr has to take all of the resulting documents and sort them by time before returning. This is slow, and there has to be a better way.
What is the better way?
I found the answer.
If you want to sort by time, and not relevance, use fq= instead of q= for all of your filters. This way, Solr doesn't waste time figuring out the weighted value of the documents matching q=. It turns out that Solr was spending too much time weighting, not sorting.
Additionally, you can speed sorting up by pre-warming your sort fields in the newSearcher and firstSearcher event listeners in solrconfig.xml. This will ensure that sorts are done via cache.
Obvious first question: what's type of your time field? If it's string, then sorting is obviously very slow. tdate is even faster than date.
Another point: do you have enough memory for Solr? If it starts swapping, then performance is immediately awful.
And third one: if you have older Lucene, then date is just string, which is very slow.
Warning: Wild suggestion, not based on prior experience or known facts. :)
Perform a query without sorting and rows=0 to get the number of matches. Disable faceting etc. to improve performance - we only need the total number of matches.
Based on the number of matches from Step #1, the distribution of your data and the count/offset of the results that you need, fire another query which sorts by date and also adds a filter on the date, like fq=date:[NOW()-xDAY TO *] where x is the estimated time period in days during which we will find the required number of matching documents.
If the number of results from Step #2 is less than what you need, then relax the filter a bit and fire another query.
For starters, you can use the following to estimate x:
If you are uniformly adding n documents a day to the index of size N documents and a specific query matched d documents in Step #1, then to get the top r results you can use x = (N*r*1.2)/(d*n). If you have to relax your filter too often in Step #3, then slowly increase the value 1.2 in the formula as required.