How do I estimate the size of a Lucene index? - lucene

Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?

Here is the lucene index format documentation.
The major file is the compound index (.cfs file).
If you have term statistics, you can probably get an estimate for the .cfs file size,
Note that this varies greatly based on the Analyzer you use, and on the field types you define.

The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.

I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).
Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?

Related

How does geohash index work in Lucene

In lucene spatial 4 I'm wondering how the geohash index works behind the scenes. I understand the concept of the geohash which basically takes 2 points (lat, lon) and creates a single "string" hash.
Is the index just a "string" index (r-tree or quad-tree) or something along these lines (such as just indexing a last name).....or is there something special with it.
For pre-fixed type searches do all of the n-grams of the hash get indexed such as if a geohash is
drgt2abc does this get indexed as d, dr, drg, drgt, etc..
Is there a default number of n-grams that we might want indexed?
With this type of indexing will search queries with 100 thousand records verse 100 million records have similar query performance for spatial queries. (Such as box/polygon, or distance) or can I expect a general/typical slow degradation of the index as lots of records added.
Thanks
The best online explanation is my video: Lucene / Solr 4 Spatial deep dive
Is the index just a "string" index (r-tree or quad-tree) or something
along these lines (such as just indexing a last name).....or is there
something special with it.
Lucene, fundamentally, has just one index used for text, numbers, and now spatial. You could say it's a string index. It's a sorted list of bytes/strings. From a higher level view, using spatial in this way is the family of "Tries" AKA "PrefixTrees" in computer science.
For pre-fixed type searches do all of the n-grams of the hash get
indexed such as if a geohash is
drgt2abc does this get indexed as d, dr, drg, drgt, etc..
Yes.
Is there a default number of n-grams that we might want indexed?
You tell it conveniently in terms of the precision requirements you have and it'll lookup how long it needs to be. Or you can tell it by length.
With this type of indexing will search queries with 100 thousand
records verse 100 million records have similar query performance for
spatial queries. (Such as box/polygon, or distance) or can I expect a
general/typical slow degradation of the index as lots of records
added.
Indeed, this type of indexing (and more specifically the clever recursive search tree algorithm that uses it) means that you'll have scalable search performance. 100m is a ton of documents for one filter to match so it's of course going to be slower than one that matches only 100k docs, but it's definitely sub-linear. And by next year it'll be even faster, due to work happening this summer on a new PrefixTree encoding plus a spatial benchmark in progress which will allow me to make further tuning optimizations I have planned.

lucene estimate index size, search time

I search a way to estimate indexing time, index size, search time with lucene library.
I have some number for 500 files and i would like to estimate value for 5000 document.
I search on the web and i don't found any good way to estimate theses number.
The answer depends hugely on what you put into the index. Obviously, if you store full field content, then you can expect at least linear growth, with the factor within an order of magnitude from 1. If you only index the terms, you will need much less space, but at the same time the estimate will get much more difficult. The number of unique index terms is a very important factor, for example. This will probably start levelling off at some number that depends highly on the details of your content. All in all, in such a case measurement is probably your only reliable method.

SOLR index size reduction

We have a some massive SOLR indices for a large project, and its consuming above 50 GB of space .
We have considered several ways to reduce the size that are related to changing the content in the indices, but I am curious of wether or not there might be any changes we can make to a SOLR index which will reduce its size by 2 orders of magnitude or more, which are directly related to either (1) maintainance commands we can run or (2) simple configuration parameters which may not be set right.
Another relevant question is (3) Is there a way to trade index size for performance inside of SOLR, and if so , how would it work ?
Any thoughts on this would be appreciated... Thanks!
There are a couple things you might be able to do to trade performance for index size. For example, an integer (int) field uses less space than a trie integer (tint), but range queries will be slower when using an int.
To make major reductions in your index, you will almost certainly need to look more closely at the fields you are using.
Are you using a lot of stored fields? If so, try removing the stored fields from the index and query your database for the necessary data once you've got the results back from Solr.
Add omitNorms="true" to text fields that don't need length normalization
Add omitPositions="true" to text fields that don't require phrase matching
Special fields, like NGrams, can take up a lot of space
Are you removing stop words from text fields?

How to know the space occupied by each field in Solr

our Solr index file is about 30G and we want to do some optimization to reduce the size.
Then how to know the space occupied by each field?
Please check http://code.google.com/p/luke/
Luke would help you to analyze your index with the fields and terms and probably the percentage of the field size in the index.
Also some of the other questions :-
1. Is your index optimized ?
2. Revisit your configuration for the Indexed fields, stored fields, multiple copies of the same fields, term vectors and norm.
3. Taking a look at the terms on your index and identifying stopwords or common words.

Lucene Indexing

I would like to use Lucene for indexing a table in an existing database. I have been thinking the process is like:
Create a 'Field' for every column in the table
Store all the Fields
'ANALYZE' all the Fields except for the Field with the primary key
Store each row in the table as a Lucene Document.
While most of the columns in this table are small in size, one is huge. This column is also the one containing the bulk of the data on which searches will be performed.
I know Lucene provides an option to not store a Field. I was thinking of two solutions:
Store the field regardless of the size and if a hit is found for a search, fetch the appropriate Field from Document
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out
I realize there may not be a one size fits all answer ...
For sure, your system will be more responsive if you store everything on Lucene. Stored field does not affect the query time, it will only make the size of your index bigger. And probably not that bigger if it is only a small portion of the rows that have a lot of data. So if the index size is not an issue for your system, I would go with that.
I strongly disagree with a Pascal's answer. Index size can have major impact on search performance. The main reasons are:
stored fields increase index size. It could be problem with relatively slow I/O system;
stored fields are all loaded when you load Document in memory. This could be good stress for the GC
stored fields are likely to impact reader reopen time.
The final answer, of course, it depends. If the original data is already stored somewhere else, it's good practice to retrieve it from original data store.
When adding a row from the database to Lucene, you can judge if it actually needed to be write to the inverted-index. If not, you can use Index.NOT to avoid writing too much data to the inverted-index.
Meanwhile, you can judge where a column will be queried by key-value. If not, you needn't use Store.YES to store the data.