our Solr index file is about 30G and we want to do some optimization to reduce the size.
Then how to know the space occupied by each field?
Please check http://code.google.com/p/luke/
Luke would help you to analyze your index with the fields and terms and probably the percentage of the field size in the index.
Also some of the other questions :-
1. Is your index optimized ?
2. Revisit your configuration for the Indexed fields, stored fields, multiple copies of the same fields, term vectors and norm.
3. Taking a look at the terms on your index and identifying stopwords or common words.
Related
In lucene spatial 4 I'm wondering how the geohash index works behind the scenes. I understand the concept of the geohash which basically takes 2 points (lat, lon) and creates a single "string" hash.
Is the index just a "string" index (r-tree or quad-tree) or something along these lines (such as just indexing a last name).....or is there something special with it.
For pre-fixed type searches do all of the n-grams of the hash get indexed such as if a geohash is
drgt2abc does this get indexed as d, dr, drg, drgt, etc..
Is there a default number of n-grams that we might want indexed?
With this type of indexing will search queries with 100 thousand records verse 100 million records have similar query performance for spatial queries. (Such as box/polygon, or distance) or can I expect a general/typical slow degradation of the index as lots of records added.
Thanks
The best online explanation is my video: Lucene / Solr 4 Spatial deep dive
Is the index just a "string" index (r-tree or quad-tree) or something
along these lines (such as just indexing a last name).....or is there
something special with it.
Lucene, fundamentally, has just one index used for text, numbers, and now spatial. You could say it's a string index. It's a sorted list of bytes/strings. From a higher level view, using spatial in this way is the family of "Tries" AKA "PrefixTrees" in computer science.
For pre-fixed type searches do all of the n-grams of the hash get
indexed such as if a geohash is
drgt2abc does this get indexed as d, dr, drg, drgt, etc..
Yes.
Is there a default number of n-grams that we might want indexed?
You tell it conveniently in terms of the precision requirements you have and it'll lookup how long it needs to be. Or you can tell it by length.
With this type of indexing will search queries with 100 thousand
records verse 100 million records have similar query performance for
spatial queries. (Such as box/polygon, or distance) or can I expect a
general/typical slow degradation of the index as lots of records
added.
Indeed, this type of indexing (and more specifically the clever recursive search tree algorithm that uses it) means that you'll have scalable search performance. 100m is a ton of documents for one filter to match so it's of course going to be slower than one that matches only 100k docs, but it's definitely sub-linear. And by next year it'll be even faster, due to work happening this summer on a new PrefixTree encoding plus a spatial benchmark in progress which will allow me to make further tuning optimizations I have planned.
I search a way to estimate indexing time, index size, search time with lucene library.
I have some number for 500 files and i would like to estimate value for 5000 document.
I search on the web and i don't found any good way to estimate theses number.
The answer depends hugely on what you put into the index. Obviously, if you store full field content, then you can expect at least linear growth, with the factor within an order of magnitude from 1. If you only index the terms, you will need much less space, but at the same time the estimate will get much more difficult. The number of unique index terms is a very important factor, for example. This will probably start levelling off at some number that depends highly on the details of your content. All in all, in such a case measurement is probably your only reliable method.
We have a some massive SOLR indices for a large project, and its consuming above 50 GB of space .
We have considered several ways to reduce the size that are related to changing the content in the indices, but I am curious of wether or not there might be any changes we can make to a SOLR index which will reduce its size by 2 orders of magnitude or more, which are directly related to either (1) maintainance commands we can run or (2) simple configuration parameters which may not be set right.
Another relevant question is (3) Is there a way to trade index size for performance inside of SOLR, and if so , how would it work ?
Any thoughts on this would be appreciated... Thanks!
There are a couple things you might be able to do to trade performance for index size. For example, an integer (int) field uses less space than a trie integer (tint), but range queries will be slower when using an int.
To make major reductions in your index, you will almost certainly need to look more closely at the fields you are using.
Are you using a lot of stored fields? If so, try removing the stored fields from the index and query your database for the necessary data once you've got the results back from Solr.
Add omitNorms="true" to text fields that don't need length normalization
Add omitPositions="true" to text fields that don't require phrase matching
Special fields, like NGrams, can take up a lot of space
Are you removing stop words from text fields?
I use Lucene for searching the HTML documents. The issue I have is on increased size of index files, I have abt 300-400MB size of HTML files but the index is running upto .98Gb. The reason I see because of specification we have. Like we index the same contents for four different fields, which I guess is the problem ( we use same contents, one case sensitive and other otherwise, one casesensitive with special characters and other otherwise).
Is there a way to reduce the size of index? Keeping the same requirements? Is there a different way we index the same and search differently to support all?
I assume your problem is that you are storing these fields instead of just indexing them. So the solution is: don't store them.
Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?
Here is the lucene index format documentation.
The major file is the compound index (.cfs file).
If you have term statistics, you can probably get an estimate for the .cfs file size,
Note that this varies greatly based on the Analyzer you use, and on the field types you define.
The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.
I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).
Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?