I use Lucene for searching the HTML documents. The issue I have is on increased size of index files, I have abt 300-400MB size of HTML files but the index is running upto .98Gb. The reason I see because of specification we have. Like we index the same contents for four different fields, which I guess is the problem ( we use same contents, one case sensitive and other otherwise, one casesensitive with special characters and other otherwise).
Is there a way to reduce the size of index? Keeping the same requirements? Is there a different way we index the same and search differently to support all?
I assume your problem is that you are storing these fields instead of just indexing them. So the solution is: don't store them.
Related
Going over some interview info about data structures etc.
So, as I understand, arrays are O(1) for indexing, which I believe means finding the specific element contained at space x in the array. Just want to confirm this as I am second guessing myself.
Also, hash maps are O(1) for indexing, searching, insertion and deletion. Does that not kind of make any data structure question pointless, since a hash map will always be the best solution?
Thanks
Well indexing is not only about arrays,
according to this - indexing is creating tables (indexes) that point to the location of folders, files and records. Depending on the purpose, indexing identifies the location of resources based on file names, key data fields in a database record, text within a file or unique attributes in a graphics or video file.
For your second question hash maps are not absolute or best data structures for various reasons, mainly:
Collisions
Hash function calculation time
Extra memory used
Also there's lots of Data Structure questions where hashmaps are not superior:
Data structure for finding k-th minimum element and supporting updates (Hashmap would be like bruteforce because it does not keep elements sorted, so we need something like Balanced binary search tree)
Data structure for finding if word is in dictionary (Sure hashmap works but Trie is so much faster & less memory)
Data structure for finding minimum element in any range of an array with updates (Once again hashmap is just too slow for this, we need something like segment tree)
...
We have a some massive SOLR indices for a large project, and its consuming above 50 GB of space .
We have considered several ways to reduce the size that are related to changing the content in the indices, but I am curious of wether or not there might be any changes we can make to a SOLR index which will reduce its size by 2 orders of magnitude or more, which are directly related to either (1) maintainance commands we can run or (2) simple configuration parameters which may not be set right.
Another relevant question is (3) Is there a way to trade index size for performance inside of SOLR, and if so , how would it work ?
Any thoughts on this would be appreciated... Thanks!
There are a couple things you might be able to do to trade performance for index size. For example, an integer (int) field uses less space than a trie integer (tint), but range queries will be slower when using an int.
To make major reductions in your index, you will almost certainly need to look more closely at the fields you are using.
Are you using a lot of stored fields? If so, try removing the stored fields from the index and query your database for the necessary data once you've got the results back from Solr.
Add omitNorms="true" to text fields that don't need length normalization
Add omitPositions="true" to text fields that don't require phrase matching
Special fields, like NGrams, can take up a lot of space
Are you removing stop words from text fields?
our Solr index file is about 30G and we want to do some optimization to reduce the size.
Then how to know the space occupied by each field?
Please check http://code.google.com/p/luke/
Luke would help you to analyze your index with the fields and terms and probably the percentage of the field size in the index.
Also some of the other questions :-
1. Is your index optimized ?
2. Revisit your configuration for the Indexed fields, stored fields, multiple copies of the same fields, term vectors and norm.
3. Taking a look at the terms on your index and identifying stopwords or common words.
I would like to use Lucene for indexing a table in an existing database. I have been thinking the process is like:
Create a 'Field' for every column in the table
Store all the Fields
'ANALYZE' all the Fields except for the Field with the primary key
Store each row in the table as a Lucene Document.
While most of the columns in this table are small in size, one is huge. This column is also the one containing the bulk of the data on which searches will be performed.
I know Lucene provides an option to not store a Field. I was thinking of two solutions:
Store the field regardless of the size and if a hit is found for a search, fetch the appropriate Field from Document
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out
I realize there may not be a one size fits all answer ...
For sure, your system will be more responsive if you store everything on Lucene. Stored field does not affect the query time, it will only make the size of your index bigger. And probably not that bigger if it is only a small portion of the rows that have a lot of data. So if the index size is not an issue for your system, I would go with that.
I strongly disagree with a Pascal's answer. Index size can have major impact on search performance. The main reasons are:
stored fields increase index size. It could be problem with relatively slow I/O system;
stored fields are all loaded when you load Document in memory. This could be good stress for the GC
stored fields are likely to impact reader reopen time.
The final answer, of course, it depends. If the original data is already stored somewhere else, it's good practice to retrieve it from original data store.
When adding a row from the database to Lucene, you can judge if it actually needed to be write to the inverted-index. If not, you can use Index.NOT to avoid writing too much data to the inverted-index.
Meanwhile, you can judge where a column will be queried by key-value. If not, you needn't use Store.YES to store the data.
Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?
Here is the lucene index format documentation.
The major file is the compound index (.cfs file).
If you have term statistics, you can probably get an estimate for the .cfs file size,
Note that this varies greatly based on the Analyzer you use, and on the field types you define.
The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.
I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).
Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?