SOLR index size reduction - optimization

We have a some massive SOLR indices for a large project, and its consuming above 50 GB of space .
We have considered several ways to reduce the size that are related to changing the content in the indices, but I am curious of wether or not there might be any changes we can make to a SOLR index which will reduce its size by 2 orders of magnitude or more, which are directly related to either (1) maintainance commands we can run or (2) simple configuration parameters which may not be set right.
Another relevant question is (3) Is there a way to trade index size for performance inside of SOLR, and if so , how would it work ?
Any thoughts on this would be appreciated... Thanks!

There are a couple things you might be able to do to trade performance for index size. For example, an integer (int) field uses less space than a trie integer (tint), but range queries will be slower when using an int.
To make major reductions in your index, you will almost certainly need to look more closely at the fields you are using.
Are you using a lot of stored fields? If so, try removing the stored fields from the index and query your database for the necessary data once you've got the results back from Solr.
Add omitNorms="true" to text fields that don't need length normalization
Add omitPositions="true" to text fields that don't require phrase matching
Special fields, like NGrams, can take up a lot of space
Are you removing stop words from text fields?

Related

Is the Lucene query language hack proof

Obviously it cannot be used to trash the index or crack card numbers, passwords etc. (unless one is stupid enough to put card numbers or passwords in the index).
Is it possible to bring down the server with excessively complex searches?
I suppose what I really need to know is can I pass a user-entered Lucene query directly to the search engine without sanitization and be safe from malice.
It is impossible to modify the index from the input of a query parser. However, there are several things that could hurt a search server running Lucene:
A high value for the number of top results to collect
Lucene puts hits in a priority queue to order them (which is implemented with a backing array of the size of the priority queue). So running a request which fetches the results from offset 99 999 900 to offset 100 000 000 will make the server allocate a few hundred of megabytes for this priority queue. Running several queries of this kind in parallel is likely to make the server run out of memory.
Sorting on arbitrary fields
Sorting on a field requires the field cache of this field to be loaded. In addition to taking a lot of time, this operation will use a lot of memory (especially on text fields with a lot of large distinct values), and this memory will not be reclaimed until the index reader for which this cache has been loaded is not used anymore.
Term dictionary intensive queries
Some queries are more expensive than other ones. To prevent query execution from taking too long, Lucene already has some guards against too complex queries: by default, a BooleanQuery cannot have more than 1024 clauses.
Other queries such as wildcard queries and fuzzy queries are very expensive too.
To prevent your users from hurting your search service, you should decide what they are allowed to do and what they are not. For example, Twitter (which uses Lucene for its search backend) used to limit queries to a few clauses in order to be certain to provide the response in reasonable time. (This question Twitter api - search too complex? talks about this limitation)
As far as I know, there are no major vulnerabilities that you need to worry about. Depending on the query parser you are using, you may want to do some simple sanitization.
Limit the length of the query string
Check for characters that you don't want to support. For example, +, -, [, ], *
If you let the user pick the number of results returned (e.g. 10, 20, 50), then make sure they can't use a really large value.

How to know the space occupied by each field in Solr

our Solr index file is about 30G and we want to do some optimization to reduce the size.
Then how to know the space occupied by each field?
Please check http://code.google.com/p/luke/
Luke would help you to analyze your index with the fields and terms and probably the percentage of the field size in the index.
Also some of the other questions :-
1. Is your index optimized ?
2. Revisit your configuration for the Indexed fields, stored fields, multiple copies of the same fields, term vectors and norm.
3. Taking a look at the terms on your index and identifying stopwords or common words.

Reducing the memory size of Index for Lucene

I use Lucene for searching the HTML documents. The issue I have is on increased size of index files, I have abt 300-400MB size of HTML files but the index is running upto .98Gb. The reason I see because of specification we have. Like we index the same contents for four different fields, which I guess is the problem ( we use same contents, one case sensitive and other otherwise, one casesensitive with special characters and other otherwise).
Is there a way to reduce the size of index? Keeping the same requirements? Is there a different way we index the same and search differently to support all?
I assume your problem is that you are storing these fields instead of just indexing them. So the solution is: don't store them.

Lucene Indexing

I would like to use Lucene for indexing a table in an existing database. I have been thinking the process is like:
Create a 'Field' for every column in the table
Store all the Fields
'ANALYZE' all the Fields except for the Field with the primary key
Store each row in the table as a Lucene Document.
While most of the columns in this table are small in size, one is huge. This column is also the one containing the bulk of the data on which searches will be performed.
I know Lucene provides an option to not store a Field. I was thinking of two solutions:
Store the field regardless of the size and if a hit is found for a search, fetch the appropriate Field from Document
Don't store the Field and if a hit is found for a search, query the data base to get the relevant information out
I realize there may not be a one size fits all answer ...
For sure, your system will be more responsive if you store everything on Lucene. Stored field does not affect the query time, it will only make the size of your index bigger. And probably not that bigger if it is only a small portion of the rows that have a lot of data. So if the index size is not an issue for your system, I would go with that.
I strongly disagree with a Pascal's answer. Index size can have major impact on search performance. The main reasons are:
stored fields increase index size. It could be problem with relatively slow I/O system;
stored fields are all loaded when you load Document in memory. This could be good stress for the GC
stored fields are likely to impact reader reopen time.
The final answer, of course, it depends. If the original data is already stored somewhere else, it's good practice to retrieve it from original data store.
When adding a row from the database to Lucene, you can judge if it actually needed to be write to the inverted-index. If not, you can use Index.NOT to avoid writing too much data to the inverted-index.
Meanwhile, you can judge where a column will be queried by key-value. If not, you needn't use Store.YES to store the data.

How do I estimate the size of a Lucene index?

Is there a known math formula that I can use to estimate the size of a new Lucene index? I know how many fields I want to have indexed, and the size of each field. And, I know how many items will be indexed. So, once these are processed by Lucene, how does it translate into bytes?
Here is the lucene index format documentation.
The major file is the compound index (.cfs file).
If you have term statistics, you can probably get an estimate for the .cfs file size,
Note that this varies greatly based on the Analyzer you use, and on the field types you define.
The index stores each "token" or text field etc., only once...so the size is dependent on the nature of the material being indexed. Add to that whatever is being stored as well. One good approach might be to take a sample and index it, and use that to extrapolate out for the complete source collection. However, the ratio of index size to source size decreases over time as well, as the words are already there in the index, so you might want to make the sample a decent percentage of the original.
I think it has to also do with the frequency of each term (i.e. an index of 10,000 copies of the sames terms should be much smaller than an index of 10,000 wholly unique terms).
Also, there's probably a small dependency on whether you're using Term Vectors or not, and certainly whether you're storing fields or not. Can you provide more details? Can you analyze the term frequency of your source data?