How to optimize large index on solr - optimization

our index is rising relatively fast, by adding 2000-3000 documents a day.
We are running an optimize every night.
The point is, that Solr needs double disc space while optimizing. Actually the index has an size of 44GB, which works on an 100GB partition - for the next few months.
The point is, that 50% of the disk space are unused for 90% of the day and only needed during optimize.
Next thing: we have to add more space on that partition periodical - which is always a painful discussion with the guys from the storage department (because we have more than one index...).
So the question is: is there a way to optimize an index without blocking additional 100% of the index size on disk?
I know, that multi-cores an distributed search is an option - but this is only an "fall back" solution, because for that we need to change the application basically.
Thank you!

There is continous merging going on under the hood in Lucene. Read up on the Merge Factor which can be set in the solrconfig.xml. If you tweak this setting you probably wont have to optimize at all.

You can try partial optimize by passing maxSegment parameter.
This will reduce the index to that specified number.
I suggest you do in batches (e.g if there are 50 segments first reduce to 30 then to 15 and so on).
Here's the url:
host:port/solr/CORE_NAME/update?optimize=true&maxSegments=(Enter the number of segments you want to reduce to. Ignore the parentheses)&waitFlush=false

Related

What is the mathematical relationship between "no. of rows affected" and "execution time" of a sql query?

The query remains constant i.e it will remain the same.
e.g. a select query takes 30 minutes if it returns 10000 rows.
Would the same query take 1 hour if it has to return 20000 rows?
I am interested in knowing the mathematical relation between no. of rows(N) and execution time(T) keeping other parameters as constant(K).
i.e T= N*K or
T=N*K + C or
any other formula?
Reading http://research.microsoft.com/pubs/76556/progress.pdf if it helps. Anybody who can understand this before me, please do reply. Thanks...
Well that is good question :), but there is not exact formula, because it depends of execution plan.
SQL query optimizer could choose another execution plan on query which return different number of rows.
I guess if the query execution plan is the same for both query's and you have some "lab" conditions then time growth could be linear. You should research more on sql execution plans and statistics
Take the very simple example of reading every row in a single table.
In the worst case, you will have to read every page of the table from your underlying storage. The worst case for this is having to do a random seek. The seek time will dominate all other factors. So you can estimate the total time.
time ~= seek time x number of data pages
Assuming your rows are of a fairly regular size, then this is linear in the number of rows.
However databases do a number of things to try and avoid this worst case. For example, in SQL Server table storage is often allocated in extents of 8 consecutive pages. A hard drive has a much faster streaming IO rate than random IO rate. If you have a clustered index, reading the pages in cluster order tend to have a lot more streaming IO than random IO.
The best case time, ignoring memory caching, is (8KB is the SQL Server page size)
time ~= 8KB * number of data pages / streaming IO rate in KB/s
This is also linear in the number of rows.
As long as you do a reasonable job managing fragmentation, you could reasonably extrapolate linearly in this simple case. This assumes your data is much larger than the buffer cache. If not, you also have to worry about the cliff edge where your query changes from reading from buffer to reading from disk.
I'm also ignoring details like parallel storage paths and access.

Put and Delete with CouchDB + Lucene

I'm running CouchDB (1.2.1) + Lucene on Linux (https://github.com/rnewson/couchdb-lucene/), and I have a few questions.
I index everything - one index for all documents. i've got around 20.000.000 documents.
How fast are puts/deletes done on the index -- I have about 10-50 Puts/Deletes etc. a second.
Is there a rule, like after 10,000 updates you have to optimize the index?
Are changes in documents immediately visible in the index? If not is there a delay or a temporary table for this updates/deletes?
Thanks in advance - Brandon
Use a profiler to measure the put/delete performance. That's the only way you'll get reasonably accurate numbers.
Optimization depends on how quickly the index is changing -- again, you will need to experiment and profile.
Changes are immediately visible in the index, but not to already-open IndexReaders.

Elasticsearch multindex performance

I'm thinking about moving from one index to day-based indexes (multi-index) in our Elasticsearch cluster with huge number of records.
The actual question is how it can affect the performance of indexing, searching and mapping in the ES cluster?
Is it take more time to search through one huge index than from a hundreds of big indices?
It will take less time to search through one large index, rather than hundreds of smaller ones.
Breaking an index in this fashion could help performance if you will be primarily searching only one of the broken out indexes. In your case, if you most often will need to search for records added on a particular day, then you might be helped by this, performance-wise. If you will mostly be searching across the entire of range of indexes, you would generally be better off searching in the single monolithic index.
Finally, we have implemented ES multi-indexing in our company. For our application we chose monthly indices strategy, so we create a new index every month.
Of course, as it was advised by #femtoRgon, the search through all smaller indices takes a little bit more, but speed of application has been increased because of its logic.
So, my advice to everybody who wants to move from one index to multi-indices: make research of your application needs and select appropriate slices of the whole index (if it's really needed).
As example, i can share some results of research of our application, that helped us to make a decision to use monthly indices:
90-95% of our queries are only for last 3 months
we have about 4 big groups of queries: today, last week, last month and last 3 months (of course, we could create weekly or daily indices, but they would be too small, since we don't have enough documents inside)
we can explain to customers why he need to wait if he makes "non usual" query across whole period (all indices).

SOLR index size reduction

We have a some massive SOLR indices for a large project, and its consuming above 50 GB of space .
We have considered several ways to reduce the size that are related to changing the content in the indices, but I am curious of wether or not there might be any changes we can make to a SOLR index which will reduce its size by 2 orders of magnitude or more, which are directly related to either (1) maintainance commands we can run or (2) simple configuration parameters which may not be set right.
Another relevant question is (3) Is there a way to trade index size for performance inside of SOLR, and if so , how would it work ?
Any thoughts on this would be appreciated... Thanks!
There are a couple things you might be able to do to trade performance for index size. For example, an integer (int) field uses less space than a trie integer (tint), but range queries will be slower when using an int.
To make major reductions in your index, you will almost certainly need to look more closely at the fields you are using.
Are you using a lot of stored fields? If so, try removing the stored fields from the index and query your database for the necessary data once you've got the results back from Solr.
Add omitNorms="true" to text fields that don't need length normalization
Add omitPositions="true" to text fields that don't require phrase matching
Special fields, like NGrams, can take up a lot of space
Are you removing stop words from text fields?

When and why do I reindex a MSDE database

I understand that indexes should get updated automatically but when that does not happen we need to reindex.
My question is (1) Why this automatic udate fails or why an index become bad?
(2) How do I prgramatically know which table/index needs re-indexing at a point of time?
Indexes' statistics may be updated automatically. I do not believe that the indexes themselves would be rebuilt automatically when needed (although there may be some administrative feature that allows such a thing to take place).
Indexes associated with tables which receive a lot of changes (new rows, updated rows and deleted rows) may have their indexes that become fragmented, and less efficient. Rebuilding the index then "repacks" the index in a contiguous section of storage space, a bit akin to the way defragmentation of the file system makes file access faster...
Furthermore the Indexes (on several DBMS) have a FILL_FACTOR parameter, which determine how much extra space should be left in each node for growth. For example if you expect a given table to grow by 20% next year, by declaring the fill factor around 80%, the amount of fragmentation of the index should be minimal during the first year (there may be some if these 20% of growth are not evenly distributed,..)
In SQL Server, It is possible to query properties of the index that indicate its level of fragmentation, and hence it possible need for maintenance. This can be done by way of the interactive management console. It is also possible to do this programatically, by way of sys.dm_db_index_physical_stats in MSSQL 2005 and above (maybe even older versions?)