How often to call commit on an offline Solr/Lucene index? - lucene

I know there have been some semi-similar questions, but in this case, I am building an index which is offline, until build is complete. I am building from scratch two cores, one has about 300k records with alot of citation information and large blocks of full text (this is the document index) and another core which has about 6.6 Million records, with full text (this is the page index).
Given this index is being built offline, the only real performance issue is speed of building. Noone should be querying this data.
The auto-commit would apparently fire if I stop adding items for 50 seconds? Which I don't do. I am adding ten at a time and they are added every couple seconds.
So, should I commit more often? I feel like the longer this runs the slower it gets, at least in my test case of 6k documents to index.
With noone searching this index, how often would anyone suggest I commit?
Should say I am using Solr 3.1 and SolrNet.

Although it's commits that are taking time for you, you might want to consider looking into other tweaking than commit frequency.
Is it the indexing core that also does searching, or is it replicated somewhere else after indexing concludes? If the latter is the case, then turning off caches might have a very noticeable impact on performance (solr rebuilds caches every time you commit).

You could also look into using the autoCommit or commitWith features of Solr.
commitWithin is done as part of the document add command. I believe that this is supported with SolrNet - please see Using the commiWithin attribute thread for more details.
autoCommit is a Solr configuration value added to the update handler section.

Related

Indexes in Azure SQL Database

I have an Azure SQL Database that has proved pretty successful so far. It's about 20 months old, no maintenance done... but it has handled a lot. Some tables have millions of rows, and when querying on columns that are indexed, query response times are acceptable when using the web application that talks to it.
However, I read conflicting advice on rebuilding indexes.
This guy says there is no point in doing it: http://beyondrelational.com/modules/2/blogs/76/posts/15290/index-fragmentation-in-sql-azure.aspx
This guy says go ahead rebuild:
https://alexandrebrisebois.wordpress.com/2013/02/06/dont-forget-about-index-maintenance-on-windows-azure-sql-database/
I have run some rebuild index statements on some of the smaller tables storing a few thousand rows. Some of the fragmentation would drop by about 1/2... then if I run it a second time, it might go down by about
these rebuilds ran in about 2-10 seconds depending on size of table...
Then I ran an index that had the following fragmentation on a table that has about 2 million rows:
PK__tmp_ms_x__CDEC17C03A4CDB46 55.2335782060565
PK__tmp_ms_x__CDEC17C03A4CDB46 0
IX_this_is_my_fk_index 15.7538877620014
It took 33 minutes.
The result was
PK__tmp_ms_x__CDEC17C03A4CDB46 0.01
PK__tmp_ms_x__CDEC17C03A4CDB46 0
IX_this_is_my_fk_index 0
Questions:
Query speeds have not really changed since doing the above. Is this normal?
Given that there are many things I have no control over in SQL Azure, does it even make sense to Rebuild indexes?
BTW: I am not and never have been a DBA... just a developer
Rebuilding indexes will matter if the indexes are actually being used. If the index isnt being used for the query youre running, then you wont see a difference. If its only lightly being used, you'll see a minor difference if you run stats. If its being heavily used, you should see a good performance increase, most of the time. The other thing to note with Microsoft SQL is that index fragmentation is sometimes irrelevant. Usually when I'm choosing whether or not to rebuild an index, im looking at the page count combined with fragmentation. If im running a query and i'm having performance issues, and im using the index, and the index has more than 16000 pages, and the index is more than 50% fragmented, ill rebuild it. If the table is small or if I can use the online option, i will just go ahead and rebuild all of them at the same time..
Specifically for Azure, my opinion is that if you are trying to improve performance, its still a good step to take because its so easy, even if you cant be sure of the results. Whether or not its a shared service and whether or not you can control the hardware layer, reviewing the index fragmentation and rebuilding the indexes are something you have access to, so why not make use of it?
So I guess the short answer is yes, in certain situations.
What I would suggest, rather than manually reviewing indexes and rebuilding them, is set up a nightly or weekly job that runs when your db is least active. Have it go through all the tables and rebuild the indexes. You can also give it a set running time if you have lots of tables, and then make it "stateful" (you can use a table to retain progress info) so it remembers where it left off and resumes at the next scheduled run.

Best practices for syncing Lucene repository with source data?

I am designing an application which will have a heavy reliance on searching using a Lucene.NET repository. The repository will be built using data from an operational database that is constantly changing. I'm trying to figure out the best strategy to keep the Lucene repository synced up with the source database. Should I have a service running that wakes up every few minutes, queries the database for updated records, and adds/removes from the Lucene index? Should I rebuild the Lucene repository every night and tolerate some latency in the data?
What are the best practices for keeping the data in a Lucene repository fresh? How do the different strategies affect latency, performance, etc.?
Lucene is capable of performing so called near real-time search, which means that the updates to the index can be seen in query results almost instantly. So you can freely send the updates as soon as they are saved in the database -- Lucene should have no problem in handling even quite frequent updates, as for example Twitter search is built with it (of course, to maintain such big load, you would need to distribute your index).
So preferably, you would send your updates in some code that triggers after transaction is committed. It is hard to say anything more specific, without knowing what database or queuing system are you using. Some general thoughts on this matter, as well as examples of using it along with CouchDB or RabbitMQ are shown in elasticsearch river documentation.

Lucene experts: how best to run diagnostics against an IndexWriter to resolve performance issues?

I've got an index that currently occupies about 1gb of space and has about 2.5 million documents. The index is stored on a solid-state drive for speed. I'm adding 2500 documents at a time and committing after each batch has been added. The index is a "live" index and needs to be kept up-to-date throughout the day and night, so minimising write speeds is very important. I'm using a merge factor of 10 and am never calling Optimize(), rather allowing the index to optimize itself as needed based on the merge factor.
I need to commit the documents after each batch has been added because I record this fact so that if the app crashes or restarts, it can pick up where it left off. If I didn't commit, the stored state would be inconsistent with what's in the index. I'm assuming my additions, deletions and updates are lost if the writer is destroyed without committing.
Anyway, I've noticed that after an arbitrary period of time, which could be anywhere from two minutes or two hours and some variable number of previous commits, the indexer seems to stall on the IndexWriter.AddDocument(doc) method and I can't for the life of me figure out why it's stalling or how to fix it. The block can stay in place for upwards of two hours, which seems strange for an index taking up less than 2GB in the low millions of documents and having an SSD drive to work with.
What could cause AddDocument to block? Are there any Lucene diagnostic utilities that could help me? What else could I look for to track down the problem?
You can use IndexWriter.SetInfoStream() to redirect diagnostics output to a stream that might give you a hint of what's wrong.

High disk IO rate

My rails application always reaches the threshold of the disk I/O rate set by my VPS at Linode. It's set at 3000 (I up it from 2000), and every hour or so I will get a notification that it reaches 4000-5000+.
What are the methods that I can use to minimize the disk IO rate? I mostly use Sphinx (Thinking Sphinx plugin) and Latitude and Longitude distance search.
What are the methods to avoid?
I'm using Rails 2.3.11 and MySQL.
Thanks.
did you check if your server is swapping itself to death? what does "top" say?
your Linode may have limited RAM, and it could be very likely that it is swapping like crazy to keep things running..
If you see red in the IO graph, that is swapping activity! You need to upgrade your Linode to more RAM,
or limit the number / size of processes which are running. You should also add approximately 2x the RAM size as Swap space (swap partition).
http://tinypic.com/view.php?pic=2s0b8t2&s=7
Since your question is too vague to answer concisely, this is generally a sign of one of a few things:
Your data set is too large because of historical data that you could prune. Delete what is no longer relevant.
Your tables are not indexed properly and you are hitting a lot of table scans. Check with EXAMINE on each of your slow queries.
Your data structure is not optimized for the way you are using it, and you are doing too many joins. Some tactical de-normalization would help here. Make sure all your JOIN queries are strictly necessary.
You are retrieving more data than is required to service the request. It is, sadly, all too common that people load enormous TEXT or BLOB columns from a user table when displaying only a list of user names. Load only what you need.
You're being hit by some kind of automated scraper or spider robot that's systematically downloading your entire site, page by page. You may want to alter your robots.txt if this is an issue, or start blocking troublesome IPs.
Is it going high and staying high for a long time, or is it just spiking temporarily?
There aren't going to be specific methods to avoid (other than not writing to disk).
You could try using a profiler in production like NewRelic to get more insight into your performance. A profiler will highlight the actions that are taking a long time, however, and when you examine the specific algorithm you're using in that action, you might discover what's inefficient about that particular action.

How to "defragment" MongoDB index effectively in production?

I've been looking at MongoDB. Feels good.
I added some indexes to a collection, uploaded a bunch of data, then removed all the data, and I noticed the indexes did not change size, similar to the behavior reported here.
If I call
db.repairDatabase()
the indexes are then squashed to near-zero. Similarly if I don't remove all the data, but call repairDatabase(), the indexes are squashed somewhat (perhaps because unused extends are truncated?). I am getting index size from "totalIndexSize" of db.collection.stats().
However, that takes a long time (I've read it could be hours on a large database). It's unclear to me how available the database is for reads or writes while it is running. I am guessing not so available.
Since I want to run as few instances of mongod as possible, I want to understand more about how indexes are managed after deletes. Can anyone point me to anything or give any advice?
To summarize David's linked question:
No way except db.repairDatabase().
If you need to minimize downtime, set up a master/slave configuration, and "repairDatabase" the slave, then switch slave and master.
I think Reducing MongoDB database file size answers your question.