Best way to keep index real time? - optimization

I have a Solr/Lucene index file of approximately 700 Gb. The documents that I need to index are being read in real-time, roughly 1000 docs every 30 minutes are submitted and need to be indexed. In my scenario a script is run every 30 mins that indexes the documents that are not yet indexed, since it is a requirement that new documents should be searchable as soon as possible, but this process slow down the searching.
Is this the best way i can index latest documents or there is some other better way!

First, remember that Solr is not a real-time search engine (yet). There is still work to be done.
You can use a master/slave setup, where the indexation are done on the master and the search on the slave. With this, indexation does not affect search performance. After the commit is done on the master, force the slave to fetch the latest index from the master. While the new index is being replicated on the slave, it is still processing queries with the previous index.
Also, check you cache warming settings. Remember that this might slow down the searches if those settings are too aggressive. Also check the queries launched on the new searcher event.

You can do this with Lucene easily. Split the indexes in multiple parts (or to be precise, while building indexes, create "smaller" parts.) Create searcher for each of the part and store a reference to them. You can create a MultiSearcher on top of these individual parts.
Now, there will be only one index that will get the new documents. At regular intervals, add documents to this index, commit and re-open this searcher.
After the last index is updated, you can create a new multi-searcher again, using the previously opened searchers.
Thus, at any point, you will be re-opening only one searcher and that will be quite fast.

Check http://code.google.com/p/zoie/ wrapper around Lucene to make it real time - code donated from Linkedin.

^^i do this, with normal lucene, non solr, and it works really nice. however not sure if there is a solr way to do that at the moment. twitter recently went with lucene for searching and has effectively real time searching by just writing to their index at any update. their index resides completely in memory, so updating/reading the index is of no consequence and happens instantly, a lucene index can always be read while being written to, as long as there is only one writer at a time.

Check out this wiki page

Related

lucene documents disappeared (index size shrinks)

I have a cluster of servers indexing documents with Lucene indexes located on a network drive. I synchronize IndexWriter creation so only one instance at a time. There are about 50 documents added to the index every second so the index is actively updated. But within a day or so, the index suddenly shrinks in size, reducing from a few GB to almost empty, then starts growing again. It appears to be that all documents in the index are suddenly deleted and a new index is been built. I checked the index with Luke and I could not find any documents labelled as deleted. I have not been able to catch exactly when this happens. But when this happens, all the files in the index directly have a new timestamp within the past few minutes, except the write.lock file, which has the time stamp when the index was first created. There is no any error in indexing and the indexes (before and after the size shrunk) are normal (searchable).
Has anyone seen anything like this? Any idea what is causing this behavior? I am using Lucene 6.1 on Windows Server 2012.
Thanks a lot in advance!

Solr tlog extremely large, not merging with index after commits

I am in the process of a bulk indexing operation into a solr 5.0 collection with approx 200m documents now. I am noticing that the tlog is building up and is not being deleted, additionally, indexing performance has gotten really slow. I am wondering why the tlog is not being removed. This is what the data directory looks like:
du -sh *
4.0K data
69G index
109G tlog
I've tried multiple variations of:
update?commit=true&expungeDeletes=true&openSearcher=true
I see in the log file that Solr is picking it up, but there are no changes.
The commit settings in solrconfig are:
<autoCommit>
<maxTime>15000</maxTime>
<maxDocs>1500000</maxDocs>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>900000</maxTime>
<maxDocs>2000000</maxDocs>
</autoSoftCommit>
One thing to keep in mind is that I had soft commit commented out during the indexing process. Also, these values are pretty high because this is relatively index heavy collection, with pretty controlled querying, so the commit strategy is pretty relaxed.
I restarted Solr and naturally it is taking forever to start because it is replaying the tlog, not sure if it will clear this up once fully started. Now, I am under impression that Solr keeps some tlogs around in case it needs to replica the data to another collection, but this is a standalone instance and is not really necessary, additionally, since it is larger than the index folder, I am assuming there are items not commited to the main index yet. Is that right?
Any idea what's happening here?
So I thought I'd pass along an update, even though it's a bit late.
I restarted Solr instance, naturally it took about 4 hours to start up since tlogs had to be replayed. Then they were purged after a commit.

Postgres Paginating a FTS Query

What is the best way to paginate a FTS Query ? LIMIT and OFFSET spring to mind. However, I am concerned that by using limit and offset I'd be running the same query over and over (i.e., once for page 1, another time for page 2.... etc).
Will PostgreSQL be smart enough to transparently cache the query result ? Thus subsequently satisfying the pagination queries from a cache ? If not, how do I paginate efficiently ?
edit
The database is for single user desktop analytics. But, I still want to know what the best way is, if this were a live OLTP application. I have addressed the problem in the past with SQL Server by creating a ordered set of document id's and cache the query parameters against the IDs in a seperate table. Clearing the cache every few hours (so as to allow new documents to enter the result set).
Perhaps this approach is viable for postgres. But still I wanna know the mechanics present in the database and how best to leverage them. If I were a DB developer I'd enable the query-response cache to work with the FTS system.
A server-side SQL cursor can be effectively used for this if a client session can be tied to a specific db connection that stays open during the entire session. This is because cursors cannot be shared between different connections. But if it's a desktop app with a unique connection per running instance, that's fine.
The doc for DECLARE CURSOR explains how the resultset is going to be materialized when the cursor is declared WITH HOLD in a committed transaction.
Locking shouldn't be a concern at all. Should the data be modified while the cursor is already materialized, it wouldn't affect the reader nor block the writer.
Other than that, there is no implicit query cache in PostgreSQL. The LIMIT/OFFSET technique implies a new execution of the query for each page, which may be as slow as the initial query depending on the complexity of the execution plan and the effectiveness of the buffer cache and disk cache.
Well, to be honest, what you may want is for your query to return a live Cursor, that you can then reuse to fetch certain portions of the results that it (the Cursor) represents. Now, I don't know if PostGre supports this, Mongo DB does, and I've tried going down that road but it's not cool. For example: do you know how much time it will pass between when a query is done and a second page of results from that query are demanded? Can the cursor stay on for that amount if time? And if it can, what does it mean exactly, will it block resources, such that if you have many lazy users, who start queries but take a long time to navigate through pages, your server might be bogged down by locked cursors?
Honestly, I think redoing a paginated query each time someone asks for a certain page is ok. First of all, you'll be returning a small number of entries (no need to display more than 10-20 entries at a time) and that's gonna be pretty fast, and second, you should more likely tune up your server so that it executes frequent request fast (add indexes, put it behind a Solr server if necessary, etc.) rather than have those queries run slow, but caching them.
Finally, if you really want to speed up full text searches, and have fancy indexes like case insensitive, prefix and suffix enabled, etc, you should take a look at Lucene or better yet Solr (which is Lucene on steroids) as an in-between search and indexing solution between your users and your persistence tier.

LockObtainFailedException updating Lucene search index using solr

I've googled this a lot. Most of these issues are caused by a lock being left around after a JVM crash. This is not my case.
I have an index with multiple readers and writers. I'm am trying to do a mass index update (delete and add -- that's how lucene does updates). I'm using solr's embedded server (org.apache.solr.client.solrj.embedded.EmbeddedSolrServer). Other writers are using the remote, non-streaming server (org.apache.solr.client.solrj.impl.CommonsHttpSolrServer).
I kick off this mass update, it runs fine for a while, then dies with a
Caused by:
org.apache.lucene.store.LockObtainFailedException:
Lock obtain timed out:
NativeFSLock#/.../lucene-ff783c5d8800fd9722a95494d07d7e37-write.lock
I've adjusted my lock timeouts in solrconfig.xml
<writeLockTimeout>20000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
I'm about to start reading the lucene code to figure this out. Any help so I don't have to do this would be great!
EDIT: All my updates go through the following code (Scala):
val req = new UpdateRequest
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, false, false)
req.add(docs)
val rsp = req.process(solrServer)
solrServer is an instance of org.apache.solr.client.solrj.impl.CommonsHttpSolrServer, org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer, or org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.
ANOTHER EDIT:
I stopped using EmbeddedSolrServer and it works now. I have two separate processes that update the solr search index:
1) Servlet
2) Command line tool
The command line tool was using the EmbeddedSolrServer and it would eventually crash with the LockObtainFailedException. When I started using StreamingUpdateSolrServer, the problems went away.
I'm still a little confused that the EmbeddedSolrServer would work at all. Can someone explain this. I thought that it would play nice with the Servlet process and they would wait while the other is writing.
I'm assuming that you're doing something like:
writer1.writeSomeStuff();
writer2.writeSomeStuff(); // this one doesn't write
The reason this won't work is because the writer stays open unless you close it. So writer1 writes and holds on to the lock, even after it's done writing. (Once a writer gets a lock, it never releases until it's destroyed.) writer2 can't get the lock, since writer1 is still holding onto it, so it throws a LockObtainFailedException.
If you want to use two writers, you'd need to do something like:
writer1.writeSomeStuff();
writer1.close();
writer2.open();
writer2.writeSomeStuff();
writer2.close();
Since you can only have one writer open at a time, this pretty much negates any benefit you would get from using multiple writers. (It's actually much worse to open and close them all the time since you'll be constantly paying a warmup penalty.)
So the answer to what I suspect is your underlying question is: don't use multiple writers. Use a single writer with multiple threads accessing it (IndexWriter is thread safe). If you're connecting to Solr via REST or some other HTTP API, a single Solr writer should be able to handle many requests.
I'm not sure what your use case is, but another possible answer is to see Solr's Recommendations for managing multiple indices. Particularly the ability to hot-swap cores might be of interest.
>> But you have multiple Solr servers writing to the same location, right?
No, wrong. Solr is using the Lucene libraries and it is stated in "Lucene in Action" * that there can only be one process/thread writing to the index at a time. That is why the writer takes a lock.
Your concurrent processes that are trying to write could, perhaps, check for the org.apache.lucene.store.LockObtainFailedException exception when instantiating the writer.
You could, for instance, put the process that instantiates writer2 in a waiting loop to wait until the active writing process finishes and issues writer1.close(); which will then release the lock and make the Lucene index available for writing again. Alternatively, you could have multiple Lucene indexes (in different locations) being written to concurrently and when doing a search you would need to search through all of them.
* "In order to enforce a single writer at a time, which means an IndexWriter or an IndexReader doing deletions or changing norms, Lucene uses a file-based lock: If the lock file (write.lock, by default) exists in your index directory, a writer currently has the index open. Any attempt to create another writer on the same index will hit a LockObtainFailedException. This is a vital protection mechanism, because if two writers are accidentally created on a single index, it will very quickly lead to index corruption."
Section 2.11.3, Lucene in Action, Second Edition, Michael McCandless, Erik Hatcher, and Otis Gospodnetić, 2010

Solr Index slow after a while

I use SolrJ to send data to my Solr server.
When I start my program off, it indexes stuff at the rate of about 1000 docs per sec(I commit every 250,000 docs)
I have noticed that when my index is filled up with about 5 million docs, it starts crawling, not just at commit time, add time too.
My Solr server and indexing program run on the same machine
Here are some of the relevant portions from my solrconfig:
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>1024</ramBufferSizeMB>
<mergeFactor>150</mergeFactor>
Any suggestions about how to fix this?
that merge factor seems really, really (really) high.
Do you really want that?
If you aren't using compound files that could easily lead to a ulimit problem (if you are linux).