lucene.net - how to update an index very frequently? - lucene

I have an Azure WebJob with a Queue that receives items to process. There can be many items to process every second. The Queue process around 20 items simultaneously.
I want to index the items with Lucene .net.
Starting an IndexWriter, calling Optimize() and Disposing it on every item that hits the queue takes too much time. It feels that I am doing it wrong.
I want the items to be ready for search as soon as possible.
Is it ok to have one IndexWriter for many threads?
Do i need to call Optimize() or is it Ok to never call it, or call it on a separate process that runs once a day (for example)?
If i have only one IndexWriter and never Dispose it (except when the program exits), would i have new items stuck on the buffer?
Would new items added with the IndexWriter be available for search before disposing the IndexWriter?
Thank you.

The IndexWriter is thread-safe, it's safe to call from different threads.
It's okay to never call optimize. (You could write a custom merge policy if the default doesn't work for you.)
You will flush all documents to disk by calling commit. There's no need to dispose of your writer. Reuse it instead.
Documents are searchable once a reader sees them. This occurs after you commit your writer and reopen your reader. You could read them before they are commited by using near-realtime (NRT) searching by grabbing a reader from IndexWriter.OpenReader.

Related

Clarification about the following API ReentranReadWriteLock

Directly from this API:
When constructed as fair, threads contend for entry using an
approximately arrival-order policy. When the currently held lock is
released either the longest-waiting single writer thread will be
assigned the write lock, or if there is a group of reader threads
waiting longer than all waiting writer threads, that group will be
assigned the read lock.
It compares a single writing thread to a group of reading threads. What if there were only one waiting thread instead of a group of thread as the API specifies.. Would it change anything or it refers both to individual threads and group of threads?
Thanks in advance.
I'm 95% certain that "group" in this case can be read as "one or more". It should be easy enough to write a test for this. Harder but also possible is to crack open the java source and see what it's doing.
The idea here is you can give the lock to 1 writer or 1+ readers at the same time. It's just trying to say that if there are multiple readers waiting before the next writer, they all get the lock at the same time. This is safe because they're just reading.

Lucene indexwriter.close() is a must?

My program was running too slow that I had to terminate it halfway to optimize some part of the codes right after 40,000 documents was inserted into database, BUT, before Lucene indexwriter.close was called. Then I couldn't find any results for some of the records that seems to be limited to the 40,000 documents from that particular run.
Does that mean that those record which I had index during the program run was lost? IndexWriter must always be perfectly closed to allow for the data to be written to the index?
Thanks in advance!
It's not close that you need to call but commit. addDocument only analyzes the document and buffers data into memory while commit will flush pending changes and perform a fsync.
close calls commit internally, I think this is why you assume close is required.
However, beware of not calling commit too often as this operation is very costly compared to addDocument.

Does writeToFile:atomically: blocks asynchronous reading?

A few times while using my application I am processing some large data in the background. (To be ready when the user needs it. Something kind of indexing.) When this background process finished it needs to the save the data in a cache file, but since this is really large it take some seconds.
But at the same time the user may open some dialog which displays images and text loaded from the disk. If this happens at the same time while the background process data is saved, the user interface needs to wait until the saving process is completed. (This is not wanted, since the user then have to wait 3-4 seconds until the images and texts from the disk are loaded!)
So I am looking a way to throttling the writing to disk. I thought of splitting up the data in chunks and inserting a short delay between saving the different chunks. In this delay, the user interface will be able to load the needed texts and images, so the user will not recognize a delay.
At the moment I am using [[array componentsJoinedByString:'\n'] writeToFile:#"some name.dic" atomically:YES]. This is very high-level solution which doesn't allow any customization. How can I implement without large data into one file without saving all the data as one-shot?
Does writeToFile:atomically: blocks asynchronous reading?
No. It is like writing to a temporary file. Once completed successfully, then renaming the temporary file to the destination (replacing the pre-existing file at the destination, if it exists).
You should consider how you can break your data up, so it is not so slow. If it's all divided by strings/lines and it takes seconds, and easy approach to divide the database would be by first character. Of course, a better solution could likely be imagined, based on how you access, search, and update the index/database.
…inserting a short delay between saving the different chunks. In this delay, the user interface will be able to load the needed texts and images, so the user will not recognize a delay.
Don't. Just implement the move/replace of the atomic write yourself (writing to a temporary file during index and write). Then your app can serialize read and write commands explicitly for fast, consistent and correct accesses to these shared resources.
You have to look to the class NSFileHandle.
Using combination of seekToEndOfFile and writeData:(NSData *)data you can do the work you wish.

How to handle multiple IndexWriter and multiple cross-process IndexWriter

I am googling for two days... I really need help.
I have an application with multiple threads trying to update a lucene index using Open/Close of IndexWriter for each update. Threads can start at any time.
Yeah, the problem with write.lock! So there are may be two or more solutions:
1) check IndexWriter.IsLocked(index) and if it is locked to sleep the thread.
2) Open an IndexWriter and never close it. The problem is that I have another application using the same index. Also when should I close the index and finalize the whole process?
here are interesting postings
Lucene IndexWriter thread safety
Lucene - open a closed IndexWriter
Update:
Exactly 2 years later I did a rest API which wraps this and all writes and reads were routed to the API.
Multiple threads, same process:
Keep your IndexWriter opened, and share it across multiple thread. The IndexWriter is threadsafe.
For multiple processes, your solution #1 is prone to issues with race conditions. You would need to use named Mutex to implement it safely:
http://msdn.microsoft.com/en-us/library/bwe34f1k.aspx
That being said, I'd personally opt for a process dedicated to writing to the index, and communicate with it using something like WCF.

LockObtainFailedException updating Lucene search index using solr

I've googled this a lot. Most of these issues are caused by a lock being left around after a JVM crash. This is not my case.
I have an index with multiple readers and writers. I'm am trying to do a mass index update (delete and add -- that's how lucene does updates). I'm using solr's embedded server (org.apache.solr.client.solrj.embedded.EmbeddedSolrServer). Other writers are using the remote, non-streaming server (org.apache.solr.client.solrj.impl.CommonsHttpSolrServer).
I kick off this mass update, it runs fine for a while, then dies with a
Caused by:
org.apache.lucene.store.LockObtainFailedException:
Lock obtain timed out:
NativeFSLock#/.../lucene-ff783c5d8800fd9722a95494d07d7e37-write.lock
I've adjusted my lock timeouts in solrconfig.xml
<writeLockTimeout>20000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
I'm about to start reading the lucene code to figure this out. Any help so I don't have to do this would be great!
EDIT: All my updates go through the following code (Scala):
val req = new UpdateRequest
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, false, false)
req.add(docs)
val rsp = req.process(solrServer)
solrServer is an instance of org.apache.solr.client.solrj.impl.CommonsHttpSolrServer, org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer, or org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.
ANOTHER EDIT:
I stopped using EmbeddedSolrServer and it works now. I have two separate processes that update the solr search index:
1) Servlet
2) Command line tool
The command line tool was using the EmbeddedSolrServer and it would eventually crash with the LockObtainFailedException. When I started using StreamingUpdateSolrServer, the problems went away.
I'm still a little confused that the EmbeddedSolrServer would work at all. Can someone explain this. I thought that it would play nice with the Servlet process and they would wait while the other is writing.
I'm assuming that you're doing something like:
writer1.writeSomeStuff();
writer2.writeSomeStuff(); // this one doesn't write
The reason this won't work is because the writer stays open unless you close it. So writer1 writes and holds on to the lock, even after it's done writing. (Once a writer gets a lock, it never releases until it's destroyed.) writer2 can't get the lock, since writer1 is still holding onto it, so it throws a LockObtainFailedException.
If you want to use two writers, you'd need to do something like:
writer1.writeSomeStuff();
writer1.close();
writer2.open();
writer2.writeSomeStuff();
writer2.close();
Since you can only have one writer open at a time, this pretty much negates any benefit you would get from using multiple writers. (It's actually much worse to open and close them all the time since you'll be constantly paying a warmup penalty.)
So the answer to what I suspect is your underlying question is: don't use multiple writers. Use a single writer with multiple threads accessing it (IndexWriter is thread safe). If you're connecting to Solr via REST or some other HTTP API, a single Solr writer should be able to handle many requests.
I'm not sure what your use case is, but another possible answer is to see Solr's Recommendations for managing multiple indices. Particularly the ability to hot-swap cores might be of interest.
>> But you have multiple Solr servers writing to the same location, right?
No, wrong. Solr is using the Lucene libraries and it is stated in "Lucene in Action" * that there can only be one process/thread writing to the index at a time. That is why the writer takes a lock.
Your concurrent processes that are trying to write could, perhaps, check for the org.apache.lucene.store.LockObtainFailedException exception when instantiating the writer.
You could, for instance, put the process that instantiates writer2 in a waiting loop to wait until the active writing process finishes and issues writer1.close(); which will then release the lock and make the Lucene index available for writing again. Alternatively, you could have multiple Lucene indexes (in different locations) being written to concurrently and when doing a search you would need to search through all of them.
* "In order to enforce a single writer at a time, which means an IndexWriter or an IndexReader doing deletions or changing norms, Lucene uses a file-based lock: If the lock file (write.lock, by default) exists in your index directory, a writer currently has the index open. Any attempt to create another writer on the same index will hit a LockObtainFailedException. This is a vital protection mechanism, because if two writers are accidentally created on a single index, it will very quickly lead to index corruption."
Section 2.11.3, Lucene in Action, Second Edition, Michael McCandless, Erik Hatcher, and Otis Gospodnetić, 2010