How to handle multiple IndexWriter and multiple cross-process IndexWriter

How to handle multiple IndexWriter and multiple cross-process IndexWriter - lucene

I am googling for two days... I really need help.
I have an application with multiple threads trying to update a lucene index using Open/Close of IndexWriter for each update. Threads can start at any time.
Yeah, the problem with write.lock! So there are may be two or more solutions:
1) check IndexWriter.IsLocked(index) and if it is locked to sleep the thread.
2) Open an IndexWriter and never close it. The problem is that I have another application using the same index. Also when should I close the index and finalize the whole process?
here are interesting postings
Lucene IndexWriter thread safety
Lucene - open a closed IndexWriter
Update:
Exactly 2 years later I did a rest API which wraps this and all writes and reads were routed to the API.

Multiple threads, same process:
Keep your IndexWriter opened, and share it across multiple thread. The IndexWriter is threadsafe.
For multiple processes, your solution #1 is prone to issues with race conditions. You would need to use named Mutex to implement it safely:
http://msdn.microsoft.com/en-us/library/bwe34f1k.aspx
That being said, I'd personally opt for a process dedicated to writing to the index, and communicate with it using something like WCF.

Related

How can I recreate a blocking process which uses FETCH API_CURSOR?

My organization has recently had trouble with some SQL Server blocking processes. dbWarden has successfully reported blocking to us, but we often have the blocking SQL text reported as 'FETCH API_CURSOR'.
So, we're looking to alter the blocking alerts trigger in dbWarden to use sys.dm_exec_cursors and sys.dm_exec_sql_text to retrieve the text in the case where we find 'FETCH API_CURSOR' reported.
Trouble is, I cannot seem to come up with a way to recreate/simulate a blocking situation on our development server that will report as 'FETCH API_CURSOR'. I've started from the VB script here on SQL Authority to recreate the open cursor, but I cannot for the life of me figure out how to make it blocking.
I've seen many methods for recreating blocking transactions (open a transaction in one window, but do not commit/close, then try an update on same table in another), but not that would utilize FETCH API_CURSOR in a way that would allow us to successfully test. I'm somewhat at a loss here.
Has anyone had success in simulating blocking cursors in the past and can offer suggestions?

I'd suggest you to use Profiler tool to capture actual code that creates and fetches cursor. In this case, you'd see exactly what's going on in an application. It is not so difficult to reproduce similar blocking on a development server.
Let's say, one thread fetches rows from a cursor and another thread try to UPDATE same rows. See what's going on under the hood. Reading thread creates cursor to fetch result of SELECT back to an application. This technology is ancient and extremely slow and nowaday only some old (mostly) Java application use cursors for this purpose. Rows get fetched one-by-one, client is handling this process, so it takes time. During this time, reading thread holds shared locks on data it reads. It is by design, SQL Server is locker, it does use locks to function properly. If another thread tries to update a row that has been locked with shared lock, it get blocked. Because updating thread uses shared locks when searching rows for update, and tries to upgrade it to something more serious, but can't. For example, you can't upgrade your shared lock to U lock if another thread owns S lock on the same row. So I'd try to create a cursor, fetch several rows and tried to update in another tab. If you see difficulties, try to increase reading transaction serialization level.
But, seriously, I don't think you have to reproduce this or similar scenario on a development server. Stop use cursors for recordset fetching! Reads will be much faster and blocking issues will be reduced a lot. It's been a while since 1989 when cursors seen their great times. Client DB-access libraries evolved a lot, it is worth trying to pick up fruits of progress. Even in Java, it is a configuration option, use or not use them.
I apologize if cursor does get used on purpose, in this case. It is very unprobably but possible. I haven't seen such 'proper' cursor usage for ages! I'll be delighted to run into one more proper case.

lucene.net - how to update an index very frequently?

I have an Azure WebJob with a Queue that receives items to process. There can be many items to process every second. The Queue process around 20 items simultaneously.
I want to index the items with Lucene .net.
Starting an IndexWriter, calling Optimize() and Disposing it on every item that hits the queue takes too much time. It feels that I am doing it wrong.
I want the items to be ready for search as soon as possible.
Is it ok to have one IndexWriter for many threads?
Do i need to call Optimize() or is it Ok to never call it, or call it on a separate process that runs once a day (for example)?
If i have only one IndexWriter and never Dispose it (except when the program exits), would i have new items stuck on the buffer?
Would new items added with the IndexWriter be available for search before disposing the IndexWriter?
Thank you.

The IndexWriter is thread-safe, it's safe to call from different threads.
It's okay to never call optimize. (You could write a custom merge policy if the default doesn't work for you.)
You will flush all documents to disk by calling commit. There's no need to dispose of your writer. Reuse it instead.
Documents are searchable once a reader sees them. This occurs after you commit your writer and reopen your reader. You could read them before they are commited by using near-realtime (NRT) searching by grabbing a reader from IndexWriter.OpenReader.

Clarification about the following API ReentranReadWriteLock

Directly from this API:
When constructed as fair, threads contend for entry using an
approximately arrival-order policy. When the currently held lock is
released either the longest-waiting single writer thread will be
assigned the write lock, or if there is a group of reader threads
waiting longer than all waiting writer threads, that group will be
assigned the read lock.
It compares a single writing thread to a group of reading threads. What if there were only one waiting thread instead of a group of thread as the API specifies.. Would it change anything or it refers both to individual threads and group of threads?
Thanks in advance.

I'm 95% certain that "group" in this case can be read as "one or more". It should be easy enough to write a test for this. Harder but also possible is to crack open the java source and see what it's doing.
The idea here is you can give the lock to 1 writer or 1+ readers at the same time. It's just trying to say that if there are multiple readers waiting before the next writer, they all get the lock at the same time. This is safe because they're just reading.

Guidelines for using lucene.net in a web service app?

Just started reading up on Lucene.net and I would like some of my REST based web services to use the powerful searching facilities of Lucene.net
However I came across a link which said that I should create a windows service (with WCF) to do all the lucene searches/indexes etc as IIS recycles the application pool which will cause all sorts of locking issues.
My question is, is this correct? If so, is there another way of resolving this problem without creating a windows service (with WCF)? Also since I have REST based services, would I make a call from these services to the Windows WCF service which would make things slower?

Indexing
During your reading you would have picked up that indexing is done using the IndexWriter class. Lucene will only allow 1 IndexWriter instance open at a time. When using the default locking it creates a lock file in the index directory and prevents any other IndexWriter instances from being created. For this reason it may be better to implement indexing in a process that you have more control over.
If your indexing process is terminated with extreme prejudice and your IndexWriter class does not get closed, the lock on your index folder is maintained and no other instances will be allowed. Because of this Lucene allows you to lift a lock from an Indexed folder (using IndexWriter.unlock)- a dangerous method because if there are two IndexWriters open on the same index it will corrupt the index. If you have a windows service that is performing the indexing, and it's the only process in your solution that does the indexing (and any updates), you can confidently unlock the indexing folder on startup of the service. In a web service based environment where you are performing indexing from a web method - controlling and recovering from locking issues becomes problematic.
Searching
The IndexSearcher class is used for the searches. This in readonly mode can be done from your service based code. I don't think it's necessary to create a separate set of WCF methods for this purpose.
Optimization
The index may required to be optimized for performance periodically depending on the volumes. Once again having the indexing in a separate process you can schedule the optimization nightly, weekly or what ever is required. Optimization is done by a call to one method.
Indexing new data
How and when to get the indexing process to index new data.... I don't know what data you're indexing so it's hard to tell. In my scenario I have WCF methods that are responsible for input data - high volume. I require the data that has been received to be available for searching as soon as possible. So,
my Model layer has a notification layer that when new records of the required type have been successfully committed, a simple notification message is inserted into a local queue in MSMQ.
The reason for MSMQ is that the queue is persisted and transactional and that any messages in there are available even after a crash of system reboot - allowing me to never (cough!) lose any messages.
The indexing service takes the notification, build the Lucene Document and indexes the data.
The indexing service can also be triggered to do a full re-index by deleting the existing index an crawling the Db.
EDIT:
Example architecture:
WCF Service Methods taking on data commiting it to the Model layer. The Model layer notifies a listening client that an CRUD operation occurred successfully on items. The listening client posts the notification in a queue.
Windows Service handles Indexing of data, watching the queue for indexing requests.
ASP.Net app provides user interface with search features.

You can simply disable application pool recycling and host your application/service in IIS.
To disable recycling on config changes, use the disallowRotationOnConfigChange parameter.
You can also split your application in two parts: Index updates and searches.
Handle index updates from a windows service, and have your IIS portion handles searches (readonly). You would do this by having a mechanism that detects index updates, and refresh the IndexSearchers. This way, if the performance penalty of using services is a concern for you, it wont impact search time which is the important aspect for the users. With this configuration you can even have a master index update node, and distribute searches across different web servers in a farm. The only downside is you dont have the near real time searching functionality thats built in the IndexWriter class.
http://wiki.apache.org/lucene-java/NearRealtimeSearch
That being said, I've never had performance issues with setups that have the Lucene functions exposed over a WCF service, especially if your running either on the same machine with NetNamedPipe or on a local LAN with NetTcp.

LockObtainFailedException updating Lucene search index using solr

I've googled this a lot. Most of these issues are caused by a lock being left around after a JVM crash. This is not my case.
I have an index with multiple readers and writers. I'm am trying to do a mass index update (delete and add -- that's how lucene does updates). I'm using solr's embedded server (org.apache.solr.client.solrj.embedded.EmbeddedSolrServer). Other writers are using the remote, non-streaming server (org.apache.solr.client.solrj.impl.CommonsHttpSolrServer).
I kick off this mass update, it runs fine for a while, then dies with a
Caused by:
org.apache.lucene.store.LockObtainFailedException:
Lock obtain timed out:
NativeFSLock#/.../lucene-ff783c5d8800fd9722a95494d07d7e37-write.lock
I've adjusted my lock timeouts in solrconfig.xml
<writeLockTimeout>20000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>
I'm about to start reading the lucene code to figure this out. Any help so I don't have to do this would be great!
EDIT: All my updates go through the following code (Scala):
val req = new UpdateRequest
req.setAction(AbstractUpdateRequest.ACTION.COMMIT, false, false)
req.add(docs)
val rsp = req.process(solrServer)
solrServer is an instance of org.apache.solr.client.solrj.impl.CommonsHttpSolrServer, org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer, or org.apache.solr.client.solrj.embedded.EmbeddedSolrServer.
ANOTHER EDIT:
I stopped using EmbeddedSolrServer and it works now. I have two separate processes that update the solr search index:
1) Servlet
2) Command line tool
The command line tool was using the EmbeddedSolrServer and it would eventually crash with the LockObtainFailedException. When I started using StreamingUpdateSolrServer, the problems went away.
I'm still a little confused that the EmbeddedSolrServer would work at all. Can someone explain this. I thought that it would play nice with the Servlet process and they would wait while the other is writing.

I'm assuming that you're doing something like:
writer1.writeSomeStuff();
writer2.writeSomeStuff(); // this one doesn't write
The reason this won't work is because the writer stays open unless you close it. So writer1 writes and holds on to the lock, even after it's done writing. (Once a writer gets a lock, it never releases until it's destroyed.) writer2 can't get the lock, since writer1 is still holding onto it, so it throws a LockObtainFailedException.
If you want to use two writers, you'd need to do something like:
writer1.writeSomeStuff();
writer1.close();
writer2.open();
writer2.writeSomeStuff();
writer2.close();
Since you can only have one writer open at a time, this pretty much negates any benefit you would get from using multiple writers. (It's actually much worse to open and close them all the time since you'll be constantly paying a warmup penalty.)
So the answer to what I suspect is your underlying question is: don't use multiple writers. Use a single writer with multiple threads accessing it (IndexWriter is thread safe). If you're connecting to Solr via REST or some other HTTP API, a single Solr writer should be able to handle many requests.
I'm not sure what your use case is, but another possible answer is to see Solr's Recommendations for managing multiple indices. Particularly the ability to hot-swap cores might be of interest.

>> But you have multiple Solr servers writing to the same location, right?
No, wrong. Solr is using the Lucene libraries and it is stated in "Lucene in Action" * that there can only be one process/thread writing to the index at a time. That is why the writer takes a lock.
Your concurrent processes that are trying to write could, perhaps, check for the org.apache.lucene.store.LockObtainFailedException exception when instantiating the writer.
You could, for instance, put the process that instantiates writer2 in a waiting loop to wait until the active writing process finishes and issues writer1.close(); which will then release the lock and make the Lucene index available for writing again. Alternatively, you could have multiple Lucene indexes (in different locations) being written to concurrently and when doing a search you would need to search through all of them.
* "In order to enforce a single writer at a time, which means an IndexWriter or an IndexReader doing deletions or changing norms, Lucene uses a file-based lock: If the lock file (write.lock, by default) exists in your index directory, a writer currently has the index open. Any attempt to create another writer on the same index will hit a LockObtainFailedException. This is a vital protection mechanism, because if two writers are accidentally created on a single index, it will very quickly lead to index corruption."
Section 2.11.3, Lucene in Action, Second Edition, Michael McCandless, Erik Hatcher, and Otis Gospodnetić, 2010

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas