How to stop running indexing before starting another? - apache

I'm making a web app that uses Lucene as search engine. First, the user has to select a file/directory to index and after that he is capable to search it (duh!). My problem happens when the user is trying to index a huge amount of data: for example, if it's taking too long and the user refreshs the page and try to index another directory, an exception is thrown because the first indexing is still running (write.lock shows up). Known that, how is it possible to stop the first indexing? I tried closing the IndexWriter with no success.
Thanks in advance.

Why do you want to interrupt the first indexing operation and restart it again?
In my opinion you should display a simple image which shows that the system is working (as Nielsen says: "The system should always keep users informed about what is going on, through appropriate feedback within reasonable time.") and when the user press refresh you should intercept the event and prevent the execution of another indexing process.

You are probably trying to open an indexwriter instance on the index directory which has already indexwriter open on it. If you have opened indexwriter on two different index directory then the exception with write.lock won't happen. Could you please check that the new indexwriter instance is not writing to previously opened index directory which has already indexwriter opened on it.

Related

Multiple instances of application using Lucene.Net

I'm developing a WPF application that uses Lucene.Net to index data from files being generated by a third-party process. It's low volume with new files being created no more than once a minute.
My application uses a singleton IndexWriter instance that is created at startup. Similarly an IndexSearcher is also created at startup, but is recreated whenever an IndexWriter.Commit() occurs, ensuring that the newly added documents will appear in search results.
Anyway, some users need to run two instances of the application, but the problem is that newly added documents don't show up when searching within the second instance. I guess it's because the first instance is doing the commits, and there needs to be a way to tell the second instance to recreate its IndexSearcher.
One way would be to signal this using a file create/update in conjunction with a FileSystemWatcher, but first wondered if there was anything in Lucene.Net that I could utilise?
The only thing I can think of that might be helpful for you is IndexReader.Reopen(). This will refresh the IndexReader, but only if the index has changed since the reader was originally opened. It should cause minimal disk access in the case where the index hasn't been updated, and in the case where it has, it tries to only load segments that were changed or added.
One thing to note about the API: Reopen returns an IndexReader. In the case where the index hasn't changed, it returns the same instance; otherwise it returns a new one. The original index reader is not disposed, so you'll need to do it manually:
IndexReader reader = /* ... */;
IndexReader newReader = reader.Reopen();
if(newReader != reader)
{
// Only close the old reader if we got a new one
reader.Dispose();
}
reader = newReader;
I can't find the .NET docs right now, but here are the Java docs for Lucene 3.0.3 that explain the API.
If both instance have their own IndexWriter opened on the same directory, you're in for a world of pain and intermittent bad behaviour.
an IW expects and requires exclusive control of the index directory. This is the reason for the lock file.
If the second instance can detect that there is a existing instance, then you might be able to just open an IndexReader/Searcher on the folder and reopen when the directory changes.
But then what happens if the first instance closes? The index will nolonger be updated. So the second instance would need to reinitialise, this time with an IW. Perhaps it could do this when the lock file is removed when the first instance closes.
The "better" approach would be to spin up a "service" (just a background process, maybe in the system tray). All instances of the app would then query this service. If the app is started and the service is not detected then spin it up.

Sitecore "Indexing is paused"

I seem to be having some issues with the IntervalAsynchronousStrategy for updating content items.
Sometimes, the indexes will not be automatically updated with this strategy, and a manual index rebuild is required.
These are the corresponding log file entries:
8404 09:20:24 INFO [Index=artscentre_web_index] IntervalAsynchronousUpdateStrategy executing.
8404 09:20:24 INFO [Index=artscentre_web_index] History engine is empty. Incremental rebuild returns
8032 09:20:21 WARN [Index=artscentre_web_index] IntervalAsynchronousUpdateStrategy triggered but muted. Indexing is paused.
And I see this for every time the index rebuilds, even though there is content being edited and published in that time.
I have previously swapped from the OnPublishEnd rebuild strategy to the interval strategy as I was finding that publishing content would not trigger an index rebuild either.
Our environment is a single instance setup only, so the single IIS website handles both CM and CD. Therefore I can eliminate anything to do with remote events, I think?
Has anyone else had this much trouble getting Sitecore to maintain index updates?
Cheers,
Justin

Suggestion around Lucene 4.4 (Log Search)

I am new to Lucene and trying to use it for searching log files/entries generated by a SystemA.
Architecture
Receive each log entry (i.e. XML) in a INPUT Directory. SystemA sends log entries to a MQ queue which is polled by a small utility, that picks the message and create a file in INPUT directory.
WriteIndex.java (i.e. IndexWriter/Lucene) keep checking if a new file received in INPUT directory. If yes, it takes the file, puts in Index and move the file to OUTPUT directory. As part of Indexing, I am putting filename, path, timestamp, contents in Index.
"Note: I am creating index on Content as well putting whole Content as StringField."
SearchIndex.java (ie. SeacherManager/Lucene/refereshIfChanged) is created. As part of Creation I started a new thread as well that keep checking every 1 min if Index has changed on not. I acquire IndexSearcher for every request. It's working fine.
Everything so far worked very fine. But I am not sure what will happen in production as I have tested it for few hundred files but in production, I will be getting like 500K log entries in a day which means 500K small file, each having an XML. "WriteIndex.java" will have to run non-stop to update index whenever new file received.
I have following questions
Anyone has done any similar work? Any issues/best practices I should follow.
Do you see any problem with Index files generated for such large number of xml files. Each XML file would be 2KB max. Remember I am indexing on the content as well as putting content as String in index so that I can retrieve from the index whenever I found a match on index while searching.
I would be exposing SearchIndex.java as Servlet to allow admins to come on a WebPage and search log entries. Any issues you see with it?
Please let me know if anyone need anything specific.
Thanks,
Rohit Goyal
Architecture looks fine.
Few things
Consider using TextField instead of StringField. TextField will be tokenized and hence user would be able to search on tokens. StringField is not tokenized and hence for document to match search, full text should match.
No problem in performance for lucene. Check out Lucene performance graphs. Lucene can generate index for over a billion wikipedia documents in minutes. Searching is fast too.

Apache Lucene Index Writer

I'm new to apache lucene. I started using lucene. I had faced an issue. I started indexing all the files in the directory. I didn't close the indexwriter and tried to open in luke. It prompted with an error "Index not closed". The problem over here is the code execution has been completed. How to unlock the index? If I instantiate a new IndexWriter with the same directory, will it overwrite the existing index?
I am not an expert either..
If I were you, I'd do the following..
1) Add the following snippet to code at the end, which is a must at any cost.
myIndexWriter.close();
2) Delete the existing directory manually, and rerun the whole code.
If you instantiate the new IndexWriter without deleting the directory, it will add docs to the existing index. Yes, it will result in duplicate index entries.
However, in the Lucene's perspective, all those entries are still unique, i.e. every addDocument() creates a new entry in the Index with a new unique lucene-internal-doc-id.

Can you read from a lucene index while updating the index

I can't find a straightforward yes or no answer to this!
I know I can send multiple reads in parallel but can I query an index while a seperate process/thread is updating it?
It's been a while since I used Lucene. However, assuming you are talking about the Java version, the FAQ has this to say:
Does Lucene allow searching and indexing simultaneously?
Yes. However, an IndexReader only searches the index as of the "point in time" that it was opened. Any updates to the index, either added or deleted documents, will not be visible until the IndexReader is re-opened. So your application must periodically re-open its IndexReaders to see the latest updates. The IndexReader.isCurrent() method allows you to test whether any updates have occurred to the index since your IndexReader was opened.
See also Lucene's near-real-time feature, which gives fast turnaround upon changes (adds, deletes, updates) to the index to being able to search those changes. For example, using near-real-time you could make changes to the index and then reopen the reader every few seconds.