Lucene IndexWriter.Close() vs indexWriter.Commit()

Lucene IndexWriter.Close() vs indexWriter.Commit() - indexing

What is different between IndexWriter.Close() andIndexWriter.Commit() when I hava just single instance of indexWriter?
Note:The Data that I going to make index is very big then I can't close IndexWriter runtime.
Note:I want to search in documents when data are indexing at sametime.

Commit() commits pending, buffered changes to the index (which can then be found with IndexReader() ). The IndexWriter can then continue to be used for more changes. Close() also performs a Commit(), but additionally closes the IndexWriter. Note that IndexWriter implements IDisposable(), and I recommend using it.
By your first note, if you mean there are lots of documents to index, that's fine. You can use the same IndexWriter for many documents without closing it. Just loop through however many documents you want to index within the same IndexWriter using() statement.
With regards to your second note, you must perform a commit() ( or close()) before your IndexWriter() changes will be seen by an IndexReader(). You can always search with IndexReader(), but it will only see the index as it was since the last IndexWriter.Commit().
I recommend Lucene In Action for these important details. It helped me a great deal.

Related

Using IndexWriter with SearchManager

I have a few basic questions regarding the usage of SearcherManager with IndexWriter.
I need to periodically re-build the Lucene index in the application and currently it happens on a different thread other than the one that serves the search requests.
Can I use the same IndexWriter instance through the lifetime of the application to rebuild the index periodically? Currently, I create / open it once during the startup and just call IndexWriter#commit whenever a new index is built.
I'm using SearcherManagerto acquire and release IndexSearcher instances for each search request. After the index is periodically built, I'm planning to use SearcherManager#maybeRefresh method to get refreshed IndexSearcher instances.SearcherManager instance is also created once during the startup and I intend to maintain it through out.
I do not close the IndexWriter or SearcherManager throughout the app's lifetime.
Now for the questions,
If I create a new IndexWriter every time I need to rebuild the index, will SearcherManager#maybeRefresh be able to detect that it's a new IndexWriter Instance? Or do I need to create a new SearcherManager using the newly created IndexWriter ?
What's the difference between creating a SearcherManager instance using an IndexWriter, creating it using a DirectoryReader or creating it using a Directory ?

The answers depend on how you construct your SearcherManager:
If you construct it with a DirectoryReader, all future IndexSearchers acquired from the SearcherManager will be based on that reader, i.e. all searches will provide results from the point in time you instantiated the SearcherManager. If you write data to the index/directory and run SearcherManager.maybeRefresh() afterwards, the reader will not be updated and your search results will be outdated.
If you construct the SearcherManager with an IndexWriter, SearcherManager.maybeRefresh() will update the SearcherManager's reader if data has been written and commited by the writer. All newly acquired IndexSearchers will then reflect the new state of the underlying index.
Despite having limited experience, I recommend using the latter approach. It provides a very simple way to implement near-real-time searching: At application start you create an IndexWriter and construct a SearcherManager with it. Afterwards you start a background thread that periodically commits all changes in the IndexWriter and refreshes the SearcherManager. For the lifetime of your application you can keep using the initial IndexWriter and SearcherManager without having to close/reopen them.
PS: I have only started working with Lucene a few days ago, so don't take everything I wrote here as 100% certain.

What is a good practice to entirely replace an existing Lucene index?

We use Lucene as a search engine. Our Lucene index is created by a master server, which is then deployed to slave instances.
This deployment is currently done by a script that deletes the files, and copy the new ones.
We needed to know if there was any good practice to do a "hot deployment" of a Lucene index. Do we need to stop or suspend Lucene? Do we need to inform Lucene the index has changed?
Thanks

The first step is to open the index in append mode for writing. You can achieve this by calling IndexWriter with the open mode named IndexWriterConfig.OpenMode.CREATE_OR_APPEND.
Once this is done, you are ready to both update existing documents and add new documents. For updating documents, you need to provide some kind of a unique identifier for a document (could be the URL or something else that is guaranteed to be unique). Now if you want to update a document with id say "Doc001" simply call the updateDocument function of Lucene passing "Doc001" as the Term (the very first) argument.
By this you can update an existing index without deleting it.

Can you read from a lucene index while updating the index

I can't find a straightforward yes or no answer to this!
I know I can send multiple reads in parallel but can I query an index while a seperate process/thread is updating it?

It's been a while since I used Lucene. However, assuming you are talking about the Java version, the FAQ has this to say:
Does Lucene allow searching and indexing simultaneously?
Yes. However, an IndexReader only searches the index as of the "point in time" that it was opened. Any updates to the index, either added or deleted documents, will not be visible until the IndexReader is re-opened. So your application must periodically re-open its IndexReaders to see the latest updates. The IndexReader.isCurrent() method allows you to test whether any updates have occurred to the index since your IndexReader was opened.

See also Lucene's near-real-time feature, which gives fast turnaround upon changes (adds, deletes, updates) to the index to being able to search those changes. For example, using near-real-time you could make changes to the index and then reopen the reader every few seconds.

How to know when Lucene Index generation process is completed

I've a .net windows service which generates Lucene search indexes every night.
I first get all the records from the database and add it to Lucene index using IndexWriter's AddDocument method and then call Optimize method before returning from the method.
Since the records fetched are faily large, indexing takes around 2-3 minutes to complete.
As you already know,Lucene generates intermediate segment files while it is generating the index and it compresses the whole index into 3 files when Optimize is called.
Is there anyway I can know that this index generation process is finished by Lucene and index is avaialable for search?
I need to know this because I want to call another method when process is completed.

You can check for the existance of the write.lock file.
http://wiki.apache.org/lucene-java/LuceneFAQ#head-733eab8f4000ba0f6c9f4ea052dea77d3d541857

I don't understand why you would need to know when Lucene finishes indexing. You can perform searches while Lucene is indexing. In fact, I think you can search while it's optimizing.
I personally do not like the idea of searching for the lock file. Can you not set a boolean and toggle it after you call writer.optimize()?

Lucene index updation and performance

I am working on a job portal site and have been using Lucene for job search functionality.
Users will be posting a number jobs on our site on a daily basis.We need to make sure that new job posted is searchable on the site as soon as possible.
In this context, how do I update Lucene index when a new job is posted or when an existing job is edited?
Can lucene index updating and search work in parallel?
Also,can I know any tips/best practices with respect to Lucene indexing,optimizing,performance etc?
Appreciate ur help!
Thanks!

Yes, Lucene can search from and write to an index at the same time as long as no more than 1 IndexWriter writes to it. If you want the new records visible ASAP, have the IndexWriter call the commit() function often (see IndexWriter's JavaDoc for details).
These Wiki pages might also help:
ImproveIndexingSpeed
ImproveSearchingSpeed

I have used Lucene.Net on a web site similar to what you are doing. Yes, you can do live indexes, updating to keep everything up to date? What platform are you using Lucene on, .NET, Java?

Make sure you create a new IndexSearcher as any additions after an IndexSearcher has been created are not visible to that instance.
A better approach may be to ReOpen the IndexReader if you want to resuse the same index searcher.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene IndexWriter.Close() vs indexWriter.Commit() - indexing

What is different between IndexWriter.Close() andIndexWriter.Commit() when I hava just single instance of indexWriter? Note:The Data that I going to make index is very big then I can't close IndexWriter runtime. Note:I want to search in documents when data are indexing at sametime.

Related

Using IndexWriter with SearchManager

What is a good practice to entirely replace an existing Lucene index?

Can you read from a lucene index while updating the index

How to know when Lucene Index generation process is completed

Lucene index updation and performance

Categories

Resources