Apache Lucene - How to update an existing index without concurrency issues? - lucene

I'm using Lucene Core 3.6.
I'm asking this within the context of a multi-user environment where many concurrent requests will be coming into the IndexSearcher.
Can I just create a new IndexWriter using the same Directory and Analyzer I used to originally populate the index and safely write to it? Will there be blocking, synchronization, or concurrency issues I have to be aware of?
From my reading I believe that the newly added document is available as soon as I open a new IndexSearcher however I've also read that for performance reasons I want to keep one IndexSearcher open for as long as possible. To me, this implies I have to keep track of when I write to the index so I can return a new IndexSearcher on the next request.
I would suspect my choice of Directory implementation has an effect on this. Till now I've only been using RAMDirectory.
EDIT: Updated the title to better clarify what I'm asking.

Use an SearchManager. Mike McCandless has a blog post about search managers and NRT managers which might be helpful.
There are various articles you can read online about how Lucene achieves near-real-time (NRT) index updates, but to answer your basic questions: only one IndexWriter should ever be open, but new readers are opened from that writer upon update. It's good to keep a reader open as long as possible, but since with NRT the updates come from memory, it's a pretty quick turnaround (generally tens of milliseconds).

Related

Apache Lucene Search program

I am currently reading a few tutorials on Apache Lucene. I had a question on how indexing works, which I was not able to find an answer to.
Given a set of documents that I want to index, and search with a string. It seems a Lucene program should index all these documents and then search the inputted search string each time the program is run. Will this not cause performance issues? or am I missing something?
No, it would be pretty atypical build a new index every time you run the program.
Many tutorials and examples out there use RAMDirectory, perhaps that's where the confusion is coming from. RAMDirectory creates an index entirely in memory. This is great for demos and tutorials, because you don't have to worry about the file system, or any of that nonsense, and it ensures you are working from a predictable blank state from the start.
In practice, though, you usually won't be using it. You would, instead, use a directory on the file system, and open an existing index after creating it for the first time, rather than creating a new index every time you run the program.

Can a ravendb collection forced to be in memory?

Can I force a ravendb collection to stay in memory so that the queries are fast. I read about aggressive caching but the documentation only talks about the request caching. If I have sharding enabled can I force all the shards to cache the collection in memory ?
Any help is appreciated,
Thanks
RavenDB doesn't really have "Collections" in the sense you are thinking. The only thing that collections are used for is to filter documents by their Raven-Entity-Name metadata. This serves a few purposes:
The Raven Studio UI can group things to make them easier to find.
Indexes can use a shortcut form of docs.EntityName instead of having a where clause against the metadata in every index.
But that's pretty much it. They aren't isolated on disk. For example, when Raven indexes documents, every index considers all documents. Docs get discarded quickly if they don't pass the collection filter, but they are still put through the pipeline.
You can read more about collections here.
Also - As long as you are still in a learning phase, you may want to post these style of questions on the RavenDB Google Group instead. You will get a much better response. You won't get much rating on StackOverflow when you are asking non-code "can X do Y?" questions. Come back here when you have written some code. See the ravendb tag for other questions that have been answered, and you'll get a feel for what StackOverflow is for. Thanks.
You don't need to do that.
RavenDB will automatically detect usage patterns and keep frequently requested documents in memory.

lucene: reopen indexreader after index

when my search server start to run, it will load all of the index at once for all queries. However, it's still use the old index even if I rebuild the index. So I think the I should tell the indexReader of searcher to reopen index after the server rebuild the index, but how to implement it?
Maybe use producer-consumer pattern? Although I can use indexReader.isCurrent() to check whether the index have changed, but I have to check this ever times to search or at some period. Is there any more efficient and real-time way?
A convenient way to do what you are describing is to use Lucene's helper class SearcherManager. If you are interested in doing near-realtime search, you might also be interested in NRTManager.
There is a very nice blog article about these two classes on Mike McCandless' blog.
This is by no means a simple task. I had to write quite a lot of code to achieve it (unfortunately, it's in Clojure, so no Java code samples at hand). The basic principle is to have a thread-safe reference to your IndexSearcher that is accessible to both the index reading code and index building code. The index builder starts creating a new index in the background; this does not interfere with the existing indexs. When done, it enters a synchronized block, closes the IndexReader and IndexSearcher, opens a new IndexReader, and updates the global IndexSearcher reference to the IndexSearcher created from it. All reading code must synchronize on the same lock as the one involved in the mentioned synchronized block. A better alternative is to use ReentrantReadWriteLock instead of a synchronized block. This will avoid unnecessary contention between many reader threads.
After initialization, during normal operation, you can use the NRTManager to simultaneously read the index and make incremental updates to it.

Indexing engine

I'm developing context discover system - which is mix of searching and suggestions.
Currently I'm looking for library for indexing.
After some investigation I stayed on Lucene and Terrier and found Indri not comfortable.
What are the downsides of both? What problem I can meet while using them?
Is it true that Terrier doesn't have incremental indexing (every time new document is added, I need to rebuild and reindex everything)?
My requirements are:
- easy adding new documents
- easy score methods injection
- quiet well defined model
And one more thing: is Terrier still active? I haven't seen any update since 10/03/2010 terrier changelog
What sort of database are you going to be using? Lucene, in my experience, is much better documented than Terrier.
Here's an article comparing Lucene and Terrier:
http://text-analytics.blogspot.com/2011/05/java-based-retrieval-toolkits.html

Lucene creating duplicate indexes

I created an app using lucene. The server winded up throwing out of memory errors because I was new'in up an IndexSeacher for every search in the app. The garbage collector couldn't keep up.
I just got done implementing a singleton approach and now there are multiple indexes being created.
Any clue why this is happening? IndexWriter is what I am keeping static. I get IndexSearchers from it.
You don't have multiple indexes, you just have multiple segments. Lucene splits the index up into segments over time, although you can compact it if you want.
See here and here for more info
You also probably want to "new up" one IndexSearcher and pass it around, seems like you are creating the index every time here.