lucene: reopen indexreader after index

lucene: reopen indexreader after index - lucene

when my search server start to run, it will load all of the index at once for all queries. However, it's still use the old index even if I rebuild the index. So I think the I should tell the indexReader of searcher to reopen index after the server rebuild the index, but how to implement it?
Maybe use producer-consumer pattern? Although I can use indexReader.isCurrent() to check whether the index have changed, but I have to check this ever times to search or at some period. Is there any more efficient and real-time way?

A convenient way to do what you are describing is to use Lucene's helper class SearcherManager. If you are interested in doing near-realtime search, you might also be interested in NRTManager.
There is a very nice blog article about these two classes on Mike McCandless' blog.

This is by no means a simple task. I had to write quite a lot of code to achieve it (unfortunately, it's in Clojure, so no Java code samples at hand). The basic principle is to have a thread-safe reference to your IndexSearcher that is accessible to both the index reading code and index building code. The index builder starts creating a new index in the background; this does not interfere with the existing indexs. When done, it enters a synchronized block, closes the IndexReader and IndexSearcher, opens a new IndexReader, and updates the global IndexSearcher reference to the IndexSearcher created from it. All reading code must synchronize on the same lock as the one involved in the mentioned synchronized block. A better alternative is to use ReentrantReadWriteLock instead of a synchronized block. This will avoid unnecessary contention between many reader threads.
After initialization, during normal operation, you can use the NRTManager to simultaneously read the index and make incremental updates to it.

Related

How to use Kotlin's sequences and lambdas in a way that's memory efficient

So I'm in the process of writing some code that needs to be both memory efficient and fast. I have a working reference in java already, but was rewriting it in kotlin.
I basically need to load a lot of csv files and load them into a tree once and then traverse them repeatedly once they're loaded.
I originally wrote the whole thing using sequences, but found it cause the GC to spike repeatedly.
I can't really share this code, but was wondering if yall know what would cause this to happen.
I'll be happy to add details as you need them, but here's my basic pattern.
step1: inputStream -> csvLines: List<String>
step2: csvLines.drop(x).fold(emptySequence()) -> callOtherFunctionWithFold -> callOtherFunctionWithFold -> Sequence<OutputObjects>
I keep the csvLines as a seperate list because I'm access specific rows based on the rules I need.
step3: Sequence<OuputObjects> -> nodes
The result is functional, but this code is much less memory efficient and less performant than the java equivalent only using arraylists and modifying them in place.
After looking at the visualvm output, I created a ton of kotlin.*.ArrayIterators. It looks like I create one every time I use a lamda.
So what can I do to make this more efficient? I though sequences were supposed to reduce object creation lazily, but it looks like I'm doing things that break its ability to do so.
Do sequences reevaluate after ever GC run or run in general? If so, that would make them unsuitable to use in objects that are loaded at startup, right?

To use Kotlin sequences, you need to start with asSequence()
csvLines.asSequence()
.drop(x)
.fold(...)
...
If you leave that out, it uses Collection functions instead which creates a new (intermediate) collection after every function.

What does optimize method do? Alternatives for optimize method in latest versions of lucene

I am pretty new to lucene I am trying to understand the segment merging process. I came across the method optimize(which will merge all the available Lucene index segment at that instance).
My exact question is, Does Optimize merges all the levels of segments & creates one complex segment?
Alternatives in the latest version of Lucene( say Lucene 6.5)?
Is it good to always call the optimize method after the indexing process, so that my index will always have a single segment and searches will be fast?

First of all, it's not needed to always merge segments to just one segment. It could be configured. In principle, idea of merging segments/optimizing index is coming from the implementation of deletes in the Lucene. Lucene do not deleting documents, but rather marking them for deletion, second, new documents are coming into new segments.
Lucene have a lot of per-segment files - like term dictionary and many others, so merging them together will reduce the heap and makes searches faster. However, usually the process of merging isn't that fast.
Overall, you need to have a balance between calling merging/optimizing every time you index new docs and not doing it all. One thing to look at is MergePolicy, which defines different types of merging, with different strategies. If you will not find any suitable for you (which I doubt), you could implement one for your needs.
As in Lucene 6.5 you could use
public void forceMerge(int maxNumSegments) of IndexWriter class

Apache Lucene Search program

I am currently reading a few tutorials on Apache Lucene. I had a question on how indexing works, which I was not able to find an answer to.
Given a set of documents that I want to index, and search with a string. It seems a Lucene program should index all these documents and then search the inputted search string each time the program is run. Will this not cause performance issues? or am I missing something?

No, it would be pretty atypical build a new index every time you run the program.
Many tutorials and examples out there use RAMDirectory, perhaps that's where the confusion is coming from. RAMDirectory creates an index entirely in memory. This is great for demos and tutorials, because you don't have to worry about the file system, or any of that nonsense, and it ensures you are working from a predictable blank state from the start.
In practice, though, you usually won't be using it. You would, instead, use a directory on the file system, and open an existing index after creating it for the first time, rather than creating a new index every time you run the program.

Apache Lucene - How to update an existing index without concurrency issues?

I'm using Lucene Core 3.6.
I'm asking this within the context of a multi-user environment where many concurrent requests will be coming into the IndexSearcher.
Can I just create a new IndexWriter using the same Directory and Analyzer I used to originally populate the index and safely write to it? Will there be blocking, synchronization, or concurrency issues I have to be aware of?
From my reading I believe that the newly added document is available as soon as I open a new IndexSearcher however I've also read that for performance reasons I want to keep one IndexSearcher open for as long as possible. To me, this implies I have to keep track of when I write to the index so I can return a new IndexSearcher on the next request.
I would suspect my choice of Directory implementation has an effect on this. Till now I've only been using RAMDirectory.
EDIT: Updated the title to better clarify what I'm asking.

Use an SearchManager. Mike McCandless has a blog post about search managers and NRT managers which might be helpful.
There are various articles you can read online about how Lucene achieves near-real-time (NRT) index updates, but to answer your basic questions: only one IndexWriter should ever be open, but new readers are opened from that writer upon update. It's good to keep a reader open as long as possible, but since with NRT the updates come from memory, it's a pretty quick turnaround (generally tens of milliseconds).

Lucene creating duplicate indexes

I created an app using lucene. The server winded up throwing out of memory errors because I was new'in up an IndexSeacher for every search in the app. The garbage collector couldn't keep up.
I just got done implementing a singleton approach and now there are multiple indexes being created.
Any clue why this is happening? IndexWriter is what I am keeping static. I get IndexSearchers from it.

You don't have multiple indexes, you just have multiple segments. Lucene splits the index up into segments over time, although you can compact it if you want.
See here and here for more info

You also probably want to "new up" one IndexSearcher and pass it around, seems like you are creating the index every time here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas