Lucene creating duplicate indexes - indexing

I created an app using lucene. The server winded up throwing out of memory errors because I was new'in up an IndexSeacher for every search in the app. The garbage collector couldn't keep up.
I just got done implementing a singleton approach and now there are multiple indexes being created.
Any clue why this is happening? IndexWriter is what I am keeping static. I get IndexSearchers from it.

You don't have multiple indexes, you just have multiple segments. Lucene splits the index up into segments over time, although you can compact it if you want.
See here and here for more info

You also probably want to "new up" one IndexSearcher and pass it around, seems like you are creating the index every time here.

Related

What does optimize method do? Alternatives for optimize method in latest versions of lucene

I am pretty new to lucene I am trying to understand the segment merging process. I came across the method optimize(which will merge all the available Lucene index segment at that instance).
My exact question is, Does Optimize merges all the levels of segments & creates one complex segment?
Alternatives in the latest version of Lucene( say Lucene 6.5)?
Is it good to always call the optimize method after the indexing process, so that my index will always have a single segment and searches will be fast?
First of all, it's not needed to always merge segments to just one segment. It could be configured. In principle, idea of merging segments/optimizing index is coming from the implementation of deletes in the Lucene. Lucene do not deleting documents, but rather marking them for deletion, second, new documents are coming into new segments.
Lucene have a lot of per-segment files - like term dictionary and many others, so merging them together will reduce the heap and makes searches faster. However, usually the process of merging isn't that fast.
Overall, you need to have a balance between calling merging/optimizing every time you index new docs and not doing it all. One thing to look at is MergePolicy, which defines different types of merging, with different strategies. If you will not find any suitable for you (which I doubt), you could implement one for your needs.
As in Lucene 6.5 you could use
public void forceMerge(int maxNumSegments) of IndexWriter class

Apache Lucene Search program

I am currently reading a few tutorials on Apache Lucene. I had a question on how indexing works, which I was not able to find an answer to.
Given a set of documents that I want to index, and search with a string. It seems a Lucene program should index all these documents and then search the inputted search string each time the program is run. Will this not cause performance issues? or am I missing something?
No, it would be pretty atypical build a new index every time you run the program.
Many tutorials and examples out there use RAMDirectory, perhaps that's where the confusion is coming from. RAMDirectory creates an index entirely in memory. This is great for demos and tutorials, because you don't have to worry about the file system, or any of that nonsense, and it ensures you are working from a predictable blank state from the start.
In practice, though, you usually won't be using it. You would, instead, use a directory on the file system, and open an existing index after creating it for the first time, rather than creating a new index every time you run the program.

Why RavenDB reads all documents in indexing process and not only collections used by index?

I have quite large database with ~2.6 million documents where I have two collections each 1.2 million and rest are small collections (<1000 documents). When I create new index for small collection, it takes lot of time indexing to complete (so temp indexes are useless). It seems that RavenDB indexing process reads each document in DB and checks if it should be added to index. I think it would perform better to index only collections used by index.
Also when using Smuggler to export data and I want to export only one small collection, it reads all documents and exporting might take quite a lot of time. Same time custom app which uses RavenDB Linq API and indexes can export data in seconds.
Why RavenDB behaves like this? And maybe there is some configuration setting which might change this behavior?
RavenDB doesn't actually have any real concept of a "collection". All documents are pretty much the same. It simply looks at the Raven-Entity-Name metadata in each document to determine how to group things together for purposes of querying by type and displaying the "Collections" page in the management studio.
I am not sure of the specific rationale for this. I think it has something to do with the underlying ESENT tables used by the document store. Perhaps Ayende can answer better. Your particular use cases are good examples for why it might be done differently.
One thing you could try is to use multiple databases. You could put the your large-quantity documents in one database, and put everything else in another. Of course, you may have problems with indexing related documents, multi-map/reduce, or other scenarios where documents of different types need to be together on the same database.
Seems that answer to my question is coming in RavenDB 3.0. Ayende says:
In RavenDB 2.x, you still have to pay the full price for indexing
everything, but that isn’t the case in RavenDB 3.0. What we have done
is to effectively optimize the process so that in this case, we will
preload all of the documents taking part in the relevant collection,
and send them directly to be indexed.
We do this by utilizing the Raven/DocumentsByEntityName index. Which
has already indexed everything in the database anyway. This is a nice
little feature, because it allows us to really take advantage of the
work we already did long ago. Using one index to pre-populate another
is a neat trick, and one that I am very happy about.
And here is full blog post: http://ayende.com/blog/165923/shiny-features-in-the-depth-new-index-optimization

lucene: reopen indexreader after index

when my search server start to run, it will load all of the index at once for all queries. However, it's still use the old index even if I rebuild the index. So I think the I should tell the indexReader of searcher to reopen index after the server rebuild the index, but how to implement it?
Maybe use producer-consumer pattern? Although I can use indexReader.isCurrent() to check whether the index have changed, but I have to check this ever times to search or at some period. Is there any more efficient and real-time way?
A convenient way to do what you are describing is to use Lucene's helper class SearcherManager. If you are interested in doing near-realtime search, you might also be interested in NRTManager.
There is a very nice blog article about these two classes on Mike McCandless' blog.
This is by no means a simple task. I had to write quite a lot of code to achieve it (unfortunately, it's in Clojure, so no Java code samples at hand). The basic principle is to have a thread-safe reference to your IndexSearcher that is accessible to both the index reading code and index building code. The index builder starts creating a new index in the background; this does not interfere with the existing indexs. When done, it enters a synchronized block, closes the IndexReader and IndexSearcher, opens a new IndexReader, and updates the global IndexSearcher reference to the IndexSearcher created from it. All reading code must synchronize on the same lock as the one involved in the mentioned synchronized block. A better alternative is to use ReentrantReadWriteLock instead of a synchronized block. This will avoid unnecessary contention between many reader threads.
After initialization, during normal operation, you can use the NRTManager to simultaneously read the index and make incremental updates to it.

Tips/recommendations for using Lucene

I'm working on a job portal using asp.net 3.5
I've used Lucene for job and resume search functionality.
Would like to know tips/recommendations if any with respect to Lucene performance optimization, scalability, etc.
Thanks a ton!
I've documented how I used Lucene.NET (in BugTracker.NET) here:
http://www.ifdefined.com/blog/post/2009/02/Full-Text-Search-in-ASPNET-using-LuceneNET.aspx
One thing you should keep in mind is that it is very hard to cluster or replicate lucene indexes in large installations, like fail over scenarios or distributed systems. So you should either have a good way to replicate your index jobs or the whole database.
If you use a sort, watch out for the size of the comparators. When sorts are used, for each document returned by the searcher there will be a comparator object stored for each SortField in the Sort object. Depending on the size of the documents and the number of fields you want to sort on, this can become a big headache.