What does optimize method do? Alternatives for optimize method in latest versions of lucene - lucene

I am pretty new to lucene I am trying to understand the segment merging process. I came across the method optimize(which will merge all the available Lucene index segment at that instance).
My exact question is, Does Optimize merges all the levels of segments & creates one complex segment?
Alternatives in the latest version of Lucene( say Lucene 6.5)?
Is it good to always call the optimize method after the indexing process, so that my index will always have a single segment and searches will be fast?

First of all, it's not needed to always merge segments to just one segment. It could be configured. In principle, idea of merging segments/optimizing index is coming from the implementation of deletes in the Lucene. Lucene do not deleting documents, but rather marking them for deletion, second, new documents are coming into new segments.
Lucene have a lot of per-segment files - like term dictionary and many others, so merging them together will reduce the heap and makes searches faster. However, usually the process of merging isn't that fast.
Overall, you need to have a balance between calling merging/optimizing every time you index new docs and not doing it all. One thing to look at is MergePolicy, which defines different types of merging, with different strategies. If you will not find any suitable for you (which I doubt), you could implement one for your needs.
As in Lucene 6.5 you could use
public void forceMerge(int maxNumSegments) of IndexWriter class

Related

Why RavenDB reads all documents in indexing process and not only collections used by index?

I have quite large database with ~2.6 million documents where I have two collections each 1.2 million and rest are small collections (<1000 documents). When I create new index for small collection, it takes lot of time indexing to complete (so temp indexes are useless). It seems that RavenDB indexing process reads each document in DB and checks if it should be added to index. I think it would perform better to index only collections used by index.
Also when using Smuggler to export data and I want to export only one small collection, it reads all documents and exporting might take quite a lot of time. Same time custom app which uses RavenDB Linq API and indexes can export data in seconds.
Why RavenDB behaves like this? And maybe there is some configuration setting which might change this behavior?
RavenDB doesn't actually have any real concept of a "collection". All documents are pretty much the same. It simply looks at the Raven-Entity-Name metadata in each document to determine how to group things together for purposes of querying by type and displaying the "Collections" page in the management studio.
I am not sure of the specific rationale for this. I think it has something to do with the underlying ESENT tables used by the document store. Perhaps Ayende can answer better. Your particular use cases are good examples for why it might be done differently.
One thing you could try is to use multiple databases. You could put the your large-quantity documents in one database, and put everything else in another. Of course, you may have problems with indexing related documents, multi-map/reduce, or other scenarios where documents of different types need to be together on the same database.
Seems that answer to my question is coming in RavenDB 3.0. Ayende says:
In RavenDB 2.x, you still have to pay the full price for indexing
everything, but that isn’t the case in RavenDB 3.0. What we have done
is to effectively optimize the process so that in this case, we will
preload all of the documents taking part in the relevant collection,
and send them directly to be indexed.
We do this by utilizing the Raven/DocumentsByEntityName index. Which
has already indexed everything in the database anyway. This is a nice
little feature, because it allows us to really take advantage of the
work we already did long ago. Using one index to pre-populate another
is a neat trick, and one that I am very happy about.
And here is full blog post: http://ayende.com/blog/165923/shiny-features-in-the-depth-new-index-optimization

how lucene use skip list in inverted index?

In some blogs and lucene website,I know lucene use data structure "skip list" in inverted index. But I have some puzzle about it.
1:In general,skip list maybe used in memory ,but inverted index is stored in disk. So how lucene use it when search on the index? just scanning it on disk or load it to memory?
2:skip list's insert operator often use random(0,1) to decide whether insert to next level,but in luncene introdution,it seems a fixed interval in every terms,so how lucene create the "skip list" different or not?
Please correct me if I am wrong.
Lucene uses memory in a couple different ways, even though the index is persisted on a disk when the IndexReader is created for searching and for operations like sorting (field cache):
http://blog.mikemccandless.com/2010/07/lucenes-ram-usage-for-searching.html
Basically those binary files get copied into RAM for much faster scanning and reducing I/O. You get a hint in the above link how searching with some parameters can force Lucene to "skip terms in searching" Hence, where that data structure can be used.
Lucene is open source, so you can see the code for yourself what is being used in Java or Lucene.NET for the C# implementation.
see To accelerate posting list skips, Lucene uses skip lists

Lucene creating duplicate indexes

I created an app using lucene. The server winded up throwing out of memory errors because I was new'in up an IndexSeacher for every search in the app. The garbage collector couldn't keep up.
I just got done implementing a singleton approach and now there are multiple indexes being created.
Any clue why this is happening? IndexWriter is what I am keeping static. I get IndexSearchers from it.
You don't have multiple indexes, you just have multiple segments. Lucene splits the index up into segments over time, although you can compact it if you want.
See here and here for more info
You also probably want to "new up" one IndexSearcher and pass it around, seems like you are creating the index every time here.

Index verification tools for Lucene

How can we know the index in Lucene is correct?
Detail
I created a simple program that created Lucene indexes and stored it in a folder. Using the diagnostic tool, Luke I could look inside an index and view the content.
I realise Lucene is a standard framework for building a search engine but I wanted to be sure that Lucene indexes every term that existed in a file.
Can I verify that the Lucene index creation is dependable? That not even a single term went missing?
You could always build a small program that will perform the same analysis you use when indexing your content. Then, for all the terms, query your index to make sure that the document is among the results. Repeat for all the content. But personally, I would not waste time on this. If you can open your index in Luke and if you can make a couple of queries, everything is most probably fine.
Often, the real question is whether or not the analysis you did on the content will be appropriate for the queries that will be made against your index. You have to make sure that your index will have a good balance between recall and precision.

Tips/recommendations for using Lucene

I'm working on a job portal using asp.net 3.5
I've used Lucene for job and resume search functionality.
Would like to know tips/recommendations if any with respect to Lucene performance optimization, scalability, etc.
Thanks a ton!
I've documented how I used Lucene.NET (in BugTracker.NET) here:
http://www.ifdefined.com/blog/post/2009/02/Full-Text-Search-in-ASPNET-using-LuceneNET.aspx
One thing you should keep in mind is that it is very hard to cluster or replicate lucene indexes in large installations, like fail over scenarios or distributed systems. So you should either have a good way to replicate your index jobs or the whole database.
If you use a sort, watch out for the size of the comparators. When sorts are used, for each document returned by the searcher there will be a comparator object stored for each SortField in the Sort object. Depending on the size of the documents and the number of fields you want to sort on, this can become a big headache.