Optimizing big Lucene index fails silently

Optimizing big Lucene index fails silently - lucene

I have a question on optimizing big Lucene index (it's 197 Gb now - may sound not that big for some of you).
I'm using Lucene of version 2.9.4 and I came to a state when I need to optimize an index with 900 segments into much smaller number of segments (1-10 ideally). I'm still calling IndexWriter.optimize() that's available in 2.9.4, but setting merge factor fails the same way.
So, what happens is that after an hour of optimizing my logs (I have set up all possible logs) say that optimization is done and no errors are there in any logs file. Everything looks fine except the fact that the files in index directory are still the same - no number of files reduced or deletes expunged.
I have enough space on drive (300 Gb) and no readers or searchers are open - the index is isolated and focused on optimization.
According to index wirter logs the merge threads merge segments and print out some numbers of segments from 900 down to 456 iteratively and then suddenly it says that it's merging all of them up to 16 segments (that's the number or segments that I set to merge to)
Does anyone have any idea what could happen? Am I merging too many segments? Could there be any OS related (Windows Server 2008) issues like 'Too Many File Handlers Open' (where can I check that message)?
Thanks in advance

It's not a failure. The issue is very simple - you just need to open index reader (or re-open existing one) after optimization got completed. That's it. It will replace old index files with list of new files right when you open the reader in few seconds.

Related

Lucene.Net 3.0.5 - Near Real Time Search, reopening reader performance

I'm indexing a sequence of documents with the IndexWriter and commiting the changes at the end of the iteration.
In order to access the uncommitted changes I'me using NRTS as described here
Imagine that I'm indexing 1000 documents and iterating through them to check if there's any I can reuse/update. (some specific requirements I have)
I'm reopening the reader at each iteration:
using (var indexReader = writer.GetReader())
using (var searcher = new IndexSearcher(indexReader))
How slow should it be to reopen the reader? Once the index gets to around 300K documents, Occasionally, indexing 1000 documents can take around 60 seconds (not much text)
Am I taking the wrong approach? Please advise.

To increase your performance, you need to not optimize so often.
I use a separate timer for optimization. Every 40 minutes it enables optimization to five segments (a good value according to "Lucene In Action"), which then occurs if the indexer is running (there being no need to optimize if the indexer is shut down). Then, once a day, it enables optimization to one segment at a very low usage time of day. I usually see about 5 minutes for the one-segment optimization. Feel free to borrow my strategy, but in any case, don't optimize so often - your optimization is hurting your overall index rate, especially given that your document size is small, and so the 500 doc iteration loop must be happening frequently.
You could also put in some temporary logging code at the various stages to see where your indexer is spending its time so you can tweak iteration size, settling time between loops (if you're paranoid like me), optimization frequency, etc.

Single word lucene indexing limit?

I've a Lucene based application and obviously a problem.
When the number of indexed document is low no problems appear. When then number of documents increase, seems that no single word are indexing. What we obtain is that searching with single word (single term) is an empty set.
The version of Lucene is 3.1 on 64 bit machine and the index is 10GB.
Do you have any idea?
Thanks

According to the Lucene documentation, Lucene should be able to handle 274 billion distinct terms. I don't believe it is possible that you have reached that limitation in a 10GB index.
Without more information, it is difficult to help further. However, being that you only see problems with large numbers of documents, I suspect you are running into exceptional conditions of some form, causing the system to fail to read or respond correctly. File Handle leaks or Memory Overflow perhaps, to take a stab in the dark.

How to handle a very big array in vb.net

I have a program producing a lot of data, which it writes to a csv file line by line (as the data is created). If I were able to open the csv file in excel it would be about 1 billion cells (75,000*14,600). I get the System.OutOfMemoryException thrown every time I try and access it (or even create an array this size). If anyone has any idea how to can take the data into vb.net so I can do some simple operations (all data needs to be available at once) then I'll try every idea you have.
I've looked at increasing the amount of ram used but other articles/posts say this will run short way before the 1 billion mark. There's no issues with time here, assuming it's no more than a few days/weeks I can deal with it (I'll only be running it once or twice a year). If you don't know anyway to do it the only other solutions I can think of would be increasing the number of columns in excel to ~75,000 (if that's possible - can't write the data the other way around), or I suppose if there's another language that could handle this?
At present it fails right at the start:
Dim bigmatrix(75000, 14600) As Double
Many thanks,
Fraser :)

First, this will always require a 64bit operating system and a fairly large amount of RAM, as you're trying to allocate about 8 GB.
This is theoretically possible in Visual Basic targeting .NET 4.5 if you turn on gcAllowVeryLargeObjects. That being said, I would recommend using a jagged array instead of a multidimensional array if possible, as this will remove the requirement of needing a single allocation of 8GB. (This will also potentially allow it to work in .NET 4 or earlier.)

Lucene experts: how best to run diagnostics against an IndexWriter to resolve performance issues?

I've got an index that currently occupies about 1gb of space and has about 2.5 million documents. The index is stored on a solid-state drive for speed. I'm adding 2500 documents at a time and committing after each batch has been added. The index is a "live" index and needs to be kept up-to-date throughout the day and night, so minimising write speeds is very important. I'm using a merge factor of 10 and am never calling Optimize(), rather allowing the index to optimize itself as needed based on the merge factor.
I need to commit the documents after each batch has been added because I record this fact so that if the app crashes or restarts, it can pick up where it left off. If I didn't commit, the stored state would be inconsistent with what's in the index. I'm assuming my additions, deletions and updates are lost if the writer is destroyed without committing.
Anyway, I've noticed that after an arbitrary period of time, which could be anywhere from two minutes or two hours and some variable number of previous commits, the indexer seems to stall on the IndexWriter.AddDocument(doc) method and I can't for the life of me figure out why it's stalling or how to fix it. The block can stay in place for upwards of two hours, which seems strange for an index taking up less than 2GB in the low millions of documents and having an SSD drive to work with.
What could cause AddDocument to block? Are there any Lucene diagnostic utilities that could help me? What else could I look for to track down the problem?

You can use IndexWriter.SetInfoStream() to redirect diagnostics output to a stream that might give you a hint of what's wrong.

How often to call commit on an offline Solr/Lucene index?

I know there have been some semi-similar questions, but in this case, I am building an index which is offline, until build is complete. I am building from scratch two cores, one has about 300k records with alot of citation information and large blocks of full text (this is the document index) and another core which has about 6.6 Million records, with full text (this is the page index).
Given this index is being built offline, the only real performance issue is speed of building. Noone should be querying this data.
The auto-commit would apparently fire if I stop adding items for 50 seconds? Which I don't do. I am adding ten at a time and they are added every couple seconds.
So, should I commit more often? I feel like the longer this runs the slower it gets, at least in my test case of 6k documents to index.
With noone searching this index, how often would anyone suggest I commit?
Should say I am using Solr 3.1 and SolrNet.

Although it's commits that are taking time for you, you might want to consider looking into other tweaking than commit frequency.
Is it the indexing core that also does searching, or is it replicated somewhere else after indexing concludes? If the latter is the case, then turning off caches might have a very noticeable impact on performance (solr rebuilds caches every time you commit).

You could also look into using the autoCommit or commitWith features of Solr.
commitWithin is done as part of the document add command. I believe that this is supported with SolrNet - please see Using the commiWithin attribute thread for more details.
autoCommit is a Solr configuration value added to the update handler section.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas