Lucene IndexSearcher locks index causing IOException when rebuilding - lucene

I've learned from reading the available documentation that an IndexSearcher instance should be shared across searches, for optimal performance, and that a new instance must be created in order to load any changes made to the index. This implies that the index is writable (using IndexWriter) after having created an instance of IndexSearcher that points to the same directory. However, this is not the behaviour I see in my implementation of Lucene.Net. I'm using FSDirectory. RAMDirectory is not a viable option. The IndexSearcher locks one of the index files (in my implementation it's the _1.cfs file) making the index non-updatable during the lifetime of the IndexSearcher instance.
Is this a known behaviour? Can't I rebuild the index from scratch while using an IndexSearcher instance created prior to rebuilding? Is it only possible to to modifications to the index, but not to rebuild it?
Here is how I create the IndexSearcher instance:
// Create FSDirectory
var directory = FSDirectory.GetDirectory(storagePath, false);
// Create IndexReader
var reader = IndexReader.Open(directory);
// I get the same behaviour regardless of whether I close the directory or not.
directory.Close();
// Create IndexSearcher
var searcher = new IndexSearcher(reader);
// Closing the reader will cause "object reference not set..." when searching.
//reader.Close();
Here is how I create the IndexWriter:
var directory = FSDirectory.GetDirectory(storagePath, true);
var indexWriter = new IndexWriter(directory, new StandardAnalyzer(), true);
I'm using Lucene.Net version 2.0.
Edit:
Upgrading to Lucene.Net 2.1 (thx KenE) and slightly modifying the way I create my IndexWriter fixed the problem:
var directory = FSDirectory.GetDirectory(storagePath, false);
var indexWriter = new IndexWriter(directory, new StandardAnalyzer(), true);

The latest version of Lucene.Net (2.1) appears to support opening an IndexWriter with create=true even when there are open readers:
http://incubator.apache.org/lucene.net/docs/2.1/Lucene.Net.Index.IndexWriter.html
Earlier versions are not clear as to whether they support this or not. I would try using 2.1.

Related

Memory cache dependencies

I'm trying to understand the Cache dependencies example, but some of it is eluding me.
var cts = new CancellationTokenSource();
_cache.Set(CacheKeys.DependentCTS, cts);
using (var entry = _cache.CreateEntry(CacheKeys.Parent))
{
// expire this entry if the dependant entry expires.
entry.Value = DateTime.Now;
entry.RegisterPostEvictionCallback(DependentEvictionCallback, this);
_cache.Set(CacheKeys.Child,
DateTime.Now,
new CancellationChangeToken(cts.Token));
}
CancellationTokenSource allows evicting multiple cache entries as a group. But what's the point of the using block in this example? I downloaded the sample project and replaced the using block with the following:
_cache.Set( CacheKeys.Parent, DateTime.Now, new CancellationChangeToken( cts.Token ) );
_cache.Set( CacheKeys.Child, DateTime.Now, new CancellationChangeToken( cts.Token ) );
While my two lines are simpler, they seem to have the same effect as the using block. The doc says:
With the using pattern in the code above, cache entries created inside the using block will inherit triggers and expiration settings.
What does "triggers and expiration settings" refer to here?
I must add dependent cache entries in my application, but they must be added at separate times. Therefore I can't take the approach shown in the example's using block since it requires both entries be added at the same time, right? What functionality am I missing out on by instead taking the approach shown above in my two lines of code?
Finally, shouldn't Dispose be called on CancellationTokenSource after the Cancel call? The sample doesn't do that.

Remove PDFont caching with Apache tika

I am trying to extract text only from a number of different coduments (rtf doc pdf). I naturally turned to Apache Tika because it can autodetect the document and extract text accordingly. I am only interested in the text and not formatting etc.
My application ends up with a big memory leak and on investigating it, this is coming from caching from PDFFont class from the PDFBox dependency. I am not interesting in caching Fontmetrics and other Font formatting issues from pdfs as I want to only extract the text.
I am using tika 1.12. Does anyone know how to get around this cahcing issue. This is how I am using Autodetect:
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File(child.getPath()));
ParseContext context = new ParseContext();
parser.parse(inputstream, handler, metadata, context);
String s=null;
s =handler.toString();
handler=null;
context=null;
inputstream.close();
PDFont.clearResources();
So I fudged a workaround and just called System.gc(); everytime the file had finished being processed which works a treat but doesn't really answer the question.

How to read Lucene 3.2 index by Lucene 4.10?

Getting Lucene 4.10 read 3.2 version indexes
Upgraded to 4.10 still need to read 3.2 indexes. Deployed jre 7 as required. Made all changes within a existing code base which became erroneous. Still need to read 3.2 indexes before going to take on re-indexing. How to read existing 3.2 indexes by Lucene 4.10 ( what changes to make if any in a code )
You can use IndexUpgrader, something like:
IndexUpgrader upgrader = new IndexUpgrader(myIndexDirectory, Version.LUCENE_4_10_0);
upgrader.upgrade();
or run it from the command line:
java -cp lucene-core.jar org.apache.lucene.index.IndexUpgrader myIndexDirectory
You can set the codec used to decode the indexes in the IndexWriterConfig. Lucene3xCodec would be the codec to use here:
IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
config.setCodec(new Lucene3xCodec());
IndexWriter writer = new IndexWriter(directory, config);
IndexSearcher searcher = new IndexSearcher(new DirectoryReader.open(writer));
Bear in mind, this codec is strictly read-only. Any attempt to add, delete, or update a document will result in an UnsupportedOperationException being thrown. If you wish to support writing to the index, you must upgrade your index (see my original answer).

avoid indexing documents again Lucene

When I run my program, I index the documents each time I run the program in eclipse. However, I want to just index once. Perhaps by deleting the index after each use, but I don't know how to go about doing that.
Set your IndexWriter to OpenMode.CREATE. It's probably set to OpenMode.CREATE_OR_APPEND now. Setting it to CREATE will cause the existing index at the specified directory to be overwritten when you open the indexwriter, to make way for the new one.
Like:
IndexWriterConfig config = new IndexWriterConfig(version, analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
//etc.....
IndexWriter writer = new IndexWriter(directory, config);

how to reopen a closed indexWriter in lucene 3.2?

how to reopen a closed indexWriter in lucene 3.2
and how to testify whether an indexWriter is closed?
When we create instance of IndexWriter, we should do like this way
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
If IndexWriterConfig.OpenMode.CREATE_OR_APPEND is used IndexWriter will create a new index if there is not already an index at the provided path and otherwise open the existing index.
The above is from : https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/index/IndexWriter.html