How to reuse an Index that already created using Apache Lucene? - apache

I have a program using Lucene that create index in a Directory (index directory) every time. As everyone knows that creating index on each and every execution is time consuming process , I want to reuse the already created index in the initial execution ?
IS it possible in Lucene . Do Lucene have this feature ?

It is absolutely possible. Assuming indexDirPath is the location of your lucene index, you can use the following code:
Directory dir = FSDirectory.open(new File(indexDirPath));
IndexReader ir = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(ir);
This should be followed by use of the appropriate Analyzer you used while creating the index.

Related

Modify existing Solr 7.6.0 / Lucene index (add another field 'URL' to an already indexed file (.pdf, .docx etc.))

I have a Solr 7.6.0 Lucene index (lots of .pdf's, .docx and .xlsx files)
The index was created using the post command in a command window, pointing to a directory share (mapped filepath) where the files exist.
There is also a web URL for the document which I have in a database and Lucene currently knows nothing about. I would like to 'enrich' the existing index with this URL data.
Can I extract the id of the currently indexed files and then use the Solr web interface to modify the existing index, injecting the URL?
I am looking at the following tutorial for advice:
https://www.tutorialspoint.com/apache_solr/apache_solr_indexing_data.htm
The tutorial shows an example of adding a document but not modifying one.
Thanks #MatsLindh I managed to get it to work:
I used the Solr GUI to run the JSON add-field update:
{
"add-field" : {
"name":"URL",
"type":"string",
"stored":true
"indexed":true
}
}
I then inserted/set the property:
{"id":"S:\\Docs\\forIndexing\\indexThisFile_001.pdf",
"URL":{"set":"https//localhost/urlToFiles/indexThisFile_001.pdf:"}
}

SearcherManager and MultiReader in Lucene

I am using Lucene and I have a MultiReader from a few directory readers like:
MutiReader myMultiReader = new Multireader(directoryReader1, directoryReader2,...)
I want to use SearcherManager from it since there will be changes in the index from time to time. How can I do this? SeacherManager only accepts a single DirectoryReader or IndexWriter as parameters in the constructor
https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/search/SearcherManager.html
I don't understand how I could combine both MultiReader and SearcherManager.
By the way, I have already checked this links which don't really answer this particular issue:
http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html
http://lucene.472066.n3.nabble.com/SearcherManager-vs-MultiReader-td4068411.html

How to use Sitecore Solr Custom Index

Could someone please help me understand below?
Do we need to specify the name of the index in code when using a Sitecore solr search?
If we make the new custom index called 'sitecore_web-index_custom'. How do we make sure we are using this index in code?
Thank you.
In order to get Sitecore index, use GetIndex method from the ContentSearchManager class:
Sitecore.ContentSearch.ContentSearchManager.GetIndex(...)
You can either pass index name:
// get Sitecore built in index for current database:
string dbName = (Sitecore.Context.ContentDatabase ?? Sitecore.Context.Database).Name;
var index = Sitecore.ContentSearch.ContentSearchManager.GetIndex("sitecore_" + dbName + "_index");
// get custom index
Sitecore.ContentSearch.ContentSearchManager.GetIndex("sitecore_web-index_custom")
or Sitecore Item:
// get index by Sitecore item
Sitecore.ContentSearch.ContentSearchManager.GetIndex((SitecoreIndexableItem)item);
In the second scenario, Sitecore will try to find the index in which the item is indexed.
There is no difference between getting Solr or Lucene indexes - Sitecore API is transparent here.
More information about Sitecore search and indexing can be found in
Sitecore Search and Indexing Guide
Developer's Guide to Item Buckets and Search

Lucene's IndexWriter clarifications

In attemp to find a bug in our project, I found myself with few questions about Lucene's indexing API with no answer:
The first one is related to the following snippet:
IndexWriter writer = //open an index for writings.
// ... do some heavy updates (inserts and deletes) to the index using 'writer'
IndexReader reader = writer.GetReader();
long[] idsArray = Field_Cache_Fields.DEFAULT.GetLongs(reader, "ID_Field");
//under the assumption that ALL indexed documents contain field with the name "ID_Field".
Is it promised by Lucene's API that the reader that I get will ALWAYS get the updated, even though uncommited index? Just to make sure my answer is clear: every deleted doc WONT be seen by the reader and every added doc WILL be..
The second question is related to the next snippet:
IndexWriter writer = //open an index for writing, but dont change a thing - just commit meta data.
writer.Commit["Hello"] = "World";
writer.Commit();
Is it promised that the metadata will be commited to the index, eventhough I opened it with no actual change to the index?
In both of the question I will be happy to know what was meant by the API, and also if some one knows about issues (any bugs?) specific with Lucene .Net 2.9.2
thanks guys!
First Question: yes
From the doc:
Expert: returns a readonly reader, covering all committed as well as un-committed changes to the index. This provides "near real-time" searching, in that changes made during an IndexWriter session can be quickly made available for searching without closing the writer nor calling #commit .
Note that this is functionally equivalent to calling {#commit} and then using IndexReader#open to open a new reader. But the turarnound time of this method should be faster since it avoids the potentially costly #commit .

Minimize Lucene index file count

I have a specific app which requires that the number of files that make up an index to be as few as possible. Previously when I used Lucene.NET 2.9.2 I was able to keep the entire index in 3 (or 4) files by using:
writer.SetUseCompoundFile(true);
writer.Optimize(1, true);
After upgrading to Lucene.NET 2.9.4 the same code produces index consisting of 10 files (fdt, fdx, fnm, frq, nrm, prx, tii, tis + segments.gen and segments_c). How can I bring that down again?
The cause for this is probably deep in Lucene and not that much Lucene.NET specific. Still something changed in between versions and I'd love to have control over this.
OK, I've finally found an answer. When inspecting the index directory during the lengthy indexing process I have observed that CFS comes and goes but once the process is done, there is no sign of a CFS file. I did some more research given some new keywords (thanks #jf-beaulac) and I've found this. They say that the default threshold for CFS is 10% of the entire index size. If any segment grows past that, no CFS is created regardless of writer.SetUseCompoundFile(true) usage.
So, after some digging through Lucene.NET I have come up with the following necessary step:
indexWriter.SetUseCompoundFile(true);
var mergePolicy = indexWriter.GetMergePolicy();
var logPolicy = mergePolicy as LogMergePolicy;
if (logPolicy != null)
{
logPolicy.SetNoCFSRatio(1);
}
Setting the "no-cfs-ratio" to 100% keeps all segments within CFS and things finally work the way I want them to.
So, #jf-beaulac thanks a lot for getting me going. I suppose your sample would fail too if you added some more documents. Still, I recognize your help and so I will accept your answer.
I'll post the exact code snippet I used to test this, comparing it to your code will maybe help you finding whats wrong.
FSDirectory dir = FSDirectory.GetDirectory("C:\\temp\\CFSTEST");
IndexWriter writer = new IndexWriter(dir, new CJKAnalyzer());
writer.SetUseCompoundFile(true);
Document document = new Document();
document.Add(new Field(
"text",
"プーケット",
Field.Store.YES,
Field.Index.ANALYZED));
writer.AddDocument(document);
document.GetField("text").SetValue("another doc");
writer.AddDocument(document);
writer.Optimize(1, true);
writer.Close();