Lucene's IndexWriter clarifications

Lucene's IndexWriter clarifications - lucene

In attemp to find a bug in our project, I found myself with few questions about Lucene's indexing API with no answer:
The first one is related to the following snippet:
IndexWriter writer = //open an index for writings.
// ... do some heavy updates (inserts and deletes) to the index using 'writer'
IndexReader reader = writer.GetReader();
long[] idsArray = Field_Cache_Fields.DEFAULT.GetLongs(reader, "ID_Field");
//under the assumption that ALL indexed documents contain field with the name "ID_Field".
Is it promised by Lucene's API that the reader that I get will ALWAYS get the updated, even though uncommited index? Just to make sure my answer is clear: every deleted doc WONT be seen by the reader and every added doc WILL be..
The second question is related to the next snippet:
IndexWriter writer = //open an index for writing, but dont change a thing - just commit meta data.
writer.Commit["Hello"] = "World";
writer.Commit();
Is it promised that the metadata will be commited to the index, eventhough I opened it with no actual change to the index?
In both of the question I will be happy to know what was meant by the API, and also if some one knows about issues (any bugs?) specific with Lucene .Net 2.9.2
thanks guys!

First Question: yes
From the doc:
Expert: returns a readonly reader, covering all committed as well as un-committed changes to the index. This provides "near real-time" searching, in that changes made during an IndexWriter session can be quickly made available for searching without closing the writer nor calling #commit .
Note that this is functionally equivalent to calling {#commit} and then using IndexReader#open to open a new reader. But the turarnound time of this method should be faster since it avoids the potentially costly #commit .

Related

SearcherManager and MultiReader in Lucene

I am using Lucene and I have a MultiReader from a few directory readers like:
MutiReader myMultiReader = new Multireader(directoryReader1, directoryReader2,...)
I want to use SearcherManager from it since there will be changes in the index from time to time. How can I do this? SeacherManager only accepts a single DirectoryReader or IndexWriter as parameters in the constructor
https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/search/SearcherManager.html
I don't understand how I could combine both MultiReader and SearcherManager.
By the way, I have already checked this links which don't really answer this particular issue:
http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html
http://lucene.472066.n3.nabble.com/SearcherManager-vs-MultiReader-td4068411.html

Lucene 6.2.1 - IllegalStateException "field-name" was indexed without position data; cannot run SpanTermQuery

I am not familiar with lucene . Recently I got a chance to involve in a work where they are moving from old lucene version 2.4.1 to 6.2.1 for their application.
While running with new version 6.2.1, we are facing an exception while searching:
Exception during query field "field_name" was indexed without position data; cannot run SpanTermQuery (term=2887629129)**
In code, the field is created as follows:
doc.add(new Field("field_1", "field_value", StringField.TYPE_STORED));
Finally we tried as given below:
FieldType type = new FieldType();
type.setStored(true);
type.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
doc.add(new Field("field_1", "field_value", StringField.TYPE_STORED));
With the above change, the previous error was gone, but we are not receiving any search results, getting empty result.

Given that you are using a SpanQuery, I assume you want the field to be analyzed. StringField indexes without analysis, as a single token. You will want to use TextField.
doc.add(new Field("field_1", "field_value", TextField.TYPE_STORED));
No need to set IndexOptions here, the default will already be DOCS_AND_FREQS_AND_POSITIONS.

How to retrieve older version document from elasticsearch?

Is there any way to retrieve older version of same document in elasticsearch?
Suppose I've indexed 1 document in ES:
put class/student/1
{
"marks":95
}
Later point of time I want to update it to:
put class/student/1
{
"marks":96
}
As soon as I index the updated marks, I see '_version' getting updated as 2.
Is there any way to query ES and get _version=1 document?

This is not possible. Even though there is a version number associated with each create/index/update/delete operation, this version number can't be used to retrieve the older version of the document. Rather it can be used to prevent dirty reads while read/manipulate/index operations

How to reuse an Index that already created using Apache Lucene?

I have a program using Lucene that create index in a Directory (index directory) every time. As everyone knows that creating index on each and every execution is time consuming process , I want to reuse the already created index in the initial execution ?
IS it possible in Lucene . Do Lucene have this feature ?

It is absolutely possible. Assuming indexDirPath is the location of your lucene index, you can use the following code:
Directory dir = FSDirectory.open(new File(indexDirPath));
IndexReader ir = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(ir);
This should be followed by use of the appropriate Analyzer you used while creating the index.

Minimize Lucene index file count

I have a specific app which requires that the number of files that make up an index to be as few as possible. Previously when I used Lucene.NET 2.9.2 I was able to keep the entire index in 3 (or 4) files by using:
writer.SetUseCompoundFile(true);
writer.Optimize(1, true);
After upgrading to Lucene.NET 2.9.4 the same code produces index consisting of 10 files (fdt, fdx, fnm, frq, nrm, prx, tii, tis + segments.gen and segments_c). How can I bring that down again?
The cause for this is probably deep in Lucene and not that much Lucene.NET specific. Still something changed in between versions and I'd love to have control over this.

OK, I've finally found an answer. When inspecting the index directory during the lengthy indexing process I have observed that CFS comes and goes but once the process is done, there is no sign of a CFS file. I did some more research given some new keywords (thanks #jf-beaulac) and I've found this. They say that the default threshold for CFS is 10% of the entire index size. If any segment grows past that, no CFS is created regardless of writer.SetUseCompoundFile(true) usage.
So, after some digging through Lucene.NET I have come up with the following necessary step:
indexWriter.SetUseCompoundFile(true);
var mergePolicy = indexWriter.GetMergePolicy();
var logPolicy = mergePolicy as LogMergePolicy;
if (logPolicy != null)
{
logPolicy.SetNoCFSRatio(1);
}
Setting the "no-cfs-ratio" to 100% keeps all segments within CFS and things finally work the way I want them to.
So, #jf-beaulac thanks a lot for getting me going. I suppose your sample would fail too if you added some more documents. Still, I recognize your help and so I will accept your answer.

I'll post the exact code snippet I used to test this, comparing it to your code will maybe help you finding whats wrong.
FSDirectory dir = FSDirectory.GetDirectory("C:\\temp\\CFSTEST");
IndexWriter writer = new IndexWriter(dir, new CJKAnalyzer());
writer.SetUseCompoundFile(true);
Document document = new Document();
document.Add(new Field(
"text",
"プーケット",
Field.Store.YES,
Field.Index.ANALYZED));
writer.AddDocument(document);
document.GetField("text").SetValue("another doc");
writer.AddDocument(document);
writer.Optimize(1, true);
writer.Close();

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene's IndexWriter clarifications - lucene

Related

SearcherManager and MultiReader in Lucene

Lucene 6.2.1 - IllegalStateException "field-name" was indexed without position data; cannot run SpanTermQuery

How to retrieve older version document from elasticsearch?

How to reuse an Index that already created using Apache Lucene?

Minimize Lucene index file count

Categories

Resources