SearcherManager and MultiReader in Lucene

SearcherManager and MultiReader in Lucene - lucene

I am using Lucene and I have a MultiReader from a few directory readers like:
MutiReader myMultiReader = new Multireader(directoryReader1, directoryReader2,...)
I want to use SearcherManager from it since there will be changes in the index from time to time. How can I do this? SeacherManager only accepts a single DirectoryReader or IndexWriter as parameters in the constructor
https://lucene.apache.org/core/6_0_1/core/org/apache/lucene/search/SearcherManager.html
I don't understand how I could combine both MultiReader and SearcherManager.
By the way, I have already checked this links which don't really answer this particular issue:
http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html
http://lucene.472066.n3.nabble.com/SearcherManager-vs-MultiReader-td4068411.html

Related

Indexing a document with content using solrj in EmbeddedSolrServer

I want to query an EmbeddedSolrServer instance with a Filter query. Like we normally do in the picture with an admin panel. But the problem here is that I want to do this programmatically with Java. I know that we can do that query.setQuery("*:*"); , but this is not what I want if someone want to search by a specific word in content's document. I found also this solrParams.add(CommonParams.QT, "*:*");, But it's not working. I think that may be the problem is from parsing the PDF document, when I try to index it. So please if someone know how to index a document using EmbeddedSolrServer exactly the same way we index it using post.jar in command.

Indexing a file is as easy as
EmbeddedSolrServer server = new EmbeddedSolrServer(solrHome, defaultCoreName)
ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
req.addFile(fileToIndex, "application/octet-stream");
req.setParam("commit", "true");
req.setParam("literal.id", id);
NamedList<Object> namedList = server.request(req);
server.close();

How to reuse an Index that already created using Apache Lucene?

I have a program using Lucene that create index in a Directory (index directory) every time. As everyone knows that creating index on each and every execution is time consuming process , I want to reuse the already created index in the initial execution ?
IS it possible in Lucene . Do Lucene have this feature ?

It is absolutely possible. Assuming indexDirPath is the location of your lucene index, you can use the following code:
Directory dir = FSDirectory.open(new File(indexDirPath));
IndexReader ir = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(ir);
This should be followed by use of the appropriate Analyzer you used while creating the index.

Lucene's IndexWriter clarifications

In attemp to find a bug in our project, I found myself with few questions about Lucene's indexing API with no answer:
The first one is related to the following snippet:
IndexWriter writer = //open an index for writings.
// ... do some heavy updates (inserts and deletes) to the index using 'writer'
IndexReader reader = writer.GetReader();
long[] idsArray = Field_Cache_Fields.DEFAULT.GetLongs(reader, "ID_Field");
//under the assumption that ALL indexed documents contain field with the name "ID_Field".
Is it promised by Lucene's API that the reader that I get will ALWAYS get the updated, even though uncommited index? Just to make sure my answer is clear: every deleted doc WONT be seen by the reader and every added doc WILL be..
The second question is related to the next snippet:
IndexWriter writer = //open an index for writing, but dont change a thing - just commit meta data.
writer.Commit["Hello"] = "World";
writer.Commit();
Is it promised that the metadata will be commited to the index, eventhough I opened it with no actual change to the index?
In both of the question I will be happy to know what was meant by the API, and also if some one knows about issues (any bugs?) specific with Lucene .Net 2.9.2
thanks guys!

First Question: yes
From the doc:
Expert: returns a readonly reader, covering all committed as well as un-committed changes to the index. This provides "near real-time" searching, in that changes made during an IndexWriter session can be quickly made available for searching without closing the writer nor calling #commit .
Note that this is functionally equivalent to calling {#commit} and then using IndexReader#open to open a new reader. But the turarnound time of this method should be faster since it avoids the potentially costly #commit .

Minimize Lucene index file count

I have a specific app which requires that the number of files that make up an index to be as few as possible. Previously when I used Lucene.NET 2.9.2 I was able to keep the entire index in 3 (or 4) files by using:
writer.SetUseCompoundFile(true);
writer.Optimize(1, true);
After upgrading to Lucene.NET 2.9.4 the same code produces index consisting of 10 files (fdt, fdx, fnm, frq, nrm, prx, tii, tis + segments.gen and segments_c). How can I bring that down again?
The cause for this is probably deep in Lucene and not that much Lucene.NET specific. Still something changed in between versions and I'd love to have control over this.

OK, I've finally found an answer. When inspecting the index directory during the lengthy indexing process I have observed that CFS comes and goes but once the process is done, there is no sign of a CFS file. I did some more research given some new keywords (thanks #jf-beaulac) and I've found this. They say that the default threshold for CFS is 10% of the entire index size. If any segment grows past that, no CFS is created regardless of writer.SetUseCompoundFile(true) usage.
So, after some digging through Lucene.NET I have come up with the following necessary step:
indexWriter.SetUseCompoundFile(true);
var mergePolicy = indexWriter.GetMergePolicy();
var logPolicy = mergePolicy as LogMergePolicy;
if (logPolicy != null)
{
logPolicy.SetNoCFSRatio(1);
}
Setting the "no-cfs-ratio" to 100% keeps all segments within CFS and things finally work the way I want them to.
So, #jf-beaulac thanks a lot for getting me going. I suppose your sample would fail too if you added some more documents. Still, I recognize your help and so I will accept your answer.

I'll post the exact code snippet I used to test this, comparing it to your code will maybe help you finding whats wrong.
FSDirectory dir = FSDirectory.GetDirectory("C:\\temp\\CFSTEST");
IndexWriter writer = new IndexWriter(dir, new CJKAnalyzer());
writer.SetUseCompoundFile(true);
Document document = new Document();
document.Add(new Field(
"text",
"プーケット",
Field.Store.YES,
Field.Index.ANALYZED));
writer.AddDocument(document);
document.GetField("text").SetValue("another doc");
writer.AddDocument(document);
writer.Optimize(1, true);
writer.Close();

Indexing file paths or URIs in Lucene

Some of the documents I store in Lucene have fields that contain file paths or URIs. I'd like users to be able to retrieve these documents if their query terms contain a path or URI segment.
For example, if the path is
C:\home\user\research\whitepapers\analysis\detail.txt
I'd like the user to be able to find it by queriying for path:whitepapers.
Likewise, if the URI is
http://www.stackoverflow.com/questions/ask
A query containing uri:questions would retrieve it.
Do I need to use a special analyzer for these fields, or will StandardAnaylzer do the job? Will I need to do any pre-processing of these fields? (To replace the forward slashes or backslashes with spaces, for example?)
Suggestions welcome!

You can use StandardAnalyzer.
I tested this, by adding the following function to Lucene's TestStandardAnalyzer.java:
public void testBackslashes() throws Exception {
assertAnalyzesTo(a, "C:\\home\\user\\research\\whitepapers\\analysis\\detail.txt", new String[]{"c","home", "user", "research","whitepapers", "analysis", "detail.txt"});
assertAnalyzesTo(a, "http://www.stackoverflow.com/questions/ask", new String[]{"http", "www.stackoverflow.com","questions","ask"});
}
This unit test passed using Lucene 2.9.1. You may want to try it with your specific Lucene distribution. I guess it does what you want, while keeping domain names and file names unbroken. Did I mention that I like unit tests?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SearcherManager and MultiReader in Lucene - lucene

Related

Indexing a document with content using solrj in EmbeddedSolrServer

How to reuse an Index that already created using Apache Lucene?

Lucene's IndexWriter clarifications

Minimize Lucene index file count

Indexing file paths or URIs in Lucene

Categories

Resources