Lucene behaviour in mocked unit tests - rhino-mocks

Now this is just strange:
The code as it is below works fine in a NUnit unit test with RhinoMocks (the assert passes).
This is creating an IndexSearcher in the code.
Now if I use the mocked version of Get (swap the commented assignment of IndexSearcher) so now the searcher is returned by the mock, it doesn't pass the assertion.
Can anyone figure out why that is? (NUnit 2.5.2 - RhinoMocks 3.6 - Lucene 2.9.2)
[Test]
public void Test()
{
ISearcherManager searcherManager = _repository.StrictMock<ISearcherManager>();
Directory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), true);
searcherManager.Expect(item => item.Get()).Return(new IndexSearcher(writer.GetReader())).Repeat.AtLeastOnce();
_repository.ReplayAll();
//searcherManager.Get();
Document doc = new Document();
doc.Add(new Field("F", "hello you", Field.Store.YES, Field.Index.ANALYZED));
writer.AddDocument(doc);
IndexSearcher searcher = searcherManager.Get();
//IndexSearcher searcher = new IndexSearcher(writer.GetReader());
QueryParser parser = new QueryParser("F", new StandardAnalyzer());
Query q = parser.Parse("hello");
TopDocs hits = searcher.Search(q, 2);
Assert.AreEqual(1, hits.totalHits);
}

I'm not familiar with Lucene, but the only difference I see is that via the Expect call, you are creating your IndexSearcher before adding the document to the writer. In the code that is commented out, the creation of the IndexSearcher is happening after you add the document to the writer. Is that an important distinction?

Related

what will be the alternative for Pdfstamper in itext7?

I tried to find the alternative for Pdfstamper in itext7 but didn't get how to use? I've already implemented code in itextshap its working but not in itext7.
I've one more doubt what will be the alternative for Acro Fields in itext7?
public byte[] GeneratePDF(string pdfPath, Dictionary<string, string> formFieldMap, bool formFlattening = true)
{
var output = new MemoryStream();
var reader = new PdfReader(pdfPath);
var stamper = new PdfStamper(reader, output);
//PdfDocument pdfDocument = new PdfDocument(reader, writer);
var formFields = stamper.AcroFields;
foreach (var fieldName in formFieldMap.Keys)
formFields.SetField(fieldName, formFieldMap[fieldName]);
stamper.FormFlattening = formFlattening;
stamper.Close();
reader.Close();
return output.ToArray();
}
The iText API got completely overhauled between versions 5.x and 7.x. Thus, you do not always have a one-to-one correspondence between classes here and there. Thus, I would propose studying the introductory ebooks on the iText knowledge base site before porting code.
There actually is an example in those ebooks very similar to your code:
//Initialize PDF document
PdfDocument pdf = new PdfDocument(new PdfReader(src), new PdfWriter(dest));
PdfAcroForm form = PdfAcroForm.GetAcroForm(pdf, true);
IDictionary<String, PdfFormField> fields = form.GetFormFields();
PdfFormField toSet;
fields.TryGetValue("name", out toSet);
toSet.SetValue("James Bond");
fields.TryGetValue("language", out toSet);
toSet.SetValue("English");
fields.TryGetValue("experience1", out toSet);
toSet.SetValue("Off");
fields.TryGetValue("experience2", out toSet);
toSet.SetValue("Yes");
fields.TryGetValue("experience3", out toSet);
toSet.SetValue("Yes");
fields.TryGetValue("shift", out toSet);
toSet.SetValue("Any");
fields.TryGetValue("info", out toSet);
toSet.SetValue("I was 38 years old when I became an MI6 agent.");
form.FlattenFields();
pdf.Close();
("Flattening a Form" in "Chapter 4: Making a PDF interactive | .NET" of "iText 7: Jump-Start Tutorial for .NET")

Index field using: new TextField(String fieldName, Reader reader)

I've been trying to index a field using the readerValue() that Lucene provides in the Fields. The thing is that the terms are not being indexed. This is the interesting part of the code:
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
indexWriter.deleteAll();
String str = "Some random text to be indexed";
Reader reader = new StringReader(str);
Document doc = new Document();
doc.add(new TextField("content", reader));
indexWriter.addDocument(doc);
Now, if I index that text as a String with the other TextField constructor it works fine, but like this it does not index the terms, instead returns null when I try to get the value of the field after a search:
QueryParser queryParser = new QueryParser("content", new SimpleAnalyzer());
Query query = queryParser.parse(text);
TopDocs topDocs = indexSearcher.search(query,10);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
Document document = indexSearcher.doc(scoreDoc.doc);
Reader r = document.getField("content").readerValue();
I really can't see the problem, maybe it is some dumb thing that I missed, or maybe I'm using it wrong? Thanks in advance for any help.
By default, TextField is unstored. The behavior you are seeing is expected for an un-stored field. You should be able to search on it, but not retrieve it from the index. The constructor that takes a string argument for the field contents allows you to set whether to store the field or not, thus the different behavior.
The reason the store option is not available on that constructor, is that Lucene explicitly disallows a stored field to be set with a Reader or TokenStream value. If you want to store the field, you will simply need get the string value from the Reader yourself.

Why I can't get the doc that added recently by IndexWriter in the search result in Lucene 4.0?

Such as the title said, I have encountered a puzzled problem.
I have built an index for my test program, then I use IndexWriter to add a document into index. The code is :
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_CURRENT, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc1 = new Document();
doc1.add(new Field("name", "张三", Field.Store.YES, Field.Index.ANALYZED));
doc1.add(new IntField("year", 2013, Field.Store.YES));
doc1.add(new TextField("content", "123456789", Field.Store.YES));
iwriter.addDocument(doc1);
iwriter.commit();
iwriter.close();
When I try to search in this index, I can't get this doc. I really get a correct result count, it is one more than before. But when I try to print the doc.get('name'), the output is wrong.
The code in search part is:
DirectoryReader ireader = DirectoryReader.open(directory);
System.out.println(ireader.numDeletedDocs());
IndexSearcher isearcher = new IndexSearcher(ireader);
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "name", analyzer);
Query query = parser.parse("张");
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
System.out.println(hits.length);
In results, there is a "Name: 李四".
I'm sure that I use the StandardAnalyzer during indexing and searching. And StandardAnalyzer will make one Chinese character as a single token. Why when I search "张", I will get "李四"? Is there anything wrong when I add a doc? Or the docid is mismatch?
Did you (re)open the index after adding the doc? Lucene searches only return the documents that existed as of the time the index was opened for searching.
[edit...]
Use IndexReader.Open() or IndexReader.doOpenIfChanged() to open the index again. doOpenIfChanged() has the advantage that it returns null if you can still use the old IndexReader instance (because the index has not changed).
(If I recall correctly, DirectoryReader.Open() just opens the index directory, so the higher-level Lucene code does not realize that the index has changed if you just call DirectoryReader.Open.)

Stop words Removal in Lucene

How to remove stop words in Lucene for the given String "This is the chemical orientation"
I think that Lucene's StopFilter is what you are looking for.
you should use standardAnalyser ,that knows about certain token types, lowercases, removes stop words, ...
example of creating an IndexWriter with standardAnalyser:
public IndexWriter Indexer(String dir) throws IOException {
IndexWriter writer;
Directory indexDir = FSDirectory.open(new File(dir).toPath());
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig cfg = new IndexWriterConfig(analyzer);
cfg.setOpenMode(OpenMode.CREATE);
writer = new IndexWriter(indexDir, cfg);
return writer;
}

Lucene HTMLFormatter skipping last character

I have this simple Lucene search code (Modified from http://www.lucenetutorial.com/lucene-in-5-minutes.html)
class Program
{
static void Main(string[] args)
{
StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new RAMDirectory();
IndexWriter w = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
addDoc(w, "Table 1 <table> content </table>");
addDoc(w, "Table 2");
addDoc(w, "<table> content </table>");
addDoc(w, "The Art of Computer Science");
w.Close();
String querystr = "table";
Query q = new QueryParser("title", analyzer).Parse(querystr);
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(index);
Hits hitsFound = searcher.Search(q);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("*", "*");
Highlighter highlighter = null;
highlighter = new Highlighter(formatter, new QueryScorer(searcher.Rewrite(q)));
for (int i = 0; i < hitsFound.Length(); i++)
{
Console.WriteLine(highlighter.GetBestFragment(analyzer, "title", hitsFound.Doc(i).Get("title")));
// Console.WriteLine(hitsFound.Doc(i).Get("title"));
}
Console.ReadKey();
}
private static void addDoc(IndexWriter w, String value)
{
Document doc = new Document();
doc.Add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.AddDocument(doc);
}
}
The highlighted results always seem to skip the closing '>' of my last table tag. Any suggestions?
Lucene's highlighter, out of the box, is geared to handle plain text. It will work incorrectly if you try to highlight HTML or any mark-up text.
I recently ran into the same problem and found a solution in Solr's HTMLStripReader which skips the content in tags. The solution is outlined on my blog at following URL.
http://sigabrt.blogspot.com/2010/04/highlighting-query-in-entire-html.html
I could have posted the code here, but my solution is applicable for Lucene Java. For .Net, you have to find out equivalent of HTMLStripReader.
Solved. Apparently my Highlighter.Net version was archaic. Upgrading to 2.3.2.1 Solved the problem