Lucene HTMLFormatter skipping last character - lucene

I have this simple Lucene search code (Modified from http://www.lucenetutorial.com/lucene-in-5-minutes.html)
class Program
{
static void Main(string[] args)
{
StandardAnalyzer analyzer = new StandardAnalyzer();
Directory index = new RAMDirectory();
IndexWriter w = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);
addDoc(w, "Table 1 <table> content </table>");
addDoc(w, "Table 2");
addDoc(w, "<table> content </table>");
addDoc(w, "The Art of Computer Science");
w.Close();
String querystr = "table";
Query q = new QueryParser("title", analyzer).Parse(querystr);
Lucene.Net.Search.IndexSearcher searcher = new
Lucene.Net.Search.IndexSearcher(index);
Hits hitsFound = searcher.Search(q);
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter("*", "*");
Highlighter highlighter = null;
highlighter = new Highlighter(formatter, new QueryScorer(searcher.Rewrite(q)));
for (int i = 0; i < hitsFound.Length(); i++)
{
Console.WriteLine(highlighter.GetBestFragment(analyzer, "title", hitsFound.Doc(i).Get("title")));
// Console.WriteLine(hitsFound.Doc(i).Get("title"));
}
Console.ReadKey();
}
private static void addDoc(IndexWriter w, String value)
{
Document doc = new Document();
doc.Add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
w.AddDocument(doc);
}
}
The highlighted results always seem to skip the closing '>' of my last table tag. Any suggestions?

Lucene's highlighter, out of the box, is geared to handle plain text. It will work incorrectly if you try to highlight HTML or any mark-up text.
I recently ran into the same problem and found a solution in Solr's HTMLStripReader which skips the content in tags. The solution is outlined on my blog at following URL.
http://sigabrt.blogspot.com/2010/04/highlighting-query-in-entire-html.html
I could have posted the code here, but my solution is applicable for Lucene Java. For .Net, you have to find out equivalent of HTMLStripReader.

Solved. Apparently my Highlighter.Net version was archaic. Upgrading to 2.3.2.1 Solved the problem

Related

what will be the alternative for Pdfstamper in itext7?

I tried to find the alternative for Pdfstamper in itext7 but didn't get how to use? I've already implemented code in itextshap its working but not in itext7.
I've one more doubt what will be the alternative for Acro Fields in itext7?
public byte[] GeneratePDF(string pdfPath, Dictionary<string, string> formFieldMap, bool formFlattening = true)
{
var output = new MemoryStream();
var reader = new PdfReader(pdfPath);
var stamper = new PdfStamper(reader, output);
//PdfDocument pdfDocument = new PdfDocument(reader, writer);
var formFields = stamper.AcroFields;
foreach (var fieldName in formFieldMap.Keys)
formFields.SetField(fieldName, formFieldMap[fieldName]);
stamper.FormFlattening = formFlattening;
stamper.Close();
reader.Close();
return output.ToArray();
}
The iText API got completely overhauled between versions 5.x and 7.x. Thus, you do not always have a one-to-one correspondence between classes here and there. Thus, I would propose studying the introductory ebooks on the iText knowledge base site before porting code.
There actually is an example in those ebooks very similar to your code:
//Initialize PDF document
PdfDocument pdf = new PdfDocument(new PdfReader(src), new PdfWriter(dest));
PdfAcroForm form = PdfAcroForm.GetAcroForm(pdf, true);
IDictionary<String, PdfFormField> fields = form.GetFormFields();
PdfFormField toSet;
fields.TryGetValue("name", out toSet);
toSet.SetValue("James Bond");
fields.TryGetValue("language", out toSet);
toSet.SetValue("English");
fields.TryGetValue("experience1", out toSet);
toSet.SetValue("Off");
fields.TryGetValue("experience2", out toSet);
toSet.SetValue("Yes");
fields.TryGetValue("experience3", out toSet);
toSet.SetValue("Yes");
fields.TryGetValue("shift", out toSet);
toSet.SetValue("Any");
fields.TryGetValue("info", out toSet);
toSet.SetValue("I was 38 years old when I became an MI6 agent.");
form.FlattenFields();
pdf.Close();
("Flattening a Form" in "Chapter 4: Making a PDF interactive | .NET" of "iText 7: Jump-Start Tutorial for .NET")

After Lucene search, get character offsets of all matched words in document? (not just preview snippet)

I am creating a search engine for a large number of HTML documents using lucene.
I know I can use PostingsHighlighter and friends to show snippets, with bold words, similar to Google Search results, also similar to this random lucene-based example.
However, unlike these examples, I need a solution that preserves highlighted words, even after the matched document is opened by the user, similar to Google Books.
Some words are hyphenated, in the form <div> ... an inter-</div><div...>national audience ...</div> I am thinking I need to convert these to plain text first, and write some code to merge words that were hyphenated, before I send them to lucene.
Once the resulting document is opened by the user, I'm hoping that I can use lucene to get character offsets of each matched word in the document.
I will have to cross-reference the offsets in the plain text back to the original HTML, and write code to highlight <b> the words based on said offsets.
<div> ... an <b>inter-</b></div><div...><b>national</b> audience ...</div>
How can I get what I need from lucene? Surely I don't have to write my own search for this 'final inch'?
OK, I figured out something I can get started with. :)
To index:
StandardAnalyzer analyzer - new StandardAnalyzer()
Directory index = FSDirectory.open(new File("...").toPath());
IndexWriterConfig config = new IndexWriterConfig(analyzer);
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
addDoc(writer, "...", "...");
// documents need to be read from the data source..
// only add once, or else your docs will be duplicated as you continue to use the system
writer.close();
specify offsets to store for highlighting
private static final FieldType typeOffsets;
static {
typeOffsets = new FieldType(textField.TYPE_STORED);
typeOffsets.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
}
method addDoc
void addDoc(IndexWriter writer, String title, String body) {
Document doc = new Document();
doc.add(new Field("title", body, typeOffsets));
doc.add(new Field("body", body, typeOffsets));
// you can also add an store a TextField that does not have offsets,
// like a file ID that you wouldn't search on, just need to reference original doc.
writer.addDocument(doc);
}
Perform your first search
String q = "...";
String[] fields = new String[] {"title", "body"};
QueryParser parser = new MultiFieldQueryParser(fields, analyzer)
Query query = parser.parse(q)
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(index));
PostingsHighlighter highlighter = new PostingsHighlighter();
TopDocs topDocs = searcher.search(query, 10, Sort.RELEVANCE);
Get highlighted snippets with highlighter.highlightFields(fields, query, searcher, topDocs). You can iterate over the results.
When you want to highlight the end document (i.e. after the search is completed and user selected the result), use this solution (needs minor edits). It works by using NullFragmenter to turn the whole thing into one snippet.
public static String highlight(String pText, String pQuery) throws Exception
{
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);
QueryParser parser = new QueryParser(Version.LUCENE_30, "", analyzer);
Highlighter highlighter = new Highlighter(new QueryScorer(parser.parse(pQuery)));
highlighter.setTextFragmenter(new NullFragmenter());
String text = highlighter.getBestFragment(analyzer, "", pText);
if (text != null)
{
return text;
}
return pText;
}
Edit: You can actually use PostingsHighlighter for this last step instead of Highlighter, but you have to override getBreakIterator, and then override your BreakIterator so that it thinks the whole document is one sentance.
Edit: You can override getFormatter to capture the offsets, rather than trying to parse the <b> tags normally output by PostingsHighlighter.

PhraseQuery+Lucene 4.6 is not working for PDF Word search

Iam Using lucene 4.6 version with Phrase Query for searching the words from PDF. Below is my code. Here Iam able to get the out put text from the PDF also getting the query as contents:"Following are the". But No.of hits is showing as 0. Any suggestions on it?? Thanks in advance.
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
//Directory directory = FSDirectory.open("/tmp/testindex");
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
iwriter.deleteAll();
iwriter.commit();
Document doc = new Document();
PDDocument document = null;
try {
document = PDDocument.load(strFilepath);
}
catch (IOException ex) {
System.out.println("Exception Occured while Loading the document: " + ex);
}
String output=new PDFTextStripper().getText(document);
System.out.println(output);
//String text = "This is the text to be indexed";
doc.add(new Field("contents", output, TextField.TYPE_STORED));
iwriter.addDocument(doc);
iwriter.close();
// Now search the index
DirectoryReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
String sentence = "Following are the";
//IndexSearcher searcher = new IndexSearcher(directory);
if(output.contains(sentence)){
System.out.println("");
}
PhraseQuery query = new PhraseQuery();
String[] words = sentence.split(" ");
for (String word : words) {
query.add(new Term("contents", word));
}
ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
// Iterate through the results:
if(hits.length>0){
System.out.println("Searched text existed in the PDF.");
}
ireader.close();
directory.close();
}
catch(Exception e){
System.out.println("Exception: "+e.getMessage());
}
There are two reasons why your PhraseQuery is not working
StandardAnalyzer uses ENGLISH_STOP_WORDS_SET which contains a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with these words which will be removed from TokenStream while indexing. That means when you search "Following are the" in index, are and the will not be found. so you will never get any result for such a PhraseQuery as are and the will never be there in first place to search with.
Solution for this is use this constructor for
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46, CharArraySet.EMPTY_SET); while indexing this will make sure that StopFilter will not remove any word from TokenStream while indexing.
StandardAnalyzer also uses LowerCaseFilter that means all tokens will be normalized to lower case. so Following will be indexed as following that means searching "Following" won't give you result. For this .toLowerCase() will come to your rescue, just use this on your sentence and you should get results from search.
Also have a look at this link which specify Unicode Standard Annex #29 which is followed by StandardTokenizer. And from brief look at it, it looks like APOSTROPHE, QUOTATION MARK, FULL STOP, SMALL COMMA and many other characters under certain condition will be ignored while indexing.

Stop words Removal in Lucene

How to remove stop words in Lucene for the given String "This is the chemical orientation"
I think that Lucene's StopFilter is what you are looking for.
you should use standardAnalyser ,that knows about certain token types, lowercases, removes stop words, ...
example of creating an IndexWriter with standardAnalyser:
public IndexWriter Indexer(String dir) throws IOException {
IndexWriter writer;
Directory indexDir = FSDirectory.open(new File(dir).toPath());
Analyzer analyzer = new StandardAnalyzer();
IndexWriterConfig cfg = new IndexWriterConfig(analyzer);
cfg.setOpenMode(OpenMode.CREATE);
writer = new IndexWriter(indexDir, cfg);
return writer;
}

Lucene behaviour in mocked unit tests

Now this is just strange:
The code as it is below works fine in a NUnit unit test with RhinoMocks (the assert passes).
This is creating an IndexSearcher in the code.
Now if I use the mocked version of Get (swap the commented assignment of IndexSearcher) so now the searcher is returned by the mock, it doesn't pass the assertion.
Can anyone figure out why that is? (NUnit 2.5.2 - RhinoMocks 3.6 - Lucene 2.9.2)
[Test]
public void Test()
{
ISearcherManager searcherManager = _repository.StrictMock<ISearcherManager>();
Directory directory = new RAMDirectory();
IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(), true);
searcherManager.Expect(item => item.Get()).Return(new IndexSearcher(writer.GetReader())).Repeat.AtLeastOnce();
_repository.ReplayAll();
//searcherManager.Get();
Document doc = new Document();
doc.Add(new Field("F", "hello you", Field.Store.YES, Field.Index.ANALYZED));
writer.AddDocument(doc);
IndexSearcher searcher = searcherManager.Get();
//IndexSearcher searcher = new IndexSearcher(writer.GetReader());
QueryParser parser = new QueryParser("F", new StandardAnalyzer());
Query q = parser.Parse("hello");
TopDocs hits = searcher.Search(q, 2);
Assert.AreEqual(1, hits.totalHits);
}
I'm not familiar with Lucene, but the only difference I see is that via the Expect call, you are creating your IndexSearcher before adding the document to the writer. In the code that is commented out, the creation of the IndexSearcher is happening after you add the document to the writer. Is that an important distinction?