Lucene.net highlight searched term in the text - lucene

I am using Lucene.net to search a given document. Requirement is once search is done, it should highlight the searched term in the document. I have seen examples which returns the best fragments. But what i need is to highlight in the main content.
using (StandardAnalyzer standardAnalyzer = new StandardAnalyzer(Version.LUCENE_30, stopWords))
{
QueryParser parser = new QueryParser(Version.LUCENE_30, "Content", standardAnalyzer);
parser.AllowLeadingWildcard = true;
Query qry = parser.Parse(searchText);
Directory indexDir = CreateRAMDirectory(htmlContent);
IndexReader reader = IndexReader.Open(indexDir, true);
IndexSearcher searcher = new IndexSearcher(reader);
searcher.SetDefaultFieldSortScoring(true, true);
IFormatter formatter = new SimpleHTMLFormatter("<span style=\"font-weight:bold; background-color:yellow;\">", "</span>");
SimpleFragmenter fragmenter = new SimpleFragmenter(1000);
QueryScorer scorer = null;
scorer = new QueryScorer(qry);
ScoreDoc[] hits = searcher.Search(qry, null, 10000, Sort.RELEVANCE).ScoreDocs;
Highlighter highlighter = new Highlighter(formatter, scorer);
highlighter.TextFragmenter = fragmenter;
foreach (var result in hits)
{
int docId = result.Doc;
float score = result.Score;
Document doc = searcher.Doc(docId);
Lucene.Net.Analysis.TokenStream stream = standardAnalyzer.TokenStream("Content", new IO.StringReader(searchText));
String highlighterData = highlighter.GetBestFragments(stream, searchText, 1000, "");
}
}
I am a newbie to Lucene.net, how can i get the entire document with searched term content highlighted rather than fragments?

The fragmenter governs how large the chunks of text returned are. To use the entire field contents, just use NullFragmenter, instead of SimpleFragmenter.
Fragmenter fragmenter = new NullFragmenter();
.....
highlighter.TextFragmenter = fragmenter;

I had the same issue, even with the NullFragmenter, it only returned roughly 51 kB of text.
By analyzing the objects, I found out that there is another property at the highligher which sets how large a fragment would be at maximum. Set this value to the length of your string, then the whole document will be processed.
highlighter.TextFragmenter = new NullFragmenter();
highlighter.MaxDocCharsToAnalyze = text.Length;

Related

Querying part-of-speech tags with Lucene 7 OpenNLP

For fun and learning I am trying to build a part-of-speech (POS) tagger with OpenNLP and Lucene 7.4. The goal would be that once indexed I can actually search for a sequence of POS tags and find all sentences that match sequence. I already get the indexing part, but I am stuck on the query part. I am aware that SolR might have some functionality for this, and I already checked the code (which was not so self-expalantory after all). But my goal is to understand and implement in Lucene 7, not in SolR, as I want to be independent of any search engine on top.
Idea
Input sentence 1: The quick brown fox jumped over the lazy dogs.
Applied Lucene OpenNLP tokenizer results in: [The][quick][brown][fox][jumped][over][the][lazy][dogs][.]
Next, applying Lucene OpenNLP POS tagging results in: [DT][JJ][JJ][NN][VBD][IN][DT][JJ][NNS][.]
Input sentence 2: Give it to me, baby!
Applied Lucene OpenNLP tokenizer results in: [Give][it][to][me][,][baby][!]
Next, applying Lucene OpenNLP POS tagging results in: [VB][PRP][TO][PRP][,][UH][.]
Query: JJ NN VBD matches part of sentence 1, so sentence 1 should be returned. (At this point I am only interested in exact matches, i.e. let's leave aside partial matches, wildcards etc.)
Indexing
First, I created my own class com.example.OpenNLPAnalyzer:
public class OpenNLPAnalyzer extends Analyzer {
protected TokenStreamComponents createComponents(String fieldName) {
try {
ResourceLoader resourceLoader = new ClasspathResourceLoader(ClassLoader.getSystemClassLoader());
TokenizerModel tokenizerModel = OpenNLPOpsFactory.getTokenizerModel("en-token.bin", resourceLoader);
NLPTokenizerOp tokenizerOp = new NLPTokenizerOp(tokenizerModel);
SentenceModel sentenceModel = OpenNLPOpsFactory.getSentenceModel("en-sent.bin", resourceLoader);
NLPSentenceDetectorOp sentenceDetectorOp = new NLPSentenceDetectorOp(sentenceModel);
Tokenizer source = new OpenNLPTokenizer(
AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY, sentenceDetectorOp, tokenizerOp);
POSModel posModel = OpenNLPOpsFactory.getPOSTaggerModel("en-pos-maxent.bin", resourceLoader);
NLPPOSTaggerOp posTaggerOp = new NLPPOSTaggerOp(posModel);
// Perhaps we should also use a lower-case filter here?
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
// Very important: Tokens are not indexed, we need a store them as payloads otherwise we cannot search on them
TypeAsPayloadTokenFilter payloadFilter = new TypeAsPayloadTokenFilter(posFilter);
return new TokenStreamComponents(source, payloadFilter);
}
catch (IOException e) {
throw new RuntimeException(e.getMessage());
}
}
Note that we are using a TypeAsPayloadTokenFilter wrapped around OpenNLPPOSFilter. This means, our POS tags will be indexed as payloads, and our query - however it'll look like - will have to search on payloads as well.
Querying
This is where I am stuck. I have no clue how to query on payloads, and whatever I try does not work. Note that I am using Lucene 7, it seems that in older versions querying on payload has changed several times. Documentation is extremely scarce. It's not even clear what the proper field name is now to query - is it "word" or "type" or anything else? For example, I tried this code which does not return any search results:
// Step 1: Indexing
final String body = "The quick brown fox jumped over the lazy dogs.";
Directory index = new RAMDirectory();
OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, indexWriterConfig);
Document document = new Document();
document.add(new TextField("body", body, Field.Store.YES));
writer.addDocument(document);
writer.close();
// Step 2: Querying
final int topN = 10;
DirectoryReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
final String fieldName = "body"; // What is the correct field name here? "body", or "type", or "word" or anything else?
final String queryText = "JJ";
Term term = new Term(fieldName, queryText);
SpanQuery match = new SpanTermQuery(term);
BytesRef pay = new BytesRef("type"); // Don't understand what to put here as an argument
SpanPayloadCheckQuery query = new SpanPayloadCheckQuery(match, Collections.singletonList(pay));
System.out.println(query.toString());
TopDocs topDocs = searcher.search(query, topN);
Any help is very much appreciated here.
Why don't you use TypeAsSynonymFilter instead of TypeAsPayloadTokenFilter and just make a normal query. So in your Analyzer:
:
TokenFilter posFilter = new OpenNLPPOSFilter(source, posTaggerOp);
TypeAsSynonymFilter typeAsSynonymFilter = new TypeAsSynonymFilter(posFilter);
return new TokenStreamComponents(source, typeAsSynonymFilter);
And indexing side:
static Directory index() throws Exception {
Directory index = new RAMDirectory();
OpenNLPAnalyzer analyzer = new OpenNLPAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(index, indexWriterConfig);
writer.addDocument(doc("The quick brown fox jumped over the lazy dogs."));
writer.addDocument(doc("Give it to me, baby!"));
writer.close();
return index;
}
static Document doc(String body){
Document document = new Document();
document.add(new TextField(FIELD, body, Field.Store.YES));
return document;
}
And searching side:
static void search(Directory index, String searchPhrase) throws Exception {
final int topN = 10;
DirectoryReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser(FIELD, new WhitespaceAnalyzer());
Query query = parser.parse(searchPhrase);
System.out.println(query);
TopDocs topDocs = searcher.search(query, topN);
System.out.printf("%s => %d hits\n", searchPhrase, topDocs.totalHits);
for(ScoreDoc scoreDoc: topDocs.scoreDocs){
Document doc = searcher.doc(scoreDoc.doc);
System.out.printf("\t%s\n", doc.get(FIELD));
}
}
And then use them like this:
public static void main(String[] args) throws Exception {
Directory index = index();
search(index, "\"JJ NN VBD\""); // search the sequence of POS tags
search(index, "\"brown fox\""); // search a phrase
search(index, "\"fox brown\""); // search a phrase (no hits)
search(index, "baby"); // search a word
search(index, "\"TO PRP\""); // search the sequence of POS tags
}
The result looks like this:
body:"JJ NN VBD"
"JJ NN VBD" => 1 hits
The quick brown fox jumped over the lazy dogs.
body:"brown fox"
"brown fox" => 1 hits
The quick brown fox jumped over the lazy dogs.
body:"fox brown"
"fox brown" => 0 hits
body:baby
baby => 1 hits
Give it to me, baby!
body:"TO PRP"
"TO PRP" => 1 hits
Give it to me, baby!

Apache Lucene fuzzy search for multi-worded phrases

I have the following Apache Lucene 7 application:
StandardAnalyzer standardAnalyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(standardAnalyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document document = new Document();
document.add(new TextField("content", new FileReader("document.txt")));
writer.addDocument(document);
writer.close();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Query fuzzyQuery = new FuzzyQuery(new Term("content", "Company"), 2);
TopDocs results = searcher.search(fuzzyQuery, 5);
System.out.println("Hits: " + results.totalHits);
System.out.println("Max score:" + results.getMaxScore())
when I use it with :
new FuzzyQuery(new Term("content", "Company"), 2);
the application works fine and returns the following result:
Hits: 1
Max score:0.35161147
but when I try to search with multi term query, for example:
new FuzzyQuery(new Term("content", "Company name"), 2);
it returns the following result:
Hits: 0
Max score:NaN
Anyway, the phrase Company name exists in the source document.txt file.
How to properly use FuzzyQuery in this case in order to be able to do the fuzzy search for multi-word phrases.
UPDATED
Based on the provided solution I have tested it on the following text information:
Company name: BlueCross BlueShield Customer Service
1-800-521-2227
of Texas Preauth-Medical 1-800-441-9188
Preauth-MH/CD 1-800-528-7264
Blue Card Access 1-800-810-2583
For the following query:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCross"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
the search works fine:
Hits: 1
Max score:0.5753642
but when I try to corrupt a little bit the search query(for example from BlueCross to BlueCros)
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueCros"), 2));
clauses[1] = new SpanMultiTermQueryWrapper<FuzzyQuery>(new FuzzyQuery(new Term("content", "BlueShield"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
it stops working and returns:
Hits: 0
Max score:NaN
The problem here is the following, you're using TextField, which is tokenizing field. E.g. your text "Company name is working on something" would be effectively split by spaces (and others delimeters). So, even if you have the text Company name, during indexation it will become Company, name, is, etc.
In this case this TermQuery won't be able to find what you're looking for. The trick which going to help you would look like this:
SpanQuery[] clauses = new SpanQuery[2];
clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "some"), 2));
clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("content", "text"), 2));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true);
However, I wouldn't recommend this approach much, especially if your load would be big and you're planning on searching on a 10 term long company names. One should be aware, that those query are potentially heavy to execute.
The following problem with BlueCros is the following. By default Lucene uses StandardAnalyzer for TextField. So it means it effectively lowercase the terms, basically it means that BlueCross in the content field becomes bluecross.
Fuzzy difference between BlueCros and bluecross is 3, that's the reason you do not have a match.
Simple proposal would be to convert term in query to the lowercase, by doing something like .toLowerCase()
In general, one should prefer to use same analyzers during the query time as well (e.g. during construction of the query)
For Lucene.Net it can be like this.
private string _IndexPath = #"Your Index Path";
private Directory _Directory;
private Searcher _IndexSearcher;
private MultiPhraseQuery _MultiPhraseQuery;
_Directory = FSDirectory.Open(_IndexPath);
IndexReader indexReader = IndexReader.Open(_Directory, true);
string field = "Name" // Your field name
string keyword = "big red fox"; // your search term
float fuzzy = 0,7f; // between 0-1
using (_IndexSearcher = new IndexSearcher(indexReader))
{
// "big red fox" to [big,red,fox]
var keywordSplit = keyword.Split();
_MultiPhraseQuery = new MultiPhraseQuery();
FuzzyTermEnum[] _FuzzyTermEnum = new FuzzyTermEnum[keywordSplit.Length];
Term[] _Term = new Term[keywordSplit.Length];
for (int i = 0; i < keywordSplit.Length; i++)
{
_FuzzyTermEnum[i] = new FuzzyTermEnum(indexReader, new Term(field, keywordSplit[i]),fuzzy);
_Term[i] = _FuzzyTermEnum[i].Term;
if (_Term[i] == null)
{
_MultiPhraseQuery.Add(new Term(field, keywordSplit[i]));
}
else
{
_MultiPhraseQuery.Add(_FuzzyTermEnum[i].Term);
}
}
var results = _IndexSearcher.Search(_MultiPhraseQuery, indexReader.MaxDoc);
foreach (var loopDoc in results.ScoreDocs.OrderByDescending(s => s.Score))
{
//YourCode Here
}
}

Apache Lucene 5.5.3 - Searching a string ending with special character

I'm using Apache Lucene 5.5.3. I'm using org.apache.lucene.analysis.standard.StandardAnalyzer in my code and using below code snippet to create index.
Document doc = new Document();
doc.add(new TextField("userName", getUserName(), Field.Store.YES));
Now if I search for a string 'ALL-' , then I'm not getting any search results but if I search for a string 'ALL-Categories', then I'm getting some search results.
The same thing is happening for a string with special characters '+' , '.', '!' etc.
Below is my search code:-
Directory directory = new RAMDirectory();
IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
Document document = new Document();
document.add(new TextField("body", ALL-THE GLITTERS IS NOT GOLD, Field.Store.YES));
IndexWriter writer = new IndexWriter(directory, new IndexWriterConfig(buildAnalyzer()));
writer.addDocument(document);
writer.commit();
Builder builder = new BooleanQuery.Builder();
Query query1 = new QueryParser(IndexAttribute.USER_NAME, buildAnalyzer()).parse(searchQUery+"*");
Query query2 = new QueryParser(IndexAttribute.IS_VETERAN, buildAnalyzer()).parse(""+isVeteran);
builder.add(query1, BooleanClause.Occur.MUST);
builder.add(query2, BooleanClause.Occur.MUST);
Query q = builder.build();
TopDocs docs = searcher.search(q, 10);
ScoreDoc[] hits = docs.scoreDocs;
private static Analyzer buildAnalyzer() throws IOException {
return CustomAnalyzer.builder().withTokenizer("whitespace").addTokenFilter("lowercase")
.addTokenFilter("standard").build();
}
So, Please suggest me on this.
Please refer section Escaping Special Characters to know special characters in Lucene 5.5.3.
As suggested in above article, you need to place a \ or alternatively you can use method public static String escape(String s) of QueryParser class to achieve the same.
I got the solution with WildcardQuery, StringField and MultiFieldQueryParser combination. In addition to these classes, we have to do is escape the space in the query string

SpanNotQuery giving unexpected results (exclude is ignored)

We're having some problems with a SpanNotQuery in elasticsearch. It looks like the exclude part of the query is ignored.
To reproduce the problem I created a set of documents:
fiets kopen
fiets lopen
harrie kopen
harrie lopen
harrie fiets
kopen lopen
A SpanTermQuery for harrie will result in (3, 4, 5)
A SpanTermQuery for kopen will result in (1, 3, 6)
Now I want to combine this in a SpanNotQuery where the include is 'harrie' and exclude 'kopen'
I would expect the result to be (4, 5), but it is (3, 4, 5).
We have to use SpanQueries, this is just a small subset of the trouble we're running in to.
I created a unit test with only Lucene to show our problem
public class LuceneTest {
#Test
public void test() throws Exception {
RAMDirectory ram = new RAMDirectory();
createAndFillIndex(ram);
DirectoryReader directoryReader = DirectoryReader.open(ram);
IndexSearcher searcher = new IndexSearcher(directoryReader);
SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
Query spanNot = new SpanNotQuery(include, exclude);
TopDocs search = searcher.search(spanNot, 100);
for (ScoreDoc scoreDoc : search.scoreDocs) {
Document result = searcher.doc(scoreDoc.doc);
String dummy = result.get("dummy");
System.out.println(scoreDoc.doc + ": " + dummy);
}
}
private void createAndFillIndex(RAMDirectory ram) throws IOException {
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_47, new SimpleAnalyzer(Version.LUCENE_47));
IndexWriter writer = new IndexWriter(ram, conf);
add(writer, "nul"); //0
add(writer, "fiets kopen"); //1
add(writer, "fiets lopen"); //2
add(writer, "harrie kopen"); //3
add(writer, "harrie lopen"); //4
add(writer, "harrie fiets"); //5
add(writer, "kopen lopen"); //6
writer.close();
}
private void add(IndexWriter writer, String value) throws IOException {
Document doc = new Document();
IndexableField f = new TextField("dummy", value, Field.Store.YES);
doc.add(f);
writer.addDocument(doc);
}
}
Does anyone know what we're doing wrong?
Thanks!
The documentation gives a hint here. It matches:
spans from include which have no overlap with spans from exclude
We're dealing with spans, not whole documents. The matching span for a simple term query though, is just the single term. In each of the three matched documents in your example, the matched span is harrie, which does not have any overlap with the term kopen in any of them.
It's probably more helpful to look at an example that shows how it's intended to work. You should be able to copy-paste the following fragments into your example (and by the way, thanks for the MCVE!). Let's try this query:
SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
SpanQuery matchterm = new SpanTermQuery(new Term("dummy", "match"));
SpanQuery[] clauses = {include, matchterm};
SpanQuery nearQuery = new SpanNearQuery(clauses, 2, true);
Query spanNot = new SpanNotQuery(nearQuery, exclude);
against these documents:
add(writer, "harrie kopen match"); //1
add(writer, "harrie match kopen"); //2
add(writer, "harrie other stuff match kopen"); //3
You should see 2 hits.
Document 1: matches nearQuery with the span: "harrie kopen match". This contains "kopen" (that is, overlaps with the span matching exclude), and so it is eliminated by the SpanNotQuery
Document 2: matches nearQuery with the span: "harrie match". The document contains "kopen", but not within the matched span, so the document remains matched.
Document 3: matches nearQuery with the span: "marrie other stuff match". Again, the document contains "kopen", but not within the matched span, so it get through.
If you want the negation to be over the entire document, rather than just the matched span, use a BooleanQuery instead.
SpanQuery include = new SpanTermQuery(new Term("dummy", "harrie"));
SpanQuery exclude = new SpanTermQuery(new Term("dummy", "kopen"));
Query query = new BooleanQuery();
query.add(new BooleanClause(include, BooleanClause.Occur.MUST))
query.add(new BooleanClause(exclude, BooleanClause.Occur.MUST_NOT))

How to use Lucene IndexReader to read index in version 4.4?

For the just the sake of learning I've created an index from 1 file and wanted to search it. I am using Lucene Version 4.4. I know that indexing part is true.
tempFileName is the name of file which contains tokens and this file has the following words :
"odd plus odd is even ## even plus even is even ## odd plus even is odd ##"
However when I provide a query it returns nothing. I can't see what would be the problem. Any help is greatly appreciated.
Indexing part :
public void startIndexingDocument(String indexPath) throws IOException {
Analyzer analyzer = new WhitespaceAnalyzer(Version.LUCENE_44);
SimpleFSDirectory directory = new SimpleFSDirectory(new File(indexPath));
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_44,
analyzer);
IndexWriter writer = new IndexWriter(directory, config);
indexDocs(writer);
writer.close();
}
private void indexDocs(IndexWriter w) throws IOException {
Document doc = new Document();
File file = new File(tempFileName);
BufferedReader br = new BufferedReader(new FileReader(tempFileName));
Field field = new StringField(fieldName, br.readLine().toString(),
Field.Store.YES);
doc.add(field);
w.addDocument(doc);
}
Searching part :
public void readFromIndex(String indexPath) throws IOException,
ParseException {
Analyzer anal = new WhitespaceAnalyzer(Version.LUCENE_44);
QueryParser parser = new QueryParser(Version.LUCENE_44, fieldName, anal);
Query query = parser.parse("odd");
IndexReader reader = IndexReader.open(NIOFSDirectory.open(new File(
indexPath)));
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// display
System.out.println("fieldName =" + fieldName);
System.out.println("Found : " + hits.length + " hits.");
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println((i + 1) + ". " + d.get(fieldName));
}
reader.close();
}
The problem is that you are using a StringField. StringField indexes the entire input as a single token. Good for atomic strings, like keywords, identifiers, stuff like that. Not good for full text searching.
Use a TextField.
StringField have a single token. So, I try to test with simple code.
for example #yns~ If you have a file that this is cralwer file and this contents hava a single String.
ex) file name : data03.scd , contents : parktaeha
You try to search with "parktaeha" queryString.
You get the search result!
field name : acet, queryString parktaeha
======== start search!! ========== q=acet:parktaeha Found 1 hits. result array length :1 search result=> parktaeha
======== end search!! ==========
Look under the code. This code is test code.
while((target = in.readLine()) != null){
System.out.println("target:"+target);
doc.add(new TextField("acet",target ,Field.Store.YES)); // use TextField
// TEST : doc.add(new StringField("acet", target.toString(),Field.Store.YES));
}
ref url