Lucene MultiFieldQueryParser and Highlighter - lucene

I am indexing articles in lucene index through different fields i.e. title, description, link, publishDate
I query the index using MultiFieldQueryParser like
+(title:[text]^5.0 description:[text]^4.0 link:[text]^3.0) +publishDate:[20150101 TO 20150531]
and then i show the articles as the search results.
So far all is good.
Now I want to highlight the search text in the title,description
How shall i go about this?
The normal Highlighter gives me NullPointerException while generrating fragments.
and PostingHighlighter gives me a Map with results grouped together according to the field.. but i don't want it that way. I was the entire document to be returned together with highlighting of search text in title and description.
Any help or suggestion or code snippet is appreciated..

I got it working by using the FieldType for all the fields that I wanted highlighted:
FieldType ft = new FieldType();
ft.setIndexed(true);
ft.setIndexOptionsFieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
ft.setStored(true);
ft.setStoreTermVectors(true);
ft.setStoreTermVectorOffsets(true);
ft.setTokenized(true);
ft.stored();
QueryScorer qs = new QueryScorer(q);
Highlighter h = new Highlighter(qs);
highlighter.setTextFragmenter(new SimpleFragmenter(300));
String highlighted = h.getBestFragment(new StandardAnalyzer(),f,Text);

Related

what is the difference between TermQuery and QueryParser in Lucene 6.0?

There are two queries,one is created by QueryParser:
QueryParser parser = new QueryParser(field, analyzer);
Query query1 = parser.parse("Lucene");
the other is term query:
Query query2=new TermQuery(new Term("title", "Lucene"));
what is the difference between query1 and query2?
This is the definition of Term from lucene docs.
A Term represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occurred in.
So in your case the query will be created to search the word "Lucene" in the field "title".
To explain the difference between the two let me take a difference example,
consider the following
Query query2 = new TermQuery(new Term("title", "Apache Lucene"));
In this case the query will search for the exact word "Apache Lucene" in the field title.
In the other case
As an example, let's assume a Lucene index contains two fields, "title" and "body".
QueryParser parser = new QueryParser("title", "StandardAnalyzer");
Query query1 = parser.parse("title:Apache body:Lucene");
Query query2 = parser.parse("title:Apache Lucene");
Query query3 = parser.parse("title:\"Apache Lucene\"");
couple of things.
"title" is the field that QueryParser will search if you don't prefix it with a field.(as given in the constructor).
parser.parse("title:Apache body:Lucene"); -> in this case the final query will look like this. query2 = title:Apache body:Lucene.
parser.parse("body:Apache Lucene"); -> in this case the final query will also look like this. query2 = body:Apache title:Lucene. but for a different reason.
So the parser will search "Apache" in body field and "Lucene" in title field. Since The field is only valid for the term that it directly precedes,(http://lucene.apache.org/core/2_9_4/queryparsersyntax.html)
So since we do not specify any field for lucene , the default field which is "title" will be used.
query2 = parser.parse("title:\"Apache Lucene\""); in this case we are explicitly telling that we want to search for "Apache Lucene" in field "title". This is phrase query and is similar to Term query if analyzed correctly.
So to summarize the term query will not analyze the term and search as it is. while Query parser parses the input based on some conditions described above.
The QueryParser parses the string and constructs a BooleanQuery (afaik) consisting of BooleanClauses and analyzes the terms along the way.
The TermQuery does NOT do analysis, and takes the term as-is. This is the main difference.
So the query1 and query2 might be equivalent (in a sense, that they provide the same search results) if the field is the same, and the QueryParser's analyzer is not changing the term.

Lucene query result : get the words in the returned documents that were found by the query

In order to present highlighted match-words in the documents that were returned by Lucene queries, Lucene search result may contain the words that were used to return the doc as matching my request.
For example :
Lucene query : "dog cat"
Result : ["dogs are nice","dog and cats are friends"]
How to achieve this with Lucene? Manually I can't handle cats or dogs or any difference between requested words and returned words.
Use Lucene's Highlighter. Something like this:
//By default, this formatter will wrap highlights with <b>, but that is configurable.
Formatter formatter = new SimpleHTMLFormatter();
QueryScorer scorer = new QueryScorer(query);
Highlighter highlighter = new Highlighter(formatter, queryScorer);
//You can set a fragmenter as well, by default it will split into fragments 100 chars in size, using SimpleFragmenter.
String highlightedSnippet = highlighter.getBestFragment(myAnalyzer, fieldName, fieldContent);

Sitecore term query for filter data

In Sitecore lucene search i am using "term query" to filter data from sitecore.
Here i have one field in Sitecore called "Description" and i want to do fileration based on term "Lorem". But every time I am getting 0 result. If i dont use rterm query i get all result that means my index configuration is correct. Please help.
TermQuery bothQuery = new TermQuery (new Term("Description", "Lorem"));
BooleanQuery query = new BooleanQuery();
query.Add(bothQuery, BooleanClause.Occur.MUST);
TopDocs topDocs = sc.Searcher.Search(query, int.MaxValue);
SearchHits searchHits = new SearchHits(topDocs, sc.Searcher.GetIndexReader());
return searchHits.FetchResults(0, int.MaxValue).Select(r => r.GetObject<Item>()).ToList();
I note that your Term definition above has a field name containing a capital letter. You don't specify the version of Sitecore / Lucene you're working in, but my experience with the 6.x series of Sitecore is that the indexing process transforms all the Field names to lower case at index time.
Hence your field in Sitecore might be called "Description" but in Lucene's index it is probably called "description". Try changing your code to use a lower case field name.
You can check this using an index display tool like the Lucene Index Viewer from the Sitecore Marketplace. It will show you the names of the fields in your index, and let you test queries against them without the need to recompile code.

Find typo with Lucene

I would like to use Lucene to index/search text. The text can contain mistyped words, names, etc. What is the most simple way of getting Lucene to find a document containing
"this is Licene"
when user searches for
"Lucene"?
This is only for a demo app, so we need the most simple solution.
Lucene's fuzzy queries and based on Levenshtein edit distance.
Use a fuzzy query in the QueryParser, with syntax like:
Lucene~0.5
Or create a FuzzyQuery, passing in the maximum number of edits, something like:
Query query = new FuzzyQuery(new Term("field", "lucene"), 1);
Note: FuzzyQuery, in Lucene 4.x, does not support greater edit distances than 2.
Another option you could try is using the Lucene SpellChecker:
http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/SpellChecker.html
It is a out of box, and very easy to use:
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
By default, it is using the LevensteinDistance, but you could provide your own customized Edit Distance.

Lucene.NET query + highlighting

I am using Umbraco and came across Lucene. I found a lot of code and articles on Lucene, but I still can't build an acceptable search.
I have a number of fields to search from, eg. "nodeName" and "bodyText"
What I need:
When I search for "men shoes", it should only return results that have both "men" and "shoes", but also return a page where the nodeName only has "shoes" and the bodyText only has "men".
When I search for "shoes", I want results containing "shoe" or "shoes." but not "hoes" if possible
Boost the nodeName field
Get a snippet of bodyText that contains the matched word(s)
Highlight the matched words on both the page name and the snippet of the bodyText
Has anyone ever done this?
This might get you started.
var manager = ExamineManager.Instance;
var searcher = manager.SearchProviderCollection["YOURSearcher"];
var query = manager.SearchProviderCollection["YOURSearcher"].CreateSearchCriteria(BooleanOperation.Or)
.Field("nodeName", keywords.Boost(10))
.Or().Field("nodeName", keywords.Fuzzy())
.Or().Field("bodyContent", keywords.Boost(5))
.Or().Field("otherField", keywords.Boost(3));
var results = searcher.Search(query.Compile());
The code by Jonathan Lathigee works, it's the most google-like I could find so far
http://our.umbraco.org/forum/developers/extending-umbraco/19329-Search-multiple-fields-for-multiple-terms-with-examine?p=0