Lucene.NET query + highlighting - lucene

I am using Umbraco and came across Lucene. I found a lot of code and articles on Lucene, but I still can't build an acceptable search.
I have a number of fields to search from, eg. "nodeName" and "bodyText"
What I need:
When I search for "men shoes", it should only return results that have both "men" and "shoes", but also return a page where the nodeName only has "shoes" and the bodyText only has "men".
When I search for "shoes", I want results containing "shoe" or "shoes." but not "hoes" if possible
Boost the nodeName field
Get a snippet of bodyText that contains the matched word(s)
Highlight the matched words on both the page name and the snippet of the bodyText
Has anyone ever done this?

This might get you started.
var manager = ExamineManager.Instance;
var searcher = manager.SearchProviderCollection["YOURSearcher"];
var query = manager.SearchProviderCollection["YOURSearcher"].CreateSearchCriteria(BooleanOperation.Or)
.Field("nodeName", keywords.Boost(10))
.Or().Field("nodeName", keywords.Fuzzy())
.Or().Field("bodyContent", keywords.Boost(5))
.Or().Field("otherField", keywords.Boost(3));
var results = searcher.Search(query.Compile());

The code by Jonathan Lathigee works, it's the most google-like I could find so far
http://our.umbraco.org/forum/developers/extending-umbraco/19329-Search-multiple-fields-for-multiple-terms-with-examine?p=0

Related

Get the position of matches in Lucene

Is it possible to find the position of words with a match when the indexed field isn't stored?
for example:
Query: "fox over dog"
Indexed text of matched doc: "The quick brown fox jumps over the lazy dog"
What I want: [4,6,9]
Note1: I know text can be highlighted using Lucene but I want the position of the words
Note2: The field isn't set to be stored by Lucene**
I have not done this for practical purposes - just to give a pseudo code and pointers that you can experiment with to reach to correct solution.
Also, you have not specified your Lucene version, I am using Lucene 6.0.0 with Java.
1.While Indexing, set these two booleans for your specific field for which positions are desired. Lucene will be able to give that data if indexing has stored that information otherwise not.
FieldType txtFieldType = new FieldType(
TextField.TYPE_NOT_STORED);
txtFieldType.setStoreTermVectors(true);
txtFieldType.setStoreTermVectorPositions(true);
2.At your searcher, you need to use Terms , TermsEnum & PostingsEnum like below,
`Terms terms = searcher.getIndexReader().getTermVector(hit.doc, "TEXT_FIELD");`
if(terms.hasPositions()){
TermsEnum termsEnum = terms.iterator();
PostingsEnum postings = null;
while(termsEnum.next() != null){
postings = termsEnum.postings(postings ,PostingsEnum.ALL);
while(postings.nextDoc() != PostingsEnum.NO_MORE_DOCS){
System.out.println(postings.nextPosition());
}
You need to do some of your own analysis to arrive at the data that you need but your first need to save meta data as pointed in point # 1.
}
}
searcher is IndexSearcher instance, hit.doc is doc id and hit is a ScoreDoc .

Search by exact words in a phrase using Umbraco Examine

I have some description field per content and those are:
For content1:
The quick brown fox jumps over the lazy dog. And the lazy dog is good.
For content2:
The lazy fog is crazy.
Now, when I use keyword = lazy dog, I want to give result as content1 and not content2
I tried like:
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria =
searcher.CreateSearchCriteria()
.GroupedAnd( new List<string> { "description" }, "lazy dog") )
.Compile();
ISearchResults result = searcher.Search( criteria );
But it didn't gave me desired results, it give me results: content1 and content2.
What should I do in order to get as content1 result ?
By default examine is compiling this query to:
+(+description:lazy dog)
and based on it it's returning the results with both: lazy and dog words.
What you want to achieve is:
+(+description:"lazy dog")
First of what you need to try is to escape the phrase. In your case it will be:
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria =
searcher.CreateSearchCriteria()
.GroupedAnd( new List<string> { "description" }, "lazy dog".Escape()) )
.Compile();
ISearchResults result = searcher.Search( criteria );
Can't test it now, but there were some problems with it in the past from what I remember. The second option and a life saver for you, may be building the search query manually and using the raw query.
BaseSearchProvider searcher = ExamineManager.Instance.SearchProviderCollection["MySearch"];
ISearchCriteria criteria = searcher.CreateSearchCriteria();
var query = criteria.RawQuery("+description:\"lazy dog\"");
ISearchResults result = searcher.Search( query );
And it should return you correct = matched result only. Personally, I've used also some boosting of specific words to just point some results higher in the score list, but if you want to have only matched items, try above solutions and let me know if it helped you.
If you want to deal with more than one property, you can either use some fluent API methods like GroupedAnd or GroupedOr (depending of the desired behaviour of search) or build more advanced raw query.
For the first option, check Grouped Operations documentation: https://github.com/Shazwazza/Examine/wiki/Grouped-Operations.
For the second scenario it would be the best to analyze how it's done e.g. in ezSearch package (which btw. is awesome!): https://github.com/umco/umbraco-ezsearch/blob/master/Src/Our.Umbraco.ezSearch/Web/UI/Views/MacroPartials/ezSearch.cshtml.

Lucene MultiFieldQueryParser and Highlighter

I am indexing articles in lucene index through different fields i.e. title, description, link, publishDate
I query the index using MultiFieldQueryParser like
+(title:[text]^5.0 description:[text]^4.0 link:[text]^3.0) +publishDate:[20150101 TO 20150531]
and then i show the articles as the search results.
So far all is good.
Now I want to highlight the search text in the title,description
How shall i go about this?
The normal Highlighter gives me NullPointerException while generrating fragments.
and PostingHighlighter gives me a Map with results grouped together according to the field.. but i don't want it that way. I was the entire document to be returned together with highlighting of search text in title and description.
Any help or suggestion or code snippet is appreciated..
I got it working by using the FieldType for all the fields that I wanted highlighted:
FieldType ft = new FieldType();
ft.setIndexed(true);
ft.setIndexOptionsFieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
ft.setStored(true);
ft.setStoreTermVectors(true);
ft.setStoreTermVectorOffsets(true);
ft.setTokenized(true);
ft.stored();
QueryScorer qs = new QueryScorer(q);
Highlighter h = new Highlighter(qs);
highlighter.setTextFragmenter(new SimpleFragmenter(300));
String highlighted = h.getBestFragment(new StandardAnalyzer(),f,Text);

read all document by using particular category name using alfresco search.luceneSearch or search.lib.js

Category Name
|
Geograpy (8)
Study Db (18)
i am implement my own advance search in alfresco. i need to read all files which related with particular category.
example:
if there is 20 file under geograpy, lucene query should read particular document under search key word "banana".
Further explanation -
I am using search.lib.js to search. I would like to analyze the result to find out to which category the documents belong to. For example I would like to know how many documents belong to the category under Languages and the subcategories. I experimented with the Classification API but I don't get the result I want. Any Idea how to go through the result to get the category name of each document?
is there any simple method like node.properties["cm:creator"]?
thanks
janaka
I think you should specify more your question:
Are you using cm:content or a customized content?
Are you going to search the keyword inside the content of the file? or are you going to search the keyword in a specific metadata(s)?
Do you want to create a webscript (java or javascript)?
One thing to take in consideration:
if you use +PATH:"cm:generalclassifiable/...." for the categorization in your lucene queries, the performance will be slow (following my experince)
You can use for example the next query to find all nodes at any depth below /cm:Languages:
var results = search.luceneSearch("+PATH:\"cm:generalclassifiable/cm:Languages//*\");
Take a look to this url: https://wiki.alfresco.com/wiki/Search#Path_Queries
Once you have all the elements, you can loop all, and get to which category below. Of course you need to create some counter per each category/subcategory:
for(i = 0; i < results.length; i++){
var node = results[i];
var categoryNodeRef = node.properties["cm:categories"];
var categoryDesc = categoryNodeRef.properties["cm:description"];
var categoryName = categoryNodeRef.properties["cm:name"];
}
This is not exactly the solution, but can be a useful idea to start.
Sorry if it's not what you're asking for, I have just arrived from my holidays.

How to do a startsWith and then Contains Search using Lucene.NET 3.0?

What is the best way to search and Index in Lucene.NET 3.0 so that the results come out ordered in the following way:
Results that start with the full query text (as a single word) e.g. "Bar Acme"
Results that start with the search term as a word fragment e.g. "Bart Simpson"
Results that contain the query text as a full word e.g. "National Bar Association"
Results that contain the query text as a fragment e.g. "United Bartenders Inc"
Example: Searching for Bar
Ordered Results:
Bar Acme
Bar Lunar
Bart Simpson
National Bar Association
International Bartenders Association
Lucene doesn't generally support searching/scoring based on position within a field. It would be possible to support it if you prefix every fields with some known fieldstart delimiter, or something. I don't really think it makes sense, in the lens of a full text search where position within the text field isn't relevant (ie. if I were searching for Bar in a document, I would likely be rather annoyed if "Bart Simpson" were returned before "national bar association")
Apart from that though, a simple prefix search handles everything else. So if you simply add your start of word token, you can search for the modified term with a higher boost prefix query than the original, and then you should have precisely what you describe.
It can be achieved with linq. Make lucene search with hit count Int32.MaxValue. Loop the results of ScoreDocs and store it in collection Searchresults.
sample code:
Searchresults = (from scoreDoc in results.ScoreDocs select (new SearchResults { suggestion = searcher.Doc(scoreDoc.Doc).Get("suggestion") })).OrderBy(x => x.suggestion).ToList();
Searchresultsstartswith = Searchresults.Where(x => x.suggestion.ToLower().StartsWith(searchStringLinq.ToLower())).Take(10).ToList();
if (SearchresultsStartswith.Count > 0)
return SearchresultsStartswith.ToList();
else
return Searchresults.Take(10).ToList();