How to do a startsWith and then Contains Search using Lucene.NET 3.0? - lucene

What is the best way to search and Index in Lucene.NET 3.0 so that the results come out ordered in the following way:
Results that start with the full query text (as a single word) e.g. "Bar Acme"
Results that start with the search term as a word fragment e.g. "Bart Simpson"
Results that contain the query text as a full word e.g. "National Bar Association"
Results that contain the query text as a fragment e.g. "United Bartenders Inc"
Example: Searching for Bar
Ordered Results:
Bar Acme
Bar Lunar
Bart Simpson
National Bar Association
International Bartenders Association

Lucene doesn't generally support searching/scoring based on position within a field. It would be possible to support it if you prefix every fields with some known fieldstart delimiter, or something. I don't really think it makes sense, in the lens of a full text search where position within the text field isn't relevant (ie. if I were searching for Bar in a document, I would likely be rather annoyed if "Bart Simpson" were returned before "national bar association")
Apart from that though, a simple prefix search handles everything else. So if you simply add your start of word token, you can search for the modified term with a higher boost prefix query than the original, and then you should have precisely what you describe.

It can be achieved with linq. Make lucene search with hit count Int32.MaxValue. Loop the results of ScoreDocs and store it in collection Searchresults.
sample code:
Searchresults = (from scoreDoc in results.ScoreDocs select (new SearchResults { suggestion = searcher.Doc(scoreDoc.Doc).Get("suggestion") })).OrderBy(x => x.suggestion).ToList();
Searchresultsstartswith = Searchresults.Where(x => x.suggestion.ToLower().StartsWith(searchStringLinq.ToLower())).Take(10).ToList();
if (SearchresultsStartswith.Count > 0)
return SearchresultsStartswith.ToList();
else
return Searchresults.Take(10).ToList();

Related

Accented or special characters in RavenDB index

I have several fields in my collection that contain accented characters, and the languages from which the words come are quite varied: Czech, German, Spanish, Finnish, Hungarian, etc.
I have noticed that when searching for, e.g. "Andalucía" (note the accented i), the query comes up empty - however, searching for "Andaluc*" returns what I am looking for.
I have found this in the RavenDB docs, and wanted to ask if changing the field indexing method from default to exact would solve my problem.
Thanks !
EDIT: RavenDB appears to be dropping letters after AND including the accented character in the search. In the cmd window, I can see the query (which I enter from RavenDB Studio as NAME_1:Andalucía) coming out as (...)/ByName?term=Andaluc&field=NAME_1&max(...)
When I navigate to the terms of the index, I can see "andalucía" (lowercase !!). The index definition is simply a "select new { NAME_1 = area.NAME_1 }". Forgot to mention I'm still on RavenDB 3.5.
Index definition:
Map = areas => from area in areas
select new
{
NAME_0 = area.NAME_0,
NAME_1 = area.NAME_1
};
Indexes.Add(x => x.NAME_1, FieldIndexing.Analyzed);
//Analyzers.Add(x => x.NAME_1, typeof(StandardAnalyzer).FullName);
The commented-out line doesn't work because the type StandardAnalyzer doesn't resolve in my VS2017 project. I'm curently looking into how to get either the dll or the correct using statement.
Querying for Andalucía in quotation marks results in the "correct query" being sent to Raven: (...)/ByName?term=Andalucía&field=NAME_1&max=5(...) - but produces no results.
FURTHER EDIT: Found the Lucene dll, included it in the project, used the StandardAnalyzer als the analyzer - same result (no results found).
On RavenDB 4, this appears fixed. meh
You need to verify that both 'Full-Text-Search' and 'Suggestions' options are 'turned on' in the index.
You need to specify the field you want the suggestions for.
Add this in your index definition:
Suggestion(x => x.NAME_1);
You must not have the following line of code in your index definition on the properties where you perform search operations:
Indexes.Add(x => x.PropertyXYZ, FieldIndexing.No);
By default if you didn't change the Indexing your query works.

Can I clear the stopword list in lucene.net in order for exact matches to work better?

When dealing with exact matches I'm given a real world query like this:
"not in education, employment, or training"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? education employment ? training"
Here's a more contrived example:
"there is no such thing"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? ? ? thing"
My goal is to have searches like these match only the exact match as the user entered it.
Could one solution be to clear the stopwords list? would this have adverse affects? if so what? (my google-fu failed)
This all depends on the analyzer you are using. The StandardAnalyzer uses Stop words and strips them out, in fact the StopAnalyzer is where the StandardAnalyzer gets its stop words from.
Use the WhitespaceAnalyzer or create your own by inheriting from one that most closely suits your needs and modify it to be what you want.
Alternatively, if you like the StandardAnalyzer you can new one up with a custom stop word list:
//This is what the default stop word list is in case you want to use or filter this
var defaultStopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
//create a new StandardAnalyzer with custom stop words
var sa = new StandardAnalyzer(
Version.LUCENE_29, //depends on your version
new HashSet<string> //pass in your own stop word list
{
"hello",
"world"
});

Get the position of matches in Lucene

Is it possible to find the position of words with a match when the indexed field isn't stored?
for example:
Query: "fox over dog"
Indexed text of matched doc: "The quick brown fox jumps over the lazy dog"
What I want: [4,6,9]
Note1: I know text can be highlighted using Lucene but I want the position of the words
Note2: The field isn't set to be stored by Lucene**
I have not done this for practical purposes - just to give a pseudo code and pointers that you can experiment with to reach to correct solution.
Also, you have not specified your Lucene version, I am using Lucene 6.0.0 with Java.
1.While Indexing, set these two booleans for your specific field for which positions are desired. Lucene will be able to give that data if indexing has stored that information otherwise not.
FieldType txtFieldType = new FieldType(
TextField.TYPE_NOT_STORED);
txtFieldType.setStoreTermVectors(true);
txtFieldType.setStoreTermVectorPositions(true);
2.At your searcher, you need to use Terms , TermsEnum & PostingsEnum like below,
`Terms terms = searcher.getIndexReader().getTermVector(hit.doc, "TEXT_FIELD");`
if(terms.hasPositions()){
TermsEnum termsEnum = terms.iterator();
PostingsEnum postings = null;
while(termsEnum.next() != null){
postings = termsEnum.postings(postings ,PostingsEnum.ALL);
while(postings.nextDoc() != PostingsEnum.NO_MORE_DOCS){
System.out.println(postings.nextPosition());
}
You need to do some of your own analysis to arrive at the data that you need but your first need to save meta data as pointed in point # 1.
}
}
searcher is IndexSearcher instance, hit.doc is doc id and hit is a ScoreDoc .

Lucene Query for "OR" and "IN"

I'm using Lucene.net within my project to search for customers. I've got my Lucene index built and search is returning expected results for all of my indexed fields, however, when I search specifically for customers in Indiana or Oregon, I receive zero results, despite my database reflecting otherwise.
In my test case, these states are abbreviated to IN and OR respectively in my lucene index. Searching for other fields will yield results for customers within these states, so I know they are indexed.
Example:
State:(fl) returns results for customers in Florida, as expected.
State:(in) returns no results
State:(or) returns no results
State:(ar*) returns results for customers in Arkansas, as expected.
State:(in*) returns no results
State:(or*) returns no results
State:("mi") returns results for customers in Michigan, as expected.
State:("or") returns no results
State:("in") returns no results
State:("\\ca") returns results for customers in California, as expected.
State:("\\or") returns no results
State:("\\in") returns no results
On a related note, searching for names containing AND, OR, and IN work without issue:
Name:(and*) returns results for Andrew, Andrea, Andy, etc.
Name:(in*) returns results for Inge, Ina, Indie, etc.
Name:(or*) returns results for Oris, Orlando, Orville, etc.
I've tried the following for creating my indices:
new Field("State", (String.IsNullOrWhiteSpace(ShippingState) ? "" : ShippingState), Field.Store.YES, Field.Index.ANALYZED);
new Field("State", (String.IsNullOrWhiteSpace(BillingState) ? "" : BillingState), Field.Store.YES, Field.Index.ANALYZED);
new Field("State", (String.IsNullOrWhiteSpace(ShippingState) ? "" : ShippingState) + " " + (String.IsNullOrWhiteSpace(BillingState) ? "" : BillingState), Field.Store.YES, Field.Index.ANALYZED);
I've also looked at other solutions to similar problems, such as how to properly escape OR and AND in lucene query? but I've had no luck in adapting these solutions to this issue. I'm using Lucene.NET 3.0.3.
The problem here isn't really the collision with query syntax. "IN" isn't even a lucene query keyword.
The problem is that standard analysis eliminates certain common words known as stop words, which are deemed to not usually be interesting search terms. By default, this the stop words are common english words, including "in", "or" and "and", among others (full list here: What is the default list of stopwords used in Lucene's StopFilter?).
If this isn't desirable behavior in your case, you can define your StandardAnalyzer with a custom (or empty) stop word set:
StandardAnalyzer analyzer = new StandardAnalyzer(
Lucene.Net.Util.Version.LUCENE_30,
new HashSet<String>() //Empty stop word set
);

Lucens best way to do "starts-with" queries

I want to be able to do the following types of queries:
The data to index consists of (let's say), music videos where only the title is interesting.
I simply want to index these and then create queries for them such that, whatever word or words the user used in the query, the documents containing those words, in that order, at the beginning of the tile will be returned first, followed (in no particular order) by documents containing at least one of the searched words in any position of the title. Also all this should be case insensitive.
Example:
For documents:
Video1Title = Sea is blue
Video2Title = Wild sea
Video3Title = Wild sea Whatever
Video4Title = Seaside Whatever
If I search "sea" I want to get
"Video1Title = Sea is blue"
first followed by all the other documents that contain "sea" in title, but not at the beginning.
If I search "Wild sea" I want to get
Video2Title = Wild sea
Video3Title = Wild sea Whatever
first followed by all the other documents that have "Wild" or "Sea" in their title but don't have "Wild Sea" as title prefix.
If I search "Seasi" I don't wanna get anything (I don't care for Keyword Tokenization and prefix queries).
Now AFAIKS, there's no actual way to tell Lucene "find me documents where word1 and word2 and etc. are in positions 1 and 2 and 3 and etc."
There are "workarounds" to simulate that behaviour:
Index the field twice. In field1 you have the words tokenized (using perhaps StandardAnalyzer) and in field2 you have them all clumped up into one element (using KeywordAnalyzer). Then if you search something like :
+(field1:word1 word2 word3) (field2:"word1 word2 word3*")
effectively telling Lucene "Documents must contain word1 or word2 or word3 in the title, and furthermore those that match "title starts with >word1 word2 word3<" are better (get higher score).
Add a "lucene_start_token" to the beginning of the field when indexing them such that
Video2Title = Wild sea is indexed as "title:lucene_start_token Wild sea" and so on for the rest
Then do a query such that:
+(title:sea) (title:"lucene_start_token sea")
and having Lucene return all documents which contain my search word(s) in the title and also give a better score on those who matched "lucene_start_token+search words"
My question is then, are there indeed better ways to do this (maybe using PhraseQuery and Term position)? If not, which of the above is better perfromance-wise?
You can use Lucene Payloads for that. You can give custom boost for every term of the field value.
So, when you index your titles you can start using a boost factor of 3 (for example):
title: wild|3.0 creatures|2.5 blue|2.0 sea|1.5
title: sea|3.0 creatures|2.5
Indexing this way you are boosting nearest terms to the start of title.
The main problem using this approach is you have to tokenize by yourself and add all this boost information "manually" as the Analyzer needs the text structured that way (term1|1.1 term2|3.0 term3).
What you could do is index the title and each token separately, e.g. text wild deep blue endless sea would be indexed like:
title: wild deep blue endless sea
t1: wild
t2: deep
t3: blue
t4: endless
t5: sea
Then if someone queries "wild deep", the query would be rewritten into
title:"wild deep" OR (t1:wild AND t2:deep)
This way you will always find all matching documents (if they match title) but matching t1..tN tokens will score the relevant documents higher.