Fuzzy search with stop words produces unexpected results with Lucene / ElasticSearch - lucene

I am noticing that the fuzzy operator on stop words does not produce the results I'd expect.
Here's my configuration:
index :
analysis :
analyzer :
my_analyzer :
tokenizer : my_tokenizer
filter : [standard, my_stop_english_filter]
tokenizer :
my_tokenizer :
type : standard
max_token_length : 512
filter :
my_stop_english_filter :
type : stop
stopwords : [the]
ignore_case : true
And suppose I have indexed:
the brown fox
If I search for:
the brown~ fox~, then I get a hit as expected.
However, if I search for: the~ brown~ fox~, then I do not get a hit, presumably because the fuzzy operator prevents the from being treated as a stop word.
Is there a way I can combine stop words with fuzzy search?
Thanks,
Eric

If I recall correctly, this is the way Lucene is supposed to work as it is currently written -- using a fuzzy search disable the stopping of the stop words. It would take some work, but you could create a modified version of the query parser so stop words are ignored when applying fuzzy search (but then how do handle a fuzzy search on something that looks like a stop word?)

Related

Can I clear the stopword list in lucene.net in order for exact matches to work better?

When dealing with exact matches I'm given a real world query like this:
"not in education, employment, or training"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? education employment ? training"
Here's a more contrived example:
"there is no such thing"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? ? ? thing"
My goal is to have searches like these match only the exact match as the user entered it.
Could one solution be to clear the stopwords list? would this have adverse affects? if so what? (my google-fu failed)
This all depends on the analyzer you are using. The StandardAnalyzer uses Stop words and strips them out, in fact the StopAnalyzer is where the StandardAnalyzer gets its stop words from.
Use the WhitespaceAnalyzer or create your own by inheriting from one that most closely suits your needs and modify it to be what you want.
Alternatively, if you like the StandardAnalyzer you can new one up with a custom stop word list:
//This is what the default stop word list is in case you want to use or filter this
var defaultStopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
//create a new StandardAnalyzer with custom stop words
var sa = new StandardAnalyzer(
Version.LUCENE_29, //depends on your version
new HashSet<string> //pass in your own stop word list
{
"hello",
"world"
});

Lucene Query for "OR" and "IN"

I'm using Lucene.net within my project to search for customers. I've got my Lucene index built and search is returning expected results for all of my indexed fields, however, when I search specifically for customers in Indiana or Oregon, I receive zero results, despite my database reflecting otherwise.
In my test case, these states are abbreviated to IN and OR respectively in my lucene index. Searching for other fields will yield results for customers within these states, so I know they are indexed.
Example:
State:(fl) returns results for customers in Florida, as expected.
State:(in) returns no results
State:(or) returns no results
State:(ar*) returns results for customers in Arkansas, as expected.
State:(in*) returns no results
State:(or*) returns no results
State:("mi") returns results for customers in Michigan, as expected.
State:("or") returns no results
State:("in") returns no results
State:("\\ca") returns results for customers in California, as expected.
State:("\\or") returns no results
State:("\\in") returns no results
On a related note, searching for names containing AND, OR, and IN work without issue:
Name:(and*) returns results for Andrew, Andrea, Andy, etc.
Name:(in*) returns results for Inge, Ina, Indie, etc.
Name:(or*) returns results for Oris, Orlando, Orville, etc.
I've tried the following for creating my indices:
new Field("State", (String.IsNullOrWhiteSpace(ShippingState) ? "" : ShippingState), Field.Store.YES, Field.Index.ANALYZED);
new Field("State", (String.IsNullOrWhiteSpace(BillingState) ? "" : BillingState), Field.Store.YES, Field.Index.ANALYZED);
new Field("State", (String.IsNullOrWhiteSpace(ShippingState) ? "" : ShippingState) + " " + (String.IsNullOrWhiteSpace(BillingState) ? "" : BillingState), Field.Store.YES, Field.Index.ANALYZED);
I've also looked at other solutions to similar problems, such as how to properly escape OR and AND in lucene query? but I've had no luck in adapting these solutions to this issue. I'm using Lucene.NET 3.0.3.
The problem here isn't really the collision with query syntax. "IN" isn't even a lucene query keyword.
The problem is that standard analysis eliminates certain common words known as stop words, which are deemed to not usually be interesting search terms. By default, this the stop words are common english words, including "in", "or" and "and", among others (full list here: What is the default list of stopwords used in Lucene's StopFilter?).
If this isn't desirable behavior in your case, you can define your StandardAnalyzer with a custom (or empty) stop word set:
StandardAnalyzer analyzer = new StandardAnalyzer(
Lucene.Net.Util.Version.LUCENE_30,
new HashSet<String>() //Empty stop word set
);

How to do a startsWith and then Contains Search using Lucene.NET 3.0?

What is the best way to search and Index in Lucene.NET 3.0 so that the results come out ordered in the following way:
Results that start with the full query text (as a single word) e.g. "Bar Acme"
Results that start with the search term as a word fragment e.g. "Bart Simpson"
Results that contain the query text as a full word e.g. "National Bar Association"
Results that contain the query text as a fragment e.g. "United Bartenders Inc"
Example: Searching for Bar
Ordered Results:
Bar Acme
Bar Lunar
Bart Simpson
National Bar Association
International Bartenders Association
Lucene doesn't generally support searching/scoring based on position within a field. It would be possible to support it if you prefix every fields with some known fieldstart delimiter, or something. I don't really think it makes sense, in the lens of a full text search where position within the text field isn't relevant (ie. if I were searching for Bar in a document, I would likely be rather annoyed if "Bart Simpson" were returned before "national bar association")
Apart from that though, a simple prefix search handles everything else. So if you simply add your start of word token, you can search for the modified term with a higher boost prefix query than the original, and then you should have precisely what you describe.
It can be achieved with linq. Make lucene search with hit count Int32.MaxValue. Loop the results of ScoreDocs and store it in collection Searchresults.
sample code:
Searchresults = (from scoreDoc in results.ScoreDocs select (new SearchResults { suggestion = searcher.Doc(scoreDoc.Doc).Get("suggestion") })).OrderBy(x => x.suggestion).ToList();
Searchresultsstartswith = Searchresults.Where(x => x.suggestion.ToLower().StartsWith(searchStringLinq.ToLower())).Take(10).ToList();
if (SearchresultsStartswith.Count > 0)
return SearchresultsStartswith.ToList();
else
return Searchresults.Take(10).ToList();

ElasticSearch: query_string search with phrase against snowball filtered fields

I'm doing a simple query_string query that looks like this:
"query_string" : {
"default_operator" : "AND",
"fields" : ["title^20","keywords^10","description^8","content^1","titles^6","highlights^4"],
"query" : "\"south west\""
}
However, the search is matching documents with the words "south" and "west" that are not necessarily adjecent, e.g. "We are seeing low flying buzzards in the south of england and also the west". I would like it to only return results that match the exact phrase, e.g. "We are seeing low flying buzzards in the south west of buckinghamshire".
The analyzer used for both search and indexing is the snowball analyzer and I am guessing that this may be the root of the issue, i.e. do phrase queries not work with the snowball analyzer?
Any ideas?
TIA
Dominic
User error. The DSL was being incorrectly serialized.

Do Lucene and Sphinx support prefix matching?

If not how do you make this work with them and which is better?
e.g. when searching for "mi" i would like results with "microsoft" to potentially show up in a result even though there is no "keyword" like "mi" specifically.
Yes and Yes.
Lucene has PrefixQuery:
BooleanQuery query = new BooleanQuery();
for (String token : tokenize(queryString)) {
query.add(new PrefixQuery(new Term(LABEL_FIELD_NAME, token)), Occur.MUST);
}
return query;
You can also use the Lucene query parser syntax and define the prefix search by using a wildcard exam*. The query parser syntax works if you want to deploy a separate Lucene search server, Solr, that is called using a HTTP API
In Sphinx it seams you have to do the following:
Set minimum prefix length to a value larger than 0
Enable wildcard syntax
Generate a query string with a willdcard exam*