lucene query special characters - lucene

I have trouble understandnig handling of special characters in lucene.
My analyzer has no stopwords, so that special chars are not removed:
CharArraySet stopwords = new CharArraySet(0, true);
return new GermanAnalyzer(stopwords);
than I create docs like:
doc.add(new TextField("tags", "23", Store.NO));
doc.add(new TextField("tags", "Brüder-Grimm-Weg", Store.NO));
Query tags:brüder\-g works fine, but fuzzy query tags:brüder\-g~ does not return anything. When the street name would be Eselgasse query tags:Esel~ would work fine.
I use lucene 5.3.1
Thanks for help!

Fuzzy Queries (as well as wildcard or regex queries) are not analyzed by the QueryParser.
If you are using StandardAnalyzer, for instance, "Brüder-Grimm-Weg" will be indexed as three terms, "brüder", "grimm", and "weg". So, after analysis you have:
"tags:brüder\-g" --> tags:brüder tags:g
This matches on tags:brüder
"tags:brüder\-g~" --> tags:brüder-g~2
Since this is not analyzed, it remains a single term, and you have no matches, since there is no single term in your index like "brüder-g"

Related

Can I clear the stopword list in lucene.net in order for exact matches to work better?

When dealing with exact matches I'm given a real world query like this:
"not in education, employment, or training"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? education employment ? training"
Here's a more contrived example:
"there is no such thing"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? ? ? thing"
My goal is to have searches like these match only the exact match as the user entered it.
Could one solution be to clear the stopwords list? would this have adverse affects? if so what? (my google-fu failed)
This all depends on the analyzer you are using. The StandardAnalyzer uses Stop words and strips them out, in fact the StopAnalyzer is where the StandardAnalyzer gets its stop words from.
Use the WhitespaceAnalyzer or create your own by inheriting from one that most closely suits your needs and modify it to be what you want.
Alternatively, if you like the StandardAnalyzer you can new one up with a custom stop word list:
//This is what the default stop word list is in case you want to use or filter this
var defaultStopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
//create a new StandardAnalyzer with custom stop words
var sa = new StandardAnalyzer(
Version.LUCENE_29, //depends on your version
new HashSet<string> //pass in your own stop word list
{
"hello",
"world"
});

Why isn't my query finding the results, is this an exact match or contains?

My index has the following data:
doc.add(new StringField("domain", "examplehouse.com", Field.Store.YES)
doc.add(new StringField("domain", "exampletree.com", Field.Store.YES)
doc.add(new StringField("domain", "exampleapple.com", Field.Store.YES)
Now I am trying to return all domains with the term "example" in it:
bq = new BooleanQuery().Builder.add(new TermQuery(new Term("domain", "example")))
indexSearcher.search(bq, 100)
The query when I print it out looks like:
+domain:example
Is this the correct type of query or is this an exact match?
TermQueries are always exact matches. In your case a wildcard based query like a PrefixQuery would make more sense: https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/PrefixQuery.html
There are multiple wildcard types and you should take care of understanding this:
prefix (multi ending wildcard) : examp*
single ending wildcard : exampl?
mutli leading wildcard : *ample
single leading wildcard : ?xample
Old but still valid link to lucene docs expaining the query syntax:
https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

Lucene Query for "OR" and "IN"

I'm using Lucene.net within my project to search for customers. I've got my Lucene index built and search is returning expected results for all of my indexed fields, however, when I search specifically for customers in Indiana or Oregon, I receive zero results, despite my database reflecting otherwise.
In my test case, these states are abbreviated to IN and OR respectively in my lucene index. Searching for other fields will yield results for customers within these states, so I know they are indexed.
Example:
State:(fl) returns results for customers in Florida, as expected.
State:(in) returns no results
State:(or) returns no results
State:(ar*) returns results for customers in Arkansas, as expected.
State:(in*) returns no results
State:(or*) returns no results
State:("mi") returns results for customers in Michigan, as expected.
State:("or") returns no results
State:("in") returns no results
State:("\\ca") returns results for customers in California, as expected.
State:("\\or") returns no results
State:("\\in") returns no results
On a related note, searching for names containing AND, OR, and IN work without issue:
Name:(and*) returns results for Andrew, Andrea, Andy, etc.
Name:(in*) returns results for Inge, Ina, Indie, etc.
Name:(or*) returns results for Oris, Orlando, Orville, etc.
I've tried the following for creating my indices:
new Field("State", (String.IsNullOrWhiteSpace(ShippingState) ? "" : ShippingState), Field.Store.YES, Field.Index.ANALYZED);
new Field("State", (String.IsNullOrWhiteSpace(BillingState) ? "" : BillingState), Field.Store.YES, Field.Index.ANALYZED);
new Field("State", (String.IsNullOrWhiteSpace(ShippingState) ? "" : ShippingState) + " " + (String.IsNullOrWhiteSpace(BillingState) ? "" : BillingState), Field.Store.YES, Field.Index.ANALYZED);
I've also looked at other solutions to similar problems, such as how to properly escape OR and AND in lucene query? but I've had no luck in adapting these solutions to this issue. I'm using Lucene.NET 3.0.3.
The problem here isn't really the collision with query syntax. "IN" isn't even a lucene query keyword.
The problem is that standard analysis eliminates certain common words known as stop words, which are deemed to not usually be interesting search terms. By default, this the stop words are common english words, including "in", "or" and "and", among others (full list here: What is the default list of stopwords used in Lucene's StopFilter?).
If this isn't desirable behavior in your case, you can define your StandardAnalyzer with a custom (or empty) stop word set:
StandardAnalyzer analyzer = new StandardAnalyzer(
Lucene.Net.Util.Version.LUCENE_30,
new HashSet<String>() //Empty stop word set
);

lucene wildcard query with space

I have Lucene index which has city names.
Consider I want to search for 'New Delhi'. I have string 'New Del' which I want to pass to Lucene searcher and I am expecting output as 'New Delhi'.
If I generate query like Name:New Del* It will give me all cities with 'New and Del'in it.
Is there any way by which I can create Lucene query wildcard query with spaces in it?
I referred and tried few solutions given # http://www.gossamer-threads.com/lists/lucene/java-user/5487
It sounds like you have indexed your city names with analysis. That will tend to make this more difficult. With analysis, "new" and "delhi" are separate terms, and must be treated as such. Searching over multiple terms with wildcards like this tends to be a bit more difficult.
The easiest solution would be to index your city names without tokenization (lowercasing might not be a bad idea though). Then you would be able to search with the query parser simply by escaping the space:
QueryParser parser = new QueryParser("defaultField", analyzer);
Query query = parser.parse("cityname:new\\ del*");
Or you could use a simple WildcardQuery:
Query query = new WildcardQuery(new Term("cityname", "new del*"));
With the field analyzed by standard analyzer:
You will need to rely on SpanQueries, something like this:
SpanQuery queryPart1 = new SpanTermQuery(new Term("cityname", "new"));
SpanQuery queryPart2 = new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("cityname", "del*")));
Query query = new SpanNearQuery(new SpanQuery[] {query1, query2}, 0, true);
Or, you can use the surround query parser (which provides query syntax intended to provide more robust support of span queries), using a query like W(new, del*):
org.apache.lucene.queryparser.surround.parser.QueryParser surroundparser = new org.apache.lucene.queryparser.surround.parser.QueryParser();
SrndQuery srndquery = surroundparser.parse("W(new, del*)");
query = srndquery.makeLuceneQueryField("cityname", new BasicQueryFactory());
As I learnt from the thread mentioned by you (http://www.gossamer-threads.com/lists/lucene/java-user/5487), you can either do an exact match with space or treat either parts w/ wild card.
So something like this should work - [New* Del*]

Lucene search using StopWords in StandardAnalyzer

I have the following issue using Lucene.NET 3.0.3.
My project analyze Documents using StandardAnalyzer with StopWord-List (combined german and english words).
While searching I create my searchterm by hand and parse it using MultiFieldQueryParser. The Parser is initialized with the same analyzer as indexing documents.
The parsed search query initialized a BooleanQuery. The BooleanQuery and a TopScoreDocCollector search in the Lucene index with IndexSearcher.
My code looks like:
using (StandardAnalyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30, roxConnectionTools.getServiceInstance<ISearchIndexService>().GetStopWordList()))
{
...
MultiFieldQueryParser parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, searchFields, analyzer);
parser.MultiTermRewriteMethod = MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE;
parser.AllowLeadingWildcard = true;
...
Query searchQuery = parser.Parse(searchStringBuilder.ToString().Trim);
...
BooleanQuery boolQuery = new BooleanQuery();
boolQuery.Add(searchQuery, Occur.MUST);
...
TopScoreDocCollector scoreCollector = TopScoreDocCollector.Create(SearchServiceTools.MAX_SCORE_COLLECTOR_SIZE, true);
...
searcher.Search(boolQuery, scoreCollector);
ScoreDoc[] scoreDocs = scoreCollector.TopDocs().ScoreDocs;
}
If I index a document field with value "Test- und Produktivumgebung" I can´t find this document by searching this term.
I get results if I correct the search term to "Test- Produktivumgebung".
The word "und" is in my StopWord-List.
My search query looks like the following:
Manually generated search query: (+*Test* +*und* +*Produktivumgebung*)
Parsed search query: +(title:*Test*) +(title:*und*) +(title:*Produktivumgebung*)
Why I can´t find the document searching for "Test- und Produktivumgebung"?
Wildcard Queries are not analyzed (See this question, for an example). Since you are (if I understand correctly), interpreting the query "Test- und Produktivumgebung" to (+*Test* +*und* +*Produktivumgebung*), the analyzer is not used for any of those wildcard queries, and stop words will not be eliminated.
If you eliminate the step that performs that translation, the query "Test- und Produktivumgebung" should be parsed to a phrase query and analyzed, and should work just fine. Another reason to eliminate that step, is that applying a leading wildcard to every term will cause your performance to become very poor. That's why leading wildcards must be manually enabled, because it is generally a bad idea to use them.