lucene wildcard query with space - lucene

I have Lucene index which has city names.
Consider I want to search for 'New Delhi'. I have string 'New Del' which I want to pass to Lucene searcher and I am expecting output as 'New Delhi'.
If I generate query like Name:New Del* It will give me all cities with 'New and Del'in it.
Is there any way by which I can create Lucene query wildcard query with spaces in it?
I referred and tried few solutions given # http://www.gossamer-threads.com/lists/lucene/java-user/5487

It sounds like you have indexed your city names with analysis. That will tend to make this more difficult. With analysis, "new" and "delhi" are separate terms, and must be treated as such. Searching over multiple terms with wildcards like this tends to be a bit more difficult.
The easiest solution would be to index your city names without tokenization (lowercasing might not be a bad idea though). Then you would be able to search with the query parser simply by escaping the space:
QueryParser parser = new QueryParser("defaultField", analyzer);
Query query = parser.parse("cityname:new\\ del*");
Or you could use a simple WildcardQuery:
Query query = new WildcardQuery(new Term("cityname", "new del*"));
With the field analyzed by standard analyzer:
You will need to rely on SpanQueries, something like this:
SpanQuery queryPart1 = new SpanTermQuery(new Term("cityname", "new"));
SpanQuery queryPart2 = new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("cityname", "del*")));
Query query = new SpanNearQuery(new SpanQuery[] {query1, query2}, 0, true);
Or, you can use the surround query parser (which provides query syntax intended to provide more robust support of span queries), using a query like W(new, del*):
org.apache.lucene.queryparser.surround.parser.QueryParser surroundparser = new org.apache.lucene.queryparser.surround.parser.QueryParser();
SrndQuery srndquery = surroundparser.parse("W(new, del*)");
query = srndquery.makeLuceneQueryField("cityname", new BasicQueryFactory());

As I learnt from the thread mentioned by you (http://www.gossamer-threads.com/lists/lucene/java-user/5487), you can either do an exact match with space or treat either parts w/ wild card.
So something like this should work - [New* Del*]

Related

what is the difference between TermQuery and QueryParser in Lucene 6.0?

There are two queries,one is created by QueryParser:
QueryParser parser = new QueryParser(field, analyzer);
Query query1 = parser.parse("Lucene");
the other is term query:
Query query2=new TermQuery(new Term("title", "Lucene"));
what is the difference between query1 and query2?
This is the definition of Term from lucene docs.
A Term represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occurred in.
So in your case the query will be created to search the word "Lucene" in the field "title".
To explain the difference between the two let me take a difference example,
consider the following
Query query2 = new TermQuery(new Term("title", "Apache Lucene"));
In this case the query will search for the exact word "Apache Lucene" in the field title.
In the other case
As an example, let's assume a Lucene index contains two fields, "title" and "body".
QueryParser parser = new QueryParser("title", "StandardAnalyzer");
Query query1 = parser.parse("title:Apache body:Lucene");
Query query2 = parser.parse("title:Apache Lucene");
Query query3 = parser.parse("title:\"Apache Lucene\"");
couple of things.
"title" is the field that QueryParser will search if you don't prefix it with a field.(as given in the constructor).
parser.parse("title:Apache body:Lucene"); -> in this case the final query will look like this. query2 = title:Apache body:Lucene.
parser.parse("body:Apache Lucene"); -> in this case the final query will also look like this. query2 = body:Apache title:Lucene. but for a different reason.
So the parser will search "Apache" in body field and "Lucene" in title field. Since The field is only valid for the term that it directly precedes,(http://lucene.apache.org/core/2_9_4/queryparsersyntax.html)
So since we do not specify any field for lucene , the default field which is "title" will be used.
query2 = parser.parse("title:\"Apache Lucene\""); in this case we are explicitly telling that we want to search for "Apache Lucene" in field "title". This is phrase query and is similar to Term query if analyzed correctly.
So to summarize the term query will not analyze the term and search as it is. while Query parser parses the input based on some conditions described above.
The QueryParser parses the string and constructs a BooleanQuery (afaik) consisting of BooleanClauses and analyzes the terms along the way.
The TermQuery does NOT do analysis, and takes the term as-is. This is the main difference.
So the query1 and query2 might be equivalent (in a sense, that they provide the same search results) if the field is the same, and the QueryParser's analyzer is not changing the term.

Lucene search using StopWords in StandardAnalyzer

I have the following issue using Lucene.NET 3.0.3.
My project analyze Documents using StandardAnalyzer with StopWord-List (combined german and english words).
While searching I create my searchterm by hand and parse it using MultiFieldQueryParser. The Parser is initialized with the same analyzer as indexing documents.
The parsed search query initialized a BooleanQuery. The BooleanQuery and a TopScoreDocCollector search in the Lucene index with IndexSearcher.
My code looks like:
using (StandardAnalyzer analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30, roxConnectionTools.getServiceInstance<ISearchIndexService>().GetStopWordList()))
{
...
MultiFieldQueryParser parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_30, searchFields, analyzer);
parser.MultiTermRewriteMethod = MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE;
parser.AllowLeadingWildcard = true;
...
Query searchQuery = parser.Parse(searchStringBuilder.ToString().Trim);
...
BooleanQuery boolQuery = new BooleanQuery();
boolQuery.Add(searchQuery, Occur.MUST);
...
TopScoreDocCollector scoreCollector = TopScoreDocCollector.Create(SearchServiceTools.MAX_SCORE_COLLECTOR_SIZE, true);
...
searcher.Search(boolQuery, scoreCollector);
ScoreDoc[] scoreDocs = scoreCollector.TopDocs().ScoreDocs;
}
If I index a document field with value "Test- und Produktivumgebung" I can´t find this document by searching this term.
I get results if I correct the search term to "Test- Produktivumgebung".
The word "und" is in my StopWord-List.
My search query looks like the following:
Manually generated search query: (+*Test* +*und* +*Produktivumgebung*)
Parsed search query: +(title:*Test*) +(title:*und*) +(title:*Produktivumgebung*)
Why I can´t find the document searching for "Test- und Produktivumgebung"?
Wildcard Queries are not analyzed (See this question, for an example). Since you are (if I understand correctly), interpreting the query "Test- und Produktivumgebung" to (+*Test* +*und* +*Produktivumgebung*), the analyzer is not used for any of those wildcard queries, and stop words will not be eliminated.
If you eliminate the step that performs that translation, the query "Test- und Produktivumgebung" should be parsed to a phrase query and analyzed, and should work just fine. Another reason to eliminate that step, is that applying a leading wildcard to every term will cause your performance to become very poor. That's why leading wildcards must be manually enabled, because it is generally a bad idea to use them.

Find typo with Lucene

I would like to use Lucene to index/search text. The text can contain mistyped words, names, etc. What is the most simple way of getting Lucene to find a document containing
"this is Licene"
when user searches for
"Lucene"?
This is only for a demo app, so we need the most simple solution.
Lucene's fuzzy queries and based on Levenshtein edit distance.
Use a fuzzy query in the QueryParser, with syntax like:
Lucene~0.5
Or create a FuzzyQuery, passing in the maximum number of edits, something like:
Query query = new FuzzyQuery(new Term("field", "lucene"), 1);
Note: FuzzyQuery, in Lucene 4.x, does not support greater edit distances than 2.
Another option you could try is using the Lucene SpellChecker:
http://lucene.apache.org/core/6_4_0/suggest/org/apache/lucene/search/spell/SpellChecker.html
It is a out of box, and very easy to use:
SpellChecker spellchecker = new SpellChecker(spellIndexDirectory);
// To index a field of a user index:
spellchecker.indexDictionary(new LuceneDictionary(my_lucene_reader, a_field));
// To index a file containing words:
spellchecker.indexDictionary(new PlainTextDictionary(new File("myfile.txt")));
String[] suggestions = spellchecker.suggestSimilar("misspelt", 5);
By default, it is using the LevensteinDistance, but you could provide your own customized Edit Distance.

Lucene fuzzy search on a phrase (FuzzyQuery + SpanQuery)

I am looking for a way of coding the lucene fuzzy query that searches all the documents, which are relevant to an exact phrase. If I search "mosa employee appreciata", a document contains "most employees appreciate" will be returned as the result.
I tried to use:
FuzzyQeury = new FuzzyQuery(new Term("contents","mosa employee appreicata"))
Unfortunately, it empirically doesn't work. The FuzzyQuery employs the editor distance, theoretically, "mosa employee appreciata" should be matched with "most employees appreciate" provide the appropriate distance is given. It seems a bit odd.
Any clues? Thank you.
There are two likely problems here. First: I'm guessing the "contents" field is being analyzed such that "most employees apreciate" is not a term, but rather three terms. Defining as a single term is not appropriate in this case.
However, even if the content listed is a single term, a second likely problem we have is that there is too much distance between the terms to get a match. The Damerau-Levenshtein distance between mosa employee appreicata and most employees appreciate is 4 (the approximate distance, incidentally, between my average first shot at spelling
"Damerau-Levenshtein" and the correct spelling). Fuzzy Query, as of 4.0, handles edit distances of no more than 2, due to performance constraints, and the assumption that larger distances are usually not particularly relevant.
If you need to perform a phrase query with fuzzy terms, you should look into either MultiPhraseQuery, or combine a set of SpanQueries (especially SpanMultiTermQueryWrapper and SpanNearQuery) to meet your needs.
SpanQuery[] clauses = new SpanQuery[3];
clauses[0] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "mosa")));
clauses[1] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "employee")));
clauses[2] = new SpanMultiTermQueryWrapper(new FuzzyQuery(new Term("contents", "appreicata")));
SpanNearQuery query = new SpanNearQuery(clauses, 0, true)
And since none of the individual terms have an edit distance greater than 2, this should be more effective.
ComplexPhraseQueryParser handles fuzzy searching on phrase words - i.e., specify the words that should be fuzzy searched and those that should not. Works as follows
Query query = new ComplexPhraseQueryParser("content", analyzer)
.parse("some test~ query~ blah blah");
Seems to work nicely. Not sure about performance, however but seems to work well on small data sets.
I had some (very small) millage with the following:
String[] searchTerms = searchString.split(" ");
FuzzyLikeThisQuery fltw = new FuzzyLikeThisQuery(searchTerms.length, new StandardAnalyzer());
Arrays.stream(searchTerms)
.forEach(term -> fltq.addTerms(term, FIELD, SIMILARITY_IN_EDITS, PREFIX_LENGTH);
This query matches far too distant strings with the index. String that don't match are ones where each of the terms are distant by more than 2 edits from the terms used in the indexed content.
Please use at your own peril.
The answer from femtoRgon is great! Thank you.
There is another way to solve this problem.
//declare a mutilphrasequery
MultiPhraseQuery childrenInOrder = new MultiPhraseQuery();
//user fuzzytermenum to enumerate your query string
FuzzyTermEnum fuzzyEnumeratedTerms1 = new FuzzyTermEnum(reader, new Term(searchField,"mosa"));
FuzzyTermEnum fuzzyEnumeratedTerms2 = new FuzzyTermEnum(reader, new Term(searchField,"employee"));
FuzzyTermEnum fuzzyEnumeratedTerms3 = new FuzzyTermEnum(reader, new Term(searchField,"appreicata"));
//this basically pull out the possbile terms from the index
Term termHolder1 = fuzzyEnumeratedTerms1.term();
Term termHolder2 = fuzzyEnumeratedTerms2.term();
Term termHolder3 = fuzzyEnumeratedTerms3.term();
//put the possible terms into multiphrasequery
if (termHolder1==null){
childrenInOrder.add(new Term(searchField,"mosa"));
}else{
childrenInOrder.add(fuzzyEnumeratedTerms1.term());
}
if (termHolder2==null){
childrenInOrder.add(new Term(searchField,"employee"));
}else{
childrenInOrder.add(fuzzyEnumeratedTerms2.term());
}
if (termHolder3==null){
childrenInOrder.add(new Term(searchField,"appreicata"));
}else{
childrenInOrder.add(fuzzyEnumeratedTerms3.term());
}
//close it - it is important to close it
fuzzyEnumeratedTerms1.close();
fuzzyEnumeratedTerms2.close();
fuzzyEnumeratedTerms3.close();

Lucene - "AND" sets of "OR" terms

Suppose I have a search using criteria such as a list countries. A user can select a set of countries to search across and combine this set with other criteria.
In SQL I'd do this in my where clause i.e. WHERE (country = 'brazil' OR country = 'france' OR country = 'china) AND (other search criteria).
It isn't clear how to do this in Lucene. Query.combine seems to have promise but that would increase in complexity very quickly if I have multiple sets of "OR" terms to work through.
Is Lucene capable in this regard? Or should I just hit my regular DB with these types of criteria and filter my Lucene results?
Digging deeper, it looks like you can nest boolean queries to accomplish this. I'll update with an answer if this technique works and if it is performant.
Using the standard query parser(and you can take a look at the relevant documentation), you can use syntax similar to a DB query, such as:
(country:brazil OR country:france OR country:china) AND (other search criteria)
Or, to simplify a bit:
country:(brazil OR france OR china) AND (other search criteria)
Alternatively, Lucene also supports queries written using +/-, rather than AND/OR syntax. I find that syntax more expressive for a Lucene query. The equivalent in this form would be:
+country:(brazil france china) +(other search criteria)
If manually constructing queries, you can indeed nest BooleanQueries to create a similar structure, using the correct BooleanClauses to establish the And/Or logic you've specified:
Query countryQuery = new BooleanQuery();
countryQuery.add(new TermQuery(new Term("country","brazil")),BooleanClause.Occur.SHOULD);
countryQuery.add(new TermQuery(new Term("country","france")),BooleanClause.Occur.SHOULD);
countryQuery.add(new TermQuery(new Term("country","china")),BooleanClause.Occur.SHOULD);
Query otherStuffQuery = //Set up the other query here,
//or get it from a query parser, or something
Query rootQuery = new BooleanQuery();
rootQuery.add(countryQuery, BooleanClause.Occur.MUST);
rootQuery.add(otherStuffQuery, BooleanClause.Occur.MUST);
Two ways.
Let the Lucene formulate the query. To accomplish that, send in the query string in the following format.
Query: "country(brazil france china)"
An inbuilt QueryParser parses the above string to a BooleanQuery with an OR operator.
QueryParser qp = new QueryParser(Version.LUCENE_41, "country", new StandardAnalyzer(Version.LUCENE_41));
Query q = qp.parse(s);
If you want to formulate the query yourself,
BooleanQuery bq = new BooleanQuery();
//
TermQuery tq = new TermQuery(new Term("country", "brazil"));
bq.add(tq, Occur.SHOULD); // SHOULD ==> OR operator
//
tq = new TermQuery(new Term("country", "france"));
bq.add(tq, Occur.SHOULD);
//
tq = new TermQuery(new Term("country", "china"));
bq.add(tq, Occur.SHOULD);
Unless you add hundreds of subqueries, Lucene will meet your expectations performance-wise.