Lucene - "AND" sets of "OR" terms - lucene

Suppose I have a search using criteria such as a list countries. A user can select a set of countries to search across and combine this set with other criteria.
In SQL I'd do this in my where clause i.e. WHERE (country = 'brazil' OR country = 'france' OR country = 'china) AND (other search criteria).
It isn't clear how to do this in Lucene. Query.combine seems to have promise but that would increase in complexity very quickly if I have multiple sets of "OR" terms to work through.
Is Lucene capable in this regard? Or should I just hit my regular DB with these types of criteria and filter my Lucene results?
Digging deeper, it looks like you can nest boolean queries to accomplish this. I'll update with an answer if this technique works and if it is performant.

Using the standard query parser(and you can take a look at the relevant documentation), you can use syntax similar to a DB query, such as:
(country:brazil OR country:france OR country:china) AND (other search criteria)
Or, to simplify a bit:
country:(brazil OR france OR china) AND (other search criteria)
Alternatively, Lucene also supports queries written using +/-, rather than AND/OR syntax. I find that syntax more expressive for a Lucene query. The equivalent in this form would be:
+country:(brazil france china) +(other search criteria)
If manually constructing queries, you can indeed nest BooleanQueries to create a similar structure, using the correct BooleanClauses to establish the And/Or logic you've specified:
Query countryQuery = new BooleanQuery();
countryQuery.add(new TermQuery(new Term("country","brazil")),BooleanClause.Occur.SHOULD);
countryQuery.add(new TermQuery(new Term("country","france")),BooleanClause.Occur.SHOULD);
countryQuery.add(new TermQuery(new Term("country","china")),BooleanClause.Occur.SHOULD);
Query otherStuffQuery = //Set up the other query here,
//or get it from a query parser, or something
Query rootQuery = new BooleanQuery();
rootQuery.add(countryQuery, BooleanClause.Occur.MUST);
rootQuery.add(otherStuffQuery, BooleanClause.Occur.MUST);

Two ways.
Let the Lucene formulate the query. To accomplish that, send in the query string in the following format.
Query: "country(brazil france china)"
An inbuilt QueryParser parses the above string to a BooleanQuery with an OR operator.
QueryParser qp = new QueryParser(Version.LUCENE_41, "country", new StandardAnalyzer(Version.LUCENE_41));
Query q = qp.parse(s);
If you want to formulate the query yourself,
BooleanQuery bq = new BooleanQuery();
//
TermQuery tq = new TermQuery(new Term("country", "brazil"));
bq.add(tq, Occur.SHOULD); // SHOULD ==> OR operator
//
tq = new TermQuery(new Term("country", "france"));
bq.add(tq, Occur.SHOULD);
//
tq = new TermQuery(new Term("country", "china"));
bq.add(tq, Occur.SHOULD);
Unless you add hundreds of subqueries, Lucene will meet your expectations performance-wise.

Related

what is the difference between TermQuery and QueryParser in Lucene 6.0?

There are two queries,one is created by QueryParser:
QueryParser parser = new QueryParser(field, analyzer);
Query query1 = parser.parse("Lucene");
the other is term query:
Query query2=new TermQuery(new Term("title", "Lucene"));
what is the difference between query1 and query2?
This is the definition of Term from lucene docs.
A Term represents a word from text. This is the unit of search. It is composed of two elements, the text of the word, as a string, and the name of the field that the text occurred in.
So in your case the query will be created to search the word "Lucene" in the field "title".
To explain the difference between the two let me take a difference example,
consider the following
Query query2 = new TermQuery(new Term("title", "Apache Lucene"));
In this case the query will search for the exact word "Apache Lucene" in the field title.
In the other case
As an example, let's assume a Lucene index contains two fields, "title" and "body".
QueryParser parser = new QueryParser("title", "StandardAnalyzer");
Query query1 = parser.parse("title:Apache body:Lucene");
Query query2 = parser.parse("title:Apache Lucene");
Query query3 = parser.parse("title:\"Apache Lucene\"");
couple of things.
"title" is the field that QueryParser will search if you don't prefix it with a field.(as given in the constructor).
parser.parse("title:Apache body:Lucene"); -> in this case the final query will look like this. query2 = title:Apache body:Lucene.
parser.parse("body:Apache Lucene"); -> in this case the final query will also look like this. query2 = body:Apache title:Lucene. but for a different reason.
So the parser will search "Apache" in body field and "Lucene" in title field. Since The field is only valid for the term that it directly precedes,(http://lucene.apache.org/core/2_9_4/queryparsersyntax.html)
So since we do not specify any field for lucene , the default field which is "title" will be used.
query2 = parser.parse("title:\"Apache Lucene\""); in this case we are explicitly telling that we want to search for "Apache Lucene" in field "title". This is phrase query and is similar to Term query if analyzed correctly.
So to summarize the term query will not analyze the term and search as it is. while Query parser parses the input based on some conditions described above.
The QueryParser parses the string and constructs a BooleanQuery (afaik) consisting of BooleanClauses and analyzes the terms along the way.
The TermQuery does NOT do analysis, and takes the term as-is. This is the main difference.
So the query1 and query2 might be equivalent (in a sense, that they provide the same search results) if the field is the same, and the QueryParser's analyzer is not changing the term.

lucene wildcard query with space

I have Lucene index which has city names.
Consider I want to search for 'New Delhi'. I have string 'New Del' which I want to pass to Lucene searcher and I am expecting output as 'New Delhi'.
If I generate query like Name:New Del* It will give me all cities with 'New and Del'in it.
Is there any way by which I can create Lucene query wildcard query with spaces in it?
I referred and tried few solutions given # http://www.gossamer-threads.com/lists/lucene/java-user/5487
It sounds like you have indexed your city names with analysis. That will tend to make this more difficult. With analysis, "new" and "delhi" are separate terms, and must be treated as such. Searching over multiple terms with wildcards like this tends to be a bit more difficult.
The easiest solution would be to index your city names without tokenization (lowercasing might not be a bad idea though). Then you would be able to search with the query parser simply by escaping the space:
QueryParser parser = new QueryParser("defaultField", analyzer);
Query query = parser.parse("cityname:new\\ del*");
Or you could use a simple WildcardQuery:
Query query = new WildcardQuery(new Term("cityname", "new del*"));
With the field analyzed by standard analyzer:
You will need to rely on SpanQueries, something like this:
SpanQuery queryPart1 = new SpanTermQuery(new Term("cityname", "new"));
SpanQuery queryPart2 = new SpanMultiTermQueryWrapper(new WildcardQuery(new Term("cityname", "del*")));
Query query = new SpanNearQuery(new SpanQuery[] {query1, query2}, 0, true);
Or, you can use the surround query parser (which provides query syntax intended to provide more robust support of span queries), using a query like W(new, del*):
org.apache.lucene.queryparser.surround.parser.QueryParser surroundparser = new org.apache.lucene.queryparser.surround.parser.QueryParser();
SrndQuery srndquery = surroundparser.parse("W(new, del*)");
query = srndquery.makeLuceneQueryField("cityname", new BasicQueryFactory());
As I learnt from the thread mentioned by you (http://www.gossamer-threads.com/lists/lucene/java-user/5487), you can either do an exact match with space or treat either parts w/ wild card.
So something like this should work - [New* Del*]

Lucene QueryParser : Parse multi-term string without analyzing

I serialized a BooleanQuery constructed using TermQuery's into a string. Now I am trying to de-serialize the string back into a BooleanQuery on a different node in a distributed system. So while de-serializing, I have multiple fields and I do not want to use an analyzer
Eg : I am trying to parse the below string without analyzing
+contents:maxItemsPerBlock +path:/lucene-5.1.0/core/src/java/org/apache/lucene/codecs/blocktree/Stats.java
QueryParser in lucene requires an analyzer, but I want the above field values to be treated as terms. I am looking for a query parser which does something like the below since I do not want to parse the strings and construct the query myself.
TermQuery q1 = new TermQuery(new Term("contents", "maxItemsPerBlock"));
TermQuery q2 = new TermQuery(new Term("path", "/lucene-5.1.0/core/src/java/org/apache/lucene/codecs/blocktree/Stats.java"));
BooleanQuery q = new BooleanQuery();
q.add(q1, BooleanClause.Occur.MUST);
q.add(q2, BooleanClause.Occur.MUST);
Also when I tried using a whitespace analyzer with a QueryParser, I got an "IllegalArgumentException : field must not be null" error. Below is the sample code
Analyzer analyzer = new WhitespaceAnalyzer();
String field = "contents";
QueryParser parser = new QueryParser(null, analyzer);
Query query = parser.parse("+contents:maxItemsPerBlock +path:/home/rchallapalli/Desktop/lucene-5.1.0/core/src/java/org/apache/lucene/codecs/blocktree/Stats.java");
java.lang.IllegalArgumentException: field must not be null
at org.apache.lucene.search.MultiTermQuery.<init>(MultiTermQuery.java:233)
at org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:99)
at org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:81)
at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:108)
at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:93)
at org.apache.lucene.queryparser.classic.QueryParserBase.newRegexpQuery(QueryParserBase.java:572)
at org.apache.lucene.queryparser.classic.QueryParserBase.getRegexpQuery(QueryParserBase.java:774)
at org.apache.lucene.queryparser.classic.QueryParserBase.handleBareTokenQuery(QueryParserBase.java:844)
at org.apache.lucene.queryparser.classic.QueryParser.Term(QueryParser.java:348)
at org.apache.lucene.queryparser.classic.QueryParser.Clause(QueryParser.java:247)
at org.apache.lucene.queryparser.classic.QueryParser.Query(QueryParser.java:202)
at org.apache.lucene.queryparser.classic.QueryParser.TopLevelQuery(QueryParser.java:160)
at org.apache.lucene.queryparser.classic.QueryParserBase.parse(QueryParserBase.java:117)
Considering the text you offer in your question. Maybe WhitespaceAnalyzer which splits tokens at whitespace is a choice.
Before you serialize the BooleanQuery constructed by TermQuery, the term in TermQuery is actually what you want to match in the Lucene Index.
// code in Scala
val parser = new QueryParser(version, "", new WhitespaceAnalyzer((version)))
val parsedQuery = parser.parse(searchString)
I tried the following two cases: single-value field and multi-valued field, all work.
+contents:maxItemsPerBlock +path:/lucene-5.1.0/core/src/java/org/apache/lucene/codecs/blocktree/Stats.java
+(contents:maxItemsPerBlock contents:minItemsPerBlock) +path:/lucene-5.1.0/core/src/java/org/apache/lucene/codecs/blocktree/Stats.java
Besides, in our system the serialization and deserialization when it comes to Query passing between nodes are based on
java's ObjectInputStream and ObjectOutputStream. So you may try in that way so you don't have to consider the Analyzer thing.

Lucene phrase query with wildcards

I come up with solution to programmaticlly create query to search for phrase with wildcards using this code:
public static Query createPhraseQuery(String[] phraseWords, String field) {
SpanQuery[] queryParts = new SpanQuery[phraseWords.length];
for (int i = 0; i < phraseWords.length; i++) {
WildcardQuery wildQuery = new WildcardQuery(new Term(field, phraseWords[i]));
queryParts[i] = new SpanMultiTermQueryWrapper<WildcardQuery>(wildQuery);
}
return new SpanNearQuery(queryParts, //words
0, //max distance
true //exact order
);
}
Example creation and call toString() method will output:
String[] phraseWords = new String[]{"foo*", "b*r"};
Query phraseQuery = createPhraseQuery(phraseWords, "text");
System.out.println(phraseQuery.toString());
outputs:
spanNear([SpanMultiTermQueryWrapper(text:foo*), SpanMultiTermQueryWrapper(text:b*r)], 0, true)
Which works great, and fast enough for most cases. For instance, if I create such query and search with it, It will output desired results, for example:
Sentence with foo bar.
Foolies beer drinkers.
...
And not something like:
Bar fooes.
Foo has bar.
I have mentioned that query work fast enough in most cases. Currently I have an index with size of aprox. 200GB and on average searching time is between 0.1 to 3 seconds. Depending on many factors like: cache, size of subsets of documents matching single word in phrase since lucene will perform set intersections between founded terms.
Example:
Let supose I want to query phrase "an* karenjin*" (which I will split into ["an*", "karenjin*"] and than create query using createPhraseQuery method) and I want that it matches sentences containing: "ana karenjina", "ani karenjinoj", "ane karenjine", ... (different cases due croatian grammar).
This query is very slow that I haven't waited long enough to get results (over 1h) and sometimes causes GC overhead limit exceeded exception.
This behaviour is somewhat expected since "an*" itself matches a huge number of documents. I am aware of that I could query "an? karanjin*" which giver results in 30-40sec (faster but still slow).
This is where I am confused.
If I query just "karenjin*" it gives results in 1 sec. Therefore I have tried to query "an* karenjin*" and using a Filter "karenjin*" using WildcardQuery and QueryWrapperFilter. And it is still unacceptable slow (I killed process before it returned anythong).
Documentation says that Filter reduces search space of Query. So I tried to use filter:
Filter filter = new QueryWrapperFilter(new WildcardQuery(new Term("text", "karanjin*")));
And query:
Query query = createPhraseQuery(new String[]{"an*", "karenjin*"}, "text");
Than search, (after several warm-up queries):
Sort sort = new Sort(new SortField("insertTime", SortField.Type.STRING, true));
TopDocs docs = searcher.search(query, filter, 100, sort);
OK, what is my question?
How come is quering:
Query query = new WildcardQuery(new Term("text", "karanjin*"));
is fast, but using Filter described above is still slow?
Yes, wildcards can be performance hogs, especially if they match a lot of terms, but what you describe does seem surprisingly so. Hard to say for sure why that is occuring, but for an attempt.
I'll assume:
Query query = new WildcardQuery(new Term("text", "an*"));
On it's own, is performing very badly, as described. Since the wildcards you are looking for are both prefix style queries, it's a better idea to use a PrefixQuery instead.
Query query = new PrefixQuery(new Term("text", "an"));
Though I don't think that will make much of a difference if any at all. What might just make a different is changing you rewrite method. You could try limiting the number of Terms the query is rewritten into:
Query query = new PrefixQuery(new Term("text", "an"));
//or
//Query query = new WildcardQuery(new Term("text", "an*"));
query.setRewriteMethod(new MultiTermQuery.RewriteMethod.TopTermsRewrite(10));

emit every document in the database with lucene

I've got an index where I need to get all documents with a standard search, still ranked by relevance, even if a document isn't a hit.
My first idea is to add a field that is always matched, but that might deform the relevance score.
Use a BooleanQuery to combine your original query with a MatchAllDocsQuery. You can mitigate the effect this has on scoring by setting the boost on the MatchAllDocsQuery to zero before you combine it with your main query. This way you don't have to add an otherwise bogus field to the index.
For example:
// Parse a query by the user.
QueryParser qp = new QueryParser(Version.LUCENE_35, "text", new StandardAnalyzer());
Query standardQuery = qp.parse("User query may go here");
// Make a query that matches everything, but has no boost.
MatchAllDocsQuery matchAllDocsQuery = new MatchAllDocsQuery();
matchAllDocsQuery.setBoost(0f);
// Combine the queries.
BooleanQuery boolQuery = new BooleanQuery();
boolQuery.add(standardQuery, BooleanClause.Occur.SHOULD);
boolQuery.add(matchAllDocsQuery, BooleanClause.Occur.SHOULD);
// Now just pass it to the searcher.
This should give you hits from standardQuery followed by the rest of the documents in the index.