Lucene Boolean Query on Not ANalyzed Fields - lucene

Using RavenDB to do a query on Lucene Index.
This query parses okay:
X:[[a]] AND Y:[[b]] AND Z:[[c]]
However this query gives me a parse exception:
X:[[a]] AND Y:[[b]] AND Z:[[c]] AND P:[[d]]
"Lucene.Net.QueryParsers.ParseException: Cannot parse '( AND )': Encountered \" \"AND"
I tried this on complexed index and simple reproduce cases and same result it seems once you go past three ands it blows up. Im using [[]] and not analyzed because i want exact matches (also sometimes values contain whitespace etc..) and from RavenDB I have veyr little control over the indexing.
Im wondering how I can rewrite the query to avoid the parse exception?

This is now fixed in the latest RavenDB builds. See this thread for more info.

This looks rather like a bug in Lucene's QueryParser, perhaps try reporting this in the user mailing list.
As a bypass, you could create a BooleanQuery manually and add the terms you want yourself. Since they are not analyzed, and the query doesn't look too complicated, you may be better off without the query-parser.

Related

IGNORE CASE query problems saving to a table and using Allow large results

I need case insensitivity in my queries so I found IGNORE CASE which works superbly when used in queries that target the browser (I am talking about BQ web UI). If I choose a destination table (an absolute must for me) and select Allow Large Results (with unchecked Flatten Results) then I get a cryptic error like this:
Error: unexpected LIMIT clause at: 2.200 - 2.206
Even though this Official Google BigQuery issue and feature request tracker post seems to speak of the same issue and even though the problem seems to have been acknowledged back in Jan 2015 the solution isn't apparent.
I could potentially use a bunch of temp tables with lowercased search columns as a workaround but that sounds awfully difficult with the number of tables and columns that I have and the complex queries that I intend to run.
Any other possible workarounds? Why isn't this working yet on BQ?
Yes, it is a known problem, and it has not been neglected. The code changes to fix it are (surprisingly) not trivial, but they are mostly done. Not team is carefully looking how to enable and deploy them. I cannot give you a timeline, but the fix to this problem is coming.
The only workarounds in the meantime, are to wrap all the string comparisons, string GROUP BYs and string ORDER BYs with conversion to LOWER() (or UPPER()) of operands.

Lucene.NET 2.9 - MultiFieldQueryParser, boosted fields, stemming and prefixes

I have a system where the search queries multiple fields with different boost values. It is running on Lucene.NET 2.9.4 because it's an Umbraco (6.x) site, and that's what version of Lucene.NET the CMS uses.
My client asked me if I could add stemming, so I wrote a custom analyzer that does Standard / Lowercase / Stop / PorterStemmer. The stemming filter seems to work fine.
But now, when I try to use my new analyzer with the MultiFieldQueryParser, it's not finding anything.
The MultiFieldQueryParser is returning a query containing stemmed words - e.g. if I search for "the figure", what I get as part of the query it returns is:
keywords:figur^4.0 Title:figur^3.0 Collection:figur^2.0
i.e. it's searching the correct fields and applying the correct boosts, but trying to do an exact search on stemmed terms on indexes that contained unstemmed words.
I think what's actually needed is for the MultiFieldQueryParser to return a list of clauses which are of type PrefixQuery. so it'll output a query like
keywords:figur*^4.0 Title:figur*^3.0 Collection:figur*^2.0
If I try to just add a wildcard to the end of the term, and feed that into the parser, the stemmer doesn't kick in. i.e. it builds a query to look for "figure*".
Is there any way to combine MultiFieldQueryParser boosting and prefix queries?
You need to reindex using your custom analyzer. Applying a stemmer only at query time is useless. You might kludge together something using wildcards, but it would remain an ugly, unreliable kludge.

Hibernate Search - possible to get new Lucene query after facets applied?

A Lucene Query is generated as so:
Query luceneQuery = builder.all().createQuery();
Then facets are applied.
I'm not sure if when facets are applied the luceneQuery is ANDed and ORed with other Querys resulting in a new Lucene Query. Alternatively, perhaps a bunch of BitSets's are applied to the original Query to refine the results. (I don't know).
If a new query is generated I'd like to retrieve it. If not, I need a rethink. That's the crux of the question.
Why:
I'm applying a faceted search on a field with multiple possible values.
E.g. TMovie.class many-to-many TTag.class (multiple-value-facet)
I'm filtering on TMovie where TTag is some value.
Anyway, the filtering works but there is a known problem whereby the Facet-counts returned are incorrect.
Detailed here: Add faceting over multivalued to application using Hibernate Search and https://forum.hibernate.org/viewtopic.php?f=9&t=1010472
I'm using this solution:
http://sujitpal.blogspot.ie/2007/04/lucene-search-within-search-with.html (see comment on new API under article)
The BitSet solution (in this example at least) generates counts based on the original Lucene Query. This works perfectly. However.....
If alternate (different, not TTags) facets are applied to the original query some complications arise.
The Bitset solution calculates on the original Lucene query. It does not calculate on the lucene query now reduced by the application of alternate Facets (a different FacetSelection) (or even TTag Facets themselves for that matter). I.e. the count calculations are irrespective of any other FacetSelection Facets applied.
So...
A. can I get the new Lucene query after facets are applied? The BitSet solution applied to this would be correct.
B. Any other alternative suggestions?
Thanks so much.. All comments welcome.
John
Regarding your first question, applying a facet is not modifying the original query, it uses a custom Collector called FacetCollector - see https://github.com/hibernate/hibernate-search/blob/master/engine/src/main/java/org/hibernate/search/query/collector/impl/FacetCollector.java. Under the hood the collector uses a Lucene FieldCache for doing the facet count. There is also the root of the limitation for multi-value faceting. FieldCache does not support multiple values per field.
Anyways, no additional queries are applied during faceting and the original query is unmodified. The benefit of course is performance. The solution you are pointing to probably works as well, but relies on running multiple queries. However, it might be a valid work around for your use case.

Lucene query fails with mixed MUST/MUST_NOT

Given a document with this text, indexed in a field named Content:
The dish ran away with the spoon.
The following query fails to match that document:
+Content:dish +(-Content:xyz) <-- no results!
I want the query to be treated as must include "dish", must not include "xyz". It's the "must not" part that is failing.
I know the +- combination looks funny but syntactically it should be correct, especially considering that the following variations all work:
+Content:dish +(-Content:xyz +Content:spoon) <-- this works
+Content:dish -Content:xyz <-- this works
So why doesn't +(-Content:xyz) work? Is that by design, or a bug, or am I just missing something? I'm using Lucene.Net but I assume regular Lucene behaves the same.
Lucene doesn't start with a full view of everything, like a SQL database. Lucene starts with no documents matched, and finds things based on the clauses searched on. This is why:
-Content:xyz
On it's own doesn't really work. It knows not to bring in content:xyz, but hasn't been given any documents to match. The same is true of your query, because it's placed in a subquery.
-Content:xyz is evaluated first, which gets no docs on it's own. So then you have, effectively
+Content:dish +(no documents)
It's useful to think of - as an AND NOT rather than simply a NOT (though don't take that to imply the +/- and AND/OR/NOT syntax necessarily map to each other directly).
If you want to be able to execute a lonely negative query like that, you need to bring in all documents first. The MatchAllDocsQuery is the best way to accomplish that, something like:
BooleanQuery query = new BooleanQuery();
query.add(new BooleanClause(new MatchAllDocsQuery(), BooleanClause.Occur.SHOULD));
query.add(new BooleanClause(new TermQuery(new Term("Content","xyz")), BooleanClause.Occur.MUST_NOT));
Would be the equivalent of a SQL style query with only a negation for a WHERE clause.
Of course, this isn't really necessary in the case you've listed since:
+Content:dish -Content:xyz
Is perfectly adequate.

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.
the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.
It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article