Lucene data range search

Lucene data range search - lucene

I'm using Umbraco v7.2 for a site, and have run into a highly entertaining issue trying to search for things using the External Searcher by a date of ranges.
If I perform a Lucene search using the examine management search tools in the backoffice, I get results using this query:
{(+__NodeTypeAlias:bookingperiod)} AND startDate:2016-03-01T00\:00\:00
Subsequently, I KNOW that I can get results that include this date in a range. However, what's highly entertaining, quite puzzling and really rather frustrating, is that if I use a range query, I get no results. Here's the syntax:
{(+__NodeTypeAlias:bookingperiod)} AND +(startDate:[2016-02-28T00:00:00 TO 2016-03-20T00:00:00])
Now, in the interests of clarity, I've tried escaping the colon characters in the dates, the dashes in the dates and both, but it makes no difference at all. Can anyone explain to me where I'm going wrong?
Thanks!

I ran into this issue a while back, not sure why though, but changing to the format :"yyyyMMddHHmmss" helped, might be something with the parser.
So the query becomes:
+__NodeTypeAlias:bookingperiod AND +startDate:[20160228000000 TO 20160320000000]

Related

How to set permissions efficiently

I have a list of Exchanges (extract shown below) and I want to give a user WRITE permissions to all but those containing Invoice in the name. Here an extract of the list of Exchanges as an example:
E.BankAccount.Commands
E.BankAccount.Commands.Confirmations
E.BankAccount.Commands.Errors
E.BankAccount.Events
E.Customer.Commands
E.Customer.Commands.Confirmations
E.Customer.Commands.Errors
E.Customer.Events
E.Invoice.Commands
E.Invoice.Commands.Confirmations
E.Invoice.Commands.Errors
E.Payment.Commands
E.Payment.Commands.Confirmations
E.Payment.Commands.Errors
E.Payment.EventsEvents.test
So I can write a regex like this:
E\.(BankAccount|Customer|Payment)\.[a-zA-Z.]+
This works but since I have more than 100 different types it is a lot to type and almost impossible to maintain. Better would be to have a regex which excludes invoice. I tried a couple of things but without any look, e.g.
E\.([a-zA-Z.]+|?!Invoice)\.[a-zA-Z.]+ or
E\.([a-zA-Z.]+|?!(Invoice))\.[a-zA-Z.]+
But that is all syntactically wrong. Another strategy could be to reverse the result of the expression e.g. use
E\.(Invoice)\.[a-zA-Z.]+
and then somehow reverse the entire result but I could not figure out how!
I am not using regex very often so I can't find any better solution then the first one which is impracticable in my production environment.
Does anybody have more experience and a solution to that problem?? Any help would be greatly appreciated!

IGNORE CASE query problems saving to a table and using Allow large results

I need case insensitivity in my queries so I found IGNORE CASE which works superbly when used in queries that target the browser (I am talking about BQ web UI). If I choose a destination table (an absolute must for me) and select Allow Large Results (with unchecked Flatten Results) then I get a cryptic error like this:
Error: unexpected LIMIT clause at: 2.200 - 2.206
Even though this Official Google BigQuery issue and feature request tracker post seems to speak of the same issue and even though the problem seems to have been acknowledged back in Jan 2015 the solution isn't apparent.
I could potentially use a bunch of temp tables with lowercased search columns as a workaround but that sounds awfully difficult with the number of tables and columns that I have and the complex queries that I intend to run.
Any other possible workarounds? Why isn't this working yet on BQ?

Yes, it is a known problem, and it has not been neglected. The code changes to fix it are (surprisingly) not trivial, but they are mostly done. Not team is carefully looking how to enable and deploy them. I cannot give you a timeline, but the fix to this problem is coming.
The only workarounds in the meantime, are to wrap all the string comparisons, string GROUP BYs and string ORDER BYs with conversion to LOWER() (or UPPER()) of operands.

Lucene Boolean Query on Not ANalyzed Fields

Using RavenDB to do a query on Lucene Index.
This query parses okay:
X:[[a]] AND Y:[[b]] AND Z:[[c]]
However this query gives me a parse exception:
X:[[a]] AND Y:[[b]] AND Z:[[c]] AND P:[[d]]
"Lucene.Net.QueryParsers.ParseException: Cannot parse '( AND )': Encountered \" \"AND"
I tried this on complexed index and simple reproduce cases and same result it seems once you go past three ands it blows up. Im using [[]] and not analyzed because i want exact matches (also sometimes values contain whitespace etc..) and from RavenDB I have veyr little control over the indexing.
Im wondering how I can rewrite the query to avoid the parse exception?

This is now fixed in the latest RavenDB builds. See this thread for more info.

This looks rather like a bug in Lucene's QueryParser, perhaps try reporting this in the user mailing list.
As a bypass, you could create a BooleanQuery manually and add the terms you want yourself. Since they are not analyzed, and the query doesn't look too complicated, you may be better off without the query-parser.

Why are my Lucene Document results empty?

I'm running a simple test--trying to index something and then search for it. I index a simple document, but then when a search for a string in it, I get back what looks to be an empty document (it has no fields). Lucene seems to be doing something, because if I search for a word that's not in the document, it returns 0 results.
Any reason why Lucene would reliably return a document when it finds one that matches the given query, and yet that document has nothing in it?
More details:
I'm actually running Lucandra (Lucene + Cassandra). That certainly may be a relevant detail, but not sure.
The fields are set to Field.Store/YES and Field.Index/ANALYZED
Interestingly, I'm able to get this to work just fine on my local machine, but when we put it on our main server (which is a multi-node cassandra setup), I get the behavior described above. So this seems like probably the relevant detail, but unfortunately, I see no error message to clue me in to what specifically is causing it.

Unsure if this will work with Lucandra, but you have tried opening the index using Luke? Viewing the index contents with Luke might help

It's hard to tell what the problem is since you only provide a very abstract description. However, it sounds a bit like you are not storing the field value in the index. There are different modes for indexing a field. One option determines whether the original value is stored in the index to retrieve it later:
http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/document/Field.Store.html
See also the description of the enclosing class Field

Read: http://anismiles.wordpress.com/2010/05/27/lucandra-an-inside-story/

Prevent "Too Many Clauses" on lucene query

In my tests I suddenly bumped into a Too Many Clauses exception when trying to get the hits from a boolean query that consisted of a termquery and a wildcard query.
I searched around the net and on the found resources they suggest to increase the BooleanQuery.SetMaxClauseCount().
This sounds fishy to me.. To what should I up it? How can I rely that this new magic number will be sufficient for my query? How far can I increment this number before all hell breaks loose?
In general I feel this is not a solution. There must be a deeper problem..
The query was +{+companyName:mercedes +paintCode:a*} and the index has ~2.5M documents.

the paintCode:a* part of the query is a prefix query for any paintCode beginning with an "a". Is that what you're aiming for?
Lucene expands prefix queries into a boolean query containing all the possible terms that match the prefix. In your case, apparently there are more than 1024 possible paintCodes that begin with an "a".
If it sounds to you like prefix queries are useless, you're not far from the truth.
I would suggest you change your indexing scheme to avoid using a Prefix Query. I'm not sure what you're trying to accomplish with your example, but if you want to search for paint codes by first letter, make a paintCodeFirstLetter field and search by that field.
ADDED
If you're desperate, and are willing to accept partial results, you can build your own Lucene version from source. You need to make changes to the files PrefixQuery.java and MultiTermQuery.java, both under org/apache/lucene/search. In the rewrite method of both classes, change the line
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
to
try {
query.add(tq, BooleanClause.Occur.SHOULD); // add to query
} catch (TooManyClauses e) {
break;
}
I did this for my own project and it works.
If you really don't like the idea of changing Lucene, you could write your own PrefixQuery variant and your own QueryParser, but I don't think it's much better.

It seems like you are using this on a field that is sort of a Keyword type (meaning there will not be multiple tokens in your data source field).
There is a suggestion here that seems pretty elegant to me: http://grokbase.com/t/lucene.apache.org/java-user/2007/11/substring-indexing-to-avoid-toomanyclauses-exception/12f7s7kzp2emktbn66tdmfpcxfya
The basic idea is to break down your term into multiple fields with increasing length until you are pretty sure you will not hit the clause limit.
Example:
Imagine a paintCode like this:
"a4c2d3"
When indexing this value, you create the following field values in your document:
[paintCode]: "a4c2d3"
[paintCode1n]: "a"
[paintCode2n]: "a4"
[paintCode3n]: "a4c"
By the time you query, the number of characters in your term decide which field to search on. This means that you will perform a prefix query only for terms with more of 3 characters, which greatly decreases the internal result count, preventing the infamous TooManyBooleanClausesException. Apparently this also speeds up the searching process.
You can easily automate a process that breaks down the terms automatically and fills the documents with values according to a name scheme during indexing.
Some issues may arise if you have multiple tokens for each field. You can find more details in the article

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Lucene data range search - lucene

I ran into this issue a while back, not sure why though, but changing to the format :"yyyyMMddHHmmss" helped, might be something with the parser. So the query becomes: +__NodeTypeAlias:bookingperiod AND +startDate:[20160228000000 TO 20160320000000]

Related

How to set permissions efficiently

IGNORE CASE query problems saving to a table and using Allow large results

Lucene Boolean Query on Not ANalyzed Fields

Why are my Lucene Document results empty?

Prevent "Too Many Clauses" on lucene query

Categories

Resources