Lucene operator precedence for boolean operators - lucene

What is the order of operations for boolean operators? Left to right? Right to left? Specific operators have higher priority?
For example, if I search for:
jakarta OR apache AND website
What do I get? Is it
Anything with "jakarta" as well as anything with both "apache" and "website"?
Anything with "website" that also has either "jakarta" or "apache"?
Something else?

Short answer:
In Lucene, the AND operator takes precedence over the OR operator. So, you are effectively doing this:
jakarta OR (apache AND website)
You can verify this for yourself by parsing your query string and seeing how it converts AND and OR to the "required" and "optional" operators.
And the NOT operator takes precendence over the AND operator, since we are discussing precedence.
But you need to be very careful when dealing with Lucene's so-called "boolean" operators, as they do not behave the way you may expect based on their collective name ("boolean").
(Unfortunately I have never seen any official documentation which provides a citation for these precedence rules - but instead I am relying on empirical observations. See below for more about that. If the documentation for this does exist, that would be great to see.)
Longer Answer
One key thing to understand is that Lucene boolean operators are not really "boolean" in the sense that you may think, based on Boolean algebra, where you use parentheses to help avoid ambiguity (or where you need to know what rules a programming language may be applying) - and where everything evaluates to TRUE or FALSE.
Lucene boolean operators serve a subtly different purpose.
They are not purely concerned with TRUE/FALSE inclusion/exclusion, but also concerned with how to score results so that the more relevant results have higher scores than less relevant results.
The Lucene query jakarta OR apache AND website is equivalent to the following:
jakarta +apache +website
This means the document's field must contain apache and website, but may also include jakarta (for a higher relevance score).
You can see this for yourself by taking your original query string and parsing it:
Query query = parser.parse(queryString);
...and then printing the resulting string representation of the query. The + operator is the "required" operator. It:
requires that the term after the "+" symbol exist somewhere in the field
And the lack of a + operator means the default of "may" as in "may contain" - meaning the term is optional: it does not need to be present, if there is some other clause in the query which does match a document.
The use of AND forces the terms on either side of the AND to be required.
You can encounter some potentially surprising situations.
Consider this:
foo AND bar OR baz AND bat
This parses to the following:
+foo +bar +baz +bat
This is because the AND operators are transformed to + operators for every term, rendering the OR redundant.
It's the same result as if you had written this:
foo AND bar AND baz AND bat
But not the same as this:
(foo AND bar) OR (baz AND bat)
which is parsed to this, where the parentheses are retained:
(+foo +bar) (+baz +bat)
Bottom Line:
Use parentheses to explicitly make your intentions clear, when using AND and OR and also NOT.
Regarding NOT, since we mentioned it - that takes prescendence over AND.
The query:
foo AND bar NOT baz AND bat
Is parsed as:
+foo +bar -baz +bat
So, a document field must contain foo, bar and bat - and must not contain baz.
Why does this situation exist?
I don't know, but I think Lucene originally did not include AND, OR and NOT - but instead used + (must include), - (must not include) and "nothing" (may include). The so-called boolean operators AND, OR, NOT were added later on, as a kind of "syntactic sugar" for these original operators - introduced for people who were more familiar with AND, OR and NOT from other contexts. I'm basing this on the following thread:
Getting a Better Understanding of Lucene's Search Operators
A summary of that thread is included in this answer about the NOT operator.

Related

Is it acceptable to use `to` to create a `Pair`?

to is an infix function within the standard library. It can be used to create Pairs concisely:
0 to "hero"
in comparison with:
Pair(0, "hero")
Typically, it is used to initialize Maps concisely:
mapOf(0 to "hero", 1 to "one", 2 to "two")
However, there are other situations in which one needs to create a Pair. For instance:
"to be or not" to "be"
(0..10).map { it to it * it }
Is it acceptable, stylistically, to (ab)use to in this manner?
Just because some language features are provided does not mean they are better over certain things. A Pair can be used instead of to and vice versa. What becomes a real issue is that, does your code still remain simple, would it require some reader to read the previous story to understand the current one? In your last map example, it does not give a hint of what it's doing. Imagine someone reading { it to it * it}, they would be most likely confused. I would say this is an abuse.
to infix offer a nice syntactical sugar, IMHO it should be used in conjunction with a nicely named variable that tells the reader what this something to something is. For example:
val heroPair = Ironman to Spiderman //including a 'pair' in the variable name tells the story what 'to' is doing.
Or you could use scoping functions
(Ironman to Spiderman).let { heroPair -> }
I don't think there's an authoritative answer to this.  The only examples in the Kotlin docs are for creating simple constant maps with mapOf(), but there's no hint that to shouldn't be used elsewhere.
So it'll come down to a matter of personal taste…
For me, I'd be happy to use it anywhere it represents a mapping of some kind, so in a map{…} expression would seem clear to me, just as much as in a mapOf(…) list.  Though (as mentioned elsewhere) it's not often used in complex expressions, so I might use parentheses to keep the precedence clear, and/or simplify the expression so they're not needed.
Where it doesn't indicate a mapping, I'd be much more hesitant to use it.  For example, if you have a method that returns two values, it'd probably be clearer to use an explicit Pair.  (Though in that case, it'd be clearer still to define a simple data class for the return value.)
You asked for personal perspective so here is mine.
I found this syntax is a huge win for simple code, especial in reading code. Reading code with parenthesis, a lot of them, caused mental stress, imagine you have to review/read thousand lines of code a day ;(

ABNF rule `zero = ["0"] "0"` matches `00` but not `0`

I have the following ABNF grammar:
zero = ["0"] "0"
I would expect this to match the strings 0 and 00, but it only seems to match 00? Why?
repl-it demo: https://repl.it/#DanStevens/abnf-rule-zero-0-0-matches-00-but-not-0
Good question.
ABNF ("Augmented Backus Naur Form"9 is defined by RFC 5234, which is the current version of a document intended to clarify a notation used (with variations) by many RFCs.
Unfortunately, while RFC 5234 exhaustively describes the syntax of ABNF, it does not provide much in the way of a clear statement of semantics. In particular, it does not specify whether ABNF alternation is unordered (as it is in the formal language definitions of BNF) or ordered (as it is in "PEG" -- Parsing Expression Grammar -- notation). Note that optionality/repetition are just types of alternation, so if you choose one convention for alternation, you'll most likely choose it for optionality and repetition as well.
The difference is important in cases like this. If alternation is ordered, then the parser will not backup to try a different alternative after some alternative succeeds. In terms of optionality, this means that if an optional element is present in the stream, the parser will never reconsider the decision to accept the optional element, even if some subsequent element cannot be matched. If you take that view, then alternation does not distribute over concatenation. ["0"]"0" is precisely ("0"/"")"0", which is different from "00"/"0". The latter expression would match a single 0 because the second alternative would be tried after the first one failed. The former expression, which you use, will not.
I do not believe that the authors of RFC 5234 took this view, although it would have been a lot more helpful had they made that decision explicit in the document. My only real evidence to support my belief is that the ABNF included in RFC 5234 to describe ABNF itself would fail if repetition was considered ordered. In particular, the rule for repetitions:
repetition = [repeat] element
repeat = 1*DIGIT / (*DIGIT "*" *DIGIT)
cannot match 7*"0", since the 7 will be matched by the first alternative of repeat, which will be accepted as satisfying the optional [repeat] in repetition, and element will subsequently fail.
In fact, this example (or one similar to it) was reported to the IETF as an erratum in RFC 5234, and the erratum was rejected as unnecessary, because the verifier believed that the correct parse should be produced, thus providing evidence that the official view is that ABNF is not a variant of PEG. Apparently, this view is not shared by the author of the APG parser generator (who also does not appear to document their interpretation.) The suggested erratum chose roughly the same solution as you came up with:
repeat = *DIGIT ["*" *DIGIT]
although that's not strictly speaking the same; the original repeat cannot match the empty string, but the replacement one can. (Since the only use of repeat in the grammar is optional, this doesn't make any practical difference.)
(Disclosure note: I am not a fan of PEG. So it's possible the above answer is not free of bias.)

Use of math in ALFA

How to get a rule like that working:
rule adminCanViewAllExams {
condition (integerOneAndOnly(my.company.attributes.subject.rights) & 0x00000040) == 0
permit
}
Syntax highlighter complains it doesn't know those items:
& (This is a binary math operation)
0x00000040 (this is the hexadecimal representation of an integer)
EDIT
(adding OP's comment inside the question)
I want to keep as much as possible in my current application. Meaning, I don't want to change a lot in my database model. I just want to implement the PEP and PDP part new. So, currently the rights of the user are stored in a Long. Each bit in the number represents a right. To get the right we do a binary &-operation which masks the other bits in the Long. We might redesign this part, but it's still good to know how far the support for mathematic operations goes
XACML does not support bitwise logic. It can do boolean logic (AND and OR) but that's about it.
To achieve what you are looking for, you could use a Policy Information Point which would take in my.company.attributes.subject.rights and 0x00000040. It would return an attribute called allowed.
Alternatively, you can extend XACML (and ALFA) to add missing datatypes and functions. But I would recommend going for human-readable policies.

SonarLint - questions about some of the rules for VB.NET

The large majority of SonarLint rules that I've come across in Java seemed plausible and justified. However, ever since I've started using SonarLint for VB.NET, I've come across several rules that left me questioning their usefulness or even whether or not they are working correctly.
I'd like to know if this is simply a problem of me using some VB.NET constructs in a suboptimal way or whether the rule really is flawed.
(Apologies if this question is a little longer. I didn't know if I should create a separate question for each individual rule.)
The following rules I found to leave some cases unconsidered that would actually turn up as false-positives:
S1871: Two branches in the same conditional structure should not have exactly the same implementation
I found this one to bring up a lot of false-positives for me, because sometimes the order in which the conditions are checked actually does matter. Take the following pseudo code as example:
If conditionA() Then
doSomething()
ElseIf conditionB() AndAlso conditionC() Then
doSomethingElse()
ElseIf conditionD() OrElse conditionE() Then
doYetAnotherThing()
'... feel free to have even more cases in between here
Else Then
doSomething() 'Non-compliant
End If
If I wanted to follow this Sonar rule and still make the code behave the same way, I'd have to add the negated version of each ElseIf-condition to the first If-condition.
Another example would be the following switch:
Select Case i
Case 0 To 40
value = 0
Case 41 To 60
value = 1
Case 61 To 80
value = 3
Case 81 To 100
value = 5
Case Else
value = 0 'Non-compliant
There shouldn't be anything wrong with having that last case in a switch. True, I could have initialized value beforehand to 0 and ignored that last case, but then I'd have one more assignment operation than necessary. And the Java ruleset has conditioned me to always put a default case in every switch.
S1764: Identical expressions should not be used on both sides of a binary operator
This rule does not seem to take into account that some functions may return different values every time you call them, for instance collections where accessing an element removes it from the collection:
stack.Push(stack.Pop() / stack.Pop()) 'Non-compliant
I understand if this is too much of an edge case to make special exceptions for it, though.
The following rules I am not actually sure about:
S3385: "Exit" statements should not be used
While I agree that Return is more readable than Exit Sub, is it really bad to use a single Exit For to break out of a For or a For Each loop? The SonarLint rule for Java permits the use of a single break; in a loop before flagging it as an issue. Is there a reason why the default in VB.NET is more strict in that regard? Or is the rule built on the assumption that you can solve nearly all your loop problems with LINQ extension methods and lambdas?
S2374: Signed types should be preferred to unsigned ones
This rule basically states that unsigned types should not be used at all because they "have different arithmetic operators than signed ones - operators that few developers understand". In my code I am only using UInteger for ID values (because I don't need negative values and a Long would be a waste of memory in my case). They are stored in List(Of UInteger) and only ever compared to other UIntegers. Is this rule even relevant to my case (are comparisons part of these "arithmetic operators" mentioned by the rule) and what exactly would be the pitfall? And if not, wouldn't it be better to make that rule apply to arithmetic operations involving unsigned types, rather than their declaration?
S2355: Array literals should be used instead of array creation expressions
Maybe I don't know VB.NET well enough, but how exactly would I satisfy this rule in the following case where I want to create a fixed-size array where the initialization length is only known at runtime? Is this a false-positive?
Dim myObjects As Object() = New Object(someOtherList.Count - 3) {} 'Non-compliant
Sure, I could probably just use a List(Of Object). But I am curious anyway.
Thanks for raising these points. Note that not all rules apply every time. There are cases when we need to balance between false positives/false negatives/real cases. For example with identical expressions on both sides of an operator rule. Is it a bug to have the same operands? No it's not. If it was, then the compiler would report it. Is it a bad smell, is it usually a mistake? Yes in many cases. See this for example in Roslyn. Should we tune this rule to exclude some cases? Yes we should, there's nothing wrong with 2 << 2. So there's a lot of balancing that needs to happen, and we try to settle for an implementation that brings the most value for the users.
For the points you raised:
Two branches in the same conditional structure should not have exactly the same implementation
This rule generally states that having two blocks of code match exactly is a bad sign. Copy-pasted code should be avoided for many reasons, for example if you need to fix the code in one place, you'll need to fix it in the other too. You're right that adding negated conditions would be a mess, but if you extract each condition into its own method (and call the negated methods inside them) with proper names, then it would probably improves the readability of your code.
For the Select Case, again, copy pasted code is always a bad sign. In this case you could do this:
Select Case i
...
Case 0 To 40
Case Else
value = 0 ' Compliant
End Select
Or simply remove the 0-40 case.
Identical expressions should not be used on both sides of a binary operator
I think this is a corner case. See the first paragraph of the answer.
"Exit" statements should not be used
It's almost always true that by choosing another type of loop, or changing the stop condition, you can get away without using any "Exit" statements. It's good practice to have a single exit point from loops.
Signed types should be preferred to unsigned ones
This is a legacy rule from SonarQube VB.NET, and I agree with you that it shouldn't be enabled by default in SonarLint. I created the following ticket in our JIRA: https://jira.sonarsource.com/browse/SLVS-1074
Array literals should be used instead of array creation expressions
Yes, it seems to be a false positive, we shouldn't report on array creations when the size is explicitly specified. https://jira.sonarsource.com/browse/SLVS-1075

Why not have operators as both keywords and functions?

I saw this question and it got me wondering.
Ignoring the fact that pretty much all languages have to be backwards compatible, is there any reason we cannot use operators as both keywords and functions, depending on if it's immediately followed by a parenthesis? Would it make the grammar harder?
I'm thinking mostly of python, but also C-like languages.
Perl does something very similar to this, and the results are sometimes surprising. You'll find warnings about this in many Perl texts; for example, this one comes from the standard distributed Perl documentation (man perlfunc):
Any function in the list below may be used either with or without parentheses around its arguments. (The syntax descriptions omit the parentheses.) If you use parentheses, the simple but occasionally surprising rule is this: It looks like a function, therefore it is a function, and precedence doesn't matter. Otherwise it's a list operator or unary operator, and precedence does matter. Whitespace between the function and left parenthesis doesn't count, so sometimes you need to be careful:
print 1+2+4; # Prints 7.
print(1+2) + 4; # Prints 3.
print (1+2)+4; # Also prints 3!
print +(1+2)+4; # Prints 7.
print ((1+2)+4); # Prints 7.
An even more surprising case, which often bites newcomers:
print
(a % 7 == 0 || a % 7 == 1) ? "good" : "bad";
will print 0 or 1.
In short, it depends on your theory of parsing. Many people believe that parsing should be precise and predictable, even when that results in surprising parses (as in the Python example in the linked question, or even more famously, C++'s most vexing parse). Others lean towards Perl's "Do What I Mean" philosophy, even though the result -- as above -- is sometimes rather different from what the programmer actually meant.
C, C++ and Python all tend towards the "precise and predictable" philosophy, and they are unlikely to change now.
Depending on the language, not() is not defined. If not() is not defined in some language, you can not use it. Why not() is not defined in some language? Because creator of that language probably had not need this type of language construction. Because it is better to let things be simpler.