ABNF rule `zero = ["0"] "0"` matches `00` but not `0` - grammar

I have the following ABNF grammar:
zero = ["0"] "0"
I would expect this to match the strings 0 and 00, but it only seems to match 00? Why?
repl-it demo: https://repl.it/#DanStevens/abnf-rule-zero-0-0-matches-00-but-not-0

Good question.
ABNF ("Augmented Backus Naur Form"9 is defined by RFC 5234, which is the current version of a document intended to clarify a notation used (with variations) by many RFCs.
Unfortunately, while RFC 5234 exhaustively describes the syntax of ABNF, it does not provide much in the way of a clear statement of semantics. In particular, it does not specify whether ABNF alternation is unordered (as it is in the formal language definitions of BNF) or ordered (as it is in "PEG" -- Parsing Expression Grammar -- notation). Note that optionality/repetition are just types of alternation, so if you choose one convention for alternation, you'll most likely choose it for optionality and repetition as well.
The difference is important in cases like this. If alternation is ordered, then the parser will not backup to try a different alternative after some alternative succeeds. In terms of optionality, this means that if an optional element is present in the stream, the parser will never reconsider the decision to accept the optional element, even if some subsequent element cannot be matched. If you take that view, then alternation does not distribute over concatenation. ["0"]"0" is precisely ("0"/"")"0", which is different from "00"/"0". The latter expression would match a single 0 because the second alternative would be tried after the first one failed. The former expression, which you use, will not.
I do not believe that the authors of RFC 5234 took this view, although it would have been a lot more helpful had they made that decision explicit in the document. My only real evidence to support my belief is that the ABNF included in RFC 5234 to describe ABNF itself would fail if repetition was considered ordered. In particular, the rule for repetitions:
repetition = [repeat] element
repeat = 1*DIGIT / (*DIGIT "*" *DIGIT)
cannot match 7*"0", since the 7 will be matched by the first alternative of repeat, which will be accepted as satisfying the optional [repeat] in repetition, and element will subsequently fail.
In fact, this example (or one similar to it) was reported to the IETF as an erratum in RFC 5234, and the erratum was rejected as unnecessary, because the verifier believed that the correct parse should be produced, thus providing evidence that the official view is that ABNF is not a variant of PEG. Apparently, this view is not shared by the author of the APG parser generator (who also does not appear to document their interpretation.) The suggested erratum chose roughly the same solution as you came up with:
repeat = *DIGIT ["*" *DIGIT]
although that's not strictly speaking the same; the original repeat cannot match the empty string, but the replacement one can. (Since the only use of repeat in the grammar is optional, this doesn't make any practical difference.)
(Disclosure note: I am not a fan of PEG. So it's possible the above answer is not free of bias.)

Related

Lucene operator precedence for boolean operators

What is the order of operations for boolean operators? Left to right? Right to left? Specific operators have higher priority?
For example, if I search for:
jakarta OR apache AND website
What do I get? Is it
Anything with "jakarta" as well as anything with both "apache" and "website"?
Anything with "website" that also has either "jakarta" or "apache"?
Something else?
Short answer:
In Lucene, the AND operator takes precedence over the OR operator. So, you are effectively doing this:
jakarta OR (apache AND website)
You can verify this for yourself by parsing your query string and seeing how it converts AND and OR to the "required" and "optional" operators.
And the NOT operator takes precendence over the AND operator, since we are discussing precedence.
But you need to be very careful when dealing with Lucene's so-called "boolean" operators, as they do not behave the way you may expect based on their collective name ("boolean").
(Unfortunately I have never seen any official documentation which provides a citation for these precedence rules - but instead I am relying on empirical observations. See below for more about that. If the documentation for this does exist, that would be great to see.)
Longer Answer
One key thing to understand is that Lucene boolean operators are not really "boolean" in the sense that you may think, based on Boolean algebra, where you use parentheses to help avoid ambiguity (or where you need to know what rules a programming language may be applying) - and where everything evaluates to TRUE or FALSE.
Lucene boolean operators serve a subtly different purpose.
They are not purely concerned with TRUE/FALSE inclusion/exclusion, but also concerned with how to score results so that the more relevant results have higher scores than less relevant results.
The Lucene query jakarta OR apache AND website is equivalent to the following:
jakarta +apache +website
This means the document's field must contain apache and website, but may also include jakarta (for a higher relevance score).
You can see this for yourself by taking your original query string and parsing it:
Query query = parser.parse(queryString);
...and then printing the resulting string representation of the query. The + operator is the "required" operator. It:
requires that the term after the "+" symbol exist somewhere in the field
And the lack of a + operator means the default of "may" as in "may contain" - meaning the term is optional: it does not need to be present, if there is some other clause in the query which does match a document.
The use of AND forces the terms on either side of the AND to be required.
You can encounter some potentially surprising situations.
Consider this:
foo AND bar OR baz AND bat
This parses to the following:
+foo +bar +baz +bat
This is because the AND operators are transformed to + operators for every term, rendering the OR redundant.
It's the same result as if you had written this:
foo AND bar AND baz AND bat
But not the same as this:
(foo AND bar) OR (baz AND bat)
which is parsed to this, where the parentheses are retained:
(+foo +bar) (+baz +bat)
Bottom Line:
Use parentheses to explicitly make your intentions clear, when using AND and OR and also NOT.
Regarding NOT, since we mentioned it - that takes prescendence over AND.
The query:
foo AND bar NOT baz AND bat
Is parsed as:
+foo +bar -baz +bat
So, a document field must contain foo, bar and bat - and must not contain baz.
Why does this situation exist?
I don't know, but I think Lucene originally did not include AND, OR and NOT - but instead used + (must include), - (must not include) and "nothing" (may include). The so-called boolean operators AND, OR, NOT were added later on, as a kind of "syntactic sugar" for these original operators - introduced for people who were more familiar with AND, OR and NOT from other contexts. I'm basing this on the following thread:
Getting a Better Understanding of Lucene's Search Operators
A summary of that thread is included in this answer about the NOT operator.

Is it acceptable to use `to` to create a `Pair`?

to is an infix function within the standard library. It can be used to create Pairs concisely:
0 to "hero"
in comparison with:
Pair(0, "hero")
Typically, it is used to initialize Maps concisely:
mapOf(0 to "hero", 1 to "one", 2 to "two")
However, there are other situations in which one needs to create a Pair. For instance:
"to be or not" to "be"
(0..10).map { it to it * it }
Is it acceptable, stylistically, to (ab)use to in this manner?
Just because some language features are provided does not mean they are better over certain things. A Pair can be used instead of to and vice versa. What becomes a real issue is that, does your code still remain simple, would it require some reader to read the previous story to understand the current one? In your last map example, it does not give a hint of what it's doing. Imagine someone reading { it to it * it}, they would be most likely confused. I would say this is an abuse.
to infix offer a nice syntactical sugar, IMHO it should be used in conjunction with a nicely named variable that tells the reader what this something to something is. For example:
val heroPair = Ironman to Spiderman //including a 'pair' in the variable name tells the story what 'to' is doing.
Or you could use scoping functions
(Ironman to Spiderman).let { heroPair -> }
I don't think there's an authoritative answer to this.  The only examples in the Kotlin docs are for creating simple constant maps with mapOf(), but there's no hint that to shouldn't be used elsewhere.
So it'll come down to a matter of personal taste…
For me, I'd be happy to use it anywhere it represents a mapping of some kind, so in a map{…} expression would seem clear to me, just as much as in a mapOf(…) list.  Though (as mentioned elsewhere) it's not often used in complex expressions, so I might use parentheses to keep the precedence clear, and/or simplify the expression so they're not needed.
Where it doesn't indicate a mapping, I'd be much more hesitant to use it.  For example, if you have a method that returns two values, it'd probably be clearer to use an explicit Pair.  (Though in that case, it'd be clearer still to define a simple data class for the return value.)
You asked for personal perspective so here is mine.
I found this syntax is a huge win for simple code, especial in reading code. Reading code with parenthesis, a lot of them, caused mental stress, imagine you have to review/read thousand lines of code a day ;(

Where is contains( Junction) defined?

This code works:
(3,6...66).contains( 9|21 ).say # OUTPUT: «any(True, True)␤»
And returns a Junction. It's also tested, but not documented.
The problem is I can't find its implementation anywhere. The Str code, which is also called from Cool, never returns a Junction (it does not take a Junction, either). There are no other methods contain in source.
Since it's autothreaded, it's probably specially defined somewhere. I have no idea where, though. Any help?
TL;DR Junction autothreading is handled by a single central mechanism. I have a go at explaining it below.
(The body of your question starts with you falling into a trap, one I think you documented a year or two back. It seems pretty irrelevant to what you're really asking but I cover that too.)
How junctions get handled
Where is contains( Junction) defined? ... The problem is I can't find [the Junctional] implementation anywhere. ... Since it's autothreaded, it's probably specially defined somewhere.
Yes. There's a generic mechanism that automatically applies autothreading to all P6 routines (methods, operators etc.) that don't have signatures that explicitly control what happens with Junction arguments.
Only a tiny handful of built in routines have these explicit Junction handling signatures -- print is perhaps the most notable. The same is true of user defined routines.
.contains does not have any special handling. So it is handled automatically by the generic mechanism.
Perhaps the section The magic of Junctions of my answer to an earlier SO Filtering elements matching two regexes will be helpful as a high level description of the low level details that follow below. Just substitute your 9|21 for the foo & bar in that SO, and your .contains for the grep, and it hopefully makes sense.
Spelunking the code
I'll focus on methods. Other routines are handled in a similar fashion.
method AUTOTHREAD does the work for full P6 methods.
This is setup in this code that sets up handling for both nqp and full P6 code.
The above linked P6 setup code in turn calls setup_junction_fallback.
When a method call occurs in a user's program, it involves calling find_method (modulo cache hits as explained in the comment above that code; note that the use of the word "fallback" in that comment is about a cache miss -- which is technically unrelated to the other fallback mechanisms evident in this code we're spelunking thru).
The bit of code near the end of this find_method handles (non-cache-miss) fallbacks.
Which arrives at find_method_fallback which starts off with the actual junction handling stuff.
A trap
This code works:
(3,6...66).contains( 9|21 ).say # OUTPUT: «any(True, True)␤»
It "works" to the degree this does too:
(3,6...66).contains( 2 | '9 1' ).say # OUTPUT: «any(True, True)␤»
See Lists become strings, so beware .contains() and/or discussion of the underlying issues such as pmichaud's comment.
Routines like print, put, infix ~, and .contains are string routines. That means they coerce their arguments to Str. By default the .Str coercion of a listy value is its elements separated by spaces:
put 3,6...18; # 3 6 9 12 15 18
put (3,6...18).contains: '9 1'; # True
It's also tested
Presumably you mean the two tests with a *.contains argument passed to classify:
my $m := #l.classify: *.contains: any 'a'..'f';
my $s := classify *.contains( any 'a'..'f'), #l;
Routines like classify are list routines. While some list routines do a single operation on their list argument/invocant, eg push, most of them, including classify, iterate over their list doing something with/to each element within the list.
Given a sequence invocant/argument, classify will iterate it and pass each element to the test, in this case a *.contains.
The latter will then coerce individual elements to Str. This is a fundamental difference compared to your example which coerces a sequence to Str in one go.

Preferentially match shorter token in ANTLR4

I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
Here's a very small subset of the defined tokens. I could make a cut-down version of the grammar as an example, but it seems like it shouldn't be necessary to resolve this problem (or to point out that I'm going about this entirely the wrong way).
MILLI_OR_METRE: 'm' ;
OSMOLE: 'osm' ;
MONTH: 'mo' ;
SECOND: 's' ;
One of the standard testcases is mosm, from which the lexer should generate the token stream MILLI_OR_METRE OSMOLE. Unfortunately, because ANTLR preferentially matches longer tokens, it generates the token stream MONTH SECOND MILLI_OR_METRE, which then causes the parser to raise an error.
Is it possible to make an ANTLR4 lexer try to match using shorter tokens first? Adding lookahead-type rules to MONTH isn't a great solution, as there are all sorts of potential lexing conflicts that I'd need to take account of (for example mol being lexed as MONTH LITRE instead of MOLE and so on).
EDIT:
StefanA below is of course correct; this is a job for a parser capable of backtracking (eg. recursive descent, packrat, PEG and probably various others... Coco/R is one reasonable package to do this). In an attempt to avoid adding a dependency on another parser generator (or moving other bits of the project from ANTLR to this new generator) I've hacked my way around the problem like this:
MONTH: 'mo' { _input.La(1) != 's' && _input.La(1) != 'l' && _input.La(1) != '_' }? ;
// (note: this is a C# project; java would use _input.LA instead)
but this isn't really a very extensible or maintainable solution, and like as not will have introduced other subtle issues I've not come across yet.
Your problem does not require smaller tokens to be preferred (In this case MONTH would never be matched). You need a backtracking behaviour dependent on the text being matched or not. Right?
ANTLR separates tokenization and parsing strictly. Consequently every solution to your problem will seem like a hack.
However other parser generators are specialized on problems like yours. Packrat Parsers (PEG) are backtracking and allow tokenization on the fly. Try out parboiled for this purpose.
Appears that the question is not being framed correctly.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
But, according to the UCUM:
The expression syntax of The Unified Code for Units of Measure generates an infinite number of codes with the consequence that it is impossible to compile a table of all valid units.
The most to expect from the lexer is an unambiguous identification of the measurement string without regard to its semantic value. Similarly, a parser alone will be unable to validly select between unit sequences like MONTH LITRE and MOLE - both could reasonably apply to a leak rate - unless the problem space is statically constrained in the parser definition.
A heuristic, structural (explicitly identifying the problem space) or contextual (considering the relative nature of other units in the problem space), is most likely required to select the correct unit interpretation.
The best tool to use is the one that puts you in the best position to implement the heuristics necessary to disambiguate the unit strings. Antlr could do it using parse-tree walkers. Whether that is the appropriate approach requires further analysis.

RFC 6570 URL Templates : the role of / vs. other prefixes

I recently read some of : https://www.rfc-editor.org/rfc/rfc6570#section-1
And I found the following URL template examples :
GIVEN :
var="value";
x=1024;
path=/foo/bar;
{/var,x}/here /value/1024/here
{#path,x}/here #/foo/bar,1024/here
These seem contradictory.
In the first one, it appears that the / replaces ,
In the 2nd one, it appears that the , is kept .
Thus, I'm wondering wether there are inconsistencies in this particular RFC. I'm new to these RFC's so maybe I don't fully understand the culture behind how these develop.
There's no contradiction in those two examples. They illustrate the point that the rules for expanding an expression whose first character is / are different from the rules for expanding an expression whose first character is #. These alternative expansion rules are pretty much the entire point of having a variety of different magic leading characters -- which are called operators in the RFC.
The expression with the leading / is expanded according to a rule that says "each variable in the expression is replaced by its value, preceded by a / character". (I'm paraphrasing the real rule, which is described in section 3.2.6 of that RFC.) The expression with the leading # is expanded according to a rule that says "each variable in the expression is replaced by its value, with the first variable preceded by a # and subsequent variables preceded by a ,. (Again paraphrased, see section 3.2.4 for the real rule.)