Regex/token/rule to match nested curly braces? - grammar

I need to match the values of key = value pairs in BibTeX files, which can contain arbitrarily nested braces, delimited by braces. I've got as far as matching at most two deep nested curly braces, like {some {stuff} like {this}} with the kludgey:
token brace-value {
'{' <-[{}]>* ['{' <-[}]>* '}' <-[{}]>* ]* '}'
}
I shudder at the idea of going one level further down... but proper parsing of my BibTeX stuff needs at least three levels deep.
Yes, I know there are BibTeX parsers around, but I need to grab the complete entry for further processing, and peek at a few keys meanwhile. My *.bib files are rather tame (and I wouldn't mind to handle a few stray entries by hand), the problem is that I have a lot of them, with much overlap. But some of the "same" entries have different keys, or extra data. I want to consolidate them into a few master files (the whole idea behind BibTeX, right?). Not fun by hand if bibtool gives a file with no duplicates (ha!) of some 20 thousand lines...

After perusing Lenz' "Parsing with Perl 6 Regexes and Grammars" (Apress, 2017), I realized the "regex" machinery (based on backtracking) might actually be a lot more capable than officially admitted, as a regex can call another, and nowhere do I see a prohibition on recursive calls.
Before digging in, a bit of context free grammars: A way to describing nested braces (and nothing else) is with the grammar:
S -> { S } S | <nothing>
I.e., nested braces are either an opening brace, nested braces, a closing brace, more nested braces; or nothing whatsoever. This translates more or less directly to Raku (there is no empty regex, fake it by making the construction optional):
my regex nb {
[ '{' <nb> '}' <nb> ]?
}
Lo and behold, this works. Need to fix up to avoid captures, kill backtracking (if it doesn't match on the first try, it won't ever match), and decorate with "anything else" fillers.
my regex nested-braces {
:ratchet
<-[{}]>*
[ '{' <.nested-braces> '}' <.nested-braces> ]?
<-[{}]>*
};
This checks out with my test cases.
For not-so-adventurous souls, there is the Text::Balanced module for Perl (formerly Perl 5, callable from Raku using Inline::Perl5). Not directly useful to me inside a grammar, unfortunately.

Solution
A way to describe nested braces (and nothing else)
Presuming a rule named &R, I'd likely write the following pattern if I was writing a quick small one-off script:
\{ <&R>* \}
If I was writing a larger program that should be maintainable I'd likely be writing a grammar and, using a rule named R the pattern would be:
'{' ~ '}' <R>*
This latter avoids leaning toothpick syndrome and uses the regex ~ operator.
These will both parse arbitrarily deeply nested paired braces, eg:
say '{{{{}}}}' ~~ token { \{ <&?ROUTINE>* \} } # 「{{{{}}}}」
(&?ROUTINE refers to the routine in which it appears. A regex is a routine. (Though you can't use <&?ROUTINE> in a regex declared with / ... / syntax.)
regex vs token
kill backtracking
my regex nested-braces {
:ratchet
The only difference between patterns declared with regex and token is that the former turns ratcheting off. So using it and then immediately turning ratcheting on is notably unidiomatic. Instead:
my token nested-braces {
Backtracking
the "regex" machinery (based on backtracking)
The grammar/regex engine does include backtracking as an optional feature because that's occasionally exactly what one wants.
But the engine is not "based on backtracking", and many grammars/parsers make little or no use of backtracking.
Recursion
a regex can call another, and nowhere do I see a prohibition on recursive calls.
This alone is nothing special for contemporary regex engines.
PCRE has supported recursion since 2000, and named regexes since 2003. Perl's default regex engine has supported both since 2007.
Their support for deeper levels of recursion and more named regexes being stored at once has been increasing over time.
Damian Conway's PPR uses these features of regexes to build non-trivial (but still small) parse trees.
Capabilities
a lot more capable
Raku "regexes" can be viewed as a cleaned up take on the unfolding regex evolution. To the degree this helps someone understand them, great.
But really, it's a whole new deal. For example, they're turing complete, in a sensible way, and thus able to parse anything.
than officially admitted
Well that's an odd thing to say! Raku's Grammars are frequently touted as one of Raku's most innovative features.
There are three major caveats:
Performance The primary current caveat is that a well written C parser will blow the socks off a well written Raku Grammar based parser.
Pay off It's often not worth the effort it takes to write a fully correct parser for a non-trivial format if there's an existing parser.
Left recursion Raku does not automatically rewrite left recursion (infinite loops).
Using existing parsers
I know there are BibTeX parsers around, but I need to grab the complete entry for further processing, and peek at a few keys meanwhile.
Using a foreign module in Raku can be a bit of a revelation. It is not necessarily like anything you'll have experienced before. Raku's foreign language adaptors can do smart marshaling for you so it can be like you're using native Raku features.
Two of the available foreign language adaptors are already sufficiently polished to be amazing -- the ones for Perl and for C.
I'm pretty sure there's a BibTeX package for Perl that wraps a C BibTeX parser. If you used that you'd hopefully get parsing results all nicely wrapped up into Raku objects as if it was all Raku in the first place, but retaining much of the high performance of the C code.
A Raku BibTeX Grammar?
Perhaps your needs do call for creating and using a small Raku Grammar.
(Maybe you're doing this partly as an exercise to familiarize yourself with Raku, or the regex/grammar aspect of Raku. For that it sounds pretty ideal.)
As soon as you begin to use multiple regexes together -- even just two -- you are closing in on grammar territory. After all, they're just an easy-to-use construct for using multiple regexes together.
So if you decide you want to stick with writing parsing code in Raku, expect to write it something like this:
grammar BiBTeX {
token TOP { ... }
token ...
token ...
}
BiBTeX.parse: my-bib-file
For more details, see the official doc's Grammar tutorial or read Moritz's book.

OK, just (re) checked. The documentation of '{' ~ '}' leaves a whole lot to desire, it is not at all clear it is meant to handle balanced, correctly nested delimiters.
So my final solution is really just along the lines:
my regex nested-braces {
:ratchet
'{' ~ '}' .*
}
Thanks everyone! Lerned quite a bit today.

Related

how do I resolve this antlr ambiguity?

I have a 4000 line text file which is parsing slowly, taking perhaps 3 minutes. I am running the Intellij Antlr plugin. When I look at the profiler, I see this:
The time being consumed is the largest of all rules, by a factor of 15 or so. That's ok, the file is full of things I actually don't care about (hence 'trash'). However, the profiler says words_and_trash is ambiguous but I don't know why. Here are the productions in question. (There are many others of course...):
I have no idea why this is ambiguous. The parser isn't complaining about so_much_trash and I don't think word, trash, and OPEN_PAREN overlap.
What's my strategy for solving this ambiguity?
It's ambiguous because, given your two alternatives for words_and_trash, anything that matches the first alternative, could also match the second alternative (that's the definition ambiguity in this context).
It appears you might be using a technique common in other grammar tools to handle repetition. ANTLR can do this like so:
words_and_trash: so_much_trash+;
so_much_trash: word
| trash
| OPEN_PAREN words_and_trash CLOSE_PAREN
;
You might also find the following video, useful: ANTLR4 Intellij Plugin -- Parser Preview, Parse Tree, and Profiling. It's by the author of ANTLR, and covers ambiguities.

Raku: how do I make an argument optional, have a default, with a where test?

Can't find a way to get this to work:
sub triple(Str:D $mod where * ~~ any #modifiers = 'command' ) { }
If I don't pass in an argument, I get an error:
Too few positionals passed; expected 1 argument but got 0
With a question mark after $mod:
sub triple(Str:D $mod? where * ~~ any #modifiers = 'command' ) { }
I get:
Constraint type check failed in binding to parameter '$mod'; expected anonymous constraint to be met but got Str (Str)
Looks like it may have been a precedence problem. This works:
sub triple(Str:D $mod? where (* ~~ any #modifiers) = 'command' ) {}
TL;DR You've identified the problem in your answer -- precedence -- and provided a solution. This answer covers what happened; why the precedence issue arises; why Raku's grammar/parser didn't just get it right; and lists some solutions, a couple of which I'll start with.
Instead of:
sub triple(Str:D $mod? where * ~~ any #modifiers = 'command' ) { }
I suggest moving the any and writing one of these:
sub triple(Str:D $mod? where * ~~ #modifiers.any = 'command' ) { }
sub triple(Str:D $mod? where #modifiers.any = 'command' ) { }
What happened
The = ... at the end of the where clause is parsed as an assignment (to #modifiers) instead of as a default value (for $mod):
#modifiers = 'command' is evaluated, overwriting whatever values #modifiers had.
The any creates a junction with one element ('command').
Now the only argument that triple will accept is 'command'.
Why the precedence issue arises
Raku's grammar is designed to have nice ergonomics. This includes design details that reduce the need for parens and braces. Overall these design details yield a big net win. But there are wrinkles, and you've encountered one.
Raku lets one write where ... to specify a where clause without requiring that one uses an explicit braced lambda ({...}) for the ... bit. One can even create a lambda using just a *. Nice! But where does the lambda end? If you use explicit braces, it's clear. If not, what determines the end of the lambda?
More generally, shouldn't the parser just know that the = in = 'command' is not part of any lambda? That it should instead just automatically finish a where clause if there is one before parsing the = ... part? That the = ... should always be parsed as a default value for a parameter?
One can easily see the ambiguity (once one's attention is drawn to it), and so does/could Raku's grammar/parser. It just needs to resolve that ambiguity either by rejecting such syntax, demanding the coder explicitly disambiguates (eg with parens, as you've done), or by choosing which way to parse.
What Raku's grammar/parser does in the face of the ambiguity is choose. And it chooses wrong. (Unless of course one wanted it to be an assignment of some value on the ='s left, not a default for a parameter, though that's gotta be pretty unlikely.)
Why Raku's grammar/parser didn't just get it right
Why doesn't the parser reject this code as being too ambiguous, or be smart enough to choose the "it's a default" interpretation? It certainly could -- the Raku grammar/parser feature is Turing complete, equivalent in power to the unrestricted grammars category in the Chomsky parsing hierarchy -- so why doesn't it just get it right?
In a nutshell, it gets it right the right amount, at least imo. But that's subjective, oddly worded, and vague, so it's probably not a satisfactory summary. So I'll try provide a bit more detail in the hope it's more informative.
Every Raku design decision is discussed openly, and there are searchable public records of essentially all of it. To dig into these discussions I recommend starting out with Liz++'s awesome IRC log service, and within the numerous channels listed, focusing on the #perl6 logs that ran from 2005 thru 2019 or so.
Although I've been around for many of the Raku design discussions of the last 20 years, I don't have a good recollection of all the discussions surrounding this decision about the ambiguity of meaning of a = ... at the end of a where clause, and what to do about it. And I haven't myself recently done the digging I suggest; for now I'll leave that for any interested readers. Instead I will outline what I think will have been some contributing factors:
Single pass parsing
Raku's "braided" approach to language design requires single pass parsing.
Longest parsing orientation
Longest Token Matching is all but essential for user definable braiding (see link in previous point) to be truly viable. LTM reflects a general principle that humans naturally tend toward recognizing the longest "token" (within reason of course). That is to say, if one sees $100, it strikes one cognitively as a hundred dollars, not as a dollar sign, a 1, a 0, and then another 0.
A similar deal applies to parsing of a string of tokens (again, within reason); if it weren't for the fact one learns to think of = ... as specifying a parameter's default, the #modifiers = 'command' would naturally be read as being an assignment into #modifiers.
Limited backtracking
Backtracking is slow, pathological backtracking utterly evil. So Raku's grammar/parser avoids potentially backtracking in all but three cases for which it really is the right solution, and entirely avoids any risk of pathological backtracking.
Handling ambiguity
While artificial languages can aim to expunge all ambiguity, the closer one gets to eliminating all ambiguity, the greater the amount of extraneous and distracting syntax one requires, such as frequent required use of delimiters (parens, braces, square brackets, etc) to ensure disambiguation. That makes a language increasingly unfriendly and verbose for that reason instead. Raku culture avoids ideological "foolish consistency" extremes.
Raku's designers (principally Larry Wall) considered all these factors and many more and arrived at Raku's solution:
Be rationally predictable
A sufficiently simple and predictable approach to parsing, and requisite sensitivity to the likelihood and costs of any surprises a user may encounter, goes a long way, and the design relative to where clauses is a case in point.
While the precedence issue may have been a surprise, and the error message unhelpful, I, er, predict you'll find your ERN signal regarding this will tune up quite nicely in fairly short order, just as it will for most of the things that might trip you up as you learn Raku.
Use predictive parsing
While there are several ways to accommodate all of the above, predictive parsing1 is a great choice, and -- not coincidentally! -- the one most naturally written using Raku grammars, and the one used for Raku's own grammar/parser.
Some other solutions
Here's what does not work as expected:
sub triple(Str:D $mod? where * ~~ any #modifiers = 'command' ) { }
^ Needs to be end of `where` clause
You've suggested a solution, and I suggested a couple at the start. Some more follow.
You used parens. Here are some other ways to use parens:
sub triple(Str:D $mod? where * ~~ any(#modifiers) = 'command' ) { }
sub triple(Str:D $mod? where * ~~ (any #modifiers) = 'command' ) { }
Or, switch to use of $_ (aka "it") instead of * (aka "whatever") inside braces:
sub triple(Str:D $mod? where { $_ ~~ any #modifiers } = 'command' ) { }
Footnotes
1 The Wikipedia page discusses "grammars" and "ambiguity" in a manner that may be confusing given that they are not used in the same way those words are used in the context of Raku and of this answer. But discussing that would be a rabbit hole inappropriate for this SO.

Where is contains( Junction) defined?

This code works:
(3,6...66).contains( 9|21 ).say # OUTPUT: «any(True, True)␤»
And returns a Junction. It's also tested, but not documented.
The problem is I can't find its implementation anywhere. The Str code, which is also called from Cool, never returns a Junction (it does not take a Junction, either). There are no other methods contain in source.
Since it's autothreaded, it's probably specially defined somewhere. I have no idea where, though. Any help?
TL;DR Junction autothreading is handled by a single central mechanism. I have a go at explaining it below.
(The body of your question starts with you falling into a trap, one I think you documented a year or two back. It seems pretty irrelevant to what you're really asking but I cover that too.)
How junctions get handled
Where is contains( Junction) defined? ... The problem is I can't find [the Junctional] implementation anywhere. ... Since it's autothreaded, it's probably specially defined somewhere.
Yes. There's a generic mechanism that automatically applies autothreading to all P6 routines (methods, operators etc.) that don't have signatures that explicitly control what happens with Junction arguments.
Only a tiny handful of built in routines have these explicit Junction handling signatures -- print is perhaps the most notable. The same is true of user defined routines.
.contains does not have any special handling. So it is handled automatically by the generic mechanism.
Perhaps the section The magic of Junctions of my answer to an earlier SO Filtering elements matching two regexes will be helpful as a high level description of the low level details that follow below. Just substitute your 9|21 for the foo & bar in that SO, and your .contains for the grep, and it hopefully makes sense.
Spelunking the code
I'll focus on methods. Other routines are handled in a similar fashion.
method AUTOTHREAD does the work for full P6 methods.
This is setup in this code that sets up handling for both nqp and full P6 code.
The above linked P6 setup code in turn calls setup_junction_fallback.
When a method call occurs in a user's program, it involves calling find_method (modulo cache hits as explained in the comment above that code; note that the use of the word "fallback" in that comment is about a cache miss -- which is technically unrelated to the other fallback mechanisms evident in this code we're spelunking thru).
The bit of code near the end of this find_method handles (non-cache-miss) fallbacks.
Which arrives at find_method_fallback which starts off with the actual junction handling stuff.
A trap
This code works:
(3,6...66).contains( 9|21 ).say # OUTPUT: «any(True, True)␤»
It "works" to the degree this does too:
(3,6...66).contains( 2 | '9 1' ).say # OUTPUT: «any(True, True)␤»
See Lists become strings, so beware .contains() and/or discussion of the underlying issues such as pmichaud's comment.
Routines like print, put, infix ~, and .contains are string routines. That means they coerce their arguments to Str. By default the .Str coercion of a listy value is its elements separated by spaces:
put 3,6...18; # 3 6 9 12 15 18
put (3,6...18).contains: '9 1'; # True
It's also tested
Presumably you mean the two tests with a *.contains argument passed to classify:
my $m := #l.classify: *.contains: any 'a'..'f';
my $s := classify *.contains( any 'a'..'f'), #l;
Routines like classify are list routines. While some list routines do a single operation on their list argument/invocant, eg push, most of them, including classify, iterate over their list doing something with/to each element within the list.
Given a sequence invocant/argument, classify will iterate it and pass each element to the test, in this case a *.contains.
The latter will then coerce individual elements to Str. This is a fundamental difference compared to your example which coerces a sequence to Str in one go.

Why not have operators as both keywords and functions?

I saw this question and it got me wondering.
Ignoring the fact that pretty much all languages have to be backwards compatible, is there any reason we cannot use operators as both keywords and functions, depending on if it's immediately followed by a parenthesis? Would it make the grammar harder?
I'm thinking mostly of python, but also C-like languages.
Perl does something very similar to this, and the results are sometimes surprising. You'll find warnings about this in many Perl texts; for example, this one comes from the standard distributed Perl documentation (man perlfunc):
Any function in the list below may be used either with or without parentheses around its arguments. (The syntax descriptions omit the parentheses.) If you use parentheses, the simple but occasionally surprising rule is this: It looks like a function, therefore it is a function, and precedence doesn't matter. Otherwise it's a list operator or unary operator, and precedence does matter. Whitespace between the function and left parenthesis doesn't count, so sometimes you need to be careful:
print 1+2+4; # Prints 7.
print(1+2) + 4; # Prints 3.
print (1+2)+4; # Also prints 3!
print +(1+2)+4; # Prints 7.
print ((1+2)+4); # Prints 7.
An even more surprising case, which often bites newcomers:
print
(a % 7 == 0 || a % 7 == 1) ? "good" : "bad";
will print 0 or 1.
In short, it depends on your theory of parsing. Many people believe that parsing should be precise and predictable, even when that results in surprising parses (as in the Python example in the linked question, or even more famously, C++'s most vexing parse). Others lean towards Perl's "Do What I Mean" philosophy, even though the result -- as above -- is sometimes rather different from what the programmer actually meant.
C, C++ and Python all tend towards the "precise and predictable" philosophy, and they are unlikely to change now.
Depending on the language, not() is not defined. If not() is not defined in some language, you can not use it. Why not() is not defined in some language? Because creator of that language probably had not need this type of language construction. Because it is better to let things be simpler.

Preferentially match shorter token in ANTLR4

I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
Here's a very small subset of the defined tokens. I could make a cut-down version of the grammar as an example, but it seems like it shouldn't be necessary to resolve this problem (or to point out that I'm going about this entirely the wrong way).
MILLI_OR_METRE: 'm' ;
OSMOLE: 'osm' ;
MONTH: 'mo' ;
SECOND: 's' ;
One of the standard testcases is mosm, from which the lexer should generate the token stream MILLI_OR_METRE OSMOLE. Unfortunately, because ANTLR preferentially matches longer tokens, it generates the token stream MONTH SECOND MILLI_OR_METRE, which then causes the parser to raise an error.
Is it possible to make an ANTLR4 lexer try to match using shorter tokens first? Adding lookahead-type rules to MONTH isn't a great solution, as there are all sorts of potential lexing conflicts that I'd need to take account of (for example mol being lexed as MONTH LITRE instead of MOLE and so on).
EDIT:
StefanA below is of course correct; this is a job for a parser capable of backtracking (eg. recursive descent, packrat, PEG and probably various others... Coco/R is one reasonable package to do this). In an attempt to avoid adding a dependency on another parser generator (or moving other bits of the project from ANTLR to this new generator) I've hacked my way around the problem like this:
MONTH: 'mo' { _input.La(1) != 's' && _input.La(1) != 'l' && _input.La(1) != '_' }? ;
// (note: this is a C# project; java would use _input.LA instead)
but this isn't really a very extensible or maintainable solution, and like as not will have introduced other subtle issues I've not come across yet.
Your problem does not require smaller tokens to be preferred (In this case MONTH would never be matched). You need a backtracking behaviour dependent on the text being matched or not. Right?
ANTLR separates tokenization and parsing strictly. Consequently every solution to your problem will seem like a hack.
However other parser generators are specialized on problems like yours. Packrat Parsers (PEG) are backtracking and allow tokenization on the fly. Try out parboiled for this purpose.
Appears that the question is not being framed correctly.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
But, according to the UCUM:
The expression syntax of The Unified Code for Units of Measure generates an infinite number of codes with the consequence that it is impossible to compile a table of all valid units.
The most to expect from the lexer is an unambiguous identification of the measurement string without regard to its semantic value. Similarly, a parser alone will be unable to validly select between unit sequences like MONTH LITRE and MOLE - both could reasonably apply to a leak rate - unless the problem space is statically constrained in the parser definition.
A heuristic, structural (explicitly identifying the problem space) or contextual (considering the relative nature of other units in the problem space), is most likely required to select the correct unit interpretation.
The best tool to use is the one that puts you in the best position to implement the heuristics necessary to disambiguate the unit strings. Antlr could do it using parse-tree walkers. Whether that is the appropriate approach requires further analysis.