How do I match using :global in Raku grammar? - raku

I'm trying to write a Raku grammar that can parse commands that ask for programming puzzles.
This is a simplified version just for my question, but the commands combine a difficulty level with an optional list of languages.
Sample valid input:
No language: easy
One language: hard javascript
Multiple languages: medium javascript python raku
I can get it to match one language, but not multiple languages. I'm not sure where to add the :g.
Here's an example of what I have so far:
grammar Command {
rule TOP { <difficulty> <languages>? }
token difficulty { 'easy' | 'medium' | 'hard' }
rule languages { <language>+ }
token language { \w+ }
}
multi sub MAIN(Bool :$test) {
use Test;
plan 5;
# These first 3 pass.
ok Command.parse('hard', :token<difficulty>), '<difficulty> can parse a difficulty';
nok Command.parse('no', :token<difficulty>), '<difficulty> should not parse random words';
# Why does this parse <languages>, but <language> fails below?
ok Command.parse('js', :rule<languages>), '<languages> can parse a language';
# These last 2 fail.
ok Command.parse('js', :token<language>), '<language> can parse a language';
# Why does this not match both words? Can I use :g somewhere?
ok Command.parse('js python', :rule<languages>), '<languages> can parse multiple languages';
}
This works, even though my test #4 fails:
my token wrd { \w+ }
'js' ~~ &wrd; #=> 「js」
Extracting multiple languages works with a regex using this syntax, but I'm not sure how to use that in a grammar:
'js python' ~~ m:g/ \w+ /; #=> (「js」 「python」)
Also, is there an ideal way to make the order unimportant so that difficulty could come anywhere in the string? Example:
rule TOP { <languages>* <difficulty> <languages>? }
Ideally, I'd like for anything that is not a difficulty to be read as a language. Example: raku python medium js should read medium as a difficulty and the rest as languages.

There are two things at issue here.
To specify a subrule in a grammar parse, the named argument is always :rule, regardless whether in the grammar it's a rule, token, method, or regex. Your first two tests are passing because they represent valid full-grammar parses (that is, TOP), as the :token named argument is ignored since it's unknown.
That gets us:
ok Command.parse('hard', :rule<difficulty>), '<difficulty> can parse a difficulty';
nok Command.parse('no', :rule<difficulty>), '<difficulty> should not parse random words';
ok Command.parse('js', :rule<languages> ), '<languages> can parse a language';
ok Command.parse('js', :rule<language> ), '<language> can parse a language';
ok Command.parse('js python', :rule<languages> ), '<languages> can parse multiple languages';
# Output
ok 1 - <difficulty> can parse a difficulty
ok 2 - <difficulty> should not parse random words
ok 3 - <languages> can parse a language
ok 4 - <language> can parse a language
not ok 5 - <languages> can parse multiple languages
The second issue is how implied whitespace is handled in a rule. In a token, the following are equivalent:
token foo { <alpha>+ }
token bar { <alpha> + }
But in a rule, they would be different. Compare the token equivalents for the following rules:
rule foo { <alpha>+ }
token foo { <alpha>+ <.ws> }
rule bar { <alpha> + }
token bar { [<alpha> <.ws>] + }
In your case, you have <language>+, and since language is \w+, it's impossible to match two (because the first one will consume all the \w). Easy solution though, just change <language>+ to <language> +.
To allow the <difficulty> token to float around, the first solution that jumps to my mind is to match it and bail in a <language> token:
token language { <!difficulty> \w+ }
<!foo> will fail if at that position, it can match <foo>. This will work almost perfect until you get a language like 'easyFoo'. The easy fix there is to ensure that the difficulty token always occurs at a word boundary:
token difficulty {
[
| easy
  | medium
| hard
]
>>
}
where >> asserts a word boundary on the right.

Related

Perl6 regex not matching end $ character with filenames

I've been trying to learn Perl6 from Perl5, but the issue is that the regex works differently, and it isn't working properly.
I am making a test case to list all files in a directory ending in ".p6$"
This code works with the end character
if 'read.p6' ~~ /read\.p6$/ {
say "'read.p6' contains 'p6'";
}
However, if I try to fit this into a subroutine:
multi list_files_regex (Str $regex) {
my #files = dir;
for #files -> $file {
if $file.path ~~ /$regex/ {
say $file.path;
}
}
}
it no longer works. I don't think the issue with the regex, but with the file name, there may be some attribute I'm not aware of.
How can I get the file name to match the regex in Perl6?
Regexes are a first-class language within Perl 6, rather than simply strings, and what you're seeing here is a result of that.
The form /$foo/ in Perl 6 regex will search for the string value in $foo, so it will be looking, literally, for the characters read\.p6$ (that is, with the dot and dollar sign).
Depending on the situation of the calling code, there are a couple of options:
If you really are receiving regexes as strings, for example read as input or from a file, then use $file.path ~~ /<$regex>/. This means it will treat what's in $regex as regex syntax.
If you will just be passing a range of different regexes in, change the parameter to be of type Regex, and then do $file.path ~~ $regex. In this case, you'd pass them like list_files_regex(/foo/).
Last but not least, dir takes a test parameter, and so you can instead write:
for dir(test => /<$regex>/) -> $file {
say $file.path;
}

Split a BibTeX author field into parts

I am trying to parse a BibTeX author field using the following grammar:
use v6;
use Grammar::Tracer;
# Extract BibTeX author parts from string. The parts are separated
# by a comma and optional space around the comma
grammar Author {
token TOP {
<all-text>
}
token all-text {
[<author-part> [[\s* ',' \s*] || [\s* $]]]+
}
token author-part {
[<-[\s,]> || [\s* <!before ','>]]+
}
}
my $str = "Rockhold, Mark L";
my $result = Author.parse( $str );
say $result;
Output:
TOP
| all-text
| | author-part
| | * MATCH "Rockhold"
| | author-part
But here the program hangs (I have to press CTRL-C) to abort.
I suspect the problem is related to the negative lookahead assertion. I tried to remove it, and then the program does not hang anymore, but then I am also not able to extract the last part "Mark L" with an internal space.
Note that for debugging purposes, the Author grammar above is a simplified version of the one used in my actual program.
The expression [\s* <!before ','>] may not make any progress. Since it's in a quantifier, it will be retried again and again (but not move forward), resulting in the hang observed.
Such a construct will reliably hang at the end of the string; doing [\s* <!before ',' || $>] fixes it by making the lookahead fail at the end of the string also (being at the end of the string is a valid way to not be before a ,).
At least for this simple example, it looks like the whole author-part token could just be <-[,]>+, but perhaps that's an oversimplification for the real problem that this was reduced from.
Glancing at all-text, I'd also point out the % quantifier modifier which makes matching comma-separated (or anything-separated, really) things easier.

How to make Perl 6 grammar produce more than one match (like :ex and :ov)?

I want grammar to do something like this:
> "abc" ~~ m:ex/^ (\w ** 1..2) (\w ** 1..2) $ {say $0, $1}/
「ab」「c」
「a」「bc」
Or like this:
> my regex left { \S ** 1..2 }
> my regex right { \S ** 1..2 }
> "abc" ~~ m:ex/^ <left><right> $ {say $<left>, $<right>}/
「ab」「c」
「a」「bc」
Here is my grammar:
grammar LR {
regex TOP {
<left>
<right>
}
regex left {
\w ** 1..2
}
regex right {
\w ** 1..2
}
}
my $string = "abc";
my $match = LR.parse($string);
say "input: $string";
printf "split: %s|%s\n", ~$match<left>, ~$match<right>;
Its output is:
$ input: abc
$ split: ab|c
So, <left> can be only greedy leaving nothing to <right>. How should I modify the code to match both possible variants?
$ input: abc
$ split: a|bc, ab|c
Grammars are designed to give zero or one answers, not more than that, so you have to use some tricks to make them do what you want.
Since Grammar.parse returns just one Match object, you have to use a different approach to get all matches:
sub callback($match) {
say $match;
}
grammar LR {
regex TOP {
<left>
<right>
$
{ callback($/) }
# make the match fail, thus forcing backtracking:
<!>
}
regex left {
\w ** 1..2
}
regex right {
\w ** 1..2
}
}
LR.parse('abc');
Making the match fail by calling the <!> assertion (which always fails) forces the previous atoms to backtrack, and thus finding different solutions. Of course this makes the grammar less reusable, because it works outside the regular calling conventions for grammars.
Note that, for the caller, the LR.parse seems to always fail; you get all the matches as calls to the callback function.
A slightly nicer API (but the same approach underneath) is to use gather/take to get a sequence of all matches:
grammar LR {
regex TOP {
<left>
<right>
$
{ take $/ }
# make the match fail, thus forcing backtracking:
<!>
}
regex left {
\w ** 1..2
}
regex right {
\w ** 1..2
}
}
.say for gather LR.parse('abc');
I think Moritz Lenz, nickname moritz, author of the upcoming book "Parsing with Perl 6 Regexes and Grammars", is the person to ask about this. I probably should have just asked him to answer this SO...
Notes
In case anyone considers attempting to modify grammar.parse so that it supports :exhaustive, or otherwise hacking things to do what #evb wants, the following documents potentially useful inspiration/guidance that I gleaned from spelunking the relevant speculations document (S05) and searching the #perl6 and #perl6-dev irc logs.
7 years ago Moritz added an edit of S05:
A [regex] modifier that affects only the calling behaviour, and not the regex itself [eg :exhaustive] may only appear on constructs that involve a call (like m// [or grammar.parse]), and not on rx// [or regex { ... }].
(The [eg :exhaustive], [or grammar.parse], and [or regex { ... }] bits are extrapolations/interpretations/speculations I've added in this SO answer. They're not in the linked source.)
5 years ago Moritz expressed interest in implementing :exhaustive for matching (not parsing) features. Less than 2 minutes later jnthn showed a one liner that demo'd how he guessed he'd approach it. Less than 30 minutes later Moritz posted a working prototype. The final version landed 7 days later.
1 year ago Moritz said on #perl6 (emphasis added by me): "regexes and grammars aren't a good tool to find all possible ways to parse a string".
Hth.

Does .parse anchor or :sigspace first in a Perl 6 rule?

I have two questions. Is the behavior I show correct, and if so, is it documented somewhere?
I was playing with the grammar TOP method. Declared as a rule, it implies beginning- and end-of-string anchors along with :sigspace:
grammar Number {
rule TOP { \d+ }
}
my #strings = '137', '137 ', ' 137 ';
for #strings -> $string {
my $result = Number.parse( $string );
given $result {
when Match { put "<$string> worked!" }
when Any { put "<$string> failed!" }
}
}
With no whitespace or trailing whitespace only, the string parses. With leading whitespace, it fails:
<137> worked!
<137 > worked!
< 137 > failed!
I figure this means that rule is applying :sigspace first and the anchors afterward:
grammar Foo {
regex TOP { ^ :sigspace \d+ $ }
}
I expected a rule to allow leading whitespace, which would happen if you switched the order:
grammar Foo {
regex TOP { :sigspace ^ \d+ $ }
}
I could add an explicit token in rule for the beginning of the string:
grammar Number {
rule TOP { ^ \d+ }
}
Now everything works:
<137> worked!
<137 > worked!
< 137 > worked!
I don't have any reason to think it should be one way or the other. The Grammars docs say two things happen, but the docs do not say which order these effects apply:
Note that if you're parsing with .parse method, token TOP is automatically anchored
and
When rule instead of token is used, any whitespace after an atom is turned into a non-capturing call to ws.
I think the answer is that the rule isn't actually anchored in the pattern sense. It's the way .parse works. The cursor has to start at position 0 and end at the last position in the string. That's something outside of the pattern.
The behavior is intended, and is a culmination of these language features:
Sigspace ignores whitespace before the first atom.
From the design docs1 (S05: Regexes and Rules, line 348, emphasis added):
The new :s (:sigspace) modifier causes certain whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, . Only whitespace sequences immediately following a matching construct (atom, quantified atom, or assertion) are eligible. Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations. Trailing space inside the regex delimiters is significant.
This means:
rule TOP { \d+ }
^-------- <.ws> automatically inserted
rule TOP { ^ \d+ $ }
^---^-^---- <.ws> automatically inserted
Regexes are first-class compiled code with lexical scoping.
A regex/rule is not a string that may have characters concatenated to it later to change its behavior. It is a self-contained routine, which is parsed and has its behavior nailed down at compile time.
Regex modifiers like :sigspace, including the one implicitly added by the rule keyword, apply only to their lexical scope - i.e. to the fragment of source code they appear in at compile time. S05, line 6291:
The :i, :m, :r, :s, :dba, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped)
The anchoring of rule TOP is done at run time by .parse.
S05, line 44231:
The .parse and .parsefile methods anchor to the beginning and ending of the text, and fail if the end of text is not reached. (The TOP rule can check against $ itself if it wishes to produce its own error message.)
I.e. the anchoring to the beginning of the string is not intrinsic to the rule TOP, and doesn't affect how the lexical scope of TOP is parsed and compiled. It is done when method .parse is called.
It has to be this way, because because the same grammar can be used with different starting rules instead of TOP, using .parse(..., rule => ...).
So when you write
rule TOP { \d+ }
it is compiled as
regex TOP { :r \d+ <.ws> }
And when you .parse that grammar, it effectively invokes the regex code ^ <TOP> $, with the anchors not being part of TOP's lexical scope but rather of a scope that merely calls the routine TOP. The combined behavior is as if the rule TOP had been written as:
regex TOP { ^ [:r :s \d+] $ }
1) The design docs are in general not to be taken as gospel for what is or isn't part of the Perl 6 language, but S05 is pretty accurate in that regard, except that it mentions some features that haven't been implemented yet but are planned. Anyone who wants to truly grok the intricacies of Perl 6 regexes/grammars, is IMO well served by reading the full S05 from top to bottom at least once.
There aren't two regex effects going on. The rule applies :sigspace. After that, the grammar is defined. When you call .parse, it starts at the beginning of the string and goes to the end (or fails). That anchoring isn't part of the grammar. It's part of how .parse applies the grammar.
My main issue was the odd way some of the things are worded in the docs. They aren't technically wrong, but they also tend to assume knowledge about things the reader might not know. In this case, the casual comment about anchoring TOP isn't as special as it seems. Any rule passed to .parse is anchored in the same way. There's no special behavior for that rule name other than it's the default value for :rule in a call to .parse.

Why doesn't this perl 6 grammar work?

I don't know perl 5, but I thought I'd have a play with perl 6. I am trying out its grammar capabilities, but so far I'm having no luck. Here's my code far:
grammar CopybookGrammar {
token TOP { {say "at TOP" } <aword><num>}
token aword { {say "at word" } [a..z]+ }
token num { {say "at NUM" } [0..9]+ }
}
sub scanit($contents) {
my $match1 = CopybookGrammar.parse($contents);
say $match1;
}
scanit "outline1";
The output is as follows:
at TOP
at word
(Any)
For some reason, it does not appear to matching the <num> rule. Any ideas?
You forgot the angled brackets in the character classes syntax:
[a..z]+ should be <[a..z]>+
[0..9]+ should be <[0..9]>+
By themselves, square brackets [ ] simply act as a non-capturing group in Perl 6 regexes. So [a..z]+ would match the letter "a", followed by any two characters, followed by the letter "z", and then the whole thing again any number of times. Since this does not match the word "outline", the <aword> token failed to match for you, and parsing did not continue to the <num> token.
PS: When debugging grammars, a more convenient alternative to adding {say ...} blocks everywhere, is to use Grammar::Debugger. After installing that module, you can temporarily add the line use Grammar::Debugger; to your code, and run your program - then it'll go through your grammar step by step (using the ENTER key to continue), and tell you which tokens/rules match along the way.