As part of my grammar I have:
rule EX1 { <EX2> ( '/' <EX2>)* }
In my actions class I have written:
method EX1($/) {
my #ex2s = map *.made, $/.<EX2>;
my $ex1 = #ex2s.join('|');
#say "EX1 making $ex1";
$/.make($ex1);
}
So basically I am just trying to join all the EX2's together with a '|' between them instead of a '/'. However something is not right with my code, as it only picks up the first EX2, not the subsequent ones. How do I find out what the optional ones are?
TL;DR Your action method would work if your rule created the data structure the method is expecting. So we'll fix the rule and leave the method alone.
The main problem
Let's assume the EX1 rule is slotted into a working grammar; a string has been successfully parsed; the substring ex2/ex2/ex2 matched the EX1 rule; and we've displayed the corresponding part of the parse tree (by just saying the results of .parse using the grammar):
EX1 => 「ex2/ex2/ex2」
EX2 => 「ex2」
0 => 「/ex2」
EX2 => 「ex2」
0 => 「/ex2」
EX2 => 「ex2」
Note the extraneous 0 => captures and how the second and third EX2s are indented under them and indented relative to the first EX2. That's the wrong nesting structure relative to your method's assumptions.
Brad's solution to the main problem
As Brad++ points out in their comment responding to the first version of this answer, you can simply switch from the construct that both groups and captures ((...)) to the one that only groups ([...]).
rule EX1 { <EX2> [ '/' <EX2>]* }
Now the corresponding parse tree fragment for the same input string as above is:
EX1 => 「ex2/ex2/ex2」
EX2 => 「ex2」
EX2 => 「ex2」
EX2 => 「ex2」
The 0 captures are gone and the EX2s are now all siblings. For further discussion of when and why P6 nests captures the way it does, see jnthn's answer to Why/how ... capture groups?.
Your action method should now work -- for some inputs...
Håkon's solution to another likely problem
If Brad's solution works for some of the inputs you'd expect it to work for, but not all, part of the problem is likely how your rule matches between <EX2> and the / character.
As Håkon++ points out in their answer, your rule has spacing that probably doesn't do what you want.
If you don't intend the spacing in your pattern to be significant, then don't use a rule. In a token or regex all spaces in a pattern (ignoring inside a string eg ' ') is just to make your pattern more readable and isn't meaningful relative to any input string being matched. If in doubt, use a token (or regex) not a rule:
token EX1 { <EX2> ( '/' <EX2>)* }
🡅 🡅 🡅 🡅 🡅 🡅
Spacing indicated with 🡅 is NOT significant. You could omit it or extend it and it'll make no difference to how the rule matches input. It's only for readability.
In contrast, the entire point of the rule construct is that whitespace after each atom and each quantifier in a pattern is significant. Such spacing implicitly applies a (user overridable) boundary matching rule (by default a rule that allows whitespace and/or a transition between "word" and non-"word" characters) after the corresponding substring in the input.
In your EX1 rule, which I repeat below with exaggerated spacing to ensure clarity, some of the spacing is not significant, just as it isn't in a token or regex:
rule EX1 { <EX2> ( '/' <EX2>)* }
🡅 🡅 🡅
As before 🡅 indicates spacing that is NOT significant -- you can omit or extend it and it'll make no difference. The thing to remember is that spaces at the start of a pattern (or sub-pattern) is just for readability. (Experience from use showed that it was much better if any spacing there is not treated as significant.)
But spacing or lack of spacing after an atom or quantifier is significant:
This spacing is significant: ⮟ ⮟ ⮟
rule EX1 { <EX2> ( '/' <EX2>)* }
This LACK of spacing is significant: ⮝⮝
By writing your rule as you did you're telling P6 to match input with boundary matching (which by default allows whitespace) only:
after the first <EX2> (and thus before the first /);
between / and subsequent <EX2> matches;
after the last <EX2> match.
So your rule tells P6 to allow spaces between a / and <EX2> match when they occur in that order -- /, then <EX2>.
But it also tells P6 to not allow spaces the other way around -- between an <EX2> match and a / match in that order! Except with the very first <EX2> '/' pair!! P6 will let you declare match patterns of arbitrary complexity, including spacing, but I doubt this is what you meant or want.
For a complete listing of what "after an atom" means (i.e. when whitespace in rules is significant) see When is white space really important in Perl6 grammars?.
This significant spacing feature is:
Classic Perl DWIMery designed to make life easier;
Idiomatic -- used in most grammars because it does indeed make life easier;
The only reason the rule declarator exists (this significant whitespace aspect is the only difference between a rule and a token);
Completely optional because you can just use a token instead.
If someone reading this thinks they'd rather not take advantage of this significant space feature, then they can just use tokens instead. (This in turn will likely lead them to see why rule exists as an option, and then, or perhaps later, to see why it works the way it does, and to appreciate its DWIMery anew. :) )
The built in construct for the pattern you're matching
Finally, here's the idiomatic way to write the pattern you're trying to match:
rule EX1 { <EX2> + % '/' }
This tells P6 to match one or more <EX2>s separated by / characters. See Modified quantifier: %, %% for an explanation of this nice construct.
This is still a rule so most of the spacing in it remains significant. The precise details for when it is and isn't are at their most apparently fiddly for this construct because it has up to three significant spacers and one that's not:
NOT significant: ⮟ ⮟
rule EX1 { <EX2> + % '/' }
Significant: ⮝ ⮝ ⮝
Including spacing both before and after the + is redundant:
rule EX1 { <EX2> + % '/' }
rule EX1 { <EX2> +% '/' } # same match result
rule EX1 { <EX2>+ % '/' } # same match result
White space is significant in rules. So I think you are missing a whitespace after the last <EX2>:
rule EX1 { <EX2> ( '/' <EX2>)+ }
It should be:
rule EX1 { <EX2> ( '/' <EX2> )+ }
This allows for space to separate the terms in EX1.
Related
I have this pair of rules in my ANTLR lexer grammar, which match the same pattern, but have mutually exclusive predicates:
MAGIC: '#' ~[\r\n]* {getLine() == 1}? ;
HASH_COMMENT: '#' ~[\r\n]* {getLine() != 1}? -> skip;
When I look at the tokens in the ANTLR Preview, I see:
So it seems like the predicate isn't being used, and regardless of the line I'm on, the token comes out as MAGIC.
I also tried a different approach to try and work around this:
tokens { MAGIC }
HASH_COMMENT: '#' ~[\r\n]* {if (getLine() == 1) setType(MAGIC); else skip();};
But now, both come out as HASH_COMMENT:
I really expected the first attempt using two predicates to work, so that was surprising, but now it seems like the action doesn't work either, which is even more odd.
How do I make this work?
I'd rather not try to match "#usda ..." as a different token because that comment could occur further down the file, and it should be treated as a normal comment unless it's on the first line.
I would not try to force semantics in the parse step. The letter combination is a HASH_COMMENT, period.
Instead I would handle that as normal syntax and handle anything special you might need in the step after parsing. For example:
document: HASH_COMMENT? content EOF;
This way you define a possible HASH_COMMENT (which you might interpret as MAGIC later, without using such a token type) before any content. Might not be line one, but before anything else (which resembles real document better, where you can have whitespaces before your hash comment).
I am trying to avoid matching whitespace at the end of a string while still matching whitespace in the middle of words.
Here is an example of a regex that matches underscores within x but does not match up to three trailing underscores.
say 'x_x___x________' ~~ /
[
| 'x'
| '_' <!before [
| $
| '_' <?before $>
| '_' <?before ['_' <?before $>]>
| '_' <?before ['_' <?before ['_' <?before $>]>]>
# ...
]>
]+
/;
Is there a way to construct the rest of the pattern implied by the ...?
It is a little difficult to discern what you are asking for.
You could be looking for something as simple as this:
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..3 /
# 「x_x___x」
or
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..2 /
# 「x_x」
or
say 'x_x___x________' ~~ / 'x'+ % '_'+ /
# 「x_x___x」
I would suggest using a Capture..., thusly:
'x_x___x________' ~~ /(.*?) _* $/;
say $0; #「x_x___x」
(The ? modifier makes the * 'non-greedy'.)
Please let me know if I have missed the point!
avoid matching whitespace at the end of a string while still matching whitespace in the middle of words
Per Brad's answer, and your comment on it, something like this:
/ \w+ % \s+ /
what I'm looking for is a way to match arbitrarily long streams that end with a known pattern
Per #user0721090601's comment on your Q, and as a variant of #p6steve's answer, something like this:
/ \w+ % \s+ )> \s* $ /
The )> capture marker marks where capture is to end.
You can use arbitrary patterns on the left and right of that marker.
an infinite family of <!before> patterns
Generalizing to an infinite family of patterns of any type, whether they are zero-width or not, the most natural solution in a regex is iteration using any of the standard quantifiers that are open ended. For example, \s+ for one or more whitespace characters.[1] [2]
Is there a way to construct the rest of the pattern implied by the ...?
I'll generalize that to "Is there a way in a Raku regex to match some arbitrary pattern that could in theory be recognized by a computer program?"
The answer is always "Yes":
While Raku rules/regexes might look like traditional regexes they are in fact arbitrary functions embedded in an arbitrary program over which you ultimately have full control.
Rules have arbitrary read access to capture state.[3]
Rules can do arbitrary turing complete computation.[4]
A collection of rules/regexes can arbitrarily consume input and drive the parse/match state, i.e. can implement any parser.
In short, if it can be matched/parsed by any program written in any programming language, it can be matched/parsed using Raku rules/regexes.
Footnotes
[1] If you use an open ended quantifier you do need to make sure that each match iteration/recursion either consumes at least one character, or fails, so that you avoid an infinite loop. For example, the * quantifier will succeed even if the pattern it qualifies does not match, so be careful that that won't lead to an infinite loop.
[2] Given the way you wrote your example, perhaps you are curious about recursion rather than iteration. Suffice to say, it's easy to do that too.[1]
[3] In Raku rules, captures form a hierarchy. There are two special variables that track the capture state of two key levels of this hierarchy:
$¢ is the capture state of the innermost enclosing overall capture. Think of it as something analogous to a return value being constructed by the current function call in a stack of function calls.
$/ is the capture state of the innermost enclosing capture. Think of it as something analogous to a value being constructed by a particular block of code inside a function.
For example:
'123' ~~ / 1* ( 2* { print "$¢ $/" } ) 3* { print "$¢ $/" } / ; # 1 2123 123
The overall / ... / is analogous to an ordinary function call. The first 1 and first 123 of the output show what has been captured by that overall regex.
The ( ... ) sets up an inner capture for a part of the regex. The 2* { print "$¢ $/" } within it is analogous to a block of code. The 2 shows what it has captured.
The final 123 shows that, at the top level of the regex, $/ and $¢ have the same value.
[4] For example, the code in footnote 3 above includes arbitrary code inside the { ... } blocks. More generally:
Rules can be invoked recursively;
Rules can have full signatures and pass arguments;
Rules can contain arbitrary code;
Rules can use multiple dispatch semantics for resolution. Notably, this can include resolution based on longest match length.
I’m wondering if Raku’s trim() routines might suit your purpose, for example: .trim, .trim-trailing or even .trim-leading. In the Raku REPL:
> say 'x x x ' ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x 」)
> say 'x x x '.trim-trailing ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x」)
HTH.
https://docs.raku.org/routine/trim https://docs.raku.org/routine/trim-trailing https://docs.raku.org/routine/trim-leading
I have a simple grammar, and I am using it to parse some text. The text is user inputted, but my program guarantees that it stars with a match to the grammar. (ie, if my grammar matched only a, the text might be abc or a or a_.) However, when I use the .parse method on my grammar, it fails on any non-exact match. How can I perform a partial match?
In Raku, Grammar.parse has to match the whole string. This is what causes it to fail if your grammar would only match a in the string abc. To allow matching only part of the input string, you can use Grammar.subparse instead.
grammar Foo {
token TOP { 'a' }
}
my $string = 'abc';
say Foo.parse($string); # Nil
say Foo.subparse($string); # 「a」
The input string will need to start with the potential Match. Otherwise, you will get a failed match.
say Foo.subparse('cbacb'); # #<failed match>
You can work around this using a Capture marker.
grammar Bar {
token TOP {
<-[a]>* # Match 0 or more characters that are *not* a
<( 'a' # Start the match, and match a single 'a'
}
}
say Bar.parse('a'); # 「a」
say Bar.subparse('a'); # 「a」
say Bar.parse('abc'); # Nil
say Bar.subparse('abc'); # 「a」
say Bar.parse('cbabc'); # Nil
say Bar.subparse('cbabc'); # 「a」
This works because <-[a]>*, a character class that includes any character except the letter a, will consume all the characters before a potential a. However, the Capture marker will cause these to be dropped from the eventual Match object, leaving you with just the a you wanted to match.
TL;DR
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
# Partial match anchored to end of string:
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
Vocabulary
There are traditionally two takes on the general notion of text "matching":
"Parsing"
"Regexes"
Raku:
Provides a unified text pattern language and engine that do both jobs.
Makes it easy to stick to one perspective, or other, or blend them, or refactor between them, as suits an individual dev and/or individual use case.
Takes "parsing" to mean more or less a single match starting at the start of the input string whereas "regexes" are much more flexible.
What you've written in your question and your first comment on Tyil's answer reflects the inherent ambiguity of the topic. I'll provide two answers rather than one to try help you and/or other readers to be clearer about Raku's use of vocabulary, and your options functionality wise.
Limited "partial matching" via .parse et al
You began with:
Partial match in a grammar ... I have a simple grammar ... my program guarantees that it starts with a match to the grammar
With that in mind, here's your question:
How can I perform a partial match?
The phrases "guarantees that it starts" and "partial match" are ambiguous.
One take is that you want what I'll call a "prefix" match, matching one or more characters anchored from the start of the string, and not merely any sub-string starting and ending anywhere in the input string.
This nicely fits with "parsing", or at least Raku's use of the word in its grammar methods.
All the built in Grammar methods with parse in their name insert an anchor to the start of the string in whatever grammar rule they use to start the parsing process. You cannot remove that anchor. This reflects the choice of vocabulary; "parse" is taken to mean matching from the start no matter what else happens.
The parse method for this "prefix" scenario is .subparse:
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
See also:
Search of SO for "[raku] subparse".
raku doc for .subparse.
But perhaps "guarantees that it starts" and "partial match" did not mean that you wanted anchoring at the start. Your comment on Tyil's answer highlights this ambiguity:
Will .subparse only match at the start, or match anywhere in the string?
Tyil provides a workaround. You can do what Tyil shows, but it'll only match if the very first a encountered in the input string is the one that's at the start of the sub-string you want your "parse" to match.
If instead the first a was a false positive, and there was a second or a subsequent a you wanted the "parse" match to start at, then, at least in the Raku world, it's helpful to call that "regexing" rather than "parsing" and to use "regex" matching via the ~~ smartmatch operator.
Unlimited "partial matching" via ~~
Raku lets you do unlimited partial matching if you use its ~~ construct with a regex.
For example, you could write:
# End of match at end of string:
↓
say 'abcaa' ~~ token { a* $ } # 「aa」
~~ with a regex tells Raku to:
Try match starting at the first character position in the string on the LHS;
If that fails, step forward one character, and try again, with the new position in the input string treated as a fresh starting point;
Repeat that until either matching once, or failing to find any match in the entire string.
Here I've left the start position of the match unspecified (which ~~ takes to mean it can be anywhere in the string) and anchored the end of the pattern to the end of the input string. So it successfully matches the aa at the end of the string.
This anchoring freedom illustrates just one of the many ways that ~~ smart matching provides much greater matching flexibility than using the parse methods.
If you have an existing grammar you can still use that:
grammar foo { token TOP { a* } }
# Anchor matching to end of string:
↓
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
You have to name both the grammar and the rule within it you wish to invoke and put them inside <...>. And you need to insert a . to avoid a correspondingly named sub-capture, presuming you don't want that.
Here's another example:
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
"Parsing" in Raku always starts at the beginning of an input string and results in either no match or one match.
In contrast, a "regex" can match arbitrary fragments, and can match any number of fragments. (You can even match overlapping fragments.)
In my last example I used :g, which is short for :global, which is a well known feature among traditional regex engines. :g matches as many times as a match is found in the input string (but not overlapping).
The match operation then returns either Nil (no matches at all) or a list of match objects (one or more). I've applied a .max(*.chars) to yield the longest match (the first if there are multiple longest sub-strings).
I have two questions. Is the behavior I show correct, and if so, is it documented somewhere?
I was playing with the grammar TOP method. Declared as a rule, it implies beginning- and end-of-string anchors along with :sigspace:
grammar Number {
rule TOP { \d+ }
}
my #strings = '137', '137 ', ' 137 ';
for #strings -> $string {
my $result = Number.parse( $string );
given $result {
when Match { put "<$string> worked!" }
when Any { put "<$string> failed!" }
}
}
With no whitespace or trailing whitespace only, the string parses. With leading whitespace, it fails:
<137> worked!
<137 > worked!
< 137 > failed!
I figure this means that rule is applying :sigspace first and the anchors afterward:
grammar Foo {
regex TOP { ^ :sigspace \d+ $ }
}
I expected a rule to allow leading whitespace, which would happen if you switched the order:
grammar Foo {
regex TOP { :sigspace ^ \d+ $ }
}
I could add an explicit token in rule for the beginning of the string:
grammar Number {
rule TOP { ^ \d+ }
}
Now everything works:
<137> worked!
<137 > worked!
< 137 > worked!
I don't have any reason to think it should be one way or the other. The Grammars docs say two things happen, but the docs do not say which order these effects apply:
Note that if you're parsing with .parse method, token TOP is automatically anchored
and
When rule instead of token is used, any whitespace after an atom is turned into a non-capturing call to ws.
I think the answer is that the rule isn't actually anchored in the pattern sense. It's the way .parse works. The cursor has to start at position 0 and end at the last position in the string. That's something outside of the pattern.
The behavior is intended, and is a culmination of these language features:
Sigspace ignores whitespace before the first atom.
From the design docs1 (S05: Regexes and Rules, line 348, emphasis added):
The new :s (:sigspace) modifier causes certain whitespace sequences to be considered "significant"; they are replaced by a whitespace matching rule, . Only whitespace sequences immediately following a matching construct (atom, quantified atom, or assertion) are eligible. Initial whitespace is ignored at the front of any regex, to make it easy to write rules that can participate in longest-token-matching alternations. Trailing space inside the regex delimiters is significant.
This means:
rule TOP { \d+ }
^-------- <.ws> automatically inserted
rule TOP { ^ \d+ $ }
^---^-^---- <.ws> automatically inserted
Regexes are first-class compiled code with lexical scoping.
A regex/rule is not a string that may have characters concatenated to it later to change its behavior. It is a self-contained routine, which is parsed and has its behavior nailed down at compile time.
Regex modifiers like :sigspace, including the one implicitly added by the rule keyword, apply only to their lexical scope - i.e. to the fragment of source code they appear in at compile time. S05, line 6291:
The :i, :m, :r, :s, :dba, :Perl5, and Unicode-level modifiers can be placed inside the regex (and are lexically scoped)
The anchoring of rule TOP is done at run time by .parse.
S05, line 44231:
The .parse and .parsefile methods anchor to the beginning and ending of the text, and fail if the end of text is not reached. (The TOP rule can check against $ itself if it wishes to produce its own error message.)
I.e. the anchoring to the beginning of the string is not intrinsic to the rule TOP, and doesn't affect how the lexical scope of TOP is parsed and compiled. It is done when method .parse is called.
It has to be this way, because because the same grammar can be used with different starting rules instead of TOP, using .parse(..., rule => ...).
So when you write
rule TOP { \d+ }
it is compiled as
regex TOP { :r \d+ <.ws> }
And when you .parse that grammar, it effectively invokes the regex code ^ <TOP> $, with the anchors not being part of TOP's lexical scope but rather of a scope that merely calls the routine TOP. The combined behavior is as if the rule TOP had been written as:
regex TOP { ^ [:r :s \d+] $ }
1) The design docs are in general not to be taken as gospel for what is or isn't part of the Perl 6 language, but S05 is pretty accurate in that regard, except that it mentions some features that haven't been implemented yet but are planned. Anyone who wants to truly grok the intricacies of Perl 6 regexes/grammars, is IMO well served by reading the full S05 from top to bottom at least once.
There aren't two regex effects going on. The rule applies :sigspace. After that, the grammar is defined. When you call .parse, it starts at the beginning of the string and goes to the end (or fails). That anchoring isn't part of the grammar. It's part of how .parse applies the grammar.
My main issue was the odd way some of the things are worded in the docs. They aren't technically wrong, but they also tend to assume knowledge about things the reader might not know. In this case, the casual comment about anchoring TOP isn't as special as it seems. Any rule passed to .parse is anchored in the same way. There's no special behavior for that rule name other than it's the default value for :rule in a call to .parse.
how can I define the lex pattern ( ), or ( /* rem / ), and ( / foo / 100 / foo */ )
in using gnu (f)lex tool.
_space [ \t]
id [a-zA-Z_]+[a-zA-Z0-9_]
digit [0-9]
math_ops [\+\-\/\*\^\%]
rem_expr (({_space}*)*|("/*".*"*/")*|("//".*)*|([\n]*))*
arr_digid ("("*({digit}*|{id}*)*")"*){arr_expr1}*{math_ops}+
arr_expr1 {rem_expr}*{digit}*{rem_expr}*
arr_expr2 {rem_expr}*
%%
\({arr_expr2}*\) {
return _REM_;
}
\({arr_expr1}*\) {
return _PATTERN2_;
}
Generally, you do not return comments or whitespace from a lexer. Why would you? They are, by definition, not part of the semantics of the program you are trying to parse.
On the whole, the easiest way to deal with them is to just ignore them. Below, the first pattern matches any whitespace character other than newline (Use [[:space:]] to also ignore newlines), and the second one is a way of matching C-style comments. ("/*".*"*/" doesn't work because it will match from the beginning of the first comment on a line to the end of the last one.)
[[:blank:]] ;
[/][*][^*]*[*]+([^/*][^*]*[*]+)[/] ;
The fact that the patterns do not have an action (or, in general, do not have a return statement in their action) means that the (f)lex-generated scanner will simply proceed to analyze the next token.
Some other notes:
It's really not necessary to define a shortcut for every pattern. There is no problem with putting a pattern directly in the lex actions. And you certainly don't need to define shortcuts for character classes which already have shortcuts (like [[:blank:]] and [[:digit:]].
You don't need to backslash escape characters inside a character class, although with a couple of characters order is important. (That's why I used [*] in the C-comment pattern; I could equally have used "*" or \*, but I personally prefer [*].) So you could have defined:
math_ops [+/*^%-]
The - must go either at the end or the beginning of the list; ^ cannot go at the beginning, and (though you don't use it) ] would have to go at the beginning. The only character which requires backslash-escaping is a backslash itself.
However, my preference is always to let single-character tokens be handled with a single default rule at the end:
. { return yytext[0]; }
This is much more maintainable, and avoids the need to invent arbitrary token names for single-character tokens. You can just use a single-quoted character in your bison/yacc file.