In Perl 6, what is the meaning of <.before ... > in a regex?
It seems to mean the same as <?before ... >:
my $str = "Hello";
$str ~~ /<alpha> <?before 'o'>/;
say $/;
$str ~~ /<alpha> <.before 'o'>/;
say $/;
Output:
「l」
alpha => 「l」
「l」
alpha => 「l」
I would like to clarify this, because I saw some code that used <.before ...> and I was not sure what it was suppose to mean. It could be illegal syntax, or it could mean negative lookahead <!before ...> or positive lookahead <?before ...> or something else. I know that I can make a rule like <alpha> non-capturing by putting a dot in front of it <.alpha>, but it does not make sense for me for lookaheads since they are always non-capturing. In any case, it should then be written <.?before ... > or <.!before ... > to distinguish between positive and negative lookahead?
<.before ...> is the same as <?before ...>, though less obvious a look-ahead (which is why I recommend using the <?before ...> form).
Basically, <?anything> turns anything into a positive look-ahead, and <.anything> is a non-capturing subrule call. Since before is already a look-ahead, and neither form captures, they behave the same.
Related
I am trying to avoid matching whitespace at the end of a string while still matching whitespace in the middle of words.
Here is an example of a regex that matches underscores within x but does not match up to three trailing underscores.
say 'x_x___x________' ~~ /
[
| 'x'
| '_' <!before [
| $
| '_' <?before $>
| '_' <?before ['_' <?before $>]>
| '_' <?before ['_' <?before ['_' <?before $>]>]>
# ...
]>
]+
/;
Is there a way to construct the rest of the pattern implied by the ...?
It is a little difficult to discern what you are asking for.
You could be looking for something as simple as this:
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..3 /
# 「x_x___x」
or
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..2 /
# 「x_x」
or
say 'x_x___x________' ~~ / 'x'+ % '_'+ /
# 「x_x___x」
I would suggest using a Capture..., thusly:
'x_x___x________' ~~ /(.*?) _* $/;
say $0; #「x_x___x」
(The ? modifier makes the * 'non-greedy'.)
Please let me know if I have missed the point!
avoid matching whitespace at the end of a string while still matching whitespace in the middle of words
Per Brad's answer, and your comment on it, something like this:
/ \w+ % \s+ /
what I'm looking for is a way to match arbitrarily long streams that end with a known pattern
Per #user0721090601's comment on your Q, and as a variant of #p6steve's answer, something like this:
/ \w+ % \s+ )> \s* $ /
The )> capture marker marks where capture is to end.
You can use arbitrary patterns on the left and right of that marker.
an infinite family of <!before> patterns
Generalizing to an infinite family of patterns of any type, whether they are zero-width or not, the most natural solution in a regex is iteration using any of the standard quantifiers that are open ended. For example, \s+ for one or more whitespace characters.[1] [2]
Is there a way to construct the rest of the pattern implied by the ...?
I'll generalize that to "Is there a way in a Raku regex to match some arbitrary pattern that could in theory be recognized by a computer program?"
The answer is always "Yes":
While Raku rules/regexes might look like traditional regexes they are in fact arbitrary functions embedded in an arbitrary program over which you ultimately have full control.
Rules have arbitrary read access to capture state.[3]
Rules can do arbitrary turing complete computation.[4]
A collection of rules/regexes can arbitrarily consume input and drive the parse/match state, i.e. can implement any parser.
In short, if it can be matched/parsed by any program written in any programming language, it can be matched/parsed using Raku rules/regexes.
Footnotes
[1] If you use an open ended quantifier you do need to make sure that each match iteration/recursion either consumes at least one character, or fails, so that you avoid an infinite loop. For example, the * quantifier will succeed even if the pattern it qualifies does not match, so be careful that that won't lead to an infinite loop.
[2] Given the way you wrote your example, perhaps you are curious about recursion rather than iteration. Suffice to say, it's easy to do that too.[1]
[3] In Raku rules, captures form a hierarchy. There are two special variables that track the capture state of two key levels of this hierarchy:
$¢ is the capture state of the innermost enclosing overall capture. Think of it as something analogous to a return value being constructed by the current function call in a stack of function calls.
$/ is the capture state of the innermost enclosing capture. Think of it as something analogous to a value being constructed by a particular block of code inside a function.
For example:
'123' ~~ / 1* ( 2* { print "$¢ $/" } ) 3* { print "$¢ $/" } / ; # 1 2123 123
The overall / ... / is analogous to an ordinary function call. The first 1 and first 123 of the output show what has been captured by that overall regex.
The ( ... ) sets up an inner capture for a part of the regex. The 2* { print "$¢ $/" } within it is analogous to a block of code. The 2 shows what it has captured.
The final 123 shows that, at the top level of the regex, $/ and $¢ have the same value.
[4] For example, the code in footnote 3 above includes arbitrary code inside the { ... } blocks. More generally:
Rules can be invoked recursively;
Rules can have full signatures and pass arguments;
Rules can contain arbitrary code;
Rules can use multiple dispatch semantics for resolution. Notably, this can include resolution based on longest match length.
I’m wondering if Raku’s trim() routines might suit your purpose, for example: .trim, .trim-trailing or even .trim-leading. In the Raku REPL:
> say 'x x x ' ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x 」)
> say 'x x x '.trim-trailing ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x」)
HTH.
https://docs.raku.org/routine/trim https://docs.raku.org/routine/trim-trailing https://docs.raku.org/routine/trim-leading
It's possible to make a conjunction, so that the string matches 2 or more regex patterns.
> "banana" ~~ m:g/ . a && b . /
(「ba」)
Also, it's possible to negate a character class: if I want to match only consonants, I can take all the letters and subtract character class of vowels:
> "camelia" ~~ m:g/ <.alpha> && <-[aeiou]> /
(「c」 「m」 「l」)
But what if I need to negate/subtract not a character class, but a regex of any length? Something like this:
> "banana" ~~ m:g/ . **3 && NOT ban / # doesn't work
(「ana」)
TL;DR Moritz's answer covers some important issues. This answer focuses on matching sub-strings per Eugene's comment ("I want to find substring(s) that match regex R, but don't match regex A.").
Write an assertion that says you are NOT sitting immediately before the regex you don't want to match and then follow that with the regex you do want to match:
say "banana" ~~ m:g/ <!before ban> . ** 3 / # (「ana」)
The before assertion is called a "zero width" assertion. This means that if it succeeds (which in this case means it does not "match" because we've written !before rather than just before), the matching position is not moved.
(Of course, if such an assertion fails and there's no alternative pattern that matches at the current match position, the match engine then steps forward one character position.)
It's possible that you want the patterns in the opposite order, with the positive match first and the negative second, as you showed in your question. (Perhaps the positive match is faster than the negative, so reversing their order will speed up the match.)
One way that will work for fairly simple patterns is using a negative after assertion:
say "banana" ~~ m:g/ . ** 3 <!after ban> / # (「ana」)
However, if the negative pattern is sufficiently complex you may need to use this formulation:
say "banana" ~~ m:g/ . ** 3 && <!before ban> .*? / # (「ana」)
This inserts a && regex conjunction operator that, presuming the LHS pattern succeeds, tries the RHS as well after resetting the matching position (which is why the RHS now starts with <!before ban> rather than <!after ban>) and requires that the RHS matches the same length of input (which is why the <!before ban> is followed by the .*? "padding").
What does it even mean to "negate" a regex?
When you talk about the computer science definition of a regex, then it always needs to match a whole string. In this scenario, negation is pretty easy to define. But by default, regexes in Perl 6 search, so they don't have to match the whole string. This means you have to be careful to define what you mean by "negate".
If by negation of a regex A you mean a regex that matches whenever A does not match a whole string, and vice versa, you can indeed work with <!before ...>, but you need to be careful with anchoring: / ^ <!before A $ > .* / is this exact negation.
If by negation of a regex A you mean "only match if A matches nowhere in the string", you have to use something like / ^ [<!before A> .]* $ /.
If you have another definition of negation in mind, please share it.
When I use a named regex, I can print its contents:
my regex rgx { \w\w };
my $string = 'abcd';
$string ~~ / <rgx> /;
say $<rgx>; # 「ab」
But if I want to match with :g or :ex adverb, so there is more than one match, it doesn't work. The following
my regex rgx { \w\w };
my $string = 'abcd';
$string ~~ m:g/ <rgx> /;
say $<rgx>; # incorrect
gives an error:
Type List does not support associative indexing.
in block <unit> at test1.p6 line 5
How should I modify my code?
UPD: Based on #piojo's explanation, I modified the last line as follows and that solved my problem:
say $/[$_]<rgx> for ^$/.elems;
The following would be easier, but for some reason it doesn't work:
say $_<verb> for $/; # incorrect
It seems like :g and :overlap are special cases: if your match is repeated within the regex, like / <rgx>* /, then you would access the matches as $<rgx>[0], $<rgx>[1], etc.. But in this case, the engine is doing the whole match more than once. So you can access those matches through the top-level match operator, $/. In fact, $<foo> is just a shortcut for $/<foo>.
So based on the error message, we know that in this case, $/ is a list. So we can access your matches as $/[0]<rgx> and $/[1]<rgx>.
The < > has term precedence. Here's the example from the docs:
say <a b c>[1];
I figured the same precedence would apply to all of the quoting operators. This works:
my $string = '5+8i';
my $number = <<$string>>;
say $number;
This interpolates $string and creates allomorphes (in this case a ComplexStr):
(5+8i)
But, if I try to index it like the example from the docs, it doesn't compile:
my $string = '5+8i';
my $number = <<$string>>[0];
say $number;
I'm not quite sure what Perl 6 thinks is happening here. Perhaps it's thinking it's a hyperoperator:
===SORRY!=== Error while compiling ...
Cannot use variable $number in declaration to initialize itself
at /Users/brian/Desktop/scratch.pl:6
------> say $⏏number;
expecting any of:
statement end
statement modifier
statement modifier loop
term
I can skip the variable:
my $string = '5+8i';
say <<$string>>[0];
But that's a different error that can't find the closing quotes:
===SORRY!=== Error while compiling ...
Unable to parse expression in shell-quote words; couldn't find final '>>'
at /Users/brian/Desktop/scratch.pl:8
------> <BOL>⏏<EOL>
expecting any of:
statement end
statement modifier
statement modifier loop
I think this warrants a rakudobug email. I think the parser gets confused trying to interpret it as a hyper (aka >>.method). The following workaround seems to corroborate this:
my $string = '5+8i';
my $number = <<$string >>[0]; # note space before >>
say $number;
To satisfy your OCD, you could probably also put a space before $string.
And yes, whitespace is not meaningless in Perl 6.
Jonathan has the answer in response to RT #131695.
The >>[] is a postfix operator to index a list, so it tries to use it as such. This is intended behavior. Fair enough, although I think the parser is being a bit too clever for regular code monkeys here.
I'm using whittle to parse a grammar, but I'm running into the classical LALR ambiguity problem. My grammar looks like this (simplified):
<comment> ::= '{' <string> '}' # string enclosed in braces
<tag> ::= '[' <name> <quoted-string> ']' # [tagname "tag value"]
<name> ::= /[A-Za-z_]+/ # subset of all printable chars
<quoted-string> ::= '"' <string> '"' # string enclosed in quotes
<string> ::= /[:print:]/ # regex for all printable chars
The problem, of course, is <string>. It contains all printable characters and is therefore very greedy. Since it's an LALR parser, it tries to parse a <name> as a <string> and everything breaks. The grammar complicates things because it uses different string delimiters for different things, which is why I tried to make the <string> rule in the first place.
Is there a canonical way to normalize this grammar to make it LALR compliant, if it's even possible?
This is not "the classical LALR ambiguity problem", whatever that might be. It is simply an error in the lexical specification of the language.
I took a quick glance at the Whittle readme, but it didn't bear any resemblance to the grammar in the OP. So I'm assuming that the text in the OP is conceptual rather than literal, and the fact that it includes the obviously incorrect
<string> ::= /[:print:]/ # regex for all printable chars
is just a typo.
Better would have been /[:print:]*/, assuming that Ruby lets you get away with [:print:] rather than the Posix-standard [[:print:]].
But that wouldn't be correct either because lexing (usually) matches the longest possible string, and consequently that will gobble up the closing quote and any following text.
So the correct solution for quoted-string is to write it out correctly:
<quoted-string> ::= /"[^"]*"/
or even
<quoted-string> :: /"([^\\"]|\\.)*"/
# any number of characters other than quote or escape, or escaped pairs
You might have other ideas about how to escape internal double quotes; those are just examples. In both cases, you need to postprocess the token in order to (at least) strip the double-quotes and possible interpret escape sequences. That's just the way it goes.
Your comment sequences present a more difficult issue, assuming that your intention was that a comment might include nested braces (eg. {This comment {with this} ends here}) because the nested brace syntax is not regular and thus cannot be matched with a regular expression. Of course, very few "regular expression" libraries are really regular these days, and I don't know if Ruby contains some sort of brace-counting extension, like for example Lua's pattern syntax. The nested brace syntax is certainly context-free but to actually parse it you need to lexically analyze the contents of the outer {...} in a different way than the rest of the program.
It is this latter observation, and not any weakness in the LALR algorithm, that is causing you pain, and I'd say that this is a weakness with the (mostly undocumented afaics) lexical analysis section of whittle. In a flex-generated lexer, for example, it would be normal to use start conditions to separate the lexical environments (program / quoted string / braced comment), and the parser would then have no ambiguity.
Hope that helps.