How can the Raku behavior on capturing group in alternate be the same as Perl - raku

How can Raku behavior on capturing group in alternate be just like Perl regex' one e.g.
> 'abefo' ~~ /a [(b) | (c) (d)] (e)[(f)|(g)]/
「abef」
0 => 「b」
2 => 「e」
3 => 「f」
needed to be 'usual' Perl regex result (let index system stay Raku):
$0 = 'b'
$1 = undef
$2 = undef
$3 = e
$4 = f
Please give useful guide.

Quoting the Synopsis 5: Regexes and Rules design speculation document:
it is still possible to mimic the monotonic Perl 5 capture indexing semantics
Inserting a $3= for the (e):
/ a [ (b) | (c) (d) ] $3=(e) [ (f) | (g) ] /
andthen say 'abefo' ~~ $_
「abef」
0 => 「b」
3 => 「e」
4 => 「f」
I've briefly looked for a mention of this in the doc but didn't see it.
So maybe we should file doc issues for mentioning this, presumably in Capture numbers and $ ($1, $2, ...).

Unclear as to the question, but surely going back to Perl5 semantics vs Raku means changing the alternation operator.
Perl5's | alternation operator is one in which the "first matching alternative" wins. The equivalent alternation operator in Raku is ||.
Raku's | alternation operator performs Longest Token Matching (LTM), (which roughly means if you separate your alternatives by |, you can spend less time ordering them by longest token to get the desired result).
https://docs.raku.org/language/regexes#Alternation:_||
https://docs.raku.org/language/regexes#Longest_alternation:_|
(As for capture numbering, maybe you can submit a request that that gets handled by the :Perl5 or :P5 regex adverb? See: https://docs.raku.org/language/regexes#Perl_compatibility_adverb )

Related

Anti-matching against an infinite family of <!before> patterns in Raku

I am trying to avoid matching whitespace at the end of a string while still matching whitespace in the middle of words.
Here is an example of a regex that matches underscores within x but does not match up to three trailing underscores.
say 'x_x___x________' ~~ /
[
| 'x'
| '_' <!before [
| $
| '_' <?before $>
| '_' <?before ['_' <?before $>]>
| '_' <?before ['_' <?before ['_' <?before $>]>]>
# ...
]>
]+
/;
Is there a way to construct the rest of the pattern implied by the ...?
It is a little difficult to discern what you are asking for.
You could be looking for something as simple as this:
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..3 /
# 「x_x___x」
or
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..2 /
# 「x_x」
or
say 'x_x___x________' ~~ / 'x'+ % '_'+ /
# 「x_x___x」
I would suggest using a Capture..., thusly:
'x_x___x________' ~~ /(.*?) _* $/;
say $0; #「x_x___x」
(The ? modifier makes the * 'non-greedy'.)
Please let me know if I have missed the point!
avoid matching whitespace at the end of a string while still matching whitespace in the middle of words
Per Brad's answer, and your comment on it, something like this:
/ \w+ % \s+ /
what I'm looking for is a way to match arbitrarily long streams that end with a known pattern
Per #user0721090601's comment on your Q, and as a variant of #p6steve's answer, something like this:
/ \w+ % \s+ )> \s* $ /
The )> capture marker marks where capture is to end.
You can use arbitrary patterns on the left and right of that marker.
an infinite family of <!before> patterns
Generalizing to an infinite family of patterns of any type, whether they are zero-width or not, the most natural solution in a regex is iteration using any of the standard quantifiers that are open ended. For example, \s+ for one or more whitespace characters.[1] [2]
Is there a way to construct the rest of the pattern implied by the ...?
I'll generalize that to "Is there a way in a Raku regex to match some arbitrary pattern that could in theory be recognized by a computer program?"
The answer is always "Yes":
While Raku rules/regexes might look like traditional regexes they are in fact arbitrary functions embedded in an arbitrary program over which you ultimately have full control.
Rules have arbitrary read access to capture state.[3]
Rules can do arbitrary turing complete computation.[4]
A collection of rules/regexes can arbitrarily consume input and drive the parse/match state, i.e. can implement any parser.
In short, if it can be matched/parsed by any program written in any programming language, it can be matched/parsed using Raku rules/regexes.
Footnotes
[1] If you use an open ended quantifier you do need to make sure that each match iteration/recursion either consumes at least one character, or fails, so that you avoid an infinite loop. For example, the * quantifier will succeed even if the pattern it qualifies does not match, so be careful that that won't lead to an infinite loop.
[2] Given the way you wrote your example, perhaps you are curious about recursion rather than iteration. Suffice to say, it's easy to do that too.[1]
[3] In Raku rules, captures form a hierarchy. There are two special variables that track the capture state of two key levels of this hierarchy:
$¢ is the capture state of the innermost enclosing overall capture. Think of it as something analogous to a return value being constructed by the current function call in a stack of function calls.
$/ is the capture state of the innermost enclosing capture. Think of it as something analogous to a value being constructed by a particular block of code inside a function.
For example:
'123' ~~ / 1* ( 2* { print "$¢ $/" } ) 3* { print "$¢ $/" } / ; # 1 2123 123
The overall / ... / is analogous to an ordinary function call. The first 1 and first 123 of the output show what has been captured by that overall regex.
The ( ... ) sets up an inner capture for a part of the regex. The 2* { print "$¢ $/" } within it is analogous to a block of code. The 2 shows what it has captured.
The final 123 shows that, at the top level of the regex, $/ and $¢ have the same value.
[4] For example, the code in footnote 3 above includes arbitrary code inside the { ... } blocks. More generally:
Rules can be invoked recursively;
Rules can have full signatures and pass arguments;
Rules can contain arbitrary code;
Rules can use multiple dispatch semantics for resolution. Notably, this can include resolution based on longest match length.
I’m wondering if Raku’s trim() routines might suit your purpose, for example: .trim, .trim-trailing or even .trim-leading. In the Raku REPL:
> say 'x x x ' ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x 」)
> say 'x x x '.trim-trailing ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x」)
HTH.
https://docs.raku.org/routine/trim https://docs.raku.org/routine/trim-trailing https://docs.raku.org/routine/trim-leading

Is it possible to interpolate Array values in token?

I'm working on homoglyphs module and I have to build regular expression that can find homoglyphed text corresponding to ASCII equivalent.
So for example I have character with no homoglyph alternatives:
my $f = 'f';
and character that can be obfuscated:
my #o = 'o', 'о', 'ο'; # ASCII o, Cyrillic o, Greek omicron
I can easily build regular expression that will detect homoglyphed phrase 'foo':
say 'Suspicious!' if $text ~~ / $f #o #o /;
But how should I compose such regular expression if I don't know the value to detect in compile time? Let's say I want to detect phishing that contains homoglyphed 'cash' word in messages. I can build sequence with all the alternatives:
my #lookup = ['c', 'с', 'ϲ', 'ς'], ['a', 'а', 'α'], 's', 'h'; # arbitrary runtime length
Now obviously following solution cannot "unpack" array elements into the regular expression:
/ #lookup / # doing LTM, not searching elements in sequence
I can workaround this by manually quoting each element and compose text representation of alternatives to get string that can be evaluated as regular expression. And build token from that using string interpolation:
my $regexp-ish = textualize( #lookup ); # string "[ 'c' | 'с' | 'ϲ' | 'ς' ] [ 'a' | 'а' | 'α' ] 's' 'h'"
my $token = token { <$regexp-ish> }
But that is quite error-prone.
Is there any cleaner solution to compose regular expression on the fly from arbitrary amount of elements not known at compile time?
The Unicode::Security module implements confusables by using the Unicode consortium tables. It's actually not using regular expressions, just looking up different characters in those tables.
I'm not sure this is the best approach to use.
I haven't implemented a confusables1 module yet in Intl::, though I do plan on getting around to it eventually, here's two different ways I could imagine a token looking.2
my token confusable($source) {
:my $i = 0; # create a counter var
[
<?{ # succeed only if
my $a = self.orig.substr: self.pos+$i, 1; # the test character A
my $b = $source.substr: $i++, 1; # the source character B and
so $a eq $b # are the same or
|| $a eq %*confusables{$b}.any; # the A is one of B's confusables
}>
. # because we succeeded, consume a char
] ** {$source.chars} # repeat for each grapheme in the source
}
Here I used the dynamic hash %*confusables which would be populated in some way — that will depend on your module and may not even necessarily be dynamic (for example, having the signature :($source, %confusables) or referencing a module variable, etc.
You can then have your code work as follows:
say $foo ~~ /<confusable: 'foo'>/
This is probably the best way to go about things as it will give you a lot more control — I took a peak at your module and it's clear you want to enable 2-to-1 glyph relationships and eventually you'll probably want to be running code directly over the characters.
If you are okay with just 1-to-1 relationships, you can go with a much simpler token:
my token confusable($source) {
:my #chars = $source.comb; # split the source
#( # match the array based on
|( # a slip of
%confusables{#chars.head} # the confusables
// Empty # (or nothing, if none)
), #
#a.shift # and the char itself
) #
** {$source.chars} # repeating for each source char
}
The #(…) structure lets you effectively create an adhoc array to be interpolated. In this case, we just slip in the confusables with the original, and that's that. You have to be careful though because a non-existent hash item will return the type object (Any) and that messes things up here (hence // Empty)
In either case, you'll want to use arguments with your token, as constructing regexes on the fly is fraught with potential gotchas and interpolations errors.
1Unicode calls homographs both "visually similar characters" and "confusables".
2The dynamic hash here %confusables could be populated any number of ways, and may not necessarily need to be dynamic, as it could be populated via the arguments (using a signature like :($source, %confusables) or referencing a module variable.

How to negate/subtract regexes (not only character classes) in Perl 6?

It's possible to make a conjunction, so that the string matches 2 or more regex patterns.
> "banana" ~~ m:g/ . a && b . /
(「ba」)
Also, it's possible to negate a character class: if I want to match only consonants, I can take all the letters and subtract character class of vowels:
> "camelia" ~~ m:g/ <.alpha> && <-[aeiou]> /
(「c」 「m」 「l」)
But what if I need to negate/subtract not a character class, but a regex of any length? Something like this:
> "banana" ~~ m:g/ . **3 && NOT ban / # doesn't work
(「ana」)
TL;DR Moritz's answer covers some important issues. This answer focuses on matching sub-strings per Eugene's comment ("I want to find substring(s) that match regex R, but don't match regex A.").
Write an assertion that says you are NOT sitting immediately before the regex you don't want to match and then follow that with the regex you do want to match:
say "banana" ~~ m:g/ <!before ban> . ** 3 / # (「ana」)
The before assertion is called a "zero width" assertion. This means that if it succeeds (which in this case means it does not "match" because we've written !before rather than just before), the matching position is not moved.
(Of course, if such an assertion fails and there's no alternative pattern that matches at the current match position, the match engine then steps forward one character position.)
It's possible that you want the patterns in the opposite order, with the positive match first and the negative second, as you showed in your question. (Perhaps the positive match is faster than the negative, so reversing their order will speed up the match.)
One way that will work for fairly simple patterns is using a negative after assertion:
say "banana" ~~ m:g/ . ** 3 <!after ban> / # (「ana」)
However, if the negative pattern is sufficiently complex you may need to use this formulation:
say "banana" ~~ m:g/ . ** 3 && <!before ban> .*? / # (「ana」)
This inserts a && regex conjunction operator that, presuming the LHS pattern succeeds, tries the RHS as well after resetting the matching position (which is why the RHS now starts with <!before ban> rather than <!after ban>) and requires that the RHS matches the same length of input (which is why the <!before ban> is followed by the .*? "padding").
What does it even mean to "negate" a regex?
When you talk about the computer science definition of a regex, then it always needs to match a whole string. In this scenario, negation is pretty easy to define. But by default, regexes in Perl 6 search, so they don't have to match the whole string. This means you have to be careful to define what you mean by "negate".
If by negation of a regex A you mean a regex that matches whenever A does not match a whole string, and vice versa, you can indeed work with <!before ...>, but you need to be careful with anchoring: / ^ <!before A $ > .* / is this exact negation.
If by negation of a regex A you mean "only match if A matches nowhere in the string", you have to use something like / ^ [<!before A> .]* $ /.
If you have another definition of negation in mind, please share it.

Accessing parts of match in Perl 6

When I use a named regex, I can print its contents:
my regex rgx { \w\w };
my $string = 'abcd';
$string ~~ / <rgx> /;
say $<rgx>; # 「ab」
But if I want to match with :g or :ex adverb, so there is more than one match, it doesn't work. The following
my regex rgx { \w\w };
my $string = 'abcd';
$string ~~ m:g/ <rgx> /;
say $<rgx>; # incorrect
gives an error:
Type List does not support associative indexing.
in block <unit> at test1.p6 line 5
How should I modify my code?
UPD: Based on #piojo's explanation, I modified the last line as follows and that solved my problem:
say $/[$_]<rgx> for ^$/.elems;
The following would be easier, but for some reason it doesn't work:
say $_<verb> for $/; # incorrect
It seems like :g and :overlap are special cases: if your match is repeated within the regex, like / <rgx>* /, then you would access the matches as $<rgx>[0], $<rgx>[1], etc.. But in this case, the engine is doing the whole match more than once. So you can access those matches through the top-level match operator, $/. In fact, $<foo> is just a shortcut for $/<foo>.
So based on the error message, we know that in this case, $/ is a list. So we can access your matches as $/[0]<rgx> and $/[1]<rgx>.

How to make Perl 6 grammar produce more than one match (like :ex and :ov)?

I want grammar to do something like this:
> "abc" ~~ m:ex/^ (\w ** 1..2) (\w ** 1..2) $ {say $0, $1}/
「ab」「c」
「a」「bc」
Or like this:
> my regex left { \S ** 1..2 }
> my regex right { \S ** 1..2 }
> "abc" ~~ m:ex/^ <left><right> $ {say $<left>, $<right>}/
「ab」「c」
「a」「bc」
Here is my grammar:
grammar LR {
regex TOP {
<left>
<right>
}
regex left {
\w ** 1..2
}
regex right {
\w ** 1..2
}
}
my $string = "abc";
my $match = LR.parse($string);
say "input: $string";
printf "split: %s|%s\n", ~$match<left>, ~$match<right>;
Its output is:
$ input: abc
$ split: ab|c
So, <left> can be only greedy leaving nothing to <right>. How should I modify the code to match both possible variants?
$ input: abc
$ split: a|bc, ab|c
Grammars are designed to give zero or one answers, not more than that, so you have to use some tricks to make them do what you want.
Since Grammar.parse returns just one Match object, you have to use a different approach to get all matches:
sub callback($match) {
say $match;
}
grammar LR {
regex TOP {
<left>
<right>
$
{ callback($/) }
# make the match fail, thus forcing backtracking:
<!>
}
regex left {
\w ** 1..2
}
regex right {
\w ** 1..2
}
}
LR.parse('abc');
Making the match fail by calling the <!> assertion (which always fails) forces the previous atoms to backtrack, and thus finding different solutions. Of course this makes the grammar less reusable, because it works outside the regular calling conventions for grammars.
Note that, for the caller, the LR.parse seems to always fail; you get all the matches as calls to the callback function.
A slightly nicer API (but the same approach underneath) is to use gather/take to get a sequence of all matches:
grammar LR {
regex TOP {
<left>
<right>
$
{ take $/ }
# make the match fail, thus forcing backtracking:
<!>
}
regex left {
\w ** 1..2
}
regex right {
\w ** 1..2
}
}
.say for gather LR.parse('abc');
I think Moritz Lenz, nickname moritz, author of the upcoming book "Parsing with Perl 6 Regexes and Grammars", is the person to ask about this. I probably should have just asked him to answer this SO...
Notes
In case anyone considers attempting to modify grammar.parse so that it supports :exhaustive, or otherwise hacking things to do what #evb wants, the following documents potentially useful inspiration/guidance that I gleaned from spelunking the relevant speculations document (S05) and searching the #perl6 and #perl6-dev irc logs.
7 years ago Moritz added an edit of S05:
A [regex] modifier that affects only the calling behaviour, and not the regex itself [eg :exhaustive] may only appear on constructs that involve a call (like m// [or grammar.parse]), and not on rx// [or regex { ... }].
(The [eg :exhaustive], [or grammar.parse], and [or regex { ... }] bits are extrapolations/interpretations/speculations I've added in this SO answer. They're not in the linked source.)
5 years ago Moritz expressed interest in implementing :exhaustive for matching (not parsing) features. Less than 2 minutes later jnthn showed a one liner that demo'd how he guessed he'd approach it. Less than 30 minutes later Moritz posted a working prototype. The final version landed 7 days later.
1 year ago Moritz said on #perl6 (emphasis added by me): "regexes and grammars aren't a good tool to find all possible ways to parse a string".
Hth.