How to insert variable in user-defined character class? - variables

What I am trying to do is to allow programs to define character class depending on text encountered. However, <[]> takes characters literally, and the following yields an error:
my $all1Line = slurp "htmlFile";
my #a = ($all1Line ~~ m:g/ (\" || \') ~ $0 {} :my $marker = $0; http <-[ $marker ]>*? page <-[ $marker ]>*? /); # error: $marker is taken literally as $ m a r k e r
I wanted to match all links that are the format "https://foo?page=0?ssl=1" or 'http ... page ...'

Based on your example code and text, I'm not entirely sure what your source data looksl ike, so I can't provide much more detailed information. That said, based on how to match characters from an earlier part of the match, the easiest way to do this is with array matching:
my $input = "(abc)aaaaaa(def)ddee(ghi)gihgih(jkl)mnmnoo";
my #output = $input ~~ m:g/
:my #valid; # initialize variable in regex scope
'(' ~ ')' $<valid>=(.*?) # capture initial text
{ #valid = $<valid>.comb } # split the text into characters
$<text>=(#valid+) # capture text, so long as it contains the characters
/;
say #output;
.say for #output.map(*<text>.Str);
The output of which is
[「(abc)aaaaaa」
valid => 「abc」
text => 「aaaaaa」 「(def)ddee」
valid => 「def」
text => 「ddee」 「(ghi)gihgih」
valid => 「ghi」
text => 「gihgih」]
aaaaaa
ddee
gihgih
Alternatively, you could store the entire character class definition in a variable and reference the variable as <$marker-char-class>, or you if you want to avoid that, you can define it all inline as code to be interpreted as regex with <{ '<[' ~ $marker ~ ']>' }>. Note that both methods are subject to the same problem: you're constructing the character class from the regex syntax, which may require escape characters or particular ordering, and so is definitely suboptimal.
If it's something you'll do very often and not very adhoc, you could also define your own regex method token, but that's probably very overkill and would serve better as its own question.

Related

Anti-matching against an infinite family of <!before> patterns in Raku

I am trying to avoid matching whitespace at the end of a string while still matching whitespace in the middle of words.
Here is an example of a regex that matches underscores within x but does not match up to three trailing underscores.
say 'x_x___x________' ~~ /
[
| 'x'
| '_' <!before [
| $
| '_' <?before $>
| '_' <?before ['_' <?before $>]>
| '_' <?before ['_' <?before ['_' <?before $>]>]>
# ...
]>
]+
/;
Is there a way to construct the rest of the pattern implied by the ...?
It is a little difficult to discern what you are asking for.
You could be looking for something as simple as this:
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..3 /
# 「x_x___x」
or
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..2 /
# 「x_x」
or
say 'x_x___x________' ~~ / 'x'+ % '_'+ /
# 「x_x___x」
I would suggest using a Capture..., thusly:
'x_x___x________' ~~ /(.*?) _* $/;
say $0; #「x_x___x」
(The ? modifier makes the * 'non-greedy'.)
Please let me know if I have missed the point!
avoid matching whitespace at the end of a string while still matching whitespace in the middle of words
Per Brad's answer, and your comment on it, something like this:
/ \w+ % \s+ /
what I'm looking for is a way to match arbitrarily long streams that end with a known pattern
Per #user0721090601's comment on your Q, and as a variant of #p6steve's answer, something like this:
/ \w+ % \s+ )> \s* $ /
The )> capture marker marks where capture is to end.
You can use arbitrary patterns on the left and right of that marker.
an infinite family of <!before> patterns
Generalizing to an infinite family of patterns of any type, whether they are zero-width or not, the most natural solution in a regex is iteration using any of the standard quantifiers that are open ended. For example, \s+ for one or more whitespace characters.[1] [2]
Is there a way to construct the rest of the pattern implied by the ...?
I'll generalize that to "Is there a way in a Raku regex to match some arbitrary pattern that could in theory be recognized by a computer program?"
The answer is always "Yes":
While Raku rules/regexes might look like traditional regexes they are in fact arbitrary functions embedded in an arbitrary program over which you ultimately have full control.
Rules have arbitrary read access to capture state.[3]
Rules can do arbitrary turing complete computation.[4]
A collection of rules/regexes can arbitrarily consume input and drive the parse/match state, i.e. can implement any parser.
In short, if it can be matched/parsed by any program written in any programming language, it can be matched/parsed using Raku rules/regexes.
Footnotes
[1] If you use an open ended quantifier you do need to make sure that each match iteration/recursion either consumes at least one character, or fails, so that you avoid an infinite loop. For example, the * quantifier will succeed even if the pattern it qualifies does not match, so be careful that that won't lead to an infinite loop.
[2] Given the way you wrote your example, perhaps you are curious about recursion rather than iteration. Suffice to say, it's easy to do that too.[1]
[3] In Raku rules, captures form a hierarchy. There are two special variables that track the capture state of two key levels of this hierarchy:
$¢ is the capture state of the innermost enclosing overall capture. Think of it as something analogous to a return value being constructed by the current function call in a stack of function calls.
$/ is the capture state of the innermost enclosing capture. Think of it as something analogous to a value being constructed by a particular block of code inside a function.
For example:
'123' ~~ / 1* ( 2* { print "$¢ $/" } ) 3* { print "$¢ $/" } / ; # 1 2123 123
The overall / ... / is analogous to an ordinary function call. The first 1 and first 123 of the output show what has been captured by that overall regex.
The ( ... ) sets up an inner capture for a part of the regex. The 2* { print "$¢ $/" } within it is analogous to a block of code. The 2 shows what it has captured.
The final 123 shows that, at the top level of the regex, $/ and $¢ have the same value.
[4] For example, the code in footnote 3 above includes arbitrary code inside the { ... } blocks. More generally:
Rules can be invoked recursively;
Rules can have full signatures and pass arguments;
Rules can contain arbitrary code;
Rules can use multiple dispatch semantics for resolution. Notably, this can include resolution based on longest match length.
I’m wondering if Raku’s trim() routines might suit your purpose, for example: .trim, .trim-trailing or even .trim-leading. In the Raku REPL:
> say 'x x x ' ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x 」)
> say 'x x x '.trim-trailing ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x」)
HTH.
https://docs.raku.org/routine/trim https://docs.raku.org/routine/trim-trailing https://docs.raku.org/routine/trim-leading

Partial Match in a Grammar

I have a simple grammar, and I am using it to parse some text. The text is user inputted, but my program guarantees that it stars with a match to the grammar. (ie, if my grammar matched only a, the text might be abc or a or a_.) However, when I use the .parse method on my grammar, it fails on any non-exact match. How can I perform a partial match?
In Raku, Grammar.parse has to match the whole string. This is what causes it to fail if your grammar would only match a in the string abc. To allow matching only part of the input string, you can use Grammar.subparse instead.
grammar Foo {
token TOP { 'a' }
}
my $string = 'abc';
say Foo.parse($string); # Nil
say Foo.subparse($string); # 「a」
The input string will need to start with the potential Match. Otherwise, you will get a failed match.
say Foo.subparse('cbacb'); # #<failed match>
You can work around this using a Capture marker.
grammar Bar {
token TOP {
<-[a]>* # Match 0 or more characters that are *not* a
<( 'a' # Start the match, and match a single 'a'
}
}
say Bar.parse('a'); # 「a」
say Bar.subparse('a'); # 「a」
say Bar.parse('abc'); # Nil
say Bar.subparse('abc'); # 「a」
say Bar.parse('cbabc'); # Nil
say Bar.subparse('cbabc'); # 「a」
This works because <-[a]>*, a character class that includes any character except the letter a, will consume all the characters before a potential a. However, the Capture marker will cause these to be dropped from the eventual Match object, leaving you with just the a you wanted to match.
TL;DR
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
# Partial match anchored to end of string:
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
Vocabulary
There are traditionally two takes on the general notion of text "matching":
"Parsing"
"Regexes"
Raku:
Provides a unified text pattern language and engine that do both jobs.
Makes it easy to stick to one perspective, or other, or blend them, or refactor between them, as suits an individual dev and/or individual use case.
Takes "parsing" to mean more or less a single match starting at the start of the input string whereas "regexes" are much more flexible.
What you've written in your question and your first comment on Tyil's answer reflects the inherent ambiguity of the topic. I'll provide two answers rather than one to try help you and/or other readers to be clearer about Raku's use of vocabulary, and your options functionality wise.
Limited "partial matching" via .parse et al
You began with:
Partial match in a grammar ... I have a simple grammar ... my program guarantees that it starts with a match to the grammar
With that in mind, here's your question:
How can I perform a partial match?
The phrases "guarantees that it starts" and "partial match" are ambiguous.
One take is that you want what I'll call a "prefix" match, matching one or more characters anchored from the start of the string, and not merely any sub-string starting and ending anywhere in the input string.
This nicely fits with "parsing", or at least Raku's use of the word in its grammar methods.
All the built in Grammar methods with parse in their name insert an anchor to the start of the string in whatever grammar rule they use to start the parsing process. You cannot remove that anchor. This reflects the choice of vocabulary; "parse" is taken to mean matching from the start no matter what else happens.
The parse method for this "prefix" scenario is .subparse:
grammar foo { token TOP { a* } }
# Partial match anchored at start of string:
say .subparse: 'abcaa' given foo; # 「a」
See also:
Search of SO for "[raku] subparse".
raku doc for .subparse.
But perhaps "guarantees that it starts" and "partial match" did not mean that you wanted anchoring at the start. Your comment on Tyil's answer highlights this ambiguity:
Will .subparse only match at the start, or match anywhere in the string?
Tyil provides a workaround. You can do what Tyil shows, but it'll only match if the very first a encountered in the input string is the one that's at the start of the sub-string you want your "parse" to match.
If instead the first a was a false positive, and there was a second or a subsequent a you wanted the "parse" match to start at, then, at least in the Raku world, it's helpful to call that "regexing" rather than "parsing" and to use "regex" matching via the ~~ smartmatch operator.
Unlimited "partial matching" via ~~
Raku lets you do unlimited partial matching if you use its ~~ construct with a regex.
For example, you could write:
# End of match at end of string:
↓
say 'abcaa' ~~ token { a* $ } # 「aa」
~~ with a regex tells Raku to:
Try match starting at the first character position in the string on the LHS;
If that fails, step forward one character, and try again, with the new position in the input string treated as a fresh starting point;
Repeat that until either matching once, or failing to find any match in the entire string.
Here I've left the start position of the match unspecified (which ~~ takes to mean it can be anywhere in the string) and anchored the end of the pattern to the end of the input string. So it successfully matches the aa at the end of the string.
This anchoring freedom illustrates just one of the many ways that ~~ smart matching provides much greater matching flexibility than using the parse methods.
If you have an existing grammar you can still use that:
grammar foo { token TOP { a* } }
# Anchor matching to end of string:
↓
say 'abcaa' ~~ / <.foo::TOP> $ /; # 「aa」
You have to name both the grammar and the rule within it you wish to invoke and put them inside <...>. And you need to insert a . to avoid a correspondingly named sub-capture, presuming you don't want that.
Here's another example:
# Longest partial match, no anchoring:
say ('abcaaabcaabc' ~~ m:g/ <.foo::TOP> /).max(*.chars); # 「aaa」
"Parsing" in Raku always starts at the beginning of an input string and results in either no match or one match.
In contrast, a "regex" can match arbitrary fragments, and can match any number of fragments. (You can even match overlapping fragments.)
In my last example I used :g, which is short for :global, which is a well known feature among traditional regex engines. :g matches as many times as a match is found in the input string (but not overlapping).
The match operation then returns either Nil (no matches at all) or a list of match objects (one or more). I've applied a .max(*.chars) to yield the longest match (the first if there are multiple longest sub-strings).

Is it possible to interpolate Array values in token?

I'm working on homoglyphs module and I have to build regular expression that can find homoglyphed text corresponding to ASCII equivalent.
So for example I have character with no homoglyph alternatives:
my $f = 'f';
and character that can be obfuscated:
my #o = 'o', 'о', 'ο'; # ASCII o, Cyrillic o, Greek omicron
I can easily build regular expression that will detect homoglyphed phrase 'foo':
say 'Suspicious!' if $text ~~ / $f #o #o /;
But how should I compose such regular expression if I don't know the value to detect in compile time? Let's say I want to detect phishing that contains homoglyphed 'cash' word in messages. I can build sequence with all the alternatives:
my #lookup = ['c', 'с', 'ϲ', 'ς'], ['a', 'а', 'α'], 's', 'h'; # arbitrary runtime length
Now obviously following solution cannot "unpack" array elements into the regular expression:
/ #lookup / # doing LTM, not searching elements in sequence
I can workaround this by manually quoting each element and compose text representation of alternatives to get string that can be evaluated as regular expression. And build token from that using string interpolation:
my $regexp-ish = textualize( #lookup ); # string "[ 'c' | 'с' | 'ϲ' | 'ς' ] [ 'a' | 'а' | 'α' ] 's' 'h'"
my $token = token { <$regexp-ish> }
But that is quite error-prone.
Is there any cleaner solution to compose regular expression on the fly from arbitrary amount of elements not known at compile time?
The Unicode::Security module implements confusables by using the Unicode consortium tables. It's actually not using regular expressions, just looking up different characters in those tables.
I'm not sure this is the best approach to use.
I haven't implemented a confusables1 module yet in Intl::, though I do plan on getting around to it eventually, here's two different ways I could imagine a token looking.2
my token confusable($source) {
:my $i = 0; # create a counter var
[
<?{ # succeed only if
my $a = self.orig.substr: self.pos+$i, 1; # the test character A
my $b = $source.substr: $i++, 1; # the source character B and
so $a eq $b # are the same or
|| $a eq %*confusables{$b}.any; # the A is one of B's confusables
}>
. # because we succeeded, consume a char
] ** {$source.chars} # repeat for each grapheme in the source
}
Here I used the dynamic hash %*confusables which would be populated in some way — that will depend on your module and may not even necessarily be dynamic (for example, having the signature :($source, %confusables) or referencing a module variable, etc.
You can then have your code work as follows:
say $foo ~~ /<confusable: 'foo'>/
This is probably the best way to go about things as it will give you a lot more control — I took a peak at your module and it's clear you want to enable 2-to-1 glyph relationships and eventually you'll probably want to be running code directly over the characters.
If you are okay with just 1-to-1 relationships, you can go with a much simpler token:
my token confusable($source) {
:my #chars = $source.comb; # split the source
#( # match the array based on
|( # a slip of
%confusables{#chars.head} # the confusables
// Empty # (or nothing, if none)
), #
#a.shift # and the char itself
) #
** {$source.chars} # repeating for each source char
}
The #(…) structure lets you effectively create an adhoc array to be interpolated. In this case, we just slip in the confusables with the original, and that's that. You have to be careful though because a non-existent hash item will return the type object (Any) and that messes things up here (hence // Empty)
In either case, you'll want to use arguments with your token, as constructing regexes on the fly is fraught with potential gotchas and interpolations errors.
1Unicode calls homographs both "visually similar characters" and "confusables".
2The dynamic hash here %confusables could be populated any number of ways, and may not necessarily need to be dynamic, as it could be populated via the arguments (using a signature like :($source, %confusables) or referencing a module variable.

Perl6 regex not matching end $ character with filenames

I've been trying to learn Perl6 from Perl5, but the issue is that the regex works differently, and it isn't working properly.
I am making a test case to list all files in a directory ending in ".p6$"
This code works with the end character
if 'read.p6' ~~ /read\.p6$/ {
say "'read.p6' contains 'p6'";
}
However, if I try to fit this into a subroutine:
multi list_files_regex (Str $regex) {
my #files = dir;
for #files -> $file {
if $file.path ~~ /$regex/ {
say $file.path;
}
}
}
it no longer works. I don't think the issue with the regex, but with the file name, there may be some attribute I'm not aware of.
How can I get the file name to match the regex in Perl6?
Regexes are a first-class language within Perl 6, rather than simply strings, and what you're seeing here is a result of that.
The form /$foo/ in Perl 6 regex will search for the string value in $foo, so it will be looking, literally, for the characters read\.p6$ (that is, with the dot and dollar sign).
Depending on the situation of the calling code, there are a couple of options:
If you really are receiving regexes as strings, for example read as input or from a file, then use $file.path ~~ /<$regex>/. This means it will treat what's in $regex as regex syntax.
If you will just be passing a range of different regexes in, change the parameter to be of type Regex, and then do $file.path ~~ $regex. In this case, you'd pass them like list_files_regex(/foo/).
Last but not least, dir takes a test parameter, and so you can instead write:
for dir(test => /<$regex>/) -> $file {
say $file.path;
}

Accessing parts of match in Perl 6

When I use a named regex, I can print its contents:
my regex rgx { \w\w };
my $string = 'abcd';
$string ~~ / <rgx> /;
say $<rgx>; # 「ab」
But if I want to match with :g or :ex adverb, so there is more than one match, it doesn't work. The following
my regex rgx { \w\w };
my $string = 'abcd';
$string ~~ m:g/ <rgx> /;
say $<rgx>; # incorrect
gives an error:
Type List does not support associative indexing.
in block <unit> at test1.p6 line 5
How should I modify my code?
UPD: Based on #piojo's explanation, I modified the last line as follows and that solved my problem:
say $/[$_]<rgx> for ^$/.elems;
The following would be easier, but for some reason it doesn't work:
say $_<verb> for $/; # incorrect
It seems like :g and :overlap are special cases: if your match is repeated within the regex, like / <rgx>* /, then you would access the matches as $<rgx>[0], $<rgx>[1], etc.. But in this case, the engine is doing the whole match more than once. So you can access those matches through the top-level match operator, $/. In fact, $<foo> is just a shortcut for $/<foo>.
So based on the error message, we know that in this case, $/ is a list. So we can access your matches as $/[0]<rgx> and $/[1]<rgx>.