"Nothing" in Lookaround terms [RAKU] - regex-lookarounds

I was reading in regexes documemtation about "Tilde for nesting structures".
The sideline explanation about the use of <?> is:
Here <?> successfully matches the null string.
I assumed that I was able to use <?[]> instead of it, but it failed to do so!
As an example:
say so "" ~~ / <?> /;
say so "test" ~~ / <?> /;
say so "" ~~ / <?[]> /;
say so "test" ~~ / <?[]> /;
The response:
True
True
False
False
Could someone give me an explanation about this?

The syntax <?[]> means a lookahead matching an empty character class. Observe that an empty character class also never matches:
say "x" ~~ /<[]>/ # Nil
A character class specifies a set of characters that could be matched. An empty character class implies an empty set of characters, and so cannot possibly match anything.

Related

Raku - String invoking anonymous regex

Why does the first expression interpret but not the second?
I understand them to be identical, that is, a string invoking an anonymous regex.
("foo").first: /foo/ andthen say "$_ bar";
> foo bar
"foo": /foo/ andthen say "$_ bar";
> Confused
> at /home/dmc7z/raku/foo.raku:2
> ------> "foo":⏏ /foo/ andthen say "$_ bar";
> expecting any of:
> colon pair
This is a method call:
("foo").first: /foo/
It's the same as if you had written:
("foo").first( /foo/ )
Or just:
"foo".first( /foo/ )
(Note that I used : at the end of the three above descriptions in English. That's where the idea to use : to mean that whatever following it is part of the same expression comes from.)
In this case it doesn't make a whole lot of sense to use first. Instead I would use ~~.
"foo" ~~ /foo/ andthen say "$_ bar";
first is used to find the first item in a list that matches some description. If you use it on a single item it will either return the item, or return Nil. It always returns one value, and Nil is the most sensible single undefined value in this case.
The reason it says it's expecting a colon pair is that is the only use of : that could be valid in that location. Honestly I halfway expected it to complain that it was an invalid label.

How to insert variable in user-defined character class?

What I am trying to do is to allow programs to define character class depending on text encountered. However, <[]> takes characters literally, and the following yields an error:
my $all1Line = slurp "htmlFile";
my #a = ($all1Line ~~ m:g/ (\" || \') ~ $0 {} :my $marker = $0; http <-[ $marker ]>*? page <-[ $marker ]>*? /); # error: $marker is taken literally as $ m a r k e r
I wanted to match all links that are the format "https://foo?page=0?ssl=1" or 'http ... page ...'
Based on your example code and text, I'm not entirely sure what your source data looksl ike, so I can't provide much more detailed information. That said, based on how to match characters from an earlier part of the match, the easiest way to do this is with array matching:
my $input = "(abc)aaaaaa(def)ddee(ghi)gihgih(jkl)mnmnoo";
my #output = $input ~~ m:g/
:my #valid; # initialize variable in regex scope
'(' ~ ')' $<valid>=(.*?) # capture initial text
{ #valid = $<valid>.comb } # split the text into characters
$<text>=(#valid+) # capture text, so long as it contains the characters
/;
say #output;
.say for #output.map(*<text>.Str);
The output of which is
[「(abc)aaaaaa」
valid => 「abc」
text => 「aaaaaa」 「(def)ddee」
valid => 「def」
text => 「ddee」 「(ghi)gihgih」
valid => 「ghi」
text => 「gihgih」]
aaaaaa
ddee
gihgih
Alternatively, you could store the entire character class definition in a variable and reference the variable as <$marker-char-class>, or you if you want to avoid that, you can define it all inline as code to be interpreted as regex with <{ '<[' ~ $marker ~ ']>' }>. Note that both methods are subject to the same problem: you're constructing the character class from the regex syntax, which may require escape characters or particular ordering, and so is definitely suboptimal.
If it's something you'll do very often and not very adhoc, you could also define your own regex method token, but that's probably very overkill and would serve better as its own question.

Making character class with modifier symbols in Perl 6

I'd like to make a user-defined character class of "vowels", which will match any literal English vowel letter (a, e, i, o, u) as well as any of these letters with any possible diacritics: ắ ḗ ú̱ å ų̄ ẹ́ etc.
This is what I've tried to do, but it doesn't work:
> my $vowel = / <[aeiou]> <:Sk>* /
/ <[aeiou]> <:Sk>* /
> "áei" ~~ m:g/ <$vowel> /
(「e」 「i」)
You could try use ignoremark:
The :ignoremark or :m adverb instructs the regex engine to only
compare base characters, and ignore additional marks such as combining
accents.
For your example:
my $vowel = /:m<[aeiou]>/;
.say for "áeikj" ~~ m:g/ <$vowel> /;
Output:
「á」
「e」
「i」
The reason you can't match a vowel with a combining character using / <[aeiou]> <:Sk>* / is that strings in Perl 6 are operated on at the grapheme level. At that level, ų̄ is already just a single character, and <[aeiou]> being a character class already matches one whole character.
The right solution is, as Håkon pointed out in the other answer, to use the ignoremark adverb. You can put it before the regex like rx:m/ <[aeiou]> / or inside of it, or even turn it on and off at different points with :m and :!m.

How to negate/subtract regexes (not only character classes) in Perl 6?

It's possible to make a conjunction, so that the string matches 2 or more regex patterns.
> "banana" ~~ m:g/ . a && b . /
(「ba」)
Also, it's possible to negate a character class: if I want to match only consonants, I can take all the letters and subtract character class of vowels:
> "camelia" ~~ m:g/ <.alpha> && <-[aeiou]> /
(「c」 「m」 「l」)
But what if I need to negate/subtract not a character class, but a regex of any length? Something like this:
> "banana" ~~ m:g/ . **3 && NOT ban / # doesn't work
(「ana」)
TL;DR Moritz's answer covers some important issues. This answer focuses on matching sub-strings per Eugene's comment ("I want to find substring(s) that match regex R, but don't match regex A.").
Write an assertion that says you are NOT sitting immediately before the regex you don't want to match and then follow that with the regex you do want to match:
say "banana" ~~ m:g/ <!before ban> . ** 3 / # (「ana」)
The before assertion is called a "zero width" assertion. This means that if it succeeds (which in this case means it does not "match" because we've written !before rather than just before), the matching position is not moved.
(Of course, if such an assertion fails and there's no alternative pattern that matches at the current match position, the match engine then steps forward one character position.)
It's possible that you want the patterns in the opposite order, with the positive match first and the negative second, as you showed in your question. (Perhaps the positive match is faster than the negative, so reversing their order will speed up the match.)
One way that will work for fairly simple patterns is using a negative after assertion:
say "banana" ~~ m:g/ . ** 3 <!after ban> / # (「ana」)
However, if the negative pattern is sufficiently complex you may need to use this formulation:
say "banana" ~~ m:g/ . ** 3 && <!before ban> .*? / # (「ana」)
This inserts a && regex conjunction operator that, presuming the LHS pattern succeeds, tries the RHS as well after resetting the matching position (which is why the RHS now starts with <!before ban> rather than <!after ban>) and requires that the RHS matches the same length of input (which is why the <!before ban> is followed by the .*? "padding").
What does it even mean to "negate" a regex?
When you talk about the computer science definition of a regex, then it always needs to match a whole string. In this scenario, negation is pretty easy to define. But by default, regexes in Perl 6 search, so they don't have to match the whole string. This means you have to be careful to define what you mean by "negate".
If by negation of a regex A you mean a regex that matches whenever A does not match a whole string, and vice versa, you can indeed work with <!before ...>, but you need to be careful with anchoring: / ^ <!before A $ > .* / is this exact negation.
If by negation of a regex A you mean "only match if A matches nowhere in the string", you have to use something like / ^ [<!before A> .]* $ /.
If you have another definition of negation in mind, please share it.

Accessing parts of match in Perl 6

When I use a named regex, I can print its contents:
my regex rgx { \w\w };
my $string = 'abcd';
$string ~~ / <rgx> /;
say $<rgx>; # 「ab」
But if I want to match with :g or :ex adverb, so there is more than one match, it doesn't work. The following
my regex rgx { \w\w };
my $string = 'abcd';
$string ~~ m:g/ <rgx> /;
say $<rgx>; # incorrect
gives an error:
Type List does not support associative indexing.
in block <unit> at test1.p6 line 5
How should I modify my code?
UPD: Based on #piojo's explanation, I modified the last line as follows and that solved my problem:
say $/[$_]<rgx> for ^$/.elems;
The following would be easier, but for some reason it doesn't work:
say $_<verb> for $/; # incorrect
It seems like :g and :overlap are special cases: if your match is repeated within the regex, like / <rgx>* /, then you would access the matches as $<rgx>[0], $<rgx>[1], etc.. But in this case, the engine is doing the whole match more than once. So you can access those matches through the top-level match operator, $/. In fact, $<foo> is just a shortcut for $/<foo>.
So based on the error message, we know that in this case, $/ is a list. So we can access your matches as $/[0]<rgx> and $/[1]<rgx>.