Making character class with modifier symbols in Perl 6 - raku

I'd like to make a user-defined character class of "vowels", which will match any literal English vowel letter (a, e, i, o, u) as well as any of these letters with any possible diacritics: ắ ḗ ú̱ å ų̄ ẹ́ etc.
This is what I've tried to do, but it doesn't work:
> my $vowel = / <[aeiou]> <:Sk>* /
/ <[aeiou]> <:Sk>* /
> "áei" ~~ m:g/ <$vowel> /
(「e」 「i」)

You could try use ignoremark:
The :ignoremark or :m adverb instructs the regex engine to only
compare base characters, and ignore additional marks such as combining
accents.
For your example:
my $vowel = /:m<[aeiou]>/;
.say for "áeikj" ~~ m:g/ <$vowel> /;
Output:
「á」
「e」
「i」

The reason you can't match a vowel with a combining character using / <[aeiou]> <:Sk>* / is that strings in Perl 6 are operated on at the grapheme level. At that level, ų̄ is already just a single character, and <[aeiou]> being a character class already matches one whole character.
The right solution is, as Håkon pointed out in the other answer, to use the ignoremark adverb. You can put it before the regex like rx:m/ <[aeiou]> / or inside of it, or even turn it on and off at different points with :m and :!m.

Related

Anti-matching against an infinite family of <!before> patterns in Raku

I am trying to avoid matching whitespace at the end of a string while still matching whitespace in the middle of words.
Here is an example of a regex that matches underscores within x but does not match up to three trailing underscores.
say 'x_x___x________' ~~ /
[
| 'x'
| '_' <!before [
| $
| '_' <?before $>
| '_' <?before ['_' <?before $>]>
| '_' <?before ['_' <?before ['_' <?before $>]>]>
# ...
]>
]+
/;
Is there a way to construct the rest of the pattern implied by the ...?
It is a little difficult to discern what you are asking for.
You could be looking for something as simple as this:
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..3 /
# 「x_x___x」
or
say 'x_x___x________' ~~ / 'x'+ % '_' ** 1..2 /
# 「x_x」
or
say 'x_x___x________' ~~ / 'x'+ % '_'+ /
# 「x_x___x」
I would suggest using a Capture..., thusly:
'x_x___x________' ~~ /(.*?) _* $/;
say $0; #「x_x___x」
(The ? modifier makes the * 'non-greedy'.)
Please let me know if I have missed the point!
avoid matching whitespace at the end of a string while still matching whitespace in the middle of words
Per Brad's answer, and your comment on it, something like this:
/ \w+ % \s+ /
what I'm looking for is a way to match arbitrarily long streams that end with a known pattern
Per #user0721090601's comment on your Q, and as a variant of #p6steve's answer, something like this:
/ \w+ % \s+ )> \s* $ /
The )> capture marker marks where capture is to end.
You can use arbitrary patterns on the left and right of that marker.
an infinite family of <!before> patterns
Generalizing to an infinite family of patterns of any type, whether they are zero-width or not, the most natural solution in a regex is iteration using any of the standard quantifiers that are open ended. For example, \s+ for one or more whitespace characters.[1] [2]
Is there a way to construct the rest of the pattern implied by the ...?
I'll generalize that to "Is there a way in a Raku regex to match some arbitrary pattern that could in theory be recognized by a computer program?"
The answer is always "Yes":
While Raku rules/regexes might look like traditional regexes they are in fact arbitrary functions embedded in an arbitrary program over which you ultimately have full control.
Rules have arbitrary read access to capture state.[3]
Rules can do arbitrary turing complete computation.[4]
A collection of rules/regexes can arbitrarily consume input and drive the parse/match state, i.e. can implement any parser.
In short, if it can be matched/parsed by any program written in any programming language, it can be matched/parsed using Raku rules/regexes.
Footnotes
[1] If you use an open ended quantifier you do need to make sure that each match iteration/recursion either consumes at least one character, or fails, so that you avoid an infinite loop. For example, the * quantifier will succeed even if the pattern it qualifies does not match, so be careful that that won't lead to an infinite loop.
[2] Given the way you wrote your example, perhaps you are curious about recursion rather than iteration. Suffice to say, it's easy to do that too.[1]
[3] In Raku rules, captures form a hierarchy. There are two special variables that track the capture state of two key levels of this hierarchy:
$¢ is the capture state of the innermost enclosing overall capture. Think of it as something analogous to a return value being constructed by the current function call in a stack of function calls.
$/ is the capture state of the innermost enclosing capture. Think of it as something analogous to a value being constructed by a particular block of code inside a function.
For example:
'123' ~~ / 1* ( 2* { print "$¢ $/" } ) 3* { print "$¢ $/" } / ; # 1 2123 123
The overall / ... / is analogous to an ordinary function call. The first 1 and first 123 of the output show what has been captured by that overall regex.
The ( ... ) sets up an inner capture for a part of the regex. The 2* { print "$¢ $/" } within it is analogous to a block of code. The 2 shows what it has captured.
The final 123 shows that, at the top level of the regex, $/ and $¢ have the same value.
[4] For example, the code in footnote 3 above includes arbitrary code inside the { ... } blocks. More generally:
Rules can be invoked recursively;
Rules can have full signatures and pass arguments;
Rules can contain arbitrary code;
Rules can use multiple dispatch semantics for resolution. Notably, this can include resolution based on longest match length.
I’m wondering if Raku’s trim() routines might suit your purpose, for example: .trim, .trim-trailing or even .trim-leading. In the Raku REPL:
> say 'x x x ' ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x 」)
> say 'x x x '.trim-trailing ~~ m:g/ 'x'+ \s* /;
(「x 」 「x 」 「x」)
HTH.
https://docs.raku.org/routine/trim https://docs.raku.org/routine/trim-trailing https://docs.raku.org/routine/trim-leading

Grammar and unicode characters

Why the below Grammar fails to parse for unicode characters?
it parses fine after removing word boundaries from <sym>.
#!/usr/bin/env perl6
grammar G {
proto rule TOP { * }
rule TOP:sym<y> { «<.sym>» }
rule TOP:sym<✓> { «<.sym>» }
}
say G.parse('y'); # 「y」
say G.parse('✓'); # Nil
From the « and » "left and right word boundary" doc:
[«] matches positions where there is a non-word character at the left, or the start of the string, and a word character to the right.
✓ isn't a word character. So the word boundary assertion fails.
What is and isn't a "word character"
"word", in the sense of the \w character class, has the same definition in P6 as it does in P5 (when not using the P5 \a regex modifier), namely letters, some decimal digits, or an underscore:
Characters whose Unicode general category starts with an L, which stands for Letter.1
Characters whose Unicode general category is Nd, which stands for Number, decimal.2
_, an underscore.
"alpha 'Nd under"
In a comment below #p6steve++ contributes a cute mnemonic that adds "under" to the usual "alphanum".
But "num" is kinda wrong because it isn't any number but only some decimal digits, specifically the characters that match the Unicode General Category Nd (matched by P6 regex /<:Nd>/).2
This leads naturally to alphaNdunder (alpha Nd under) pronounced "alpha 'nd under".
Footnotes
1 Letters are matched by the P6 regex /<:L>/. This includes Ll (Letter, lowercase) (matched by /<:Ll>/) as JJ notes but also others including Lu (Letter, uppercase) and Lo (Letter, other), which latter includes the ら character JJ also mentions. There are other letter sub-categories too.
2 Decimal digits with the Unicode general category Nd are matched by the P6 regex /<:Nd>/. This covers decimal digits that can be chained together to produce arbitrarily large decimal numbers where each digit position adds a power of ten. It excludes decimal digits that have a "typographic context" (my phrasing follows the example of Wikipedia). For example, 1 is the English decimal digit denoting one; it is included. But ¹ and ① are excluded because they have a "typographic context". For a billion+ people their native languages use १ to denote one and १ is included in the Nd category for decimal digits. But for another billion+ people their native languages use 一 for one but it is excluded from the Nd category (and is in the L category for letters instead). Similarly ६ (Devanagari 6) is included in the Nd category but 六 (Han number 6) is excluded.
I keep starting my answers with "Raiph is right". But he is. Also, an example of why this is so:
for <y ✓ Ⅲ> {
say $_.uniprops;
say m/<|w>/;
}
The second line of the loop compares against the word boundary anchor; just the first character, which can be a part of an actual word, matches that anchor. It also prints the Unicode properties in the first line of the loop; in the first case it's a letter, (Ll), it's not in the other two cases. You can use any Ll character as part of a word, and in your grammar, but only characters with that Unicode property can actually form words.
grammar G {
proto rule TOP { * }
rule TOP:sym<y> { «<.sym>» }
rule TOP:sym<ら> { «<.sym>» }
}
say G.parse('y'); # 「y」
say G.parse('ら'); # This is a hiragana letter, so it works.

How to negate/subtract regexes (not only character classes) in Perl 6?

It's possible to make a conjunction, so that the string matches 2 or more regex patterns.
> "banana" ~~ m:g/ . a && b . /
(「ba」)
Also, it's possible to negate a character class: if I want to match only consonants, I can take all the letters and subtract character class of vowels:
> "camelia" ~~ m:g/ <.alpha> && <-[aeiou]> /
(「c」 「m」 「l」)
But what if I need to negate/subtract not a character class, but a regex of any length? Something like this:
> "banana" ~~ m:g/ . **3 && NOT ban / # doesn't work
(「ana」)
TL;DR Moritz's answer covers some important issues. This answer focuses on matching sub-strings per Eugene's comment ("I want to find substring(s) that match regex R, but don't match regex A.").
Write an assertion that says you are NOT sitting immediately before the regex you don't want to match and then follow that with the regex you do want to match:
say "banana" ~~ m:g/ <!before ban> . ** 3 / # (「ana」)
The before assertion is called a "zero width" assertion. This means that if it succeeds (which in this case means it does not "match" because we've written !before rather than just before), the matching position is not moved.
(Of course, if such an assertion fails and there's no alternative pattern that matches at the current match position, the match engine then steps forward one character position.)
It's possible that you want the patterns in the opposite order, with the positive match first and the negative second, as you showed in your question. (Perhaps the positive match is faster than the negative, so reversing their order will speed up the match.)
One way that will work for fairly simple patterns is using a negative after assertion:
say "banana" ~~ m:g/ . ** 3 <!after ban> / # (「ana」)
However, if the negative pattern is sufficiently complex you may need to use this formulation:
say "banana" ~~ m:g/ . ** 3 && <!before ban> .*? / # (「ana」)
This inserts a && regex conjunction operator that, presuming the LHS pattern succeeds, tries the RHS as well after resetting the matching position (which is why the RHS now starts with <!before ban> rather than <!after ban>) and requires that the RHS matches the same length of input (which is why the <!before ban> is followed by the .*? "padding").
What does it even mean to "negate" a regex?
When you talk about the computer science definition of a regex, then it always needs to match a whole string. In this scenario, negation is pretty easy to define. But by default, regexes in Perl 6 search, so they don't have to match the whole string. This means you have to be careful to define what you mean by "negate".
If by negation of a regex A you mean a regex that matches whenever A does not match a whole string, and vice versa, you can indeed work with <!before ...>, but you need to be careful with anchoring: / ^ <!before A $ > .* / is this exact negation.
If by negation of a regex A you mean "only match if A matches nowhere in the string", you have to use something like / ^ [<!before A> .]* $ /.
If you have another definition of negation in mind, please share it.

How to remove diacritics in Perl 6

Two related questions.
Perl 6 is so smart that it understands a grapheme as one character, whether it is one Unicode symbol (like ä, U+00E4) or two and more combined symbols (like p̄ and ḏ̣). This little code
my #symb;
#symb.push("ä");
#symb.push("p" ~ 0x304.chr); # "p̄"
#symb.push("ḏ" ~ 0x323.chr); # "ḏ̣"
say "$_ has {$_.chars} character" for #symb;
gives the following output:
ä has 1 character
p̄ has 1 character
ḏ̣ has 1 character
But sometimes I would like to be able to do the following.
1) Remove diacritics from ä. So I need some method like
"ä".mymethod → "a"
2) Split "combined" symbols into parts, i.e. split p̄ into p and Combining Macron U+0304. E.g. something like the following in bash:
$ echo p̄ | grep . -o | wc -l
2
Perl 6 has great Unicode processing support in the Str class. To do what you are asking in (1), you can use the samemark method/routine.
Per the documentation:
multi sub samemark(Str:D $string, Str:D $pattern --> Str:D)
method samemark(Str:D: Str:D $pattern --> Str:D)
Returns a copy of $string with the mark/accent information for each character changed such that it matches the mark/accent of the corresponding character in $pattern. If $string is longer than $pattern, the remaining characters in $string receive the same mark/accent as the last character in $pattern. If $pattern is empty no changes will be made.
Examples:
say 'åäö'.samemark('aäo'); # OUTPUT: «aäo␤»
say 'åäö'.samemark('a'); # OUTPUT: «aao␤»
say samemark('Pêrl', 'a'); # OUTPUT: «Perl␤»
say samemark('aöä', ''); # OUTPUT: «aöä␤»
This can be used both to remove marks/diacritics from letters, as well as to add them.
For (2), there are a few ways to do this (TIMTOWTDI). If you want a list of all the codepoints in a string, you can use the ords method to get a List (technically a Positional) of all the codepoints in the string.
say "p̄".ords; # OUTPUT: «(112 772)␤»
You can use the uniname method/routine to get the Unicode name for a codepoint:
.uniname.say for "p̄".ords; # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»
or just use the uninames method/routine:
.say for "p̄".uninames; # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»
If you just want the number of codepoints in the string, you can use codes:
say "p̄".codes; # OUTPUT: «2␤»
This is different than chars, which just counts the number of characters in the string:
say "p̄".chars; # OUTPUT: «1␤»
Also see #hobbs' answer using NFD.
This is the best I was able to come up with from the docs — there might be a simpler way, but I'm not sure.
my $in = "Él está un pingüino";
my $stripped = Uni.new($in.NFD.grep: { !uniprop($_, 'Grapheme_Extend') }).Str;
say $stripped; # El esta un pinguino
The .NFD method converts the string to normalization form D (decomposed), which separates graphemes out into base codepoints and combining codepoints whenever possible. The grep then returns a list of only those codepoints that don't have the "Grapheme_Extend" property, i.e. it removes the combining codepoints. the Uni.new(...).Str then assembles those codepoints back into a string.
You can also put these pieces together to answer your second question; e.g.:
$in.NFD.map: { Uni.new($_).Str }
will return a list of 1-character strings, each with a single decomposed codepoint, or
$in.NFD.map(&uniname).join("\n")
will make a nice little unicode debugger.
I can't say this is better or faster, but I strip diacritics in this way:
my $s = "åäö";
say $s.comb.map({.NFD[0].chr}).join; # output: "aao"

Regex for letters, digits, no spaces

I'm trying to create a Regex to check for 6-12 characters, one being a digit, the rest being any characters, no spaces. Can Regex do this? I'm trying to do this in objective-c and I'm not familiar with Regex at all. I've been reading a couple tutorials, but most are for matching simple cases of a number, or a set of numbers, but not exactly what i'm looking for. I can do it with methods, but I was wondering if it that would be too slow and I figured I could try learning something new.
asdfg1 == ok
asdfg 1 != ok
asdfgh != ok
123456 != ok
asdfasgdasgdasdfasdf != ok
use this regex ^(?=.*\d)(?=.*[a-zA-Z])[^ ]{6,12}$
It seems that you mean "letter" when you say "character", right? And (thanks to burning_LEGION for pointing that out) there may be only one digit?
In that case, use
^(?=\D*\d\D*$)[^\W_]{6,12}$
Explanation:
^ # Start of string
(?=\D*\d\D*$) # Assert that there is exactly one digit in the string
[^\W_] # Match a letter or digit (explanation below)
{6,12} # 6-12 times
$ # End of string
[^\W_] might look a little odd. How does it work? Well, \w matches any letter, digit or underscore. \W matches anything that \w doesn't match. So [^\W] (meaning "match any character that is not not alphanumeric/underscore") is essentially the same as \w, but by adding _ to this character class, we can remove the underscore from the list of allowed characters.
i didn't try though, but i think here is the answer
(^[^\d\x20]*\d[^\d\x20]*$){6,12}
This is for one digit: ^[^\d\x20]{0,11}\d{1}[^\d\x20]{0,11}$ but I can`t get limited to 6-12 length, you can use other function to check length first and if it from 6 to 12 check with this regex witch I wrote.