How to loop on sorted (with custom sort) keys of a hash in Raku? - raku

Trying to progressively convert some Perl scripts to Raku. I am quite stuck with the following one, even after browsing quite a lot here and reading Learning Perl 6 more deeply.
The part on which I can't make progress is the last loop (converted to for); getting keys and sorting them by month name and day number looks impossible, but I am sure it is doable.
Any hints on how to achieve this with "idiomatic" syntax would be really welcome.
#!/usr/bin/perl
use strict;
my %totals;
while (<>) {
if (/redis/ and /Partial/) {
my($f1, $f2) = split(' ');
my $w = $f1 . ' ' . $f2;
$totals{$w}++;
}
}
my %m = ("jan" => 1, "feb" => 2, "mar" => 3, "apr" => 4, "may" => 5, "jun" => 6,
"jul" => 7, "aug" => 8, "sep" => 9, "oct" => 10, "nov" => 11, "dec" => 12);
foreach my $e (sort { my($a1, $a2) = split(' ', $a) ; my($b1, $b2) = split(' ', $b) ;
$m{lc $a1} <=> $m{lc $b1} or $a2 <=> $b2 } keys %totals) {
print "$e", " ", $totals{$e}, "\n";
}

Fed with the same sample data, your perl code produces the same output as this.
my $data = q:to/END/;
may 01 xxx3.1 Partial redis
may 01 xxx3.2 Partial redis
may 01 xxx3.3 Partial redis
apr 22 xxx2.2 Partial redis
apr 22 xxx2.1 Partial redis
mar 01 xxx1 redis Partial
some multi-line
string
END
sub sort-by( $value )
{
state %m = <jan feb mar apr may jun jul aug sep oct nov dec> Z=> 1..12;
%m{ .[0].lc }, .[1] with $value.key.words;
}
say .key, ' ', .value.elems
for $data
.lines
.grep( /redis/ & /Partial/ )
.classify( *.words[0..1].Str )
.sort( &sort-by );

You could try something like:
enum Month (jan => 1, |<feb mar apr may jun jul aug sep oct nov dec>);
lines()
andthen .grep: /redis/&/Partial/
andthen .map: *.words
andthen .map: {Month::{.[0].lc} => .[1].Int}\
#or andthen .map: {Date.new: year => Date.today.year, month => Month::{.[0].lc}, day => .[1], }\
andthen bag $_
andthen .sort
andthen .map: *.put;

I think that this is close to what you are asking for... also shows that perl6/raku is quite closely related to perl5 unless you want to get fancy...
#test data...
my %totals = %(
"jan 2" => 3,
"jan 4" => 1,
"feb 7" => 1,
);
my %m = %("jan" => 1, "feb" => 2, "mar" => 3, "apr" => 4, "may" => 5, "jun" => 6,
"jul" => 7, "aug" => 8, "sep" => 9, "oct" => 10, "nov" => 11, "dec" => 12);
my &sorter = {
my ($a1, $a2) = split(' ', $^a);
my ($b1, $b2) = split(' ', $^b);
%m{lc $a1} <=> %m{lc $b1} or $a2 <=> $b2
}
for %totals.keys.sort(&sorter) ->$e {
say "$e => {%totals{$e}}"
}
#output
jan 2 => 3
jan 4 => 1
feb 7 => 1
The main changes are:
%totals{$e} for $totals{$e}
%() instead of {} for hash literals
for with method syntax and -> instead of foreach with sub syntax
$^a and $^b in sort routine need caret twigils (^)
say is a bit cleaner than print

TL;DR #wamba provides an idiomatic solution. This answer is a minimal "mechanical" translation instead.
I think your question and this answer suggests that a great way to learn many of Raku's basics as they relate to Perl is:
Feed a small Perl program into Rakudo;
Methodically investigate/fix each reported error until it works;
Post a question to StackOverflow if you get stuck.
Presuming that's what you did, great. If not, hopefully this answer will inspire you or other readers to try doing just that. 😍
The code
my %totals;
for lines() {
if (/redis/ and /Partial/) {
my ($f1, $f2) = split(' ', $_);
my $w = $f1 ~ ' ' ~ $f2;
%totals{$w}++;
}
}
my %m = ("jan" => 1, "feb" => 2, "mar" => 3, "apr" => 4, "may" => 5, "jun" => 6,
"jul" => 7, "aug" => 8, "sep" => 9, "oct" => 10, "nov" => 11, "dec" => 12);
for sort { my ($a1, $a2) = split(' ', $^a) ; my ($b1, $b2) = split(' ', $^b) ;
%m{lc $a1} <=> %m{lc $b1} or $a2 <=> $b2 }, keys %totals
-> $e {
print "$e", " ", %totals{$e}, "\n";
}
works for the following test input:
feb 1 redis Partial
jan 2 Partial redis
jan 2 redis Partial
The mechanical translation process
I began properly working on your question by feeding the code in your question to Rakudo. And was surprised to get Unsupported use of <>. .... It was Perl code, not Rakunian! 🙃
Then I saw #wamba had provided an idiomatic solution. I decided I'd do the most direct translation possible instead. My first attempt worked. Order restored. 🙂
I pondered how best to explain my changes. I wondered what the error messages would be if I went back to the start and just fixed one at a time. The result was a delightful series of good error messages. So I've structured the rest of this answer as a series of error messages/fixes/discussions, each one leading to the next, until the program just works.
In the interests of simplicity I drop most of the info from the error messages. The messages/fixes are in the order I encountered them by fixing one at a time:
Unsupported use of <>. In Raku please use: lines() to read input ...
------> while (<⏏>) {
(⏏ is Unicode's eject symbol, marking the point where the compiler conceptually "ejects" the code.)
The idiomatic replacement of Perl's while (<>) is for lines().
Variable '$f1' is not declared
------> my(⏏$f1, $f2) ...
Raku interprets code of the form foo(...) as a function call if it's where a function call makes sense. This takes priority over interpreting foo as a keyword (i.e. my as a variable declarator).
Next, because my($f1, $f2) is interpreted as a function call, the $f1 is interpreted as an argument that you haven't declared, leading to the error message.
Inserting whitespace after the my fixes both the real problem and this apparent one.
(This error occurred in multiple locations in your code; I applied the same fix each time.)
Unsupported use of . to concatenate strings. In Raku please use: ~.
------> my $w = $f1 .⏏ ' ' . $f2;
To help remember that ~ is used as a string operation in Raku, note that it looks like a piece of string. 🧵
Variable '$totals' is not declared. Did you mean '%totals'?
------> ⏏$totals{$w}++;
As Damian Conway notes, "We took this Perl table of what do you use when, and we made it this table instead".
The code $totals{...} is syntactically valid. One can bind or assign a hash (reference) to a scalar. But Rakudo (the Raku compiler) knows at compile time that the code hasn't declared a $totals variable, so it rightly complains.
Your code has declared a %totals variable, so Rakudo helpfully asks if you meant that.
(This error occurred in multiple locations in your code; I applied the same fix each time.)
Unsupported use of 'foreach'. In Raku please use: 'for'.
------> foreach⏏ my
Raku code tends to be shorter (and more readable) than Perl code. It's mostly due to design that goes beyond mere paint, but little things like s/for/foreach don't hurt.
This appears to be Perl code
------> for ⏏my $e
This error message is arguably LTA (I'm thinking "non-descriptive"). But it's equally arguably pretty good, all things considered.
Perl and Raku support binding of a value to a new lexical variable/parameter scoped to a block. Perl uses my, and puts the variable(s) in front of the value(s). Raku puts the value(s) first, inserts a -> between them and any variables, and skips the my.
(There's a good deal of richness to this use of -> that I'll not get into here because it doesn't matter for this example. But it's worth being aware that this change buys Rakoons a good deal, and you've got that to look forward to.)
Variable '$a' is not declared
------> for sort { my ($a1, $a2) = split(' ', ⏏$a)
As Perl devs know, it has special variables $a and $b.
Raku generalized this notion to $^foo parameters, a convenient DRY way to add positional parameters to a closure while skipping the usual ceremony in which one has to specify the name twice (once to declare it, another time to use it).
An unusual aspect of these that might initially seem crazy, but is actually very ergonomic, is that their formal parameter position is determined by their name. So, given $^foo and $^bar, the latter will bind to the first positional argument, $^foo to the second.
Missing comma after block argument to sort
------> { ... }⏏ keys
I inserted a comma where indicated.
Calling split(Str) will never work with signature of the proto ($, $, |)
------> my ($f1, $f2) = ⏏split(' '
Some Perl routines implicitly presume use of $_. There's no syntactic way to know whether or not any given routine is implicitly using it. You just have to read each routine's definition. Raku dropped that.
So Rakudo concludes the split routine is missing the string that's to be split.
(The | in the "signature of the proto ($, $, |)" just means "other arguments that can optionally be passed", so you can ignore that. The $, $ indicates two arguments are required, so we're missing one.)
A quick check of the routine definition shows the sub version of split requires the string to be split as its second positional argument. Thus I switched to split(' ', $_).
The code works. \o/
Notably, all the actual error messages started with ===SORRY!=== Error while compiling. That means they were all caught before the program even ran, which is nice. 😎

You've already got good answers, but I'm taking the opportunity to expose you to some other standard Raku tools and idioms that seemed natural to me for your problem.
For both my solutions:
My equivalents of your %totals variable store keys in a structured data form rather than just as string keys. The supposed rationale is to simplify the sort and presentation. (But really it's to show you another way. It would of course be trivial to ensure the month and day numbers are concatenated as two two digit numbers to ensure correct sorting.) I use two different key types to show variations on this theme.
I deal with conversion to/from month names by constructing hashes mapping names to numbers. I declare one with the .pairs or .antipairs method, and then apply the reverse to convert in the other direction. I do this one way in the first solution and the other in the second. And I set the number for jan to 0 in one solution and 1 in the other.
Short and sweet, lean on Pairs
When declaring a %foo variable, if you don't specify its key type, it defaults to Str. But in this code, the key of each Pair in %totals is itself a Pair:
my %totals{Pair}; # Keys of the `Pair`s in `%totals` are themselves `Pair`s
my %months = <jan feb mar apr may jun jul aug sep oct nov dec> .pairs; # 0 => jan
for lines.grep(/redis/ & /Partial/)».words {
++%totals{ %months.antipairs.hash{ lc .[0] }.Int => .[1].Int }
}
for %totals .sort {
printf "%3s %2d : %-d\n", %months{.key.key}, .key.value, .value
}
If no sort closure(s) are specified, Raku's sort routine, when applied to a hash, sorts its entries by comparing their keys using cmp. Furthermore, for an ordinary hash, comparing two keys means comparing two strings.
That would work fine for your situation if these strings were each date's month and day formatted as two digits each and then concatenated. Alternatively, splitting and schwartzian works fine too. Raku's really good at that stuff but I preferred to go a different way with this answer, so that the default sort did the right thing.
For this first solution, I picked Pairs as the key type. When cmp compares Pairs, it sorts first by key and then by value within that. Both key and value were coerced to Ints, thus the above code correctly sorts by month, then days within that.
More structure, use Dates
This version adds structure and more fancy typing. It wraps the equivalent of the %totals hash (renamed %.data) into an outer object containing some utility routines, and makes the inner key object be a Date instead of a Pair:
role Totals {
my %months = <jan feb mar apr may jun jul aug sep oct nov dec> .antipairs «+» 1; # jan => 1
method month-name (Int $num --> Str) { %months.antipairs.hash{$num} }
method month-num (Str $name --> Int) { %months{lc $name} }
has %.data{Date} handles <sort>;
}
my $totals = Totals.new;
for lines.grep(/redis/ & /Partial/)».words {
++$totals.data{ Date.new: :year(2000), :month(Totals.month-num: .[0]), :day(.[1]) }
}
for $totals .sort {
printf "%3s %2d : %-d\n", Totals.month-name(.key.month), .key.day, .value
}
In the first solution, sort did the right thing because it was comparing Pairs, and cmp in turn did the right thing given how I'd set the pairs up.
In this solution sort/cmp do the right thing without needing to coerce string values to Ints, because the totals entries are Dates and they compare according to ordinary date comparison rules.

Related

Is it possible to interpolate Array values in token?

I'm working on homoglyphs module and I have to build regular expression that can find homoglyphed text corresponding to ASCII equivalent.
So for example I have character with no homoglyph alternatives:
my $f = 'f';
and character that can be obfuscated:
my #o = 'o', 'о', 'ο'; # ASCII o, Cyrillic o, Greek omicron
I can easily build regular expression that will detect homoglyphed phrase 'foo':
say 'Suspicious!' if $text ~~ / $f #o #o /;
But how should I compose such regular expression if I don't know the value to detect in compile time? Let's say I want to detect phishing that contains homoglyphed 'cash' word in messages. I can build sequence with all the alternatives:
my #lookup = ['c', 'с', 'ϲ', 'ς'], ['a', 'а', 'α'], 's', 'h'; # arbitrary runtime length
Now obviously following solution cannot "unpack" array elements into the regular expression:
/ #lookup / # doing LTM, not searching elements in sequence
I can workaround this by manually quoting each element and compose text representation of alternatives to get string that can be evaluated as regular expression. And build token from that using string interpolation:
my $regexp-ish = textualize( #lookup ); # string "[ 'c' | 'с' | 'ϲ' | 'ς' ] [ 'a' | 'а' | 'α' ] 's' 'h'"
my $token = token { <$regexp-ish> }
But that is quite error-prone.
Is there any cleaner solution to compose regular expression on the fly from arbitrary amount of elements not known at compile time?
The Unicode::Security module implements confusables by using the Unicode consortium tables. It's actually not using regular expressions, just looking up different characters in those tables.
I'm not sure this is the best approach to use.
I haven't implemented a confusables1 module yet in Intl::, though I do plan on getting around to it eventually, here's two different ways I could imagine a token looking.2
my token confusable($source) {
:my $i = 0; # create a counter var
[
<?{ # succeed only if
my $a = self.orig.substr: self.pos+$i, 1; # the test character A
my $b = $source.substr: $i++, 1; # the source character B and
so $a eq $b # are the same or
|| $a eq %*confusables{$b}.any; # the A is one of B's confusables
}>
. # because we succeeded, consume a char
] ** {$source.chars} # repeat for each grapheme in the source
}
Here I used the dynamic hash %*confusables which would be populated in some way — that will depend on your module and may not even necessarily be dynamic (for example, having the signature :($source, %confusables) or referencing a module variable, etc.
You can then have your code work as follows:
say $foo ~~ /<confusable: 'foo'>/
This is probably the best way to go about things as it will give you a lot more control — I took a peak at your module and it's clear you want to enable 2-to-1 glyph relationships and eventually you'll probably want to be running code directly over the characters.
If you are okay with just 1-to-1 relationships, you can go with a much simpler token:
my token confusable($source) {
:my #chars = $source.comb; # split the source
#( # match the array based on
|( # a slip of
%confusables{#chars.head} # the confusables
// Empty # (or nothing, if none)
), #
#a.shift # and the char itself
) #
** {$source.chars} # repeating for each source char
}
The #(…) structure lets you effectively create an adhoc array to be interpolated. In this case, we just slip in the confusables with the original, and that's that. You have to be careful though because a non-existent hash item will return the type object (Any) and that messes things up here (hence // Empty)
In either case, you'll want to use arguments with your token, as constructing regexes on the fly is fraught with potential gotchas and interpolations errors.
1Unicode calls homographs both "visually similar characters" and "confusables".
2The dynamic hash here %confusables could be populated any number of ways, and may not necessarily need to be dynamic, as it could be populated via the arguments (using a signature like :($source, %confusables) or referencing a module variable.

Perl6 regex not matching end $ character with filenames

I've been trying to learn Perl6 from Perl5, but the issue is that the regex works differently, and it isn't working properly.
I am making a test case to list all files in a directory ending in ".p6$"
This code works with the end character
if 'read.p6' ~~ /read\.p6$/ {
say "'read.p6' contains 'p6'";
}
However, if I try to fit this into a subroutine:
multi list_files_regex (Str $regex) {
my #files = dir;
for #files -> $file {
if $file.path ~~ /$regex/ {
say $file.path;
}
}
}
it no longer works. I don't think the issue with the regex, but with the file name, there may be some attribute I'm not aware of.
How can I get the file name to match the regex in Perl6?
Regexes are a first-class language within Perl 6, rather than simply strings, and what you're seeing here is a result of that.
The form /$foo/ in Perl 6 regex will search for the string value in $foo, so it will be looking, literally, for the characters read\.p6$ (that is, with the dot and dollar sign).
Depending on the situation of the calling code, there are a couple of options:
If you really are receiving regexes as strings, for example read as input or from a file, then use $file.path ~~ /<$regex>/. This means it will treat what's in $regex as regex syntax.
If you will just be passing a range of different regexes in, change the parameter to be of type Regex, and then do $file.path ~~ $regex. In this case, you'd pass them like list_files_regex(/foo/).
Last but not least, dir takes a test parameter, and so you can instead write:
for dir(test => /<$regex>/) -> $file {
say $file.path;
}

Perl 6: Difference between .. and ...?

What's the difference between .. and ... in Perl 6?
For example, the following lines will produce the same output:
for 1..5 { .say };
for 1...5 { .say };
.. construct a range object (think mathematical interval).
... constructs a sequence (think lazily generated one-shot list).
If all I want to do is iterate over consecutive integers (eg for indexing), I prefer he former (it's the less general tool, and a character shorter to boot).
If you need more precise control, use the latter (eg the idiomatic example for generating the Fibonacci sequence in Perl6 is given by the expression 1, 1, *+* ... *, where he third term *+* is the rule for inductively generating the elements).

How to make Perl 6 grammar produce more than one match (like :ex and :ov)?

I want grammar to do something like this:
> "abc" ~~ m:ex/^ (\w ** 1..2) (\w ** 1..2) $ {say $0, $1}/
「ab」「c」
「a」「bc」
Or like this:
> my regex left { \S ** 1..2 }
> my regex right { \S ** 1..2 }
> "abc" ~~ m:ex/^ <left><right> $ {say $<left>, $<right>}/
「ab」「c」
「a」「bc」
Here is my grammar:
grammar LR {
regex TOP {
<left>
<right>
}
regex left {
\w ** 1..2
}
regex right {
\w ** 1..2
}
}
my $string = "abc";
my $match = LR.parse($string);
say "input: $string";
printf "split: %s|%s\n", ~$match<left>, ~$match<right>;
Its output is:
$ input: abc
$ split: ab|c
So, <left> can be only greedy leaving nothing to <right>. How should I modify the code to match both possible variants?
$ input: abc
$ split: a|bc, ab|c
Grammars are designed to give zero or one answers, not more than that, so you have to use some tricks to make them do what you want.
Since Grammar.parse returns just one Match object, you have to use a different approach to get all matches:
sub callback($match) {
say $match;
}
grammar LR {
regex TOP {
<left>
<right>
$
{ callback($/) }
# make the match fail, thus forcing backtracking:
<!>
}
regex left {
\w ** 1..2
}
regex right {
\w ** 1..2
}
}
LR.parse('abc');
Making the match fail by calling the <!> assertion (which always fails) forces the previous atoms to backtrack, and thus finding different solutions. Of course this makes the grammar less reusable, because it works outside the regular calling conventions for grammars.
Note that, for the caller, the LR.parse seems to always fail; you get all the matches as calls to the callback function.
A slightly nicer API (but the same approach underneath) is to use gather/take to get a sequence of all matches:
grammar LR {
regex TOP {
<left>
<right>
$
{ take $/ }
# make the match fail, thus forcing backtracking:
<!>
}
regex left {
\w ** 1..2
}
regex right {
\w ** 1..2
}
}
.say for gather LR.parse('abc');
I think Moritz Lenz, nickname moritz, author of the upcoming book "Parsing with Perl 6 Regexes and Grammars", is the person to ask about this. I probably should have just asked him to answer this SO...
Notes
In case anyone considers attempting to modify grammar.parse so that it supports :exhaustive, or otherwise hacking things to do what #evb wants, the following documents potentially useful inspiration/guidance that I gleaned from spelunking the relevant speculations document (S05) and searching the #perl6 and #perl6-dev irc logs.
7 years ago Moritz added an edit of S05:
A [regex] modifier that affects only the calling behaviour, and not the regex itself [eg :exhaustive] may only appear on constructs that involve a call (like m// [or grammar.parse]), and not on rx// [or regex { ... }].
(The [eg :exhaustive], [or grammar.parse], and [or regex { ... }] bits are extrapolations/interpretations/speculations I've added in this SO answer. They're not in the linked source.)
5 years ago Moritz expressed interest in implementing :exhaustive for matching (not parsing) features. Less than 2 minutes later jnthn showed a one liner that demo'd how he guessed he'd approach it. Less than 30 minutes later Moritz posted a working prototype. The final version landed 7 days later.
1 year ago Moritz said on #perl6 (emphasis added by me): "regexes and grammars aren't a good tool to find all possible ways to parse a string".
Hth.

What's the deal with all the different Perl 6 equality operators? (==, ===, eq, eqv, ~~, =:=, ...)

Perl 6 seems to have an explosion of equality operators. What is =:=? What's the difference between leg and cmp? Or eqv and ===?
Does anyone have a good summary?
=:= tests if two containers (variables or items of arrays or hashes) are aliased, ie if one changes, does the other change as well?
my $x;
my #a = 1, 2, 3;
# $x =:= #a[0] is false
$x := #a[0];
# now $x == 1, and $x =:= #a[0] is true
$x = 4;
# now #a is 4, 2, 3
As for the others: === tests if two references point to the same object, and eqv tests if two things are structurally equivalent. So [1, 2, 3] === [1, 2, 3] will be false (not the same array), but [1, 2, 3] eqv [1, 2, 3] will be true (same structure).
leg compares strings like Perl 5's cmp, while Perl 6's cmp is smarter and will compare numbers like <=> and strings like leg.
13 leg 4 # -1, because 1 is smaller than 4, and leg converts to string
13 cmp 4 # +1, because both are numbers, so use numeric comparison.
Finally ~~ is the "smart match", it answers the question "does $x match $y". If $y is a type, it's type check. If $y is a regex, it's regex match - and so on.
Does the summary in Synopsis 3: Comparison semantics do what you want, or were you already reading that? The design docs link to the test files where those features are used, so you can see examples of their use and their current test state.
Perl 6's comparison operators are much more suited to a dynamic language and all of the things going on. Instead of just comparing strings or numbers (or turning things into strings or numbers), now you can test things precisely with an operator that does what you want. You can test the value, the container, the type, and so on.
In one of the comments, you ask about eqv and cmp. In the old days of Perl 5, cmp was there for sorting and returns one of three magic values (-1,0,1), and it did that with string semantics always. In Perl 6, cmp returns one of three types of Order objects, so you don't have to remember what -1, 0, or 1 means. Also, the new cmp doesn't force string semantics, so it can be smarter when handed numbers (unlike Perl 5's which would sort like 1, 10, 11, 2, 20, 21 ...).
The leg (less than, equal, greater than) is cmp with string semantics. It's defined as Perl 6's ~$a cmp ~$b, where ~ is the new "string contextualizer" that forces string semantics. With leg, you are always doing a string comparison, just like the old Perl 5 cmp.
If you still have questions on the other operators, let's break them down into separate questions. :)