Put named capture from regex in Subset into a variable in the signature - raku

Consider
subset MySubset of Str where * ~~ /^ \d $<interesting> = ( \d+ ) $/;
Now I want to use the subset as a Type in my signature, but put the captured part(s) into a variable via unpacking, kinda like
sub f( MySubset $( :$interesting ) )
{
say $interesting;
}
f( "12345678" ); # should say 2345678
That's not working of course. Is it even possible to do this?

Subsignature unpacking is about turning a value into a Capture and matching against that.
class Point {
has ( $.x, $.y );
}
my ( :$x, :$y ) := Point.new( x => 3, y => 4 ).Capture;
say "[$x,$y]"; # [3,4]
Since a Str doesn't have a public attribute named $.interesting, it won't match.
A subset is just extra code to check a value more completely than you could otherwise do. It does not turn the value into a new type.
It would be more likely to work if you used $<interesting>.
sub f( MySubset )
{
say $<interesting>;
}
Of course since blocks get their own $/, this also does not work.
While it might be nice to pass information from a subset to a signature, I am not aware of anyway to do it.
As a side note, where already does smart matching so it is an incredibly bad idea to use ~~ inside of it.
This is basically how your subset works:
"12345678" ~~ ( * ~~ /…/ )
In this particular case you could just use .substr
sub f( MySubset $_ ) {
.substr(1)
}

I can't figure out a way with a subset type, however there is a way - with a little...creativity - to do a match and unpack it in the signature.
Match inherits from Capture, so having one be unpacked in a signature is straightforward - if only we can arrange for there to be a parameter that contains the Match we wish to unpack. One way to do that is to introduce a further parameter with a default. We can't really stop anyone passing to it - though we can make it a pain to do so by using the anonymous named parameter. Thus, if we write this:
sub foo($value, :$ (:$col, :$row) = $value.match(/^$<col>=[<:L>+]$<row>=[\d+]$/)) {
say $col;
say $row;
}
And call it as foo("AB23"), the output is:
「AB」
「23」
Finally, we may factor the rule out to a named token, achieving:
‌‌my token colrow { ^$<col>=[<:L>+]$<row>=[\d+]$ }
sub foo($value, :$ (:$col, :$row) = $value.match(&colrow)) {
say $col;
say $row;
}

I'm pretty sure wheres (and subsets) just answer True/False. Brad concurs.
There are essentially always metaprogramming answers to questions but I presume you don't mean that (and almost never dig that deep anyway).
So here are a couple ways to get something approaching what you seem to be after.
A (dubious due to MONKEYing) solution based on Brad's insights:
use MONKEY;
augment class Str {
method MyMatch { self ~~ / ^ \d $<interesting> = ( \d+ ) $ / }
}
class MyMatch is Match {}
sub f( MyMatch() $foo (:$interesting) ) { say ~$interesting }
f( "12345678" ); # 2345678
The bad news is that the sub dispatch works even if the string doesn't match. The doc makes it clear that the coercer method (method MyMatch in the above) cannot currently signal failure:
The method is assumed to return the correct type — no additional checks on the result are currently performed.
One can hope that one day augmenting a class will be an officially respectable thing to do (rather than requiring a use MONKEY...) and that coercing can signal failure. At that point I think this might be a decent solution.
A variant on the above that binds to $/ so you can use $<interesting>:
use MONKEY;
augment class Str {
method MyMatch { self ~~ / ^ \d $<interesting> = ( \d+ ) $ / }
}
class MyMatch is Match {}
sub f( MyMatch() $/ ) { say ~$<interesting> }
f( "12345678" ); # 2345678
Another way that avoids MONKEYing around is to use a subset as you suggest but separate the regex and subset:
my regex Regex { ^ \d $<interesting> = ( \d+ ) $ }
subset Subset of Str where &Regex;
sub f( Subset $foo ; $interesting = ~($foo ~~ &Regex)<interesting> )
{
say $interesting;
}
f( "12345678" ); # 2345678
Notes:
The regex parses the input value at least twice. First in the Subset to decide whether the call dispatches to the sub. But the result of the match is thrown away -- the value arrives as a string. Then the regex matches again so the match can be deconstructed. With current Rakudo, if the sub were a multi, it would be even worse -- the regex would be used three times because Rakudo currently does both a trial bind as part of deciding which multi to match, and then does another bind for the actual call.
Parameters can be set to values based on previous parameters. I've done that with $interesting. A signature can have parameters that are part of dispatch decisions, and others that are not. These are separated by a semi-colon. I've combined these two features to create another variable, thinking you might think that a positive thing. Your comment suggest you don't, which is more than reasonable. :)

Related

Why no "each" method on Perl6 sequences?

Sometimes I'll start writing a chain of method calls at the Perl 6 REPL, like:
".".IO.dir.grep(...).map(...).
...and then I realize that what I want to do with the final list is print every element on its own line. I would expect sequences to have something like an each method so I could end the chain with .each(*.say), but there's no method like that that I can find. Instead I have to return to the beginning of the line and prepend .say for. It feels like it breaks up the flow of my thoughts.
It's a minor annoyance, but it strikes me as such a glaring omission that I wonder if I'm missing some easy alternative. The only ones I can think of are ».say and .join("\n").say, but the former can operate on the elements out of order (if I understand correctly) and the latter constructs a single string which could be problematically large, depending on the input list.
You can roll your own.
use MONKEY;
augment class Any
{
method each( &block )
{
for self -> $value {
&block( $value );
}
}
};
List.^compose;
Seq.^compose;
(1, 2).each({ .say });
(2, 3).map(* + 1).each({ .say });
# 1
# 2
# 3
# 4
If you like this, there's your First CPAN module opportunity right there.
As you wrote in the comment, just an other .map(*.say) does also create a line with True values when using REPL. You can try to call .sink method after the last map statement.
".".IO.dir.grep({$_.contains('e')}).map(*.uc).map(*.say).sink

How to require 1 or more of an argument in MAIN

Right now, I have a MAIN sub that can take one or more string arguments. But I am using two separate parameters for MAIN to do it:
sub MAIN (
Str:D $file,
*#files,
) {
#files.prepend: $file;
# Rest of the program
}
Now I am wondering if there's a more idiomatic way to achieve this, as my current solution feels a little clunky, and not very Perly.
You could do it with a proto sub
proto sub MAIN ( $, *# ){*}
multi sub MAIN ( *#files ) {
# Rest of the program
}
or with sub-signature deparsing
sub MAIN ( *#files ($,*#) ) {
# Rest of the program
}
At the risk of "over answering" - my take on "Perly" is concise as possible without becoming obscure (perhaps I'm just replacing one subjective term with two others... :-)
If you have a "slurpy" array as the only parameter, then it will happily accept no arguments which is outside the spec you put in the comments. However, a positional parameter is compulsory by default and proto's are only necessary if you want to factor out constraints on all multi's - presumably overkill for what you want here. So, this is enough:
sub MAIN($file , *#others) {
say "Received file, $file, and #others.elems() others."
}
This is close to what mr_ron put - but why not go with the default Usage message that MAIN kindly whips up for you by examining your parameters:
$ ./f.pl
Usage:
./f.pl <file> [<others> ...]
Some might say I cheated by dropping the Str type constraint on the first parameter but it really doesn't buy you much when you're restricting to strings because numerics specified at the CLI come through as type IntStr (a kind-of hybrid type) that satisfies a Str constraint. OTOH, when constraining CLI parameters to Num or Int, Perl6 will check that you're actually putting digits there - or at least, what unicode considers digits.
If you're wanting actual filenames, you can save yourself a validation step by constraining to type IO(). Then it will only work if you name a file. And finally, putting where .r after the parameter will insist that it be readable to boot:
sub MAIN(IO() $file where .r, *#others) { ...
One short line that insists on one compulsory argument that is a filename referencing a readable file, with a varying number of other parameters and a useful Usage message auto generated if it all goes sideways...
Perhaps good enough answer here:
sub MAIN(*#a where {.elems > 0 and .all ~~ Str}) {
say "got at least 1 file name"
}
sub USAGE {
say "{$*PROGRAM-NAME}: <file-name> [ <file-name> ... ]"
}
Based on docs here:
https://docs.perl6.org/type/Signature#Constraining_Slurpy_Arguments
You can also try and use simply dynamic variables:
die "Usage: $*EXECUTABLE <file> <file2>*" if !+#*ARGS;
my #files = #*ARGS;
where #*ARGS is an array with the arguments issued into the command line
You can even use #*ARGFILES, since they are actually files

Generating Random String of Numbers and Letters Using Go's "testing/quick" Package

I've been breaking my head over this for a few days now and can't seem to be able to figure it out. Perhaps it's glaringly obvious, but I don't seem to be able to spot it. I've read up on all the basics of unicode, UTF-8, UTF-16, normalisation, etc, but to no avail. Hopefully somebody's able to help me out here...
I'm using Go's Value function from the testing/quick package to generate random values for the fields in my data structs, in order to implement the Generator interface for the structs in question. Specifically, given a Metadata struct, I've defined the implementation as follows:
func (m *Metadata) Generate(r *rand.Rand, size int) (value reflect.Value) {
value = reflect.ValueOf(m).Elem()
for i := 0; i < value.NumField(); i++ {
if t, ok := quick.Value(value.Field(i).Type(), r); ok {
value.Field(i).Set(t)
}
}
return
}
Now, in doing so, I'll end up with both the receiver and the return value being set with random generated values of the appropriate type (strings, ints, etc. in the receiver and reflect.Value in the returned reflect.Value).
Now, the implementation for the Value function states that it will return something of type []rune converted to type string. As far as I know, this should allow me to then use the functions in the runes, unicode and norm packages to define a filter which filters out everything which is not part of 'Latin', 'Letter' or 'Number'. I defined the following filter which uses a transform to filter out letters which are not in those character rangetables (as defined in the unicode package):
func runefilter(in reflect.Value) (out reflect.Value) {
out = in // Make sure you return something
if in.Kind() == reflect.String {
instr := in.String()
t := transform.Chain(norm.NFD, runes.Remove(runes.NotIn(rangetable.Merge(unicode.Letter, unicode.Latin, unicode.Number))), norm.NFC)
outstr, _, _ := transform.String(t, instr)
out = reflect.ValueOf(outstr)
}
return
}
Now, I think I've tried just about anything, but I keep ending up with a series of strings which are far from the Latin range, e.g.:
𥗉똿穊
𢷽嚶
秓䝏小𪖹䮋
𪿝ท솲
𡉪䂾
ʋ𥅮ᦸ
堮𡹯憨𥗼𧵕ꥆ
𢝌𐑮𧍛併怃𥊇
鯮
𣏲𝐒
⓿ꐠ槹𬠂黟
𢼭踁퓺𪇖
俇𣄃𔘧
𢝶
𝖸쩈𤫐𢬿詢𬄙
𫱘𨆟𑊙
欓
So, can anybody explain what I'm overlooking here and how I could instead define a transformer which removes/replaces non-letter/number/latin characters so that I can use the Value function as intended (but with a smaller subset of 'random' characters)?
Thanks!
Confusingly the Generate interface needs a function using the type not a the pointer to the type. You want your type signature to look like
func (m Metadata) Generate(r *rand.Rand, size int) (value reflect.Value)
You can play with this here. Note: the most important thing to do in that playground is to switch the type of the generate function from m Metadata to m *Metadata and see that Hi Mom! never prints.
In addition, I think you would be better served using your own type and writing a generate method for that type using a list of all of the characters you want to use. For example:
type LatinString string
const latin = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01233456789"
and then use the generator
func (l LatinString) Generate(rand *rand.Rand, size int) reflect.Value {
var buffer bytes.Buffer
for i := 0; i < size; i++ {
buffer.WriteString(string(latin[rand.Intn(len(latin))]))
}
s := LatinString(buffer.String())
return reflect.ValueOf(s)
}
playground
Edit: also this library is pretty cool, thanks for showing it to me
The answer to my own question is, it seems, a combination of the answers provided in the comments by #nj_ and #jimb and the answer provided by #benjaminkadish.
In short, the answer boils down to:
"Not such a great idea as you thought it was", or "Bit of an ill-posed question"
"You were using the union of 'Letter', 'Latin' and 'Number' (Letter || Number || Latin), instead of the intersection of 'Latin' with the union of 'Letter' and 'Number' ((Letter || Number) && Latin))
Now for the longer version...
The idea behind me using the testing/quick package is that I wanted random data for (fuzzy) testing of my code. In the past, I've always written the code for doing things like that myself, again and again. This meant a lot of the same code across different projects. Now, I could of course written my own package for it, but it turns out that, even better than that, there's actually a standard package which does just about exactly what I want.
Now, it turns out the package does exactly what I want very well. The codepoints in the strings which it generates are actually random and not just restricted to what we're accustomed to using in everyday life. Now, this is of course exactly the thing which you want in doing fuzzy testing in order to test the code with values outside the usual assumptions.
In practice, that means I'm running into two problems:
There's some limits on what I would consider reasonable input for a string. Meaning that, in testing the processing of a Name field or a URL field, I can reasonably assume there's not going to be a value like 'James Mc⌢' (let alone 'James Mc🙁') or 'www.🕸site.com', but just 'James McFrown' and 'www.website.com'. Hence, I can't expect a reasonable system to be able to support it. Of course, things shouldn't completely break down, but it also can't be expected to handle the former examples without any problems.
When I filter the generated string on values which one might consider reasonable, the chance of ending up with a valid string is very small. The set of possible characters in the set used by the testing/quick is just so large (0x10FFFF) and the set of reasonable characters so small, you end up with empty strings most of the time.
So, what do we need to take away from this?
So, whilst I hoped to use the standard testing/quick package to replace my often repeated code to generate random data for fuzzy testing, it does this so well that it provides data outside the range of what I would consider reasonable for the code to be able to handle. It seems that the choice, in the end, is to:
Either be able to actually handle all fuzzy options, meaning that if somebody's name is 'Arnold 💰💰' ('Arnold Moneybags'), it shouldn't go arse over end. Or...
Use custom/derived types with their own Generator. This means you're going to have to use the derived type instead of the basic type throughout the code. (Comparable to defining a string as wchar_t instead of char in C++ and working with those by default.). Or...
Don't use testing/quick for fuzzy testing, because as soon as you run into a generated string value, you can (and should) get a very random string.
As always, further comments are of course welcome, as it's quite possible I overlooked something.

Bison parser with operator tokens in variable name

I am new to bison, and have the misfortune of needing to write a parser for a language that may have what would otherwise be an operator within a variable name. For example, depending on context, the expression
FOO = BAR-BAZ
could be interpreted as either:
the variable "FOO" being assigned the value of the variable "BAR" minus the value of the variable "BAZ", OR
the variable "FOO" being assigned the value of the variable "BAR-BAZ"
Fortunately the language requires variable declarations ahead of time, so I can determine whether a given string is a valid variable via a function I've implemented:
bool isVariable(char* name);
that will return true if the given string is a valid variable name, and false otherwise.
How do I tell bison to attempt the second scenario above first, and only if (through use of isVariable()) that path fails, go back and try it as the first scenario above? I've read that you can have bison try multiple parsing paths and cull invalid ones when it encounters a YYERROR, so I've tried a set of rules similar to:
variable:
STRING { if(!isVariable($1)) YYERROR; }
;
expression:
expression '-' expression
| variable
;
but when given "BAR-BAZ" the parser tries it as a single variable and just stops completely when it hits the YYERROR instead of exploring the "BAR" - "BAZ" path as I expect. What am I doing wrong?
Edit:
I'm beginning to think that my flex rule for STRING might be the culprit:
((A-Z0-9][-A-Z0-9_///.]+)|([A-Z])) {yylval.sval = strdup(yytext); return STRING;}
In this case, if '-' appears in the middle of alphanumeric characters, the whole lot is treated as 1 STRING, without the possibility for subdivision by the parser (and therefore only one path explored). I suppose I could manually parse the STRING in the parser action, but it seems like there should be a better way. Perhaps flex could give back alternate token streams (one for the "BAR-BAZ" case and another for the "BAR"-"BAZ" case) that are diverted to different parser stacks for exploration? Is something like that possible?
It's not impossible to solve this problem within a bison-generated parser, but it's not easy, and the amount of hackery required might detract from the readability and verifiability of the grammar.
To be clear, GLR parsers are not fallback parsers. The GLR algorithm explores all possible parses in parallel, and rejects invalid ones as it goes. (The bison implementation requires that the parse converge to a single possible parse; the original GLR algorithm produces forest of parse trees.) Also, the GLR algorithm does not contemplate multiple lexical analyses.
If you want to solve this problem in the context of the parser, you'll probably need to introduce special handling for whitespace, or at least for - which are not surrounded by whitespace. Otherwise, you will not be able to distinguish between a - b (presumably always subtraction) and a-b (which might be the variable a-b if that variable were defined). Leaving aside that issue, you would be looking for something like this (but this won't work, as explained below):
expr : term
| expr '-' term
term : factor
| term '*' factor
factor: var
| '(' expr ')'
var : ident { if (!isVariable($1)) { /* reject this production */ } }
ident : WORD
| ident '-' WORD { $$ = concatenate($1, "-", $3); }
This won't work because the action associated with var : ident is not executed until after the parse has been disambiguated. So if the production is rejected, the parse fails, because the parser has already determined that the production is necessary. (Until the parser makes that determination, actions are deferred.)
Bison allows GLR grammars to use semantic predicates, which are executed immediately instead of being deferred. But that doesn't help, because semantic predicates cannot make use of computed semantic values (since the semantic value computations are still deferred when the semantic predicate is evaluated). You might think you could get around this by making the computation of the concatenated identifier (in the second ident production) a semantic predicate, but then you run into another limitation: semantic predicates do not themselves have semantic values.
Probably there is a hack which will get around this problem, but that might leave you with a different problem. Suppose that a, c, a-b and b-c are defined variables. Then, what is the meaning of a-b-c? Is it (a-b) - c or a - (b-c) or an error?
If you expect it to be an error, then there is no problem since the GLR parser will find both possible parses and bison-generated GLR parsers signal a syntax error if the parse is ambiguous. But then the question becomes: is a-b-c only an error if it is ambiguous? Or is it an error because you cannot use a subtraction operator without surround whitespace if its arguments are hyphenated variables? (So that a-b-c can only be resolved to (a - b) - c or to (a-b-c), regardless of whether a-b and b-c exist?) To enforce the latter requirement, you'll need yet more complication.
If, on the other hand, your language is expected to model a "fallback" approach, then the result should be (a-b) - c. But making that selection is not a simple merge procedure between two expr reductions, because of the possibility of a higher precedence * operator: d * a-b-c either resolves to (d * a-b) - c or (d * a) - b-c; in those two cases, the parse trees are radically different.
An alternative solution is to put the disambiguation of hyphenated variables into the scanner, instead of the parser. This leads to a much simpler and somewhat clearer definition, but it leads to a different problem: how do you tell the scanner when you don't want the semantic disambiguation to happen? For example, you don't want the scanner to insist on breaking up a variable name into segments when you the name occurs in a declaration.
Even though the semantic tie-in with the scanner is a bit ugly, I'd go with that approach in this case. A rough outline of a solution is as follows:
First, the grammar. Here I've added a simple declaration syntax, which may or may not have any resemblance to the one in your grammar. See notes below.
expr : term
| expr '-' term
term : factor
| term '*' factor
factor: VARIABLE
| '(' expr ')'
decl : { splitVariables(false); } "set" VARIABLE
{ splitVariables(true); } '=' expr ';'
{ addVariable($2); /* ... */ }
(See below for the semantics of splitVariables.)
Now, the lexer. Again, it's important to know what the intended result for a-b-c is; I'll outline two possible strategies. First, the fallback strategy, which can be implemented in flex:
int candidate_len = 0;
[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] { yymore();
candidate_len = yyleng;
BEGIN(HYPHENATED);
}
[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<HYPHENATED>"-"[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] {
yymore();
if (isVariable(yytext))
candidate_len = yyleng;
}
<HYPHENATED>"-"[[:alpha:]][[:alnum:]]* { if (!isVariable(yytext))
yyless(candidate_len);
yylval.id = strdup(yytext);
BEGIN(INITIAL);
return WORD;
}
That uses yymore and yyless to find the longest prefix sequence of hyphenated words which is a valid variable. (If there is no such prefix, it chooses the first word. An alternative would be to select the entire sequence if there is no such prefix.)
A similar alternative, which only allows the complete hyphenated sequence (in the case where that is a valid variable) or individual words. Again, we use yyless and yymore, but this time we don't bother checking intermediate prefixes and we use a second start condition for the case where we know we're not going to combine words:
int candidate_len = 0;
[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] { yymore();
candidate_len = yyleng;
BEGIN(HYPHENATED);
}
[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<HYPHENATED>("-"[[:alpha:]][[:alnum:]]*)*[[:alpha:]][[:alnum:]]* {
if (isVariable(yytext)) {
yylval.id = strdup(yytext);
BEGIN(INITIAL);
return WORD;
} else {
yyless(candidate_len);
yylval.id = strdup(yytext);
BEGIN(NO_COMBINE);
return WORD;
}
}
<NO_COMBINE>[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<NO_COMBINE>"-" { return '-'; }
<NO_COMBINE>.|\n { yyless(0); /* rescan */
BEGIN(INITIAL);
}
Both of the above solutions use isVariable to decide whether or not a hyphenated sequence is a valid variable. As mentioned earlier, there must be a way to turn off the check, for example in the case of a declaration. To accomplish this, we need to implement splitVariables(bool). The implementation is straightforward; it simply needs to set a flag visible to isVariable. If the flag is set to true, then isVariable always returns true without actually checking for the existence of the variable in the symbol table.
All of that assumes that the symbol table and the splitVariables flag are shared between the parser and the scanner. A naïve solution would make both of these variables globals; a cleaner solution would be to use a pure parser and lexer, and pass the symbol table structure (including the flag) from the main program into the parser, and from there (using %lex-param) into the lexer.

What does => do in Ada

I am debuggin some Ada code, and have come across a loop in which there are several lines containing the operator: =>. I have not come across this before, and a quick Google hasn't really been much help in finding out what it does... Can anyone help me here?
For example, in the loop, there are lines such as:
time => data.time;
distance => data.distance;
Is this assigning the value of the variables on the right hand side to the ones on the left- so that the ones on the left are now equal to the ones on the right, or maybe assigning the memory addresses of the variables on the left, so that they point to the location of the variables on the right?
Any help would be much appreciated.
Edited to show surrounding code (04/02/2015 # 1700)
So, a fuller example of somewhere that => is used would be:
if data.IASType /= Types.TOA and data.IASType /= Types.RNG then
-- Calculate positionOfTarget using the laterRelativeTime
...
SteeringUtilities.calculateApproachData
(...
time => data.time,
distance => data.distance,
end if;
Apologies- just realised I miss quoted the two lines earlier by putting ; at the end of the lines rather than ,.
=> is not an "operator". It's a syntax element whose most common purpose is to let you specify a list of things (such as parameters to a subprogram call) by showing what each item in the list means, instead of simply listing them in order. For example, one of the Put_Line procedures is defined like this:
procedure Put_Line(File : in File_Type; Item : in String);
When you call it, the following calls are all equivalent:
Put_Line(My_File, "Hello, world");
Put_Line(File => My_File, Item => "Hello, world");
Put_Line(Item => "Hello, world", File => My_File);
The syntax is used for many other things, such as lists of discriminants, parameters in a generic instantiation, parameters to a pragma, etc. It's also used for record and array aggregates--for array aggregates, you can have an index, multiple indexes, ranges of indexes, or others on the left side of =>.