In perl6 grammars, as explained here (note, the design documents are not guaranteed to be up-to-date as the implementation is finished), if an opening angle bracket is followed by an identifier then the construct is a call to a subrule, method or function.
If the character following the identifier is an opening paren, then it's a call to a method or function eg: <foo('bar')>. As explained further down the page, if the first char after the identifier is a space, then the rest of the string up to the closing angle will be interpreted as a regex argument to the method - to quote:
<foo bar>
is more or less equivalent to
<foo(/bar/)>
What's the proper way to use this feature? In my case, I'm parsing line oriented data and I'm trying to declare a rule that will instigate a seperate search on the current line being parsed:
#!/usr/bin/env perl6
# use Grammar::Tracer ;
grammar G {
my $SOLpos = -1 ; # Start-of-line pos
regex TOP { <line>+ }
method SOLscan($regex) {
# Start a new cursor
my $cur = self."!cursor_start_cur"() ;
# Set pos and from to start of the current line
$cur.from($SOLpos) ;
$cur.pos($SOLpos) ;
# Run the given regex on the cursor
$cur = $regex($cur) ;
# If pos is >= 0, we found what we were looking for
if $cur.pos >= 0 {
$cur."!cursor_pass"(self.pos, 'SOLscan')
}
self
}
token line {
{ $SOLpos = self.pos ; say '$SOLpos = ' ~ $SOLpos }
[
|| <word> <ws> 'two' { say 'matched two' } <SOLscan \w+> <ws> <word>
|| <word>+ %% <ws> { say 'matched words' }
]
\n
}
token word { \S+ }
token ws { \h+ }
}
my $mo = G.subparse: q:to/END/ ;
hello world
one two three
END
As it is, this code produces:
$ ./h.pl
$SOLpos = 0
matched words
$SOLpos = 12
matched two
Too many positionals passed; expected 1 argument but got 2
in method SOLscan at ./h.pl line 14
in regex line at ./h.pl line 32
in regex TOP at ./h.pl line 7
in block <unit> at ./h.pl line 41
$
Line 14 is $cur.from($SOLpos). If commented out, line 15 produces the same error. It appears as though .pos and .from are read only... (maybe :-)
Any ideas what the proper incantation is?
Note, any proposed solution can be a long way from what I've done here - all I'm really wanting to do is understand how the mechanism is supposed to be used.
It does not seem to be in the corresponding directory in roast, so that would make it a "Not Yet Implemented" feature, I'm afraid.
Related
In Edit distance: Ignore start/end, I offered a Perl 6 solution to a fuzzy fuzzy matching problem. I had a grammar like this (although maybe I've improved it after Edit #3):
grammar NString {
regex n-chars { [<.ignore>* \w]**4 }
regex ignore { \s }
}
The literal 4 itself was the length of the target string in the example. But the next problem might be some other length. So how can I tell the grammar how long I want that match to be?
Although the docs don't show an example or using the $args parameter, I found one in S05-grammar/example.t in roast.
Specify the arguments in :args and give the regex an appropriate signature. Inside the regex, access the arguments in a code block:
grammar NString {
regex n-chars ($length) { [<.ignore>* \w]**{ $length } }
regex ignore { \s }
}
class NString::Actions {
method n-chars ($/) {
put "Found $/";
}
}
my $string = 'The quick, brown butterfly';
loop {
state $from = 0;
my $match = NString.subparse(
$string,
:rule('n-chars'),
:actions(NString::Actions),
:c($from++),
:args( \(5) )
);
last unless ?$match;
}
I'm still not sure about the rules for passing the arguments though. This doesn't work:
:args( 5 )
I get:
Too few positionals passed; expected 2 arguments but got 1
This works:
:args( 5, )
But that's enough thinking about this for one night.
I was playing around with Interpolating into names. I was mostly interested in this colon syntax feature to turn a variable into a pair where the identifier is the key.
my %Hamadryas = map { slip $_, 0 }, <
februa
honorina
velutina
>;
{
my $pair = :%Hamadryas;
say $pair; # Hamadryas => { ... }
}
put '-' x 50;
But, just for giggles, I wanted to try it with variable name interpolation too. I know this is stupid because if I know the name I don't need the colon syntax to get it. But, I also thought that it should work by accident:
{
my $name = 'Hamadryas';
# Since I already have the name, I could just:
# my $pair = $name => %::($name)
# But, couldn't I just line up the syntax?
my $pair = :%::($name); # does not work
say $pair;
}
Why doesn't that :%::($name) syntax work? That's more a question of when the parser decides that it's not parsing something it wants to understand. I figured it would see the : and start processing a colon pair, then see the % and know it had a hash, even though there's the :: after the %.
Is there a way to make it work with tricks and grammar mutations?
I'm writing a "compiler" of sorts: it reads a description of a game (with rooms, characters, things, etc.) Think of it as a visual version of an Adventure-style game, but with much simpler problems.
When I run my "compiler" I'm getting a syntax error on my input, and I can't figure out why. Here's the relevant section of my yacc input:
character
: char-head general-text character-insides { PopChoices(); }
;
character-insides
: LEFTBRACKET options RIGHTBRACKET
;
char-head
: char-namesWT opt-imgsWT char-desc opt-cond
;
char-desc
: general-text { SetText($1); }
;
char-namesWT
: DOTC ID WORD { AddCharacter($3, $2); expect(EXP_TEXT); }
;
opt-cond
: %empty
| condition
;
condition
: condition-reason condition-main general-text
{ AddCondition($1, $2, $3); }
;
condition-reason
: DOTU { $$ = 'u'; }
| DOTV { $$ = 'v'; }
;
condition-main
: money-conditionWT
| have-conditionWT
| moves-conditionWT
| flag-conditionWT
;
have-conditionWT
: PERCENT_SLASH opt-bang ID
{ $$ = MkCondID($1, $2, $3) ; expect(EXP_TEXT); }
;
opt-bang
: %empty { $$ = TRUE; }
| BANG { $$ = FALSE; }
;
ID: WORD
;
Things in all caps are terminal symbols, things in lower or mixed case are non-terminals. If a non-terminal ends in WT, then it "wants text". That is, it expects that what comes after it may be arbitrary text.
Background: I have written my own token recognizer in C++ because(*) I want the syntax to be able to change the way the lexer's behavior. Two types of tokens should be matched only when the syntax expects them: FILENAME (with slashes and other non-alphameric characters) and TEXT, which means "all the text from here to the end of the line" (but not starting with certain keywords).
The function "expect" tells the lexer when to look for these two symbols. The expectation is reset to EXP_NORMAL after each token is returned.
I have added code to yylex that prints out the tokens as it recognizes them, and it looks to me like the tokenizer is working properly -- returning the tokens I expect.
(*) Also because I want to be able to ask the tokenizer for the column where the error occurred, and get the contents of the line being scanned at the time so I can print out a more useful error message.
Here is the relevant part of the input:
.c Wendy wendy
OK, now you caught me, what do you want to do with me?
.u %/lasso You won't catch me like that.
[
Here is the last part of the debugging output from yylex:
token: 262: DOTC/
token: 289: WORD/Wendy
token: 289: WORD/wendy
token: 292: TEXT/OK, now you caught me, what do you want to do with me?
token: 286: DOTU/
token: 274: PERCENT_SLASH/%/
token: 289: WORD/lasso
token: 292: TEXT/You won't catch me like that.
token: 269: LEFTBRACKET/
here's my error message:
: line 124, columns 3-4: syntax error, unexpected LEFTBRACKET, expecting TEXT
[
To help you understand the equations above, here is the relevant part of the description of the input syntax that I wrote the yacc code from.
// Character:
// .c id charactername,[imagename,[animationname]]
// description-text
// .u condition on the character being usable [optional]
// .v condition on the character being visible [optional]
// [
// (options)
// ]
// Conditions:
// %$[-]n Must [not] have at least n dollars
// %/[-]name Must [not] have named thing
// %t-nnn At/before specified number of moves
// %t+nnn At/after specified number of moves
// %#[-]name named flag must [not] be set
// Condition-char: $, /, t, or #, as described above
//
// Condition:
// % condition-char (identifier/int) ['/' text-if-fail ]
// description-text: Can be either on-line text or multi-line text
// On-line text is the rest of the line
brackets mark optional non-terminals, but a bracket standing alone (represented by LEFTBRACKET and RIGHTBRACKET in the yacc) is an actual token, e.g.
// [
// (options)
// ]
above.
What am I doing wrong?
To debug parsing problems in your grammar, you need to understand the shift/reduce machine that yacc/bison produces (described in the .output file produced with the -v option), and you need to look at the trail of states that the parser goes through to reach the problem you see.
To enable debugging code in the parser (which can print the states and the shift and reduce actions as they occur), you need to compile with -DYYDEBUG or put #define YYDEBUG 1 in the top of your grammar file. The debugging code is controlled by the global variable yydebug -- set to non-zero to turn on the trace and zero to turn it off. I often use the following in main:
#ifdef YYDEBUG
extern int yydebug;
if (char *p = getenv("YYDEBUG"))
yydebug = atoi(p);
#endif
Then you can include -DYYDEBUG in your compiler flags for debug builds and turn on the debugging code by something like setenv YYDEBUG 1 to set the envvar prior to running your program.
I suppose your syntax error message was generated by bison. What is striking is that it claims to have found a LEFTBRACKET when it expects a [. Naively, you might expect it to be satisfied with the LEFTBRACKET it found, but of course bison knows nothing about LEFTBRACKET except its numeric value, which will be some integer larger than 256.
The only reason bison might expect [ is if your grammar includes the terminal '['. But since your scanner seems to return LEFTBRACKET when it sees a [, the parser will never see '['.
I'm trying to gather all text that is not defined by a previous rule into a string and prefix it with a formatting string using lex. I'm wondering if there's a standard way of doing this.
For example, say I have the rules:
word1|word2|word3|word4 {printf("%s%s", "<>", yytext);}
[0-9]+ {printf("%s%s", "{}", yytext);}
everything else {printf("%s%s", "[]", yytext);}
And I attempt to lex the string:
word1 this is some other text ; word2 98 foo bar .
I would want this to produce the following when run through the lexer:
<>word1[] this is some other text ; <>word2[] {}98[] foo bar .
I attempted to do this using states, but realize I can't determine when to stop the check, like:
%x OTHER
%%
. {yymore(); BEGIN OTHER;}
<OTHER>.|\n yymore();
<OTHER>how to determine when to end? {printf("%s%s", "[]", yytex); BEGIN INITIAL;}
What is a good way to do this? Is there someway to continue as long as another rule isn't met?
AFAIK, there is no "standard" solution, but a simple one is to keep a bit of context (the prefix last printed) and use that to decide whether or not to print a new prefix. For example, you could use a custom printer like this:
enum OutputType { NO_TOKEN = 0, WORD, NUMBER, OTHER };
void print_with_prefix(enum OutputType type, const char* token) {
static enum OutputType prev = NO_TOKEN;
const char* prefix = "";
switch (type) {
case WORD: prefix = "<>"; break;
case NUMBER: prefix = "{}"; break;
case OTHER: if (prev != OTHER) prefix = "[]"; break;
default: assert(false);
}
prev = type;
printf("%s%s", prefix, token);
}
Then you just need to change the calls to printf to invoke print_with_prefix instead (and, as written, to supply an enum value instead of a string).
For the OTHER case, you then don't need to do anything special to accumulate the token. Just
. { print_with_prefix(OTHER, yytext); }
(I'm skating over the handling of whitespace and newlines, but it's just conceptual.)
In Perl 5, I can use Getopt::Long to parse commandline arguments with some validation (see below from http://perldoc.perl.org/Getopt/Long.html).
use Getopt::Long;
my $data = "file.dat";
my $length = 24;
my $verbose;
GetOptions ("length=i" => \$length, # numeric
"file=s" => \$data, # string
"verbose" => \$verbose) # flag
or die("Error in command line arguments\n");
say $length;
say $data;
say $verbose;
Here =i in "length=i" creates a numeric type constraint on the value associated with --length and =s in "file=s" creates a similar string type constraint.
How do I do something similar in Raku (née Perl 6)?
Basics
That feature is built into Raku (formerly known as Perl 6). Here is the equivalent of your Getopt::Long code in Raku:
sub MAIN ( Str :$file = "file.dat"
, Num :$length = Num(24)
, Bool :$verbose = False
)
{
$file.say;
$length.say;
$verbose.say;
}
MAIN is a special subroutine that automatically parses command line arguments based on its signature.
Str and Num provide string and numeric type constraints.
Bool makes $verbose a binary flag which is False if absent or if called as --/verbose. (The / in --/foo is a common Unix command line syntax for setting an argument to False).
: prepended to the variables in the subroutine signature makes them named (instead of positional) parameters.
Defaults are provided using $variable = followed by the default value.
Aliases
If you want single character or other aliases, you can use the :f(:$foo) syntax.
sub MAIN ( Str :f(:$file) = "file.dat"
, Num :l(:$length) = Num(24)
, Bool :v(:$verbose) = False
)
{
$file.say;
$length.say;
$verbose.say;
}
:x(:$smth) makes additional alias for --smth such as short alias -x in this example. Multiple aliases and fully-named is available too, here is an example: :foo(:x(:bar(:y(:$baz)))) will get you --foo, -x, --bar, -y and --baz and if any of them will pass to $baz.
Positional arguments (and example)
MAIN can also be used with positional arguments. For example, here is Guess the number (from Rosetta Code). It defaults to a min of 0 and max of 100, but any min and max number could be entered. Using is copy allows the parameter to be changed within the subroutine:
#!/bin/env perl6
multi MAIN
#= Guessing game (defaults: min=0 and max=100)
{
MAIN(0, 100)
}
multi MAIN ( $max )
#= Guessing game (min defaults to 0)
{
MAIN(0, $max)
}
multi MAIN
#= Guessing game
( $min is copy #= minimum of range of numbers to guess
, $max is copy #= maximum of range of numbers to guess
)
{
#swap min and max if min is lower
if $min > $max { ($min, $max) = ($max, $min) }
say "Think of a number between $min and $max and I'll guess it!";
while $min <= $max {
my $guess = (($max + $min)/2).floor;
given lc prompt "My guess is $guess. Is your number higher, lower or equal (or quit)? (h/l/e/q)" {
when /^e/ { say "I knew it!"; exit }
when /^h/ { $min = $guess + 1 }
when /^l/ { $max = $guess }
when /^q/ { say "quiting"; exit }
default { say "WHAT!?!?!" }
}
}
say "How can your number be both higher and lower than $max?!?!?";
}
Usage message
Also, if your command line arguments don't match a MAIN signature, you get a useful usage message, by default. Notice how subroutine and parameter comments starting with #= are smartly incorporated into this usage message:
./guess --help
Usage:
./guess -- Guessing game (defaults: min=0 and max=100)
./guess <max> -- Guessing game (min defaults to 0)
./guess <min> <max> -- Guessing game
<min> minimum of range of numbers to guess
<max> maximum of range of numbers to guess
Here --help isn't a defined command line parameter, thus triggering this usage message.
See also
See also the 2010, 2014, and 2018 Perl 6 advent calendar posts on MAIN, the post Parsing command line arguments in Perl 6, and the section of Synopsis 6 about MAIN.
Alternatively, there is a Getopt::Long for perl6 too. Your program works in it with almost no modifications:
use Getopt::Long;
my $data = "file.dat";
my $length = 24;
my $verbose;
get-options("length=i" => $length, # numeric
"file=s" => $data, # string
"verbose" => $verbose); # flag
say $length;
say $data;
say $verbose;