I would like to match any Num from part of a text string. So far, this (stolen from from https://docs.perl6.org/language/regexes.html#Best_practices_and_gotchas) does the job...
my token sign { <[+-]> }
my token decimal { \d+ }
my token exponent { 'e' <sign>? <decimal> }
my regex float {
<sign>?
<decimal>?
'.'
<decimal>
<exponent>?
}
my regex int {
<sign>?
<decimal>
}
my regex num {
<float>?
<int>?
}
$str ~~ s/( <num>? \s*) ( .* )/$1/;
This seems like a lot of (error prone) reinvention of the wheel. Is there a perl6 trick to match built in types (Num, Real, etc.) in a grammar?
If you can make reasonable assumptions about the number, like that it's delimited by word boundaries, you can do something like this:
regex number {
« # left word boundary
\S+ # actual "number"
» # right word boundary
<?{ defined +"$/" }>
}
The final line in this regex stringifies the Match ("$/"), and then tries to convert it to a number (+). If it works, it returns a defined value, otherwise a Failure. This string-to-number conversion recognizes the same syntax as the Perl 6 grammar. The <?{ ... }> construct is an assertion, so it makes the match fail if the expression on the inside returns a false value.
Related
The example for sym shows * (WhateverCode) standing in for a single symbol
grammar Foo {
token TOP { <letter>+ }
proto token letter {*}
token letter:sym<P> { <sym> }
token letter:sym<e> { <sym> }
token letter:sym<r> { <sym> }
token letter:sym<l> { <sym> }
token letter:sym<*> { . }
}.parse("I ♥ Perl", actions => class {
method TOP($/) { make $<letter>.grep(*.<sym>).join }
}).made.say; # OUTPUT: «Perl»
It will, however, fail if we use it to stand in for a symbol composed of several letters:
grammar Foo {
token TOP { <action>+ % " " }
proto token action {*}
token action:sym<come> { <sym> }
token action:sym<bebe> { <sym> }
token action:sym<*> { . }
}.parse("come bebe ama").say; # Nil
Since sym, by itself, does work with symbols with more than one character, how can we define a default sym token that matches a set of characters?
Can * be used in sym tokens for more than one character? ... The example for sym shows * (WhateverCode) standing in for a single symbol
It's not a WhateverCode or Whatever.1
The <...> in foo:sym<...> is a quote words constructor, so the ... is just a literal string.
That's why this works:
grammar g { proto token foo {*}; token foo:sym<*> { <sym> } }
say g.parse: '*', rule => 'foo'; # matches
As far as P6 is concerned, the * in foo:sym<*> is just a random string. It could be abracadabra. I presume the writer chose * to represent the mental concept of "whatever" because it happens to match the P6 concept Whatever. Perhaps they were being too cute.
For the rest of this answer I will write JJ instead of * wherever the latter is just an arbitrary string as far as P6 is concerned.
The * in the proto is a Whatever. But that's completely unrelated to your question:
grammar g { proto token foo {*}; token foo:sym<JJ> { '*' } }
say g.parse: '*', rule => 'foo'; # matches
In the body of a rule (tokens and regexes are rules) whose name includes a :sym<...> part, you can write <sym> and it will match the string between the angles of the :sym<...>:
grammar g { proto token foo {*}; token foo:sym<JJ> { <sym> } }
say g.parse: 'JJ', rule => 'foo'; # matches
But you can write anything you like in the rule/token/regex body. A . matches a single character:
grammar g { proto token foo {*}; token foo:sym<JJ> { . } }
say g.parse: '*', rule => 'foo'; # matches
It will, however, fail if we use it to stand in for a symbol composed of several letters
No. That's because you changed the grammar.
If you change the grammar back to the original coding (apart from the longer letter:sym<...>s) it works fine:
grammar Foo {
token TOP { <letter>+ }
proto token letter {*}
token letter:sym<come> { <sym> }
token letter:sym<bebe> { <sym> }
token letter:sym<JJ> { . }
}.parse(
"come bebe ama",
actions => class { method TOP($/) { make $<letter>.grep(*.<sym>).join } })
.made.say; # OUTPUT: «comebebe»
Note that in the original, the letter:sym<JJ> token is waiting in the wings to match any single character -- and that includes a single space, so it matches those and they're dealt with.
But in your modification you added a required space between tokens in the TOP token. That had two effects:
It matched the space after "come" and after "bebe";
After the "a" was matched by letter:sym<JJ>, the lack of a space between the "a" and "m" meant the overall match failed at that point.
sym, by itself, does work with symbols with more than one character
Yes. All token foo:sym<bar> { ... } does is add:
A multiple dispatch alternative to foo;
A token sym, lexically scoped to the body of the foo token, that matches 'bar'.
how can we define a default sym token that matches a set of characters?
You can write such a sym token but, to be clear, because you don't want it to match a fixed string it can't use the <sym> in the body.(Because a <sym> has to be a fixed string.) If you still want to capture under the key sym then you could write $<sym>= in the token body as Håkon showed in a comment under their answer. But it could also be letter:whatever with $<sym>= in the body.
I'm going to write it as a letter:default token to emphasize that it being :sym<something> doesn't make any difference. (As explained above, the :sym<something> is just about being an alternative, along with other :baz<...>s and :bar<...>s, with the only addition being that if it's :sym<something>, then it also makes a <sym> subrule available in the body of the associated rule, which, if used, matches the fixed string 'something'.)
The winning dispatch among all the rule foo:bar:baz:qux<...> alternatives is chosen according to LTM logic among the rules starting with foo. So you need to write such a token that does not win as a longest token prefix but only matches if nothing else matches.
To immediately go to the back of the pack in an LTM race, insert a {} at the start of the rule body2:
token letter:default { {} \w+ }
Now, from the back of the pack, if this rule gets a chance it'll match with the \w+ pattern, which will stop the token when it hits a non-word character.
The bit about making it match if nothing else matches may mean listing it last. So:
grammar Foo {
token TOP { <letter>+ % ' ' }
proto token letter {*}
token letter:sym<come> { <sym> } # matches come
token letter:sym<bebe> { <sym> } # matches bebe
token letter:boo { {} \w**6 } # match 6 char string except eg comedy
token letter:default { {} \w+ } # matches any other word
}.parse(
"come bebe amap",
actions => class { method TOP($/) { make $<letter>.grep(*.<sym>).join } })
.made.say; # OUTPUT: «comebebe»
that just can't be the thing causing it ... "come bebe ama" shouldn't work in your grammar
The code had errors which I've now fixed and apologize for. If you run it you'll find it works as advertised.
But your comment prodded me to expand my answer. Hopefully it now properly answers your question.
Footnote
1 Not that any of this has anything to do with what's actually going on but... In P6 a * in "term position" (in English, where a noun belongs, in general programming lingo, where a value belongs) is a Whatever, not a WhateverCode. Even when * is written with an operator, eg. +* or * + *, rather than on its own, the *s are still just Whatevers, but the compiler automatically turns most such combinations of one or more *s with one or more operators into a sub-class of Code called a WhateverCode. (Exceptions are listed in a table here.)
2 See footnote 2 in my answer to SO "perl6 grammar , not sure about some syntax in an example".
The :sym<...> contents are for the reader of your program, not for the compiler, and are used to distinguish multi tokens of otherwise identical names.
It just so happened that programmers started to write grammars like this:
token operator:sym<+> { '+' }
token operator:sym<-> { '-' }
token operator:sym</> { '/' }
To avoid duplicating the symbols (here +, -, /), a special rule <sym> was introduced that matches whatever is inside :sym<...> as a literal, so you can write the above tokens as
token operator:sym<+> { <sym> }
token operator:sym<-> { <sym> }
token operator:sym</> { <sym> }
If you don't use <sym> inside the regex, you are free to write anything you want inside :sym<...>, so you can write something like
token operator:sym<fallback> { . }
Maybe like this:
grammar Foo {
token TOP { <action>+ % " " }
proto token action {*}
token action:sym<come> { <sym> }
token action:sym<bebe> { <sym> }
token action:sym<default> { \w+ }
}.parse("come bebe ama").say;
Output:
「come bebe ama」
action => 「come」
sym => 「come」
action => 「bebe」
sym => 「bebe」
action => 「ama」
Not sure whether grammars are meant to do such things: I want tokens to be defined in runtime (in future — with data from a file). So I wrote a simple test code, and as expected it wouldn't even compile.
grammar Verb {
token TOP {
<root>
<ending>
}
token root {
(\w+) <?{ ~$0 (elem) #root }>
}
token ending {
(\w+) <?{ ~$0 (elem) #ending }>
}
}
my #root = <go jump play>;
my #ending = <ing es s ed>;
my $string = "going";
my $match = Verb.parse($string);
.Str.say for $match<root>;
What's the best way of doing such things in Perl 6?
To match any of the elements of an array, just write the name of the array variable (starting with a # sigil) in the regex:
my #root = <go jump play>;
say "jumping" ~~ / #root /; # Matches 「jump」
say "jumping" ~~ / #root 'ing' /; # Matches 「jumping」
So in your use-case, the only tricky part is passing the arrays from the code that creates them (e.g. by parsing data files), to the grammar tokens that need them.
The easiest way would probably be to make them dynamic variables (signified by the * twigil):
grammar Verb {
token TOP {
<root>
<ending>
}
token root {
#*root
}
token ending {
#*ending
}
}
my #*root = <go jump play>;
my #*ending = <ing es s ed>;
my $string = "going";
my $match = Verb.parse($string);
say $match<root>.Str;
Another way would be to pass a Capture with the arrays to the args adverb of method .parse, which will pass them on to token TOP, from where you can in turn pass them on to the sub-rules using the <foo(...)> or <foo: ...> syntax:
grammar Verb {
token TOP (#known-roots, #known-endings) {
<root: #known-roots>
<ending: #known-endings>
}
token root (#known) {
#known
}
token ending (#known) {
#known
}
}
my #root = <go jump play>;
my #ending = <ing es s ed>;
my $string = "going";
my $match = Verb.parse($string, args => \(#root, #ending));
say $match<root>.Str; # go
The approach you were taking could have worked but you made three mistakes.
Scoping
Lexical variable declarations need to appear textually before the compiler encounters their use:
my $foo = 42; say $foo; # works
say $bar; my $bar = 42; # compile time error
Backtracking
say .parse: 'going' for
grammar using-token {token TOP { \w+ ing}}, # Nil
grammar using-regex-with-ratchet {regex TOP {:ratchet \w+ ing}}, # Nil
grammar using-regex {regex TOP { \w+ ing}}; # 「going」
The regex declarator has exactly the same effect as the token declarator except that it defaults to doing backtracking.
Your first use of \w+ in the root token matches the entire input 'going', which then fails to match any element of #root. And then, because there's no backtracking, the overall parse immediately fails.
(Don't take this to mean that you should default to using regex. Relying on backtracking can massively slow down parsing and there's typically no need for it.)
Debugging
See https://stackoverflow.com/a/19640657/1077672
This works:
my #root = <go jump play>;
my #ending = <ing es s ed>;
grammar Verb {
token TOP {
<root>
<ending>
}
regex root {
(\w+) <?{ ~$0 (elem) #root }>
}
token ending {
(\w+) <?{ ~$0 (elem) #ending }>
}
}
my $string = "going";
my $match = Verb.parse($string);
.Str.say for $match<root>;
outputs:
go
In Edit distance: Ignore start/end, I offered a Perl 6 solution to a fuzzy fuzzy matching problem. I had a grammar like this (although maybe I've improved it after Edit #3):
grammar NString {
regex n-chars { [<.ignore>* \w]**4 }
regex ignore { \s }
}
The literal 4 itself was the length of the target string in the example. But the next problem might be some other length. So how can I tell the grammar how long I want that match to be?
Although the docs don't show an example or using the $args parameter, I found one in S05-grammar/example.t in roast.
Specify the arguments in :args and give the regex an appropriate signature. Inside the regex, access the arguments in a code block:
grammar NString {
regex n-chars ($length) { [<.ignore>* \w]**{ $length } }
regex ignore { \s }
}
class NString::Actions {
method n-chars ($/) {
put "Found $/";
}
}
my $string = 'The quick, brown butterfly';
loop {
state $from = 0;
my $match = NString.subparse(
$string,
:rule('n-chars'),
:actions(NString::Actions),
:c($from++),
:args( \(5) )
);
last unless ?$match;
}
I'm still not sure about the rules for passing the arguments though. This doesn't work:
:args( 5 )
I get:
Too few positionals passed; expected 2 arguments but got 1
This works:
:args( 5, )
But that's enough thinking about this for one night.
I'm trying to gather all text that is not defined by a previous rule into a string and prefix it with a formatting string using lex. I'm wondering if there's a standard way of doing this.
For example, say I have the rules:
word1|word2|word3|word4 {printf("%s%s", "<>", yytext);}
[0-9]+ {printf("%s%s", "{}", yytext);}
everything else {printf("%s%s", "[]", yytext);}
And I attempt to lex the string:
word1 this is some other text ; word2 98 foo bar .
I would want this to produce the following when run through the lexer:
<>word1[] this is some other text ; <>word2[] {}98[] foo bar .
I attempted to do this using states, but realize I can't determine when to stop the check, like:
%x OTHER
%%
. {yymore(); BEGIN OTHER;}
<OTHER>.|\n yymore();
<OTHER>how to determine when to end? {printf("%s%s", "[]", yytex); BEGIN INITIAL;}
What is a good way to do this? Is there someway to continue as long as another rule isn't met?
AFAIK, there is no "standard" solution, but a simple one is to keep a bit of context (the prefix last printed) and use that to decide whether or not to print a new prefix. For example, you could use a custom printer like this:
enum OutputType { NO_TOKEN = 0, WORD, NUMBER, OTHER };
void print_with_prefix(enum OutputType type, const char* token) {
static enum OutputType prev = NO_TOKEN;
const char* prefix = "";
switch (type) {
case WORD: prefix = "<>"; break;
case NUMBER: prefix = "{}"; break;
case OTHER: if (prev != OTHER) prefix = "[]"; break;
default: assert(false);
}
prev = type;
printf("%s%s", prefix, token);
}
Then you just need to change the calls to printf to invoke print_with_prefix instead (and, as written, to supply an enum value instead of a string).
For the OTHER case, you then don't need to do anything special to accumulate the token. Just
. { print_with_prefix(OTHER, yytext); }
(I'm skating over the handling of whitespace and newlines, but it's just conceptual.)
I am trying to write a lexer for an IntelliJ language plugin. In the JFLex manual there is an example that can lex string literals. However in this example they use a StringBuffer to insert each part of the lexed characters and continually build up a single string. The problem I have with this method is that it creates a copy of the characters that are being read and I dont know how to integrate that example with the IntelliJ. In IntelliJ one always returns a IElementType and then the associated text is taken from yytext() using the functions getTokenStart() and getTokenEnd(), such that the start and end of the whole token is mapped directly to the input string.
So I want to be able to return a token and the associated yytext() should span over the whole text since the last time another token was returned. For example in the string literal example, I would read \" which marks the literal start, then I change into state STRING and when I read \" again I change back into another state and return the string literal token. At that point I want yytext() to contain the whole string literal.
Is this possible with JFlex? If not what is the recommended why to pass the content from a StringBuffer to the IntelliJ API after a token has been matched that spans multiple actions.
You could write a regular expression that matches the entire String literal so that you get it in one yytext() call, but this match would contain escape sequences unprocessed.
From the JFlex java example:
<STRING> {
\" { yybegin(YYINITIAL); return symbol(STRING_LITERAL, string.toString()); }
{StringCharacter}+ { string.append( yytext() ); }
/* escape sequences */
"\\b" { string.append( '\b' ); }
"\\t" { string.append( '\t' ); }
"\\n" { string.append( '\n' ); }
"\\f" { string.append( '\f' ); }
"\\r" { string.append( '\r' ); }
"\\\"" { string.append( '\"' ); }
"\\'" { string.append( '\'' ); }
"\\\\" { string.append( '\\' ); }
\\[0-3]?{OctDigit}?{OctDigit} { char val = (char) Integer.parseInt(yytext().substring(1),8);
string.append( val ); }
/* error cases */
\\. { throw new RuntimeException("Illegal escape sequence \""+yytext()+"\""); }
{LineTerminator} { throw new RuntimeException("Unterminated string at end of line"); }
}
This code doesn't just match escape sequences like "\\t", but turns them into the single character '\t'. You could match the whole string in one expression in an expression like this
\" ({StringCharacter} | \\[0-3]?{OctDigit}?{OctDigit} | "\\b" | "\\t" | .. | "\\\\") * \"
but yytext will then contain the unprocessed sequence \\t instead of the character '\t'.
If that is acceptable, then that's the easy solution. If the token is supposed to be an actual substring of the input, then it sounds like this is what you want.
If it's not, you'll need something more complicated, for instance an intermediate interface function that is not yytext(), but that returns the StringBuffer content when the last match was a string match (a flag you could set in the string action), and otherwise returns yytext().