What does fragment mean in ANTLR?
I've seen both rules:
fragment DIGIT : '0'..'9';
and
DIGIT : '0'..'9';
What is the difference?
A fragment is somewhat akin to an inline function: It makes the grammar more readable and easier to maintain.
A fragment will never be counted as a token, it only serves to simplify a grammar.
Consider:
NUMBER: DIGITS | OCTAL_DIGITS | HEX_DIGITS;
fragment DIGITS: '1'..'9' '0'..'9'*;
fragment OCTAL_DIGITS: '0' '0'..'7'+;
fragment HEX_DIGITS: '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;
In this example, matching a NUMBER will always return a NUMBER to the lexer, regardless of if it matched "1234", "0xab12", or "0777".
See item 3
According to the Definitive Antlr4 references book :
Rules prefixed with fragment can be called only from other lexer rules; they are not tokens in their own right.
actually they'll improve readability of your grammars.
look at this example :
STRING : '"' (ESC | ~["\\])* '"' ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
STRING is a lexer using fragment rule like ESC .Unicode is used in Esc rule and Hex is used in Unicode fragment rule.
ESC and UNICODE and HEX rules can't be used explicitly.
The Definitive ANTLR 4 Reference (Page 106):
Rules prefixed with fragment can
be called only from other lexer rules; they are not tokens in their own right.
Abstract Concepts:
Case1: ( if I need the RULE1, RULE2, RULE3 entities or group info )
rule0 : RULE1 | RULE2 | RULE3 ;
RULE1 : [A-C]+ ;
RULE2 : [DEF]+ ;
RULE3 : ('G'|'H'|'I')+ ;
Case2: ( if I don't care RULE1, RULE2, RULE3, I just focus on RULE0 )
RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;
// RULE0 is a terminal node.
// You can't name it 'rule0', or you will get syntax errors:
// 'A-C' came as a complete surprise to me while matching alternative
// 'DEF' came as a complete surprise to me while matching alternative
Case3: ( is equivalent to Case2, making it more readable than Case2)
RULE0 : RULE1 | RULE2 | RULE3 ;
fragment RULE1 : [A-C]+ ;
fragment RULE2 : [DEF]+ ;
fragment RULE3 : ('G'|'H'|'I')+ ;
// You can't name it 'rule0', or you will get warnings:
// warning(125): implicit definition of token RULE1 in parser
// warning(125): implicit definition of token RULE2 in parser
// warning(125): implicit definition of token RULE3 in parser
// and failed to capture rule0 content (?)
Differences between Case1 and Case2/3 ?
The lexer rules are equivalent
Each of RULE1/2/3 in Case1 is a capturing group, similar to Regex:(X)
Each of RULE1/2/3 in Case3 is a non-capturing group, similar to Regex:(?:X)
Let's see a concrete example.
Goal: identify [ABC]+, [DEF]+, [GHI]+ tokens
input.txt
ABBCCCDDDDEEEEE ABCDE
FFGGHHIIJJKK FGHIJK
ABCDEFGHIJKL
Main.py
import sys
from antlr4 import *
from AlphabetLexer import AlphabetLexer
from AlphabetParser import AlphabetParser
from AlphabetListener import AlphabetListener
class MyListener(AlphabetListener):
# Exit a parse tree produced by AlphabetParser#content.
def exitContent(self, ctx:AlphabetParser.ContentContext):
pass
# (For Case1 Only) enable it when testing Case1
# Exit a parse tree produced by AlphabetParser#rule0.
def exitRule0(self, ctx:AlphabetParser.Rule0Context):
print(ctx.getText())
# end-of-class
def main():
file_name = sys.argv[1]
input = FileStream(file_name)
lexer = AlphabetLexer(input)
stream = CommonTokenStream(lexer)
parser = AlphabetParser(stream)
tree = parser.content()
print(tree.toStringTree(recog=parser))
listener = MyListener()
walker = ParseTreeWalker()
walker.walk(listener, tree)
# end-of-def
main()
Case1 and results:
Alphabet.g4 (Case1)
grammar Alphabet;
content : (rule0|ANYCHAR)* EOF;
rule0 : RULE1 | RULE2 | RULE3 ;
RULE1 : [A-C]+ ;
RULE2 : [DEF]+ ;
RULE3 : ('G'|'H'|'I')+ ;
ANYCHAR : . -> skip;
Result:
# Input data (for reference)
# ABBCCCDDDDEEEEE ABCDE
# FFGGHHIIJJKK FGHIJK
# ABCDEFGHIJKL
$ python3 Main.py input.txt
(content (rule0 ABBCCC) (rule0 DDDDEEEEE) (rule0 ABC) (rule0 DE) (rule0 FF) (rule0 GGHHII) (rule0 F) (rule0 GHI) (rule0 ABC) (rule0 DEF) (rule0 GHI) <EOF>)
ABBCCC
DDDDEEEEE
ABC
DE
FF
GGHHII
F
GHI
ABC
DEF
GHI
Case2/3 and results:
Alphabet.g4 (Case2)
grammar Alphabet;
content : (RULE0|ANYCHAR)* EOF;
RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;
ANYCHAR : . -> skip;
Alphabet.g4 (Case3)
grammar Alphabet;
content : (RULE0|ANYCHAR)* EOF;
RULE0 : RULE1 | RULE2 | RULE3 ;
fragment RULE1 : [A-C]+ ;
fragment RULE2 : [DEF]+ ;
fragment RULE3 : ('G'|'H'|'I')+ ;
ANYCHAR : . -> skip;
Result:
# Input data (for reference)
# ABBCCCDDDDEEEEE ABCDE
# FFGGHHIIJJKK FGHIJK
# ABCDEFGHIJKL
$ python3 Main.py input.txt
(content ABBCCC DDDDEEEEE ABC DE FF GGHHII F GHI ABC DEF GHI <EOF>)
Did you see "capturing groups" and "non-capturing groups" parts?
Let's see the concrete example2.
Goal: identify octal / decimal / hexadecimal numbers
input.txt
0
123
1~9999
001~077
0xFF, 0x01, 0xabc123
Number.g4
grammar Number;
content
: (number|ANY_CHAR)* EOF
;
number
: DECIMAL_NUMBER
| OCTAL_NUMBER
| HEXADECIMAL_NUMBER
;
DECIMAL_NUMBER
: [1-9][0-9]*
| '0'
;
OCTAL_NUMBER
: '0' '0'..'9'+
;
HEXADECIMAL_NUMBER
: '0x'[0-9A-Fa-f]+
;
ANY_CHAR
: .
;
Main.py
import sys
from antlr4 import *
from NumberLexer import NumberLexer
from NumberParser import NumberParser
from NumberListener import NumberListener
class Listener(NumberListener):
# Exit a parse tree produced by NumberParser#Number.
def exitNumber(self, ctx:NumberParser.NumberContext):
print('%8s, dec: %-8s, oct: %-8s, hex: %-8s' % (ctx.getText(),
ctx.DECIMAL_NUMBER(), ctx.OCTAL_NUMBER(), ctx.HEXADECIMAL_NUMBER()))
# end-of-def
# end-of-class
def main():
input = FileStream(sys.argv[1])
lexer = NumberLexer(input)
stream = CommonTokenStream(lexer)
parser = NumberParser(stream)
tree = parser.content()
print(tree.toStringTree(recog=parser))
listener = Listener()
walker = ParseTreeWalker()
walker.walk(listener, tree)
# end-of-def
main()
Result:
# Input data (for reference)
# 0
# 123
# 1~9999
# 001~077
# 0xFF, 0x01, 0xabc123
$ python3 Main.py input.txt
(content (number 0) \n (number 123) \n (number 1) ~ (number 9999) \n (number 001) ~ (number 077) \n (number 0xFF) , (number 0x01) , (number 0xabc123) \n <EOF>)
0, dec: 0 , oct: None , hex: None
123, dec: 123 , oct: None , hex: None
1, dec: 1 , oct: None , hex: None
9999, dec: 9999 , oct: None , hex: None
001, dec: None , oct: 001 , hex: None
077, dec: None , oct: 077 , hex: None
0xFF, dec: None , oct: None , hex: 0xFF
0x01, dec: None , oct: None , hex: 0x01
0xabc123, dec: None , oct: None , hex: 0xabc123
If you add the modifier 'fragment' to DECIMAL_NUMBER, OCTAL_NUMBER, HEXADECIMAL_NUMBER, you won't be able to capture the number entities (since they are not tokens anymore). And the result will be:
$ python3 Main.py input.txt
(content 0 \n 1 2 3 \n 1 ~ 9 9 9 9 \n 0 0 1 ~ 0 7 7 \n 0 x F F , 0 x 0 1 , 0 x a b c 1 2 3 \n <EOF>)
This blog post has a very clear example where fragment makes a significant difference:
grammar number;
number: INT;
DIGIT : '0'..'9';
INT : DIGIT+;
The grammar will recognize '42' but not '7'. You can fix it by making digit a fragment (or moving DIGIT after INT).
Related
I'm trying to play with dependent types, but I've gotten stuck on a syntax error that I can't seem to find the cause for. It's on my first lin definition in the concrete syntax.
For context, I've been reading Inari's embedded grammars tutorial, and I'm interested in making functions which I can call from the outside, but which won't actually be used in linearisation unless I do it manually from Haskell. Please let me know if you know a better way to achieve this.
My abstract syntax:
cat
S;
Pre_S Clude ;
VP Clude ;
NP Clude ;
V2 ;
N ;
Clude ;
fun
Include, Exclude : Clude ;
To_S : Pre_S Include -> S ;
Pred : (c : Clude) -> NP c -> VP c -> Pre_S c ;
Compl : (c : Clude) -> V2 -> NP c -> VP c ;
-- I'd like to call this one from Haskell
Scramble : (c : Clude) -> VP c -> VP Exclude ;
To_NP : (c : Clude) -> N -> NP c ;
Bob, Billy, John, Mary, Lisa : N ;
Liked, Loved, Hated : V2 ;
Concrete:
lincat
Pre_S, S, V2, NP, N = {s : Str} ;
VP = {fst : Str ; snd : Str} ;
lin
To_S pre = {s = pre.s} ;
Pred _ sub vp = {s = sub ++ vp.fst ++ vp.snd} ;
Compl _ ver obj = {fst = ver ; snd = obj} ;
Scramble _ vp = {fst = vp.snd ; snd = vp.fst } ;
To_NP _ n = {s = n.s} ;
Bob = {s = "Bob"} ; --and seven more like this
Is it because this isn't possible this way, or did I do something wrong that I'm just not managing to find?
1. Fixing the syntax errors
None of the errors in your code is due to dependent types. The first issue, the actual syntax error is this:
To_S pre = {s = pre.s} ;
pre is a reserved word in GF, you can't used it as a variable name. You can write e.g.
To_S pr = {s = pr.s} ;
or shorter—since To_S just looks like a coercion function, you can do this:
To_S pr = pr ;
After fixing that, you get new error messages:
Happened in linearization of Pred
type of sub
expected: Str
inferred: {s : Str}
Happened in linearization of Compl
type of ver
expected: Str
inferred: {s : Str}
Happened in linearization of Compl
type of obj
expected: Str
inferred: {s : Str}
These are fixed as follows. You can't put a {s : Str} into a field that is supposed to contain a Str, so you need to access the sub.s, ver.s and obj.s.
Pred _ sub vp = {s = sub.s ++ vp.fst ++ vp.snd} ;
Compl _ ver obj = {fst = ver.s ; snd = obj.s} ;
Full grammar after fixing the syntax errors
This works for me.
lin
To_S pr = pr ;
Pred _ sub vp = {s = sub.s ++ vp.fst ++ vp.snd} ;
Compl _ ver obj = {fst = ver.s ; snd = obj.s} ;
Scramble _ vp = {fst = vp.snd ; snd = vp.fst } ;
To_NP _ n = {s = n.s} ;
Bob = {s = "Bob"} ;
Loved = {s = "loved"} ; -- just add rest of the lexicon
Include, Exclude = {s = ""} ; -- these need to have a linearisation, otherwise nothing works!
2. Tips and tricks on working with your grammar
With the lexicon of Bob and Loved, it generates exactly one tree. When I use the command gt (generate_trees), if I don't give a category flag, it automatically uses the start category, which is S.
> gt | l -treebank
To_S (Pred Include (To_NP Include Bob) (Compl Include Loved (To_NP Include Bob)))
Bob loved Bob
Parsing in S works too:
p "Bob loved Bob"
To_S (Pred Include (To_NP Include Bob) (Compl Include Loved (To_NP Include Bob)))
Parsing VPs: add a linref
With the current grammar, we can't parse any VPs:
> p -cat="VP ?" "Bob loved"
The parser failed at token 2: "loved"
> p -cat="VP ?" "loved Bob"
The parser failed at token 2: "Bob"
That's because the lincat of VP is discontinuous and doesn't have a single s field. But if you would like to parse also VPs, you can add a linref for the category of VP, like this:
linref
VP = \vp -> vp.fst ++ vp.snd ;
Now it works to parse even VPs, sort of:
Cl> p -cat="VP ?" "loved Bob"
Compl ?3 Loved (To_NP ?3 Bob)
These weird question marks are metavariables, and we'll fix it next.
Metavariables
If you've read my blog, you may have already seen this bit about metavariables: https://inariksit.github.io/gf/2018/08/28/gf-gotchas.html#metavariables-or-those-question-marks-that-appear-when-parsing
GF has a weird requirement that every argument needs to contribute with a string (even an empty string!), otherwise it isn’t recognised when parsing. This happens even if there is no ambiguity.
We have already linearised Include and Exclude with empty strings—they must have some linearisation in any case, otherwise the whole grammar doesn't work. So we need to add the Clude's empty string into all of the linearisations, otherwise GF parser is confused.
In all those linearisations where you just marked the Clude argument with underscore, we do this now. Doesn't matter which field we add it to, it's just an empty string and makes no effect in the output.
Pred _clu sub vp = {s = sub.s ++ vp.fst ++ vp.snd ++ _clu.s} ;
Compl _clu ver obj = {fst = ver.s ; snd = obj.s ++ _clu.s} ;
Scramble _clu vp = {fst = vp.snd ; snd = vp.fst ++ _clu.s} ;
To_NP _clu n = {s = n.s ++ _clu.s} ;
After this, parsing in VP works without question marks:
Cl> p -cat="VP ?" "loved Bob"
Compl Exclude Loved (To_NP Exclude Bob)
Compl Include Loved (To_NP Include Bob)
Cl> p -cat="VP ?" "Bob loved"
Scramble Exclude (Compl Exclude Loved (To_NP Exclude Bob))
I'm still getting metavariables????
So we fixed it, but why do I still get question marks when I gt and get a tree that uses Scramble?
> gt -cat="VP Exclude"
Compl Exclude Loved (To_NP Exclude Bob)
Scramble ?3 (Compl ?3 Loved (To_NP ?3 Bob))
Scramble Exclude (Scramble ?4 (Compl ?4 Loved (To_NP ?4 Bob)))
> gt -cat="VP Include"
Compl Include Loved (To_NP Include Bob)
That's because the Scramble operation truly suppresses its arguments. It's not just a case of GF compiler being stupid and refusing to cooperate even if it's obvious which argument there is, in this case there is no way to retrieve which argument it was: Scramble makes everything into a VP Exclude.
Full grammar after adding linref and including Clude's empty string
Just for the sake of completeness, here's all the changes.
lin
To_S pr = pr ;
Pred _clu sub vp = {s = sub.s ++ vp.fst ++ vp.snd ++ _clu.s} ;
Compl _clu ver obj = {fst = ver.s ; snd = obj.s ++ _clu.s} ;
Scramble _clu vp = {fst = vp.snd ; snd = vp.fst ++ _clu.s} ;
To_NP _clu n = {s = n.s ++ _clu.s} ;
Bob = {s = "Bob"} ;
Loved = {s = "loved"} ;
Include, Exclude = {s = ""} ;
linref
VP = \vp -> vp.fst ++ vp.snd ;
3. Alternative way of making this grammar without dependent types
If you would like to use your grammar from Python or the Haskell bindings for the C runtime, you are out of luck: the C runtime doesn't support dependent types. So here's a version of your grammar where we mimic the behaviour using parameters in the concrete syntax and the nonExist token (see https://inariksit.github.io/gf/2018/08/28/gf-gotchas.html#raise-an-exception).
I have kept the documentation minimal (because this answer is already so long 😅), but if you have any questions about some parts of this solution, just ask!
Abstract syntax
cat
S ; VP ; NP ; V2 ; N ;
fun
Pred : NP -> VP -> S ;
Compl : V2 -> NP -> VP ;
Scramble : VP -> VP ;
To_NPIncl,
ToNPExcl : N -> NP ;
Bob, Billy, John, Mary, Lisa : N ;
Liked, Loved, Hated : V2 ;
Concrete syntax
param
Clude = Include | Exclude ;
lincat
-- To mimic your categories VP Clude and NP Clude,
-- we add a parameter in VP and NP
VP = {fst, snd : Str ; clude : Clude} ;
NP = {s : Str ; clude : Clude} ;
-- The rest are like in your grammar
S, V2, N = {s : Str} ;
lin
-- No need for Pre_S, we can match NP's and VP's Clude in Pred
-- Only make a sentence if both's Clude is Include
Pred np vp = case <np.clude, vp.clude> of {
<Include,Include> => {s = np.s ++ vp.fst ++ vp.snd} ;
_ => {s = Predef.nonExist}
} ;
-- As per your grammar, V2 has no inherent Clude, but NP does
Compl v2 np = {
fst = v2.s ;
snd = np.s ;
clude = np.clude ;
} ;
-- Scramble doesn't look at it's argument VP's clude,
-- just makes it into Exclude automatically.
Scramble vp = {
fst = vp.snd ;
snd = vp.fst ;
clude = Exclude ;
} ;
-- Your grammar has the function To_NP : (c : Clude) -> N -> NP c ;
-- We translate it into two functions.
To_NPIncl n = n ** {clude = Include} ;
To_NPExcl n = n ** {clude = Exclude} ;
-- Finally, lexicon.
Bob = {s = "Bob"} ;
Loved = {s = "loved"} ;
Now when we generate all trees in category S, there is one that uses Scramble, but it doesn't have a linearisation.
> gt | l -treebank
ClParam: Pred (To_NPIncl Bob) (Compl Loved (To_NPIncl Bob))
ClParamEng: Bob loved Bob
ClParam: Pred (To_NPIncl Bob) (Scramble (Compl Loved (To_NPIncl Bob)))
ClParamEng:
Maybe a bit less elegant than your version, where the tree wasn't even generated, but this is just to demonstrate the different approaches. If you're working on Haskell and won't need to use C runtime, feel free to continue with your approach!
I have an expression IF 1 THEN 2 ELSE 3 * 4. I want this parsed as IF 1 THEN 2 ELSE (3 * 4), however using my grammar (extract) below, it parses it as (IF 1 THEN 2 ELSE 3) * 4.
formula: expression EOF;
expression
: LPAREN expression RPAREN #parenthesisExp
| IF condition=expression THEN thenExpression=expression ELSE elseExpression=expression #ifExp
| left=expression BINARYOPERATOR right=expression #binaryoperationExp
| left=expression op=(TIMES|DIV) right=expression #muldivExp
| left=expression op=(PLUS|MINUS) right=expression #addsubtractExp
| left=expression op=(EQUALS|NOTEQUALS|LT|GT) right=expression #comparisonExp
| left=expression AMPERSAND right=expression #concatenateExp
| NOT expression #notExp
| STRINGLITERAL #stringliteralExp
| signedAtom #atomExp
;
My understanding is that because I have the ifExp alternative appearing before the muldivExp it should use that first, then because I have the muldivExp before atomExp (which handles numbers) it should do 3 * 4 to end the ELSE, rather than using just the 3. In which case I can't see why it's making the IF..THEN..ELSE a child of the multiplication.
I don't think the rest of the grammar is relevant here, but in case it is see below for the whole thing.
grammar AnaplanFormula;
formula: expression EOF;
expression
: LPAREN expression RPAREN #parenthesisExp
| IF condition=expression THEN thenExpression=expression ELSE elseExpression=expression #ifExp
| left=expression BINARYOPERATOR right=expression #binaryoperationExp
| left=expression op=(TIMES|DIV) right=expression #muldivExp
| left=expression op=(PLUS|MINUS) right=expression #addsubtractExp
| left=expression op=(EQUALS|NOTEQUALS|LT|GT) right=expression #comparisonExp
| left=expression AMPERSAND right=expression #concatenateExp
| NOT expression #notExp
| STRINGLITERAL #stringliteralExp
| signedAtom #atomExp
;
signedAtom
: PLUS signedAtom #plusSignedAtom
| MINUS signedAtom #minusSignedAtom
| func_ #funcAtom
| atom #atomAtom
;
atom
: SCIENTIFIC_NUMBER #numberAtom
| LPAREN expression RPAREN #expressionAtom // Do we need this?
| entity #entityAtom
;
func_: functionname LPAREN (expression (',' expression)*)? RPAREN #funcParameterised
| entity LSQUARE dimensionmapping (',' dimensionmapping)* RSQUARE #funcSquareBrackets
;
dimensionmapping: WORD COLON entity; // Could make WORD more specific here
functionname: WORD; // Could make WORD more specific here
entity: QUOTELITERAL #quotedEntity
| WORD+ #wordsEntity
| left=entity DOT right=entity #dotQualifiedEntity
;
WS: [ \r\n\t]+ -> skip;
/////////////////
// Fragments //
/////////////////
fragment NUMBER: DIGIT+ (DOT DIGIT+)?;
fragment DIGIT: [0-9];
fragment LOWERCASE: [a-z];
fragment UPPERCASE: [A-Z];
fragment WORDSYMBOL: [#?_£%];
//////////////////
// Tokens //
//////////////////
IF: 'IF' | 'if';
THEN: 'THEN' | 'then';
ELSE: 'ELSE' | 'else';
BINARYOPERATOR: 'AND' | 'and' | 'OR' | 'or';
NOT: 'NOT' | 'not';
WORD: (DIGIT* (LOWERCASE | UPPERCASE | WORDSYMBOL)) (LOWERCASE | UPPERCASE | DIGIT | WORDSYMBOL)*;
STRINGLITERAL: DOUBLEQUOTES (~'"' | ('""'))* DOUBLEQUOTES;
QUOTELITERAL: '\'' (~'\'' | ('\'\''))* '\'';
LSQUARE: '[';
RSQUARE: ']';
LPAREN: '(';
RPAREN: ')';
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/';
COLON: ':';
EQUALS: '=';
NOTEQUALS: LT GT;
LT: '<';
GT: '>';
AMPERSAND: '&';
DOUBLEQUOTES: '"';
UNDERSCORE: '_';
QUESTIONMARK: '?';
HASH: '#';
POUND: '£';
PERCENT: '%';
DOT: '.';
PIPE: '|';
SCIENTIFIC_NUMBER: NUMBER (('e' | 'E') (PLUS | MINUS)? NUMBER)?;
Move your ifExpr down near the end of your alternatives. (In particular, below any alternative that you would wish to match your elseExpression
Your “if ... then ... else ...” is below the muldivExp precisely because you've made it a higher priority. Items lower in the tree are evaluated before items higher in the tree, so higher priority items belong lower in the tree.
With:
expression:
LPAREN expression RPAREN # parenthesisExp
| left = expression BINARYOPERATOR right = expression # binaryoperationExp
| left = expression op = (TIMES | DIV) right = expression # muldivExp
| left = expression op = (PLUS | MINUS) right = expression # addsubtractExp
| left = expression op = (EQUALS | NOTEQUALS | LT | GT) right = expression # comparisonExp
| left = expression AMPERSAND right = expression # concatenateExp
| NOT expression # notExp
| STRINGLITERAL # stringliteralExp
| signedAtom # atomExp
| IF condition = expression THEN thenExpression = expression ELSE elseExpression = expression #
ifExp
;
I get
FAQ: In Raku, how do I parse a String and get a Number ? For example:
xxx("42"); # 42 (Int)
xxx("0x42"); # 66 (Int)
xxx("42.123456789123456789"); # 42.123456789123456789 (Rat)
xxx("42.4e2"); # 4240 (Rat)
xxx("42.4e-2"); # 0.424 (Rat)
Just use the prefix +:
say +"42"; # 42 (Int)
say +"0x42"; # 66 (Int)
say +"42.123456789123456789"; # 42.123456789123456789 (Rat)
say +"42.4e2"; # 4240 (Rat)
say +"42.4e-2"; # 0.424 (Rat)
Info
val a Str routine is doing exactely what you (I) want.
Beware that it is returning Allomorph object. Use unival or just + prefix to convert it to Number
Links:
Learning Raku: Number, Strings, and NumberString Allomorphs
Same question in Python, Perl
Roseta Code: Determine if a string is numeric
Edited thanks to #Holli comment
my regex number {
\S+ #grab chars
<?{ defined +"$/" }> #assertion that coerces via '+' to Real
}
#strip factor [leading] e.g. 9/5 * Kelvin
if ( $defn-str ~~ s/( <number>? ) \s* \*? \s* ( .* )/$1/ ) {
my $factor = $0;
#...
}
I have a yacc parser in which commands like "abc=123" (VAR=VAR),
abc=[1 2 3] (VAR=value_series/string) and abc=[[123]] can be parsed.
I need to parse abc=[xyz=[1 2 3] jkl=[3 4 5]].This grammar is failing due to ambiguity between rule 2 (I guess, it couldn't differentiate between value_series and the new rule.
I have tried a case:
VAR_NAME EQUAL quote_or_brace model EQUAL quote_or_brace value_series quote_or_brace net EQUAL quote_or_brace value_series quote_or_brace quote_or_brace
It didn't work.
series:
| PLUS series
{
}
| series VAR_NAME EQUAL VAR_NAME
{
delete [] $2;
delete [] $4;
}
| series VAR_NAME EQUAL quote_or_brace value_series quote_or_brace
{
delete [] $2;
}
| series VAR_NAME EQUAL quote_or_brace quote_or_brace value_series quote_or_brace quote_or_brace
{
delete [] $2;
}
| error
{
setErrorMsg(string("error"));
YYABORT;
};
If it was me, I would probably write a set of rules similar to this:
assignment_list
: assignment
| assignment assignment_list
;
assignment
: VAR_NAME '=' assignment_rhs
;
assignment_rhs
: expression
| '[' opt_expression_list ']' /* allows var1 = [ ... ] */
;
opt_expression_list
: /* empty */
| expression_list
;
expression_list
: expression
| expression expression_list
;
expression
: VAR_NAME /* a variable name, as in var1 = var2 */
| NUMBER /* some kind of number, as in var1 = 123 */
| STRING /* quoted string, as in var1 = "foo" */
| assignment /* allows nesting of assignments, as in var1 = [var2 = 4] */
/* also allows things like var1 = var2 = 123 */
;
Note that this isn't tested in anyway, and that I'm not fully sure about the recursion from expression to assigment.
Given the grammar below, I'm seeing very poor performance when parsing longer strings, on the order of seconds. (this on both Python and Go implementations) Is there something in this grammar that is causing that?
Example output:
0.000061s LEXING "hello world"
0.014349s PARSING "hello world"
0.000052s LEXING 5 + 10
0.015384s PARSING 5 + 10
0.000061s LEXING FIRST_WORD(WORD_SLICE(contact.blerg, 2, 4))
0.634113s PARSING FIRST_WORD(WORD_SLICE(contact.blerg, 2, 4))
0.000095s LEXING (DATEDIF(DATEVALUE("01-01-1970"), date.now, "D") * 24 * 60 * 60) + ((((HOUR(date.now)+7) * 60) + MINUTE(date.now)) * 60))
1.552758s PARSING (DATEDIF(DATEVALUE("01-01-1970"), date.now, "D") * 24 * 60 * 60) + ((((HOUR(date.now)+7) * 60) + MINUTE(date.now)) * 60))
This is on Python.. though I don't expect blazing performance I would expect sub-second for any input. What am I doing wrong?
grammar Excellent;
parse
: expr EOF
;
expr
: atom # expAtom
| concatenationExpr # expConcatenation
| equalityExpr # expEquality
| comparisonExpr # expComparison
| additionExpr # expAddition
| multiplicationExpr # expMultiplication
| exponentExpr # expExponent
| unaryExpr # expUnary
;
path
: NAME (step)*
;
step
: LBRAC expr RBRAC
| PATHSEP NAME
| PATHSEP NUMBER
;
parameters
: expr (COMMA expr)* # functionParameters
;
concatenationExpr
: atom (AMP concatenationExpr)? # concatenation
;
equalityExpr
: comparisonExpr op=(EQ|NE) comparisonExpr # equality
;
comparisonExpr
: additionExpr (op=(LT|GT|LTE|GTE) additionExpr)? # comparison
;
additionExpr
: multiplicationExpr (op=(ADD|SUB) multiplicationExpr)* # addition
;
multiplicationExpr
: exponentExpr (op=(MUL|DIV) exponentExpr)* # multiplication
;
exponentExpr
: unaryExpr (EXP exponentExpr)? # exponent
;
unaryExpr
: SUB? atom # negation
;
funcCall
: function=NAME LPAR parameters? RPAR # functionCall
;
funcPath
: function=funcCall (step)* # functionPath
;
atom
: path # contextReference
| funcCall # atomFuncCall
| funcPath # atomFuncPath
| LITERAL # stringLiteral
| NUMBER # decimalLiteral
| LPAR expr RPAR # parentheses
| TRUE # true
| FALSE # false
;
NUMBER
: DIGITS ('.' DIGITS?)?
;
fragment
DIGITS
: ('0'..'9')+
;
TRUE
: [Tt][Rr][Uu][Ee]
;
FALSE
: [Ff][Aa][Ll][Ss][Ee]
;
PATHSEP
:'.';
LPAR
:'(';
RPAR
:')';
LBRAC
:'[';
RBRAC
:']';
SUB
:'-';
ADD
:'+';
MUL
:'*';
DIV
:'/';
COMMA
:',';
LT
:'<';
GT
:'>';
EQ
:'=';
NE
:'!=';
LTE
:'<=';
GTE
:'>=';
QUOT
:'"';
EXP
: '^';
AMP
: '&';
LITERAL
: '"' ~'"'* '"'
;
Whitespace
: (' '|'\t'|'\n'|'\r')+ ->skip
;
NAME
: NAME_START_CHARS NAME_CHARS*
;
fragment
NAME_START_CHARS
: 'A'..'Z'
| '_'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
;
fragment
NAME_CHARS
: NAME_START_CHARS
| '0'..'9'
| '\u00B7' | '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
ERRROR_CHAR
: .
;
You can always try to parse with SLL(*) first and only if that fails you need to parse it with LL(*) (which is the default).
See this ticket on ANTLR's GitHub for further explaination and here is an implementation that uses this strategy.
This method will save you (a lot of) time when parsing syntactically correct input.
Seems like this performance is due to the left recursion used in the addition / multiplication etc, operators. Rewriting these to be binary rules instead yields performance that is instant. (see below)
grammar Excellent;
COMMA : ',';
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
DOT : '.';
PLUS : '+';
MINUS : '-';
TIMES : '*';
DIVIDE : '/';
EXPONENT : '^';
EQ : '=';
NEQ : '!=';
LTE : '<=';
LT : '<';
GTE : '>=';
GT : '>';
AMPERSAND : '&';
DECIMAL : [0-9]+('.'[0-9]+)?;
STRING : '"' (~["] | '""')* '"';
TRUE : [Tt][Rr][Uu][Ee];
FALSE : [Ff][Aa][Ll][Ss][Ee];
NAME : [a-zA-Z][a-zA-Z0-9_.]*; // variable names, e.g. contact.name or function names, e.g. SUM
WS : [ \t\n\r]+ -> skip; // ignore whitespace
ERROR : . ;
parse : expression EOF;
atom : fnname LPAREN parameters? RPAREN # functionCall
| atom DOT atom # dotLookup
| atom LBRACK expression RBRACK # arrayLookup
| NAME # contextReference
| STRING # stringLiteral
| DECIMAL # decimalLiteral
| TRUE # true
| FALSE # false
;
expression : atom # atomReference
| MINUS expression # negation
| expression EXPONENT expression # exponentExpression
| expression (TIMES | DIVIDE) expression # multiplicationOrDivisionExpression
| expression (PLUS | MINUS) expression # additionOrSubtractionExpression
| expression (LTE | LT | GTE | GT) expression # comparisonExpression
| expression (EQ | NEQ) expression # equalityExpression
| expression AMPERSAND expression # concatenation
| LPAREN expression RPAREN # parentheses
;
fnname : NAME
| TRUE
| FALSE
;
parameters : expression (COMMA expression)* # functionParameters
;