I'm facing an issue while validating the below formula with the given grammar rules.
if(2>3?ceil(loopup(12)):floor(matrix(2,3)))
However, I am able to inject the below formulas:
if(2>3?loopup(12):matrix(2,3))
if(2>3?ceil(12.2):floor(2.3))
ast
: expr+ EOF
;
expr: nestedexpr
| LOOKUP_FIELD '(' idrule ')'
| TIER_FIELD '(' idrule ',' idrule ')'
| MATRIX_FIELD '(' idrule ',' idrule ')'
| IF '(' conditionalrule '?' expr ':' expr')'
| ROUND '(' idrule ',' roundnumberrule ')'
;
nestedexpr:
nestedexpr ('*'|'/'|'+'|'-') nestedexpr
| '(' '-' nestedexpr ')'
| '(' nestedexpr ')'
| ROUND '(' expr ',' roundnumberrule ')'
| MATH_FUNCTION_FIELD '(' expr ')'
| DYNAMIC_FIELD_ID/users-ack-status
;
arithematicexpr:
arithematicexpr ('*'|'/'|'+'|'-') arithematicexpr
| '(' '-' arithematicexpr ')'
| '(' arithematicexpr ')'
| DYNAMIC_FIELD_ID
;
orrule: OR '(' conditionalrule (',' conditionalrule)+ ')';
andrule: AND '(' conditionalrule (',' conditionalrule)+ ')';
conditionalrule: orrule | andrule | relationalrule;
relationalrule: DYNAMIC_FIELD_ID RELATIONAL_OPERATOR DYNAMIC_FIELD_ID;
idrule :
DYNAMIC_FIELD_ID
;
LOOKUP_FIELD: L O O K U P ;
TIER_FIELD: T I E R;
MATRIX_FIELD: M A T R I X;
IF: I F;
AND: A N D;
OR: O R;
ROUND : R O U N D;
MATH_FUNCTION_FIELD : C E I L | F L O O R;
RELATIONAL_OPERATOR: '<' | '>' | '<=' | '>=' | '<>' | '=';
BOOL_FIELD : T R U E | F A L S E;
DYNAMIC_FIELD_ID: {isDynamicFieldId()}? . ;
roundnumberrule: ROUND_NUMBER;
ROUND_NUMBER: [0-7];
WS : [ \t\r\n]+ -> skip ;
if(2>3?ceil(loopup(12)):floor(matrix(2,3)))
The above should get parsed by the mentioned grammar rule.
I write follow grammar in a .g4 file:
expr
:function=expr'('((arg',')*arg)?')' #callFun
|object=expr'.'SYMBOL #objCall
|coll=expr'['arg']' #collage
|varId=SYMBOL #varValue
|(expr) #brackets
|expr ADD ADD #lastInc
|expr SUB SUB #lastDec
//2
|ADD ADD <assoc=right> expr #postInc
|SUB SUB <assoc=right> expr #postDec
|NOT <assoc=right> expr #not
|SUB <assoc=right> expr #negative
|'new' expr #createObj
//3
|expr MUL expr #mul
|expr DIV expr #div
|expr MOD expr #mod
//4
|expr ADD expr #add
|expr SUB expr #sub
//5
|expr LMOVE expr #lMove
|expr RMOVE expr #rMove
//6
|expr LESS expr #less
|expr LESS EQUAL expr #lessEqual
|expr GREATER expr #greater
|expr GREATER EQUAL expr #greaterEqual
//7
|expr EQUAL EQUAL expr #equal
|expr NOT EQUAL expr #notEqual
//8
|expr AND ADD expr #and
//9
|expr OR OR expr #or
//10
|expr '?' <assoc=right> exprA = expr':'exprB = expr #trueAfalseB
//11
|expr EQUAL <assoc=right> expr #putIn
|expr ADD EQUAL <assoc=right> expr #addPutIn
|expr SUB EQUAL <assoc=right> expr #subPutIn
|expr MUL EQUAL <assoc=right> expr #mulPutIn
|expr DIV EQUAL <assoc=right> expr #divPutIn
|expr MOD EQUAL <assoc=right> expr #modPutIn
|expr LMOVE EQUAL <assoc=right> expr #lMovePutIn
|expr RMOVE EQUAL <assoc=right> expr #rMovePutIn
//12
|THROW <assoc=right> expr #throw
//13
|expr','expr #comma
;
If I'm not mistaken, there's only immediate left recursion in this grammar.But there is a error message:"The following sets of rules are mutually left-recursive [expr]".why?
As mentioned by Kaby76 in the comments, this alternative:
expr
: ...
| (expr)
| ...
;
is the cause. You probably meant to do this:
expr
: ...
| '(' expr ')'
| ...
;
I have an expression IF 1 THEN 2 ELSE 3 * 4. I want this parsed as IF 1 THEN 2 ELSE (3 * 4), however using my grammar (extract) below, it parses it as (IF 1 THEN 2 ELSE 3) * 4.
formula: expression EOF;
expression
: LPAREN expression RPAREN #parenthesisExp
| IF condition=expression THEN thenExpression=expression ELSE elseExpression=expression #ifExp
| left=expression BINARYOPERATOR right=expression #binaryoperationExp
| left=expression op=(TIMES|DIV) right=expression #muldivExp
| left=expression op=(PLUS|MINUS) right=expression #addsubtractExp
| left=expression op=(EQUALS|NOTEQUALS|LT|GT) right=expression #comparisonExp
| left=expression AMPERSAND right=expression #concatenateExp
| NOT expression #notExp
| STRINGLITERAL #stringliteralExp
| signedAtom #atomExp
;
My understanding is that because I have the ifExp alternative appearing before the muldivExp it should use that first, then because I have the muldivExp before atomExp (which handles numbers) it should do 3 * 4 to end the ELSE, rather than using just the 3. In which case I can't see why it's making the IF..THEN..ELSE a child of the multiplication.
I don't think the rest of the grammar is relevant here, but in case it is see below for the whole thing.
grammar AnaplanFormula;
formula: expression EOF;
expression
: LPAREN expression RPAREN #parenthesisExp
| IF condition=expression THEN thenExpression=expression ELSE elseExpression=expression #ifExp
| left=expression BINARYOPERATOR right=expression #binaryoperationExp
| left=expression op=(TIMES|DIV) right=expression #muldivExp
| left=expression op=(PLUS|MINUS) right=expression #addsubtractExp
| left=expression op=(EQUALS|NOTEQUALS|LT|GT) right=expression #comparisonExp
| left=expression AMPERSAND right=expression #concatenateExp
| NOT expression #notExp
| STRINGLITERAL #stringliteralExp
| signedAtom #atomExp
;
signedAtom
: PLUS signedAtom #plusSignedAtom
| MINUS signedAtom #minusSignedAtom
| func_ #funcAtom
| atom #atomAtom
;
atom
: SCIENTIFIC_NUMBER #numberAtom
| LPAREN expression RPAREN #expressionAtom // Do we need this?
| entity #entityAtom
;
func_: functionname LPAREN (expression (',' expression)*)? RPAREN #funcParameterised
| entity LSQUARE dimensionmapping (',' dimensionmapping)* RSQUARE #funcSquareBrackets
;
dimensionmapping: WORD COLON entity; // Could make WORD more specific here
functionname: WORD; // Could make WORD more specific here
entity: QUOTELITERAL #quotedEntity
| WORD+ #wordsEntity
| left=entity DOT right=entity #dotQualifiedEntity
;
WS: [ \r\n\t]+ -> skip;
/////////////////
// Fragments //
/////////////////
fragment NUMBER: DIGIT+ (DOT DIGIT+)?;
fragment DIGIT: [0-9];
fragment LOWERCASE: [a-z];
fragment UPPERCASE: [A-Z];
fragment WORDSYMBOL: [#?_£%];
//////////////////
// Tokens //
//////////////////
IF: 'IF' | 'if';
THEN: 'THEN' | 'then';
ELSE: 'ELSE' | 'else';
BINARYOPERATOR: 'AND' | 'and' | 'OR' | 'or';
NOT: 'NOT' | 'not';
WORD: (DIGIT* (LOWERCASE | UPPERCASE | WORDSYMBOL)) (LOWERCASE | UPPERCASE | DIGIT | WORDSYMBOL)*;
STRINGLITERAL: DOUBLEQUOTES (~'"' | ('""'))* DOUBLEQUOTES;
QUOTELITERAL: '\'' (~'\'' | ('\'\''))* '\'';
LSQUARE: '[';
RSQUARE: ']';
LPAREN: '(';
RPAREN: ')';
PLUS: '+';
MINUS: '-';
TIMES: '*';
DIV: '/';
COLON: ':';
EQUALS: '=';
NOTEQUALS: LT GT;
LT: '<';
GT: '>';
AMPERSAND: '&';
DOUBLEQUOTES: '"';
UNDERSCORE: '_';
QUESTIONMARK: '?';
HASH: '#';
POUND: '£';
PERCENT: '%';
DOT: '.';
PIPE: '|';
SCIENTIFIC_NUMBER: NUMBER (('e' | 'E') (PLUS | MINUS)? NUMBER)?;
Move your ifExpr down near the end of your alternatives. (In particular, below any alternative that you would wish to match your elseExpression
Your “if ... then ... else ...” is below the muldivExp precisely because you've made it a higher priority. Items lower in the tree are evaluated before items higher in the tree, so higher priority items belong lower in the tree.
With:
expression:
LPAREN expression RPAREN # parenthesisExp
| left = expression BINARYOPERATOR right = expression # binaryoperationExp
| left = expression op = (TIMES | DIV) right = expression # muldivExp
| left = expression op = (PLUS | MINUS) right = expression # addsubtractExp
| left = expression op = (EQUALS | NOTEQUALS | LT | GT) right = expression # comparisonExp
| left = expression AMPERSAND right = expression # concatenateExp
| NOT expression # notExp
| STRINGLITERAL # stringliteralExp
| signedAtom # atomExp
| IF condition = expression THEN thenExpression = expression ELSE elseExpression = expression #
ifExp
;
I get
I'm using Antlr4. Here is my grammar:
assign : id '=' expr ;
id : 'A' | 'B' | 'C' ;
expr : expr '+' term
| expr '-' term
| term ;
term : term '*' factor
| term '/' factor
| factor ;
factor : expr '**' factor
| '(' expr ')'
| id ;
WS : [ \t\r\n]+ -> skip ;
I know this grammar is ambiguous and also I know I should add an element to the grammar but I don't know how to make the grammar unambiguous.
factor : expr '**' factor
Consider the input
A + B ** C
A + B is an expr so we could analyse that as a factor, semantically (A+B)C
But the other, more conventional interpretation (A + (BC)) is also possible:
<expr> =>
<expr> + <term> =>
<term> + <term> =>
<factor> + <term> =>
A + <term> =>
A + <factor> =>
A + <expr> ** <factor> =>
A + <term> ** <factor> =>
A + <factor> ** <factor> =>
A + B ** <factor> =>
A + B ** C
What does fragment mean in ANTLR?
I've seen both rules:
fragment DIGIT : '0'..'9';
and
DIGIT : '0'..'9';
What is the difference?
A fragment is somewhat akin to an inline function: It makes the grammar more readable and easier to maintain.
A fragment will never be counted as a token, it only serves to simplify a grammar.
Consider:
NUMBER: DIGITS | OCTAL_DIGITS | HEX_DIGITS;
fragment DIGITS: '1'..'9' '0'..'9'*;
fragment OCTAL_DIGITS: '0' '0'..'7'+;
fragment HEX_DIGITS: '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;
In this example, matching a NUMBER will always return a NUMBER to the lexer, regardless of if it matched "1234", "0xab12", or "0777".
See item 3
According to the Definitive Antlr4 references book :
Rules prefixed with fragment can be called only from other lexer rules; they are not tokens in their own right.
actually they'll improve readability of your grammars.
look at this example :
STRING : '"' (ESC | ~["\\])* '"' ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
STRING is a lexer using fragment rule like ESC .Unicode is used in Esc rule and Hex is used in Unicode fragment rule.
ESC and UNICODE and HEX rules can't be used explicitly.
The Definitive ANTLR 4 Reference (Page 106):
Rules prefixed with fragment can
be called only from other lexer rules; they are not tokens in their own right.
Abstract Concepts:
Case1: ( if I need the RULE1, RULE2, RULE3 entities or group info )
rule0 : RULE1 | RULE2 | RULE3 ;
RULE1 : [A-C]+ ;
RULE2 : [DEF]+ ;
RULE3 : ('G'|'H'|'I')+ ;
Case2: ( if I don't care RULE1, RULE2, RULE3, I just focus on RULE0 )
RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;
// RULE0 is a terminal node.
// You can't name it 'rule0', or you will get syntax errors:
// 'A-C' came as a complete surprise to me while matching alternative
// 'DEF' came as a complete surprise to me while matching alternative
Case3: ( is equivalent to Case2, making it more readable than Case2)
RULE0 : RULE1 | RULE2 | RULE3 ;
fragment RULE1 : [A-C]+ ;
fragment RULE2 : [DEF]+ ;
fragment RULE3 : ('G'|'H'|'I')+ ;
// You can't name it 'rule0', or you will get warnings:
// warning(125): implicit definition of token RULE1 in parser
// warning(125): implicit definition of token RULE2 in parser
// warning(125): implicit definition of token RULE3 in parser
// and failed to capture rule0 content (?)
Differences between Case1 and Case2/3 ?
The lexer rules are equivalent
Each of RULE1/2/3 in Case1 is a capturing group, similar to Regex:(X)
Each of RULE1/2/3 in Case3 is a non-capturing group, similar to Regex:(?:X)
Let's see a concrete example.
Goal: identify [ABC]+, [DEF]+, [GHI]+ tokens
input.txt
ABBCCCDDDDEEEEE ABCDE
FFGGHHIIJJKK FGHIJK
ABCDEFGHIJKL
Main.py
import sys
from antlr4 import *
from AlphabetLexer import AlphabetLexer
from AlphabetParser import AlphabetParser
from AlphabetListener import AlphabetListener
class MyListener(AlphabetListener):
# Exit a parse tree produced by AlphabetParser#content.
def exitContent(self, ctx:AlphabetParser.ContentContext):
pass
# (For Case1 Only) enable it when testing Case1
# Exit a parse tree produced by AlphabetParser#rule0.
def exitRule0(self, ctx:AlphabetParser.Rule0Context):
print(ctx.getText())
# end-of-class
def main():
file_name = sys.argv[1]
input = FileStream(file_name)
lexer = AlphabetLexer(input)
stream = CommonTokenStream(lexer)
parser = AlphabetParser(stream)
tree = parser.content()
print(tree.toStringTree(recog=parser))
listener = MyListener()
walker = ParseTreeWalker()
walker.walk(listener, tree)
# end-of-def
main()
Case1 and results:
Alphabet.g4 (Case1)
grammar Alphabet;
content : (rule0|ANYCHAR)* EOF;
rule0 : RULE1 | RULE2 | RULE3 ;
RULE1 : [A-C]+ ;
RULE2 : [DEF]+ ;
RULE3 : ('G'|'H'|'I')+ ;
ANYCHAR : . -> skip;
Result:
# Input data (for reference)
# ABBCCCDDDDEEEEE ABCDE
# FFGGHHIIJJKK FGHIJK
# ABCDEFGHIJKL
$ python3 Main.py input.txt
(content (rule0 ABBCCC) (rule0 DDDDEEEEE) (rule0 ABC) (rule0 DE) (rule0 FF) (rule0 GGHHII) (rule0 F) (rule0 GHI) (rule0 ABC) (rule0 DEF) (rule0 GHI) <EOF>)
ABBCCC
DDDDEEEEE
ABC
DE
FF
GGHHII
F
GHI
ABC
DEF
GHI
Case2/3 and results:
Alphabet.g4 (Case2)
grammar Alphabet;
content : (RULE0|ANYCHAR)* EOF;
RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;
ANYCHAR : . -> skip;
Alphabet.g4 (Case3)
grammar Alphabet;
content : (RULE0|ANYCHAR)* EOF;
RULE0 : RULE1 | RULE2 | RULE3 ;
fragment RULE1 : [A-C]+ ;
fragment RULE2 : [DEF]+ ;
fragment RULE3 : ('G'|'H'|'I')+ ;
ANYCHAR : . -> skip;
Result:
# Input data (for reference)
# ABBCCCDDDDEEEEE ABCDE
# FFGGHHIIJJKK FGHIJK
# ABCDEFGHIJKL
$ python3 Main.py input.txt
(content ABBCCC DDDDEEEEE ABC DE FF GGHHII F GHI ABC DEF GHI <EOF>)
Did you see "capturing groups" and "non-capturing groups" parts?
Let's see the concrete example2.
Goal: identify octal / decimal / hexadecimal numbers
input.txt
0
123
1~9999
001~077
0xFF, 0x01, 0xabc123
Number.g4
grammar Number;
content
: (number|ANY_CHAR)* EOF
;
number
: DECIMAL_NUMBER
| OCTAL_NUMBER
| HEXADECIMAL_NUMBER
;
DECIMAL_NUMBER
: [1-9][0-9]*
| '0'
;
OCTAL_NUMBER
: '0' '0'..'9'+
;
HEXADECIMAL_NUMBER
: '0x'[0-9A-Fa-f]+
;
ANY_CHAR
: .
;
Main.py
import sys
from antlr4 import *
from NumberLexer import NumberLexer
from NumberParser import NumberParser
from NumberListener import NumberListener
class Listener(NumberListener):
# Exit a parse tree produced by NumberParser#Number.
def exitNumber(self, ctx:NumberParser.NumberContext):
print('%8s, dec: %-8s, oct: %-8s, hex: %-8s' % (ctx.getText(),
ctx.DECIMAL_NUMBER(), ctx.OCTAL_NUMBER(), ctx.HEXADECIMAL_NUMBER()))
# end-of-def
# end-of-class
def main():
input = FileStream(sys.argv[1])
lexer = NumberLexer(input)
stream = CommonTokenStream(lexer)
parser = NumberParser(stream)
tree = parser.content()
print(tree.toStringTree(recog=parser))
listener = Listener()
walker = ParseTreeWalker()
walker.walk(listener, tree)
# end-of-def
main()
Result:
# Input data (for reference)
# 0
# 123
# 1~9999
# 001~077
# 0xFF, 0x01, 0xabc123
$ python3 Main.py input.txt
(content (number 0) \n (number 123) \n (number 1) ~ (number 9999) \n (number 001) ~ (number 077) \n (number 0xFF) , (number 0x01) , (number 0xabc123) \n <EOF>)
0, dec: 0 , oct: None , hex: None
123, dec: 123 , oct: None , hex: None
1, dec: 1 , oct: None , hex: None
9999, dec: 9999 , oct: None , hex: None
001, dec: None , oct: 001 , hex: None
077, dec: None , oct: 077 , hex: None
0xFF, dec: None , oct: None , hex: 0xFF
0x01, dec: None , oct: None , hex: 0x01
0xabc123, dec: None , oct: None , hex: 0xabc123
If you add the modifier 'fragment' to DECIMAL_NUMBER, OCTAL_NUMBER, HEXADECIMAL_NUMBER, you won't be able to capture the number entities (since they are not tokens anymore). And the result will be:
$ python3 Main.py input.txt
(content 0 \n 1 2 3 \n 1 ~ 9 9 9 9 \n 0 0 1 ~ 0 7 7 \n 0 x F F , 0 x 0 1 , 0 x a b c 1 2 3 \n <EOF>)
This blog post has a very clear example where fragment makes a significant difference:
grammar number;
number: INT;
DIGIT : '0'..'9';
INT : DIGIT+;
The grammar will recognize '42' but not '7'. You can fix it by making digit a fragment (or moving DIGIT after INT).