I'm facing an issue while validating the below formula with the given grammar rules.
if(2>3?ceil(loopup(12)):floor(matrix(2,3)))
However, I am able to inject the below formulas:
if(2>3?loopup(12):matrix(2,3))
if(2>3?ceil(12.2):floor(2.3))
ast
: expr+ EOF
;
expr: nestedexpr
| LOOKUP_FIELD '(' idrule ')'
| TIER_FIELD '(' idrule ',' idrule ')'
| MATRIX_FIELD '(' idrule ',' idrule ')'
| IF '(' conditionalrule '?' expr ':' expr')'
| ROUND '(' idrule ',' roundnumberrule ')'
;
nestedexpr:
nestedexpr ('*'|'/'|'+'|'-') nestedexpr
| '(' '-' nestedexpr ')'
| '(' nestedexpr ')'
| ROUND '(' expr ',' roundnumberrule ')'
| MATH_FUNCTION_FIELD '(' expr ')'
| DYNAMIC_FIELD_ID/users-ack-status
;
arithematicexpr:
arithematicexpr ('*'|'/'|'+'|'-') arithematicexpr
| '(' '-' arithematicexpr ')'
| '(' arithematicexpr ')'
| DYNAMIC_FIELD_ID
;
orrule: OR '(' conditionalrule (',' conditionalrule)+ ')';
andrule: AND '(' conditionalrule (',' conditionalrule)+ ')';
conditionalrule: orrule | andrule | relationalrule;
relationalrule: DYNAMIC_FIELD_ID RELATIONAL_OPERATOR DYNAMIC_FIELD_ID;
idrule :
DYNAMIC_FIELD_ID
;
LOOKUP_FIELD: L O O K U P ;
TIER_FIELD: T I E R;
MATRIX_FIELD: M A T R I X;
IF: I F;
AND: A N D;
OR: O R;
ROUND : R O U N D;
MATH_FUNCTION_FIELD : C E I L | F L O O R;
RELATIONAL_OPERATOR: '<' | '>' | '<=' | '>=' | '<>' | '=';
BOOL_FIELD : T R U E | F A L S E;
DYNAMIC_FIELD_ID: {isDynamicFieldId()}? . ;
roundnumberrule: ROUND_NUMBER;
ROUND_NUMBER: [0-7];
WS : [ \t\r\n]+ -> skip ;
if(2>3?ceil(loopup(12)):floor(matrix(2,3)))
The above should get parsed by the mentioned grammar rule.
I write follow grammar in a .g4 file:
expr
:function=expr'('((arg',')*arg)?')' #callFun
|object=expr'.'SYMBOL #objCall
|coll=expr'['arg']' #collage
|varId=SYMBOL #varValue
|(expr) #brackets
|expr ADD ADD #lastInc
|expr SUB SUB #lastDec
//2
|ADD ADD <assoc=right> expr #postInc
|SUB SUB <assoc=right> expr #postDec
|NOT <assoc=right> expr #not
|SUB <assoc=right> expr #negative
|'new' expr #createObj
//3
|expr MUL expr #mul
|expr DIV expr #div
|expr MOD expr #mod
//4
|expr ADD expr #add
|expr SUB expr #sub
//5
|expr LMOVE expr #lMove
|expr RMOVE expr #rMove
//6
|expr LESS expr #less
|expr LESS EQUAL expr #lessEqual
|expr GREATER expr #greater
|expr GREATER EQUAL expr #greaterEqual
//7
|expr EQUAL EQUAL expr #equal
|expr NOT EQUAL expr #notEqual
//8
|expr AND ADD expr #and
//9
|expr OR OR expr #or
//10
|expr '?' <assoc=right> exprA = expr':'exprB = expr #trueAfalseB
//11
|expr EQUAL <assoc=right> expr #putIn
|expr ADD EQUAL <assoc=right> expr #addPutIn
|expr SUB EQUAL <assoc=right> expr #subPutIn
|expr MUL EQUAL <assoc=right> expr #mulPutIn
|expr DIV EQUAL <assoc=right> expr #divPutIn
|expr MOD EQUAL <assoc=right> expr #modPutIn
|expr LMOVE EQUAL <assoc=right> expr #lMovePutIn
|expr RMOVE EQUAL <assoc=right> expr #rMovePutIn
//12
|THROW <assoc=right> expr #throw
//13
|expr','expr #comma
;
If I'm not mistaken, there's only immediate left recursion in this grammar.But there is a error message:"The following sets of rules are mutually left-recursive [expr]".why?
As mentioned by Kaby76 in the comments, this alternative:
expr
: ...
| (expr)
| ...
;
is the cause. You probably meant to do this:
expr
: ...
| '(' expr ')'
| ...
;
I'm using Antlr4. Here is my grammar:
assign : id '=' expr ;
id : 'A' | 'B' | 'C' ;
expr : expr '+' term
| expr '-' term
| term ;
term : term '*' factor
| term '/' factor
| factor ;
factor : expr '**' factor
| '(' expr ')'
| id ;
WS : [ \t\r\n]+ -> skip ;
I know this grammar is ambiguous and also I know I should add an element to the grammar but I don't know how to make the grammar unambiguous.
factor : expr '**' factor
Consider the input
A + B ** C
A + B is an expr so we could analyse that as a factor, semantically (A+B)C
But the other, more conventional interpretation (A + (BC)) is also possible:
<expr> =>
<expr> + <term> =>
<term> + <term> =>
<factor> + <term> =>
A + <term> =>
A + <factor> =>
A + <expr> ** <factor> =>
A + <term> ** <factor> =>
A + <factor> ** <factor> =>
A + B ** <factor> =>
A + B ** C
Given the grammar below, I'm seeing very poor performance when parsing longer strings, on the order of seconds. (this on both Python and Go implementations) Is there something in this grammar that is causing that?
Example output:
0.000061s LEXING "hello world"
0.014349s PARSING "hello world"
0.000052s LEXING 5 + 10
0.015384s PARSING 5 + 10
0.000061s LEXING FIRST_WORD(WORD_SLICE(contact.blerg, 2, 4))
0.634113s PARSING FIRST_WORD(WORD_SLICE(contact.blerg, 2, 4))
0.000095s LEXING (DATEDIF(DATEVALUE("01-01-1970"), date.now, "D") * 24 * 60 * 60) + ((((HOUR(date.now)+7) * 60) + MINUTE(date.now)) * 60))
1.552758s PARSING (DATEDIF(DATEVALUE("01-01-1970"), date.now, "D") * 24 * 60 * 60) + ((((HOUR(date.now)+7) * 60) + MINUTE(date.now)) * 60))
This is on Python.. though I don't expect blazing performance I would expect sub-second for any input. What am I doing wrong?
grammar Excellent;
parse
: expr EOF
;
expr
: atom # expAtom
| concatenationExpr # expConcatenation
| equalityExpr # expEquality
| comparisonExpr # expComparison
| additionExpr # expAddition
| multiplicationExpr # expMultiplication
| exponentExpr # expExponent
| unaryExpr # expUnary
;
path
: NAME (step)*
;
step
: LBRAC expr RBRAC
| PATHSEP NAME
| PATHSEP NUMBER
;
parameters
: expr (COMMA expr)* # functionParameters
;
concatenationExpr
: atom (AMP concatenationExpr)? # concatenation
;
equalityExpr
: comparisonExpr op=(EQ|NE) comparisonExpr # equality
;
comparisonExpr
: additionExpr (op=(LT|GT|LTE|GTE) additionExpr)? # comparison
;
additionExpr
: multiplicationExpr (op=(ADD|SUB) multiplicationExpr)* # addition
;
multiplicationExpr
: exponentExpr (op=(MUL|DIV) exponentExpr)* # multiplication
;
exponentExpr
: unaryExpr (EXP exponentExpr)? # exponent
;
unaryExpr
: SUB? atom # negation
;
funcCall
: function=NAME LPAR parameters? RPAR # functionCall
;
funcPath
: function=funcCall (step)* # functionPath
;
atom
: path # contextReference
| funcCall # atomFuncCall
| funcPath # atomFuncPath
| LITERAL # stringLiteral
| NUMBER # decimalLiteral
| LPAR expr RPAR # parentheses
| TRUE # true
| FALSE # false
;
NUMBER
: DIGITS ('.' DIGITS?)?
;
fragment
DIGITS
: ('0'..'9')+
;
TRUE
: [Tt][Rr][Uu][Ee]
;
FALSE
: [Ff][Aa][Ll][Ss][Ee]
;
PATHSEP
:'.';
LPAR
:'(';
RPAR
:')';
LBRAC
:'[';
RBRAC
:']';
SUB
:'-';
ADD
:'+';
MUL
:'*';
DIV
:'/';
COMMA
:',';
LT
:'<';
GT
:'>';
EQ
:'=';
NE
:'!=';
LTE
:'<=';
GTE
:'>=';
QUOT
:'"';
EXP
: '^';
AMP
: '&';
LITERAL
: '"' ~'"'* '"'
;
Whitespace
: (' '|'\t'|'\n'|'\r')+ ->skip
;
NAME
: NAME_START_CHARS NAME_CHARS*
;
fragment
NAME_START_CHARS
: 'A'..'Z'
| '_'
| 'a'..'z'
| '\u00C0'..'\u00D6'
| '\u00D8'..'\u00F6'
| '\u00F8'..'\u02FF'
| '\u0370'..'\u037D'
| '\u037F'..'\u1FFF'
| '\u200C'..'\u200D'
| '\u2070'..'\u218F'
| '\u2C00'..'\u2FEF'
| '\u3001'..'\uD7FF'
| '\uF900'..'\uFDCF'
| '\uFDF0'..'\uFFFD'
;
fragment
NAME_CHARS
: NAME_START_CHARS
| '0'..'9'
| '\u00B7' | '\u0300'..'\u036F'
| '\u203F'..'\u2040'
;
ERRROR_CHAR
: .
;
You can always try to parse with SLL(*) first and only if that fails you need to parse it with LL(*) (which is the default).
See this ticket on ANTLR's GitHub for further explaination and here is an implementation that uses this strategy.
This method will save you (a lot of) time when parsing syntactically correct input.
Seems like this performance is due to the left recursion used in the addition / multiplication etc, operators. Rewriting these to be binary rules instead yields performance that is instant. (see below)
grammar Excellent;
COMMA : ',';
LPAREN : '(';
RPAREN : ')';
LBRACK : '[';
RBRACK : ']';
DOT : '.';
PLUS : '+';
MINUS : '-';
TIMES : '*';
DIVIDE : '/';
EXPONENT : '^';
EQ : '=';
NEQ : '!=';
LTE : '<=';
LT : '<';
GTE : '>=';
GT : '>';
AMPERSAND : '&';
DECIMAL : [0-9]+('.'[0-9]+)?;
STRING : '"' (~["] | '""')* '"';
TRUE : [Tt][Rr][Uu][Ee];
FALSE : [Ff][Aa][Ll][Ss][Ee];
NAME : [a-zA-Z][a-zA-Z0-9_.]*; // variable names, e.g. contact.name or function names, e.g. SUM
WS : [ \t\n\r]+ -> skip; // ignore whitespace
ERROR : . ;
parse : expression EOF;
atom : fnname LPAREN parameters? RPAREN # functionCall
| atom DOT atom # dotLookup
| atom LBRACK expression RBRACK # arrayLookup
| NAME # contextReference
| STRING # stringLiteral
| DECIMAL # decimalLiteral
| TRUE # true
| FALSE # false
;
expression : atom # atomReference
| MINUS expression # negation
| expression EXPONENT expression # exponentExpression
| expression (TIMES | DIVIDE) expression # multiplicationOrDivisionExpression
| expression (PLUS | MINUS) expression # additionOrSubtractionExpression
| expression (LTE | LT | GTE | GT) expression # comparisonExpression
| expression (EQ | NEQ) expression # equalityExpression
| expression AMPERSAND expression # concatenation
| LPAREN expression RPAREN # parentheses
;
fnname : NAME
| TRUE
| FALSE
;
parameters : expression (COMMA expression)* # functionParameters
;
What does fragment mean in ANTLR?
I've seen both rules:
fragment DIGIT : '0'..'9';
and
DIGIT : '0'..'9';
What is the difference?
A fragment is somewhat akin to an inline function: It makes the grammar more readable and easier to maintain.
A fragment will never be counted as a token, it only serves to simplify a grammar.
Consider:
NUMBER: DIGITS | OCTAL_DIGITS | HEX_DIGITS;
fragment DIGITS: '1'..'9' '0'..'9'*;
fragment OCTAL_DIGITS: '0' '0'..'7'+;
fragment HEX_DIGITS: '0x' ('0'..'9' | 'a'..'f' | 'A'..'F')+;
In this example, matching a NUMBER will always return a NUMBER to the lexer, regardless of if it matched "1234", "0xab12", or "0777".
See item 3
According to the Definitive Antlr4 references book :
Rules prefixed with fragment can be called only from other lexer rules; they are not tokens in their own right.
actually they'll improve readability of your grammars.
look at this example :
STRING : '"' (ESC | ~["\\])* '"' ;
fragment ESC : '\\' (["\\/bfnrt] | UNICODE) ;
fragment UNICODE : 'u' HEX HEX HEX HEX ;
fragment HEX : [0-9a-fA-F] ;
STRING is a lexer using fragment rule like ESC .Unicode is used in Esc rule and Hex is used in Unicode fragment rule.
ESC and UNICODE and HEX rules can't be used explicitly.
The Definitive ANTLR 4 Reference (Page 106):
Rules prefixed with fragment can
be called only from other lexer rules; they are not tokens in their own right.
Abstract Concepts:
Case1: ( if I need the RULE1, RULE2, RULE3 entities or group info )
rule0 : RULE1 | RULE2 | RULE3 ;
RULE1 : [A-C]+ ;
RULE2 : [DEF]+ ;
RULE3 : ('G'|'H'|'I')+ ;
Case2: ( if I don't care RULE1, RULE2, RULE3, I just focus on RULE0 )
RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;
// RULE0 is a terminal node.
// You can't name it 'rule0', or you will get syntax errors:
// 'A-C' came as a complete surprise to me while matching alternative
// 'DEF' came as a complete surprise to me while matching alternative
Case3: ( is equivalent to Case2, making it more readable than Case2)
RULE0 : RULE1 | RULE2 | RULE3 ;
fragment RULE1 : [A-C]+ ;
fragment RULE2 : [DEF]+ ;
fragment RULE3 : ('G'|'H'|'I')+ ;
// You can't name it 'rule0', or you will get warnings:
// warning(125): implicit definition of token RULE1 in parser
// warning(125): implicit definition of token RULE2 in parser
// warning(125): implicit definition of token RULE3 in parser
// and failed to capture rule0 content (?)
Differences between Case1 and Case2/3 ?
The lexer rules are equivalent
Each of RULE1/2/3 in Case1 is a capturing group, similar to Regex:(X)
Each of RULE1/2/3 in Case3 is a non-capturing group, similar to Regex:(?:X)
Let's see a concrete example.
Goal: identify [ABC]+, [DEF]+, [GHI]+ tokens
input.txt
ABBCCCDDDDEEEEE ABCDE
FFGGHHIIJJKK FGHIJK
ABCDEFGHIJKL
Main.py
import sys
from antlr4 import *
from AlphabetLexer import AlphabetLexer
from AlphabetParser import AlphabetParser
from AlphabetListener import AlphabetListener
class MyListener(AlphabetListener):
# Exit a parse tree produced by AlphabetParser#content.
def exitContent(self, ctx:AlphabetParser.ContentContext):
pass
# (For Case1 Only) enable it when testing Case1
# Exit a parse tree produced by AlphabetParser#rule0.
def exitRule0(self, ctx:AlphabetParser.Rule0Context):
print(ctx.getText())
# end-of-class
def main():
file_name = sys.argv[1]
input = FileStream(file_name)
lexer = AlphabetLexer(input)
stream = CommonTokenStream(lexer)
parser = AlphabetParser(stream)
tree = parser.content()
print(tree.toStringTree(recog=parser))
listener = MyListener()
walker = ParseTreeWalker()
walker.walk(listener, tree)
# end-of-def
main()
Case1 and results:
Alphabet.g4 (Case1)
grammar Alphabet;
content : (rule0|ANYCHAR)* EOF;
rule0 : RULE1 | RULE2 | RULE3 ;
RULE1 : [A-C]+ ;
RULE2 : [DEF]+ ;
RULE3 : ('G'|'H'|'I')+ ;
ANYCHAR : . -> skip;
Result:
# Input data (for reference)
# ABBCCCDDDDEEEEE ABCDE
# FFGGHHIIJJKK FGHIJK
# ABCDEFGHIJKL
$ python3 Main.py input.txt
(content (rule0 ABBCCC) (rule0 DDDDEEEEE) (rule0 ABC) (rule0 DE) (rule0 FF) (rule0 GGHHII) (rule0 F) (rule0 GHI) (rule0 ABC) (rule0 DEF) (rule0 GHI) <EOF>)
ABBCCC
DDDDEEEEE
ABC
DE
FF
GGHHII
F
GHI
ABC
DEF
GHI
Case2/3 and results:
Alphabet.g4 (Case2)
grammar Alphabet;
content : (RULE0|ANYCHAR)* EOF;
RULE0 : [A-C]+ | [DEF]+ | ('G'|'H'|'I')+ ;
ANYCHAR : . -> skip;
Alphabet.g4 (Case3)
grammar Alphabet;
content : (RULE0|ANYCHAR)* EOF;
RULE0 : RULE1 | RULE2 | RULE3 ;
fragment RULE1 : [A-C]+ ;
fragment RULE2 : [DEF]+ ;
fragment RULE3 : ('G'|'H'|'I')+ ;
ANYCHAR : . -> skip;
Result:
# Input data (for reference)
# ABBCCCDDDDEEEEE ABCDE
# FFGGHHIIJJKK FGHIJK
# ABCDEFGHIJKL
$ python3 Main.py input.txt
(content ABBCCC DDDDEEEEE ABC DE FF GGHHII F GHI ABC DEF GHI <EOF>)
Did you see "capturing groups" and "non-capturing groups" parts?
Let's see the concrete example2.
Goal: identify octal / decimal / hexadecimal numbers
input.txt
0
123
1~9999
001~077
0xFF, 0x01, 0xabc123
Number.g4
grammar Number;
content
: (number|ANY_CHAR)* EOF
;
number
: DECIMAL_NUMBER
| OCTAL_NUMBER
| HEXADECIMAL_NUMBER
;
DECIMAL_NUMBER
: [1-9][0-9]*
| '0'
;
OCTAL_NUMBER
: '0' '0'..'9'+
;
HEXADECIMAL_NUMBER
: '0x'[0-9A-Fa-f]+
;
ANY_CHAR
: .
;
Main.py
import sys
from antlr4 import *
from NumberLexer import NumberLexer
from NumberParser import NumberParser
from NumberListener import NumberListener
class Listener(NumberListener):
# Exit a parse tree produced by NumberParser#Number.
def exitNumber(self, ctx:NumberParser.NumberContext):
print('%8s, dec: %-8s, oct: %-8s, hex: %-8s' % (ctx.getText(),
ctx.DECIMAL_NUMBER(), ctx.OCTAL_NUMBER(), ctx.HEXADECIMAL_NUMBER()))
# end-of-def
# end-of-class
def main():
input = FileStream(sys.argv[1])
lexer = NumberLexer(input)
stream = CommonTokenStream(lexer)
parser = NumberParser(stream)
tree = parser.content()
print(tree.toStringTree(recog=parser))
listener = Listener()
walker = ParseTreeWalker()
walker.walk(listener, tree)
# end-of-def
main()
Result:
# Input data (for reference)
# 0
# 123
# 1~9999
# 001~077
# 0xFF, 0x01, 0xabc123
$ python3 Main.py input.txt
(content (number 0) \n (number 123) \n (number 1) ~ (number 9999) \n (number 001) ~ (number 077) \n (number 0xFF) , (number 0x01) , (number 0xabc123) \n <EOF>)
0, dec: 0 , oct: None , hex: None
123, dec: 123 , oct: None , hex: None
1, dec: 1 , oct: None , hex: None
9999, dec: 9999 , oct: None , hex: None
001, dec: None , oct: 001 , hex: None
077, dec: None , oct: 077 , hex: None
0xFF, dec: None , oct: None , hex: 0xFF
0x01, dec: None , oct: None , hex: 0x01
0xabc123, dec: None , oct: None , hex: 0xabc123
If you add the modifier 'fragment' to DECIMAL_NUMBER, OCTAL_NUMBER, HEXADECIMAL_NUMBER, you won't be able to capture the number entities (since they are not tokens anymore). And the result will be:
$ python3 Main.py input.txt
(content 0 \n 1 2 3 \n 1 ~ 9 9 9 9 \n 0 0 1 ~ 0 7 7 \n 0 x F F , 0 x 0 1 , 0 x a b c 1 2 3 \n <EOF>)
This blog post has a very clear example where fragment makes a significant difference:
grammar number;
number: INT;
DIGIT : '0'..'9';
INT : DIGIT+;
The grammar will recognize '42' but not '7'. You can fix it by making digit a fragment (or moving DIGIT after INT).