How to have both function calls and parenthetical grouping without backtrack - antlr

Is there any way to specify a grammar which allows the following syntax:
f(x)(g, (1-(-2))*3, 1+2*3)[0]
which is transformed into (in pseudo-lisp to show order):
(index
((f x)
g
(* (- 1 -2) 3)
(+ (* 2 3) 1)
)
0
)
along with things like limited operator precedence etc.
The following grammar works with backtrack = true, but I'd like to avoid that:
grammar T;
options {
output=AST;
backtrack=true;
memoize=true;
}
tokens {
CALL;
INDEX;
LOOKUP;
}
prog: (expr '\n')* ;
expr : boolExpr;
boolExpr
: relExpr (boolop^ relExpr)?
;
relExpr
: addExpr (relop^ addExpr)?
| a=addExpr oa=relop b=addExpr ob=relop c=addExpr
-> ^(LAND ^($oa $a $b) ^($ob $b $c))
;
addExpr
: mulExpr (addop^ mulExpr)?
;
mulExpr
: atomExpr (mulop^ atomExpr)?
;
atomExpr
: INT
| ID
| OPAREN expr CPAREN -> expr
| call
;
call
: callable ( OPAREN (expr (COMMA expr)*)? CPAREN -> ^(CALL callable expr*)
| OBRACK expr CBRACK -> ^(INDEX callable expr)
| DOT ID -> ^(INDEX callable ID)
)
;
fragment
callable
: ID
| OPAREN expr CPAREN
;
fragment
boolop
: LAND | LOR
;
fragment
relop
: (EQ|GT|LT|GTE|LTE)
;
fragment
addop
: (PLUS|MINUS)
;
fragment
mulop
: (TIMES|DIVIDE)
;
EQ : '==' ;
GT : '>' ;
LT : '<' ;
GTE : '>=' ;
LTE : '<=' ;
LAND : '&&' ;
LOR : '||' ;
PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIVIDE : '/' ;
ID : ('a'..'z')+ ;
INT : '0'..'9' ;
OPAREN : '(' ;
CPAREN : ')' ;
OBRACK : '[' ;
CBRACK : ']' ;
DOT : '.' ;
COMMA : ',' ;

There are a couple of things wrong with your grammar:
1
Only lexer rules can be fragments, not parser rules. Some ANTLR targets simply ignore the fragment keyword in front of parser rules (like the Java target), but better just remove them from your grammar: if you decide to create a parser for a different target-language, you may run into problems because of it.
2
Without the backtrack=true, you cannot mix tree-rewrite operators (^ and !) and rewrite rules (->) because you need to create a single alternative inside relExpr instead of the two alternatives you now have (this is to eliminate an ambiguity).
In your case, you can't create the desired AST with just ^ (inside a single alternative), so you'll need to do it like this:
relExpr
: (a=addExpr -> $a) ( (oa=relOp b=addExpr -> ^($oa $a $b))
( ob=relOp c=addExpr -> ^(LAND ^($oa $a $b) ^($ob $b $c))
)?
)?
;
(yes, I know, it's not particularly pretty, but that can't be helped AFAIK)
Also, you can only put the LAND token in the rewrite rules if it is defined in the tokens { ... } block:
tokens {
// literal tokens
LAND='&&';
...
// imaginary tokens
CALL;
...
}
Otherwise you can only use tokens (and other parser rules) in rewrite rules if they really occur inside the parser rule itself.
3
You did not account for the unary minus in your grammar, implement it like this:
mulExpr
: unaryExpr ((TIMES | DIVIDE)^ unaryExpr)*
;
unaryExpr
: MINUS atomExpr -> ^(UNARY_MINUS atomExpr)
| atomExpr
;
Now, to create a grammar that does not need backtrack=true, remove the ID and '(' expr ')' from your atomExpr rule:
atomExpr
: INT
| call
;
and make everything passed callable optional inside your call rule:
call
: (callable -> callable) ( OPAREN params CPAREN -> ^(CALL $call params)
| OBRACK expr CBRACK -> ^(INDEX $call expr)
| DOT ID -> ^(INDEX $call ID)
)*
;
That way, ID and '(' expr ')' are already matched by call (and there's no ambiguity).
Taken all the remarks above into account, you could get the following grammar:
grammar T;
options {
output=AST;
}
tokens {
// literal tokens
EQ = '==' ;
GT = '>' ;
LT = '<' ;
GTE = '>=' ;
LTE = '<=' ;
LAND = '&&' ;
LOR = '||' ;
PLUS = '+' ;
MINUS = '-' ;
TIMES = '*' ;
DIVIDE = '/' ;
OPAREN = '(' ;
CPAREN = ')' ;
OBRACK = '[' ;
CBRACK = ']' ;
DOT = '.' ;
COMMA = ',' ;
// imaginary tokens
CALL;
INDEX;
LOOKUP;
UNARY_MINUS;
PARAMS;
}
prog
: expr EOF -> expr
;
expr
: boolExpr
;
boolExpr
: relExpr ((LAND | LOR)^ relExpr)?
;
relExpr
: (a=addExpr -> $a) ( (oa=relOp b=addExpr -> ^($oa $a $b))
( ob=relOp c=addExpr -> ^(LAND ^($oa $a $b) ^($ob $b $c))
)?
)?
;
addExpr
: mulExpr ((PLUS | MINUS)^ mulExpr)*
;
mulExpr
: unaryExpr ((TIMES | DIVIDE)^ unaryExpr)*
;
unaryExpr
: MINUS atomExpr -> ^(UNARY_MINUS atomExpr)
| atomExpr
;
atomExpr
: INT
| call
;
call
: (callable -> callable) ( OPAREN params CPAREN -> ^(CALL $call params)
| OBRACK expr CBRACK -> ^(INDEX $call expr)
| DOT ID -> ^(INDEX $call ID)
)*
;
callable
: ID
| OPAREN expr CPAREN -> expr
;
params
: (expr (COMMA expr)*)? -> ^(PARAMS expr*)
;
relOp
: EQ | GT | LT | GTE | LTE
;
ID : 'a'..'z'+ ;
INT : '0'..'9'+ ;
SPACE : (' ' | '\t') {skip();};
which would parse the input "a >= b < c" into the following AST:
and the input "f(x)(g, (1-(-2))*3, 1+2*3)[0]" as follows:

Related

Why parse failing after upgrading from Antlr 3 to Antlr 4?

Recently I am trying to upgrade my project from Antlr3 to Antlr4. But after making change in the grammar file, it seems the equations that worked previously is no longer working. I am new to Antlr4 so unable to understand whether my change broke something or not.
Here is my original grammar file:
grammar equation;
options {
language=CSharp2;
output=AST;
ASTLabelType=CommonTree;
}
tokens {
VARIABLE;
CONSTANT;
EXPR;
PAREXPR;
EQUATION;
UNARYEXPR;
FUNCTION;
BINARYOP;
LIST;
}
equationset: equation* EOF!;
equation: variable ASSIGN expression -> ^(EQUATION variable expression)
;
parExpression
: LPAREN expression RPAREN -> ^(PAREXPR expression)
;
expression
: conditionalexpression -> ^(EXPR conditionalexpression)
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR^ andExpression )*
;
andExpression
: comparisonExpression ( AND^ comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ^ | NE^ | LTE^ | GTE^ | LT^ | GT^) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS^ | MINUS^) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES^ | DIVIDE^) unaryExpression )*
;
unaryExpression
: NOT unaryExpression -> ^(UNARYEXPR NOT unaryExpression)
| MINUS unaryExpression -> ^(UNARYEXPR MINUS unaryExpression)
| exponentexpression;
exponentexpression
: primary (CARET^ primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING -> ^(CONSTANT STRING) | numeric -> ^(CONSTANT numeric);
booleantok : BOOLEAN -> ^(BOOLEAN);
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER -> IDENTIFIER+;
function
: scopedidentifier LPAREN argumentlist RPAREN -> ^(FUNCTION scopedidentifier argumentlist);
variable: scopedidentifier -> ^(VARIABLE scopedidentifier);
argumentlist: (expression) ? (COMMA! expression)*;
WS : (' '|'\r'|'\n'|'\t')+ {$channel=HIDDEN;};
COMMENT : '/*' .* '*/' {$channel=HIDDEN;};
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;};
STRING: (('\"') ( (~('\"')) )* ('\"'))+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';
And here is what I have after my changes:
grammar equation;
options {
}
tokens {
VARIABLE;
CONSTANT;
EXPR;
PAREXPR;
EQUATION;
UNARYEXPR;
FUNCTION;
BINARYOP;
LIST;
}
equationset: equation* EOF;
equation: variable ASSIGN expression
;
parExpression
: LPAREN expression RPAREN
;
expression
: conditionalexpression
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR andExpression )*
;
andExpression
: comparisonExpression ( AND comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ | NE | LTE | GTE | LT | GT) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS | MINUS) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES | DIVIDE) unaryExpression )*
;
unaryExpression
: NOT unaryExpression
| MINUS unaryExpression
| exponentexpression;
exponentexpression
: primary (CARET primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING | numeric;
booleantok : BOOLEAN;
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER;
function
: scopedidentifier LPAREN argumentlist RPAREN;
variable: scopedidentifier;
argumentlist: (expression) ? (COMMA expression)*;
WS : (' '|'\r'|'\n'|'\t')+ ->channel(HIDDEN);
COMMENT : '/*' .* '*/' ->channel(HIDDEN);
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN);
STRING: (('\"') ( (~('\"')) )* ('\"'))+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';
A sample equation that I am trying to parse (which was working OK before) is:
[a].[b] = 1.76 * [Product_DC].[PDC_Inbound_Pallets] * if(product_dc.[PDC_DC] =="US84",1,0)
Thanks in advance.
Tokens should be listed with comma , not semicolon ;. See also Token Section paragraph in official doc.
Since ANTLR 4.7 backslash is not required for double quote escaping. STRING: (('\"') ( (~('\"')) )* ('\"'))+; should be rewritten to STRING: ('"' ~'"'* '"')+;.
You missed question mark in multiline comment token for non-greedy matching: '/*' .* '*/' -> '/*' .*? '*/'.
So, the fixed grammar looks like this:
grammar equation;
options {
}
tokens {
VARIABLE,
CONSTANT,
EXPR,
PAREXPR,
EQUATION,
UNARYEXPR,
FUNCTION,
BINARYOP,
LIST
}
equationset: equation* EOF;
equation: variable ASSIGN expression
;
parExpression
: LPAREN expression RPAREN
;
expression
: conditionalexpression
;
conditionalexpression
: orExpression
;
orExpression
: andExpression ( OR andExpression )*
;
andExpression
: comparisonExpression ( AND comparisonExpression )*;
comparisonExpression:
additiveExpression ((EQ | NE | LTE | GTE | LT | GT) additiveExpression)*;
additiveExpression
: multiplicativeExpression ( (PLUS | MINUS) multiplicativeExpression )*
;
multiplicativeExpression
: unaryExpression ( ( TIMES | DIVIDE) unaryExpression )*
;
unaryExpression
: NOT unaryExpression
| MINUS unaryExpression
| exponentexpression;
exponentexpression
: primary (CARET primary)*;
primary : parExpression | constant | booleantok | variable | function;
numeric: INTEGER | REAL;
constant: STRING | numeric;
booleantok : BOOLEAN;
scopedidentifier
: (IDENTIFIER DOT)* IDENTIFIER;
function
: scopedidentifier LPAREN argumentlist RPAREN;
variable: scopedidentifier;
argumentlist: (expression) ? (COMMA expression)*;
WS : (' '|'\r'|'\n'|'\t')+ ->channel(HIDDEN);
COMMENT : '/*' .*? '*/' -> channel(HIDDEN);
LINE_COMMENT : '//' ~('\n'|'\r')* '\r'? '\n' ->channel(HIDDEN);
STRING: ('"' ~'"'* '"')+;
fragment ALPHA: 'a'..'z'|'_';
fragment DIGIT: '0'..'9';
fragment ALNUM: ALPHA|DIGIT;
EQ : '==';
ASSIGN : '=';
NE : '!=' | '<>';
OR : 'or' | '||';
AND : 'and' | '&&';
NOT : '!'|'not';
LTE : '<=';
GTE : '>=';
LT : '<';
GT : '>';
TIMES : '*';
DIVIDE : '/';
BOOLEAN : 'true' | 'false';
IDENTIFIER: ALPHA (ALNUM)* | ('[' (~(']'))+ ']') ;
REAL: DIGIT* DOT DIGIT+ ('e' (PLUS | MINUS)? DIGIT+)?;
INTEGER: DIGIT+;
PLUS : '+';
MINUS : '-';
COMMA : ',';
RPAREN : ')';
LPAREN : '(';
DOT : '.';
CARET : '^';

ANTLR4 Token is not recognized when substituted

I try to modify the grammar of the sqlite syntax (I'm interested in a variant of the where clause only) and I'm keep having a weird error when substituting AND to it's own token.
grammar wtfql;
/*
SQLite understands the following binary operators, in order from highest to
lowest precedence:
||
* / %
+ -
<< >> & |
< <= > >=
= != <> IS IS NOT IN LIKE GLOB MATCH REGEXP
AND
OR
*/
start : expr EOF?;
expr
: literal_value
//BIND_PARAMETER
| ( table_name '.' )? column_name
| unary_operator expr
| expr '||' expr
| expr ( '*' | '/' | '%' ) expr
| expr ( '+' | '-' ) expr
| expr ( '<' | '<=' | '>' | '>=' ) expr
| expr ( '=' | '<>' | K_IN ) expr
| expr K_AND expr
| expr K_OR expr
| function_name '(' ( expr ( ',' expr )* )? ')'
| '(' expr ')'
| expr K_NOT expr
| expr ( K_NOT K_NULL )
| expr K_NOT? K_IN ( '(' ( expr ( ',' expr )* ) ')' )
;
unary_operator
: '-'
| '+'
| K_NOT
;
literal_value
: NUMERIC_LITERAL
| STRING_LITERAL
| K_NULL
;
function_name
: IDENTIFIER
;
table_name
: any_name
;
column_name
: any_name
;
any_name
: IDENTIFIER
| keyword
// | '(' any_name ')'
;
keyword
: K_AND
| K_NOT
| K_NULL
| K_IN
| K_OR
;
IDENTIFIER
: [a-zA-Z_] [a-zA-Z_0-9]* // TODO check: needs more chars in set
;
NUMERIC_LITERAL
: DIGIT+ ( '.' DIGIT* )? ( E [-+]? DIGIT+ )?
| '.' DIGIT+ ( E [-+]? DIGIT+ )?
;
STRING_LITERAL
: '\"' ( ~'\"' | '\"\"' )* '\"'
;
SPACES
: [ \u000B\t\r\n] -> channel(HIDDEN)
;
DOT : '.';
OPEN_PAR : '(';
CLOSE_PAR : ')';
COMMA : ',';
STAR : '*';
PLUS : '+';
MINUS : '-';
TILDE : '~';
DIV : '/';
MOD : '%';
AMP : '&';
PIPE : '|';
LT : '<';
LT_EQ : '<=';
GT : '>';
GT_EQ : '>=';
EQ : '=';
NOT_EQ2 : '<>';
K_AND : A N D;
K_NOT : N O T;
K_NULL : N U L L;
K_OR : O R;
K_IN : I N;
fragment DIGIT : [0-9];
fragment A : [aA];
fragment B : [bB];
fragment C : [cC];
fragment D : [dD];
fragment E : [eE];
fragment F : [fF];
fragment G : [gG];
fragment H : [hH];
fragment I : [iI];
fragment J : [jJ];
fragment K : [kK];
fragment L : [lL];
fragment M : [mM];
fragment N : [nN];
fragment O : [oO];
fragment P : [pP];
fragment Q : [qQ];
fragment R : [rR];
fragment S : [sS];
fragment T : [tT];
fragment U : [uU];
fragment V : [vV];
fragment W : [wW];
fragment X : [xX];
fragment Y : [yY];
fragment Z : [zZ];
writing
| expr K_AND expr
with the input
field1=1 and field2 = 2
results in
line 1:8 mismatched input 'and' expecting {<EOF>, '||', '*', '+', '-', '/', '%', '<', '<=', '>', '>=', '=', '<>', K_AND, K_NOT, K_OR, K_IN}
while
| expr 'and' expr
works like a charm:
$ antlr4 wtfql.g4 && javac -classpath /usr/local/Cellar/antlr/4.4/antlr-4.4-complete.jar wtfql*.java && cat test.txt | grun wtfql start -tree -gui
(start (expr (expr (expr (column_name (any_name feld1))) = (expr (literal_value 1))) and (expr (expr (column_name (any_name feld2))) = (expr (literal_value 2)))) <EOF>)
What am I missing?
I presume "and" is an IDENTIFIER since the rule for IDENTIFIER comes before the rule for AND and thus wins.
If you write 'and' in the parser rule this implicitly creates a token (not AND!) which comes before IDENTIFIER and thus wins.
Rule of thumb: More specific lexer rules first. Don't create new lexer tokens implicitly in parser rules.
If you check the token type, you'll get a clue what's going on.

Trying to resolve left-recursion trying to build Parser with ANTLR

I’m currently trying to build a parser for the language Oberon using Antlr and Ecplise.
This is what I have got so far:
grammar oberon;
options
{
language = Java;
//backtrack = true;
output = AST;
}
#parser::header {package dhbw.Oberon;}
#lexer::header {package dhbw.Oberon; }
T_ARRAY : 'ARRAY' ;
T_BEGIN : 'BEGIN';
T_CASE : 'CASE' ;
T_CONST : 'CONST' ;
T_DO : 'DO' ;
T_ELSE : 'ELSE' ;
T_ELSIF : 'ELSIF' ;
T_END : 'END' ;
T_EXIT : 'EXIT' ;
T_IF : 'IF' ;
T_IMPORT : 'IMPORT' ;
T_LOOP : 'LOOP' ;
T_MODULE : 'MODULE' ;
T_NIL : 'NIL' ;
T_OF : 'OF' ;
T_POINTER : 'POINTER' ;
T_PROCEDURE : 'PROCEDURE' ;
T_RECORD : 'RECORD' ;
T_REPEAT : 'REPEAT' ;
T_RETURN : 'RETURN';
T_THEN : 'THEN' ;
T_TO : 'TO' ;
T_TYPE : 'TYPE' ;
T_UNTIL : 'UNTIL' ;
T_VAR : 'VAR' ;
T_WHILE : 'WHILE' ;
T_WITH : 'WITH' ;
module : T_MODULE ID SEMI importlist? declarationsequence?
(T_BEGIN statementsequence)? T_END ID PERIOD ;
importlist : T_IMPORT importitem (COMMA importitem)* SEMI ;
importitem : ID (ASSIGN ID)? ;
declarationsequence :
( T_CONST (constantdeclaration SEMI)*
| T_TYPE (typedeclaration SEMI)*
| T_VAR (variabledeclaration SEMI)*)
(proceduredeclaration SEMI | forwarddeclaration SEMI)*
;
constantdeclaration: identifierdef EQUAL expression ;
identifierdef: ID MULT? ;
expression: simpleexpression (relation simpleexpression)? ;
simpleexpression : (PLUS|MINUS)? term (addoperator term)* ;
term: factor (muloperator factor)* ;
factor: number
| stringliteral
| T_NIL
| set
| designator '(' explist? ')'
;
number: INT | HEX ; // TODO add real
stringliteral : '"' ( ~('\\'|'"') )* '"' ;
set: '{' elementlist? '}' ;
elementlist: element (COMMA element)* ;
element: expression (RANGESEP expression)? ;
designator: qualidentifier
('.' ID
| '[' explist ']'
| '(' qualidentifier ')'
| UPCHAR )+
;
explist: expression (COMMA expression)* ;
actualparameters: '(' explist? ')' ;
muloperator: MULT | DIV | MOD | ET ;
addoperator: PLUS | MINUS | OR ;
relation: EQUAL ; // TODO
typedeclaration: ID EQUAL type ;
type: qualidentifier
| arraytype
| recordtype
| pointertype
| proceduretype
;
qualidentifier: (ID '.')* ID ;
arraytype: T_ARRAY expression (',' expression) T_OF type;
recordtype: T_RECORD ('(' qualidentifier ')')? fieldlistsequence T_END ;
fieldlistsequence: fieldlist (SEMI fieldlist) ;
fieldlist: (identifierlist COLON type)? ;
identifierlist: identifierdef (COMMA identifierdef)* ;
pointertype: T_POINTER T_TO type ;
proceduretype: T_PROCEDURE formalparameters? ;
variabledeclaration: identifierlist COLON type ;
proceduredeclaration: procedureheading SEMI procedurebody ID ;
procedureheading: T_PROCEDURE MULT? identifierdef formalparameters? ;
formalparameters: '(' params? ')' (COLON qualidentifier)? ;
params: fpsection (SEMI fpsection)* ;
fpsection: T_VAR? idlist COLON formaltype ;
idlist: ID (COMMA ID)* ;
formaltype: (T_ARRAY T_OF)* (qualidentifier | proceduretype);
procedurebody: declarationsequence (T_BEGIN statementsequence)? T_END ;
forwarddeclaration: T_PROCEDURE UPCHAR? ID MULT? formalparameters? ;
statementsequence: statement (SEMI statement)* ;
statement : assignment
| procedurecall
| ifstatement
| casestatement
| whilestatement
| repeatstatement
| loopstatement
| withstatement
| T_EXIT
| T_RETURN expression?
;
assignment: designator ASSIGN expression ;
procedurecall: designator actualparameters? ;
ifstatement: T_IF expression T_THEN statementsequence
(T_ELSIF expression T_THEN statementsequence)*
(T_ELSE statementsequence)? T_END ;
casestatement: T_CASE expression T_OF caseitem ('|' caseitem)*
(T_ELSE statementsequence)? T_END ;
caseitem: caselabellist COLON statementsequence ;
caselabellist: caselabels (COMMA caselabels)* ;
caselabels: expression (RANGESEP expression)? ;
whilestatement: T_WHILE expression T_DO statementsequence T_END ;
repeatstatement: T_REPEAT statementsequence T_UNTIL expression ;
loopstatement: T_LOOP statementsequence T_END ;
withstatement: T_WITH qualidentifier COLON qualidentifier T_DO statementsequence T_END ;
ID : ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'_'|'0'..'9')* ;
fragment DIGIT : '0'..'9' ;
INT : ('-')?DIGIT+ ;
fragment HEXDIGIT : '0'..'9'|'A'..'F' ;
HEX : HEXDIGIT+ 'H' ;
ASSIGN : ':=' ;
COLON : ':' ;
COMMA : ',' ;
DIV : '/' ;
EQUAL : '=' ;
ET : '&' ;
MINUS : '-' ;
MOD : '%' ;
MULT : '*' ;
OR : '|' ;
PERIOD : '.' ;
PLUS : '+' ;
RANGESEP : '..' ;
SEMI : ';' ;
UPCHAR : '^' ;
WS : ( ' ' | '\t' | '\r' | '\n'){skip();};
My problem is when I check the grammar I get the following error and just can’t find an appropriate way to fix this:
rule statement has non-LL(*) decision
due to recursive rule invocations reachable from alts 1,2.
Resolve by left-factoring or using syntactic predicates
or using backtrack=true option.
|---> statement : assignment
Also I have the problem with declarationsequence and simpleexpression.
When I use options { … backtrack = true; … } it at least compiles, but obviously doesn’t work right anymore when I run a test-file, but I can’t find a way to resolve the left-recursion on my own (or maybe I’m just too blind at the moment because I’ve looked at this for far too long now). Any ideas how I could change the lines where the errors occurs to make it work?
EDIT
I could fix one of the three mistakes. statement works now. The problem was that assignment and procedurecall both started with designator.
statement : procedureassignmentcall
| ifstatement
| casestatement
| whilestatement
| repeatstatement
| loopstatement
| withstatement
| T_EXIT
| T_RETURN expression?
;
procedureassignmentcall : (designator ASSIGN)=> assignment | procedurecall;
assignment: designator ASSIGN expression ;
procedurecall: designator actualparameters? ;

Why is my tree grammar ambiguous?

I'm a little confused. I have a grammar that works well and matches my language just as I want it to. Recently, I added a couple rules to the grammar and in the process of converting the new grammar rules to the tree grammar I am getting some strange errors. The first error I was getting was that the tree grammar was ambiguous.
The errors I received were:
[10:53:16] warning(200): ShiroDefinitionPass.g:129:17:
Decision can match input such as "'subjunctive node'" using multiple alternatives: 5, 6
As a result, alternative(s) 6 were disabled for that input
[10:53:16] warning(200): ShiroDefinitionPass.g:129:17:
Decision can match input such as "PORT_ASSIGNMENT" using multiple alternatives: 2, 6
and about 10 more similar errors.
I can't tell why the tree grammar is ambiguous. It was fine before I added the sNode, subjunctDeclNodeProd, subjunctDecl, and subjunctSelector rules.
My grammar is:
grammar Shiro;
options{
language = Java;
ASTLabelType=CommonTree;
output=AST;
}
tokens{
NEGATION;
STATE_DECL;
PORT_DECL;
PORT_INIT;
PORT_ASSIGNMENT;
PORT_TAG;
PATH;
PORT_INDEX;
EVAL_SELECT;
SUBJ_SELECT;
SUBJ_NODE_PROD;
ACTIVATION;
ACTIVATION_LIST;
PRODUCES;
}
shiro : statement+
;
statement
: nodestmt
| sNode
| graphDecl
| statestmt
| collection
| view
| NEWLINE!
;
view : 'view' IDENT mfName IDENT -> ^('view' IDENT mfName IDENT)
;
collection
: 'collection' IDENT orderingFunc path 'begin' NEWLINE
(collItem)+ NEWLINE?
'end'
-> ^('collection' IDENT orderingFunc path collItem+)
;
collItem: IDENT -> IDENT
;
orderingFunc
: IDENT -> IDENT
;
statestmt
: 'state' stateName 'begin' NEWLINE
stateHeader
'end' -> ^(STATE_DECL stateName stateHeader)
;
stateHeader
: (stateTimeStmt | stateCommentStmt | stateParentStmt | stateGraphStmt | activationPath | NEWLINE!)+
;
stateTimeStmt
: 'Time' time -> ^('Time' time)
;
stateCommentStmt
: 'Comment' comment -> ^('Comment' comment)
;
stateParentStmt
: 'Parent' stateParent -> ^('Parent' stateParent)
;
stateGraphStmt
: 'Graph' stateGraph -> ^('Graph' stateGraph)
;
stateName
: IDENT
;
time : STRING_LITERAL
;
comment : STRING_LITERAL
;
stateParent
: IDENT
;
stateGraph
: IDENT
;
activationPath
: l=activation ('.'^ (r=activation | activationList))*
;
activationList
: '<' activation (',' activation)* '>' -> ^(ACTIVATION_LIST activation+)
;
activation
: c=IDENT ('[' v=IDENT ']')? -> ^(ACTIVATION $c ($v)?)
;
graphDecl
: 'graph' IDENT 'begin' NEWLINE
graphLine+
'end'
-> ^('graph' IDENT graphLine+)
;
graphLine
: nodeProduction | portAssignment | NEWLINE!
;
nodeInternal
: (nodeProduction
| portAssignment
| portstmt
| nodestmt
| sNode
| NEWLINE!)+
;
nodestmt
: 'node'^ IDENT ('['! activeSelector ']'!)? 'begin'! NEWLINE!
nodeInternal
'end'!
;
sNode
: 'subjunctive node'^ IDENT '['! subjunctSelector ']'! 'begin'! NEWLINE!
(subjunctDeclNodeProd | subjunctDecl | NEWLINE!)+
'end'!
;
subjunctDeclNodeProd
: l=IDENT '->' r=IDENT 'begin' NEWLINE
nodeInternal
'end' -> ^(SUBJ_NODE_PROD $l $r nodeInternal )
;
subjunctDecl
: 'subjunct'^ IDENT ('['! activeSelector ']'!)? 'begin'! NEWLINE!
nodeInternal
'end'!
;
subjunctSelector
: IDENT -> ^(SUBJ_SELECT IDENT)
;
activeSelector
: IDENT -> ^(EVAL_SELECT IDENT)
;
nodeProduction
: path ('->'^ activationPath )+ NEWLINE!
;
portAssignment
: path '(' mfparams ')' NEWLINE -> ^(PORT_ASSIGNMENT path mfparams)
;
portDecl
: portType portName mfName -> ^(PORT_DECL ^(PORT_TAG portType) portName mfName)
;
portDeclInit
: portType portName mfCall -> ^(PORT_INIT ^(PORT_TAG portType) portName mfCall)
;
portstmt
: (portDecl | portDeclInit ) NEWLINE!
;
portName
: IDENT
;
portType: 'port'
| 'eval'
;
mfCall : mfName '(' mfparams ')' -> ^(mfName mfparams)
;
mfName : IDENT
;
mfparams: expression(',' expression)* -> expression+
;
// Path
path : (IDENT)('.' IDENT)*('[' pathIndex ']')? -> ^(PATH IDENT+ pathIndex?)
;
pathIndex
: portIndex -> ^(PORT_INDEX portIndex)
;
portIndex
: ( NUMBER |STRING_LITERAL )
;
// Expressions
term : path
| '(' expression ')' -> expression
| NUMBER
| STRING_LITERAL
;
unary : ('+'^ | '-'^)* term
;
mult : unary (('*'^ | '/'^ | '%'^) unary)*
;
add
: mult (( '+'^ | '-'^ ) mult)*
;
expression
: add (( '|'^ ) add)*
;
// LEXEMES
STRING_LITERAL
: '"' .* '"'
;
NUMBER : DIGIT+ ('.'DIGIT+)?
;
IDENT : (LCLETTER | UCLETTER | DIGIT)(LCLETTER | UCLETTER | DIGIT|'_')*
;
COMMENT
: '//' ~('\n'|'\r')* {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' NEWLINE?{$channel=HIDDEN;}
;
WS
: (' ' | '\t' | '\f')+ {$channel = HIDDEN;}
;
NEWLINE : '\r'? '\n'
;
fragment
LCLETTER
: 'a'..'z'
;
fragment
UCLETTER: 'A'..'Z'
;
fragment
DIGIT : '0'..'9'
;
My tree grammar for the section looks like:
tree grammar ShiroDefinitionPass;
options{
tokenVocab=Shiro;
ASTLabelType=CommonTree;
}
shiro
: statement+
;
statement
: nodestmt
| sNode
| graphDecl
| statestmt
| collection
| view
;
view : ^('view' IDENT mfName IDENT)
;
collection
: ^('collection' IDENT orderingFunc path collItem+)
;
collItem: IDENT
;
orderingFunc
: IDENT
;
statestmt
: ^(STATE_DECL stateHeader)
;
stateHeader
: (stateTimeStmt | stateCommentStmt | stateParentStmt| stateGraphStmt | activation )+
;
stateTimeStmt
: ^('Time' time)
;
stateCommentStmt
: ^('Comment' comment)
;
stateParentStmt
: ^('Parent' stateParent)
;
stateGraphStmt
: ^('Graph' stateGraph)
;
stateName
: IDENT
;
time : STRING_LITERAL
;
comment : STRING_LITERAL
;
stateParent
: IDENT
;
stateGraph
: IDENT
;
activationPath
: l=activation ('.' (r=activation | activationList))*
;
activationList
: ^(ACTIVATION_LIST activation+)
;
activation
: ^(ACTIVATION IDENT IDENT?)
;
// Graph Declarations
graphDecl
: ^('graph' IDENT graphLine+)
;
graphLine
: nodeProduction
| portAssignmen
;
// End Graph declaration
nodeInternal
: (nodeProduction
|portAssignment
|portstmt
|nodestmt
|sNode )+
;
nodestmt
: ^('node' IDENT activeSelector? nodeInternal)
;
sNode
: ^('subjunctive node' IDENT subjunctSelector (subjunctDeclNodeProd | subjunctDecl)*)
;
subjunctDeclNodeProd
: ^(SUBJ_NODE_PROD IDENT IDENT nodeInternal+ )
;
subjunctDecl
: ^('subjunct' IDENT activeSelector? nodeInternal )
;
subjunctSelector
: ^(SUBJ_SELECT IDENT)
;
activeSelector returns
: ^(EVAL_SELECT IDENT)
;
nodeProduction
: ^('->' nodeProduction)
| path
;
portAssignment
: ^(PORT_ASSIGNMENT path)
;
// Port Statement
portDecl
: ^(PORT_DECL ^(PORT_TAG portType) portName mfName)
;
portDeclInit
: ^(PORT_INIT ^(PORT_TAG portType) portName mfCall)
;
portstmt
: (portDecl | portDeclInit)
;
portName
: IDENT
;
portType returns
: 'port' | 'eval'
;
mfCall
: ^(mfName mfparams)
;
mfName
: IDENT
;
mfparams
: (exps=expression)+
;
// Path
path
: ^(PATH (id=IDENT)+ (pathIndex)? )
;
pathIndex
: ^(PORT_INDEX portIndex)
;
portIndex
: ( NUMBER
|STRING_LITERAL
)
;
// Expressions
expression
: ^('+' op1=expression op2=expression)
| ^('-' op1=expression op2=expression)
| ^('*' op1=expression op2=expression)
| ^('/' op1=expression op2=expression)
| ^('%' op1=expression op2=expression)
| ^('|' op1=expression op2=expression)
| NUMBER
| path
;
In your tree grammar, here is the declaration of rule nodeInternal:
nodeInternal
: (nodeProduction
|portAssignment
|portstmt
|nodestmt
|sNode)+
;
And here is your declaration of rule subjunctDeclNodeProd:
subjunctDeclNodeProd
: ^(SUBJ_NODE_PROD IDENT IDENT nodeInternal+ )
;
When subjunctDeclNodeProd is being processed, ANTLR doesn't know how to process an input such as this, with two PATH children:
^(SUBJ_NODE_PROD IDENT IDENT ^(PATH IDENT) ^(PATH IDENT))
Should it follow rule nodeInternal once and process nodeProduction, nodeProduction or should it follow nodeInternal twice and process nodeProduction each time?
Consider rewriting subjunctDeclNodeProd without the +:
subjunctDeclNodeProd
: ^(SUBJ_NODE_PROD IDENT IDENT nodeInternal)
;
I think that will take care of the problem.

ANTLR grammar error

I'm trying to built C-- compiler using ANTLR 3.4.
Full set of the grammar listed here,
program : (vardeclaration | fundeclaration)* ;
vardeclaration : INT ID (OPENSQ NUM CLOSESQ)? SEMICOL ;
fundeclaration : typespecifier ID OPENP params CLOSEP compoundstmt ;
typespecifier : INT | VOID ;
params : VOID | paramlist ;
paramlist : param (COMMA param)* ;
param : INT ID (OPENSQ CLOSESQ)? ;
compoundstmt : OPENCUR vardeclaration* statement* CLOSECUR ;
statementlist : statement* ;
statement : expressionstmt | compoundstmt | selectionstmt | iterationstmt | returnstmt;
expressionstmt : (expression)? SEMICOL;
selectionstmt : IF OPENP expression CLOSEP statement (options {greedy=true;}: ELSE statement)?;
iterationstmt : WHILE OPENP expression CLOSEP statement;
returnstmt : RETURN (expression)? SEMICOL;
expression : (var EQUAL expression) | sampleexpression;
var : ID ( OPENSQ expression CLOSESQ )? ;
sampleexpression: addexpr ( ( LOREQ | LESS | GRTR | GOREQ | EQUAL | NTEQL) addexpr)?;
addexpr : mulexpr ( ( PLUS | MINUS ) mulexpr)*;
mulexpr : factor ( ( MULTI | DIV ) factor )*;
factor : ( OPENP expression CLOSEP ) | var | call | NUM;
call : ID OPENP arglist? CLOSEP;
arglist : expression ( COMMA expression)*;
Used lexer rules as following,
ELSE : 'else' ;
IF : 'if' ;
INT : 'int' ;
RETURN : 'return' ;
VOID : 'void' ;
WHILE : 'while' ;
PLUS : '+' ;
MINUS : '-' ;
MULTI : '*' ;
DIV : '/' ;
LESS : '<' ;
LOREQ : '<=' ;
GRTR : '>' ;
GOREQ : '>=' ;
EQUAL : '==' ;
NTEQL : '!=' ;
ASSIGN : '=' ;
SEMICOL : ';' ;
COMMA : ',' ;
OPENP : '(' ;
CLOSEP : ')' ;
OPENSQ : '[' ;
CLOSESQ : ']' ;
OPENCUR : '{' ;
CLOSECUR: '}' ;
SCOMMENT: '/*' ;
ECOMMENT: '*/' ;
ID : ('a'..'z' | 'A'..'Z')+/*(' ')*/ ;
NUM : ('0'..'9')+ ;
WS : (' ' | '\t' | '\n' | '\r')+ {$channel = HIDDEN;};
COMMENT: '/*' .* '*/' {$channel = HIDDEN;};
But I try to save this it give me the error,
error(211): /CMinusMinus/src/CMinusMinus/CMinusMinus.g:33:13: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
|---> expression : (var EQUAL expression) | sampleexpression;
1 error
How can I resolve this problem?
As already mentioned: your grammar rule expression is ambiguous: both alternatives in that rule start, or can be, a var.
You need to "help" your parser a bit. If the parse can see a var followed by an EQUAL, it should choose alternative 1, else alternative 2. This can be done by using a syntactic predicate (the (var EQUAL)=> part in the rule below).
expression
: (var EQUAL)=> var EQUAL expression
| sampleexpression
;
More about predicates in this Q&A: What is a 'semantic predicate' in ANTLR?
The problem is this:
expression : (var EQUAL expression) | sampleexpression;
where you either start with var or sampleexpression. But sampleexpression can be reduced to var as well by doing sampleexpression->addExpr->MultExpr->Factor->var
So there is no way to find a k-length predicate for the compiler.
You can as suggested by the error message set backtrack=true to see whether this solves your problem, but it might lead not to the AST - parsetrees you would expect and might also be slow on special input conditions.
You could also try to refactor your grammar to avoid such recursions.