My grammar identifiers keywords as identifiers - antlr

I'm trying to parse expressions from the Jakarta Expression Language. In summary, it is a simplified Java expressions, with addition of a few things:
Support for creating maps like: {"foo": "bar"}
Support for creating lists and sets like: [1,2,3,4] {1,2,3,4}
Use some identifiers instead of symbols, like: foo gt bar (foo > bar), foo mod bar(foo % bar), and so on.
I'm struggling in the last bit, where it always understands the "mod", "gt", "ge" as identifiers instead of using the expression that has the "%", ">", ">=".
I'm new to ANTLR. My grammar is based on the Java grammar in the https://github.com/antlr/grammars-v4/tree/master/java/java and the JavaCC provided by: https://jakarta.ee/specifications/expression-language/4.0/jakarta-expression-language-spec-4.0.html#collected-syntax
grammar ExpressionLanguageGrammar;
prog: compositeExpression;
compositeExpression: (dynamicExpression | deferredExpression | literalExpression)*;
dynamicExpression: '${' expression RCURL;
deferredExpression: '#{' expression RCURL;
literalExpression: literal;
literal: BOOL_LITERAL | FLOATING_POINT_LITERAL | INTEGER_LITERAL | StringLiteral | NULL;
mapData | listData | setData;
methodArguments: LPAREN expressionList? RPAREN;
expressionList: (expression ((COMMA expression)*));
lambdaExpressionOrCall: LPAREN lambdaExpression RPAREN methodArguments*;
lambdaExpression: lambdaParameters ARROW expression;
lambdaParameters: IDENTIFIER | (LPAREN (IDENTIFIER ((COMMA IDENTIFIER)*))? RPAREN);
mapEntry: expression COLON expression;
mapEntries: mapEntry (COMMA mapEntry)*;
expression
: primary
|'[' expressionList? ']'
| '{' expressionList? '}'
| '{' mapEntries? '}'
| expression bop='.' (IDENTIFIER | IDENTIFIER '(' expressionList? ')')
| expression ('[' expression ']')+
| prefix=('-' | '!' | NOT1 | EMPTY) expression
| expression bop=('*' | '/' | '%' | MOD1 | DIV1) expression
| expression bop=('+' | '-') expression
| expression bop=('<=' | '>=' | '>' | '<' | LE1 | GE1 | LT1 | GT1) expression
| expression bop=INSTANCEOF IDENTIFIER
| expression bop=('==' | '!=' | EQ1 | NE1) expression
| expression bop=('&&' | AND1) expression
| expression bop=('||' | OR1) expression
| <assoc=right> expression bop='?' expression bop=':' expression
| <assoc=right> expression
bop=('=' | '+=' | '-=' | '*=' | '/=')
expression
| lambdaExpression
| lambdaExpressionOrCall
;
primary
: '(' expression ')'
| literal
| IDENTIFIER
;
BOOL_LITERAL: TRUE | FALSE;
IDENTIFIER: LETTER (LETTER|DIGIT)*;
INTEGER_LITERAL: [0-9]+;
FLOATING_POINT_LITERAL: [0-9]+ '.' [0-9]* EXPONENT? | '.' [0-9]+ EXPONENT? | [0-9]+ EXPONENT?;
fragment EXPONENT: ('e'|'E') ('+'|'-')? [0-9]+;
StringLiteral: ('"' DoubleStringCharacter* '"'
| '\'' SingleStringCharacter* '\'') ;
fragment DoubleStringCharacter
: ~["\\\r\n]
| '\\' EscapeSequence
;
fragment SingleStringCharacter
: ~['\\\r\n]
| '\\' EscapeSequence
;
fragment EscapeSequence
: CharacterEscapeSequence
| '0'
| HexEscapeSequence
| UnicodeEscapeSequence
| ExtendedUnicodeEscapeSequence
;
fragment CharacterEscapeSequence
: SingleEscapeCharacter
| NonEscapeCharacter
;
fragment HexEscapeSequence
: 'x' HexDigit HexDigit
;
fragment UnicodeEscapeSequence
: 'u' HexDigit HexDigit HexDigit HexDigit
| 'u' '{' HexDigit HexDigit+ '}'
;
fragment ExtendedUnicodeEscapeSequence
: 'u' '{' HexDigit+ '}'
;
fragment SingleEscapeCharacter
: ['"\\bfnrtv]
;
fragment NonEscapeCharacter
: ~['"\\bfnrtv0-9xu\r\n]
;
fragment EscapeCharacter
: SingleEscapeCharacter
| [0-9]
| [xu]
;
fragment HexDigit
: [_0-9a-fA-F]
;
fragment DecimalIntegerLiteral
: '0'
| [1-9] [0-9_]*
;
fragment ExponentPart
: [eE] [+-]? [0-9_]+
;
fragment IdentifierPart
: IdentifierStart
| [\p{Mn}]
| [\p{Nd}]
| [\p{Pc}]
| '\u200C'
| '\u200D'
;
fragment IdentifierStart
: [\p{L}]
| [$_]
| '\\' UnicodeEscapeSequence
;
LCURL: '{';
RCURL: '}';
LETTER: '\u0024' |
'\u0041'..'\u005a' |
'\u005f' |
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff';
DIGIT: '\u0030'..'\u0039'|
'\u0660'..'\u0669'|
'\u06f0'..'\u06f9'|
'\u0966'..'\u096f'|
'\u09e6'..'\u09ef'|
'\u0a66'..'\u0a6f'|
'\u0ae6'..'\u0aef'|
'\u0b66'..'\u0b6f'|
'\u0be7'..'\u0bef'|
'\u0c66'..'\u0c6f'|
'\u0ce6'..'\u0cef'|
'\u0d66'..'\u0d6f'|
'\u0e50'..'\u0e59'|
'\u0ed0'..'\u0ed9'|
'\u1040'..'\u1049';
TRUE: 'true';
FALSE: 'false';
NULL: 'null';
DOT: '.';
LPAREN: '(';
RPAREN: ')';
LBRACK: '[';
RBRACK: ']';
COLON: ':';
COMMA: ',';
SEMICOLON: ';';
GT0: '>';
GT1: 'gt';
LT0: '<';
LT1: 'lt';
GE0: '>=';
GE1: 'ge';
LE0: '<=';
LE1: 'le';
EQ0: '==';
EQ1: 'eq';
NE0: '!=';
NE1: 'ne';
NOT0: '!';
NOT1: 'not';
AND0: '&&';
AND1: 'and';
OR0: '||';
OR1: 'or';
EMPTY: 'empty';
INSTANCEOF: 'instanceof';
MULT: '*';
PLUS: '+';
MINUS: '-';
QUESTIONMARK: '?';
DIV0: '/';
DIV1: 'div';
MOD0: '%';
MOD1: 'mod';
CONCAT: '+=';
ASSIGN: '=';
ARROW: '->';
DOLLAR: '$';
HASH: '#';
WS: [ \t\r\n]+ -> skip;

Move the Lexer rules for them to be prior to the Lexer rule for Identifier.
If ANTLR has more than one Lexer rule that matches input of the same length it chooses the first rule in the grammar that matches.
For example “mod” is matched by Identifier and MOD1, but Identifier is 1st, so it chooses Identifier. Move the MOD1 rule to be before Identifier and it’ll match MOD1
———-
BTW, unless you care about having different token values for “%” and “mod”, you can just define a single rule:
MOD: ‘%’ | ‘mod’;
You’d can still get the token text if you need it but it will you can just specify MOD in your parser rules instead of (MOD0 | MOD1)

Related

ANTLR arithmetic and comparison expressions grammer ANTLR

how to add relational operations to my code
Thanks
My code is
grammar denem1;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: stat+ EOF -> ^(ROOT stat+)
;
stat
: expr ';'
;
expr
: Id Assign expr -> ^(Assign Id expr)
| add
;
add
: mult (('+' | '-')^ mult)*
;
mult
: atom (('*' | '/')^ atom)*
;
atom
: Id
| Num
| '('! expr ')' !
;
Assign : '=' ;
Comment : '//' ~('\r' | '\n')* {skip();};
Id : 'a'..'z'+;
Num : '0'..'9'+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
Like this:
...
expr
: Id Assign expr -> ^(Assign Id expr)
| rel
;
rel
: add (('<=' | '<' | '>=' | '>')^ add)?
;
add
: mult (('+' | '-')^ mult)*
;
...
If possible, use ANTLR v4 instead of the old v3. In v4, you can simply do this:
stat
: expr ';'
;
expr
: Id Assign expr
| '-' expr
| expr ('*' | '/') expr
| expr ('+' | '-') expr
| expr ('<=' | '<' | '>=' | '>') expr
| Id
| Num
| '(' expr ')'
;

Square brackets not recognized as tokens in ANTLR

I am currently creating a programming language for my semester project. We are using ANTLR as the choice of CC, and now we have run into a problem. When specifying the grammar for the declaration of arrays, ANTLR seems to not recognizing square brackets as tokens. For example, the following line of code:
string[] names = { "Bob", "Hans" }
will produce the error
extraneous input 'string[]' expecting {'end', 'num', 'bool', 'string', 'block', 'item', 'coords', 'break', 'for', 'while', 'until', 'switch', 'if', IDENTIFIER}
when the grammar for declarations are specified as the following
dcl
: 'num' IDENTIFIER '=' (NUM | IDENTIFIER | accessing)
| 'bool' IDENTIFIER '=' (BOOL | IDENTIFIER | accessing)
| 'string' '[' ']' IDENTIFIER '=' ('{' str_arr_items '}' | IDENTIFIER)
| 'string' IDENTIFIER '=' (STR | IDENTIFIER | accessing)
| 'block' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
| 'item' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
| 'coords' IDENTIFIER '=' (COORDS | IDENTIFIER | accessing)
;
However, it seems to work fine if I exchange the '[]' with '{}' or '()'. For example, the following line of code
string() names = { "Bob", "Hans" }
works fine with the following grammar
| 'string' '(' ')' IDENTIFIER '=' ('{' str_arr_items '}' | IDENTIFIER)
Why does it work with other kinds of brackets and symbols, when it does not work with square brackets?
Edit
Here is the entire grammar
grammar Minecraft;
/* LEXER RULES */
SINGLE_COMMENT : '//' ~('\r' | '\n')* -> skip ;
MULTILINE_COMMENT : '/*' .*? '*/' -> skip ;
WS : [ \t\n\r]+ -> skip ;
fragment LETTER : ('a' .. 'z') | ('A' .. 'Z') ;
IDENTIFIER : LETTER+ ;
fragment NUMBER : ('0' .. '9') ;
BOOL : 'true' | 'false' ;
NUM : NUMBER+ | NUMBER+ '.' NUMBER+ ;
STR : '"' (LETTER | NUMBER)* '"' | '\'' (LETTER | NUMBER)* '\'' ;
COORDS : NUM ',' NUM ',' NUM ;
ITEM_ID : NUMBER+ | NUMBER+ ':' NUMBER+ ;
MULDIVMODOP : '*' | '/' | '%' ;
ADDSUBOP : '+' | '-' ;
NEGOP : '!' ;
EQOP : '==' | '!=' | '<' | '<=' | '>' | '>=' ;
LOGOP : '&&' | '||' ;
/* PROGRAM GRAMMAR */
prog : 'begin' 'bot' body 'end' 'bot' ;
body : glob_var* initiate main function* ;
initiate : 'initiate' stmt* 'end' 'initiate' ;
main : 'loop' stmt* 'end' 'loop' ;
type : 'num' | 'bool' | 'string' | 'block' | 'item' | 'coords' ;
function
: 'function' IDENTIFIER '(' args ')' stmt* 'end' 'function'
| 'activity' IDENTIFIER '(' args ')' stmt* 'end' 'activity'
;
arg
: (type | arr_names) IDENTIFIER
| dcl
;
args : arg ',' args | arg ;
i_args : IDENTIFIER ',' i_args | /* epsilon */ ;
cond
: '(' cond ')'
| left=cond MULDIVMODOP right=cond
| left=cond ADDSUBOP right=cond
| NEGOP cond
| left=cond EQOP right=cond
| left=cond LOGOP right=cond
| (NUM | STR | BOOL | ITEM_ID | COORDS | IDENTIFIER)
;
stnd_stmt
: dcl
| 'for' IDENTIFIER '=' NUM ('to' | 'downto') NUM 'do' stmt* 'end' 'for'
| ('while' | 'until') cond 'repeat' stmt* 'end' 'repeat'
| IDENTIFIER '(' i_args ')'
| 'break'
;
stmt : stnd_stmt | if_stmt ;
else_stmt : stnd_stmt | ifelse_stmt ;
if_stmt
: 'if' cond 'then' stmt* 'end' 'if'
| 'if' cond 'then' stmt* 'else' else_stmt* 'end' 'if'
;
ifelse_stmt
: 'if' cond 'then' else_stmt*
| 'if' cond 'then' else_stmt* 'else' else_stmt*
;
glob_var : 'global' dcl ;
str_arr_items : (STR | IDENTIFIER) ',' str_arr_items | (STR | IDENTIFIER) ;
dcl
: 'num' IDENTIFIER '=' (NUM | IDENTIFIER | accessing)
| 'bool' IDENTIFIER '=' (BOOL | IDENTIFIER | accessing)
| 'string' '[' ']' IDENTIFIER '=' ('{' str_arr_items '}' | IDENTIFIER)
| 'string' IDENTIFIER '=' (STR | IDENTIFIER | accessing)
| 'block' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
| 'item' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
| 'coords' IDENTIFIER '=' (COORDS | IDENTIFIER | accessing)
;
arr_items : 'num[]' | 'string[]' | 'block[]' | 'item[]' ;
accessing
: IDENTIFIER '[' ('X' | 'Y' | 'Z') ']'
| IDENTIFIER '[' NUM+ ']'
;
Seems like the line
arr_items : 'num[]' | 'string[]' | 'block[]' | 'item[]' ;
created the tokens
num[]
string[]
block[] and
item[]
which means, that when the parser came to parsing the symbol 'string[]', it would automatically convert it to the token 'string[]' and not the tokens 'string' '[' and ']'. When I deleted the line from the CFG, the parser would behave as expected. Thanks to Bart Kiers for pointing me towards this :)

Parsing DECAF grammar in ANTLR

I am creating a the parser for DECAF with Antlr
grammar DECAF ;
//********* LEXER ******************
LETTER: ('a'..'z'|'A'..'Z') ;
DIGIT : '0'..'9' ;
ID : LETTER( LETTER | DIGIT)* ;
NUM: DIGIT(DIGIT)* ;
COMMENTS: '//' ~('\r' | '\n' )* -> channel(HIDDEN);
WS : [ \t\r\n\f | ' '| '\r' | '\n' | '\t']+ ->channel(HIDDEN);
CHAR: (LETTER|DIGIT|' '| '!' | '"' | '#' | '$' | '%' | '&' | '\'' | '(' | ')' | '*' | '+'
| ',' | '-' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '#' | '[' | '\\' | ']' | '^' | '_' | '`'| '{' | '|' | '}' | '~'
'\t'| '\n' | '\"' | '\'');
// ********** PARSER *****************
program : 'class' 'Program' '{' (declaration)* '}' ;
declaration: structDeclaration| varDeclaration | methodDeclaration ;
varDeclaration: varType ID ';' | varType ID '[' NUM ']' ';' ;
structDeclaration : 'struct' ID '{' (varDeclaration)* '}' ;
varType: 'int' | 'char' | 'boolean' | 'struct' ID | structDeclaration | 'void' ;
methodDeclaration : methodType ID '(' (parameter (',' parameter)*)* ')' block ;
methodType : 'int' | 'char' | 'boolean' | 'void' ;
parameter : parameterType ID | parameterType ID '[' ']' ;
parameterType: 'int' | 'char' | 'boolean' ;
block : '{' (varDeclaration)* (statement)* '}' ;
statement : 'if' '(' expression ')' block ( 'else' block )?
| 'while' '(' expression ')' block
|'return' expressionA ';'
| methodCall ';'
| block
| location '=' expression
| (expression)? ';' ;
expressionA: expression | ;
location : (ID|ID '[' expression ']') ('.' location)? ;
expression : location | methodCall | literal | expression op expression | '-' expression | '!' expression | '('expression')' ;
methodCall : ID '(' arg1 ')' ;
arg1 : arg2 | ;
arg2 : (arg) (',' arg)* ;
arg : expression;
op: arith_op | rel_op | eq_op | cond_op ;
arith_op : '+' | '-' | '*' | '/' | '%' ;
rel_op : '<' | '>' | '<=' | '>=' ;
eq_op : '==' | '!=' ;
cond_op : '&&' | '||' ;
literal : int_literal | char_literal | bool_literal ;
int_literal : NUM ;
char_literal : '\'' CHAR '\'' ;
bool_literal : 'true' | 'false' ;
When I give it the input:
class Program {
void main(){
return 3+5 ;
}
}
The parse tree is not building correctly since it is not recognizing the 3+5 as an expression. Is there anything wrong with my grammar that is causing the problem?
Lexer rules are matched from top to bottom. When 2 or more lexer rules match the same amount of characters, the one defined first will win. Because of that, a single digit integer will get matched as a DIGIT instead of a NUM.
Try parsing the following instead:
class Program {
void main(){
return 33 + 55 ;
}
}
which will be parsed just fine. This is because 33 and 55 are matched as NUMs, because NUM can now match 2 characters (DIGIT only 1, so NUM wins).
To fix it, make DIGIT a fragment (and LETTER as well):
fragment LETTER: ('a'..'z'|'A'..'Z') ;
fragment DIGIT : '0'..'9' ;
ID : LETTER( LETTER | DIGIT)* ;
NUM: DIGIT(DIGIT)* ;
Lexer fragments are only used internally by other lexer rules, and will never become tokens of their own.
A couple of other things: your WS rule matches way too much (it now also matches a | and a '), it should be:
WS : [ \t\r\n\f]+ ->channel(HIDDEN);
and you shouldn't match a char literal in your parser: do it in the lexer:
CHAR : '\'' ( ~['\r\n\\] | '\\' ['\\] ) '\'';
If you don't, the following will not get parsed properly:
class Program {
void main(){
return '1';
}
}
because the 1 wil be tokenized as a NUM and not as a CHAR.

ANTLR - Field that accept attributes with more than one word

My Grammar file (see below) parses queries of the type:
(name = Jon AND age != 16 OR city = NY);
However, it doesn't allow something like:
(name = 'Jon Smith' AND age != 16);
ie, it doesn't allow assign to a field values with more than one word, separated by White Spaces. How can I modify my grammar file to accept that?
options
{
language = Java;
output = AST;
}
tokens {
BLOCK;
RETURN;
QUERY;
ASSIGNMENT;
INDEXES;
}
#parser::header {
package pt.ptinovacao.agorang.antlr;
}
#lexer::header {
package pt.ptinovacao.agorang.antlr;
}
query
: expr ('ORDER BY' NAME AD)? ';' EOF
-> ^(QUERY expr ^('ORDER BY' NAME AD)?)
;
expr
: logical_expr
;
logical_expr
: equality_expr (logical_op^ equality_expr)*
;
equality_expr
: NAME equality_op atom -> ^(equality_op NAME atom)
| '(' expr ')' -> ^('(' expr)
;
atom
: ID
| id_list
| Int
| Number
;
id_list
: '(' ID (',' ID)* ')'
-> ID+
;
NAME
: 'equipType'
| 'equipment'
| 'IP'
| 'site'
| 'managedDomain'
| 'adminState'
| 'dataType'
;
AD : 'ASC' | 'DESC' ;
equality_op
: '='
| '!='
| 'IN'
| 'NOT IN'
;
logical_op
: 'AND'
| 'OR'
;
Number
: Int ('.' Digit*)?
;
ID
: ('a'..'z' | 'A'..'Z' | '_' | '.' | '-' | Digit)*
;
String
#after {
setText(getText().substring(1, getText().length()-1).replaceAll("\\\\(.)", "$1"));
}
: '"' (~('"' | '\\') | '\\' ('\\' | '"'))* '"'
| '\'' (~('\'' | '\\') | '\\' ('\\' | '\''))* '\''
;
Comment
: '//' ~('\r' | '\n')* {skip();}
| '/*' .* '*/' {skip();}
;
Space
: (' ' | '\t' | '\r' | '\n' | '\u000C') {skip();}
;
fragment Int
: '1'..'9' Digit*
| '0'
;
fragment Digit
: '0'..'9'
;
indexes
: ('[' expr ']')+ -> ^(INDEXES expr+)
;
Include the String token as an alternative in your atom rule:
atom
: ID
| id_list
| Int
| Number
| String
;

Extracting recursion in ANTLR

I've got a grammar in ANTLR and don't understand how it can be recursive. Is there any way to get ANTLR to show the derivation that it used to see that my rules are recursive?
The recursive grammar in it's entirety:
grammar DeadMG;
options {
language = C;
}
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
program
: namespace_scope_definitions;
namespace_scope_definitions
: (namespace_definition | type_definition | function_definition | variable_definition)+;
type_scope_definitions
: (type_definition | function_definition | variable_definition)*;
namespace_definition
: 'namespace' ID '{' namespace_scope_definitions '}';
type_definition
: 'type' ID? (':' expression (',' expression)+ )? '{' type_scope_definitions '}';
function_definition
: ID '(' function_argument_list ')' ('(' function_argument_list ')')? ('->' expression)? compound_statement;
function_argument_list
: expression? ID (':=' expression)? (',' function_argument_list)?;
variable_definition
: 'static'? expression? ID ':=' expression
| 'static'? expression ID ('(' (expression)* ')')?;
literal_expression
: CHAR
| FLOAT
| INT
| STRING
| 'auto'
| 'type'
| type_definition;
primary_expression
: literal_expression
| ID
| '(' expression ')';
expression
: assignment_expression;
assignment_expression
: logical_or_expression (('=' | '*=' | '/=' | '%=' | '+=' | '-=' | '<<='| '>>=' | '&=' | '^=' | '|=') assignment_expression)*;
logical_or_expression
: logical_and_expression ('||' logical_and_expression)*;
logical_and_expression
: inclusive_or_expression ('&&' inclusive_or_expression)*;
inclusive_or_expression
: exclusive_or_expression ('|' exclusive_or_expression)*;
exclusive_or_expression
: and_expression ('^' and_expression)*;
and_expression
: equality_expression ('&' equality_expression)*;
equality_expression
: relational_expression (('=='|'!=') relational_expression)*;
relational_expression
: shift_expression (('<'|'>'|'<='|'>=') shift_expression)*;
shift_expression
: additive_expression (('<<'|'>>') additive_expression)*;
additive_expression
: multiplicative_expression (('+' multiplicative_expression) | ('-' multiplicative_expression))*;
multiplicative_expression
: unary_expression (('*' | '/' | '%') unary_expression)*;
unary_expression
: '++' primary_expression
| '--' primary_expression
| ('&' | '*' | '+' | '-' | '~' | '!') primary_expression
| 'sizeof' primary_expression
| postfix_expression;
postfix_expression
: primary_expression
| '[' expression ']'
| '(' expression* ')'
| '.' ID
| '->' ID
| '++'
| '--';
initializer_statement
: expression ';'
| variable_definition ';';
return_statement
: 'return' expression ';';
try_statement
: 'try' compound_statement catch_statement;
catch_statement
: 'catch' '(' ID ')' compound_statement catch_statement?
| 'catch' '(' '...' ')' compound_statement;
for_statement
: 'for' '(' initializer_statement expression? ';' expression? ')' compound_statement;
while_statement
: 'while' '(' initializer_statement ')' compound_statement;
do_while_statement
: 'do' compound_statement 'while' '(' expression ')';
switch_statement
: 'switch' '(' expression ')' '{' case_statement '}';
case_statement
: 'case:' (statement)* case_statement?
| 'default:' (statement)*;
if_statement
: 'if' '(' initializer_statement ')' compound_statement;
statement
: compound_statement
| return_statement
| try_statement
| initializer_statement
| for_statement
| while_statement
| do_while_statement
| switch_statement
| if_statement;
compound_statement
: '{' (statement)* '}';
More specifically, I am having trouble with the following rules:
namespace_scope_definitions
: (namespace_definition | type_definition | function_definition | variable_definition)+;
type_scope_definitions
: (type_definition | function_definition | variable_definition)*;
ANTLR is saying that alternatives 2 and 4, that is, type_definition and variable_definition, are recursive. Here's variable_definition:
variable_definition
: 'static'? expression? ID ':=' expression
| 'static'? expression ID ('(' (expression)* ')')?;
and here's type_definition:
type_definition
: 'type' ID? (':' expression (',' expression)+ )? '{' type_scope_definitions '}';
'type' itself, and type_definition, is a valid expression in my expression syntax. However, removing it is not resolving the ambiguity, so it doesn't originate there. And I have plenty of other ambiguities I need to resolve- detailing all the warnings and errors would be quite too much, so I'd really like to see more details on how they are recursive from ANTLR itself.
My suggestion is to remove most of the operator precedence rules for now:
expression
: multiplicative_expression
(
('+' multiplicative_expression)
| ('-' multiplicative_expression)
)*;
and then inline the rules that have a single caller to isolate the ambiguities. Yes it is tedious.
I found a few ambiguities in the grammar, fixed them and got a lot less warnings. However, I think that probably, LL is just not the right parsing algorithm for me. I am writing a custom parser and lexer. It would still have been nice if ANTLR would show me how it found the problems though, so that I might intervene and fix them.