how to detect more than one error in yacc - yacc

in this code (parser for c language using yacc ) after detecting the first error the program execution stops. what do i do so that all the errors are shown and then only the program executions stops. I read some where you can use yyerrork but i couldnt apply it. Please help.
%token IDENTIFIER CONSTANT STRING_LITERAL SIZEOF
%token PTR_OP INC_OP DEC_OP LEFT_OP RIGHT_OP LE_OP GE_OP EQ_OP NE_OP
%token AND_OP OR_OP MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN ADD_ASSIGN
%token SUB_ASSIGN LEFT_ASSIGN RIGHT_ASSIGN AND_ASSIGN
%token XOR_ASSIGN OR_ASSIGN TYPE_NAME
%token TYPEDEF EXTERN STATIC AUTO REGISTER
%token CHAR SHORT INT LONG SIGNED UNSIGNED FLOAT DOUBLE CONST VOLATILE VOID
%token STRUCT UNION ENUM ELLIPSIS
%token CASE DEFAULT IF ELSE SWITCH WHILE DO FOR GOTO CONTINUE BREAK RETURN
%start translation_unit
%%
primary_expression
: IDENTIFIER
| CONSTANT
| STRING_LITERAL
| '(' expression ')'
;
postfix_expression
: primary_expression
| postfix_expression '[' expression ']'
| postfix_expression '(' ')'
| postfix_expression '(' argument_expression_list ')'
| postfix_expression '.' IDENTIFIER
| postfix_expression PTR_OP IDENTIFIER
| postfix_expression INC_OP
| postfix_expression DEC_OP
;
argument_expression_list
: assignment_expression
| argument_expression_list ',' assignment_expression
;
unary_expression
: postfix_expression
| INC_OP unary_expression
| DEC_OP unary_expression
| unary_operator cast_expression
| SIZEOF unary_expression
| SIZEOF '(' type_name ')'
;
unary_operator
: '&'
| '*'
| '+'
| '-'
| '~'
| '!'
;
cast_expression
: unary_expression
| '(' type_name ')' cast_expression
;
multiplicative_expression
: cast_expression
| multiplicative_expression '*' cast_expression
| multiplicative_expression '/' cast_expression
| multiplicative_expression '%' cast_expression
;
additive_expression
: multiplicative_expression
| additive_expression '+' multiplicative_expression
| additive_expression '-' multiplicative_expression
;
shift_expression
: additive_expression
| shift_expression LEFT_OP additive_expression
| shift_expression RIGHT_OP additive_expression
;
relational_expression
: shift_expression
| relational_expression '<' shift_expression
| relational_expression '>' shift_expression
| relational_expression LE_OP shift_expression
| relational_expression GE_OP shift_expression
;
equality_expression
: relational_expression
| equality_expression EQ_OP relational_expression
| equality_expression NE_OP relational_expression
;
and_expression
: equality_expression
| and_expression '&' equality_expression
;
exclusive_or_expression
: and_expression
| exclusive_or_expression '^' and_expression
;
inclusive_or_expression
: exclusive_or_expression
| inclusive_or_expression '|' exclusive_or_expression
;
logical_and_expression
: inclusive_or_expression
| logical_and_expression AND_OP inclusive_or_expression
;
logical_or_expression
: logical_and_expression
| logical_or_expression OR_OP logical_and_expression
;
conditional_expression
: logical_or_expression
| logical_or_expression '?' expression ':' conditional_expression
;
assignment_expression
: conditional_expression
| unary_expression assignment_operator assignment_expression
;
assignment_operator
: '='
| MUL_ASSIGN
| DIV_ASSIGN
| MOD_ASSIGN
| ADD_ASSIGN
| SUB_ASSIGN
| LEFT_ASSIGN
| RIGHT_ASSIGN
| AND_ASSIGN
| XOR_ASSIGN
| OR_ASSIGN
;
expression
: assignment_expression
| expression ',' assignment_expression
;
constant_expression
: conditional_expression
;
declaration
: declaration_specifiers ';'
| declaration_specifiers init_declarator_list ';'
;
declaration_specifiers
: storage_class_specifier
| storage_class_specifier declaration_specifiers
| type_specifier
| type_specifier declaration_specifiers
| type_qualifier
| type_qualifier declaration_specifiers
;
init_declarator_list
: init_declarator
| init_declarator_list ',' init_declarator
;
init_declarator
: declarator
| declarator '=' initializer
;
storage_class_specifier
: TYPEDEF
| EXTERN
| STATIC
| AUTO
| REGISTER
;
type_specifier
: VOID
| CHAR
| SHORT
| INT
| LONG
| FLOAT
| DOUBLE
| SIGNED
| UNSIGNED
| struct_or_union_specifier
| enum_specifier
| TYPE_NAME
;
struct_or_union_specifier
: struct_or_union IDENTIFIER '{' struct_declaration_list '}'
| struct_or_union '{' struct_declaration_list '}'
| struct_or_union IDENTIFIER
;
struct_or_union
: STRUCT
| UNION
;
struct_declaration_list
: struct_declaration
| struct_declaration_list struct_declaration
;
struct_declaration
: specifier_qualifier_list struct_declarator_list ';'
;
specifier_qualifier_list
: type_specifier specifier_qualifier_list
| type_specifier
| type_qualifier specifier_qualifier_list
| type_qualifier
;
struct_declarator_list
: struct_declarator
| struct_declarator_list ',' struct_declarator
;
struct_declarator
: declarator
| ':' constant_expression
| declarator ':' constant_expression
;
enum_specifier
: ENUM '{' enumerator_list '}'
| ENUM IDENTIFIER '{' enumerator_list '}'
| ENUM IDENTIFIER
;
enumerator_list
: enumerator
| enumerator_list ',' enumerator
;
enumerator
: IDENTIFIER
| IDENTIFIER '=' constant_expression
;
type_qualifier
: CONST
| VOLATILE
;
declarator
: pointer direct_declarator
| direct_declarator
;
direct_declarator
: IDENTIFIER
| '(' declarator ')'
| direct_declarator '[' constant_expression ']'
| direct_declarator '[' ']'
| direct_declarator '(' parameter_type_list ')'
| direct_declarator '(' identifier_list ')'
| direct_declarator '(' ')'
;
pointer
: ''
| '' type_qualifier_list
| '' pointer
| '' type_qualifier_list pointer
;
type_qualifier_list
: type_qualifier
| type_qualifier_list type_qualifier
;
parameter_type_list
: parameter_list
| parameter_list ',' ELLIPSIS
;
parameter_list
: parameter_declaration
| parameter_list ',' parameter_declaration
;
parameter_declaration
: declaration_specifiers declarator
| declaration_specifiers abstract_declarator
| declaration_specifiers
;
identifier_list
: IDENTIFIER
| identifier_list ',' IDENTIFIER
;
type_name
: specifier_qualifier_list
| specifier_qualifier_list abstract_declarator
;
abstract_declarator
: pointer
| direct_abstract_declarator
| pointer direct_abstract_declarator
;
direct_abstract_declarator
: '(' abstract_declarator ')'
| '[' ']'
| '[' constant_expression ']'
| direct_abstract_declarator '[' ']'
| direct_abstract_declarator '[' constant_expression ']'
| '(' ')'
| '(' parameter_type_list ')'
| direct_abstract_declarator '(' ')'
| direct_abstract_declarator '(' parameter_type_list ')'
;
initializer
: assignment_expression
| '{' initializer_list '}'
| '{' initializer_list ',' '}'
;
initializer_list
: initializer
| initializer_list ',' initializer
;
statement
: labeled_statement
| compound_statement
| expression_statement
| selection_statement
| iteration_statement
| jump_statement
;
labeled_statement
: IDENTIFIER ':' statement
| CASE constant_expression ':' statement
| DEFAULT ':' statement
;
compound_statement
: '{' '}'
| '{' statement_list '}'
| '{' declaration_list '}'
| '{' declaration_list statement_list '}'
;
declaration_list
: declaration
| declaration_list declaration
;
statement_list
: statement
| statement_list statement
;
expression_statement
: ';'
| expression ';'
;
selection_statement
: IF '(' expression ')' statement
| IF '(' expression ')' statement ELSE statement
| SWITCH '(' expression ')' statement
;
iteration_statement
: WHILE '(' expression ')' statement
| DO statement WHILE '(' expression ')' ';'
| FOR '(' expression_statement expression_statement ')' statement
| FOR '(' expression_statement expression_statement expression ')' statement
;
jump_statement
: GOTO IDENTIFIER ';'
| CONTINUE ';'
| BREAK ';'
| RETURN ';'
| RETURN expression ';'
;
translation_unit
: external_declaration
| translation_unit external_declaration
;
external_declaration
: function_definition
| declaration
;
function_definition
: declaration_specifiers declarator declaration_list compound_statement
| declaration_specifiers declarator compound_statement
| declarator declaration_list compound_statement
| declarator compound_statement
;
%%
include
extern char yytext[];
extern int column;
yyerror(s)
char *s;
{
fflush(stdout);
printf("\n%*s\n%*s\n", column, "^", column, s);
}

You need to give yacc some rules to recover from the syntax error and attempt to continue. In your grammar, you might add a rule like:
declaration: error ';'
This rule will make it possible to recover from errors seen while parsing a declaration -- the parser will scan through the input until it sees a ';' and say that's the end of the declaration and attempt to continue from there. You might also add rules like:
struct_or_union_specifier
: struct_or_union IDENTIFIER '{' error '}'
| struct_or_union '{' error '}'
to skip up to the next } when you hit an error in a struct specifier. You can experiment with adding more rules, but it gets tricky to figure out which error recovery rule will get used in any given situation (yacc pops states until it finds one that has an action for an error, so you really need to understand the state machine it builds for your parser)

Boy this takes me back a few years. I thought folk use Bison now. It is tricky to recover from errors, YACC tries to re-right things. I remember sometimes having to play with the stack, and use yyerrok so the parser could continue. Remember that when things go south there is often a good reason, and that good reason ripples though creating many false positives. That said, with a lot of finagling, if you know your grammer, and if you are willing to tweak your grammer, some pretty good error recovery is possible.
I found a URL to the reference I used to use: http://dinosaur.compilertools.net/yacc/index.html

Related

My grammar identifiers keywords as identifiers

I'm trying to parse expressions from the Jakarta Expression Language. In summary, it is a simplified Java expressions, with addition of a few things:
Support for creating maps like: {"foo": "bar"}
Support for creating lists and sets like: [1,2,3,4] {1,2,3,4}
Use some identifiers instead of symbols, like: foo gt bar (foo > bar), foo mod bar(foo % bar), and so on.
I'm struggling in the last bit, where it always understands the "mod", "gt", "ge" as identifiers instead of using the expression that has the "%", ">", ">=".
I'm new to ANTLR. My grammar is based on the Java grammar in the https://github.com/antlr/grammars-v4/tree/master/java/java and the JavaCC provided by: https://jakarta.ee/specifications/expression-language/4.0/jakarta-expression-language-spec-4.0.html#collected-syntax
grammar ExpressionLanguageGrammar;
prog: compositeExpression;
compositeExpression: (dynamicExpression | deferredExpression | literalExpression)*;
dynamicExpression: '${' expression RCURL;
deferredExpression: '#{' expression RCURL;
literalExpression: literal;
literal: BOOL_LITERAL | FLOATING_POINT_LITERAL | INTEGER_LITERAL | StringLiteral | NULL;
mapData | listData | setData;
methodArguments: LPAREN expressionList? RPAREN;
expressionList: (expression ((COMMA expression)*));
lambdaExpressionOrCall: LPAREN lambdaExpression RPAREN methodArguments*;
lambdaExpression: lambdaParameters ARROW expression;
lambdaParameters: IDENTIFIER | (LPAREN (IDENTIFIER ((COMMA IDENTIFIER)*))? RPAREN);
mapEntry: expression COLON expression;
mapEntries: mapEntry (COMMA mapEntry)*;
expression
: primary
|'[' expressionList? ']'
| '{' expressionList? '}'
| '{' mapEntries? '}'
| expression bop='.' (IDENTIFIER | IDENTIFIER '(' expressionList? ')')
| expression ('[' expression ']')+
| prefix=('-' | '!' | NOT1 | EMPTY) expression
| expression bop=('*' | '/' | '%' | MOD1 | DIV1) expression
| expression bop=('+' | '-') expression
| expression bop=('<=' | '>=' | '>' | '<' | LE1 | GE1 | LT1 | GT1) expression
| expression bop=INSTANCEOF IDENTIFIER
| expression bop=('==' | '!=' | EQ1 | NE1) expression
| expression bop=('&&' | AND1) expression
| expression bop=('||' | OR1) expression
| <assoc=right> expression bop='?' expression bop=':' expression
| <assoc=right> expression
bop=('=' | '+=' | '-=' | '*=' | '/=')
expression
| lambdaExpression
| lambdaExpressionOrCall
;
primary
: '(' expression ')'
| literal
| IDENTIFIER
;
BOOL_LITERAL: TRUE | FALSE;
IDENTIFIER: LETTER (LETTER|DIGIT)*;
INTEGER_LITERAL: [0-9]+;
FLOATING_POINT_LITERAL: [0-9]+ '.' [0-9]* EXPONENT? | '.' [0-9]+ EXPONENT? | [0-9]+ EXPONENT?;
fragment EXPONENT: ('e'|'E') ('+'|'-')? [0-9]+;
StringLiteral: ('"' DoubleStringCharacter* '"'
| '\'' SingleStringCharacter* '\'') ;
fragment DoubleStringCharacter
: ~["\\\r\n]
| '\\' EscapeSequence
;
fragment SingleStringCharacter
: ~['\\\r\n]
| '\\' EscapeSequence
;
fragment EscapeSequence
: CharacterEscapeSequence
| '0'
| HexEscapeSequence
| UnicodeEscapeSequence
| ExtendedUnicodeEscapeSequence
;
fragment CharacterEscapeSequence
: SingleEscapeCharacter
| NonEscapeCharacter
;
fragment HexEscapeSequence
: 'x' HexDigit HexDigit
;
fragment UnicodeEscapeSequence
: 'u' HexDigit HexDigit HexDigit HexDigit
| 'u' '{' HexDigit HexDigit+ '}'
;
fragment ExtendedUnicodeEscapeSequence
: 'u' '{' HexDigit+ '}'
;
fragment SingleEscapeCharacter
: ['"\\bfnrtv]
;
fragment NonEscapeCharacter
: ~['"\\bfnrtv0-9xu\r\n]
;
fragment EscapeCharacter
: SingleEscapeCharacter
| [0-9]
| [xu]
;
fragment HexDigit
: [_0-9a-fA-F]
;
fragment DecimalIntegerLiteral
: '0'
| [1-9] [0-9_]*
;
fragment ExponentPart
: [eE] [+-]? [0-9_]+
;
fragment IdentifierPart
: IdentifierStart
| [\p{Mn}]
| [\p{Nd}]
| [\p{Pc}]
| '\u200C'
| '\u200D'
;
fragment IdentifierStart
: [\p{L}]
| [$_]
| '\\' UnicodeEscapeSequence
;
LCURL: '{';
RCURL: '}';
LETTER: '\u0024' |
'\u0041'..'\u005a' |
'\u005f' |
'\u0061'..'\u007a' |
'\u00c0'..'\u00d6' |
'\u00d8'..'\u00f6' |
'\u00f8'..'\u00ff' |
'\u0100'..'\u1fff' |
'\u3040'..'\u318f' |
'\u3300'..'\u337f' |
'\u3400'..'\u3d2d' |
'\u4e00'..'\u9fff' |
'\uf900'..'\ufaff';
DIGIT: '\u0030'..'\u0039'|
'\u0660'..'\u0669'|
'\u06f0'..'\u06f9'|
'\u0966'..'\u096f'|
'\u09e6'..'\u09ef'|
'\u0a66'..'\u0a6f'|
'\u0ae6'..'\u0aef'|
'\u0b66'..'\u0b6f'|
'\u0be7'..'\u0bef'|
'\u0c66'..'\u0c6f'|
'\u0ce6'..'\u0cef'|
'\u0d66'..'\u0d6f'|
'\u0e50'..'\u0e59'|
'\u0ed0'..'\u0ed9'|
'\u1040'..'\u1049';
TRUE: 'true';
FALSE: 'false';
NULL: 'null';
DOT: '.';
LPAREN: '(';
RPAREN: ')';
LBRACK: '[';
RBRACK: ']';
COLON: ':';
COMMA: ',';
SEMICOLON: ';';
GT0: '>';
GT1: 'gt';
LT0: '<';
LT1: 'lt';
GE0: '>=';
GE1: 'ge';
LE0: '<=';
LE1: 'le';
EQ0: '==';
EQ1: 'eq';
NE0: '!=';
NE1: 'ne';
NOT0: '!';
NOT1: 'not';
AND0: '&&';
AND1: 'and';
OR0: '||';
OR1: 'or';
EMPTY: 'empty';
INSTANCEOF: 'instanceof';
MULT: '*';
PLUS: '+';
MINUS: '-';
QUESTIONMARK: '?';
DIV0: '/';
DIV1: 'div';
MOD0: '%';
MOD1: 'mod';
CONCAT: '+=';
ASSIGN: '=';
ARROW: '->';
DOLLAR: '$';
HASH: '#';
WS: [ \t\r\n]+ -> skip;
Move the Lexer rules for them to be prior to the Lexer rule for Identifier.
If ANTLR has more than one Lexer rule that matches input of the same length it chooses the first rule in the grammar that matches.
For example “mod” is matched by Identifier and MOD1, but Identifier is 1st, so it chooses Identifier. Move the MOD1 rule to be before Identifier and it’ll match MOD1
———-
BTW, unless you care about having different token values for “%” and “mod”, you can just define a single rule:
MOD: ‘%’ | ‘mod’;
You’d can still get the token text if you need it but it will you can just specify MOD in your parser rules instead of (MOD0 | MOD1)

Square brackets not recognized as tokens in ANTLR

I am currently creating a programming language for my semester project. We are using ANTLR as the choice of CC, and now we have run into a problem. When specifying the grammar for the declaration of arrays, ANTLR seems to not recognizing square brackets as tokens. For example, the following line of code:
string[] names = { "Bob", "Hans" }
will produce the error
extraneous input 'string[]' expecting {'end', 'num', 'bool', 'string', 'block', 'item', 'coords', 'break', 'for', 'while', 'until', 'switch', 'if', IDENTIFIER}
when the grammar for declarations are specified as the following
dcl
: 'num' IDENTIFIER '=' (NUM | IDENTIFIER | accessing)
| 'bool' IDENTIFIER '=' (BOOL | IDENTIFIER | accessing)
| 'string' '[' ']' IDENTIFIER '=' ('{' str_arr_items '}' | IDENTIFIER)
| 'string' IDENTIFIER '=' (STR | IDENTIFIER | accessing)
| 'block' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
| 'item' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
| 'coords' IDENTIFIER '=' (COORDS | IDENTIFIER | accessing)
;
However, it seems to work fine if I exchange the '[]' with '{}' or '()'. For example, the following line of code
string() names = { "Bob", "Hans" }
works fine with the following grammar
| 'string' '(' ')' IDENTIFIER '=' ('{' str_arr_items '}' | IDENTIFIER)
Why does it work with other kinds of brackets and symbols, when it does not work with square brackets?
Edit
Here is the entire grammar
grammar Minecraft;
/* LEXER RULES */
SINGLE_COMMENT : '//' ~('\r' | '\n')* -> skip ;
MULTILINE_COMMENT : '/*' .*? '*/' -> skip ;
WS : [ \t\n\r]+ -> skip ;
fragment LETTER : ('a' .. 'z') | ('A' .. 'Z') ;
IDENTIFIER : LETTER+ ;
fragment NUMBER : ('0' .. '9') ;
BOOL : 'true' | 'false' ;
NUM : NUMBER+ | NUMBER+ '.' NUMBER+ ;
STR : '"' (LETTER | NUMBER)* '"' | '\'' (LETTER | NUMBER)* '\'' ;
COORDS : NUM ',' NUM ',' NUM ;
ITEM_ID : NUMBER+ | NUMBER+ ':' NUMBER+ ;
MULDIVMODOP : '*' | '/' | '%' ;
ADDSUBOP : '+' | '-' ;
NEGOP : '!' ;
EQOP : '==' | '!=' | '<' | '<=' | '>' | '>=' ;
LOGOP : '&&' | '||' ;
/* PROGRAM GRAMMAR */
prog : 'begin' 'bot' body 'end' 'bot' ;
body : glob_var* initiate main function* ;
initiate : 'initiate' stmt* 'end' 'initiate' ;
main : 'loop' stmt* 'end' 'loop' ;
type : 'num' | 'bool' | 'string' | 'block' | 'item' | 'coords' ;
function
: 'function' IDENTIFIER '(' args ')' stmt* 'end' 'function'
| 'activity' IDENTIFIER '(' args ')' stmt* 'end' 'activity'
;
arg
: (type | arr_names) IDENTIFIER
| dcl
;
args : arg ',' args | arg ;
i_args : IDENTIFIER ',' i_args | /* epsilon */ ;
cond
: '(' cond ')'
| left=cond MULDIVMODOP right=cond
| left=cond ADDSUBOP right=cond
| NEGOP cond
| left=cond EQOP right=cond
| left=cond LOGOP right=cond
| (NUM | STR | BOOL | ITEM_ID | COORDS | IDENTIFIER)
;
stnd_stmt
: dcl
| 'for' IDENTIFIER '=' NUM ('to' | 'downto') NUM 'do' stmt* 'end' 'for'
| ('while' | 'until') cond 'repeat' stmt* 'end' 'repeat'
| IDENTIFIER '(' i_args ')'
| 'break'
;
stmt : stnd_stmt | if_stmt ;
else_stmt : stnd_stmt | ifelse_stmt ;
if_stmt
: 'if' cond 'then' stmt* 'end' 'if'
| 'if' cond 'then' stmt* 'else' else_stmt* 'end' 'if'
;
ifelse_stmt
: 'if' cond 'then' else_stmt*
| 'if' cond 'then' else_stmt* 'else' else_stmt*
;
glob_var : 'global' dcl ;
str_arr_items : (STR | IDENTIFIER) ',' str_arr_items | (STR | IDENTIFIER) ;
dcl
: 'num' IDENTIFIER '=' (NUM | IDENTIFIER | accessing)
| 'bool' IDENTIFIER '=' (BOOL | IDENTIFIER | accessing)
| 'string' '[' ']' IDENTIFIER '=' ('{' str_arr_items '}' | IDENTIFIER)
| 'string' IDENTIFIER '=' (STR | IDENTIFIER | accessing)
| 'block' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
| 'item' IDENTIFIER '=' (ITEM_ID | IDENTIFIER | accessing)
| 'coords' IDENTIFIER '=' (COORDS | IDENTIFIER | accessing)
;
arr_items : 'num[]' | 'string[]' | 'block[]' | 'item[]' ;
accessing
: IDENTIFIER '[' ('X' | 'Y' | 'Z') ']'
| IDENTIFIER '[' NUM+ ']'
;
Seems like the line
arr_items : 'num[]' | 'string[]' | 'block[]' | 'item[]' ;
created the tokens
num[]
string[]
block[] and
item[]
which means, that when the parser came to parsing the symbol 'string[]', it would automatically convert it to the token 'string[]' and not the tokens 'string' '[' and ']'. When I deleted the line from the CFG, the parser would behave as expected. Thanks to Bart Kiers for pointing me towards this :)

Parsing DECAF grammar in ANTLR

I am creating a the parser for DECAF with Antlr
grammar DECAF ;
//********* LEXER ******************
LETTER: ('a'..'z'|'A'..'Z') ;
DIGIT : '0'..'9' ;
ID : LETTER( LETTER | DIGIT)* ;
NUM: DIGIT(DIGIT)* ;
COMMENTS: '//' ~('\r' | '\n' )* -> channel(HIDDEN);
WS : [ \t\r\n\f | ' '| '\r' | '\n' | '\t']+ ->channel(HIDDEN);
CHAR: (LETTER|DIGIT|' '| '!' | '"' | '#' | '$' | '%' | '&' | '\'' | '(' | ')' | '*' | '+'
| ',' | '-' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '#' | '[' | '\\' | ']' | '^' | '_' | '`'| '{' | '|' | '}' | '~'
'\t'| '\n' | '\"' | '\'');
// ********** PARSER *****************
program : 'class' 'Program' '{' (declaration)* '}' ;
declaration: structDeclaration| varDeclaration | methodDeclaration ;
varDeclaration: varType ID ';' | varType ID '[' NUM ']' ';' ;
structDeclaration : 'struct' ID '{' (varDeclaration)* '}' ;
varType: 'int' | 'char' | 'boolean' | 'struct' ID | structDeclaration | 'void' ;
methodDeclaration : methodType ID '(' (parameter (',' parameter)*)* ')' block ;
methodType : 'int' | 'char' | 'boolean' | 'void' ;
parameter : parameterType ID | parameterType ID '[' ']' ;
parameterType: 'int' | 'char' | 'boolean' ;
block : '{' (varDeclaration)* (statement)* '}' ;
statement : 'if' '(' expression ')' block ( 'else' block )?
| 'while' '(' expression ')' block
|'return' expressionA ';'
| methodCall ';'
| block
| location '=' expression
| (expression)? ';' ;
expressionA: expression | ;
location : (ID|ID '[' expression ']') ('.' location)? ;
expression : location | methodCall | literal | expression op expression | '-' expression | '!' expression | '('expression')' ;
methodCall : ID '(' arg1 ')' ;
arg1 : arg2 | ;
arg2 : (arg) (',' arg)* ;
arg : expression;
op: arith_op | rel_op | eq_op | cond_op ;
arith_op : '+' | '-' | '*' | '/' | '%' ;
rel_op : '<' | '>' | '<=' | '>=' ;
eq_op : '==' | '!=' ;
cond_op : '&&' | '||' ;
literal : int_literal | char_literal | bool_literal ;
int_literal : NUM ;
char_literal : '\'' CHAR '\'' ;
bool_literal : 'true' | 'false' ;
When I give it the input:
class Program {
void main(){
return 3+5 ;
}
}
The parse tree is not building correctly since it is not recognizing the 3+5 as an expression. Is there anything wrong with my grammar that is causing the problem?
Lexer rules are matched from top to bottom. When 2 or more lexer rules match the same amount of characters, the one defined first will win. Because of that, a single digit integer will get matched as a DIGIT instead of a NUM.
Try parsing the following instead:
class Program {
void main(){
return 33 + 55 ;
}
}
which will be parsed just fine. This is because 33 and 55 are matched as NUMs, because NUM can now match 2 characters (DIGIT only 1, so NUM wins).
To fix it, make DIGIT a fragment (and LETTER as well):
fragment LETTER: ('a'..'z'|'A'..'Z') ;
fragment DIGIT : '0'..'9' ;
ID : LETTER( LETTER | DIGIT)* ;
NUM: DIGIT(DIGIT)* ;
Lexer fragments are only used internally by other lexer rules, and will never become tokens of their own.
A couple of other things: your WS rule matches way too much (it now also matches a | and a '), it should be:
WS : [ \t\r\n\f]+ ->channel(HIDDEN);
and you shouldn't match a char literal in your parser: do it in the lexer:
CHAR : '\'' ( ~['\r\n\\] | '\\' ['\\] ) '\'';
If you don't, the following will not get parsed properly:
class Program {
void main(){
return '1';
}
}
because the 1 wil be tokenized as a NUM and not as a CHAR.

Ambiguity in Bison grammar

I've got a problem in my Bison grammar. I've got a pair of shift/reduces which are fine, and six reduce/reduces. The issue is that I don't understand how the reduce/reduce conflicts come about, since the parser should know which to choose from tokens prior.
%token STRING_LITERAL
%token INTEGER
%token FLOAT
%token CHARACTER
%token PTR_OP INC_OP DEC_OP LEFT_OP RIGHT_OP LE_OP GE_OP EQ_OP NE_OP
%token AND_OP OR_OP MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN ADD_ASSIGN
%token SUB_ASSIGN LEFT_ASSIGN RIGHT_ASSIGN AND_ASSIGN
%token XOR_ASSIGN OR_ASSIGN STATIC CATCH DOUBLE_COLON ELLIPSIS FUNCTION VAR
%token SIZEOF
%token GOTO
%token AUTO
%token THIS VAR_ASSIGN
%token NAMESPACE
%token TRY
%token TYPE
%token DECLTYPE
%token PUBLIC
%token PRIVATE
%token PROTECTED
%token USING
%token THROW
%token FRIEND
%token COMPILETIME
%token RUNTIME
%token VIRTUAL
%token ABSTRACT
%token CASE DEFAULT IF ELSE SWITCH WHILE DO FOR CONTINUE BREAK RETURN
%%
global_scope_definition
: namespace_definition
| function_definition
| variable_definition
| using_definition
| type_definition;
global_scope_definitions
: global_scope_definition
| global_scope_definitions global_scope_definition
program
: global_scope_definitions;
type_expression
: expression
variable_assignment
: VAR_ASSIGN;
name_or_qualified_name
: IDENTIFIER
| name_or_qualified_name '.' IDENTIFIER;
namespace_definition
: NAMESPACE name_or_qualified_name '{' namespace_scope_definitions '}';
accessibility_definition
: PUBLIC ':'
| PRIVATE ':'
| PROTECTED ':'
| FRIEND ':';
using_definition
: USING IDENTIFIER '=' name_or_qualified_name ';'
| USING name_or_qualified_name ';';
type_definition
: TYPE IDENTIFIER type_literal;
namespace_scope_definition
: accessibility_definition
| global_scope_definition;
namespace_scope_definitions
: namespace_scope_definition
| namespace_scope_definitions namespace_scope_definition;
accessibility_modifier
: PUBLIC
| PROTECTED
| PRIVATE
| FRIEND;
accessibility_block
: phase_block
| accessibility_modifier phase_block;
phase_modifier
: COMPILETIME
| RUNTIME;
phase_block
: definition_block
| phase_modifier definition_block;
definition_block
: default_definition_block
| STATIC static_definition_block
| VIRTUAL virtual_definition_block
| ABSTRACT abstract_definition_block;
static_definition_block
: '{' static_definitions '}';
static_definitions
: static_definition
| static_definitions static_definition;
static_definition
: variable_definition
| function_definition;
abstract_definition_block
: '{' abstract_definitions '}';
abstract_definitions
: abstract_definition
| abstract_definitions abstract_definition;
abstract_definition
: function_definition;
virtual_definition_block
: '{' virtual_definitions '}';
virtual_definitions
: virtual_definition
| virtual_definitions virtual_definition;
virtual_definition
: function_definition;
default_definition_block
: '{' default_definitions '}';
default_definitions
: default_definition
| default_definitions default_definition;
default_definition
: variable_definition
| function_definition
| constructor_definition
| destructor_definition
| type_definition;
type_scope_definition
: using_definition
| default_definition
| accessibility_block;
type_scope_definitions
: type_scope_definition
| type_scope_definitions type_scope_definition;
destructor_definition
: '~' TYPE '(' ')' compound_statement;
constructor_definition
: TYPE function_definition_arguments statements_and_inits;
statements_and_inits
: inits compound_statement
| compound_statement;
init
: ':' IDENTIFIER function_call_expression;
inits
: init
| inits init;
function_definition_arguments
: '(' ')'
| '(' function_argument_list ')';
function_definition
: type_expression IDENTIFIER function_definition_arguments compound_statement
| type_expression IDENTIFIER function_definition_arguments function_definition_arguments compound_statement;
function_argument_definition
: IDENTIFIER
| type_expression IDENTIFIER
| IDENTIFIER variable_assignment expression
| type_expression IDENTIFIER variable_assignment expression
| IDENTIFIER variable_assignment '{' expressions '}'
| type_expression IDENTIFIER variable_assignment '{' expressions '}';
function_argument_list
: function_argument_definition
| function_argument_list ',' function_argument_definition;
static_variable_definition
: STATIC variable_definition
| FRIEND variable_definition
| STATIC FRIEND variable_definition
| variable_definition;
variable_definition
: IDENTIFIER variable_assignment expression ';'
| type_expression IDENTIFIER variable_assignment expression ';'
| type_expression IDENTIFIER ';'
| type_expression IDENTIFIER function_call_expression ';';
base_class_list
: ':' type_expression
| base_class_list ',' type_expression;
type_literal
: base_class_list '{' type_scope_definitions '}'
| '{' type_scope_definitions '}'
| base_class_list '{' '}'
| '{' '}';
literal_expression
: INTEGER
| FLOAT
| CHARACTER
| STRING_LITERAL
| AUTO
| THIS
| TYPE type_literal;
primary_expression
: literal_expression
| '(' expression ')'
| IDENTIFIER;
expression
: variadic_expression;
variadic_expression
: assignment_expression
| assignment_expression ELLIPSIS;
assignment_operator
: '='
| MUL_ASSIGN
| DIV_ASSIGN
| MOD_ASSIGN
| ADD_ASSIGN
| SUB_ASSIGN
| LEFT_ASSIGN
| RIGHT_ASSIGN
| AND_ASSIGN
| XOR_ASSIGN
| OR_ASSIGN;
assignment_expression
: logical_or_expression
| unary_expression assignment_operator assignment_expression;
logical_or_expression
: logical_and_expression
| logical_or_expression OR_OP logical_and_expression;
logical_and_expression
: inclusive_or_expression
| logical_and_expression AND_OP inclusive_or_expression;
inclusive_or_expression
: exclusive_or_expression
| inclusive_or_expression '|' exclusive_or_expression;
exclusive_or_expression
: and_expression
| exclusive_or_expression '^' and_expression;
and_expression
: equality_expression
| and_expression '&' equality_expression;
equality_expression
: relational_expression
| equality_expression EQ_OP relational_expression
| equality_expression NE_OP relational_expression;
comparison_operator
: '<'
| '>'
| LE_OP
| GE_OP;
relational_expression
: shift_expression
| relational_expression comparison_operator shift_expression;
shift_operator
: LEFT_OP
| RIGHT_OP;
shift_expression
: additive_expression
| shift_expression shift_operator additive_expression;
additive_operator
: '+'
| '-';
additive_expression
: multiplicative_expression
| additive_expression additive_operator multiplicative_expression;
multiplicative_operator
: '*'
| '/'
| '%';
multiplicative_expression
: unary_expression
| multiplicative_expression multiplicative_operator unary_expression;
lambda_expression
: '[' capture_list ']' function_argument_list compound_statement
| '[' capture_list ']' compound_statement;
| '[' ']' function_argument_list compound_statement
| '[' ']' compound_statement;
default_capture
: '&' | '=' ;
capture_list
: default_capture comma_capture_list
| comma_capture_list;
comma_capture_list
: variable_capture
| comma_capture_list ',' variable_capture;
variable_capture
: '&' IDENTIFIER
| '=' IDENTIFIER
| AND_OP IDENTIFIER;
unary_operator
: '&'
| '*'
| '+'
| '-'
| '~'
| '!'
| INC_OP
| DEC_OP;
unary_expression
: unary_operator unary_expression
| SIZEOF '(' expression ')'
| DECLTYPE '(' expression ')'
| lambda_expression
| postfix_expression;
postfix_expression
: primary_expression { $$ = $1; }
| postfix_expression '[' expression ']'
| postfix_expression function_call_expression
| postfix_expression '.' IDENTIFIER
| postfix_expression PTR_OP IDENTIFIER
| postfix_expression INC_OP
| postfix_expression DEC_OP
| postfix_expression FRIEND;
expressions
: expression
| expressions ',' expression;
function_argument
: expression
| IDENTIFIER variable_assignment '{' expressions '}'
| IDENTIFIER variable_assignment expression;
function_arguments
: function_argument
| function_arguments ',' function_argument;
function_call_expression
: '(' function_arguments ')'
| '(' ')';
initializer_statement
: expression
| IDENTIFIER variable_assignment expression
| type_expression IDENTIFIER variable_assignment expression;
destructor_statement
: expression '~' TYPE '(' ')' ';';
return_statement
: RETURN expression ';'
| RETURN ';';
try_statement
: TRY compound_statement catch_statements;
catch_statement
: CATCH '(' type_expression IDENTIFIER ')' compound_statement;
catch_statements
: catch_statement
| catch_statements catch_statement
| CATCH '(' ELLIPSIS ')' compound_statement
| catch_statements CATCH '(' ELLIPSIS ')' compound_statement;
for_statement_initializer
: initializer_statement ';'
| ';';
for_statement_condition
: expression ';'
| ';';
for_statement_repeat
: expression
| ;
for_statement
: FOR '(' for_statement_initializer for_statement_condition for_statement_repeat ')' statement;
while_statement
: WHILE '(' initializer_statement ')' statement;
do_while_statement
: DO statement WHILE '(' expression ')';
switch_statement
: SWITCH '(' initializer_statement ')' '{' case_statements '}';
default_statement
: DEFAULT ':' statements;
case_statement
: CASE expression DOUBLE_COLON statements;
case_statements
: case_statement
| case_statements case_statement { $1.push_back($2); $$ = std::move($1); }
| case_statements default_statement { $1.push_back($2); $$ = std::move($1); };
if_statement
: IF '(' initializer_statement ')' statement
| IF '(' initializer_statement ')' statement ELSE statement;
continue_statement
: CONTINUE ';';
break_statement
: BREAK ';';
label_statement
: IDENTIFIER ':';
goto_statement
: GOTO IDENTIFIER ';';
throw_statement
: THROW ';'
| THROW expression ';';
runtime_statement
: RUNTIME compound_statement;
compiletime_statement
: COMPILETIME compound_statement;
statement
: compound_statement
| return_statement
| try_statement
| expression ';'
| static_variable_definition
| for_statement
| while_statement
| do_while_statement
| switch_statement
| if_statement
| continue_statement
| break_statement
| goto_statement
| label_statement
| using_definition
| throw_statement
| compiletime_statement
| runtime_statement
| destructor_statement ;
statements
: statement
| statements statement;
compound_statement
: '{' '}'
| '{' statements '}';
%%
This is my grammar. Bison takes issue with supposed ambiguity between function_argument_definition and primary_expression, and between function_argument_definition and function_argument. However, I'm pretty sure that it should already know which to pick by the time it encounters any such thing. How can I resolve these ambiguities?
Consider the rules
function_definition:
type_expression IDENTIFIER function_definition_arguments compound_statement
variable_definition:
type_expression IDENTIFIER function_call_expression ';'
either of these can appear in the same context in a variety of ways, so the compiler has no way to tell which it is looking at until it gets to the ; in the variable_definition or the { in the compound_statement in the function_definition. As a result it has no way of telling whether its processing a function_definition_arguments or a function_call_expression, leading to the reduce/reduce conflicts you see.
To find this sort of problem yourself, you need to run bison with the -v option to produce a .output file showing the state machine it built. You then look at the states with conflict and backtrack to see how it gets to those states. In your example, state 280 has (two of) the reduce/reduce conflicts. One way of getting there is state 177, which is parsing function_definition_arguments and function_call_expression in parallel -- the parser is in a state where either is legal. State 177 comes from state 77, which comes from state 26, which shows the two rules I reproduced above.

Extracting recursion in ANTLR

I've got a grammar in ANTLR and don't understand how it can be recursive. Is there any way to get ANTLR to show the derivation that it used to see that my rules are recursive?
The recursive grammar in it's entirety:
grammar DeadMG;
options {
language = C;
}
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
INT : '0'..'9'+
;
FLOAT
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
COMMENT
: '//' ~('\n'|'\r')* '\r'? '\n' {$channel=HIDDEN;}
| '/*' ( options {greedy=false;} : . )* '*/' {$channel=HIDDEN;}
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
STRING
: '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
;
CHAR: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
program
: namespace_scope_definitions;
namespace_scope_definitions
: (namespace_definition | type_definition | function_definition | variable_definition)+;
type_scope_definitions
: (type_definition | function_definition | variable_definition)*;
namespace_definition
: 'namespace' ID '{' namespace_scope_definitions '}';
type_definition
: 'type' ID? (':' expression (',' expression)+ )? '{' type_scope_definitions '}';
function_definition
: ID '(' function_argument_list ')' ('(' function_argument_list ')')? ('->' expression)? compound_statement;
function_argument_list
: expression? ID (':=' expression)? (',' function_argument_list)?;
variable_definition
: 'static'? expression? ID ':=' expression
| 'static'? expression ID ('(' (expression)* ')')?;
literal_expression
: CHAR
| FLOAT
| INT
| STRING
| 'auto'
| 'type'
| type_definition;
primary_expression
: literal_expression
| ID
| '(' expression ')';
expression
: assignment_expression;
assignment_expression
: logical_or_expression (('=' | '*=' | '/=' | '%=' | '+=' | '-=' | '<<='| '>>=' | '&=' | '^=' | '|=') assignment_expression)*;
logical_or_expression
: logical_and_expression ('||' logical_and_expression)*;
logical_and_expression
: inclusive_or_expression ('&&' inclusive_or_expression)*;
inclusive_or_expression
: exclusive_or_expression ('|' exclusive_or_expression)*;
exclusive_or_expression
: and_expression ('^' and_expression)*;
and_expression
: equality_expression ('&' equality_expression)*;
equality_expression
: relational_expression (('=='|'!=') relational_expression)*;
relational_expression
: shift_expression (('<'|'>'|'<='|'>=') shift_expression)*;
shift_expression
: additive_expression (('<<'|'>>') additive_expression)*;
additive_expression
: multiplicative_expression (('+' multiplicative_expression) | ('-' multiplicative_expression))*;
multiplicative_expression
: unary_expression (('*' | '/' | '%') unary_expression)*;
unary_expression
: '++' primary_expression
| '--' primary_expression
| ('&' | '*' | '+' | '-' | '~' | '!') primary_expression
| 'sizeof' primary_expression
| postfix_expression;
postfix_expression
: primary_expression
| '[' expression ']'
| '(' expression* ')'
| '.' ID
| '->' ID
| '++'
| '--';
initializer_statement
: expression ';'
| variable_definition ';';
return_statement
: 'return' expression ';';
try_statement
: 'try' compound_statement catch_statement;
catch_statement
: 'catch' '(' ID ')' compound_statement catch_statement?
| 'catch' '(' '...' ')' compound_statement;
for_statement
: 'for' '(' initializer_statement expression? ';' expression? ')' compound_statement;
while_statement
: 'while' '(' initializer_statement ')' compound_statement;
do_while_statement
: 'do' compound_statement 'while' '(' expression ')';
switch_statement
: 'switch' '(' expression ')' '{' case_statement '}';
case_statement
: 'case:' (statement)* case_statement?
| 'default:' (statement)*;
if_statement
: 'if' '(' initializer_statement ')' compound_statement;
statement
: compound_statement
| return_statement
| try_statement
| initializer_statement
| for_statement
| while_statement
| do_while_statement
| switch_statement
| if_statement;
compound_statement
: '{' (statement)* '}';
More specifically, I am having trouble with the following rules:
namespace_scope_definitions
: (namespace_definition | type_definition | function_definition | variable_definition)+;
type_scope_definitions
: (type_definition | function_definition | variable_definition)*;
ANTLR is saying that alternatives 2 and 4, that is, type_definition and variable_definition, are recursive. Here's variable_definition:
variable_definition
: 'static'? expression? ID ':=' expression
| 'static'? expression ID ('(' (expression)* ')')?;
and here's type_definition:
type_definition
: 'type' ID? (':' expression (',' expression)+ )? '{' type_scope_definitions '}';
'type' itself, and type_definition, is a valid expression in my expression syntax. However, removing it is not resolving the ambiguity, so it doesn't originate there. And I have plenty of other ambiguities I need to resolve- detailing all the warnings and errors would be quite too much, so I'd really like to see more details on how they are recursive from ANTLR itself.
My suggestion is to remove most of the operator precedence rules for now:
expression
: multiplicative_expression
(
('+' multiplicative_expression)
| ('-' multiplicative_expression)
)*;
and then inline the rules that have a single caller to isolate the ambiguities. Yes it is tedious.
I found a few ambiguities in the grammar, fixed them and got a lot less warnings. However, I think that probably, LL is just not the right parsing algorithm for me. I am writing a custom parser and lexer. It would still have been nice if ANTLR would show me how it found the problems though, so that I might intervene and fix them.