Is there a way to eliminate these reduce/reduce conflicts? - grammar

I am working on a toy language for fun, (called NOP) using Bison and Flex, and I have hit a wall. I am trying to parse sequences that look like name1.name2 and name1.func1().name2 and I get a lot of reduce/reduce conflicts. I know -why- am getting them, but I am having a heck of a time figuring out what to do about it.
So my question is whether this is a legitimate irregularity that can't be "fixed", or if my grammar is just wrong. The productions in question are compound_name and compound_symbol. It seems to me that they should parse separately. If I try to combine them I get conflicts with that as well. In the grammar, I am trying to illustrate what I want to do, rather than anything "clever".
%debug
%defines
%locations
%{
%}
%define parse.error verbose
%locations
%token FPCONST INTCONST UINTCONST STRCONST BOOLCONST
%token SYMBOL
%token AND OR NOT EQ NEQ LTE GTE LT GT
%token ADD_ASSIGN SUB_ASSIGN MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN
%token DICT LIST BOOL STRING FLOAT INT UINT NOTHING
%right ADD_ASSIGN SUB_ASSIGN
%right MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN
%left AND OR
%left EQ NEQ
%left LT GT LTE GTE
%right ':'
%left '+' '-'
%left '*' '/' '%'
%left NEG
%right NOT
%%
program
: {} all_module {}
;
all_module
: module_list
;
module_list
: module_element {}
| module_list module_element {}
;
module_element
: compound_symbol {}
| expression {}
;
compound_name
: SYMBOL {}
| compound_name '.' SYMBOL {}
;
compound_symbol_element
: compound_name {}
| func_call {}
;
compound_symbol
: compound_symbol_element {}
| compound_symbol '.' compound_symbol_element {}
;
func_call
: compound_name '(' expression_list ')' {}
;
formatted_string
: STRCONST {}
| STRCONST '(' expression_list ')' {}
;
type_specifier
: STRING {}
| FLOAT {}
| INT {}
| UINT {}
| BOOL {}
| NOTHING {}
;
constant
: FPCONST {}
| INTCONST {}
| UINTCONST {}
| BOOLCONST {}
| NOTHING {}
;
expression_factor
: constant { }
| compound_symbol { }
| formatted_string {}
;
expression
: expression_factor {}
| expression '+' expression {}
| expression '-' expression {}
| expression '*' expression {}
| expression '/' expression {}
| expression '%' expression {}
| expression EQ expression {}
| expression NEQ expression {}
| expression LT expression {}
| expression GT expression {}
| expression LTE expression {}
| expression GTE expression {}
| expression AND expression {}
| expression OR expression {}
| '-' expression %prec NEG {}
| NOT expression { }
| type_specifier ':' SYMBOL {} // type cast
| '(' expression ')' {}
;
expression_list
: expression {}
| expression_list ',' expression {}
;
%%
This is a very stripped down parser. The "real" one is about 600 lines. It has no conflicts (and passes a bunch of tests) if I don't try to use a function call in a variable name. I am looking at re-writing it to be a packrat grammar if I cannot get Bison to do that I want. The rest of the project is here: https://github.com/chucktilbury/nop
$ bison -tvdo temp.c temp.y
temp.y: warning: 4 shift/reduce conflicts [-Wconflicts-sr]
temp.y: warning: 16 reduce/reduce conflicts [-Wconflicts-rr]

All of the reduce/reduce conflicts are the result of:
module_element
: expression
| compound_symbol
That creates an ambiguity because you also have
expression
: expression_factor
expression_factor
: compound_symbol
So the parser can't tell whether or not you need the unit productions to be reduced. Eliminating module_element: compound_symbol doesn't change the set of sentences which can be produced; it just requires that a compound_symbol be reduced through expression before becoming a module_element.
As Chris Dodd points out in a comment, the fact that two module_elements can appear consecutively without a delimiter creates an additional ambiguity: the grammar allows a - b to be parsed either as a single expression (and consequently module_element) or as two consecutive expressions —a and -b— and thus two consecutive module_elements. That ambiguity accounts for three of the four shift/reduce conflicts.
Both of these are probably errors introduced when you simplified the grammar, since it appears that module elements in the full grammar are definitions, not expressions. Removing modules altogether and using expression as the starting symbol leaves only a single conflict.
That conflict is indeed the result of an ambiguity between compound_symbol and compound_name, as noted in your question. The problem is seen in these productions (non-terminals shortened to make typing easier):
name: SYMBOL
| name '.' SYMBOL
symbol
: element
| symbol '.' element
element
: name
That means that both a and a.b are names and hence
elements. But a symbol is a .-separated list of elements, so a.b could be derived in two ways:
symbol → element symbol → symbol . element
→ name → element . element
→ a.b → name . element
→ a . element
→ a . name
→ a . b
I fixed this by simplifying the grammar to:
compound_symbol
: compound_name
| compound_name '(' expression_list ')'
compound_name
: SYMBOL
| compound_symbol '.' SYMBOL
That gets rid of func_call and compound_symbol_element, which as far as I can see serve no purpose. I don't know if the non-terminal names remaining really capture anything sensible; I think it would make more sense to call compound_symbol something like name_or_call.
This grammar could be simplified further if higher-order functions were possible; the existing grammar forbids hof()(), presumably because you don't contemplate allowing a function to return a function object.
But even with higher-order functions, you might want to differentiate between function calls and member access/array subscript, because in many languages a function cannot return an object reference and hence a function call cannot appear on the left-hand side of an assignment operator. In other languages, such as C, the requirement that the left-hand side of an assignment operator be a reference ("lvalue") is enforced outside of the grammar. (And in C++, a function call or even an overloaded operator can return a reference, so the restriction needs to be enforced after type analysis.)

Related

Im just starting with ANTLR and I cant decipher where Im messing up with mismatched input error

I've just started using antlr so Id really appreciate the help! Im just trying to make a variable declaration declaration rule but its not working! Ive put the files Im working with below, please lmk if you need anything else!
INPUT CODE:
var test;
GRAMMAR G4 FILE:
grammar treetwo;
program : (declaration | statement)+ EOF;
declaration :
variable_declaration
| variable_assignment
;
statement:
expression
| ifstmnt
;
variable_declaration:
VAR NAME SEMICOLON
;
variable_assignment:
NAME '=' NUM SEMICOLON
| NAME '=' STRING SEMICOLON
| NAME '=' BOOLEAN SEMICOLON
;
expression:
operand operation operand SEMICOLON
| expression operation expression SEMICOLON
| operand operation expression SEMICOLON
| expression operation operand SEMICOLON
;
ifstmnt:
IF LPAREN term RPAREN LCURLY
(declaration | statement)+
RCURLY
;
term:
| NUM EQUALITY NUM
| NAME EQUALITY NUM
| NUM EQUALITY NAME
| NAME EQUALITY NAME
;
/*Tokens*/
NUM : '0' | '-'?[1-9][0-9]*;
STRING: [a-zA-Z]+;
BOOLEAN: 'true' | 'false';
VAR : 'var';
NAME : [a-zA-Z]+;
SEMICOLON : ';';
LPAREN: '(';
RPAREN: ')';
LCURLY: '{';
RCURLY: '}';
EQUALITY: '==' | '<' | '>' | '<=' | '>=' | '!=' ;
operation: '+' | '-' | '*' | '/';
operand: NUM;
IF: 'if';
WS : [ \t\r\n]+ -> skip;
Error I'm getting:
(line 1,char 0): mismatched input 'var' expecting {NUM, 'var', NAME, 'if'}
Your STRING rule is the same as your NAME rule.
With the ANTLR lexer, if two lexer rules match the same input, the first one declared will be used. As a result, you’ll never see a NAME token.
Most tutorials will show you have to dump out the token stream. It’s usually a good idea to view the token stream and verify your Lexer rules before getting too far into your parser rules.

How to detect an expression result is unused in an interpreted programming language?

I'm working on a simple procedural interpreted scripting language, written in Java using ANTLR4. Just a hobby project. I have written a few DSLs using ANTLR4 and the lexer and parser presented no real problems. I got quite a bit of the language working by interpreting directly from the parse tree but that strategy, apart from being slow, started to break down when I started to add functions.
So I've created a stack-based virtual machine, based on Chapter 10 of "Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages". I have an assembler for the VM that works well and I'm now trying to make the scripting language generate assembly via an AST.
Something I can't quite see is how to detect when an expression or function result is unused, so that I can generate a POP instruction to discard the value from the top of the operand stack.
I want things like assignment statements to be expressions, so that I can do things like:
x = y = 1;
In the AST, the assignment node is annotated with the symbol (the lvalue) and the rvalue comes from visiting the children of the assignment node. At the end of the visit of the assignment node, the rvalue is stored into the lvalue, and this is reloaded back into the operand stack so that it can be used as an expression result.
This generates ( for x = y = 1):
CLOAD 1 ; Push constant value
GSTOR y ; Store into global y and pop
GLOAD y ; Push value of y
GSTOR x ; Store into global x and pop
GLOAD x ; Push value of x
But it needs a POP instruction at the end to discard the result, otherwise the operand stack starts to grow with these unused results. I can't see the best way of doing this.
I guess my grammar could be flawed, which is preventing me seeing a solution here.
grammar g;
// ----------------------------------------------------------------------------
// Parser
// ----------------------------------------------------------------------------
parse
: (functionDefinition | compoundStatement)*
;
functionDefinition
: FUNCTION ID parameterSpecification compoundStatement
;
parameterSpecification
: '(' (ID (',' ID)*)? ')'
;
compoundStatement
: '{' compoundStatement* '}'
| conditionalStatement
| iterationStatement
| statement ';'
;
statement
: declaration
| expression
| exitStatement
| printStatement
| returnStatement
;
declaration
: LET ID ASSIGN expression # ConstantDeclaration
| VAR ID ASSIGN expression # VariableDeclaration
;
conditionalStatement
: ifStatement
;
ifStatement
: IF expression compoundStatement (ELSE compoundStatement)?
;
exitStatement
: EXIT
;
iterationStatement
: WHILE expression compoundStatement # WhileStatement
| DO compoundStatement WHILE expression # DoStatement
| FOR ID IN expression TO expression (STEP expression)? compoundStatement # ForStatement
;
printStatement
: PRINT '(' (expression (',' expression)*)? ')' # SimplePrintStatement
| PRINTF '(' STRING (',' expression)* ')' # PrintFormatStatement
;
returnStatement
: RETURN expression?
;
expression
: expression '[' expression ']' # Indexed
| ID DEFAULT expression # DefaultValue
| ID op=(INC | DEC) # Postfix
| op=(ADD | SUB | NOT) expression # Unary
| op=(INC | DEC) ID # Prefix
| expression op=(MUL | DIV | MOD) expression # Multiplicative
| expression op=(ADD | SUB) expression # Additive
| expression op=(GT | GE | LT | LE) expression # Relational
| expression op=(EQ | NE) expression # Equality
| expression AND expression # LogicalAnd
| expression OR expression # LogicalOr
| expression IF expression ELSE expression # Ternary
| ID '(' (expression (',' expression)*)? ')' # FunctionCall
| '(' expression ')' # Parenthesized
| '[' (expression (',' expression)* )? ']' # LiteralArray
| ID # Identifier
| NUMBER # LiteralNumber
| STRING # LiteralString
| BOOLEAN # LiteralBoolean
| ID ASSIGN expression # SimpleAssignment
| ID op=(CADD | CSUB | CMUL | CDIV) expression # CompoundAssignment
| ID '[' expression ']' ASSIGN expression # IndexedAssignment
;
// ----------------------------------------------------------------------------
// Lexer
// ----------------------------------------------------------------------------
fragment
IDCHR : [A-Za-z_$];
fragment
DIGIT : [0-9];
fragment
ESC : '\\' ["\\];
COMMENT : '#' .*? '\n' -> skip;
// ----------------------------------------------------------------------------
// Keywords
// ----------------------------------------------------------------------------
DO : 'do';
ELSE : 'else';
EXIT : 'exit';
FOR : 'for';
FUNCTION : 'function';
IF : 'if';
IN : 'in';
LET : 'let';
PRINT : 'print';
PRINTF : 'printf';
RETURN : 'return';
STEP : 'step';
TO : 'to';
VAR : 'var';
WHILE : 'while';
// ----------------------------------------------------------------------------
// Operators
// ----------------------------------------------------------------------------
ADD : '+';
DIV : '/';
MOD : '%';
MUL : '*';
SUB : '-';
DEC : '--';
INC : '++';
ASSIGN : '=';
CADD : '+=';
CDIV : '/=';
CMUL : '*=';
CSUB : '-=';
GE : '>=';
GT : '>';
LE : '<=';
LT : '<';
AND : '&&';
EQ : '==';
NE : '!=';
NOT : '!';
OR : '||';
DEFAULT : '??';
// ----------------------------------------------------------------------------
// Literals and identifiers
// ----------------------------------------------------------------------------
BOOLEAN : ('true'|'false');
NUMBER : DIGIT+ ('.' DIGIT+)?;
STRING : '"' (ESC | .)*? '"';
ID : IDCHR (IDCHR | DIGIT)*;
WHITESPACE : [ \t\r\n] -> skip;
ANYCHAR : . ;
So my question is where is the usual place to detect unused expression results, i.e. when expressions are used as plain statements? Is it something I should detect during the parse, then annotate the AST node? Or is this better done when visiting the AST for code generation (assembly generation in my case)? I just can't see where best to do it.
IMO it's not a question of the right grammar, but how you process the AST/parse tree. The fact if a result is used or not could be determined by checking the siblings (and parent siblings etc.). An assignment for instance is made of the lvalue, the operator and the rvalue, hence when you determined the rvalue, check the previous tree node sibling if that is an operator. Similarly you can check if the parent is a parentheses expression (for nested function calls, grouping etc.).
statement
: ...
| expression
If you label this case with # ExpressionStatement, you can generate a pop after every expression statement by overriding exitExpressionStatement() in the listener or visitExpressionStatement in the visitor.

Reduce/reduce conflict in clike grammar in jison

I'm working on the clike language compiler using Jison package. I went really well until I've introduced classes, thus Type can be a LITERAL now. Here is a simplified grammar:
%lex
%%
\s+ /* skip whitespace */
int return 'INTEGER'
string return 'STRING'
boolean return 'BOOLEAN'
void return 'VOID'
[0-9]+ return 'NUMBER'
[a-zA-Z_][0-9a-zA-Z_]* return 'LITERAL'
"--" return 'DECR'
<<EOF>> return 'EOF'
"=" return '='
";" return ';'
/lex
%%
Program
: EOF
| Stmt EOF
;
Stmt
: Type Ident ';'
| Ident '=' NUMBER ';'
;
Type
: INTEGER
| STRING
| BOOLEAN
| LITERAL
| VOID
;
Ident
: LITERAL
;
And the jison conflict:
Conflict in grammar: multiple actions possible when lookahead token is LITERAL in state 10
- reduce by rule: Ident -> LITERAL
- reduce by rule: Type -> LITERAL
Conflict in grammar: multiple actions possible when lookahead token is = in state 10
- reduce by rule: Ident -> LITERAL
- reduce by rule: Type -> LITERAL
States with conflicts:
State 10
Type -> LITERAL . #lookaheads= LITERAL =
Ident -> LITERAL . #lookaheads= LITERAL =
I've found quite a similar question that has no been answered, does any one have any clue how to solve this?
That's evidently a bug in jison, since the grammar is certainly LALR(1), and is handled without problems by bison. Apparently, jison is incorrectly computing the lookahead for the state in which the conflict occurs. (Update: It seems to be bug 205, reported in January 2014.)
If you ask jison to produce an LR(1) parser instead of an LALR(1) grammar, then it correctly computes the lookaheads and the grammar passes without warnings. However, I don't think that is a sustainable solution.
Here's another work-around. The Decl and Assign productions are not necessary; the "fix" was to remove LITERAL from Type and add a separate production for it.
Program
: EOF
| Stmt EOF
;
Decl
: Type Ident ';'
| LITERAL Ident ';'
;
Assign
: Ident '=' NUMBER ';'
;
Stmt
: Decl
| Assign
;
Type
: INTEGER
| STRING
| BOOLEAN
| VOID
;
Ident
: LITERAL
;
You might want to consider recognizing more than one statement:
Program
: EOF
| Stmts EOF
;
Stmts
: Stmt
| Stmts Stmt
;

What is wrong with this ANTLR Grammar? Conditional statement nested parenthesis

I've been tasked with writing a prototype of my team's DSL in Java, so I thought I would try it out using ANTLR. However I'm having problems with the 'expression' and 'condition' rules.
The DSL is already well defined so I would like to keep as close to the current spec as possible.
grammar MyDSL;
// Obviously this is just a snippet of the whole language, but it should give a
// decent view of the issue.
entry
: condition EOF
;
condition
: LPAREN condition RPAREN
| atomic_condition
| NOT condition
| condition AND condition
| condition OR condition
;
atomic_condition
: expression compare_operator expression
| expression (IS NULL | IS NOT NULL)
| identifier
| BOOLEAN
;
compare_operator
: EQUALS
| NEQUALS
| GT | LT
| GTEQUALS | LTEQUALS
;
expression
: LPAREN expression RPAREN
| atomic_expression
| PREFIX expression
| expression (MULTIPLY | DIVIDE) expression
| expression (ADD | SUBTRACT) expression
| expression CONCATENATE expression
;
atomic_expression
: SUBSTR LPAREN expression COMMA expression (COMMA expression)? RPAREN
| identifier
| INTEGER
;
identifier
: WORD
;
// Function Names
SUBSTR: 'SUBSTR';
// Control Chars
LPAREN : '(';
RPAREN : ')';
COMMA : ',';
// Literals and Identifiers
fragment DIGIT : [0-9] ;
INTEGER: DIGIT+;
fragment LETTER : [A-Za-z#$#];
fragment CHARACTER : DIGIT | LETTER | '_';
WORD: LETTER CHARACTER*;
BOOLEAN: 'TRUE' | 'FALSE';
// Arithmetic Operators
MULTIPLY : '*';
DIVIDE : '/';
ADD : '+';
SUBTRACT : '-';
PREFIX: ADD| SUBTRACT ;
// String Operators
CONCATENATE : '||';
// Comparison Operators
EQUALS : '==';
NEQUALS : '<>';
GTEQUALS : '>=';
LTEQUALS : '<=';
GT : '>';
LT : '<';
// Logical Operators
NOT : 'NOT';
AND : 'AND';
OR : 'OR';
// Keywords
IS : 'IS';
NULL: 'NULL';
// Whitespace
BLANK: [ \t\n\r]+ -> channel(HIDDEN) ;
The phrase I'm testing with is
(FOO == 115 AND (SUBSTR(BAR,2,1) == 1 OR SUBSTR(BAR,4,1) == 1))
However it is breaking on the nested parenthesis, matching the first ( with the first ) instead of the outermost (see below). In ANTLR3 I solved this with semantic predicates but it seems that ANTLR4 is supposed to have fixed the need for those.
I'd really like to keep the condition and the expression rules separate if at all possible. I have been able to get it to work when merged together in a single expression rule (based on examples here and elsewhere) but the current DSL spec has them as different and I'm trying to reduce any possible differences in behaviour.
Can anyone point out how I can get this all working while maintaining a separate rule for conditions' andexpressions`? Many thanks!
The grammar seems fine to me.
There's one thing going wrong in the lexer: the WORD token is defined before various keywords/operators causing it to get precedence over them. Place your WORD rule at the very end of your lexer rules (or at least after the last keywords which WORD could also match).

Left-factoring grammar of coffeescript expressions

I'm writing an Antlr/Xtext parser for coffeescript grammar. It's at the beginning yet, I just moved a subset of the original grammar, and I am stuck with expressions. It's the dreaded "rule expression has non-LL(*) decision" error. I found some related questions here, Help with left factoring a grammar to remove left recursion and ANTLR Grammar for expressions. I also tried How to remove global backtracking from your grammar, but it just demonstrates a very simple case which I cannot use in real life. The post about ANTLR Grammar Tip: LL() and Left Factoring gave me more insights, but I still can't get a handle.
My question is how to fix the following grammar (sorry, I couldn't simplify it and still keep the error). I guess the trouble maker is the term rule, so I'd appreciate a local fix to it, rather than changing the whole thing (I'm trying to stay close to the rules of the original grammar). Pointers are also welcome to tips how to "debug" this kind of erroneous grammar in your head.
grammar CoffeeScript;
options {
output=AST;
}
tokens {
AT_SIGIL; BOOL; BOUND_FUNC_ARROW; BY; CALL_END; CALL_START; CATCH; CLASS; COLON; COLON_SLASH; COMMA; COMPARE; COMPOUND_ASSIGN; DOT; DOT_DOT; DOUBLE_COLON; ELLIPSIS; ELSE; EQUAL; EXTENDS; FINALLY; FOR; FORIN; FOROF; FUNC_ARROW; FUNC_EXIST; HERECOMMENT; IDENTIFIER; IF; INDENT; INDEX_END; INDEX_PROTO; INDEX_SOAK; INDEX_START; JS; LBRACKET; LCURLY; LEADING_WHEN; LOGIC; LOOP; LPAREN; MATH; MINUS; MINUS; MINUS_MINUS; NEW; NUMBER; OUTDENT; OWN; PARAM_END; PARAM_START; PLUS; PLUS_PLUS; POST_IF; QUESTION; QUESTION_DOT; RBRACKET; RCURLY; REGEX; RELATION; RETURN; RPAREN; SHIFT; STATEMENT; STRING; SUPER; SWITCH; TERMINATOR; THEN; THIS; THROW; TRY; UNARY; UNTIL; WHEN; WHILE;
}
COMPARE : '<' | '==' | '>';
COMPOUND_ASSIGN : '+=' | '-=';
EQUAL : '=';
LOGIC : '&&' | '||';
LPAREN : '(';
MATH : '*' | '/';
MINUS : '-';
MINUS_MINUS : '--';
NEW : 'new';
NUMBER : ('0'..'9')+;
PLUS : '+';
PLUS_PLUS : '++';
QUESTION : '?';
RELATION : 'in' | 'of' | 'instanceof';
RPAREN : ')';
SHIFT : '<<' | '>>';
STRING : '"' (('a'..'z') | ' ')* '"';
TERMINATOR : '\n';
UNARY : '!' | '~' | NEW;
// Put it at the end, so keywords will be matched earlier
IDENTIFIER : ('a'..'z' | 'A'..'Z')+;
WS : (' ')+ {skip();} ;
root
: body
;
body
: line
;
line
: expression
;
assign
: assignable EQUAL expression
;
expression
: value
| assign
| operation
;
identifier
: IDENTIFIER
;
simpleAssignable
: identifier
;
assignable
: simpleAssignable
;
value
: assignable
| literal
| parenthetical
;
literal
: alphaNumeric
;
alphaNumeric
: NUMBER
| STRING;
parenthetical
: LPAREN body RPAREN
;
// term should be the same as expression except operation to avoid left-recursion
term
: value
| assign
;
questionOp
: term QUESTION?
;
mathOp
: questionOp (MATH questionOp)*
;
additiveOp
: mathOp ((PLUS | MINUS) mathOp)*
;
shiftOp
: additiveOp (SHIFT additiveOp)*
;
relationOp
: shiftOp (RELATION shiftOp)*
;
compareOp
: relationOp (COMPARE relationOp)*
;
logicOp
: compareOp (LOGIC compareOp)*
;
operation
: UNARY expression
| MINUS expression
| PLUS expression
| MINUS_MINUS simpleAssignable
| PLUS_PLUS simpleAssignable
| simpleAssignable PLUS_PLUS
| simpleAssignable MINUS_MINUS
| simpleAssignable COMPOUND_ASSIGN expression
| logicOp
;
UPDATE:
The final solution will use Xtext with an external lexer to avoid to intricacies of handling significant whitespace. Here is a snippet from my Xtext version:
CompareOp returns Operation:
AdditiveOp ({CompareOp.left=current} operator=COMPARE right=AdditiveOp)*;
My strategy is to make a working Antlr parser first without a usable AST. (Well, it would deserve a separates question if this is a feasible approach.) So I don't care about tokens at the moment, they are included to make development easier.
I am aware that the original grammar is LR. I don't know how close I can stay to it when transforming to LL.
UPDATE2 and SOLUTION:
I could simplify my problem with the insights gained from Bart's answer. Here is a working toy grammar to handle simple expressions with function calls to illustrate it. The comment before expression shows my insight.
grammar FunExp;
ID: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
NUMBER: '0'..'9'+;
WS: (' ')+ {skip();};
root
: expression
;
// atom and functionCall would go here,
// but they are reachable via operation -> term
// so they are omitted here
expression
: operation
;
atom
: NUMBER
| ID
;
functionCall
: ID '(' expression (',' expression)* ')'
;
operation
: multiOp
;
multiOp
: additiveOp (('*' | '/') additiveOp)*
;
additiveOp
: term (('+' | '-') term)*
;
term
: atom
| functionCall
| '(' expression ')'
;
When you generate a lexer and parser from your grammar, you see the following error printed to your console:
error(211): CoffeeScript.g:52:3: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
warning(200): CoffeeScript.g:52:3: Decision can match input such as "{NUMBER, STRING}" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
(I've emphasized the important bits)
This is only the first error, but you start with the first and with a bit of luck, the errors below that first one will also disappear when you fix the first one.
The error posted above means that when you're trying to parse either a NUMBER or a STRING with the parser generated from your grammar, the parser can go two ways when it ends up in the expression rule:
expression
: value // choice 1
| assign // choice 2
| operation // choice 3
;
Namely, choice 1 and choice 3 both can parse a NUMBER or a STRING, as you can see by the "paths" the parser can follow to match these 2 choices:
choice 1:
expression
value
literal
alphaNumeric : {NUMBER, STRING}
choice 3:
expression
operation
logicOp
relationOp
shiftOp
additiveOp
mathOp
questionOp
term
value
literal
alphaNumeric : {NUMBER, STRING}
In the last part of the warning, ANTLR informs you that it ignores choice 3 whenever either a NUMBER or a STRING will be parsed, causing choice 1 to match such input (since it is defined before choice 3).
So, either the CoffeeScript grammar is ambiguous in this respect (and handles this ambiguity somehow), or your implementation of it is wrong (I'm guessing the latter :)). You need to fix this ambiguity in your grammar: i.e. don't let the expression's choices 1 and 3 both match the same input.
I noticed 3 other things in your grammar:
1
Take the following lexer rules:
NEW : 'new';
...
UNARY : '!' | '~' | NEW;
Be aware that the token UNARY can never match the text 'new' since the token NEW is defined before it. If you want to let UNARY macth this, remove the NEW rule and do:
UNARY : '!' | '~' | 'new';
2
In may occasions, you're collecting multiple types of tokens in a single one, like LOGIC:
LOGIC : '&&' | '||';
and then you use that token in a parser rules like this:
logicOp
: compareOp (LOGIC compareOp)*
;
But if you're going to evaluate such an expression at a later stage, you don't know what this LOGIC token matched ('&&' or '||') and you'll have to inspect the token's inner text to find that out. You'd better do something like this (at least, if you're doing some sort of evaluating at a later stage):
AND : '&&';
OR : '||';
...
logicOp
: compareOp ( AND compareOp // easier to evaluate, you know it's an AND expression
| OR compareOp // easier to evaluate, you know it's an OR expression
)*
;
3
You're skipping white spaces (and no tabs?) with:
WS : (' ')+ {skip();} ;
but doesn't CoffeeScript indent it's code block with spaces (and tabs) just like Python? But perhaps you're going to do that in a later stage?
I just saw that the grammar you're looking at is a jison grammar (which is more or less a bison implementation in JavaScript). But bison, and therefor jison, generates LR parsers while ANTLR generates LL parsers. So trying to stay close to the rules of the original grammar will only result in more problems.