I'm finding myself challenged on how to properly format rewrite rules when certain conditions occur in the original rule.
What is the appropriate way to rewrite this:
unaryExpression: op=('!' | '-') t=term
-> ^(UNARY_EXPR $op $t)
Antlr doesn't seem to like me branding anything in parenthesis with a label and "op=" fails. Also, I've tried:
unaryExpression: ('!' | '-') t=term
-> ^(UNARY_EXPR ('!' | '-') $t)
Antlr doesn't like the or '|' and throws a grammar error.
Replacing the character class with a token name does solve this problem, however it creates a quagmire of other issues with my grammar.
--- edit ----
A second problem has been added. Please help me format this rule with tree grammar:
multExpression
: unaryExpression (MULT_OP unaryExpression)*
;
Pretty simple: My expectation is to enclose every matched token in a parent (imaginary) token MULT so that I end up with something like:
MULT
o
|
o---o---o---o---o
| | | | |
'3' '*' '6' '%' 2
unaryExpression
: (op='!' | op='-') term
-> ^(UNARY_EXPR[$op] $op term)
;
I used the UNARY_EXPR[$op] so the root node gets some useful line/column information instead of defaulting to -1.
Related
I have a grammar as the following (It's a partial view with only the relevant parts):
elem_course : INIT_ABSCISSA '=' expression;
expression
: ID
| INT_VALUE
| '(' expression ')'
| expression OPERATOR1 expression
| expression OPERATOR2 expression
;
OPERATOR1 : '*' | '/' ;
OPERATOR2 : '+' | '-' ;
fragment
WORD : LETTER (LETTER | NUM | '_' )*;
ID : WORD;
fragment
NUM : [0-9];
fragment
LETTER : [a-zA-Z];
BEACON_ANTENNA_TRAIN : 'BEACON_ANTENNA_TRAIN';
And, I would like to match the following line :
INIT_ABSCISSA = 40 + BEACON_ANTENNA_TRAIN
But as BEACON_ANTENNA_TRAIN is a lexer token and even the rule states that I except and ID, the parser matchs the token and raise me the following error when parsing:
line 11:29 mismatched input 'BEACON_ANTENNA_TRAIN' expecting {'(', INT_VALUE, ID}
Is there a way to force the parser that it should match the content as an ID rather than a token?
(Quick note: It's nice to abbreviate content in questions, but it really helps if it is functioning, stand-alone content that demonstrates your issue)
In this case, I've had to add the following lever rules to get this to generate, so I'm making some (probably legitimate) assumptions.
INT_VALUE: [\-+]? NUM+;
INIT_ABSCISSA: 'INIT_ABSCISSA';
WS: [ \t\r\n]+ -> skip;
I'm also going to have to assume that BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN'; appears before your ID rule. As posted your token stream is as follows and could not generate the error you show)
[#0,0:12='INIT_ABSCISSA',<ID>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<ID>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
If I reorder the lexer rules like this:
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
I can get your error message.
The lexer looks at you input stream of characters and attempts to match all lexer rules. To choose the token type, ANTLR will:
select the rule that matches the longest stream of input characters
If multiple Lever rules match the same sequence of input characters, then the rule that appears first will be used (that's why I had to re-order the rules to get your error.
With those assumptions, now to your question.
The short answer is "you can't". The Lexer processes input and determines token types before the parser is involved in any way. There is nothing you can do in parser rules to influence Token Type.
The parser, on the other hand starts with the start rule and then uses a recursive descent algorithm to attempt to match your token stream to parser rules.
You don't really give any idea what really guides whether BEACON_ANTENNA_TRAIN should be a BEACON_ANTENNA_TRAIN or an ID, so I'll put an example together that assumes that it's an ID if it's on the right hand side (rhs) of the elemen_course rule.
Then this grammar:
grammar IDG
;
elem_course: INIT_ABSCISSA '=' rhs_expression;
rhs_expression
: id = (ID | BEACON_ANTENNA_TRAIN | INIT_ABSCISSA)
| INT_VALUE
| '(' rhs_expression ')'
| rhs_expression OPERATOR1 rhs_expression
| rhs_expression OPERATOR2 rhs_expression
;
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
produces this token stream and parse tree:
$ grun IDG elem_course -tokens -tree IDG.txt
[#0,0:12='INIT_ABSCISSA',<'INIT_ABSCISSA'>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<'BEACON_ANTENNA_TRAIN'>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
(elem_course INIT_ABSCISSA = (rhs_expression (rhs_expression 40) + (rhs_expression BEACON_ANTENNA_TRAIN)))
As a side note: It's possible that, depending on what drives your decision, you might be able to leverage Lexer modes, but there's not anything in your example to leaves that impression.
This is the well known keyword-as-identifier problem and Mike Cargal gave you a working solution. I just want to add that the general approach for this problem is to add all keywords to a parser id rule that should be matched as an id. To restrict which keyword is allowed in certain grammar positions, you can use multiple id rules. For example the MySQL grammar uses this approach to a large extend to define keywords that can go as identifier in general or only as a label, for role names etc.
Good day everyone,
I am using antlr4 to create a parser and lexer for Hive SQL (Hplsql.g4).
I believe this is the latest grammar file.
https://github.com/AngersZhuuuu/Spark-Hive/blob/master/hplsql/src/main/antlr4/org/apache/hive/hplsql/Hplsql.g4
However, I found at least two additions that are needed: IF and array indices.
For example, in a select statement, I may have:
a) SELECT if(a>8,12,20) FROM x
b) SELECT column_name[2] FROM x
Both are valid in Hive but both do not parse when I create a parser and lexer for java from the Hplsql.g4 above. I added an expression for the IF and it appears to work.
I added
expr :
...
| expr_if //I added
and a new rule:
expr_if :
T_IF T_OPEN_P bool_expr T_COMMA expr T_COMMA expr T_CLOSE_P //I added
;
However, figuring out how to allow an array index is not so easy because the grammar allows aliases:
select a from x
select a alias_of_a from x
select a[1] from x
select a[1] alias_of_a from x
should all be valid.
I tried adding a new expression for this like so:
expr :
...
| expr_array //I added
expr_array :
T_OPEN_SB L_INT T_OPEN_CB //I added
;
This didn't work for me. (T_OPEN_SB L_INT T_OPEN_CB are [ integer ] respectively). I tried so many variations on this as well. My questions are:
Am I using the right grammar file - if not is there a newer one with IF and array handling?
Has anyone been successful in extending this grammar to handle my cases above?
As per Bart's recommendations:
I updated ident.
I updated expr_atom.
I added array_index.
I had // | '[' .*? ']' commented out before.
Test Sql: select a[0] from t
Result:
line 1:8 no viable alternative at input 'selecta[0]'
line 1:8 mismatched input '[0]'
Tree
(program (block stmt (stmt select) (stmt (expr_stmt (expr (expr_atom (ident a)))))) [0] from t)
I feel like the problem is somehow related to select_list_alias below.
With select_list_alias containing ident and T_AS optional, ident is matching the array index.
I can't reconcile why this happens, especially since ident has been updated.
Excerpt from Hplsql.sql:
select_list :
select_list_set? select_list_limit? select_list_item (T_COMMA select_list_item)*
;
select_list_item :
(ident T_EQUAL)? expr select_list_alias?
| select_list_asterisk
;
select_list_alias :
{!_input.LT(1).getText().equalsIgnoreCase("INTO") && !_input.LT(1).getText().equalsIgnoreCase("FROM")}? T_AS? ident
| T_OPEN_P T_TITLE L_S_STRING T_CLOSE_P
;
If I pass in a simple SQL stmt to grun such as
select a[1] from t
The parse tree should look similar to this:
Instead of expr_atom, I want to see expr_array where it would split into expr_atom for the a and array_index for the [1].
Note that there is one SQL statement here. With my existing g4, the array index [1] (and the remainder of the stmt) gets parsed as a separate SQL statement.
Bart, I see from your parse tree that parsing resulted in two SQL statements from "select a[0] from t" - I was getting the same situation.
I will continue to explore different approaches - I am still suspicious of the select_list_alias which has T_AS? ident at the end. Just to confirm, I have commented out one line from ident_part like this: // | '[' .*? ']'
As mentioned in the comments: [ ... ] will be tokenised as a L_ID token. If you don;t want that, remove the | '[' .*? ']' part:
fragment
L_ID_PART :
[a-zA-Z] ([a-zA-Z] | L_DIGIT | '_')* // Identifier part
| ('_' | '#' | ':' | '#' | '$') ([a-zA-Z] | L_DIGIT | '_' | '#' | ':' | '#' | '$')+ // (at least one char must follow special char)
| '"' .*? '"' // Quoted identifiers
// | '[' .*? ']' <-- removed
| '`' .*? '`'
;
and create/edit the grammar like this:
expr_atom :
date_literal
| timestamp_literal
| bool_literal
| expr_array // <-- added
| ident
| string
| dec_number
| int_number
| null_const
;
// new rule
expr_array
: ident array_index+
;
// new rule
array_index
: T_OPEN_SB expr T_CLOSE_SB
;
The rules above will cause select a[1] alias_of_a from x to be parsed successfully, but wil fail on input like select a[1] alias_of_a from [identifier]: the [identifier] will not be matched as an identifier.
You could try adding something like this:
ident :
L_ID
| T_OPEN_SB ~T_CLOSE_SB+ T_CLOSE_SB // <-- added
| non_reserved_words
;
which will parse select a[1] alias_of_a from [identifier] properly, but have no good picture of the whole grammar (or deep knowledge of HPL/SQL) to determine if that will mess up other things :)
EDIT
With my proposed changes, the grammar looks like this: https://gist.github.com/bkiers/4aedd6074726cbcd5d87ede00000cd0d (I cannot post it here on SO because of the char limit)
Parsing select a[0] from t with this will result in the parse tree:
And parsing select a[0] from [t] with this will result in this parse tree:
You're also able to test it by running the following Java code:
String source = "select a[0] from [t]";
HplsqlLexer lexer = new HplsqlLexer(CharStreams.fromString(source));
HplsqlParser parser = new HplsqlParser(new CommonTokenStream(lexer));
ParseTree root = parser.program();
JFrame frame = new JFrame("Antlr AST");
JPanel panel = new JPanel();
TreeViewer viewer = new TreeViewer(Arrays.asList(parser.getRuleNames()), root);
viewer.setScale(1.5);
panel.add(viewer);
frame.add(panel);
frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
frame.pack();
frame.setVisible(true);
I'm still trying to parse a simple Javadoc style format using ANTLR. Basically the format looks like this:
/**
* Description
*
* #name someId
*/
My parser grammar is here:
query_doc : BEGIN_QDOC description name NOMANSLAND* END_QDOC;
description : (DESCRIPTION_TEXT | NOMANSLAND)*;
name : OPEN_NAME INNER_WS NAMEID INNER_WS* CLOSE_NAME;
My lexer grammar is here:
BEGIN_QDOC : '/**';
END_QDOC : ('*/');
NOMANSLAND : '\r'? '\n' (' ' | '\t')* '*' (' ' | '\t')*;
DESCRIPTION_TEXT : ~('\n');
OPEN_NAME : '#name' -> mode(NAME);
mode NAME;
INNER_WS : (' ' | '\t')+;
NAMEID : ('a'..'z' | 'A'..'Z' | '0'..'9' | '-' | '_' | '?')+;
CLOSE_NAME : (('\r'? '\n') | '*/') -> mode(DEFAULT_MODE);
This appears to be working okay for the most part except for closing the #name definition in the following case:
/**
* #name someId*/
The above should be perfectly valid. We should not need a new line before ending the comment with '*/'. The issue I am having is that '*/' terminates the name definition successfully, but it consumes the token and only returns to the default mode so I need to have:
/**
* #name someId*/*/
if I actually want it to end the comment. I want it to return to the default mode and then realize that this token should end the comment (i.e. it should match END_QDOC). How can I accomplish this in ANTLR? I tried fixing it so that CLOSE_NAME is the inverse of ID:
CLOSE_NAME : ~('a'..'z' | 'A'..'Z' | '0'..'9' | '-' | '_' | '?');
But ANTLR still consumes the * leaving a unrecognized token error on the remaining '/'. What I would really like to do is have ANTLR exit the mode without consuming the token so that '*/' is the next token when we return to DEFAULT_MODE. Any thoughts?
First of all, rather than use the mode command, you probably want to use -> pushMode(NAME) and -> popMode to go back to the default mode.
For your CLOSE_NAME rule, you could use a predicate instead of a matching literal for handling the end of a comment:
CLOSE_NAME
: ( '\r'? '\n'
| {_input.LA(1) == '*' && _input.LA(2) == '/'}?
)
-> popMode
;
This can produce a zero-length token and wasn't allowed in ANTLR 4.0, but the restriction was removed (changed to a warning) in ANTLR 4.1 since we realized that a zero-length token could be used to trigger a mode change and thus avoid infinite loops.
I'm trying to parse values with ANTLR. Here's the relevant part of my grammar:
root : IDENTIFIER | SELF | literal | constructor | call | indexer;
hierarchy : root (SUB^ (IDENTIFIER | call | indexer))*;
factor : hierarchy ((MULT^ | DIV^ | MODULO^) hierarchy)*;
sum : factor ((PLUS^ | MINUS^) factor)*;
comparison : sum (comparison_operator^ sum)*;
value : comparison | '(' value ')';
I won't describe each token or rule since their name is quite explanatory of their role. This grammar works well and compiles, allowing me to parse, using value, things such as:
a.b[c(5).d[3] * e()] < e("f")
The only thing left for value recognition is to be able to have parenthesized hierarchy roots. For instance:
(a.b).c
(3 < d()).e
...
Naively, and without much expectations, I tried adding the following alternative to my root rule:
root : ... | '(' value ')';
This however breaks the value rule due to non-LL(*)ism:
rule value has non-LL(*) decision due to recursive rule invocations reachable
from alts 1,2. Resolve by left-factoring or using syntactic predicates or using
backtrack=true option.
Even after reading most of The Definitive ANTLR Reference, I still don't understand these errors. However, what I do understand is that, upon seeing a parenthesis opening, ANTLR cannot know if it's looking at the beginning of a parenthesized value, or at the beginning of a parenthesized root.
How can I clearly define the behavior of parenthesized hierarchy root?
Edit: As requested, the additional rules:
parameter : type IDENTIFIER -> ^(PARAMETER ^(type IDENTIFIER));
constructor : NEW type PAREN_OPEN (arguments+=value (SEPARATOR arguments+=value)*)? PAREN_CLOSE -> ^(CONSTRUCTOR type ^(ARGUMENTS $arguments*)?);
call : IDENTIFIER PAREN_OPEN (values+=value (SEPARATOR values+=value)*)? PAREN_CLOSE -> ^(CALL IDENTIFIER ^(ARGUMENTS $values*)?);
indexer : IDENTIFIER INDEX_START (values+=value (SEPARATOR values+=value)*)? INDEX_END -> ^(INDEXER IDENTIFIER ^(ARGUMENTS $values*));
Remove '(' value ')' from value and place it in root:
root : IDENTIFIER | SELF | literal | constructor | call | indexer | '(' value ')';
...
value : comparison;
Now (a.b).c will result in the following parse:
And (3 < d()).e in:
Of course, you'll probably want to omit the parenthesis from the AST:
root : IDENTIFIER | SELF | literal | constructor | call | indexer | '('! value ')'!;
Also, you don't need to append tokens in a List using += in your parser rules. The following:
call
: IDENTIFIER PAREN_OPEN (values+=value (SEPARATOR values+=value)*)? PAREN_CLOSE
-> ^(CALL IDENTIFIER ^(ARGUMENTS $values*)?)
;
can be rewritten into:
call
: IDENTIFIER PAREN_OPEN (value (SEPARATOR value)*)? PAREN_CLOSE
-> ^(CALL IDENTIFIER ^(ARGUMENTS value*)?)
;
EDIT
Your main problem is the fact that certain input can be parsed in two (or more!) ways. For example, the input (a) could be parsed by alternative 1 and 2 of your value rule:
value
: comparison // alternative 1
| '(' value ')' // alternative 2
;
Run through your parser rules: a comparison (alternative 1) can match (a) because it matches the root rule, which in its turn matches '(' value ')'. But that is also what alternative 2 matches! And there you have it: the parser "sees" for one input, two different
parses and reports about this ambiguity.
I'm writing an Antlr/Xtext parser for coffeescript grammar. It's at the beginning yet, I just moved a subset of the original grammar, and I am stuck with expressions. It's the dreaded "rule expression has non-LL(*) decision" error. I found some related questions here, Help with left factoring a grammar to remove left recursion and ANTLR Grammar for expressions. I also tried How to remove global backtracking from your grammar, but it just demonstrates a very simple case which I cannot use in real life. The post about ANTLR Grammar Tip: LL() and Left Factoring gave me more insights, but I still can't get a handle.
My question is how to fix the following grammar (sorry, I couldn't simplify it and still keep the error). I guess the trouble maker is the term rule, so I'd appreciate a local fix to it, rather than changing the whole thing (I'm trying to stay close to the rules of the original grammar). Pointers are also welcome to tips how to "debug" this kind of erroneous grammar in your head.
grammar CoffeeScript;
options {
output=AST;
}
tokens {
AT_SIGIL; BOOL; BOUND_FUNC_ARROW; BY; CALL_END; CALL_START; CATCH; CLASS; COLON; COLON_SLASH; COMMA; COMPARE; COMPOUND_ASSIGN; DOT; DOT_DOT; DOUBLE_COLON; ELLIPSIS; ELSE; EQUAL; EXTENDS; FINALLY; FOR; FORIN; FOROF; FUNC_ARROW; FUNC_EXIST; HERECOMMENT; IDENTIFIER; IF; INDENT; INDEX_END; INDEX_PROTO; INDEX_SOAK; INDEX_START; JS; LBRACKET; LCURLY; LEADING_WHEN; LOGIC; LOOP; LPAREN; MATH; MINUS; MINUS; MINUS_MINUS; NEW; NUMBER; OUTDENT; OWN; PARAM_END; PARAM_START; PLUS; PLUS_PLUS; POST_IF; QUESTION; QUESTION_DOT; RBRACKET; RCURLY; REGEX; RELATION; RETURN; RPAREN; SHIFT; STATEMENT; STRING; SUPER; SWITCH; TERMINATOR; THEN; THIS; THROW; TRY; UNARY; UNTIL; WHEN; WHILE;
}
COMPARE : '<' | '==' | '>';
COMPOUND_ASSIGN : '+=' | '-=';
EQUAL : '=';
LOGIC : '&&' | '||';
LPAREN : '(';
MATH : '*' | '/';
MINUS : '-';
MINUS_MINUS : '--';
NEW : 'new';
NUMBER : ('0'..'9')+;
PLUS : '+';
PLUS_PLUS : '++';
QUESTION : '?';
RELATION : 'in' | 'of' | 'instanceof';
RPAREN : ')';
SHIFT : '<<' | '>>';
STRING : '"' (('a'..'z') | ' ')* '"';
TERMINATOR : '\n';
UNARY : '!' | '~' | NEW;
// Put it at the end, so keywords will be matched earlier
IDENTIFIER : ('a'..'z' | 'A'..'Z')+;
WS : (' ')+ {skip();} ;
root
: body
;
body
: line
;
line
: expression
;
assign
: assignable EQUAL expression
;
expression
: value
| assign
| operation
;
identifier
: IDENTIFIER
;
simpleAssignable
: identifier
;
assignable
: simpleAssignable
;
value
: assignable
| literal
| parenthetical
;
literal
: alphaNumeric
;
alphaNumeric
: NUMBER
| STRING;
parenthetical
: LPAREN body RPAREN
;
// term should be the same as expression except operation to avoid left-recursion
term
: value
| assign
;
questionOp
: term QUESTION?
;
mathOp
: questionOp (MATH questionOp)*
;
additiveOp
: mathOp ((PLUS | MINUS) mathOp)*
;
shiftOp
: additiveOp (SHIFT additiveOp)*
;
relationOp
: shiftOp (RELATION shiftOp)*
;
compareOp
: relationOp (COMPARE relationOp)*
;
logicOp
: compareOp (LOGIC compareOp)*
;
operation
: UNARY expression
| MINUS expression
| PLUS expression
| MINUS_MINUS simpleAssignable
| PLUS_PLUS simpleAssignable
| simpleAssignable PLUS_PLUS
| simpleAssignable MINUS_MINUS
| simpleAssignable COMPOUND_ASSIGN expression
| logicOp
;
UPDATE:
The final solution will use Xtext with an external lexer to avoid to intricacies of handling significant whitespace. Here is a snippet from my Xtext version:
CompareOp returns Operation:
AdditiveOp ({CompareOp.left=current} operator=COMPARE right=AdditiveOp)*;
My strategy is to make a working Antlr parser first without a usable AST. (Well, it would deserve a separates question if this is a feasible approach.) So I don't care about tokens at the moment, they are included to make development easier.
I am aware that the original grammar is LR. I don't know how close I can stay to it when transforming to LL.
UPDATE2 and SOLUTION:
I could simplify my problem with the insights gained from Bart's answer. Here is a working toy grammar to handle simple expressions with function calls to illustrate it. The comment before expression shows my insight.
grammar FunExp;
ID: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
NUMBER: '0'..'9'+;
WS: (' ')+ {skip();};
root
: expression
;
// atom and functionCall would go here,
// but they are reachable via operation -> term
// so they are omitted here
expression
: operation
;
atom
: NUMBER
| ID
;
functionCall
: ID '(' expression (',' expression)* ')'
;
operation
: multiOp
;
multiOp
: additiveOp (('*' | '/') additiveOp)*
;
additiveOp
: term (('+' | '-') term)*
;
term
: atom
| functionCall
| '(' expression ')'
;
When you generate a lexer and parser from your grammar, you see the following error printed to your console:
error(211): CoffeeScript.g:52:3: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
warning(200): CoffeeScript.g:52:3: Decision can match input such as "{NUMBER, STRING}" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
(I've emphasized the important bits)
This is only the first error, but you start with the first and with a bit of luck, the errors below that first one will also disappear when you fix the first one.
The error posted above means that when you're trying to parse either a NUMBER or a STRING with the parser generated from your grammar, the parser can go two ways when it ends up in the expression rule:
expression
: value // choice 1
| assign // choice 2
| operation // choice 3
;
Namely, choice 1 and choice 3 both can parse a NUMBER or a STRING, as you can see by the "paths" the parser can follow to match these 2 choices:
choice 1:
expression
value
literal
alphaNumeric : {NUMBER, STRING}
choice 3:
expression
operation
logicOp
relationOp
shiftOp
additiveOp
mathOp
questionOp
term
value
literal
alphaNumeric : {NUMBER, STRING}
In the last part of the warning, ANTLR informs you that it ignores choice 3 whenever either a NUMBER or a STRING will be parsed, causing choice 1 to match such input (since it is defined before choice 3).
So, either the CoffeeScript grammar is ambiguous in this respect (and handles this ambiguity somehow), or your implementation of it is wrong (I'm guessing the latter :)). You need to fix this ambiguity in your grammar: i.e. don't let the expression's choices 1 and 3 both match the same input.
I noticed 3 other things in your grammar:
1
Take the following lexer rules:
NEW : 'new';
...
UNARY : '!' | '~' | NEW;
Be aware that the token UNARY can never match the text 'new' since the token NEW is defined before it. If you want to let UNARY macth this, remove the NEW rule and do:
UNARY : '!' | '~' | 'new';
2
In may occasions, you're collecting multiple types of tokens in a single one, like LOGIC:
LOGIC : '&&' | '||';
and then you use that token in a parser rules like this:
logicOp
: compareOp (LOGIC compareOp)*
;
But if you're going to evaluate such an expression at a later stage, you don't know what this LOGIC token matched ('&&' or '||') and you'll have to inspect the token's inner text to find that out. You'd better do something like this (at least, if you're doing some sort of evaluating at a later stage):
AND : '&&';
OR : '||';
...
logicOp
: compareOp ( AND compareOp // easier to evaluate, you know it's an AND expression
| OR compareOp // easier to evaluate, you know it's an OR expression
)*
;
3
You're skipping white spaces (and no tabs?) with:
WS : (' ')+ {skip();} ;
but doesn't CoffeeScript indent it's code block with spaces (and tabs) just like Python? But perhaps you're going to do that in a later stage?
I just saw that the grammar you're looking at is a jison grammar (which is more or less a bison implementation in JavaScript). But bison, and therefor jison, generates LR parsers while ANTLR generates LL parsers. So trying to stay close to the rules of the original grammar will only result in more problems.