Why can't I have operator associativity with precedence in Antlr v4?

Why can't I have operator associativity with precedence in Antlr v4? - antlr

I'm using antlr v4 (eliminates direct left recursion).
My grammar's non-terminals are: and, or, id.
and has higher priority than or, and both of them are left associative.
According to Antlr4 reference if I put and before or it will have higher precedence.
So I have written this simple grammar:
expr : 'id'
| expr BINOP expr
;
BINOP: 'and'<assoc=left> //higher precedence
| 'or'<assoc=left> //lower precedence
;
But when it parses the string id and id or id and id the associativity is ok but
precedence is not ok: ((id and id) or id) and id.
If I turn BINOP into a parser rule:
binop: 'and'<assoc=left>
| 'or'<assoc=left>
;
neither associativity nor precedence work correctly:
id and (id or (id and id))
However when I implement BINOP inside expr parser rule:
expr : 'id'
| expr 'and'<assoc=left> expr
| expr 'or'<assoc=left> expr
;
everything works fine and I get the desired parse tree:
(id and id) or (id and id)
I googled a lot about the problem but I couldn't find anything.
I would be very glad if anyone could tell me where the problem is, and how
can I get correct associativity and precedence by having a separate rule for BINOP.
Thanks for your time.

The assoc=left is default so does nothing here. Also, precedence works on alternative level and you have put both operators at the same level:
expr BINOP expr
Ter

Related

How to disambiguate a subselect from a parenthesized expression?

I have the following expression notation:
expr
: OpenParen expr (Comma expr)* Comma? CloseParen # parenExpr
| OpenParen simpleSelect CloseParen # subSelectExpr
Unfortunately, a simpleSelect can also have a parenthetical around it, and so the following statement becomes ambiguous:
select ((select 1))
Here is the current grammar that I have, simplified down to only showing the issue:
grammar Subselect;
options { caseInsensitive=true; }
statement: query_statement EOF;
query_statement
: query_expr # simple
| query_statement set_op query_statement # set
;
query_expr
: with_clause?
( select | '(' query_statement ')' )
limit_clause?
;
select
: select_clause
(from_clause
where_clause?)?
;
with_clause: 'WITH' expr 'AS (' select ')';
select_clause: 'SELECT' expr (',' expr)*;
from_clause: 'FROM' expr;
where_clause: 'WHERE' expr;
limit_clause: 'LIMIT' expr;
set_op: 'UNION'|'INTERSECT'|'EXCEPT';
expr
: '(' expr ')' # parenExpr
| '(' query_expr ')' # subSelect
| Atom # identifier
;
Atom: [a-z_0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
And on the parse of select ((select 1)), here is the output:
What would be a possible way to disambiguate this?
I suppose the main thing is here:
'(' query_statement ')'
Since that recursively calls itself -- is there a way to do indirection or something else such that a query_statement called from within parens can never itself have parens?
Also, maybe this is a common thing? I get the same ambiguous output when running this on the official MySQL grammar here:
I would be curious whether any of the grammars can solve the issue here: https://github.com/antlr/grammars-v4/tree/master/sql. Maybe the best approach is just to remove duplicate parens before parsing the text? (If so, are there are good tools to do that, or do I need to write an additional antlr parser just to do that?)

Your input generates this parse tree:
That's a reasonable interpretation of your input and it is identified as a subSelect expr. It's a subSelect nested in a parenExpr (both of which are exprs).
If I switch up your rule a bit:
expr: '(' query_expr ')' # subSelect
| '(' expr ')' # parenExpr
| Atom # identifier
;
Now it's a subSelect that interprets the nested (select 1) as a query expression.
It's ambiguous because the outer parenthesized expression could match either of the first two alternatives resulting in different parse trees.
In ANTLR, ambiguities in alternatives are resolved by "using" the first alternative that matches. In this way ANTLR has deterministic behavior where you can control which interpretation is used (with alternative order). It's not uncommon for ANTLR grammars to have ambiguities like this.
IMHO, the IntelliJ plugin has caused many people to stumble over this as an indication that something is "wrong" with the grammar. There's a reason that ANTLR itself does not report an error in this situation. It has defined, deterministic behavior.
So far as "resolving" this ambiguity: the simple fact that the syntax uses parentheses to indicate two different "things" indicates that it is inherently ambiguous, so I don't believe you can "fix" the grammar ambiguity without modifying the syntax. (I might be wrong about this, and would find it interesting if someone provides a refactoring that manages to remove the ambiguity.)

EDIT:
After trying an earlier solution that proved incorrect with some additional test data, I've tried a different approach.
I added Atom as a viable alternative for query_expr since that Atom '1` is being offered as test data. In the full grammar implementation, it's hard to predict if this is necessary, even sufficent. I have only the grammar above with which to test.
I used some semantic predicates to strip parentheses (avoids the effort of writing an additional parser).
For testing purposes only, I added SQL-style line comments so that I could test many different inputs quickly.
The following SQL statements were tested, showing no ambiguity.
select 1
select (1)
select ((select 1))
select ((select (abc)))
select abc from ((select 1 from (select((select(1))))))
(select 1 from (select((select(1)))))
((select (xyz) from (select (((((foo))))) from tableX)))
select a from (select x from xyz)
union
select b from abc
select a from ((select x from xyz ))
intersect
((select b from foo))
select a from (select x from xyz )
intersect
(select b from foo)
The grammar is as follows:
grammar Subselect;
options { caseInsensitive=true; }
#header
{
import java.util.*;
}
#parser::members
{
String stripParens(String phrase)
{
String temp1 = phrase.substring[1];
temp2 = temp1.substring(0, s.length()-1);
return temp2;
}
}
statement: query_statement EOF;
query_statement
: query_expr # simple
| query_statement set_op query_statement # set
;
query_expr
: with_clause?
( select | '(' query_statement ')' )
limit_clause?
| Atom
;
select
: select_clause
(from_clause
where_clause?)?
;
with_clause: 'WITH' expr 'AS (' select ')';
select_clause: 'SELECT' expr (',' expr)*;
from_clause: 'FROM' expr;
where_clause: 'WHERE' expr;
limit_clause: 'LIMIT' expr;
set_op: 'UNION'|'INTERSECT'|'EXCEPT';
lrpExpr
: {stripParens(_input.LT[1].getText())}? query_expr
;
expr
: '(' lrpExpr ')' # parenExpr
| Atom # identifier
;
//---------------------------------------------
Atom: [a-z_0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
LineComment : '--' ~[\r\n]* -> skip ;
I'm not including images of parse trees in this edit to conserve space. However, from the inputs I tested, lrpExpr, being a separate rule, would give e.g. a Visitor class to evaluate what is inside the parentheses before moving further down the parse tree, so order of evaluation e.g. mathematical operator precedence could still be honored.
All still fast and with zero ambiguity.
I hope this suits your needs better.
Attribution: I used this answer as a starting point for the Java code for the semantic predicate.

How do I force the the parser to match a content as an ID rather than a token?

I have a grammar as the following (It's a partial view with only the relevant parts):
elem_course : INIT_ABSCISSA '=' expression;
expression
: ID
| INT_VALUE
| '(' expression ')'
| expression OPERATOR1 expression
| expression OPERATOR2 expression
;
OPERATOR1 : '*' | '/' ;
OPERATOR2 : '+' | '-' ;
fragment
WORD : LETTER (LETTER | NUM | '_' )*;
ID : WORD;
fragment
NUM : [0-9];
fragment
LETTER : [a-zA-Z];
BEACON_ANTENNA_TRAIN : 'BEACON_ANTENNA_TRAIN';
And, I would like to match the following line :
INIT_ABSCISSA = 40 + BEACON_ANTENNA_TRAIN
But as BEACON_ANTENNA_TRAIN is a lexer token and even the rule states that I except and ID, the parser matchs the token and raise me the following error when parsing:
line 11:29 mismatched input 'BEACON_ANTENNA_TRAIN' expecting {'(', INT_VALUE, ID}
Is there a way to force the parser that it should match the content as an ID rather than a token?

(Quick note: It's nice to abbreviate content in questions, but it really helps if it is functioning, stand-alone content that demonstrates your issue)
In this case, I've had to add the following lever rules to get this to generate, so I'm making some (probably legitimate) assumptions.
INT_VALUE: [\-+]? NUM+;
INIT_ABSCISSA: 'INIT_ABSCISSA';
WS: [ \t\r\n]+ -> skip;
I'm also going to have to assume that BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN'; appears before your ID rule. As posted your token stream is as follows and could not generate the error you show)
[#0,0:12='INIT_ABSCISSA',<ID>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<ID>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
If I reorder the lexer rules like this:
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
I can get your error message.
The lexer looks at you input stream of characters and attempts to match all lexer rules. To choose the token type, ANTLR will:
select the rule that matches the longest stream of input characters
If multiple Lever rules match the same sequence of input characters, then the rule that appears first will be used (that's why I had to re-order the rules to get your error.
With those assumptions, now to your question.
The short answer is "you can't". The Lexer processes input and determines token types before the parser is involved in any way. There is nothing you can do in parser rules to influence Token Type.
The parser, on the other hand starts with the start rule and then uses a recursive descent algorithm to attempt to match your token stream to parser rules.
You don't really give any idea what really guides whether BEACON_ANTENNA_TRAIN should be a BEACON_ANTENNA_TRAIN or an ID, so I'll put an example together that assumes that it's an ID if it's on the right hand side (rhs) of the elemen_course rule.
Then this grammar:
grammar IDG
;
elem_course: INIT_ABSCISSA '=' rhs_expression;
rhs_expression
: id = (ID | BEACON_ANTENNA_TRAIN | INIT_ABSCISSA)
| INT_VALUE
| '(' rhs_expression ')'
| rhs_expression OPERATOR1 rhs_expression
| rhs_expression OPERATOR2 rhs_expression
;
INIT_ABSCISSA: 'INIT_ABSCISSA';
BEACON_ANTENNA_TRAIN: 'BEACON_ANTENNA_TRAIN';
OPERATOR1: '*' | '/';
OPERATOR2: '+' | '-';
fragment WORD: LETTER (LETTER | NUM | '_')*;
ID: WORD;
fragment NUM: [0-9];
fragment LETTER: [a-zA-Z];
INT_VALUE: [\-+]? NUM+;
WS: [ \t\r\n]+ -> skip;
produces this token stream and parse tree:
$ grun IDG elem_course -tokens -tree IDG.txt
[#0,0:12='INIT_ABSCISSA',<'INIT_ABSCISSA'>,1:0]
[#1,14:14='=',<'='>,1:14]
[#2,16:17='40',<INT_VALUE>,1:16]
[#3,19:19='+',<OPERATOR2>,1:19]
[#4,21:40='BEACON_ANTENNA_TRAIN',<'BEACON_ANTENNA_TRAIN'>,1:21]
[#5,41:40='<EOF>',<EOF>,1:41]
(elem_course INIT_ABSCISSA = (rhs_expression (rhs_expression 40) + (rhs_expression BEACON_ANTENNA_TRAIN)))
As a side note: It's possible that, depending on what drives your decision, you might be able to leverage Lexer modes, but there's not anything in your example to leaves that impression.

This is the well known keyword-as-identifier problem and Mike Cargal gave you a working solution. I just want to add that the general approach for this problem is to add all keywords to a parser id rule that should be matched as an id. To restrict which keyword is allowed in certain grammar positions, you can use multiple id rules. For example the MySQL grammar uses this approach to a large extend to define keywords that can go as identifier in general or only as a label, for role names etc.

Antlr lexer tokens that match similar strings, what if the greedy lexer makes a mistake?

It seems that sometimes the Antlr lexer makes a bad choice on which rule to use when tokenizing a stream of characters... I'm trying to figure out how to help Antlr make the obvious-to-a-human right choice. I want to parse text like this:
d/dt(x)=a
a=d/dt
d=3
dt=4
This is an unfortunate syntax that an existing language uses and I'm trying to write a parser for. The "d/dt(x)" is representing the left hand side of a differential equation. Ignore the lingo if you must, just know that it is not "d" divided by "dt". However, the second occurrence of "d/dt" really is "d" divided by "dt".
Here's my grammar:
grammar diffeq_grammar;
program : (statement? NEWLINE)*;
statement
: diffeq
| assignment;
diffeq : DDT ID ')' '=' ID;
assignment
: ID '=' NUMBER
| ID '=' ID '/' ID
;
DDT : 'd/dt(';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n'|'\r'|'\n';
When using this grammar the lexer grabs the first "d/dt(" and turns it to the token DDT. Perfect! Now later the lexer sees the second "d" followed by a "/" and says "hmmm, I can match this as an ID and a '/' or I can be greedy and match DDT". The lexer chooses to be greedy... but little does it know, there is no "(" a few characters later in the input stream. When the lexer looks for the missing "(" it throws a MismatchedTokenException!
The only solution I've found so far, is to move all the rules into the parser with a grammar like:
grammar diffeq_grammar;
program : (statement? NEWLINE)*;
statement
: diffeq
| assignment;
diffeq : ddt id ')' '=' id;
assignment
: id '=' number
| id '=' id '/' id
;
ddt : 'd' '/' 'd' 't' '(';
id : CHAR+;
number : DIGIT+;
CHAR : 'a'..'z';
DIGIT : '0'..'9';
NEWLINE : '\r\n'|'\r'|'\n';
This is a fine solution if I didn't already have thousands of lines of working code that depend on the first grammar working. After spending 2 days researching this problem I have come to the conclusion that a lexer... really ought to be able to distinguish the two cases. At some point the Antlr lexer is deciding between two rules: DDT and ID. It chooses DDT because the lexer is greedy. But when matching DDT fails, I'd like the lexer to go back to using ID.
I'm okay with using predicates or other tricks as long as the grammar remains basically the same (i.e., the rules in the lexer, stay in the lexer. And most rules are left untouched.).
Ideally I can modify the lexer rule for DDT with any valid Antlr code... and be done.
My target language is Java.
Thanks!
UPDATE
Thank you guys for some great answers!! I accepted the answer that best fit my question. The actual solution I used is in my own answer (not the accepted answer), and there are more answers that could have worked. Readers, check out all the answers; some of them may suit your case better than mine.

I'm okay with using predicates or other tricks as long as the grammar remains basically the same (i.e., the rules in the lexer, stay in the lexer. And most rules are left untouched.).
In that case, force the lexer to look ahead in the char-stream to make sure there really is "d/dt(" using a gated syntactic predicate.
A demo:
grammar diffeq_grammar;
#parser::members {
public static void main(String[] args) throws Exception {
String src =
"d/dt(x)=a\n" +
"a=d/dt\n" +
"d=3\n" +
"dt=4\n";
diffeq_grammarLexer lexer = new diffeq_grammarLexer(new ANTLRStringStream(src));
diffeq_grammarParser parser = new diffeq_grammarParser(new CommonTokenStream(lexer));
parser.program();
}
}
#lexer::members {
private boolean ahead(String text) {
for(int i = 0; i < text.length(); i++) {
if(input.LA(i + 1) != text.charAt(i)) {
return false;
}
}
return true;
}
}
program
: (statement? NEWLINE)* EOF
;
statement
: diffeq {System.out.println("diffeq : " + $text);}
| assignment {System.out.println("assignment : " + $text);}
;
diffeq
: DDT ID ')' '=' ID
;
assignment
: ID '=' NUMBER
| ID '=' ID '/' ID
;
DDT : {ahead("d/dt(")}?=> 'd/dt(';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n' | '\r' | '\n';
If you now run the demo:
java -cp antlr-3.3.jar org.antlr.Tool diffeq_grammar.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar diffeq_grammarParser
(when using Windows, replace the : with ; in the last command)
you will see the following output:
diffeq : d/dt(x)=a
assignment : a=d/dt
assignment : d=3
assignment : dt=4

Although this is not what you are trying to do considering the large amount of working code that you have in the project, you should still consider separating your parser and lexer more thoroughly. I is best to let the parser and the lexer do what they do best, rather than "fusing" them together. The most obvious indication of something being wrong is the lack of symmetry between your ( and ) tokens: one is part of a composite token, while the other one is a stand-alone token.
If refactoring is at all an option, you could change the parser and lexer like this:
grammar diffeq_grammar;
program : (statement? NEWLINE)* EOF; // <-- You forgot EOF
statement
: diffeq
| assignment;
diffeq : D OVER DT OPEN id CLOSE EQ id; // <-- here, id is a parser rule
assignment
: id EQ NUMBER
| id EQ id OVER id
;
id : ID | D | DT; // <-- Nice trick, isn't it?
D : 'D';
DT : 'DT';
OVER : '/';
EQ : '=';
OPEN : '(';
CLOSE : ')';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n'|'\r'|'\n';
You may need to enable backtracking and memoization for this to work (but try compiling it without backtracking first).

Here's the solution I finally used. I know it violates one of my requirements: to keep lexer rules in the lexer and parser rules in the parser, but as it turns out moving DDT to ddt required no change in my code. Also, dasblinkenlight makes some good points about mismatched parenthesis in his answer and comments.
grammar ddt_problem;
program : (statement? NEWLINE)*;
statement
: diffeq
| assignment;
diffeq : ddt ID ')' '=' ID;
assignment
: ID '=' NUMBER
| ID '=' ID '/' ID
;
ddt : ( d=ID ) { $d.getText().equals("d") }? '/' ( dt=ID ) { $dt.getText().equals("dt") }? '(';
ID : 'a'..'z'+;
NUMBER : '0'..'9'+;
NEWLINE : '\r\n'|'\r'|'\n';

Left-factoring grammar of coffeescript expressions

I'm writing an Antlr/Xtext parser for coffeescript grammar. It's at the beginning yet, I just moved a subset of the original grammar, and I am stuck with expressions. It's the dreaded "rule expression has non-LL(*) decision" error. I found some related questions here, Help with left factoring a grammar to remove left recursion and ANTLR Grammar for expressions. I also tried How to remove global backtracking from your grammar, but it just demonstrates a very simple case which I cannot use in real life. The post about ANTLR Grammar Tip: LL() and Left Factoring gave me more insights, but I still can't get a handle.
My question is how to fix the following grammar (sorry, I couldn't simplify it and still keep the error). I guess the trouble maker is the term rule, so I'd appreciate a local fix to it, rather than changing the whole thing (I'm trying to stay close to the rules of the original grammar). Pointers are also welcome to tips how to "debug" this kind of erroneous grammar in your head.
grammar CoffeeScript;
options {
output=AST;
}
tokens {
AT_SIGIL; BOOL; BOUND_FUNC_ARROW; BY; CALL_END; CALL_START; CATCH; CLASS; COLON; COLON_SLASH; COMMA; COMPARE; COMPOUND_ASSIGN; DOT; DOT_DOT; DOUBLE_COLON; ELLIPSIS; ELSE; EQUAL; EXTENDS; FINALLY; FOR; FORIN; FOROF; FUNC_ARROW; FUNC_EXIST; HERECOMMENT; IDENTIFIER; IF; INDENT; INDEX_END; INDEX_PROTO; INDEX_SOAK; INDEX_START; JS; LBRACKET; LCURLY; LEADING_WHEN; LOGIC; LOOP; LPAREN; MATH; MINUS; MINUS; MINUS_MINUS; NEW; NUMBER; OUTDENT; OWN; PARAM_END; PARAM_START; PLUS; PLUS_PLUS; POST_IF; QUESTION; QUESTION_DOT; RBRACKET; RCURLY; REGEX; RELATION; RETURN; RPAREN; SHIFT; STATEMENT; STRING; SUPER; SWITCH; TERMINATOR; THEN; THIS; THROW; TRY; UNARY; UNTIL; WHEN; WHILE;
}
COMPARE : '<' | '==' | '>';
COMPOUND_ASSIGN : '+=' | '-=';
EQUAL : '=';
LOGIC : '&&' | '||';
LPAREN : '(';
MATH : '*' | '/';
MINUS : '-';
MINUS_MINUS : '--';
NEW : 'new';
NUMBER : ('0'..'9')+;
PLUS : '+';
PLUS_PLUS : '++';
QUESTION : '?';
RELATION : 'in' | 'of' | 'instanceof';
RPAREN : ')';
SHIFT : '<<' | '>>';
STRING : '"' (('a'..'z') | ' ')* '"';
TERMINATOR : '\n';
UNARY : '!' | '~' | NEW;
// Put it at the end, so keywords will be matched earlier
IDENTIFIER : ('a'..'z' | 'A'..'Z')+;
WS : (' ')+ {skip();} ;
root
: body
;
body
: line
;
line
: expression
;
assign
: assignable EQUAL expression
;
expression
: value
| assign
| operation
;
identifier
: IDENTIFIER
;
simpleAssignable
: identifier
;
assignable
: simpleAssignable
;
value
: assignable
| literal
| parenthetical
;
literal
: alphaNumeric
;
alphaNumeric
: NUMBER
| STRING;
parenthetical
: LPAREN body RPAREN
;
// term should be the same as expression except operation to avoid left-recursion
term
: value
| assign
;
questionOp
: term QUESTION?
;
mathOp
: questionOp (MATH questionOp)*
;
additiveOp
: mathOp ((PLUS | MINUS) mathOp)*
;
shiftOp
: additiveOp (SHIFT additiveOp)*
;
relationOp
: shiftOp (RELATION shiftOp)*
;
compareOp
: relationOp (COMPARE relationOp)*
;
logicOp
: compareOp (LOGIC compareOp)*
;
operation
: UNARY expression
| MINUS expression
| PLUS expression
| MINUS_MINUS simpleAssignable
| PLUS_PLUS simpleAssignable
| simpleAssignable PLUS_PLUS
| simpleAssignable MINUS_MINUS
| simpleAssignable COMPOUND_ASSIGN expression
| logicOp
;
UPDATE:
The final solution will use Xtext with an external lexer to avoid to intricacies of handling significant whitespace. Here is a snippet from my Xtext version:
CompareOp returns Operation:
AdditiveOp ({CompareOp.left=current} operator=COMPARE right=AdditiveOp)*;
My strategy is to make a working Antlr parser first without a usable AST. (Well, it would deserve a separates question if this is a feasible approach.) So I don't care about tokens at the moment, they are included to make development easier.
I am aware that the original grammar is LR. I don't know how close I can stay to it when transforming to LL.
UPDATE2 and SOLUTION:
I could simplify my problem with the insights gained from Bart's answer. Here is a working toy grammar to handle simple expressions with function calls to illustrate it. The comment before expression shows my insight.
grammar FunExp;
ID: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
NUMBER: '0'..'9'+;
WS: (' ')+ {skip();};
root
: expression
;
// atom and functionCall would go here,
// but they are reachable via operation -> term
// so they are omitted here
expression
: operation
;
atom
: NUMBER
| ID
;
functionCall
: ID '(' expression (',' expression)* ')'
;
operation
: multiOp
;
multiOp
: additiveOp (('*' | '/') additiveOp)*
;
additiveOp
: term (('+' | '-') term)*
;
term
: atom
| functionCall
| '(' expression ')'
;

When you generate a lexer and parser from your grammar, you see the following error printed to your console:
error(211): CoffeeScript.g:52:3: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
warning(200): CoffeeScript.g:52:3: Decision can match input such as "{NUMBER, STRING}" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
(I've emphasized the important bits)
This is only the first error, but you start with the first and with a bit of luck, the errors below that first one will also disappear when you fix the first one.
The error posted above means that when you're trying to parse either a NUMBER or a STRING with the parser generated from your grammar, the parser can go two ways when it ends up in the expression rule:
expression
: value // choice 1
| assign // choice 2
| operation // choice 3
;
Namely, choice 1 and choice 3 both can parse a NUMBER or a STRING, as you can see by the "paths" the parser can follow to match these 2 choices:
choice 1:
expression
value
literal
alphaNumeric : {NUMBER, STRING}
choice 3:
expression
operation
logicOp
relationOp
shiftOp
additiveOp
mathOp
questionOp
term
value
literal
alphaNumeric : {NUMBER, STRING}
In the last part of the warning, ANTLR informs you that it ignores choice 3 whenever either a NUMBER or a STRING will be parsed, causing choice 1 to match such input (since it is defined before choice 3).
So, either the CoffeeScript grammar is ambiguous in this respect (and handles this ambiguity somehow), or your implementation of it is wrong (I'm guessing the latter :)). You need to fix this ambiguity in your grammar: i.e. don't let the expression's choices 1 and 3 both match the same input.
I noticed 3 other things in your grammar:
1
Take the following lexer rules:
NEW : 'new';
...
UNARY : '!' | '~' | NEW;
Be aware that the token UNARY can never match the text 'new' since the token NEW is defined before it. If you want to let UNARY macth this, remove the NEW rule and do:
UNARY : '!' | '~' | 'new';
2
In may occasions, you're collecting multiple types of tokens in a single one, like LOGIC:
LOGIC : '&&' | '||';
and then you use that token in a parser rules like this:
logicOp
: compareOp (LOGIC compareOp)*
;
But if you're going to evaluate such an expression at a later stage, you don't know what this LOGIC token matched ('&&' or '||') and you'll have to inspect the token's inner text to find that out. You'd better do something like this (at least, if you're doing some sort of evaluating at a later stage):
AND : '&&';
OR : '||';
...
logicOp
: compareOp ( AND compareOp // easier to evaluate, you know it's an AND expression
| OR compareOp // easier to evaluate, you know it's an OR expression
)*
;
3
You're skipping white spaces (and no tabs?) with:
WS : (' ')+ {skip();} ;
but doesn't CoffeeScript indent it's code block with spaces (and tabs) just like Python? But perhaps you're going to do that in a later stage?
I just saw that the grammar you're looking at is a jison grammar (which is more or less a bison implementation in JavaScript). But bison, and therefor jison, generates LR parsers while ANTLR generates LL parsers. So trying to stay close to the rules of the original grammar will only result in more problems.

ANTLR ambiguity in DeCaf - professor unsure where error is

I'm working on a project for school with converting a BNF form Decaf spec into a context-free grammar and building it in ANTLR. I've been working on it for a few weeks and been going to the professor when I've become stuck, but I finally ran into something that he says should not be causing an error. Here's the isolated part of my grammar, expr is the starting point. Before I do that I have one question.
Does it matter if my lexer rules appear before my parser rules in my grammar, or if they're mixed in intermittently through my grammar file?
calloutarg: expr | STRING;
expr: multexpr ((PLUS|MINUS) multexpr)* ;
multexpr : atom ((MULT|DIVISION) atom)*
;
atom : OPENPAR expr CLOSEPAR | ID ((OPENBRACKET expr CLOSEBRACKET)? | OPENPAR ((expr (COMMA)* )+)? CLOSEPAR)|
CALLOUT OPENPAR STRING (COMMA (calloutarg)+ COMMA)? CLOSEPAR | constant;
constant: INT | CHAR | boolconstant;
boolconstant: TRUE|FALSE;
The ugly formatting is because part of his advice for debugging was to take individual rules and break them down where the ambiguity is to see where the errors are starting. In this case, it's saying the problem is in the long ID portion, that OPENBRACKET and OPENPAR are the cause. If you have any ideas at all, I am deeply appreciative. Thank you, and sorry for how nasty the formatting is on the code I posted.

Does it matter if my lexer rules appear before my parser rules in my grammar ...
No, that does not matter.
The problem is that inside your atom rule, ANTLR cannot make a choice between these three variants:
ID ( ...
ID [ ...
ID
without resorting to (possibly) backtracking. You could resolve it by using some syntactic predicates (which looks like: (...)=> ...). A syntactic predicates is nothing more than a "look ahead" and if this "look ahead" is successful, it chooses that particular path.
Your current atom rule can be rewritten as follows:
atom
: OPENPAR expr CLOSEPAR
| ID OPENPAR ((expr (COMMA)* )+)? CLOSEPAR
| ID OPENBRACKET expr CLOSEBRACKET
| ID
| CALLOUT OPENPAR STRING (COMMA (calloutarg)+ COMMA)? CLOSEPAR
| constant
;
And with the predicates it will look like:
atom
: OPENPAR expr CLOSEPAR
| (ID OPENPAR)=> ID OPENPAR ((expr (COMMA)* )+)? CLOSEPAR
| (ID OPENBRACKET)=> ID OPENBRACKET expr CLOSEBRACKET
| ID
| CALLOUT OPENPAR STRING (COMMA (calloutarg)+ COMMA)? CLOSEPAR
| constant
;
which should do the trick.
Note: do not use ANTLRWorks to generate or test the parser! It cannot handle predicates (well). Best do it on the command line.
Also see: https://wincent.com/wiki/ANTLR_predicates
EDIT
Let's label the six different "branches" from your atom rule from A to F:
atom // branch
: OPENPAR expr CLOSEPAR // A
| ID OPENBRACKET expr CLOSEBRACKET // B
| ID OPENPAR ((expr COMMA*)+)? CLOSEPAR // C
| ID // D
| CALLOUT OPENPAR STRING (COMMA calloutarg+ COMMA)? CLOSEPAR // E
| constant // F
;
Now, when the (future) parser should handle input like this:
ID OPENPAR expr CLOSEPAR
ANTLR does not know how the parser should handle it. It could be parsed in two different ways:
branch D followed by branch A
branch C
Which is the source of the ambiguity ANTLR is complaining about. If you were to comment out one of the branches A, C or D, the error would disappear.
Hope that helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Why can't I have operator associativity with precedence in Antlr v4? - antlr

The assoc=left is default so does nothing here. Also, precedence works on alternative level and you have put both operators at the same level: expr BINOP expr Ter

Related

How to disambiguate a subselect from a parenthesized expression?

How do I force the the parser to match a content as an ID rather than a token?

Antlr lexer tokens that match similar strings, what if the greedy lexer makes a mistake?

Left-factoring grammar of coffeescript expressions

ANTLR ambiguity in DeCaf - professor unsure where error is

Categories

Resources