bison shift/reduce problem moving add op into a subexpr - yacc

Originally in the example there was this
expr:
INTEGER
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
;
I wanted it to be 'more simple' so i wrote this (i realize it would do '+' for both add and subtract. But this is an example)
expr:
INTEGER
| expr addOp expr { $$ = $1 + $3; }
;
addOp:
'+' { $$ = $1; }
| '-' { $$ = $1; }
;
Now i get a shift/reduce error. It should be exactly the same -_- (to me). What do i need to do to fix this?
edit: To make things clear. The first has NO warning/error. I use %left to set the precedence (and i will use %right for = and those other right ops). However it seems to not apply when going into sub expressions.

Are you sure the conflicts involve just those two rules? The first one should have more conflicts than the second. At least with one symbol of look-ahead the decision to shift to a state with addOp on the stack is easier the second time around.
Update (I believe I can prove my theory... :-):
$ cat q2.y
%% expr: '1' | expr '+' expr | expr '-' expr;
$ cat q3.y
%% expr: '1' | expr addOp expr;
addOp: '+' | '-';
$ yacc q2.y
conflicts: 4 shift/reduce
$ yacc q3.y
conflicts: 2 shift/reduce
Having said all that, it's normal for yacc grammars to have ambiguities, and any real-life system is likely to have not just a few but literally dozens of shift/reduce conflicts. By definition, this conflict occurs when there is a perfectly valid shift available, so if you don't mind the parser taking that shift, then don't worry about it.
Now, in yacc you should prefer left-recursive rules. You can achieve that and get rid of your grammar ambiguity with:
$ cat q4.y
%% expr: expr addOp '1' | '1';
addOp: '+' | '-';
$ yacc q4.y
$
Note: no conflicts in the example above. If you like your grammar the way it is, just do:
%expect 2
%% expr: '1' | expr addOp expr;
addOp: '+' | '-';

The problem is that the rule
expr: expr addOp expr { ..action.. }
has no precedence. Normally rules get the precedence of the first token on the RHS, but this rule has no tokens on its RHS. You need to add a %prec directive to it:
expr: expr addOp expr %prec '+' { ..action.. }
to explicitly give the rule a precedence.
Note that doing this doesn't get rid of the shift/reduce conflict, which was present in your original grammar. It just resolves it according to the precedence rules you specify, which means that bison won't give you a message about it. In general, using precedence to resolve conflicts can be tricky, since it can hide conflicts that you might have wanted to resolve differently, or might be unresolvable in your grammar as written.
Also see my answer to this question

Related

ANTLR4 Grammar - Issue with "dot" in fields and extended expressions

I have the following ANTLR4 Grammar
grammar ExpressionGrammar;
parse: (expr)
;
expr: MIN expr
| expr ( MUL | DIV ) expr
| expr ( ADD | MIN ) expr
| NUM
| function
| '(' expr ')'
;
function : ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL : '*';
DIV : '/';
MIN : '-';
ADD : '+';
OPEN_PAR : '(' ;
CLOSE_PAR : ')' ;
NUM : '0' | [1-9][0-9]*;
ID : [a-zA-Z_] [a-zA-Z]*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
I have an input expression like this :-
(Fields.V1)*(Fields.V2) + (Constants.Value1)*(Constants.Value2)
The ANTLR parser generated the following text from the grammar above :-
(FieldsV1)*(FieldsV2)+(Constants<missing ')'>
As you can see, the "dots" in Fields.V1 and Fields.V2 are missing from the text and also there is a <missing ')' Error node. I believe I should somehow make ANTLR understand that an expression can also have fields with dot operators.
A question on top of this :-
(Var1)(Var2)
ANTLR is not throwing me error for this above scenario , the expressions should not be (Var1)(Var2) -- It should always have the operator (var1)*(var2) or (var1)+(var2) etc. The parser error tree is not generating this error. How should the grammar be modified to make sure even this scenario is taken into consideration.
To recognize IDs like Fields.V1, change you Lexer rule for ID to something like this:
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
Notice, since each "node" of the ID follows the same rule, I made it a lexer fragment that I could use to compose the ID rule. I also added 0-9 to the second part of the fragment, since it appears that you want to allow numbers in IDs
Then the ID rule uses the fragment to build out the Lexer rule that allows for dots in the ID.
You also didn't add ID as a valid expr alternative
To handle detection of the error condition in (Var1)(Var2), you need Mike's advice to add the EOF Lexer rule to the end of the parse parser rule. Without the EOF, ANTLR will stop parsing as soon as it reaches the end of a recognized expr ((Var1)). The EOF says "and then you need to find an EOF", so ANTLR will continue parsing into the (Var2) and give you the error.
A revised version that handles both of your examples:
grammar ExpressionGrammar;
parse: expr EOF;
expr:
MIN expr
| expr ( MUL | DIV) expr
| expr ( ADD | MIN) expr
| NUM
| ID
| function
| '(' expr ')';
function: ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL: '*';
DIV: '/';
MIN: '-';
ADD: '+';
OPEN_PAR: '(';
CLOSE_PAR: ')';
NUM: '0' | [1-9][0-9]*;
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
(Now that I've read through the comments, this is pretty much just applying the suggestions in the comments)

Is it possible to make this YACC grammar unambiguous? expr: ... | expr expr

I am writing a simple calculator in yacc / bison.
The grammar for an expression looks somewhat like this:
expr
: NUM
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
| '+' expr %prec '*' { $$ = $1; }
| '-' expr %prec '*' { $$ = $1; }
| '(' expr ')' { $$ = $2; }
| expr expr { $$ = $1 '*' $2; }
;
I have declared the precedence of the operators like this.
%left '+' '-'
%left '*' '/'
%nonassoc '('
The problem is with the last rule:
expr expr { $$ = $1 $2; }
I want this rule because I want to be able to write expressions like 5(3+4)(3-24) in my calculator.
Is it possible to make this grammar unambiguous?
The ambiguity results from the fact that you allow unary operators (- expr), so 2 - 2 can be parsed either as a simple subtraction (yielding 0) or as an implicit product (of 2 and -2, yielding -4).
It's clear that subtraction is intended (otherwise subtraction would be impossible to represent) so it is necessary to ban the production expr: expr expr if the second expr on the right-hand side is a unary operation.
That can't be done with precedence declarations (or at least it cannot be done in an obvious way), so the best solution is to write out the grammar explicitly, without relying on precedence to disambiguate.
You will also have to decide exactly what is the precedence of implicit multiplication: either the same as explicit multiplication/division, or stronger. That affects how ab/cd is parsed. There is no consensus that I know of, so it is more or less up to you.
In the following, I assume that implicit multiplication binds more tightly. I also ensure that -ab is parsed as (-a)b, although -(ab) has the same end result (until you start dealing with things like non-arithmetic types and automatic conversions). So just take it as an example.
term: NUM
| '(' expr ')'
unop: term
| '-' unop
| '+' unop
conc: unop
| conc term
prod: conc
| prod '*' conc
| prod '/' conc
expr: prod
| expr '+' prod
| expr '-' prod

How do I find reduce/shift conflicts in my yacc code?

I'm writing a flex/yacc program that should read some tokens and easy grammar using cygwin.
I'm guessing something is wrong with my BNF grammar, but I can't seem to locate the problem. Below is some code
%start statement_list
%%
statement_list: statement
|statement_list statement
;
statement: condition|simple|loop|call_func|decl_array|decl_constant|decl_var;
call_func: IDENTIFIER'('ID_LIST')' {printf("callfunc\n");} ;
ID_LIST: IDENTIFIER
|ID_LIST','IDENTIFIER
;
simple: IDENTIFIER'['NUMBER']' ASSIGN expr
|PRINT STRING
|PRINTLN STRING
|RETURN expr
;
bool_expr: expr E_EQUAL expr
|expr NOT_EQUAL expr
|expr SMALLER expr
|expr E_SMALLER expr
|expr E_LARGER expr
|expr LARGER expr
|expr E_EQUAL bool
|expr NOT_EQUAL bool
;
expr: expr ADD expr {$$ = $1+$3;}
| expr MULT expr {$$ = $1-$3;}
| expr MINUS expr {$$ = $1*$3;}
| expr DIVIDE expr {if($3 == 0) yyerror("divide by zero");
else $$ = $1 / $3;}
|expr ASSIGN expr
| NUMBER
| IDENTIFIER
;
bool: TRUE
|FALSE
;
decl_constant: LET IDENTIFIER ASSIGN expr
|LET IDENTIFIER COLON "bool" ASSIGN bool
|LET IDENTIFIER COLON "Int" ASSIGN NUMBER
|LET IDENTIFIER COLON "String" ASSIGN STRING
;
decl_var: VAR IDENTIFIER
|VAR IDENTIFIER ASSIGN NUMBER
|VAR IDENTIFIER ASSIGN STRING
|VAR IDENTIFIER ASSIGN bool
|VAR IDENTIFIER COLON "Bool" ASSIGN bool
|VAR IDENTIFIER COLON "Int" ASSIGN NUMBER
|VAR IDENTIFIER COLON "String" ASSIGN STRING
;
decl_array: VAR IDENTIFIER COLON "Int" '[' NUMBER ']'
|VAR IDENTIFIER COLON "Bool" '[' NUMBER ']'
|VAR IDENTIFIER COLON "String" '[' NUMBER ']'
;
condition: IF '(' bool_expr ')' statement ELSE statement;
loop: WHILE '(' bool_expr ')' statement;
I've tried changing statement into
statement:';';
,reading a simple token to test if it works, but it seems like my code refuses to enter that part of the grammar.
Also when I compile it, it tells me there are 18 shift/reduce conflicts. Should I try to locate and solve all of them ?
EDIT: I have edited my code using Chris Dodd's answer, trying to solve each conflict by looking at the output file. The last few conflicts seem to be located in the below code.
expr: expr ADD expr {$$ = $1+$3;}
| expr MULT expr {$$ = $1-$3;}
| expr MINUS expr {$$ = $1*$3;}
| expr DIVIDE expr {if($3 == 0) yyerror("divide by zero");
else $$ = $1 / $3;}
|expr ASSIGN expr
| NUMBER
| IDENTIFIER
;
And here is part of the output file telling me what's wrong.
state 60
28 expr: expr . ADD expr
29 | expr . MULT expr
30 | expr . MINUS expr
31 | expr . DIVIDE expr
32 | expr . ASSIGN expr
32 | expr ASSIGN expr .
ASSIGN shift, and go to state 36
ADD shift, and go to state 37
MINUS shift, and go to state 38
MULT shift, and go to state 39
DIVIDE shift, and go to state 40
ASSIGN [reduce using rule 32 (expr)]
ADD [reduce using rule 32 (expr)]
MINUS [reduce using rule 32 (expr)]
MULT [reduce using rule 32 (expr)]
DIVIDE [reduce using rule 32 (expr)]
$default reduce using rule 32 (expr)
I don't understand, why would it choose rule 32 when it read ADD, MULT, DIVIDE or other tokens ? What's wrong with this part of my grammar?
Also, even though that above part of the grammar is wrong, shouldn't my compiler be able to read other grammar correctly ? For instance,
let a = 5
should be readable, yet the program returns syntax error ?
Your grammar looks reasonable, though it does have ambiguities in expressions, most of which could be solved by precedence. You should definitely look at ALL the conflicts reported and understand why they occur, and preferrably change the grammar to get rid of them.
As for your specific issue, if you change it to have statement: ';' ;, it should accept that. You don't show any of your lexing code, so the problem may be there. It may be helpful to compile you parser with -DYYDEBUG=1 to enable the debugging code generated by yacc/bison, and set the global variable yydebug to 1 before calling yyparse, which will dump a trace of everything the parser is doing to stderr.

Antlr4 - Implicit Definitions

I am trying to create a simple for now only integer-arithmetic expression parser. For now i have:
grammar MyExpr;
input: (expr NEWLINE)+;
expr: '(' expr ')'
| '-' expr
| <assoc = right> expr '^' expr
| expr ('*' | '/') expr
| expr ('+' | '-') expr
| ID '(' ExpressionList? ')'
| INT;
ExpressionList : expr (',' expr)*;
ID : [a-zA-Z]+;
INT : DIGIT+;
DIGIT: [0-9];
NEWLINE : '\r'?'\n';
WS : [\t]+ -> skip;
The rule ExpressionList seems to cause some problems. If i remove everything containing ExpressionList, everything compiles and seems to work out fine. But like above, i get errors like:
error(160): MyExpr.g4:14:17: reference to parser rule expr in lexer rule ExpressionList
error(126): MyExpr.g4:7:6: cannot create implicit token for string literal in non-combined grammar: '-'
I am using Eclipse and the Antlr4 Plugin. I try to orient myself on the cymbol grammar given in the antlr4-book.
Can someone tell me whats going wrong in my little grammar?
Found it out by myself:
Rules starting with capital letter refer to Lexer-rules. SO all I had to do was renaming my ExpressionList to expressionList.
Maybe someone else will find this useful some day ;)

Left-factoring grammar of coffeescript expressions

I'm writing an Antlr/Xtext parser for coffeescript grammar. It's at the beginning yet, I just moved a subset of the original grammar, and I am stuck with expressions. It's the dreaded "rule expression has non-LL(*) decision" error. I found some related questions here, Help with left factoring a grammar to remove left recursion and ANTLR Grammar for expressions. I also tried How to remove global backtracking from your grammar, but it just demonstrates a very simple case which I cannot use in real life. The post about ANTLR Grammar Tip: LL() and Left Factoring gave me more insights, but I still can't get a handle.
My question is how to fix the following grammar (sorry, I couldn't simplify it and still keep the error). I guess the trouble maker is the term rule, so I'd appreciate a local fix to it, rather than changing the whole thing (I'm trying to stay close to the rules of the original grammar). Pointers are also welcome to tips how to "debug" this kind of erroneous grammar in your head.
grammar CoffeeScript;
options {
output=AST;
}
tokens {
AT_SIGIL; BOOL; BOUND_FUNC_ARROW; BY; CALL_END; CALL_START; CATCH; CLASS; COLON; COLON_SLASH; COMMA; COMPARE; COMPOUND_ASSIGN; DOT; DOT_DOT; DOUBLE_COLON; ELLIPSIS; ELSE; EQUAL; EXTENDS; FINALLY; FOR; FORIN; FOROF; FUNC_ARROW; FUNC_EXIST; HERECOMMENT; IDENTIFIER; IF; INDENT; INDEX_END; INDEX_PROTO; INDEX_SOAK; INDEX_START; JS; LBRACKET; LCURLY; LEADING_WHEN; LOGIC; LOOP; LPAREN; MATH; MINUS; MINUS; MINUS_MINUS; NEW; NUMBER; OUTDENT; OWN; PARAM_END; PARAM_START; PLUS; PLUS_PLUS; POST_IF; QUESTION; QUESTION_DOT; RBRACKET; RCURLY; REGEX; RELATION; RETURN; RPAREN; SHIFT; STATEMENT; STRING; SUPER; SWITCH; TERMINATOR; THEN; THIS; THROW; TRY; UNARY; UNTIL; WHEN; WHILE;
}
COMPARE : '<' | '==' | '>';
COMPOUND_ASSIGN : '+=' | '-=';
EQUAL : '=';
LOGIC : '&&' | '||';
LPAREN : '(';
MATH : '*' | '/';
MINUS : '-';
MINUS_MINUS : '--';
NEW : 'new';
NUMBER : ('0'..'9')+;
PLUS : '+';
PLUS_PLUS : '++';
QUESTION : '?';
RELATION : 'in' | 'of' | 'instanceof';
RPAREN : ')';
SHIFT : '<<' | '>>';
STRING : '"' (('a'..'z') | ' ')* '"';
TERMINATOR : '\n';
UNARY : '!' | '~' | NEW;
// Put it at the end, so keywords will be matched earlier
IDENTIFIER : ('a'..'z' | 'A'..'Z')+;
WS : (' ')+ {skip();} ;
root
: body
;
body
: line
;
line
: expression
;
assign
: assignable EQUAL expression
;
expression
: value
| assign
| operation
;
identifier
: IDENTIFIER
;
simpleAssignable
: identifier
;
assignable
: simpleAssignable
;
value
: assignable
| literal
| parenthetical
;
literal
: alphaNumeric
;
alphaNumeric
: NUMBER
| STRING;
parenthetical
: LPAREN body RPAREN
;
// term should be the same as expression except operation to avoid left-recursion
term
: value
| assign
;
questionOp
: term QUESTION?
;
mathOp
: questionOp (MATH questionOp)*
;
additiveOp
: mathOp ((PLUS | MINUS) mathOp)*
;
shiftOp
: additiveOp (SHIFT additiveOp)*
;
relationOp
: shiftOp (RELATION shiftOp)*
;
compareOp
: relationOp (COMPARE relationOp)*
;
logicOp
: compareOp (LOGIC compareOp)*
;
operation
: UNARY expression
| MINUS expression
| PLUS expression
| MINUS_MINUS simpleAssignable
| PLUS_PLUS simpleAssignable
| simpleAssignable PLUS_PLUS
| simpleAssignable MINUS_MINUS
| simpleAssignable COMPOUND_ASSIGN expression
| logicOp
;
UPDATE:
The final solution will use Xtext with an external lexer to avoid to intricacies of handling significant whitespace. Here is a snippet from my Xtext version:
CompareOp returns Operation:
AdditiveOp ({CompareOp.left=current} operator=COMPARE right=AdditiveOp)*;
My strategy is to make a working Antlr parser first without a usable AST. (Well, it would deserve a separates question if this is a feasible approach.) So I don't care about tokens at the moment, they are included to make development easier.
I am aware that the original grammar is LR. I don't know how close I can stay to it when transforming to LL.
UPDATE2 and SOLUTION:
I could simplify my problem with the insights gained from Bart's answer. Here is a working toy grammar to handle simple expressions with function calls to illustrate it. The comment before expression shows my insight.
grammar FunExp;
ID: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
NUMBER: '0'..'9'+;
WS: (' ')+ {skip();};
root
: expression
;
// atom and functionCall would go here,
// but they are reachable via operation -> term
// so they are omitted here
expression
: operation
;
atom
: NUMBER
| ID
;
functionCall
: ID '(' expression (',' expression)* ')'
;
operation
: multiOp
;
multiOp
: additiveOp (('*' | '/') additiveOp)*
;
additiveOp
: term (('+' | '-') term)*
;
term
: atom
| functionCall
| '(' expression ')'
;
When you generate a lexer and parser from your grammar, you see the following error printed to your console:
error(211): CoffeeScript.g:52:3: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
warning(200): CoffeeScript.g:52:3: Decision can match input such as "{NUMBER, STRING}" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
(I've emphasized the important bits)
This is only the first error, but you start with the first and with a bit of luck, the errors below that first one will also disappear when you fix the first one.
The error posted above means that when you're trying to parse either a NUMBER or a STRING with the parser generated from your grammar, the parser can go two ways when it ends up in the expression rule:
expression
: value // choice 1
| assign // choice 2
| operation // choice 3
;
Namely, choice 1 and choice 3 both can parse a NUMBER or a STRING, as you can see by the "paths" the parser can follow to match these 2 choices:
choice 1:
expression
value
literal
alphaNumeric : {NUMBER, STRING}
choice 3:
expression
operation
logicOp
relationOp
shiftOp
additiveOp
mathOp
questionOp
term
value
literal
alphaNumeric : {NUMBER, STRING}
In the last part of the warning, ANTLR informs you that it ignores choice 3 whenever either a NUMBER or a STRING will be parsed, causing choice 1 to match such input (since it is defined before choice 3).
So, either the CoffeeScript grammar is ambiguous in this respect (and handles this ambiguity somehow), or your implementation of it is wrong (I'm guessing the latter :)). You need to fix this ambiguity in your grammar: i.e. don't let the expression's choices 1 and 3 both match the same input.
I noticed 3 other things in your grammar:
1
Take the following lexer rules:
NEW : 'new';
...
UNARY : '!' | '~' | NEW;
Be aware that the token UNARY can never match the text 'new' since the token NEW is defined before it. If you want to let UNARY macth this, remove the NEW rule and do:
UNARY : '!' | '~' | 'new';
2
In may occasions, you're collecting multiple types of tokens in a single one, like LOGIC:
LOGIC : '&&' | '||';
and then you use that token in a parser rules like this:
logicOp
: compareOp (LOGIC compareOp)*
;
But if you're going to evaluate such an expression at a later stage, you don't know what this LOGIC token matched ('&&' or '||') and you'll have to inspect the token's inner text to find that out. You'd better do something like this (at least, if you're doing some sort of evaluating at a later stage):
AND : '&&';
OR : '||';
...
logicOp
: compareOp ( AND compareOp // easier to evaluate, you know it's an AND expression
| OR compareOp // easier to evaluate, you know it's an OR expression
)*
;
3
You're skipping white spaces (and no tabs?) with:
WS : (' ')+ {skip();} ;
but doesn't CoffeeScript indent it's code block with spaces (and tabs) just like Python? But perhaps you're going to do that in a later stage?
I just saw that the grammar you're looking at is a jison grammar (which is more or less a bison implementation in JavaScript). But bison, and therefor jison, generates LR parsers while ANTLR generates LL parsers. So trying to stay close to the rules of the original grammar will only result in more problems.