I'm trying to convert my existing Antlr3 project to Antlr4 to get more functionality. I have this grammar that wouldn't compile with Antlr4.9
expr
: term ( OR^ term )* ;
and
factor
: ava | NOT^ factor | (LPAREN! expr RPAREN!) ;
Mostly because Antlr4 doesn't support ^ and ! anymore. From the documentation it seems like those are
AST root operator. When generating abstract syntax trees (ASTs), token
references suffixed with the "^" root operator force AST nodes to be
created and added as the root of the current tree. This symbol is only
effective when the buildAST option is set. More information about ASTs
is also available.
AST exclude operator. When generating abstract syntax trees, token
references suffixed with the "!" exclude operator are not included in
the AST constructed for that rule. Rule references can also be
suffixed with the exclude operator, which implies that, while the tree
for the referenced rule is constructed, it is not linked into the tree
for the referencing rule. This symbol is only effective when the
buildAST option is set. More information about ASTs is also available.
If I took those out it would compile but I'm not sure what do those mean and how would Antlr4 supports it.
LPAREN and RPAREN is tokens
tokens {
EQUALS = '=';
LPAREN = '(';
RPAREN = ')';
}
which Antlr4 kindly provides the way to convert that in the error messages but not ^ and !. The grammar is for parsing boolean expression for example (a=b AND b=c)
I think this is the rule
targetingexpr returns [boolean value]
: expr { $value = $expr.value; } ;
expr returns [boolean value]
: ^(NOT a=expr) { $value = !a; }
| ^(AND a=expr b=expr) { $value = a && b; }
| ^(OR a=expr b=expr) { $value = a || b; }
| ^(EQUALS A=ALPHANUM B=ALPHANUM) { $value = targetingContext.contains($A.text,$B.text); }
;
The v3 grammar:
...
tokens {
EQUALS = '=';
LPAREN = '(';
RPAREN = ')';
}
...
expr
: term ( OR^ term )* ;
factor
: ava | NOT^ factor | (LPAREN! expr RPAREN!) ;
in v4 would look like this:
...
expr
: term ( OR term )* ;
factor
: ava | NOT factor | (LPAREN expr RPAREN) ;
EQUALS : '=';
LPAREN : '(';
RPAREN : ')';
So, just remove the inline ^ and ! operators (tree rewriting is no longer available in ANTLR4), and move the literal tokens in the tokens { ... } sections into own lexer rules.
I think this is the rule
targetingexpr returns [boolean value]
: expr { $value = $expr.value; } ;
expr returns [boolean value]
: ^(NOT a=expr) { $value = !a; }
| ^(AND a=expr b=expr) { $value = a && b; }
| ^(OR a=expr b=expr) { $value = a || b; }
| ^(EQUALS A=ALPHANUM B=ALPHANUM) { $value = targetingContext.contains($A.text,$B.text); }
;
What you posted there is part of a tree grammar for which there is no equivalent. In ANTLR4 you'd use a visitor to evaluate your expressions instead of inside a tree grammar.
Here are part of my grammar:
expr_address
: expr_address_category expr_opt { $$ = new ExprAddress($1,*$2);}
| axis axis_data { $$ = new ExprAddress($1,*$2);}
;
axis_data
: expr_opt { $$ = $1;}
| sign { if($1 == MINUS)
$$ = new IntergerExpr(-1000000000);
else if($1 == PLUS)
$$ = new IntergerExpr(+1000000000);}
;
expr_opt
: { $$ = new IntergerExpr(0);}
| expr { $$ = $1;}
;
expr_address_category
: I { $$ = NCAddress_I;}
| J { $$ = NCAddress_J;}
| K { $$ = NCAddress_K;}
;
axis
: X { $$ = NCAddress_X;}
| Y { $$ = NCAddress_Y;}
| Z { $$ = NCAddress_Z;}
| U { $$ = NCAddress_U;}
| V { $$ = NCAddress_V;}
| W { $$ = NCAddress_W;}
;
expr
: '[' expr ']' {$$ = $2;}
| COS parenthesized_expr {$$ = new BuiltinMethodCallExpr(COS,*$2);}
| SIN parenthesized_expr {$$ = new BuiltinMethodCallExpr(SIN,*$2);}
| ATAN parenthesized_expr {$$ = new BuiltinMethodCallExpr(ATAN,*$2);}
| SQRT parenthesized_expr {$$ = new BuiltinMethodCallExpr(SQRT,*$2);}
| ROUND parenthesized_expr {$$ = new BuiltinMethodCallExpr(ROUND,*$2);}
| variable {$$ = $1;}
| literal
| expr '+' expr {$$ = new BinaryOperatorExpr(*$1,PLUS,*$3);}
| expr '-' expr {$$ = new BinaryOperatorExpr(*$1,MINUS,*$3);}
| expr '*' expr {$$ = new BinaryOperatorExpr(*$1,MUL,*$3);}
| expr '/' expr {$$ = new BinaryOperatorExpr(*$1,DIV,*$3);}
| sign expr %prec UMINUS {$$ = new UnaryOperatorExpr($1,*$2);}
| expr EQ expr {$$ = new BinaryOperatorExpr(*$1,EQ,*$3);}
| expr NE expr {$$ = new BinaryOperatorExpr(*$1,NE,*$3);}
| expr GT expr {$$ = new BinaryOperatorExpr(*$1,GT,*$3);}
| expr GE expr {$$ = new BinaryOperatorExpr(*$1,GE,*$3);}
| expr LT expr {$$ = new BinaryOperatorExpr(*$1,LT,*$3);}
| expr LE expr {$$ = new BinaryOperatorExpr(*$1,LE,*$3);}
;
variable
: d_h_address {$$ = new AddressExpr(*$1);}
;
d_h_address
: D INTEGER_LITERAL { $$ = new IntAddress(NCAddress_D,$2);}
| H INTEGER_LITERAL { $$ = new IntAddress(NCAddress_H,$2);}
;
I hope my grammar support that like:
H100=20;
X;
X+0;
X+;
X+H100; //means H100 variable ref
The top two are same with X0; By the way,sign -> +/-;
But bison report conflicts,the key part of bison.output:
State 108
11 expr: sign . expr
64 axis_data: sign .
INTEGER_LITERAL shift, and go to state 93
REAL_LITERAL shift, and go to state 94
'+' shift, and go to state 74
'-' shift, and go to state 75
COS shift, and go to state 95
SIN shift, and go to state 96
ATAN shift, and go to state 97
SQRT shift, and go to state 98
ROUND shift, and go to state 99
D shift, and go to state 35
H shift, and go to state 36
'[' shift, and go to state 100
D [reduce using rule 64 (axis_data)]
H [reduce using rule 64 (axis_data)]
$default reduce using rule 64 (axis_data)
State 69
62 expr_address: axis . axis_data
INTEGER_LITERAL shift, and go to state 93
REAL_LITERAL shift, and go to state 94
'+' shift, and go to state 74
'-' shift, and go to state 75
COS shift, and go to state 95
SIN shift, and go to state 96
ATAN shift, and go to state 97
SQRT shift, and go to state 98
ROUND shift, and go to state 99
D shift, and go to state 35
H shift, and go to state 36
'[' shift, and go to state 100
D [reduce using rule 65 (expr_opt)]
H [reduce using rule 65 (expr_opt)]
$default reduce using rule 65 (expr_opt)
State 68
61 expr_address: expr_address_category . expr_opt
INTEGER_LITERAL shift, and go to state 93
REAL_LITERAL shift, and go to state 94
'+' shift, and go to state 74
'-' shift, and go to state 75
COS shift, and go to state 95
SIN shift, and go to state 96
ATAN shift, and go to state 97
SQRT shift, and go to state 98
ROUND shift, and go to state 99
D shift, and go to state 35
H shift, and go to state 36
'[' shift, and go to state 100
D [reduce using rule 65 (expr_opt)]
H [reduce using rule 65 (expr_opt)]
$default reduce using rule 65 (expr_opt)
I don't know how to deal with this,thanks advance.
EDIT:
I make a minimal grammar:
%{
#include <stdio.h>
extern "C" int yylex();
void yyerror(const char *s) { printf("ERROR: %s/n", s); }
%}
%token PLUS '+' MINUS '-'
%token D H I J K X Y Z INT
/*%type sign expr var expr_address_category expr_opt
%type axis */
%start word_list
%%
/*Above grammar lost this rule,it makes ambiguous*/
word_list
: word
| word_list word
;
sign
: PLUS
| MINUS
;
expr
: var
| sign expr
| '[' expr ']'
;
var
: D INT
| H INT
;
word
: expr_address
| var '=' expr
;
expr_address
: expr_address_category expr_opt
/*| '(' axis sign ')'*/
| axis sign
;
expr_opt
: /* empty */
| expr
;
expr_address_category
: I
| J
| K
| axis
;
axis
: X
| Y
| Z
;
%%
and I hope it can support:
X;
X0;
X+0; //the top three are same with X0
X+;
X+H100; //this means X's data is ref +H100;
X+H100=10; //two word on a block,X+ and H100=10;
XH100=10; //two word on a block,X and H100=10;
EDIT2:
The above EDIT lost this rule.
block
: word_list ';'
| ';'
;
Because I have to allow such grammar:
H000 = 100 H001 = 200 H002 = 300;
This is essentially the classic LR(2) grammar, except that in your case it is LR(3) because your variables consist of two tokens [Note 1]:
var : D INT | H INT
The basic problem is the concatenation of words without separators:
word_list : word | word_list word
combined with the fact that one of the options for word ends with an optional var:
word: expr_address
expr_address: expr_address_category expr_opt
while the other one starts with a var:
word: var '=' expr
The = makes this unambiguous, since nothing in an expr can contain that symbol. But at the point where a decision needs to be made, the = is not visible, because the lookahead is the first token of a var -- either an H or a D -- and the equals sign is still two tokens away.
This LR(2) grammar is very similar to the grammar used by yacc/bison itself, a fact which I always find to be ironic, because the grammar for yacc does not require ; between productions:
production: SYMBOL ':' | production SYMBOL /* Lots of detail omitted */
As with your grammar, this makes it impossible to know whether a SYMBOL should be shifted or trigger a reduce because the disambiguating : is still not visible.
Since the grammar is (I assume) unambiguous, and bison can now generate GLR parsers, that will be the simplest solution: just add
%glr-parser
to your bison prologue (but read the section of the bison manual on GLR parsers to understand the trade-off).
Note that the shift-reduce conflicts will still be reported as warnings; since it is impossible to reliably decide whether a grammar is ambiguous, bison doesn't attempt to do so and ambiguities will be reported at run-time if they exist.
You should also fix the issue mentioned in #ChrisDodd's answer regarding the refactoring of expr_address (although with a GLR parser it is not strictly necessary).
If, for whatever reason, you feel that a GLR parser will not meet your needs, you could use the solution in most implementations of yacc (including bison), which is a hack in the lexical scanner. The basic idea is to mark whether a symbol is followed by a colon or not in the lexer, so that the above production could be rewritten as:
production: SYMBOL_COLON | production SYMBOL
This solution would work for you if you were willing to combine the letter and the number into a single token:
word: expr_address expr_opt
| VARIABLE_EQUALS expr
// ...
expr: VARIABLE
My preference is to do this transformation in a wrapper around the lexer, which keeps a (one-element) queue of pending tokens:
/* The use of static variables makes this yylex wrapper unreliable
* if it is reused after a syntax error.
*/
int yylex_wrapper() {
static int saved_token = -1;
static YYSTYPE saved_yylval = {0};
int token = saved_token;
saved_token = -1;
yylval = saved_yylval;
// Read a new token only if we don't have one in the queue.
if (token < 0) token = yylex();
// If the current token is IDENTIFIER, check the next token
if (token == IDENTIFIER) {
// Read the next token into the queue (saved_token / saved_yylval)
YYSTYPE temp_val = yylval;
saved_token = yylex();
saved_yylval = yylval;
yylval = temp_val;
// If the second token is '=', then modify the current token
// and delete the '=' from the queue
if (saved_token == '=') {
saved_token = -1;
token = IDENTIFIER_EQUALS;
}
}
return token;
}
Notes
Personally, I would start by making a var a single token (do you really want to allow people to write:
H /* Some comment in the middle of the variable name */ 100
but that's not going to solve any problems; it merely reduces the grammar's lookahead requirement from LR(3) to LR(2).
The main problem is that it can't figure out where one word in a word_list ends and the next one begins, because there is no separator token between words. This is in contrast to your examples, which all have ; terminators. So that suggests one obvious fix -- put in the ; separators:
word: expr_address ';'
| var '=' expr ';'
That fixes most of the problems, but leaves a lookahead conflict where it can't decide whether an axis is an expr_address_category or not when the lookahead is a sign, because it depends on whether there's an expr after the sign or not. You can fix that by refactoring to defer deciding:
expr_address
: expr_address_category expr_opt
| axis expr_opt
| axis sign
..and remove axis from expr_address_category
I'm trying to get: (20 + (-3)) * 3 / (20 / 3) / 2 to equal 4. Right now it equals 17.
So basically it's doing (20/3) then dividing that by 2, then dividing 3 by [(20/3)/2], then multiplying that by 17. Not sure how to alter my grammar/rules/precedences to get it to read correctly. Any guidance would be appreciated, thanks.
%%
start: PROGRAMtoken IDtoken IStoken compoundstatement
compoundstatement: BEGINtoken {print_header();} statement semi ENDtoken {print_end();}
semi: SEMItoken statement semi
|
statement: IDtoken EQtoken exp
{ regs[$1] = $3; }
| PRINTtoken exp
{ cout << $2 << endl; }
| declaration
declaration: VARtoken IDtoken comma
comma: COMMAtoken IDtoken comma
|
exp: exp PLUStoken term
{ $$ = $1 + $3; }
| exp MINUStoken term
{ $$ = $1 - $3; }
| term
{ $$ = $1; }
| MINUStoken term
{ $$ = -$2;}
term: factor
{ $$ = $1;
}
| factor TIMEStoken term
{$$ = $1 * $3;
}
| factor DIVIDEtoken term
{ $$ = $1 / $3;
}
factor: ICONSTtoken
{ $$ = $1;}
| IDtoken
{ $$ = regs[$1]; }
| LPARENtoken exp RPARENtoken
{ $$ = $2;}
%%
My tokens and types look like:
%token BEGINtoken
%token COMMAtoken
%left DIVIDEtoken
%left TIMEStoken
%token ENDtoken
%token EOFtoken
%token EQtoken
%token <value> ICONSTtoken
%token <value> IDtoken
%token IStoken
%token LPARENtoken
%left PLUStoken MINUStoken
%token PRINTtoken
%token PROGRAMtoken
%token RPARENtoken
%token SEMItoken
%token VARtoken
%type <value> exp
%type <value> term
%type <value> factor
You really wanted someone to work hard to give you an answer, which is why the question has hung around for a year. Have a read of this help page: https://stackoverflow.com/help/how-to-ask, in particular the part about simplifying the problem. There are lots of rules in your grammar file that were not needed to reproduce the problem. We did not need:
%token BEGINtoken
%token COMMAtoken
%token ENDtoken
%token EOFtoken
%token EQtoken
%token <value> IDtoken
%token IStoken
%token PROGRAMtoken
%token VARtoken
%%
start: PROGRAMtoken IDtoken IStoken compoundstatement
compoundstatement: BEGINtoken {print_header();} statement semi ENDtoken {print_end();}
semi: SEMItoken statement semi
|
| declaration
declaration: VARtoken IDtoken comma
comma: COMMAtoken IDtoken comma
|
You could have just removed these tokens and rules to get to the heart of the operator precedence issue. We did not need any variables, declarations, assignments or program structure to illustrate the failure. Learning to simplify is the heart of competent debugging and thus programming. If you'd done this simplification more people would have had a go at answering. I'm saying this not for the OP, but for those that will follow with similar problems!
I'm wondering what school is setting this assignment, as I've seen a fair number of yacc questions on SO around the same dumb problem. I suspect more will come here every year, so answering this will help them. I knew what the issue was on inspection of the grammar, but to test my solution I had to code up a working lexer, some symbol table routines, a main program and other ancillary code. Again, another deterrent for problem solvers.
Lets get to the heart of the problem. You have these token declarations:
%left DIVIDEtoken
%left TIMEStoken
%left PLUStoken MINUStoken
These tell yacc that if any rules are ambiguous that the operators associate left. Your rules for these operators are:
exp: exp PLUStoken term
{ $$ = $1 + $3; }
| exp MINUStoken term
{ $$ = $1 - $3; }
| term
{ $$ = $1; }
| MINUStoken term
{ $$ = -$2;}
term: factor
{ $$ = $1;
}
| factor TIMEStoken term
{$$ = $1 * $3;
}
| factor DIVIDEtoken term
{ $$ = $1 / $3;
}
However, these rules are not ambiguous, and thus the operator precedence declaration is not required. Yacc will follow the non-ambiguous grammar you have used. The ways these rules are written, it tells yacc that the operators have right associativity, which is the opposite of what you want. Now, it could be clearly seen from the simple arithmetic in your example that the operators were being calculated in a right associative way, and you wanted the opposite. There were really big clues there weren't there?
OK. How to change the associativity? One way would be to make the grammar ambiguous again so that the %left declaration is used, or just flip the rules around to invert the associativity. That's what I did:
exp: term PLUStoken exp
{ $$ = $1 + $3; }
| term MINUStoken exp
{ $$ = $1 - $3; }
| term
{ $$ = $1; }
| MINUStoken term
{ $$ = -$2;}
term: factor
{ $$ = $1;
}
| term TIMEStoken factor
{$$ = $1 * $3;
}
| term DIVIDEtoken factor
{ $$ = $1 / $3;
}
Do you see what I did there? I rotated the grammar rule around the operator.
Now for some more disclaimers. I said this is a dumb exercise. Interpreting expressions on the fly is a poor use of the yacc tool, and not what happens in real compilers or interpreters. In a realistic implementation, a parse tree would be built and the value calculations would be performed during the tree walk. This would then enable the issues of undeclared variables to be resolved (which also occurs in this exercise). The use of the regs array to store values is also dumb, because there is clearly an ancillary symbol table in use to return a unique integer ID for the symbols. In a real compiler/interpreter those values would also be stored in that symbol table.
I hope this tutorial has helped further students understand these parsing issues in their classwork.
Say I have a grammar like this:
expr : expr '+' expr { $$ = operation('+', $1, $3); }
| expr '-' expr { $$ = operation('-', $1, $3); }
| expr '*' expr { $$ = operation('*', $1, $3); }
| expr '/' expr { $$ = operation('/', $1, $3); }
| num
;
Where each of those operators has a precedence attached and is marked as left associative.
Then I want to refactor my grammar such that:
op : '+' | '-' | '*' | '/' ;
expr : expr op expr { $$ = operation($2, $1, $3); }
| num
;
How does yacc (if even at all) determine the associativity and precedence of op in this case? Will it trace its way through all the possible precedences/associativities of +, -, * and / when evaluating op, or does defining an associativity for nonterminal symbols make no sense?
AFAIK, with precedence order for nonterminals, it uses the precedence of the rightmost terminal symbol, but I can't find any documentation on the associativity rules themselves for nonterminals.
The "normal" way to do this (as far as I'm aware) is to define a different expr type for each operator, that way you get very explicit control over what's happening.
Python's grammar is a good example of this: http://docs.python.org/reference/grammar.html.
I'm back and now writing my own language and my OS, but as I'm now starting in the development of my own development language, I'm getting some errors when using Bison and I don't know how to solve them. This is my *.y file code:
input:
| input line
;
line: '\n'
| exp '\n' { printf ("\t%.10g\n", $1); }
;
exp: NUM { $$ = $1; }
| exp exp '+' { $$ = $1 + $2; }
| exp exp '-' { $$ = $1 - $2; }
| exp exp '*' { $$ = $1 * $2; }
| exp exp '/' { $$ = $1 / $2; }
/* Exponentiation */
| exp exp '^' { $$ = pow ($1, $2); }
/* Unary minus */
| exp 'n' { $$ = -$1; }
;
%%
And when I try to use Bison with this source code I'm getting this error:
calc.y:1.1-5: syntax error, unexpected identifier:
You need a '%%' before the rules as well as after them (or, strictly, instead; if there is no code after the second '%%', you can omit that line).
You will also need a '%token NUM' before the first '%%'; the grammar then passes Bison.
Another alternative solution exists, which is to upgrade to bison version 3.0.4. I guess between version 2.x and 3.x, they changed the file syntax.