I'm writing a flex/yacc program that should read some tokens and easy grammar using cygwin.
I'm guessing something is wrong with my BNF grammar, but I can't seem to locate the problem. Below is some code
%start statement_list
%%
statement_list: statement
|statement_list statement
;
statement: condition|simple|loop|call_func|decl_array|decl_constant|decl_var;
call_func: IDENTIFIER'('ID_LIST')' {printf("callfunc\n");} ;
ID_LIST: IDENTIFIER
|ID_LIST','IDENTIFIER
;
simple: IDENTIFIER'['NUMBER']' ASSIGN expr
|PRINT STRING
|PRINTLN STRING
|RETURN expr
;
bool_expr: expr E_EQUAL expr
|expr NOT_EQUAL expr
|expr SMALLER expr
|expr E_SMALLER expr
|expr E_LARGER expr
|expr LARGER expr
|expr E_EQUAL bool
|expr NOT_EQUAL bool
;
expr: expr ADD expr {$$ = $1+$3;}
| expr MULT expr {$$ = $1-$3;}
| expr MINUS expr {$$ = $1*$3;}
| expr DIVIDE expr {if($3 == 0) yyerror("divide by zero");
else $$ = $1 / $3;}
|expr ASSIGN expr
| NUMBER
| IDENTIFIER
;
bool: TRUE
|FALSE
;
decl_constant: LET IDENTIFIER ASSIGN expr
|LET IDENTIFIER COLON "bool" ASSIGN bool
|LET IDENTIFIER COLON "Int" ASSIGN NUMBER
|LET IDENTIFIER COLON "String" ASSIGN STRING
;
decl_var: VAR IDENTIFIER
|VAR IDENTIFIER ASSIGN NUMBER
|VAR IDENTIFIER ASSIGN STRING
|VAR IDENTIFIER ASSIGN bool
|VAR IDENTIFIER COLON "Bool" ASSIGN bool
|VAR IDENTIFIER COLON "Int" ASSIGN NUMBER
|VAR IDENTIFIER COLON "String" ASSIGN STRING
;
decl_array: VAR IDENTIFIER COLON "Int" '[' NUMBER ']'
|VAR IDENTIFIER COLON "Bool" '[' NUMBER ']'
|VAR IDENTIFIER COLON "String" '[' NUMBER ']'
;
condition: IF '(' bool_expr ')' statement ELSE statement;
loop: WHILE '(' bool_expr ')' statement;
I've tried changing statement into
statement:';';
,reading a simple token to test if it works, but it seems like my code refuses to enter that part of the grammar.
Also when I compile it, it tells me there are 18 shift/reduce conflicts. Should I try to locate and solve all of them ?
EDIT: I have edited my code using Chris Dodd's answer, trying to solve each conflict by looking at the output file. The last few conflicts seem to be located in the below code.
expr: expr ADD expr {$$ = $1+$3;}
| expr MULT expr {$$ = $1-$3;}
| expr MINUS expr {$$ = $1*$3;}
| expr DIVIDE expr {if($3 == 0) yyerror("divide by zero");
else $$ = $1 / $3;}
|expr ASSIGN expr
| NUMBER
| IDENTIFIER
;
And here is part of the output file telling me what's wrong.
state 60
28 expr: expr . ADD expr
29 | expr . MULT expr
30 | expr . MINUS expr
31 | expr . DIVIDE expr
32 | expr . ASSIGN expr
32 | expr ASSIGN expr .
ASSIGN shift, and go to state 36
ADD shift, and go to state 37
MINUS shift, and go to state 38
MULT shift, and go to state 39
DIVIDE shift, and go to state 40
ASSIGN [reduce using rule 32 (expr)]
ADD [reduce using rule 32 (expr)]
MINUS [reduce using rule 32 (expr)]
MULT [reduce using rule 32 (expr)]
DIVIDE [reduce using rule 32 (expr)]
$default reduce using rule 32 (expr)
I don't understand, why would it choose rule 32 when it read ADD, MULT, DIVIDE or other tokens ? What's wrong with this part of my grammar?
Also, even though that above part of the grammar is wrong, shouldn't my compiler be able to read other grammar correctly ? For instance,
let a = 5
should be readable, yet the program returns syntax error ?
Your grammar looks reasonable, though it does have ambiguities in expressions, most of which could be solved by precedence. You should definitely look at ALL the conflicts reported and understand why they occur, and preferrably change the grammar to get rid of them.
As for your specific issue, if you change it to have statement: ';' ;, it should accept that. You don't show any of your lexing code, so the problem may be there. It may be helpful to compile you parser with -DYYDEBUG=1 to enable the debugging code generated by yacc/bison, and set the global variable yydebug to 1 before calling yyparse, which will dump a trace of everything the parser is doing to stderr.
Related
I have the following ANTLR4 Grammar
grammar ExpressionGrammar;
parse: (expr)
;
expr: MIN expr
| expr ( MUL | DIV ) expr
| expr ( ADD | MIN ) expr
| NUM
| function
| '(' expr ')'
;
function : ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL : '*';
DIV : '/';
MIN : '-';
ADD : '+';
OPEN_PAR : '(' ;
CLOSE_PAR : ')' ;
NUM : '0' | [1-9][0-9]*;
ID : [a-zA-Z_] [a-zA-Z]*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
I have an input expression like this :-
(Fields.V1)*(Fields.V2) + (Constants.Value1)*(Constants.Value2)
The ANTLR parser generated the following text from the grammar above :-
(FieldsV1)*(FieldsV2)+(Constants<missing ')'>
As you can see, the "dots" in Fields.V1 and Fields.V2 are missing from the text and also there is a <missing ')' Error node. I believe I should somehow make ANTLR understand that an expression can also have fields with dot operators.
A question on top of this :-
(Var1)(Var2)
ANTLR is not throwing me error for this above scenario , the expressions should not be (Var1)(Var2) -- It should always have the operator (var1)*(var2) or (var1)+(var2) etc. The parser error tree is not generating this error. How should the grammar be modified to make sure even this scenario is taken into consideration.
To recognize IDs like Fields.V1, change you Lexer rule for ID to something like this:
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
Notice, since each "node" of the ID follows the same rule, I made it a lexer fragment that I could use to compose the ID rule. I also added 0-9 to the second part of the fragment, since it appears that you want to allow numbers in IDs
Then the ID rule uses the fragment to build out the Lexer rule that allows for dots in the ID.
You also didn't add ID as a valid expr alternative
To handle detection of the error condition in (Var1)(Var2), you need Mike's advice to add the EOF Lexer rule to the end of the parse parser rule. Without the EOF, ANTLR will stop parsing as soon as it reaches the end of a recognized expr ((Var1)). The EOF says "and then you need to find an EOF", so ANTLR will continue parsing into the (Var2) and give you the error.
A revised version that handles both of your examples:
grammar ExpressionGrammar;
parse: expr EOF;
expr:
MIN expr
| expr ( MUL | DIV) expr
| expr ( ADD | MIN) expr
| NUM
| ID
| function
| '(' expr ')';
function: ID '(' arguments? ')';
arguments: expr ( ',' expr)*;
/* Tokens */
MUL: '*';
DIV: '/';
MIN: '-';
ADD: '+';
OPEN_PAR: '(';
CLOSE_PAR: ')';
NUM: '0' | [1-9][0-9]*;
fragment ID_NODE: [a-zA-Z_][a-zA-Z0-9]*;
ID: ID_NODE ('.' ID_NODE)*;
COMMENT: '//' ~[\r\n]* -> skip;
WS: [ \t\n]+ -> skip;
(Now that I've read through the comments, this is pretty much just applying the suggestions in the comments)
I am trying to write a grammar that allows for
Signed integers (i.e. integers with or without a sign; 3, -2, +5)
Unary minus (-)
Binary addition and subtraction (+, -)
Here is the relevant grammar:
expr: INTLITER
| unaryOp expr
| expr binaryOp expr
| OPEN_PAREN expr CLOSE_PAREN
;
unaryOp: MINUS ; // Other operators ommitted for clarity
binaryOp: PLUS | MINUS ;
INTLITER: INTSIGN? DIGIT+ ;
fragment INTSIGN : PLUS | MINUS
WS: [ \r\n\t] -> skip ; // Ignore whitespace
I'm finding a strange issue concerning whitespace.
Consider the expression (2+ 1); this gives a correct parse tree, as expected, like so:
However, (2+1) gives this parse tree:
Since the WS rule means that whitespace is ignored, how is the whitespace here affecting the parse tree?
How might I fix this problem?
The problem with the grammar is that you are trying to represent signed numbers as a token in the lexer. Define the INTLITER without "INTSIGN?". The grammar now works.
grammar arithmetic;
expr: INTLITER
| unaryOp expr
| expr binaryOp expr
| OPEN_PAREN expr CLOSE_PAREN
;
unaryOp: MINUS ; // Other operators ommitted for clarity
binaryOp: PLUS | MINUS ;
INTLITER: DIGIT+ ;
WS: [ \r\n\t] -> skip ; // Ignore whitespace
fragment DIGIT
: ('0' .. '9')+
;
OPEN_PAREN
: '('
;
CLOSE_PAREN
: ')'
;
PLUS
: '+'
;
MINUS
: '-'
;
I am trying to create a simple for now only integer-arithmetic expression parser. For now i have:
grammar MyExpr;
input: (expr NEWLINE)+;
expr: '(' expr ')'
| '-' expr
| <assoc = right> expr '^' expr
| expr ('*' | '/') expr
| expr ('+' | '-') expr
| ID '(' ExpressionList? ')'
| INT;
ExpressionList : expr (',' expr)*;
ID : [a-zA-Z]+;
INT : DIGIT+;
DIGIT: [0-9];
NEWLINE : '\r'?'\n';
WS : [\t]+ -> skip;
The rule ExpressionList seems to cause some problems. If i remove everything containing ExpressionList, everything compiles and seems to work out fine. But like above, i get errors like:
error(160): MyExpr.g4:14:17: reference to parser rule expr in lexer rule ExpressionList
error(126): MyExpr.g4:7:6: cannot create implicit token for string literal in non-combined grammar: '-'
I am using Eclipse and the Antlr4 Plugin. I try to orient myself on the cymbol grammar given in the antlr4-book.
Can someone tell me whats going wrong in my little grammar?
Found it out by myself:
Rules starting with capital letter refer to Lexer-rules. SO all I had to do was renaming my ExpressionList to expressionList.
Maybe someone else will find this useful some day ;)
I'm having trouble figuring this one out as well as the shift reduce problem.
Adding ';' to the end doesn't solve the problem since I can't change the language, it needs to go just as the following example. Does any prec operand work?
The example is the following:
A variable can be declared as: as a pointer or int as integer, so, both of this are valid:
<int> a = 0
int a = 1
the code goes:
%left '<'
declaration: variable
| declaration variable
variable : type tNAME '=' expr
| type tNAME
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
It obviously gives a shift/reduce problem afer expr. since it can shift for expr of "less" operator or reduce for another variable declaration.
I want precedence to be given on variable declaration, and have tried to create a %nonassoc prec_aux and put after '<' type '>' %prec prec_aux and after type tNAME but it doesn't solve my problem :S
How can I solve this?
Output was:
Well cant figure hwo to post linebreaks and code on reply... so here it goes the output:
35: shift/reduce conflict (shift 47, reduce 7) on '<'
state 35
variable : type tNAME '=' expr . (7)
expr : expr . '+' expr (26)
expr : expr . '-' expr (27)
expr : expr . '*' expr (28)
expr : expr . '/' expr (29)
expr : expr . '%' expr (30)
expr : expr . '<' expr (31)
expr : expr . '>' expr (32)
'>' shift 46
'<' shift 47
'+' shift 48
'-' shift 49
'*' shift 50
'/' shift 51
'%' shift 52
$end reduce 7
tINT reduce 7
Thats the output and the error seems the one I mentioned.
Does anyone know a different solution, other than adding a new terminal to the language that isn't really an option?
I think the resolution is to rewrite the grammar so it can lookahead somehow and see if its a type or expr after the '<' but I'm not seeing how to.
Precedence is unlikely to work since its the same character. Is there a way to give precendence for types that we define? such as declaration?
Thanks in advance
Your grammar gets confused in text like this:
int a = b
<int> c
That '<' on the second line could be part of an expression in the first declaration. It would have to look ahead further to find out.
This is the reason most languages have a statement terminator. This produces no conflicts:
%%
%token tNAME;
%token tINT;
%token tINTEGER;
%token tTERM;
%left '<';
declaration: variable
| declaration variable
variable : type tNAME '=' expr tTERM
| type tNAME tTERM
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
It helps when creating a parser to know how to design a grammar to eliminate possible conflicts. For that you would need an understanding of how parsers work, which is outside the scope of this answer :)
The basic problem here is that you need more lookahead than the 1 token you get with yacc/bison. When the parser sees a < it has no way of telling whether its done with the preivous declaration and its looking at the beginning of a bracketed type, or if this is a less-than operator. There's two basic things you can do here:
Use a parsing method such as bison's %glr-parser option or btyacc, which can deal with non-LR(1) grammars
Use the lexer to do extra lookahead and return disambiguating tokens
For the latter, you would have the lexer do extra lookahead after a '<' and return a different token if its followed by something that looks like a type. The easiest is to use flex's / lookahead operator. For example:
"<"/[ \t\n\r]*"<" return OPEN_ANGLE;
"<"/[ \t\n\r]*"int" return OPEN_ANGLE;
"<" return '<';
Then you change your bison rules to expect OPEN_ANGLE in types instead of <:
type : OPEN_ANGLE type '>'
| tINT
expr : tINTEGER
| expr '<' expr
For more complex problems, you can use flex start states, or even insert an entire token filter/transform pass between the lexer and the parser.
Here is the fix, but not entirely satisfactory:
%{
%}
%token tNAME tINT tINTEGER
%left '<'
%left '+'
%nonassoc '=' /* <-- LOOK */
%%
declaration: variable
| declaration variable
variable : type tNAME '=' expr
| type tNAME
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
| expr '+' expr
;
This issue is a conflict between these two LR items: the dot-final:
variable : type tNAME '=' expr_no_less .
and this one:
expr : expr . '<' expr
Notice that these two have different operators. It is not, as you seem to think, a conflict between different productions involving the '<' operator.
By adding = to the precedence ranking, we fix the issue in the sense that the conflict diagnostic goes away.
Note that I gave = a high precedence. This will resolve the conflict by favoring the reduce. This means that you cannot use a '<' expression as an initializer:
int name = 4 < 3 // syntax error
When the < is seen, the int name = 4 wants to be reduced, and the idea is that < must be the start of the next declaration, as part of a type production.
To allow < relational expressions to be used as initializers, add the support for parentheses into the expression grammar. Then users can parenthesize:
int foo = (4 < 3) <int> bar = (2 < 1)
There is no way to fix that without a more powerful parsing method or hacks.
What if you move the %nonassoc before %left '<', giving it low precedence? Then the shift will be favored. Unfortunately, that has the consequence that you cannot write another <int> declaration after a declaration.
int foo = 3 <int> bar = 4
^ // error: the machine shifted and is now doing: expr '<' . expr.
So that is the wrong way to resolve the conflict; you want to be able to write multiple such declarations.
Another Note:
My TXR language, which implements something equivalent to Parse Expression Grammars handles this grammar fine. This is essentially LL(infinite), which trumps LALR(1).
We don't even have to have a separate lexical analyzer and parser! That's just something made necessary by the limitations of one-symbol-lookahead, and the need for utmost efficiency on 1970's hardware.
Example output from shell command line, demonstrating the parse by translation to a Lisp-like abstract syntax tree, which is bound to the variable dl (declaration list). So this is complete with semantic actions, yielding an output that can be further processed in TXR Lisp. Identifiers are translated to Lisp symbols via calls to intern and numbers are translated to number objects also.
$ txr -l type.txr -
int x = 3 < 4 int y
(dl (decl x int (< 3 4)) (decl y int nil))
$ txr -l type.txr -
< int > x = 3 < 4 < int > y
(dl (decl x (pointer int) (< 3 4)) (decl y (pointer int) nil))
$ txr -l type.txr -
int x = 3 + 4 < 9 < int > y < int > z = 4 + 3 int w
(dl (decl x int (+ 3 (< 4 9))) (decl y (pointer int) nil)
(decl z (pointer int) (+ 4 3)) (decl w int nil))
$ txr -l type.txr -
<<<int>>>x=42
(dl (decl x (pointer (pointer (pointer int))) 42))
The source code of (type.txr):
#(define ws)#/[ \t]*/#(end)
#(define int)#(ws)int#(ws)#(end)
#(define num (n))#(ws)#{n /[0-9]+/}#(ws)#(filter :tonumber n)#(end)
#(define id (id))#\
#(ws)#{id /[A-Za-z_][A-Za-z_0-9]*/}#(ws)#\
#(set id #(intern id))#\
#(end)
#(define type (ty))#\
#(local l)#\
#(cases)#\
#(int)#\
#(bind ty #(progn 'int))#\
#(or)#\
<#(type l)>#\
#(bind ty #(progn '(pointer ,l)))#\
#(end)#\
#(end)
#(define expr (e))#\
#(local e1 op e2)#\
#(cases)#\
#(additive e1)#{op /[<>]/}#(expr e2)#\
#(bind e #(progn '(,(intern op) ,e1 ,e2)))#\
#(or)#\
#(additive e)#\
#(end)#\
#(end)
#(define additive (e))#\
#(local e1 op e2)#\
#(cases)#\
#(num e1)#{op /[+\-]/}#(expr e2)#\
#(bind e #(progn '(,(intern op) ,e1 ,e2)))#\
#(or)#\
#(num e)#\
#(end)#\
#(end)
#(define decl (d))#\
#(local type id expr)#\
#(type type)#(id id)#\
#(maybe)=#(expr expr)#(or)#(bind expr nil)#(end)#\
#(bind d #(progn '(decl ,id ,type ,expr)))#\
#(end)
#(define decls (dl))#\
#(coll :gap 0)#(decl dl)#(end)#\
#(end)
#(freeform)
#(decls dl)
Originally in the example there was this
expr:
INTEGER
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
;
I wanted it to be 'more simple' so i wrote this (i realize it would do '+' for both add and subtract. But this is an example)
expr:
INTEGER
| expr addOp expr { $$ = $1 + $3; }
;
addOp:
'+' { $$ = $1; }
| '-' { $$ = $1; }
;
Now i get a shift/reduce error. It should be exactly the same -_- (to me). What do i need to do to fix this?
edit: To make things clear. The first has NO warning/error. I use %left to set the precedence (and i will use %right for = and those other right ops). However it seems to not apply when going into sub expressions.
Are you sure the conflicts involve just those two rules? The first one should have more conflicts than the second. At least with one symbol of look-ahead the decision to shift to a state with addOp on the stack is easier the second time around.
Update (I believe I can prove my theory... :-):
$ cat q2.y
%% expr: '1' | expr '+' expr | expr '-' expr;
$ cat q3.y
%% expr: '1' | expr addOp expr;
addOp: '+' | '-';
$ yacc q2.y
conflicts: 4 shift/reduce
$ yacc q3.y
conflicts: 2 shift/reduce
Having said all that, it's normal for yacc grammars to have ambiguities, and any real-life system is likely to have not just a few but literally dozens of shift/reduce conflicts. By definition, this conflict occurs when there is a perfectly valid shift available, so if you don't mind the parser taking that shift, then don't worry about it.
Now, in yacc you should prefer left-recursive rules. You can achieve that and get rid of your grammar ambiguity with:
$ cat q4.y
%% expr: expr addOp '1' | '1';
addOp: '+' | '-';
$ yacc q4.y
$
Note: no conflicts in the example above. If you like your grammar the way it is, just do:
%expect 2
%% expr: '1' | expr addOp expr;
addOp: '+' | '-';
The problem is that the rule
expr: expr addOp expr { ..action.. }
has no precedence. Normally rules get the precedence of the first token on the RHS, but this rule has no tokens on its RHS. You need to add a %prec directive to it:
expr: expr addOp expr %prec '+' { ..action.. }
to explicitly give the rule a precedence.
Note that doing this doesn't get rid of the shift/reduce conflict, which was present in your original grammar. It just resolves it according to the precedence rules you specify, which means that bison won't give you a message about it. In general, using precedence to resolve conflicts can be tricky, since it can hide conflicts that you might have wanted to resolve differently, or might be unresolvable in your grammar as written.
Also see my answer to this question