Yacc/Lexer: how does this work, really? - yacc

I have tried to sit here and read tutorials on how YACC works with the lex file however I'm not sure I can wrap my head around it. I understand it is for reading the actual input file and determining if functions such as add or subtract are in the right format, but how does this work? I understand how Lex files work and have constructed one that returns defined variables based on what it comes across in the file.
If I have a program that is very simple, how would I go about analyzing the first lines of this test programming language. Let's assume "program" is a defined value in lex, as well as "is", "var", "begin", "+", "print", ";", ",", and "end".
How would the yacc file need to be written in order for the first few lines to be read?
Test file:
program xyz is
var a, b, c
begin
a = 2;
b = 3;
c = a + b;
print c
end
Yaccer.y
%token EOFNUM 0
%token SEMINUM 1
%token LPARENNUM 2
%token RPARENNUM 3
%token ICONSTNUM 4
%token BEGINNUM 5
%token PROGRAMNUM 6
%token MINUSNUM 7
%token TIMESNUM 8
%token VARNUM 9
%token COMMANUM 10
%token IDNUM 11
%token ENDNUM 12
%token ISNUM 13
%token PLUSNUM 14
%token DIVNUM 15
%token PRINTNUM 16
%token EQUALNUM 17
%left '+' '-'
%left '*' '/'
%%
%%
#include "lex.yy.c"
#include <stdio.h>
yyerror(str)
char *str;
{ printf("yyerror: %s at line %d\n", str, yyline); }
main () {
if (!yyparse()) { printf("accept\n");}
else { printf("reject\n"); }
}
I know there needs to be some %types defined for the equations, however I am not sure how to go about using the types along with the %left declarations, or even if this is correct. I am also confused as to how I analyze the first line.

You would need to create both a lex file and a yacc file. In the lex file you define the individual tokens for every type of token than your grammar contain. Then you construct the grammar in BNF form suitable for yacc. For your simple input it looks a bit like this. I assume that it's a typo that the print statement doesn't have a ; after it, and this grammar requires a semicolon.
%token IDENTIFIER
%token PROGRAM
%token BEGIN
%token END
%token IS
%token VAR
%token PRINT
%token NUMBER
program
statementlist
statement
printstatement
assignstatement
expression
%%
program : PROGRAM IDENTIFIER IS VAR variables BEGIN statementlist END;
variables : IDENTIFIER | variables ',' IDENTIFIER;
statementlist : statement | statementlist statement;
statement : assignstatement | printstatement;
printstatement : PRINT IDENTIFIER ';';
assignstatement : IDENTIFIER '=' expression ';';
expression : value | expression '+' value;
value : NUMBER | IDENTIFIER;
%%
The %left and %right are associativity modifiers, and aren't really needed in your case. They would be needed if you were to support - using the same precedence as +. Well, technically not really needed but they make the grammar easier to understand.
Yacc isn't easy to understand, and I recommend a tutorial on the subject. Rules very quickly become recursive in nature and you have to have a certain mindset when working with it. It's easier to build your grammar incrementally, starting with the highest order statement. Make a parser that just recognizes program begin end and then work in the features as you go.
A good tool to test and verify grammars is the javascript parser generator found here

Related

Is there a way to eliminate these reduce/reduce conflicts?

I am working on a toy language for fun, (called NOP) using Bison and Flex, and I have hit a wall. I am trying to parse sequences that look like name1.name2 and name1.func1().name2 and I get a lot of reduce/reduce conflicts. I know -why- am getting them, but I am having a heck of a time figuring out what to do about it.
So my question is whether this is a legitimate irregularity that can't be "fixed", or if my grammar is just wrong. The productions in question are compound_name and compound_symbol. It seems to me that they should parse separately. If I try to combine them I get conflicts with that as well. In the grammar, I am trying to illustrate what I want to do, rather than anything "clever".
%debug
%defines
%locations
%{
%}
%define parse.error verbose
%locations
%token FPCONST INTCONST UINTCONST STRCONST BOOLCONST
%token SYMBOL
%token AND OR NOT EQ NEQ LTE GTE LT GT
%token ADD_ASSIGN SUB_ASSIGN MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN
%token DICT LIST BOOL STRING FLOAT INT UINT NOTHING
%right ADD_ASSIGN SUB_ASSIGN
%right MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN
%left AND OR
%left EQ NEQ
%left LT GT LTE GTE
%right ':'
%left '+' '-'
%left '*' '/' '%'
%left NEG
%right NOT
%%
program
: {} all_module {}
;
all_module
: module_list
;
module_list
: module_element {}
| module_list module_element {}
;
module_element
: compound_symbol {}
| expression {}
;
compound_name
: SYMBOL {}
| compound_name '.' SYMBOL {}
;
compound_symbol_element
: compound_name {}
| func_call {}
;
compound_symbol
: compound_symbol_element {}
| compound_symbol '.' compound_symbol_element {}
;
func_call
: compound_name '(' expression_list ')' {}
;
formatted_string
: STRCONST {}
| STRCONST '(' expression_list ')' {}
;
type_specifier
: STRING {}
| FLOAT {}
| INT {}
| UINT {}
| BOOL {}
| NOTHING {}
;
constant
: FPCONST {}
| INTCONST {}
| UINTCONST {}
| BOOLCONST {}
| NOTHING {}
;
expression_factor
: constant { }
| compound_symbol { }
| formatted_string {}
;
expression
: expression_factor {}
| expression '+' expression {}
| expression '-' expression {}
| expression '*' expression {}
| expression '/' expression {}
| expression '%' expression {}
| expression EQ expression {}
| expression NEQ expression {}
| expression LT expression {}
| expression GT expression {}
| expression LTE expression {}
| expression GTE expression {}
| expression AND expression {}
| expression OR expression {}
| '-' expression %prec NEG {}
| NOT expression { }
| type_specifier ':' SYMBOL {} // type cast
| '(' expression ')' {}
;
expression_list
: expression {}
| expression_list ',' expression {}
;
%%
This is a very stripped down parser. The "real" one is about 600 lines. It has no conflicts (and passes a bunch of tests) if I don't try to use a function call in a variable name. I am looking at re-writing it to be a packrat grammar if I cannot get Bison to do that I want. The rest of the project is here: https://github.com/chucktilbury/nop
$ bison -tvdo temp.c temp.y
temp.y: warning: 4 shift/reduce conflicts [-Wconflicts-sr]
temp.y: warning: 16 reduce/reduce conflicts [-Wconflicts-rr]
All of the reduce/reduce conflicts are the result of:
module_element
: expression
| compound_symbol
That creates an ambiguity because you also have
expression
: expression_factor
expression_factor
: compound_symbol
So the parser can't tell whether or not you need the unit productions to be reduced. Eliminating module_element: compound_symbol doesn't change the set of sentences which can be produced; it just requires that a compound_symbol be reduced through expression before becoming a module_element.
As Chris Dodd points out in a comment, the fact that two module_elements can appear consecutively without a delimiter creates an additional ambiguity: the grammar allows a - b to be parsed either as a single expression (and consequently module_element) or as two consecutive expressions —a and -b— and thus two consecutive module_elements. That ambiguity accounts for three of the four shift/reduce conflicts.
Both of these are probably errors introduced when you simplified the grammar, since it appears that module elements in the full grammar are definitions, not expressions. Removing modules altogether and using expression as the starting symbol leaves only a single conflict.
That conflict is indeed the result of an ambiguity between compound_symbol and compound_name, as noted in your question. The problem is seen in these productions (non-terminals shortened to make typing easier):
name: SYMBOL
| name '.' SYMBOL
symbol
: element
| symbol '.' element
element
: name
That means that both a and a.b are names and hence
elements. But a symbol is a .-separated list of elements, so a.b could be derived in two ways:
symbol → element symbol → symbol . element
→ name → element . element
→ a.b → name . element
→ a . element
→ a . name
→ a . b
I fixed this by simplifying the grammar to:
compound_symbol
: compound_name
| compound_name '(' expression_list ')'
compound_name
: SYMBOL
| compound_symbol '.' SYMBOL
That gets rid of func_call and compound_symbol_element, which as far as I can see serve no purpose. I don't know if the non-terminal names remaining really capture anything sensible; I think it would make more sense to call compound_symbol something like name_or_call.
This grammar could be simplified further if higher-order functions were possible; the existing grammar forbids hof()(), presumably because you don't contemplate allowing a function to return a function object.
But even with higher-order functions, you might want to differentiate between function calls and member access/array subscript, because in many languages a function cannot return an object reference and hence a function call cannot appear on the left-hand side of an assignment operator. In other languages, such as C, the requirement that the left-hand side of an assignment operator be a reference ("lvalue") is enforced outside of the grammar. (And in C++, a function call or even an overloaded operator can return a reference, so the restriction needs to be enforced after type analysis.)

How does one remove shift/reduce errors caused by limited lookahead?

In my grammar, it's usually possible to only use declaration which looks like:
int x, y, z = 23;
int i = 1;
int j;
In for I'd like to use a set of comma separated declarations of different types e.g.
for (int i = 0, double d = 2.0; i < 0; i++) { ... }
Using yacc, the limited lookahead creates problems. Here is the naive grammar:
variables_decl
: type_expression IDENT
| variables_decl ',' IDENT
;
declaration
: variables_decl
| variables_decl '=' initializer
;
declaration_list
: declaration
| declaration_list ',' declaration
;
This causes a shift/reduce error on ',':
state 149
100 variables_decl: variables_decl . ',' IDENT
101 declaration: variables_decl .
102 | variables_decl . '=' initializer
',' shift, and go to state 261
'=' shift, and go to state 262
',' [reduce using rule 101 (declaration)]
$default reduce using rule 101 (declaration)
I'd like fix this issue so that this actually works:
for (double x, y, int i, j = 0, long l = 1; i < 0; i++) { ... }
But it's not obvious to me how to do that.
In general terms, you avoid this type of shift/reduce conflict by avoiding forcing the parser to make a decision until absolutely necessary.
It's understandable why you have structured the grammar as you have; intuitively, a declaration list is a list of declarations, where each declaration is a type and a list of variables of that type. The problem is that this definition makes it impossible to know whether a comma belongs to an inner or outer list.
Moreover, one extra lookahead token might not be enough, since the following IDENT could be a typename or the name of a variable to be declared, assuming type-expression is the usual C syntax which can start with an identifier corresponding to a typename.
But that's not the only way to look at the declaration list syntax. Instead, you can think of it as a list of individual declarations, each of which starts with an optional type (except the first in the list, which must have an explicit type), using the semantic convention that an omitted type is the same as the type of the previous variable. That leads to the following conflict-free grammar:
declaration_list: explicit_decl
| declaration_list ',' declaration
declaration : explicit_decl
| implicit_decl
explicit_decl : type_expression implicit_decl
implicit_decl : IDENT opt_init
opt_init : %empty | '=' expr
That does not capture the syntax of C declarations, since not all C declarations have the form type_expression IDENT. The IDENT being defined can be buried inside the declaration, as with, for example, int a[4] or int f(int i); fortunately, these forms are of limited use in a for loop.
On the other hand, unlike your grammar it does allow all declared variables to be initialised, so
int a = 1, b = 0, double x = -1.0, y = 0.0
should work.
Another note: the first item in a C for clause can be empty, a declaration (possibly in the form of a list) or an expression. In the last case, a top-level , is an operator, not a list indicator.
In short, the above fragment might or might not be a solution in the context of your actual grammar. But it is conflict-free in a simple test framework where typed declarations are always of the form typename identifier.

Reduce/reduce conflict in clike grammar in jison

I'm working on the clike language compiler using Jison package. I went really well until I've introduced classes, thus Type can be a LITERAL now. Here is a simplified grammar:
%lex
%%
\s+ /* skip whitespace */
int return 'INTEGER'
string return 'STRING'
boolean return 'BOOLEAN'
void return 'VOID'
[0-9]+ return 'NUMBER'
[a-zA-Z_][0-9a-zA-Z_]* return 'LITERAL'
"--" return 'DECR'
<<EOF>> return 'EOF'
"=" return '='
";" return ';'
/lex
%%
Program
: EOF
| Stmt EOF
;
Stmt
: Type Ident ';'
| Ident '=' NUMBER ';'
;
Type
: INTEGER
| STRING
| BOOLEAN
| LITERAL
| VOID
;
Ident
: LITERAL
;
And the jison conflict:
Conflict in grammar: multiple actions possible when lookahead token is LITERAL in state 10
- reduce by rule: Ident -> LITERAL
- reduce by rule: Type -> LITERAL
Conflict in grammar: multiple actions possible when lookahead token is = in state 10
- reduce by rule: Ident -> LITERAL
- reduce by rule: Type -> LITERAL
States with conflicts:
State 10
Type -> LITERAL . #lookaheads= LITERAL =
Ident -> LITERAL . #lookaheads= LITERAL =
I've found quite a similar question that has no been answered, does any one have any clue how to solve this?
That's evidently a bug in jison, since the grammar is certainly LALR(1), and is handled without problems by bison. Apparently, jison is incorrectly computing the lookahead for the state in which the conflict occurs. (Update: It seems to be bug 205, reported in January 2014.)
If you ask jison to produce an LR(1) parser instead of an LALR(1) grammar, then it correctly computes the lookaheads and the grammar passes without warnings. However, I don't think that is a sustainable solution.
Here's another work-around. The Decl and Assign productions are not necessary; the "fix" was to remove LITERAL from Type and add a separate production for it.
Program
: EOF
| Stmt EOF
;
Decl
: Type Ident ';'
| LITERAL Ident ';'
;
Assign
: Ident '=' NUMBER ';'
;
Stmt
: Decl
| Assign
;
Type
: INTEGER
| STRING
| BOOLEAN
| VOID
;
Ident
: LITERAL
;
You might want to consider recognizing more than one statement:
Program
: EOF
| Stmts EOF
;
Stmts
: Stmt
| Stmts Stmt
;

byacc shift/reduce

I'm having trouble figuring this one out as well as the shift reduce problem.
Adding ';' to the end doesn't solve the problem since I can't change the language, it needs to go just as the following example. Does any prec operand work?
The example is the following:
A variable can be declared as: as a pointer or int as integer, so, both of this are valid:
<int> a = 0
int a = 1
the code goes:
%left '<'
declaration: variable
| declaration variable
variable : type tNAME '=' expr
| type tNAME
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
It obviously gives a shift/reduce problem afer expr. since it can shift for expr of "less" operator or reduce for another variable declaration.
I want precedence to be given on variable declaration, and have tried to create a %nonassoc prec_aux and put after '<' type '>' %prec prec_aux and after type tNAME but it doesn't solve my problem :S
How can I solve this?
Output was:
Well cant figure hwo to post linebreaks and code on reply... so here it goes the output:
35: shift/reduce conflict (shift 47, reduce 7) on '<'
state 35
variable : type tNAME '=' expr . (7)
expr : expr . '+' expr (26)
expr : expr . '-' expr (27)
expr : expr . '*' expr (28)
expr : expr . '/' expr (29)
expr : expr . '%' expr (30)
expr : expr . '<' expr (31)
expr : expr . '>' expr (32)
'>' shift 46
'<' shift 47
'+' shift 48
'-' shift 49
'*' shift 50
'/' shift 51
'%' shift 52
$end reduce 7
tINT reduce 7
Thats the output and the error seems the one I mentioned.
Does anyone know a different solution, other than adding a new terminal to the language that isn't really an option?
I think the resolution is to rewrite the grammar so it can lookahead somehow and see if its a type or expr after the '<' but I'm not seeing how to.
Precedence is unlikely to work since its the same character. Is there a way to give precendence for types that we define? such as declaration?
Thanks in advance
Your grammar gets confused in text like this:
int a = b
<int> c
That '<' on the second line could be part of an expression in the first declaration. It would have to look ahead further to find out.
This is the reason most languages have a statement terminator. This produces no conflicts:
%%
%token tNAME;
%token tINT;
%token tINTEGER;
%token tTERM;
%left '<';
declaration: variable
| declaration variable
variable : type tNAME '=' expr tTERM
| type tNAME tTERM
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
It helps when creating a parser to know how to design a grammar to eliminate possible conflicts. For that you would need an understanding of how parsers work, which is outside the scope of this answer :)
The basic problem here is that you need more lookahead than the 1 token you get with yacc/bison. When the parser sees a < it has no way of telling whether its done with the preivous declaration and its looking at the beginning of a bracketed type, or if this is a less-than operator. There's two basic things you can do here:
Use a parsing method such as bison's %glr-parser option or btyacc, which can deal with non-LR(1) grammars
Use the lexer to do extra lookahead and return disambiguating tokens
For the latter, you would have the lexer do extra lookahead after a '<' and return a different token if its followed by something that looks like a type. The easiest is to use flex's / lookahead operator. For example:
"<"/[ \t\n\r]*"<" return OPEN_ANGLE;
"<"/[ \t\n\r]*"int" return OPEN_ANGLE;
"<" return '<';
Then you change your bison rules to expect OPEN_ANGLE in types instead of <:
type : OPEN_ANGLE type '>'
| tINT
expr : tINTEGER
| expr '<' expr
For more complex problems, you can use flex start states, or even insert an entire token filter/transform pass between the lexer and the parser.
Here is the fix, but not entirely satisfactory:
%{
%}
%token tNAME tINT tINTEGER
%left '<'
%left '+'
%nonassoc '=' /* <-- LOOK */
%%
declaration: variable
| declaration variable
variable : type tNAME '=' expr
| type tNAME
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
| expr '+' expr
;
This issue is a conflict between these two LR items: the dot-final:
variable : type tNAME '=' expr_no_less .
and this one:
expr : expr . '<' expr
Notice that these two have different operators. It is not, as you seem to think, a conflict between different productions involving the '<' operator.
By adding = to the precedence ranking, we fix the issue in the sense that the conflict diagnostic goes away.
Note that I gave = a high precedence. This will resolve the conflict by favoring the reduce. This means that you cannot use a '<' expression as an initializer:
int name = 4 < 3 // syntax error
When the < is seen, the int name = 4 wants to be reduced, and the idea is that < must be the start of the next declaration, as part of a type production.
To allow < relational expressions to be used as initializers, add the support for parentheses into the expression grammar. Then users can parenthesize:
int foo = (4 < 3) <int> bar = (2 < 1)
There is no way to fix that without a more powerful parsing method or hacks.
What if you move the %nonassoc before %left '<', giving it low precedence? Then the shift will be favored. Unfortunately, that has the consequence that you cannot write another <int> declaration after a declaration.
int foo = 3 <int> bar = 4
^ // error: the machine shifted and is now doing: expr '<' . expr.
So that is the wrong way to resolve the conflict; you want to be able to write multiple such declarations.
Another Note:
My TXR language, which implements something equivalent to Parse Expression Grammars handles this grammar fine. This is essentially LL(infinite), which trumps LALR(1).
We don't even have to have a separate lexical analyzer and parser! That's just something made necessary by the limitations of one-symbol-lookahead, and the need for utmost efficiency on 1970's hardware.
Example output from shell command line, demonstrating the parse by translation to a Lisp-like abstract syntax tree, which is bound to the variable dl (declaration list). So this is complete with semantic actions, yielding an output that can be further processed in TXR Lisp. Identifiers are translated to Lisp symbols via calls to intern and numbers are translated to number objects also.
$ txr -l type.txr -
int x = 3 < 4 int y
(dl (decl x int (< 3 4)) (decl y int nil))
$ txr -l type.txr -
< int > x = 3 < 4 < int > y
(dl (decl x (pointer int) (< 3 4)) (decl y (pointer int) nil))
$ txr -l type.txr -
int x = 3 + 4 < 9 < int > y < int > z = 4 + 3 int w
(dl (decl x int (+ 3 (< 4 9))) (decl y (pointer int) nil)
(decl z (pointer int) (+ 4 3)) (decl w int nil))
$ txr -l type.txr -
<<<int>>>x=42
(dl (decl x (pointer (pointer (pointer int))) 42))
The source code of (type.txr):
#(define ws)#/[ \t]*/#(end)
#(define int)#(ws)int#(ws)#(end)
#(define num (n))#(ws)#{n /[0-9]+/}#(ws)#(filter :tonumber n)#(end)
#(define id (id))#\
#(ws)#{id /[A-Za-z_][A-Za-z_0-9]*/}#(ws)#\
#(set id #(intern id))#\
#(end)
#(define type (ty))#\
#(local l)#\
#(cases)#\
#(int)#\
#(bind ty #(progn 'int))#\
#(or)#\
<#(type l)>#\
#(bind ty #(progn '(pointer ,l)))#\
#(end)#\
#(end)
#(define expr (e))#\
#(local e1 op e2)#\
#(cases)#\
#(additive e1)#{op /[<>]/}#(expr e2)#\
#(bind e #(progn '(,(intern op) ,e1 ,e2)))#\
#(or)#\
#(additive e)#\
#(end)#\
#(end)
#(define additive (e))#\
#(local e1 op e2)#\
#(cases)#\
#(num e1)#{op /[+\-]/}#(expr e2)#\
#(bind e #(progn '(,(intern op) ,e1 ,e2)))#\
#(or)#\
#(num e)#\
#(end)#\
#(end)
#(define decl (d))#\
#(local type id expr)#\
#(type type)#(id id)#\
#(maybe)=#(expr expr)#(or)#(bind expr nil)#(end)#\
#(bind d #(progn '(decl ,id ,type ,expr)))#\
#(end)
#(define decls (dl))#\
#(coll :gap 0)#(decl dl)#(end)#\
#(end)
#(freeform)
#(decls dl)

Shift reduce conflict

I'm having a problem understanding the shift/reduce confict for a grammar that I know has no ambiguity. The case is one of the if else type but it's not the 'dangling else' problem since I have mandatory END clauses delimiting code blocks.
Here is the grammar for gppg (Its a Bison like compiler compiler ... and that was not an echo):
%output=program.cs
%start program
%token FOR
%token END
%token THINGS
%token WHILE
%token SET
%token IF
%token ELSEIF
%token ELSE
%%
program : statements
;
statements : /*empty */
| statements stmt
;
stmt : flow
| THINGS
;
flow : '#' IF '(' ')' statements else
;
else : '#' END
| '#' ELSE statements '#' END
| elseifs
;
elseifs : elseifs '#' ELSEIF statements else
| '#' ELSEIF statements else
;
Here is the conflict output:
// Parser Conflict Information for grammar file "program.y"
Shift/Reduce conflict on symbol "'#'", parser will shift
Reduce 10: else -> elseifs
Shift "'#'": State-22 -> State-23
Items for From-state State 22
10 else: elseifs .
-lookahead: '#', THINGS, EOF
11 elseifs: elseifs . '#' ELSEIF statements else
Items for Next-state State 23
11 elseifs: elseifs '#' . ELSEIF statements else
// End conflict information for parser
I already switched arround everything, and I do know how to resolve it, but that solution involves giving up the left recursion on 'elseif' for a right recursion.
Ive been through all the scarse documentation I have found on the internet regarding this issue (I post some links at the end) and still have not found an elegant solution. I know about ANTLR and I don't want to consider it right now. Please limit your solution to Yacc/Bison parsers.
I would appreciate elegant solutions, I managed to do It by eleminating the /* empty */ rules and duplication everything that needed an empty list but in the larger grammar Im working on It just ends up like 'sparghetti grammar syndrome'.
Here are some links:
http://nitsan.org/~maratb/cs164/bison.html
http://compilers.iecc.com/comparch/article/98-01-079
GPPG, the parser I'm using
Bison manual
Your revised ELSEIF rule has no markers for a condition -- it should nominally have '(' and ')' added.
More seriously, you now have a rule for
elsebody : else
| elseifs else
;
and
elseifs : /* Nothing */
| elseifs ...something...
;
The 'nothing' is not needed; it is implicitly taken care of by the 'elsebody' without the 'elseifs'.
I would be very inclined to use rules 'opt_elseifs', 'opt_else', and 'end':
flow : '#' IF '(' ')' statements opt_elseifs opt_else end
;
opt_elseifs : /* Nothing */
| opt_elseifs '#' ELSIF '(' ')' statements
;
opt_else : /* Nothing */
| '#' ELSE statements
;
end : '#' END
;
I've not run this through a parser generator, but I find this relatively easy to understand.
I think the problem is in the elseifs clause.
elseifs : elseifs '#' ELSEIF statements else
| '#' ELSEIF statements else
;
I think the first version is not required, since the else clause refers back to elseifs anyway:
else : '#' END
| '#' ELSE statements '#' END
| elseifs
;
What happens if you change elseifs?:
elseifs : '#' ELSEIF statements else
;
The answer from Jonathan above seems like it would be the best, but since its not working for you I have a few suggestions you could try that will help you in debugging the error.
Firstly have you considered making the hash/sharp symbol a part of the tokens themselves (i.e. #END, #IF, etc)? So that they get taken out by the lexer, meaning they don't have to be included in the parser.
Secondly I would urge you to rewrite the rules without duplicating any token streams. (Part of the Don't Repeat Yourself principle.) So the rule " '#' ELSEIF statements else " should only exist in one place in that file (not two as you have above).
Lastly I suggest that you look into precedence and associativity of the IF/ELSEIF/ELSE tokens. I know that you should be able to write a parser that doesn't require this but it might be the thing that you need in this case.
I'm still switching thing arround, and my original question had some errors since the elseifs sequence had an else allways at the end which was wrong. Here is another take at the question, this time I get two shift/reduce conflicts:
flow : '#' IF '(' ')' statements elsebody
;
elsebody : else
| elseifs else
;
else : '#' ELSE statements '#' END
| '#' END
;
elseifs : /* empty */
| elseifs '#' ELSEIF statements
;
The conflicts now are:
// Parser Conflict Information for grammar file "program.y"
Shift/Reduce conflict on symbol "'#'", parser will shift
Reduce 12: elseifs -> /* empty */
Shift "'#'": State-10 -> State-13
Items for From-state State 10
7 flow: '#' IF '(' ')' statements . elsebody
4 statements: statements . stmt
Items for Next-state State 13
10 else: '#' . ELSE statements '#' END
11 else: '#' . END
7 flow: '#' . IF '(' ')' statements elsebody
Shift/Reduce conflict on symbol "'#'", parser will shift
Reduce 13: elseifs -> elseifs, '#', ELSEIF, statements
Shift "'#'": State-24 -> State-6
Items for From-state State 24
13 elseifs: elseifs '#' ELSEIF statements .
-lookahead: '#'
4 statements: statements . stmt
Items for Next-state State 6
7 flow: '#' . IF '(' ')' statements elsebody
// End conflict information for parser
Empty rules just aggravate the gppg i'm affraid. But they seem so natural to use I keep trying them.
I already know right recursion solves the problem as 1800 INFORMATION has said. But I'm looking for a solution with left recursion on the elseifs clause.
elsebody : elseifs else
| elseifs
;
elseifs : /* empty */
| elseifs '#' ELSEIF statements
;
else : '#' ELSE statements '#' END
;
I think this should left recurse and always terminate.
OK - here is a grammar (not minimal) for if blocks. I dug it out of some code I have (called adhoc, based on hoc from Kernighan & Plauger's "The UNIX Programming Environment"). This outline grammar compiles with Yacc with no conflicts.
%token NUMBER IF ELSE
%token ELIF END
%token THEN
%start program
%%
program
: stmtlist
;
stmtlist
: /* Nothing */
| stmtlist stmt
;
stmt
: ifstmt
;
ifstmt
: ifcond endif
| ifcond else begin
| ifcond eliflist begin
;
ifcond
: ifstart cond then stmtlist
;
ifstart
: IF
;
cond
: '(' expr ')'
;
then
: /* Nothing */
| THEN
;
endif
: END IF begin
;
else
: ELSE stmtlist END IF
;
eliflist
: elifblock
| elifcond eliflist begin /* RIGHT RECURSION */
;
elifblock
: elifcond else begin
| elifcond endif
;
elifcond
: elif cond then stmtlist end
;
elif
: ELIF
;
begin
: /* Nothing */
;
end
: /* Nothing */
;
expr
: NUMBER
;
%%
I used 'NUMBER' as the dummy element, instead of THINGS, and I used ELIF instead of ELSEIF. It includes a THEN, but that is optional. The 'begin' and 'end' operations were used to grab the program counter in the generated program - and therefore should be removable from this without affecting it.
There was a reason I thought I needed to use right recursion instead of the normal left recursion - but I think it was to do with the code generation strategy I was using, rather than anything else. The question mark in the comment was in the original; I remember not being happy with it. The program as a whole does work - it is a project that's been on the back burner for the last decade or so (hmmm...I did some work at the end of 2004 and beginning of 2005; prior to that, it was 1992 and 1993).
I've not spent the time working out why this compiles conflict-free and what I outlined earlier does not. I hope it helps.