How to resolve a shift/reduce conflict forcing a shift or a reduce? - grammar

When there is a shift/reduce conflict in Yacc/Bison, is it possible to force the conflict to be solved exactly as you want? In other words: is it possible explicitly force it to prioritize the shift or the reduce?
For what I have read, if you are happy with the default resolution you can tell the generator to not complain about it. I really don't like this because it is obfuscating your rational choice.
Another option is to rewrite the grammar to fix the issue. I don't know if this is always possible and often this makes it much harder to understand.
Finally, I have read the precedence rules can fix this. I clueless tried that in many ways and I couldn't make it work. Is it possible to use the precedence rule for that? How?
Though my ambiguous grammar is very different, I can use the classical if-then-else from the Bison manual to give a concrete example:
%token IF THEN ELSE variable
%%
stmt:
expr
| if_stmt
;
if_stmt:
IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
;
expr:
variable
;

As far as I can tell, it is not possible to direct the parser to resolve a S/R conflict by choosing to reduce. Though I might be wrong, it is probably ill-advised to proceed this way anyway. Therefore, the only possibilities are either rewriting the grammar, or solving the conflict by shifting.
The following usage of right predecence for THEN and ELSE describes the desired behavior for the if-then-else statement (that is, associating else with the innermost if statement).
%token IF THEN ELSE variable
%right THEN ELSE
%%
stmt
: expr
| if_stmt
;
if_stmt
: IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
;
expr
: variable
;
By choosing right association for the above tokens, the following sequence:
IF expr1 THEN IF expr2 THEN IF expr3 THEN x ELSE y
is parsed as:
IF expr1 THEN (IF expr2 THEN (IF expr3 THEN (x ELSE (y))))
and Bison does not complain about the case any longer.
Remember that you can always run bison file.y -r all and inspect file.output in order to see if the generated parser state machine is correct.

Well, the default resolution for a shift/reduce conflict is to shift, so if that's what you want, you don't need to do anything (other than ignoring the warning).
If you want to resolve a shift/reduce conflict by reducing, you can use the precedence rules -- just make sure that the rule to be reduced is higher precedence than the token to be shifted. The tricky part comes if there are multiple shift/reduce conflicts involving the same rules and tokens, it may not be possible to find a globally consistent set of precedences for the rules and tokens which resolves things the way you want.

Related

Effect of yacc operator associativity declaration on expressions that have just a few tokens

When you have a grammar like this one:
B: 'a' A 'a'
| 'b' A 'b'
A: 'a' 'a'
| 'a'
The %right 'a' declaration causes aa.a not to be accepted because a shift happens instead of a reduce at '.',
and %left 'a' doesn't accept neither aa.aa nor ba.ab because the parse always reduces at the dot.
It is not quite clear to me how to figure out what effect do associativity declarations have in such cases where the token ('a') is not being used as an operator straightforwardly.
Why do you think LR(1) would be more intuitive? The grammar is not LR(1) so any LR(1) parser generator should report a shift/reduce conflict, just as an LALR(1) parser generator would.
Of course, yacc/bison is not a pure LALR(1) parser generator. If it resolves shift/reduce conflicts using a precedence/associativity declaration, then it suppresses the warning. That doesn't make the grammar unambiguous, though. One of the (many) issues with using precedence declarations is that it is no longer clear what language you are parsing. Silently ignoring a shift/reduce and statically resolving it in favour of one or the other action will produce a parser which recognizes some context-free language, but it is not the language described by the grammar.
None of that has to do with the LALR algorithm, though.
To answer your question: the algorithm bison/yacc uses to resolve shift/reduce conflicts is really simple.
Every terminal mentioned in a precedence declaration is assigned a precedence value. All terminals mentioned in the same declaration have the same precedence, and have a higher precedence than any terminal mentioned in a previous declaration.
Every production whose last terminal has a precedence value is assigned the same precedence value. (If the production includes a %prec TERMINAL modifier, that terminal is used instead of the last terminal in the production.
If there is a shift-reduce conflict for a production with some lookahead symbol, and both the production and the lookahead symbol have precedence values, then the reduction is applied if the production's precedence is higher or if the precedences are equal and the precedence was assigned by a %left declaration. The shift is applied if the lookahead symbol's precedence is higher or if the precedences are equal and the precedence was assigned by a %right declaration.
That's it. Note that in the above algorithm, there is no mention of operators, which is actually not a concept in any form of LR parsing.

shift/reduce error in yacc

I know this part of my grammar cause error but I don't know how to fix it I even use %left and right but it didn't help. Can anybody please help me to find out what is the problem with this grammar.
Thanks in advance for your help.
%token VARIABLE NUM
%right '='
%left '+' '-'
%left '*' '/'
%left '^'
%start S_PROOP
EQUATION_SEQUENCE
: FORMULA '=' EQUATION
;
EQUATION
: FORMULA
| FORMULA '=' EQUATION
;
FORMULA
: SUM EXPRESSION
| PRODUCT EXPRESSION
| EXPRESSION '+' EXPRESSION
| EXPRESSION '*' EXPRESSION
| EXPRESSION '/' EXPRESSION
| EXPRESSION '^' EXPRESSION
| EXPRESSION '-' EXPRESSION
| EXPRESSION
;
EXPRESSION
: EXPRESSION EXPRESSION
| '(' EXPRESSION ')'
| NUM
| VARIABLE
;
Normal style is to use lower case for non-terminals and upper case for terminals; using upper case indiscriminately makes your grammar harder to read (at least for those of us used to normal yacc/bison style). So I've written this answer without so much recourse to the caps lock key.
The basic issue is the production
expression: expression expression
which is obviously ambiguous, since it does not provide any indication of associativity. In that, it is not different from
expression: expression '+' expression
but that conflict can be resolved using a precedence declaration:
%left '+'
The difference is that the first production does not have any terminal symbol, which makes it impossible to use precedence rules to disambiguate: in yacc/bison, precedence is always a comparison between a potential reduction and a potential shift. The potential reduction is some production which could be reduced; the potential shift is a terminal symbol which might be able to extend some production. Since the potential shift must be a terminal symbol, that is what is used in the precedence declaration; by default, the precedence of the potential reduction is defined by the last terminal symbol in the right-hand side but it is possible to specify a different terminal using a %prec marker. In any case, the precedence relation involves a terminal symbol, and if the grammar allows juxtaposition of two terminals, there is no relevant terminal symbol.
That's easy to work around, since you are under no obligation to use precedence to resolve conflicts. You could just avoid the conflict:
/* Left associative rule */
expr_sequence: expr | expr_sequence expr
/* Alternative: right associative rule */
expr_sequence: expr | expr expr_sequence
Since there is no indication what you intend by the juxtaposition, I'm unable to recommend one or the other of the above alternatives, but normally I would incline towards the first one.
That's not terribly different from your grammar for equation_sequence, although equation_sequence actually uses a terminal symbol so it could have been handled with a precedence declaration. It's worth noting that equation_sequence, as written, is right-associative. That's usually considered correct for assignment operators, (a = b = c + 3, in a language like C, is parsed as a = (b = c + 3) and not as (a = b) = c + 3, making assignment one of the few right-associative operators.) But if you are using = as an equality operator, it might not actually be what you intended.

Reduce/Reduce conflict when introducing pointers in my grammar

I'm working on a small compiler in order to get a greater appreciation of the difficulties of creating one's own language. Right now I'm at the stage of adding pointer functionality to my grammar but I got a reduce/reduce conflict by doing it.
Here is a simplified version of my grammar that is compilable by bnfc. I use the happy parser generator and that's the program telling me there is a reduce/reduce conflict.
entrypoints Stmt ;
-- Statements
-------------
SDecl. Stmt ::= Type Ident; -- ex: "int my_var;"
SExpr. Stmt ::= Expr; -- ex: "printInt(123); "
-- Types
-------------
TInt. Type ::= "int" ;
TPointer. Type ::= Type "*" ;
TAlias. Type ::= Ident ; -- This is how I implement typedefs
-- Expressions
--------------
EMult. Expr1 ::= Expr1 "*" Expr2 ;
ELitInt. Expr2 ::= Integer ;
EVariable. Expr2 ::= Ident ;
-- and the standard corecions
_. Expr ::= Expr1 ;
_. Expr1 ::= Expr2 ;
I'm in a learning stage of how grammars work. But I think I know what happens. Consider these two programs
main(){
int a;
int b;
a * b;
}
and
typedef int my_type;
main(){
my_type * my_type_pointer_variable;
}
(The typedef and main(){} part isn't relevant and in my grammar. But they give some context)
In the first program I wish it would parse a "*" b as Stmt ==(SExpr)==> Expr ==(EMult)==> Expr * Expr ==(..)==> Ident "*" Ident, that is to essentially start stepping using the SExpr rule.
At the same time I would like my_type * my_type_pointer_variable to be expanded using the rules. Stmt ==(SDecl)==> Type Ident ==(TPointer)==> Type "*" Ident ==(TAlias)==> Ident "*" Ident.
But the grammar stage have no idea if an identifier originally is a type alias or a variable.
(1) How can I get rid of the reduce/reduce conflict and (2) am I the only one having this issue? Is there an obvious solution and how does the c grammar resolve this issue?
So far I have successfully just been able to change the syntax of my language by using "&" or some other symbol instead of "*", but that's very undesirable. Also I cannot make sense from various public c grammars and tried to see why they don't have this issue but I have had no luck in this.
And last, how do I resolve issues like these on my own? All I understood from happys more verbose output is how the conflict happens, is cleverness the only way to work around these conflicts? I'm afraid I'll stumble on even more issues for example when introducing EIndir. Expr = '*' Expr;
The usual way this problem is dealt with in C parsers is something generally called "the lexer feedback hack". Its a 'hack' in the sense that it doesn't deal with it in the grammar at all; instead, when the lexer recognizes an identifier, it classifies that identifier as either a typename or a non-typename, and returns a different token for each case (usually designated 'TypeIdent' for an identifier that is a typename and simply 'Ident' for any other). The lexer makes this selection by looking at the current state of the symbol table, so it sees all the typedefs that have occurred prior to the current point in the parse, but not typedefs that are after the current point. This is why C requires that you declare typedefs before their first use in each compilation unit.

Yacc "rule useless due to conflicts"

i need some help with yacc.
i'm working on a infix/postfix translator, the infix to postfix part was really easy but i'm having some issue with the postfix to infix translation.
here's an example on what i was going to do (just to translate an easy ab+c- or an abc+-)
exp: num {printf("+ ");} exp '+'
| num {printf("- ");} exp '-'
| exp {printf("+ ");} num '+'
| exp {printf("- ");} num '-'
|/* empty*/
;
num: number {printf("%d ", $1);}
;
obiously it doesn't work because i'm asking an action (with the printfs) before the actual body so, while compiling, I get many
warning: rule useless in parser due to conflict
the problem is that the printfs are exactly where I need them (or my output wont be an infix expression). is there a way to keep the print actions right there and let yacc identify which one it needs to use?
Basically, no there isn't. The problem is that to resolve what you've got, yacc would have to have an unbounded amount of lookahead. This is… problematic given that yacc is a fairly simple-minded tool, so instead it takes a (bad) guess and throws out some of your rules with a warning. You need to change your grammar so yacc can decide what to do with a token with only a very small amount of lookahead (a single token IIRC). The usual way to do this is to attach the interpretations of the values to the tokens and either use a post-action or, more practically, build a tree which you traverse as a separate step (doing print out of an infix expression from its syntax tree is trivial).
Note that when you've got warnings coming out of yacc, that typically means that your grammar is wrong and that the resulting parser will do very unexpected things. Refine it until you get no warnings from that stage at all. That is, treat grammar warnings as errors; anything else and you'll be sorry.

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.