How to solve a shift/reduce conflict? - grammar

I'm using CUP to create a parser that I need for my thesis. I have a shift/reduce conflict in my grammar. I have this production rule:
command ::= IDENTIFIER | IDENTIFIER LPAREN parlist RPAREN;
and I have this warning:
Warning : *** Shift/Reduce conflict found in state #3
between command ::= IDENTIFIER (*)
and command ::= IDENTIFIER (*) LPAREN parlist RPAREN
under symbol LPAREN
Now, I actually wanted it to shift so I'm pretty ok with it, but my professor told me to find a way to solve the conflict. I'm blind. I've always read about the if/else conflict but to me this doesn't seem the case.
Can you help me?
P.S.: IDENTIFIER, LPAREN "(" and RPAREN ")" are terminal, parlist and command are not.

Your problem is not in those rules at all. Although Michael Mrozek answer is correct approach to resolving the "dangling else problem", it does not grasp the problem at hand.
If you look at the error message, you see that the shift / reduce conflict is present when lexing LPAREN. I am pretty sure that the rules alone will not create a conflict.
I can't see your grammar, so I can't help you. But your conflict is probably when a command is followed by a different rule that start with a LPAREN.
Look at any other rules that can potentially be after command and start with LPAREN. You will then have to consolidate the rules. There is a very good chance that your grammar is erroneous for a specific input.

You have two productions:
command ::= IDENTIFIER
command ::= IDENTIFIER LPAREN parlist RPAREN;
It's a shift/reduce conflict when the input tokens are IDENTIFIER LPAREN, because:
LPAREN could be the start of a new production you haven't listed, in which case the parser should reduce the IDENTIFIER already on the stack into command, and have command LPAREN remaining
They could both be the start of the second production, so it should shift the LPAREN onto the stack next to IDENTIFIER and keep reading, trying to find a parlist.
You can fix it by doing something like this:
command ::= IDENTIFIER command2
command2 ::= LPAREN parlist RPAREN |;

Try to set a precedence:
precedence left LPAREN, RPARENT;
It forces CUP to decide the conflict, taking the left match.

Related

ANTLR4 No Viable Alternative At Input

I'm implementing a simple PseudoCode language with ANTLR4, this is my current grammar:
// Define a grammar called PseudoCode
grammar PseudoCode;
prog : FUNCTION SIGNATURE '(' ')'
| FUNCTION SIGNATURE '{' VARB '}' ;
param: VARB | VARB ',' param ;
assignment: VARB '=' NUMBER ;
FUNCTION: 'function' ;
VARB: [a-z0-9]+ ;
SIGNATURE: [a-zA-Z0-9]+ ;
NUMBER: [0-9]+ | [0-9]+ '.' [0-9]+ ;
WS: [ \t\r\n]+ -> skip ;
The problem is after compiling and generating the Parser, Lexer, etc... and then running with grun PseudoCode prog -tree with the input being for example: function bla{bleh}
I keep on getting the following error:
line 1:9 no viable alternative at input 'functionbla'
Can someone point out what is wrong with my grammar?
bla is a VARB, not a SIGNATURE, because it matches both rules and VARB comes first in the grammar. The way you defined your lexer rules, an identifier can only be matched as a SIGNATURE if it contains capital letters.
The simplest solution to this problem would be to have a single lexer rule for identifiers and then use that everywhere where you currently use SIGNATURE or VARB. If you want to disallow capital letters in certain places, you could simply check for this condition in an action or listener, which would also allow you to produce clearer error messages than syntax errors (e.g. "capital letters are not allowed in variable names").
If you absolutely do need capital letters in variable names to be syntax errors, you could define one rule for identifiers with capital letters and one without. Then you could use ID_WITH_CAPITALS | ID_LOWER_CASE_ONLY in places where you want to allow both and ID_LOWER_CASE_ONLY in cases where you only want to allow lower case letters.
PS: You'll also want to make sure that your identifier rule does not match numbers (which both VARB and SIGNATURE currently do). Currently NUMBER tokens will only be generated for numbers with a decimal point.

Antlr: Unintended behavior

Why this simple grammar
grammar Test;
expr
: Int | expr '+' expr;
Int
: [0-9]+;
doesn't match the input 1+1 ? It says "No method for rule expr or it has arguments" but in my opition it should be matched.
It looks like I haven't used ANTLR for a while... ANTLRv3 did not support left-recursive rules, but ANTLRv4 does support immediate left recursion. It also supports the regex-like character class syntax you used in your post. I tested this version and it works in ANTLRWorks2 (running on ANTLR4):
grammar Test;
start : expr
;
expr : expr '+' expr
| INT
;
INT : [0-9]+
;
If you add the start rule then ANTLR is able to infer that EOF goes at the end of that rule. It doesn't seem to be able to infer EOF for more complex rules like expr and expr2 since they're recursive...
There are a lot of comments below, so here is (co-author of ANTLR4) Sam Harwell's response (emphasis added):
You still want to include an explicit EOF in the start rule. The problem the OP faced with using expr directly is ANTLR 4 internally rewrote it to be expr[int _p] (it does so for all left recursive rules), and the included TestRig is not able to directly execute rules with parameters. Adding a start rule resolves the problem because TestRig is able to execute that rule. :)
I've posted a follow-up question with regard to EOF: When is EOF needed in ANTLR 4?
If your command looks like this:
grun MYGRAMMAR xxx -tokens
And this exception is thrown:
No method for rule xxx or it has arguments
Then this exception will get thrown with the rule you specified in the command above. It means the rule probably doesn't exist.
System.err.println("No method for rule "+startRuleName+" or it has arguments");
So startRuleName here, should print xxx if it's not the first (start) rule in the grammar. Put xxx as the first rule in your grammar to prevent this.

Antlr 3 keywords and identifiers colliding

Surprise, I am building an SQL like language parser for a project.
I had it mostly working, but when I started testing it against real requests it would be handling, I realized it was behaving differently on the inside than I thought.
The main issue in the following grammar is that I define a lexer rule PCT_WITHIN for the language keyword 'pct_within'. This works fine, but if I try to match a field like 'attributes.pct_vac', I get the field having text of 'attributes.ac' and a pretty ANTLR error of:
line 1:15 mismatched character u'v' expecting 'c'
GRAMMAR
grammar Select;
options {
language=Python;
}
eval returns [value]
: field EOF
;
field returns [value]
: fieldsegments {print $field.text}
;
fieldsegments
: fieldsegment (DOT (fieldsegment))*
;
fieldsegment
: ICHAR+ (USCORE ICHAR+)*
;
WS : ('\t' | ' ' | '\r' | '\n')+ {self.skip();};
ICHAR : ('a'..'z'|'A'..'Z');
PCT_CONTAINS : 'pct_contains';
USCORE : '_';
DOT : '.';
I have been reading everything I can find on the topic. How the Lexer consumes stuff as it finds it even if it is wrong. How you can use semantic predication to remove ambiguity/how to use lookahead. But everything I read hasn't helped me fix this issue.
Honestly I don't see how it even CAN be an issue. I must be missing something super obvious because other grammars I see have Lexer rules like EXISTS but that doesn't cause the parser to take a string like 'existsOrNot' and spit out and IDENTIFIER with the text of 'rNot'.
What am I missing or doing completely wrong?
Convert your fieldsegment parser rule into a lexer rule. As it stands now it will accept input like
"abc
_ abc"
which is probably not what you want. The keyword "pct_contains" won't be matched by this rule since it is defined separately. If you want to accept the keyword in certain sequences as regular identifier you will have to include it in the accepted identifier rule.

Reduce/Reduce conflict when introducing pointers in my grammar

I'm working on a small compiler in order to get a greater appreciation of the difficulties of creating one's own language. Right now I'm at the stage of adding pointer functionality to my grammar but I got a reduce/reduce conflict by doing it.
Here is a simplified version of my grammar that is compilable by bnfc. I use the happy parser generator and that's the program telling me there is a reduce/reduce conflict.
entrypoints Stmt ;
-- Statements
-------------
SDecl. Stmt ::= Type Ident; -- ex: "int my_var;"
SExpr. Stmt ::= Expr; -- ex: "printInt(123); "
-- Types
-------------
TInt. Type ::= "int" ;
TPointer. Type ::= Type "*" ;
TAlias. Type ::= Ident ; -- This is how I implement typedefs
-- Expressions
--------------
EMult. Expr1 ::= Expr1 "*" Expr2 ;
ELitInt. Expr2 ::= Integer ;
EVariable. Expr2 ::= Ident ;
-- and the standard corecions
_. Expr ::= Expr1 ;
_. Expr1 ::= Expr2 ;
I'm in a learning stage of how grammars work. But I think I know what happens. Consider these two programs
main(){
int a;
int b;
a * b;
}
and
typedef int my_type;
main(){
my_type * my_type_pointer_variable;
}
(The typedef and main(){} part isn't relevant and in my grammar. But they give some context)
In the first program I wish it would parse a "*" b as Stmt ==(SExpr)==> Expr ==(EMult)==> Expr * Expr ==(..)==> Ident "*" Ident, that is to essentially start stepping using the SExpr rule.
At the same time I would like my_type * my_type_pointer_variable to be expanded using the rules. Stmt ==(SDecl)==> Type Ident ==(TPointer)==> Type "*" Ident ==(TAlias)==> Ident "*" Ident.
But the grammar stage have no idea if an identifier originally is a type alias or a variable.
(1) How can I get rid of the reduce/reduce conflict and (2) am I the only one having this issue? Is there an obvious solution and how does the c grammar resolve this issue?
So far I have successfully just been able to change the syntax of my language by using "&" or some other symbol instead of "*", but that's very undesirable. Also I cannot make sense from various public c grammars and tried to see why they don't have this issue but I have had no luck in this.
And last, how do I resolve issues like these on my own? All I understood from happys more verbose output is how the conflict happens, is cleverness the only way to work around these conflicts? I'm afraid I'll stumble on even more issues for example when introducing EIndir. Expr = '*' Expr;
The usual way this problem is dealt with in C parsers is something generally called "the lexer feedback hack". Its a 'hack' in the sense that it doesn't deal with it in the grammar at all; instead, when the lexer recognizes an identifier, it classifies that identifier as either a typename or a non-typename, and returns a different token for each case (usually designated 'TypeIdent' for an identifier that is a typename and simply 'Ident' for any other). The lexer makes this selection by looking at the current state of the symbol table, so it sees all the typedefs that have occurred prior to the current point in the parse, but not typedefs that are after the current point. This is why C requires that you declare typedefs before their first use in each compilation unit.

How to resolve a shift/reduce conflict forcing a shift or a reduce?

When there is a shift/reduce conflict in Yacc/Bison, is it possible to force the conflict to be solved exactly as you want? In other words: is it possible explicitly force it to prioritize the shift or the reduce?
For what I have read, if you are happy with the default resolution you can tell the generator to not complain about it. I really don't like this because it is obfuscating your rational choice.
Another option is to rewrite the grammar to fix the issue. I don't know if this is always possible and often this makes it much harder to understand.
Finally, I have read the precedence rules can fix this. I clueless tried that in many ways and I couldn't make it work. Is it possible to use the precedence rule for that? How?
Though my ambiguous grammar is very different, I can use the classical if-then-else from the Bison manual to give a concrete example:
%token IF THEN ELSE variable
%%
stmt:
expr
| if_stmt
;
if_stmt:
IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
;
expr:
variable
;
As far as I can tell, it is not possible to direct the parser to resolve a S/R conflict by choosing to reduce. Though I might be wrong, it is probably ill-advised to proceed this way anyway. Therefore, the only possibilities are either rewriting the grammar, or solving the conflict by shifting.
The following usage of right predecence for THEN and ELSE describes the desired behavior for the if-then-else statement (that is, associating else with the innermost if statement).
%token IF THEN ELSE variable
%right THEN ELSE
%%
stmt
: expr
| if_stmt
;
if_stmt
: IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
;
expr
: variable
;
By choosing right association for the above tokens, the following sequence:
IF expr1 THEN IF expr2 THEN IF expr3 THEN x ELSE y
is parsed as:
IF expr1 THEN (IF expr2 THEN (IF expr3 THEN (x ELSE (y))))
and Bison does not complain about the case any longer.
Remember that you can always run bison file.y -r all and inspect file.output in order to see if the generated parser state machine is correct.
Well, the default resolution for a shift/reduce conflict is to shift, so if that's what you want, you don't need to do anything (other than ignoring the warning).
If you want to resolve a shift/reduce conflict by reducing, you can use the precedence rules -- just make sure that the rule to be reduced is higher precedence than the token to be shifted. The tricky part comes if there are multiple shift/reduce conflicts involving the same rules and tokens, it may not be possible to find a globally consistent set of precedences for the rules and tokens which resolves things the way you want.