Yacc conflict i cant fix - yacc

i've been trying to fix a shift/reduce conflict in my yacc specification and i can't seem to find where it is.
%union{
char* valueBase;
char* correspondencia;
}
%token pal palT palC
%type <valueBase> pal
%type <correspondencia> palT palC Smth
%%
Dicionario : Traducao
| Dicionario Traducao
;
Traducao : Palavra Correspondencia
;
Palavra : Base Delim
| Exp
;
Delim :
| ':'
;
Correspondencia :
| palC {printf("PT Tradução: %s\n",$1);}
;
Exp : Smth '-' Smth {aux = yylval.valueBase; printf("PT Tradução: %s %s %s\n", $1, aux, $3);}
;
Smth : palT {$$ = strdup($1);}
| {$$ = "";}
;
Base : pal {printf("EN Palavra base: %s\n",$1);}
;
Any help to find and fix this conflict would be extremely appreciated.

So looking at the y.output file from your grammar, you have a shift/reduce conflict in state 13:
State 13
10 Exp: Smth '-' . Smth
palT shift, and go to state 2
palT [reduce using rule 12 (Smth)]
$default reduce using rule 12 (Smth)
Smth go to state 16
Basically, what this is saying is that when parsing an Exp after having seen a Smth '-' and looking at a lookahead of palT, it doesn't know whether it should reduce an empty Smth to finish the Exp (leaving the palT as part of some later construct) OR shift the palT so it can then be reduced (recognized) as a Smth that completes this Exp.
The language you are recognizing is a sequence of one or more Traducao, each of which consists of a Palavra followed by an optional palC (Correspondencia that may be a palC or empty). That means that you might have a Palavra directly following another Palavra (the Correspondencia for the first one is empty). So the parser needs to find the boundary between one Palavra and the next just by looking at its current state and one token of lookahead, which is a problem.
In particular, when you have an input like PalT '-' PalT '-' PalT, that is two consecutive Palavra, but it is not clear whether the middle PalT belongs to the first one or the second. It is ambiguous, because it could be parsed successfully either way.
If you want the parser to just accept as much as possible into the first Palavra, then you can just accept the default resolution (of shift). If that is wrong and you would want the other interpretation, then you are going to need more lookahead to recognize this case, as it depends on whether or not there is a second '-' after the second palT or something else.

Related

How to disambiguate a subselect from a parenthesized expression?

I have the following expression notation:
expr
: OpenParen expr (Comma expr)* Comma? CloseParen # parenExpr
| OpenParen simpleSelect CloseParen # subSelectExpr
Unfortunately, a simpleSelect can also have a parenthetical around it, and so the following statement becomes ambiguous:
select ((select 1))
Here is the current grammar that I have, simplified down to only showing the issue:
grammar Subselect;
options { caseInsensitive=true; }
statement: query_statement EOF;
query_statement
: query_expr # simple
| query_statement set_op query_statement # set
;
query_expr
: with_clause?
( select | '(' query_statement ')' )
limit_clause?
;
select
: select_clause
(from_clause
where_clause?)?
;
with_clause: 'WITH' expr 'AS (' select ')';
select_clause: 'SELECT' expr (',' expr)*;
from_clause: 'FROM' expr;
where_clause: 'WHERE' expr;
limit_clause: 'LIMIT' expr;
set_op: 'UNION'|'INTERSECT'|'EXCEPT';
expr
: '(' expr ')' # parenExpr
| '(' query_expr ')' # subSelect
| Atom # identifier
;
Atom: [a-z_0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
And on the parse of select ((select 1)), here is the output:
What would be a possible way to disambiguate this?
I suppose the main thing is here:
'(' query_statement ')'
Since that recursively calls itself -- is there a way to do indirection or something else such that a query_statement called from within parens can never itself have parens?
Also, maybe this is a common thing? I get the same ambiguous output when running this on the official MySQL grammar here:
I would be curious whether any of the grammars can solve the issue here: https://github.com/antlr/grammars-v4/tree/master/sql. Maybe the best approach is just to remove duplicate parens before parsing the text? (If so, are there are good tools to do that, or do I need to write an additional antlr parser just to do that?)
Your input generates this parse tree:
That's a reasonable interpretation of your input and it is identified as a subSelect expr. It's a subSelect nested in a parenExpr (both of which are exprs).
If I switch up your rule a bit:
expr: '(' query_expr ')' # subSelect
| '(' expr ')' # parenExpr
| Atom # identifier
;
Now it's a subSelect that interprets the nested (select 1) as a query expression.
It's ambiguous because the outer parenthesized expression could match either of the first two alternatives resulting in different parse trees.
In ANTLR, ambiguities in alternatives are resolved by "using" the first alternative that matches. In this way ANTLR has deterministic behavior where you can control which interpretation is used (with alternative order). It's not uncommon for ANTLR grammars to have ambiguities like this.
IMHO, the IntelliJ plugin has caused many people to stumble over this as an indication that something is "wrong" with the grammar. There's a reason that ANTLR itself does not report an error in this situation. It has defined, deterministic behavior.
So far as "resolving" this ambiguity: the simple fact that the syntax uses parentheses to indicate two different "things" indicates that it is inherently ambiguous, so I don't believe you can "fix" the grammar ambiguity without modifying the syntax. (I might be wrong about this, and would find it interesting if someone provides a refactoring that manages to remove the ambiguity.)
EDIT:
After trying an earlier solution that proved incorrect with some additional test data, I've tried a different approach.
I added Atom as a viable alternative for query_expr since that Atom '1` is being offered as test data. In the full grammar implementation, it's hard to predict if this is necessary, even sufficent. I have only the grammar above with which to test.
I used some semantic predicates to strip parentheses (avoids the effort of writing an additional parser).
For testing purposes only, I added SQL-style line comments so that I could test many different inputs quickly.
The following SQL statements were tested, showing no ambiguity.
select 1
select (1)
select ((select 1))
select ((select (abc)))
select abc from ((select 1 from (select((select(1))))))
(select 1 from (select((select(1)))))
((select (xyz) from (select (((((foo))))) from tableX)))
select a from (select x from xyz)
union
select b from abc
select a from ((select x from xyz ))
intersect
((select b from foo))
select a from (select x from xyz )
intersect
(select b from foo)
The grammar is as follows:
grammar Subselect;
options { caseInsensitive=true; }
#header
{
import java.util.*;
}
#parser::members
{
String stripParens(String phrase)
{
String temp1 = phrase.substring[1];
temp2 = temp1.substring(0, s.length()-1);
return temp2;
}
}
statement: query_statement EOF;
query_statement
: query_expr # simple
| query_statement set_op query_statement # set
;
query_expr
: with_clause?
( select | '(' query_statement ')' )
limit_clause?
| Atom
;
select
: select_clause
(from_clause
where_clause?)?
;
with_clause: 'WITH' expr 'AS (' select ')';
select_clause: 'SELECT' expr (',' expr)*;
from_clause: 'FROM' expr;
where_clause: 'WHERE' expr;
limit_clause: 'LIMIT' expr;
set_op: 'UNION'|'INTERSECT'|'EXCEPT';
lrpExpr
: {stripParens(_input.LT[1].getText())}? query_expr
;
expr
: '(' lrpExpr ')' # parenExpr
| Atom # identifier
;
//---------------------------------------------
Atom: [a-z_0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
LineComment : '--' ~[\r\n]* -> skip ;
I'm not including images of parse trees in this edit to conserve space. However, from the inputs I tested, lrpExpr, being a separate rule, would give e.g. a Visitor class to evaluate what is inside the parentheses before moving further down the parse tree, so order of evaluation e.g. mathematical operator precedence could still be honored.
All still fast and with zero ambiguity.
I hope this suits your needs better.
Attribution: I used this answer as a starting point for the Java code for the semantic predicate.

ANTLR: "for" keyword used for loops conflicts with "for" used in messages

I have the following grammar:
myg : line+ EOF ;
line : ( for_loop | command params ) NEWLINE;
for_loop : FOR WORD INT DO NEWLINE stmt_body;
stmt_body: line+ END;
params : ( param | WHITESPACE)*;
param : WORD | INT;
command : WORD;
fragment LOWERCASE : [a-z] ;
fragment UPPERCASE : [A-Z] ;
fragment DIGIT : [0-9] ;
WORD : (LOWERCASE | UPPERCASE | DIGIT | [_."'/\\-])+ (DIGIT)* ;
INT : DIGIT+ ;
WHITESPACE : (' ' | '\t')+ -> skip;
NEWLINE : ('\r'? '\n' | '\r')+ -> skip;
FOR: 'for';
DO: 'do';
END: 'end';
My problem is that the 2 following are valid in this language:
message please wait for 90 seconds
This would be a valid command printing a message with the word "for".
for n 2 do
This would be the beginning of a for loop.
The problem is that with the current lexer it doesn't match the for loop since 'for' is matched by the WORD rule as it appears first.
I could solve that by putting the FOR rule before the WORD rule but then 'for' in message would be matched by the FOR rule
This is the typical keywords versus identifier problem and I thought there were quite a number of questions regarding that here on Stackoverflow. But to my surprise I can only find an old answer of mine for ANTLR3.
Even though the principle mentioned there remains the same, you no longer can change the returned token type in a parser rule, with ANTLR4.
There are 2 steps required to make your scenario work.
Define the keywords before the WORD rule. This way they get own token types you need for grammar parts which require specific keywords.
Add keywords selectively to rules, which parse names, where you want to allow those keywords too.
For the second step modify your rules:
param: WORD | INT | commandKeyword;
command: WORD | commandKeyword;
commandKeyword: FOR | DO | END; // Keywords allowed as names in commands.

Left-factoring grammar of coffeescript expressions

I'm writing an Antlr/Xtext parser for coffeescript grammar. It's at the beginning yet, I just moved a subset of the original grammar, and I am stuck with expressions. It's the dreaded "rule expression has non-LL(*) decision" error. I found some related questions here, Help with left factoring a grammar to remove left recursion and ANTLR Grammar for expressions. I also tried How to remove global backtracking from your grammar, but it just demonstrates a very simple case which I cannot use in real life. The post about ANTLR Grammar Tip: LL() and Left Factoring gave me more insights, but I still can't get a handle.
My question is how to fix the following grammar (sorry, I couldn't simplify it and still keep the error). I guess the trouble maker is the term rule, so I'd appreciate a local fix to it, rather than changing the whole thing (I'm trying to stay close to the rules of the original grammar). Pointers are also welcome to tips how to "debug" this kind of erroneous grammar in your head.
grammar CoffeeScript;
options {
output=AST;
}
tokens {
AT_SIGIL; BOOL; BOUND_FUNC_ARROW; BY; CALL_END; CALL_START; CATCH; CLASS; COLON; COLON_SLASH; COMMA; COMPARE; COMPOUND_ASSIGN; DOT; DOT_DOT; DOUBLE_COLON; ELLIPSIS; ELSE; EQUAL; EXTENDS; FINALLY; FOR; FORIN; FOROF; FUNC_ARROW; FUNC_EXIST; HERECOMMENT; IDENTIFIER; IF; INDENT; INDEX_END; INDEX_PROTO; INDEX_SOAK; INDEX_START; JS; LBRACKET; LCURLY; LEADING_WHEN; LOGIC; LOOP; LPAREN; MATH; MINUS; MINUS; MINUS_MINUS; NEW; NUMBER; OUTDENT; OWN; PARAM_END; PARAM_START; PLUS; PLUS_PLUS; POST_IF; QUESTION; QUESTION_DOT; RBRACKET; RCURLY; REGEX; RELATION; RETURN; RPAREN; SHIFT; STATEMENT; STRING; SUPER; SWITCH; TERMINATOR; THEN; THIS; THROW; TRY; UNARY; UNTIL; WHEN; WHILE;
}
COMPARE : '<' | '==' | '>';
COMPOUND_ASSIGN : '+=' | '-=';
EQUAL : '=';
LOGIC : '&&' | '||';
LPAREN : '(';
MATH : '*' | '/';
MINUS : '-';
MINUS_MINUS : '--';
NEW : 'new';
NUMBER : ('0'..'9')+;
PLUS : '+';
PLUS_PLUS : '++';
QUESTION : '?';
RELATION : 'in' | 'of' | 'instanceof';
RPAREN : ')';
SHIFT : '<<' | '>>';
STRING : '"' (('a'..'z') | ' ')* '"';
TERMINATOR : '\n';
UNARY : '!' | '~' | NEW;
// Put it at the end, so keywords will be matched earlier
IDENTIFIER : ('a'..'z' | 'A'..'Z')+;
WS : (' ')+ {skip();} ;
root
: body
;
body
: line
;
line
: expression
;
assign
: assignable EQUAL expression
;
expression
: value
| assign
| operation
;
identifier
: IDENTIFIER
;
simpleAssignable
: identifier
;
assignable
: simpleAssignable
;
value
: assignable
| literal
| parenthetical
;
literal
: alphaNumeric
;
alphaNumeric
: NUMBER
| STRING;
parenthetical
: LPAREN body RPAREN
;
// term should be the same as expression except operation to avoid left-recursion
term
: value
| assign
;
questionOp
: term QUESTION?
;
mathOp
: questionOp (MATH questionOp)*
;
additiveOp
: mathOp ((PLUS | MINUS) mathOp)*
;
shiftOp
: additiveOp (SHIFT additiveOp)*
;
relationOp
: shiftOp (RELATION shiftOp)*
;
compareOp
: relationOp (COMPARE relationOp)*
;
logicOp
: compareOp (LOGIC compareOp)*
;
operation
: UNARY expression
| MINUS expression
| PLUS expression
| MINUS_MINUS simpleAssignable
| PLUS_PLUS simpleAssignable
| simpleAssignable PLUS_PLUS
| simpleAssignable MINUS_MINUS
| simpleAssignable COMPOUND_ASSIGN expression
| logicOp
;
UPDATE:
The final solution will use Xtext with an external lexer to avoid to intricacies of handling significant whitespace. Here is a snippet from my Xtext version:
CompareOp returns Operation:
AdditiveOp ({CompareOp.left=current} operator=COMPARE right=AdditiveOp)*;
My strategy is to make a working Antlr parser first without a usable AST. (Well, it would deserve a separates question if this is a feasible approach.) So I don't care about tokens at the moment, they are included to make development easier.
I am aware that the original grammar is LR. I don't know how close I can stay to it when transforming to LL.
UPDATE2 and SOLUTION:
I could simplify my problem with the insights gained from Bart's answer. Here is a working toy grammar to handle simple expressions with function calls to illustrate it. The comment before expression shows my insight.
grammar FunExp;
ID: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
NUMBER: '0'..'9'+;
WS: (' ')+ {skip();};
root
: expression
;
// atom and functionCall would go here,
// but they are reachable via operation -> term
// so they are omitted here
expression
: operation
;
atom
: NUMBER
| ID
;
functionCall
: ID '(' expression (',' expression)* ')'
;
operation
: multiOp
;
multiOp
: additiveOp (('*' | '/') additiveOp)*
;
additiveOp
: term (('+' | '-') term)*
;
term
: atom
| functionCall
| '(' expression ')'
;
When you generate a lexer and parser from your grammar, you see the following error printed to your console:
error(211): CoffeeScript.g:52:3: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
warning(200): CoffeeScript.g:52:3: Decision can match input such as "{NUMBER, STRING}" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
(I've emphasized the important bits)
This is only the first error, but you start with the first and with a bit of luck, the errors below that first one will also disappear when you fix the first one.
The error posted above means that when you're trying to parse either a NUMBER or a STRING with the parser generated from your grammar, the parser can go two ways when it ends up in the expression rule:
expression
: value // choice 1
| assign // choice 2
| operation // choice 3
;
Namely, choice 1 and choice 3 both can parse a NUMBER or a STRING, as you can see by the "paths" the parser can follow to match these 2 choices:
choice 1:
expression
value
literal
alphaNumeric : {NUMBER, STRING}
choice 3:
expression
operation
logicOp
relationOp
shiftOp
additiveOp
mathOp
questionOp
term
value
literal
alphaNumeric : {NUMBER, STRING}
In the last part of the warning, ANTLR informs you that it ignores choice 3 whenever either a NUMBER or a STRING will be parsed, causing choice 1 to match such input (since it is defined before choice 3).
So, either the CoffeeScript grammar is ambiguous in this respect (and handles this ambiguity somehow), or your implementation of it is wrong (I'm guessing the latter :)). You need to fix this ambiguity in your grammar: i.e. don't let the expression's choices 1 and 3 both match the same input.
I noticed 3 other things in your grammar:
1
Take the following lexer rules:
NEW : 'new';
...
UNARY : '!' | '~' | NEW;
Be aware that the token UNARY can never match the text 'new' since the token NEW is defined before it. If you want to let UNARY macth this, remove the NEW rule and do:
UNARY : '!' | '~' | 'new';
2
In may occasions, you're collecting multiple types of tokens in a single one, like LOGIC:
LOGIC : '&&' | '||';
and then you use that token in a parser rules like this:
logicOp
: compareOp (LOGIC compareOp)*
;
But if you're going to evaluate such an expression at a later stage, you don't know what this LOGIC token matched ('&&' or '||') and you'll have to inspect the token's inner text to find that out. You'd better do something like this (at least, if you're doing some sort of evaluating at a later stage):
AND : '&&';
OR : '||';
...
logicOp
: compareOp ( AND compareOp // easier to evaluate, you know it's an AND expression
| OR compareOp // easier to evaluate, you know it's an OR expression
)*
;
3
You're skipping white spaces (and no tabs?) with:
WS : (' ')+ {skip();} ;
but doesn't CoffeeScript indent it's code block with spaces (and tabs) just like Python? But perhaps you're going to do that in a later stage?
I just saw that the grammar you're looking at is a jison grammar (which is more or less a bison implementation in JavaScript). But bison, and therefor jison, generates LR parsers while ANTLR generates LL parsers. So trying to stay close to the rules of the original grammar will only result in more problems.

ANTLR grammar problem with parenthetical expressions

I'm using ANTLRWorks 1.4.2 to create a simple grammar for the purpose of evaluating an user-provided expression as boolean result. This ultimately will be part of a larger grammar, but I have some questions about this current fragment. I want users to be able to use expressions such as:
2 > 1
2 > 1 and 3 < 1
(2 > 1 or 1 < 3) and 4 > 1
(2 > 1 or 1 < 3) and (4 > 1 or (2 < 1 and 3 > 1))
The first two expressions are legal in my grammar, but the last two are not, and I am not sure why. Also, ANTLRworks seems to suggest that input such as ((((1 > 2) with mismatched parentheses is legal, and I am not sure why. So, I seem to be missing out on some insight into the right way to handle parenthetical grouping in a grammar.
How can I change my grammar to properly handle parentheses?
My grammar is below:
grammar conditional_test;
boolean
: boolean_value_expression
EOF
;
boolean_value_expression
: boolean_term (OR boolean_term)*
EOF
;
boolean_term
: boolean_factor (AND boolean_factor)*
;
boolean_factor
: (NOT)? boolean_test
;
boolean_test
: predicate
;
predicate
: expression relational_operator expression
| LPAREN boolean_value_expression RPAREN
;
relational_operator
: EQ
| LT
| GT
;
expression
: NUMBER
;
LPAREN : '(';
RPAREN : ')';
NUMBER : '0'..'9'+;
EQ : '=';
GT : '>';
LT : '<';
AND : 'and';
OR : 'or' ;
NOT : 'not';
Chris Farmer wrote:
The first two expressions are legal in my grammar, but the last two are not, and I am not sure why. ...
You should remove the EOF token from:
boolean_value_expression
: boolean_term (OR boolean_term)*
EOF
;
You normally only use the EOF after the entry point of your grammar (boolean in your case). Be careful boolean is a reserved word in Java and can therefor not be used as a parser rule!
So the first two rules should look like:
bool
: boolean_value_expression
EOF
;
boolean_value_expression
: boolean_term (OR boolean_term)*
;
And you may also want to ignore literal spaces by adding the following lexer rule:
SPACE : ' ' {$channel=HIDDEN;};
(you can include tabs an line breaks, of course)
Now all of your example input matches properly (tested with ANTLRWorks 1.4.2 as well).
Chris Farmer wrote:
Also, ANTLRworks seems to suggest that input such as ((((1 > 2) with mismatched parentheses is legal, ...
No, ANTLRWorks does produce errors, perhaps not very noticeable ones. The parse tree ANTLRWorks produces has a NoViableAltException as a leaf, and there are some errors on the "Console" tab.

bison shift/reduce problem moving add op into a subexpr

Originally in the example there was this
expr:
INTEGER
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
;
I wanted it to be 'more simple' so i wrote this (i realize it would do '+' for both add and subtract. But this is an example)
expr:
INTEGER
| expr addOp expr { $$ = $1 + $3; }
;
addOp:
'+' { $$ = $1; }
| '-' { $$ = $1; }
;
Now i get a shift/reduce error. It should be exactly the same -_- (to me). What do i need to do to fix this?
edit: To make things clear. The first has NO warning/error. I use %left to set the precedence (and i will use %right for = and those other right ops). However it seems to not apply when going into sub expressions.
Are you sure the conflicts involve just those two rules? The first one should have more conflicts than the second. At least with one symbol of look-ahead the decision to shift to a state with addOp on the stack is easier the second time around.
Update (I believe I can prove my theory... :-):
$ cat q2.y
%% expr: '1' | expr '+' expr | expr '-' expr;
$ cat q3.y
%% expr: '1' | expr addOp expr;
addOp: '+' | '-';
$ yacc q2.y
conflicts: 4 shift/reduce
$ yacc q3.y
conflicts: 2 shift/reduce
Having said all that, it's normal for yacc grammars to have ambiguities, and any real-life system is likely to have not just a few but literally dozens of shift/reduce conflicts. By definition, this conflict occurs when there is a perfectly valid shift available, so if you don't mind the parser taking that shift, then don't worry about it.
Now, in yacc you should prefer left-recursive rules. You can achieve that and get rid of your grammar ambiguity with:
$ cat q4.y
%% expr: expr addOp '1' | '1';
addOp: '+' | '-';
$ yacc q4.y
$
Note: no conflicts in the example above. If you like your grammar the way it is, just do:
%expect 2
%% expr: '1' | expr addOp expr;
addOp: '+' | '-';
The problem is that the rule
expr: expr addOp expr { ..action.. }
has no precedence. Normally rules get the precedence of the first token on the RHS, but this rule has no tokens on its RHS. You need to add a %prec directive to it:
expr: expr addOp expr %prec '+' { ..action.. }
to explicitly give the rule a precedence.
Note that doing this doesn't get rid of the shift/reduce conflict, which was present in your original grammar. It just resolves it according to the precedence rules you specify, which means that bison won't give you a message about it. In general, using precedence to resolve conflicts can be tricky, since it can hide conflicts that you might have wanted to resolve differently, or might be unresolvable in your grammar as written.
Also see my answer to this question