In a functional language compiler written using the happy parser, which is a quite similar with yacc/bison, I implemented lists and with lists some core functions map, concat and filter, using the following rules:
Exp:
...
| concat '(' Exp ',' Exp ')' { Concat $3 $5 }
| map '(' Exp ',' Exp ')' { Map $3 $5 }
| filter '(' Exp ',' Exp ')' { Filter $3 $5 }
This works just fine, but in most functional languages there is no paranthesis or commas, so instead of map(myfun, [1,2,3]) I would rather write map myfun [1,2,3]. The obvious modification in the grammar is the following:
Exp:
...
| concat Exp Exp { Concat $2 $3 }
| map Exp Exp { Map $2 $3 }
| filter Exp Exp { Filter $2 $3 }
But this modification includes lots of reduce-reduce conflicts. How can I achieve the parsing of function calls without commas and paranthesis?
The smallest conflicting grammar I could extract was this:
Exp :
-- Math
Exp '+' Exp { Op $1 Add $3 }
| Exp '-' Exp { Op $1 Sub $3 }
-- Literals
| num { Num $1 }
| '-' num %prec NEGATIVE { Num (-$2) }
-- Lists
| map Exp Exp { Map $2 $3 }
It generates 4 reduce/reduce conflicts. Removing any of the rules also ends up with the conflicts. Here is the full grammar if you are interested.
The problem is that since there's no token in a function application, token-based precedence conflict resolution doesn't work very well -- when its trying to decide on a shift that might be a function application and a reduce of some other expression, the lookahead token is whatever the argument expression begins with; there no 'blank space' token that can be used.
To hack around that problem and make it work, you need to set the precedence of EVERY token that might being an expression (every token in FIRST(Exp)) to that of function application. If any of those tokens need some other precedence (eg, any token that might be either infix or prefix), this gets much trickier and might not work.
An alternative that might work better is to not use precedence rules at all -- instead, disambiguate the grammar with different rules for each level of precedence:
Exp: Term | Exp '+' Term
Term: Factor | Term '*' Factor
Factor: Primary | Factor Primary
Primary: num | id | '(' Exp ')'
Related
I am working on a toy language for fun, (called NOP) using Bison and Flex, and I have hit a wall. I am trying to parse sequences that look like name1.name2 and name1.func1().name2 and I get a lot of reduce/reduce conflicts. I know -why- am getting them, but I am having a heck of a time figuring out what to do about it.
So my question is whether this is a legitimate irregularity that can't be "fixed", or if my grammar is just wrong. The productions in question are compound_name and compound_symbol. It seems to me that they should parse separately. If I try to combine them I get conflicts with that as well. In the grammar, I am trying to illustrate what I want to do, rather than anything "clever".
%debug
%defines
%locations
%{
%}
%define parse.error verbose
%locations
%token FPCONST INTCONST UINTCONST STRCONST BOOLCONST
%token SYMBOL
%token AND OR NOT EQ NEQ LTE GTE LT GT
%token ADD_ASSIGN SUB_ASSIGN MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN
%token DICT LIST BOOL STRING FLOAT INT UINT NOTHING
%right ADD_ASSIGN SUB_ASSIGN
%right MUL_ASSIGN DIV_ASSIGN MOD_ASSIGN
%left AND OR
%left EQ NEQ
%left LT GT LTE GTE
%right ':'
%left '+' '-'
%left '*' '/' '%'
%left NEG
%right NOT
%%
program
: {} all_module {}
;
all_module
: module_list
;
module_list
: module_element {}
| module_list module_element {}
;
module_element
: compound_symbol {}
| expression {}
;
compound_name
: SYMBOL {}
| compound_name '.' SYMBOL {}
;
compound_symbol_element
: compound_name {}
| func_call {}
;
compound_symbol
: compound_symbol_element {}
| compound_symbol '.' compound_symbol_element {}
;
func_call
: compound_name '(' expression_list ')' {}
;
formatted_string
: STRCONST {}
| STRCONST '(' expression_list ')' {}
;
type_specifier
: STRING {}
| FLOAT {}
| INT {}
| UINT {}
| BOOL {}
| NOTHING {}
;
constant
: FPCONST {}
| INTCONST {}
| UINTCONST {}
| BOOLCONST {}
| NOTHING {}
;
expression_factor
: constant { }
| compound_symbol { }
| formatted_string {}
;
expression
: expression_factor {}
| expression '+' expression {}
| expression '-' expression {}
| expression '*' expression {}
| expression '/' expression {}
| expression '%' expression {}
| expression EQ expression {}
| expression NEQ expression {}
| expression LT expression {}
| expression GT expression {}
| expression LTE expression {}
| expression GTE expression {}
| expression AND expression {}
| expression OR expression {}
| '-' expression %prec NEG {}
| NOT expression { }
| type_specifier ':' SYMBOL {} // type cast
| '(' expression ')' {}
;
expression_list
: expression {}
| expression_list ',' expression {}
;
%%
This is a very stripped down parser. The "real" one is about 600 lines. It has no conflicts (and passes a bunch of tests) if I don't try to use a function call in a variable name. I am looking at re-writing it to be a packrat grammar if I cannot get Bison to do that I want. The rest of the project is here: https://github.com/chucktilbury/nop
$ bison -tvdo temp.c temp.y
temp.y: warning: 4 shift/reduce conflicts [-Wconflicts-sr]
temp.y: warning: 16 reduce/reduce conflicts [-Wconflicts-rr]
All of the reduce/reduce conflicts are the result of:
module_element
: expression
| compound_symbol
That creates an ambiguity because you also have
expression
: expression_factor
expression_factor
: compound_symbol
So the parser can't tell whether or not you need the unit productions to be reduced. Eliminating module_element: compound_symbol doesn't change the set of sentences which can be produced; it just requires that a compound_symbol be reduced through expression before becoming a module_element.
As Chris Dodd points out in a comment, the fact that two module_elements can appear consecutively without a delimiter creates an additional ambiguity: the grammar allows a - b to be parsed either as a single expression (and consequently module_element) or as two consecutive expressions —a and -b— and thus two consecutive module_elements. That ambiguity accounts for three of the four shift/reduce conflicts.
Both of these are probably errors introduced when you simplified the grammar, since it appears that module elements in the full grammar are definitions, not expressions. Removing modules altogether and using expression as the starting symbol leaves only a single conflict.
That conflict is indeed the result of an ambiguity between compound_symbol and compound_name, as noted in your question. The problem is seen in these productions (non-terminals shortened to make typing easier):
name: SYMBOL
| name '.' SYMBOL
symbol
: element
| symbol '.' element
element
: name
That means that both a and a.b are names and hence
elements. But a symbol is a .-separated list of elements, so a.b could be derived in two ways:
symbol → element symbol → symbol . element
→ name → element . element
→ a.b → name . element
→ a . element
→ a . name
→ a . b
I fixed this by simplifying the grammar to:
compound_symbol
: compound_name
| compound_name '(' expression_list ')'
compound_name
: SYMBOL
| compound_symbol '.' SYMBOL
That gets rid of func_call and compound_symbol_element, which as far as I can see serve no purpose. I don't know if the non-terminal names remaining really capture anything sensible; I think it would make more sense to call compound_symbol something like name_or_call.
This grammar could be simplified further if higher-order functions were possible; the existing grammar forbids hof()(), presumably because you don't contemplate allowing a function to return a function object.
But even with higher-order functions, you might want to differentiate between function calls and member access/array subscript, because in many languages a function cannot return an object reference and hence a function call cannot appear on the left-hand side of an assignment operator. In other languages, such as C, the requirement that the left-hand side of an assignment operator be a reference ("lvalue") is enforced outside of the grammar. (And in C++, a function call or even an overloaded operator can return a reference, so the restriction needs to be enforced after type analysis.)
i've been trying to fix a shift/reduce conflict in my yacc specification and i can't seem to find where it is.
%union{
char* valueBase;
char* correspondencia;
}
%token pal palT palC
%type <valueBase> pal
%type <correspondencia> palT palC Smth
%%
Dicionario : Traducao
| Dicionario Traducao
;
Traducao : Palavra Correspondencia
;
Palavra : Base Delim
| Exp
;
Delim :
| ':'
;
Correspondencia :
| palC {printf("PT Tradução: %s\n",$1);}
;
Exp : Smth '-' Smth {aux = yylval.valueBase; printf("PT Tradução: %s %s %s\n", $1, aux, $3);}
;
Smth : palT {$$ = strdup($1);}
| {$$ = "";}
;
Base : pal {printf("EN Palavra base: %s\n",$1);}
;
Any help to find and fix this conflict would be extremely appreciated.
So looking at the y.output file from your grammar, you have a shift/reduce conflict in state 13:
State 13
10 Exp: Smth '-' . Smth
palT shift, and go to state 2
palT [reduce using rule 12 (Smth)]
$default reduce using rule 12 (Smth)
Smth go to state 16
Basically, what this is saying is that when parsing an Exp after having seen a Smth '-' and looking at a lookahead of palT, it doesn't know whether it should reduce an empty Smth to finish the Exp (leaving the palT as part of some later construct) OR shift the palT so it can then be reduced (recognized) as a Smth that completes this Exp.
The language you are recognizing is a sequence of one or more Traducao, each of which consists of a Palavra followed by an optional palC (Correspondencia that may be a palC or empty). That means that you might have a Palavra directly following another Palavra (the Correspondencia for the first one is empty). So the parser needs to find the boundary between one Palavra and the next just by looking at its current state and one token of lookahead, which is a problem.
In particular, when you have an input like PalT '-' PalT '-' PalT, that is two consecutive Palavra, but it is not clear whether the middle PalT belongs to the first one or the second. It is ambiguous, because it could be parsed successfully either way.
If you want the parser to just accept as much as possible into the first Palavra, then you can just accept the default resolution (of shift). If that is wrong and you would want the other interpretation, then you are going to need more lookahead to recognize this case, as it depends on whether or not there is a second '-' after the second palT or something else.
I am writing a simple calculator in yacc / bison.
The grammar for an expression looks somewhat like this:
expr
: NUM
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
| '+' expr %prec '*' { $$ = $1; }
| '-' expr %prec '*' { $$ = $1; }
| '(' expr ')' { $$ = $2; }
| expr expr { $$ = $1 '*' $2; }
;
I have declared the precedence of the operators like this.
%left '+' '-'
%left '*' '/'
%nonassoc '('
The problem is with the last rule:
expr expr { $$ = $1 $2; }
I want this rule because I want to be able to write expressions like 5(3+4)(3-24) in my calculator.
Is it possible to make this grammar unambiguous?
The ambiguity results from the fact that you allow unary operators (- expr), so 2 - 2 can be parsed either as a simple subtraction (yielding 0) or as an implicit product (of 2 and -2, yielding -4).
It's clear that subtraction is intended (otherwise subtraction would be impossible to represent) so it is necessary to ban the production expr: expr expr if the second expr on the right-hand side is a unary operation.
That can't be done with precedence declarations (or at least it cannot be done in an obvious way), so the best solution is to write out the grammar explicitly, without relying on precedence to disambiguate.
You will also have to decide exactly what is the precedence of implicit multiplication: either the same as explicit multiplication/division, or stronger. That affects how ab/cd is parsed. There is no consensus that I know of, so it is more or less up to you.
In the following, I assume that implicit multiplication binds more tightly. I also ensure that -ab is parsed as (-a)b, although -(ab) has the same end result (until you start dealing with things like non-arithmetic types and automatic conversions). So just take it as an example.
term: NUM
| '(' expr ')'
unop: term
| '-' unop
| '+' unop
conc: unop
| conc term
prod: conc
| prod '*' conc
| prod '/' conc
expr: prod
| expr '+' prod
| expr '-' prod
I'm writing an Antlr/Xtext parser for coffeescript grammar. It's at the beginning yet, I just moved a subset of the original grammar, and I am stuck with expressions. It's the dreaded "rule expression has non-LL(*) decision" error. I found some related questions here, Help with left factoring a grammar to remove left recursion and ANTLR Grammar for expressions. I also tried How to remove global backtracking from your grammar, but it just demonstrates a very simple case which I cannot use in real life. The post about ANTLR Grammar Tip: LL() and Left Factoring gave me more insights, but I still can't get a handle.
My question is how to fix the following grammar (sorry, I couldn't simplify it and still keep the error). I guess the trouble maker is the term rule, so I'd appreciate a local fix to it, rather than changing the whole thing (I'm trying to stay close to the rules of the original grammar). Pointers are also welcome to tips how to "debug" this kind of erroneous grammar in your head.
grammar CoffeeScript;
options {
output=AST;
}
tokens {
AT_SIGIL; BOOL; BOUND_FUNC_ARROW; BY; CALL_END; CALL_START; CATCH; CLASS; COLON; COLON_SLASH; COMMA; COMPARE; COMPOUND_ASSIGN; DOT; DOT_DOT; DOUBLE_COLON; ELLIPSIS; ELSE; EQUAL; EXTENDS; FINALLY; FOR; FORIN; FOROF; FUNC_ARROW; FUNC_EXIST; HERECOMMENT; IDENTIFIER; IF; INDENT; INDEX_END; INDEX_PROTO; INDEX_SOAK; INDEX_START; JS; LBRACKET; LCURLY; LEADING_WHEN; LOGIC; LOOP; LPAREN; MATH; MINUS; MINUS; MINUS_MINUS; NEW; NUMBER; OUTDENT; OWN; PARAM_END; PARAM_START; PLUS; PLUS_PLUS; POST_IF; QUESTION; QUESTION_DOT; RBRACKET; RCURLY; REGEX; RELATION; RETURN; RPAREN; SHIFT; STATEMENT; STRING; SUPER; SWITCH; TERMINATOR; THEN; THIS; THROW; TRY; UNARY; UNTIL; WHEN; WHILE;
}
COMPARE : '<' | '==' | '>';
COMPOUND_ASSIGN : '+=' | '-=';
EQUAL : '=';
LOGIC : '&&' | '||';
LPAREN : '(';
MATH : '*' | '/';
MINUS : '-';
MINUS_MINUS : '--';
NEW : 'new';
NUMBER : ('0'..'9')+;
PLUS : '+';
PLUS_PLUS : '++';
QUESTION : '?';
RELATION : 'in' | 'of' | 'instanceof';
RPAREN : ')';
SHIFT : '<<' | '>>';
STRING : '"' (('a'..'z') | ' ')* '"';
TERMINATOR : '\n';
UNARY : '!' | '~' | NEW;
// Put it at the end, so keywords will be matched earlier
IDENTIFIER : ('a'..'z' | 'A'..'Z')+;
WS : (' ')+ {skip();} ;
root
: body
;
body
: line
;
line
: expression
;
assign
: assignable EQUAL expression
;
expression
: value
| assign
| operation
;
identifier
: IDENTIFIER
;
simpleAssignable
: identifier
;
assignable
: simpleAssignable
;
value
: assignable
| literal
| parenthetical
;
literal
: alphaNumeric
;
alphaNumeric
: NUMBER
| STRING;
parenthetical
: LPAREN body RPAREN
;
// term should be the same as expression except operation to avoid left-recursion
term
: value
| assign
;
questionOp
: term QUESTION?
;
mathOp
: questionOp (MATH questionOp)*
;
additiveOp
: mathOp ((PLUS | MINUS) mathOp)*
;
shiftOp
: additiveOp (SHIFT additiveOp)*
;
relationOp
: shiftOp (RELATION shiftOp)*
;
compareOp
: relationOp (COMPARE relationOp)*
;
logicOp
: compareOp (LOGIC compareOp)*
;
operation
: UNARY expression
| MINUS expression
| PLUS expression
| MINUS_MINUS simpleAssignable
| PLUS_PLUS simpleAssignable
| simpleAssignable PLUS_PLUS
| simpleAssignable MINUS_MINUS
| simpleAssignable COMPOUND_ASSIGN expression
| logicOp
;
UPDATE:
The final solution will use Xtext with an external lexer to avoid to intricacies of handling significant whitespace. Here is a snippet from my Xtext version:
CompareOp returns Operation:
AdditiveOp ({CompareOp.left=current} operator=COMPARE right=AdditiveOp)*;
My strategy is to make a working Antlr parser first without a usable AST. (Well, it would deserve a separates question if this is a feasible approach.) So I don't care about tokens at the moment, they are included to make development easier.
I am aware that the original grammar is LR. I don't know how close I can stay to it when transforming to LL.
UPDATE2 and SOLUTION:
I could simplify my problem with the insights gained from Bart's answer. Here is a working toy grammar to handle simple expressions with function calls to illustrate it. The comment before expression shows my insight.
grammar FunExp;
ID: ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*;
NUMBER: '0'..'9'+;
WS: (' ')+ {skip();};
root
: expression
;
// atom and functionCall would go here,
// but they are reachable via operation -> term
// so they are omitted here
expression
: operation
;
atom
: NUMBER
| ID
;
functionCall
: ID '(' expression (',' expression)* ')'
;
operation
: multiOp
;
multiOp
: additiveOp (('*' | '/') additiveOp)*
;
additiveOp
: term (('+' | '-') term)*
;
term
: atom
| functionCall
| '(' expression ')'
;
When you generate a lexer and parser from your grammar, you see the following error printed to your console:
error(211): CoffeeScript.g:52:3: [fatal] rule expression has non-LL(*) decision due to recursive rule invocations reachable from alts 1,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
warning(200): CoffeeScript.g:52:3: Decision can match input such as "{NUMBER, STRING}" using multiple alternatives: 1, 3
As a result, alternative(s) 3 were disabled for that input
(I've emphasized the important bits)
This is only the first error, but you start with the first and with a bit of luck, the errors below that first one will also disappear when you fix the first one.
The error posted above means that when you're trying to parse either a NUMBER or a STRING with the parser generated from your grammar, the parser can go two ways when it ends up in the expression rule:
expression
: value // choice 1
| assign // choice 2
| operation // choice 3
;
Namely, choice 1 and choice 3 both can parse a NUMBER or a STRING, as you can see by the "paths" the parser can follow to match these 2 choices:
choice 1:
expression
value
literal
alphaNumeric : {NUMBER, STRING}
choice 3:
expression
operation
logicOp
relationOp
shiftOp
additiveOp
mathOp
questionOp
term
value
literal
alphaNumeric : {NUMBER, STRING}
In the last part of the warning, ANTLR informs you that it ignores choice 3 whenever either a NUMBER or a STRING will be parsed, causing choice 1 to match such input (since it is defined before choice 3).
So, either the CoffeeScript grammar is ambiguous in this respect (and handles this ambiguity somehow), or your implementation of it is wrong (I'm guessing the latter :)). You need to fix this ambiguity in your grammar: i.e. don't let the expression's choices 1 and 3 both match the same input.
I noticed 3 other things in your grammar:
1
Take the following lexer rules:
NEW : 'new';
...
UNARY : '!' | '~' | NEW;
Be aware that the token UNARY can never match the text 'new' since the token NEW is defined before it. If you want to let UNARY macth this, remove the NEW rule and do:
UNARY : '!' | '~' | 'new';
2
In may occasions, you're collecting multiple types of tokens in a single one, like LOGIC:
LOGIC : '&&' | '||';
and then you use that token in a parser rules like this:
logicOp
: compareOp (LOGIC compareOp)*
;
But if you're going to evaluate such an expression at a later stage, you don't know what this LOGIC token matched ('&&' or '||') and you'll have to inspect the token's inner text to find that out. You'd better do something like this (at least, if you're doing some sort of evaluating at a later stage):
AND : '&&';
OR : '||';
...
logicOp
: compareOp ( AND compareOp // easier to evaluate, you know it's an AND expression
| OR compareOp // easier to evaluate, you know it's an OR expression
)*
;
3
You're skipping white spaces (and no tabs?) with:
WS : (' ')+ {skip();} ;
but doesn't CoffeeScript indent it's code block with spaces (and tabs) just like Python? But perhaps you're going to do that in a later stage?
I just saw that the grammar you're looking at is a jison grammar (which is more or less a bison implementation in JavaScript). But bison, and therefor jison, generates LR parsers while ANTLR generates LL parsers. So trying to stay close to the rules of the original grammar will only result in more problems.
Originally in the example there was this
expr:
INTEGER
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
;
I wanted it to be 'more simple' so i wrote this (i realize it would do '+' for both add and subtract. But this is an example)
expr:
INTEGER
| expr addOp expr { $$ = $1 + $3; }
;
addOp:
'+' { $$ = $1; }
| '-' { $$ = $1; }
;
Now i get a shift/reduce error. It should be exactly the same -_- (to me). What do i need to do to fix this?
edit: To make things clear. The first has NO warning/error. I use %left to set the precedence (and i will use %right for = and those other right ops). However it seems to not apply when going into sub expressions.
Are you sure the conflicts involve just those two rules? The first one should have more conflicts than the second. At least with one symbol of look-ahead the decision to shift to a state with addOp on the stack is easier the second time around.
Update (I believe I can prove my theory... :-):
$ cat q2.y
%% expr: '1' | expr '+' expr | expr '-' expr;
$ cat q3.y
%% expr: '1' | expr addOp expr;
addOp: '+' | '-';
$ yacc q2.y
conflicts: 4 shift/reduce
$ yacc q3.y
conflicts: 2 shift/reduce
Having said all that, it's normal for yacc grammars to have ambiguities, and any real-life system is likely to have not just a few but literally dozens of shift/reduce conflicts. By definition, this conflict occurs when there is a perfectly valid shift available, so if you don't mind the parser taking that shift, then don't worry about it.
Now, in yacc you should prefer left-recursive rules. You can achieve that and get rid of your grammar ambiguity with:
$ cat q4.y
%% expr: expr addOp '1' | '1';
addOp: '+' | '-';
$ yacc q4.y
$
Note: no conflicts in the example above. If you like your grammar the way it is, just do:
%expect 2
%% expr: '1' | expr addOp expr;
addOp: '+' | '-';
The problem is that the rule
expr: expr addOp expr { ..action.. }
has no precedence. Normally rules get the precedence of the first token on the RHS, but this rule has no tokens on its RHS. You need to add a %prec directive to it:
expr: expr addOp expr %prec '+' { ..action.. }
to explicitly give the rule a precedence.
Note that doing this doesn't get rid of the shift/reduce conflict, which was present in your original grammar. It just resolves it according to the precedence rules you specify, which means that bison won't give you a message about it. In general, using precedence to resolve conflicts can be tricky, since it can hide conflicts that you might have wanted to resolve differently, or might be unresolvable in your grammar as written.
Also see my answer to this question