Why do i have a shift reduce/conflict on the ')' and not '('? - yacc

I have syntax like
%(var)
and
%var
and
(var)
My rules are something like
optExpr:
| '%''('CommaLoop')'
| '%' CommaLoop
CommaLoop:
val | CommaLoop',' val
Expr:
MoreRules
| '(' val ')'
The problem is it doesnt seem to be able to tell if ) belongs to %(CommaLoop) or % (val) but it complains on the ) instead of the (. What the heck? shouldnt it complain on (? and how should i fix the error? i think making %( a token is a good solution but i want to be sure why $( isnt an error before doing this.

This is due to the way LR parsing works. LR parsing is effectively bottom-up, grouping together tokens according to the RHS of your grammar rules, and replacing them with the LHS. When the parser 'shifts', it puts a token on the stack, but doesn't actually match a rule yet. Instead, it tracks partially matched rules via the current state. When it gets to a state that corresponds to the end of the rule, it can reduce, popping the symbols for the RHS off the stack and pushing back a single symbol denoting the LHS. So if there are conflicts, they don't show up until the parser gets to the end of some rule and can't decide whether to reduce (or what to reduce).
In your example, after seeing % ( val, that is what will be on the stack (top is at the right side here). When the lookahead is ), it can't decide whether it should pop the val and reduce via the rule CommaLoop: val, or if it should shift the ) so it can then pop 3 things and reduce with the rule Expr: '(' val ')'
I'm assuming here that you have some additional rules such as CommaLoop: Expr, otherwise your grammar doesn't actually match anything and bison/yacc will complain about unused non-terminals.

Right now, your explanation and your grammar don't seem to match. In your explanation, you show all three phrases as having 'var', but your grammar shows the ones starting with '%' as allowing a comma-separated list, while the one without allows only a single 'val'.
For the moment, I'll assume all three should allow a comma-separated list. In this case, I'd factor the grammar more like this:
optExpr: '%' aList
aList: CommaLoop
| parenList
parenList: '(' CommaLoop ')'
CommaLoop:
| val
| CommaLoop ',' val
Expr: MoreRules
| parenList
I've changed optExpr and Expr so neither can match an empty sequence -- my guess is you probably didn't intend that to start with. I've fleshed this out enough to run it through byacc; it produces no warnings or errors.

Related

ANTLR - Checking for a String's "contruction"

Currently working with ANTLR and found out something interesting that's not working as I intended.
I try to run something along the lines of "test 10 cm" through my grammar and it fails, however "test 10 c m" works as the previous should. The "cm" portion of the code is what I call "wholeunit" in my grammar and it is as follows:
wholeunit :
siunit
| unitmod siunit
| wholeunit NUM
| wholeunit '/' wholeunit
| wholeunit '.' wholeunit
;
What it's doing right now is the "unitmod siunit" portion of the rule where unitmod = c and siunit = m .
What I'd like to know is how would I make it so the grammar would still follow the rule "unitmod siunit" without the need for a space in the middle, I might be missing something huge. (Yes, I have spaces and tabs marked to be skipped)
Probable cause is "cm" being considered another token together (possibly same token type as "test"), rather than "c" and "m" as separate tokens.
Remember that in ANTLR lexer, the rule matching the longest input wins.
One solution would possibly be to make the wholeunit a lexer rule rather than parser rule, and make sure it's above the rule that matches any word (like "test") - if same input can be matched by multiple rules, ANTLR selects the first rule in order they're defined in.

shift/reduce error in yacc

I know this part of my grammar cause error but I don't know how to fix it I even use %left and right but it didn't help. Can anybody please help me to find out what is the problem with this grammar.
Thanks in advance for your help.
%token VARIABLE NUM
%right '='
%left '+' '-'
%left '*' '/'
%left '^'
%start S_PROOP
EQUATION_SEQUENCE
: FORMULA '=' EQUATION
;
EQUATION
: FORMULA
| FORMULA '=' EQUATION
;
FORMULA
: SUM EXPRESSION
| PRODUCT EXPRESSION
| EXPRESSION '+' EXPRESSION
| EXPRESSION '*' EXPRESSION
| EXPRESSION '/' EXPRESSION
| EXPRESSION '^' EXPRESSION
| EXPRESSION '-' EXPRESSION
| EXPRESSION
;
EXPRESSION
: EXPRESSION EXPRESSION
| '(' EXPRESSION ')'
| NUM
| VARIABLE
;
Normal style is to use lower case for non-terminals and upper case for terminals; using upper case indiscriminately makes your grammar harder to read (at least for those of us used to normal yacc/bison style). So I've written this answer without so much recourse to the caps lock key.
The basic issue is the production
expression: expression expression
which is obviously ambiguous, since it does not provide any indication of associativity. In that, it is not different from
expression: expression '+' expression
but that conflict can be resolved using a precedence declaration:
%left '+'
The difference is that the first production does not have any terminal symbol, which makes it impossible to use precedence rules to disambiguate: in yacc/bison, precedence is always a comparison between a potential reduction and a potential shift. The potential reduction is some production which could be reduced; the potential shift is a terminal symbol which might be able to extend some production. Since the potential shift must be a terminal symbol, that is what is used in the precedence declaration; by default, the precedence of the potential reduction is defined by the last terminal symbol in the right-hand side but it is possible to specify a different terminal using a %prec marker. In any case, the precedence relation involves a terminal symbol, and if the grammar allows juxtaposition of two terminals, there is no relevant terminal symbol.
That's easy to work around, since you are under no obligation to use precedence to resolve conflicts. You could just avoid the conflict:
/* Left associative rule */
expr_sequence: expr | expr_sequence expr
/* Alternative: right associative rule */
expr_sequence: expr | expr expr_sequence
Since there is no indication what you intend by the juxtaposition, I'm unable to recommend one or the other of the above alternatives, but normally I would incline towards the first one.
That's not terribly different from your grammar for equation_sequence, although equation_sequence actually uses a terminal symbol so it could have been handled with a precedence declaration. It's worth noting that equation_sequence, as written, is right-associative. That's usually considered correct for assignment operators, (a = b = c + 3, in a language like C, is parsed as a = (b = c + 3) and not as (a = b) = c + 3, making assignment one of the few right-associative operators.) But if you are using = as an equality operator, it might not actually be what you intended.

Why do parser combinators don't backtrack in case of failure?

I looked through the Artima guide on parser combinators, which says that we need to append failure(msg) to our grammar rules to make error-reporting meaningful for the user
def value: Parser[Any] =
obj | stringLit | num | "true" | "false" | failure("illegal start of value")
This breaks my understanding of the recursive mechanism, used in these parsers. One one hand, Artima guide makes sense saying that if all productions fail then parser will arrive at the failure("illegal start of value") returned to the user. It however does not make sense, nevertheless, once we understand that grammar is not the list of value alternatives but a tree instead. That is, value parser is a node that is called when value is sensed at the input. This means that calling parser, which is also a parent, detects failure on value parsing and proceeds with value sibling alternative. Suppose that all alternatives to value also fail. Grandparser will try its alternatives then. Failed in turn, the process unwinds upward until the starting symbol parser fails. So, what will be the error message? It seems that the last alternative of the topmost parser is reported errorenous.
To figure out, who is right, I have created a demo where program is the topmost (starting symbol) parser
import scala.util.parsing.combinator._
object ExprParserTest extends App with JavaTokenParsers {
// Grammar
val declaration = wholeNumber ~ "to" ~ wholeNumber | ident | failure("declaration not found")
val term = wholeNumber | ident ; lazy val expr: Parser[_] = term ~ rep ("+" ~ expr)
lazy val statement: Parser[_] = ident ~ " = " ~ expr | "if" ~ expr ~ "then" ~ rep(statement) ~ "else" ~ rep(statement)
val program = rep(declaration) ~ rep(statement)
// Test
println(parseAll(program, "1 to 2")) // OK
println(parseAll(program, "1 to '2")) // failure, regex `-?\d+' expected but `'' found at '2
println(parseAll(program, "abc")) // OK
}
It fails with 1 to '2 due to extra ' tick. Yes, it seems to stuck in the program -> declaration -> num "to" num rule and does not even try the ident and failure("declaration not found") alternatives! I does not back track to the statements either for the same reason. So, neither my guess nor Artima guide seems right on what parser combinators are actually doing. I wonder: what is the real logic behind rule sensing, backtracking and error reporting in parser combinators? Why does the error message suggests that no backtracking to declaration -> ident | failure(), nor statements occured? What is the point of Artima guide suggesting to place failure() in the end if it is not reached as we see or ignored, as the backtracking logic should be, anyway?
Isn't parser combinator just a plain dumb PEG? It behaves like predictive parser. I expected it is PEG and, thus, that starting symbol parser should return all failed branches and wonder why/how does the actual parser manage to select the most appropriate failure.
Many parser combinators backtrack, unless they're in an 'or' block. As a speed optimization, they'll commit to the 1st successful 'or' item and not backtrack. So 1) try to avoid '|' as much as possible in your grammar, and 2) if using '|' is unavoidable, place the longest or least-likely-to-match items first.

Yacc "rule useless due to conflicts"

i need some help with yacc.
i'm working on a infix/postfix translator, the infix to postfix part was really easy but i'm having some issue with the postfix to infix translation.
here's an example on what i was going to do (just to translate an easy ab+c- or an abc+-)
exp: num {printf("+ ");} exp '+'
| num {printf("- ");} exp '-'
| exp {printf("+ ");} num '+'
| exp {printf("- ");} num '-'
|/* empty*/
;
num: number {printf("%d ", $1);}
;
obiously it doesn't work because i'm asking an action (with the printfs) before the actual body so, while compiling, I get many
warning: rule useless in parser due to conflict
the problem is that the printfs are exactly where I need them (or my output wont be an infix expression). is there a way to keep the print actions right there and let yacc identify which one it needs to use?
Basically, no there isn't. The problem is that to resolve what you've got, yacc would have to have an unbounded amount of lookahead. This is… problematic given that yacc is a fairly simple-minded tool, so instead it takes a (bad) guess and throws out some of your rules with a warning. You need to change your grammar so yacc can decide what to do with a token with only a very small amount of lookahead (a single token IIRC). The usual way to do this is to attach the interpretations of the values to the tokens and either use a post-action or, more practically, build a tree which you traverse as a separate step (doing print out of an infix expression from its syntax tree is trivial).
Note that when you've got warnings coming out of yacc, that typically means that your grammar is wrong and that the resulting parser will do very unexpected things. Refine it until you get no warnings from that stage at all. That is, treat grammar warnings as errors; anything else and you'll be sorry.

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.