Left-recursion in ANTLR - antlr

The stm and stmList gives me this error, it seems that ANTLR see it like a possible infinite recursion loop. How can I avoid it? The following sets of rules are mutually left-recursive [stmList]
stmList: stm stmList | ;
stm: ifStm | whStm;
ifStm: ifPart elifPart* elsePart?;
ifPart: IF LB exp RB CLB stmList CRB;
elifPart: ELIF LB exp RB CLB stmList CRB;
elsePart: ELSE CLB stmList CRB;
whStm: WHILE LB exp RB CLB stmList CRB;
LB: '(';
RB: ')';
CLB: '{';
CRB: '}';
WHILE: 'While';
IF: 'If';
ELIF: 'Elif';
ELSE: 'Else';

This is probably because of the empty alt in stmList, though I also wonder why this error comes up. It doesn't seem right. However, I recommend not to use empty alts anyway, unless you guard the other alt(s) with a predicate and "call" the containing rule unconditionally. This can easily lead to problems when you forget that. Instead remove the empty alt and make the call optional:
stmList: stm stmList;
elsePart: ELSE CLB stmList? CRB;
Additionally, stmList looks pretty much like you would do such a definition in yacc, where no EBNF suffixes are possible. Instead just write:
stmList: stm+;

Related

warning: rule useless in parser due to conflicts

here CR is create
SP is space
RE is replace
iam getting the output correctly for create or replace but not for just create. could anyone pls tell what is wrong with code
but iam still getting this warning and hence not working
p.y:10.5-6: warning: rule useless in parser due to conflicts
%token CR TRI SP RE OR BEF AFT IOF INS UPD DEL ON OF
%%
s:e '\n' { printf("valid variable\n");f=1; };
e:TPR SP TRI;
TPR:CR
|CR SP OR SP RE;
It's rarely a good idea to pass whitespace to the parser. It only complicates the grammar, providing little or no additional value.
It is also always a good idea to adopt a single convention for the names of terminals and non-terminals. If you are going to use ALL CAPS for terminals (which is the normal convention), then don't use it also for non-terminals such as TPR. Also, the use of meaningful names and literal strings will make your grammar much more readable.
The "rule useless in parser due to conflicts" warning is always accompanied by one or more shift/reduce or reduce/reduce conflicts. Normally, the solution is to fix the conflicts. In this case, you could do so by simply not passing the whitespace to the parser.
Here is your grammar, I think: (I'm guessing what your abbreviations mean)
%token CR "create" OR "or" RE "replace"
%token TABLE_IDENTIFIER
%%
statement: expr '\n' { /* Some action */ }
expr: table_producer TABLE_IDENTIFIER
table_producer
: "create"
| "create" "or" "replace"
Written this way, without the whitespace, the grammar does not have any conflicts. If we reintroduce the whitespace:
%token CR "create" OR "or" RE "replace"
%token TABLE_IDENTIFIER SPACE
%%
statement: expr '\n' { /* Some action */ }
expr: table_producer SPACE TABLE_IDENTIFIER
table_producer
: "create"
| "create" SPACE "or" SPACE "replace"
then there is a shift/reduce conflict after create is recognized. The lookahead will be SPACE, but the parser cannot know whether that SPACE is part of the second table_producer production (create or...) or part of the expr production (create table_name).
There must be some punctuation between two words, otherwise they would be recognized by the lexer as a single-word. So the fact that the words are separated by whitespace is not meaningful; if the lexer simply keeps the whitespace to itself, as is normal, then the conflict disappears.

How to resolve ambiguity without backtracking in ANTLR?

expr
: atom
| atom BINARY expr -> ^(BINARY atom expr)
;
I would like to resolve this without backtracking if possible. Using backtracking breaks my code for some reason. There isn't a lot of documentation on syntactic predicates, I'm wondering how to do this with predicates instead.

How to resolve a shift/reduce conflict forcing a shift or a reduce?

When there is a shift/reduce conflict in Yacc/Bison, is it possible to force the conflict to be solved exactly as you want? In other words: is it possible explicitly force it to prioritize the shift or the reduce?
For what I have read, if you are happy with the default resolution you can tell the generator to not complain about it. I really don't like this because it is obfuscating your rational choice.
Another option is to rewrite the grammar to fix the issue. I don't know if this is always possible and often this makes it much harder to understand.
Finally, I have read the precedence rules can fix this. I clueless tried that in many ways and I couldn't make it work. Is it possible to use the precedence rule for that? How?
Though my ambiguous grammar is very different, I can use the classical if-then-else from the Bison manual to give a concrete example:
%token IF THEN ELSE variable
%%
stmt:
expr
| if_stmt
;
if_stmt:
IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
;
expr:
variable
;
As far as I can tell, it is not possible to direct the parser to resolve a S/R conflict by choosing to reduce. Though I might be wrong, it is probably ill-advised to proceed this way anyway. Therefore, the only possibilities are either rewriting the grammar, or solving the conflict by shifting.
The following usage of right predecence for THEN and ELSE describes the desired behavior for the if-then-else statement (that is, associating else with the innermost if statement).
%token IF THEN ELSE variable
%right THEN ELSE
%%
stmt
: expr
| if_stmt
;
if_stmt
: IF expr THEN stmt
| IF expr THEN stmt ELSE stmt
;
expr
: variable
;
By choosing right association for the above tokens, the following sequence:
IF expr1 THEN IF expr2 THEN IF expr3 THEN x ELSE y
is parsed as:
IF expr1 THEN (IF expr2 THEN (IF expr3 THEN (x ELSE (y))))
and Bison does not complain about the case any longer.
Remember that you can always run bison file.y -r all and inspect file.output in order to see if the generated parser state machine is correct.
Well, the default resolution for a shift/reduce conflict is to shift, so if that's what you want, you don't need to do anything (other than ignoring the warning).
If you want to resolve a shift/reduce conflict by reducing, you can use the precedence rules -- just make sure that the rule to be reduced is higher precedence than the token to be shifted. The tricky part comes if there are multiple shift/reduce conflicts involving the same rules and tokens, it may not be possible to find a globally consistent set of precedences for the rules and tokens which resolves things the way you want.

Using precedence in Bison for unary minus doesn't solve shift/reduce conflict

I'm devising a very simple grammar, where I use the unary minus operand. However, I get a shift/reduce conflict. In the Bison manual, and everywhere else I look, it says that I should define a new token and give it higher precedence than the binary minus operand, and then use "%prec TOKEN" in the rule.
I've done that, but I still get the warning. Why?
I'm using bison (GNU Bison) 2.4.1. The grammar is shown below:
%{
#include <string>
extern "C" int yylex(void);
%}
%union {
std::string token;
}
%token <token> T_IDENTIFIER T_NUMBER
%token T_EQUAL T_LPAREN T_RPAREN
%right T_EQUAL
%left T_PLUS T_MINUS
%left T_MUL T_DIV
%left UNARY
%start program
%%
program : statements expr
;
statements : '\n'
| statements line
;
line : assignment
| expr
;
assignment : T_IDENTIFIER T_EQUAL expr
;
expr : T_NUMBER
| T_IDENTIFIER
| expr T_PLUS expr
| expr T_MINUS expr
| expr T_MUL expr
| expr T_DIV expr
| T_MINUS expr %prec UNARY
| T_LPAREN expr T_RPAREN
;
%prec doesn't do as much as you might hope here. It tells Bison that in a situation where you have - a * b you want to parse this as (- a) * b instead of - (a * b). In other words, here it will prefer the UNARY rule over the T_MUL rule. In either case, you can be certain that the UNARY rule will get applied eventually, and it is only a question of the order in which the input gets reduced to the unary argument.
In your grammar, things are very much different. Any sequence of line non-terminals will make up a sequence, and there is nothing to say that a line non-terminal must end at an end-of-line. In fact, any expression can be a line. So here are basically two ways to parse a - b: either as a single line with a binary minus, or as two “lines”, the second starting with a unary minus. There is nothing to decide which of these rules will apply, so the rule-based precedence won't work here yet.
Your solution is correcting your line splitting, by requiring every line to actually end with or be followed by an end-of-line symbol.
If you really want the behaviour your grammar indicates with respect to line endings, you'd need two separate non-terminals for expressions which can and which cannot start with a T_MINUS. You'd have to propagate this up the tree: the first line may start with a unary minus, but subsequent ones must not. Inside a parenthesis, starting with a minus would be all right again.
The expr rule is ok (without the %prec UNARY). Your shift/reduce conflict comes from the rule:
statements : '\n'
| statements line
;
The rule does not what you think. For example you can write:
a + b c + d
I think that is not supposed to be valid input.
But also the program rule is not very sane:
program : statements expr
;
The rules should be something like:
program: lines;
lines: line | lines line;
line: statement "\n" | "\n";
statement: assignment | expr;

How can I construct a clean, Python like grammar in ANTLR?

G'day!
How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?
I'm trying to write a simple DSLs for expressions:
# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)
Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:
# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
+AnotherValueWithAGratuitouslyLongName)
I started off with an ANTLR grammar like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL!?
;
empty_line
: NL;
assignment
: ID '=' expr
;
// ... and so on
It seems simple, but I'm already in trouble with the newlines:
warning(200): StackOverflowQuestion.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Graphically, in org.antlr.works.IDE:
Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png
I've kicked the grammar around, but always end up with violations of expected behavior:
A newline is not required at the end of the file
Empty lines are acceptable
Everything in a line from a pound sign onward is discarded as a comment
Assignments end with end-of-line, not semicolons
Expressions can span multiple lines if wrapped in brackets
I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.
Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?
I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:
Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.
The second point would let you simplify your grammar to something like this:
exprlist
: ( assignment_statement | empty_line )* EOF!
;
assignment_statement
: assignment NL
;
empty_line
: NL
;
assignment
: ID '=' expr
;
How about this?
exprlist
: (expr)? (NL+ expr)* NL!? EOF!
;
expr
: assignment | ...
;
assignment
: ID '=' expr
;
I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.
While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.
In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.
My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!
It may seem a little unsavory, but in reality it will save you a lot of future headaches.