Removing Left Recursion in ANTLR - antlr

As is explained in Removing left recursion , there are two ways to remove the left recursion.
Modify the original grammar to remove the left recursion using some procedure
Write the grammar originally not to have the left recursion
What people normally use for removing (not having) the left recursion with ANTLR? I've used flex/bison for parser, but I need to use ANTLR. The only thing I'm concerned about using ANTLR (or LL parser in genearal) is left recursion removal.
In practical sense, how serious of removing left recursion in ANTLR? Is this a showstopper in using ANTLR? Or, nobody cares about it in ANTLR community?
I like the idea of AST generation of ANTLR. In terms of getting AST quick and easy way, which method (out of the 2 removing left recursion methods) is preferable?
Added
I did some experiment with the following grammar.
E -> E + T|T
T -> T * F|F
F -> INT | ( E )
After left recursion removal, I get the following one
E -> TE'
E' -> null | + TE'
T -> FT'
T' -> null | * FT'
I could come up with the following ANTLR representation. Even though, It's relatively pretty simple and straightforward, it seems the grammar that doesn't have the left recursion should be the better way to go.
grammar T;
options {
language=Python;
}
start returns [value]
: e {$value = $e.value};
e returns [value]
: t ep
{
$value = $t.value
if $ep.value != None:
$value += $ep.value
}
;
ep returns [value]
: {$value = None}
| '+' t r = ep
{
$value = $t.value
if $r.value != None:
$value += $r.value
}
;
t returns [value]
: f tp
{
$value = $f.value
if $tp.value != None:
$value *= $tp.value
}
;
tp returns [value]
: {$value = None}
| '*' f r = tp
{
$value = $f.value;
if $r.value != None:
$value *= $r.value
}
;
f returns [int value]
: INT {$value = int($INT.text)}
| '(' e ')' {$value = $e.value}
;
INT : '0'..'9'+ ;
WS: (' '|'\n'|'\r')+ {$channel=HIDDEN;} ;

Consider something like a typical parameter list:
parameter_list: parameter
| parameter_list ',' parameter
;
Since you don't care about anything like precedence or associativity with parameters, this is fairly easy to convert to right recursion, at the expense of adding an extra production:
parameter_list: parameter more_params
;
more_params:
| ',' parameter more_params
;
For the most serious cases, you might want to spend some time in the Dragon Book. Doing a quick check, this is covered primarily in chapter 4.
As far as seriousness goes, I'm pretty sure ANTLR simply won't accept a grammar that contains left recursion, which would put it into the "absolute necessity" category.

In practical sense, how serious of
removing left recursion in ANTLR? Is
this a showstopper in using ANTLR?
I think that you have a misunderstanding of left-recursion. It is a property of the grammar, not of the parser generator or the interaction between the parser generator and the specification. It happens when the first symbol on the right side of a rule is equal to the nonterminal corresponding to the rule itself.
To understand the inherent problem here, you need to know something about how a recursive-descent (LL) parser works. In an LL parser, the rule for each nonterminal symbol is implemented by a function corresponding to that rule. So, suppose I have a grammar like this:
S -> A B
A -> a
B -> b
Then, the parser would look (roughly) like this:
boolean eat(char x) {
// if the next character is x, advance the stream and return true
// otherwise, return false
}
boolean S() {
if (!A()) return false;
if (!B()) return false;
return true;
}
boolean A(char symbol) {
return eat('a');
}
boolean B(char symbol) {
return eat('b');
}
However, what happens if I change the grammar to be the following?
S -> A B
A -> A c | null
B -> b
Presumably, I want this grammar to represent a language like c*b. The corresponding function in the LL parser would look like this:
boolean A() {
if (!A()) return false; // stack overflow! We continually call A()
// without consuming any input.
eat('c');
return true;
}
So, we can't have left-recursion. Rewrite the grammar as:
S -> A B
A -> c A | null
B -> b
and the parser changes as such:
boolean A() {
if (!eat('c')) return true;
A();
return true;
}
(Disclaimer: this is my rudimentary approximation of an LL parser, meant only for demonstration purposes regarding this question. It has obvious bugs in it.)

I can't speak for ANTLR, but in general, the steps to eliminate a left recursion of the form:
A -> A B
-> B
is to change it to be:
A -> B+
(note that B must appear at least once)
or, if ANTLR doesn't support the Kleene closure, you can do:
A -> B B'
B' -> B B'
->
If you provide an example of your rules that are having conflicts, I can provide a better, more specific answer.

If you are writing the grammar, then of course you try to write it to avoid the pitfalls of your particular parser generator.
Usually, in my experience, I get some reference manual for the (legacy) language of interest, and it already contains a grammar or railroad diagrams, and it is what it is.
In that case, pretty much left recursion removal from a grammar is done by hand. There's no market for left-recursion-removal tools, and if you had one, it would be specialized to a grammar syntax that didn't match the grammar syntax you have.
Doing this removal is mostly a matter of sweat in many cases, and there isn't usually tons of it. So the usual approach is get out your grammar knife and have at it.
I don't think how you remove left recursion changes how ANTLR gets trees. You have to do the left recursion removal first, or ANTLR (or whatever LL parser generator you are using) simply won't accept your grammar.
There are those of us that don't want the parser generator to put any serious constraints on what we can write for a context free grammar. In this case you want to use something like a GLR parser generator, which handles left- or right-recursion with ease. Unreasonable people can even insist on automated AST generation with no effort on the part of the grammar writer. For a tool that can do both, see DMS Software Reengineering Toolkit.

This is only orthogonally relevant, but I just published a preprint of a paper on a new parsing method that I call "pika parsing" (c.f. packrat parsing) that directly handles left recursive grammars without the need for rule rewriting.
https://arxiv.org/abs/2005.06444

Related

How to synthesise compiler testing data?

I am writing a simple compiler as a school work. I am looking for an automated approach to generate both positive and negative testing data to test my compiler, given the formal grammar and other specification. The language I am dealing with is of mediate size with 38 or so non-terminals. For the sake of illustration, here is a snapshot of the grammar:
program: const_decl* declaration* ENDMARKER
# statement
stmt: flow_stmt | '{' stmt* '}' | NAME [stmt_trailer] ';' | ';'
stmt_trailer: arglist | ['[' expr ']'] '=' expr
flow_stmt: if_stmt | for_stmt | while_stmt | read_stmt ';' | write_stmt ';' | return_stmt ';'
return_stmt: 'return' ['(' expr ')']
if_stmt: 'if' '(' condition ')' stmt ['else' stmt]
condition: expr ('<'|'<='|'>'|'>='|'!='|'==') expr | expr
for_stmt: ('for' '(' NAME '=' expr ';' condition ';'
NAME '=' NAME ('+'|'-') NUMBER ')' stmt)
Is there any tools to generate input file with the help of the grammar? The hand-written tests are too tedious or too weak to discover problems. An example of this language here:
void main() {
int N;
int temp;
int i, j;
int array_size;
reset_heap;
scanf(N);
for (i = 0; i < N; i = i + 1) {
scanf(array_size);
if (array_size > max_heap_size) {
printf("array_size exceeds max_heap_size");
} else {
for (j = 0; j < array_size; j = j + 1) {
scanf(temp);
heap[j] = temp;
}
heap_sort(array_size);
print_heap(array_size);
}
}
}
Generating controllable testing data automatically can save the days. Given the simplicity of the language, there must be some way to effectively do this. Any pointer and insight is greatly appreciated.
Any pointer and insight is greatly appreciated.
This should have the subtopic of How to avoid combinatorial explosion when generating test data.
While I would not be surprised if there are tools to do this having had the same need to generate test data for grammars I have created a few one off applications.
One of the best series of articles I have found on this is by Eric Lippert, Every Binary Tree There Is, think BNF converted to binary operators then converted to AST when you read tree. However he uses Catalan (every branch has two leaves) and when I wrote my app I preferred Motzikin (a branch can have one or two leaves).
Also he did his in C# with LINQ and I did mine in Prolog using DCG.
Generating the data based on the BNF or DCG is not hard, the real trick is to limit the area of expansion and the size of the expansion and to inject bad data.
By area of expansion lets say you want to test nested if statements three levels deep, but have to have valid code that compiles. Obviously you need the boilerplate code to make it compile then you start changing the deeply nested if by adding or removing the else clause. So you need to put in constraints so that the boilerplate code is constant and the testing part is variable.
By size of expansion lets say that you want to test conditional expressions. You can easily calculate that if you have many operators and you want to test them all in combinations you soon run into combinatorial explosion. The trick is to ensure you test deep enough and with enough breadth but not every combination. Again the judicial use of constraints helps.
So the point of all of this is that you start with a tool that takes in the BNF and generates valid code. Then you modify the BNF to add constraints and modify the generator to understand the constraints to generate the code examples.
Then you modify the BNF for invalid data and likewise the generator to understand those rules.
After that is working you can then start layering on levels of automation.
If you do go this route and decide that you will have to learn Prolog, take a look at Mercury first. I have not done this with Mercury, but if I do it again Mercury is high on the list.
While my actual code is not public, this and this is the closest to it that is public.
Along the way I had some fun with it in Code Golf.
When generating terminals such as reserved words or values for types, you can use predefined list with both valid and invalid data, e.g. for if if the language is case sensitive I would include in the list if,If,IF,iF, etc. For value types such as unsigned byte I would include -1,0,255 and 256.
When I was testing basic binary math expressions with +, -, * and ^ I generated all the test for with three basic numbers -2,-1,0,1, and 2. I thought it would be useless since I already had hundreds of test cases, but since it only took a few minutes to generate all of the test cases and several hours to run it, to my surprise it found a pattern I did not cover. The point here is that contrary what most people say about having to many test cases, remember that it is only time on a computer by changing a few constraints so do the large number of test.

How to modify parsing grammar to allow assignment and non-assignment statements?

So the question is about the grammar below. I'm working on a mini-interpreted language for fun (we learned about some compiler design in class, so I want to take it to the next level and try something on my own). I'm stuck trying to make the non-terminal symbol Expr.
Statement ::= Expr SC
Expr ::= /* I need help here */
Assign ::= Name EQUAL Expr
AddSub ::= MulDiv {(+|-) AddSub}
MulDiv ::= Primary {(*|/) MulDiv}
Primary ::= INT | FLOAT | STR | LP Expr RP | Name
Name ::= ID {. Name}
Expr has to be made such that Statement must allow for the two cases:
x = 789; (regular assignment, followed by semicolon)
x+2; (no assignment, just calculation, discarded; followed by a semicolon)
The purpose of the second case is to setup the foundation for more changes in the future. I was thinking about unary increment and decrement operators, and also function calls; both of which don't require assignment to be meaningful.
I've looked at other grammars (C# namely), but it was too complicated and lengthy to understand. Naturally I'm not looking for solutions, but only for guidance on how I could modify my grammar.
All help is appreciated.
EDIT: I should say that my initial thought was Expr ::= Assign | AddSub, but that wouldn't work since it would create ambiguity since both could start with the non-terminal symbol Name. I have made my tokenizer such that it allows one token look ahead (peek), but I have not made such a thing for the non terminals, since it would be trying to fix a problem that could be avoided (ambiguity). In the grammar, the terminals are the ones that are all-caps.
The simplest solution is the one actually taken by the designers of C, and thus by the various C derivatives: treat assignment simply as yet another operator, without restricting it to being at the top-level of a statement. Hence, in C, the following is unproblematic:
while ((ch = getchar()) != EOF) { ... }
Not everyone will consider that good style, but it is certainly common (particularly in the clauses of the for statement, whose syntax more or less requires that assignment be an expression).
There are two small complications, which are relatively easy to accomplish:
Logically, and unlike most operators, assignment associates to the right so that a = b = 0 is parsed as a = (b = 0) and not (a = b) = 0 (which would be highly unexpected). It also binds very weakly, at least to the right.
Opinions vary as to how tightly it should bind to the left. In C, for the most part a strict precedence model is followed so that a = 2 + b = 3 is rejected since it is parsed as a = ((2 + b) = 3). a = 2 + b = 3 might seem like terrible style, but consider also a < b ? (x = a) : (y = a). In C++, where the result of the ternary operator can be a reference, you could write that as (a < b ? x : y) = a in which the parentheses are required even thought assignment has lower precedence than the ternary operator.
None of these options are difficult to implement in a grammar, though.
In many languages, the left-hand side of an assignment has a restricted syntax. In C++, which has reference values, the restriction could be considered semantic, and I believe it is usually implemented with a semantic check, but in many C derivatives lvalue can be defined syntactically. Such definitions are unambiguous, but they are often not amenable to parsing with a top-down grammar, and they can create complications even for a bottom-up grammar. Doing the check post-parse is always a simple solution.
If you really want to distinguish assignment statements from expression statements, then you indeed run into the problem of prediction failure (not ambiguity) if you use a top-down parsing technique such as recursive descent. Since the grammar is not ambiguous, a simple solution is to use an LALR(1) parser generator such as bison/yacc, which has no problems parsing such a grammar since it does not require an early decision as to which kind of statement is being parsed. On the whole, the use of LALR(1) or even GLR parser generators simplifies implementation of a parser by allowing you to specify a grammar in a form which is easily readable and corresponds to the syntactic analysis. (For example, an LALR(1) parser can handle left-associative operators naturally, while a LL(1) grammar can only produce right-associative parses and therefore requires some kind of reconstruction of the syntax tree.)
A recursive descent parser is a computer program, not a grammar, and its expressiveness is thus not limited by the formal constraints of LL(1) grammars. That is both a strength and a weakness: the strength is that you can find solutions which are not limited by the limitations of LL(1) grammars; the weakness is that it is much more complicated (even, sometimes, impossible) to extract a clear statement about the precise syntax of the language. This power, for example, allows recursive descent grammars to handle left associativity in a more-or-less natural way despite the restriction mentioned above.
If you want to go down this road, then the solution is simple enough. You will have some sort of function:
/* This function parses and returns a single expression */
Node expr() {
Node left = value();
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
That easily be modified to be able to parse both expressions and statements. First, refactor the function so that it parses the "rest" of an expression, given the first operator. (The only change is a new prototype and the deletion of the first line in the body.)
/* This function parses and returns a single expression
* after the first value has been parsed. The value must be
* passed as an argument.
*/
Node expr_rest(Node left) {
while (true) {
switch (lookahead) {
/* handle each possible operator token. I left out
* the detail of handling operator precedence since it's
* not relevant here
*/
case OP_PLUS: {
accept(lookahead);
left = MakeNode(OP_PLUS, left, value());
break;
}
/* If no operator found, return the current expression */
default:
return left;
}
}
}
With that in place, it is straightforward to implement both expr and stmt:
Node expr() {
return expr_rest(value());
}
Node stmt() {
/* Check lookahead for statements which start with
* a keyword. Omitted for simplicity.
*/
/* either first value in an expr or target of assignment */
Node left = value();
switch (lookahead) {
case OP_ASSIGN:
accept(lookahead);
return MakeAssignment(left, expr())
}
/* Handle += and other mutating assignments if desired */
default: {
/* Not an assignment, just an expression */
return MakeExpressionStatement(expr_rest(left));
}
}
}

conditional compilation in yacc

how to do conditional compilation in yacc. Similar to done in C using ifdef.
I want to create a rule based on a condition. Is it possible in yacc. Example. based on condition rule A is defined as follows:
ruleA : A | B, /* For condition 1 */
ruleA : C /* If condition 1 is not satisfied */
btyacc has conditional compilation based on defined flags, similar to the C preprocessor. You can say:
%ifdef VERSION_A
ruleA: A | B ;
%endif
%ifdef VERSION_B
ruleA: C ;
%endif
and then use a -DVERSION_A or -DVERSION_B command line argument to get one version or the other. Its pretty primitive (you can only test a single flag per %ifdef, can't nest %ifdefs and there's no %else), but its adequate for simple things.
If you can't preprocess your Yacc grammar with an appropriate preprocessor, then you can use a solution based on actions:
ruleA : A { if (condition1) { process OK; } else YYERROR; }
| B { if (condition1) { process OK; } else YYERROR; }
| C { if (!condition1) { process OK; } else YYERROR; }
;
The YYERROR action triggers Yacc's normal error processing mechanism. This means your grammar must 'work' with both sets of rules in operation as far as Yacc is concerned. If this leads to complexity because of shift/reduce (or even reduce/reduce) conflicts, then preprocessing is the way to go. The result grammar will be smaller and more focussed.
My suggestion is that you expose the conditional flag as a terminal symbol, we can call it the mode terminal. (So, technically, it's a run-time not compile-time condition.)
You need to return the mode terminal at every point that it would make a difference in the parse. Your grammar can ignore it when it receives extra ones. It can return the mode terminal only in the "condition 1" case, only in the "not condition 1" case, or you can return a different one in both cases. So say you have two tokens C1 and C2, one for each mode.
This may not work out so well if you are parsing an existing grammar, but if you are designing the grammar and the parser together all problems are solvable.
Then you end up with:
ruleA : C1 A | C1 B | C2 C ;

Is the grammar LALR?

Lets say the same grammar is not LR(1), can we safely say that the grammar is not LALR too?
if not, what are the conditions for a grammar to be LALR? (or what are the conditions that make a grammar not LALR)
Thanks for the help!
LALR(1) ⊂ LR(1), so yes, we can assume that. The two grammars express languages in a similar manner, but LR(1) keeps track of more left state than LALR(1). Cf. These lecture notes, which discuss the differences in state between the two representations.
In general, parser generators will handle all the details of creating the shift-reduce steps for you; the difference is that generators based on the larger grammars are more likely to find conflict-free parsing strategies.
This document compares both.
Here is a simple grammar that is LR(1) but not LALR(1):
G -> S
S -> c X t
-> c Y n
-> r Y t
-> r X n
X -> a
Y -> a
An LALR(1) parser generator gives you an LR(0) state machine.
An LR(1) parser generator gives you an LR(1) state machine.
With this grammar the LR(1) state machine has one more state
than the LR(0) state machine.
The LR(0) state machine contains this state:
X -> a .
Y -> a .
The LR(1) state machine contains these two state instead of
the one shown above:
X -> a . { t }
Y -> a . { n }
X -> a . { n }
Y -> a . { t }
The problem with LALR is that the states are made first
without any knowledge of the look-a-heads. Then the look-a-heads
are examined or created after the states are made. Then LALR
has this one state and the look-a-heads, which are usually added later,
will look like this:
X -> a . { t, n }
Y -> a . { n, t }
Can anybody see a problem here? If the look-ahead is 't',
which reduction do you choose? It's ambiguous ! So the
LALR(1) parser generator gives you a reduce-reduce conflict
report which can be confusing to the inexperienced grammar
writer.
That is why I made LRSTAR an LR(1) parser generator. It
can handle the above grammar.

First and follow of the non-terminals in two grammars

Given the following grammar:
S -> L=L
s -> L
L -> *L
L -> id
What are the first and follow for the non-terminals?
If the grammar is changed into:
S -> L=R
S -> R
L -> *R
L -> id
R -> L
What will be the first and follow ?
When I took a compiler course in college I didn't understand FIRST and FOLLOWS at all. I implemented the algorithms described in the Dragon book, but I had no clue what was going on. I think I do now.
I assume you have some book that gives a formal definition of these two sets, and the book is completely incomprehensible. I'll try to give an informal description of them, and hopefully that will help you make sense of what's in your book.
The FIRST set is the set of terminals you could possibly see as the first part of the expansion of a non-terminal. The FOLLOWS set is the set of terminals you could possibly see following the expansion of a non-terminal.
In your first grammar, there are only three kinds of terminals: =, *, and id. (You might also consider $, the end-of-input symbol, to be a terminal.) The only non-terminals are S (a statement) and L (an Lvalue -- a "thing" you can assign to).
Think of FIRST(S) as the set of non-terminals that could possibly start a statement. Intuitively, you know you do not start a statement with =. So you wouldn't expect that to show up in FIRST(S).
So how does a statement start? There are two production rules that define what an S looks like, and they both start with L. So to figure out what's in FIRST(S), you really have to look at what's in FIRST(L). There are two production rules that define what an Lvalue looks like: it either starts with a * or with an id. So FIRST(S) = FIRST(L) = { *, id }.
FOLLOWS(S) is easy. Nothing follows S because it is the start symbol. So the only thing in FOLLOWS(S) is $, the end-of-input symbol.
FOLLOWS(L) is a little trickier. You have to look at every production rule where L appears, and see what comes after it. In the first rule, you see that = may follow L. So = is in FOLLOWS(L). But you also notice in that rule that there is another L at the end of the production rule. So another thing that could follow L is anything that could follow that production. We already figured out that the only thing that can follow the S production is the end-of-input. So FOLLOWS(L) = { =, $ }. (If you look at the other production rules, L always appears at the end of them, so you just get $ from those.)
Take a look at this Easy Explanation, and for now ignore all the stuff about ϵ, because you don't have any productions which contain the empty-string. Under "Rules for First Sets", rules #1, #3, and #4.1 should make sense. Under "Rules for Follows Sets", rules #1, #2, and #3 should make sense.
Things get more complicated when you have ϵ in your production rules. Suppose you have something like this:
D -> S C T id = V // Declaration is [Static] [Const] Type id = Value
S -> static | ϵ // The 'static' keyword is optional
C -> const | ϵ // The 'const' keyword is optional
T -> int | float // The Type is mandatory and is either 'int' or 'float'
V -> ... // The Value gets complicated, not important here.
Now if you want to compute FIRST(D) you can't just look at FIRST(S), because S may be "empty". You know intuitively that FIRST(D) is { static, const, int, float }. That intuition is codified in rule #4.2. Think of SCT in this example as Y1Y2Y3 in the "Easy Explanation" rules.
If you want to compute FOLLOWS(S), you can't just look at FIRST(C), because that may be empty, so you also have to look at FIRST(T). So FOLLOWS(S) = { const, int, float }. You get that by applying "Rules for follow sets" #2 and #4 (more or less).
I hope that helps and that you can figure out FIRST and FOLLOWS for the second grammar on your own.
If it helps, R represents an Rvalue -- a "thing" you can't assign to, such as a constant or a literal. An Lvalue can also act as an Rvalue (but not the other way around).
a = 2; // a is an lvalue, 2 is an rvalue
a = b; // a is an lvalue, b is an lvalue, but in this context it's an rvalue
2 = a; // invalid because 2 cannot be an lvalue
2 = 3; // invalid, same reason.
*4 = b; // Valid! You would almost never write code like this, but it is
// grammatically correct: dereferencing an Rvalue gives you an Lvalue.