Lets say the same grammar is not LR(1), can we safely say that the grammar is not LALR too?
if not, what are the conditions for a grammar to be LALR? (or what are the conditions that make a grammar not LALR)
Thanks for the help!
LALR(1) ⊂ LR(1), so yes, we can assume that. The two grammars express languages in a similar manner, but LR(1) keeps track of more left state than LALR(1). Cf. These lecture notes, which discuss the differences in state between the two representations.
In general, parser generators will handle all the details of creating the shift-reduce steps for you; the difference is that generators based on the larger grammars are more likely to find conflict-free parsing strategies.
This document compares both.
Here is a simple grammar that is LR(1) but not LALR(1):
G -> S
S -> c X t
-> c Y n
-> r Y t
-> r X n
X -> a
Y -> a
An LALR(1) parser generator gives you an LR(0) state machine.
An LR(1) parser generator gives you an LR(1) state machine.
With this grammar the LR(1) state machine has one more state
than the LR(0) state machine.
The LR(0) state machine contains this state:
X -> a .
Y -> a .
The LR(1) state machine contains these two state instead of
the one shown above:
X -> a . { t }
Y -> a . { n }
X -> a . { n }
Y -> a . { t }
The problem with LALR is that the states are made first
without any knowledge of the look-a-heads. Then the look-a-heads
are examined or created after the states are made. Then LALR
has this one state and the look-a-heads, which are usually added later,
will look like this:
X -> a . { t, n }
Y -> a . { n, t }
Can anybody see a problem here? If the look-ahead is 't',
which reduction do you choose? It's ambiguous ! So the
LALR(1) parser generator gives you a reduce-reduce conflict
report which can be confusing to the inexperienced grammar
writer.
That is why I made LRSTAR an LR(1) parser generator. It
can handle the above grammar.
Related
I have a language L, which is defined as: L = { b^n c^n a^n , n>=1}
The corresponding grammar would be able to create words such as:
bca
bbccaa
bbbcccaaa
...
How would such a grammar look like? Making two variables dependent of each other is relatively simple, but I have trouble with doing it for three.
Thanks in advance!
L = { b^n c^n a^n , n>=1}
As pointed out in the comments, this is a canonical example of a language which is not context free. It can be shown using the pumping lemma for context-free languages. Basically, consider a string like b^p c^p a^p where p is the pumping length and then show no matter what part you pump, you will throw off the balance (basically, the size of the part that's pumped is less than p, so it cannot "span" all three symbols to keep them in sync).
L = {a^m b^n c^n a^(m+n) |m ≥ 0,n ≥ 1}
As suggested in the comments, this is not context free either. It can be shown using the pumping lemma for context-free languages as well. However, given a proof (or acceptance) of the above, there is an easier way. Recall that the intersection of a regular language and a context-free language must be context free. Assume L is context-free. Then so must be its intersection with the regular language (b+c)(b+c)* a*. However, that intersection can be expressed as b^n c^n a^n (since m is forced to be zero), which we know is not context-free, a contradiction. Therefore, our assumption was wrong and L is not context free either.
I have a problem with the recognition of languages. Given a certain language, for example ancb2n, n > 0, how do I determine quickly what type belongs according Chomsky?
My idea was to determine the grammar that generates it and then up to the language but it is a long process. I think there's another way to recognize it by eye,
without writing grammars or automata.
can someone help me?
Unfortunately, associating an arbitrary language with a level of the Chomsky hierarchy is, in the general case, undecidable. (See Rice's Theorem.)
Of course, it is easy to categorize a given grammar, since the Chomsky hierarchy is defined by a simple syntactic analysis of the grammar itself. However, languages do not have unique grammars; the existence of (for example) a Type 2 (context-free) grammar for a language does not mean that there is not a Type 3 (regular) grammar for the same language.
So there are no shortcuts.
However, there is a lot to be said for experience. The language { ancb2n | n > 0 } is context-free (and not regular), as are all languages of similar forms. That it is context free is demonstrated by the grammar
L → c
L → a L b b
and the fact that it is not regular can be proven using the pumping lemma for regular languages. (The linked Wikipedia article contains, as an example of the use of the lemma, a proof for a similar language which should be easy to adapt.)
On the other hand, a language which requires three equal counts ({ ancnbn | n > 0 }) is not context-free (but is context-sensitive). (That's not the same as { ancn+mbm | n > 0 }, which is context-free.)
How to convert a Context Free Grammar to a DFA? This is easy if we have transitions like
A->a B. But when we have the transitions as A->a B c. Then how should we represent it as a DFA
There is no general procedure to convert an arbitrary CFG into a DFA. For example, consider this CFG:
S → aSb | ε
This grammar is for the language { anbn | n ≥ 0 }, which is a canonical nonregular language. Since we can only build DFAs for regular languages, there’s no way to build a DFA with the same language as this CFG
First, you should convert your language to CNF (Chomskey Normal Form).
Then steps for conversion are as such:
Convert it to left/right grammar is called a regular grammar.
Convert the Regular Grammar into Finite Automata
The transitions for automata are obtained as follows
For every production A -> aB make δ(A, a) = B that is make an are labeled ‘a’ from A to B.
For every production A -> a make δ(A, a) = final state.
For every production A -> ϵ, make δ(A, ϵ) = A and A will be final state.
No. For these Grammar no DFA can form.
why?
because it requires memory. Memory of occurence of a.
Yes . it is CFL (context free Language).
We can design a PDA (Push down automata). Here , memory ( STACK is
use ). for PUSH a and POP b
In the dragon book, LL grammar is defined as follows:
A grammar is LL if and only if for any production A -> a|b, the following two conditions apply.
FIRST(a) and FIRST(b) are disjoint. This implies that they cannot both derive EMPTY
If b can derive EMPTY, then a cannot derive any string that begins with FOLLOW(A), that is FIRST(a) and FOLLOW(A) must be disjoint.
And I know that LL grammar can't be left recursive, but what is the formal reason? I guess left-recursive grammar will contradict rule 2, right? e.g., I've written following grammar:
S->SA|empty
A->a
Because FIRST(SA) = {a, empty} and FOLLOW(S) ={$, a}, then FIRST(SA) and FOLLOW(S) are not disjoint, so this grammar is not LL. But I don't know if it is the left-recursion make FIRST(SA) and FOLLOW(S) not disjoint, or there is some other reason? Put it in another way, is it true that every left-recursive grammar will have a production that will violate condition 2 of LL grammar?
OK, I figure it out, if a grammar contains left-recursive production, like:
S->SA
Then somehow it must contain another production to "finish" the recursion,say:
S->B
And since FIRST(B) is a subset of FIRST(SA), so they are joint, this violates condition 1, there must be conflict when filling parse table entries corresponding to terminals both in FIRST(B) and FIRST(SA). To summarize, left-recursion grammar could cause FIRST set of two or more productions to have common terminals, thus violating condition 1.
Consider your grammar:
S->SA|empty
A->a
This is a shorthand for the three rules:
S -> SA
S -> empty
A -> a
Now consider the string aaa. How was it produced? You can only read one character at a time if you have no lookahead, so you start off like this (you have S as start symbol):
S -> SA
S -> empty
A -> a
Fine, you have produced the first a. But now you cannot apply any more rules because there is no more non-terminals. You are stuck!
What you should have done was this:
S -> SA
S -> SA
S -> SA
S -> empty
A -> a
A -> a
A -> a
But you don't know this without reading the entire string. You would need an infinite amount of lookahead.
In a general sense, yes, every left-recursive grammar can have ambiguous strings without infinite lookahead. Look at the example again: There are two different rules for S. Which one should we use?
An LL(k) grammar is one that allows the construction of a deterministic, descent parser with only k symbols of lookahead. The problem with left recursion is that it makes it impossible to determine which rule to apply until the complete input string is examined, which makes the required k potentially infinite.
Using your example, choose a k, and give the parser an input sequence of length n >= k:
aaaaaaa...
A parser cannot decide if it should apply S->SA or S->empty by looking at the k symbols ahead because the decision would depend on how many times S->SA has been chosen before, and that is information the parser does not have.
The parser would have to choose S->SA exactly n times and S->empty once, and it's impossible to decide which is right by looking at the first k symbols in the input stream.
To know, a parser would have to both examine the complete input sequence, and keep count of how many times S->SA has been chosen, but such a parser would fall outside of the definition of LL(k).
Note that unlimited lookahead is not a solution because a parser runs on limited resources, so there will always be a finite input sequence of a length large enough to make the parser crash before producing any output.
In the book "The Theory of Parsing", Volume 2, by Aho and Ullman, page 681 you can find Lemma 8.3 that states: "No LL(k) grammar is left-recursive".
The proof says:
Suppose that G = (N, T, P, S) has a left-recursive nonterminal A. Then there is a derivation A -> Aw. If w -> e then it is easy to show that G is ambiguous and hence cannot be LL. Thus, assume that w -> v for some v in T+ (a non empty string of terminals). We can further assume that A -> u, being u some string of terminals and that there exists a derivation
Hence, there is another derivation:
As is explained in Removing left recursion , there are two ways to remove the left recursion.
Modify the original grammar to remove the left recursion using some procedure
Write the grammar originally not to have the left recursion
What people normally use for removing (not having) the left recursion with ANTLR? I've used flex/bison for parser, but I need to use ANTLR. The only thing I'm concerned about using ANTLR (or LL parser in genearal) is left recursion removal.
In practical sense, how serious of removing left recursion in ANTLR? Is this a showstopper in using ANTLR? Or, nobody cares about it in ANTLR community?
I like the idea of AST generation of ANTLR. In terms of getting AST quick and easy way, which method (out of the 2 removing left recursion methods) is preferable?
Added
I did some experiment with the following grammar.
E -> E + T|T
T -> T * F|F
F -> INT | ( E )
After left recursion removal, I get the following one
E -> TE'
E' -> null | + TE'
T -> FT'
T' -> null | * FT'
I could come up with the following ANTLR representation. Even though, It's relatively pretty simple and straightforward, it seems the grammar that doesn't have the left recursion should be the better way to go.
grammar T;
options {
language=Python;
}
start returns [value]
: e {$value = $e.value};
e returns [value]
: t ep
{
$value = $t.value
if $ep.value != None:
$value += $ep.value
}
;
ep returns [value]
: {$value = None}
| '+' t r = ep
{
$value = $t.value
if $r.value != None:
$value += $r.value
}
;
t returns [value]
: f tp
{
$value = $f.value
if $tp.value != None:
$value *= $tp.value
}
;
tp returns [value]
: {$value = None}
| '*' f r = tp
{
$value = $f.value;
if $r.value != None:
$value *= $r.value
}
;
f returns [int value]
: INT {$value = int($INT.text)}
| '(' e ')' {$value = $e.value}
;
INT : '0'..'9'+ ;
WS: (' '|'\n'|'\r')+ {$channel=HIDDEN;} ;
Consider something like a typical parameter list:
parameter_list: parameter
| parameter_list ',' parameter
;
Since you don't care about anything like precedence or associativity with parameters, this is fairly easy to convert to right recursion, at the expense of adding an extra production:
parameter_list: parameter more_params
;
more_params:
| ',' parameter more_params
;
For the most serious cases, you might want to spend some time in the Dragon Book. Doing a quick check, this is covered primarily in chapter 4.
As far as seriousness goes, I'm pretty sure ANTLR simply won't accept a grammar that contains left recursion, which would put it into the "absolute necessity" category.
In practical sense, how serious of
removing left recursion in ANTLR? Is
this a showstopper in using ANTLR?
I think that you have a misunderstanding of left-recursion. It is a property of the grammar, not of the parser generator or the interaction between the parser generator and the specification. It happens when the first symbol on the right side of a rule is equal to the nonterminal corresponding to the rule itself.
To understand the inherent problem here, you need to know something about how a recursive-descent (LL) parser works. In an LL parser, the rule for each nonterminal symbol is implemented by a function corresponding to that rule. So, suppose I have a grammar like this:
S -> A B
A -> a
B -> b
Then, the parser would look (roughly) like this:
boolean eat(char x) {
// if the next character is x, advance the stream and return true
// otherwise, return false
}
boolean S() {
if (!A()) return false;
if (!B()) return false;
return true;
}
boolean A(char symbol) {
return eat('a');
}
boolean B(char symbol) {
return eat('b');
}
However, what happens if I change the grammar to be the following?
S -> A B
A -> A c | null
B -> b
Presumably, I want this grammar to represent a language like c*b. The corresponding function in the LL parser would look like this:
boolean A() {
if (!A()) return false; // stack overflow! We continually call A()
// without consuming any input.
eat('c');
return true;
}
So, we can't have left-recursion. Rewrite the grammar as:
S -> A B
A -> c A | null
B -> b
and the parser changes as such:
boolean A() {
if (!eat('c')) return true;
A();
return true;
}
(Disclaimer: this is my rudimentary approximation of an LL parser, meant only for demonstration purposes regarding this question. It has obvious bugs in it.)
I can't speak for ANTLR, but in general, the steps to eliminate a left recursion of the form:
A -> A B
-> B
is to change it to be:
A -> B+
(note that B must appear at least once)
or, if ANTLR doesn't support the Kleene closure, you can do:
A -> B B'
B' -> B B'
->
If you provide an example of your rules that are having conflicts, I can provide a better, more specific answer.
If you are writing the grammar, then of course you try to write it to avoid the pitfalls of your particular parser generator.
Usually, in my experience, I get some reference manual for the (legacy) language of interest, and it already contains a grammar or railroad diagrams, and it is what it is.
In that case, pretty much left recursion removal from a grammar is done by hand. There's no market for left-recursion-removal tools, and if you had one, it would be specialized to a grammar syntax that didn't match the grammar syntax you have.
Doing this removal is mostly a matter of sweat in many cases, and there isn't usually tons of it. So the usual approach is get out your grammar knife and have at it.
I don't think how you remove left recursion changes how ANTLR gets trees. You have to do the left recursion removal first, or ANTLR (or whatever LL parser generator you are using) simply won't accept your grammar.
There are those of us that don't want the parser generator to put any serious constraints on what we can write for a context free grammar. In this case you want to use something like a GLR parser generator, which handles left- or right-recursion with ease. Unreasonable people can even insist on automated AST generation with no effort on the part of the grammar writer. For a tool that can do both, see DMS Software Reengineering Toolkit.
This is only orthogonally relevant, but I just published a preprint of a paper on a new parsing method that I call "pika parsing" (c.f. packrat parsing) that directly handles left recursive grammars without the need for rule rewriting.
https://arxiv.org/abs/2005.06444