Why can't a LL grammar be left-recursive? - grammar

In the dragon book, LL grammar is defined as follows:
A grammar is LL if and only if for any production A -> a|b, the following two conditions apply.
FIRST(a) and FIRST(b) are disjoint. This implies that they cannot both derive EMPTY
If b can derive EMPTY, then a cannot derive any string that begins with FOLLOW(A), that is FIRST(a) and FOLLOW(A) must be disjoint.
And I know that LL grammar can't be left recursive, but what is the formal reason? I guess left-recursive grammar will contradict rule 2, right? e.g., I've written following grammar:
S->SA|empty
A->a
Because FIRST(SA) = {a, empty} and FOLLOW(S) ={$, a}, then FIRST(SA) and FOLLOW(S) are not disjoint, so this grammar is not LL. But I don't know if it is the left-recursion make FIRST(SA) and FOLLOW(S) not disjoint, or there is some other reason? Put it in another way, is it true that every left-recursive grammar will have a production that will violate condition 2 of LL grammar?

OK, I figure it out, if a grammar contains left-recursive production, like:
S->SA
Then somehow it must contain another production to "finish" the recursion,say:
S->B
And since FIRST(B) is a subset of FIRST(SA), so they are joint, this violates condition 1, there must be conflict when filling parse table entries corresponding to terminals both in FIRST(B) and FIRST(SA). To summarize, left-recursion grammar could cause FIRST set of two or more productions to have common terminals, thus violating condition 1.

Consider your grammar:
S->SA|empty
A->a
This is a shorthand for the three rules:
S -> SA
S -> empty
A -> a
Now consider the string aaa. How was it produced? You can only read one character at a time if you have no lookahead, so you start off like this (you have S as start symbol):
S -> SA
S -> empty
A -> a
Fine, you have produced the first a. But now you cannot apply any more rules because there is no more non-terminals. You are stuck!
What you should have done was this:
S -> SA
S -> SA
S -> SA
S -> empty
A -> a
A -> a
A -> a
But you don't know this without reading the entire string. You would need an infinite amount of lookahead.
In a general sense, yes, every left-recursive grammar can have ambiguous strings without infinite lookahead. Look at the example again: There are two different rules for S. Which one should we use?

An LL(k) grammar is one that allows the construction of a deterministic, descent parser with only k symbols of lookahead. The problem with left recursion is that it makes it impossible to determine which rule to apply until the complete input string is examined, which makes the required k potentially infinite.
Using your example, choose a k, and give the parser an input sequence of length n >= k:
aaaaaaa...
A parser cannot decide if it should apply S->SA or S->empty by looking at the k symbols ahead because the decision would depend on how many times S->SA has been chosen before, and that is information the parser does not have.
The parser would have to choose S->SA exactly n times and S->empty once, and it's impossible to decide which is right by looking at the first k symbols in the input stream.
To know, a parser would have to both examine the complete input sequence, and keep count of how many times S->SA has been chosen, but such a parser would fall outside of the definition of LL(k).
Note that unlimited lookahead is not a solution because a parser runs on limited resources, so there will always be a finite input sequence of a length large enough to make the parser crash before producing any output.

In the book "The Theory of Parsing", Volume 2, by Aho and Ullman, page 681 you can find Lemma 8.3 that states: "No LL(k) grammar is left-recursive".
The proof says:
Suppose that G = (N, T, P, S) has a left-recursive nonterminal A. Then there is a derivation A -> Aw. If w -> e then it is easy to show that G is ambiguous and hence cannot be LL. Thus, assume that w -> v for some v in T+ (a non empty string of terminals). We can further assume that A -> u, being u some string of terminals and that there exists a derivation
Hence, there is another derivation:

Related

Well typed and ill typed lambda terms

I have been trying to understand the applied lambda calculus. Up till now, I have understood how type inference works. But I am not able to follow what is the meaning of saying that a term is well-typed or ill-typed and then how can I determine whether a given term is well-typed or ill-typed.
For example, consider a lambda term tw defined as λx[(x x)] . How to conclude whether it is a well-typed or ill-typed term?
If we are talking about Simply Typed Lambda Calculus with some additional constants and basic types (i.e. applied lambda calculus), then the term λx:σ. (x x) is well-formed, but ill-typed.
'Well-formed' means syntactically correct, i.e. will be accepted by a parser for STLC. 'Ill-typed' means the type-checker would not pass it further.
Type-checker works according to the typing rules, which are usually expressed as a number of typing judgements (one typing scheme for each syntactic form).
Let me show that the term you provided is indeed ill-typed.
According to the rule (3) [see the typing rules link], λx:σ. (x x) must have type of general form σ -> τ (since it is a function, or more correctly abstraction). But that means the body (x x) must have some type τ (assuming x : σ). This is basically the same rule (3) expressed in a natural language. So, now we need to figure out the type of the function's body, which is an application.
Now, the rule for application (4) says that if we have an expression like this (e1 e2), then e1 must be some function e1 : α -> β and e2 : α must be an argument of the right type. Let's apply this rule to our expression for the body (x x). (1) x : α -> β and (2) x : α. Since an term in STLC can have only one type, we've got an equation: α -> β = α.
But there is no way we can unify both types together, since α is a subpart of α -> β. That's why this won't typecheck.
By the way, one of the major points of STLC was to forbid self-application (like (x x)), because it prevents from using (untyped) lambda calculus as a logic, since one can perform non-terminating calculations using self-application (see for instance Y-combinator).

is it possible to separate the concept of precedence and association in yacc

I would like to have a clear example of precedence and one of associativity in yacc, but I find myself yet in having troubles separating these two concepts.
Perhaps this is due to the fact that I'm associating these two concepts to math and mathematical operation.. These are two old examples I built:
Associativity (*) is used to specify the kind of association to be applied (left,right, non assoc....)
In fact
%left '+' '*'
instruct that plus and multiplication are left associative. So far, so good. (not exactly but it serve the purpose of the example)
Precedence (**) is used to give precedence to one operator over another.
%left '+'
%left '*'
the multiplication has higher precedence than plus operation.
So we got the wanted parsing action for E+E*E
E+(E*E) in case of (**)
(E+E)*E in case of only (*) --> this is clearly wrong - but it's fine for the example
So question is, can I separate clearly associativity from precedence without using the concept of associativity?
Even non-associatity implies associativity knowledge… so.. how, if possible, can I talk separately about them?
No. In a parser definition, associativity is just a small detail within the precedence algorithm.
To understand that, it's important to understand what precedence actually means, in parsing terms.
A left-to-right shift-reduce parser has a stack and an input stream. Initially, the stack is empty, and the input stream contains the input to be parsed. The SR parser repeatedly does one of the following two actions until the stack consists only of the start symbol and the input stream is empty (in which case the parse has succeeded), or neither action is possible (in which case the parse has failed):
reduce the production whose right-hand side is on the top of the stack by popping the right-hand side off of the stack and pushing the left-hand side non-terminal;
shift one input symbol from the input onto the stack.
It's an important feature of this framework that reductions can only occur when the production's right-hand side is on the top of the stack.
The shift action is always possible unless the input stream is exhausted, but a reduce action can only be taken if the top of the stack precisely matches the right-hand side of some production.
Different ways of building SR parsers will involve different mechanisms for deciding which action to take in any given stack configuration. One such mechanism is the precedence algorithm. Some very simple languages can be SR parsed only with the precedence algorithm. In other cases, it can be used as an auxiliary decision algorithm in order to resolve ambiguous grammar specifications; this is the use case for precedence in yacc-derived parser generators.
For precedence to work, it is necessary that at most one reduction action be possible in any stack configuration, which means that there cannot be two productions with the same right-hand side. [Note 1]
Given that there is at most one possible reduction action and at most one possible shift action (since the next input symbol, if any, is given), the only issue is deciding whether to shift or reduce. The precedence algorithm involves a precedence function PREC(A→α, a) ⇒ { SHIFT, REDUCE }, whose arguments are a production A→α and a terminal symbol a, which are mapped onto either SHIFT or REDUCE.
Although the precedence relationship is usually written as though it were a comparison, it is not a normal comparison operator because the two arguments are from different domains. It always involves a production and a terminal.
In simple cases, however, it is possible to implement PREC using numeric comparisons. To do that, we define two functions which map productions and terminals, respectively, onto integers: f(A→α) and g(a). We use those to compute PREC:
PREC(A→α, a) ≡
REDUCE if f(A→α) > g(a)
SHIFT if f(A→α) < g(a)
[Note 2]
In any event, the precedence algorithm for a given stack configuration is:
Identify the production P (=A→α) of the possible reduce action, if any.
If only a shift or only a reduce is possible, do that. Otherwise, if both a reduce and a shift are possible, compute PREC(P, input) and reduce using P if the result is REDUCE; otherwise, shift input.
Now that might seem confusing, since most descriptions of precedence relations describe them as though they compared terminals, rather than a production with a terminal. That's because it is normal to "name" each production using the last terminal in the production. Usually, that is unambiguous, because of the restriction on production right-hand sides: since two right-hand side must differ, it is likely that all production right-hand sides have different terminal symbols. [Note 3]
Although that short-hand allows us to say, for example, that "* has higher precedence than +" instead of the somewhat more cumbersome "the production E→E*E has precedence over the terminal +", it is important to remember that the latter statement is what we really mean.
Precedence also applies to single operators. With most operators, we prefer to group from left to right, so that E-E-E should be parsed as though it had been written (E-E)-E. However, some operators like exponentiation group to the right, meaning that E**E**E should be parsed as E**(E**E). This is simple to define using the PREC function; for a left-grouping operator ⊕, we'll have:
PREC(E→E⊕E, ⊕) ≡ REDUCE
while a right-grouping operator ⊗ would have
PREC(E→E⊗E, ⊗) ≡ SHIFT
That's clear when we use the actual arguments to PREC, but it becomes confusing when we use the shorthand notation, which leaves us trying to say that ⊕ has higher precedence than ⊕ while ⊗ has lower precedence than ⊗. To avoid the ambiguity and still let us get away with the shorthand, we describe ⊕ as "left-associative" (%left) and ⊗ as "right-associative" (%right). But the implementation is simply an application of the normal precedence algorithm.
As an example, consider the simple expression language:
E → E + E
E → E * E
E → E ** E
E → id
Here we expect * to bind more tightly than + with ** binding tightest; the first two group to the left while exponentiation groups to the right. To achieve that, we can assign f and g functions as follows:
Production f(Production) Terminal g(Terminal)
E → E + E 2 + 1
E → E * E 4 * 3
E → E ** E 5 ** 6
E → id 8 id 7
Yacc-generated grammars don't use precedence to decide when to reduce the E→id production, but the above will work since the grammar can be parsed completely using only the precedence algorithm.
Parentheses can easily be added; I'll leave that as an exercise.
Notes
There might be some other mechanism to decide between reduction actions, so the restriction is only absolute for a parser which only uses precedence. There might also be some other mechanism to restrict possible shift actions. For example, for a shift to be feasible, the tokens on the top of the stack need to eventually be reduced, which means that some suffix of the stack must be a prefix of the right-hand side of some production. Similarly, a reduction is only feasible if, post-reduce, some suffix of the stack is the prefix of the right-hand side of some production.
You'll see formulations using < and ≥ (or ≤ and >), but to avoid confusion, I'm assuming that the ranges of f and g are different sets of integers. Since the functions are arbitrary, this does not restrict generality.
That's not always the case. For example, languages which allow - to be either a unary or a binary operator will have productions with right-hand sides - E and E - E. Yacc-derived parser generators use the %prec TERMINAL declaration to associate a production with a terminal other than the default.
This is all very confused.
Associativity ... Is used to give precedence to one operator over another
No. Absolutely not. Associativity is used to determine which order two adjacent instances of the same operator are evaluated in. (E+E)+E or E+(E+E). All arithmetic operators except exponentiation are left-associative in mathematics.
%left '+' '*'
This says that + and * are both left-associative and have the same precedence, because they are both on the same line. And it is therefore wrong.
can I separate clearly associativity from precedence without using the concept of associativity
I'm sorry but this is just meaningless.

Automata theory : Conversion of a Context free grammar to a DFA

How to convert a Context Free Grammar to a DFA? This is easy if we have transitions like
A->a B. But when we have the transitions as A->a B c. Then how should we represent it as a DFA
There is no general procedure to convert an arbitrary CFG into a DFA. For example, consider this CFG:
S → aSb | ε
This grammar is for the language { anbn | n ≥ 0 }, which is a canonical nonregular language. Since we can only build DFAs for regular languages, there’s no way to build a DFA with the same language as this CFG
First, you should convert your language to CNF (Chomskey Normal Form).
Then steps for conversion are as such:
Convert it to left/right grammar is called a regular grammar.
Convert the Regular Grammar into Finite Automata
The transitions for automata are obtained as follows
For every production A -> aB make δ(A, a) = B that is make an are labeled ‘a’ from A to B.
For every production A -> a make δ(A, a) = final state.
For every production A -> ϵ, make δ(A, ϵ) = A and A will be final state.
No. For these Grammar no DFA can form.
why?
because it requires memory. Memory of occurence of a.
Yes . it is CFL (context free Language).
We can design a PDA (Push down automata). Here , memory ( STACK is
use ). for PUSH a and POP b

Is the language of all strings over the alphabet "a,b,c" with the same number of substrings "ab" & "ba" regular?

Is the language of all strings over the alphabet "a,b,c" with the same number of substrings "ab" & "ba" regular?
I believe the answer is NO, but it is hard to make a formal demonstration of it, even a NON formal demonstration.
Any ideas on how to approach this?
It's clearly not regular. How is an FA going to recognize (abc)^n c (cba)^n. Strings like this are in your language, right? The argument is a simple one based on the fact that there are infinitely many equivalence classes under the indistinguishability relation I_l.
The most common way to prove a language is NOT regular is using on of the Pumping Lemmas.
Using the lemma is a little tricky, since it has all those "exists" and so on. To prove a language L is not regular using the pumping lemma you have to prove that
for any integer p,
there is a word w in L of length n, with n>=p, such that
for all possible ways to decompose w as xyz, with len(xy) <= p and y non empty
there exists an i such that x(y^i)z (repeating the y bit i times) is NOT in L
whooo!
I'l l show how the proof looks for the "same number of as and bs" language. It should be straighfoward to convert to your case:
for any given p, we can make a word of length n = 2*p
a^p b^p (p a's followed by p b's)
any way you decompose this into xyz w/ |xy| <=p, y will only contain a's.
Thus, pumping the the y part will make the word have more as than bs,
thus NOT belonging to L.
If you need intuition on why this works, it follows from how you need to be able to count to arbritrarily large numbers to verify if a word belongs to one of these languages. However, Regular Languages are described by finite automata and no finite automata can represent the infinite ammount of states required to represent all the numbers. (The Wikipedia article should have a formal proof).
EDIT: It looks like you can't straight up use the pumping lemma in this particular case directly: if you always make y be one character long you can never make a word stop being accepted (aba becoming abbbba makes no difference and so on).
Just do the equivalence class approach suggested by Patrick87 - it will probably turn out to be cleaner than any of the dirty hacks you would need to do to make the pumping lemma applicable here.

First and follow of the non-terminals in two grammars

Given the following grammar:
S -> L=L
s -> L
L -> *L
L -> id
What are the first and follow for the non-terminals?
If the grammar is changed into:
S -> L=R
S -> R
L -> *R
L -> id
R -> L
What will be the first and follow ?
When I took a compiler course in college I didn't understand FIRST and FOLLOWS at all. I implemented the algorithms described in the Dragon book, but I had no clue what was going on. I think I do now.
I assume you have some book that gives a formal definition of these two sets, and the book is completely incomprehensible. I'll try to give an informal description of them, and hopefully that will help you make sense of what's in your book.
The FIRST set is the set of terminals you could possibly see as the first part of the expansion of a non-terminal. The FOLLOWS set is the set of terminals you could possibly see following the expansion of a non-terminal.
In your first grammar, there are only three kinds of terminals: =, *, and id. (You might also consider $, the end-of-input symbol, to be a terminal.) The only non-terminals are S (a statement) and L (an Lvalue -- a "thing" you can assign to).
Think of FIRST(S) as the set of non-terminals that could possibly start a statement. Intuitively, you know you do not start a statement with =. So you wouldn't expect that to show up in FIRST(S).
So how does a statement start? There are two production rules that define what an S looks like, and they both start with L. So to figure out what's in FIRST(S), you really have to look at what's in FIRST(L). There are two production rules that define what an Lvalue looks like: it either starts with a * or with an id. So FIRST(S) = FIRST(L) = { *, id }.
FOLLOWS(S) is easy. Nothing follows S because it is the start symbol. So the only thing in FOLLOWS(S) is $, the end-of-input symbol.
FOLLOWS(L) is a little trickier. You have to look at every production rule where L appears, and see what comes after it. In the first rule, you see that = may follow L. So = is in FOLLOWS(L). But you also notice in that rule that there is another L at the end of the production rule. So another thing that could follow L is anything that could follow that production. We already figured out that the only thing that can follow the S production is the end-of-input. So FOLLOWS(L) = { =, $ }. (If you look at the other production rules, L always appears at the end of them, so you just get $ from those.)
Take a look at this Easy Explanation, and for now ignore all the stuff about ϵ, because you don't have any productions which contain the empty-string. Under "Rules for First Sets", rules #1, #3, and #4.1 should make sense. Under "Rules for Follows Sets", rules #1, #2, and #3 should make sense.
Things get more complicated when you have ϵ in your production rules. Suppose you have something like this:
D -> S C T id = V // Declaration is [Static] [Const] Type id = Value
S -> static | ϵ // The 'static' keyword is optional
C -> const | ϵ // The 'const' keyword is optional
T -> int | float // The Type is mandatory and is either 'int' or 'float'
V -> ... // The Value gets complicated, not important here.
Now if you want to compute FIRST(D) you can't just look at FIRST(S), because S may be "empty". You know intuitively that FIRST(D) is { static, const, int, float }. That intuition is codified in rule #4.2. Think of SCT in this example as Y1Y2Y3 in the "Easy Explanation" rules.
If you want to compute FOLLOWS(S), you can't just look at FIRST(C), because that may be empty, so you also have to look at FIRST(T). So FOLLOWS(S) = { const, int, float }. You get that by applying "Rules for follow sets" #2 and #4 (more or less).
I hope that helps and that you can figure out FIRST and FOLLOWS for the second grammar on your own.
If it helps, R represents an Rvalue -- a "thing" you can't assign to, such as a constant or a literal. An Lvalue can also act as an Rvalue (but not the other way around).
a = 2; // a is an lvalue, 2 is an rvalue
a = b; // a is an lvalue, b is an lvalue, but in this context it's an rvalue
2 = a; // invalid because 2 cannot be an lvalue
2 = 3; // invalid, same reason.
*4 = b; // Valid! You would almost never write code like this, but it is
// grammatically correct: dereferencing an Rvalue gives you an Lvalue.