Definition of First and Follow sets of the right-hand sides of production - grammar

I am learning about LL(1) grammars. I have a task of checking if grammar is LL(1) and if not, I then need to find the rules, which prevent it from being LL(1). I came across this link https://www.csd.uwo.ca/~mmorenom/CS447/Lectures/Syntax.html/node14.html which has a theorem which can be used as a criteria for deciding if grammar is LL(1) or not. It says that for any rule A -> alpha | beta some equalities, considering FIRST and FOLLOW sets need to be true. Therefore, I need to find FIRST and FOLLOW sets of these right-hand sides of production.
Let's say, I have following rules A -> a b B S | eps. How do I calculate FIRST and FOLLOW of a b B S? As far as I understand by definition these sets are defined only for 1 non-terminal symbol.

The idea behind the FIRST function is that it returns the set of terminals which could possibly start the expansion of its argument. It's usual to also add the special object ε (which is a way of writing an empty sequence of symbols) if ε is a possible expansion.
So if a is a terminal, FIRST(a) is just { a }. And if A is a non-terminal, FIRST(A) is the set of non-terminals which could possibly appear at the beginning of a derivation of A. Finally, FIRST(ε) must be { ε }, according to the convention described above.
Now suppose α is a (possibly empty) sequence of grammar symbols:
If α is empty (that is, it's ε), FIRST(α) is { ε }
If the first symbol in α is the terminal a, FIRST(α) is { a }.
If the first symbol in α is the non-terminal A, there are two possibilities. Let TAIL(α) be the rest of α after the first symbol. Now:
if ε ∈ FIRST(A), then FIRST(α) is FIRST(A) ∪ FIRST(TAIL(α)).
otherwise, FIRST(α) is FIRST(A).
Now, how do we compute FIRST(A), for every non-terminal A? Using the above definition of FIRST(α), we recursively define FIRST(A) to be the union of the sets FIRST(α) for every α which is the right-hand side of a production A → α.
The FOLLOW function defines the set of terminal symbols which might appear after the expansion of a non-terminal. It is only defined on non-terminals; if you look carefully at the LL(1) conditions on the page you cite, you'll see that FIRST is applied to a right-hand side, while FOLLOW is only applied to left-hand sides.

Related

Why does JFlap fail to build a usable LL(1) parser from my calculator grammar?

I entered the following grammar in JFlap:
E → TK
K → +TK
K → λ
T → FM
M → *FM
M → λ
F → i
F → (E)
and tried to parse i * (i + i). I am sure the LL(1) grammar is correct and the input string should be accepted but JFlap said that the string in rejected. (See screenshot). Why?
The grammar is fine.
However, you somehow entered it incorrectly. Notice that your parsing table has a λ column. That means that JFlap interpreted the λ as a character, not as an empty right-hand side, probably because you typed a real λ rather than letting JFlap fill it in automatically. You should just leave the right-hand side empty if you want an empty right-hand side. JFlap will then show that as λ but it won't treat it as a symbol.
By way of illustration, here are screenshots for the correctly entered grammar (with empty right-hand sides) and a grammar where I typed a λ instead of leaving the right-hand side empty. I stopped the second parse one step earlier than the error message, so that you can see the problem being reported: since M does not have an empty production, it blocks the parser from recognising the + sign.
Here's the correctly entered grammar:
And here's the one which I generated the same way that you did (if my guess is right). Note that it has a λ column in the transition table. You would also see it being handled differently in the FIRST and FOLLOW sets.
As a postscript, JFlap seems to handle most Unicode characters in tokens, but using a λ character as a token triggers a variety of bugs. So you shouldn't do that even if you intended the λ to be a legitimate character.

Formal Languages - Grammar

I am taking a Formal Languages and Computability class and am having a little trouble understanding the concept of grammar. One of my assignment questions is this:
Take ∑ = {a,b}, and let na(w) and nb(w) denote the number of a's and b's in the string w, respectively. Then the grammar G with productions:
S -> SS
S -> λ
S -> aSb
S -> bSa
generates the language L = {w: na(w) = nb(w)}.
1) The language in the example contains an empty string. Modify the given grammar so that it generates L - {λ}.
I am thinking that I should modify the condition of L, something like:
L = {w: na(w) = nb(w), na, nb > 0}
That way, we indicate that the string is never empty.
2) Modify the grammar in the example so that it will generate L ∪ {anbn+1: n >= 0}.
I am not sure on how to do this one. Should that mean I make one more condition in the grammar, adding something like S -> aSbb?
Any explanation about these two questions would be greatly appreciated. I'm still trying to figure these grammar stuff out so I am not sure about my answers.
1) The question is about modifying the grammar to obtain a new language; so don't modify directly the language…
Your grammar generates the empty word because of the production:
S -> λ
So you could think of removing this production altogether. This yields the following grammar:
S -> SS
S -> aSb
S -> bSa
Unfortunately, this grammar doesn't generate a language (a bit like in induction, it misses an initial: there are no productions that only consist of terminals). To fix this, add the following productions:
S -> ab
S -> ba
2) Don't randomly try to add production rules in the hope that it's going to work. Here you want a's followed by b's. So the production rule
S -> bSa
must certainly disappear. Also, the rule
S -> SS
would produce, e.g., abab (try to see how this is obtained). So we'll have to remove it too. We're left with:
S -> λ
S -> aSb
Now this grammar generates:
λ
ab
aabb
aaabbb
etc. That's not bad at all! To get an extra trailing b, we could create a new non-terminal, say T, replace our current S by T, and add that trailing b in S:
T -> λ
T -> aTb
S -> Tb
I know that this is homework; I gave you the solutions to your homework: that's because, from the way you asked your question, it seems you're completely lost. I hope this answer will help you get on the right path!

Computation of follow set

To compute FOLLOW(A) for all non-terminals A, apply the following rules
until nothing can be added to any FOLLOW set.
Place $ in FOLLOW(S) , where S is the start symbol, and $ is the input
right endmarker .
If there is a production A -> B, then everything in FIRST(b) except epsilon
is in FOLLOW(B) .
If there is a production A -> aBb, or a production A -> aBb, where
FIRST(b) contains t, then everything in FOLLOW(A) is in FOLLOW(B).
a,b is actually alpha and beta(sentential form). This is from dragon book.
Now my question is in this case can we take a=epsilon ?
and can b(beta) be 2 non-terminals like XY? (if senetntial then it solud be..)
Here's what the Dragon book actually says: [See note 1]
Place $ in FOLLOW(S).
For every production A→αBβ, place everything
in FIRST(β) except ε into
FOLLOW(B)
For every production A→αB or
A→αBβ where FIRST(β) contains
ε, place FOLLOW(A) into
FOLLOW(B).
There is a section earlier in the book on "notational conventions" in which it is made clear that a lower-case greek letter like α or β represents a possibly empty string of grammar symbols. So, yes, α could be empty and β could be two nonterminals (or any other string of grammar symbols).
Note:
Here I'm using a variant on the formatting suggesting made by #leftroundabout in this meta post. (The only difference is that I put the formulae in bold.) It's easy to type Greek letters as entities if you don't have a Greek keyboard handy; just use, for example, α (α) or β (β). For upper-case Greek letters, write the name with an upper-case letter: Σ (Σ). Other useful symbols are arrows: → (→) and ⇒ (⇒).

Specifying language for a given grammar

DFA problem : Write a complete grammar for L, including the quadruple and the production rules
L ={x: ∃y ∈ {a, b}* : x = ay}
Answer:
G={{S, A}, {a, b}, S, P}
P: S => aA
A => aA | bA | λ
My question is :
Why there is λ for A, but there is no λ for S?
From the language definition, it is any string that begins with an a and contains only a's and b's , but why in the answer A => bA. Does not it mean that the string starts with b if it is A => bA?
Thank you so much
1. Why there is λ for A, but there is no λ for S?
λ nul can be derived from A to convert a sentimental from into sentence. Additionally according to language statement prefix sub-string y ∈ {a, b}* can be nul (a empty string) e.g. "a" is a string belongs to the language. If y contain any symbol then length of language will be more than one.
S doesn't derive λ nul because empty (or say nul string) is not in language. The smallest string in language is single "a".
2. From the language definition, it is any string that begins with an a and contains only a's and b's , but why in the answer A => bA. Does not it mean that the string starts with b if it is A => bA?
Note only strings those can derived from start variable S are included in language of grammar. You can't start derivation from A (that is not start variable). And if you start a derivation from S your string will always start with a symbol.
I suggest you to read: "Why the need for terminals? Is my solution sufficient enough?" Where I written about basic definition of formal grammar.

Chomsky Language Types

I'm trying to understand the four different Chomsky language types but the definitions that I have found don't really mean anything to me. I know type 0 is free grammar, type 1 is context sensitive, type 2 is context free whilst type 3 is regular. So, could someone please explain this and put it into context, thanks.
A language is the set of words that belong to that language. Many times, however, instead of listing each and every word in the language, it is enough to specify the set of rules that generate the words of the language (and only those) to identify what is the language-in-question.
Note: there can be more than one set of rules that desrcibe the same language.
In general, the more restrictions placed on the rules, the less expressive the language (less words can be generated from the rules), but easier to recognize if a word belongs to the language the rules specify. Because of the latter, we want to identify languages with the most restrictions on their rules that will still allow us to generate the same language.
A few words about the rules: In general, you describe a formal language with four items (AKA a four-tuple):
The set of non-terminal symbols (N)
The set of terminal symbols (T)
The set of production rules (P)
The start symbol (S)
The terminal symbols (AKA letters) are the symbols that words of the language consist of, ususally a subset of lowercase English letters (e.g. 'a', 'b', 'c')
The non-terminal symbols are marking an intermediate state in the generation of a word, indicating that a possible transformation can still be applied to the intermediate "word". There is no overlap between the terminal and non-terminal symbols (i.e. the intersection of the two sets are empty). The symbols used for non-terminals are usually subsets of uppercase English letters (e.g. 'A', 'B', 'C')
The rules denote possible transformations on a series of terminal and non-terminal symbols. They are in the form of: left-side -> right-side, where both the left-side and the right-side consists of series of terminal and non-terminal symbols. An example rule: aBc -> cBa, which means that a series of symbols "aBc" (as part of intermediary "words") can be replaced with the series "cBa" during the generation of words.
The start symbol is a dedicated non-terminal symbol (usually denoted by S) that denotes the "root" or the "start" of the word generation, i.e. the first rule applied in the series of word-generation always has the start-symbol as its left-side.
The generation of a word is successfully over when all non-terminals have been replaced with terminals (so the final "intermediary word" consists only of terminal symbols, which indicates that we arrived at a word of the language-in-question).
The generation of a word is unsuccessful, when not all non-terminals have been replaced with terminals, but there are no production rules that can be applied on the current intermediary "word". In this case the generation has to strart anew from the starting symbol, following a different path of production rule applications.
Example:
L={T, N, P, S},
where
T={a, b, c}
N={A, B, C, S}
P={S->A, S->AB, S->BC, A->a, B->bb, C->ccc}
which denotes the language of three words: "a", "abb" and "bbccc"
An example application of the rules:
S->AB->aB->abb
where we 1) started from the start symbol (S), 2) applied the second rule (replacing S with AB), 3) applied the fourth rule (replacing A with a) and 4) applied the fifth rule (replacing B with bb). As there are no non-terminals in the resulting "abb", it is a word of the language.
When talking in general about the rules, the Greek symbols alpha, beta, gamma etc. denote (a potentially empty) series of terminal+non-terminal symbols; the Greek letter epsilon denotes the empty string (i.e. the empty series of symbols).
The four different types in the Chomsky hierarchy describe grammars of different expressive power (different restrictions on the rules).
Languages generated by Type 0 (or Unrestricted) grammars are most expressive (less restricted). The set of Recursively Enumerable languages contain the languages that can be generated using a Turing machine (basically a computer). This means that if you have a language that is more expressive than this type (e.g. English), you cannot write an algorithm that can list each an every (and only these) words of the language. The rules have one restriction: the left-side of a rule cannot be empty (no symbols can be introduced "out of the blue").
Languages generated by Type 1 (Context-sensitive) grammars are the Context-sensitive languages. The rules have the restriction that they are in the form: alpha A beta -> alpha gamma beta, where alpha and beta can be empty, and gamma is non-empty (exception: the S->epsilon rule, which is only allowed if the start symbol S does not appear on the right-side of any rules). This restricts the rules to have at least one non-terminal on their left-side and allows them to have a "context": the non-terminal A in this rule example can be replaced with gamma, only if it is surrounded by ("is in the context of") alpha and beta. The application of the rule preserves the context (i.e. alpha and beta does not change).
Languages generated by Type 2 (Context-free) grammars are the Context-free languages. The rules have the restriction that they are in the form: A -> gamma. This restricts the rules to have exactly one non-terminal on their left-side and have no "context". This essentially means that if you see a non-terminal symbol in an intermediary word, you can apply any one of the rules that have that non-terminal symbol on their left-side to replace it with their right-side, regardless of the surroundings of the non-terminal symbol. Most programming languages have context free generating grammars.
Languages generated by Type 3 (Regular) grammars are the Regular languages. The rules have the restriction that they are of the form: A->a or A->aB (the rule S->epsilon is permitted if the starting symbol S does not appear on the right-side of any rules), which means that each non-terminal must produce exactly one terminal symbol (and possibly one non-terminal as well). The regular expressions generate/recognize languages of this type.
Some of these restrictions can be lifted/modified in a way to keep the modified grammar have the same expressive power. The modified rules can allow other algorithms to recognize the words of a language.
Note that (as noted earlier) a language can often be generated by multiple grammars (even grammars belonging to different types). The expressive power of a language family is usually equated with the expressive power of the type of the most restrictive grammars that can generate those languages (e.g. languages generated by regular (Type 3) grammars can also be generated by Type 2 grammars, but their expressive power is still that of Type 3 grammars).
Examples
The regular grammar
T={a, b}
N={A, B, S}
P={S->aA, A->aA, A->aB, B->bB, B->b}
generates the language which contains words that start with a non-zero number of 'a's, followed by a non-zero number of 'b's. Note that is it not possible to describe a language where each word consists of a number of 'a's followed by an equal number of 'b's with regular grammars.
The context-free grammar
T={a, b}
N={A, B, S}
P={S->ASB, A->a, B->b}
generates the language which contains words that start with a non-zero number of 'a's, followed by an equal number of 'b's. Note that it is not possible to describe a language where each word consists of a number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's with context-free grammars.
The context-sensitive grammar
T={a, b, c}
N={A, B, C, H, S}
P={S->aBC, S->aSBC, CB->HB, HB->HC, HC->BC, aB->ab, bB->bb, bC->bc, cC->cc}
generates tha language which contains words that start with non-zero number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's. The role of H in this grammar is to enable "swapping" a CB combination to a BC combination, so the B's can be gathered on the left, and the C's can be gathered on the right. Note that it is not possible to describe a language where the words consist of a series of 'a's, where the number of 'a's is a prime with context-sensitive grammars, but it is possible to write an unrestricted grammar that generates that language.
There are 4 types of grammars
TYPE-0 :
Grammar accepted by Unrestricted Grammar
Language accepted by Recursively enumerable language
Automaton is Turing machine
TYPE-1 :
Grammar accepted by Context sensitive Grammar
Language accepted by Context sensitive language
Automaton is Linear bounded automaton
TYPE-2 :
Grammar accepted by Context free Grammar
Language accepted by Context free language
Automaton is PushDown automaton
TYPE-3 :
Grammar accepted by Regular Grammar
Language accepted by Regular language
Automaton is Finite state automaton
-
Chomsky is a Normal Form that each production has form:
A->BC or A->a
There can be Two variables or Only one terminal for rule in
Chomsky