Computation of follow set - grammar

To compute FOLLOW(A) for all non-terminals A, apply the following rules
until nothing can be added to any FOLLOW set.
Place $ in FOLLOW(S) , where S is the start symbol, and $ is the input
right endmarker .
If there is a production A -> B, then everything in FIRST(b) except epsilon
is in FOLLOW(B) .
If there is a production A -> aBb, or a production A -> aBb, where
FIRST(b) contains t, then everything in FOLLOW(A) is in FOLLOW(B).
a,b is actually alpha and beta(sentential form). This is from dragon book.
Now my question is in this case can we take a=epsilon ?
and can b(beta) be 2 non-terminals like XY? (if senetntial then it solud be..)

Here's what the Dragon book actually says: [See note 1]
Place $ in FOLLOW(S).
For every production A→αBβ, place everything
in FIRST(β) except ε into
FOLLOW(B)
For every production A→αB or
A→αBβ where FIRST(β) contains
ε, place FOLLOW(A) into
FOLLOW(B).
There is a section earlier in the book on "notational conventions" in which it is made clear that a lower-case greek letter like α or β represents a possibly empty string of grammar symbols. So, yes, α could be empty and β could be two nonterminals (or any other string of grammar symbols).
Note:
Here I'm using a variant on the formatting suggesting made by #leftroundabout in this meta post. (The only difference is that I put the formulae in bold.) It's easy to type Greek letters as entities if you don't have a Greek keyboard handy; just use, for example, α (α) or β (β). For upper-case Greek letters, write the name with an upper-case letter: Σ (Σ). Other useful symbols are arrows: → (→) and ⇒ (⇒).

Related

Definition of First and Follow sets of the right-hand sides of production

I am learning about LL(1) grammars. I have a task of checking if grammar is LL(1) and if not, I then need to find the rules, which prevent it from being LL(1). I came across this link https://www.csd.uwo.ca/~mmorenom/CS447/Lectures/Syntax.html/node14.html which has a theorem which can be used as a criteria for deciding if grammar is LL(1) or not. It says that for any rule A -> alpha | beta some equalities, considering FIRST and FOLLOW sets need to be true. Therefore, I need to find FIRST and FOLLOW sets of these right-hand sides of production.
Let's say, I have following rules A -> a b B S | eps. How do I calculate FIRST and FOLLOW of a b B S? As far as I understand by definition these sets are defined only for 1 non-terminal symbol.
The idea behind the FIRST function is that it returns the set of terminals which could possibly start the expansion of its argument. It's usual to also add the special object ε (which is a way of writing an empty sequence of symbols) if ε is a possible expansion.
So if a is a terminal, FIRST(a) is just { a }. And if A is a non-terminal, FIRST(A) is the set of non-terminals which could possibly appear at the beginning of a derivation of A. Finally, FIRST(ε) must be { ε }, according to the convention described above.
Now suppose α is a (possibly empty) sequence of grammar symbols:
If α is empty (that is, it's ε), FIRST(α) is { ε }
If the first symbol in α is the terminal a, FIRST(α) is { a }.
If the first symbol in α is the non-terminal A, there are two possibilities. Let TAIL(α) be the rest of α after the first symbol. Now:
if ε ∈ FIRST(A), then FIRST(α) is FIRST(A) ∪ FIRST(TAIL(α)).
otherwise, FIRST(α) is FIRST(A).
Now, how do we compute FIRST(A), for every non-terminal A? Using the above definition of FIRST(α), we recursively define FIRST(A) to be the union of the sets FIRST(α) for every α which is the right-hand side of a production A → α.
The FOLLOW function defines the set of terminal symbols which might appear after the expansion of a non-terminal. It is only defined on non-terminals; if you look carefully at the LL(1) conditions on the page you cite, you'll see that FIRST is applied to a right-hand side, while FOLLOW is only applied to left-hand sides.

Why does JFlap fail to build a usable LL(1) parser from my calculator grammar?

I entered the following grammar in JFlap:
E → TK
K → +TK
K → λ
T → FM
M → *FM
M → λ
F → i
F → (E)
and tried to parse i * (i + i). I am sure the LL(1) grammar is correct and the input string should be accepted but JFlap said that the string in rejected. (See screenshot). Why?
The grammar is fine.
However, you somehow entered it incorrectly. Notice that your parsing table has a λ column. That means that JFlap interpreted the λ as a character, not as an empty right-hand side, probably because you typed a real λ rather than letting JFlap fill it in automatically. You should just leave the right-hand side empty if you want an empty right-hand side. JFlap will then show that as λ but it won't treat it as a symbol.
By way of illustration, here are screenshots for the correctly entered grammar (with empty right-hand sides) and a grammar where I typed a λ instead of leaving the right-hand side empty. I stopped the second parse one step earlier than the error message, so that you can see the problem being reported: since M does not have an empty production, it blocks the parser from recognising the + sign.
Here's the correctly entered grammar:
And here's the one which I generated the same way that you did (if my guess is right). Note that it has a λ column in the transition table. You would also see it being handled differently in the FIRST and FOLLOW sets.
As a postscript, JFlap seems to handle most Unicode characters in tokens, but using a λ character as a token triggers a variety of bugs. So you shouldn't do that even if you intended the λ to be a legitimate character.

What are terminal and nonterminal symbols?

I am reading Rebol Wikipedia page.
"Parse expressions are written in the parse dialect, which, like the do dialect, is an expression-oriented sublanguage of the data exchange dialect. Unlike the do dialect, the parse dialect uses keywords representing operators and the most important nonterminals"
Can you explain what are terminals and nonterminals? I have read a lot about grammars, but did not understand what they mean. Here is another link where this words are used very often.
Definitions of terminal and non-terminal symbols are not Parse-specific, but are concerned with grammars in general. Things like this wiki page or intro in Grune's book explain them quite well. OTOH, if you're interested in how Red Parse works and yearn for simple examples and guidance, I suggest to drop by our dedicated chat room.
"parsing" has slightly different meanings, but the one I prefer is conversion of linear structure (string of symbols, in a broad sense) to a hierarchical structure (derivation tree) via a formal recipe (grammar), or checking if a given string has a tree-like structure specified by a grammar (i.e. if "string" belongs to a "language").
All symbols in a string are terminals, in a sense that tree derivation "terminates" on them (i.e. they are leaves in a tree). Non-terminals, in turn, are a form of abstraction that is used in grammar rules - they group terminals and non-terminals together (i.e. they are nodes in a tree).
For example, in the following Parse grammar:
greeting: ['hi | 'hello | 'howdy]
person: [name surname]
name: ['john | 'jane]
surname: ['doe | 'smith]
sentence: [greeting person]
greeting, person, name, surname and sentence are non-terminals (because they never actually appear in the linear input sequence, only in grammar rules);
hi, hello, howdy with john, jane, doe and smith are terminals (because parser cannot "expand" them into a set of terminals and non-terminals as it does with non-terminals, hence it "terminates" by reaching the bottom).
>> parse [hi jane doe] sentence
== true
>> parse [howdy john smith] sentence
== true
>> parse [wazzup bubba ?] sentence
== false
As you can see, terminal and non-terminal are disjoint sets, i.e. a symbol can be either in one of them, but not in both; moreso, inside grammar rules, only non-terminals can be written on the left side.
One grammar can match different strings, and one string can be matched by different grammars (in the example above, it could be [greeting name surname], or [exclamation 2 noun], or even [some noun], provided that exclamation and noun non-terminals are defined).
And, as usual, one picture is worth a thousand words:
Hope that helps.
think of it like that
a digit can be 1-9
now i will tell you to write down on a page a digit.
so you know that you can write down 1,2,3,4,5,6,7,8,9
basically the nonterminal symbol is "digit"
and the terminals symbols are the 1,2,3,4,5,6,7,8,9
when i told you to write down on a page a digit you wrote down 1 or 2 or 3 or 4 or 5 or 6 or 7 or 8 or 9
you didn't wrote down the word "digit" you wrote down the 1 or 2 or 3....
do you see where i'm going ?
let's try to make our own "rules"
let's "create" a nonterminal symbol we will call it "Olaf"
Olaf can be a dog (NOTE: dog is terminal)
Olaf can be a cat (NOTE: cat is terminal)
Olaf can be a digit (NOTE: digit is nonterminal)
Now i'm telling you that you can write down on a page an Olaf.
so that's mean that you can write down "dog"
you can also write down "cat"
you can also write down a digit so that's mean you can write down 1 or 2 or 3...
because digit is nonterminal symbol you dont write down "digit" you write down
the symbols that digit is referring to which is 1 or 2 or 3 etc...
in the end only terminals symbols are written on the "page"
one more thing i have to say is something that you may encounter one day, basically when you say "a nonterminal can be something".
there is a special term for that and that's basically called a "production rule"(can also be called a "production")
for example
Olaf can be "dog"
Olaf can be "cat"
Olaf can be digit
we got 3 productions here in other words we got here 3 definitions of Olaf
specifications of Programming languages use those ideas quite a lot when defining a syntax of a language

Specifying language for a given grammar

DFA problem : Write a complete grammar for L, including the quadruple and the production rules
L ={x: ∃y ∈ {a, b}* : x = ay}
Answer:
G={{S, A}, {a, b}, S, P}
P: S => aA
A => aA | bA | λ
My question is :
Why there is λ for A, but there is no λ for S?
From the language definition, it is any string that begins with an a and contains only a's and b's , but why in the answer A => bA. Does not it mean that the string starts with b if it is A => bA?
Thank you so much
1. Why there is λ for A, but there is no λ for S?
λ nul can be derived from A to convert a sentimental from into sentence. Additionally according to language statement prefix sub-string y ∈ {a, b}* can be nul (a empty string) e.g. "a" is a string belongs to the language. If y contain any symbol then length of language will be more than one.
S doesn't derive λ nul because empty (or say nul string) is not in language. The smallest string in language is single "a".
2. From the language definition, it is any string that begins with an a and contains only a's and b's , but why in the answer A => bA. Does not it mean that the string starts with b if it is A => bA?
Note only strings those can derived from start variable S are included in language of grammar. You can't start derivation from A (that is not start variable). And if you start a derivation from S your string will always start with a symbol.
I suggest you to read: "Why the need for terminals? Is my solution sufficient enough?" Where I written about basic definition of formal grammar.

Chomsky Language Types

I'm trying to understand the four different Chomsky language types but the definitions that I have found don't really mean anything to me. I know type 0 is free grammar, type 1 is context sensitive, type 2 is context free whilst type 3 is regular. So, could someone please explain this and put it into context, thanks.
A language is the set of words that belong to that language. Many times, however, instead of listing each and every word in the language, it is enough to specify the set of rules that generate the words of the language (and only those) to identify what is the language-in-question.
Note: there can be more than one set of rules that desrcibe the same language.
In general, the more restrictions placed on the rules, the less expressive the language (less words can be generated from the rules), but easier to recognize if a word belongs to the language the rules specify. Because of the latter, we want to identify languages with the most restrictions on their rules that will still allow us to generate the same language.
A few words about the rules: In general, you describe a formal language with four items (AKA a four-tuple):
The set of non-terminal symbols (N)
The set of terminal symbols (T)
The set of production rules (P)
The start symbol (S)
The terminal symbols (AKA letters) are the symbols that words of the language consist of, ususally a subset of lowercase English letters (e.g. 'a', 'b', 'c')
The non-terminal symbols are marking an intermediate state in the generation of a word, indicating that a possible transformation can still be applied to the intermediate "word". There is no overlap between the terminal and non-terminal symbols (i.e. the intersection of the two sets are empty). The symbols used for non-terminals are usually subsets of uppercase English letters (e.g. 'A', 'B', 'C')
The rules denote possible transformations on a series of terminal and non-terminal symbols. They are in the form of: left-side -> right-side, where both the left-side and the right-side consists of series of terminal and non-terminal symbols. An example rule: aBc -> cBa, which means that a series of symbols "aBc" (as part of intermediary "words") can be replaced with the series "cBa" during the generation of words.
The start symbol is a dedicated non-terminal symbol (usually denoted by S) that denotes the "root" or the "start" of the word generation, i.e. the first rule applied in the series of word-generation always has the start-symbol as its left-side.
The generation of a word is successfully over when all non-terminals have been replaced with terminals (so the final "intermediary word" consists only of terminal symbols, which indicates that we arrived at a word of the language-in-question).
The generation of a word is unsuccessful, when not all non-terminals have been replaced with terminals, but there are no production rules that can be applied on the current intermediary "word". In this case the generation has to strart anew from the starting symbol, following a different path of production rule applications.
Example:
L={T, N, P, S},
where
T={a, b, c}
N={A, B, C, S}
P={S->A, S->AB, S->BC, A->a, B->bb, C->ccc}
which denotes the language of three words: "a", "abb" and "bbccc"
An example application of the rules:
S->AB->aB->abb
where we 1) started from the start symbol (S), 2) applied the second rule (replacing S with AB), 3) applied the fourth rule (replacing A with a) and 4) applied the fifth rule (replacing B with bb). As there are no non-terminals in the resulting "abb", it is a word of the language.
When talking in general about the rules, the Greek symbols alpha, beta, gamma etc. denote (a potentially empty) series of terminal+non-terminal symbols; the Greek letter epsilon denotes the empty string (i.e. the empty series of symbols).
The four different types in the Chomsky hierarchy describe grammars of different expressive power (different restrictions on the rules).
Languages generated by Type 0 (or Unrestricted) grammars are most expressive (less restricted). The set of Recursively Enumerable languages contain the languages that can be generated using a Turing machine (basically a computer). This means that if you have a language that is more expressive than this type (e.g. English), you cannot write an algorithm that can list each an every (and only these) words of the language. The rules have one restriction: the left-side of a rule cannot be empty (no symbols can be introduced "out of the blue").
Languages generated by Type 1 (Context-sensitive) grammars are the Context-sensitive languages. The rules have the restriction that they are in the form: alpha A beta -> alpha gamma beta, where alpha and beta can be empty, and gamma is non-empty (exception: the S->epsilon rule, which is only allowed if the start symbol S does not appear on the right-side of any rules). This restricts the rules to have at least one non-terminal on their left-side and allows them to have a "context": the non-terminal A in this rule example can be replaced with gamma, only if it is surrounded by ("is in the context of") alpha and beta. The application of the rule preserves the context (i.e. alpha and beta does not change).
Languages generated by Type 2 (Context-free) grammars are the Context-free languages. The rules have the restriction that they are in the form: A -> gamma. This restricts the rules to have exactly one non-terminal on their left-side and have no "context". This essentially means that if you see a non-terminal symbol in an intermediary word, you can apply any one of the rules that have that non-terminal symbol on their left-side to replace it with their right-side, regardless of the surroundings of the non-terminal symbol. Most programming languages have context free generating grammars.
Languages generated by Type 3 (Regular) grammars are the Regular languages. The rules have the restriction that they are of the form: A->a or A->aB (the rule S->epsilon is permitted if the starting symbol S does not appear on the right-side of any rules), which means that each non-terminal must produce exactly one terminal symbol (and possibly one non-terminal as well). The regular expressions generate/recognize languages of this type.
Some of these restrictions can be lifted/modified in a way to keep the modified grammar have the same expressive power. The modified rules can allow other algorithms to recognize the words of a language.
Note that (as noted earlier) a language can often be generated by multiple grammars (even grammars belonging to different types). The expressive power of a language family is usually equated with the expressive power of the type of the most restrictive grammars that can generate those languages (e.g. languages generated by regular (Type 3) grammars can also be generated by Type 2 grammars, but their expressive power is still that of Type 3 grammars).
Examples
The regular grammar
T={a, b}
N={A, B, S}
P={S->aA, A->aA, A->aB, B->bB, B->b}
generates the language which contains words that start with a non-zero number of 'a's, followed by a non-zero number of 'b's. Note that is it not possible to describe a language where each word consists of a number of 'a's followed by an equal number of 'b's with regular grammars.
The context-free grammar
T={a, b}
N={A, B, S}
P={S->ASB, A->a, B->b}
generates the language which contains words that start with a non-zero number of 'a's, followed by an equal number of 'b's. Note that it is not possible to describe a language where each word consists of a number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's with context-free grammars.
The context-sensitive grammar
T={a, b, c}
N={A, B, C, H, S}
P={S->aBC, S->aSBC, CB->HB, HB->HC, HC->BC, aB->ab, bB->bb, bC->bc, cC->cc}
generates tha language which contains words that start with non-zero number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's. The role of H in this grammar is to enable "swapping" a CB combination to a BC combination, so the B's can be gathered on the left, and the C's can be gathered on the right. Note that it is not possible to describe a language where the words consist of a series of 'a's, where the number of 'a's is a prime with context-sensitive grammars, but it is possible to write an unrestricted grammar that generates that language.
There are 4 types of grammars
TYPE-0 :
Grammar accepted by Unrestricted Grammar
Language accepted by Recursively enumerable language
Automaton is Turing machine
TYPE-1 :
Grammar accepted by Context sensitive Grammar
Language accepted by Context sensitive language
Automaton is Linear bounded automaton
TYPE-2 :
Grammar accepted by Context free Grammar
Language accepted by Context free language
Automaton is PushDown automaton
TYPE-3 :
Grammar accepted by Regular Grammar
Language accepted by Regular language
Automaton is Finite state automaton
-
Chomsky is a Normal Form that each production has form:
A->BC or A->a
There can be Two variables or Only one terminal for rule in
Chomsky