Simplification of lambda-productions,unary rules and non-useful symbols of a Grammar - grammar

I know that is not a general question but I would like to know how to do it with an example that I've been already working a bit on it..Once said this:
I have the following Grammar. I tried to simplify it but I'm unsure of its correctness, could someone help me confirming if it's correct or not?
S -> BC | lambda
A -> aA | lambda
B -> bB
C -> c
If I have to simplify the Grammar I first apply lambda-eliminations where I have something like:
S -> BC | B | C
A -> aA | a
B -> bB
C -> c
And finally I have to eliminate non-useful symbols:
Firstly I eliminate the ones that are not productive and then the ones that are unreacheable so..
S -> BC | bB | C
A -> aA | a
B -> bB ---> non-productive
C -> c
S -> C | b | C
A -> aA | a --> unreacheable
C -> c
Finally I have something like this and I eliminate C because is unnecessary and I also eliminate BC because were eliminated so should be something like:
S -> b | c
But if i'm honest I don't think that what I've done it's correct but I don't know exactly

It looks like there might be some problems with your simplification - or I'm having trouble following it. I'll walk through what I would do and you compare to your understanding.
S -> BC | lambda
A -> aA | lambda
B -> bB
C -> c
I presume the goal is to eliminate as many lambda and non-terminal symbols as possible. The first thing to note is that the production S -> lambda cannot be eliminated without changing the language; but all other lambda productions can be eliminated. We see one other production, so we are assured it can be eliminated. How do we eliminate A -> lambda?
We note that A is unreachable from S, the start symbol. So we can quite easily eliminate A -> lambda by eliminating A altogether. We arrive at this simpler, equivalent, grammar:
S -> BC | lambda
B -> bB
C -> c
Now, as our goal is to eliminate non-terminal symbols (we have already eliminated all extraneous -> lambda productions) we can look at S, B and C. We know we need a start symbol, so we might as well keep S as it is. B can only generate bB, which contains non-terminals; and bB can never lead to a string of only non-terminals. B is unproductive and we can eliminate it. When we eliminate an unproductive symbol, any concatenated term in which it appears must also be eliminated, since the concatenated expression can never arrive at a string of only non-terminals (any concatenated expression in which an unproductive symbol appears is also unproductive):
S -> lambda
C -> c
Applying our analysis to C, we easily see it is as unreachable as A was initially and so it can be eliminated in the same way:
S -> lambda
This grammar is in simplest terms and is minimal in terms of non-terminal symbols and productions for the language {lambda}.

Related

A simple CFG claimed to have no equivalent PEG, that seems to have one anyway

In "Packrat Parsing: a Practical Linear-Time Algorithm with Backtracking" on page 30 the author states that the context-free grammar (CFG):
S -> a S a | a S b | b S a | b S b | a
appears not to have a corresponding parsing expression grammar (PEG).
The above CFG is equivalent to:
S -> (a | b) S (a | b) | a
and can be summarized as "odd number of a's and b's with an 'a' in the middle". However the strait-forward translation of this to a PEG:
S <- (a / b) S (a / b) / a
seems to work fine and code for the same language.
You can try this out yourself online using peg.js (enter the grammar as S = ('a' / 'b') S ('a' / 'b') / 'a').
Is the author wrong or am I misunderstanding something?
You just didn't test enough. Try inputs consisting of an odd number of as. All match the grammar but PEG only accepts those of length 2k−1 for some integer k.

Creating Context-Free-Grammar with restrictions

I'm trying to understand CFG by using an example with some obstacles in it.
For example, I want to match the declaration of a double variable:
double d; In this case, "d" could be any other valid identifier.
There are some cases that should not be matched, e.g. "double double;", but I don't understand how to avoid a match of the second "double"
My approach:
G = (Σ, V, S, P)
Σ = {a-z}
V = {S,T,U,W}
P = { S -> doubleTUW
T -> _(space)
U -> (a-z)U | (a-z)
W -> ;
}
Now there must be a way to limit the possible outcomes of this grammar by using the function L(G). Unfortunately, I couldn't find a syntax that meet my requirement to deny a second "double".
Here's a somewhat tedious regular expression to match any identifier other than double. Converting it to a CFG can be done mechanically but it is even more tedious.
([a-ce-z]|d[a-np-z]|do[a-tv-z]|dou[ac-z]|doub[a-km-z]|doubl[a-df-z]|double[a-z])[a-z]*
Converting it to a CFG can be done mechanically but it is even more tedious:
ALPHA → a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
NOT_B → a|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
NOT_D → a|b|c|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
NOT_E → a|b|c|d|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z
NOT_L → a|b|c|d|e|f|g|h|i|j|k|m|n|o|p|q|r|s|t|u|v|w|x|y|z
NOT_O → a|b|c|d|e|f|g|h|i|j|k|l|m|n|p|q|r|s|t|u|v|w|x|y|z
NOT_U → a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|v|w|x|y|z
WORD → NOT_D
| d NOT_O
| do NOT_U
| dou NOT_B
| doub NOT_L
| doubl NOT_E
| double ALPHA
| WORD ALPHA
This is why many of us usually use scanner generators like (f)lex which handle such exclusions automatically.

How can I construct a grammar that generates this language? grammar context-free-grammar

I'm studying for a finite automata & grammars test and I'm stuck with this question:
Construct a grammar that generates L:
L = {a^n b^m c^2n | n>=0, m>=0}
How can I construct a grammar that generates this language?
grammar context-free-grammar automata
I think this should do the trick. I verified this on http://mdaines.github.io/grammophone/ .
S -> a B c c
| a S c c
| .
B -> b B
| .
I find it always helps with these kinds of questions to come up with some rules for how to build big strings out of little strings. First, identify the littlest strings in your language. In our case, we can start with the observation that if n = 0, b^m is in our language; that is, w in b* is in our language. We then note that if x is a string in our language we get another string by adding one a on the left and two cs on the right; that is, axcc is a string in our language also. So our rules are:
b* in L
if x in L then axcc in L
Writing this in terms of a CFG is now straightforward:
S -> B
S -> aScc
Here, S generates our language L and B generates the language b*. We complete the grammar by providing a grammar for b* with start symbol B:
(1) S -> B
(2) S -> aScc
(3) B -> e
(4) B -> B
Any string a^n b^m c^2n can be generated using n applications of rule 2, 1 application of rule 1, m applications of rule 4 and 1 application of rule 3. That this grammar generates no strings not in the language is left as an exercise.

Necessary condition for grammar ambiguity

In my notebook I wrote:
The necessary condition for grammar ambiguity is
It contains the rule A->BB, where A and B are non-terminals.
OR it contains the rule A->a|b, where A is a non-terminal and {a,b} are terminals.
Would you please confirm or refute this statement?
That's not true because there are other ambiguous grammars that don't have either of those rules.
For example cc can be produced by A -> Bc -> cc but also by A -> cC -> cc in the following grammar:
A -> Bc | cC
B -> c
C -> c

Why the need for terminals? Is my solution sufficient enough?

I'm trying to get my head around context free grammars and I think I'm close. What is baffling me is this one question (I'm doing practise questions as I have an exam in a month's time):
I've come up with this language but I believe it's wrong.
S --> aSb | A | B
A --> aA | Σ
B --> bB | Σ
Apparently this is the correct solution:
S --> aSb | aA | bB
A --> aA | Σ
B --> bB | Σ
What I don't quite understand is why we have S --> aSb | aA | bB and not just S --> aSb | A | B. What is the need for the terminals? Can't I just call A instead and grab my terminals that way?
Testing to see if I can generate the string: aaabbbb
S --> aSb --> aaSbb --> aaaSbbb --> aaaBbbb --> aaabbbb
I believe I generate the string correctly, but I'm not quite sure. I'm telling myself that the reason for S --> aSb | aA | bB is that if we start with aA and then replace A with a, we have two a's which gives us our correct string as they're not equal, this can be done with b as well. Any advice is greatly appreciated.
Into the Tuple (G-4-tuple)
V (None terminals) = {A, B}
Σ (Terminals) = {a, b}
P = { } // not quite sure how to express my solution in R? Would I have to use a test string to do so?
S = A
First:
Σ means language symbols. in your language Σ = {a, b}
^ means null symbols (it is theoretical, ^ is not member of any language symbol)
ε means empty string (it is theoretical, ε can be a member of some language)
See ^ symbol means nothing but we use it just for theoretical purpose, like ∞ infinity symbol we uses in mathematics(really no number is ∞ but we use it to understand, to proof some theorems) similarly ^ is nothing but we use it.
this point not written in any book, I am writing it to explain/for understanding point of view. The subject more belongs to theoretical and math and I am from computer science.
As you says your grammar is L = {am bn | m != n}. Suppose if productions are as follows:
First:
S --> aSb | A | B
A --> aA | Σ
B --> bB | Σ
It means.(very rare book may use Σ in grammar rules)
S --> aSb | A | B
A --> aA | a | b
B --> bB | a | b
I replaced Σ by a | b (a, b language symbols).
This grammar can generates a string of equal numbers of symbols a and symbol b(an bn). How it can generate an bn? See below an example derivation:
S ---> aSb ---> aAb ---> aaAb ---> aabb
^ ^ ^ ^
rule-1 S-->A A--> aA A --> b
But these kind of strings are not possible in language L because m != n.
Second:
For the same reason production rules S --> aSb | aA | bB is also not correct grammar if A --> aA | Σ or B --> bB | Σ are in grammar.
I think in second grammar you mean:
S --> aSb | aA | bB
A --> aA | ^
B --> bB | ^
Then this is correct grammar for language L = {am bn | m != n}. Because using:
S --> aSb
you can only generate equal numbers of a' and b and by replacing S either by aA or by bB you make a sentential form in which unequal numbers of a and b symbols are present and that can't convert back to generate a string of type an bn. (since A doesn't generates b and B doesn't generates a).
Third:
But usually we write grammar rules like:
S --> aSb | A | B
A --> aA | a
B --> bB | b
Both forms are equivalent (generate same language L = {am bn | m != n}) because once you convert S into either A or B you have to generate at-least one a or b (or more) respectively and thus constraint m != n holds.
Remember proofing, whether two grammars are equivalent or not is undecidable problem. We can't prove it by algorithm (but logically possible, that works because we are human being having brain better then processor :P :) ).
Fourth:
At the end I would also like to add, Grammar:
S --> aSb | A | B
A --> aA | ^
B --> bB | ^
doesn't produces L = {am bn | m != n} because we can generate an bn for example:
S ---> aSb ---> aAb ---> ab
^
A --> ^
Grammar in formal languages
Any class of formal languages can be represented by a formal Grammar consisting of the four-tuple (S, V, Σ, P). (note a Grammar or an automata both are finite representation weather language is finite or infinite: Check figures one & two).
Σ: Finite set of language symbols.
In grammar we commonly call it finite set of terminals (in contrast of variables V). Language symbols or terminals are thing, using which language strings (sentences) are constructed. In your example set of terminals Σ is {a, b}. In natural language you can correlate terminals with vocabulary or dictionary words.
Natural language means what we speak Hindi, English
V: Finite set of Non-terminals.
Non-terminal or say 'variable', should always participate in grammar production rules. (otherwise the variable counts in useless variables, that is a variable that doesn't derives terminals or nothing).
See: 'ultimate aim of grammar is to produce language's strings in correct form hence every variable should be useful in some way.
In natural language you can correlate variable set with Noun/Verbs/Tens that defined a specific semantical property of an language (like Verb means eating/sleeping, Noun means he/she/Xiy etc).
Note: One can find in some books V ∩ Σ = ∅ that means variables are not terminals.
S: Start Variable. (S ∈ V)
S is a special variable symbol, that is called 'Start Symbol'. We can only consider a string in language of grammar L(G) if it can be derived from Start variable S. If a string can not be derived from S (even if its consist of language symbols Σ) then string will not be consider in the language of grammar( actually that string belongs to 'complement language' of L(G), we writes complement language L' = Σ* - L(G) , Check: "the complement language in case of regular language")
P: Finite set of Production Rules.
Production Rules defines replacement rules in the from α --> β, that means during the derivation of a string from S, from grammar rules at any time α (lhs) can be replaced by β (rhs).(this is similar to Noun can be replace by he,she or Xiy, and Verb can be replace by eating, sleeping etc in natural language.
Production rules defines formation rules of language sentences. Formal language are similar to Natural language having a pattern that is certain thing can occurs in certain form--that we call syntax in programming language. And because of this ability of grammar, grammar use for syntax checking called parse).
Note: In α --> β, α and β both are consists of language symbols and terminals (V U Σ)* with a constraint that in α their must be at-least one variable. (as we can replace only a string contain variable by rhs of rule. a terminal can't replace by other terminal or we can say a sentence can't be replaced by other sentence)
Remember: There is two form Sentential Form and Sentence of a string:
Sentence: if all symbols are terminals (sentence can be either in L(G) or in complement language L' = Σ* - L)
Sentential: if any symbol is variable (not a language string but derivation string)
From #MAV (Thanks!!):
To represent grammar of above language L = {am bn | m != n}, 4-tuple are :
V = {S, A, B}
Σ = {a, b}
P = {S --> aSb | A | B, A --> aA | a, B --> bB | a}
S = S
note: Generally I use P for Production rules, your book may use R for rules
Terminology uses in theory of formal languages and automate
Capital letters are uses for variables e.g. S, A, B in grammar construction.
Small letter from start uses for terminals(language symbols) for example a, b.
(some time numbers like 0, 1 uses. Also ^ is null symbol).
Small letters form last uses for string of terminals z, y, w, x (for example you can find these notations in pumping lemma,
symbols use for language string or sub strings).
α, β, γ for Sentential forms.
Σ for language symbols.
Γ for input or output tap symbol, other then language symbols.
^ for null symbol, # or ☐ Symbol for blank symbol in Turing machine and PDA (^, #, ☐ are other then language symbols.
ε uses for empty string (can be a part of language string for example { } is empty body in C language, you can write while(1); or
while(1){ } both are valid see here I have defined a valid program
with empty sentences).
∅ means empty set in set theory.
Φ, Ψ uses for substring in Sentential forms.
Note: ∅ means set is empty, ε means string is empty, ^ means none symbol (don't mix in theory, all are different in semantic)
There is no rules I know about symbol notation, but these are commonly used terminology once can find in most standard books I observed during study.
Next post: Tips for writing Context free grammar