How can I prove that derivations in Chomsky Normal Form require 2n - 1 steps? - grammar

I'm trying to prove the following:
If G is a Context Free Grammar in the Chomsky Normal Form, then for any string w belongs L(G) of length n ≥ 1, it requires exactly 2n-1 steps to make any derivation of w.
How would I go about proving this?

As a hint - since every production in Chomsky Normal Form either has the form
S → AB, for nonterminals A and B, or the form
S → x, for terminal x,
Then deriving a string would work in the following way:
Create a string of exactly n nonterminals, then
Expand each nonterminal out to a single terminal.
Applying productions of the first form will increase the number of nonterminals from k to k + 1, since you replace one nonterminal (-1) with two nonterminals (+2) for a net gain of +1 nonterminal. Since your start with one nonterminal, this means you need to do n - 1 productions of the first form. You then need n more of the second form to convert the nonterminals to terminals, giving a total of n + (n - 1) = 2n - 1 productions.
To show that you need exactly this many, I would suggest doing a proof by contradiction and showing that you can't do it with any more or any fewer. As a hint, try counting the number of productions of each type that are made and showing that if it isn't 2n - 1, either the string is too short, or you will still have nonterminals remaining.
Hope this helps!

Related

Prove the language is not context-free?

How can you prove the language L given below is not context-free, I would like to know does my proof given below makes any sense, if not, what would be the correct method to prove?
L = {a^n b^n c^i|n ≤ i ≤ 2n}
I am trying to solve this language by contradiction. Suppose L is regular and with pumping length p such that S = a^p b^p c^p. Observe that S ∉ L. Since there must be a pumping cycle xy with length less than p, this can duplicate y which consists of some number of b to cause x(y^2)z to enter the language because the number of b exceeds the number of c by no longer bound by the given condition of i which is n ≤ i ≥ 2n, therefore, we have contradiction and hence language L is not context-free.
The proof is by contradiction. Assume the language is context-free. Then, by the pumping lemma for context-free languages, any string in L can be written as uvxyz where |vxy| < p, |vy| > 0 and for all natural numbers k, u(v^k)x(y^k)z is in the language as well. Choose a^p b^p c^(p+1). Then we must be able to write this string as uvxyz so that |vy| > 0. There are several possibilities to consider:
v and y consist only of a's. In this case, pumping in either direction causes the numbers of a's and b's to differ, producing a string not in the language; so this cannot be the case.
v and y consist only of a's and b's. In this case, pumping might keep the numbers of a's and b's the same, but pumping up will eventually cause the number of c's to be less than the number of a's and b's; so this cannot be the case.
v and y consist only of b's. This case is similar to (1) above and so cannot be a valid choice.
v and y consist only of b's and c's. Also similar to (1) and (3) in that pumping will cause the numbers of a's and b's to differ.
v and y consist only of c's. Pumping up will eventually cause there to be more c's than twice the number of a's; so this cannot be the case either.
No matter how we choose v and y, pumping will produce strings not in the language. This is a contradiction; this means our assumption that the language is context-free must have been incorrect.

Topcoder Binary search Tutorial

What we can call the main theorem states that binary search can be used if and only if for all x in S, p(x) implies p(y) for all y > x. This property is what we use when we discard the second half of the search space. It is equivalent to saying that ¬p(x) implies ¬p(y) for all y < x (the symbol ¬ denotes the logical not operator), which is what we use when we discard the first half of the search space.
Please explain this paragraph in simpler and detailed terms.
Consider that p(x) is some property of x. When using binary search this property is usually x being either greater, lesser, or equal than some other value k that you are looking for.
What we can call the main theorem states that binary search can be used if and only if for all x in S, p(x) implies p(y) for all y > x.
Lets say that x is some value in the middle of the list and you are looking for where k is. Lets also say that p(x) means that k is greater than x. If the list is sorted in ascending order, than all values y to the right of x (y > x) must also be greater than k (the property is transitive), and as such p(y) also holds for all y. This is the basis of binary search. If you are looking for k and some value x is known to be greater than k, than all elements to its right are also greater than k. Notice that this is only true if the list is sorted. Consider the list [a,b,c] and a value k that you are looking for. If it's known that a < b and b < c, if k < b is true, than k < c must also be true.
This property is what we use when we discard the second half of the search space.
This is what the previous conclusion allows you to do. As you know that the property that holds for x also holds for all y (that is, they are not the element you are looking for, because they are greater) than it's safe to discard them, and as such you keep looking for k only on the lower half.
The rest of the paragraph says pretty much the same thing for discarding the lower half.
In short, p(x) is some transitive property that should hold to all values to the right of any given value x (again, because it's transitive). ¬p(x), on the other hand, should hold for all values to the left of x. By being able to conclude that those are not the elements you are looking for, you can conclude that it's safe to discard either half of the list.

Why can't a LL grammar be left-recursive?

In the dragon book, LL grammar is defined as follows:
A grammar is LL if and only if for any production A -> a|b, the following two conditions apply.
FIRST(a) and FIRST(b) are disjoint. This implies that they cannot both derive EMPTY
If b can derive EMPTY, then a cannot derive any string that begins with FOLLOW(A), that is FIRST(a) and FOLLOW(A) must be disjoint.
And I know that LL grammar can't be left recursive, but what is the formal reason? I guess left-recursive grammar will contradict rule 2, right? e.g., I've written following grammar:
S->SA|empty
A->a
Because FIRST(SA) = {a, empty} and FOLLOW(S) ={$, a}, then FIRST(SA) and FOLLOW(S) are not disjoint, so this grammar is not LL. But I don't know if it is the left-recursion make FIRST(SA) and FOLLOW(S) not disjoint, or there is some other reason? Put it in another way, is it true that every left-recursive grammar will have a production that will violate condition 2 of LL grammar?
OK, I figure it out, if a grammar contains left-recursive production, like:
S->SA
Then somehow it must contain another production to "finish" the recursion,say:
S->B
And since FIRST(B) is a subset of FIRST(SA), so they are joint, this violates condition 1, there must be conflict when filling parse table entries corresponding to terminals both in FIRST(B) and FIRST(SA). To summarize, left-recursion grammar could cause FIRST set of two or more productions to have common terminals, thus violating condition 1.
Consider your grammar:
S->SA|empty
A->a
This is a shorthand for the three rules:
S -> SA
S -> empty
A -> a
Now consider the string aaa. How was it produced? You can only read one character at a time if you have no lookahead, so you start off like this (you have S as start symbol):
S -> SA
S -> empty
A -> a
Fine, you have produced the first a. But now you cannot apply any more rules because there is no more non-terminals. You are stuck!
What you should have done was this:
S -> SA
S -> SA
S -> SA
S -> empty
A -> a
A -> a
A -> a
But you don't know this without reading the entire string. You would need an infinite amount of lookahead.
In a general sense, yes, every left-recursive grammar can have ambiguous strings without infinite lookahead. Look at the example again: There are two different rules for S. Which one should we use?
An LL(k) grammar is one that allows the construction of a deterministic, descent parser with only k symbols of lookahead. The problem with left recursion is that it makes it impossible to determine which rule to apply until the complete input string is examined, which makes the required k potentially infinite.
Using your example, choose a k, and give the parser an input sequence of length n >= k:
aaaaaaa...
A parser cannot decide if it should apply S->SA or S->empty by looking at the k symbols ahead because the decision would depend on how many times S->SA has been chosen before, and that is information the parser does not have.
The parser would have to choose S->SA exactly n times and S->empty once, and it's impossible to decide which is right by looking at the first k symbols in the input stream.
To know, a parser would have to both examine the complete input sequence, and keep count of how many times S->SA has been chosen, but such a parser would fall outside of the definition of LL(k).
Note that unlimited lookahead is not a solution because a parser runs on limited resources, so there will always be a finite input sequence of a length large enough to make the parser crash before producing any output.
In the book "The Theory of Parsing", Volume 2, by Aho and Ullman, page 681 you can find Lemma 8.3 that states: "No LL(k) grammar is left-recursive".
The proof says:
Suppose that G = (N, T, P, S) has a left-recursive nonterminal A. Then there is a derivation A -> Aw. If w -> e then it is easy to show that G is ambiguous and hence cannot be LL. Thus, assume that w -> v for some v in T+ (a non empty string of terminals). We can further assume that A -> u, being u some string of terminals and that there exists a derivation
Hence, there is another derivation:

What is the language of this deterministic finite automata?

Given:
I have no idea what the accepted language is.
From looking at it you can get several end results:
1.) bb
2.) ab(a,b)
3.) bbab(a, b)
4.) bbaaa
How to write regular expression for a DFA
In any automata, the purpose of state is like memory element. A state stores some information in automate like ON-OFF fan switch.
A Deterministic-Finite-Automata(DFA) called finite automata because finite amount of memory present in the form of states. For any Regular Language(RL) a DFA is always possible.
Let's see what information stored in the DFA (refer my colorful figure).
(note: In my explanation any number means zero or more times and Λ is null symbol)
State-1: is START state and information stored in it is even number of a has been come. And ZERO b.
Regular Expression(RE) for this state is = (aa)*.
State-4: Odd number of a has been come. And ZERO b.
Regular Expression for this state is = (aa)*a.
Figure: a BLUE states = EVEN number of a, and RED states = ODD number of a has been come.
NOTICE: Once first b has been come, move can't back to state-1 and state-4.
State-5: comes after Yellow b. Yellow b means b after odd numbers of a.
Once you gets b after odd numbers of a(at state-5) every thing is acceptable because there is self a loop for (b,a) at state-5.
You can write for state-5 : Yellow-b followed-by any string of a, b that is = Yellow-b (a + b)*
State-6: Just to differentiate whether odd a or even.
State-2: comes after even a then b then any number of b. = (aa)* bb*
State-3: comes after state-2 then first a then there is a loop via state-6.
We can write for state-3 comes = state-2 a (aa)* = (aa)*bb* a (aa)*
Because in our DFA, we have three final states so language accepted by DFA is union (+ in RE) of three RL (or three RE).
So the language accepted by the DFA is corresponding to three accepting states-2,3,5, And we can write like:
State-2 + state-3 + state-5
(aa)*bb* + (aa)*bb* a (aa)* + Yellow-b (a + b)*
I forgot to explain how Yellow-b comes?
ANSWER: Yellow-b is a b after state-4 or state-3. And we can write like:
Yellow-b = ( state-4 + state-3 ) b = ( (aa)*a + (aa)*bb* a (aa)* ) b
[ANSWER]
(aa)*bb* + (aa)*bb* a (aa)* + ( (aa)*a + (aa)*bb* a (aa)* ) b (a + b)*
English Description of Language: DFA accepts union of three languages
EVEN NUMBERs OF a's, FOLLOWED BY ONE OR MORE b's,
EVEN NUMBERs OF a's, FOLLOWED BY ONE OR MORE b's, FOLLOWED BY ODD NUMBERs OF a's.
A PREFIX STRING OF a AND b WITH ODD NUMBER OF a's, FOLLOWED BY b, FOLLOWED BY ANY STRING OF a AND b AND Λ.
English Description is complex but this the only way to describe the language. You can improve it by first convert given DFA into minimized DFA then write RE and description.
Also, there is a Derivative Method to find RE from a given Transition Graph using Arden's Theorem. I have explained here how to write a regular expression for a DFA using Arden's theorem. The transition graph must first be converted into a standard form without the null-move and single start state. But I prefer to learn Theory of computation by analysis instead of using the Mathematical derivation approach.
I guess this question isn't relevant anymore :) and it's probably better to guide you through it then just stating the answer, but I think I got a basic expression that covers it (it's probably minimizable), so i'll just write it down for future searchers
(aa)*b(b)* // for stoping at 2
U
(aa)*b(b)*a(aa)* // for stoping at 3
U
(aa)*b(b)*a(aa)*b((a)*(b)*)* // for stoping at 5 via 3
U
a(aa)*b((a)*(b)*)* // for stoping at 5 via 4
The examples (1 - 4) that you give there are not the language accepted by the DFA. They are merely strings that belong to the language that the DFA accepts. Therefore, they all fall in the same language.
If you want to figure out the regular expression that defines that DFA, you will need to do something called k-path induction, and you can read up on it here.

Is the language of all strings over the alphabet "a,b,c" with the same number of substrings "ab" & "ba" regular?

Is the language of all strings over the alphabet "a,b,c" with the same number of substrings "ab" & "ba" regular?
I believe the answer is NO, but it is hard to make a formal demonstration of it, even a NON formal demonstration.
Any ideas on how to approach this?
It's clearly not regular. How is an FA going to recognize (abc)^n c (cba)^n. Strings like this are in your language, right? The argument is a simple one based on the fact that there are infinitely many equivalence classes under the indistinguishability relation I_l.
The most common way to prove a language is NOT regular is using on of the Pumping Lemmas.
Using the lemma is a little tricky, since it has all those "exists" and so on. To prove a language L is not regular using the pumping lemma you have to prove that
for any integer p,
there is a word w in L of length n, with n>=p, such that
for all possible ways to decompose w as xyz, with len(xy) <= p and y non empty
there exists an i such that x(y^i)z (repeating the y bit i times) is NOT in L
whooo!
I'l l show how the proof looks for the "same number of as and bs" language. It should be straighfoward to convert to your case:
for any given p, we can make a word of length n = 2*p
a^p b^p (p a's followed by p b's)
any way you decompose this into xyz w/ |xy| <=p, y will only contain a's.
Thus, pumping the the y part will make the word have more as than bs,
thus NOT belonging to L.
If you need intuition on why this works, it follows from how you need to be able to count to arbritrarily large numbers to verify if a word belongs to one of these languages. However, Regular Languages are described by finite automata and no finite automata can represent the infinite ammount of states required to represent all the numbers. (The Wikipedia article should have a formal proof).
EDIT: It looks like you can't straight up use the pumping lemma in this particular case directly: if you always make y be one character long you can never make a word stop being accepted (aba becoming abbbba makes no difference and so on).
Just do the equivalence class approach suggested by Patrick87 - it will probably turn out to be cleaner than any of the dirty hacks you would need to do to make the pumping lemma applicable here.