BNF Grammar Ambiguity - grammar

I was recently thinking of the following BNF
A -> x | yA | yAzA
where x,y,z are terminals.
I'm pretty sure this grammar is ambiguous, but how would one make it unambiguous?

A grammar is ambiguous if a particular string can have more than one parse tree. In your language the string yyxzx can have either of these two parse trees:
A A
/ \ /|\`\
y A y A z A
/|\`\ / \ \
y A z A y A x
| | |
x x x
Therefore the grammar is ambiguous.
This actually is equivalent to the notorious "if/then/else" ambiguity in C-like languages, where y=if, z=else, and x=statement. http://en.wikipedia.org/wiki/Dangling_else. I would recommend checking out that page for ideas on how to get around this problem.

Related

Is this a regular language?

This is the initial grammar:
S → ε | c | bSb | aAa
A → aSa | bAb
some resulting words are:
baabaabbbbaabaab
bcb
baabaacaabaab
bbaabcbaabb
ababbaba
bbbb
ababbcbbaba
baabcbaab
aaaacaaaa
aacaa
baaaaaacaaaaaab
bbbbaacaabbbb
abaaabbaaaba
abbbaaacaaabbba
At first I wrote this regular expression (a|b)*c?(a|b)* but later I noticed that the a and b occurrences are always even, so this regualr expression is wrong. Considering that an automata can't count, can I conclude that the language is not regular? Thank you very much!
This language is not regular. You can prove as much using the Myhill-Nerode theorem as follows. Consider the prefix b^n c. The shortest string which can be appended to this string to get a string in the language of the grammar is b^n; no shorter string can be appended to b^n c to get a string in the language. Imagine the language were regular. Then after processing the prefix b^n c, the DFA would have to have a shortest path to an accepting state of length n. But n is an arbitrary natural number and the DFA must have some fixed constant number of states. This is a contradiction, so our language has no DFA and is not regular.
An example of a non-regular grammar producing a regular language is trivially the following:
S -> aSa | aS | A | e
aSSaa -> Saa
A -> AA | ASA | aSaSA
SAS -> ASASSASaaSa
This is not regular, it's not even context-free. But it generates the regular language of all strings of a, a*, owing to productions S -> aS | e alone.

Constructing a linear grammar for the language

I find difficulties in constructing a Grammar for the language especially with linear grammar.
Can anyone please give me some basic tips/methodology where i can construct the grammar for any language ? thanks in advance
I have a doubt whether the answer for this question "Construct a linear grammar for the language: is right
L ={a^n b c^n | n belongs to Natural numbers}
Solution:
Right-Linear Grammar :
S--> aS | bA
A--> cA | ^
Left-Linear Grammar:
S--> Sc | Ab
A--> Aa | ^
As pointed out in the comments, these grammars are wrong since they generate strings not in the language. Here's a derivation of abcc in both grammars:
S -> aS -> abA -> abcA -> abccA -> abcc
S -> Sc -> Scc -> Abcc -> Aabcc -> abcc
Also as pointed out in the comments, there is a simple linear grammar for this language, where a linear grammar is defined as having at most one nonterminal symbol in the RHS of any production:
S -> aSc | b
There are some general rules for constructing grammars for languages. These are either obvious simple rules or rules derived from closure properties and the way grammars work. For instance:
if L = {a} for an alphabet symbol a, then S -> a is a gammar for L.
if L = {e} for the empty string e, then S -> e is a grammar for L.
if L = R U T for languages R and T, then S -> S' | S'' along with the grammars for R and T are a grammar for L if S' is the start symbol of the grammar for R and S'' is the start symbol of the grammar for T.
if L = RT for languages R and T, then S = S'S'' is a grammar for L if S' is the start symbol of the grammar for R and S'' is the start symbol of the grammar for T.
if L = R* for language R, then S = S'S | e is a grammar for L if S' is the start symbol of the grammar for R.
Rules 4 and 5, as written, do not preserve linearity. Linearity can be preserved for left-linear and right-linear grammars (since those grammars describe regular languages, and regular languages are closed under these kinds of operations); but linearity cannot be preserved in general. To prove this, an example suffices:
R -> aRb | ab
T -> cTd | cd
L = RT = a^n b^n c^m d^m, 0 < a,b,c,d
L' = R* = (a^n b^n)*, 0 < a,b
Suppose there were a linear grammar for L. We must have a production for the start symbol S that produces something. To produce something, we require a string of terminal and nonterminal symbols. To be linear, we must have at most one nonterminal symbol. That is, our production must be of the form
S := xYz
where x is a string of terminals, Y is a single nonterminal, and z is a string of terminals. If x is non-empty, reflection shows the only useful choice is a; anything else fails to derive known strings in the language. Similarly, if z is non-empty, the only useful choice is d. This gives four cases:
x empty, z empty. This is useless, since we now have the same problem to solve for nonterminal Y as we had for S.
x = a, z empty. Y must now generate exactly a^n' b^n' b c^m d^m where n' = n - 1. But then the exact same argument applies to the grammar whose start symbol is Y.
x empty, z = d. Y must now generate exactly a^n b^n c c^m' d^m' where m' = m - 1. But then the exact same argument applies to the grammar whose start symbol is Y.
x = a, z = d. Y must now generate exactly a^n' b^n' bc c^m' d^m' where n' and m' are as in 2 and 3. But then the exact same argument applies to the grammar whose start symbol is Y.
None of the possible choices for a useful production for S is actually useful in getting us closer to a string in the language. Therefore, no strings are derived, a contradiction, meaning that the grammar for L cannot be linear.
Suppose there were a grammar for L'. Then that grammar has to generate all the strings in (a^n b^n)R(a^m b^m), plus those in e + R. But it can't generate the ones in the former by the argument used above: any production useful for that purpose would get us no closer to a string in the language.

Creating a context-free grammar for a specific language

I am trying to create a context-free grammar for the language
L = {u2v; u,v E {a,b}*; |u| >= |v|}
however, I can't really understand how to pick up from here.
My idea is that for every a/b character that I generate in u, I should generate another a/b character in the string v. My biggest problem is the symbol 2 there, as I don't know how to add it after doing all of this or how to write a rule saying that it should be skipped.
How can this grammar be constructed?
A Context Free Grammar would be:
G = ({S,T},{a,b,2,},S,P)
P:
S-> aSa | aSb | bSa | bSb | T
T-> aT | bT | 2

Why the need for terminals? Is my solution sufficient enough?

I'm trying to get my head around context free grammars and I think I'm close. What is baffling me is this one question (I'm doing practise questions as I have an exam in a month's time):
I've come up with this language but I believe it's wrong.
S --> aSb | A | B
A --> aA | Σ
B --> bB | Σ
Apparently this is the correct solution:
S --> aSb | aA | bB
A --> aA | Σ
B --> bB | Σ
What I don't quite understand is why we have S --> aSb | aA | bB and not just S --> aSb | A | B. What is the need for the terminals? Can't I just call A instead and grab my terminals that way?
Testing to see if I can generate the string: aaabbbb
S --> aSb --> aaSbb --> aaaSbbb --> aaaBbbb --> aaabbbb
I believe I generate the string correctly, but I'm not quite sure. I'm telling myself that the reason for S --> aSb | aA | bB is that if we start with aA and then replace A with a, we have two a's which gives us our correct string as they're not equal, this can be done with b as well. Any advice is greatly appreciated.
Into the Tuple (G-4-tuple)
V (None terminals) = {A, B}
Σ (Terminals) = {a, b}
P = { } // not quite sure how to express my solution in R? Would I have to use a test string to do so?
S = A
First:
Σ means language symbols. in your language Σ = {a, b}
^ means null symbols (it is theoretical, ^ is not member of any language symbol)
ε means empty string (it is theoretical, ε can be a member of some language)
See ^ symbol means nothing but we use it just for theoretical purpose, like ∞ infinity symbol we uses in mathematics(really no number is ∞ but we use it to understand, to proof some theorems) similarly ^ is nothing but we use it.
this point not written in any book, I am writing it to explain/for understanding point of view. The subject more belongs to theoretical and math and I am from computer science.
As you says your grammar is L = {am bn | m != n}. Suppose if productions are as follows:
First:
S --> aSb | A | B
A --> aA | Σ
B --> bB | Σ
It means.(very rare book may use Σ in grammar rules)
S --> aSb | A | B
A --> aA | a | b
B --> bB | a | b
I replaced Σ by a | b (a, b language symbols).
This grammar can generates a string of equal numbers of symbols a and symbol b(an bn). How it can generate an bn? See below an example derivation:
S ---> aSb ---> aAb ---> aaAb ---> aabb
^ ^ ^ ^
rule-1 S-->A A--> aA A --> b
But these kind of strings are not possible in language L because m != n.
Second:
For the same reason production rules S --> aSb | aA | bB is also not correct grammar if A --> aA | Σ or B --> bB | Σ are in grammar.
I think in second grammar you mean:
S --> aSb | aA | bB
A --> aA | ^
B --> bB | ^
Then this is correct grammar for language L = {am bn | m != n}. Because using:
S --> aSb
you can only generate equal numbers of a' and b and by replacing S either by aA or by bB you make a sentential form in which unequal numbers of a and b symbols are present and that can't convert back to generate a string of type an bn. (since A doesn't generates b and B doesn't generates a).
Third:
But usually we write grammar rules like:
S --> aSb | A | B
A --> aA | a
B --> bB | b
Both forms are equivalent (generate same language L = {am bn | m != n}) because once you convert S into either A or B you have to generate at-least one a or b (or more) respectively and thus constraint m != n holds.
Remember proofing, whether two grammars are equivalent or not is undecidable problem. We can't prove it by algorithm (but logically possible, that works because we are human being having brain better then processor :P :) ).
Fourth:
At the end I would also like to add, Grammar:
S --> aSb | A | B
A --> aA | ^
B --> bB | ^
doesn't produces L = {am bn | m != n} because we can generate an bn for example:
S ---> aSb ---> aAb ---> ab
^
A --> ^
Grammar in formal languages
Any class of formal languages can be represented by a formal Grammar consisting of the four-tuple (S, V, Σ, P). (note a Grammar or an automata both are finite representation weather language is finite or infinite: Check figures one & two).
Σ: Finite set of language symbols.
In grammar we commonly call it finite set of terminals (in contrast of variables V). Language symbols or terminals are thing, using which language strings (sentences) are constructed. In your example set of terminals Σ is {a, b}. In natural language you can correlate terminals with vocabulary or dictionary words.
Natural language means what we speak Hindi, English
V: Finite set of Non-terminals.
Non-terminal or say 'variable', should always participate in grammar production rules. (otherwise the variable counts in useless variables, that is a variable that doesn't derives terminals or nothing).
See: 'ultimate aim of grammar is to produce language's strings in correct form hence every variable should be useful in some way.
In natural language you can correlate variable set with Noun/Verbs/Tens that defined a specific semantical property of an language (like Verb means eating/sleeping, Noun means he/she/Xiy etc).
Note: One can find in some books V ∩ Σ = ∅ that means variables are not terminals.
S: Start Variable. (S ∈ V)
S is a special variable symbol, that is called 'Start Symbol'. We can only consider a string in language of grammar L(G) if it can be derived from Start variable S. If a string can not be derived from S (even if its consist of language symbols Σ) then string will not be consider in the language of grammar( actually that string belongs to 'complement language' of L(G), we writes complement language L' = Σ* - L(G) , Check: "the complement language in case of regular language")
P: Finite set of Production Rules.
Production Rules defines replacement rules in the from α --> β, that means during the derivation of a string from S, from grammar rules at any time α (lhs) can be replaced by β (rhs).(this is similar to Noun can be replace by he,she or Xiy, and Verb can be replace by eating, sleeping etc in natural language.
Production rules defines formation rules of language sentences. Formal language are similar to Natural language having a pattern that is certain thing can occurs in certain form--that we call syntax in programming language. And because of this ability of grammar, grammar use for syntax checking called parse).
Note: In α --> β, α and β both are consists of language symbols and terminals (V U Σ)* with a constraint that in α their must be at-least one variable. (as we can replace only a string contain variable by rhs of rule. a terminal can't replace by other terminal or we can say a sentence can't be replaced by other sentence)
Remember: There is two form Sentential Form and Sentence of a string:
Sentence: if all symbols are terminals (sentence can be either in L(G) or in complement language L' = Σ* - L)
Sentential: if any symbol is variable (not a language string but derivation string)
From #MAV (Thanks!!):
To represent grammar of above language L = {am bn | m != n}, 4-tuple are :
V = {S, A, B}
Σ = {a, b}
P = {S --> aSb | A | B, A --> aA | a, B --> bB | a}
S = S
note: Generally I use P for Production rules, your book may use R for rules
Terminology uses in theory of formal languages and automate
Capital letters are uses for variables e.g. S, A, B in grammar construction.
Small letter from start uses for terminals(language symbols) for example a, b.
(some time numbers like 0, 1 uses. Also ^ is null symbol).
Small letters form last uses for string of terminals z, y, w, x (for example you can find these notations in pumping lemma,
symbols use for language string or sub strings).
α, β, γ for Sentential forms.
Σ for language symbols.
Γ for input or output tap symbol, other then language symbols.
^ for null symbol, # or ☐ Symbol for blank symbol in Turing machine and PDA (^, #, ☐ are other then language symbols.
ε uses for empty string (can be a part of language string for example { } is empty body in C language, you can write while(1); or
while(1){ } both are valid see here I have defined a valid program
with empty sentences).
∅ means empty set in set theory.
Φ, Ψ uses for substring in Sentential forms.
Note: ∅ means set is empty, ε means string is empty, ^ means none symbol (don't mix in theory, all are different in semantic)
There is no rules I know about symbol notation, but these are commonly used terminology once can find in most standard books I observed during study.
Next post: Tips for writing Context free grammar

":=" and "=>" in Mercury

I recently came across this code example in Mercury:
append(X,Y,Z) :-
X == [],
Z := Y.
append(X,Y,Z) :-
X => [H | T],
append(T,Y,NT),
Z <= [H | NT].
Being a Prolog programmer, I wonder: what's the difference between a normal unification =
and the := or => which are use here?
In the Mercury reference, these operators get different priorities, but they don't explain the difference.
First let's re-write the code using indentation:
append(X, Y, Z) :-
X == [],
Z := Y.
append(X, Y, Z) :-
X => [H | T],
append(T, Y, NT),
Z <= [H | NT].
You seem to have to indent all code by four spaces, which doesn't seem to work in comments, my comments above should be ignored (I'm not able to delete them).
The code above isn't real Mercury code, it is pseudo code. It doesn't make sense as real Mercury code because the <= and => operators are used for typeclasses (IIRC), not unification. Additionally, I haven't seen the := operator before, I'm not sure what is does.
In this style of pseudo code (I believe) that the author is trying to show that := is an assignment type of unification where X is assigned the value of Y. Similarly => is showing a deconstruction of X and <= is showing a construction of Z. Also == shows an equality test between X and the empty list. All of these operations are types of unification. The compiler knows which type of unification should be used for each mode of the predicate. For this code the mode that makes sense is append(in, in, out)
Mercury is different from Prolog in this respect, it knows which type of unification to use and therefore can generate more efficient code and ensure that the program is mode-correct.
One more thing, the real Mercury code for this pseudo code would be:
:- pred append(list(T)::in, list(T)::in, list(T)::out) is det.
append(X, Y, Z) :-
X = [],
Z = Y.
append(X, Y, Z) :-
X = [H | T],
append(T, Y, NT),
Z = [H | NT].
Note that every unification is a = and a predicate and mode declaration has been added.
In concrete Mercury syntax the operator := is used for field updates.
Maybe we are not able to use such operators like ':=' '<=' '=>' '==' in recent Mercury release, but actually these operators are specialized unification, according to the description in Nancy Mazur's thesis.
'=>' stands for deconstruction e.g. X => f(Y1, Y2, ..., Yn) where X is input and all Yn is output. It's semidet. '<=' is on the contrary, and is det. '==' is used in the situation where both sides are ground, and it is semidet. ':=' is just like regular assigning operator in any other language, and it's det. In older papers I even see that they use '==' instead of '=>' to perform a deconstruction. (I think my English is awful = x =)