Ambiguous grammar? - grammar

hi
there is this question in the book that said
Given this grammer
A --> AA | (A) | epsilon
a- what it generates\
b- show that is ambiguous
now the answers that i think of is
a- adjecent paranthesis
b- it generates diffrent parse tree so its abmbiguous and i did a draw showing two scenarios .
is this right or there is a better answer ?

a is almost correct.
Grammar really generates (), ()(), ()()(), … sequences.
But due to second rule it can generate (()), ()((())), etc.
b is not correct.
This grammar is ambiguous due ot immediate left recursion: A → AA.
How to avoid left recursion: one, two.

a) Nearly right...
This grammar generates exactly the set of strings composed of balanced parenthesis. To see why is that so, let's try to make a quick demonstration.
First: Everything that goes out of your grammar is a balanced parenthesis string. Why?, simple induction:
Epsilon is a balanced (empty) parenthesis string.
if A is a balanced parenthesis string, the (A) is also balanced.
if A1 and A2 are balanced, so is A1A2 (I'm using too different identifiers just to make explicit the fact that A -> AA doesn't necessary produces the same for each A).
Second: Every set of balanced string is produced by your grammar. Let's do it by induction on the size of the string.
If the string is zero-sized, it must be Epsilon.
If not, then being N the size of the string and M the length of the shortest prefix that is balanced (note that the rest of the string is also balanced):
If M = N then you can produce that string with (A).
If M < N the you can produce it with A -> AA, the first M characters with the first A and last N - M with the last A.
In either case, you have to produce a string shorter than N characters, so by induction you can do that. QED.
For example: (()())(())
We can generate this string using exactly the idea of the demonstration.
A -> AA -> (A)A -> (AA)A -> ((A)(A))A -> (()())A -> (()())(A) -> (()())((A)) -> (()())(())
b) Of course left and right recursion is enough to say it's ambiguous, but to see why specially this grammar is ambiguous, follow the same idea for the demonstration:
It is ambiguous because you don't need to take the shortest balanced prefix. You could take the longest balanced (or in general any balanced prefix) that is not the size of the string and the demonstration (and generation) would follow the same process.
Ex: (())()()
You can chose A -> AA and generate with the first A the (()) substring, or the (())() substring.

Yes you are right.
That is what ambigious grammar means.
the problem with mbigious grammars is that if you are writing a compiler, and you want to identify each token in certain line of code (or something like that), then ambigiouity wil inerrupt you in identifying as you will have "two explainations" to that line of code.

It sounds like your approach for part B is correct, showing two independent derivations for the same string in the languages defined by the grammar.
However, I think your answer to part A needs a little work. Clearly you can use the second clause recursively to obtain strings like (((((epsilon))))), but there are other types of derivations possible using the first clause and second clause together.

Related

Construct CFG from {w element of {a, b}* : 2#a(w)=3#b(w)}

If i have following language { x is element of {a,b}*, where 2#a(x)=3#b(x), then the cfg of that language is :
S=>SaSaSaSbSbS |SaSaSbSaSbS|SaSaSbSbSaS | SaSbSaSaSbS| SaSbSaSbSaS | SaSbSbSaSaS |SbSaSaSaSbS |SbSaSaSbSaS |SbSaSbSaSaS |SbSbSaSaSaS | epsilon/lambda
Is this correct? If this isnt correct/there's another more simple form, can you tell it? I have no clue on another form other than that.
At a glance it looks like this probably works:
your base case is good; the empty string is in the language
you cover all your inductive cases: you only add 2 a and 3 b and you cover all arrangements
I am not seeing a fundamentally simpler solution than this, although you might be able to remove either the leading or the trailing S from the right-hand side of all productions; then, by choosing a production you'd be committing to that first or last terminal symbol, but I think that still works out. Possibly even removing both leading and trailing S so you commit to both the first and the last. Any other simplification seems like it would increase the number of productions or the number of nonterminals, or both, which while possibly reducing the total number of symbols needed to encode the grammar, arguably doesn't make the grammar any simpler (indeed, more nonterminals and productions is typically seen as more complicated, not less). If you want to experiment with adding productions or nonterminals, consider e.g. T => Sa and R => Sb, just to cut down on repetition.

what is ambiguity in alphabet in automata theory?

I am just new in automata field. I have read many articles, and seen many video. I stuck in some first topics. It can be easy for others. but after spending a lot of time,i am still unable to understand it.
TOPIC is: Ambiguity in alphabet
An alphabet is = {A, Aa, bab, d}
and a string is s= AababA
and author says that, this is ambiguous alphabet, because when computer reads it , it reads from left to right. After the capital A, there again A that is prefix of small a, will create ambiguity. A letter(symbol) should not be prefix again of a new letter.
moreover author says.
we will tokenize it (AababA) in two ways:
(Aa) (bab) (A)
(A) (abab) (A)
after that , first one is ok, second is not ok due to ambiguity in alphabet define above.
What is procedure to tokenize the above string in two ways? is there any specific rule?
How alphabet is ambiguous due to second group.
If it is invalid due to prefix of A, then how? What is the role of prefix in ambiguity of alphabet?
If we don't think about prefix, and we just simply match the both string group with above alphabet, then we can easily judge, that second is not matching with above alphabet, then why do we need to discuss that prefix?
I hope, this question will be considered important, so that answer will help me to make my self out of this confusion. I will be very thankful .
The author chose a confusing example. If you share the source where you got this example, I could give a better answer, but I would argue that in this case, there is no practical ambiguity. If you see Aa, you can know that the first lexeme must be "Aa", because nothing in the alphabet starts with "a".
For an easier example, consider the alphabet {A, a, Aa} and string "AAaAaaA"
You could tokenize this in the following ways:
(A) (A) (a) (A) (a) (a) (A)
(A) (Aa) (A) (a) (a) (A)
(A) (A) (a) (Aa) (a) (A)
(A) (Aa) (Aa) (A)
This is most often resolved by choosing the longest lexeme that matches in each case, which would yield the last tokenization.
Now let us return to your example, but let's make the string a little bit different: "AababAe".
You could tokenize the string in the following ways:
(Aa) (bab) (A) <error>
(A) <error>
In one branch, you have an error. In one branch, you don't. As you noted, the tokenizer should choose the first. Both have errors, though. The point is that there is an explicit choice here to prefer the longest valid tokenization. Nothing in the alphabet forces you to make this choice. It is just as valid to choose the shortest matching option. This would be massively impractical, but it is a valid choice.

ANTLR recognize single character

I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.

How to show that a grammar with Bitwise operator is ambiguous using the expression a>>b^c

I am trying to solve this question but I really don't know how to get started. I would appreciate some help.
The bitwise operators for a language are shown in the table below alongside the grammar. The operators and the grammar rules are in order of precedence from highest to lowest. The characters a, b and c represent terminals in the language.
Grammar table:
Show that the grammar is ambiguous using expression: a >> b ^ c
Rewrite the grammar so that it is unambiguous.
The Dragon Book says: "A grammar that produces more than one parse tree for some sentence is said to be ambiguous." So to show that a grammar is ambiguous, you need to show at least two parse trees for a single sentence generated by the grammar. In this case, the sentence to use is already given to you, so for Q1 you just need to find two different parse trees for a >> b ^ c. Shiping's comment gives you a big clue for that.
For Q2, where they ask you to "rewrite the grammar", I suspect the unspoken requirement is that the resulting grammar generate exactly the same language as the original. (So Shiping's suggestion to introduce parentheses to the language would not be accepted.) The general approach for doing this is to introduce a nonterminal for each precedence level in the precedence chart, and then modify the grammar rules to use the new nonterminals in such a way that the grammar can only generate parse trees that respect the precedence chart.
For example, look at the two trees you found for Q1. You should observe that one of them conforms to the precedence chart and one does not. You want a new grammar that allows the precedence-conforming tree but not the other.
As another clue, consider the difference between these two grammars:
E -> E + E
E -> E * E
E -> a | b
and
E -> E + T
T -> T * F
F -> a | b
Although they generate the same language, the first is ambiguous but the second is not.

Context sensitive grammar

Noam Chomsky - formal languages - type 1 - context sensitive grammar
Does AB->BA violate the rule? I assume it does.
A -> aAB does not violate condition?
aAB->ABc violates condition?
Using the wikipedia link provided, you can answer each question if you can map your production rules to the form:
iAr -> ibr, where A is a single non-terminal, i and r are (possibly empty) strings of terminals and non-terminals, and b is a non-empty string of terminals and non-terminals.
In other words, look at each of your rules, and try to make suitable choices for i, A, r, and b.
Before we look at your questions, let's look at some hypothetical examples:
Is CRC -> CRRRRRC a valid context-sensitive rule?
Yes. I can choose i=empty, A=C, r=RC, and b=CRRRR. Note, I could have made other choices that work, too.
Is xYz -> xWzv a valid context-sensitive rule?
No. There is no choice for i, A, and r that allow a match. If I chose i=x A=Y, r=z, and b=W, that trailing v screws things up.
Is xY -> xWzv a valid context-sensitive rule?
Yes. I can choose i=x, A=Y, r=empty, and b=Wzv.
This is the scheme you should use to answer your questions. Now, let's look at those:
AB -> BA: Assume you choose either A or B to be your single non-terminal. The choice fixes i and r (one will be empty, the other will be the non-terminal you didn't choose). Is there a string of the form ibr that can match based on how you fixed i and r? In other words, can you choose the string to replace b that maps to your rule?
A -> aAB. I hope the choice of your single non-terminal on the left is intuitively obvious. This choice will again fix i and r. Does the right map to a suitable ibr form where b is a nonempty string of terminals and nonterminals?
aAB -> ABc. Again, choose A or B to be your single non-terminal. This fixes i and r. Is there a choice that allows you to choose a suitable ibr?