regular/context free grammar - grammar

Im hoping someone can help me understand a question I have, its not homework, its just an example question I am trying to work out. The problem is to define a grammar that generates all the sums of any number of operands. For example, 54 + 3 + 78 + 2 + 5... etc. The way that I found most easy to define the problem is:
non-terminal {S,B}
terminal {0..9,+,epsilon}
Rules:
S -> [0..9]S
S -> + B
B -> [0..9]B
B -> + S
S -> epsilon
B -> epsilon
epsilon is an empty string.
This seems like it should be correct to me as you could define the first number recursively with the first rule, then to add the next integer, you could use the second rule and then define the second integer using the third rule. You could then use the fourth rule to go back to S and define as many integers as you need.
This solution seems to me to be a regular grammar as it obeys the rule A -> aB or A -> a but in the notes it says for this question that it is no possible to define this problem using a regular grammar. Can anyone please explain to me why my attempt is wrong and why this needs to be context free?
Thanks.

Although it's not the correct definition, it's easier to think that for a language to be non-regular it would need to balance something (like parenthesis).
Your problem can be solved using direct recursion only on the sides of the rules, not in the middle, so it can be solved using a regular language. (Again, this is not the correct definition, but it's easier to remember!)
For example, for a regular expression engine (like in Perl or JavaScript) one could easily write /(\d+)(\+(\d+))*/.
You could write it this way:
non-terminal {S,R,N,N'}
terminal {0..9,+,epsilon}
Rules:
S -> N R
S -> epsilon
N -> [0..9] N'
N' -> N
N' -> epsilon
R -> + N R
R -> epsilon
Which should work correctly.

The language is regular. A regular expression would be:
((0|1|2|...|9)*(0|1|2|...|9)+)*(0|1|2|...|9)*(0|1|2|...|9)
Terminals are: {0,1,2,...,9,+}
"|" means union and * stands for Star closure
If you need to have "(" and ")" in your language, then it will not be regular as it needs to match parentheses.
A sample context free grammar would be:
E->E+E
E->(E)
E->F
F-> 0F | 1F | 2F | ... | 9F | 0 | 1 | ... | 9

Related

Grammars: How to add a level of precedence

So lets say I have the following Context Free Grammar for a simple calculator language:
S->TS'
S'->OP1 TE'|e
T->FT'
T'->OP2 FT'|e
F->id|(S)
OP1->+|-
OP2->*|/
As one can see the * and / have higher precedence over + and -.
However, how can I add another level of precedence? Example would be for exponents, ^, (ex:3^2=9) or something else? Please explain your procedure and reasoning on how you got there so I can do it for other operators.
Here's a more readable grammar:
expr: sum
sum : sum add_op term
| term
term: term mul_op factor
| factor
factor: ID
| '(' expr ')'
add_op: '+' | '-'
mul_op: '*' | '/'
This can be easily extended using the same pattern:
expr: bool
bool: bool or_op conj
| conj
conj: conj and_op comp
| comp
/* This one doesn't allow associativity. No a < b < c in this language */
comp: sum comp_op sum
sum : sum add_op term
| term
term: term mul_op factor
| factor
/* Here we'll add an even higher precedence operators */
/* Unlike the other operators, though, this one is right associative */
factor: atom exp_op factor
| atom
atom: ID
| '(' expr ')'
/* I left out the operator definitions. I hope they are obvious. If not,
* let me know and I'll put them back in
*/
I hope the pattern is more or less obvious there.
Those grammars won't work in a recursive descent parser, because recursive descent parsers choke on left recursion. The grammar you have has been run through a left-recursion elimination algorithm, and you could do that to the grammar above as well. But note that eliminating left recursion more or less erases the difference between left- and right-recursion, so after you identify the parse with a recursive descent grammar, you need to fix it according to your knowledge about the associativity of the operator, because associativity is no longer inherent in the grammar.
For these simple productions, eliminating left-recursion is really simple, in two steps. We start with some non-terminal:
foo: foo foo_op bar
| bar
and we flip it around so that it is right associative:
foo: bar foo_op foo
| bar
(If the operator was originally right associative, as with exponentiation above, then this step isn't needed.)
Then we need to left-factor, because LL parsing requires that every alternative for a non-terminal has a unique prefix:
foo : bar foo'
foo': foo_op foo
| ε
Doing that to every recursive production above (that is, all of them except for expr, comp and atom) will yield a grammar which looks like the one you started with, only with more operators.
In passing, I emphasize that there is no mysterious magical force at work here. When the grammar says, for example:
term: term mul_op factor
| factor
what it's saying is that a term (or product, if you prefer) cannot be the right-hand argument of a multiplication, but it can be the left-hand argument. It's also saying that if you're at a point in which a product would be valid, you don't actually need something with a multiplication operator; you can use a factor instead. But obviously you cannot use a sum, since factor doesn't parse expressions with a sum operator. (It does parse anything inside parentheses. But those are things inside parentheses.)
That's the sense in which both associativity and precedence are implicit in the grammar.

Difference between grammar rules

Say there are two grammar rules
Rule 1 B -> aB | cB
and
Rule 2 B -> Ba | Bc
I'm a bit confused as the difference of these two. Would rule 1's expression be (a+c)* ? Then what would Rule 2's expression be?
Both of those grammars yield the empty language since there is no non-recursive rule, so no sentence consisting only of terminals can be derived.
If you add the production B→ε, both grammars would yield the same language, equivalent to the regular expression (a+c)*. However, the parse trees produced by the parse would be quite different.

Formal Languages - Grammar

I am taking a Formal Languages and Computability class and am having a little trouble understanding the concept of grammar. One of my assignment questions is this:
Take ∑ = {a,b}, and let na(w) and nb(w) denote the number of a's and b's in the string w, respectively. Then the grammar G with productions:
S -> SS
S -> λ
S -> aSb
S -> bSa
generates the language L = {w: na(w) = nb(w)}.
1) The language in the example contains an empty string. Modify the given grammar so that it generates L - {λ}.
I am thinking that I should modify the condition of L, something like:
L = {w: na(w) = nb(w), na, nb > 0}
That way, we indicate that the string is never empty.
2) Modify the grammar in the example so that it will generate L ∪ {anbn+1: n >= 0}.
I am not sure on how to do this one. Should that mean I make one more condition in the grammar, adding something like S -> aSbb?
Any explanation about these two questions would be greatly appreciated. I'm still trying to figure these grammar stuff out so I am not sure about my answers.
1) The question is about modifying the grammar to obtain a new language; so don't modify directly the language…
Your grammar generates the empty word because of the production:
S -> λ
So you could think of removing this production altogether. This yields the following grammar:
S -> SS
S -> aSb
S -> bSa
Unfortunately, this grammar doesn't generate a language (a bit like in induction, it misses an initial: there are no productions that only consist of terminals). To fix this, add the following productions:
S -> ab
S -> ba
2) Don't randomly try to add production rules in the hope that it's going to work. Here you want a's followed by b's. So the production rule
S -> bSa
must certainly disappear. Also, the rule
S -> SS
would produce, e.g., abab (try to see how this is obtained). So we'll have to remove it too. We're left with:
S -> λ
S -> aSb
Now this grammar generates:
λ
ab
aabb
aaabbb
etc. That's not bad at all! To get an extra trailing b, we could create a new non-terminal, say T, replace our current S by T, and add that trailing b in S:
T -> λ
T -> aTb
S -> Tb
I know that this is homework; I gave you the solutions to your homework: that's because, from the way you asked your question, it seems you're completely lost. I hope this answer will help you get on the right path!

is this regular grammar- S -> 0S0/00?

Let L denotes the language generated by the grammar S -> 0S0/00. Which of the following is true?
(A) L = 0+
(B) L is regular but not 0+
(C) L is context free but not regular
(D) L is not context free
HI can anyone explain me how the language represented by the grammar S -> 0S0/00 is regular? I know very well the grammar is context free but not sure how can that be regular?
If you mean the language generated by the grammar
S -> 0S0
S -> 00
then it should be clear that it is the same language as is generated by
S -> 00S
S -> 00
which is a left regular grammar, and consequently generates a regular language. (Some people would say that a left regular grammar can only have a single terminal in each production, but it is trivial to create a chain of aN productions to produce the same effect.)
It should also be clear that the above differs from
S -> 0S
S -> S
We know that a language is regular if there exists a DFA (deterministic finite automata) that recogognizes it, or a RE (Regular expression). Either way we can see here that your grammar generates word like : 00, 0000, 000000, 00000000.. etc so it's words that starts and ends with '0' and with an even number of zeroes greater or equal than length two.
Here's a DFA for this grammar
Also here is a RE (Regular expression) that recognizes the language :
(0)(00)*(0)
Therefore you know this language recognized by this grammar is regular.
(Sorry if terms aren't 100% accurate, i took this class in french so terms might differ a bit) let me know if you have any other questions!
Consider first the definition of a regular grammar here
https://www.cs.montana.edu/ross/theory/contents/chapter02/green/section05/page04.xhtml
So first we need a set N of non terminal symbols (symbols that can be rewritten as a combination of terminal and non-terminal symbols), for our example N={S}
Next we need a set T of terminal symbols (symbols that cannot be replaced), for our example T={0}
Now a set P of grammer rules that fit a very specific form (see link), for L we see that P={S->0S0,S->00}. Both of these rules are of regular form (meaning each non-terminal can be replaced with a terminal, a terminal then a non-terminal, or the empty string, see link for more info). So we have our rules.
Now we just need a starting symbol X, we can trivally say that our starting symbol is S.
Therefore the tuple (N={S},T={0},P={S->0S0,S->00},X=S) fits the requirements to be defined a regular grammar.
We don't need the machinery of regular grammars to answer your question. Just note the possible derivations all look like this:
S -> (0 S 0) -> 0 (0 S 0) 0 -> 0 0 (0 S 0) 0 0 -> ... -> 0...0 (0 0) 0...0
\_ _/ \_ _/
k k
Here I've added parens ( ) to show the result of the previous expansion of S. These aren't part of the derived string. I.e. we substitute S with 0 S 0 k >= 0 times followed by a single substitution with 00.
From this is should be easy to see L is the set of strings of 0's of length 2k + 2 for some integer k >= 0. A shorthand notation for this is
L = { 02m | m >= 1 }
In words: The set of all even length strings of zeros excluding the empty string.
To prove L is regular, all we need is a regular expression for L. This is easy: (00)+. Or if you prefer, 00(00)*.
You might be confused because a small change to the grammar makes its language context free but not regular:
S -> 0S1/01
This is the more complex language { 0m 1m | m >= 1 }. It's straightforward to show this isn't a regular language using the Pumping Lemma.

ANTLR Parser Question

I'm trying to parse a number of text records where elements in a record are separated by a '+' char, and where the entire record is terminated by a '#' char. For example E1+E2+E3+E4+E5+E6#
Individual elements can be required or optional. If an element is optional, its value is simply missing. For example, if E2 were missing, the input string would be: E1++E3+E4+E5+E6#.
When dealing with empty trailing elements, however, the separator char ('+') may be missing as well. If, for example, the last 3 elements were missing, the string could be: E1+E2+E3#, but it could also be:
E1+E2+E3+++#
I have tried the following rule in Antlr:
'R1' 'E1 + E2 + E3' '+'? 'E4'? '+'? 'E5'? '+'? 'E6'? '#
but Antlr complains that it's ambiguous which of course is correct (every token following E3 could be E4, E5 or E6). The input syntax is fixed (it's from a legacy mainframe system), so I was wondering if anybody has a solution to this problem ?
An alternative would be to specify all the different permutations in the rule, but that would be a major task.
Best regards and thanks,
Michael
That task sounds like excessive overkill for ANTLR, any reason you're just not splitting the string into an array using the '+' as a separator?
If it's coming from a mainframe, it most likely was intended to be processed in a trivial way.
e.g.,
C++ : http://www.cplusplus.com/reference/clibrary/cstring/strtok/
PHP : http://us3.php.net/manual/en/function.explode.php
Java: http://java.sun.com/javase/6/docs/api/java/lang/String.html#split%28java.lang.String%29
C# : http://msdn.microsoft.com/en-us/library/system.string.split%28VS.71%29.aspx
Just a thought.
If this is ambiguous, it's likely because your Es all have the same format (a more complicated case would be that your Es all just start with the same k characters where k is your lookahead, but I'm going to assume that's not the case. If it is, this will still work; it will just require an extra step.)
So it looks like you can have up to 6 Es and up to 5 +s. We'll say a "segment" is an optional E followed by a + - you can have 5 segments, and an optional trailing E.
This grammar can be represented roughly like this (imperfect ANTLR syntax since I'm not very familiar with it):
r : (e_opt? PLUS){1,5} e_opt? END
e_opt : E // whatever your E is
PLUS : '+'
END : '#'
If ANTLR doesn't support anything like {1,5} then this is the same as:
(e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) (e_opt? PLUS)?)?)?)?
which is not that clean, so maybe there is a nicer way to do it.