Chomsky Normal Form correctness - grammar

I have these productions:
S->aSb
S-> eps (eps=empty string)
I should apply the Chomsky Normal Form
My reasoning:
1) eliminate the eps rules
Given:
S->aSb
S-> eps
I get:
S->ab
S->aSb
2) eliminate the unit rules
There are none
3) remove useless symbols
I get:
S->ab
So, the given grammar after applying CNF (Chomsky Normal Form) becomes:
S->ab
Am I right?

What you have here is not quite the same. Notice that the empty string is no longer part of your language, nor are the strings aabb, aaabbb, etc.
Chec the step where you eliminate useless rules. Is that second rule really useless?
Also, are you sure you can eliminate the epsilon production?

Related

ANTLR4 Best practice on token ambiguities: Lexer predicate, or Parser tree walker

I have a question about a certain ambiguity I am encountering in a grammar I am currently working on. Here is the problem, in brief. Consider these two inputs:
1010
0101
In isolation, in my grammar the first input is interpreted as a decimal number, the second as an octal due to the leading zero.
However, if the preceding character to each of these sequences is a % then both would be interpreted as a binary number. This wouldn't be a problem if we stopped there.
Now, let's say before the % we encountered a 5, what would happen? Does my grammar consider each of these as valid input:
5%1010
5%0101
The answer is "Yes!" The rightmost sequences of 1s and 0s simply revert back to decimal and octal, respectively, and the % is a modulo operator.
This wouldn't be a problem if expressions in my grammar only consisted of digits, but that unfortunately is not the case, as any number of non-digit tokens could substitute for the 5 in the example above, like variables, braces, and even other math operators like parentheses and minus signs.
The solution I have come to in ANTLR is simply to have an expression rule where one of the alternatives concatenates an expression and a binary number, so you have:
expr
: expr Binary
| expr '%' expr
| Integer
| Octal
| Binary
;
Integer
: '0'
| [1-9] [0-9]*
;
Octal
: '0' [0-7]+
;
Binary
: '%' [01]+
;
I then leave it up to my visitor to actually "pull apart" the right hand side of the expression type above (the expr Binary one), and properly calculate the modulo, which means I have to "re-tokenize" essentially the % and following digits.
I guess my question is: Is this the best solution given my case? I fully accept it if so, but I am curious if others have had to resort to things like these.
I cooked up a lexer predicate to do some crazy lookaheads (and lookbehinds) in the input, but my instinct was this felt wrong, as I was essentially hand-parsing, rather than leveraging the tool itself to give me enough what I needed to work with.

Modifiying ANTLR v4 auto-generated lexer?

So i am writing a small language and i am using antlrv4 as my tool. Antlr autogenerates lexer and parser files when u compile your grammar file(.g4). I am using javac btw. I want my language to have no semicolons and the way i want to do this is: if there is an identifier or ")" as the last token in a line, the lexer will automatically put the semicolon(Similar to what "go" language does). How would i approach something like this? There are other things like ATN(which i think is augmented transition network) and dfa(which i think is deterministic finite automaton) in the lexer file which i don't understand or how they relate to the lexing process?. Any help is appreciated. (btw i am still working on the grammar file so i don't have that fully completed).
Several points here: the ATN and the DFA are internal structures for parser + lexer and not something you would touch to change parsing behavior. Also, it's not clear to me why you want to have the lexer insert a semicolon at some point. What exactly do you want to accomplish by that (don't say: to make semicolons optional in the parser, I mean the underlying reason).
If you want to accept a command without a trailing semicolon you can make that optional:
assignment: simpleAssignment | complexAssignment SEMI?;
The parser will give you the content of the assignment rule regardless whether there is a trailing semicolon or not. Is that what you want?

Do independent rules influence one another?

When I was debugging my grammar for C# I noticed something very unusual: some inputs that are not accepted by a full grammar are being accepted by the same grammar with some independent rules deleted. I could not find a logical explanation. For example:
CS - this grammar does not accept the input a<a<a>><EOF>
CS' - and this grammar which is basically the same as CS but with some independent rules deleted (rules are not reordered) does accept a<a<a>><EOF>
As you can see both grammars start with the rule start: namespaceOrTypeName EOF; and therefore they should call the same set of rules (CS will never call those rules that are deleted in CS'). I spent a day debugging this, deleting or adding new rules, but couldn't find a flaw in the logic. Any help would be of use, thank you.
Unicode
EDIT:
After changing the start rule in CS to start: Identifier EOF; the grammar starts rejecting the input method which is normally accepted when only Identifier rules are defined. So I guess, since there is a rule attributeTarget: ...| 'method' | ..., that after compiling the grammar some phrases get reserved such as 'method' in this case but I'm not still sure if thats the case.
The first grammar includes the overloadableBinaryOperator rule which implicitly defines the >> token. Since >> is a 2-character token, the lexer will never treat the input >> as two separate 1-character tokens >, >. If you open the grammar in ANTLRWorks 2, you'll see a warning indicator for each implicitly-defined token. You should remove all of these warnings by:
Creating explicit lexer rules for every token you intend to appear in the input.
Only using the syntax 'new' in a parser rule if a corresponding lexer rule exists for the literal 'new'.

Chomsky Language Types

I'm trying to understand the four different Chomsky language types but the definitions that I have found don't really mean anything to me. I know type 0 is free grammar, type 1 is context sensitive, type 2 is context free whilst type 3 is regular. So, could someone please explain this and put it into context, thanks.
A language is the set of words that belong to that language. Many times, however, instead of listing each and every word in the language, it is enough to specify the set of rules that generate the words of the language (and only those) to identify what is the language-in-question.
Note: there can be more than one set of rules that desrcibe the same language.
In general, the more restrictions placed on the rules, the less expressive the language (less words can be generated from the rules), but easier to recognize if a word belongs to the language the rules specify. Because of the latter, we want to identify languages with the most restrictions on their rules that will still allow us to generate the same language.
A few words about the rules: In general, you describe a formal language with four items (AKA a four-tuple):
The set of non-terminal symbols (N)
The set of terminal symbols (T)
The set of production rules (P)
The start symbol (S)
The terminal symbols (AKA letters) are the symbols that words of the language consist of, ususally a subset of lowercase English letters (e.g. 'a', 'b', 'c')
The non-terminal symbols are marking an intermediate state in the generation of a word, indicating that a possible transformation can still be applied to the intermediate "word". There is no overlap between the terminal and non-terminal symbols (i.e. the intersection of the two sets are empty). The symbols used for non-terminals are usually subsets of uppercase English letters (e.g. 'A', 'B', 'C')
The rules denote possible transformations on a series of terminal and non-terminal symbols. They are in the form of: left-side -> right-side, where both the left-side and the right-side consists of series of terminal and non-terminal symbols. An example rule: aBc -> cBa, which means that a series of symbols "aBc" (as part of intermediary "words") can be replaced with the series "cBa" during the generation of words.
The start symbol is a dedicated non-terminal symbol (usually denoted by S) that denotes the "root" or the "start" of the word generation, i.e. the first rule applied in the series of word-generation always has the start-symbol as its left-side.
The generation of a word is successfully over when all non-terminals have been replaced with terminals (so the final "intermediary word" consists only of terminal symbols, which indicates that we arrived at a word of the language-in-question).
The generation of a word is unsuccessful, when not all non-terminals have been replaced with terminals, but there are no production rules that can be applied on the current intermediary "word". In this case the generation has to strart anew from the starting symbol, following a different path of production rule applications.
Example:
L={T, N, P, S},
where
T={a, b, c}
N={A, B, C, S}
P={S->A, S->AB, S->BC, A->a, B->bb, C->ccc}
which denotes the language of three words: "a", "abb" and "bbccc"
An example application of the rules:
S->AB->aB->abb
where we 1) started from the start symbol (S), 2) applied the second rule (replacing S with AB), 3) applied the fourth rule (replacing A with a) and 4) applied the fifth rule (replacing B with bb). As there are no non-terminals in the resulting "abb", it is a word of the language.
When talking in general about the rules, the Greek symbols alpha, beta, gamma etc. denote (a potentially empty) series of terminal+non-terminal symbols; the Greek letter epsilon denotes the empty string (i.e. the empty series of symbols).
The four different types in the Chomsky hierarchy describe grammars of different expressive power (different restrictions on the rules).
Languages generated by Type 0 (or Unrestricted) grammars are most expressive (less restricted). The set of Recursively Enumerable languages contain the languages that can be generated using a Turing machine (basically a computer). This means that if you have a language that is more expressive than this type (e.g. English), you cannot write an algorithm that can list each an every (and only these) words of the language. The rules have one restriction: the left-side of a rule cannot be empty (no symbols can be introduced "out of the blue").
Languages generated by Type 1 (Context-sensitive) grammars are the Context-sensitive languages. The rules have the restriction that they are in the form: alpha A beta -> alpha gamma beta, where alpha and beta can be empty, and gamma is non-empty (exception: the S->epsilon rule, which is only allowed if the start symbol S does not appear on the right-side of any rules). This restricts the rules to have at least one non-terminal on their left-side and allows them to have a "context": the non-terminal A in this rule example can be replaced with gamma, only if it is surrounded by ("is in the context of") alpha and beta. The application of the rule preserves the context (i.e. alpha and beta does not change).
Languages generated by Type 2 (Context-free) grammars are the Context-free languages. The rules have the restriction that they are in the form: A -> gamma. This restricts the rules to have exactly one non-terminal on their left-side and have no "context". This essentially means that if you see a non-terminal symbol in an intermediary word, you can apply any one of the rules that have that non-terminal symbol on their left-side to replace it with their right-side, regardless of the surroundings of the non-terminal symbol. Most programming languages have context free generating grammars.
Languages generated by Type 3 (Regular) grammars are the Regular languages. The rules have the restriction that they are of the form: A->a or A->aB (the rule S->epsilon is permitted if the starting symbol S does not appear on the right-side of any rules), which means that each non-terminal must produce exactly one terminal symbol (and possibly one non-terminal as well). The regular expressions generate/recognize languages of this type.
Some of these restrictions can be lifted/modified in a way to keep the modified grammar have the same expressive power. The modified rules can allow other algorithms to recognize the words of a language.
Note that (as noted earlier) a language can often be generated by multiple grammars (even grammars belonging to different types). The expressive power of a language family is usually equated with the expressive power of the type of the most restrictive grammars that can generate those languages (e.g. languages generated by regular (Type 3) grammars can also be generated by Type 2 grammars, but their expressive power is still that of Type 3 grammars).
Examples
The regular grammar
T={a, b}
N={A, B, S}
P={S->aA, A->aA, A->aB, B->bB, B->b}
generates the language which contains words that start with a non-zero number of 'a's, followed by a non-zero number of 'b's. Note that is it not possible to describe a language where each word consists of a number of 'a's followed by an equal number of 'b's with regular grammars.
The context-free grammar
T={a, b}
N={A, B, S}
P={S->ASB, A->a, B->b}
generates the language which contains words that start with a non-zero number of 'a's, followed by an equal number of 'b's. Note that it is not possible to describe a language where each word consists of a number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's with context-free grammars.
The context-sensitive grammar
T={a, b, c}
N={A, B, C, H, S}
P={S->aBC, S->aSBC, CB->HB, HB->HC, HC->BC, aB->ab, bB->bb, bC->bc, cC->cc}
generates tha language which contains words that start with non-zero number of 'a's, followed by an equal number of 'b's, followed by an equal number of 'c's. The role of H in this grammar is to enable "swapping" a CB combination to a BC combination, so the B's can be gathered on the left, and the C's can be gathered on the right. Note that it is not possible to describe a language where the words consist of a series of 'a's, where the number of 'a's is a prime with context-sensitive grammars, but it is possible to write an unrestricted grammar that generates that language.
There are 4 types of grammars
TYPE-0 :
Grammar accepted by Unrestricted Grammar
Language accepted by Recursively enumerable language
Automaton is Turing machine
TYPE-1 :
Grammar accepted by Context sensitive Grammar
Language accepted by Context sensitive language
Automaton is Linear bounded automaton
TYPE-2 :
Grammar accepted by Context free Grammar
Language accepted by Context free language
Automaton is PushDown automaton
TYPE-3 :
Grammar accepted by Regular Grammar
Language accepted by Regular language
Automaton is Finite state automaton
-
Chomsky is a Normal Form that each production has form:
A->BC or A->a
There can be Two variables or Only one terminal for rule in
Chomsky

Yacc "rule useless due to conflicts"

i need some help with yacc.
i'm working on a infix/postfix translator, the infix to postfix part was really easy but i'm having some issue with the postfix to infix translation.
here's an example on what i was going to do (just to translate an easy ab+c- or an abc+-)
exp: num {printf("+ ");} exp '+'
| num {printf("- ");} exp '-'
| exp {printf("+ ");} num '+'
| exp {printf("- ");} num '-'
|/* empty*/
;
num: number {printf("%d ", $1);}
;
obiously it doesn't work because i'm asking an action (with the printfs) before the actual body so, while compiling, I get many
warning: rule useless in parser due to conflict
the problem is that the printfs are exactly where I need them (or my output wont be an infix expression). is there a way to keep the print actions right there and let yacc identify which one it needs to use?
Basically, no there isn't. The problem is that to resolve what you've got, yacc would have to have an unbounded amount of lookahead. This is… problematic given that yacc is a fairly simple-minded tool, so instead it takes a (bad) guess and throws out some of your rules with a warning. You need to change your grammar so yacc can decide what to do with a token with only a very small amount of lookahead (a single token IIRC). The usual way to do this is to attach the interpretations of the values to the tokens and either use a post-action or, more practically, build a tree which you traverse as a separate step (doing print out of an infix expression from its syntax tree is trivial).
Note that when you've got warnings coming out of yacc, that typically means that your grammar is wrong and that the resulting parser will do very unexpected things. Refine it until you get no warnings from that stage at all. That is, treat grammar warnings as errors; anything else and you'll be sorry.