Standard algorithms for recognizing grammar productions correctness - grammar

I'm sure there's a standard way to do this, but I don't even know where to start searching for it.
How can i recognize, in any language, structures (grammars) in the form of, for example:
Exp ::= Number |(Exp) | Exp + Exp
Number ::= Number Digit | Digit
Digit ::= 0 | ... | 9
I mean, given a string like 32 + (43 + 23), how can I tell if it's legal? Is there a standard algorithm or something? I don't know what to search for, so I wasn't able to search this site neither.

You are looking for parsing algorithm (membership algorithim). Parsing is the process of analyzing a string of language symbols in Formal Languages. And Yes for any context-free-grammar there is a possible parsing algorithm-that is Brute-Force a fundamental algorithm but inefficient, Like phrase-structure parsing, its worst-case complexity is O(n3) (you are to start from here) REFF1 . But if grammar is in standard (restricted) form then more efficient algorithm is possible. there are various parsing algorithms e.g. LL parser and LR parser..etc. REFFRENCE

Related

ANTLR4 Best practice on token ambiguities: Lexer predicate, or Parser tree walker

I have a question about a certain ambiguity I am encountering in a grammar I am currently working on. Here is the problem, in brief. Consider these two inputs:
1010
0101
In isolation, in my grammar the first input is interpreted as a decimal number, the second as an octal due to the leading zero.
However, if the preceding character to each of these sequences is a % then both would be interpreted as a binary number. This wouldn't be a problem if we stopped there.
Now, let's say before the % we encountered a 5, what would happen? Does my grammar consider each of these as valid input:
5%1010
5%0101
The answer is "Yes!" The rightmost sequences of 1s and 0s simply revert back to decimal and octal, respectively, and the % is a modulo operator.
This wouldn't be a problem if expressions in my grammar only consisted of digits, but that unfortunately is not the case, as any number of non-digit tokens could substitute for the 5 in the example above, like variables, braces, and even other math operators like parentheses and minus signs.
The solution I have come to in ANTLR is simply to have an expression rule where one of the alternatives concatenates an expression and a binary number, so you have:
expr
: expr Binary
| expr '%' expr
| Integer
| Octal
| Binary
;
Integer
: '0'
| [1-9] [0-9]*
;
Octal
: '0' [0-7]+
;
Binary
: '%' [01]+
;
I then leave it up to my visitor to actually "pull apart" the right hand side of the expression type above (the expr Binary one), and properly calculate the modulo, which means I have to "re-tokenize" essentially the % and following digits.
I guess my question is: Is this the best solution given my case? I fully accept it if so, but I am curious if others have had to resort to things like these.
I cooked up a lexer predicate to do some crazy lookaheads (and lookbehinds) in the input, but my instinct was this felt wrong, as I was essentially hand-parsing, rather than leveraging the tool itself to give me enough what I needed to work with.

Grammars: How to add a level of precedence

So lets say I have the following Context Free Grammar for a simple calculator language:
S->TS'
S'->OP1 TE'|e
T->FT'
T'->OP2 FT'|e
F->id|(S)
OP1->+|-
OP2->*|/
As one can see the * and / have higher precedence over + and -.
However, how can I add another level of precedence? Example would be for exponents, ^, (ex:3^2=9) or something else? Please explain your procedure and reasoning on how you got there so I can do it for other operators.
Here's a more readable grammar:
expr: sum
sum : sum add_op term
| term
term: term mul_op factor
| factor
factor: ID
| '(' expr ')'
add_op: '+' | '-'
mul_op: '*' | '/'
This can be easily extended using the same pattern:
expr: bool
bool: bool or_op conj
| conj
conj: conj and_op comp
| comp
/* This one doesn't allow associativity. No a < b < c in this language */
comp: sum comp_op sum
sum : sum add_op term
| term
term: term mul_op factor
| factor
/* Here we'll add an even higher precedence operators */
/* Unlike the other operators, though, this one is right associative */
factor: atom exp_op factor
| atom
atom: ID
| '(' expr ')'
/* I left out the operator definitions. I hope they are obvious. If not,
* let me know and I'll put them back in
*/
I hope the pattern is more or less obvious there.
Those grammars won't work in a recursive descent parser, because recursive descent parsers choke on left recursion. The grammar you have has been run through a left-recursion elimination algorithm, and you could do that to the grammar above as well. But note that eliminating left recursion more or less erases the difference between left- and right-recursion, so after you identify the parse with a recursive descent grammar, you need to fix it according to your knowledge about the associativity of the operator, because associativity is no longer inherent in the grammar.
For these simple productions, eliminating left-recursion is really simple, in two steps. We start with some non-terminal:
foo: foo foo_op bar
| bar
and we flip it around so that it is right associative:
foo: bar foo_op foo
| bar
(If the operator was originally right associative, as with exponentiation above, then this step isn't needed.)
Then we need to left-factor, because LL parsing requires that every alternative for a non-terminal has a unique prefix:
foo : bar foo'
foo': foo_op foo
| ε
Doing that to every recursive production above (that is, all of them except for expr, comp and atom) will yield a grammar which looks like the one you started with, only with more operators.
In passing, I emphasize that there is no mysterious magical force at work here. When the grammar says, for example:
term: term mul_op factor
| factor
what it's saying is that a term (or product, if you prefer) cannot be the right-hand argument of a multiplication, but it can be the left-hand argument. It's also saying that if you're at a point in which a product would be valid, you don't actually need something with a multiplication operator; you can use a factor instead. But obviously you cannot use a sum, since factor doesn't parse expressions with a sum operator. (It does parse anything inside parentheses. But those are things inside parentheses.)
That's the sense in which both associativity and precedence are implicit in the grammar.

How can I show that this grammar is ambiguous?

I want to prove that this grammar is ambiguous, but I'm not sure how I am supposed to do that. Do I have to use parse trees?
S -> if E then S | if E then S else S | begin S L | print E
L -> end | ; S L
E -> i
You can show it is ambiguous if you can find a string that parses more than one way:
if i then ( if i then print i else print i ; )
if i then ( if i then print i ) else print i ;
This happens to be the classic "dangling else" ambiguity. Googling your tag(s), title & grammar gives other hits.
However, if you don't happen to guess at an ambiguous string then googling your tag(s) & title:
how can i prove that this grammar is ambiguous?
There is no easy method for proving a context-free grammar ambiguous -- in fact,
the question is undecidable, by reduction to the Post correspondence problem.
You can put the grammar into a parser generator which supports all context-free grammars, a context-free general parser generator. Generate the parser, then parse a string which you think is ambiguous and find out by looking at the output of the parser.
A context-free general parser generator generates parsers which produce all derivations in polynomial time. Examples of such parser generators include SDF2, Rascal, DMS, Elkhound, ART. There is also a backtracking version of yacc (btyacc) but I don't think it does it in polynomial time. Usually the output is encoded as a graph where alternative trees for sub-sentences are encoded with a nested set of alternative trees.

Yacc "rule useless due to conflicts"

i need some help with yacc.
i'm working on a infix/postfix translator, the infix to postfix part was really easy but i'm having some issue with the postfix to infix translation.
here's an example on what i was going to do (just to translate an easy ab+c- or an abc+-)
exp: num {printf("+ ");} exp '+'
| num {printf("- ");} exp '-'
| exp {printf("+ ");} num '+'
| exp {printf("- ");} num '-'
|/* empty*/
;
num: number {printf("%d ", $1);}
;
obiously it doesn't work because i'm asking an action (with the printfs) before the actual body so, while compiling, I get many
warning: rule useless in parser due to conflict
the problem is that the printfs are exactly where I need them (or my output wont be an infix expression). is there a way to keep the print actions right there and let yacc identify which one it needs to use?
Basically, no there isn't. The problem is that to resolve what you've got, yacc would have to have an unbounded amount of lookahead. This is… problematic given that yacc is a fairly simple-minded tool, so instead it takes a (bad) guess and throws out some of your rules with a warning. You need to change your grammar so yacc can decide what to do with a token with only a very small amount of lookahead (a single token IIRC). The usual way to do this is to attach the interpretations of the values to the tokens and either use a post-action or, more practically, build a tree which you traverse as a separate step (doing print out of an infix expression from its syntax tree is trivial).
Note that when you've got warnings coming out of yacc, that typically means that your grammar is wrong and that the resulting parser will do very unexpected things. Refine it until you get no warnings from that stage at all. That is, treat grammar warnings as errors; anything else and you'll be sorry.

ANTLR Parser Question

I'm trying to parse a number of text records where elements in a record are separated by a '+' char, and where the entire record is terminated by a '#' char. For example E1+E2+E3+E4+E5+E6#
Individual elements can be required or optional. If an element is optional, its value is simply missing. For example, if E2 were missing, the input string would be: E1++E3+E4+E5+E6#.
When dealing with empty trailing elements, however, the separator char ('+') may be missing as well. If, for example, the last 3 elements were missing, the string could be: E1+E2+E3#, but it could also be:
E1+E2+E3+++#
I have tried the following rule in Antlr:
'R1' 'E1 + E2 + E3' '+'? 'E4'? '+'? 'E5'? '+'? 'E6'? '#
but Antlr complains that it's ambiguous which of course is correct (every token following E3 could be E4, E5 or E6). The input syntax is fixed (it's from a legacy mainframe system), so I was wondering if anybody has a solution to this problem ?
An alternative would be to specify all the different permutations in the rule, but that would be a major task.
Best regards and thanks,
Michael
That task sounds like excessive overkill for ANTLR, any reason you're just not splitting the string into an array using the '+' as a separator?
If it's coming from a mainframe, it most likely was intended to be processed in a trivial way.
e.g.,
C++ : http://www.cplusplus.com/reference/clibrary/cstring/strtok/
PHP : http://us3.php.net/manual/en/function.explode.php
Java: http://java.sun.com/javase/6/docs/api/java/lang/String.html#split%28java.lang.String%29
C# : http://msdn.microsoft.com/en-us/library/system.string.split%28VS.71%29.aspx
Just a thought.
If this is ambiguous, it's likely because your Es all have the same format (a more complicated case would be that your Es all just start with the same k characters where k is your lookahead, but I'm going to assume that's not the case. If it is, this will still work; it will just require an extra step.)
So it looks like you can have up to 6 Es and up to 5 +s. We'll say a "segment" is an optional E followed by a + - you can have 5 segments, and an optional trailing E.
This grammar can be represented roughly like this (imperfect ANTLR syntax since I'm not very familiar with it):
r : (e_opt? PLUS){1,5} e_opt? END
e_opt : E // whatever your E is
PLUS : '+'
END : '#'
If ANTLR doesn't support anything like {1,5} then this is the same as:
(e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) (e_opt? PLUS)?)?)?)?
which is not that clean, so maybe there is a nicer way to do it.