LL(1) predictive parsing -- Avoid Left recursion - grammar

When defining a grammar, say a grammar to evaluate an arithmetic expression: we divide the Expression to Terms and Factors, like so:
E ::= E + T
T ::= T * F
F ::= num
| (E)
Then we need to resolve left recursion.
So why not define the grammar like so:
E ::= T + E
T ::= F * T
F := num
| (E)
And have only right recursion.

The problem is that it gets the associativity wrong -- a left-recursive grammar is left associative while a right-recursive grammar is right associative. Since associativity doesn't matter for + or * you don't see a problem, but if you add an operator (such as -) for which associativity DOES matter, you see the problem.
Note that the way that you deal with left recursion in an LL grammar is essentially by converting to right recursion and then post-processing the parse tree to turn it back into left recursion. Breaking it down, you convert to
E ::= T + E | T
which you then left-factor into
E ::= T E'
E' ::= \epsilon | + E
this will parse the expression T + T + T as
E
/ \
T E'
/ \
+ E
/ \
T E'
/ \
+ E
/ \
T E'
|
\epsilon
which you then evaluate by treating it as a linked list of alternating terms and operators which you evaluate/perform top to bottom (left to right):
tmp1 = eval_term(pop list head)
while (list not empty)
op = pop list head
tmp2 = eval_term(pop list head)
tmp1 = tmp1 op tmp2

In the specific example you show, order doesn't matter, so you can swap operands.
But that is not the case for all the other grammars, because moving their symbols may change their meaning; so you need to find another way to eliminate left recursion.

Related

How to remove indirect left recursion in Antlr grammar

I've a grammar as follow:
expression : scalar
| vector;
scalar : <bunch of rules>
| vector[scalar] #VectorIndex
;
vector : <bunch of rules>
| scalar ('*' | '+' | '-') vector
;
Is there any possibility to remove indirect left recursion from this grammar? Replacing vector with all its sub-rules will make the grammar too repetitive and messy.

How do I label expression alternatives with same precedence level?

With antlr4 I can label rule alternatives like this:
e : e '*' e # Mult
| e '+' e # Add
| INT # Int
;
From what I understand, in the rule above, Mult has higher precedence over Add because Mult comes before Add in the list of alternatives.
So for instance, if I wrote:
e : e '*' e # Mult
| e ('+'|'-') e # Add
| INT # Int
;
The + in 1 + 2 and - in 4 - 2 have the same precedence.
However, now the alternative is not in the top level. Is there a way I can label the rules e '+' e # Add and e '-' e #Sub separately while still having both alternatives have same precedence level?
I'm afraid not. You can label the op though with op=('+'|'-') and then get the ctx.op() value during a tree walk and ask for its token type.

Rascal error when specifying grammar

I have a simple file in rascal for specifying a toy grammar
module temp
import IO;
import ParseTree;
layout LAYOUT = [\t-\n\r\ ]*;
start syntax Simple
= A B ;
syntax A = "Hello"+ ("joe" "pok")* ;
syntax A= "Hi";
syntax B = "world"*|"wembly";
syntax B = C | C C* ;
public void main () {
println("hello");
iprint(parse(#start[Simple], "Hello Hello world world world"));
}
This works fine, however, the problem is that I didn't want to write
syntax B = C | C C* ;
I wanted to write
syntax B = ( C | C C* )?
but it was rejected as a parse error by rascal -even though all of
syntax B = ( C C C* )? ;
syntax B = ( C | C* )? ;
syntax B = C | C C* ;
are accepted fine. Can anyone explain to me what I'm doing wrong?
The sequence symbol (nested sequence) always requires brackets in rascal. The meta notation is defined as
syntax Sym = sequence: "(" Sym+ ")" | opt: Sym "?" | alternative: "(" Sym "|" {Sym "|"}+ ")" | ... ;
So, in your example you should have written:
syntax B = (C | (C C*))?;
What is perhaps confusing is that Rascal uses the | sign twice. Once for separating top-level alternatives, once for nested alternative:
syntax X = "a" | "b"; // top-level
syntax Y = ("c" | "d"); // nested, will internally generate a new rule:
syntax ("c" | "d") = "c" | "d";
Finally, normal alternatives have sequences without brackets, as in:
syntax B
= C
| C C*
;
// or less abstractly:
syntax Exp = left Exp "*" Exp
> left Exp "+" Exp
;
BTW, we generally avoid the use of too many nested regular expressions because they are so anonymous and therefore make interpreting parse trees harder. The best usage of regular expressions is for expressing lexical syntax where we are not so much interested in the internal structure anyhow.

Explanations about FOLLOW function - Grammar

I've some problems to understand the function FOLLOW. I cannot calcule follow functions of a grammar and that's not good. I tried exercises to understand this function and in particulary this exercise, I've this grammar :
S -> E
E -> T E'
E' -> + T E' | minus T E' |
T -> F T'
T' -> * F T' |
F -> id | ( F'
F' -> E ) | n )
Here the results of the calculating of follow function :
S $
E ), $
E' ), $
T +, minus, ), $
T' +, minus, ), $
F *, +, minus, ), $
F' *, +, minus, ), $
I really don't understand why the FOLLOW(T)=FOLLOW(T') = { +, minus, ), $ }
In the grammar that I give, theterminal symbols plus and minus never appears on the right of T or T' so if someon can explain me this, it will be cool
Conceptually, FOLLOW(X) is the set of tokens that can come AFTER an X in a legal sentence in the grammar. So to calculate it, you look at where X appears on the right side of a rule (any rule) and see what comes after it. In the case of T', you have
T -> F T'
T' -> * F T'
since T' is the last thing on the rhs in both cases, you end up with FOLLOW(T') = FOLLOW(T) ∪ FOLLOW(T'), which is equivalent to FOLLOW(T') = FOLLOW(T).
For T you have:
E -> T E'
E' -> + T E'
which gives you FOLLOW(T) = FIRST(E') ∪ FOLLOW(E) ∪ FOLLOW(E') -- the FOLLOWs are included because E' expands to ε. Depending on exactly whose formulation of FIRST and FOLLOW you use, that may mean that ε ∈ FIRST(E') (in which case you remove it from FOLLOW(T)) or that NULLABLE(E') = true, but the overall effect on FOLLOW(T) is the same -- it gets + and minus from FIRST(E') and ) and $ from FOLLOW(E)

Write an unambiguous Statement grammar that meets the following requirements:

Write a “Statement” grammar that meets the following requirements:
skip is a valid statement
Assignment of the form x := E is a valid statement, where x is an identifier and E is an
arithmetic expression
The composition of two statements S0 ; S1 is a valid statement
I have the following solution, but am not sure if it is correct:
x:: E|skip|s0 E|s1 E
S:
  SKIP
| ID ':=' E
| S ';' S
;
There must be another rule for E and SKIP and ID are lexical tokens.
How about this? I'm not sure about what would be considered a "valid" arithmetic expression and what would be considered valid identifiers but how about something like this?
S :: 'skip'
S :: IDENTIFIER ':=' E
S :: S | S ';' S
A1 :: '+' | '-'
A2 :: '*' | '/'
NBR :: '1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'|'0'
O :: NBR /* remove this if arithm. expression only on identifiers */
O :: IDENTIFIER
O :: '(' E ')'
F :: O
F :: O A1 O
E :: F A2 F