Write a “Statement” grammar that meets the following requirements:
skip is a valid statement
Assignment of the form x := E is a valid statement, where x is an identifier and E is an
arithmetic expression
The composition of two statements S0 ; S1 is a valid statement
I have the following solution, but am not sure if it is correct:
x:: E|skip|s0 E|s1 E
S:
SKIP
| ID ':=' E
| S ';' S
;
There must be another rule for E and SKIP and ID are lexical tokens.
How about this? I'm not sure about what would be considered a "valid" arithmetic expression and what would be considered valid identifiers but how about something like this?
S :: 'skip'
S :: IDENTIFIER ':=' E
S :: S | S ';' S
A1 :: '+' | '-'
A2 :: '*' | '/'
NBR :: '1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'|'0'
O :: NBR /* remove this if arithm. expression only on identifiers */
O :: IDENTIFIER
O :: '(' E ')'
F :: O
F :: O A1 O
E :: F A2 F
Related
With antlr4 I can label rule alternatives like this:
e : e '*' e # Mult
| e '+' e # Add
| INT # Int
;
From what I understand, in the rule above, Mult has higher precedence over Add because Mult comes before Add in the list of alternatives.
So for instance, if I wrote:
e : e '*' e # Mult
| e ('+'|'-') e # Add
| INT # Int
;
The + in 1 + 2 and - in 4 - 2 have the same precedence.
However, now the alternative is not in the top level. Is there a way I can label the rules e '+' e # Add and e '-' e #Sub separately while still having both alternatives have same precedence level?
I'm afraid not. You can label the op though with op=('+'|'-') and then get the ctx.op() value during a tree walk and ask for its token type.
When defining a grammar, say a grammar to evaluate an arithmetic expression: we divide the Expression to Terms and Factors, like so:
E ::= E + T
T ::= T * F
F ::= num
| (E)
Then we need to resolve left recursion.
So why not define the grammar like so:
E ::= T + E
T ::= F * T
F := num
| (E)
And have only right recursion.
The problem is that it gets the associativity wrong -- a left-recursive grammar is left associative while a right-recursive grammar is right associative. Since associativity doesn't matter for + or * you don't see a problem, but if you add an operator (such as -) for which associativity DOES matter, you see the problem.
Note that the way that you deal with left recursion in an LL grammar is essentially by converting to right recursion and then post-processing the parse tree to turn it back into left recursion. Breaking it down, you convert to
E ::= T + E | T
which you then left-factor into
E ::= T E'
E' ::= \epsilon | + E
this will parse the expression T + T + T as
E
/ \
T E'
/ \
+ E
/ \
T E'
/ \
+ E
/ \
T E'
|
\epsilon
which you then evaluate by treating it as a linked list of alternating terms and operators which you evaluate/perform top to bottom (left to right):
tmp1 = eval_term(pop list head)
while (list not empty)
op = pop list head
tmp2 = eval_term(pop list head)
tmp1 = tmp1 op tmp2
In the specific example you show, order doesn't matter, so you can swap operands.
But that is not the case for all the other grammars, because moving their symbols may change their meaning; so you need to find another way to eliminate left recursion.
I have a simple file in rascal for specifying a toy grammar
module temp
import IO;
import ParseTree;
layout LAYOUT = [\t-\n\r\ ]*;
start syntax Simple
= A B ;
syntax A = "Hello"+ ("joe" "pok")* ;
syntax A= "Hi";
syntax B = "world"*|"wembly";
syntax B = C | C C* ;
public void main () {
println("hello");
iprint(parse(#start[Simple], "Hello Hello world world world"));
}
This works fine, however, the problem is that I didn't want to write
syntax B = C | C C* ;
I wanted to write
syntax B = ( C | C C* )?
but it was rejected as a parse error by rascal -even though all of
syntax B = ( C C C* )? ;
syntax B = ( C | C* )? ;
syntax B = C | C C* ;
are accepted fine. Can anyone explain to me what I'm doing wrong?
The sequence symbol (nested sequence) always requires brackets in rascal. The meta notation is defined as
syntax Sym = sequence: "(" Sym+ ")" | opt: Sym "?" | alternative: "(" Sym "|" {Sym "|"}+ ")" | ... ;
So, in your example you should have written:
syntax B = (C | (C C*))?;
What is perhaps confusing is that Rascal uses the | sign twice. Once for separating top-level alternatives, once for nested alternative:
syntax X = "a" | "b"; // top-level
syntax Y = ("c" | "d"); // nested, will internally generate a new rule:
syntax ("c" | "d") = "c" | "d";
Finally, normal alternatives have sequences without brackets, as in:
syntax B
= C
| C C*
;
// or less abstractly:
syntax Exp = left Exp "*" Exp
> left Exp "+" Exp
;
BTW, we generally avoid the use of too many nested regular expressions because they are so anonymous and therefore make interpreting parse trees harder. The best usage of regular expressions is for expressing lexical syntax where we are not so much interested in the internal structure anyhow.
I've some problems to understand the function FOLLOW. I cannot calcule follow functions of a grammar and that's not good. I tried exercises to understand this function and in particulary this exercise, I've this grammar :
S -> E
E -> T E'
E' -> + T E' | minus T E' |
T -> F T'
T' -> * F T' |
F -> id | ( F'
F' -> E ) | n )
Here the results of the calculating of follow function :
S $
E ), $
E' ), $
T +, minus, ), $
T' +, minus, ), $
F *, +, minus, ), $
F' *, +, minus, ), $
I really don't understand why the FOLLOW(T)=FOLLOW(T') = { +, minus, ), $ }
In the grammar that I give, theterminal symbols plus and minus never appears on the right of T or T' so if someon can explain me this, it will be cool
Conceptually, FOLLOW(X) is the set of tokens that can come AFTER an X in a legal sentence in the grammar. So to calculate it, you look at where X appears on the right side of a rule (any rule) and see what comes after it. In the case of T', you have
T -> F T'
T' -> * F T'
since T' is the last thing on the rhs in both cases, you end up with FOLLOW(T') = FOLLOW(T) ∪ FOLLOW(T'), which is equivalent to FOLLOW(T') = FOLLOW(T).
For T you have:
E -> T E'
E' -> + T E'
which gives you FOLLOW(T) = FIRST(E') ∪ FOLLOW(E) ∪ FOLLOW(E') -- the FOLLOWs are included because E' expands to ε. Depending on exactly whose formulation of FIRST and FOLLOW you use, that may mean that ε ∈ FIRST(E') (in which case you remove it from FOLLOW(T)) or that NULLABLE(E') = true, but the overall effect on FOLLOW(T) is the same -- it gets + and minus from FIRST(E') and ) and $ from FOLLOW(E)
I'm having trouble figuring this one out as well as the shift reduce problem.
Adding ';' to the end doesn't solve the problem since I can't change the language, it needs to go just as the following example. Does any prec operand work?
The example is the following:
A variable can be declared as: as a pointer or int as integer, so, both of this are valid:
<int> a = 0
int a = 1
the code goes:
%left '<'
declaration: variable
| declaration variable
variable : type tNAME '=' expr
| type tNAME
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
It obviously gives a shift/reduce problem afer expr. since it can shift for expr of "less" operator or reduce for another variable declaration.
I want precedence to be given on variable declaration, and have tried to create a %nonassoc prec_aux and put after '<' type '>' %prec prec_aux and after type tNAME but it doesn't solve my problem :S
How can I solve this?
Output was:
Well cant figure hwo to post linebreaks and code on reply... so here it goes the output:
35: shift/reduce conflict (shift 47, reduce 7) on '<'
state 35
variable : type tNAME '=' expr . (7)
expr : expr . '+' expr (26)
expr : expr . '-' expr (27)
expr : expr . '*' expr (28)
expr : expr . '/' expr (29)
expr : expr . '%' expr (30)
expr : expr . '<' expr (31)
expr : expr . '>' expr (32)
'>' shift 46
'<' shift 47
'+' shift 48
'-' shift 49
'*' shift 50
'/' shift 51
'%' shift 52
$end reduce 7
tINT reduce 7
Thats the output and the error seems the one I mentioned.
Does anyone know a different solution, other than adding a new terminal to the language that isn't really an option?
I think the resolution is to rewrite the grammar so it can lookahead somehow and see if its a type or expr after the '<' but I'm not seeing how to.
Precedence is unlikely to work since its the same character. Is there a way to give precendence for types that we define? such as declaration?
Thanks in advance
Your grammar gets confused in text like this:
int a = b
<int> c
That '<' on the second line could be part of an expression in the first declaration. It would have to look ahead further to find out.
This is the reason most languages have a statement terminator. This produces no conflicts:
%%
%token tNAME;
%token tINT;
%token tINTEGER;
%token tTERM;
%left '<';
declaration: variable
| declaration variable
variable : type tNAME '=' expr tTERM
| type tNAME tTERM
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
It helps when creating a parser to know how to design a grammar to eliminate possible conflicts. For that you would need an understanding of how parsers work, which is outside the scope of this answer :)
The basic problem here is that you need more lookahead than the 1 token you get with yacc/bison. When the parser sees a < it has no way of telling whether its done with the preivous declaration and its looking at the beginning of a bracketed type, or if this is a less-than operator. There's two basic things you can do here:
Use a parsing method such as bison's %glr-parser option or btyacc, which can deal with non-LR(1) grammars
Use the lexer to do extra lookahead and return disambiguating tokens
For the latter, you would have the lexer do extra lookahead after a '<' and return a different token if its followed by something that looks like a type. The easiest is to use flex's / lookahead operator. For example:
"<"/[ \t\n\r]*"<" return OPEN_ANGLE;
"<"/[ \t\n\r]*"int" return OPEN_ANGLE;
"<" return '<';
Then you change your bison rules to expect OPEN_ANGLE in types instead of <:
type : OPEN_ANGLE type '>'
| tINT
expr : tINTEGER
| expr '<' expr
For more complex problems, you can use flex start states, or even insert an entire token filter/transform pass between the lexer and the parser.
Here is the fix, but not entirely satisfactory:
%{
%}
%token tNAME tINT tINTEGER
%left '<'
%left '+'
%nonassoc '=' /* <-- LOOK */
%%
declaration: variable
| declaration variable
variable : type tNAME '=' expr
| type tNAME
type : '<' type '>'
| tINT
expr : tINTEGER
| expr '<' expr
| expr '+' expr
;
This issue is a conflict between these two LR items: the dot-final:
variable : type tNAME '=' expr_no_less .
and this one:
expr : expr . '<' expr
Notice that these two have different operators. It is not, as you seem to think, a conflict between different productions involving the '<' operator.
By adding = to the precedence ranking, we fix the issue in the sense that the conflict diagnostic goes away.
Note that I gave = a high precedence. This will resolve the conflict by favoring the reduce. This means that you cannot use a '<' expression as an initializer:
int name = 4 < 3 // syntax error
When the < is seen, the int name = 4 wants to be reduced, and the idea is that < must be the start of the next declaration, as part of a type production.
To allow < relational expressions to be used as initializers, add the support for parentheses into the expression grammar. Then users can parenthesize:
int foo = (4 < 3) <int> bar = (2 < 1)
There is no way to fix that without a more powerful parsing method or hacks.
What if you move the %nonassoc before %left '<', giving it low precedence? Then the shift will be favored. Unfortunately, that has the consequence that you cannot write another <int> declaration after a declaration.
int foo = 3 <int> bar = 4
^ // error: the machine shifted and is now doing: expr '<' . expr.
So that is the wrong way to resolve the conflict; you want to be able to write multiple such declarations.
Another Note:
My TXR language, which implements something equivalent to Parse Expression Grammars handles this grammar fine. This is essentially LL(infinite), which trumps LALR(1).
We don't even have to have a separate lexical analyzer and parser! That's just something made necessary by the limitations of one-symbol-lookahead, and the need for utmost efficiency on 1970's hardware.
Example output from shell command line, demonstrating the parse by translation to a Lisp-like abstract syntax tree, which is bound to the variable dl (declaration list). So this is complete with semantic actions, yielding an output that can be further processed in TXR Lisp. Identifiers are translated to Lisp symbols via calls to intern and numbers are translated to number objects also.
$ txr -l type.txr -
int x = 3 < 4 int y
(dl (decl x int (< 3 4)) (decl y int nil))
$ txr -l type.txr -
< int > x = 3 < 4 < int > y
(dl (decl x (pointer int) (< 3 4)) (decl y (pointer int) nil))
$ txr -l type.txr -
int x = 3 + 4 < 9 < int > y < int > z = 4 + 3 int w
(dl (decl x int (+ 3 (< 4 9))) (decl y (pointer int) nil)
(decl z (pointer int) (+ 4 3)) (decl w int nil))
$ txr -l type.txr -
<<<int>>>x=42
(dl (decl x (pointer (pointer (pointer int))) 42))
The source code of (type.txr):
#(define ws)#/[ \t]*/#(end)
#(define int)#(ws)int#(ws)#(end)
#(define num (n))#(ws)#{n /[0-9]+/}#(ws)#(filter :tonumber n)#(end)
#(define id (id))#\
#(ws)#{id /[A-Za-z_][A-Za-z_0-9]*/}#(ws)#\
#(set id #(intern id))#\
#(end)
#(define type (ty))#\
#(local l)#\
#(cases)#\
#(int)#\
#(bind ty #(progn 'int))#\
#(or)#\
<#(type l)>#\
#(bind ty #(progn '(pointer ,l)))#\
#(end)#\
#(end)
#(define expr (e))#\
#(local e1 op e2)#\
#(cases)#\
#(additive e1)#{op /[<>]/}#(expr e2)#\
#(bind e #(progn '(,(intern op) ,e1 ,e2)))#\
#(or)#\
#(additive e)#\
#(end)#\
#(end)
#(define additive (e))#\
#(local e1 op e2)#\
#(cases)#\
#(num e1)#{op /[+\-]/}#(expr e2)#\
#(bind e #(progn '(,(intern op) ,e1 ,e2)))#\
#(or)#\
#(num e)#\
#(end)#\
#(end)
#(define decl (d))#\
#(local type id expr)#\
#(type type)#(id id)#\
#(maybe)=#(expr expr)#(or)#(bind expr nil)#(end)#\
#(bind d #(progn '(decl ,id ,type ,expr)))#\
#(end)
#(define decls (dl))#\
#(coll :gap 0)#(decl dl)#(end)#\
#(end)
#(freeform)
#(decls dl)