ifequal construct grammar implementation - grammar

I have to create grammar that recognizes this:
ifequal(exp1, exp2)
statement1
smaller
statement2
larger
statement3
statement1 is executed if two expressions are equal, second if first is smaller third if it is larger. I tried to make something similar to this solution but no luck. I must not use precedence so the grammar has to be changed properly. I am using cup tool for generating parser.
EDIT: smaller and larger parts are optional.

It's the ancient dangling if (if without an else) problem. if you write some code like
if (a == b) then
if (b == c) then
do_something;
else
do_something_else;
a parser will have problems deciding to which "if" the "else" belongs.
One solution is to add a delimiter, like "endif". Example:
if (a == b) then
if (b == c) then
do_something;
endif;
else
do_something_else;
endif;
Not confused by indentation any more (which I did wrong deliberately), it is clear what to do now.
That's why a grammar like the one outlined below doesn't work (it produces a lot of conflicts):
stmtlist: stmt
| stmtlist stmt
stmt: ifequal
| something_else
ifequal: IFEQUAL '(' expr ',' expr ')' stmtlist opt_lt_gt
opt_lt_gt:
| SMALLER stmtlist
| LARGER stmtlist
| SMALLER stmtlist LARGER stmtlist
But as soon as the statement list is somehow separated from the IFEQUAL statement, e.g. by using braces, the problems vanish.
stmtlist: '{' stmtseq '}'
stmtseq: stmt
| stmtseq stmt
The other possibility is to prohibit incomplete statements in the stmtlist. This will more or less double your grammar. It could look like
fullifequal: IFEQUAL '(' expr ',' expr ')' fstmtlist SMALLER fstmtlist LARGER fstmtlist
fstmtlist: fullstmt
| fstmtlist fullstmt
fullstmt: fullifequal
| some_other
ifequal: IFEQUAL '(' expr ',' expr ')' fstmtlist opt_lt_gt
opt_lt_gt:
| SMALLER fstmtlist
| LARGER fstmtlist
| SMALLER fstmtlist LARGER fstmtlist
Personally I prefer the braces solution because it's easy to read, doesn't constrain the statements within an "ifequal" statement and the parser code will be shorter. But I guess you'll need the other solution.

Related

Multiplication by juxtaposition in yacc

I'm trying to implement a grammar that allows multiplication by juxtaposition.
This is for parsing polynomial inputs for a CAS.
It works quite well, except few edge cases, as far as I'm aware of.
There are two problems I have identified:
Conflict with other rules, e.g., a^2 b is (erroneously) parsed as (^ a (* 2 b)), not as (* (^ a 2) b).
yacc(bison) reports 28 shift/reduce conflicts and 8 reduce/reduce conflicts.
I'm pretty sure properly resolving the first issue will resolve the second as well, but so far I haven't been successful.
The following is the gist of the grammar that I'm working with:
%start prgm
%union {
double num;
char *var;
ASTNode *node;
}
%token <num> NUM
%token <var> VAR
%type <node> expr
%left '+' '-'
%left '*' '/'
%right '^'
%%
prgm: // nothing
| prgm '\n'
| prgm expr '\n'
;
expr: NUM
| VAR
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| expr '^' expr
| expr expr %prec '*'
| '-' expr
| '(' expr ')'
;
%%
Removing the rule for juxtaposition (expr expr %prec '*') resolves the shift/reduce & reduce/reduce warnings.
Note that ab in my grammar should mean (* a b).
Multi-character variables should be preceded by a quote('); this is already handled fine in the lex file.
The lexer ignores spaces( ) and tabs(\t) entirely.
I'm aware of this question, but the use of juxtaposition here does not seem to indicate multiplication.
Any comments or help would be greatly appreciated!
P.S. If it helps, this is the link to the entire project.
As indicated in the answer to the question you linked, it is hard to specify the operator precedence of juxtaposition because there is no operator to shift. (As in your code, you can specify the precedence of the production expr: expr expr. But what lookahead token will this reduction be compared with? Adding every token in FIRST(expr) to your precedence declarations is not very scalable, and might lead to unwanted precedence resolutions.
An additional problem with the precedence solution is the behaviour of the unary minus operator (an issue not addressed in the linked question), because as written your grammar allows a - b to be parsed either as a subtraction or as the juxtaposed multiplication of a and -b. (And note that - is in FIRST(expr), leading to one of the possibly unwanted resolutions I referred to above.)
So the best solutions, as recommended in the linked question, is to use a grammar with explicit precedence, such as the following: (Here, I used juxt as the name of the non-terminal, rather than expr_sequence):
%start prgm
%token NUM
%token VAR
%left '+' '-'
%left '*' '/'
%right '^'
%%
prgm: // nothing
| prgm '\n'
| prgm expr '\n'
expr: juxt
| '-' juxt
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| expr '^' expr
juxt: atom
| juxt atom
atom: NUM
| VAR
| '(' expr ')'
This grammar may not be what you want:
it's rather simple-minded handling of unary minus has a couple of issues. I don't think it's problematic that it parses -xy into -(xy) instead of (-x)y, but it's not ideal. Also, it doesn't allow --x (also, probably not a problem but not ideal). Finally, it does not parse -x^y as -(x^y), but as (-x)^y, which is contrary to frequent practice.
In addition, it incorrectly binds juxtaposition too tightly. You might or might not consider it a problem that a/xy parses as a/(xy), but you would probably object to 2x^7 being parsed as (2x)^7.
The simplest way to avoid those issues is to use a grammar in which operator precedence is uniformly implemented with unambiguous grammar rules.
Here's an example which implements standard precedence rules (exponentiation takes precedence over unary minus; juxtaposing multiply has the same precedence as explicit multiply). It's worth taking a few minutes to look closely at which non-terminal appears in which production, and think about how that correlates with the desired precedence rules.
%union {
double num;
char *var;
ASTNode *node;
}
%token <num> NUM
%token <var> VAR
%type <node> expr mult neg expt atom
%%
prgm: // nothing
| prgm '\n'
| prgm error '\n'
| prgm expr '\n'
expr: mult
| expr '+' mult
| expr '-' mult
mult: neg
| mult '*' neg
| mult '/' neg
| mult expt
neg : expt
| '-' neg
expt: atom
| atom '^' neg
atom: NUM
| VAR
| '(' expr ')'

Antlr4 parser not parsing reassignment statement correctly

I've been creating a grammar parser using Antlr4 and wanted to add variable reassignment (without having to declare a new variable)
I've tried changing the reassignment statement to be an expression, but that didn't change anything
Here's a shortened version of my grammar:
grammar MyLanguage;
program: statement* EOF;
statement
: expression EOC
| variable EOC
| IDENTIFIER ASSIGNMENT expression EOC
;
variable: type IDENTIFIER (ASSIGNMENT expression)?;
expression
: STRING
| INTEGER
| IDENTIFIER
| expression MATH expression
| ('+' | '-') expression
;
MATH: '+' | '-' | '*' | '/' | '%' | '//' | '**';
ASSIGNMENT: MATH? '=';
EOC: ';';
WHITESPACE: [ \t\r\n]+ -> skip;
STRING: '"' (~[\u0000-\u0008\u0010-\u001F"] | [\t])* '"' | '\'' (~[\u0000-\u0008\u0010-\u001F'] | [\t])* '\'';
INTEGER: '0' | ('+' | '-')? [1-9][0-9]*;
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_]*;
type: 'str';
if anything else might be of relevance, please ask
so I tried to parse
str test = "empty";
test = "not empty";
which worked, but when I tried (part of the fibbionaci function)
temp = n1;
n1 = n1 + n2;
n2 = temp;
it got an error and parsed it as
temp = n1; //statement
n1 = n1 //statement - <missing ';'>
+n2; //statement
n2 = temp; //statement
Your problem has nothing to do with assignment statements. Additions simply don't work at all - whether they're part of an assignment or not. So the simplest input to get the error would be x+y;. If you print the token stream for that input (using grun with the -tokens option for example), you'll get the following output:
[#0,0:0='x',<IDENTIFIER>,1:0]
[#1,1:1='+',<'+'>,1:1]
[#2,2:2='y',<IDENTIFIER>,1:2]
[#3,3:3=';',<';'>,1:3]
[#4,4:3='<EOF>',<EOF>,1:4]
line 1:1 no viable alternative at input 'x+'
Now compare this to x*y;, which works fine:
[#0,0:0='x',<IDENTIFIER>,1:0]
[#1,1:1='*',<MATH>,1:1]
[#2,2:2='y',<IDENTIFIER>,1:2]
[#3,3:3=';',<';'>,1:3]
[#4,4:3='<EOF>',<EOF>,1:4]
The important difference here is that * is recognized as a MATH token, but + isn't. It's recognized as a '+' token instead.
This happens because you introduced a separate '+' (and '-') token type in the alternative | ('+' | '-') expression. So whenever the lexer sees a + it produces a '+' token, not a MATH token, because string literals in parser rules take precedence over named lexer rules.
If you turn MATH into a parser rule math (or maybe mathOperator) instead, all of the operators will be literals and the problem will go away. That said, you probably don't want a single rule for all math operators because that doesn't give you the precedence you want, but that's a different issue.
PS: Something like x+1 still won't work because it will see +1 as a single INTEGER token. You can fix that by removing the leading + and - from the INTEGER rule (that way x = -2 would be parsed as a unary minus applied to the integer 2 instead of just the integer -2, but that's not a problem).

Extend Antlr grammar file with lists

I have the following assignment about extending an Antlr grammar.
What I've tried is:
I am not sure whether this is the correct solution or not. Can anyone guide me in the right direction?
2 problems here: 1) you have 2 the same alt-labels (# Lists), and 2) you only allow zero or a single expression in your list. It should be this:
expr
: ...
| '(' expr ')' # Parenthesis
| '[' ( expr ( ',' expr )* )? ']' # Lists
;

Error in array declaration grammar

%token DIGIT RETURN IDENTIFIER COLON COMMA ELSE IF NL KEYWORD BR READ WRITE WHILE EQUAL
%start y2
%left '-'
%left '+'
%right '='
%%
stmt1:KEYWORD IDENTIFIER X1 //for initialization.
;
y2:stmt1 stmt2 //y2 is starting variable
|
;
X1:COLON {printf(" for int a/ char a");}
|'['DIGIT']'COLON {printf("for array declarations");}
;
stmt2:KEYWORD IDENTIFIER"("stmt3")"stmt5 {printf("for functions");}
|
;
stmt3:KEYWORD IDENTIFIER X2
|
;
X2:stmt4 {printf("for parameter int/char");}
|"["DIGIT"]"COLON {printf("for parameter int arr[]/char arr[]");} //in this production parser is not responding
;
stmt4:COMMA stmt3 {printf("to have multiple arguments");}
|
;
%%
I am parsing string int a[10];
but it instead of parsing,
execute yyerror() every time.
This code parses int a; single statement char a; also.
Make sure that your lexer returns '[' when it sees an open bracket.
Mixing single-quoted tokens like '[' and named tokens like COLON is confusing and suggests you are copy-and-pasting from different sources, rather than actually designing a program. Since the lexer and parser must agree on the handling of tokens, this form of creating programs is error-prone. I recommend using single-quoted single-character tokens throughout, since it is more readable and simplifies the lexer.
With respect to X2, there is a difference between '[' and "[". You probably want the first one. The same problem is found in stmt2, which uses​ "(" and ")" instead of the single-quoted versions.

Bison/YACC - avoid reduce/reduce conflict with two negation rules

The following grammar (where INTEGER is a sequence of digits) gives rise to a reduce/reduce conflict, because e.g. -4 can be reduced by expr -> -expr or expr -> num -> -INTEGER. In my grammar, num and expr return different types so that I have to distinguish -num and -expr. My goal is that -5 is reduced by num while e.g. -(...) is an expr. How could I achieve this?
%token INTEGER
%left '+' '-'
%%
start: expr
;
expr: expr '+' expr
| expr '-' expr
| '-' expr
| '(' expr ')'
| num
;
num: INTEGER
| '-' INTEGER
;
%%
For this specific case, you could change the rule for negative expressions to
expr: '-' '(' expr ')'
and only recognize negations on parenthesized expressions. This however won't recognize double-negatives (eg - - x) and, more importantly, won't scale in that it will break if you try to add other unary operators.
Now you could simply put the num rules BEFORE the expr rules and allow the default reduce/reduce conflict resolution to deal with it (the first rule appearing in the file will be used if both are possible), but that's kind of ugly in that you get these conflict warnings every time you run bison, and ignoring them when you don't know exactly what is going on is a bad idea.
The general way of addressing this kind of ambiguity is by factoring the grammar to split the offending rule into two rules and using the appropriate version in each context so that you don't get conflicts. In this case, you'd split expr into num_expr for expressions that start with a num and non_num_expr for other expressions:
expr: num_expr | non_num_expr ;
num_expr: num_expr '+' expr
| num_expr '-' expr
| num
;
non_num_expr: non_num_expr '+' expr
| non_num_expr '-' expr
| '-' non_num_expr
| '(' expr ')'
;
Basically, every rule for expr that begins with an expr on the RHS needs to be duplicated, and other uses of expr may need to be changed to one of the variants so as to avoid the conflict.
Unfortunately, in this case, it doesn't work cleanly, as you're using precedence levels to resolve the inherent ambiguity of the expression grammar, and the factored rules get in the way of that -- the extra one-step rules cause problems. So you need to either factor those rules out of existence (duplicating every rule with expr on the RHS -- one with the num_expr version and one with the non_num_version OR you need to refactor your grammar with extra rules for the precedence/associativity
expr: expr '+' term
| expr '-' term
| term
;
term: non_num_term | num_term ;
non_num_term: '-' non_num_term
| '(' expr ')'
;
num_term: num ;
Note in this case, the num/non_num factoring has been done on term rather than expr
You are not clear on why num needs to represent negative numbers. I can't tell if you use num elsewhere in your grammar. You also don't say why you want num and expr to be distinct.
Normally, negative numbers are handled at the lexer level. In your case, the rule would be something like -?[0-9]+. This eliminates the need for num at all, and results in the following:
expr: expr '+' expr
| expr '-' expr
| '-' expr
| '(' expr ')'
| INTEGER
;
EDIT: Chris Dodd has a point. So you need to move negation entirely into the parser. You still get rid of num, just don't test for negatives in the INTEGER lexer pattern (i.e. the pattern would be something like [0-9]+, which is what you're doing now, right?). The expr rule I gave above does not change.
A negative number (-5) parses as: '-' INTEGER, which becomes '-' expr (choice 5), then expr (choice 3).
A difference between two integers (3-2) parses as INTEGER '-' INTEGER, which becomes expr - expr (choice 5 twice), then expr (choice 2).
A difference between an integer and a negative integer (5--1) parses as INTEGER '-' '-' INTEGER, which becomes expr '-' '-' expr (choice 5 twice), then expr '-' expr (choice 3), then expr (choice 2).
And so forth. The fundamental problem is you have negation in two different places and there is no way that can't be ambiguous.