Multiplication by juxtaposition in yacc - grammar

I'm trying to implement a grammar that allows multiplication by juxtaposition.
This is for parsing polynomial inputs for a CAS.
It works quite well, except few edge cases, as far as I'm aware of.
There are two problems I have identified:
Conflict with other rules, e.g., a^2 b is (erroneously) parsed as (^ a (* 2 b)), not as (* (^ a 2) b).
yacc(bison) reports 28 shift/reduce conflicts and 8 reduce/reduce conflicts.
I'm pretty sure properly resolving the first issue will resolve the second as well, but so far I haven't been successful.
The following is the gist of the grammar that I'm working with:
%start prgm
%union {
double num;
char *var;
ASTNode *node;
}
%token <num> NUM
%token <var> VAR
%type <node> expr
%left '+' '-'
%left '*' '/'
%right '^'
%%
prgm: // nothing
| prgm '\n'
| prgm expr '\n'
;
expr: NUM
| VAR
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| expr '^' expr
| expr expr %prec '*'
| '-' expr
| '(' expr ')'
;
%%
Removing the rule for juxtaposition (expr expr %prec '*') resolves the shift/reduce & reduce/reduce warnings.
Note that ab in my grammar should mean (* a b).
Multi-character variables should be preceded by a quote('); this is already handled fine in the lex file.
The lexer ignores spaces( ) and tabs(\t) entirely.
I'm aware of this question, but the use of juxtaposition here does not seem to indicate multiplication.
Any comments or help would be greatly appreciated!
P.S. If it helps, this is the link to the entire project.

As indicated in the answer to the question you linked, it is hard to specify the operator precedence of juxtaposition because there is no operator to shift. (As in your code, you can specify the precedence of the production expr: expr expr. But what lookahead token will this reduction be compared with? Adding every token in FIRST(expr) to your precedence declarations is not very scalable, and might lead to unwanted precedence resolutions.
An additional problem with the precedence solution is the behaviour of the unary minus operator (an issue not addressed in the linked question), because as written your grammar allows a - b to be parsed either as a subtraction or as the juxtaposed multiplication of a and -b. (And note that - is in FIRST(expr), leading to one of the possibly unwanted resolutions I referred to above.)
So the best solutions, as recommended in the linked question, is to use a grammar with explicit precedence, such as the following: (Here, I used juxt as the name of the non-terminal, rather than expr_sequence):
%start prgm
%token NUM
%token VAR
%left '+' '-'
%left '*' '/'
%right '^'
%%
prgm: // nothing
| prgm '\n'
| prgm expr '\n'
expr: juxt
| '-' juxt
| expr '+' expr
| expr '-' expr
| expr '*' expr
| expr '/' expr
| expr '^' expr
juxt: atom
| juxt atom
atom: NUM
| VAR
| '(' expr ')'
This grammar may not be what you want:
it's rather simple-minded handling of unary minus has a couple of issues. I don't think it's problematic that it parses -xy into -(xy) instead of (-x)y, but it's not ideal. Also, it doesn't allow --x (also, probably not a problem but not ideal). Finally, it does not parse -x^y as -(x^y), but as (-x)^y, which is contrary to frequent practice.
In addition, it incorrectly binds juxtaposition too tightly. You might or might not consider it a problem that a/xy parses as a/(xy), but you would probably object to 2x^7 being parsed as (2x)^7.
The simplest way to avoid those issues is to use a grammar in which operator precedence is uniformly implemented with unambiguous grammar rules.
Here's an example which implements standard precedence rules (exponentiation takes precedence over unary minus; juxtaposing multiply has the same precedence as explicit multiply). It's worth taking a few minutes to look closely at which non-terminal appears in which production, and think about how that correlates with the desired precedence rules.
%union {
double num;
char *var;
ASTNode *node;
}
%token <num> NUM
%token <var> VAR
%type <node> expr mult neg expt atom
%%
prgm: // nothing
| prgm '\n'
| prgm error '\n'
| prgm expr '\n'
expr: mult
| expr '+' mult
| expr '-' mult
mult: neg
| mult '*' neg
| mult '/' neg
| mult expt
neg : expt
| '-' neg
expt: atom
| atom '^' neg
atom: NUM
| VAR
| '(' expr ')'

Related

How to write yacc rules

I'm a newbie to yacc and not really understand how to write the rules, especially handle the recursive definitions.
%token NUMBER
%token VARIABLE
%left '+' '-'
%left '*' '/' '%'
%left '(' ')'
%%
S: VARIABLE'='E {
printf("\nEntered arithmetic expression is Valid\n\n");
return 0;
}
E : E'+'E
| E'-'E
| E'*'E
| E'/'E
| E'%'E
| '('E')'
| NUMBER
| VARIABLE
;
%%
The above example is work well, but when I changed it as below, it got "5 shift/reduce conflicts".
%token NUMBER
%token VARIABLE
%token MINS
%token PULS
%token MUL
%token DIV
%token MOD
%token LP
%token RP
%left MINS PULS
%left MUL DIV MOD
%left LP RP
%%
S: VARIABLE'='E {
printf("\nEntered arithmetic expression is Valid\n\n");
return 0;
}
E : E operator E
| LP E RP
| NUMBER
| VARIABLE
;
operator: MINS
| PULS
| MUL
| DIV
| MOD
;
%%
Can any one tell me what is the difference between these examples? Thanks a lot..
The difference is the additional indirection with the non-terminal operator. That serves to defeat your precedence declarations.
Precedence is immediate, not transparent. That is, it only functions in the production directly including the terminal. In your second grammar, that production is:
operator: MINS
| PULS
| MUL
| DIV
| MOD
;
But there is no ambiguity to resolve in that production. All of those terminals are unambiguously reduced to operator. The ambiguity is in the production
E : E operator E
And that production has no terminals in it.
By contrast, in your first grammar, the productions
E : E'+'E
| E'-'E
| E'*'E
| E'/'E
| E'%'E
(which would be easier to read with a bit more whitespace) do include terminals whose precedences can be compared with each other.
The precise working of precedence declarations is explained in the Bison manual. In case, it's useful, here's a description of the algorithm I wrote a few years ago in a different answer on this site.

How can I understand this binary expression grammar?

I don't understand this binary expression grammar
expr -> expr '+' term
| expr '-' term
| term
term -> term '*' factor
| term '/' factor
| factor
factor -> '(' expr ')'
| NUM
In plain english:
An expr can be one of the following:
another expr followed by the character + followed by a term
another expr followed by the character - followed by a term
a term
A term can be one of the following:
another term followed by the character * followed by a factor
another term followed by the character / followed by a factor
a factor
A factor can be one of the following:
a character ( followed by and expr followed by a character )
a number

Negative expression ANTLR

I'm writing a parser but I don't know why I can't parse this:
Proceso A
varX <- - 4;
FinProceso
I'm getting
line 2:12 extraneous input '-' expecting {NEGOP, '(', '-', INT, DOUBLE, STRING, BOOL, ID}
This is my grammar in ANTLR
Grammar
Explanation
Your grammar consists of two lexems matching a character - : SUMOP and NEG. In your case the SUMOP lexem will be always produced by the lexer, because it is defined before NEG lexem. Therefore a rule operatorUnary is never used.
SUMOP : ('+' | '-');
NEG : '-';
expr
: expr SUMOP expr
| operatorUnary expr
;
operatorUnary: '-';
Solution
You should orginize your lexems. For example you can delete the NEG lexem and make use of only the SUMOP lexem.
SUMOP : ('+' | '-');
expr
: SUMOP expr // higher precedence
| expr SUMOP expr // lower precedence
;
Also it is often a good idea to make an unary negation operator with higher precedence than the binary addition and/or substraction operator. You can achive this by changing the order of rule expr alternatives.

terminal/datatype/parser rules in xtext

I'm using xtext 2.4.
What I want to do is a SQL-like syntax.
The things confuse me are I'm not sure which things should be treated as terminal/datatype/parser rules. So far my grammar related to MyTerm is:
Model:
(terms += MyTerm ';')*
;
MyTerm:
constant=MyConstant | variable?='?'| collection_literal=CollectionLiteral
;
MyConstant
: string=STRING
| number=MyNumber
| date=MYDATE
| uuid=UUID
| boolean=MYBOOLEAN
| hex=BLOB
;
MyNumber:
int=SIGNINT | float=SIGNFLOAT
;
SIGNINT returns ecore::EInt:
'-'? INT
;
SIGNFLOAT returns ecore::EFloat:
'-'? INT '.' INT;
;
CollectionLiteral:
=> MapLiteral | SetLiteral | ListLiteral
;
MapLiteral:
'{' {MapLiteral} (entries+=MapEntry (',' entries+=MapEntry)* )? '}'
;
MapEntry:
key=MyTerm ':' value=MyTerm
;
SetLiteral:
'{' {SetLiteral} (values+=MyTerm (',' values+=MyTerm)* )+ '}'
;
ListLiteral:
'[' {ListLiteral} ( values+=MyTerm (',' values+=MyTerm)* )? ']'
;
terminal MYDATE:
'0'..'9' '0'..'9' '0'..'9' '0'..'9' '-'
'0'..'9' '0'..'9' '-'
'0'..'9' '0'..'9'
;
terminal HEX:
'a'..'h'|'A'..'H'|'0'..'9'
;
terminal UUID:
HEX HEX HEX HEX HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX '-'
HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX HEX
;
terminal BLOB:
'0' ('x'|'X') HEX+
;
terminal MYBOOLEAN returns ecore::EBoolean:
'true' | 'false' | 'TRUE' | 'FALSE'
;
Few questions:
How to define integer with sign? If I define another terminal rule terminal SIGNINT: '-'? '0'..'9'+;, antlr will complain about INT becoming unreachable. Therefore I define it as a datatype rule SIGNINT: '-'? INT; Is this the correct way to do it?
How to define float with sign? I did exactly the same as define integer with sign, SIGNFLOAT: '-'? INT '.' INT;, not sure if this is correct as well.
How to define a date rule? I want to use a parser rule to store year/month/day info in fields, but define it as MyDate: year=INT '-' month=INT '-' date=INT; antlr will complain Decision can match input such as "RULE_INT '-' RULE_INT '-' RULE_INT" using multiple alternatives: 2, 3
As a result, alternative(s) 3 were disabled for that input
I also have some other rules like
the following
RelationCompare:
name=ID compare=COMPARE term=MyTerm
;
but a=4 won't be a valid RelationCompare because a and 4 will be treat as HEXs. I found this because if I change the relation to j=44 then it works. In this post it said terminal rule defined eariler will shadow those defined later. However, if I redefine terminal ID in my grammar, whether put it in front or after of terminal HEX, antlr will conplain The following token definitions can never be matched because prior tokens match the same input: RULE_HEX,RULE_MYBOOLEAN. This problem happens in k=0x00b as well. k=0xaab is valid but k=0x00b is not.
Any suggestion?
How do you define an integer with sign?
Treat it as two separate tokens '-' and INT, and use a parser rule instead of a lexer rule.
How do you define a float with sign?
Treat it as two separate tokens '-' and FLOAT, and use a parser rule instead of a lexer rule.
How do you define a date rule?
Treat it as five separate tokens and use a parser rule instead of a lexer rule.
I don't know the answer to the last question since this is in Xtext as opposed to just ANTLR.
Later I found the original antlr grammar for what I want to do therefore I simply translate the antlr grammar to xtext grammar. Here is how I defining those basic types:
terminal fragment A: 'a'|'A';
...
terminal fragment Z: 'z'|'Z';
terminal fragment DIGIT: '0'..'9';
terminal fragment LETTER: ('a'..'z'|'A'..'Z');
terminal fragment HEX: ('a'..'f'|'A'..'F'|'0'..'9');
terminal fragment EXPONENT: E ('+'|'-')? DIGIT+;
terminal INTEGER returns ecore::EInt: '-'? DIGIT+;
terminal FLOAT returns ecore::EFloat: INTEGER EXPONENT | INTEGER '.' DIGIT* EXPONENT?;
terminal BOOLEAN: T R U E | F A L S E;
The Date rule in original grammar is treated as a string.
About rules name (Rules: Antlr Grammar => xtext Grammar)
parser rule: starting with lowercase => rules starting with uppercase (each will be a Java Class)
terminal rule: starting with uppercase => using all uppercase with terminal prefix
fragment terminal rule: fragment ID => terminal fragment ID
In antlr a list of arguments is defined like this:
functionArgs
: '(' ')'
| '(' t1=term ( ',' tn=term )* ')'
;
The corresponding xtext grammar is:
FunctionArgs
: '(' ')'
| '(' ts+=Term (',' ts+=Term )* ')'
;
For those parser rules with an argument enclosed by [ ]
properties[PropertyDefinitions props]
: property[props] (K_AND property[props])*
;
Most of the time they could be moved to the left hand side
Properties
: props+=Property (K_AND props+=Property)*
;
Now it's working as expected.

Bison/YACC - avoid reduce/reduce conflict with two negation rules

The following grammar (where INTEGER is a sequence of digits) gives rise to a reduce/reduce conflict, because e.g. -4 can be reduced by expr -> -expr or expr -> num -> -INTEGER. In my grammar, num and expr return different types so that I have to distinguish -num and -expr. My goal is that -5 is reduced by num while e.g. -(...) is an expr. How could I achieve this?
%token INTEGER
%left '+' '-'
%%
start: expr
;
expr: expr '+' expr
| expr '-' expr
| '-' expr
| '(' expr ')'
| num
;
num: INTEGER
| '-' INTEGER
;
%%
For this specific case, you could change the rule for negative expressions to
expr: '-' '(' expr ')'
and only recognize negations on parenthesized expressions. This however won't recognize double-negatives (eg - - x) and, more importantly, won't scale in that it will break if you try to add other unary operators.
Now you could simply put the num rules BEFORE the expr rules and allow the default reduce/reduce conflict resolution to deal with it (the first rule appearing in the file will be used if both are possible), but that's kind of ugly in that you get these conflict warnings every time you run bison, and ignoring them when you don't know exactly what is going on is a bad idea.
The general way of addressing this kind of ambiguity is by factoring the grammar to split the offending rule into two rules and using the appropriate version in each context so that you don't get conflicts. In this case, you'd split expr into num_expr for expressions that start with a num and non_num_expr for other expressions:
expr: num_expr | non_num_expr ;
num_expr: num_expr '+' expr
| num_expr '-' expr
| num
;
non_num_expr: non_num_expr '+' expr
| non_num_expr '-' expr
| '-' non_num_expr
| '(' expr ')'
;
Basically, every rule for expr that begins with an expr on the RHS needs to be duplicated, and other uses of expr may need to be changed to one of the variants so as to avoid the conflict.
Unfortunately, in this case, it doesn't work cleanly, as you're using precedence levels to resolve the inherent ambiguity of the expression grammar, and the factored rules get in the way of that -- the extra one-step rules cause problems. So you need to either factor those rules out of existence (duplicating every rule with expr on the RHS -- one with the num_expr version and one with the non_num_version OR you need to refactor your grammar with extra rules for the precedence/associativity
expr: expr '+' term
| expr '-' term
| term
;
term: non_num_term | num_term ;
non_num_term: '-' non_num_term
| '(' expr ')'
;
num_term: num ;
Note in this case, the num/non_num factoring has been done on term rather than expr
You are not clear on why num needs to represent negative numbers. I can't tell if you use num elsewhere in your grammar. You also don't say why you want num and expr to be distinct.
Normally, negative numbers are handled at the lexer level. In your case, the rule would be something like -?[0-9]+. This eliminates the need for num at all, and results in the following:
expr: expr '+' expr
| expr '-' expr
| '-' expr
| '(' expr ')'
| INTEGER
;
EDIT: Chris Dodd has a point. So you need to move negation entirely into the parser. You still get rid of num, just don't test for negatives in the INTEGER lexer pattern (i.e. the pattern would be something like [0-9]+, which is what you're doing now, right?). The expr rule I gave above does not change.
A negative number (-5) parses as: '-' INTEGER, which becomes '-' expr (choice 5), then expr (choice 3).
A difference between two integers (3-2) parses as INTEGER '-' INTEGER, which becomes expr - expr (choice 5 twice), then expr (choice 2).
A difference between an integer and a negative integer (5--1) parses as INTEGER '-' '-' INTEGER, which becomes expr '-' '-' expr (choice 5 twice), then expr '-' expr (choice 3), then expr (choice 2).
And so forth. The fundamental problem is you have negation in two different places and there is no way that can't be ambiguous.