Operator Associativity - grammar

I have the following EBNF expression grammar:
<expr> -> <term> { (+|-) <term> }
<term> -> <factor> { (*|/|%) <factor> }
<factor> -> <pow> { ** <pow> }
<pow> -> ( <expr> ) | <id>
<id> -> A | B | C
I need to determine if the grammar enforces any particular associativity for its operators, or if that would have to be implemented in the parser code. From what I have read so far, it doesn't look like it does, but I am having a hard time understanding what causes associativity. Any help would be greatly appreaciated!

The standard transformation which mutilatesconverts an expression grammar into a form which can be parsed with a top-down (LL) grammar has already removed associativity information, because the LL grammar cannot cope with left-associative operatord. In effect, the parse tree is nduced by an LL grammar makes all bi ary operators right-associative. However, you can generally re-associate the operators without too much trouble in a semantic action.
That's why the multiplication and exponentiation operators seem to have analogous grammar productions, although normally exponentiation would be right-associative while the other binary operators are left-associative.
In an LR grammar, this would be evident:
<expr> -> <term> | <expr> + <term> | <expr> - <term>
<term> -> <factor> | <term> * <factor> | <term> / <factor> | <term> % <factor>
<factor> -> <pow> | <pow> ** <factor>
<pow> -> ( <expr> ) | <id>
<id> -> A | B | C
In the above grammar, an operator is left-associative if the production is left-recursive (because the operator can only occur as part of the non-terminal on the left of the operator). Similarly, the right associative operator has a right-recursive rule, for the same reason.

Related

antlr4: Mismatched input on simple grammar

I have a simple grammar that keeps giving me mismatched input on seemingly right inputs. My grammar is as follows
root: expression;
expression
: METRIC comparator RHS
| expression AND expression
| expression OR expression
| LPAREN expression RPAREN
;
comparator
: EQ | GT | GE | LT | LE;
EQ: [eE][qQ];
GE: [gG][eE];
GT: [gG][tT];
LE: [lL][eE];
LT: [lL][tT];
LPAREN: '(';
RPAREN: ')';
AND: [aA][nN][dD];
OR: [oO][rR];
WS: [ \t\n\r]+;
METRIC: 'latency' | 'qps';
RHS: 'foobar' | 'foobaz';
Why does this grammar give a mismatched input 'latency' error when the input is latency eq foobar. Surely this follows the first production METRIC comparator RHS
The grammar you posted does not produce the error/warning "mismatched input 'latency'". You probably didn't regenerate the lexer- and parser classes if this is the case.
The only problem with the grammar from you question is the fact that for the input latency eq foobar, the lexer produces WS tokens which your parser does not accept.
You probably want to skip these WS tokens in the lexer:
WS: [ \t\n\r]+ -> skip;
With that change, your parser will produce the following parse tree for the input latency eq foobar:

Parsing haskell-like lambda in antlr

I'm trying to parse a haskell-like language using antlr4 and I'm stuck with lambdas. In haskell, lambdas can be mixed with operators. So, given operators >>=, + and lambda syntax '\\' args* '->' expr, the following expression is valid:
a >>= \a -> b >>= \b -> Just(a + b)
and it should be parsed into the following AST:
>>=
/ \
a ->
/ \
a >>=
/ \
b ->
/ \
b Just
|
+
/ \
a b
So I can think of two ways of structuring grammar for this kind of syntax.
The first is to put lambda expression into the top expression rule, among with ifs and other syntax constructs:
grammar Test;
root
: expr0 EOF
;
expr0
: '\\' ID '->' expr0
| expr1
;
expr1
: expr2 ('>>=' expr2)*
;
expr2
: expr3 ('+' expr3)*
;
expr3
: '(' expr0 ')'
| ID ('(' expr0 ')')?
;
This grammar cannot parse the above expression. It is required to add parens around lambda: a >>= (\a -> b >>= (\b -> Just(a + b))). While I understand why parens are required, this behaviour is pretty inconvenient.
The second approach would be to put lambda to the last expression rule, among with literals and nested expressions:
grammar Test;
root
: expr0 EOF
;
expr0
: expr1 ('>>=' expr1)*
;
expr1
: expr2 ('+' expr2)*
;
expr2
: '(' expr0 ')'
| ID ('(' expr0 ')')?
| '\\' ID '->' expr0
;
This grammar accepts my expression, however, it contains ambiguity because a >>= \a -> b >>= \b -> Just(a + b) can be parsed either as a >>= \a -> (b >>= \b) -> Just(a + b) or as a >>= \a -> (b >>= \b -> Just(a + b)).
So my question is, how to implement this kind of grammar properly?

Controlling Parameter Slurping

I'm trying to write a grammar that supports functions calls without using parentheses:
f x, y
As in Haskell, I'd like function calls to minimally slurp up their parameters. That is, I want
g 5 + 3
to mean
(g 5) + 3
instead of
g (5 + 3)
Unfortunately, I'm getting the second parse with this grammar:
grammar Parameters;
expr
: '(' expr ')'
| expr MULTIPLICATIVE_OPERATOR expr
| expr ADDITIVE_OPERATOR expr
| ID (expr (',' expr)*?)??
| INT
;
MULTIPLICATIVE_OPERATOR: [*/%];
ADDITIVE_OPERATOR: '+';
ID: [a..z]+;
INT: '-'? [0-9]+;
WHITESPACE: [ \t\n\r]+ -> skip;
The parse tree I'm getting is this:
I had thought that the subrule listed first would get attempted first. In this case, expr ADDITIVE_OPERATOR expr appears before the ID subrule, so why is the ID subrule taking higher precedence?
In this case ANTLR does not the correct rule transformation (to eliminate left recursion and to handle precedences):
expr
: expr_1[0]
;
expr_1[int p]
: ('(' expr_1[0] ')' | INT | ID (expr_1[0] (',' expr_1[0])*?)??)
( {4 >= $p}? MULTIPLICATIVE_OPERATOR expr_1[5]
| {3 >= $p}? ADDITIVE_OPERATOR expr_1[4]
)*
;
leading to (expr (expr_1 a (expr_1 5 + (expr_1 3))))
correct would be:
expr
: expr_1[0]
;
expr_1[int p]
: ('(' expr_1[0] ')' | INT | ID (expr_1[5] (',' expr_1[5])*?)??)
( {4 >= $p}? MULTIPLICATIVE_OPERATOR expr_1[5]
| {3 >= $p}? ADDITIVE_OPERATOR expr_1[4]
)*
;
leading to (expr (expr_1 a (expr_1 5) + (expr_1 3)))
I am not certain if this is a bug in ANTLR4 or a trade-off of the transformation algorithm. Perhaps one should write an issue to the ANTLR4 jira.
To solve your problem you can simply put the correctly transformed grammar into your code and it should work. The explanation of rule transformation is found in "The Definitive ANTLR4 Reference" on pages 249ff (and perhaps somewhere on the web).

How to get rid of useless nodes from this AST tree?

I have already looked at this question and even though the question titles seem to be the same; it doesn't answer my question, at least not in any way that I can understand.
Parsing Math
Here is what I am parsing:
PI -> 3.14.
Number area(Number radius) -> PI * radius^2.
This is how I want my AST tree to look, minus all the useless root nodes.
how it should look http://vertigrated.com/images/How%20I%20want%20the%20tree%20to%20look.png
Here are what I hope are the relevant fragments of my grammar:
term : '(' expression ')'
| number -> ^(NUMBER number)
| (function_invocation)=> function_invocation
| ATOM
| ID
;
power : term ('^' term)* -> ^(POWER term (term)* ) ;
unary : ('+'! | '-'^)* power ;
multiply : unary ('*' unary)* -> ^(MULTIPLY unary (unary)* ) ;
divide : multiply ('/' multiply)* -> ^(DIVIDE multiply (multiply)* );
modulo : divide ('%' divide)* -> ^(MODULO divide (divide)*) ;
subtract : modulo ('-' modulo)* -> ^(SUBTRACT modulo (modulo)* ) ;
add : subtract ('+' subtract)* -> ^(ADDITION subtract (subtract)*) ;
relation : add (('=' | '!=' | '<' | '<=' | '>=' | '>') add)* ;
expression : relation (and_or relation)*
| string
| container_access
;
and_or : '&' | '|' ;
Precedence
I still want to keep the precedence as illustrated in the following diagrams, but want to eliminate the useless nodes if at all possible.
Source: Number a(x) -> 0 - 1 + 2 * 3 / 4 % 5 ^ 6.
Here are the nodes I want to eliminate:
how I want the precedence tree to look http://vertigrated.com/images/example%202%20desired%20result.png
Basically I want to eliminate any of those nodes that don't directly have a branch under them to binary options.
You must realize that the two rules:
add : sub ( ('+' sub)+ -> ^(ADD sub (sub)*) | -> sub ) ;
and
add : sub ('+'^ sub)* ;
do not produce the same AST. Given the input 1+2+3, the first rule will produce:
ADD
|
.--+--.
| | |
1 2 3
where the second rule produces:
(+)
|
.--+--.
| |
(+) 3
|
.--+--.
| |
1 2
The latter makes more sense: infix expressions are expected to have 2 child nodes, not more.
Why not simply remove the literals in your parser rules and just do:
add : sub (ADD^ sub)*;
ADD : '+';
Creating the same AST using a rewrite rule would look like this:
add : (sub -> sub) ('+' s=sub -> ^(ADD $add $s))*;
Also see chapter 7: Tree Construction from The Definitive ANTLR Reference. Especially the paragraphs Rewrite Rules in Subrules (page 173) and Referencing Previous Rule ASTs in Rewrite Rules (page 174/175).
Your rule (and other like it)
add : subtract ('+' subtract)* -> ^(ADDITION subtract (subtract)*) ;
produces the useless production when you don't have a sequence of add operations.
I'm not an ANTLR expert, but I'd guess you need two cases, one for an add term
that is unary, and one for a set of children, the first of which generates your
standard tree, and the second of which simply passes the child tree up to the parent,
without creating a new node?
add : subtract ( ('+' subtract)+ -> ^(ADDITION subtract (subtract)*)
| -> subtract ) ;
Similar changes for other rules with sequences of operands to an operator.
To get rid of the irrelevant nodes, just be explicit:
subtract
:
modulo
(
( '-' modulo)+ -> ^(SUBTRACT modulo+) // no need for parenthesis or asterisk
|
() -> modulo
)
;
Even though I accepted Barts's answers as correct, I wanted to post my own complete answer with example code that I got working just for completeness.
Here is what I did based on Bart's answer:
unary : ('+'! | '-'^)? term ;
pow : (unary -> unary) ('^' s=unary -> ^(POWER $pow $s))*;
mod : (pow -> pow) ('%' s=pow -> ^(MODULO $mod $s))*;
mult : (mod -> mod) ('*' s=mod -> ^(MULTIPLY $mult $s))*;
div : (mult -> mult) ('/' s=mult -> ^(DIVIDE $div $s))*;
sub : (div -> div) ('-' s=div -> ^(SUBTRACT $sub $s))*;
add : (sub -> sub) ('+' s=sub -> ^(ADD $add $s))*;
And here is what the resulting tree looks like:
working answer http://vertigrated.com/images/working_answer.png
There is an alternative solution to just not use the rewrites and promote the symbols themselves to roots, but I want all descriptive labels in my tree if at all possible. I am just being anal about how the tree is represented so that my tree walking code will be as clean as possible!
power : unary ('^'^ unary)* ;
mod : power ('%'^ power)* ;
mult : mod ('*'^ mod)* ;
div : mult ('/'^ mult)* ;
sub : div ('-'^ div)* ;
add : sub ('+'^ sub)* ;
And this looks like this:
without rewrites http://vertigrated.com/images/without_the_rewrites.png

Consider the following BNF Grammar

Consider the following BNF grammer (where non-terminals are enclosed in angle-brackets and <identifier> matches to any legal Java variable identifier).
<exp> ::= <exp> + <term>
| <exp> - <term>
| <term>
<term> ::= <term> * <factor>
| <term> / <factor>
| <factor>
<factor> ::= ( <exp> )
| <identifier>
Produce a derivation three for the following expression:
(x - a) * (y + b)
Staring with exp:
<exp>
replace exp with term:
<term>
replace term with:
<term> * <factor>
replace term with factor:
<factor> * <factor>
replace both factors with (exp):
( <exp> ) * ( <exp> )
replace the first exp with exp - term and the second with exp + term
( <exp> - <term> ) * ( <exp> + <term> )
replace both exp's with term, and then replace all 4 terms with factors.
( <factor> - <factor> ) * ( <factor> + <factor> )
replace all factors with identifiers
( <identifier> - <identifier> ) * ( <identifier> + <identifier> )
Does this suffice?
You need to go one step further - <factor> is a nonterminal, and you should reduce it down to <identifier>.
Additionally, you should be starting from <expr> (and then reducing it to <term>) rather than starting from <term> directly.