I'm trying to parse a haskell-like language using antlr4 and I'm stuck with lambdas. In haskell, lambdas can be mixed with operators. So, given operators >>=, + and lambda syntax '\\' args* '->' expr, the following expression is valid:
a >>= \a -> b >>= \b -> Just(a + b)
and it should be parsed into the following AST:
>>=
/ \
a ->
/ \
a >>=
/ \
b ->
/ \
b Just
|
+
/ \
a b
So I can think of two ways of structuring grammar for this kind of syntax.
The first is to put lambda expression into the top expression rule, among with ifs and other syntax constructs:
grammar Test;
root
: expr0 EOF
;
expr0
: '\\' ID '->' expr0
| expr1
;
expr1
: expr2 ('>>=' expr2)*
;
expr2
: expr3 ('+' expr3)*
;
expr3
: '(' expr0 ')'
| ID ('(' expr0 ')')?
;
This grammar cannot parse the above expression. It is required to add parens around lambda: a >>= (\a -> b >>= (\b -> Just(a + b))). While I understand why parens are required, this behaviour is pretty inconvenient.
The second approach would be to put lambda to the last expression rule, among with literals and nested expressions:
grammar Test;
root
: expr0 EOF
;
expr0
: expr1 ('>>=' expr1)*
;
expr1
: expr2 ('+' expr2)*
;
expr2
: '(' expr0 ')'
| ID ('(' expr0 ')')?
| '\\' ID '->' expr0
;
This grammar accepts my expression, however, it contains ambiguity because a >>= \a -> b >>= \b -> Just(a + b) can be parsed either as a >>= \a -> (b >>= \b) -> Just(a + b) or as a >>= \a -> (b >>= \b -> Just(a + b)).
So my question is, how to implement this kind of grammar properly?
I have a grammar:
grammar Test;
s : ID OP (NUMBER | ID);
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
OP : '/.' | '/' ;
WS : [ \t\r\n]+ -> skip ;
An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123). With the grammar above I get the first variant.
Is there a way to get both parse trees? Is there a way to control how it is parsed? Say, if there is a number after the /. then I emit the / otherwise I emit /. in the tree.
I am new to ANTLR.
An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123)
I'm not sure. In the ReplaceAll page(*), Possible Issues paragraph, it is said that "Periods bind to numbers more strongly than to slash", so that /.123 will always be interpreted as a division operation by the number .123. Next it is said that to avoid this issue, a space must be inserted in the input between the /. operator and the number, if you want it to be understood as a replacement.
So there is only one possible parse tree (otherwise how could the Wolfram parser decide how to interpret the statement ?).
ANTLR4 lexer and parser are greedy. It means that the lexer (parser) tries to read as much input characters (tokens) that it can while matching a rule. With your OP rule OP : '/.' | '/' ; the lexer will always match the input /. to the /. alternative (even if the rule is OP : '/' | '/.' ;). This means there is no ambiguity and you have no chance the input to be interpreted as OP=/ and NUMBER=.123.
Given my small experience with ANTLR, I have found no other solution than to split the ReplaceAll operator into two tokens.
Grammar Question.g4 :
grammar Question;
/* Parse Wolfram ReplaceAll. */
question
#init {System.out.println("Question last update 0851");}
: s+ EOF
;
s : division
| replace_all
;
division
: expr '/' NUMBER
{System.out.println("found division " + $expr.text + " by " + $NUMBER.text);}
;
replace_all
: expr '/' '.' replacement
{System.out.println("found ReplaceAll " + $expr.text + " with " + $replacement.text);}
;
expr
: ID
| '"' ID '"'
| NUMBER
| '{' expr ( ',' expr )* '}'
;
replacement
: expr '->' expr
| '{' replacement ( ',' replacement )* '}'
;
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
Input file t.text :
x/.123
x/.x -> 1
{x, y}/.{x -> 1, y -> 2}
{0, 1}/.0 -> "zero"
{0, 1}/. 0 -> "zero"
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:0='x',<ID>,1:0]
[#1,1:1='/',<'/'>,1:1]
[#2,2:5='.123',<NUMBER>,1:2]
[#3,7:7='x',<ID>,2:0]
[#4,8:8='/',<'/'>,2:1]
[#5,9:9='.',<'.'>,2:2]
[#6,10:10='x',<ID>,2:3]
[#7,12:13='->',<'->'>,2:5]
[#8,15:15='1',<NUMBER>,2:8]
[#9,17:17='{',<'{'>,3:0]
...
[#29,47:47='}',<'}'>,4:5]
[#30,48:48='/',<'/'>,4:6]
[#31,49:50='.0',<NUMBER>,4:7]
...
[#40,67:67='}',<'}'>,5:5]
[#41,68:68='/',<'/'>,5:6]
[#42,69:69='.',<'.'>,5:7]
[#43,71:71='0',<NUMBER>,5:9]
...
[#48,83:82='<EOF>',<EOF>,6:0]
Question last update 0851
found division x by .123
found ReplaceAll x with x->1
found ReplaceAll {x,y} with {x->1,y->2}
found division {0,1} by .0
line 4:10 extraneous input '->' expecting {<EOF>, '"', '{', ID, NUMBER}
found ReplaceAll {0,1} with 0->"zero"
The input x/.123 is ambiguous until the slash. Then the parser has two choices : / NUMBER in the division rule or / . expr in the replace_all rule. I think that NUMBER absorbs the input and so there is no more ambiguity.
(*) the link was yesterday in a comment that has disappeared, i.e. Wolfram Language & System, ReplaceAll
When defining a grammar, say a grammar to evaluate an arithmetic expression: we divide the Expression to Terms and Factors, like so:
E ::= E + T
T ::= T * F
F ::= num
| (E)
Then we need to resolve left recursion.
So why not define the grammar like so:
E ::= T + E
T ::= F * T
F := num
| (E)
And have only right recursion.
The problem is that it gets the associativity wrong -- a left-recursive grammar is left associative while a right-recursive grammar is right associative. Since associativity doesn't matter for + or * you don't see a problem, but if you add an operator (such as -) for which associativity DOES matter, you see the problem.
Note that the way that you deal with left recursion in an LL grammar is essentially by converting to right recursion and then post-processing the parse tree to turn it back into left recursion. Breaking it down, you convert to
E ::= T + E | T
which you then left-factor into
E ::= T E'
E' ::= \epsilon | + E
this will parse the expression T + T + T as
E
/ \
T E'
/ \
+ E
/ \
T E'
/ \
+ E
/ \
T E'
|
\epsilon
which you then evaluate by treating it as a linked list of alternating terms and operators which you evaluate/perform top to bottom (left to right):
tmp1 = eval_term(pop list head)
while (list not empty)
op = pop list head
tmp2 = eval_term(pop list head)
tmp1 = tmp1 op tmp2
In the specific example you show, order doesn't matter, so you can swap operands.
But that is not the case for all the other grammars, because moving their symbols may change their meaning; so you need to find another way to eliminate left recursion.
Write a “Statement” grammar that meets the following requirements:
skip is a valid statement
Assignment of the form x := E is a valid statement, where x is an identifier and E is an
arithmetic expression
The composition of two statements S0 ; S1 is a valid statement
I have the following solution, but am not sure if it is correct:
x:: E|skip|s0 E|s1 E
S:
SKIP
| ID ':=' E
| S ';' S
;
There must be another rule for E and SKIP and ID are lexical tokens.
How about this? I'm not sure about what would be considered a "valid" arithmetic expression and what would be considered valid identifiers but how about something like this?
S :: 'skip'
S :: IDENTIFIER ':=' E
S :: S | S ';' S
A1 :: '+' | '-'
A2 :: '*' | '/'
NBR :: '1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'|'0'
O :: NBR /* remove this if arithm. expression only on identifiers */
O :: IDENTIFIER
O :: '(' E ')'
F :: O
F :: O A1 O
E :: F A2 F
I have already looked at this question and even though the question titles seem to be the same; it doesn't answer my question, at least not in any way that I can understand.
Parsing Math
Here is what I am parsing:
PI -> 3.14.
Number area(Number radius) -> PI * radius^2.
This is how I want my AST tree to look, minus all the useless root nodes.
how it should look http://vertigrated.com/images/How%20I%20want%20the%20tree%20to%20look.png
Here are what I hope are the relevant fragments of my grammar:
term : '(' expression ')'
| number -> ^(NUMBER number)
| (function_invocation)=> function_invocation
| ATOM
| ID
;
power : term ('^' term)* -> ^(POWER term (term)* ) ;
unary : ('+'! | '-'^)* power ;
multiply : unary ('*' unary)* -> ^(MULTIPLY unary (unary)* ) ;
divide : multiply ('/' multiply)* -> ^(DIVIDE multiply (multiply)* );
modulo : divide ('%' divide)* -> ^(MODULO divide (divide)*) ;
subtract : modulo ('-' modulo)* -> ^(SUBTRACT modulo (modulo)* ) ;
add : subtract ('+' subtract)* -> ^(ADDITION subtract (subtract)*) ;
relation : add (('=' | '!=' | '<' | '<=' | '>=' | '>') add)* ;
expression : relation (and_or relation)*
| string
| container_access
;
and_or : '&' | '|' ;
Precedence
I still want to keep the precedence as illustrated in the following diagrams, but want to eliminate the useless nodes if at all possible.
Source: Number a(x) -> 0 - 1 + 2 * 3 / 4 % 5 ^ 6.
Here are the nodes I want to eliminate:
how I want the precedence tree to look http://vertigrated.com/images/example%202%20desired%20result.png
Basically I want to eliminate any of those nodes that don't directly have a branch under them to binary options.
You must realize that the two rules:
add : sub ( ('+' sub)+ -> ^(ADD sub (sub)*) | -> sub ) ;
and
add : sub ('+'^ sub)* ;
do not produce the same AST. Given the input 1+2+3, the first rule will produce:
ADD
|
.--+--.
| | |
1 2 3
where the second rule produces:
(+)
|
.--+--.
| |
(+) 3
|
.--+--.
| |
1 2
The latter makes more sense: infix expressions are expected to have 2 child nodes, not more.
Why not simply remove the literals in your parser rules and just do:
add : sub (ADD^ sub)*;
ADD : '+';
Creating the same AST using a rewrite rule would look like this:
add : (sub -> sub) ('+' s=sub -> ^(ADD $add $s))*;
Also see chapter 7: Tree Construction from The Definitive ANTLR Reference. Especially the paragraphs Rewrite Rules in Subrules (page 173) and Referencing Previous Rule ASTs in Rewrite Rules (page 174/175).
Your rule (and other like it)
add : subtract ('+' subtract)* -> ^(ADDITION subtract (subtract)*) ;
produces the useless production when you don't have a sequence of add operations.
I'm not an ANTLR expert, but I'd guess you need two cases, one for an add term
that is unary, and one for a set of children, the first of which generates your
standard tree, and the second of which simply passes the child tree up to the parent,
without creating a new node?
add : subtract ( ('+' subtract)+ -> ^(ADDITION subtract (subtract)*)
| -> subtract ) ;
Similar changes for other rules with sequences of operands to an operator.
To get rid of the irrelevant nodes, just be explicit:
subtract
:
modulo
(
( '-' modulo)+ -> ^(SUBTRACT modulo+) // no need for parenthesis or asterisk
|
() -> modulo
)
;
Even though I accepted Barts's answers as correct, I wanted to post my own complete answer with example code that I got working just for completeness.
Here is what I did based on Bart's answer:
unary : ('+'! | '-'^)? term ;
pow : (unary -> unary) ('^' s=unary -> ^(POWER $pow $s))*;
mod : (pow -> pow) ('%' s=pow -> ^(MODULO $mod $s))*;
mult : (mod -> mod) ('*' s=mod -> ^(MULTIPLY $mult $s))*;
div : (mult -> mult) ('/' s=mult -> ^(DIVIDE $div $s))*;
sub : (div -> div) ('-' s=div -> ^(SUBTRACT $sub $s))*;
add : (sub -> sub) ('+' s=sub -> ^(ADD $add $s))*;
And here is what the resulting tree looks like:
working answer http://vertigrated.com/images/working_answer.png
There is an alternative solution to just not use the rewrites and promote the symbols themselves to roots, but I want all descriptive labels in my tree if at all possible. I am just being anal about how the tree is represented so that my tree walking code will be as clean as possible!
power : unary ('^'^ unary)* ;
mod : power ('%'^ power)* ;
mult : mod ('*'^ mod)* ;
div : mult ('/'^ mult)* ;
sub : div ('-'^ div)* ;
add : sub ('+'^ sub)* ;
And this looks like this:
without rewrites http://vertigrated.com/images/without_the_rewrites.png