Parsing haskell-like lambda in antlr - antlr

I'm trying to parse a haskell-like language using antlr4 and I'm stuck with lambdas. In haskell, lambdas can be mixed with operators. So, given operators >>=, + and lambda syntax '\\' args* '->' expr, the following expression is valid:
a >>= \a -> b >>= \b -> Just(a + b)
and it should be parsed into the following AST:
>>=
/ \
a ->
/ \
a >>=
/ \
b ->
/ \
b Just
|
+
/ \
a b
So I can think of two ways of structuring grammar for this kind of syntax.
The first is to put lambda expression into the top expression rule, among with ifs and other syntax constructs:
grammar Test;
root
: expr0 EOF
;
expr0
: '\\' ID '->' expr0
| expr1
;
expr1
: expr2 ('>>=' expr2)*
;
expr2
: expr3 ('+' expr3)*
;
expr3
: '(' expr0 ')'
| ID ('(' expr0 ')')?
;
This grammar cannot parse the above expression. It is required to add parens around lambda: a >>= (\a -> b >>= (\b -> Just(a + b))). While I understand why parens are required, this behaviour is pretty inconvenient.
The second approach would be to put lambda to the last expression rule, among with literals and nested expressions:
grammar Test;
root
: expr0 EOF
;
expr0
: expr1 ('>>=' expr1)*
;
expr1
: expr2 ('+' expr2)*
;
expr2
: '(' expr0 ')'
| ID ('(' expr0 ')')?
| '\\' ID '->' expr0
;
This grammar accepts my expression, however, it contains ambiguity because a >>= \a -> b >>= \b -> Just(a + b) can be parsed either as a >>= \a -> (b >>= \b) -> Just(a + b) or as a >>= \a -> (b >>= \b -> Just(a + b)).
So my question is, how to implement this kind of grammar properly?

Related

ANTLR operator precedence broken by optional right recursion?

I'm confused by the behavior of this grammar (in ANTLR 4.8):
grammar Bug;
stat: expr ';' ;
expr: expr '*' expr?
| expr '+' expr
| '(' expr ')'
| INT
| ID
;
ID : [a-zA-Z]+ ;
INT : [0-9]+ ;
WS : [ \t\n\r]+ -> skip ;
That's a minimal modification of an example from the book; all I've done is add a ? to the first alternative for expr, so that * can be either a postfix unary operator or a binary operator.
To my surprise that seems to break the logic around binary operator precedence:
without the ?, 3*4+5; parses as (stat (expr (expr (expr 3) * (expr 4)) + (expr 5)) ;) (as expected)
with the ?, 3*4+5; parses as (stat (expr (expr 3) * (expr (expr 4) + (expr 5))) ;) (wat?)
Is this a bug, or is this behavior expected? How do I get the behavior I was hoping for?
Not sure if this is expected behavior... However, the more readable grammar like this seems to do what you expect it to (and preserves precedence):
expr: expr '*' expr #MulExpr
| expr '+' expr #Addxpr
| expr '*' #PointerExpr
| '(' expr ')' #NestedExpr
| INT #IntExpr
| ID #IdExpr
;
The #...Expr after each alternative are called labels.

Inline comments and empty line in antlr4 grammar

please can anyone explain me, what i need to change i this grammar to support inline comments (such as // some text) and empty line (which contains any number of space characters). I write following grammar, but this doesn't work.
program: line* EOF ;
line: (expression | assignment) (NEWLINE | EOF);
assignment : VARIABLE '=' expression ;
expression : '(' expression ')' #parenthesisExpression
| '-' expression #unaryExpression
| left=expression OP1 right=expression #firstPriorityExpression
| left=expression OP2 right=expression #secondPriorityExpression
| number=NUMBER #numericExpression
| variable=VARIABLE #variableExpression
;
NUMBER : [0-9]+ ;
VARIABLE : [a-zA-Z][a-zA-Z0-9]* ;
OP1 : '*' | '/' ;
OP2 : '+' | '-' ;
NEWLINE : '\r'? '\n' ;
WHITESPACE : [ \t\r]+ -> skip ;
COMMENT : '//' ~[\n\r]* -> skip ;
The fact you added - in a parser rule as a literal token, and also made OP2 match this character causes OP2 to never match a -. You need to have a lexer rule that matches only the single minus sign (as I showed earlier):
op1
: MUL
| DIV
;
op2
: ADD
| MIN
;
...
MUL : '*' ;
DIV : '/' ;
ADD : '+' ;
MIN : '-' ;
and then use MIN in your unary alternative:
...
| MIN expression #unaryExpression
...
When you have a separate MIN : '-' ; rule, you could do this:
...
| '-' expression #unaryExpression
...
because now ANTLR "knows" you mean the rule that matches a single -, but ANTLR does not "know" this when you have a lexer rule that matches a either a - or + like your OP2 rule:
OP2 : '+' | '-' ;

how to resolve an ambiguity

I have a grammar:
grammar Test;
s : ID OP (NUMBER | ID);
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
OP : '/.' | '/' ;
WS : [ \t\r\n]+ -> skip ;
An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123). With the grammar above I get the first variant.
Is there a way to get both parse trees? Is there a way to control how it is parsed? Say, if there is a number after the /. then I emit the / otherwise I emit /. in the tree.
I am new to ANTLR.
An expression like x/.123 can either be parsed as (s x /. 123), or as (s x / .123)
I'm not sure. In the ReplaceAll page(*), Possible Issues paragraph, it is said that "Periods bind to numbers more strongly than to slash", so that /.123 will always be interpreted as a division operation by the number .123. Next it is said that to avoid this issue, a space must be inserted in the input between the /. operator and the number, if you want it to be understood as a replacement.
So there is only one possible parse tree (otherwise how could the Wolfram parser decide how to interpret the statement ?).
ANTLR4 lexer and parser are greedy. It means that the lexer (parser) tries to read as much input characters (tokens) that it can while matching a rule. With your OP rule OP : '/.' | '/' ; the lexer will always match the input /. to the /. alternative (even if the rule is OP : '/' | '/.' ;). This means there is no ambiguity and you have no chance the input to be interpreted as OP=/ and NUMBER=.123.
Given my small experience with ANTLR, I have found no other solution than to split the ReplaceAll operator into two tokens.
Grammar Question.g4 :
grammar Question;
/* Parse Wolfram ReplaceAll. */
question
#init {System.out.println("Question last update 0851");}
: s+ EOF
;
s : division
| replace_all
;
division
: expr '/' NUMBER
{System.out.println("found division " + $expr.text + " by " + $NUMBER.text);}
;
replace_all
: expr '/' '.' replacement
{System.out.println("found ReplaceAll " + $expr.text + " with " + $replacement.text);}
;
expr
: ID
| '"' ID '"'
| NUMBER
| '{' expr ( ',' expr )* '}'
;
replacement
: expr '->' expr
| '{' replacement ( ',' replacement )* '}'
;
ID : [a-z]+ ;
NUMBER : '.'? [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
Input file t.text :
x/.123
x/.x -> 1
{x, y}/.{x -> 1, y -> 2}
{0, 1}/.0 -> "zero"
{0, 1}/. 0 -> "zero"
Execution :
$ export CLASSPATH=".:/usr/local/lib/antlr-4.6-complete.jar"
$ alias a4='java -jar /usr/local/lib/antlr-4.6-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Q*.java
$ grun Question question -tokens -diagnostics t.text
[#0,0:0='x',<ID>,1:0]
[#1,1:1='/',<'/'>,1:1]
[#2,2:5='.123',<NUMBER>,1:2]
[#3,7:7='x',<ID>,2:0]
[#4,8:8='/',<'/'>,2:1]
[#5,9:9='.',<'.'>,2:2]
[#6,10:10='x',<ID>,2:3]
[#7,12:13='->',<'->'>,2:5]
[#8,15:15='1',<NUMBER>,2:8]
[#9,17:17='{',<'{'>,3:0]
...
[#29,47:47='}',<'}'>,4:5]
[#30,48:48='/',<'/'>,4:6]
[#31,49:50='.0',<NUMBER>,4:7]
...
[#40,67:67='}',<'}'>,5:5]
[#41,68:68='/',<'/'>,5:6]
[#42,69:69='.',<'.'>,5:7]
[#43,71:71='0',<NUMBER>,5:9]
...
[#48,83:82='<EOF>',<EOF>,6:0]
Question last update 0851
found division x by .123
found ReplaceAll x with x->1
found ReplaceAll {x,y} with {x->1,y->2}
found division {0,1} by .0
line 4:10 extraneous input '->' expecting {<EOF>, '"', '{', ID, NUMBER}
found ReplaceAll {0,1} with 0->"zero"
The input x/.123 is ambiguous until the slash. Then the parser has two choices : / NUMBER in the division rule or / . expr in the replace_all rule. I think that NUMBER absorbs the input and so there is no more ambiguity.
(*) the link was yesterday in a comment that has disappeared, i.e. Wolfram Language & System, ReplaceAll

Controlling Parameter Slurping

I'm trying to write a grammar that supports functions calls without using parentheses:
f x, y
As in Haskell, I'd like function calls to minimally slurp up their parameters. That is, I want
g 5 + 3
to mean
(g 5) + 3
instead of
g (5 + 3)
Unfortunately, I'm getting the second parse with this grammar:
grammar Parameters;
expr
: '(' expr ')'
| expr MULTIPLICATIVE_OPERATOR expr
| expr ADDITIVE_OPERATOR expr
| ID (expr (',' expr)*?)??
| INT
;
MULTIPLICATIVE_OPERATOR: [*/%];
ADDITIVE_OPERATOR: '+';
ID: [a..z]+;
INT: '-'? [0-9]+;
WHITESPACE: [ \t\n\r]+ -> skip;
The parse tree I'm getting is this:
I had thought that the subrule listed first would get attempted first. In this case, expr ADDITIVE_OPERATOR expr appears before the ID subrule, so why is the ID subrule taking higher precedence?
In this case ANTLR does not the correct rule transformation (to eliminate left recursion and to handle precedences):
expr
: expr_1[0]
;
expr_1[int p]
: ('(' expr_1[0] ')' | INT | ID (expr_1[0] (',' expr_1[0])*?)??)
( {4 >= $p}? MULTIPLICATIVE_OPERATOR expr_1[5]
| {3 >= $p}? ADDITIVE_OPERATOR expr_1[4]
)*
;
leading to (expr (expr_1 a (expr_1 5 + (expr_1 3))))
correct would be:
expr
: expr_1[0]
;
expr_1[int p]
: ('(' expr_1[0] ')' | INT | ID (expr_1[5] (',' expr_1[5])*?)??)
( {4 >= $p}? MULTIPLICATIVE_OPERATOR expr_1[5]
| {3 >= $p}? ADDITIVE_OPERATOR expr_1[4]
)*
;
leading to (expr (expr_1 a (expr_1 5) + (expr_1 3)))
I am not certain if this is a bug in ANTLR4 or a trade-off of the transformation algorithm. Perhaps one should write an issue to the ANTLR4 jira.
To solve your problem you can simply put the correctly transformed grammar into your code and it should work. The explanation of rule transformation is found in "The Definitive ANTLR4 Reference" on pages 249ff (and perhaps somewhere on the web).

fatal error in grammar - piecewise definition

I am translating a grammar from LALR to ANTLR and I am having trouble with translating this one rule, piecewise expression.
Attached is the sample grammar:
grammar Test;
options {
language = Java;
output = AST;
}
parse : expression ';'
;
expression : binaryExpression
| piecesExpression
;
binaryExpression : addingExpression (('=='|'!='|'<='|'>='|'>'|'<') addingExpression)*
;
addingExpression : multiplyingExpression (('+'|'-') multiplyingExpression)*
;
multiplyingExpression : unaryExpression
(('*'|'/') unaryExpression)*
;
unaryExpression: ('!'|'-')* primitiveElement;
primitiveElement : literalExpression
| id
| '(' expression ')'
;
literalExpression : INT
;
id : IDENTIFIER
;
piecesExpression : 'piecewise' '{' piece expression '}' ('(' expression ',' expression ')')? expression?
;
piece : expression '->' expression ';' (expression '->' expression ';')*
;
// L E X I C A L R U L E S
INT : DIGITS ;
IDENTIFIER : LETTER (LETTER | DIGIT)*;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_') ;
fragment DIGITS: DIGIT+;
fragment DIGIT : '0'..'9';
ANTLR v3.5 is complaining about the piecesExpression rule. It has 2 fatal errors and I would rather not use backtrack option.
Expected results:
piecewise {t -> s; t -> x; 100}
piecewise {t -> s; t -> x; 100} (0, x+1)
piecewise {t -> s; t -> x; 100} (0, x+1) y+5
How can piecesExpression be able to capture the above results?
Thanks in advance!
ANTLR has problems determining which alternatives to take in (at least) 2 cases:
piece starts with a expression but inside the piecewise{...}, it should also end with an expression
piecesExpression ends with '(' expression ... but also has an optional trailing expression (and an primitiveElement also matches '(' expression ... in its turn)
There's no need to use global backtracking, but without rewriting many rules, you do need to add some predicates (the (...)=> in the example below) to fix the two issues outlined above.
Try this:
piecesExpression
: 'piecewise' '{' ((expression '->')=> piece)+ expression '}'
( ('(' expression ',')=> '(' expression ',' expression ')' expression?
| expression
)
;
piece
: expression '->' expression ';'
;