How to handle the precedence of assignment operator in a PHP parser? - antlr

I wrote a PHP5 parser in ANTLR 3.4, which is almost ready, but I can not handle one of the tricky feature of PHP. My problem is with the precedence of assignment operator. As the PHP manual says the precedence of assignment is almost at the end of the list. Only and, xor, or and , are after it in the list.
But there is a note on this the manual page which says:
Although = has a lower precedence than most other operators, PHP will
still allow expressions similar to the following: if (!$a = foo()), in
which case the return value of foo() is put into $a.
The small example in the note isn't a problem for my parser, I can handle this as a special case in the assigment rule.
But there are more complex codes eg:
if ($a && $b = func()) {}
My parser fails here, because it recognizes first $a && $b and can not deal with the rest of the conditioin. This is because the && has higher precedence, than =.
If I put brackets around the right side of &&:
if ($a && ($b = func())) {}
In this way the parser recognizes the structure well.
The operators are built in the way that the ANTLR book recommends: there are the base exressions at the first step and each level of operators are coming after each other.
Is there any way to handle this precedence jumping?

Don't look at it as an assignment, but let's name it an assignment expression. Put this assignment expression "below" the unary expressions (so they have a higher precedence than the unary ones):
grammar T;
options {
output=AST;
}
tokens {
BLOCK;
FUNC_CALL;
EXPR_LIST;
}
parse
: stat* EOF!
;
stat
: assignment ';'!
| if_stat
;
assignment
: Var '='^ expr
;
if_stat
: If '(' expr ')' block -> ^(If expr block)
;
block
: '{' stat* '}' -> ^(BLOCK stat*)
;
expr
: or_expr
;
or_expr
: and_expr ('||'^ and_expr)*
;
and_expr
: unary_expr ('&&'^ unary_expr)*
;
unary_expr
: '!'^ assign_expr
| '-'^ assign_expr
| assign_expr
;
assign_expr
: Var ('='^ atom)*
| atom
;
atom
: Num
| func_call
;
func_call
: Id '(' expr_list ')' -> ^(FUNC_CALL Id expr_list)
;
expr_list
: (expr (',' expr)*)? -> ^(EXPR_LIST expr*)
;
If : 'if';
Num : '0'..'9'+;
Var : '$' Id;
Id : ('a'..'z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If you'd now parse the source:
if (!$a = foo()) { $a = 1 && 2; }
if ($a && $b = func()) { $b = 2 && 3; }
if ($a = baz() && $b) { $c = 3 && 4; }
the following AST would get constructed:

Related

How do I convert this Antlr3 AST to Antlr4?

I'm trying to convert my existing Antlr3 project to Antlr4 to get more functionality. I have this grammar that wouldn't compile with Antlr4.9
expr
: term ( OR^ term )* ;
and
factor
: ava | NOT^ factor | (LPAREN! expr RPAREN!) ;
Mostly because Antlr4 doesn't support ^ and ! anymore. From the documentation it seems like those are
AST root operator. When generating abstract syntax trees (ASTs), token
references suffixed with the "^" root operator force AST nodes to be
created and added as the root of the current tree. This symbol is only
effective when the buildAST option is set. More information about ASTs
is also available.
AST exclude operator. When generating abstract syntax trees, token
references suffixed with the "!" exclude operator are not included in
the AST constructed for that rule. Rule references can also be
suffixed with the exclude operator, which implies that, while the tree
for the referenced rule is constructed, it is not linked into the tree
for the referencing rule. This symbol is only effective when the
buildAST option is set. More information about ASTs is also available.
If I took those out it would compile but I'm not sure what do those mean and how would Antlr4 supports it.
LPAREN and RPAREN is tokens
tokens {
EQUALS = '=';
LPAREN = '(';
RPAREN = ')';
}
which Antlr4 kindly provides the way to convert that in the error messages but not ^ and !. The grammar is for parsing boolean expression for example (a=b AND b=c)
I think this is the rule
targetingexpr returns [boolean value]
: expr { $value = $expr.value; } ;
expr returns [boolean value]
: ^(NOT a=expr) { $value = !a; }
| ^(AND a=expr b=expr) { $value = a && b; }
| ^(OR a=expr b=expr) { $value = a || b; }
| ^(EQUALS A=ALPHANUM B=ALPHANUM) { $value = targetingContext.contains($A.text,$B.text); }
;
The v3 grammar:
...
tokens {
EQUALS = '=';
LPAREN = '(';
RPAREN = ')';
}
...
expr
: term ( OR^ term )* ;
factor
: ava | NOT^ factor | (LPAREN! expr RPAREN!) ;
in v4 would look like this:
...
expr
: term ( OR term )* ;
factor
: ava | NOT factor | (LPAREN expr RPAREN) ;
EQUALS : '=';
LPAREN : '(';
RPAREN : ')';
So, just remove the inline ^ and ! operators (tree rewriting is no longer available in ANTLR4), and move the literal tokens in the tokens { ... } sections into own lexer rules.
I think this is the rule
targetingexpr returns [boolean value]
: expr { $value = $expr.value; } ;
expr returns [boolean value]
: ^(NOT a=expr) { $value = !a; }
| ^(AND a=expr b=expr) { $value = a && b; }
| ^(OR a=expr b=expr) { $value = a || b; }
| ^(EQUALS A=ALPHANUM B=ALPHANUM) { $value = targetingContext.contains($A.text,$B.text); }
;
What you posted there is part of a tree grammar for which there is no equivalent. In ANTLR4 you'd use a visitor to evaluate your expressions instead of inside a tree grammar.

Parse Nested Block Structure using ANTLR

I have this program
{
run_and_branch(Test1)
then
{
}
else
{
}
{
run_and_branch(Test2)
then
{
}
else
{
run(Test3);
run(Test4);
run(Test5);
}
}
run_and_branch(Test6)
then
{
}
else
{
}
run(Test7);
{
run(Test8);
run(Test9);
run(Test_10);
}
}
Below is my ANLTR Grammar File
prog
: block EOF;
block
: START_BLOCK END_BLOCK -> BLOCK|
START_BLOCK block* END_BLOCK -> block*|
test=run_statement b=block* -> ^($test $b*)|
test2=run_branch_statement THEN pass=block ELSE fail=block -> ^($test2 ^(PASS $pass) ^(FAIL $fail))
;
run_branch_statement
: RUN_AND_BRANCH OPEN_BRACKET ID CLOSE_BRACKET -> ID;
run_statement
: RUN OPEN_BRACKET ID CLOSE_BRACKET SEMICOLON -> ID;
THEN : 'then';
ELSE : 'else';
RUN_AND_BRANCH : 'run_and_branch';
RUN : 'run';
START_BLOCK
: '{' ;
END_BLOCK
: '}' ;
OPEN_BRACKET
: '(';
CLOSE_BRACKET
: ')';
SEMICOLON
: ';'
;
ID : ('a'..'z'|'A'..'Z'|'_'|'0'..'9') (':'|'%'|'='|'\''|'a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'.'|'+'|'*'|'/'|'\\')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
Using ANTLWorks I get the following AST:
As you can see in the AST there is no link between the Test1 and Test2 as depedency. I want to have the AST show this information so that I can traverse the AST and get the Test depedency Structure
I am expecting the AST look something like this
ANTLR doesn't work this way. ANTLR produces a tree, not a graph, so there is no way to represent the desired output at the grammar level. In addition, if you tried to write tail-recursive rules to link control flow this way you would quickly run into stack overflow exceptions since ANTLR produces recursive-descent parsers.
You need to take the AST produced by ANTLR and perform separate control flow analysis on it to get a control flow graph.

Solving antlr left recursion

I'm trying to parse a language using ANTLR which can contain the following syntax:
someVariable, somVariable.someMember, functionCall(param).someMember, foo.bar.baz(bjork).buffalo().xyzzy
This is the ANTLR grammar which i've come up with so far, and the access_operation throws the error
The following sets of rules are mutually left-recursive [access_operation, expression]:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
LHS;
RHS;
CALL;
PARAMS;
}
start
: body? EOF
;
body
: expression (',' expression)*
;
expression
: function -> ^(CALL)
| access_operation
| atom
;
access_operation
: (expression -> ^(LHS)) '.'! (expression -> ^(RHS))
;
function
: (IDENT '(' body? ')') -> ^(IDENT PARAMS?)
;
atom
: IDENT
| NUMBER
;
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
IDENT : (LETTER)+ ;
NUMBER : (DIGIT)+ ;
SPACE : (' ' | '\t' | '\r' | '\n') { $channel=HIDDEN; };
What i could manage so far was to refactor the access_operation rule to '.' expression which generates an AST where the access_operation node only contains the right side of the operation.
What i'm looking for instead is something like this:
How can the left-recursion problem solved in this case?
By "wrong AST" I'll make a semi educated guess that, for input like "foo.bar.baz", you get an AST where foo is the root with bar as a child who in its turn has baz as a child, which is a leaf in the AST. You may want to have this reversed. But I'd not go for such an AST if I were you: I'd keep the AST as flat as possible:
foo
/ | \
/ | \
bar baz ...
That way, evaluating is far easier: you simply look up foo, and then walk from left to right through its children.
A quick demo:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BODY;
ACCESS;
CALL;
PARAMS;
}
start
: body EOF -> body
;
body
: expression (',' expression)* -> ^(BODY expression+)
;
expression
: atom
;
atom
: NUMBER
| (IDENT -> IDENT) ( tail -> ^(IDENT tail)
| call tail? -> ^(CALL IDENT call tail?)
)?
;
tail
: (access)+
;
access
: ('.' IDENT -> ^(ACCESS IDENT)) (call -> ^(CALL IDENT call))?
;
call
: '(' (expression (',' expression)*)? ')' -> ^(PARAMS expression*)
;
IDENT : LETTER+;
NUMBER : DIGIT+;
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
which can be tested with:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "someVariable, somVariable.someMember, functionCall(param).someMember, " +
"foo.bar.baz(bjork).buffalo().xyzzy";
TestLexer lexer = new TestLexer(new ANTLRStringStream(src));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
The output of Main corresponds to the following AST:
EDIT
And since you indicated your ultimate goal is not evaluating the input, but that you rather need to conform the structure of the AST to some 3rd party API, here's a grammar that will create an AST like you indicated in your edited question:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BODY;
ACCESS_OP;
CALL;
PARAMS;
LHS;
RHS;
}
start
: body EOF -> body
;
body
: expression (',' expression)* -> ^(BODY expression+)
;
expression
: atom
;
atom
: NUMBER
| (ID -> ID) ( ('(' params ')' -> ^(CALL ID params))
('.' expression -> ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression)))?
| '.' expression -> ^(ACCESS_OP ^(LHS ID) ^(RHS expression))
)?
;
params
: (expression (',' expression)*)? -> ^(PARAMS expression*)
;
ID : LETTER+;
NUMBER : DIGIT+;
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
which creates the following AST if you run the Main class:
The atom rule may be a bit daunting, but you can't shorten it much since the left ID needs to be available to most of the alternatives. ANTLRWorks helps in visualizing the alternative paths this rule may take:
which means atom can be any of the 5 following alternatives (with their corresponding AST's):
+----------------------+--------------------------------------------------------+
| alternative | generated AST |
+----------------------+--------------------------------------------------------+
| NUMBER | NUMBER |
| ID | ID |
| ID params | ^(CALL ID params) |
| ID params expression | ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression))|
| ID expression | ^(ACCESS_OP ^(LHS ID) ^(RHS expression) |
+----------------------+--------------------------------------------------------+

why is this grammar an error 208?

I don't understand why the following grammar leads to error 208 complaining IF will be never matched:
error(208): test.g:11:1: The following token definitions can never be matched because prior tokens match the same input: IF
ANTLRWorks 1.4.3
ANTLT 3.4
grammar test;
#lexer::members {
private boolean rawAhead() {
}
}
parse : IF*;
RAW : ({rawAhead()}?=> . )+;
IF : 'if';
ID : ('A'..'Z'|'a'..'z')+;
Either remove RAW rule or ID rule solves the error...
From my point of view, IF does have the possibility to be matched when rawAhead() returns false.
Bood wrote:
I think it actually matters, say if we have an and just an 'if' outside of the mmode, e.g. <#/>if<#/>, then the if here will be matched with IF, not RAW it should be (same length, match the first), right?
Yeah, you're right, good point. Giving it some more thought that is the expected behavior AFAIK. But, it seems things work a bit differently: the RAW rule gets precedence over the ID and IF rules, even when placed at the end of the lexer grammar as you can see:
freemarker_simple.g
grammar freemarker_simple;
#lexer::members {
private boolean mmode = false;
private boolean rawAhead() {
if(mmode) return false;
int ch1 = input.LA(1), ch2 = input.LA(2), ch3 = input.LA(3);
return !(
(ch1 == '<' && ch2 == '#') ||
(ch1 == '<' && ch2 == '/' && ch3 == '#') ||
(ch1 == '$' && ch2 == '{')
);
}
}
parse
: (t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF
;
OUTPUT_START : '${' {mmode=true;};
TAG_START : '<#' {mmode=true;};
TAG_END_START : '</' ('#' {mmode=true;} | ~'#' {$type=RAW;});
OUTPUT_END : '}' {mmode=false;};
TAG_END : '>' {mmode=false;};
EQUALS : '==';
IF : 'if';
STRING : '"' ~'"'* '"';
ID : ('a'..'z' | 'A'..'Z')+;
SPACE : (' ' | '\t' | '\r' | '\n')+ {skip();};
RAW : ({rawAhead()}?=> . )+;
Main.java
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
freemarker_simpleLexer lexer = new freemarker_simpleLexer(new ANTLRStringStream("<#/if>if<#if>foo<#if>"));
freemarker_simpleParser parser = new freemarker_simpleParser(new CommonTokenStream(lexer));
parser.parse();
}
}
will print the following to the console:
TAG_START '<#'
IF 'if'
TAG_END '>'
RAW 'if'
TAG_START '<#'
IF 'if'
TAG_END '>'
RAW 'foo'
TAG_START '<#'
IF 'if'
TAG_END '>'
As you can see, the 'if' and 'foo' are tokenized as RAW in the input:
<#/if>if<#if>foo<#if>
^^ ^^^

ANTLR expression interpreter

I have created the following grammar: I would like some idea how to build an interpreter that returns a tree in java, which I can later use for printing in the screen, Im bit stack on how to start on it.
grammar myDSL;
options {
language = Java;
}
#header {
package DSL;
}
#lexer::header {
package DSL;
}
program
: IDENT '={' components* '}'
;
components
: IDENT '=('(shape)(shape|connectors)* ')'
;
shape
: 'Box' '(' (INTEGER ','?)* ')'
| 'Cylinder' '(' (INTEGER ','?)* ')'
| 'Sphere' '(' (INTEGER ','?)* ')'
;
connectors
: type '(' (INTEGER ','?)* ')'
;
type
: 'MG'
| 'EL'
;
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
INTEGER: '0'..'9'+;
// This if for the empty spaces between tokens and avoids them in the parser
WS: (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel=HIDDEN;};
COMMENT: '//' .* ('\n' | '\r') {$channel=HIDDEN;};
A couple of remarks:
There's no need to set the language for Java, which is the default target language. So you can remove this:
options {
language = Java;
}
Your IDENT contains an error:
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
the '0'..'0') should most probably be '0'..'9').
The sub rule (INTEGER ','?)* also matches source like 1 2 3 4 (no comma's at all!). Perhaps you meant to do: (INTEGER (',' INTEGER)*)?
Now, as to your question: how to let ANTLR construct a proper AST? This can be done by adding output = AST; in your options block:
options {
//language = Java;
output = AST;
}
And then either adding the "tree operators" ^ and ! in your parser rules, or by using tree rewrite rules: rule: a b c -> ^(c b a).
The "tree operator" ^ is used to define the root of the (sub) tree and ! is used to exclude a token from the (sub) tree.
Rewrite rules have ^( /* tokens here */ ) where the first token (right after ^() is the root of the (sub) tree, and all following tokens are child nodes of the root.
An example might be in order. Let's take your first rule:
program
: IDENT '={' components* '}'
;
and you want to let IDENT be the root, components* the children and you want to exclude ={ and } from the tree. You can do that by doing:
program
: IDENT^ '={'! components* '}'!
;
or by doing:
program
: IDENT '={' components* '}' -> ^(IDENT components*)
;