Parse Nested Block Structure using ANTLR - antlr

I have this program
{
run_and_branch(Test1)
then
{
}
else
{
}
{
run_and_branch(Test2)
then
{
}
else
{
run(Test3);
run(Test4);
run(Test5);
}
}
run_and_branch(Test6)
then
{
}
else
{
}
run(Test7);
{
run(Test8);
run(Test9);
run(Test_10);
}
}
Below is my ANLTR Grammar File
prog
: block EOF;
block
: START_BLOCK END_BLOCK -> BLOCK|
START_BLOCK block* END_BLOCK -> block*|
test=run_statement b=block* -> ^($test $b*)|
test2=run_branch_statement THEN pass=block ELSE fail=block -> ^($test2 ^(PASS $pass) ^(FAIL $fail))
;
run_branch_statement
: RUN_AND_BRANCH OPEN_BRACKET ID CLOSE_BRACKET -> ID;
run_statement
: RUN OPEN_BRACKET ID CLOSE_BRACKET SEMICOLON -> ID;
THEN : 'then';
ELSE : 'else';
RUN_AND_BRANCH : 'run_and_branch';
RUN : 'run';
START_BLOCK
: '{' ;
END_BLOCK
: '}' ;
OPEN_BRACKET
: '(';
CLOSE_BRACKET
: ')';
SEMICOLON
: ';'
;
ID : ('a'..'z'|'A'..'Z'|'_'|'0'..'9') (':'|'%'|'='|'\''|'a'..'z'|'A'..'Z'|'0'..'9'|'_'|'-'|'.'|'+'|'*'|'/'|'\\')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
Using ANTLWorks I get the following AST:
As you can see in the AST there is no link between the Test1 and Test2 as depedency. I want to have the AST show this information so that I can traverse the AST and get the Test depedency Structure
I am expecting the AST look something like this

ANTLR doesn't work this way. ANTLR produces a tree, not a graph, so there is no way to represent the desired output at the grammar level. In addition, if you tried to write tail-recursive rules to link control flow this way you would quickly run into stack overflow exceptions since ANTLR produces recursive-descent parsers.
You need to take the AST produced by ANTLR and perform separate control flow analysis on it to get a control flow graph.

Related

How to handle the precedence of assignment operator in a PHP parser?

I wrote a PHP5 parser in ANTLR 3.4, which is almost ready, but I can not handle one of the tricky feature of PHP. My problem is with the precedence of assignment operator. As the PHP manual says the precedence of assignment is almost at the end of the list. Only and, xor, or and , are after it in the list.
But there is a note on this the manual page which says:
Although = has a lower precedence than most other operators, PHP will
still allow expressions similar to the following: if (!$a = foo()), in
which case the return value of foo() is put into $a.
The small example in the note isn't a problem for my parser, I can handle this as a special case in the assigment rule.
But there are more complex codes eg:
if ($a && $b = func()) {}
My parser fails here, because it recognizes first $a && $b and can not deal with the rest of the conditioin. This is because the && has higher precedence, than =.
If I put brackets around the right side of &&:
if ($a && ($b = func())) {}
In this way the parser recognizes the structure well.
The operators are built in the way that the ANTLR book recommends: there are the base exressions at the first step and each level of operators are coming after each other.
Is there any way to handle this precedence jumping?
Don't look at it as an assignment, but let's name it an assignment expression. Put this assignment expression "below" the unary expressions (so they have a higher precedence than the unary ones):
grammar T;
options {
output=AST;
}
tokens {
BLOCK;
FUNC_CALL;
EXPR_LIST;
}
parse
: stat* EOF!
;
stat
: assignment ';'!
| if_stat
;
assignment
: Var '='^ expr
;
if_stat
: If '(' expr ')' block -> ^(If expr block)
;
block
: '{' stat* '}' -> ^(BLOCK stat*)
;
expr
: or_expr
;
or_expr
: and_expr ('||'^ and_expr)*
;
and_expr
: unary_expr ('&&'^ unary_expr)*
;
unary_expr
: '!'^ assign_expr
| '-'^ assign_expr
| assign_expr
;
assign_expr
: Var ('='^ atom)*
| atom
;
atom
: Num
| func_call
;
func_call
: Id '(' expr_list ')' -> ^(FUNC_CALL Id expr_list)
;
expr_list
: (expr (',' expr)*)? -> ^(EXPR_LIST expr*)
;
If : 'if';
Num : '0'..'9'+;
Var : '$' Id;
Id : ('a'..'z')+;
Space : (' ' | '\t' | '\r' | '\n')+ {skip();};
If you'd now parse the source:
if (!$a = foo()) { $a = 1 && 2; }
if ($a && $b = func()) { $b = 2 && 3; }
if ($a = baz() && $b) { $c = 3 && 4; }
the following AST would get constructed:

Solving antlr left recursion

I'm trying to parse a language using ANTLR which can contain the following syntax:
someVariable, somVariable.someMember, functionCall(param).someMember, foo.bar.baz(bjork).buffalo().xyzzy
This is the ANTLR grammar which i've come up with so far, and the access_operation throws the error
The following sets of rules are mutually left-recursive [access_operation, expression]:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
LHS;
RHS;
CALL;
PARAMS;
}
start
: body? EOF
;
body
: expression (',' expression)*
;
expression
: function -> ^(CALL)
| access_operation
| atom
;
access_operation
: (expression -> ^(LHS)) '.'! (expression -> ^(RHS))
;
function
: (IDENT '(' body? ')') -> ^(IDENT PARAMS?)
;
atom
: IDENT
| NUMBER
;
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
IDENT : (LETTER)+ ;
NUMBER : (DIGIT)+ ;
SPACE : (' ' | '\t' | '\r' | '\n') { $channel=HIDDEN; };
What i could manage so far was to refactor the access_operation rule to '.' expression which generates an AST where the access_operation node only contains the right side of the operation.
What i'm looking for instead is something like this:
How can the left-recursion problem solved in this case?
By "wrong AST" I'll make a semi educated guess that, for input like "foo.bar.baz", you get an AST where foo is the root with bar as a child who in its turn has baz as a child, which is a leaf in the AST. You may want to have this reversed. But I'd not go for such an AST if I were you: I'd keep the AST as flat as possible:
foo
/ | \
/ | \
bar baz ...
That way, evaluating is far easier: you simply look up foo, and then walk from left to right through its children.
A quick demo:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BODY;
ACCESS;
CALL;
PARAMS;
}
start
: body EOF -> body
;
body
: expression (',' expression)* -> ^(BODY expression+)
;
expression
: atom
;
atom
: NUMBER
| (IDENT -> IDENT) ( tail -> ^(IDENT tail)
| call tail? -> ^(CALL IDENT call tail?)
)?
;
tail
: (access)+
;
access
: ('.' IDENT -> ^(ACCESS IDENT)) (call -> ^(CALL IDENT call))?
;
call
: '(' (expression (',' expression)*)? ')' -> ^(PARAMS expression*)
;
IDENT : LETTER+;
NUMBER : DIGIT+;
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
which can be tested with:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "someVariable, somVariable.someMember, functionCall(param).someMember, " +
"foo.bar.baz(bjork).buffalo().xyzzy";
TestLexer lexer = new TestLexer(new ANTLRStringStream(src));
TestParser parser = new TestParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
The output of Main corresponds to the following AST:
EDIT
And since you indicated your ultimate goal is not evaluating the input, but that you rather need to conform the structure of the AST to some 3rd party API, here's a grammar that will create an AST like you indicated in your edited question:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BODY;
ACCESS_OP;
CALL;
PARAMS;
LHS;
RHS;
}
start
: body EOF -> body
;
body
: expression (',' expression)* -> ^(BODY expression+)
;
expression
: atom
;
atom
: NUMBER
| (ID -> ID) ( ('(' params ')' -> ^(CALL ID params))
('.' expression -> ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression)))?
| '.' expression -> ^(ACCESS_OP ^(LHS ID) ^(RHS expression))
)?
;
params
: (expression (',' expression)*)? -> ^(PARAMS expression*)
;
ID : LETTER+;
NUMBER : DIGIT+;
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
fragment LETTER : ('a'..'z' | 'A'..'Z');
fragment DIGIT : '0'..'9';
which creates the following AST if you run the Main class:
The atom rule may be a bit daunting, but you can't shorten it much since the left ID needs to be available to most of the alternatives. ANTLRWorks helps in visualizing the alternative paths this rule may take:
which means atom can be any of the 5 following alternatives (with their corresponding AST's):
+----------------------+--------------------------------------------------------+
| alternative | generated AST |
+----------------------+--------------------------------------------------------+
| NUMBER | NUMBER |
| ID | ID |
| ID params | ^(CALL ID params) |
| ID params expression | ^(ACCESS_OP ^(LHS ^(CALL ID params)) ^(RHS expression))|
| ID expression | ^(ACCESS_OP ^(LHS ID) ^(RHS expression) |
+----------------------+--------------------------------------------------------+

Switching lexer state in antlr3 grammar

I'm trying to construct an antlr grammar to parse a templating language. that language can be embedded in any text and the boundaries are marked with opening/closing tags: {{ / }}. So a valid template looks like this:
foo {{ someVariable }} bar
Where foo and bar should be ignored, and the part inside the {{ and }} tags should be parsed. I've found this question which basically has an answer for the problem, except that the tags are only one { and }. I've tried to modify the grammar to match 2 opening/closing characters, but as soon as i do this, the BUFFER rule consumes ALL characters, also the opening and closing brackets. The LD rule is never being invoked.
Has anyone an idea why the antlr lexer is consuming all tokens in the Buffer rule when the delimiters have 2 characters, but does not consume the delimiters when they have only one character?
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: (tag | BUFFER )*
;
tag
: LD IDENT^ RD
;
LD #after {
// flip lexer the state
insideTag=true;
System.err.println("FLIPPING TAG");
} : '{{';
RD #after {
// flip the state back
insideTag=false;
} : '}}';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER)*;
BUFFER : { !insideTag }?=> ~(LD | RD)+;
fragment LETTER : ('a'..'z' | 'A'..'Z');
You can match any character once or more until you see {{ ahead by including a predicate inside the parenthesis ( ... )+ (see the BUFFER rule in the demo).
A demo:
grammar Test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: tag EOF
;
tag
: LD IDENT^ RD
;
LD
#after {insideTag=true;}
: '{{'
;
RD
#after {insideTag=false;}
: '}}'
;
BUFFER
: ({!insideTag && !(input.LA(1)=='{' && input.LA(2)=='{')}?=> .)+ {$channel=HIDDEN;}
;
SPACE
: (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;}
;
IDENT
: ('a'..'z' | 'A'..'Z')+
;
Note that it's best to keep the BUFFER rule as the first lexer rule in your grammar: that way, it will be the first token that is tried.
If you now parse "foo {{ someVariable }} bar", the following AST is created:
Wouldn't a grammar like this fit your needs? I don't see why the BUFFER needs to be that complicated.
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean inTag=false;
}
start
: tag* EOF
;
tag
: LD IDENT RD -> IDENT
;
LD
#after { inTag=true; }
: '{{'
;
RD
#after { inTag=false; }
: '}}'
;
IDENT : {inTag}?=> ('a'..'z'|'A'..'Z'|'_') 'a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
BUFFER
: . {$channel=HIDDEN;}
;

Parsing a templating language

I'm trying to parse a templating language and I'm having trouble correctly parsing the arbitrary html that can appear between tags. So far what I have is below, any suggestions? An example of a valid input would be
{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}This Should Be Parsed as a Buffer.{/bar2}
And the grammar is:
grammar g;
options {
language=Java;
output=AST;
ASTLabelType=CommonTree;
}
/* LEXER RULES */
tokens {
}
LD : '{';
RD : '}';
LOOP : '#';
END_LOOP: '/';
PARTIAL : '>';
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER options {greedy=false;} : ~(LD | RD)+ ;
/* PARSER RULES */
start : body EOF
;
body : (tag | loop | partial | BUFFER)*
;
tag : LD! IDENT^ RD!
;
loop : LD! LOOP^ IDENT RD!
body
LD! END_LOOP! IDENT RD!
;
partial : LD! PARTIAL^ IDENT RD!
;
buffer : BUFFER
;
Your lexer tokenizes independently from your parser. If your parser tries to match a BUFFER token, the lexer does not take this info into account. In your case with input like: "blah blah blah", the lexer creates 3 IDENT tokens, not a single BUFFER token.
What you need to "tell" your lexer is that when you're inside a tag (i.e. you encountered a LD tag), a IDENT token should be created, and when you're outside a tag (i.e. you encountered a RD tag), a BUFFER token should be created instead of an IDENT token.
In order to implement this, you need to:
create a boolean flag inside the lexer that keeps track of the fact that you're in- or outside a tag. This can be done inside the #lexer::members { ... } section of your grammar;
after the lexer either creates a LD- or RD-token, flip the boolean flag from (1). This can be done in the #after{ ... } section of the lexer rules;
before creating a BUFFER token inside the lexer, check if you're outside a tag at the moment. This can be done by using a semantic predicate at the start of your lexer rule.
A short demo:
grammar g;
options {
output=AST;
ASTLabelType=CommonTree;
}
#lexer::members {
private boolean insideTag = false;
}
start
: body EOF -> body
;
body
: (tag | loop | partial | BUFFER)*
;
tag
: LD IDENT RD -> IDENT
;
loop
: LD LOOP IDENT RD body LD END_LOOP IDENT RD -> ^(LOOP body IDENT IDENT)
;
partial
: LD PARTIAL IDENT RD -> ^(PARTIAL IDENT)
;
LD #after{insideTag=true;} : '{';
RD #after{insideTag=false;} : '}';
LOOP : '#';
END_LOOP : '/';
PARTIAL : '>';
SPACE : (' ' | '\t' | '\r' | '\n') {$channel=HIDDEN;};
IDENT : (LETTER | '_') (LETTER | '_' | DIGIT)*;
BUFFER : {!insideTag}?=> ~(LD | RD)+;
fragment DIGIT : '0'..'9';
fragment LETTER : ('a'..'z' | 'A'..'Z');
(note that you probably want to discard spaces between tag, so I added a SPACE rule and discarded these spaces)
Test it with the following class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "{foo}{#bar}blah blah blah{zed}{/bar}{>foo2}{#bar2}" +
"This Should Be Parsed as a Buffer.{/bar2}";
gLexer lexer = new gLexer(new ANTLRStringStream(src));
gParser parser = new gParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.start().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and after running the main class:
*nix/MacOS
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
Windows
java -cp antlr-3.3.jar org.antlr.Tool g.g
javac -cp antlr-3.3.jar *.java
java -cp .;antlr-3.3.jar Main
You'll see some DOT-source being printed to the console, which corresponds to the following AST:
(image created using graphviz-dev.appspot.com)

ANTLR expression interpreter

I have created the following grammar: I would like some idea how to build an interpreter that returns a tree in java, which I can later use for printing in the screen, Im bit stack on how to start on it.
grammar myDSL;
options {
language = Java;
}
#header {
package DSL;
}
#lexer::header {
package DSL;
}
program
: IDENT '={' components* '}'
;
components
: IDENT '=('(shape)(shape|connectors)* ')'
;
shape
: 'Box' '(' (INTEGER ','?)* ')'
| 'Cylinder' '(' (INTEGER ','?)* ')'
| 'Sphere' '(' (INTEGER ','?)* ')'
;
connectors
: type '(' (INTEGER ','?)* ')'
;
type
: 'MG'
| 'EL'
;
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
INTEGER: '0'..'9'+;
// This if for the empty spaces between tokens and avoids them in the parser
WS: (' ' | '\t' | '\n' | '\r' | '\f')+ {$channel=HIDDEN;};
COMMENT: '//' .* ('\n' | '\r') {$channel=HIDDEN;};
A couple of remarks:
There's no need to set the language for Java, which is the default target language. So you can remove this:
options {
language = Java;
}
Your IDENT contains an error:
IDENT: ('a'..'z' | 'A'..'Z')('a'..'z' | 'A'..'Z' | '0'..'0')*;
the '0'..'0') should most probably be '0'..'9').
The sub rule (INTEGER ','?)* also matches source like 1 2 3 4 (no comma's at all!). Perhaps you meant to do: (INTEGER (',' INTEGER)*)?
Now, as to your question: how to let ANTLR construct a proper AST? This can be done by adding output = AST; in your options block:
options {
//language = Java;
output = AST;
}
And then either adding the "tree operators" ^ and ! in your parser rules, or by using tree rewrite rules: rule: a b c -> ^(c b a).
The "tree operator" ^ is used to define the root of the (sub) tree and ! is used to exclude a token from the (sub) tree.
Rewrite rules have ^( /* tokens here */ ) where the first token (right after ^() is the root of the (sub) tree, and all following tokens are child nodes of the root.
An example might be in order. Let's take your first rule:
program
: IDENT '={' components* '}'
;
and you want to let IDENT be the root, components* the children and you want to exclude ={ and } from the tree. You can do that by doing:
program
: IDENT^ '={'! components* '}'!
;
or by doing:
program
: IDENT '={' components* '}' -> ^(IDENT components*)
;