How do I rewrite a subtree with a composite root using ANTLR? - antlr

I have an antlr grammer with subtrees like this:
^(type ID)
that I want to convert to:
^(type DUMMY ID)
where type is 'a'|'b'.
Note: what I really want to do is convert anonymous instantiations to explicit by generating dummy names.
I've narrowed it down to the grammars below, but I'm getting this:
(a bar) (b bar)
got td
got bu
Exception in thread "main" org.antlr.runtime.tree.RewriteEmptyStreamException: rule type
at org.antlr.runtime.tree.RewriteRuleElementStream._next(RewriteRuleElementStream.java:157)
at org.antlr.runtime.tree.RewriteRuleSubtreeStream.nextNode(RewriteRuleSubtreeStream.java:77)
at Pattern.bu(Pattern.java:382)
The error message continues. My debug so far:
The input made it through the initial grammar generating two trees. a bar and b bar.
The second grammar does match the trees. it's printing td and bu.
The rewrite crashes, but I have no idea why? What does RewriteEmptyStreamException mean.
What the proper way to do this kind of a rewrite?
My main grammer Rewrite.g:
grammar Rewrite;
options {
output=AST;
}
#members{
public static void main(String[] args) throws Exception {
RewriteLexer lexer = new RewriteLexer(new ANTLRStringStream("a foo\nb bar"));
RewriteParser parser = new RewriteParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.test().getTree();
System.out.println(tree.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
Pattern p = new Pattern(nodes);
CommonTree newtree = (CommonTree) p.downup(tree);
}
}
type
: 'a'
| 'b'
;
test : id+;
id : type ID -> ^(type ID["bar"]);
DUMMY : 'dummy';
ID : ('a'..'z')+;
WS : (' '|'\n'|'r')+ {$channel=HIDDEN;};
and Pattern.g
tree grammar Pattern;
options {
tokenVocab = Rewrite;
ASTLabelType=CommonTree;
output=AST;
filter=true; // tree pattern matching mode
}
topdown
: td
;
bottomup
: bu
;
type
: 'a'
| 'b'
;
td
: ^(type ID) { System.out.println("got td"); }
;
bu
: ^(type ID) { System.out.println("got bu"); }
-> ^(type DUMMY ID)
;
to do compile:
java -cp ../jar/antlr-3.4-complete-no-antlrv2.jar org.antlr.Tool Rewrite.g
java -cp ../jar/antlr-3.4-complete-no-antlrv2.jar org.antlr.Tool Pattern.g
javac -cp ../jar/antlr-3.4-complete-no-antlrv2.jar *.java
java -classpath .:../jar/antlr-3.4-complete-no-antlrv2.jar RewriteParser
EDIT 1: I have also tried using antlr4 and I get the same crash.

There are two small problems to address to get the rewrite to work, one problem in Rewrite and the other in Pattern.
The Rewrite grammar produces ^(type ID) as root elements in the output AST, as shown in the output (a bar) (b bar). A root element can't be transformed because transforming is actually a form of child-swapping: the element's parent drops the element and replaces it with the new, "transformed" version. Without a parent, you'll get the error Can't set single child to a list. Adding the root is a matter of creating an imaginary token ROOT or whatever name you like and referencing it in your entry-level rule's AST generation like so: test : id+ -> ^(ROOT id+);.
The Pattern grammar, the one producing the error you're getting, is confused by the type rule: type : 'a' | 'b' ; as part of the rewrite. I don't know the low-level details here, but apparently a tree parser doesn't maintain the state of a visited root rule like type in ^(type ID) when writing a transform (or maybe it can't or shouldn't, or maybe it's some other limitation). The easiest way to address this is with the following two changes:
Let text "a" and "b" match rule ID in the lexer by changing rule type in Rewrite from type: 'a' | 'b'; to just type: ID;.
Let rule bu in Pattern match against ^(ID ID) and transform to ^(ID DUMMY ID).
Now with a couple of minor debugging changes to Rewrite's main, input "a foo\nb bar" produces the following output:
(ROOT (a foo) (b bar))
got td
got bu
(a foo) -> (a DUMMY foo)
got td
got bu
(b bar) -> (b DUMMY bar)
(ROOT (a DUMMY foo) (b DUMMY bar))
Here are the files as I've changed them:
Rewrite.g
grammar Rewrite;
options {
output=AST;
}
tokens {
ROOT;
}
#members{
public static void main(String[] args) throws Exception {
RewriteLexer lexer = new RewriteLexer(new ANTLRStringStream("a foo\nb bar"));
RewriteParser parser = new RewriteParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.test().getTree();
System.out.println(tree.toStringTree());
CommonTreeNodeStream nodes = new CommonTreeNodeStream(tree);
Pattern p = new Pattern(nodes);
CommonTree newtree = (CommonTree) p.downup(tree, true); //print the transitions to help debugging
System.out.println(newtree.toStringTree()); //print the final result
}
}
type : ID;
test : id+ -> ^(ROOT id+);
id : type ID -> ^(type ID);
DUMMY : 'dummy';
ID : ('a'..'z')+;
WS : (' '|'\n'|'r')+ {$channel=HIDDEN;};
Pattern.g
tree grammar Pattern;
options {
tokenVocab = Rewrite;
ASTLabelType=CommonTree;
output=AST;
filter=true; // tree pattern matching mode
}
topdown
: td
;
bottomup
: bu
;
td
: ^(ID ID) { System.out.println("got td"); }
;
bu
: ^(ID ID) { System.out.println("got bu"); }
-> ^(ID DUMMY ID)
;

I don't have much experience with tree-patterns, with or without rewrites. But when using rewrite rules in them, I believe your options should also include rewrite=true;. The Definitive ANTLR Reference doesn't handle them, so I'm not entirely sure (have a look at the ANTLR wiki for more info).
However, for such (relatively) simple rewrites, you don't really need a separate grammar. You could make DUMMY an imaginary token and inject it in some other parser rule, like this:
grammar T;
options {
output=AST;
}
tokens {
DUMMY;
}
test : id+;
id : type ID -> ^(type DUMMY["dummy"] ID);
type
: 'a'
| 'b'
;
ID : ('a'..'z')+;
WS : (' '|'\n'|'r')+ {$channel=HIDDEN;};
which would parse the input:
a bar
b foo
into the following AST:
Note that if your lexer is also meant to tokenize the input "dummy" as a DUMMY token, change the tokens { ... } block into this:
tokens {
DUMMY='dummy';
}
and you'd still be able to inject a DUMMY in other rules.

Related

How can I build an ANTLR Works style parse tree?

I've read that you need to use the '^' and '!' operators in order to build a parse tree similar to the ones displayed in ANTLR Works (even though you don't need to use them to get a nice tree in ANTLR Works). My question then is how can I build such a tree? I've seen a few pages on tree construction using the two operators and rewrites, and yet say I have an input string abc abc123 and a grammar:
grammar test;
program : idList;
idList : id* ;
id : ID ;
ID : LETTER (LETTER | NUMBER)* ;
LETTER : 'a' .. 'z' | 'A' .. 'Z' ;
NUMBER : '0' .. '9' ;
ANTLR Works will output:
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
What I dont understand is how you can get the 'idList' node on top of this tree (as well as the grammar one as a matter of fact). How can I reproduce this tree using rewrites and those operators?
You can't use ^ and ! alone. These operators only operate on existing tokens, while you want to create extra tokens (and make these the root of your sub trees). You can do that using rewrite rules and defining some imaginary tokens.
A quick demo:
grammar test;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
IdList;
Id;
}
#parser::members {
private static void walk(CommonTree tree, int indent) {
if(tree == null) return;
for(int i = 0; i < indent; i++, System.out.print(" "));
System.out.println(tree.getText());
for(int i = 0; i < tree.getChildCount(); i++) {
walk((CommonTree)tree.getChild(i), indent + 1);
}
}
public static void main(String[] args) throws Exception {
testLexer lexer = new testLexer(new ANTLRStringStream("abc abc123"));
testParser parser = new testParser(new CommonTokenStream(lexer));
walk((CommonTree)parser.program().getTree(), 0);
}
}
program : idList EOF -> idList;
idList : id* -> ^(IdList id*);
id : ID -> ^(Id ID);
ID : LETTER (LETTER | DIGIT)*;
SPACE : ' ' {skip();};
fragment LETTER : 'a' .. 'z' | 'A' .. 'Z';
fragment DIGIT : '0' .. '9';
If you run the demo above, you will see the following being printed to the console:
IdList
Id
abc
Id
abc123
As you can see, imaginary tokens must also start with an upper case letter, just like lexer rules. If you want to give the imaginary tokens the same text as the parser rule they represent, do something like this instead:
idList : id* -> ^(IdList["idList"] id*);
id : ID -> ^(Id["id"] ID);
which will print:
idList
id
abc
id
abc123

ANTLR Is it possible to make grammar with embed grammar inside?

ANTLR: Is it possible to make grammar with embed grammar (with it's own lexer) inside?
For example in my language I have ability to use embed SQL language:
var Query = [select * from table];
with Query do something ....;
Is it possible with ANTLR?
Is it possible to make grammar with embed grammar (with it's own lexer) inside?
If you mean whether it is possible to define two languages in a single grammar (using separate lexers), then the answer is: no, that's not possible.
However, if the question is whether it is possible to parse two languages into a single AST, then the answer is: yes, it is possible.
You simply need to:
define both languages in their own grammar;
create a lexer rule in you main grammar that captures the entire input of the embedded language;
use a rewrite rule that calls a custom method that parses the external AST and inserts it in the main AST using { ... } (see the expr rule in the main grammar (MyLanguage.g)).
MyLanguage.g
grammar MyLanguage;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
ROOT;
}
#members {
private CommonTree parseSQL(String sqlSrc) {
try {
MiniSQLLexer lexer = new MiniSQLLexer(new ANTLRStringStream(sqlSrc));
MiniSQLParser parser = new MiniSQLParser(new CommonTokenStream(lexer));
return (CommonTree)parser.parse().getTree();
} catch(Exception e) {
return new CommonTree(new CommonToken(-1, e.getMessage()));
}
}
}
parse
: assignment+ EOF -> ^(ROOT assignment+)
;
assignment
: Var Id '=' expr ';' -> ^('=' Id expr)
;
expr
: Num
| SQL -> {parseSQL($SQL.text)}
;
Var : 'var';
Id : ('a'..'z' | 'A'..'Z')+;
Num : '0'..'9'+;
SQL : '[' ~']'* ']';
Space : ' ' {skip();};
MiniSQL.g
grammar MiniSQL;
options {
output=AST;
ASTLabelType=CommonTree;
}
parse
: '[' statement ']' EOF -> statement
;
statement
: select
;
select
: Select '*' From ID -> ^(Select '*' From ID)
;
Select : 'select';
From : 'from';
ID : ('a'..'z' | 'A'..'Z')+;
Space : ' ' {skip();};
Main.java
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
String src = "var Query = [select * from table]; var x = 42;";
MyLanguageLexer lexer = new MyLanguageLexer(new ANTLRStringStream(src));
MyLanguageParser parser = new MyLanguageParser(new CommonTokenStream(lexer));
CommonTree tree = (CommonTree)parser.parse().getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
Run the demo
java -cp antlr-3.3.jar org.antlr.Tool MiniSQL.g
java -cp antlr-3.3.jar org.antlr.Tool MyLanguage.g
javac -cp antlr-3.3.jar *.java
java -cp .:antlr-3.3.jar Main
Given the input:
var Query = [select * from table]; var x = 42;
the output of the Main class corresponds to the following AST:
And if you want to allow string literals inside your SQL (which could contain ]), and comments (which could contain ' and ]), the you could use the following SQL rule inside your main grammar:
SQL
: '[' ( ~(']' | '\'' | '-')
| '-' ~'-'
| COMMENT
| STR
)*
']'
;
fragment STR
: '\'' (~('\'' | '\r' | '\n') | '\'\'')+ '\''
| '\'\''
;
fragment COMMENT
: '--' ~('\r' | '\n')*
;
which would properly parse the following input in a single token:
[
select a,b,c
from table
where a='A''B]C'
and b='' -- some ] comment ] here'
]
Just beware that trying to create a grammar for an entire SQL dialect (or even a large subset) is no trivial task! You may want to search for existing SQL parsers, or look at the ANTLR wiki for example-grammars.
Yes, with AntLR it is called Island grammar.
You can get a working example in the v3 examples, inside the island-grammar folder : it shows the usage of a grammar to parse javadoc comments inside of java code.
You can also find some clues in the doc Island Grammars Under Parser Control and that Another one.

ANTLR, heterogeneous AST problem

I examine heterogeneous trees in ANTLR (using ANTLRWorks 1.4.2).
Here is the example of what I have already done in ANTLR.
grammar test;
options {
language = java;
output = AST;
}
tokens {
PROGRAM;
VAR;
}
#members {
class Program extends CommonTree {
public Program(int ttype) {
token = new CommonToken(ttype, "<start>");
}
}
}
start
: program var function
// Works fine:
//-> ^(PROGRAM program var function)
// Does not work (described below):
-> ^(PROGRAM<Program> program var function)
;
program
: 'program'! ID ';'!
;
var
: TYPE^ ID ';'!
;
function
: ID '('! ')'! ';'!
;
TYPE
: 'int'
| 'string'
;
ID
: ('a'..'z' | 'A'..'Z')+
;
WHITESPACE
: (' ' | '\t' '\n'| '\r' | '\f')+ {$channel = HIDDEN;}
;
Sample input:
program foobar;
int foo;
bar();
When I use rewrite rule ^(PROGRAM<Program> program var function), ANTLR stumbles over and I get AST like this:
Whereas when I use this rewrite rule ^(PROGRAM program var function) it works:
Could anyone explain where am I wrong, please? Frankly, I do not really get the idea of heterogeneous trees and how do I use <…> syntax in ANTLR.
What do r0 and r1 mean (first picture)?
I have no idea what these r0 and r1 mean: I don't use ANTLRWorks for debugging, so can't comment on that.
Also, language = java; causes ANTLR 3.2 to produce the error:
error(10): internal error: no such group file java.stg
error(20): cannot find code generation templates java.stg
error(10): internal error: no such group file java.stg
error(20): cannot find code generation templates java.stg
ANTLR 3.2 expects it to be language = Java; (capital "J"). But, by default the target is Java, so, mind as well remove the language = ... entirely.
Now, as to you problem: I cannot reproduce it. As I mentioned, I tested it with ANTLR 3.2, and removed the language = java; part from your grammar, after which everything went as (I) expected.
Enabling the rewrite rule -> ^(PROGRAM<Program> program var function) produces the following ATS:
and when enabling the rewrite rule -> ^(PROGRAM program var function) instead, the following AST is created:
I tested both rewrite rules this with the following class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("program foobar; int foo; bar();");
testLexer lexer = new testLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
testParser parser = new testParser(tokens);
testParser.start_return returnValue = parser.start();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
And the images are produced using graph.gafol.net (and the output of the Main class, of course).

ANTLR Grammar gives me upsidedown-tree

I have a grammar which parses dot notion expressions like this:
a.b.c
memberExpression returns [Expression value]
: i=ID { $value = ParameterExpression($i.value); }
('.' m=memberExpression { $value = MemberExpression($m.value, $i.value); }
)*
;
This parses expressions fine and gives me a tree structure like this:
MemberExpression(
MemberExpression(
ParameterExpression("c"),
"b"
)
, "a"
)
But my problem is that I want a tree that looks like this:
MemberExpression(
MemberExpression(
ParameterExpression("a"),
"b"
)
, "c"
)
for the same expression "a.b.c"
How can I achieve this?
You could do this by collecting all tokens in a java.util.List using ANTLR's convenience += operator and create the desired tree using a custom method in your #parser::members section:
// grammar def ...
// options ...
#parser::members {
private Expression customTree(List tks) {
// `tks` is a java.util.List containing `CommonToken` objects
}
}
// parser ...
memberExpression returns [Expression value]
: ids+=ID ('.' ids+=ID)* { $value = customTree($ids); }
;
I think what you are asking for is mutually left recursive, and therefore ANTLR is not a good choice to parse it.
To elaborate, you need C at the root of the tree and therefore your rule would be:
rule: rule ID;
This rule will be uncertain whether it should match
a.b
or
a.b.c

Processing an n-ary ANTLR AST one child at a time

I currently have a compiler that uses an AST where all children of a code block are on the same level (ie, block.children == {stm1, stm2, stm3, etc...}). I am trying to do liveness analysis on this tree, which means that I need to take the value returned from the processing of stm1 and then pass it to stm2, then take the value returned by stm2 and pass it to stm3, and so on. I do not see a way of executing the child rules in this fashion when the AST is structured this way.
Is there a way to allow me to chain the execution of the child grammar items with my given AST, or am I going to have to go through the painful process of refactoring the parser to generate a nested structure and updating the rest of the compiler to work with the new AST?
Example ANTLR grammar fragment:
block
: ^(BLOCK statement*)
;
statement
: // stuff
;
What I hope I don't have to go to:
block
: ^(BLOCK statementList)
;
statementList
: ^(StmLst statement statement+)
| ^(StmLst statement)
;
statement
: // stuff
;
Parser (or lexer) rules can take parameter values and can return a value. So, in your case, you can do something like:
block
#init {Object o = null; /* initialize the value being passed through */ }
: ^(BLOCK (s=statement[o] {$o = $s.returnValue; /*re-assign 'o' */ } )*)
;
statement [Object parameter] returns [Object returnValue]
: // do something with 'parameter' and 'returnValue'
;
Here's a very simple example that you can use to play around with:
grammar Test;
#members{
public static void main(String[] args) throws Exception {
ANTLRStringStream in = new ANTLRStringStream("1;2;3;4;");
TestLexer lexer = new TestLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestParser parser = new TestParser(tokens);
parser.parse();
}
}
parse
: block EOF
;
block
#init{int temp = 0;}
: (i=statement[temp] {temp = $i.ret;} ';')+
;
statement [int param] returns [int ret]
: Number {$ret = $param + Integer.parseInt($Number.text);}
{System.out.printf("param=\%d, Number=\%s, ret=\%d\n", $param, $Number.text, $ret);}
;
Number
: '0'..'9'+
;
When you've generated a parser and lexer from it and compiled these classes, execute the TestParser class and you'll see the following printed to your console:
param=0, Number=1, ret=1
param=1, Number=2, ret=3
param=3, Number=3, ret=6
param=6, Number=4, ret=10