Generating AST from ANTLR grammar - antlr

For the question and the grammar suggested by #BartKiers (Thank you!), I added the options block to specify the output to be
options{
language=Java;
output=AST;
ASTLabelType=CommonTree;
}
However, I am not able to figure out how to access the output i.e. AST. I need to traverse through the tree and process each operation that was specified in the input.
Using your example here, I am trying to implement rules returning values. However, I am running into following errors:
relational returns [String val]
: STRINGVALUE ((operator)^ term)?
{val = $STRINGVALUE.text + $operator.text + $term.text; }
;
term returns [String rhsOperand]
: QUOTEDSTRINGVALUE {rhsOperand = $QUOTEDSTRINGVALUE.text;}
| NUMBERVALUE {rhsOperand = $NUMBERVALUE.text; }
| '(' condition ')'
;
Compilation Error:
Checking Grammar RuleGrammarParser.g...
\output\RuleGrammarParser.java:495: cannot find symbol
symbol : variable val
location: class RuleGrammarParser
val = (STRINGVALUE7!=null?STRINGVALUE7.getText():null) + (operator8!=null?input.toString(operator8.start,operator8.stop):null) + (term9!=null?input.toString(term9.start,term9.stop):null);
^
\output\RuleGrammarParser.java:612: cannot find symbol
symbol : variable rhsOperand
location: class RuleGrammarParser
rhsOperand = (QUOTEDSTRINGVALUE10!=null?QUOTEDSTRINGVALUE10.getText():null);
^
\output\RuleGrammarParser.java:632: cannot find symbol
symbol : variable rhsOperand
location: class RuleGrammarParser
rhsOperand = (NUMBERVALUE11!=null?NUMBERVALUE11.getText():null);
^
3 errors
Can you please help me understand why this fails to compiler?
Added the pastebin: http://pastebin.com/u1Bv3L0A

By simply adding output=AST to the options section you don't create a AST, but a flat, 1 dimensional list of tokens. To mark certain tokens as root (or children), you need to do a bit of work.
Checkout this answer which explains how to create a proper AST and get access to the tree the parser then produces (the CommonTree tree in the main method of the answer I mentioned).
Note that you can safely remove language=Java;: by default the target language is Java (no harm in leaving it there though).

Related

Xtext: Recursive Rule Inovcations

I am attempting to build a language that can declare methods and fields, with intrinsic support for generics. I would like to be able to use primitive types like String, as well as declare my own classes.
This should be valid syntax:
String somePrimitive
class MyClass { }
MyClass someObject;
class List { }
List<String> stringList;
List<MyClass> objectList;
List<String> getNames() { }
I have a grammar that supports these operations:
Model:
(members+=ModelMembers)*;
ModelMembers:
Class | Field | MethodDeclaration
;
Class:
'class' name=ID '{' '}'
;
Field:
type=Type
name=ID
;
enum PrimitiveType: STRING="String" | NUMBER="Number";
Type:
(
{TypeObject} clazz=[Class] ("<" a+=Type ("," a+=Type)* ">")?
|
{TypePrimitive} type=PrimitiveType
)
;
MethodDeclaration:
returnType=Type name=ID "(" ")" "{"
"}"
;
But it contains an error:
[fatal] rule rule__ModelMembers__Alternatives has non-LL(*) decision due to recursive rule invocations reachable from alts 2,3. Resolve by left-factoring or using syntactic predicates or using backtrack=true option.
The problem seems to stem from the fact that the Type rule is recursive, and can be matches as either the beginning of a MethodDeclaration or a Field.
However, it is possible to figure out what rule one is building, as the method will have () { } after the name.
What really confuses me, is if I replace the recursive rule with simply [Class], e.g. Field: type=[Class] name=ID (and the same for the MethodDeclaration) the grammar is valid.
I get that there is some ambiguity when one sees an instance of Type, as it could lead onto a method or field.... but that's exactly the same when I replace with [Class]. Instances of class can lead onto a method or field.
How can it be ambiguous using Type, but not ambiguous using [Class]?
This is not direct answer to your question, but did you considered using Xbase instead of the plain Xtext? With Xbase you can simple use predefined rules, that match everything you need:
Java types with generics
Expressions
Annotations
and many more.
Here are a couple of useful links:
Xbase: https://wiki.eclipse.org/Xbase
Extending Xbase blog: http://koehnlein.blogspot.de/2011/07/extending-xbase.html
7 Languages For The JVM (7 examples): http://www.eclipse.org/Xtext/7languages.html
Screencasts: http://xtextcasts.org/?tag_id=10
If Xbase doesn't suite you, then you can learn from it's Xbase-Xtext-Grammar.
This grammar parses the example code without throwing errors (the layout is the one favored by Terence Parr, the ANTLR man. I find it helps greatly):
Model
: (members+=ModelMembers)*
;
ModelMembers
: Class
| MethodDeclaration
| Field
;
Class
: 'class' name=ID '{' '}'
;
Field
: type=Type name=ID ';'
;
PrimitiveType
: ("String" |"Number")
;
TypeReferenceOrPrimitive
: {TypeClass} type=[Class]
| {TypePrimitive} PrimitiveType
;
Type
: {TypeObject} clazz=[Class] ("<" a+=TypeReferenceOrPrimitive ("," a+=TypeReferenceOrPrimitive)* ">")?
| {TypePrimitive} type=PrimitiveType
;
MethodDeclaration
: returnType=Type name=ID "(" ")" "{" "}"
;
I'm no Xtext expert so there may be better ways. My 'trick' is to
define TypeReferenceOrPrimitive. You will probably need to play around with the grammar a bit more in order to get an AST that is easier to process.

ANTLR - grandchild nodes in tree construction

I'm trying to write a declarative grammar where the order of declarations and other statements is unimportant. However, for parsing, I'd like to have the grammar output a tree in an ordered fashion. Let's say the language consists of declarations (decl) and assignments (assign). An example might be:
decl x
assign y 2
assign x 1
decl y
I'd like to have the program represented by a tree with all the declarations in one subtree, and all the assignments in another. For the example above, something like:
(PROGRAM
(DECLARATIONS x y)
(ASSIGNMENTS
(y 2)
(x 1)))
Can I perform this rearrangement during tree construction, or should I write a tree grammar?
I think that there is an easier answer than the other one here:
token { DECLS; ASSIGNS; }
prog: (d+=decl | a+=assign)* EOF -> ^(DECLS $d*) ^(ASSIGNS $a*) ;
...
Which can be adapted for as many rules as you like of course.
However, are you sure you need to do this? Why not just build the symbol table of DECL instructions in the parser, and then only build an AST of ASSIGNs, which you can check in the tree walk.
Jim
Can I perform this rearrangement during tree construction, or should I write a tree grammar?
It's possible to do either, but I recommend grouping the nodes during token parsing.
I haven't been happy with any tree-rewriting grammars I've written that group nodes because those grammars have to rediscover where each groupable node is at -- hence the need for grouping. The token parser touches all that data during regular processing, and the tree grammar ends up walking the tree for those nodes exactly like the token parser already walked its input for tokens. I don't think the tree parser is worth the hassle if it's just for grouping.
Anyway, managing the grouping in the parser boils down to saving off the decl and assign nodes after they're produced then pushing them out again when their grouping level occurs. Here's a quick example.
Declarative.g
grammar Declarative;
options {
output = AST;
}
tokens {
PROGRAM; DECLARATIONS; ASSIGNMENTS;
}
#parser::header {
import java.util.ArrayList;
}
#members {
private ArrayList<Object> assigns = new ArrayList<Object>();
private ArrayList<Object> decls = new ArrayList<Object>();
private Object createTree(int ttype, ArrayList<Object> children) {
Object tree = adaptor.create(ttype, tokenNames[ttype]);
for (Object child : children){
adaptor.addChild(tree, child);
}
return tree;
}
}
compilationUnit : statement* EOF -> ^(PROGRAM {createTree(DECLARATIONS, decls)} {createTree(ASSIGNMENTS, assigns)});
statement : decl {decls.add($decl.tree);}
| assign {assigns.add($assign.tree);}
;
decl : DECL^ ID;
assign : ASSIGN^ ID INT;
DECL : 'decl';
ASSIGN : 'assign';
ID : ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z')*;
INT : ('0'..'9')+;
WS : (' '|'\t'|'\f'|'\n'|'\r'){skip();};
Each decl node is saved off by the statement rule in the decls list, and similarly for for each assign node.
Method createTree uses the parser's TreeAdaptor to build the group nodes and populate them.
CommonTree tree = (CommonTree) adaptor.create(ttype, tokenNames[ttype]);
for (Object child : children){
adaptor.addChild(tree, child);
}
return tree;
The production for compilationUnit is ^(PROGRAM {createTree(DECLARATIONS, decls)} {createTree(ASSIGNMENTS, assigns)}), which adds the grouping nodes to PROGRAM. Method createTree is used to build the grouping node and its children in one go.
There may be a tricky way to get ANTLR to pull it all together for you, but this works and is fairly self-explanatory.
So given this input...
decl x
assign y 2
assign x 1
decl y
... the token parser produced for the grammar above produces this tree as output:
(PROGRAM
(DECLARATIONS
(decl x)
(decl y))
(ASSIGNMENTS
(assign y 2)
(assign x 1)))

NullPointerException with ANTLR text attribute

I have a problem that I've been stuck on for a while and I would appreciate some help if possible.
I have a few rules in an ANTLR tree grammar:
block
: compoundstatement
| ^(VAR declarations) compoundstatement
;
declarations
: (^(t=type idlist))+
;
idlist
: IDENTIFIER+
;
type
: REAL
| i=INTEGER
;
I have written a Java class VarTable that I will insert all of my variables into as they are declared at the beginning of my source file. The table will also hold their variable types (ie real or integer). I'll also be able to use this variable table to check for undeclared variables or duplicate declarations etc.
So basically I want to be able to send the variable type down from the 'declarations' rule to the 'idlist' rule and then loop through every identifier in the idlist rule, adding them to my variable table one by one.
The major problem I'm getting is that I get a NullPointerException when I try and access the 'text' attribute if the $t variable in the 'declarations' rule (This is one one which refers to the type).
And yet if I try and access the 'text' attribute of the $i variable in the 'type' rule, there's no problem.
I have looked at the place in the Java file where the NullPointerException is being generated and it still makes no sense to me.
Is it a problem with the fact that there could be multiple types because the rule is
(^(typeidlist))+
??
I have the same issue when I get down to the idlist rule, becasue I'm unsure how I can write an action that will allow me to loop through all of the IDENTIFIER Tokens found.
Grateful for any help or comments.
Cheers
You can't reference the attributes from production rules like you tried inside tree grammars, only in parser (or combined) grammars (they're different objects!). Note that INTEGER is not a production rule, just a "simple" token (terminal). That's why you can invoke its .text attribute.
So, if you want to get a hold the text of the type rule in your tree grammar and print it in your declarations rule, your could do something like this:
tree grammar T;
...
declarations
: (^(t=type idlist {System.out.println($t.returnValue);}))+
;
...
type returns [String returnValue]
: i=INTEGER {returnValue = "[" + $i.text + "]";}
;
...
But if you really want to do it without specifying a return object, you could do something like this:
declarations
: (^(t=type idlist {System.out.println($t.start.getText());}))+
;
Note that type returns an instance of a TreeRuleReturnScope which has an attribute called start which in its turn is a CommonTree instance. You could then call getText() on that CommonTree instance.

Interpreting a variable number of tree nodes in ANTLR Tree Grammar

Whilst creating an inline ANTLR Tree Grammar interpreter I have come across an issue regarding the multiplicity of procedure call arguments.
Consider the following (faulty) tree grammar definition.
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME arguments=expression*)
{
if(procedureName.equals("foo")) {
callFooMethod(arguments[0], arguments[1]);
}elseif(procedureName.equals("bar")) {
callBarMethod(arguments[0], arguments[1], arguments[2]);
}
}
;
My problem lies with the retrieval of the given arguments. If there would be a known quantity of expressions I would just assign the values coming out of these expressions to their own variable, e.g.:
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME argument1=expression argument2=expression)
{
...
}
;
This however is not the case.
Given a case like this, what is the recommendation on interpreting a variable number of tree nodes inline within the ANTLR Tree Grammar?
Use the += operator. To handle any number of arguments, including zero:
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME argument+=expression*)
{
...
}
;
See the tree construction documentation on the antlr website.
The above will change the type of the variable argument from typeof(expression) to a List (well, at least when you're generating Java code). Note that the list types are untyped, so it's just a plain list.
If you use multiple parameters with the same variable name, they will also create a list, for example:
twoParameterCall
: ^(PROCEDURECALL procedureName=NAME argument=expression argument=expression)
{
...
}
;

Am i forced to use %glr-parser?

I have been keeping the shift/reduce errors away. Now finally i think i met my match.
Int[] a
a[0] = 1
The problem is int[] is defined as
Type OptSquareBrackets
while a[0] is defined as
Var | Var '[' expr ']'
Var and Type both are defined as VAR which is any valid variable [a-zA-Z][a-zA-Z0-9_]. Apart from adding a dummy token (such as **Decl** Type OptSquareBrackets instead) is there a way to write this to not have a conflict? From this one rule i get 1 shift/reduce and 1 reduce/reduce warning.
Could you define a new Token
VarLBracket [a-zA-Z][a-zA-Z0-9_]*\[
And therefore define declaration
Type | VarLBracket ']';
and define assignment target as
Var | VarLBracket expr ']';
Create a Lex rule with [] since [] is only used in declaration and everywhere else would use [var]
Technically, this problem stems from trying to tie the grammar to a semantic meaning that doesn't actually differ in syntax.
ISTM that you just need a single grammar construct that describes both types and expressions. Make the distinction in code and not in the grammar, especially if there is not actually a syntactic difference. Yacc is called a compiler generator but it is not the least bit true. It just makes parsers.
Having said that, recognizing [] as a terminal symbol might be an easier way to fix the problem and get on with things. Yacc isn't very good at ambiguous grammars and it needs to make early decisions on which path to follow.