ANTLR - grandchild nodes in tree construction - antlr

I'm trying to write a declarative grammar where the order of declarations and other statements is unimportant. However, for parsing, I'd like to have the grammar output a tree in an ordered fashion. Let's say the language consists of declarations (decl) and assignments (assign). An example might be:
decl x
assign y 2
assign x 1
decl y
I'd like to have the program represented by a tree with all the declarations in one subtree, and all the assignments in another. For the example above, something like:
(PROGRAM
(DECLARATIONS x y)
(ASSIGNMENTS
(y 2)
(x 1)))
Can I perform this rearrangement during tree construction, or should I write a tree grammar?

I think that there is an easier answer than the other one here:
token { DECLS; ASSIGNS; }
prog: (d+=decl | a+=assign)* EOF -> ^(DECLS $d*) ^(ASSIGNS $a*) ;
...
Which can be adapted for as many rules as you like of course.
However, are you sure you need to do this? Why not just build the symbol table of DECL instructions in the parser, and then only build an AST of ASSIGNs, which you can check in the tree walk.
Jim

Can I perform this rearrangement during tree construction, or should I write a tree grammar?
It's possible to do either, but I recommend grouping the nodes during token parsing.
I haven't been happy with any tree-rewriting grammars I've written that group nodes because those grammars have to rediscover where each groupable node is at -- hence the need for grouping. The token parser touches all that data during regular processing, and the tree grammar ends up walking the tree for those nodes exactly like the token parser already walked its input for tokens. I don't think the tree parser is worth the hassle if it's just for grouping.
Anyway, managing the grouping in the parser boils down to saving off the decl and assign nodes after they're produced then pushing them out again when their grouping level occurs. Here's a quick example.
Declarative.g
grammar Declarative;
options {
output = AST;
}
tokens {
PROGRAM; DECLARATIONS; ASSIGNMENTS;
}
#parser::header {
import java.util.ArrayList;
}
#members {
private ArrayList<Object> assigns = new ArrayList<Object>();
private ArrayList<Object> decls = new ArrayList<Object>();
private Object createTree(int ttype, ArrayList<Object> children) {
Object tree = adaptor.create(ttype, tokenNames[ttype]);
for (Object child : children){
adaptor.addChild(tree, child);
}
return tree;
}
}
compilationUnit : statement* EOF -> ^(PROGRAM {createTree(DECLARATIONS, decls)} {createTree(ASSIGNMENTS, assigns)});
statement : decl {decls.add($decl.tree);}
| assign {assigns.add($assign.tree);}
;
decl : DECL^ ID;
assign : ASSIGN^ ID INT;
DECL : 'decl';
ASSIGN : 'assign';
ID : ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z')*;
INT : ('0'..'9')+;
WS : (' '|'\t'|'\f'|'\n'|'\r'){skip();};
Each decl node is saved off by the statement rule in the decls list, and similarly for for each assign node.
Method createTree uses the parser's TreeAdaptor to build the group nodes and populate them.
CommonTree tree = (CommonTree) adaptor.create(ttype, tokenNames[ttype]);
for (Object child : children){
adaptor.addChild(tree, child);
}
return tree;
The production for compilationUnit is ^(PROGRAM {createTree(DECLARATIONS, decls)} {createTree(ASSIGNMENTS, assigns)}), which adds the grouping nodes to PROGRAM. Method createTree is used to build the grouping node and its children in one go.
There may be a tricky way to get ANTLR to pull it all together for you, but this works and is fairly self-explanatory.
So given this input...
decl x
assign y 2
assign x 1
decl y
... the token parser produced for the grammar above produces this tree as output:
(PROGRAM
(DECLARATIONS
(decl x)
(decl y))
(ASSIGNMENTS
(assign y 2)
(assign x 1)))

Related

Generate classes from grammar rules, objects on parse

Is it possible to generate .m and .h's for any grammar/ rules so that during parsing it creates an object that represents that rule.
So some grammar
coolObjName = Word;
could generate a class that is named coolObjName (or some variation) and has a field for the word, and generates the action:
coolObjName = Word{
CoolObjName* newName = [[CoolObjName alloc] initWithWord:POP_STR()];
PUSH(newName);
};
Then a higher level rule such as:
myhigherlevel = coolObjName Number;
would create a myHigherLevel class that has a coolObjName member and a number, which then adds the action:
myhigherlevel = coolObjName Number{
double num = POP_DOUBLE();
coolObjName* name = POP();
MyHigherLevel* higherLevel = [[MyHigherLevel alloc] init];
higherLevel.number = num;
higherLevel.name = name;
PUSH(higherLevel);
};
Empty tags turn to empty objects and * and + result in arrays.
Is there a tool that can do this or where would I go to create such. (seems super useful and awesome)
Creator of PEGKit here.
There's nothing currently in PEGKit which will inspect your rules and auto-generate ObjC AST or model classes which might match your intent. For now, PEGKit can only produce source code for your parser, but not your related AST or model classes.
It is probably very likely that you will be building an Abstract Syntax Tree (or Intermediate Representation) in your Actions or Parser Delegate Callbacks. ANLTR has some wonderful high-level Tree Rewriting features which allow you to specify in the grammar how to build your AST. Although looking at the docs it seems this might have changed significantly since ANTLR 3? Not sure. It used to look like something like this (maybe it still does, I'm not an ANTLR expert):
addExpr : lhs=NUM op='+' rhs=NUM -> ^(op lhs rhs);
The -> means rewrite to tree like…
And the ^(op …) means tree rooted by a plus operator containing children….
I would love to add tree rewriting features like this to PEGKit, for the express purpose of building ASTs (but not other model objects). But honestly, it's a huge task, and it is not likely to appear soon.

Serialization of ANTLR ParseTree

I have a generated grammar that does two things:
Check the syntax of a domain specific language
Evaluate input against that domain specific language
These two functions are separate, lets call them validate() and evaluate().
The validate() function builds the tree from a String input while ensuring it meets the requirements of the BNF for the language. The evaluate() function plugs in values to that tree to get a result (usually true or false).
What the code is currently doing is running validate() each time on the input, just to generate the tree that evaluate() uses. Some of the inputs take up to 60 seconds to be checked. What I would LIKE to do is serialize the results of validate() (assuming it meets the syntax requirements), store the serialized form in the backend database, and just load it from the database as part of evaluate().
I noticed that I can execute the method toStringTree() on the parse tree, and retrieve a LISP style tree. However, can I restore a LISP style tree to an ANTLR parse tree? If not, can anyone recommend another way to serialize and store the generated parse tree?
Thanks for any help.
Jason
ANTLR 4's ParseRuleContext data structure (the specific implementation of ParseTree used by generated parsers to represent grammar rules in the parse tree) is not serializable by default. Open issue #233 on the project issue tracker covers the feature request. However, based on my experience with many applications using ANTLR for parsing, I'm not convinced serializing the parse trees would be useful in the long run. For each problem serializing the parse tree is meant to address, a better solution already exists.
Another option is to store a hash of the last known valid file in the database. After you use the parser to create a parse tree, you could skip the validation step if the input file has the same hash as the last time it was validated. This leverages two aspects of ANTLR 4:
For the same input file, running the parser twice will produce the same parse tree.
The ANTLR 4 parser is extremely fast in almost all cases (e.g. the Java grammar can process around 20MB of source per second). The remaining cases tend to be caused by poorly structured grammar rules that the new parser interpreter feature in ANTLRWorks 2.2 can analyze and make suggestions for improvement.
If you need performance beyond what you get with this, then a parse tree isn't the data structure you should be using. StringTemplate 4's enormous performance advantage over StringTemplate 3 came primarily from the fact that the interpreter switched from using ASTs (equivalent to parse trees for this reasoning) to a linear bytecode representation/interpreter. The ASTs for ST4 would never need to be serialized for performance reasons because the bytecode would be serialized instead. In fact, the C# port of StringTemplate 4 provides exactly this feature.
If the input data to your grammar is made of several independent blocks, you could try to store the string of each block separately, and run the parsing process again for each block independently, using a ThreadPool for example.
Say for example your input data is a set of method declarations:
int add(int a, int b) {
return a+b;
}
int mul(int a, int b) {
return a*b;
}
...
and the grammar is something like:
methodList : methodDeclaration methodList
|
;
methodDeclaration : // your method declaration rules...
The first run of the parser just collects each method text and store it. The parser starts the process at the methodList rule.
void visitMethodList(MethodListContext ctx) {
if(ctx.methodDeclaration() != null) {
String methodStr = formatParseTree(ctx.methodDeclaration(), " ");
// store methodStr for later parsing
}
// visit next method list item, if any
if(ctx.methodList() != null) {
visit(ctx.methodList());
}
}
The second run launch the parsing of each method declaration (in a separate thread for example). For this, the parser starts at the methodDeclaration rule.
void visitMethodDeclaration(MethodDeclarationContext ctx) {
// parse the method block
}
The reason why the text of a methodDeclaration rule is formatted if because calling directly ctx.methodDeclaration().getText() would combine the text of all child nodes AntLR doc, possibly making it unusable for parsing again. If white space is a token separator in the grammar, then adding one space between tokens should not change the parse tree.
String formatParseTree(ParseTree tree, String separator) {
StringBuilder builder = new StringBuilder();
for(int i = 0; i < tree.getChildCount(); i ++) {
ParseTree child = tree.getChild(i);
if(child instanceof TerminalNode) {
builder.append(child.getText());
builder.append(separator);
} else if(child instanceof RuleContext) {
builder.append(formatParseTree(child, separator));
}
}
return builder.toString();
}

how to use the generic void pointer of parser tree in ANTLR3

I'm writing a grammar for a simple language using ANTLR3 C target. I want to attach some data
to the AST generated by ANTLR. As the data to attach is small, using the generic void pointer in ANTLR3_BASE_TREE_struct is quite straightforward. The ANTLR3c documentation states that
void * u
Generic void pointer allows the grammar programmer to attach any structure they like to
a tree node, in many cases saving the need to create their own tree and tree adaptors.
Then I wrote the grammar as below(only a piece of grammar is shown here)
datum
#declarations {
parser_data_t* data;
}
#init{
data = HPS_MALLOC(sizeof(parser_data_t));
HPS_RT_ASSERT(data, HPS_ERR_NOMEM);
}
: simpleDatum
{
data->candidate = 0;
$datum.tree->u = data;
}
-> ^(I_DATUM simpleDatum)
| compoundDatum
{
data->candidate = 1;
$datum.tree->u = data;
}
-> ^(I_DATUM compoundDatum)
;
then I generate the lexer/parser, compiled it and linked it to a main function. When I ran the program, a segment fault occurred. I checked the parser generated by ANTLR and found the problem. The assignment $datum.tree->u = data caused the segment fault because
the data structure $datum.tree hadn't been assigned a valid value, so $datum.tree->u is an invalid reference.
What I want is to attach some data to the tree node representing the non-terminal datum, anyone could tell me how to achieve that? I've searched in google and StackOverflow but no answer found.
Thank you!

Generating AST from ANTLR grammar

For the question and the grammar suggested by #BartKiers (Thank you!), I added the options block to specify the output to be
options{
language=Java;
output=AST;
ASTLabelType=CommonTree;
}
However, I am not able to figure out how to access the output i.e. AST. I need to traverse through the tree and process each operation that was specified in the input.
Using your example here, I am trying to implement rules returning values. However, I am running into following errors:
relational returns [String val]
: STRINGVALUE ((operator)^ term)?
{val = $STRINGVALUE.text + $operator.text + $term.text; }
;
term returns [String rhsOperand]
: QUOTEDSTRINGVALUE {rhsOperand = $QUOTEDSTRINGVALUE.text;}
| NUMBERVALUE {rhsOperand = $NUMBERVALUE.text; }
| '(' condition ')'
;
Compilation Error:
Checking Grammar RuleGrammarParser.g...
\output\RuleGrammarParser.java:495: cannot find symbol
symbol : variable val
location: class RuleGrammarParser
val = (STRINGVALUE7!=null?STRINGVALUE7.getText():null) + (operator8!=null?input.toString(operator8.start,operator8.stop):null) + (term9!=null?input.toString(term9.start,term9.stop):null);
^
\output\RuleGrammarParser.java:612: cannot find symbol
symbol : variable rhsOperand
location: class RuleGrammarParser
rhsOperand = (QUOTEDSTRINGVALUE10!=null?QUOTEDSTRINGVALUE10.getText():null);
^
\output\RuleGrammarParser.java:632: cannot find symbol
symbol : variable rhsOperand
location: class RuleGrammarParser
rhsOperand = (NUMBERVALUE11!=null?NUMBERVALUE11.getText():null);
^
3 errors
Can you please help me understand why this fails to compiler?
Added the pastebin: http://pastebin.com/u1Bv3L0A
By simply adding output=AST to the options section you don't create a AST, but a flat, 1 dimensional list of tokens. To mark certain tokens as root (or children), you need to do a bit of work.
Checkout this answer which explains how to create a proper AST and get access to the tree the parser then produces (the CommonTree tree in the main method of the answer I mentioned).
Note that you can safely remove language=Java;: by default the target language is Java (no harm in leaving it there though).

Interpreting a variable number of tree nodes in ANTLR Tree Grammar

Whilst creating an inline ANTLR Tree Grammar interpreter I have come across an issue regarding the multiplicity of procedure call arguments.
Consider the following (faulty) tree grammar definition.
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME arguments=expression*)
{
if(procedureName.equals("foo")) {
callFooMethod(arguments[0], arguments[1]);
}elseif(procedureName.equals("bar")) {
callBarMethod(arguments[0], arguments[1], arguments[2]);
}
}
;
My problem lies with the retrieval of the given arguments. If there would be a known quantity of expressions I would just assign the values coming out of these expressions to their own variable, e.g.:
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME argument1=expression argument2=expression)
{
...
}
;
This however is not the case.
Given a case like this, what is the recommendation on interpreting a variable number of tree nodes inline within the ANTLR Tree Grammar?
Use the += operator. To handle any number of arguments, including zero:
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME argument+=expression*)
{
...
}
;
See the tree construction documentation on the antlr website.
The above will change the type of the variable argument from typeof(expression) to a List (well, at least when you're generating Java code). Note that the list types are untyped, so it's just a plain list.
If you use multiple parameters with the same variable name, they will also create a list, for example:
twoParameterCall
: ^(PROCEDURECALL procedureName=NAME argument=expression argument=expression)
{
...
}
;