Whilst creating an inline ANTLR Tree Grammar interpreter I have come across an issue regarding the multiplicity of procedure call arguments.
Consider the following (faulty) tree grammar definition.
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME arguments=expression*)
{
if(procedureName.equals("foo")) {
callFooMethod(arguments[0], arguments[1]);
}elseif(procedureName.equals("bar")) {
callBarMethod(arguments[0], arguments[1], arguments[2]);
}
}
;
My problem lies with the retrieval of the given arguments. If there would be a known quantity of expressions I would just assign the values coming out of these expressions to their own variable, e.g.:
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME argument1=expression argument2=expression)
{
...
}
;
This however is not the case.
Given a case like this, what is the recommendation on interpreting a variable number of tree nodes inline within the ANTLR Tree Grammar?
Use the += operator. To handle any number of arguments, including zero:
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME argument+=expression*)
{
...
}
;
See the tree construction documentation on the antlr website.
The above will change the type of the variable argument from typeof(expression) to a List (well, at least when you're generating Java code). Note that the list types are untyped, so it's just a plain list.
If you use multiple parameters with the same variable name, they will also create a list, for example:
twoParameterCall
: ^(PROCEDURECALL procedureName=NAME argument=expression argument=expression)
{
...
}
;
Related
With a regex, token or rule, its possible to define a variable like so;
token directive {
:my $foo = "in command";
<command> <subject> <value>?
}
There is nothing about it in the language documentation here, and very little in S05 - Regexes and Rules, to quote;
Any grammar regex is really just a kind of method, and you may declare variables in such a routine using a colon followed by any scope declarator parsed by the Perl 6 grammar, including my, our, state, and constant. (As quasi declarators, temp and let are also recognized.) A single statement (up through a terminating semicolon or line-final closing brace) is parsed as normal Perl 6 code:
token prove-nondeterministic-parsing {
:my $threshold = rand;
'maybe' \s+ <it($threshold)>
}
I get that regexen within grammars are very similar to methods in classes; I get that you can start a block anywhere within a rule and if parsing successfully gets to that point, the block will be executed - but I don't understand what on earth this thing is for.
Can someone clearly define what it's scope is; explain what need it fulfills and give the typical use case?
What scope does :my $foo; have?
:my $foo ...; has the lexical scope of the rule/token/regex in which it appears.
(And :my $*foo ...; -- note the extra * signifying a dynamic variable -- has both the lexical and dynamic scope of the rule/token/regex in which it appears.)
What this is used for
Here's what happens without this construct:
regex scope-too-small { # Opening `{` opens a regex lexical scope.
{ my $foo = / bar / } # Block with its own inner lexical scope.
$foo # ERROR: Variable '$foo' is not declared
}
grammar scope-too-large { # Opening `{` opens lexical scope for gramamr.
my $foo = / bar / ;
regex r1 { ... } # `$foo` is recognized inside `r1`...
...
regex r999 { ... } # ...but also inside r999
}
So the : ... ; syntax is used to get exactly the desired scope -- neither too broad nor too narrow.
Typical use cases
This feature is typically used in large or complex grammars to avoid lax scoping (which breeds bugs).
For a suitable example of precise lexical only scoping see the declaration and use of #extra_tweaks in token babble as defined in a current snapshot of Rakudo's Grammar.nqp source code.
P6 supports action objects. These are classes with methods corresponding one-to-one with the rules in a grammar. Whenever a rule matches, it calls its corresponding action method. Dynamic variables provide precisely the right scoping for declaring variables that are scoped to the block (method, rule, etc.) they're declared in both lexically and dynamically -- which latter means they're available in the corresponding action method too. For an example of this, see the declaration of #*nibbles in Rakudo's Grammar module and its use in Rakudo's Actions module.
I have a generated grammar that does two things:
Check the syntax of a domain specific language
Evaluate input against that domain specific language
These two functions are separate, lets call them validate() and evaluate().
The validate() function builds the tree from a String input while ensuring it meets the requirements of the BNF for the language. The evaluate() function plugs in values to that tree to get a result (usually true or false).
What the code is currently doing is running validate() each time on the input, just to generate the tree that evaluate() uses. Some of the inputs take up to 60 seconds to be checked. What I would LIKE to do is serialize the results of validate() (assuming it meets the syntax requirements), store the serialized form in the backend database, and just load it from the database as part of evaluate().
I noticed that I can execute the method toStringTree() on the parse tree, and retrieve a LISP style tree. However, can I restore a LISP style tree to an ANTLR parse tree? If not, can anyone recommend another way to serialize and store the generated parse tree?
Thanks for any help.
Jason
ANTLR 4's ParseRuleContext data structure (the specific implementation of ParseTree used by generated parsers to represent grammar rules in the parse tree) is not serializable by default. Open issue #233 on the project issue tracker covers the feature request. However, based on my experience with many applications using ANTLR for parsing, I'm not convinced serializing the parse trees would be useful in the long run. For each problem serializing the parse tree is meant to address, a better solution already exists.
Another option is to store a hash of the last known valid file in the database. After you use the parser to create a parse tree, you could skip the validation step if the input file has the same hash as the last time it was validated. This leverages two aspects of ANTLR 4:
For the same input file, running the parser twice will produce the same parse tree.
The ANTLR 4 parser is extremely fast in almost all cases (e.g. the Java grammar can process around 20MB of source per second). The remaining cases tend to be caused by poorly structured grammar rules that the new parser interpreter feature in ANTLRWorks 2.2 can analyze and make suggestions for improvement.
If you need performance beyond what you get with this, then a parse tree isn't the data structure you should be using. StringTemplate 4's enormous performance advantage over StringTemplate 3 came primarily from the fact that the interpreter switched from using ASTs (equivalent to parse trees for this reasoning) to a linear bytecode representation/interpreter. The ASTs for ST4 would never need to be serialized for performance reasons because the bytecode would be serialized instead. In fact, the C# port of StringTemplate 4 provides exactly this feature.
If the input data to your grammar is made of several independent blocks, you could try to store the string of each block separately, and run the parsing process again for each block independently, using a ThreadPool for example.
Say for example your input data is a set of method declarations:
int add(int a, int b) {
return a+b;
}
int mul(int a, int b) {
return a*b;
}
...
and the grammar is something like:
methodList : methodDeclaration methodList
|
;
methodDeclaration : // your method declaration rules...
The first run of the parser just collects each method text and store it. The parser starts the process at the methodList rule.
void visitMethodList(MethodListContext ctx) {
if(ctx.methodDeclaration() != null) {
String methodStr = formatParseTree(ctx.methodDeclaration(), " ");
// store methodStr for later parsing
}
// visit next method list item, if any
if(ctx.methodList() != null) {
visit(ctx.methodList());
}
}
The second run launch the parsing of each method declaration (in a separate thread for example). For this, the parser starts at the methodDeclaration rule.
void visitMethodDeclaration(MethodDeclarationContext ctx) {
// parse the method block
}
The reason why the text of a methodDeclaration rule is formatted if because calling directly ctx.methodDeclaration().getText() would combine the text of all child nodes AntLR doc, possibly making it unusable for parsing again. If white space is a token separator in the grammar, then adding one space between tokens should not change the parse tree.
String formatParseTree(ParseTree tree, String separator) {
StringBuilder builder = new StringBuilder();
for(int i = 0; i < tree.getChildCount(); i ++) {
ParseTree child = tree.getChild(i);
if(child instanceof TerminalNode) {
builder.append(child.getText());
builder.append(separator);
} else if(child instanceof RuleContext) {
builder.append(formatParseTree(child, separator));
}
}
return builder.toString();
}
I have a problem that I've been stuck on for a while and I would appreciate some help if possible.
I have a few rules in an ANTLR tree grammar:
block
: compoundstatement
| ^(VAR declarations) compoundstatement
;
declarations
: (^(t=type idlist))+
;
idlist
: IDENTIFIER+
;
type
: REAL
| i=INTEGER
;
I have written a Java class VarTable that I will insert all of my variables into as they are declared at the beginning of my source file. The table will also hold their variable types (ie real or integer). I'll also be able to use this variable table to check for undeclared variables or duplicate declarations etc.
So basically I want to be able to send the variable type down from the 'declarations' rule to the 'idlist' rule and then loop through every identifier in the idlist rule, adding them to my variable table one by one.
The major problem I'm getting is that I get a NullPointerException when I try and access the 'text' attribute if the $t variable in the 'declarations' rule (This is one one which refers to the type).
And yet if I try and access the 'text' attribute of the $i variable in the 'type' rule, there's no problem.
I have looked at the place in the Java file where the NullPointerException is being generated and it still makes no sense to me.
Is it a problem with the fact that there could be multiple types because the rule is
(^(typeidlist))+
??
I have the same issue when I get down to the idlist rule, becasue I'm unsure how I can write an action that will allow me to loop through all of the IDENTIFIER Tokens found.
Grateful for any help or comments.
Cheers
You can't reference the attributes from production rules like you tried inside tree grammars, only in parser (or combined) grammars (they're different objects!). Note that INTEGER is not a production rule, just a "simple" token (terminal). That's why you can invoke its .text attribute.
So, if you want to get a hold the text of the type rule in your tree grammar and print it in your declarations rule, your could do something like this:
tree grammar T;
...
declarations
: (^(t=type idlist {System.out.println($t.returnValue);}))+
;
...
type returns [String returnValue]
: i=INTEGER {returnValue = "[" + $i.text + "]";}
;
...
But if you really want to do it without specifying a return object, you could do something like this:
declarations
: (^(t=type idlist {System.out.println($t.start.getText());}))+
;
Note that type returns an instance of a TreeRuleReturnScope which has an attribute called start which in its turn is a CommonTree instance. You could then call getText() on that CommonTree instance.
I have an ANTLR grammar that can parse and evaluate simple expressions like 1+2*4, etc.
What I would like to do is to evaluate expressions like 2+$a-$b/4 where the $ variables are dynamic variables, that come from an external source and are continuously updated.
Is there any design pattern on how to do this using ANTLR, best practices, etc?
Shall I "substring" the $a with the updated value ($a -> 4.34)
A nicer way to do this?
Thx
There is actually an example for this in the ANTLR book (the ANTLR Definitive Reference.) The pattern is to parse the variable values and add them to a dictionary in the target language:
#members { var dict = new Dictionary<string, int>(); }
decl: v=ID '=' v=expr { dict[$e.Text] = int.Parse($v.Value); };
ID : '$' ('a'..'z'|'A'..'Z')+;
where 'expr' can be any valid expression (including an expression containing a variable.)
I'm using ANTLR 3 to create an AST. I want to have two AST analysers, one to use in production code and one to use in the Eclipse plugin we already have. However, the plugin doesn't require all information in the tree. What I'm looking for is a way to parse the tree without specifying all branches in the grammar. Is there a way to do so?
You may have figured this out already, but I've used . or .* in my tree grammars to skip either a given node or any number of nodes.
For example, I have a DSL that allows function declarations, and one of my tree grammars just cares about names and arguments, but not the contents (which could be arbitrarily long). I skip the processing of the code block using .* as a placeholder:
^(Function type_specifier? variable_name formal_parameters implemented_by? .*)
I don't know about the runtime performance hit, if any, but I'm not using this construct in any areas where performance is an issue for my application.
I don't know what exactly you want to do though, but I set up a boolean flag in the tree walker when I encountered this problem last time. For example:
#members
{
boolean executeAction = true;
}
...
equation:
#init{
if(executeAction){
//do your things
}
}
#after{
if(executeAction){
//do your things
}
}
exp { if(executeAction){/* Do your things */} } EQU exp
;
exp:
#init{
if(executeAction){
//do your things
}
}
#after{
if(executeAction){
//do your things
}
}
integer OPE integer
;
...
This way, you can easily switch the execution on or off. You just have to wrap all the codes into an if statement.
The thing is that in Antlr, there is no such kind of thing called skipping the subsequent rules. They are to be walked through anyway. So we can only do it manually.