Setting AST nodes as transient (effectively removing them from the AST)? - antlr

For many cases, a complete AST - as specified in a grammar spec - is great, since other code can obtain any syntactic detail.
Look at this AST forest:
My ANTLR generated parser is meant to statically analyze a programming language. Therefore, the tree variable -> base_variable_with_function_calls -> base_variable ... would't be of interest.
Solely the fact, that $d is a compound_variable would be enough detail.
Therefore: May I set somehow a ANTLR production rules as transient, so that ANTLR silently parses the grammar rule, but doesn't create intermediate AST nodes?
Obviously, such a tag could only be applied to productions, which have a single son node.

No, ANTLR 4 does not support this. The generated parse tree will contain every token matched by the grammar, and will contain a RuleNode for every rule invoked by the grammar.

Related

my lexer token action is not invoked

I use antlr4 with javascript target.
Here is a sample grammar:
P : T ;
T : [a-z]+ {console.log(this.text);} ;
start: P ;
When I run the generated parser, nothing is printed, although the input is matched. If I move the action to the token P, then it gets invoked. Why is that?
Actions are ignored in referenced rules. This was the original behavior of ANTLR 4, back when the lexer only supported a single action per token (and that action must appear at the end of the token).
Several releases later the limitation of one-action-per-rule was lifted, allowing any number of actions to be executed for a token. However, we found that many existing users relied on the original behavior, and wrote their grammars assuming that actions in referenced rules were ignored. Many of these grammars used complicated logic in these rules, so changing the behavior would be a severe breaking change that would prevent people from using new versions of ANTLR 4.
Rather than break so many existing ANTLR 4 lexers, we decided to preserve the original behavior and only execute actions that appear in the same rule as the matched token. Newer versions do allow you to place multiple actions in each rule though.
tl;dr: We considered allowing actions in other rules to execute, but decided not to because it would break a lot of grammars already written and used by people.
I found that #init and #after actions will override this default behavior.
Change the example code to:
grammar Test;
ALPHA : [a-z]+;
p : t ;
t
#init {
console.log(this.text);
}
#after {
console.log(this.text);
}
: ALPHA;
start: p ;
I changed parser rules to LOWER case as my Eclipse tool was complaining about the syntax otherwise. I also had to insert ALPHA for [a-z]+; for the same reason. The above text compiled, but I haven't tried running the generated parser. However, I am successfully working around this issue with #init/#after in my larger parser.
Hope this is helpful.

How to use Antlr as an Unparser

Does the Antlr4 generated code include anything like an unparser that can use the grammer and the parser tree to reconstruct the original source? How would I invoke that if it exists? I ask because it might be useful in some application and debugging.
It really depends what do you want to achieve. Remember that Lexer tokens which are put onto HIDDEN channel (like comments and which spaces) and are not parsed at all.
The approach I used was
use additional user specific information in lexer token class
parse the source and get AST
rewind the lexer(token source) and loop over all Lexem-es, including the hidden ones
for each hidden Lexeme, append the reference to the corresponding AST leaf
so every AST leaf "know" which white-space Lexemes are following it
recursively traverse the AST and print all the Lexemes
Yes! ANTLR's infrastructure (usually) makes the original source data available.
In the default case, you will be using a CommonTokenStream. This inherits from BufferedTokenStream, which offers a whole slew of methods for getting at stuff.
Methods getHiddenTokensOnLeft (and ...Right) will get you lists of tokens not appearing in the DEFAULT stream. Those tokens will reveal their source text using getText().
What I find even more convenient is BufferedTokenStream.getText(interval), which will give you the text (including hidden) on an Interval, which you can get from your tree element (RuleContext).
To make use of your CommonTokenStream and its methods, you just need to pass it from where you create it and set up your parser to whatever class is examining the parse tree, such as your XXXBaseListener - I just gave my Listener a constructor that stores the CommonTokenStream as an instance field.
So when I want the complete text for a rule ctx, I use this little method:
String originalString(ParserRuleContext ctx) {
return this.tokenStream.getText(ctx.getSourceInterval());
}
Alternatively, the tokens also contain line numbers and offsets, if you want to fiddle with those.

Is "Implicit token definition in parser rule" something to worry about?

I'm creating my first grammar with ANTLR and ANTLRWorks 2. I have mostly finished the grammar itself (it recognizes the code written in the described language and builds correct parse trees), but I haven't started anything beyond that.
What worries me is that every first occurrence of a token in a parser rule is underlined with a yellow squiggle saying "Implicit token definition in parser rule".
For example, in this rule, the 'var' has that squiggle:
variableDeclaration: 'var' IDENTIFIER ('=' expression)?;
How it looks exactly:
The odd thing is that ANTLR itself doesn't seem to mind these rules (when doing test rig test, I can't see any of these warning in the parser generator output, just something about incorrect Java version being installed on my machine), so it's just ANTLRWorks complaining.
Is it something to worry about or should I ignore these warnings? Should I declare all the tokens explicitly in lexer rules? Most exaples in the official bible The Defintive ANTLR Reference seem to be done exactly the way I write the code.
I highly recommend correcting all instances of this warning in code of any importance.
This warning was created (by me actually) to alert you to situations like the following:
shiftExpr : ID (('<<' | '>>') ID)?;
Since ANTLR 4 encourages action code be written in separate files in the target language instead of embedding them directly in the grammar, it's important to be able to distinguish between << and >>. If tokens were not explicitly created for these operators, they will be assigned arbitrary types and no named constants will be available for referencing them.
This warning also helps avoid the following problematic situations:
A parser rule contains a misspelled token reference. Without the warning, this could lead to silent creation of an additional token that may never be matched.
A parser rule contains an unintentional token reference, such as the following:
number : zero | INTEGER;
zero : '0'; // <-- this implicit definition causes 0 to get its own token
If you're writing lexer grammar which wouldn't be used across multiple parser grammmar(s) then you can ignore this warning shown by ANTLRWorks2.

Generate java classes from DSL grammar file

I'm looking for a way to generate a parser from a grammar file (BNF/BNF-like) that will populate an AST. However, I'm also looking to automatically generate the various AST classes in a way that is developer-readable.
Example:
For the following grammar file
expressions = expression+;
expression = CONST | math_expression;
math_expression = add_expression | substract_expression;
add_expression = expression PLUS expression;
substract_expression = expression MINUS expression;
CONST: ('0'..'9')+;
PLUS: '+';
MINUS: '-';
I would like to have the following Java classes generated (with example of what I expect their fields to be):
class Expressions {List<Expression> expression};
class Expression {String const; MathExpression mathExpression;} //only one should be filled.
class MathExpression {AddExpression addExpression; SubstractExpression substractExpression;}
class AddExpression {Expression expression1; Expression expression2;}
class SubstractExpression {Expression expression1; Expression expression2;}
And, in runtime, I would like the expression "1+1-2" to generate the following object graph to represent the AST:
Expressions(Expression(MathExpression(AddExpression(1, SubstractExpression(1, 2)))))
(never mind operator precedence).
I've been exploring DSL parser generators (JavaCC/ANTLR and friends) and the closest thing I could find was using ANTLR to generate a listener class with "enterExpression" and "leaveExpression" style methods. I found a somewhat similar code generated using JavaCC and jjtree using "multi" - but it's extremely awkward and difficult to use.
My grammar needs are somewhat simple - and I would like to automate the AST object graph creation as much as possible.
Any hints?
If you want a lot of support for DSL construction, ANTLR and JavaCC probably aren't the way to go. They provide parsing, some support of building trees... and after that you're on your own. But, as you've figured out, its a lot of work to design your own trees, work out the details, and you're hardly done with the DSL at that point; you still can't use it.
There are more complete solutions out there: JetBrains MPS, Xtext, Spoofax, DMS. All of them provide ways to define a DSL, convert it to an internal form ("build trees"), and provide support for code generation. The first three have integrated IDE support and are intended for "small" DSLs; DMS does not, but handles real languages like C++ as well as DSLs. I think the first three are open source; DMS is commercial (I'm the party behind DMS).
Markus Voelter has just released an online book on DSL Engineering, available for your idea of a donation. He goes into great detail on MPS, XText, Spoofax but none on DMS. He tells you what you need to know and what you need to do; based on my skim of the book, it is pretty extensive. You probably are not going to get off on "simple"; DSLs have a lot of semantic complexity and the supporting machinery is difficult.
I know the author, have huge respect for his skills in this arena, and have co-lectured at technical summer skills with him including having some nice beer. Otherwise I have nothing to do this book.

AST with fixed nodes instead of error nodes in antlr

I have an antlr generated Java parser that uses the C target and it works quite well. The problem is I also want it to parse erroneous code and produce a meaningful AST. If I feed it a minimal Java class with one import after which a semicolon is missing it produces two "Tree Error Node" objects where the "import" token and the tokens for the imported class should be.
But since it parses the following code correctly and produces the correct nodes for this code it must recover from the error by adding the semicolon or by resyncing. Is there a way to make antlr reflect this fixed input it produces internally in the AST? Or can I at least get the tokens/text that produced the "Tree Node Errors" somehow?
In the C targets
antlr3commontreeadaptor.c around line 200 the following fragment indicates that the C target only creates dummy error nodes so far:
static pANTLR3_BASE_TREE
errorNode (pANTLR3_BASE_TREE_ADAPTOR adaptor, pANTLR3_TOKEN_STREAM ctnstream, pANTLR3_COMMON_TOKEN startToken, pANTLR3_COMMON_TOKEN stopToken, pANTLR3_EXCEPTION e)
{
// Use the supplied common tree node stream to get another tree from the factory
// TODO: Look at creating the erronode as in Java, but this is complicated by the
// need to track and free the memory allocated to it, so for now, we just
// want something in the tree that isn't a NULL pointer.
//
return adaptor->createTypeText(adaptor, ANTLR3_TOKEN_INVALID, (pANTLR3_UINT8)"Tree Error Node");
}
Am I out of luck here and only the error nodes the Java target produces would allow me to retrieve the text of the erroneous nodes?
I haven't used antlr much, but typically the way you handle this type of error is to add rules for matching wrong syntax, make them produce error nodes, and try to fix up after errors so that you can keep parsing. Fixing up afterwards is the problem because you don't want one error to trigger more and more errors for each new token until the end.
I solved the problem by adding new alternate rules to the grammer for all possible erroneous statements.
Each Java import statement gets translated to an AST subtree with the artificial symbol IMPORT as the root for example. To make sure that I can differentiate between ASTs from correct and erroneous code the rules for the erroneous statements rewrite them to an AST with a root symbol with the prefix ERR_, so in the example of the import statement the artifical root symbol would be ERR_IMPORT.
More different root symbols could be used to encode more detailed information about the parse error.
My parser is now as error tolerant as I need it to be and it's very easy to add rules for new kinds of erroneous input whenever I need to do so. You have to watch out to not introduce any ambiguities into your grammar, though.