Which rule does the string match? - antlr

I'm using the Java syntax defined at https://github.com/antlr/grammars-v4/tree/master/java/java
My users are free to input any thing, for example
assert image != null;
,
public Color[][] smooth(Color[][] image, int neighberhoodSize)
{
...
}
,
package myapplication.mylibrary;
, and
import static java.lang.System.out; //'out' is a static field in java.lang.System
import static screen.ColorName.*;
My program should tell which syntax the input matches.
What I have up to now is
var stream = CharStreams.fromString(input);
ITokenSource lexer = new JavaLexer(stream);
ITokenStream tokens = new CommonTokenStream(lexer);
Parser parser = new JavaParser(tokens);
parser.ErrorHandler = new BailErrorStrategy();
try
{
var tree = parser.statement();
Console.WriteLine("The input is a statement");
}
catch (Exception e)
{
Console.WriteLine("The input is not a statement");
}
Are there better way to check the input match any of the 100 rules?

No, there's no other way than trial-and-error. Note that your generated parser has the property:
public static final String[] ruleNames
which you can use in combination with reflection to call all parser rules automatically instead of trying them manually.
Also, trying parser.statement() might not be enough: the input String s = "mu"; FUBAR could be properly parsed by parser.statement() and leave the trailing Identifier (FUBAR) in the token stream. After all, the statement rule probably does not end with an EOF token forcing the parser to consume all tokens. You'll probably have to manually check if all tokens are consumed before determining the input was successfully parsed by a certain parser rule. Also see this Q&A: How to test ANTLR translation without adding EOF to every rule

Unless you really mean that your users can enter anything (and I would suspect that, with some thought, that’s not really the case)
You could add a parser rule that includes alternatives for each construct your users could enter. You might have to take a little care on the order.
Since parser rules are evaluated recursive descent, if your new rule isn’t referenced by any other rules, it would have no impact on the rest of the grammar.
Could be worth a shot.

Related

Code indentor using ANTLR 4

I'am writing a code indentor using ANTLR4 and Java. I have successfully generated the lexer and the parser. And the approach i am using is to walk through the generated parse tree.
ParseTreeWalker mywalker = new ParseTreeWalker();
mywalker.walk(myListener, myTree);
The auto-generated *BaseListener has methods like below...
#Override public void enterEveryRule(ParserRuleContext ctx) { }
I'm very new to ANTLR. But, As I understand, I need to extend *BaseListener and override the relevant methods and write code to indent, So my question is what are the methods that I should be overriding for indenting the input code file? Or if there is an alternate approach I should take, please let me know.
Thanks!
None. You don't need a parser for this task and you are limiting yourself to valid code, when you require a parser (hence you cannot reformat code with a syntax error). Instead take the lexer and iterate over all tokens. Keep a state to know where you are (a block, a function, whatever) and indent according to that.

Serialization of ANTLR ParseTree

I have a generated grammar that does two things:
Check the syntax of a domain specific language
Evaluate input against that domain specific language
These two functions are separate, lets call them validate() and evaluate().
The validate() function builds the tree from a String input while ensuring it meets the requirements of the BNF for the language. The evaluate() function plugs in values to that tree to get a result (usually true or false).
What the code is currently doing is running validate() each time on the input, just to generate the tree that evaluate() uses. Some of the inputs take up to 60 seconds to be checked. What I would LIKE to do is serialize the results of validate() (assuming it meets the syntax requirements), store the serialized form in the backend database, and just load it from the database as part of evaluate().
I noticed that I can execute the method toStringTree() on the parse tree, and retrieve a LISP style tree. However, can I restore a LISP style tree to an ANTLR parse tree? If not, can anyone recommend another way to serialize and store the generated parse tree?
Thanks for any help.
Jason
ANTLR 4's ParseRuleContext data structure (the specific implementation of ParseTree used by generated parsers to represent grammar rules in the parse tree) is not serializable by default. Open issue #233 on the project issue tracker covers the feature request. However, based on my experience with many applications using ANTLR for parsing, I'm not convinced serializing the parse trees would be useful in the long run. For each problem serializing the parse tree is meant to address, a better solution already exists.
Another option is to store a hash of the last known valid file in the database. After you use the parser to create a parse tree, you could skip the validation step if the input file has the same hash as the last time it was validated. This leverages two aspects of ANTLR 4:
For the same input file, running the parser twice will produce the same parse tree.
The ANTLR 4 parser is extremely fast in almost all cases (e.g. the Java grammar can process around 20MB of source per second). The remaining cases tend to be caused by poorly structured grammar rules that the new parser interpreter feature in ANTLRWorks 2.2 can analyze and make suggestions for improvement.
If you need performance beyond what you get with this, then a parse tree isn't the data structure you should be using. StringTemplate 4's enormous performance advantage over StringTemplate 3 came primarily from the fact that the interpreter switched from using ASTs (equivalent to parse trees for this reasoning) to a linear bytecode representation/interpreter. The ASTs for ST4 would never need to be serialized for performance reasons because the bytecode would be serialized instead. In fact, the C# port of StringTemplate 4 provides exactly this feature.
If the input data to your grammar is made of several independent blocks, you could try to store the string of each block separately, and run the parsing process again for each block independently, using a ThreadPool for example.
Say for example your input data is a set of method declarations:
int add(int a, int b) {
return a+b;
}
int mul(int a, int b) {
return a*b;
}
...
and the grammar is something like:
methodList : methodDeclaration methodList
|
;
methodDeclaration : // your method declaration rules...
The first run of the parser just collects each method text and store it. The parser starts the process at the methodList rule.
void visitMethodList(MethodListContext ctx) {
if(ctx.methodDeclaration() != null) {
String methodStr = formatParseTree(ctx.methodDeclaration(), " ");
// store methodStr for later parsing
}
// visit next method list item, if any
if(ctx.methodList() != null) {
visit(ctx.methodList());
}
}
The second run launch the parsing of each method declaration (in a separate thread for example). For this, the parser starts at the methodDeclaration rule.
void visitMethodDeclaration(MethodDeclarationContext ctx) {
// parse the method block
}
The reason why the text of a methodDeclaration rule is formatted if because calling directly ctx.methodDeclaration().getText() would combine the text of all child nodes AntLR doc, possibly making it unusable for parsing again. If white space is a token separator in the grammar, then adding one space between tokens should not change the parse tree.
String formatParseTree(ParseTree tree, String separator) {
StringBuilder builder = new StringBuilder();
for(int i = 0; i < tree.getChildCount(); i ++) {
ParseTree child = tree.getChild(i);
if(child instanceof TerminalNode) {
builder.append(child.getText());
builder.append(separator);
} else if(child instanceof RuleContext) {
builder.append(formatParseTree(child, separator));
}
}
return builder.toString();
}

How to use Antlr as an Unparser

Does the Antlr4 generated code include anything like an unparser that can use the grammer and the parser tree to reconstruct the original source? How would I invoke that if it exists? I ask because it might be useful in some application and debugging.
It really depends what do you want to achieve. Remember that Lexer tokens which are put onto HIDDEN channel (like comments and which spaces) and are not parsed at all.
The approach I used was
use additional user specific information in lexer token class
parse the source and get AST
rewind the lexer(token source) and loop over all Lexem-es, including the hidden ones
for each hidden Lexeme, append the reference to the corresponding AST leaf
so every AST leaf "know" which white-space Lexemes are following it
recursively traverse the AST and print all the Lexemes
Yes! ANTLR's infrastructure (usually) makes the original source data available.
In the default case, you will be using a CommonTokenStream. This inherits from BufferedTokenStream, which offers a whole slew of methods for getting at stuff.
Methods getHiddenTokensOnLeft (and ...Right) will get you lists of tokens not appearing in the DEFAULT stream. Those tokens will reveal their source text using getText().
What I find even more convenient is BufferedTokenStream.getText(interval), which will give you the text (including hidden) on an Interval, which you can get from your tree element (RuleContext).
To make use of your CommonTokenStream and its methods, you just need to pass it from where you create it and set up your parser to whatever class is examining the parse tree, such as your XXXBaseListener - I just gave my Listener a constructor that stores the CommonTokenStream as an instance field.
So when I want the complete text for a rule ctx, I use this little method:
String originalString(ParserRuleContext ctx) {
return this.tokenStream.getText(ctx.getSourceInterval());
}
Alternatively, the tokens also contain line numbers and offsets, if you want to fiddle with those.

ANTLR: Define new channel in grammar

I know it is possible to switch between the default and hidden token channels in an ANTLR grammar, but lets say I want a third channel. How can I define a new token channel in the gramar? For instance, lets say I want a channel named ALTERNATIVE.
They're just final int's in the Token class
, so you could simply introduce an extra int in your lexer like this:
grammar T;
#lexer::members {
public static final int ALTERNATIVE = HIDDEN + 1;
}
// parser rules ...
FOO
: 'foo' {$type=ALTERNATIVE;}
;
// other lexer rules ...
A related Q&A: How do I get an Antlr Parser rule to read from both default AND hidden channel
For the C target you can use
//This must be assigned somewhere
#lexer::context {
ANTLR3_UINT32 defaultChannel;
}
TOKEN : 'blah' {$channel=defaultChannel;};
This gets reset after every rule so if you want a channel assignment to persist across rules you may have to override nextTokenStr().

Skip part of a tree when parsing an ANTLR AST

I'm using ANTLR 3 to create an AST. I want to have two AST analysers, one to use in production code and one to use in the Eclipse plugin we already have. However, the plugin doesn't require all information in the tree. What I'm looking for is a way to parse the tree without specifying all branches in the grammar. Is there a way to do so?
You may have figured this out already, but I've used . or .* in my tree grammars to skip either a given node or any number of nodes.
For example, I have a DSL that allows function declarations, and one of my tree grammars just cares about names and arguments, but not the contents (which could be arbitrarily long). I skip the processing of the code block using .* as a placeholder:
^(Function type_specifier? variable_name formal_parameters implemented_by? .*)
I don't know about the runtime performance hit, if any, but I'm not using this construct in any areas where performance is an issue for my application.
I don't know what exactly you want to do though, but I set up a boolean flag in the tree walker when I encountered this problem last time. For example:
#members
{
boolean executeAction = true;
}
...
equation:
#init{
if(executeAction){
//do your things
}
}
#after{
if(executeAction){
//do your things
}
}
exp { if(executeAction){/* Do your things */} } EQU exp
;
exp:
#init{
if(executeAction){
//do your things
}
}
#after{
if(executeAction){
//do your things
}
}
integer OPE integer
;
...
This way, you can easily switch the execution on or off. You just have to wrap all the codes into an if statement.
The thing is that in Antlr, there is no such kind of thing called skipping the subsequent rules. They are to be walked through anyway. So we can only do it manually.