Skip part of a tree when parsing an ANTLR AST - antlr

I'm using ANTLR 3 to create an AST. I want to have two AST analysers, one to use in production code and one to use in the Eclipse plugin we already have. However, the plugin doesn't require all information in the tree. What I'm looking for is a way to parse the tree without specifying all branches in the grammar. Is there a way to do so?

You may have figured this out already, but I've used . or .* in my tree grammars to skip either a given node or any number of nodes.
For example, I have a DSL that allows function declarations, and one of my tree grammars just cares about names and arguments, but not the contents (which could be arbitrarily long). I skip the processing of the code block using .* as a placeholder:
^(Function type_specifier? variable_name formal_parameters implemented_by? .*)
I don't know about the runtime performance hit, if any, but I'm not using this construct in any areas where performance is an issue for my application.

I don't know what exactly you want to do though, but I set up a boolean flag in the tree walker when I encountered this problem last time. For example:
#members
{
boolean executeAction = true;
}
...
equation:
#init{
if(executeAction){
//do your things
}
}
#after{
if(executeAction){
//do your things
}
}
exp { if(executeAction){/* Do your things */} } EQU exp
;
exp:
#init{
if(executeAction){
//do your things
}
}
#after{
if(executeAction){
//do your things
}
}
integer OPE integer
;
...
This way, you can easily switch the execution on or off. You just have to wrap all the codes into an if statement.
The thing is that in Antlr, there is no such kind of thing called skipping the subsequent rules. They are to be walked through anyway. So we can only do it manually.

Related

Alternative to the try (?) operator suited to iterator mapping

In the process of learning Rust, I am getting acquainted with error propagation and the choice between unwrap and the ? operator. After writing some prototype code that only uses unwrap(), I would like to remove unwrap from reusable parts, where panicking on every error is inappropriate.
How would one avoid the use of unwrap in a closure, like in this example?
// todo is VecDeque<PathBuf>
let dir = fs::read_dir(&filename).unwrap();
todo.extend(dir.map(|dirent| dirent.unwrap().path()));
The first unwrap can be easily changed to ?, as long as the containing function returns Result<(), io::Error> or similar. However, the second unwrap, the one in dirent.unwrap().path(), cannot be changed to dirent?.path() because the closure must return a PathBuf, not a Result<PathBuf, io::Error>.
One option is to change extend to an explicit loop:
let dir = fs::read_dir(&filename)?;
for dirent in dir {
todo.push_back(dirent?.path());
}
But that feels wrong - the original extend was elegant and clearly reflected the intention of the code. (It might also have been more efficient than a sequence of push_backs.) How would an experienced Rust developer express error checking in such code?
How would one avoid the use of unwrap in a closure, like in this example?
Well, it really depends on what you wish to do upon failure.
should failure be reported to the user or be silent
if reported, should one failure be reported or all?
if a failure occur, should it interrupt processing?
For example, you could perfectly decide to silently ignore all failures and just skip the entries that fail. In this case, the Iterator::filter_map combined with Result::ok is exactly what you are asking for.
let dir = fs::read_dir(&filename)?;
let todos.extend(dir.filter_map(Result::ok));
The Iterator interface is full of goodies, it's definitely worth perusing when looking for tidier code.
Here is a solution based on filter_map suggested by Matthieu. It calls Result::map_err to ensure the error is "caught" and logged, sending it further to Result::ok and filter_map to remove it from iteration:
fn log_error(e: io::Error) {
eprintln!("{}", e);
}
(|| {
let dir = fs::read_dir(&filename)?;
todo.extend(dir
.filter_map(|res| res.map_err(log_error).ok()))
.map(|dirent| dirent.path()));
})().unwrap_or_else(log_error)

my lexer token action is not invoked

I use antlr4 with javascript target.
Here is a sample grammar:
P : T ;
T : [a-z]+ {console.log(this.text);} ;
start: P ;
When I run the generated parser, nothing is printed, although the input is matched. If I move the action to the token P, then it gets invoked. Why is that?
Actions are ignored in referenced rules. This was the original behavior of ANTLR 4, back when the lexer only supported a single action per token (and that action must appear at the end of the token).
Several releases later the limitation of one-action-per-rule was lifted, allowing any number of actions to be executed for a token. However, we found that many existing users relied on the original behavior, and wrote their grammars assuming that actions in referenced rules were ignored. Many of these grammars used complicated logic in these rules, so changing the behavior would be a severe breaking change that would prevent people from using new versions of ANTLR 4.
Rather than break so many existing ANTLR 4 lexers, we decided to preserve the original behavior and only execute actions that appear in the same rule as the matched token. Newer versions do allow you to place multiple actions in each rule though.
tl;dr: We considered allowing actions in other rules to execute, but decided not to because it would break a lot of grammars already written and used by people.
I found that #init and #after actions will override this default behavior.
Change the example code to:
grammar Test;
ALPHA : [a-z]+;
p : t ;
t
#init {
console.log(this.text);
}
#after {
console.log(this.text);
}
: ALPHA;
start: p ;
I changed parser rules to LOWER case as my Eclipse tool was complaining about the syntax otherwise. I also had to insert ALPHA for [a-z]+; for the same reason. The above text compiled, but I haven't tried running the generated parser. However, I am successfully working around this issue with #init/#after in my larger parser.
Hope this is helpful.

Serialization of ANTLR ParseTree

I have a generated grammar that does two things:
Check the syntax of a domain specific language
Evaluate input against that domain specific language
These two functions are separate, lets call them validate() and evaluate().
The validate() function builds the tree from a String input while ensuring it meets the requirements of the BNF for the language. The evaluate() function plugs in values to that tree to get a result (usually true or false).
What the code is currently doing is running validate() each time on the input, just to generate the tree that evaluate() uses. Some of the inputs take up to 60 seconds to be checked. What I would LIKE to do is serialize the results of validate() (assuming it meets the syntax requirements), store the serialized form in the backend database, and just load it from the database as part of evaluate().
I noticed that I can execute the method toStringTree() on the parse tree, and retrieve a LISP style tree. However, can I restore a LISP style tree to an ANTLR parse tree? If not, can anyone recommend another way to serialize and store the generated parse tree?
Thanks for any help.
Jason
ANTLR 4's ParseRuleContext data structure (the specific implementation of ParseTree used by generated parsers to represent grammar rules in the parse tree) is not serializable by default. Open issue #233 on the project issue tracker covers the feature request. However, based on my experience with many applications using ANTLR for parsing, I'm not convinced serializing the parse trees would be useful in the long run. For each problem serializing the parse tree is meant to address, a better solution already exists.
Another option is to store a hash of the last known valid file in the database. After you use the parser to create a parse tree, you could skip the validation step if the input file has the same hash as the last time it was validated. This leverages two aspects of ANTLR 4:
For the same input file, running the parser twice will produce the same parse tree.
The ANTLR 4 parser is extremely fast in almost all cases (e.g. the Java grammar can process around 20MB of source per second). The remaining cases tend to be caused by poorly structured grammar rules that the new parser interpreter feature in ANTLRWorks 2.2 can analyze and make suggestions for improvement.
If you need performance beyond what you get with this, then a parse tree isn't the data structure you should be using. StringTemplate 4's enormous performance advantage over StringTemplate 3 came primarily from the fact that the interpreter switched from using ASTs (equivalent to parse trees for this reasoning) to a linear bytecode representation/interpreter. The ASTs for ST4 would never need to be serialized for performance reasons because the bytecode would be serialized instead. In fact, the C# port of StringTemplate 4 provides exactly this feature.
If the input data to your grammar is made of several independent blocks, you could try to store the string of each block separately, and run the parsing process again for each block independently, using a ThreadPool for example.
Say for example your input data is a set of method declarations:
int add(int a, int b) {
return a+b;
}
int mul(int a, int b) {
return a*b;
}
...
and the grammar is something like:
methodList : methodDeclaration methodList
|
;
methodDeclaration : // your method declaration rules...
The first run of the parser just collects each method text and store it. The parser starts the process at the methodList rule.
void visitMethodList(MethodListContext ctx) {
if(ctx.methodDeclaration() != null) {
String methodStr = formatParseTree(ctx.methodDeclaration(), " ");
// store methodStr for later parsing
}
// visit next method list item, if any
if(ctx.methodList() != null) {
visit(ctx.methodList());
}
}
The second run launch the parsing of each method declaration (in a separate thread for example). For this, the parser starts at the methodDeclaration rule.
void visitMethodDeclaration(MethodDeclarationContext ctx) {
// parse the method block
}
The reason why the text of a methodDeclaration rule is formatted if because calling directly ctx.methodDeclaration().getText() would combine the text of all child nodes AntLR doc, possibly making it unusable for parsing again. If white space is a token separator in the grammar, then adding one space between tokens should not change the parse tree.
String formatParseTree(ParseTree tree, String separator) {
StringBuilder builder = new StringBuilder();
for(int i = 0; i < tree.getChildCount(); i ++) {
ParseTree child = tree.getChild(i);
if(child instanceof TerminalNode) {
builder.append(child.getText());
builder.append(separator);
} else if(child instanceof RuleContext) {
builder.append(formatParseTree(child, separator));
}
}
return builder.toString();
}

Interpreting a variable number of tree nodes in ANTLR Tree Grammar

Whilst creating an inline ANTLR Tree Grammar interpreter I have come across an issue regarding the multiplicity of procedure call arguments.
Consider the following (faulty) tree grammar definition.
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME arguments=expression*)
{
if(procedureName.equals("foo")) {
callFooMethod(arguments[0], arguments[1]);
}elseif(procedureName.equals("bar")) {
callBarMethod(arguments[0], arguments[1], arguments[2]);
}
}
;
My problem lies with the retrieval of the given arguments. If there would be a known quantity of expressions I would just assign the values coming out of these expressions to their own variable, e.g.:
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME argument1=expression argument2=expression)
{
...
}
;
This however is not the case.
Given a case like this, what is the recommendation on interpreting a variable number of tree nodes inline within the ANTLR Tree Grammar?
Use the += operator. To handle any number of arguments, including zero:
procedureCallStatement
: ^(PROCEDURECALL procedureName=NAME argument+=expression*)
{
...
}
;
See the tree construction documentation on the antlr website.
The above will change the type of the variable argument from typeof(expression) to a List (well, at least when you're generating Java code). Note that the list types are untyped, so it's just a plain list.
If you use multiple parameters with the same variable name, they will also create a list, for example:
twoParameterCall
: ^(PROCEDURECALL procedureName=NAME argument=expression argument=expression)
{
...
}
;

Writing a TemplateLanguage/VewEngine

Aside from getting any real work done, I have an itch. My itch is to write a view engine that closely mimics a template system from another language (Template Toolkit/Perl). This is one of those if I had time/do it to learn something new kind of projects.
I've spent time looking at CoCo/R and ANTLR, and honestly, it makes my brain hurt, but some of CoCo/R is sinking in. Unfortunately, most of the examples are about creating a compiler that reads source code, but none seem to cover how to create a processor for templates.
Yes, those are the same thing, but I can't wrap my head around how to define the language for templates where most of the source is the html, rather than actual code being parsed and run.
Are there any good beginner resources out there for this kind of thing? I've taken a ganer at Spark, which didn't appear to have the grammar in the repo.
Maybe that is overkill, and one could just test-replace template syntax with c# in the file and compile it. http://msdn.microsoft.com/en-us/magazine/cc136756.aspx#S2
If you were in my shoes and weren't a language creating expert, where would you start?
The Spark grammar is implemented with a kind-of-fluent domain specific language.
It's declared in a few layers. The rules which recognize the html syntax are declared in MarkupGrammar.cs - those are based on grammar rules copied directly from the xml spec.
The markup rules refer to a limited subset of csharp syntax rules declared in CodeGrammar.cs - those are a subset because Spark only needs to recognize enough csharp to adjust single-quotes around strings to double-quotes, match curley braces, etc.
The individual rules themselves are of type ParseAction<TValue> delegate which accept a Position and return a ParseResult. The ParseResult is a simple class which contains the TValue data item parsed by the action and a new Position instance which has been advanced past the content which produced the TValue.
That isn't very useful on it's own until you introduce a small number of operators, as described in Parsing expression grammar, which can combine single parse actions to build very detailed and robust expressions about the shape of different syntax constructs.
The technique of using a delegate as a parse action came from a Luke H's blog post Monadic Parser Combinators using C# 3.0. I also wrote a post about Creating a Domain Specific Language for Parsing.
It's also entirely possible, if you like, to reference the Spark.dll assembly and inherit a class from the base CharGrammar to create an entirely new grammar for a particular syntax. It's probably the quickest way to start experimenting with this technique, and an example of that can be found in CharGrammarTester.cs.
Step 1. Use regular expressions (regexp substitution) to split your input template string to a token list, for example, split
hel<b>lo[if foo]bar is [bar].[else]baz[end]world</b>!
to
write('hel<b>lo')
if('foo')
write('bar is')
substitute('bar')
write('.')
else()
write('baz')
end()
write('world</b>!')
Step 2. Convert your token list to a syntax tree:
* Sequence
** Write
*** ('hel<b>lo')
** If
*** ('foo')
*** Sequence
**** Write
***** ('bar is')
**** Substitute
***** ('bar')
**** Write
***** ('.')
*** Write
**** ('baz')
** Write
*** ('world</b>!')
class Instruction {
}
class Write : Instruction {
string text;
}
class Substitute : Instruction {
string varname;
}
class Sequence : Instruction {
Instruction[] items;
}
class If : Instruction {
string condition;
Instruction then;
Instruction else;
}
Step 3. Write a recursive function (called the interpreter), which can walk your tree and execute the instructions there.
Another, alternative approach (instead of steps 1--3) if your language supports eval() (such as Perl, Python, Ruby): use a regexp substitution to convert the template to an eval()-able string in the host language, and run eval() to instantiate the template.
There are sooo many thing to do. But it does work for on simple GET statement plus a test. That's a start.
http://github.com/claco/tt.net/
In the end, I already had too much time in ANTLR to give loudejs' method a go. I wanted to spend a little more time on the whole process rather than the parser/lexer. Maybe in version 2 I can have a go at the Spark way when my brain understands things a little more.
Vici Parser (formerly known as LazyParser.NET) is an open-source tokenizer/template parser/expression parser which can help you get started.
If it's not what you're looking for, then you may get some ideas by looking at the source code.