Generate AST for Java with ANTLR

Generate AST for Java with ANTLR - antlr

As far as I know, there are two mechanisms in ANTLR for building abstract syntax trees. I want to build a AST for Java source files.
Question: There are so many grammar rules in Java.g (java specification), it's a large work if I specify the AST generating rules for every item in Java.g. So I wondering if there is a ready-made one, and where can I get it.

This Java 1.5 grammar1 from the ANTLR Wiki generates an AST and also provides a tree grammar2.
Java.g
JavaTreeParser.g
22 Aug 2014 - UPDATE
Since the original link appears to be dead, the grammar and tree grammar are available in a public Gist: https://gist.github.com/bkiers/741125a606954b24bbf4

Related

antlr - generate grammar from java source code

I am wondering if I can generate ANTLR grammar from java source code. I want to do some kind of research project, but I am just exploring different open sources to see which one is best.
For ANTLR, do I always have to write a grammar and pass it to the ANTLR?
Is there a way to generate grammar from an existing Java source code?

Not easily. ANTLR generate a recursive descent parser from your grammar, encoding the tests into procedural code, as well as lots of other bookkeeping stuff.
Knowing how the code is generated, you might be able to take it apart but you'll have to reach into the middle of generated statements and that isn't easy without a full parser for the generated language. (Hint: regex won't work).
I don't see a lot of point of this exercise. Why don't you just use the original grammar?

"Human-readable" ANTLR-generated code?

I've been learning ANTLR for a few days now. My goal in learning it was that I would be able to generate parsers and lexers, and then personally hand-translate them from Java into my target language (neither C/C++/Java/C#/Python, no tool has support for it). I chose ANTLR because from its About page: ANTLR is widely used because it's easy to understand, powerful, flexible, generates human-readable output[...]
In learning this tool, I decided to start with a simple lexer for a simple grammar: JSON. However, once I generated the .java file for this lexer using ANTLR4 I was caught widely off-guard. I got a huge mess of far-from-human-readable serialized code, followed by:
public static final ATN _ATN =
ATNSimulator.deserialize(_serializedATN.toCharArray());
static {
_decisionToDFA = new DFA[_ATN.getNumberOfDecisions()];
}
A few Google searches were unable to provide me a way to disable this behavior.
Is there a way to disable this behavior and produce human-readable code, or am I going to have to hand-write my lexers and parsers for this target programming language?

ANTLR 4 uses a new algorithm for prediction. Terence Parr is currently working on a tech report describing the algorithm in detail. The human-readable output refers to the generated parsers.
ANTLR 4 lexers use a DFA recognizer for a massive speed and memory usage improvement over previous releases of ANTLR. For parsers, the _ATN field is a data structure used within calls to adaptivePredict (you'll notice lines in the generated code calling that method).
You won't be able to manually translate the generated Java code of an ANTLR 4 lexer to another programming language. You might be able to manually translate the code of a generated parser provided the grammar is strictly LL(1) (i.e. the generated code does not contain any calls to adaptivePredict). However, you will lose the error recovery ability that draws from information encoded in the serialized ATN.

Purpose of antlr in xtext

I'm new to Xtext and wondering what's the purpose of antlr is in xtext. As I've understand so far, antlr generate a parser based on the grammar and the parser then deal with the text models. Right?
And what about the other generated stuff like the editor or the ecore. Are there other components behind xtext which generate them?

Xtext needs a parser generator to produce a parser for the language you define. They could have built one of their own. They chose to use ANTLR instead.
I don't know what other third party machinery they might have chosen to use.

I've been hacking one Xtext based plugin and from what I saw I think it works like this:
Xtext has it's own BNF syntax, which is very similar to ANTLR one. In fact its it's subset.
Xtext takes your grammar, and generates the ANTLR one from it(.g file). The generated ANTLR grammar adds specific actions to your BNF rules. The actions code interacts with the Xtext runtime and (maybe) with the Eclipse itself. The .g file is processed using some older version of ANTLR and .java file is generated. This file is then compiled.

Systematic way to generate ANTLR tree grammar?

I have a little bit large ANTLR parser grammar file and want to make a tree grammar for it. But, as far as I know this work of tree grammar generation can't be done automatically, i.e., I should generate it manually by copying parser grammar, removing some unnecessary code, etc. I want to know if there is a systematic way to generate a tree grammar file from a parser grammar file.
P.S. I read an article that insists that 'Manual Tree Walking Is Better Than Tree Grammars'. Is this reliable information? If so, would it be better for me to make a manual tree walker than writing an ANTLR tree grammar file? And then, how do I make a manual tree walker with my ANTLR parser grammar file(it makes an AST using rewrite rules)?
Thanks in advance.

sky wrote:
I want to know if there is a systematic way to generate a tree grammar file from a parser grammar file
You've already described the systematic way to do this: copy the parser/production rules in the tree grammar and only leave the rewrite rules in it. This will probably handle the larger part of your rules, but with other parser rules (using inline AST rewrite rules), it might look slightly different. Because of that, there is no automatic way to generate a tree grammar.
sky wrote:
P.S. I read an article that insists that 'Manual Tree Walking Is Better Than Tree Grammars'. Is this reliable information?
Yes, it is. Note that Terence Parr (creator of ANTLR) posted the article on the ANTLR wiki himself, so that says the author of it (Andy Tripp) raises valid points.
sky wrote:
If so, would it be better for me to make a manual tree walker than writing an ANTLR tree grammar file?
As Andy mentioned in his conclusion: "The decision about whether to use a "Tree Grammar" approach to translation vs. just "doing it by hand" is a matter of taste.". So, if you think writing tree grammar is too much hassle, go the manual way. It's up to you: there is no best way here.
sky wrote:
And then, how do I make a manual tree walker with my ANTLR parser grammar file(it makes an AST using rewrite rules)?
Your parser will create an AST, which by default is of type CommonTree (API-doc). You can use that tree to get the children, the parent, the type of the token etc.: all you need to manually walk the tree.
EDIT
Note that in the next version of ANTLR (version 4) it will (most likely) be possible to automatically generate a tree walker given a combined- or parser grammar.
See:
https://web.archive.org/web/20130620232750/http://www.antlr.org/wiki/display/~admin/ANTLR+v4+plans
https://web.archive.org/web/20130927174157/http://www.antlr.org/wiki/display/~admin/2011/09/05/Auto+tree+construction+and+visitors
https://web.archive.org/web/20130927175520/http://www.antlr.org/wiki/display/~admin/2011/09/08/Sample+v4+generated+visitor

Source for parsing C grammar using JavaCC

As an project assignment, I need to parse a plain-C grammar from Java to generate AST output. As a startup, I am using the file c.jj that I have found among grammar files at
http://java.net/projects/javacc/sources/svn/
but I found that it only has syntactic and lexical actions and no real semantics for parsing C source. Is there some other source that incorporate typedef, variables, construct functions, include files?

You could go looking for a complete grammar. Will you learn much this way?
You could ask your lecturer which would impress them more: implementing some small subset of C grammar by writing your own rules, or by searching google for alternative complete rules?
I trust writing your own rules - and even your own hand-crafted parser - will be more a more useful exercise. Even if its only parsing expressions.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas