I want to learn how to create refactor tools.
As an example: I want to create migration scripts
for when some library removes deprecated function
and we want to transform code to use the newer adequate functionality.
My idea was to use ANTLR to parse the code into AST,
use some pattern matching on this tree to modify contents,
and output the modified contents.
However, from what I read ANTLR isn't preserving formatting in AST tree,
therefore it would be hard to get unbroken content back.
Do you have a solution that would comply with:
allows me to modify code with preserving formatting
(optionally) allows me to use AST transformations for code transformation
(optionally) can transform variety languages like ANTLR
Question is not limited to one particular language,
I'd be happy to heard solutions created for different languages.
If you want a
general purpose tool to parse source code from arbitrary languages producing ASTs
apply procedural or preferably source-to-source pattern-directed rewrite rules to manipulate the ASTs
regenerate valid source code retaining formatting and comments
I know of only two systems at present that can do this robustly.
RASCAL Metaprogramming language, a research platform
Semantic Designs' (my company) DMS Software Reengineering Toolkit (DMS)
You probably don't want to try building frameworks like this yourself; these tools both have decades of PhD level investment to make them practical.
One issue that occurs repeatedly is the mistake of thinking that having a parser (e.g., ANTLR) solves most of the problem. See my essay on Life After Parsing. A key insight is that you can't transform "just the syntax (ASTs)" without context; you have to take in account the language semantics and whatever framework you choose to use had better help you do semantic analysis to support the rewrite rules.
There are other (general purpose) program transformation systems out there. Many are research. Few have been used to do serious software reengineering in practice. RASCAL has been applied to some quite interesting tasks but not in any commercial context that I know. DMS has been used in production for over 20 years to carry out massive code base changes including refactoring, API revision, and fully automated language migrations.
ANTLR has a TokenStreamRewriterTokenStreamRewriter class that is very good at preserving your source input.
It has very robust capabilities. It allow you to delete, insert or replace text in the input stream. IT actually stores up a series of pending changes, and then applies them when you ask for the modified input stream (even allows for rolling back changes, as well as multiple sets of changes).
A couple of examples from a recent presentation I did that touched on the Rewriter:
private void plus0(RefactorUtilContext ctx, String pName) {
for (var match : plus0PatternA.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (AddSubExprContext) (match.getTree());
rewriter.delete(pName, matchCtx.op, matchCtx.rhs.getStop());
}
for (var match : plus0PatternB.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (AddSubExprContext) (match.getTree());
rewriter.delete(pName, matchCtx.lhs.getStart(), matchCtx.op);
}
}
private void times1(RefactorUtilContext ctx, String pName) {
for (var match : times1PatternA.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (MulDivExprContext) (match.getTree());
// rewriter.delete(pName, matchCtx.op, matchCtx.rhs.getStop());
rewriter.insertBefore(pName, matchCtx.op, "/* ");
rewriter.insertAfter(pName, matchCtx.rhs.getStart(), " */");
}
for (var match : times1PatternB.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (MulDivExprContext) (match.getTree());
// rewriter.delete(pName, matchCtx.lhs.getStart(), matchCtx.op);
rewriter.insertBefore(pName, matchCtx.lhs.getStart(), "/* ");
rewriter.insertAfter(pName, matchCtx.op, " */");
}
}
TokenStreamRewriter, basically, just stores a set of instruction about how to modify you input stream, so everything about you input stream that you don't modify is, ummmm, unmodified :).
You may also wish to look into the XPath capabilities that ANTLR has. These allow you to find very specific patterns in the parse tree to locate the portions you would like to refactor. As the name suggests, the syntax is very similar to XPath for XML documents, but works on the parse tree instead of an XML DOM.
Note: all of these operate on the parse tree, not an AST (which would necessarily be of your own design, so ANTLR wouldn't know of it's structure. Also, it's in the nature of ASTs to drop irrelevant information (like comments, whitespace, etc.) so they're just not a good starting point for anything where you want to preserve formatting.)
I put together a quite small project for this presentation and it's on GitHub at LittleCalc. The most relevant file is LittleCalcExecutionVisitor.
You can start a REPL and test things out by running LittleCalcRepl.java
Related
If I have an AST and modify it, can I use StringTemplates to generate the source code for the modified AST?
I have successfully implemented my grammar for Antlr4. It generates the AST of a source code and I use the Visitor Class to perform the desired actions. I then modify something in the AST and I would like to generate the source code for that modified AST. (I believe it is called pretty-printing?).
Does Antlr's built in StringTemplates have all the functionality to do this? Where should one start (practical advice is very welcome)?
You can walk the tree and use string templates (or even plain out string prints) to spit out text equivalents that to some extent reproduce the source text.
But you will find reproducing the source text in a realistic way harder to do than this suggests. If you want back code that the original programmer will not reject, you need to:
Preserve comments. I don't think ANTLR ASTs do this.
Generate layout that preserves the original indentation.
Preserve the radix, leading-zero count, and other "format" properties of literal values
Renerate strings with reasonable escapes
Doing all of this well is tricky. See my SO answer How to compile an AST back to source code for more details. (Weirdly, the ANTLR guy suggests not using an AST at all; I'm guessing this is because string templates only work on ANTLR parse trees whose structure ANTLR understands, vs. ASTs which are whatever you home-rolled.)
If you get all of this right, what you are likely to discover is that modifying the parse tree/AST is harder than it looks. For almost any interesting task on complex languages, you need information which is not trivial to extract from the tree (e.g., what is the meaning of this identifier?, where is this variable used?,...) I call this the problem of Life After Parsing. My main point is that it takes a lot of machinery to modify ASTs and regenerate code; be aware of the size of your project.
Where do we start to manually build a CST from scratch? Or does ANTLR4 always require the lex/parse process as our input step?
I have some visual elements in my program that represent code structures.
e.g. a square represents a class, while a circle embedded within that square represents a method.
Now I want to turn those into code. How do I use ANTLR4 to do this, at runtime (using ANTLR4.js)? Most of the ANTLR examples seem to rely on lexing and parsing existing code to get to a syntax tree. So rather than:
input code->lex->parse->syntax tree->output code (1)
I want
manually create syntax tree->output code (2)
(Later, as the user adds code to that class and its methods, then ANTLR will be used as in (1).)
EDIT Maybe I'm misunderstanding this. Do I create some custom data structure and then run the parser over it? i.e. write structures to some in-memory format->parse->output code (3)?
IIUC, you could use StringTemplate directly.
By, way of background, Antlr itself builds an in-memory parse-tree and then walks it, incrementally calling StringTemplate to output code snippets qualified by corresponding parse-tree node data. That Antlr uses an internal parse-tree is just a convenience for simplifying walking (since Antlr is built using Antlr).
If you have your own data structure, regardless of its specific implementation, procedurally process it to progressively call ST templates to emit the corresponding code. And, you can directly use the same templates that Antlr uses (JavaScript.stg), if they meet your requirements.
Of course, if your data structure is of a nature that can be lex'd/parsed into a standard Antlr parse-tree, you can then use a standard Antlr visitor to call and populate node-specific templates.
I want to use several encodings in the presentation layer to encode a object/structure in the application layeri independenty from encoding scheme (such as binary, XML, etc) and programming language (Java, Javascript, PHP, C).
An example would be to transfer an object from a producer to a consumer in a byte stream. The Java client would encode it using something like this:
Object var = new Dog();
output.writeObject(var);
The server would share the Dog class definitions and could regenerate the object doing something like this:
Object var = input.readObject();
assertTrue(var instanceof Dog); // passes
It is important to note that producer and consumer would not share the type of var, and, therefore, the consumer would not need the type to decode var. They only would share data type definitions, if ever:
public interface Pojo {}
public class Dog implements Pojo { int i; String s; } // Generated by framework from a spec
What I found:
Java Serialization: It is language dependent. Cannot be used with for example javascript.
Protobuf library: It is limited to a specific binary format. It is not possible to support additional binary formats. Need name of class ("class" of message).
XStream, Simple, etc. They are rather limited to text/XML and require name of the class.
ASN.1: The standards are there and could be used with OBJECT IDENTIFIER and type definitions but they lack on documentation and tutorials.
I prefer 4th option because, among others, it is a standard. Is there any active project that support such requirements (specially something based on ASN.1)? Any usage example? Does the project include codecs (DER, BER, XER, etc.) that can be selected at runtime?
Thanks
You can find several open source and commercial implementation of tools for ASN.1. These usually include:
a compiler for the schema, which will generate code in your desired programming language
a runtime library which is used together with the generated code for encoding and decoding
ASN.1 is mainly used with the standardized communication protocols for telecom industry, so the commercial tools have very good support for the ASN.1 standard and various encoding rules.
Here are some starter tutorials and even free e-books:
http://www.oss.com/asn1/resources/asn1-made-simple/introduction.html
http://www.oss.com/asn1/resources/reference/asn1-reference-card.html
http://www.oss.com/asn1/resources/books-whitepapers-pubs/asn1-books.html
I know that the OSS ASN.1 commercial tools (http://www.oss.com/asn1/products/asn1-products.html) will support switching the encoding rules at runtime.
To add to bosonix's answer, there's also Objective System's tools at http://www.obj-sys.com/. The documentation from both OSS and Objective Systems includes many example uses.
ASN.1 is pretty much perfect for what you're looking for. I know of no other serialisation system that does this quite so thoroughly.
As well as a whole array of different binary encodings (ranging from the comprehensively tagged BER all the way down to the very packed-together PER), it does XML and now also JSON encodings too. These are well standardised by the ITU, so it is in theory fully inter operable between tool vendors, programming languages, OSes, etc.
There are other significant benefits to ASN.1. The schema language lets you define constraints on the value of message fields, or the sizes of arrays. These then get checked for you by the generated code. This is far more complete than many other serialisations. For instance, Google Protocol Buffers doesn't let you do this, meaning that you have to check the range of message fields (where applicable) in hand written code. That's tedious, error prone, and hard to maintain.
The only other ones that do this are XSD and JSON schemas. However with those you're at the mercy of the varying quality of tools used to turn those into source code - I've not yet seen any decent ones for JSON schemas. I'm not aware of whether or not Microsoft's xsd.exe honours such constraints either.
I've been learning ANTLR for a few days now. My goal in learning it was that I would be able to generate parsers and lexers, and then personally hand-translate them from Java into my target language (neither C/C++/Java/C#/Python, no tool has support for it). I chose ANTLR because from its About page: ANTLR is widely used because it's easy to understand, powerful, flexible, generates human-readable output[...]
In learning this tool, I decided to start with a simple lexer for a simple grammar: JSON. However, once I generated the .java file for this lexer using ANTLR4 I was caught widely off-guard. I got a huge mess of far-from-human-readable serialized code, followed by:
public static final ATN _ATN =
ATNSimulator.deserialize(_serializedATN.toCharArray());
static {
_decisionToDFA = new DFA[_ATN.getNumberOfDecisions()];
}
A few Google searches were unable to provide me a way to disable this behavior.
Is there a way to disable this behavior and produce human-readable code, or am I going to have to hand-write my lexers and parsers for this target programming language?
ANTLR 4 uses a new algorithm for prediction. Terence Parr is currently working on a tech report describing the algorithm in detail. The human-readable output refers to the generated parsers.
ANTLR 4 lexers use a DFA recognizer for a massive speed and memory usage improvement over previous releases of ANTLR. For parsers, the _ATN field is a data structure used within calls to adaptivePredict (you'll notice lines in the generated code calling that method).
You won't be able to manually translate the generated Java code of an ANTLR 4 lexer to another programming language. You might be able to manually translate the code of a generated parser provided the grammar is strictly LL(1) (i.e. the generated code does not contain any calls to adaptivePredict). However, you will lose the error recovery ability that draws from information encoded in the serialized ATN.
public void visitToken(DetailAST aAST) {}
I am trying to write a custom checkstyle rule. I am interested in the TokenTypes.STRING_LITERAL. The problem with this approach is, A string might be a concatenated string, StringBuffer, StringBuilder or could be within a method.
Bear with me, as I am a newbie to the Checkstyle coding.
How do I get a full string if it is concatenated. The aAST seems to be spitting them out as individual string literals.
Is there another way to grab a complete string?
Any pointers, greatly appreciated.
This is hard to do in Checkstyle, because Checkstyle works purely on the AST. It is no compiler, so it does not know about runtime types or syntactic meaning.
So, in order to do this using Checkstyle, you would have to analyze the AST manually and build your concatenated String by hand. If parts of the String are generated by, say, static methods, or by using a StringBuilder/StringBuffer, then I would say the task of finding the complete String by AST analysis becomes virtually impossible.
Instead, you might want to look at other static code analysis tools which might be better suited to your task. FindBugs, for instance, works on the compiled code and is generally able to perform quite sophisticated checks. However, it takes more resources to run than Checkstyle, and on older machines you may not be able to have FindBugs run automatically on save in your IDE.