ANTLR: Source to Target Language Conversion - antlr

I have fair understanding on ANTLR & grammar. Is it correct to say ANTLR can do source language to target language conversion like ASP to JSP or COBOL to JSP? if yes, could you help me to provide some information/tutorial/link to explorer the possibilities?
Idea is to pragmatically translating huge amounts of code from source to target using ANTLR.
Thanks

The basic steps to building a translator in Antlr4 is to:
generate a parse tree from an input text in the source language
repeatedly walk the parse tree to analyze the nodes of the parse tree, adding and evolving properties (decorator pattern) associated with individual parse tree nodes -- the properties will describe the change(s) required to represent the content of the node in the target language.
final walk of the parse tree to collect and output the target language text.
The form and content of the properties and the progression of creation and evolution will be entirely dependent on the nature of the source and target languages and the architect's conversion strategy.
Since Antlr parse-tree walks can be logically independent of one another, specific conversion aspects can be addressed in separate walks. For example, one walk can evaluate (possibly among other things) whether individual perform until statements will be converted to if or while statements. Another walk can be dedicated to analyzing variable names to ensure they are created/accessed in the correct scope and determining the naming and scope of any target language required temporary variables. Etc.
Given that the conversion is a one-time affair, there is no fundamental penalty to implementing 5, 10, or even more walks. Just the 'whatever makes sense in your case' practicality.
The (relevant) caveat addressed in the other QA is how to handle conversions where there is no simple or near identity between statements in the two languages. To convert a unique source language statement then requires a target language run-time package be created to implement the corresponding function.
GenPackage (I am the author) automates the generation of a basic conversion project. The generated project represents but one possible architectural approach and leaves substantial work to be done to tailor it to any particular end use.

Related

ANTRL4: Extracting expressions from languages

I have a programming language that has many constructs in it however I am only interested in extracting expressions from the language.
Is that possible to do without having to write the entire grammar?
Yes, its possible. You want what is called an "island parser". https://en.wikipedia.org/wiki/Island_grammar. You might not actually
decide to do this, more below.
The essential idea is to provide detailed grammar rules for the part of the language ("islands") you care about, and sloppy rules for the rest ("water").
The detailed grammar rules... you write as would normally write them. This includes building a lexer and parser to parse the part you want.
The "water" part is implemented as much as you can by defining sloppy lexemes. You may need more than one, and you will likely have to handle nested structures e.g., things involving "("...")" "["..."] and "{" ... "}" and you will end up doing with explicit tokens for the boundaries of these structures, and recursive grammar rules that keep track of the nesting (because lexers being FSAs typically can't track this).
Not obvious when you start, but painfully obvious after you are deep into this mess is skipping over long comment bodies, and especially string literals with the various quotes allowed by the language (consider Python for an over the top set) and the escaped sequences inside. You'll get burned by languages that allow interpolated strings when you figure out that you have the lex the raw string content separately from the interpolated expressions, because these are also typically nested structures. PHP and C# allow arbitrary expressions in their interpolated strings.... including expressions which themselves can contain... more interpolated strings!
The upside is all of this isn't really hard technically, if you ignore the sweat labor to dream up and handle all the funny cases.
... but ... considering typical parsing goals, island grammars tend to fall apart when used for this purpose.
To process expressions, you usually need the language declarations that provide types for the identifiers. If you leave them in the "ocean" part... you don't get type declarations and now it is hard to reason about your expressions. If you are processing java, and you encounter (a+b), is that addition or string concatenation? Without type information you just don't know.
If you decide you need the type information, now you need the detailed grammar for the variable and type declarations. And suddenly you're a lot closer to a full parser. At some point, you bail and just build a full parser; then you don't have think about whether you've cheated properly.
You don’t mention your language, but there’s a good chance that there’s an ANTLR grammar for it here ANTLR Grammars
These grammars will parse the entire contents of the source (by doing this, you can avoid some “messiness” that can come with trying decide when to pop into, and out of, island grammars, which could be particularly messy for expressions since they can occur in so many places within a typical source file.)
Once you have the resulting ParseTree, ANTLR provides a Listener capability that allows you to call a method to walk the tree and call you back for only those parts you are interested in. In your case that would be expressions.
A quick search on ANTLR Listeners should turn up several resources on how to write a Listener for your needs. (This is a pretty short article that covers the basics (in this case, for when you’re only interested in methods, but expressions would be similar. There are certainly others).

Semantic Model from a grammar

I would like ask for some thoughts about the concepts: Domain Object and a Semantic Model.
So, I really want to understand what's a Domain Object / Semantic Model for and what's not Domain Object / Semantic Model for.
As far I've been able to figure out, given a grammar is absolutly advisable do these separation concepts.
However, I'm not quite figure out how to do it. For example, given this slight grammar, how do you build a Domain Object or a Semantic Model.
It's exactly what I'm trying to figure out...
Most of books suggest this approach in order to go through an AST. Instead of directly translate at the same time you go throguh the AST creating a semantic model and then connect to it an interpreter.
Example (SQL Syntax Tree):
Instead of generate directly a SQL sentence, I create a semantic model, and then I'm able to connent an interpreter that translate this semantic model to a SQL sentence.
Abstract Systex Tree -> Semantic Model -> Interpreter
By this way, I could have a Transact-SQL Interpreter and another onr for SqLite.
The terms "domain object" and "semantic model" aren't really standard terms from the compiler literature, so you'll get lots of random answers.
The usual terms related to parsing are "concrete syntax tree" (matches the shape of the grammar rules), "abstract syntax tree" (an attempt to make a tree which contains less accidental detail, although it might not be worth the trouble.).
Parsing is only a small part of the problem of processing a language. You need a lot of semantic interpretation of syntax, however you represent it (AST, CST, ...). This includes concepts such as :
Name resolution (for every identifier, where is it defined? used?
Type resolution (for every identifier/expression/syntax construct, what is the type of that entity?
Type checking (is that syntax construct used in a valid way?)
Control flow analysis (what order are the program parts executed in, possibly even parallel/dynamic/constraint-determined)
Data flow analysis (where are values defined? consumed?)
Optimization (replacement of one set of syntax constructs by another semantically equivalent set with some nice property [executes faster after compilation is common]), at high or low levels of abstraction
High level code generation, e.g, interpreting sets of syntactic constructs in the language, to equivalent sets in the targeted [often assembly-language like] language
Each of these concepts more or less builds on top of the preceding ones.
The closest I can come to "semantic model" is that high-level code generation. That takes a lot of machinery that you have to build on top of trees.
ANTLR parses. You have to do/supply the rest.

General stategy for designing Flexible Language application using ANTLR4

Requirement:
I am trying to develop a language application using antlr4. The language in question is not important. The important thing is that the grammar is very vast (easily >2000 rules!!!). I want to do a number of operations
Extract bunch of informations. These can be call graphs, variable names. constant expressions etc.
Any number of transformations:
if a loop can be expanded, we go ahead and expand it
If we can eliminate dead code we might choose to do that
we might choose to rename all variable names to conform to some norms.
Each of these operations can be applied independent of each other. And after application of these steps I want the rewrite the input as close as possible to the original input.
e.g. So we might want to eliminate loops and rename the variable and then output the result in the original language format.
Questions:
I see a need to build a custom Tree (read AST) for this. So that I can modify the tree with each of the transformations. However when I want to generate the output, I lose the nice abilities of the TokenStreamRewriter. I have to specify how to write each of the nodes of the tree and I lose the original input formatting for the places I didn't do any transformations. Does antlr4 provide a good way to get around this problem?
Is AST the best way to go? Or do I build my own object representation? If so how do I create that object efficiently? Creating object representation is very big pain for such a vast language. But may be better in the long run. Again how do I get back the original formatting?
Is it possible to work just on the parse tree?
Are there similar language applications which do the same thing? If so what strategy do they use?
Any input is welcome.
Thanks in advance.
In general, what you want is called a Program Transformation System (PTS).
PTSs generally have parsers, build ASTs, can prettyprint the ASTs to recover compilable source text. More importantly, they have standard ways to navigate/inspect/modify the ASTs so that you can change them programmatically.
Many offer these capabilities in the form of pattern-matching code fragments written in the surface syntax of the language being transformed; this avoids the need to forever having to know excruciatingly fine details about which nodes are in your AST and how they are related to children. This is incredibly useful when you big complex grammars, as most of our modern (and our legacy languages) all seem to have.
More sophisticated PTSs (very few) provide additional facilities for teasing out the semantics of the source code. It is pretty hard to analyze/transform most code without knowing what scopes individual symbols belong to, or their type, and many other details such as data flow. Full disclosure: I build one of these.

Generating random but still valid expressions based on yacc/bison/ANTLR grammar

Is it possible? Any tool available for this?
You can do this with any system that gives you access to base grammar. ANTLR and YACC compile your grammar away so you don't have them anymore. In ANTLR's case, the grammar has been turned into code; you're not going to get it back. In YACC's case, you end up with parser tables, which contain the essence of the grammar; you could walk such parse tables if you understood them well enough to do what I describe below as.
It is easy enough to traverse a set of explicitly represented grammar rules and randomly choose expansions/derivations. By definition this will get you valid syntax.
What it won't do is get you valid code. The problem here is that most languages really have context sensitive syntax; most programs aren't valid unless the declared identifiers are used in a way consistent with their declaration and scoping rules. That latter requires a full semantic check.
Our DMS Software Reengineering Toolkit is used to parse code in arbitrary languages [using a grammar], build ASTs, lets you analyze and transform those trees, and finally prettyprint valid (syntactic) text. DMS provides direct access to the grammar rules, and tree building facilities, so it is pretty easy to generate random syntactic trees (and prettyprint). Making sure they are semantically valid is hard with DMS too; however, many of DMS's front ends can take a (random) tree and do semantic checking, so at least you'd know if the tree was semantically valid.
What you do if it says "no" is still an issue. Perhaps you can generate identifier names in way that guarantees at least not-inconsistent usage, but I suspect that would be langauge-dependent.
yacc and bison turn your grammar into a finite state machine. You should be able to traverse the state machine randomly to find valid inputs.
Basically, at each state you can either shift a new token on to the stack and move to a new state or reduce the top token in the stack based on a set of valid reductions. (See the Bison manual for details about how this works).
Your random generator will traverse the state machine making random but valid shifts or reductions at each state. Once you reach the terminal state you have a valid input.
For a human readable description of the states you can use the -v or --report=state option to bison.
I'm afraid I can't point you to any existing tools that can do this.

removing dead variables using antlr

I am currently servicing an old VBA (visual basic for applications) application. I've got a legacy tool which analyzes that application and prints out dead variables. As there are more than 2000 of them I do not want to do this by hand.
Therefore I had the idea to transform the separate codefiles which contain the dead variable according to the aforementioned tool to ASTs and remove them that way.
My question: Is there a recommended way to do this?
I do not want to use StringTemplate, as I would need to create templates for all rules and if I had a commend on the hidden channel, it would be lost, right?
Everything I need is to remove parts of that code and print out the rest as it was read in.
Any one has any recommendations, please?
Some theory
I assume that regular expressions are not enough to solve your task. That is you can't define the notion of a dead-code section in any regular language and expect to express it in a context-free language described by some antlr grammar.
The algorithm
The following algorithm can be suggested:
Tokenize source code with a lexer.
Since you want to preserve all the correct code -- don't skip or hide it's tokens. Make sure to define separate tokens for parts which may be removed or which will be used to determine the dead code, all other characters can be collected under a single token type. Here you can use output of your auxiliary tool in predicates to reduce the number of tokens generated. I guess antlr's tokenization (like any other tokenization) is expressed in a regular language so you can't remove all the dead code on this step.
Construct AST with a parser.
Here all the powers of a context-free language can be applied -- define dead-code sections in parser's rules and remove it from the AST being constructed.
Convert AST to source code. You can use some tree parser here, but I guess there is an easier way which can be found observing toString and similar methods of a tree type returned by the parser.