Structure of an ANTLR based translator (best practices) - antlr

I want to write a translator from a DSL to Java using ANTLR. So, I wrote the lexer and the parser using two different grammars. Now I have to write the tree grammar and I want to know which are the best practices (or recommended practices) for obtaining my result. More exactly, I would like to know which are the best ways to do stuff like: enrich the tree with attributes (for example, adding types) and optimizations.
Should I write different tree grammars for identifying types and for optimizations and then call then serially after the parser and before the final code generation tree grammar? Is there another way which is easier to maintain? I also though about manually parsing the tree generated by the parser in order to identify the types. But this is quite to maintain.
Thanks you.

There are no real best practices: just common sense and personal preference.
However, it is more logical to separate the adding of certain attributes to nodes from optimization actions (rewrite ^(* 0 ^(...)) to 0) in separate passes over the AST. Don't worry too much about performance: tree walking is pretty fast: most of the time is usually spent during parsing. And with ANTLR 3.2's addition of tree pattern matching, you can write pretty small tree grammars to perform a very specific operation on your AST (easy to maintain!).
Also see this previous Q&A that is about manually walking the AST or using a tree grammar for it:
Systematic way to generate ANTLR tree grammar?

Related

Modify Parse Tree

I have an ANTLR ParseTree for an sql grammar.
Example :
My goal is to edit this tree so that i can delete all the middle (booleanExpression, predicated, valueExpression, primaryExpression ) nodes inbetween.
I have explored visitor and listenors but they don’t generate the tree for me. And i'd like to not touch the grammar since its the official source one.
So how can i do it?
Thanks
There is no such feature in ANTLR's API (removing/mutating the ParseTree). You'll have to walk the tree yourself and create a copy of the ParseTree and ignore certain sub-trees you do not want/need.
I’ve usually found the ParseTree to be too cumbersome (for exactly the reason you’ve shown). I don’t know of any automated way to alter the tree in memory. ( Could be an interesting project to attempt.)
I a couple of implementations I’ve written some generalized approach to make the process easier. The “best” one was where I wrote a small DSL to define the structure I wanted. It generated the classes as well as ANTLR style visitors and listeners for those. I then used a listener to transform the ANTLR ParseTree to my ideal tree, and wrote the rest of my code using that structure with that, much simpler, tree. You can have a listener/visitor generate a tree of your own design as an artifact of processing the ParseTree, but that’s as close as I’ve come.
It was actually a relatively minor effort to set that up early in the process and accounted for a quite small percentage of the overall effort of implementing the language (so, well worth the effort).

Split Grammar and Grammar Attributes (Procedural Generation of Buildings)

I am trying to implement algorithm for procedural generation of buildings for my thesis. I've been reading a lot about shape/split grammars. Here's the link to the most popular paper covering that topic.
I managed to implement very basic grammar without attributes and guard conditions. It can generate primitives, and do geometric transformations. I'm using flex and bison to parse shape grammar files, and generate objects (Symbols, Rules etc.) that representat given grammar in object oriented manner that could later be called to generate geometry.
But now I'm stuck with the attribute part. For example:
fac(h) : h > 9; floor(h/3) floor(h/3) floor(h/3)
I am clueless how to represent grammar to contains informations about attributes, how to pass values to the symbols on the left and how to evaluate condition.
Can anyone help me with this, please? I'm using C++.
Note: I have some knowledge about grammars and parsers, and I know how to implement top-down parser with attributes using recursive descent but that approach is useless here. I can't generate source code for functions because interpretation of grammar files will be done in the same runtime as the application. Even if I could, this is not parser but generator of sentences and there are lot more problems like condition evaluation and production selection.

Shallow parsing with ANTLR

I'm trying to develop a solution able to extract, in a closed-context, certain actions.
For example, in a context of booking cinema tickets, if a user says:
"I'd like to go to the cinema tomorrow night, it would be Casablanca, I'd like to be at the last row, please"
I've designed grammars for getting the name of the film, desired seat, date and hour of the projection, etc.
However, though I've thought about ANTLR for developing such solution, I don't really know if it has such functionality, I mean, if I can define several root symbols.
ANTLR has methods of addressing ambiguities in grammars. These methods are in improved in ANTLR 4, but when it comes to processing ambiguous languages (especially human language), you'll face one giant limitation that will inevitably make ANTLR unsuitable for the task:
ANTLR eventually resolves an ambiguity by deciding that one specific option among multiple potential options is the correct solution. Since this resolution happens at a very early stage in the parsing process with ANTLR, it's very difficult to incorporate semantic logic in this decision making process (as opposed to logic involving syntax alone).
Edit: One thing that's particularly interesting about ANTLR 4 in the context of NLP is the fact that ANTLR 4 uses an augmented transition network as the basis for its parser. Somewhere in there I know it would be possible to modify it for use in natural language processing, but to date haven't figured out just how to make it work. Reference: I developed the optimized version of the ANTLR 4 runtime, which is currently slightly behind the reference branch but I'll catch up later this summer.
ANTLR isn't well suited to parse human languages: they're too ambiguous. Try NLP instead. Here's a list of natural language processing toolkits.

Creating a simple Domain Specific Language

I am curious to learn about creating a domain specific language. For now the domain is quite basic, just have some variables and run some loops, if statements.
Edit :The language will be Non-English based with a very simple syntax .
I am thinking of targeting the Java Virtual Machine, ie compile to Java byte code.
Currently I know how to write some simple grammars using ANTLR.
I know that ANTLR creates a lexer and parser but how do I go forward from here?
about semantic analysis: does it have to be manually written or are there some tools to create it?
how can the output from the lexer and parser be converted to Java byte code?
I know that there are libraries like ASM or BCEL but what is the exact procedure?
are there any frameworks for doing this? And if there is, what is the simplest one?
You should try Xtext, an Eclipse-based DSL toolkit. Version 2 is quite powerful and stable. From its home page you have plenty of resources to get you started, including some video tutorials. Because the Eclipse ecosystem runs around Java, it seems the best choice for you.
You can also try MPS, but this is a projectional editor, and beginners may find it more difficult. It is nevertheless not less powerful than Xtext.
If your goal is to learn as much as possible about compilers, then indeed you have to go the hard way - write an ad hoc parser (no antlr and alike), write your own semantic passes and your own code generation.
Otherwise, you'd better extend an existing extensible language with your DSL, reusing its parser, its semantics and its code generation functionality. For example, you can easily implement an almost arbitrary complex DSL on top of Clojure macros (and Clojure itself is then translated into JVM, you'll get it for free).
A DSL with simple syntax may or may not mean simple semantics.
Simple semantics may or may not mean easy translation to a target language; such translations are "technically easy" only if the DSL and the target languate share a lot of common data types and execution models. (Constraint systems have simple semantics, but translating them to Fortran is really hard!). (You gotta wonder: if translating your DSL is easy, why do you have it?)
If you want to build a DSL (in your case you stick with easy because you are learning), you want DSL compiler infrastructure that has whatever you need in it, including support for difficult translations. "What is needed" to handle translating all DSLs to all possible target languages is clearly an impossibly large set of machinery.
However, there is a lot which is clear that can be helpful:
Strong parsing machinery (who wants to diddle with grammars whose structure is forced
by the weakness of the parsing machinery? (If you don't know what this is, go read about LL(1) grammmars as an example).
Automatic construction of a representation (e.g, an abstract syntax tree) of the parsed DSL
Ability to access/modify/build new ASTs
Ability to capture information about symbols and their meaning (symbol tables)
Ability to build analyses of the AST for the DSL, to support translations that require
informatoin from "far away" in the tree, to influence the translation at a particular point in the tree
Ability to reogranize the AST easily to achieve local optimizations
Ability to consturct/analysis control and dataflow information if the DSL has some procedural aspects, and the code generation requires deep reasoning or optimization
Most of the tools available for "building DSL generators" provide some kind of parsing, perhaps tree building, and then leave you to fill in all the rest. This puts you in the position of having a small, clean DSL but taking forever to implement it. That's not good. You really want all that infrastructure.
Our DMS Software Reengineering Toolkit has all the infrastructure sketched above and more. (It clearly doesn't, and can't have the moon). You can see a complete, all-in-one-"page", simple DSL example that exercises some ineresting parts of this machinery.

Is it easier to write a recursive-descent parser using an EBNF or a BNF?

I've got a BNF and EBNF for a grammar. The BNF is obviously more verbose. I have a fairly good idea as far as using the BNF to build a recursive-descent parser; there are many resources for this. I am having trouble finding resources to convert an EBNF to a recursive-descent parser. Is this because it's more difficult? I recall from my CS theory classes that we went over EBNFs, but we didn't go over converting them into a recursive-descent parser. We did go over converting BNF's into a recursive-descent parser.
The reason I'm asking is because the EBNF is more compact.
From looking at the EBNF's in general, I notice that terms enclosed between { and } can be converted into a while loop. Are there any other guidelines or rules?
You should investigate so-called metacompilers, which essentially compile EBNF into recursive descent parsers. How they do it is exactly the answer your question.
(Its pretty straightfoward, but good to understand the details).
A really wonderful paper is the "MetaII" paper by Val Schorre. This is metacompiler technology from honest-to-God 1964. In 10 pages, he shows you how to build a metacompiler, and provides not just that, but another compiler too and the output of both!. There's an astonishing moment that you come too if you go build one of these, where you realized how the meta-compiler compiles itself using its own grammar. This moment got me
hooked on compiler back in about 1970 when I first tripped over this paper. This is one of those computer science papers that everybody in the software business should read.
James Neighbors (the inventor of the term "domain" in software engineering, and builder of the first program transformation system [based on these metacompilers] has a great online MetaII tutorial, for those of you that don't want the do-it-from-scratch experience. (I have nothing to do with this except that Neighbors and I were undergraduates together).
Both ways are a fine way to learn about metacompilers and generating parsers from EBNF.
The key ideas are that the left hand side of a rule creates a function that parses that nonterminal and returns true if match and advances the input stream; false if no match and the input stream doesn't advance.
The contents of the function is determined by the right hand side. Literal tokens are matched directly.
Nonterminals cause calls to other functions generated for the other rules.
Kleene* maps to while loops, alternations map to conditional branches. What EBNF doesn't address,
and the metacompilers do, is how does parsing do anyting other than saying "matched" or not?
The secret is weaving output operations into the EBNF. The MetaII paper makes all this crystal clear.
Neither is harder than the other. It is really the difference between implementing something iteratively and implementing something recursively. In BNF, everything is recursive. In EBNF, some of the recursion is expressed iteratively. There are different variations in EBNF syntax, so I'll just use the English... "zero or more" is a simple while loop as you have discovered. "One or more" is the same as one followed by "zero or more". "Zero or one times" is a simple if statement. That should cover most of the cases.
The early meta compilers META II and TREEMETA and their kin are not exactly recursive decent parser. They were were stated as using recursive functions. That just meant they could call them selves.
We do not call C a recursive language. A C or C++ function is recursive in the same way the early meta compilers are recursive.
Recursion can be used. They were programming languages. Recursion is generally used only when analyzing nexted language constructs. For example parenthesized expression and nexted blocks.
More of an LR recursive decent combination. CWIC the last documented one has extensive backtracking and look ahead features. The '-' not operator can match any language construct. And inverts it success or failure. -term fails if a term is matched for example. The input is never advanced. The '?' looks ahead and matches any language construct ?expr for example would try to parse an expr. The look ahead '?' matched construct is not kept or is the input advanced.