Modify Parse Tree - antlr

I have an ANTLR ParseTree for an sql grammar.
Example :
My goal is to edit this tree so that i can delete all the middle (booleanExpression, predicated, valueExpression, primaryExpression ) nodes inbetween.
I have explored visitor and listenors but they don’t generate the tree for me. And i'd like to not touch the grammar since its the official source one.
So how can i do it?
Thanks

There is no such feature in ANTLR's API (removing/mutating the ParseTree). You'll have to walk the tree yourself and create a copy of the ParseTree and ignore certain sub-trees you do not want/need.

I’ve usually found the ParseTree to be too cumbersome (for exactly the reason you’ve shown). I don’t know of any automated way to alter the tree in memory. ( Could be an interesting project to attempt.)
I a couple of implementations I’ve written some generalized approach to make the process easier. The “best” one was where I wrote a small DSL to define the structure I wanted. It generated the classes as well as ANTLR style visitors and listeners for those. I then used a listener to transform the ANTLR ParseTree to my ideal tree, and wrote the rest of my code using that structure with that, much simpler, tree. You can have a listener/visitor generate a tree of your own design as an artifact of processing the ParseTree, but that’s as close as I’ve come.
It was actually a relatively minor effort to set that up early in the process and accounted for a quite small percentage of the overall effort of implementing the language (so, well worth the effort).

Related

Shallow parsing with ANTLR

I'm trying to develop a solution able to extract, in a closed-context, certain actions.
For example, in a context of booking cinema tickets, if a user says:
"I'd like to go to the cinema tomorrow night, it would be Casablanca, I'd like to be at the last row, please"
I've designed grammars for getting the name of the film, desired seat, date and hour of the projection, etc.
However, though I've thought about ANTLR for developing such solution, I don't really know if it has such functionality, I mean, if I can define several root symbols.
ANTLR has methods of addressing ambiguities in grammars. These methods are in improved in ANTLR 4, but when it comes to processing ambiguous languages (especially human language), you'll face one giant limitation that will inevitably make ANTLR unsuitable for the task:
ANTLR eventually resolves an ambiguity by deciding that one specific option among multiple potential options is the correct solution. Since this resolution happens at a very early stage in the parsing process with ANTLR, it's very difficult to incorporate semantic logic in this decision making process (as opposed to logic involving syntax alone).
Edit: One thing that's particularly interesting about ANTLR 4 in the context of NLP is the fact that ANTLR 4 uses an augmented transition network as the basis for its parser. Somewhere in there I know it would be possible to modify it for use in natural language processing, but to date haven't figured out just how to make it work. Reference: I developed the optimized version of the ANTLR 4 runtime, which is currently slightly behind the reference branch but I'll catch up later this summer.
ANTLR isn't well suited to parse human languages: they're too ambiguous. Try NLP instead. Here's a list of natural language processing toolkits.

Why do tools like yacc and ANTLR generate source code?

These tools basically input a grammar and output code which processes a series of tokens into something more useful, like a syntax tree. But could these tools be written in the form of a library instead? What is the reason for generating source code as output? Is there a performance gain? Is it more flexible for the end user? Easier to implement for the authors of yacc and ANTLR?
Sorry if the question is too vague, I'm just curious about the historical reasons behind the decisions the authors made, and what purpose auto-generated code has in today's environment.
There's a big performance advantage achieved by the parser generator working out the interactions of the grammar rules with respect to one another, and compiling the result to code.
One could build interpreters that simply accepted grammars and did the parsing; there are parser types (Earley) that would actually be relatively good at that, and one could compute the grammar interactions at runtime (Earley parsers kind of do this anyway) rather than offline and then execute the parsing algorithm.
But you would pay a parsing performance penalty of 10 to 100x slowdown, and probably a big storage demand.
If you are parsing using only very small grammars, or you are parsing only very small documents, this might not matter. But the grammars that many parser generators get applied too end up being fairly big (people keep wanting to add things to what you can say in a language), and they often end up processing pretty big documents. So performance now matters, and viola, people build code-generating parser generators.
Once you have a tool, it is often easier to use even in simple cases. So now that you have parser generators, you can even apply them to little grammars or to parsing little documents.
EDIT: Addendum. The historical reason is probably driven by space and time demands. Earlier systems had not a lot of room (32Kb in 1975), didn't run very fast (1 MIPS same time frame), and people had big source files already. Parser generators tended to help with this set of problems; interpreted grammars would have had intolerably bad performance.
Ira Baxter gave you one set of reasons for not handling the grammar parsing at runtime.
There is another reason too. Associated with each rule in the grammar is the appropriate action. The action is normally a fragment of a separate language (for example, C or C++). All actions in a grammar interpreted at runtime would have to be mappable to something appropriate in the program. In general, that's a losing proposition. The fragments can do all sorts of things, referencing parts of the stack ($$, $1, etc) and invoking actions (YYACCEPT, etc). Designing the runtime system so that it could be reliably used with such fragments would be tough. You'd like be into creating source code and compiling that into a DSO (dynamic shared object) or DLL (dynamic link library) and loading it. That requires a compiler on the customer's machine, where the customer may have deliberately designed their production system to be compiler-free.

Structure of an ANTLR based translator (best practices)

I want to write a translator from a DSL to Java using ANTLR. So, I wrote the lexer and the parser using two different grammars. Now I have to write the tree grammar and I want to know which are the best practices (or recommended practices) for obtaining my result. More exactly, I would like to know which are the best ways to do stuff like: enrich the tree with attributes (for example, adding types) and optimizations.
Should I write different tree grammars for identifying types and for optimizations and then call then serially after the parser and before the final code generation tree grammar? Is there another way which is easier to maintain? I also though about manually parsing the tree generated by the parser in order to identify the types. But this is quite to maintain.
Thanks you.
There are no real best practices: just common sense and personal preference.
However, it is more logical to separate the adding of certain attributes to nodes from optimization actions (rewrite ^(* 0 ^(...)) to 0) in separate passes over the AST. Don't worry too much about performance: tree walking is pretty fast: most of the time is usually spent during parsing. And with ANTLR 3.2's addition of tree pattern matching, you can write pretty small tree grammars to perform a very specific operation on your AST (easy to maintain!).
Also see this previous Q&A that is about manually walking the AST or using a tree grammar for it:
Systematic way to generate ANTLR tree grammar?

Creating a simple Domain Specific Language

I am curious to learn about creating a domain specific language. For now the domain is quite basic, just have some variables and run some loops, if statements.
Edit :The language will be Non-English based with a very simple syntax .
I am thinking of targeting the Java Virtual Machine, ie compile to Java byte code.
Currently I know how to write some simple grammars using ANTLR.
I know that ANTLR creates a lexer and parser but how do I go forward from here?
about semantic analysis: does it have to be manually written or are there some tools to create it?
how can the output from the lexer and parser be converted to Java byte code?
I know that there are libraries like ASM or BCEL but what is the exact procedure?
are there any frameworks for doing this? And if there is, what is the simplest one?
You should try Xtext, an Eclipse-based DSL toolkit. Version 2 is quite powerful and stable. From its home page you have plenty of resources to get you started, including some video tutorials. Because the Eclipse ecosystem runs around Java, it seems the best choice for you.
You can also try MPS, but this is a projectional editor, and beginners may find it more difficult. It is nevertheless not less powerful than Xtext.
If your goal is to learn as much as possible about compilers, then indeed you have to go the hard way - write an ad hoc parser (no antlr and alike), write your own semantic passes and your own code generation.
Otherwise, you'd better extend an existing extensible language with your DSL, reusing its parser, its semantics and its code generation functionality. For example, you can easily implement an almost arbitrary complex DSL on top of Clojure macros (and Clojure itself is then translated into JVM, you'll get it for free).
A DSL with simple syntax may or may not mean simple semantics.
Simple semantics may or may not mean easy translation to a target language; such translations are "technically easy" only if the DSL and the target languate share a lot of common data types and execution models. (Constraint systems have simple semantics, but translating them to Fortran is really hard!). (You gotta wonder: if translating your DSL is easy, why do you have it?)
If you want to build a DSL (in your case you stick with easy because you are learning), you want DSL compiler infrastructure that has whatever you need in it, including support for difficult translations. "What is needed" to handle translating all DSLs to all possible target languages is clearly an impossibly large set of machinery.
However, there is a lot which is clear that can be helpful:
Strong parsing machinery (who wants to diddle with grammars whose structure is forced
by the weakness of the parsing machinery? (If you don't know what this is, go read about LL(1) grammmars as an example).
Automatic construction of a representation (e.g, an abstract syntax tree) of the parsed DSL
Ability to access/modify/build new ASTs
Ability to capture information about symbols and their meaning (symbol tables)
Ability to build analyses of the AST for the DSL, to support translations that require
informatoin from "far away" in the tree, to influence the translation at a particular point in the tree
Ability to reogranize the AST easily to achieve local optimizations
Ability to consturct/analysis control and dataflow information if the DSL has some procedural aspects, and the code generation requires deep reasoning or optimization
Most of the tools available for "building DSL generators" provide some kind of parsing, perhaps tree building, and then leave you to fill in all the rest. This puts you in the position of having a small, clean DSL but taking forever to implement it. That's not good. You really want all that infrastructure.
Our DMS Software Reengineering Toolkit has all the infrastructure sketched above and more. (It clearly doesn't, and can't have the moon). You can see a complete, all-in-one-"page", simple DSL example that exercises some ineresting parts of this machinery.

What is a good tool for automatically calculating FIRST and FOLLOW sets?

I'm currently in the middle of playing with a BNF grammar that I hope to be able to wrangle into a LL(1) form. However, I've just finished making changes and calculating the new FIRST and FOLLOW sets for the grammar by hand for the third time today and I'm getting tired of it. There has to be a better way!
Can someone suggest a tool that, given a grammar, will automatically calculate the first and follow sets for all of the non terminals?
A year ago, we had a semester project at the university I attend, where our task was to create a programming language. As a group, we decided we wanted to be able to hand-write the parser from scratch, so we had to aim for an LL(1) grammar, since it would have been completely unrealistic to write a parser otherwise.
Of course, our starting point was far from being LL(1), so we too had to wrangle it into place. For that purpose, we used the kfgEdit tool from the AtoCC package. All you do is enter your rules, and then it can check if it's an LL(1) grammar at the click of a button.
A fair word of warning: The tool is a bit finicky about what it accepts. While you'd often use EBNF for the real grammar, so you can write ? and * and + to signal how many times that token must appear there, this is not supported. Grouping is also not supported. You may very well find that it takes a very long time to do this, and you will almost surely want to do some "rearranging" after you've reached LL(1) to make the grammar even close to being readable.
Of course, depending on the type of grammar you're dealing with, this may not be much of a problem for you. We had created a sort of Pascal/C hybrid, with a fairly restricted set of constructs (procedures, functions, only built-in primitive types and arrays of them, ifs, a single loop construct we'd come up with ourselves in place of the standard 3...), and it took me at least a week to wrangle it into an LL(1) grammar - probably 2, actually. Note that this is out of a total of about 4 months, so that was a lot of time spent there.
If you absolutely MUST have an LL(1) grammar, then you obviously will need to press on if you get into a situation like this, but if you're allowed to use parser generators like yacc/bison or SableCC then you will, in the long run, most likely find it a LOT easier to go down that route. That doesn't mean you SHOULD go down that route - I found that actually writing everything by hand provided some insight I probably wouldn't have gained otherwise - but it might be better for you to gain that insight in a different situation than your current.
tl;dr version: Use kfgEdit from the AtoCC package.
For recursive descent parsing, it would be worth looking at ANTLR. However, I'm not sure it provides an exact answer for your question - find the FIRST and FOLLOW sets for a given grammar.
The DMS Software Reengineering Toolkit has a parser generator that computes FIRST and FOLLOW sets; it will also let you inspect the L(AL)R state machine it generates.
However, if you have a legitimate context-free grammar, you don't have to "wrangle it" into LL shape; the DMS parser generator produces GLR parsers from any context-free grammar.