Why do tools like yacc and ANTLR generate source code? - antlr

These tools basically input a grammar and output code which processes a series of tokens into something more useful, like a syntax tree. But could these tools be written in the form of a library instead? What is the reason for generating source code as output? Is there a performance gain? Is it more flexible for the end user? Easier to implement for the authors of yacc and ANTLR?
Sorry if the question is too vague, I'm just curious about the historical reasons behind the decisions the authors made, and what purpose auto-generated code has in today's environment.

There's a big performance advantage achieved by the parser generator working out the interactions of the grammar rules with respect to one another, and compiling the result to code.
One could build interpreters that simply accepted grammars and did the parsing; there are parser types (Earley) that would actually be relatively good at that, and one could compute the grammar interactions at runtime (Earley parsers kind of do this anyway) rather than offline and then execute the parsing algorithm.
But you would pay a parsing performance penalty of 10 to 100x slowdown, and probably a big storage demand.
If you are parsing using only very small grammars, or you are parsing only very small documents, this might not matter. But the grammars that many parser generators get applied too end up being fairly big (people keep wanting to add things to what you can say in a language), and they often end up processing pretty big documents. So performance now matters, and viola, people build code-generating parser generators.
Once you have a tool, it is often easier to use even in simple cases. So now that you have parser generators, you can even apply them to little grammars or to parsing little documents.
EDIT: Addendum. The historical reason is probably driven by space and time demands. Earlier systems had not a lot of room (32Kb in 1975), didn't run very fast (1 MIPS same time frame), and people had big source files already. Parser generators tended to help with this set of problems; interpreted grammars would have had intolerably bad performance.

Ira Baxter gave you one set of reasons for not handling the grammar parsing at runtime.
There is another reason too. Associated with each rule in the grammar is the appropriate action. The action is normally a fragment of a separate language (for example, C or C++). All actions in a grammar interpreted at runtime would have to be mappable to something appropriate in the program. In general, that's a losing proposition. The fragments can do all sorts of things, referencing parts of the stack ($$, $1, etc) and invoking actions (YYACCEPT, etc). Designing the runtime system so that it could be reliably used with such fragments would be tough. You'd like be into creating source code and compiling that into a DSO (dynamic shared object) or DLL (dynamic link library) and loading it. That requires a compiler on the customer's machine, where the customer may have deliberately designed their production system to be compiler-free.


Look for a VBA/VB parser/compiler written in OCaml

I am planning to write a compiler (including parser) in OCaml to parse and run VBA or/and VB programs. I have done this for simple imperative languages, but I am not sure how to handle the "object" features of VBA or/and VB...
Does anyone know if there is any existing work that I can inspire?
Not an OCaml solution (but OP asked):
Our DMS Software Reengineering Toolkit is general purpose program analysis and transformation machinery. It is intended to be a convenient foundation for custom software engineering tools for computer languages, with the goal being to help the tool engineer get his job done, rather than spending his time reinventing the wheel. In particular, many people think that getting a parser is the big part of the job. This is simply false. See Life After Parsing.
DMS has production front ends for many languages, both modern and legacy, including Visual Basic in its variety of dialects (VB6, VBA [essentially the same as VB6]) and VB.net.
By production I mean have been applied to real code systems of significant size and handle all the corresponding parsing issues. This is pretty hard for legacy languages, e.g., VB, especially the older dialects because such languages are generally poorly documented (VB6 and VBA especially so). The only way to get this right is to build a draft parser, run it against reality, and revise until lots of code goes through sensibly. This often takes longer than doing the draft parser because it isn't easy to understand the errors (they're undocumented!), you have to decide if they are real or the code base just has junk (more often than you'd think), guess what it means for the grammar and try it all again.
These front ends as a minimum parse source code and build ASTs; they can also invert this process to regenerate legal compilable code with the comments back as source text files. The VisualBasic front ends do this. Some of our other front ends (C, C++, Java, COBOL) go further: name/type resolution, flow analysis, etc.; they do that by collecting key program facts from the language-specific AST and then apply DMS-supplied machinery to compute the results. This would be possible for VisualBasic, too, if such facts were useful.
For an example of a tiny OO language written in OCaml check out the source code for boa at: http://andrej.com/plzoo/.
The OO flavour is not class based though so I'm not sure how useful it will be.

Shallow parsing with ANTLR

I'm trying to develop a solution able to extract, in a closed-context, certain actions.
For example, in a context of booking cinema tickets, if a user says:
"I'd like to go to the cinema tomorrow night, it would be Casablanca, I'd like to be at the last row, please"
I've designed grammars for getting the name of the film, desired seat, date and hour of the projection, etc.
However, though I've thought about ANTLR for developing such solution, I don't really know if it has such functionality, I mean, if I can define several root symbols.
ANTLR has methods of addressing ambiguities in grammars. These methods are in improved in ANTLR 4, but when it comes to processing ambiguous languages (especially human language), you'll face one giant limitation that will inevitably make ANTLR unsuitable for the task:
ANTLR eventually resolves an ambiguity by deciding that one specific option among multiple potential options is the correct solution. Since this resolution happens at a very early stage in the parsing process with ANTLR, it's very difficult to incorporate semantic logic in this decision making process (as opposed to logic involving syntax alone).
Edit: One thing that's particularly interesting about ANTLR 4 in the context of NLP is the fact that ANTLR 4 uses an augmented transition network as the basis for its parser. Somewhere in there I know it would be possible to modify it for use in natural language processing, but to date haven't figured out just how to make it work. Reference: I developed the optimized version of the ANTLR 4 runtime, which is currently slightly behind the reference branch but I'll catch up later this summer.
ANTLR isn't well suited to parse human languages: they're too ambiguous. Try NLP instead. Here's a list of natural language processing toolkits.

Creating a simple Domain Specific Language

I am curious to learn about creating a domain specific language. For now the domain is quite basic, just have some variables and run some loops, if statements.
Edit :The language will be Non-English based with a very simple syntax .
I am thinking of targeting the Java Virtual Machine, ie compile to Java byte code.
Currently I know how to write some simple grammars using ANTLR.
I know that ANTLR creates a lexer and parser but how do I go forward from here?
about semantic analysis: does it have to be manually written or are there some tools to create it?
how can the output from the lexer and parser be converted to Java byte code?
I know that there are libraries like ASM or BCEL but what is the exact procedure?
are there any frameworks for doing this? And if there is, what is the simplest one?
You should try Xtext, an Eclipse-based DSL toolkit. Version 2 is quite powerful and stable. From its home page you have plenty of resources to get you started, including some video tutorials. Because the Eclipse ecosystem runs around Java, it seems the best choice for you.
You can also try MPS, but this is a projectional editor, and beginners may find it more difficult. It is nevertheless not less powerful than Xtext.
If your goal is to learn as much as possible about compilers, then indeed you have to go the hard way - write an ad hoc parser (no antlr and alike), write your own semantic passes and your own code generation.
Otherwise, you'd better extend an existing extensible language with your DSL, reusing its parser, its semantics and its code generation functionality. For example, you can easily implement an almost arbitrary complex DSL on top of Clojure macros (and Clojure itself is then translated into JVM, you'll get it for free).
A DSL with simple syntax may or may not mean simple semantics.
Simple semantics may or may not mean easy translation to a target language; such translations are "technically easy" only if the DSL and the target languate share a lot of common data types and execution models. (Constraint systems have simple semantics, but translating them to Fortran is really hard!). (You gotta wonder: if translating your DSL is easy, why do you have it?)
If you want to build a DSL (in your case you stick with easy because you are learning), you want DSL compiler infrastructure that has whatever you need in it, including support for difficult translations. "What is needed" to handle translating all DSLs to all possible target languages is clearly an impossibly large set of machinery.
However, there is a lot which is clear that can be helpful:
Strong parsing machinery (who wants to diddle with grammars whose structure is forced
by the weakness of the parsing machinery? (If you don't know what this is, go read about LL(1) grammmars as an example).
Automatic construction of a representation (e.g, an abstract syntax tree) of the parsed DSL
Ability to access/modify/build new ASTs
Ability to capture information about symbols and their meaning (symbol tables)
Ability to build analyses of the AST for the DSL, to support translations that require
informatoin from "far away" in the tree, to influence the translation at a particular point in the tree
Ability to reogranize the AST easily to achieve local optimizations
Ability to consturct/analysis control and dataflow information if the DSL has some procedural aspects, and the code generation requires deep reasoning or optimization
Most of the tools available for "building DSL generators" provide some kind of parsing, perhaps tree building, and then leave you to fill in all the rest. This puts you in the position of having a small, clean DSL but taking forever to implement it. That's not good. You really want all that infrastructure.
Our DMS Software Reengineering Toolkit has all the infrastructure sketched above and more. (It clearly doesn't, and can't have the moon). You can see a complete, all-in-one-"page", simple DSL example that exercises some ineresting parts of this machinery.

decompilation resources and theory

There must be a million of books and papers on the theory and techniques of building compilers. Are there any resources on doing the reverse? Im not interested in any particular HW platform. Looking for good books/research papers that examine the subject and difficulties in depth.
I've worked on an AS3 and Java decompiler and I can assure you that everything I've learned in regards to decompilation is straight from compiler theory. Intermediate representations, data flow analysis, term rewriting, and other related concepts can all be found in the dragon book.
I've written about decompilers for dynamic languages here and for Python specifically.
Note though this is for dynamic languages with custom (high-level) VMs.
Decompilation is really a misnomer. Decompilers compile object code into a source representation. In many ways they are easier to write than traditional compilers - the 'source' code is already syntax checked and usually very precisely formatted.
They build up a symbol table (of addresses) and construct a target language representation of the application. The usual difficulty is that the original compiler has to a greater or lesser degree optimised the original application by removing common sub-expressions, hoisting constant code out of loops and many other similar techniques. These are often not possible to represent in the target language.
In cases where the source is for a well defined VM, then often this optimisation is left to the JIT compiler and the resulting decompiled code is very readable - in many cases almost identical to the original. Compilers of this type often leave some or all of the symbols in the object code allowing these to be recovered. Others include line numbers to help with debugging and troubleshooting. These all help to recover the original code.
As a counter, there are code obfuscators that deliberately perform transformations to the code that prevent simple restoration of the original source by scrambling names, change the sequence code is generated (without changing its resulting meaning) and introducing constructs for which there is no source language equivalent.

Can PMD be customized to fully support a new language?

Can PMD be customized to fully support a new language, in a reasonable amount of time. I mean I know that technically almost anything can be done, but im wondering if this can be done in a reasonable amount of time? E.g. < 2 weeks
This page mentions how to write a CPD parser http://pmd.sourceforge.net/cpd-parser-howto.html
But is this just for copy / paste detection? Does writing a CPD parser give me full support of PMD in terms of rile sets?
I would guess not, but I'm not a PMD expert (and I have my own bias, check my bio).
The issues are:
Can you define a syntax for my langauge quickly (maybe, depending on how good you are, how messy the language is, and the strength of the parsing machinery offered by PMD)
Can you define the semantics of my language so that "semantic checks" provided by PMD work. You have to do this, because syntax tells you (and a tool) literally nothing about semantic of the syntax. I would guess that the PMD tool 'semantic checks' are pretty wired into the precise details of Java; if you language matched java perfectly, this would be zero work. But it doesn't, or you wouldn't be asking the question. And langauge semantics differences, even minor ones, cause discontinuous changes to the interpreation of the code. Before you get to doing even "serious" semantics, you're likely to have to build a symbol table mapping identifiers in the code to declarations (and the "semantic" type) for those symbols. Based on tool infrastructure I work with, this step alone takes 1-2 months for a real language.
Lastly, you are likely to have to code special PMD checks that are specific to your langauge. That takes time and energy, too.
I build generic compiler-type machinery (parsers, flow analyzers, style/error checkers) and get asked the equivalent of this question all the time WRT to our machinery. We try to have a lot of machinery available, try to make it easy to integrate new langauges, and we've been working on trying to make this "convenient and fast" for 15+ years. Its still not convenient, and there's no way to do this with our tools in a few weeks. I doubt PMD is better.