Compiler code optimization: AST vs. IR - optimization

, where I define IR as a 3-address code type representation (I realize that one can mean by it an AST representation as well).
It is my understanding that, when writing a best-practice compiler for an imperative language, code optimization happens both on the AST (probably best using a Visitor Pattern), and on the IR produced from the AST. 
(a) Is that correct?
(b) Which type of optimization steps are best handled on the AST before even producing an IR? (reference to an article/a list online welcome too as long as it deals with an imperative language) 
The compiler I'm working on is for Decaf (which some might know) which has a fairly deep CFG up to (single) class inheritance; I'll add features not part of it such as type coercion. It will be completely hand-coded (using no tools whatsoever). This is not homework; writing it for fun. 

(a) Yes.
(b) Constant folding is one example; CSE is another; in fact almost anything to do with expression evaluation. IR-phase optimizations are more about what results from flow analysis.

IR is a form of an AST (often it is "flattened", but there are deep tree IRs as well), it may not be easy to distinguish one from another, especially if compiler is implemented as a sequence of very small rewrites from an original AST all the way down to a final IR suitable for instruction selection.
Optimisations may happen anywhere on this chain, but some representations are more suitable for a wide range of optimisations, most notably, an SSA form, used by most of the modern compilers to do nearly all the optimisations.

It's never too early to optimise (to coin a phrase). So there are optimisations performed before and during AST creation, on the AST itself, on the IR (if you have one) and on the code as it is generated. In C-like languages and those that compile to machine code, the effort goes into the later stages. In compilers targeting a VM I think there is less room for improvements at that stage.
Some early optimisations obviously work better than others. I don't know much about Decaf, but there are the obvious things like constant folding and constant expression evaluation. If you get the whole program in tree form before you have to generate any code you can find common subexpressions, do code migration, eliminate dead code/dead stores, hoist invariants, eliminate tail recursion and some kinds of strength reduction.
A lot of it depends on how hard you want to work and what your target is. You didn't say much about that.

Related

How compiled language is better than interpreted language in optimizing the hardware?

Specifically how is compiled language able to better optimize the hardware compared to interpreted language? Other online sources that I have read only gave vague explanations like because it is written in the native code of the target machine while some do not even offer explanation at all. Would appreciate if the explanation provided can be as "Layman" as possible given that I've only just started to code.
One major reason is optimizing compilers. Compiling "in advance" makes it much easier to apply optimizations to code, especially if you're compiling to native assembly code (as you typically do in C, for example). The fact that you know some stuff about the machine that it's going to be deployed on allows you to do machine-specific optimizations. This is especially important for, for example, Pentium-based processors, which have numerous complicated instructions that would tend to require some degree of knowledge of program structure in order to use (e.g. the MMX instruction set).
There are also some cases where the compiler can make structural changes to programs. For example, under special circumstances, some compilers can replace recursion with loops. (I once heard of someone writing a recursive Factorial function in C to learn about how to implement recursion in assembly language only to realize to his horror that the compiler had recognized an optimization and replaced his recursion with a for loop).

What are motivations behind compiling to byte-code?

I'm working on my own toy programming language. For now I'm interpreting the source language from AST and I'm wondering what advantages compiling to a byte-code and then interpreting it could provide me.
For now I have three things in mind:
Traversing the syntax tree hundreds of time may be slower than running instructions in an array, especially if the array support O(1) random access(ie. jumping 10 instructions up and down).
In typed execution environment, I have some run-time costs because my AST is typed, and I'm constantly traversing it(ie. I have 10 types of nodes and I need to check what type I'm on now to execute). Maybe compiling to an untyped byte-code could help to improve this, since after type-checking and compiling, I would have an untyped values and code.
Compiling to byte-code may provide better portability.
Are my points correct? What are some other motivations behind compiling to bytecode?
Speed is the main reason; interpreting ASTs is just too slow in practice.
Another reason to use bytecode is that it can be trivially serialized (stored on disk), so that you can distribute it. This is what Java does.
The point of generating byte code (or any other "easily interpreted" form such as threaded code) is essentially performance.
For an AST intepreter to decide what to do next, it needs to traverse the tree, inspect nodes, determine the type of nodes, check the type of any operands, verify legality, and decide which special case of the AST-designated operator applies (it says "+", but it means 16 bit add or string concatenate?), before it finally performs some action.
If one takes the final action and generates some kind of easily interpreted structure, then at "execution" time the interpreter can focus simply on performing actions without all that checking/special-case determination.
Another recent excuse is that if you generate byte code for any of a number of well-known virtual machines (JVM, MSIL, Parrot, etc.) you don't even have to code the interpreter. For the JVM and MSIL, you also get the benefit of the JIT compilers associated with them, and with careful design of your language, compatibility with huge libraries, which are the real attraction of Java and C#.

Why do tools like yacc and ANTLR generate source code?

These tools basically input a grammar and output code which processes a series of tokens into something more useful, like a syntax tree. But could these tools be written in the form of a library instead? What is the reason for generating source code as output? Is there a performance gain? Is it more flexible for the end user? Easier to implement for the authors of yacc and ANTLR?
Sorry if the question is too vague, I'm just curious about the historical reasons behind the decisions the authors made, and what purpose auto-generated code has in today's environment.
There's a big performance advantage achieved by the parser generator working out the interactions of the grammar rules with respect to one another, and compiling the result to code.
One could build interpreters that simply accepted grammars and did the parsing; there are parser types (Earley) that would actually be relatively good at that, and one could compute the grammar interactions at runtime (Earley parsers kind of do this anyway) rather than offline and then execute the parsing algorithm.
But you would pay a parsing performance penalty of 10 to 100x slowdown, and probably a big storage demand.
If you are parsing using only very small grammars, or you are parsing only very small documents, this might not matter. But the grammars that many parser generators get applied too end up being fairly big (people keep wanting to add things to what you can say in a language), and they often end up processing pretty big documents. So performance now matters, and viola, people build code-generating parser generators.
Once you have a tool, it is often easier to use even in simple cases. So now that you have parser generators, you can even apply them to little grammars or to parsing little documents.
EDIT: Addendum. The historical reason is probably driven by space and time demands. Earlier systems had not a lot of room (32Kb in 1975), didn't run very fast (1 MIPS same time frame), and people had big source files already. Parser generators tended to help with this set of problems; interpreted grammars would have had intolerably bad performance.
Ira Baxter gave you one set of reasons for not handling the grammar parsing at runtime.
There is another reason too. Associated with each rule in the grammar is the appropriate action. The action is normally a fragment of a separate language (for example, C or C++). All actions in a grammar interpreted at runtime would have to be mappable to something appropriate in the program. In general, that's a losing proposition. The fragments can do all sorts of things, referencing parts of the stack ($$, $1, etc) and invoking actions (YYACCEPT, etc). Designing the runtime system so that it could be reliably used with such fragments would be tough. You'd like be into creating source code and compiling that into a DSO (dynamic shared object) or DLL (dynamic link library) and loading it. That requires a compiler on the customer's machine, where the customer may have deliberately designed their production system to be compiler-free.

decompilation resources and theory

There must be a million of books and papers on the theory and techniques of building compilers. Are there any resources on doing the reverse? Im not interested in any particular HW platform. Looking for good books/research papers that examine the subject and difficulties in depth.
I've worked on an AS3 and Java decompiler and I can assure you that everything I've learned in regards to decompilation is straight from compiler theory. Intermediate representations, data flow analysis, term rewriting, and other related concepts can all be found in the dragon book.
I've written about decompilers for dynamic languages here and for Python specifically.
Note though this is for dynamic languages with custom (high-level) VMs.
Decompilation is really a misnomer. Decompilers compile object code into a source representation. In many ways they are easier to write than traditional compilers - the 'source' code is already syntax checked and usually very precisely formatted.
They build up a symbol table (of addresses) and construct a target language representation of the application. The usual difficulty is that the original compiler has to a greater or lesser degree optimised the original application by removing common sub-expressions, hoisting constant code out of loops and many other similar techniques. These are often not possible to represent in the target language.
In cases where the source is for a well defined VM, then often this optimisation is left to the JIT compiler and the resulting decompiled code is very readable - in many cases almost identical to the original. Compilers of this type often leave some or all of the symbols in the object code allowing these to be recovered. Others include line numbers to help with debugging and troubleshooting. These all help to recover the original code.
As a counter, there are code obfuscators that deliberately perform transformations to the code that prevent simple restoration of the original source by scrambling names, change the sequence code is generated (without changing its resulting meaning) and introducing constructs for which there is no source language equivalent.

Why is Clojure dynamically typed?

One thing I like very much is reading about different programming languages. Currently, I'm learning Scala but that doesn't mean I'm not interested in Groovy, Clojure, Python, and many others. All these languages have a unique look and feel and some characteristic features. In the case of Clojure I don't understand one of these design decisions. As far as I know, Clojure puts great emphasis on its functional paradigm and pretty much forces you to use immutable "variables" wherever possible. So if half of your values are immutable, why is the language dynamically typed?
The Clojure website says:
First and foremost, Clojure is dynamic. That means that a Clojure program is not just something you compile and run, but something with which you can interact.
Well, that sounds completely strange. If a program is compiled you can't change it anymore. Sure you can "interact" with it, that's what UIs are used for but the website certainly doesn't mean a neat "dynamic" GUI.
How does Clojure benefit from dynamical typing
I mean the special case of Clojure and not general advantages of dynamic typing.
How does the dynamic type system help improve functional programming
Again, I know the pleasure of not spilling "int a;" all over the source code but type inference can ease a lot of the pain. Therefore I would just like to know how dynamic typing supports the concepts of a functional language.
If a program is compiled you can't change it anymore.
This is wrong. In image-based systems, like Lisp (Clojure can be seen as a Lisp dialect) and Smalltalk, you can change the compiled environment. Development in such a language typically means working on a running system, adding and changing function definitions, macro definitions, parameters etc. (adding means compiling and loading into the image).
This has a lot of benefits. For one, all the tools can interact directly with the program and do not need to guess at the system's behaviour. You also do not have any long compilation pauses, because each compiled unit is very small (it is very rare to recompile everything). The NASA JPL once corrected a running Lisp system on a probe hundreds of thousands of kilometres away in space.
For such a system, it is very natural to have type information available at runtime (that is what dynamic typing means). Of course, nothing hinders you from also doing type inference and type checks at compilation time. These concepts are orthogonal. Modern Lisp implementations typically can do both.
Well first of all Clojure is a Lisp and Lisps traditionally have always been dynamically typed.
Second as the excerpt you quoted said Clojure is a dynamic language. This means, among other things, that you can define new functions at runtime, evaluate arbitrary code at runtime and so on. All of these things are hard or impossible to do in statically typed languages (without plastering casts all over the place).
Another reason is that macros might complicate debugging type errors immensely. I imagine that generating meaningful error messages for type errors produced by macro-generated code would be quite a task for the compiler.
I agree, a purely functional language can still have an interactive read-eval-print-loop, and would have an easier time with type inference. I assume Clojure wanted to attract lisp programmers by being "lisp for the jvm", and chose to be dynamic like other lisps. Another factor is that type systems need to be designed as the very first step of the language, and it's faster for language implementors to just skip that step.
(I'm rephrasing the original answer since it generated too much misunderstanding)
One of the reasons to keep Clojure (and any Lisp) dynamically typed is to simplify creation of macros. In short, macros deal with abstract syntax trees (ASTs) which can contain nodes of many, many different types (usually, any objects at all). In theory, it's possible to make full statically typed macro system, but in practice such systems are usually limited and sparsely spread. Please, see examples below and extended discussion in the thread.
EDIT 2020: Wow, 9 years passed from the time I posted this answer, and people still add comments. What a legacy we all have left!
Some people noted in comments that having a statically typed language doesn't prevent you from expressing code as data structure. And, strictly speaking, it's true - union types allow to express data structures of any complexity, including syntax of a language. However I claim that to express the syntax, you must either reduce expressiveness, or use such wide unions that you lose all advantages of static typing. To prove this claim I will use another language - Julia.
Julia is optionally typed - you can constrain any function or struct field to have a particular type, and Julia will check it. The language supports AST as a first class citizen using Expr and Symbol types. Expression definition looks something like this:
struct Expr
head::Symbol
args::Vector{Any}
end
Expression consists of a head which is always a symbol and list of arguments which may have any types. Julia also supports special Union which can constrain argument to specific types, e.g. Symbols and other Exprs:
struct Expr
head::Symbol
args::Vector{Union{Symbol, Expr}}
end
Which is sufficient to express e.g. :(x + y):
dump(:(x + y))
Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol +
2: Symbol x
3: Symbol y
But Julia also supports a number of other types in expressions. One obvious and helpful example is literals:
:(x + 1)
Moreover, you can use interpolation or construct expressions manually to put any object to AST:
obj = create_some_object()
ex1 = :(x + $objs)
ex2 = Expr(:+, :x, obj)
These examples are not just a funny experiments, they are actively used in real code, especially in macros. So you cannot constrain expression arguments to a specific union of types - expressions may contain any values.
Of course, when designing a new language you can put any restrictions on it. Perhaps, restricting Expr to contain only Symbol, Expr and some Literals would be useful in some contexts. But it goes against principles of simplicity and flexibility in both - Julia and Clojure, and would significantly reduce usefulness of macros.
Because that's what the world/market needed. No sense in building what's already built.
I hear the JVM already has a statically typed language ;)