Related
I'm planning on writing a programming language targeting the .NET platform which led me to start thinking about the code generation aspect of targeting such a platform. I'm new at writing compilers but I know there is optimization done as one of the phases in compiling (or there can be). I started to wonder about the any benefit to spending time optimizing the output (in this case CIL but this would apply to the JVM too) because the JIT compiler and things like the JVM's HotSpot could optimize at run time. Is there any benefit from optimizing the generated code (CIL or the JVM equivalent) when targeting .NET or JVM since the JIT will already optimize?
It depends. There are countless optimizations. Any given compiler (your compiler, the JIT compiler, or any other compiler) necessarily implements only a subset of those. This choice depends on available time, typical/expected input code, priorities, etc. and therefore the engineers who built the JIT compiler may have selected optimizations which work well for the programs they were expecting, but not so well for the kind of program you care about.
You will have to determine what optimizations the JIT compiler misses. The way to do this is, of course, empirical: Actually write programs, letting the JIT compiler optimize them (be sure to do this part properly - disable debugging, compile for release, choose realistic benchmarks, etc.), and then inspect the final machine code. Look for unexpected code (you will, of course, need assembly knowledge for this) and determine if it's a missed optimization or if the JIT was smarter than you thought.
If it is a missed optimization, you have another problem: You can't output the machine code you want, you have to generate different IL instead.
A missed optimization is probably due to a language feature the VM doesn't know (e.g. multi methods on the JVM). You lowered it into the VM's terms during compilation but the translation you chose doesn't sit well with the JIT's order of passes, heuristics, etc.
As you can't just output machine code yourself, you must now find an alternative IL fragment for the same input language code. Ideally, one which the JIT compiler does handle well. Finding that may be an exercise in imagination, but it's not technically hard, just guesswork interleaved with benchmarking.
As another answer points out, JIT compilers work under time constraints. This may lead to optimizations that could happen being missed (e.g. constant propagation running out of time), but as the creators of the JIT compiler faced the same problem, this probably isn't too severe if you don't create much larger/more complicated code.
If you create such bad code that the JIT compiler can't fix it all, then you have to duplicate its optimizations in your AOT compiler. I'm not convinced that this is a likely scenario though, and even if it happens even very simple optimizations should mostly fix the problem.
So, in summary: Start with a straightforward translation, then seek out missed optimizations and either make it easier to optimize for the JIT compiler, or do it yourself (if possible - adaptive optimization is much harder in an AOT setting).
I think this question is hard to answer in general.
For example, the F# compiler performs a tail call optimization, because having tail-recursive functions is common in that language, the F# compiler can do a better job at optimizing them in some cases than the JIT compiler and some versions of the JIT compiler don't perform the optimization at all.
So, your language might have some common operation whose straightforward implementation wouldn't perform well. In that case, it makes sense emitting IL code that's optimized.
What I think you should do is the same as when you're writing a normal program: first write your code in a way that is simple and readable. Only if something doesn't perform well, attempt to optimize that. It might be worth considering that you might need some optimizations in the future and make your code modular enough, so that you don't have to rewrite half of it because of some optimization. But for now, that should be enough.
Writing a compiler is hard enough job already (even if you're targeting an IL). Finish it first and think about optimizations later.
Generally, JIT compilers have some thresholds governing how much optimization they will attempt to perform. These may be based on the size of a method's IL and/or the amount of time already spent JIT compiling the method. So yes, IL which has already been optimized may benefit from further JIT optimization. As always, there is a trade-off: how much time do you want to spend adding AOT optimizations to your compiler (and testing/maintaining them) versus how quickly your code can be JIT compiled, and with what level of optimization.
The magnitude of the improvement depends largely on how much simpler (and smaller) the AOT-optimized IL is relative to the unoptimized IL, as well as the thresholds governing the JIT compiler (which, at least for the Microsoft CLR, are not widely known). The only way to find out is to do some testing yourself.
There must be a million of books and papers on the theory and techniques of building compilers. Are there any resources on doing the reverse? Im not interested in any particular HW platform. Looking for good books/research papers that examine the subject and difficulties in depth.
I've worked on an AS3 and Java decompiler and I can assure you that everything I've learned in regards to decompilation is straight from compiler theory. Intermediate representations, data flow analysis, term rewriting, and other related concepts can all be found in the dragon book.
I've written about decompilers for dynamic languages here and for Python specifically.
Note though this is for dynamic languages with custom (high-level) VMs.
Decompilation is really a misnomer. Decompilers compile object code into a source representation. In many ways they are easier to write than traditional compilers - the 'source' code is already syntax checked and usually very precisely formatted.
They build up a symbol table (of addresses) and construct a target language representation of the application. The usual difficulty is that the original compiler has to a greater or lesser degree optimised the original application by removing common sub-expressions, hoisting constant code out of loops and many other similar techniques. These are often not possible to represent in the target language.
In cases where the source is for a well defined VM, then often this optimisation is left to the JIT compiler and the resulting decompiled code is very readable - in many cases almost identical to the original. Compilers of this type often leave some or all of the symbols in the object code allowing these to be recovered. Others include line numbers to help with debugging and troubleshooting. These all help to recover the original code.
As a counter, there are code obfuscators that deliberately perform transformations to the code that prevent simple restoration of the original source by scrambling names, change the sequence code is generated (without changing its resulting meaning) and introducing constructs for which there is no source language equivalent.
Of the object-oriented languages I know, pretty much all but C++ and Objective-C compile to bytecode running on some sort of virtual machine. Why have so many different languages settled on compiling to bytecode, as opposed to machine code? Is it possible in princible to have a high-level memory-managed OOP language that compiled to machine code?
Edit: I'm aware that multiplatform support is often advanced as an advantage of this approach. However, it's quite possible to compile natively on multiple platforms, without making a new compiler per platform. One can, per example, emit C code and then compile that with GCC.
There's no reason in fact, this is a kind of coincidence. OOP now is the leading concept in "big" programming, and so virtual machines are.
Also note, that there are 2 distinct parts of traditional virtual machines - garbage collector and bytecode interpreter/JIT-compiler, and these parts can exist separately. For example, Common Lisp implementation called SBCL compiles program to a native code, but at runtime heavily uses garbage collection.
This is done to allow a VM or JIT compiler the chance to compile the code on demand optimally for the architecture on which the code is executed. Also, it allows for cross-platform bytecode to be created once and then executed on multiple hardware architectures. This allows for hardware specific optimizations to be placed into the compiled code.
Since byte code is not limited to a microarchitecture, it can be smaller than machine code. Complex instructions can be represented vs. the much more primitive instructions available in modern day CPUs, since the constraints in the design of CPU instructions are very different from the constraints in designing a bytecode architecture.
Then there's the issue of security. The bytecode can be verified and analyzed prior to execution (i.e., no buffer overflows, variables of a certain type being accessed as something they are not), etc...
Java uses bytecode because two of its initial design goals were portability and compactness. Those both came from the initial vision of a language for embedded devices, where fragments of code could be downloaded on the fly.
Python, Ruby, Smalltalk, javascript, awk and so on use bytecode because writing a native compiler is a lot of work, but a textual interpreter is too slow - bytecode hits a sweet spot of being fairly easy to write, but also satisfactorily quick to run.
I have no idea why the Microsoft languages use bytecode, since for them, neither portability nor compactness is a big deal. A lot of the thinking behind the CLR came out of computer scientists in Cambridge, so i imagine considerations like ease of program analysis and verification were involved.
Note that as well as C++ and Objective C, Eiffel, Ada 9X, Vala and Go are OO languages (of varying vintage) that are compiled straight to native code.
All in all, i'd say that OO and bytecode do not go hand in hand. Rather, we have a coincidental convergence of several streams of development: the traditional bytecoded interpreters of scripting languages like Python and Ruby, the mad Gosling masterplan of Java, and whatever it is Microsoft's motives are.
The biggest reason why most interpreted languages (not specifically OO languages) are compiled to bytecode is for performance. The most expensive part of interpreting code is transforming text source to an intermediate representation. For instance, to perform something like:
foo + bar;
The interpreter would have to scan 10 characters, transform them into 4 tokens, build an AST for the operation, resolve three symbols (+ is a symbol, which depends on the types of foo and bar), all before it can perform any action that actually depends on the run-time state of the program. None of this can change from run to run, and so many languages try to store some form of intermediate representation.
bytecode, rather than storing an AST has a few advantages. For one, bytecodes are easy to serialize, so the IR can be written to disk and reused at the next invocation, further reducing interpretation time. Another reason is that bytecode often takes up less actual ram. significantly bytecode representations are often easy to just in time compile, because they are often structurally similar to typical machine code.
As another data point, the D programming language is GC'ed, OO, and a lot higher level than C++ while still being compiled to native code.
Bytecode is significantly more flexible medium than machine code. First, it provides the basis for platform portability without the need for a compiler or shipping source code. So a developer can distribute a single version of the application without needing to give up the source, require complex developer tools, or anticipate potential target platforms. While the later is not always practical it does happen. Especially with developer libraries say I distribute a library that I've only tested on Windows, but someone else uses it on Linux or Android. It happens quite frequently actually, and most of the time it works as expected.
Byte code is also generally more optimized that an interpreter because it's closer to machine instructions therefore faster to translate to machine instructions. Not all OO languages are compiled. Ruby, Python, and even Javascript are interpreted so they aren't compiled to anything so the ruby interpreter has to take a very flexible language and turn that into instructions, but that flexibility comes at a price paid an runtime: parse text, generate AST, translate AST to machine code, etc. It's also easy to do optimizations like JIT where byte code is translated to machine code directly, and even gives the possibility for creating optimizations for specific hardware.
Finally, just because one language compiles to bytecode doesn't preclude other languages taking advantage of of that byte code. Now any optimization using that byte code can be applied to these other languages that might know how to translate themselves to that byte code. That makes the byte code a very important layer for reusability for other languages.
OO and byte code compilation goes back to the 70s with Smalltalk, and I'm sure someone will say LISP as early as the 50s/60s. But, it really wasn't until the 90s that it started to really be used in production systems on a large scale.
Native compilation sounds like the optimal path, and probably why our industry spent 20 years or more thinking that was THE ANSWER to all our problems, but the last 15 years we've seen byte code compilation take stage and it's been a significant advantage over what we did before. Looking back we realize how much time wasted natively compiling everything mostly by hand.
I agree with Chubbard's answer and I'd add that in OO languages type information can be very important for enabling optimizations by virtual-machines or last-level compilers
It is easier to develop an interpreter than a compiler.
Effort in development of...:
interpreter < bytecode-interpreter < bytecode-jit-compiler < compiler-to-platform-independent-language < compiler-to-multiple-machine-dependent-assembler.
It is a general trend to stop the development at jit-compilers because of platform independence. Only the preferred languages in respect to performance and research in theoretical computer science are and will be developed in ALL possible directions, including new bytecode-interpreter, even while there are good and advanced compilers to platform independent languages and to different machine-dependant assemblers.
The research in OOP languages is pretty ...let's say dull, compared to functional languages, because really new language and compiler technologies are more easily expressed with/in/using mathematical cathegory theory and mathematical descriptions of touring-complete type-systems. In other words: it is nearly functional in itself, while imperative languages are nearly only assembler-frontends with some syntactic sugar. OOP languages tend to be imperative languages, because functional languages have already closures and lambda. There are other ways to implement java-like "interfaces" in functional languages, and there is just no need for additional object oriented features.
In i.e. Haskell, adding the feature of OOP-like programming would probably be more than only a few steps back in technology – there would be no point in using that. (<- that is not only IMHO... you ever heard of GADTs or Multi-parameter-type-classes?) Probably there might be even better ways to dynamically create Objects with Interfaces to communicate with OOP-languges than changing that language itself. But there are other functional languages, too, that explicitely combine functional and OOP aspects. There is just more science with mainly functional languages than non-functional OO-languages.
OO languages can not be easily compiled to other OO languages, iff they are in some way more "advanced". Usually, they have features like stack-protector, advanced debugging abilities, abstract and inspectable multi-threading, dynamic object-loading from files from the internet... Many of these features are not or not-easily realisable with C or C++ as compiler-backend. The functional language LISP (which is 50 years old!) was AFAIK the first with garbage collector. As compiler-backend LISP used a hacked version of the language C, because plain C did not allow some of those things, assembler did allow, i.e. proper-tail-calls or tables-next-to-code. C-- allows that.
An other aspect: Imperative languages are intended to run on a specific architecture, i.e. C and C++ programs run on only those architectures, they are programmed for. Java is more extreme: it runs only on a single architecture, a virtual one, which itself runs on others.
Functional languages are usually by design pretty architecture-independent: LISP was developed to be so immense architecture-unspecific, that it could be compiled to genetic code, in some distant future. Yes, like programs running in living biologic cells.
With the bytecode for the LLVM, functional languages will most-likely be compiled to bytecode in the future, too. Most imperative languages will most likely still have the same inherited problems as they have now from not-abstracting-far-enough. Well, I'm not that sure about clang and D, but those two are not "the most" anyway.
How do modern optimizing compilers determine when to apply certain optimizations such as loop unrolling and code inlining?
Since both of these affect caching, naively inlining functions with less than X lines, or whatever other simple heuristic, is likely to generate worse performing code. So, how do modern compilers deal with this?
I'm having a hard time finding information on this (especially information thats reasonably easy to understand..), about the best I could find is the wikipedia article. Any details, links to books/articles/papers are greatly appreciated!
EDIT: Since answers are talking mainly about the two optimizations I mentioned (inlining and loop unrolling) I just wanted to clarify that I'm interested in all and any compiler optimizations, not just those two. I'm also more interested in the optimizations which can be performed during ahead-of-time compilation, though JIT optimization is of interest too (though to a slightly lesser extent).
Thanks!
Usually by being that naive anyway and hope it is an improvement.
This is why just-in-time compilation is such a winning strategy. Collect statistics then optimize for the common case.
References:
http://lambda-the-ultimate.org/node/768
GCC supports Profile Guided Optimization
And of course the Sun hotspot JVM
You can look a the Spiral project.
On top of that, optimizing is a tough thing to do generically. This is, in part, why there are so many options to the gcc compiler. If you know something about cache and pages you can do some things by hand and request that others be done through the compiler but no two machines are the same so the approach must be adhoc.
For short: Better than we!
You can have a look at this: http://www.linux-kongress.org/2009/slides/compiler_survey_felix_von_leitner.pdf
Didier
Good question. You are asking about so called speculative optimizations.
Dynamic compilers use both static heuristics and profile information. Static compilers employs heuristics and (off-line) profile information. The last is often referenced as PGO (Profile Guided Optimizations).
There is a lot of articles on inlining policies. The most comprehensive one is
An Empirical Study of Method Inlining for a Java Just-In-Time Compiler
It also contains references to related work and sharp criticism on some of the considered articles (justified).
In general, state-of-the-art compilers try to use impact analysis to estimate potential effect of speculative optimizations before applying them.
P.S. Loop unrolling is old classic stuff which helps only for some tight loops that performs only number crunchng ops (no calls and so on). Method inlining is much more important optimization in the modern compilers.
What techniques promote efficient opcode dispatch to make a fast interpreter? Are there some techniques that only work well on modern hardware and others that don't work well anymore due to hardware advances? What trade offs must be made between ease of implementation, speed, and portability?
I'm pleased that Python's C implementation is finally moving beyond a simple switch (opcode) {...} implementation for opcode dispatch to indirect threading as a compile time option, but I'm less pleased that it took them 20 years to get there. Maybe if we document these strategies on stackoverflow the next language will get fast faster.
There are a number of papers on different kinds of dispatch:
M. Anton Ertl and David Gregg, Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters, in Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI 03), pp. 278-288, San Diego, California, June 2003.
M. Anton Ertl and David Gregg, The behaviour of efficient virtual machine interpreters on modern architectures, in Proceedings of the 7th European Conference on Parallel Computing (Europar 2001), pp. 403-412, LNCS 2150, Manchester, August 2001.
An excellent summary is provided by Yunhe Shi in his PhD thesis.
Also, someone discovered a new technique a few years ago which is valid ANSI C.
Befor you start anything, check Lua.
It's small (150Kb), pure ANSI C, works on anything having C compiler. Very fast.
And most important - source code is clean and readable. Worth checking out.
Indirect threading is a strategy where each opcode implementation has its own JMP to the next opcode. The patch to the Python interpreter looks something like this:
add:
result = a + b;
goto *opcode_targets[*next_instruction++];
opcode_targets maps the instruction in the language's bytecode to the location in memory of the opcode implementation. This is faster because the processor's branch predictor can make a different prediction for each bytecode, in contrast to a switch statement that has only one branch instruction.
The compiler must support computed goto for this to work, which mostly means gcc.
Direct threading is similar, but in direct threading the array of opcodes is replaced with pointers to the opcode implentations like so:
goto *next_opcode_target++;
These techniques are only useful because modern processors are pipelined and must clear their pipelines (slow) on a mispredicted branch. The processor designers put in branch prediction to avoid having to clear the pipeline as often, but branch prediction only works for branches that are more likely to take a particular path.
Just-in-time compilation is one.
One big win is to store the source code in an intermediate form, rather than redoing lexical analysis and parsing during execution.
This can range all the way from just storing the tokens, through Forth style threaded code and on to JIT compilation.
Benchmarking is a good technique on making anything fast on given platform(s). Test, refine, test again, improve.
I don't think you can get any better answer. There's lot of techniques to make interpreters. But I give you a tip, do not do trade offs, just choose what you really need and pursue those goals.
I found an blog post on threaded interpreter implementation that was useful.
The author describes GCC label-based threading and also how to do this in Visual Studio using inline assembler.
http://abepralle.wordpress.com/2009/01/25/how-not-to-make-a-virtual-machine-label-based-threading/
The results are interesting. He reports 33% performance improvement when using GCC but surprisingly the Visual Studio inline assembly implementation is 3 times slower!
The question is a bit vague. But, it seems you're asking about writing an interpreter.
Interpreters typically utilize traditional parsing components: lexer, parser, and abstract syntax tree (AST). This allows the designer to read and interpret valid syntax and build a tree structure of commands with associated operators, parameters, etc.
Once in AST form, the entire input is tokenized and the interpreter can begin executing by traversing the tree.
There are many options, but I've recently used ANTLR as a parser generator that can build parsers in various target languages, including C/C++ and C#.