How does language design influence VM and bytecode design? - language-design

For example, how did the design of C# and VB.NET shape the development of CIL (and vice-versa)? What about Java and the JVM? How did the nature of PHP affect the development of HHBC/the HHVM, or Perl and Parrot, or Smalltalk and the VMs for various implementations?

Language design will influence the VM if the designers want it to. Some VMs are more independent than others. For example, Java does not have multiple inheritance, so JVM does not either.

Generally, a language machine (such as the Java Virtual Machine or the .NET CLR) will closely reflect the requirements of the language (Java for the JVM, C# for the CLR) for which it was designed.
For example, pretty much every Java byte code in the original JVM v1.0 was needed by the compiler. One could suggest that the needs of the JavaC compiler author(s) were being provided on demand by the JVM author(s). (It was a small team, so it may have even been the same person.)
The CLR is a bit different, because in addition to C#, they jammed in some stuff to support a pretend-C++ language, which required at least 3 additional op codes (IIRC). Nonetheless, the CLR was pretty much designed just to support C#.
It's interesting to analyze the Android Davlik engine, since it was designed as a JVM-but-without-using-JVM-byte-codes engine. (It is also register based, instead of stack based.)
At some level, the primary decision becomes this: Whether the engine is a low level Turing complete machine (something like a software RISC machine), or whether the engine's primitive language (its IL) is simply a binary form of its primary source code language. The former is more like WASM (arguably general purpose), while the latter is more like the JVM and CLR specs.

Related

Anyone program "low-level" on JVM?

Sometimes we hear about brave people who understand and write assembly language for performance reasons, as opposed to using a compiler with a high-level language. Can the same be done on the JVM? I've reviewed the JVM instruction set, and it resembles assembly language in some respects, though it's much higher level (I'm assuming that the system-specific implementations of the JVM are extremely efficient).
Is it possible to, say, write JVM instructions and put them into a Java-executable binary?
Yes. You can do this via the asm library.
In fact, this is typically how people implement non-Java languages on top of the JVM, and how many Java metaprogramming libraries work.
You may very well want to do this for the same kind of metaprogramming capabilities - e.g., generating classes at runtime, or using the InvokeDynamic instruction to generate your own method dispatch rules.
There isn't a whole lot of performance benefit to be gained from using raw Java bytecode rather than writing the corresponding high-level Java (the JIT is your main performance booster, and it's optimized for the sorts of patterns "vanilla" Java code generates) but it does give you flexibility for things that are difficult, verbose, or impossible to express in Java.

Does JVM or CLR use registers for running JIT'ed code?

I understand that JVM and CLR were designed as stack-based virtual machines. When JIT compiles bytecode into native code, does it also translate stack primitives (load/store) to registers on X86 platform?
If yes, it looks like whether bytecode is stack-based or register-based doesn't really matter. JIT matters.
I think that you are confusing two different concepts.
At least for Java, the JVM acts as a virtual machine - it's an idealized computing machine with a comparatively high-level assembly language (the bytecode) that is based on a call stack with stack frames. When compiling Java into bytecode, the Java program is turned into (essentially) an assembly program for controlling this machine.
When actually running Java on a given system, the job of the JVM implementation is to faithfully simulate the execution of this stack-based machine using whatever hardware is actually available. This typically means that a huge number of stack operations would be implemented using registers when possible, and perhaps using other specialized hardware that isn't present in the description of the Java virtual machine. The actual details of how this is done is implementation-specific - some implementations might compile it down to machine code that does almost everything in registers, while a simpler implementation might just compile down to in-memory operations. I worked for a few months on a JavaScript implementation of the JVM, in which case we "compiled" the code down to JS functions, which were in turn handed off to the browser's JS implementation.
The reason for this distinction is that Java was designed to be easily downloaded and embedded (think applets). In this case, security and portability are important concerns. The bytecode had to have some way to be inspected automatically to rule out certain types of malicious code (buffer overruns, for example). Similarly, whatever format was used had to be sufficiently high-level that it could be run on a variety of different platforms (handheld devices, supercomputers, PCs, etc.) The choice of the stack-based JVM made both of these concerns possible to satisfy simultaneously. It's high-level enough that it's possible to inspect the bytecode to rule out many type errors or reads/writes of uninitialized memory, while sufficiently low-level that a JVM can use tricks like compiling down to code using registers.
If you are curious what your particular JVM will do to a specific piece of code, you should take a look at the documentation. Most JVMs have some way of giving you information about how they're executing the code. If your question is "why not just have bytecode do register-based manipulation," the reason is twofold:
There is an analog of registers in bytecode - each stack frame has some extra dedicated space for temporary values to be stored, and
There isn't as robust support for registers as is present in x86 or MIPS because the JVM code had to be easy to execute on multiple pieces of hardware, and hardcoding in a number of registers might complicate things.
Hope this helps!
It is impossible to not use registers on an x86 core. The processor doesn't have an instruction to, say, add two local variables. One of them has to be loaded in a register. Then you can add the value in the register to the value in a variable. And store the result back to a stack variable.
The optimization opportunities are obvious from this sequence. Like not storing it back but keeping the result in a register and using it later, saving both the store and the load. That's the job of the optimizer, it looks for ways to make the best use of the available registers.
The only way to know for sure would be to examine JIT compiled output, but it's quite safe to say that using registers is one of the JIT compiler's lamest optimizations. I believe most programmers would be hard pressed to write faster code than the JIT compiler does.
The JIT compiler is capable of a lot, and probably uses registers as much as is appropriate. Things like method inlining encourage the use of registers, and a lot of imperative program code can be expressed more simply on a register-based architecture, so it only makes sense for the JIT compiler to use registers.

Matching a virtual machine design with its primary programming language

As background for a side project, I've been reading about different virtual machine designs, with the JVM of course getting the most press. I've also looked at BEAM (Erlang), GHC's RTS (kind of but not quite a VM) and some of the JavaScript implementations. Python also has a bytecode interpreter that I know exists, but have not read much about.
What I have not found is a good explanation of why particular virtual machine design choices are made for a particular language. I'm particularly interested in design choices that would fit with concurrent and/or very dynamic (Ruby, JavaScript, Lisp) languages.
Edit: In response to a comment asking for specificity here is an example. The JVM uses a stack machine rather then a register machine, which was very controversial when Java was first introduced. It turned out that the engineers who designed the JVM had done so intending platform portability, and converting a stack machine back into a register machine was easier and more efficient then overcoming an impedance mismatch where there were too many or too few registers virtual.
Here's another example: for Haskell, the paper to look at is Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine. This is very different from any other type of VM I know about. And in point of fact GHC (the premier implementation of Haskell) does not run live, but is used as an intermediate step in compilation. Peyton-Jones lists no less then 8 other virtual machines that didn't work. I would like to understand why some VM's succeed where other fail.
I'll answer your question from a different tack: what is a VM? A VM is just a specification for "interpreter" of a lower level language than the source language. Here I'm using the black box meaning of the word "interpreter". I don't care how a VM gets implemented (as a bytecode intepereter, a JIT compiler, whatever). When phrased that way, from a design point of view the VM isn't the interesting thing it's the low level language.
The ideal VM language will do two things. One, it will make it easy to compile the source language into it. And two it will also make it easy to interpret on the target platform(s) (where again the interpreter could be implemented very naively or could be some really sophisticated JIT like Hotspot or V8).
Obviously there's a tension between those two desirable properties, but they do more or less form two end points on a line through the design space of all possible VMs. (Or, perhaps some more complicated shape than a line because this isn't a flat Euclidean space, but you get the idea). If you build your VM language far outside of that line then it won't be very useful. That's what constrains VM design: putting it somewhere into that ideal line.
That line is also why high level VMs tend to be very language specific while low level VMs are more language agnostic but don't provide many services. A high level VM is by its nature close to the source language which makes it far from other, different source languages. A low level VM is by its nature close to the target platform thus close to the platform end of the ideal lines for many languages but that low level VM will also be pretty far from the "easy to compile to" end of the ideal line of most source languages.
Now, more broadly, conceptually any compiler can be seen as a series of transformations from the source language to intermediate forms that themselves can be seen as languages for VMs. VMs for the intermediate languages may never be built, but they could be. A compiler eventually emits the final form. And that final form will itself be a language for a VM. We might call that VM "JVM", "V8"...or we might call that VM "x86", "ARM", etc.
Hope that helps.
One of the techniques of deriving a VM is to just go down the compilation chain, transforming your source language into more and more low level intermediate languages. Once you spot a low level enough language suitable for a flat representation (i.e., the one which can be serialised into a sequence of "instructions"), this is pretty much your VM. And your VM interpreter or JIT compiler would just continue your transformations chain from the point you selected for a serialisation.
Some serialisation techniques are very common - e.g., using a pseudo-stack representation for expression trees (like in .NET CLR, which is not a "real" stack machine at all). Otherwise you may want to use an SSA-form for serialisation, as in LLVM, or simply a 3-address VM with an infinite number of registers (as in Dalvik). It does not really matter which way you take, since it is only a serialisation and it would be de-serialised later to carry on with your normal way of compilation.
It is a bit different story if you intend to interpret you VM code immediately instead of compiling it. There is no consensus currently in what kind of VMs are better suited for interpretation. Both stack- (or I'd dare to say, Forth-) based VMs and register-based had proven to be efficient.
I found this book to be helpful. It discusses many of the points you are asking about. (note I'm not in any way affiliated with Amazon, nor am I promoting Amazon; just was the easiest place to link from).
http://www.amazon.com/dp/1852339691/

Why do almost all OO languages compile to bytecode?

Of the object-oriented languages I know, pretty much all but C++ and Objective-C compile to bytecode running on some sort of virtual machine. Why have so many different languages settled on compiling to bytecode, as opposed to machine code? Is it possible in princible to have a high-level memory-managed OOP language that compiled to machine code?
Edit: I'm aware that multiplatform support is often advanced as an advantage of this approach. However, it's quite possible to compile natively on multiple platforms, without making a new compiler per platform. One can, per example, emit C code and then compile that with GCC.
There's no reason in fact, this is a kind of coincidence. OOP now is the leading concept in "big" programming, and so virtual machines are.
Also note, that there are 2 distinct parts of traditional virtual machines - garbage collector and bytecode interpreter/JIT-compiler, and these parts can exist separately. For example, Common Lisp implementation called SBCL compiles program to a native code, but at runtime heavily uses garbage collection.
This is done to allow a VM or JIT compiler the chance to compile the code on demand optimally for the architecture on which the code is executed. Also, it allows for cross-platform bytecode to be created once and then executed on multiple hardware architectures. This allows for hardware specific optimizations to be placed into the compiled code.
Since byte code is not limited to a microarchitecture, it can be smaller than machine code. Complex instructions can be represented vs. the much more primitive instructions available in modern day CPUs, since the constraints in the design of CPU instructions are very different from the constraints in designing a bytecode architecture.
Then there's the issue of security. The bytecode can be verified and analyzed prior to execution (i.e., no buffer overflows, variables of a certain type being accessed as something they are not), etc...
Java uses bytecode because two of its initial design goals were portability and compactness. Those both came from the initial vision of a language for embedded devices, where fragments of code could be downloaded on the fly.
Python, Ruby, Smalltalk, javascript, awk and so on use bytecode because writing a native compiler is a lot of work, but a textual interpreter is too slow - bytecode hits a sweet spot of being fairly easy to write, but also satisfactorily quick to run.
I have no idea why the Microsoft languages use bytecode, since for them, neither portability nor compactness is a big deal. A lot of the thinking behind the CLR came out of computer scientists in Cambridge, so i imagine considerations like ease of program analysis and verification were involved.
Note that as well as C++ and Objective C, Eiffel, Ada 9X, Vala and Go are OO languages (of varying vintage) that are compiled straight to native code.
All in all, i'd say that OO and bytecode do not go hand in hand. Rather, we have a coincidental convergence of several streams of development: the traditional bytecoded interpreters of scripting languages like Python and Ruby, the mad Gosling masterplan of Java, and whatever it is Microsoft's motives are.
The biggest reason why most interpreted languages (not specifically OO languages) are compiled to bytecode is for performance. The most expensive part of interpreting code is transforming text source to an intermediate representation. For instance, to perform something like:
foo + bar;
The interpreter would have to scan 10 characters, transform them into 4 tokens, build an AST for the operation, resolve three symbols (+ is a symbol, which depends on the types of foo and bar), all before it can perform any action that actually depends on the run-time state of the program. None of this can change from run to run, and so many languages try to store some form of intermediate representation.
bytecode, rather than storing an AST has a few advantages. For one, bytecodes are easy to serialize, so the IR can be written to disk and reused at the next invocation, further reducing interpretation time. Another reason is that bytecode often takes up less actual ram. significantly bytecode representations are often easy to just in time compile, because they are often structurally similar to typical machine code.
As another data point, the D programming language is GC'ed, OO, and a lot higher level than C++ while still being compiled to native code.
Bytecode is significantly more flexible medium than machine code. First, it provides the basis for platform portability without the need for a compiler or shipping source code. So a developer can distribute a single version of the application without needing to give up the source, require complex developer tools, or anticipate potential target platforms. While the later is not always practical it does happen. Especially with developer libraries say I distribute a library that I've only tested on Windows, but someone else uses it on Linux or Android. It happens quite frequently actually, and most of the time it works as expected.
Byte code is also generally more optimized that an interpreter because it's closer to machine instructions therefore faster to translate to machine instructions. Not all OO languages are compiled. Ruby, Python, and even Javascript are interpreted so they aren't compiled to anything so the ruby interpreter has to take a very flexible language and turn that into instructions, but that flexibility comes at a price paid an runtime: parse text, generate AST, translate AST to machine code, etc. It's also easy to do optimizations like JIT where byte code is translated to machine code directly, and even gives the possibility for creating optimizations for specific hardware.
Finally, just because one language compiles to bytecode doesn't preclude other languages taking advantage of of that byte code. Now any optimization using that byte code can be applied to these other languages that might know how to translate themselves to that byte code. That makes the byte code a very important layer for reusability for other languages.
OO and byte code compilation goes back to the 70s with Smalltalk, and I'm sure someone will say LISP as early as the 50s/60s. But, it really wasn't until the 90s that it started to really be used in production systems on a large scale.
Native compilation sounds like the optimal path, and probably why our industry spent 20 years or more thinking that was THE ANSWER to all our problems, but the last 15 years we've seen byte code compilation take stage and it's been a significant advantage over what we did before. Looking back we realize how much time wasted natively compiling everything mostly by hand.
I agree with Chubbard's answer and I'd add that in OO languages type information can be very important for enabling optimizations by virtual-machines or last-level compilers
It is easier to develop an interpreter than a compiler.
Effort in development of...:
interpreter < bytecode-interpreter < bytecode-jit-compiler < compiler-to-platform-independent-language < compiler-to-multiple-machine-dependent-assembler.
It is a general trend to stop the development at jit-compilers because of platform independence. Only the preferred languages in respect to performance and research in theoretical computer science are and will be developed in ALL possible directions, including new bytecode-interpreter, even while there are good and advanced compilers to platform independent languages and to different machine-dependant assemblers.
The research in OOP languages is pretty ...let's say dull, compared to functional languages, because really new language and compiler technologies are more easily expressed with/in/using mathematical cathegory theory and mathematical descriptions of touring-complete type-systems. In other words: it is nearly functional in itself, while imperative languages are nearly only assembler-frontends with some syntactic sugar. OOP languages tend to be imperative languages, because functional languages have already closures and lambda. There are other ways to implement java-like "interfaces" in functional languages, and there is just no need for additional object oriented features.
In i.e. Haskell, adding the feature of OOP-like programming would probably be more than only a few steps back in technology – there would be no point in using that. (<- that is not only IMHO... you ever heard of GADTs or Multi-parameter-type-classes?) Probably there might be even better ways to dynamically create Objects with Interfaces to communicate with OOP-languges than changing that language itself. But there are other functional languages, too, that explicitely combine functional and OOP aspects. There is just more science with mainly functional languages than non-functional OO-languages.
OO languages can not be easily compiled to other OO languages, iff they are in some way more "advanced". Usually, they have features like stack-protector, advanced debugging abilities, abstract and inspectable multi-threading, dynamic object-loading from files from the internet... Many of these features are not or not-easily realisable with C or C++ as compiler-backend. The functional language LISP (which is 50 years old!) was AFAIK the first with garbage collector. As compiler-backend LISP used a hacked version of the language C, because plain C did not allow some of those things, assembler did allow, i.e. proper-tail-calls or tables-next-to-code. C-- allows that.
An other aspect: Imperative languages are intended to run on a specific architecture, i.e. C and C++ programs run on only those architectures, they are programmed for. Java is more extreme: it runs only on a single architecture, a virtual one, which itself runs on others.
Functional languages are usually by design pretty architecture-independent: LISP was developed to be so immense architecture-unspecific, that it could be compiled to genetic code, in some distant future. Yes, like programs running in living biologic cells.
With the bytecode for the LLVM, functional languages will most-likely be compiled to bytecode in the future, too. Most imperative languages will most likely still have the same inherited problems as they have now from not-abstracting-far-enough. Well, I'm not that sure about clang and D, but those two are not "the most" anyway.

How does Parrot compare to other virtual machines?

Parrot is the virtual machine originally designed for Perl 6.
What technical capabilities does the Parrot VM offer that competing virtual machines such as the Java Virtual Machine (JVM)/Hotspot VM and Common Language Runtime (CLR) lack?
The following answer was written in 2009. See also this 2015 update by raiph.
To expand on #Reed and point out some highlights, Parrot's opcodes are at a far higher level than most virtual machines. For example, while most machines store integers and floats, the basic registers are integers, numbers, strings and Parrot Magic Cookies (PMCs). Just having strings built in is a step up from the JVM.
More interesting is the PMC, sort of like JVM's object type but far more fungible. PMCs are a container for all the other more complicated types you need in a real language like arrays, tables, trees, iterators, I/O etc. The PMC and the wide variety of built in ops for it means less work for the language writer. Parrot does not shy away from the messy but necessary bits of implementing a language.
My information may be out of date, but I believe opcodes are pluggable, you can ship a Parrot VM that only contains the opcodes your language needs. They were also going to be inheritable, if your language wants their arrays to work a little different from stock Parrot arrays you can do that subclass it.
Finally, Parrot can be written for not just in assembler (PASM) but also a slightly higher level language, Parrot Intermediate Representation (PIR). PIR has loops, subroutines, localized variables and some basic math and comparison ops, all the basics people expect in a programming language, without getting too far away from the metal.
All in all, Parrot is very friendly to language designers (it is written by and for them) who want to design a language and leave as much of the implementation as possible to somebody else.
You can read about much of this on the Parrot VM Intro page.
The main advantage Parrot has over the JVM or the CLR would be that it is designed to support dynamic languages first, and potentially provide better support and performance for dynamically typed languages. The JVM and the CLR are both geared more towards supporting statically typed languages, and many of the design decisions show that.
Parrot is the virtual machine originally designed for Perl 6.
There are now two VMs originally designed for Perl 6; commits to MoarVM began in 2012.
What technical capabilities does the Parrot VM offer that competing virtual machines such as the Java Virtual Machine (JVM)/Hotspot VM and Common Language Runtime (CLR) lack?
In another answer on this page, Reini Urban, the current (April 2015) Parrot lead dev, provides a brief comparison of Parrot with the JVM and CLR VM.
According to Reini, a key advantage Parrot has over MoarVM is "effectively lock-less threads".
One other thing that makes Parrot different from most VMs (certainly different from the JVM), is that it's a register machine rather than a stack machine. But I think people will be arguing for a long long time whether that can be called an advantage or a disadvantage.
I don't know JVM and CLR enough, but my tips:
dynamic languages (closures, polymorphic scalars, continuations, co-routines) support (speed)
dynamic method dispatch,
first class functions,
first-class continuations,
parameters (optional, named, ..),
register based
has HLL interoperability provided at an assembly level
range of platforms
Update: This is probably irrelevant as JVM is one of Rakudo Perl 6 backends nowadays. See Rakudo Perl 6 on the JVM (Perl 6 Advent calendar 2013, Day 3).
The main advantage and technical difference over the JVM and the CLR is that types (classes called PMC's) and ops (methods) may be dynamically loaded from efficient user-provided C implementations, and the parser framework to create and extend languages is built-in.
This question is outdated. Rakudo Perl 6 no longer targets Parrot as a backend; MoarVM is the preferred backend, with the JVM backend a work in progress (generally works, but many Perl 6 features not implemented or currently broken). Development work (not ready for users) is being done to add Javascript as a third backend.