Does JVM or CLR use registers for running JIT'ed code? - jvm

I understand that JVM and CLR were designed as stack-based virtual machines. When JIT compiles bytecode into native code, does it also translate stack primitives (load/store) to registers on X86 platform?
If yes, it looks like whether bytecode is stack-based or register-based doesn't really matter. JIT matters.

I think that you are confusing two different concepts.
At least for Java, the JVM acts as a virtual machine - it's an idealized computing machine with a comparatively high-level assembly language (the bytecode) that is based on a call stack with stack frames. When compiling Java into bytecode, the Java program is turned into (essentially) an assembly program for controlling this machine.
When actually running Java on a given system, the job of the JVM implementation is to faithfully simulate the execution of this stack-based machine using whatever hardware is actually available. This typically means that a huge number of stack operations would be implemented using registers when possible, and perhaps using other specialized hardware that isn't present in the description of the Java virtual machine. The actual details of how this is done is implementation-specific - some implementations might compile it down to machine code that does almost everything in registers, while a simpler implementation might just compile down to in-memory operations. I worked for a few months on a JavaScript implementation of the JVM, in which case we "compiled" the code down to JS functions, which were in turn handed off to the browser's JS implementation.
The reason for this distinction is that Java was designed to be easily downloaded and embedded (think applets). In this case, security and portability are important concerns. The bytecode had to have some way to be inspected automatically to rule out certain types of malicious code (buffer overruns, for example). Similarly, whatever format was used had to be sufficiently high-level that it could be run on a variety of different platforms (handheld devices, supercomputers, PCs, etc.) The choice of the stack-based JVM made both of these concerns possible to satisfy simultaneously. It's high-level enough that it's possible to inspect the bytecode to rule out many type errors or reads/writes of uninitialized memory, while sufficiently low-level that a JVM can use tricks like compiling down to code using registers.
If you are curious what your particular JVM will do to a specific piece of code, you should take a look at the documentation. Most JVMs have some way of giving you information about how they're executing the code. If your question is "why not just have bytecode do register-based manipulation," the reason is twofold:
There is an analog of registers in bytecode - each stack frame has some extra dedicated space for temporary values to be stored, and
There isn't as robust support for registers as is present in x86 or MIPS because the JVM code had to be easy to execute on multiple pieces of hardware, and hardcoding in a number of registers might complicate things.
Hope this helps!

It is impossible to not use registers on an x86 core. The processor doesn't have an instruction to, say, add two local variables. One of them has to be loaded in a register. Then you can add the value in the register to the value in a variable. And store the result back to a stack variable.
The optimization opportunities are obvious from this sequence. Like not storing it back but keeping the result in a register and using it later, saving both the store and the load. That's the job of the optimizer, it looks for ways to make the best use of the available registers.

The only way to know for sure would be to examine JIT compiled output, but it's quite safe to say that using registers is one of the JIT compiler's lamest optimizations. I believe most programmers would be hard pressed to write faster code than the JIT compiler does.
The JIT compiler is capable of a lot, and probably uses registers as much as is appropriate. Things like method inlining encourage the use of registers, and a lot of imperative program code can be expressed more simply on a register-based architecture, so it only makes sense for the JIT compiler to use registers.

Related

How does language design influence VM and bytecode design?

For example, how did the design of C# and VB.NET shape the development of CIL (and vice-versa)? What about Java and the JVM? How did the nature of PHP affect the development of HHBC/the HHVM, or Perl and Parrot, or Smalltalk and the VMs for various implementations?
Language design will influence the VM if the designers want it to. Some VMs are more independent than others. For example, Java does not have multiple inheritance, so JVM does not either.
Generally, a language machine (such as the Java Virtual Machine or the .NET CLR) will closely reflect the requirements of the language (Java for the JVM, C# for the CLR) for which it was designed.
For example, pretty much every Java byte code in the original JVM v1.0 was needed by the compiler. One could suggest that the needs of the JavaC compiler author(s) were being provided on demand by the JVM author(s). (It was a small team, so it may have even been the same person.)
The CLR is a bit different, because in addition to C#, they jammed in some stuff to support a pretend-C++ language, which required at least 3 additional op codes (IIRC). Nonetheless, the CLR was pretty much designed just to support C#.
It's interesting to analyze the Android Davlik engine, since it was designed as a JVM-but-without-using-JVM-byte-codes engine. (It is also register based, instead of stack based.)
At some level, the primary decision becomes this: Whether the engine is a low level Turing complete machine (something like a software RISC machine), or whether the engine's primitive language (its IL) is simply a binary form of its primary source code language. The former is more like WASM (arguably general purpose), while the latter is more like the JVM and CLR specs.

Anyone program "low-level" on JVM?

Sometimes we hear about brave people who understand and write assembly language for performance reasons, as opposed to using a compiler with a high-level language. Can the same be done on the JVM? I've reviewed the JVM instruction set, and it resembles assembly language in some respects, though it's much higher level (I'm assuming that the system-specific implementations of the JVM are extremely efficient).
Is it possible to, say, write JVM instructions and put them into a Java-executable binary?
Yes. You can do this via the asm library.
In fact, this is typically how people implement non-Java languages on top of the JVM, and how many Java metaprogramming libraries work.
You may very well want to do this for the same kind of metaprogramming capabilities - e.g., generating classes at runtime, or using the InvokeDynamic instruction to generate your own method dispatch rules.
There isn't a whole lot of performance benefit to be gained from using raw Java bytecode rather than writing the corresponding high-level Java (the JIT is your main performance booster, and it's optimized for the sorts of patterns "vanilla" Java code generates) but it does give you flexibility for things that are difficult, verbose, or impossible to express in Java.

Matching a virtual machine design with its primary programming language

As background for a side project, I've been reading about different virtual machine designs, with the JVM of course getting the most press. I've also looked at BEAM (Erlang), GHC's RTS (kind of but not quite a VM) and some of the JavaScript implementations. Python also has a bytecode interpreter that I know exists, but have not read much about.
What I have not found is a good explanation of why particular virtual machine design choices are made for a particular language. I'm particularly interested in design choices that would fit with concurrent and/or very dynamic (Ruby, JavaScript, Lisp) languages.
Edit: In response to a comment asking for specificity here is an example. The JVM uses a stack machine rather then a register machine, which was very controversial when Java was first introduced. It turned out that the engineers who designed the JVM had done so intending platform portability, and converting a stack machine back into a register machine was easier and more efficient then overcoming an impedance mismatch where there were too many or too few registers virtual.
Here's another example: for Haskell, the paper to look at is Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine. This is very different from any other type of VM I know about. And in point of fact GHC (the premier implementation of Haskell) does not run live, but is used as an intermediate step in compilation. Peyton-Jones lists no less then 8 other virtual machines that didn't work. I would like to understand why some VM's succeed where other fail.
I'll answer your question from a different tack: what is a VM? A VM is just a specification for "interpreter" of a lower level language than the source language. Here I'm using the black box meaning of the word "interpreter". I don't care how a VM gets implemented (as a bytecode intepereter, a JIT compiler, whatever). When phrased that way, from a design point of view the VM isn't the interesting thing it's the low level language.
The ideal VM language will do two things. One, it will make it easy to compile the source language into it. And two it will also make it easy to interpret on the target platform(s) (where again the interpreter could be implemented very naively or could be some really sophisticated JIT like Hotspot or V8).
Obviously there's a tension between those two desirable properties, but they do more or less form two end points on a line through the design space of all possible VMs. (Or, perhaps some more complicated shape than a line because this isn't a flat Euclidean space, but you get the idea). If you build your VM language far outside of that line then it won't be very useful. That's what constrains VM design: putting it somewhere into that ideal line.
That line is also why high level VMs tend to be very language specific while low level VMs are more language agnostic but don't provide many services. A high level VM is by its nature close to the source language which makes it far from other, different source languages. A low level VM is by its nature close to the target platform thus close to the platform end of the ideal lines for many languages but that low level VM will also be pretty far from the "easy to compile to" end of the ideal line of most source languages.
Now, more broadly, conceptually any compiler can be seen as a series of transformations from the source language to intermediate forms that themselves can be seen as languages for VMs. VMs for the intermediate languages may never be built, but they could be. A compiler eventually emits the final form. And that final form will itself be a language for a VM. We might call that VM "JVM", "V8"...or we might call that VM "x86", "ARM", etc.
Hope that helps.
One of the techniques of deriving a VM is to just go down the compilation chain, transforming your source language into more and more low level intermediate languages. Once you spot a low level enough language suitable for a flat representation (i.e., the one which can be serialised into a sequence of "instructions"), this is pretty much your VM. And your VM interpreter or JIT compiler would just continue your transformations chain from the point you selected for a serialisation.
Some serialisation techniques are very common - e.g., using a pseudo-stack representation for expression trees (like in .NET CLR, which is not a "real" stack machine at all). Otherwise you may want to use an SSA-form for serialisation, as in LLVM, or simply a 3-address VM with an infinite number of registers (as in Dalvik). It does not really matter which way you take, since it is only a serialisation and it would be de-serialised later to carry on with your normal way of compilation.
It is a bit different story if you intend to interpret you VM code immediately instead of compiling it. There is no consensus currently in what kind of VMs are better suited for interpretation. Both stack- (or I'd dare to say, Forth-) based VMs and register-based had proven to be efficient.
I found this book to be helpful. It discusses many of the points you are asking about. (note I'm not in any way affiliated with Amazon, nor am I promoting Amazon; just was the easiest place to link from).
http://www.amazon.com/dp/1852339691/

Choosing CPU architecture for LLVM/CLANG

I am designing TTL serial computer, and I am struggling on choosing architecture more suitable for LLVM compiler backend (I want to be able to run any C++ software there). There will be no MMU, no multiplication/division, no hardware stack, no interrupts.
I have 2 main options:
1) 8-bit memory, 8-bit ALU, 8-bit registers (~12-16). Memory address width 24 bit. So I will need to use 3 registers as IP and 3 registers for any memory location.
Needless to say that any address calculations would be pure pain to implement in compiler.
2) 24-bit memory, 24-bit ALU, 24-bit registers (~6-8). Flat memory, nice. The drawbacks is that due to serial nature of the design, each operation would take 3 time more clocks, even if we are operating on some booleans. 24-bit memory data width is expensive. And it's harder to implement in hardware in general.
The question is : Do you think implementing all c++ features on this 8-bit, stack-less based hardware is possible, or I need to have more complex hardware to have generated code of reasonable quality & speed?
I second the suggestion to use LCC. I used it in this homebrew 16-bit RISC project: http://fpgacpu.org/xsoc/cc.html .
I don't think it should make much difference whether you build the 8-bit variant and use 3 add-with-carries to increment IP, or the 24-bit variant and do the whole thing in hardware. You can hide the difference in your assembler.
If you look at my article above, or an even simpler CPU here: http://fpgacpu.org/papers/soc-gr0040-paper.pdf you will see you really don't need that many operators / instructions to cover the integer C repetoire. In fact there is an lcc utility (ops) to print the min operator set for a given machine.
For more information see my article on porting lcc to a new machine here: http://www.fpgacpu.org/usenet/lcc.html
Once I had ported lcc, I wrote an assembler, and it synthesized a larger repetoire of instructions from the basic ones. For example, my machine had load-byte-unsigned but not load-byte-signed, so I emitted this sequence:
lbs rd,imm(rs) ->
lbu rd,imm(rs)
lea r1,0x80
xor rd,r1
sub rd,r1
So I think you can get by with this min cover of operations:
registers
load register with constant
load rd = *rs
store *rs1 = rs2
+ - (w/ w/o carry) // actually can to + with - and ^
>> 1 // << 1 is just +
& ^ // (synthesize ~ from ^, | from & and ^)
jump-and-link rd,rs // rd = pc, pc = rs
skip-z/nz/n/nn rs // skip next insn on rs==0, !=0, <0, >=0
Even simpler is to have no registers (or equivalently blur registers with memory -- all registers have a memory address).
Set aside a register for SP, and write the function prolog/epilog handler in the compiler and you won't have to worry about stack instructions. There's just code to store each of the callee save registers, adjust the SP by the frame size, and so forth.
Interrupts (and return from interrupts) are straightforward. All you need to do is force a jump-and-link instruction into the instruction register. If you chose the bit pattern for that to be something like 0, and put the right addresses into the source register rs (especially if it is r0), it can be done with a flip-flop reset input or an extra force-to-0 and gate. I use a similar trick in the second paper above.
Interesting project. I see a TTL / 7400 contest is underway and I was thinking myself of how simple a machine could you get away with and would it be cheating to add a 32 KB or 128 KB async SRAM to the machine to hold the code and data.
Anyway, happy hacking!
p.s.
1) You will want to decide how large each integral type is. You can certainly make char, short, int, long, long long, etc. the same size, one 24b word, if you wish, although it won't be compliant in min representation ranges.
2) And although I focused on lcc here, you were asking about C++. I recommend persuing C first. Once you have things figured out for C, including *, /, % operators in software, etc., it should be more tractable to move to full blown C++ whether in LLVM or GCC. The difference between C and C++ is "only" the extra vtables and RTTI tables and code sequences (entirely built up out the primitive C integer operator repetoire) required to handle virtual function calls, pointer to member dereference, dynamic casts, static constructors, exception handling, etc.
IMHO, It is possible for c compiler. i am not sure for c++, though.
LLVM/CLang could be hard choice for 8bit computer,
Instead, first try lcc, then second llvm/etc, HTH.
Bill Buzbee succeed to retarget lcc compiler for his Magic-1(known as homebrewcpu).
Although the hardware design and construction of Magic-1 usually gets the most attention, the largest part of the project (by far) has been developing/porting the software. To this end, I've had to write an assembler and linker from scratch, retarget a C compiler, write and port the standard C libraries, write a simplified operating system and then port a more sophisticated one. It's been a challenge, but a fun one. I suppose I'm somewhat twisted, but I happen to enjoy debugging difficult problems. And, when the bug you're trying to track down could involve one or more of: hardware design flaw, loose or broken wire, loose or bad TTL chip, assembler bug, linker bug, compiler bug, C runtime library bug, or finally a bug in the program in question there's lot of opportunity for fun. Oh, and I also don't have the luxury of blaming the bugs on anyone else.
I'm continually amazed that the damn thing runs at all, much less runs as well as it does.
In my opinion, stackless hardware is already poorly suited for C and C++ code. If you have nested function calls, you will need to emulate a stack in software anyway, which of course is much slower.
When going the stackless route, you will probably allocate most of your variables as 'static', and have no re-entrant functions. In this case, 6502-style addressing modes can be effective. You could for example have these addressing modes:
Immediate address (24bit) as part of opcode
Immediate address (24bit) plus index register (8bit)
Indirect access: immediate 24bit address to memory, which contains the actual address
Indirect access: 24 bit address to memory, 8 bit index register added to value from memory.
The address modes outlined above would allow efficient access to arrays, structures and objects allocated at a constant address (static allocation). They would be less efficient (but still usable) for dynamically and stack-allocated objects.
You would also get some benefit from your serial design: usually the 24 bit + 8 bit addition does not take 24 cycles, but you can instead short-circuit the addition when carry is 0.
Instead of mapping the IP as registers directly, you could allow changing it only through goto/branch instructions, using the same address modes as above. Jumps into dynamically computed addresses are quite rare so it makes more sense to give the whole 24-bit address directly in the opcode.
I think that if you design the CPU carefully, you can use many C++ features quite efficiently. However, do not expect that any random C++ code would run fast on such a limited CPU.
The implementation is certainly possible, but I doubt it will be usable (at lest for C++ code). As it was already noted, first problem is lack of stack. Next, bunch of C++ relies heavily on dynamic memory allocation, also C++ "internal" structures are quite big.
So, as it seems to me, it will be better, if you:
Get rid of C++ requirement (or at least, limit yourself to some subset)
Use 24 bits, not 8 bits for everything (for registers as well)
Add hardware stack
You are not going to be able to run "any" C++ code there. For example fork(), system(), etc. Anything that clearly relies on interrupts for example. You can get a long way there, sure.
Now do you mean any programs that can/have been written in C++ or are you limiting yourself to the language only and not the libraries that are commonly associated with C/C++? The language itself is a much easier rule to live with.
I think the easier question/answer, is, why not just try? What have you tried so far? It could be argued that the x86 is an 8-bit machine, no regard for alignment and many 8 bit instructions. the msp430 was ported to llvm to show how easily and quickly it could be done, I would like to see that platform with better support (not where my strengths lie otherwise I would be doing it) a 16 bit platform. no mmu. does have a stack and interrupts sure, dont have to use them and if you remove library rules then what is left that needs an interrupt?
I would look at llvm but note that the documentation produced that shows how easy it is to port, is dated and wrong and you basically have to figure it out on your own from the compiler sources. llc has a book, known for that, not optimized. Sources dont compile well on modern computers, always having to go backwards in time to use it, any time I go near it after an evening just trying to build it as is I give up. vbcc, simple, clean, documented, not unfriendly to smaller processors. Is it C++, dont remember. Of all of them the easiest to get a compiler up and running though. Of all of them LLVM is the most attractive and most useful when all said and done. dont go near gcc or even think of it, duct tape and bailing wire inside holding it together.
Have you invented your instruction set yet? do you have a simulator and assembler yet? Look up lsasim at github to find my instruction set. You can write an llvm backend for mine as practice for yours...grin...(my vbcc backend is horrible, I need to start over)...
You have to have some idea of how the high level will be implemented but you really have to start with an instruction set and an instruction set simulator and an assembler of some sort. Then start hand converting C/C++ code into assembly for your instruction set, that should pretty quickly get you through "can I do this without a stack", etc. In this process define your calling convention, implement more C/C++ code by hand using your calling convention. THEN dig into a compiler and make a back end. I think you should consider vbcc as a stepping stone, then head for LLVM if it appears like it (the isa) will work.

Getting Embedded with D (the programming language)

I like a lot of what I've read about D.
Unified Documentation (That would
make my job a lot easier.)
Testing capability built in to the
language.
Debug code support in the language.
Forward Declarations. (I always
thought it was stupid to declare the
same function twice.)
Built in features to replace the
Preprocessor.
Modules
Typedef used for proper type checking
instead of aliasing.
Nested functions. (Cough PASCAL
Cough)
In and Out Parameters. (How obvious is that!)
Supports low level programming -
Embedded systems, oh yeah!
However:
Can D support an embedded system that
not going to be running an OS?
Does the outright declearation that
it doesn't support 16 bit processors
proclude it entirely from embedded
applications running on such machines? Sometimes you don't need a hammer to solve your problem.
Garbage collection is great on Windows or Linux, but, and unfortunately embedded applications sometime must do explicit memory management.
Array bounds checking, you love it, you hate it. Great for design assurance, but not alway permissable for performance issues.
What are the implications on an embedded system, not running an OS, for multithreading support? We have a customer that doesn't even like interrupts. Much less OS/multithreading.
Is there a D-Lite for embedded systems?
So basically is D suitable for embedded systems with only a few megabytes (sometimes less than a magabyte), not running an OS, where max memory usage must be known at compile time (Per requirements.) and possibly on something smaller than a 32 bit processor?
I'm very interested in some of the features, but I get the impression it's aimed at desktop application developers.
What is specifically that makes it unsuitable for a 16-bit implementation? (Assuming the 16 bit architecture could address sufficient amounts of memory to hold the runtimes, either in flash memory or RAM.) 32 bit values could still be calculated, albeit slower than 16 bit and requiring more operations, using library code.
I have to say that the short answer to this question is "No".
If your machines are 16 bit, you'll have big problems fitting D into it - it is explicitly not designed for it.
D is not a light languages in itself, it generates a lot of runtime type info that normally is linked into your app, and that also is needed for typesafe variadics (and thus the standard formatting features be it Tango or Phobos). This means that even the smallest applications are surprisingly large in size, and may thus disqualify D from the systems with low RAM. Also D with a runtime as a shared lib (which could alleviate some of these issues), has been little tested.
All current D libraries requires a C standard library below it, and thus typically also an OS, so even that works against using D. However, there do exist experimental kernels in D, so it is not impossible per se. There just wouldn't be any libraries for it, as of today.
I would personally like to see you succeed, but doubt that it will be easy work.
First and foremost read larsivi's answer. He's worked on the D runtime and knows of what he's talking about.
I just wanted to add: Some of what you asked about is already possible. It won't get you all the way, and a miss is as good as a mile here but still, FYI:
Garbage collection is great on Windoze or Linux, but, and unfortunately embedded apps sometime must do explicite memory management.
You can turn garbage collection off. The various experimental D OSes out there do it. See the std.gc module, in particular std.gc.disable. Note also that you do not need to allocate memory with new: you can use malloc and free. Even arrays can be allocated with it, you just need to attach a D array around the allocated memory using a slice.
Array bounds checking, you love it, you hate it. Great for design assurance, but not alway permissable for performance issues.
The specification for arrays specifically requires that compilers allow for bounds checking to be turned off (see the "Implementation Note"). gdc provides -fno-bounds-check, and in dmd using -release should disable it.
What are the implications on an embedded system, not running an OS, for multithreading support? We have a customer that doesn't even like interrupts. Much less OS/multithreading.
This I'm less clear on, but given that most C runtimes allow turning off multithreading, it seems likely one could get the D runtime to disable it as well. Whether that's easy or possible right now though I can't tell you.
The answers to this question are outdated:
Can D support an embedded system that not going to be running an OS?
D can be cross-compiled for ARM Linux and for ARM Cortex-M. Some projects aim at creating libraries for Cortex-M architectures like MiniLibD for the STM32 or this project which uses a generic library for the STM32. (You could implement your own minimalistic OS in D on ARM Cortex-M.)
Does the outright declearation that it doesn't support 16 bit processors proclude it entirely from embedded applications running on such machines? Sometimes you don't need a hammer to solve your problem.
No, see answer above... (But I would not expect that "smaller" architectures than Cortex-M will be supported in the near future.)
Garbage collection is great on Windows or Linux, but, and unfortunately embedded applications sometime must do explicit memory management.
You can write Garbage Collection free code. (The D foundation seems to aim at a "GC free compliant" standard library Phobos but that is work in progress.)
Array bounds checking, you love it, you hate it. Great for design assurance, but not alway permissable for performance issues.
(As you said this depends on your "personal taste" and design decisions. But I would assume an acceptable performance overhead for bound checking due to the background of the D compiler developers and D's design aims.)
What are the implications on an embedded system, not running an OS, for multithreading support? We have a customer that doesn't even like interrupts. Much less OS/multithreading.
(What is the question? One could implement mutlithreading using D's language capabilities e.g. like explained in this question. BTW: If you want to use interrupts consider this "hello world" project for a Cortex-M3.)
Is there a D-Lite for embedded systems?
The SafeD subset of D targets at the embedded domain.