Why is the JVM stack-based and the Dalvik VM register-based? - jvm

I'm curious, why did Sun decide to make the JVM stack-based and Google decide to make the DalvikVM register-based?
I suppose the JVM can't really assume that a certain number of registers are available on the target platform, since it is supposed to be platform independent. Therefor it just postpones the register-allocation etc, to the JIT compiler. (Correct me if I'm wrong.)
So the Android guys thought, "hey, that's inefficient, let's go for a register based vm right away..."? But wait, there are multiple different android devices, what number of registers did the Dalvik target? Are the Dalvik opcodes hardcoded for a certain number of registers?
Do all current Android devices on the market have about the same number of registers? Or, is there a register re-allocation performed during dex-loading? How does all this fit together?

There are a few attributes of a stack-based VM that fit in well with Java's design goals:
A stack-based design makes very few
assumptions about the target
hardware (registers, CPU features),
so it's easy to implement a VM on a
wide variety of hardware.
Since the operands for instructions
are largely implicit, the object
code will tend to be smaller. This
is important if you're going to be
downloading the code over a slow
network link.
Going with a register-based scheme probably means that Dalvik's code generator doesn't have to work as hard to produce performant code. Running on an extremely register-rich or register-poor architecture would probably handicap Dalvik, but that's not the usual target - ARM is a very middle-of-the-road architecture.
I had also forgotten that the initial version of Dalvik didn't include a JIT at all. If you're going to interpret the instructions directly, then a register-based scheme is probably a winner for interpretation performance.

I can't find a reference, but I think Sun decided for the stack-based bytecode approach because it makes it easy to run the JVM on an architecture with few registers (e.g. IA32).
In Dalvik VM Internals from Google I/O 2008, the Dalvik creator Dan Bornstein gives the following arguments for choosing a register-based VM on slide 35 of the presentation slides:
Register Machine
Why?
avoid instruction dispatch
avoid unnecessary memory access
consume instruction stream efficiently (higher semantic density per instruction)
and on slide 36:
Register Machine
The stats
30% fewer instructions
35% fewer code units
35% more bytes in the instructions stream
but we get to consume two at a time
According to Bornstein this is "a general expectation what you could find when you convert a set of class files to dex files".
The relevant part of the presentation video starts at 25:00.
There is also an insightful paper titled "Virtual Machine Showdown: Stack Versus Registers" by Shi et al. (2005), which explores the differences between stack- and register-based virtual machines.

I don't know why Sun decided to make JVM stack based. Erlangs virtual machine, BEAM is register based for performance reasons. And Dalvik also seem to be register based because of performance reasons.
From Pro Android 2:
Dalvik uses registers as primarily units of data storage instead of the stack. Google is hoping to accomplish 30 percent fewer instructions as a result.
And regarding the code size:
The Dalvik VM takes the generated Java class files and combines them into one or more Dalvik Executables (.dex) files. It reuses duplicate information from multiple class files, effectively reducing the space requirement (uncompressed) by half from traditional .jar file. For example, the .dex file of the web browser app in Android is about 200k, whereas the equivalent uncompressed .jar version is about 500k. The .dex file of the alarm clock is about 50k, and roughly twice that size in its .jar version.
And as I remember Computer Architecture: A Quantitative Approach also conclude that a register machine perform better than a stack based machine.

Related

Connection between microprogramming and embedded systems

What is the connection between microprogramming and embedded systems?
Is microprogramming a machine language?
Microprogramming it is the same as microcode?
Are embedded systems manufactured using microprogramming only?
Or isn't it an exclusivity of an embedded system that is using microprogramming?
If possible, please exemplify. Thanks!
Microprogramming / microcoding is an implementation technique for processors — as such dates back pretty far.
A processor implements an instruction set; programs using these instructions are generated by a compiler or assembly language programmer and stored in program files, later loaded into memory to execute the program.
A microcoded processor is like having another, different processor-within-the-processor that is used to interpret the instruction stream (sequences of machine language) of the program. This processor within the processor has its own instructions set and its own program. Unlike the externally visible instruction set (which can load & run any program), the processor within the processor generally only runs one dedicated program (the instruction set interpreter), which is stored in a ROM (or re-writable flash) inside the processor.
(In some such systems, the processor within the processor has instructions that are very wide (as in horizontal microcode), and impractical (regarding code size) for general use by regular programs.)
What is the connection between microprogramming and embedded systems?
There is no particular relationship between microcoding and embedded. Yes/no on either can be mixed with each other.
Is microprogramming a machine language?
Yes, I would say it is, but it is generally not accessible to operating systems and user programs.
Microcoding was particularly popular when virtually all instructions each executed in multiple cycles. Later techinques removed the indirection of the microcoded machine in favor of direct hardwired execution, with single cycle approaches. This publication sheds some light on some of the thinking of the day during the transition of the state of the art from microcoding to hard wiring. See also IBM 801.
Most processors these days are not microprogrammed; however, the very advanced techniques applied by x86 processors may mimic microprogramming techniques here and there.
Embedded systems are simply processors used in devices that are not seen as "computers", for example, a thermostat, a microwave, or a car (which might have numerous embedded systems). Considerations here are that these systems are dedicated: they tend to run a single program (rather than running an operating system capable of running any program the user directs); they have low power requirements, disconnected requirements (disconnected from user terminal/screen/keyboard, perhaps from network, etc..). Still, embedded system keep getting even more powerful.

When to use full system FS vs syscall emulation SE with userland programs in gem5?

Since syscall emulation is much easier to setup, I'm wondering what are the advantages of using the full system emulation when running an userland program.
Or in other words, what interesting aspects are modeled in the full system but not syscall emulation mode, and when are they significant?
It is mentioned in the docs at: http://gem5.org/Splash_benchmarks that full system is
Realistic: you're getting the actual Linux thread scheduler to schedule your threads
Is this the only advantage, or are there any other advantage for users that are optimizing their applications or investigating micro-architecture?
I also suspect that the MMU simulation is another important feature that is only modeled properly in full system mode, and could affect program performance.
Full system mode should be preferred (when it is possible to use it). There are benefits to using it, primarily fidelity in the simulation which is not possible with system call emulation mode. (The kernel interactions with an application can be important depending on the study that a researcher is trying to conduct.) Also, the user does not need to worry about implementing (or debugging) the system call implementation.
With that said, system call emulation mode can be useful under the right conditions. It is faster to run application code because there is no kernel running in the background. There is also no system noise if you want to mitigate it entirely. Arguably, it is easier to bootstrap a new device model as well. You can work on the model without driver support and make magic happen though fake interfaces. (It saves you having to model the bare-metal interface perfectly or having to write your own device driver.)
Your comments about dynamic linking and multi-threading support are related. If dynamic linking is fixed, you should be able to use your system's pthreads library and can forget about linking with m5threads entirely. The pthread library support has existed in the simulator for a while now (the system calls necessary for it to work properly).
However, there's a caveat to the threading implementation. You need to preallocate enough thread contexts at the start of simulation (by invoking with the -n option on the se.py script).
To elaborate, there is no operating system running in the background to schedule threads on the processors. (I use the terms threads and processors very loosely here.) To obviate the scheduling problem, you have to preallocate enough processors so that the threads can be created on calls to clone/execve. There is a constraint that you can never have more threads than processors (unlike a real system where the operating system can schedule them as it pleases).
The configuration scripts probably do not behave how a researcher would want them to behave for a multi-threaded workload. The researcher would need to verify that the caches were configured correctly and that they are sharing certain cache levels like a real machine would do. If the application calls clone/execve many times, it may not be possible to cause the generated configuration to behave realistically.
Your last statement about modeling accelerators is incorrect. The AMD GFX8 model does use system call emulation mode. (Also, we developed a NIC model which was never publicly released.) It involves creating a fake driver and manipulating it through the same ioctl interfaces that a real driver would use. Linux treats everything like a file so the driver is opened through the open system call interface and you can capture it there. There are other things which you might need to do (like map mmio ranges in the configuration), but the driver interface is the main piece. The application interacts with the driver and the driver interacts with the accelerator model.
Advantages of SE:
sometimes easier to setup benchmarks, if all syscalls you need are implemented (see also, see also), and if you have just the right cross compiler, which of course no one has documented properly which one that is.
SE runs Dhrystone about 2x https://github.com/cirosantilli/linux-kernel-module-cheat/tree/00d282d912173b72c63c0a2cc893a97d45498da5#user-mode-vs-full-system-benchmark That benchmark makes no syscalls (except for information before / after the actual benchmark runs)
it is easier to get greater visibility and control of what the application is doing since the kernel is not running in parallel. E.g. stats will be just for the application, GDB will be just for the application: thread-aware gdb for the Linux kernel
Disadvantages of SE:
in practice, harder to setup benchmarks, because it is too fragile / has too many restrictions.
If your content does not work immediately out of the box, it is easier to just create or download a full system image and go for that instead, which is much more reliable.
Here is a sample minimal working Ubuntu setup if you are still interested: How to compile and run an executable in gem5 syscall emulation mode with se.py?
less representative, since no actual OS is running
no dynamic linking for ARM as of June 2018: How to run a dynamically linked executable syscall emulation mode se.py in gem5?
if you want to evaluate an accelerator like a GPU, you will have to create some slightly custom interface for it, since there is no kernel driver running on top the the kernel as usual.
Brandon has pointed out in his answer that this has in fact been done before: https://stackoverflow.com/a/56371006/9160762
So my recommendation is:
try SE first. If it works, great. If it doesn't, try to fix it quickly, since most problems are trivial. Having the SE setup will save you a lot of time over full system, and it is often representative enough.
otherwise, use FS mode. It is just simpler to setup, more representative, and the performance hit is acceptable for most.
You could also use SE first, and then go to FS to further validate only your most important SE results, since FS is slower and you can therefore validate less different setups.

What decides which structure a process has in memory?

I've learned that a process has the following structure in memory:
(Image from Operating System Concepts, page 82)
However, it is not clear to me what decides that a process looks like this. I guess processes could (and do?) look different if you have a look at non-standard OS / architectures.
Is this structure decided by the OS? By the compiler of the program? By the computer architecture? A combination of those?
Related and possible duplicate: Why do stacks typically grow downwards?.
On some ISAs (like x86), a downward-growing stack is baked in. (e.g. call decrements SP/ESP/RSP before pushing a return address, and exceptions / interrupts push a return context onto the stack so even if you wrote inefficient code that avoided the call instruction, you can't escape hardware usage of at least the kernel stack, although user-space stacks can do whatever you want.)
On others (like MIPS where there's no implicit stack usage), it's a software convention.
The rest of the layout follows from that: you want as much room as possible for downward stack growth and/or upward heap growth before they collide. (Or allowing you to set larger limits on their growth.)
Depending on the OS and executable file format, the linker may get to choose the layout, like whether text is above or below BSS and read-write data. The OS's program loader must respect where the linker asks for sections to be loaded (at least relative to each other, for executables that support ASLR of their static code/data/BSS). Normally such executables use PC-relative addressing to access static data, so ASLRing the text relative to the data or bss would require runtime fixups (and isn't done).
Or position-dependent executables have all their segments loaded at fixed (virtual) addresses, with only the stack address randomized.
The "heap" isn't normally a real thing, especially in systems with virtual memory so each process can have their own private virtual address space. Normally you have some space reserved for the stack, and everything outside that which isn't already mapped is fair game for malloc (actually its underlying mmap(MAP_ANONYMOUS) system calls) to choose when allocating new pages. But yes even modern glibc's malloc on modern Linux does still use brk() to move the "program break" upward for small allocations, increasing the size of "the heap" the way your diagram shows.
I think this is recommended by some committee and then tools like GCC conform to that recommendation. Binary format defines these segments and operating system and its tools facilitate the process of that format to run on system. lets say ELF is recommended by system V and then adopted by unix; and gcc produce the ELF binaries to be run on unix. so i feel story may start from binary format as it decides about memory mappings(code, data/heap/stack). binary format,among other hacks, defines memory mappings to be mapped for loading programs. As for example ELF defines segments (arrange code in text,data,stack to be loaded in memory), GCC generates that segments of ELF binary while loader loads these segments. operating system also has freedom in adjusting the values of these segments like stack size. These are debatable loud thoughts which I try to consolidate.
That figure represents a a specific implementation or an idealized one. A process does not necessarily have that structure. On many systems a process looks only somewhat similar to what is in the diagram.

What is a good FAT file system for ARM7-TDMI

I'm using the ARM7TDMI-S (NXP processor) and I need a file system capable of reading/writing to an SD card. There are so many available, what have people used and been happy with? One that requires the least amount of setup is best - so the less I have to do to get it started (i.e. write device drivers to NXP's hardware) the better.
I am currently using CMX's RTOS as the OS for this project.
I suggest that you use either EFSL or Chan's FAT File System Module. I have used both on MMC/SC cards without problems. The choice between them may come down to the license terms and pre-existing ports to your target. Martin Thomas's ARM Projects site has examples for both libraries.
FAT is popular precisely because it's so simple. The main problems with FAT are performance (because of its simplicity, it's not very fast) and its limited size (2GB for FAT16, though 2TB for FAT32)

How does one use dynamic recompilation?

It came to my attention some emulators and virtual machines use dynamic recompilation. How do they do that? In C i know how to call a function in ram using typecasting (although i never tried) but how does one read opcodes and generate code for it? Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C? If so how do you find the length of the code? How do you account for system interrupts?
-edit-
system interrupts and how to (re)compile the data is what i am most interested in. Upon more research i heard of one person (no source available) used js, read the machine code, output js source and use eval to 'compile' the js source. Interesting.
It sounds like i MUST have knowledge of the target platform machine code to dynamically recompile
Yes, absolutely. That is why parts of the Java Virtual Machine must be rewritten (namely, the JIT) for every architecture.
When you write a virtual machine, you have a particular host-architecture in mind, and a particular guest-architecture. A portable VM is better called an emulator, since you would be emulating every instruction of the guest-architecture (guest-registers would be represented as host-variables, rather than host-registers).
When the guest- and host-architectures are the same, like VMWare, there are a ton of (pretty neat) optimizations you can do to speed up the virtualization - today we are at the point that this type of virtual machine is BARELY slower than running directly on the processor. Of course, it is extremely architecture-dependent - you would probably be better off rewriting most of VMWare from scratch than trying to port it.
It's quite possible - though obviously not trivial - to disassemble code from a memory pointer, optimize the code in some way, and then write back the optimized code - either to the original location or to a new location with a jump patched into the original location.
Of course, emulators and VMs don't have to RE-write, they can do this at load-time.
This is a wide open question, not sure where you want to go with it. Wikipedia covers the generic topic with a generic answer. The native code being emulated or virtualized is replaced with native code. The more the code is run the more is replaced.
I think you need to do a few things, first decide if you are talking about an emulation or a virtual machine like a vmware or virtualbox. An emulation the processor and hardware is emulated using software, so the next instruction is read by the emulator, the opcode pulled apart by code and you determine what to do with it. I have been doing some 6502 emulation and static binary translation which is dynamic recompilation but pre processed instead of real time. So your emulator may take a LDA #10, load a with immediate, the emulator sees the load A immediate instruction, knows it has to read the next byte which is the immediate the emulator has a variable in the code for the A register and puts the immediate value in that variable. Before completing the instruction the emulator needs to update the flags, in this case the Zero flag is clear the N flag is clear C and V are untouched. But what if the next instruction was a load X immediate? No big deal right? Well, the load x will also modify the z and n flags, so the next time you execute the load a instruction you may figure out that you dont have to compute the flags because they will be destroyed, it is dead code in the emulation. You can continue with this kind of thinking, say you see code that copies the x register to the a register then pushes the a register on the stack then copies the y register to the a register and pushes on the stack, you could replace that chunk with simply pushing the x and y registers on the stack. Or you may see a couple of add with carries chained together to perform a 16 bit add and store the result in adjacent memory locations. Basically looking for operations that the processor being emulated couldnt do but is easy to do in the emulation. Static binary translation which I suggest you look into before dynamic recompilation, performs this analysis and translation in a static manner, as in, before you run the code. Instead of emulating you translate the opcodes to C for example and remove as much dead code as you can (a nice feature is the C compiler can remove more dead code for you).
Once the concept of emulation and translation are understood then you can try to do it dynamically, it is certainly not trivial. I would suggest trying to again doing a static translation of a binary to the machine code of the target processor, which a good exercise. I wouldnt attempt dynamic run time optimizations until I had succeeded in performing them statically against a/the binary.
virtualization is a different story, you are talking about running the same processor on the same processor. So x86 on an x86 for example. the beauty here is that using non-old x86 processors, you can take the program being virtualized and run the actual opcodes on the actual processor, no emulation. You setup traps built into the processor to catch things, so loading values in AX and adding BX, etc these all happen at real time on the processor, when AX wants to read or write memory it depends on your trap mechanism if the addresses are within the virtual machines ram space, no traps, but lets say the program writes to an address which is the virtualized uart, you have the processor trap that then then vmware or whatever decodes that write and emulates it talking to a real serial port. That one instruction though wasnt realtime it took quite a while to execute. What you could do if you chose to is replace that instruction or set of instructions that write a value to the virtualized serial port and maybe have then write to a different address that could be the real serial port or some other location that is not going to cause a fault causing the vm manager to have to emulate the instruction. Or add some code in the virtual memory space that performs a write to the uart without a trap, and have that code instead branch to this uart write routine. The next time you hit that chunk of code it now runs at real time.
Another thing you can do is for example emulate and as you go translate to a virtual intermediate bytcode, like llvm's. From there you can translate from the intermediate machine to the native machine, eventually replacing large sections of program if not the whole thing. You still have to deal with the peripherals and I/O.
Here's an explaination of how they are doing dynamic recompilation for the 'Rubinius' Ruby interpteter:
http://www.engineyard.com/blog/2010/making-ruby-fast-the-rubinius-jit/
This approach is typically used by environments with an intermediate byte code representation (like Java, .net). The byte code contains enough "high level" structures (high level in terms of higher level than machine code) so that the VM can take chunks out of the byte code and replace it by a compiled memory block. The VM typically decide which part is getting compiled by counting how many times the code was already interpreted, since the compilation itself is a complex and time-consuming process. So it is usefull to only compile the parts which get executed many times.
but how does one read opcodes and generate code for it?
The scheme of the opcodes is defined by the specification of the VM, so the VM opens the program file, and interprets it according to the spec.
Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C?
This process is an implementation detail of the VM, typically there is a compiler embedded, which is capable to transform the VM opcode stream into machine code.
How do you account for system interrupts?
Very simple: none. The code in the VM can't interact with real hardware. The VM interact with the OS, and transfer OS events to the code by jumping/calling specific parts inside the interpreted code. Every event in the code or from the OS must pass the VM.
Also hardware virtualization products can use some kind of JIT. A typical use cases in the X86 world is the translation of 16bit real mode code to 32 or 64bit protected mode code to not to be forced to emulate a CPU in real mode. Also a software-only VM replaces jump instructions in the executing code by jumps into the VM control software, which at each branch the following code path for jump instructions scans and them replace, before it jumps to the real code destination. But I doubt if the jump replacement qualifies as JIT compilation.
IIS does this by shadow copying: after compilation it copies assemblies to some temporary place and runs them from temp.
Imagine, that user change some files. Then IIS will recompile asseblies in next steps:
Recompile (all requests handled by old code)
Copies new assemblies (all requests handled by old code)
All new requests will be handled by new code, all requests - by old.
I hope this'd be helpful.
A virtual Machine loads "byte code" or "intermediate language" and not machine code therefore, I suppose, that it just recompiles the byte code more efficiently once it has more runtime data.
http://en.wikipedia.org/wiki/Just-in-time_compilation