Simulate available simd instruction sets for testing - virtual-machine

I want to test how some code with hand written intrinsics will behave on platforms with different available instruction sets. By "behave" I mostly mean choose the right code path and not crash.
How can I do this without buying a bunch of machines? Do any of the virtual machine packages support this?

Related

When to use full system FS vs syscall emulation SE with userland programs in gem5?

Since syscall emulation is much easier to setup, I'm wondering what are the advantages of using the full system emulation when running an userland program.
Or in other words, what interesting aspects are modeled in the full system but not syscall emulation mode, and when are they significant?
It is mentioned in the docs at: http://gem5.org/Splash_benchmarks that full system is
Realistic: you're getting the actual Linux thread scheduler to schedule your threads
Is this the only advantage, or are there any other advantage for users that are optimizing their applications or investigating micro-architecture?
I also suspect that the MMU simulation is another important feature that is only modeled properly in full system mode, and could affect program performance.
Full system mode should be preferred (when it is possible to use it). There are benefits to using it, primarily fidelity in the simulation which is not possible with system call emulation mode. (The kernel interactions with an application can be important depending on the study that a researcher is trying to conduct.) Also, the user does not need to worry about implementing (or debugging) the system call implementation.
With that said, system call emulation mode can be useful under the right conditions. It is faster to run application code because there is no kernel running in the background. There is also no system noise if you want to mitigate it entirely. Arguably, it is easier to bootstrap a new device model as well. You can work on the model without driver support and make magic happen though fake interfaces. (It saves you having to model the bare-metal interface perfectly or having to write your own device driver.)
Your comments about dynamic linking and multi-threading support are related. If dynamic linking is fixed, you should be able to use your system's pthreads library and can forget about linking with m5threads entirely. The pthread library support has existed in the simulator for a while now (the system calls necessary for it to work properly).
However, there's a caveat to the threading implementation. You need to preallocate enough thread contexts at the start of simulation (by invoking with the -n option on the se.py script).
To elaborate, there is no operating system running in the background to schedule threads on the processors. (I use the terms threads and processors very loosely here.) To obviate the scheduling problem, you have to preallocate enough processors so that the threads can be created on calls to clone/execve. There is a constraint that you can never have more threads than processors (unlike a real system where the operating system can schedule them as it pleases).
The configuration scripts probably do not behave how a researcher would want them to behave for a multi-threaded workload. The researcher would need to verify that the caches were configured correctly and that they are sharing certain cache levels like a real machine would do. If the application calls clone/execve many times, it may not be possible to cause the generated configuration to behave realistically.
Your last statement about modeling accelerators is incorrect. The AMD GFX8 model does use system call emulation mode. (Also, we developed a NIC model which was never publicly released.) It involves creating a fake driver and manipulating it through the same ioctl interfaces that a real driver would use. Linux treats everything like a file so the driver is opened through the open system call interface and you can capture it there. There are other things which you might need to do (like map mmio ranges in the configuration), but the driver interface is the main piece. The application interacts with the driver and the driver interacts with the accelerator model.
Advantages of SE:
sometimes easier to setup benchmarks, if all syscalls you need are implemented (see also, see also), and if you have just the right cross compiler, which of course no one has documented properly which one that is.
SE runs Dhrystone about 2x https://github.com/cirosantilli/linux-kernel-module-cheat/tree/00d282d912173b72c63c0a2cc893a97d45498da5#user-mode-vs-full-system-benchmark That benchmark makes no syscalls (except for information before / after the actual benchmark runs)
it is easier to get greater visibility and control of what the application is doing since the kernel is not running in parallel. E.g. stats will be just for the application, GDB will be just for the application: thread-aware gdb for the Linux kernel
Disadvantages of SE:
in practice, harder to setup benchmarks, because it is too fragile / has too many restrictions.
If your content does not work immediately out of the box, it is easier to just create or download a full system image and go for that instead, which is much more reliable.
Here is a sample minimal working Ubuntu setup if you are still interested: How to compile and run an executable in gem5 syscall emulation mode with se.py?
less representative, since no actual OS is running
no dynamic linking for ARM as of June 2018: How to run a dynamically linked executable syscall emulation mode se.py in gem5?
if you want to evaluate an accelerator like a GPU, you will have to create some slightly custom interface for it, since there is no kernel driver running on top the the kernel as usual.
Brandon has pointed out in his answer that this has in fact been done before: https://stackoverflow.com/a/56371006/9160762
So my recommendation is:
try SE first. If it works, great. If it doesn't, try to fix it quickly, since most problems are trivial. Having the SE setup will save you a lot of time over full system, and it is often representative enough.
otherwise, use FS mode. It is just simpler to setup, more representative, and the performance hit is acceptable for most.
You could also use SE first, and then go to FS to further validate only your most important SE results, since FS is slower and you can therefore validate less different setups.

Simulating target on PC

I'm doing a project in C language that runs on a target with vxWorks operating system.
I would like to run my code on PC also for two reasons:
The HW of the target is not available yet, and i want to start testing my SW.
Even when the target will be ready it will be easier for me perform testing and simulations on a PC.
Is there some interesting way to do it?
Thanks.
You have three choices:
Use the VxWorks Simulator (vxsim) - it's part of the Workbench and can be accessed like a real target
Pros:
Easy to to use
Integrated into workbench
Debug functionality and good control of the system
Doesn't need any further hardware
Documentation (check Wind River VxWorks Simulator User's Guide)
Cons:
Not the real target system (but this is a con for all points here)
Use a x86 machine and boot eg. through ftp
Pros:
You can test booting via network and network
Cons:
The system may lack of drivers
Possible you have to change kernel
Debug is not as good as vxsim
The difference to your target may be realy big
Use a Virtual Maschine
Pros:
Runs on same pc - no further hardware required
Possible to test several bootloaders
Cons:
Not possible to simulate the target cpu etc.
A VM is not the best way for VxWorks testing
As Archie, I recommend you the VxWorks Simulator too.
A third way is to abstract the HW and OS in a separate layer in your application architecture, and provide both a PC and VxWorks versions of this layer.
This is rather costly, of course, but will have other advantages, i.e. insulation from vendor instability (like when pSos support was stopped years ago...) It might also nudge you in the direction of a nice, layered architecture.

Better way to implement I/O in a virtual machine?

I'm writing a virtual machine - not an existing architecture emulator like Virtualbox, but rather something like the JVM or BEAM - with its own instruction set, memory model, etc. Eventually I'm planning to implement a very small and simple (but turing-complete) high-level language that would compile into its bytecode, just for fun.
Of course, the machine must have some support of I/O, but I do not want to limit it only to manipulations with stdin/stdout. I imagine something like modular "virtual devices", which can be implemented as shared libraries so that the VM can load them at runtime and communicate with them through a standard interface. This way, for example, we can have "virtual devices" for standard input/output, graphics (imagine a virtual device that lets your VM program draw stuff inside an SDL window) or maybe even network.
The question is: how should the programs written for the VM communicate with the virtual devices? I decided to mimic techniques which are employed with actual hardware and learned about port-based I/O and memory-mapped I/O. However, I'm not sure which one of them is more suitable for my goals. Can you suggest which one is better or maybe even point out a totally different technique for dealing with input/output?
Thanks in advance.
Both memory-mapped and port based are inappropriate for most I/O.
DMA request with block-copy is usually what you want.

Why is the JVM stack-based and the Dalvik VM register-based?

I'm curious, why did Sun decide to make the JVM stack-based and Google decide to make the DalvikVM register-based?
I suppose the JVM can't really assume that a certain number of registers are available on the target platform, since it is supposed to be platform independent. Therefor it just postpones the register-allocation etc, to the JIT compiler. (Correct me if I'm wrong.)
So the Android guys thought, "hey, that's inefficient, let's go for a register based vm right away..."? But wait, there are multiple different android devices, what number of registers did the Dalvik target? Are the Dalvik opcodes hardcoded for a certain number of registers?
Do all current Android devices on the market have about the same number of registers? Or, is there a register re-allocation performed during dex-loading? How does all this fit together?
There are a few attributes of a stack-based VM that fit in well with Java's design goals:
A stack-based design makes very few
assumptions about the target
hardware (registers, CPU features),
so it's easy to implement a VM on a
wide variety of hardware.
Since the operands for instructions
are largely implicit, the object
code will tend to be smaller. This
is important if you're going to be
downloading the code over a slow
network link.
Going with a register-based scheme probably means that Dalvik's code generator doesn't have to work as hard to produce performant code. Running on an extremely register-rich or register-poor architecture would probably handicap Dalvik, but that's not the usual target - ARM is a very middle-of-the-road architecture.
I had also forgotten that the initial version of Dalvik didn't include a JIT at all. If you're going to interpret the instructions directly, then a register-based scheme is probably a winner for interpretation performance.
I can't find a reference, but I think Sun decided for the stack-based bytecode approach because it makes it easy to run the JVM on an architecture with few registers (e.g. IA32).
In Dalvik VM Internals from Google I/O 2008, the Dalvik creator Dan Bornstein gives the following arguments for choosing a register-based VM on slide 35 of the presentation slides:
Register Machine
Why?
avoid instruction dispatch
avoid unnecessary memory access
consume instruction stream efficiently (higher semantic density per instruction)
and on slide 36:
Register Machine
The stats
30% fewer instructions
35% fewer code units
35% more bytes in the instructions stream
but we get to consume two at a time
According to Bornstein this is "a general expectation what you could find when you convert a set of class files to dex files".
The relevant part of the presentation video starts at 25:00.
There is also an insightful paper titled "Virtual Machine Showdown: Stack Versus Registers" by Shi et al. (2005), which explores the differences between stack- and register-based virtual machines.
I don't know why Sun decided to make JVM stack based. Erlangs virtual machine, BEAM is register based for performance reasons. And Dalvik also seem to be register based because of performance reasons.
From Pro Android 2:
Dalvik uses registers as primarily units of data storage instead of the stack. Google is hoping to accomplish 30 percent fewer instructions as a result.
And regarding the code size:
The Dalvik VM takes the generated Java class files and combines them into one or more Dalvik Executables (.dex) files. It reuses duplicate information from multiple class files, effectively reducing the space requirement (uncompressed) by half from traditional .jar file. For example, the .dex file of the web browser app in Android is about 200k, whereas the equivalent uncompressed .jar version is about 500k. The .dex file of the alarm clock is about 50k, and roughly twice that size in its .jar version.
And as I remember Computer Architecture: A Quantitative Approach also conclude that a register machine perform better than a stack based machine.

How does one use dynamic recompilation?

It came to my attention some emulators and virtual machines use dynamic recompilation. How do they do that? In C i know how to call a function in ram using typecasting (although i never tried) but how does one read opcodes and generate code for it? Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C? If so how do you find the length of the code? How do you account for system interrupts?
-edit-
system interrupts and how to (re)compile the data is what i am most interested in. Upon more research i heard of one person (no source available) used js, read the machine code, output js source and use eval to 'compile' the js source. Interesting.
It sounds like i MUST have knowledge of the target platform machine code to dynamically recompile
Yes, absolutely. That is why parts of the Java Virtual Machine must be rewritten (namely, the JIT) for every architecture.
When you write a virtual machine, you have a particular host-architecture in mind, and a particular guest-architecture. A portable VM is better called an emulator, since you would be emulating every instruction of the guest-architecture (guest-registers would be represented as host-variables, rather than host-registers).
When the guest- and host-architectures are the same, like VMWare, there are a ton of (pretty neat) optimizations you can do to speed up the virtualization - today we are at the point that this type of virtual machine is BARELY slower than running directly on the processor. Of course, it is extremely architecture-dependent - you would probably be better off rewriting most of VMWare from scratch than trying to port it.
It's quite possible - though obviously not trivial - to disassemble code from a memory pointer, optimize the code in some way, and then write back the optimized code - either to the original location or to a new location with a jump patched into the original location.
Of course, emulators and VMs don't have to RE-write, they can do this at load-time.
This is a wide open question, not sure where you want to go with it. Wikipedia covers the generic topic with a generic answer. The native code being emulated or virtualized is replaced with native code. The more the code is run the more is replaced.
I think you need to do a few things, first decide if you are talking about an emulation or a virtual machine like a vmware or virtualbox. An emulation the processor and hardware is emulated using software, so the next instruction is read by the emulator, the opcode pulled apart by code and you determine what to do with it. I have been doing some 6502 emulation and static binary translation which is dynamic recompilation but pre processed instead of real time. So your emulator may take a LDA #10, load a with immediate, the emulator sees the load A immediate instruction, knows it has to read the next byte which is the immediate the emulator has a variable in the code for the A register and puts the immediate value in that variable. Before completing the instruction the emulator needs to update the flags, in this case the Zero flag is clear the N flag is clear C and V are untouched. But what if the next instruction was a load X immediate? No big deal right? Well, the load x will also modify the z and n flags, so the next time you execute the load a instruction you may figure out that you dont have to compute the flags because they will be destroyed, it is dead code in the emulation. You can continue with this kind of thinking, say you see code that copies the x register to the a register then pushes the a register on the stack then copies the y register to the a register and pushes on the stack, you could replace that chunk with simply pushing the x and y registers on the stack. Or you may see a couple of add with carries chained together to perform a 16 bit add and store the result in adjacent memory locations. Basically looking for operations that the processor being emulated couldnt do but is easy to do in the emulation. Static binary translation which I suggest you look into before dynamic recompilation, performs this analysis and translation in a static manner, as in, before you run the code. Instead of emulating you translate the opcodes to C for example and remove as much dead code as you can (a nice feature is the C compiler can remove more dead code for you).
Once the concept of emulation and translation are understood then you can try to do it dynamically, it is certainly not trivial. I would suggest trying to again doing a static translation of a binary to the machine code of the target processor, which a good exercise. I wouldnt attempt dynamic run time optimizations until I had succeeded in performing them statically against a/the binary.
virtualization is a different story, you are talking about running the same processor on the same processor. So x86 on an x86 for example. the beauty here is that using non-old x86 processors, you can take the program being virtualized and run the actual opcodes on the actual processor, no emulation. You setup traps built into the processor to catch things, so loading values in AX and adding BX, etc these all happen at real time on the processor, when AX wants to read or write memory it depends on your trap mechanism if the addresses are within the virtual machines ram space, no traps, but lets say the program writes to an address which is the virtualized uart, you have the processor trap that then then vmware or whatever decodes that write and emulates it talking to a real serial port. That one instruction though wasnt realtime it took quite a while to execute. What you could do if you chose to is replace that instruction or set of instructions that write a value to the virtualized serial port and maybe have then write to a different address that could be the real serial port or some other location that is not going to cause a fault causing the vm manager to have to emulate the instruction. Or add some code in the virtual memory space that performs a write to the uart without a trap, and have that code instead branch to this uart write routine. The next time you hit that chunk of code it now runs at real time.
Another thing you can do is for example emulate and as you go translate to a virtual intermediate bytcode, like llvm's. From there you can translate from the intermediate machine to the native machine, eventually replacing large sections of program if not the whole thing. You still have to deal with the peripherals and I/O.
Here's an explaination of how they are doing dynamic recompilation for the 'Rubinius' Ruby interpteter:
http://www.engineyard.com/blog/2010/making-ruby-fast-the-rubinius-jit/
This approach is typically used by environments with an intermediate byte code representation (like Java, .net). The byte code contains enough "high level" structures (high level in terms of higher level than machine code) so that the VM can take chunks out of the byte code and replace it by a compiled memory block. The VM typically decide which part is getting compiled by counting how many times the code was already interpreted, since the compilation itself is a complex and time-consuming process. So it is usefull to only compile the parts which get executed many times.
but how does one read opcodes and generate code for it?
The scheme of the opcodes is defined by the specification of the VM, so the VM opens the program file, and interprets it according to the spec.
Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C?
This process is an implementation detail of the VM, typically there is a compiler embedded, which is capable to transform the VM opcode stream into machine code.
How do you account for system interrupts?
Very simple: none. The code in the VM can't interact with real hardware. The VM interact with the OS, and transfer OS events to the code by jumping/calling specific parts inside the interpreted code. Every event in the code or from the OS must pass the VM.
Also hardware virtualization products can use some kind of JIT. A typical use cases in the X86 world is the translation of 16bit real mode code to 32 or 64bit protected mode code to not to be forced to emulate a CPU in real mode. Also a software-only VM replaces jump instructions in the executing code by jumps into the VM control software, which at each branch the following code path for jump instructions scans and them replace, before it jumps to the real code destination. But I doubt if the jump replacement qualifies as JIT compilation.
IIS does this by shadow copying: after compilation it copies assemblies to some temporary place and runs them from temp.
Imagine, that user change some files. Then IIS will recompile asseblies in next steps:
Recompile (all requests handled by old code)
Copies new assemblies (all requests handled by old code)
All new requests will be handled by new code, all requests - by old.
I hope this'd be helpful.
A virtual Machine loads "byte code" or "intermediate language" and not machine code therefore, I suppose, that it just recompiles the byte code more efficiently once it has more runtime data.
http://en.wikipedia.org/wiki/Just-in-time_compilation