Where is the VM in LLVM? - virtual-machine

Note: marked as community wiki.
Where is the Low Level Virtual Machine in LLVM?
I see that we have llvm-g++ and c-lang, but to me, a LLVM is something almost like Valgrind of a simulator, where instructions are executed on it, and I can write programs to instrument the running code / interrupt when certain conditions happen / etc ...
Where are the tools like this built on LLVM?
Thanks!

I think you're looking for QEMU, not LLVM.
The low-level virtual machine in LLVM is that, after converting the higher-level C and C++ language input into an internal low-level representation (as a stage in the normal compiling process), it can then save this low-level representation and execute it on a JIT compiler (which thus acts somewhat like a virtual machine). This JIT compiler does a substantial amount of optimization, and so I expect it would be difficult to instrument in quite the form that you're thinking of -- in particular, it does not do instruction-by-instruction stepping through the execution.
QEMU, by contrast, is an open-source emulator that does instruction-by-instruction stepping through of machine code. It already contains a certain amount of ability to instrument code to look for certain conditions, in that it can connect to GDB and set watchpoints and so forth, which are implemented in QEMU itself.

To use LLVM for running x86 code you should check libCPU or outdated llvm-qemu.
Look at running x86 program _on_ llvm

Related

How does the source code gets compiled only by a compiler?

My question is, How does a compiler converts the source code, its just a software.
Is there a role of any hardware or CPU to help compiler to do its job? Thanks.
Like other software compilers need a CPU to run on. However, I haven't heard of any CPUs with any compiler specific support. For example, gcc is just another C++ (was C prior to 2012) program. It can compile and run itself just like any other C++ program, and the CPU runs it just like any other piece of code.

How to do code coverage on embedded

I write a project for a non POSIX embedded system so I cannot use gcc option --coverage (i don't have read or write). What else can I do to produce gcov like output. I do have an output function.
It can be most easily done with by a processor with embedded trace, a board design that exposes the trace port, and a suitable hardware debugger and associate software. For example, many Cortex-M based devices include ARM's embedded trace macrocell (ETM), and this is supported by Keil's uVision IDE and ULINK-Pro debugger to provide code coverage and instruction/source level trace as well as real-time profiling. Hardware trace has the advantage that it is non-intrusive - the code runs in real-time.
If you do not have the hardware support, you may have to resort to simulation. Many tool-chains include an instruction level simulator that will perform trace, code-coverage, and profiling, but you may have to create debug scripts or code stubs to simulate hardware to coerce the execution of all paths.
A third alternative is to build the code on a desktop platform with stubs to replace target hardware dependencies, and perform testing and code coverage on that. You have to trust that the target C compiler and the test system compiler both translate the source with identical semantics. The advantage here is that the debug tools available are often superior to those available to embedded systems. You can also test much of your code before any hardware is available, and in most cases execute code much faster, possibly allowing more extensive testing.
Not having a POSIX API does not preclude using GCC, it merely precludes using the GNU C library. On embedded systems without POSIX, alternative C libraries are used such as Newlib. Newlib has a system porting layer where I/O and basic heap management are implemented.
Disclaimer: The company (Rapita Systems) I work for provides a code coverage solution aimed at embedded applications.
Because embedded systems bring their own, special and widely varying requirements, the "best" solution for code coverage also varies widely.
Where you have trace-based devices, like ARM chips with ETM or NEXUS-enabled parts, you can perform coverage without instrumentation via debuggers.
Otherwise, you are most likely faced with an instrumentation-based solution:
For RAM-limited solutions, a good solution is to write instrumentation to an I/O port
Alternatively, you can record instrumentation to a RAM buffer, and use a wide variety of means to extract this from the target.
Of course lots of different flavours of code coverage are also available: function, statement, decision/branch, MC/DC
Our family of C/C++ test coverage tools instrument the source code, producing a program you compile with you embedded compiler, that will collect test coverage data into a "small" data structure added to the program. This works with various dialects including ANSI, GCC, Microsoft and GreenHills.
You have to export that data structure from the embedded execution context to a file on a PC; this is often easy to do with a spare serial or parallel port and a small bit of custom code specific to your port. The tools will provide test coverage views and summaries with that resulting files.
So, you can use these tools to collect test coverage data from your embedded system, in most practical circumstances.
If your embedded target is supported by GCC-based cross-toolchains, you may find my blog post useful.
The main idea is that you compile your code with the appropriate gcov options, and then create the coverage information in memory (what in the end is stored in .gcda files). You can then place appropriate breakpoints with your GDB, and dump this information over your debug link (serial, JTAG, whatever).
Have a look at my blog post - I describe things in great detail.

Is it possible to embed LLVM Interpreter in my software and does it make sense?

Suppose I have a software and I want to make cross-plataform plugins. You compile the plugin for a virtual machine, and any platform running my software would be able to run this code.
I am wondering if it is possible to use LLVM interpreter and bytecode for this purpose. Also, I am wondering if does make sense using LLVM for this purpose instead of something else, i.e. is it what LLVM was made for?
I'm not sure that LLVM was designed for it. However, I doubt there is anything that hasn't been done using LLVM1
Other virtual-machines based script engines are specifically created for the job:
LUA is very popular
Wikipedia lists some other Extension/embeddable languages under the Scripting language entry
If you're looking for embeddable virtual machines:
IKVM supports embedding JVM and CLR in a bridged mode (interoperable)
Parrot supports embedding (and includes a Python interpreter; mind you, you can just run python bytecode images)
Perl has similar architecture and supports embedding
Javascript supports embedding (not sure about the architecture of v8, but I guess it would use a virtual machine)
Mono's CLR engine supports embedding: http://www.mono-project.com/Embedding_Mono
1 including compiling c++ information to javascript to run in your browser...
There is VMIR (https://github.com/andoma/vmir) which is a LLVM bitcode interpreter / JIT engine that's intended to be embedded into other apps.
Disclaimer: I'm the author of it and it's still work-in-progress but works reasonable well.
In theory, there exist a limited subset of LLVM IR which can be portable across various platforms. You shall not specify alignments, you shall not bitcast pointers to integral types, you must avoid intrinsics, etc. Which means - you can't immediately use a code generated by a stock C compiler (llvm-gcc, Clang, whatever), unless you specify a limited target for it and implement sanitising LLVM passes. Another issue is that the bitcode format from different LLVM versions is not guaranteed to be compatible.
In practice, I would not go there. Mono is a reasonably small, embeddable, fast VM, and all the .NET stack of tools is available for it. VM itself is pretty low-level (as long as you do not care about the verifyability).
LLVM includes an interpreter, so if you can build this interpreter for your target platforms, you can then evaluate LLVM bitcode on the fly.
It's apparently not so fast though.
In their classic discussion (that you do not want to miss if you're a fan of open source, LLVM, compilers) about LLVM vs libJIT, that has happened long before LLVM became famous and established, the author of libJIT Rhys Weatherley raised this particular issue, he stated that LLVM is not suitable to be embedded, while Chris Lattner, the author of LLVM stated that otherwise, it is modular and you can use it in any possible fashion including embedding only the parts you need.

Do JVMs on Desktops Use JIT Compilation?

I always come across articles which claim that Java is interpreted. I know that Oracle's HotSpot JRE provides just-in-time compilation, however is this the case for a majority of desktop users? For example, if I download Java via: http://www.java.com/en/download, will this include a JIT Compiler?
Yes, absolutely. Articles claiming Java is interpreted are typically written by people who either don't understand how Java works or don't understand what interpreted means.
Having said that, HotSpot will interpret code sometimes - and that's a good thing. There are definitely portions of any application (around startup, usually) which are only executed once. If you can interpret that faster than you can JIT compile it, why bother with the overhead? On the other hand, my experience of "Java is interpreted" articles is that this isn't what they mean :)
EDIT: To take T. J. Crowder's point in: yes, the JVM downloaded from java.com will be HotSpot. There are two different JITs for HotSpot, however - server and desktop. To sum up the differences in a single sentence, the desktop JIT is designed to start apps quickly, whereas the server JIT is more focused on high performance over time: server apps typically run for a very long time, so time spent optimising them really heavily pays off in the long run.
There is nothing in the JVM specification that mandates any particular execution strategy. Some JVMs only interpret, they don't even have a compiler. Some JVMs only JIT compile, they don't even have an interpreter. Some JVMs have both an intepreter and a compiler (or even multiple compilers) and statically choose between the two on startup. Some have both and dynamically switch back and forth during runtime. Some aren't even virtual machines in the usual sense of the word at all, they just statically compile JVM bytecode into native machinecode ahead-of-time.
The particular JVM that you are asking about, Oracle's HotSpot JVM, has one interpreter and two compilers, called the C1 and C2 compiler, also colloquially known as the client and server compilers, after their corresponding commandline options. HotSpot dynamically switches back and forth between the interpreter and one of the compilers at runtime (but it will not switch between the two compilers, you have to specify one of them on the commandline and then only that one will be used for the entire runtime of the JVM).
As per document here Starting with some of the later Java SE 7 releases, a new feature called tiered compilation became available. This feature uses the C1 compiler mode at the start to provide better startup performance. Once the application is properly warmed up, the C2 compiler mode takes over to provide more-aggressive optimizations and, usually, better performance
The C1 compiler is an optimizing compiler which is pretty fast and doesn't use a lot of memory. The C2 compiler is much more aggressively optimizing, but is also slower and uses more memory.
You select between the two by specifying the -client and -server commandline options (-client is the default if you don't specify one), which also sets a couple of other JVM parameters like the default JIT threshold (in -client mode, methods will be compiled after they have been interpreted 1500 times, in -server mode after 10000 times, can be set with the -XX:CompileThreshold commandline argument).
Whether or not "the majority of desktop users" actually will run in compiled or interpreted mode depends largely on what code they are running. My guess is that the vast majority of desktop users run the HotSpot JVM from Oracle's JRE/JDK or one of its forks (e.g. SoyLatte on OSX, IcedTea or OpenJDK on Unix/BSD/Linux) and they don't fiddle with the commandline options, so they will probably get the C1 compiler with the default 1500 JIT threshold. (But applications such as IntelliJ, Eclipse or NetBeans have their own launcher scripts that usually supply different commandline arguments.)
In my case, for example, I often run small scripts which never actually reach the JIT threshold, so they are never compiled. (Nor should they be.)
Some of these links about the Hotspot JVM (what you are downloading in the java.com download link above) might help:
Java SE HotSpot at a Glance
The Java HotSpot Performance Engine Architecture
Frequently Asked Questions About the Java HotSpot VM
Neither of the (otherwise-excellent) answers so far seems to have actually answered your last question, so: Yes, the Java runtime you downloaded from www.java.com is Oracle's (Sun's) Hotspot JVM, and so yes, it will do JIT compilation. HotSpot isn't just for servers or anything like that, it runs on desktops and takes full advantage of its (very mature) optimizing JIT compiler.
Jvm spec never claim how to execute the java bytecode, however, you can specify a JIT compiler if you use the JVM from hotspot VM, JIT is just a technique to optimize byte code execution.

Will static linking on one unix distribution work but not another?

If I statically link an executable in ubuntu, is there any chance that that executable won't work within another distribution such as mint os? or fedora? I know processor types are affected, but other then that is there anything else I have to be wary of? Sorry if this is a dumb question. Thanks for any help
There are a few corner cases, but for the most part, you should be in good shape with static linking. The one that comes to mind is libnss. This particular library is essentially impossible to link statically, because of the way it does its job (permissions, authentication, security tasks). As long as the glibc-versions are similar, you should be ok on this issue, though.
If your program needs to work with subtle features of the kernel, like volume managers, you've got a pretty slim chance of getting your program to work, statically linked, across distros, because the kernel interfaces may change slightly.
Most typical applications, the kind that even makes sense to discuss portability, like network services, gui-applications, language tools (like compilers/interpreters) wont have a problem with any of this.
If you statically link a program on one computer and then move it to another computer in which the system basically runs the same way, then it should work just fine. That's the point of static linking; that there are no other files the program depends on - it's entirely self-contained, so as long as it can run at all, it will run the same way it does on its "host" system.
This contrasts with dynamic linking, in which the program incorporates elements of other files (libraries) at runtime. If you move a dynamically linked program to another system where the libraries it depends on are different (or nonexistent), it won't work.
In most cases, your executable will work just fine. As long as your executable doesn't depend on anything unusual being present for it to function, there will be no problem. (And, if it does depend on something unusual being present, then you'll have the same issue even if you dynamically link.)
Statically linking is usually safer than dynamically linking for compatibility between different UNIX environments, as long as the same CPU is in use.
To have a statically linked binary fail, again assuming the same processor architecture, you would have to do something such as link on a system using the a.out binary format and try to execute it on a system running ELF, in which case the dynamically linked version would fail just as badly.
So why do people not routinely link statically? Two reasons:
It makes the executable larger, sometimes MUCH larger, and
If bugs in the libraries are fixed, you'll have to relink your program to get access to the bug fixes. If a critical security bug is fixed in the libraries, you have to relink and redistribute your exe.
On the contrary. Whatever your chances are of getting a binary to work across distributions or even OSes, those chances are maximized by static linking. Static linking makes an executable self-contained in terms of libraries. It can still go wrong if it tries to read a file that's not there on another system.
For even better chances of portability, try linking against dietlibc or some other libc. An article at Linux Journal mentions some candidates. A smaller, simpler libc is less likely to depend on things in the filesystem that differ from distro to distro.
I would, for the reasons noted above avoid statically linking something unless you absolutely must.
That being said, it should work on any other similar kernel of the same architecture (i.e. if you statically link on a machine running linux 2.4.x , the loader VDSO is going to be different on linux 2.6, VDSO being virtual dynamic shared object, a shared object that the kernel exposes to every process containing loader code).
Other pitfalls include things in /etc not being where you'd think, logs being in different places, system utilities being absent or different (ubuntu uses update-rc.d, RHEL uses chkconfig), etc.
There are sometimes that you just have no choice. I was writing a program that talked to LVM2's string based cmdlib interface in favor of using execv().. low and behold, 30% of the distros I needed to support did NOT include that library and offered no way of getting it. So, I had to link against the static object when producing binary packages.
If you are using glibc, you can be confident that stuff like getpwnam() and friends will still work .. just make sure to watch any hard coded paths (better yet, make them configurable at run time)
As long as you can guarantee it'll only be executed on a similar version of the OS on similar hardware your program will work fine if it statically linked. so, if you build for a 2.6 Linux and statically link you will be fine to run on (almost) all 2.6 Linux distributions.
Be warned you can't statically link some parts of GLIBC so if you're using them you'll have to dynamically link anyway. From memory the name service stuff (nss) parts required dynamic linking when I was investigating it.
You can't statically link a program for (say) Linux then expect it to run on BSD or Windows. BSD and Unix don't present or handle their system calls in the same way Linux does. I tell a slight lie because the BSDs have a Linux emulation layer that can be enabled, but out of the box it won't work.
No it will not work. Static linking for distribution independence is a concept from the old unix ages and is not recommended. By the fact you can't as many libraries are not avail as static libraries anyway.
Follow the Linux Standard Base way, this is your only chance to get as much cross distribution portability as possible.
The LSB also works fine if you program for FreeBSD and Solaris.
There are two compatibility questions at issue here: library versions and library inventory.
You don't say what libraries you are using.
If you have no '-l' options, then the only 'library' is glibc itself, which serves as the interface to the kernel. Glibc versions are upward compatible. If you link on a glibc 2.x system you can run on a glibc 2.y, for y > x. The developers make a firm commitment to this.
If you have -l options, static linking is always safe. If you are dynamically linked, you have to ensure that (1) the library is present on the target system, and (2) has a compatible version. Your Mileage Might Vary as to whether the target distro has what you need.