Related
For example, how did the design of C# and VB.NET shape the development of CIL (and vice-versa)? What about Java and the JVM? How did the nature of PHP affect the development of HHBC/the HHVM, or Perl and Parrot, or Smalltalk and the VMs for various implementations?
Language design will influence the VM if the designers want it to. Some VMs are more independent than others. For example, Java does not have multiple inheritance, so JVM does not either.
Generally, a language machine (such as the Java Virtual Machine or the .NET CLR) will closely reflect the requirements of the language (Java for the JVM, C# for the CLR) for which it was designed.
For example, pretty much every Java byte code in the original JVM v1.0 was needed by the compiler. One could suggest that the needs of the JavaC compiler author(s) were being provided on demand by the JVM author(s). (It was a small team, so it may have even been the same person.)
The CLR is a bit different, because in addition to C#, they jammed in some stuff to support a pretend-C++ language, which required at least 3 additional op codes (IIRC). Nonetheless, the CLR was pretty much designed just to support C#.
It's interesting to analyze the Android Davlik engine, since it was designed as a JVM-but-without-using-JVM-byte-codes engine. (It is also register based, instead of stack based.)
At some level, the primary decision becomes this: Whether the engine is a low level Turing complete machine (something like a software RISC machine), or whether the engine's primitive language (its IL) is simply a binary form of its primary source code language. The former is more like WASM (arguably general purpose), while the latter is more like the JVM and CLR specs.
I understand that JVM and CLR were designed as stack-based virtual machines. When JIT compiles bytecode into native code, does it also translate stack primitives (load/store) to registers on X86 platform?
If yes, it looks like whether bytecode is stack-based or register-based doesn't really matter. JIT matters.
I think that you are confusing two different concepts.
At least for Java, the JVM acts as a virtual machine - it's an idealized computing machine with a comparatively high-level assembly language (the bytecode) that is based on a call stack with stack frames. When compiling Java into bytecode, the Java program is turned into (essentially) an assembly program for controlling this machine.
When actually running Java on a given system, the job of the JVM implementation is to faithfully simulate the execution of this stack-based machine using whatever hardware is actually available. This typically means that a huge number of stack operations would be implemented using registers when possible, and perhaps using other specialized hardware that isn't present in the description of the Java virtual machine. The actual details of how this is done is implementation-specific - some implementations might compile it down to machine code that does almost everything in registers, while a simpler implementation might just compile down to in-memory operations. I worked for a few months on a JavaScript implementation of the JVM, in which case we "compiled" the code down to JS functions, which were in turn handed off to the browser's JS implementation.
The reason for this distinction is that Java was designed to be easily downloaded and embedded (think applets). In this case, security and portability are important concerns. The bytecode had to have some way to be inspected automatically to rule out certain types of malicious code (buffer overruns, for example). Similarly, whatever format was used had to be sufficiently high-level that it could be run on a variety of different platforms (handheld devices, supercomputers, PCs, etc.) The choice of the stack-based JVM made both of these concerns possible to satisfy simultaneously. It's high-level enough that it's possible to inspect the bytecode to rule out many type errors or reads/writes of uninitialized memory, while sufficiently low-level that a JVM can use tricks like compiling down to code using registers.
If you are curious what your particular JVM will do to a specific piece of code, you should take a look at the documentation. Most JVMs have some way of giving you information about how they're executing the code. If your question is "why not just have bytecode do register-based manipulation," the reason is twofold:
There is an analog of registers in bytecode - each stack frame has some extra dedicated space for temporary values to be stored, and
There isn't as robust support for registers as is present in x86 or MIPS because the JVM code had to be easy to execute on multiple pieces of hardware, and hardcoding in a number of registers might complicate things.
Hope this helps!
It is impossible to not use registers on an x86 core. The processor doesn't have an instruction to, say, add two local variables. One of them has to be loaded in a register. Then you can add the value in the register to the value in a variable. And store the result back to a stack variable.
The optimization opportunities are obvious from this sequence. Like not storing it back but keeping the result in a register and using it later, saving both the store and the load. That's the job of the optimizer, it looks for ways to make the best use of the available registers.
The only way to know for sure would be to examine JIT compiled output, but it's quite safe to say that using registers is one of the JIT compiler's lamest optimizations. I believe most programmers would be hard pressed to write faster code than the JIT compiler does.
The JIT compiler is capable of a lot, and probably uses registers as much as is appropriate. Things like method inlining encourage the use of registers, and a lot of imperative program code can be expressed more simply on a register-based architecture, so it only makes sense for the JIT compiler to use registers.
As background for a side project, I've been reading about different virtual machine designs, with the JVM of course getting the most press. I've also looked at BEAM (Erlang), GHC's RTS (kind of but not quite a VM) and some of the JavaScript implementations. Python also has a bytecode interpreter that I know exists, but have not read much about.
What I have not found is a good explanation of why particular virtual machine design choices are made for a particular language. I'm particularly interested in design choices that would fit with concurrent and/or very dynamic (Ruby, JavaScript, Lisp) languages.
Edit: In response to a comment asking for specificity here is an example. The JVM uses a stack machine rather then a register machine, which was very controversial when Java was first introduced. It turned out that the engineers who designed the JVM had done so intending platform portability, and converting a stack machine back into a register machine was easier and more efficient then overcoming an impedance mismatch where there were too many or too few registers virtual.
Here's another example: for Haskell, the paper to look at is Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine. This is very different from any other type of VM I know about. And in point of fact GHC (the premier implementation of Haskell) does not run live, but is used as an intermediate step in compilation. Peyton-Jones lists no less then 8 other virtual machines that didn't work. I would like to understand why some VM's succeed where other fail.
I'll answer your question from a different tack: what is a VM? A VM is just a specification for "interpreter" of a lower level language than the source language. Here I'm using the black box meaning of the word "interpreter". I don't care how a VM gets implemented (as a bytecode intepereter, a JIT compiler, whatever). When phrased that way, from a design point of view the VM isn't the interesting thing it's the low level language.
The ideal VM language will do two things. One, it will make it easy to compile the source language into it. And two it will also make it easy to interpret on the target platform(s) (where again the interpreter could be implemented very naively or could be some really sophisticated JIT like Hotspot or V8).
Obviously there's a tension between those two desirable properties, but they do more or less form two end points on a line through the design space of all possible VMs. (Or, perhaps some more complicated shape than a line because this isn't a flat Euclidean space, but you get the idea). If you build your VM language far outside of that line then it won't be very useful. That's what constrains VM design: putting it somewhere into that ideal line.
That line is also why high level VMs tend to be very language specific while low level VMs are more language agnostic but don't provide many services. A high level VM is by its nature close to the source language which makes it far from other, different source languages. A low level VM is by its nature close to the target platform thus close to the platform end of the ideal lines for many languages but that low level VM will also be pretty far from the "easy to compile to" end of the ideal line of most source languages.
Now, more broadly, conceptually any compiler can be seen as a series of transformations from the source language to intermediate forms that themselves can be seen as languages for VMs. VMs for the intermediate languages may never be built, but they could be. A compiler eventually emits the final form. And that final form will itself be a language for a VM. We might call that VM "JVM", "V8"...or we might call that VM "x86", "ARM", etc.
Hope that helps.
One of the techniques of deriving a VM is to just go down the compilation chain, transforming your source language into more and more low level intermediate languages. Once you spot a low level enough language suitable for a flat representation (i.e., the one which can be serialised into a sequence of "instructions"), this is pretty much your VM. And your VM interpreter or JIT compiler would just continue your transformations chain from the point you selected for a serialisation.
Some serialisation techniques are very common - e.g., using a pseudo-stack representation for expression trees (like in .NET CLR, which is not a "real" stack machine at all). Otherwise you may want to use an SSA-form for serialisation, as in LLVM, or simply a 3-address VM with an infinite number of registers (as in Dalvik). It does not really matter which way you take, since it is only a serialisation and it would be de-serialised later to carry on with your normal way of compilation.
It is a bit different story if you intend to interpret you VM code immediately instead of compiling it. There is no consensus currently in what kind of VMs are better suited for interpretation. Both stack- (or I'd dare to say, Forth-) based VMs and register-based had proven to be efficient.
I found this book to be helpful. It discusses many of the points you are asking about. (note I'm not in any way affiliated with Amazon, nor am I promoting Amazon; just was the easiest place to link from).
http://www.amazon.com/dp/1852339691/
As Oracle sues Google over the Dalvik VM it becomes clear, that you cannot implement a Java VM without license from Oracle (EDIT: Matthew Flaschen points out, that the claims of Oracle may not be valid. Anyways we have currently a situation, where Oracle threats VM-implementations.). That may become the death for Open-Source-implementations of Java (like Apache Harmony).
I don't want to discuss the impact or the legitimation of this lawsuit. but as a Java-programmer I want to take a deeper look into the alternatives, to be prepared for every case. As I see the creation of a compiler as a minor problem, my main interest are alternative VM-implementations, that serve a similar purpose as the JVM.
The VM I'm looking for, should meet some conditions:
free of patent-issues
an Open-Source-implementation exists
potential for optimizations/good performance
platform independent (the VM can be ported to different platforms without bigger hurdles)
Please add some recommendations for me.
LLVM is a really good optimizing, low level virtual machine. It can support languages like C and C++, and does not have built in support for high level features like garbage collection.
VMKit is an implementation of the Java and CLI virtual machines on top of LLVM. Since it uses Java bytecode, this probably wouldn't help with the patent issues.
HLVM is another interesting high level virtual machine built on top of LLVM. It is probably different enough to avoid most well known patents, but it is mainly targeted at numerical computing and functional programming.
On the dynamically typed side, there is Parrot.
I am actually working on a compiler and VM for a language of my own design, but don't count on it ever being finished. ;-)
Keep in mind that any large piece of software will infringe on numerous patents, the important thing is how well known they are (and how much the patents' owners actively seek out infringers). Of course, the whole patent system is absurd, and we would be much better off getting rid of it.
I don't think there is any significant piece of software that is free from patent issues.
If you are an independent developer or working for a smaller company you probably won't get hit directly by the problems though. It's unlikely that big companies holding patents will go after lots of small claims - it's an expensive process and causes a lot of resentment. SCO tried something like that and it didn't work out too well for them.
I would concentrate on finding the best tool for the job without worrying too much about the patent issues, otherwise you will never get anything done.
GraalVM is a research project developed by Oracle Labs and already in production at Twitter. I can't believe my eyes that no one mentions anything about it, it’s so weird. Anyways, GraalVM is a well promising extension of the java virtual machine to support more language and execution modes for running applications like JavaScript, Python, Ruby, R, JVM-based languages, and LLVM-based languages such as C and C++.The GraalVM project includes a new high-performance Java compiler, itself called Graal, which can be used in a just-in-time configuration on the HotSpot VM, or in an ahead-of-time configuration on the SubstrateVM. The main goal of this project is to improve the performance of the java virtual machine base language to match the performance of native languages. Let’s sum up the novel features that this project offers and make a brief explanation according to the docs why you should adopt it.
Polyglot: All languages (even LLVM-based) share the same VM and its capabilities. Zero overhead interoperability between programming languages allows you to write polyglot applications and select the best language for your task
Native: Native images compiled with GraalVM ahead-of-time improve the startup time and reduce the memory footprint of JVM-based applications.
Embeddable: GraalVM can be embedded in both managed and native applications. There are existing integrations into OpenJDK, Node.js, Oracle Database, and MySQL GraalVM removes the isolation between programming languages and enables interoperability in a shared runtime. It can run either standalone or in the context of OpenJDK, Node.js, Oracle Database, or MySQL.
Performance: Graal benchmark reports show great performance improvements in almost all of its implementations thanks to the way that GraalVM performs object allocations
If someone don’t get convinced by now that is a good choice and it is a really awesome project you can see this talk by Christian Thalinger on “on why Graal is a good fit for Twitter”
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm looking for suggestions on possible IPC mechanisms that are:
Cross platform (Win32 and Linux at least)
Simple to implement in C++ as well as the most common scripting languages (perl, ruby, python, etc).
Finally, simple to use from a programming point of view!
What my options are? I'm programming under Linux, but I'd like what I write to be portable to other OSes in the future. I've thought about using sockets, named pipes, or something like DBus.
In terms of speed, the best cross-platform IPC mechanism will be pipes. That assumes, however, that you want cross-platform IPC on the same machine. If you want to be able to talk to processes on remote machines, you'll want to look at using sockets instead. Luckily, if you're talking about TCP at least, sockets and pipes behave pretty much the same behavior. While the APIs for setting them up and connecting them are different, they both just act like streams of data.
The difficult part, however, is not the communication channel, but the messages you pass over it. You really want to look at something that will perform verification and parsing for you. I recommend looking at Google's Protocol Buffers. You basically create a spec file that describes the object you want to pass between processes, and there is a compiler that generates code in a number of different languages for reading and writing objects that match the spec. It's much easier (and less bug prone) than trying to come up with a messaging protocol and parser yourself.
For C++, check out Boost IPC.
You can probably create or find some bindings for the scripting languages as well.
Otherwise if it's really important to be able to interface with scripting languages your best bet is simply to use files, pipes or sockets or even a higher level abstraction like HTTP.
Why not D-Bus? It's a very simple message passing system that runs on almost all platforms and is designed for robustness. It's supported by pretty much every scripting language at this point.
http://freedesktop.org/wiki/Software/dbus
If you want a portable, easy to use, multi-language and LGPLed solution, I would recommend you ZeroMQ:
Amazingly fast, almost linear scaleable and still simple.
Suitable for simple and complex systems/architectures.
Very powerful communication patterns available: REP-REP, PUSH-PULL, PUB-SUB, PAIR-PAIR.
You can configure the transport protocol to make it more efficient if you are passing messages between threads (inproc://), processes (ipc://) or machines ({tcp|pgm|epgm}://), with a smart option to shave off some part of the protocol overheads in case of connections are running between VMware virtual machines (vmci://).
For serialization I would suggest MessagePack or Protocol Buffers (which other have already mentioned as well), depending on your needs.
You might want to try YAMI , it's very simple yet functional, portable and comes with binding to few languages
I can suggest you to use the plibsys C library. It is very simple, lightweight and cross-platform. Released under the LGPL. It provides:
named system-wide shared memory regions (System V, POSIX and Windows implementations);
named system-wide semaphores for access synchronization (System V, POSIX and Windows implementations);
named system-wide shared buffer implementation based on the shared memory and semaphore;
sockets (TCP, UDP, SCTP) with IPv4 and IPv6 support (UNIX and Windows implementations).
It is easy to use library with quite a good documentation. As it is written in C you can easily make bindings from scripting languages.
If you need to pass large data sets between processes (especially if speed is essential) it is better to use shared memory to pass the data itself and sockets to notify a process that the data is ready. You can make it as following:
a process puts the data into a shared memory segment and sends a notification via a socket to another process; as a notification usually is very small the time overhead is minimal;
another process receives the notification and reads the data from the shared memory segment; after that it sends a notification that the data was read back to the first process so it can feed more data.
This approach can be implemented in a cross-platform fashion.
How about Facebook's Thrift?
Thrift is a software framework for scalable cross-language services development. It combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml.
I think you'll want something based on sockets.
If you want RPC rather than just IPC I would suggest something like XML-RPC/SOAP which runs over HTTP, and can be used from any language.
YAMI - Yet Another Messaging Infrastructure is a lightweight messaging and networking framework.
If you're willing to try something a little different, there's the ICE platform from ZeroC. It's open source, and is supported on pretty much every OS you can think of, as well as having language support for C++, C#, Java, Ruby, Python and PHP. Finally, it's very easy to drive (the language mappings are tailored to fit naturally into each language). It's also fast and efficient. There's even a cut-down version for devices.
Distributed computing is usually complex and you are well advised to use existing libraries or frameworks instead of reinventing the wheel. Previous poster have already enumerated a couple of these libraries and frameworks. Depending on your needs you can pick either a very low level (like sockets) or high level framework (like CORBA). There can not be a generic "use this" answer. You need to educate yourself about distributed programming and then will find it much easier to pick the right library or framework for the job.
There exists a wildly used C++ framework for distributed computing called ACE and the CORBA ORB TAO (which is buildt upon ACE). There exist very good books about ACE http://www.cs.wustl.edu/~schmidt/ACE/ so you might take a look. Take care!
TCP sockets to localhost FTW.
It doesn't get more simple than using pipes, which are supported on every OS I know of, and can be accessed in pretty much every language.
Check out this tutorial.
Python has a pretty good IPC library: see https://docs.python.org/2/library/ipc.html
Xojo has built-in cross-platform IPC support with its IPCSocket class. Although you obviously couldn't "implement" it in other languages, you could use it in a Xojo console app and call it from other languages making this option perhaps very simple for you.
In the current ages there is available a very easy, C++1x compliant, well documented, Linux and Windows compatible, open-source "CommonAPI" library: CommonAPI C++.
The underlying IPC system is D-Bus (libdbus) or SomeIP if one wish. Application interfaces are specified using a simple and tailored for that Franca IDL language.