Interpretation in scripting languages - interpreter

Why is pure interpretation more preferred for scripting languages compared to programming languages. I mean why for scripting languages program is not converted to machine language and then executed. From what I have read, one of the reason is speed, for scripting purposes speed is not so important and because interpretation is slower so it does not matter for scripting languages. Are there more reasons for using interpretation in scripting ?

Some of your assumptions are incorrect.
However, the normal reasons for choosing interpreting rather than compiling (to machine code) are:
it is easier (less effort) to implement an interpreter,
interpreters are easier to port to multiple platforms,
compilation to native code takes time, which can slow down the development cycle and / or lead to longer application startup times in the JIT compilation case1.
1 - The latter is complicated, and it it is difficult to do an even-handed comparison. The flip-side is is that after the slow startup, the JIT-compiled program runs much faster than interpreted code, and possibly faster than statically compiled code.

Scripting, and especially an interactive scripting (command shells, etc.) is most often for the code which runs only once, and in such a scenario latency is more important than anything else. JITs won't be of any use in this case. You'd expect a feedback from your REPL immediately, and waiting for each of your commands to compile before executing them is rarely justified.
Therefore, an ad hoc interpretation or, better, a very lightweight compilation (e.g., compiling to a high level VM which is then interpreted) is preferred to a proper heavyweight compilation.

Related

Analyzing the speed-up of Oracle's HotSpot versus other compilation techniques

I'm currently working on a project that must involve research of JIT techniques. I'm a complete beginner when it comes to anything related to compilers but I did some research and learned about Java's Hotspot VM. I was hoping to do an analysis on the benefits (or downsides) of using Hotspot versus traditional compilers (for example, g++).
My initial idea was to create some sort of simple program that can be run through both compilers in order to compare compilation times but this brought up a number of questions:
From my understanding, Java source code is initially turned into bytecode by the javac compiler (creating .class files) and then, in turn, this bytecode can be run through HotSpot at runtime to execute the program. Given this, would it even be relevant to compare results with a traditional compiler that converts sources directly to machine code?
Another concern I'm facing is that the programs would be in different languages (ex. C++ vs Java). Although the functionality would be identical, could this skew results when attempting to compare?
Moving on, if the above two points are not a problem, my main questions is:
How can I actually go about benchmarking the speed-up in one method versus the other?
I did some brief research about this but all I was able to find were ways to measure the efficiency of the program itself, not the compilation technique used to run it. Is what I'm trying to do possible? Are there methods to actually analyze the speed up of one compiler over another?
Any help is appreciated!
How can I actually go about benchmarking the speed-up in one method versus the other?
You first need to consider what you actually intend to measure. In other words, saying "the speed-up" is not sufficiently rigorous.
Are we talking about CPU cycles spent compiling? Or walltime from source code to running program? Or peak performance of a few critical methods in a micro benchmark? Overall steady-state program performance? Speed of program initialization? ...
In the end you're comparing two systems that made quite different trade-offs. You can find a few roughly comparable benchmarks already mentioned in the comments but in the end they mostly represent a specific type of throughput-bound tasks and not large applications. It's not like you can find an application such as firefox written both in C and Java with identical feature sets and comparable code quality. So any comparison you do will be incomplete because you'll have to use some limited proxy measurement of how comparable two code-bases are when you compare them.

How compiled language is better than interpreted language in optimizing the hardware?

Specifically how is compiled language able to better optimize the hardware compared to interpreted language? Other online sources that I have read only gave vague explanations like because it is written in the native code of the target machine while some do not even offer explanation at all. Would appreciate if the explanation provided can be as "Layman" as possible given that I've only just started to code.
One major reason is optimizing compilers. Compiling "in advance" makes it much easier to apply optimizations to code, especially if you're compiling to native assembly code (as you typically do in C, for example). The fact that you know some stuff about the machine that it's going to be deployed on allows you to do machine-specific optimizations. This is especially important for, for example, Pentium-based processors, which have numerous complicated instructions that would tend to require some degree of knowledge of program structure in order to use (e.g. the MMX instruction set).
There are also some cases where the compiler can make structural changes to programs. For example, under special circumstances, some compilers can replace recursion with loops. (I once heard of someone writing a recursive Factorial function in C to learn about how to implement recursion in assembly language only to realize to his horror that the compiler had recognized an optimization and replaced his recursion with a for loop).

Anyone program "low-level" on JVM?

Sometimes we hear about brave people who understand and write assembly language for performance reasons, as opposed to using a compiler with a high-level language. Can the same be done on the JVM? I've reviewed the JVM instruction set, and it resembles assembly language in some respects, though it's much higher level (I'm assuming that the system-specific implementations of the JVM are extremely efficient).
Is it possible to, say, write JVM instructions and put them into a Java-executable binary?
Yes. You can do this via the asm library.
In fact, this is typically how people implement non-Java languages on top of the JVM, and how many Java metaprogramming libraries work.
You may very well want to do this for the same kind of metaprogramming capabilities - e.g., generating classes at runtime, or using the InvokeDynamic instruction to generate your own method dispatch rules.
There isn't a whole lot of performance benefit to be gained from using raw Java bytecode rather than writing the corresponding high-level Java (the JIT is your main performance booster, and it's optimized for the sorts of patterns "vanilla" Java code generates) but it does give you flexibility for things that are difficult, verbose, or impossible to express in Java.

Is optimization necessary when code generation is targeting a runtime with JIT?

I'm planning on writing a programming language targeting the .NET platform which led me to start thinking about the code generation aspect of targeting such a platform. I'm new at writing compilers but I know there is optimization done as one of the phases in compiling (or there can be). I started to wonder about the any benefit to spending time optimizing the output (in this case CIL but this would apply to the JVM too) because the JIT compiler and things like the JVM's HotSpot could optimize at run time. Is there any benefit from optimizing the generated code (CIL or the JVM equivalent) when targeting .NET or JVM since the JIT will already optimize?
It depends. There are countless optimizations. Any given compiler (your compiler, the JIT compiler, or any other compiler) necessarily implements only a subset of those. This choice depends on available time, typical/expected input code, priorities, etc. and therefore the engineers who built the JIT compiler may have selected optimizations which work well for the programs they were expecting, but not so well for the kind of program you care about.
You will have to determine what optimizations the JIT compiler misses. The way to do this is, of course, empirical: Actually write programs, letting the JIT compiler optimize them (be sure to do this part properly - disable debugging, compile for release, choose realistic benchmarks, etc.), and then inspect the final machine code. Look for unexpected code (you will, of course, need assembly knowledge for this) and determine if it's a missed optimization or if the JIT was smarter than you thought.
If it is a missed optimization, you have another problem: You can't output the machine code you want, you have to generate different IL instead.
A missed optimization is probably due to a language feature the VM doesn't know (e.g. multi methods on the JVM). You lowered it into the VM's terms during compilation but the translation you chose doesn't sit well with the JIT's order of passes, heuristics, etc.
As you can't just output machine code yourself, you must now find an alternative IL fragment for the same input language code. Ideally, one which the JIT compiler does handle well. Finding that may be an exercise in imagination, but it's not technically hard, just guesswork interleaved with benchmarking.
As another answer points out, JIT compilers work under time constraints. This may lead to optimizations that could happen being missed (e.g. constant propagation running out of time), but as the creators of the JIT compiler faced the same problem, this probably isn't too severe if you don't create much larger/more complicated code.
If you create such bad code that the JIT compiler can't fix it all, then you have to duplicate its optimizations in your AOT compiler. I'm not convinced that this is a likely scenario though, and even if it happens even very simple optimizations should mostly fix the problem.
So, in summary: Start with a straightforward translation, then seek out missed optimizations and either make it easier to optimize for the JIT compiler, or do it yourself (if possible - adaptive optimization is much harder in an AOT setting).
I think this question is hard to answer in general.
For example, the F# compiler performs a tail call optimization, because having tail-recursive functions is common in that language, the F# compiler can do a better job at optimizing them in some cases than the JIT compiler and some versions of the JIT compiler don't perform the optimization at all.
So, your language might have some common operation whose straightforward implementation wouldn't perform well. In that case, it makes sense emitting IL code that's optimized.
What I think you should do is the same as when you're writing a normal program: first write your code in a way that is simple and readable. Only if something doesn't perform well, attempt to optimize that. It might be worth considering that you might need some optimizations in the future and make your code modular enough, so that you don't have to rewrite half of it because of some optimization. But for now, that should be enough.
Writing a compiler is hard enough job already (even if you're targeting an IL). Finish it first and think about optimizations later.
Generally, JIT compilers have some thresholds governing how much optimization they will attempt to perform. These may be based on the size of a method's IL and/or the amount of time already spent JIT compiling the method. So yes, IL which has already been optimized may benefit from further JIT optimization. As always, there is a trade-off: how much time do you want to spend adding AOT optimizations to your compiler (and testing/maintaining them) versus how quickly your code can be JIT compiled, and with what level of optimization.
The magnitude of the improvement depends largely on how much simpler (and smaller) the AOT-optimized IL is relative to the unoptimized IL, as well as the thresholds governing the JIT compiler (which, at least for the Microsoft CLR, are not widely known). The only way to find out is to do some testing yourself.

Why do almost all OO languages compile to bytecode?

Of the object-oriented languages I know, pretty much all but C++ and Objective-C compile to bytecode running on some sort of virtual machine. Why have so many different languages settled on compiling to bytecode, as opposed to machine code? Is it possible in princible to have a high-level memory-managed OOP language that compiled to machine code?
Edit: I'm aware that multiplatform support is often advanced as an advantage of this approach. However, it's quite possible to compile natively on multiple platforms, without making a new compiler per platform. One can, per example, emit C code and then compile that with GCC.
There's no reason in fact, this is a kind of coincidence. OOP now is the leading concept in "big" programming, and so virtual machines are.
Also note, that there are 2 distinct parts of traditional virtual machines - garbage collector and bytecode interpreter/JIT-compiler, and these parts can exist separately. For example, Common Lisp implementation called SBCL compiles program to a native code, but at runtime heavily uses garbage collection.
This is done to allow a VM or JIT compiler the chance to compile the code on demand optimally for the architecture on which the code is executed. Also, it allows for cross-platform bytecode to be created once and then executed on multiple hardware architectures. This allows for hardware specific optimizations to be placed into the compiled code.
Since byte code is not limited to a microarchitecture, it can be smaller than machine code. Complex instructions can be represented vs. the much more primitive instructions available in modern day CPUs, since the constraints in the design of CPU instructions are very different from the constraints in designing a bytecode architecture.
Then there's the issue of security. The bytecode can be verified and analyzed prior to execution (i.e., no buffer overflows, variables of a certain type being accessed as something they are not), etc...
Java uses bytecode because two of its initial design goals were portability and compactness. Those both came from the initial vision of a language for embedded devices, where fragments of code could be downloaded on the fly.
Python, Ruby, Smalltalk, javascript, awk and so on use bytecode because writing a native compiler is a lot of work, but a textual interpreter is too slow - bytecode hits a sweet spot of being fairly easy to write, but also satisfactorily quick to run.
I have no idea why the Microsoft languages use bytecode, since for them, neither portability nor compactness is a big deal. A lot of the thinking behind the CLR came out of computer scientists in Cambridge, so i imagine considerations like ease of program analysis and verification were involved.
Note that as well as C++ and Objective C, Eiffel, Ada 9X, Vala and Go are OO languages (of varying vintage) that are compiled straight to native code.
All in all, i'd say that OO and bytecode do not go hand in hand. Rather, we have a coincidental convergence of several streams of development: the traditional bytecoded interpreters of scripting languages like Python and Ruby, the mad Gosling masterplan of Java, and whatever it is Microsoft's motives are.
The biggest reason why most interpreted languages (not specifically OO languages) are compiled to bytecode is for performance. The most expensive part of interpreting code is transforming text source to an intermediate representation. For instance, to perform something like:
foo + bar;
The interpreter would have to scan 10 characters, transform them into 4 tokens, build an AST for the operation, resolve three symbols (+ is a symbol, which depends on the types of foo and bar), all before it can perform any action that actually depends on the run-time state of the program. None of this can change from run to run, and so many languages try to store some form of intermediate representation.
bytecode, rather than storing an AST has a few advantages. For one, bytecodes are easy to serialize, so the IR can be written to disk and reused at the next invocation, further reducing interpretation time. Another reason is that bytecode often takes up less actual ram. significantly bytecode representations are often easy to just in time compile, because they are often structurally similar to typical machine code.
As another data point, the D programming language is GC'ed, OO, and a lot higher level than C++ while still being compiled to native code.
Bytecode is significantly more flexible medium than machine code. First, it provides the basis for platform portability without the need for a compiler or shipping source code. So a developer can distribute a single version of the application without needing to give up the source, require complex developer tools, or anticipate potential target platforms. While the later is not always practical it does happen. Especially with developer libraries say I distribute a library that I've only tested on Windows, but someone else uses it on Linux or Android. It happens quite frequently actually, and most of the time it works as expected.
Byte code is also generally more optimized that an interpreter because it's closer to machine instructions therefore faster to translate to machine instructions. Not all OO languages are compiled. Ruby, Python, and even Javascript are interpreted so they aren't compiled to anything so the ruby interpreter has to take a very flexible language and turn that into instructions, but that flexibility comes at a price paid an runtime: parse text, generate AST, translate AST to machine code, etc. It's also easy to do optimizations like JIT where byte code is translated to machine code directly, and even gives the possibility for creating optimizations for specific hardware.
Finally, just because one language compiles to bytecode doesn't preclude other languages taking advantage of of that byte code. Now any optimization using that byte code can be applied to these other languages that might know how to translate themselves to that byte code. That makes the byte code a very important layer for reusability for other languages.
OO and byte code compilation goes back to the 70s with Smalltalk, and I'm sure someone will say LISP as early as the 50s/60s. But, it really wasn't until the 90s that it started to really be used in production systems on a large scale.
Native compilation sounds like the optimal path, and probably why our industry spent 20 years or more thinking that was THE ANSWER to all our problems, but the last 15 years we've seen byte code compilation take stage and it's been a significant advantage over what we did before. Looking back we realize how much time wasted natively compiling everything mostly by hand.
I agree with Chubbard's answer and I'd add that in OO languages type information can be very important for enabling optimizations by virtual-machines or last-level compilers
It is easier to develop an interpreter than a compiler.
Effort in development of...:
interpreter < bytecode-interpreter < bytecode-jit-compiler < compiler-to-platform-independent-language < compiler-to-multiple-machine-dependent-assembler.
It is a general trend to stop the development at jit-compilers because of platform independence. Only the preferred languages in respect to performance and research in theoretical computer science are and will be developed in ALL possible directions, including new bytecode-interpreter, even while there are good and advanced compilers to platform independent languages and to different machine-dependant assemblers.
The research in OOP languages is pretty ...let's say dull, compared to functional languages, because really new language and compiler technologies are more easily expressed with/in/using mathematical cathegory theory and mathematical descriptions of touring-complete type-systems. In other words: it is nearly functional in itself, while imperative languages are nearly only assembler-frontends with some syntactic sugar. OOP languages tend to be imperative languages, because functional languages have already closures and lambda. There are other ways to implement java-like "interfaces" in functional languages, and there is just no need for additional object oriented features.
In i.e. Haskell, adding the feature of OOP-like programming would probably be more than only a few steps back in technology – there would be no point in using that. (<- that is not only IMHO... you ever heard of GADTs or Multi-parameter-type-classes?) Probably there might be even better ways to dynamically create Objects with Interfaces to communicate with OOP-languges than changing that language itself. But there are other functional languages, too, that explicitely combine functional and OOP aspects. There is just more science with mainly functional languages than non-functional OO-languages.
OO languages can not be easily compiled to other OO languages, iff they are in some way more "advanced". Usually, they have features like stack-protector, advanced debugging abilities, abstract and inspectable multi-threading, dynamic object-loading from files from the internet... Many of these features are not or not-easily realisable with C or C++ as compiler-backend. The functional language LISP (which is 50 years old!) was AFAIK the first with garbage collector. As compiler-backend LISP used a hacked version of the language C, because plain C did not allow some of those things, assembler did allow, i.e. proper-tail-calls or tables-next-to-code. C-- allows that.
An other aspect: Imperative languages are intended to run on a specific architecture, i.e. C and C++ programs run on only those architectures, they are programmed for. Java is more extreme: it runs only on a single architecture, a virtual one, which itself runs on others.
Functional languages are usually by design pretty architecture-independent: LISP was developed to be so immense architecture-unspecific, that it could be compiled to genetic code, in some distant future. Yes, like programs running in living biologic cells.
With the bytecode for the LLVM, functional languages will most-likely be compiled to bytecode in the future, too. Most imperative languages will most likely still have the same inherited problems as they have now from not-abstracting-far-enough. Well, I'm not that sure about clang and D, but those two are not "the most" anyway.