Analyzing the speed-up of Oracle's HotSpot versus other compilation techniques - jvm

I'm currently working on a project that must involve research of JIT techniques. I'm a complete beginner when it comes to anything related to compilers but I did some research and learned about Java's Hotspot VM. I was hoping to do an analysis on the benefits (or downsides) of using Hotspot versus traditional compilers (for example, g++).
My initial idea was to create some sort of simple program that can be run through both compilers in order to compare compilation times but this brought up a number of questions:
From my understanding, Java source code is initially turned into bytecode by the javac compiler (creating .class files) and then, in turn, this bytecode can be run through HotSpot at runtime to execute the program. Given this, would it even be relevant to compare results with a traditional compiler that converts sources directly to machine code?
Another concern I'm facing is that the programs would be in different languages (ex. C++ vs Java). Although the functionality would be identical, could this skew results when attempting to compare?
Moving on, if the above two points are not a problem, my main questions is:
How can I actually go about benchmarking the speed-up in one method versus the other?
I did some brief research about this but all I was able to find were ways to measure the efficiency of the program itself, not the compilation technique used to run it. Is what I'm trying to do possible? Are there methods to actually analyze the speed up of one compiler over another?
Any help is appreciated!

How can I actually go about benchmarking the speed-up in one method versus the other?
You first need to consider what you actually intend to measure. In other words, saying "the speed-up" is not sufficiently rigorous.
Are we talking about CPU cycles spent compiling? Or walltime from source code to running program? Or peak performance of a few critical methods in a micro benchmark? Overall steady-state program performance? Speed of program initialization? ...
In the end you're comparing two systems that made quite different trade-offs. You can find a few roughly comparable benchmarks already mentioned in the comments but in the end they mostly represent a specific type of throughput-bound tasks and not large applications. It's not like you can find an application such as firefox written both in C and Java with identical feature sets and comparable code quality. So any comparison you do will be incomplete because you'll have to use some limited proxy measurement of how comparable two code-bases are when you compare them.

Related

static and dynamic code analysis

I found several questions about this topic, and all of them with lot of references, but still I don't have a clear idea about that, because most of the references speak about concrete tools and not about the concept in general of the analysis. Thus I have some questions:
About Static analysis:
1. I would like to have a reference, or a summary of which techniques are successful and have more relevance nowadays.
2. What really can they do about discovering bugs, can we make a summary or it is depending of the tool?
About symbolic execution:
1. Where could be enclose symbolic execution? I guess depending of the approach,
I would like to know if they are dynamic analysis, or mix of static and dynamic analysis if it is possible to determine.
I found problems to differentiated the two different techniques in the tools, even I think I know the theoretical difference.
I'm actually working with C
Thanks in advance
I'm trying to give a short answer:
Static analysis looks at the syntactical structure of code and draws conclusions about the program behavior. These conclusions must not always be correct.
A typical example of static analysis is data flow analysis, where you compute sets like used, read, write for every statement. This will help to find e.g. uninitialized values.
You can also analyze the code regarding code-patterns. This way, these tools can be used to check if you are complying to a specific coding standard. A prominent coding standard example is MISRA. This coding standard is used for safety critical systems and avoids problematic constructs in C. This way you can already say a lot about the robustness of your applications against memory leaks, dangling pointers, etc.
Dynamic analysis is not looking at the syntax only, but takes state information into account. In symbolic execution, you are adding assumptions about the possible values of all variables to the statements.
The most expensive and powerful method of dynamic analysis is model checking, where you really look at all possible execution states of the system. You can think of a model checked system as a system that is tested with 100% coverage - but there are of course a lot of practical problems that prevent real systems to be checked that way.
These methods are very powerful, and you can gain a lot from the static code analysis tools especially when combined with a good coding standard.
A feature my software team found really impressive is e.g. that it will tell you in C++ when a class with virtual methods does not have a virtual destructor. Easy to check in fact, but really helpful.
The commercial tools are very expensive, but worth the money, once you learned how to use them. A typical problem in the beginning is that you will get a lot of false alarms, and don't know where to look for the real problem.
Note that nowadays g++ has some of this stuff already built-in, and that you can use something like pclint which is free.
Sorry - this is already getting quite long...hope it's interesting.
The term "static analysis" means that the analysis does not actually run a code. On the other hand, "dynamic analysis" runs a code and also requires some kinds of real test inputs. That is the definition. Nothing more.
Static analysis employs various formal methods such as abstract interpretation, model checking, and symbolic execution. In general, abstract interpretation or model checking is suitable for software verification. Symbolic execution is more appropriate for the purpose of bug finding.
Symbolic execution is categorized into static analysis. However, there is a hybrid method called concolic execution which uses both symbolic execution and dynamic testing.
Added for Zane's comment:
Maybe my explanation was little confusing.
The difference between software verification and bug finding is whether the analysis is sound or not. For example, when we say the buffer overrun analyzer is sound, it means that the analyzer must report all possible buffer overruns. If the analyzer reports nothing, it proves the absence of buffer overruns in the target program. Because model checking is the method that guarantees soundness, it is mostly used for software verification.
On the other hands, symbolic execution which is actively used by today's most commercial static analyzers does not guarantee soundness since sound analysis inherently issues lots, lots of false positives. For the purpose of bug finding, it is more important to reduce false positives even if some true positives are also lost.
In summary,
soundness: there are no false negatives
completeness: there are no false positives
software verification: soundness is more important than completeness
bug finding: completeness is more important than soundness

What is the difference between profilers that need recompiling and those that do not?

What are the differences between using profilers that need to recompile the source code with debugging options(like gprof) and profilers that do not need recompiling(like Valgrind, OProfile, ...)?
I'm not familiar with the named profilers but there are two major approaches to profiling:
Instrumentation, this method usually requires recompiling (not always, for example java and .Net applications can be instrumented dynamically). With this method it is possible to measure exactly how often a routine is called, or how many iterations a certain loop makes.
Sampeling is a method that does not require any recompiling, it simply takes a snapshot of the stack with set intervals. This has proven to be an effective way to find bottlenecks.
There is some more info about the two strategies here
I can speak on Valgrind and gprof at least.
The primary differences between using the two is basically what you already said. For gprof, you have to compile it specially to include the profiling code. When you then run your executable, the profiling code is executed (since it's built into your program), and a gmon.out file is created that can then be processed by gprof to show you runtime statistics of your program.
Valgrind is different in that you don't need to compile your program in any special way (except to add debug symbols if you want the output to be useful). Valgrind dynamically translates your program into an internal format that is run on a simulated CPU (although this is slow). This means that any program can be run through Valgrind without needing the special compilation.
Another important difference is that Valgrind can report a lot more information than gprof does, but that's not specifically related to using it.
Any profiling technique is going to need symbol table information, so that has to be requested in the compilation and linking.
Other than that, some profilers work by compiling-in calls to record-keeping routines at the beginning and possibly end of each function.
Those functions can attempt to record the time used by the function, and some record of where it was called from.
Its timing figures are made inaccurate by the overhead of calling those recording functions.
Other profilers do not need to do that, instead relying on periodic samples of the call stack.
Such a profiler has lower overhead.
Its timing figures are made inaccurate by the statistical nature of its sampling.
Implicit in this is that accuracy of timing is necessary for locating "bottlenecks", which has never, to my knowledge, been shown to be true.
The method I've always used to get orders of magnitude speedup relies on insight into what the program is doing as it spends time, rather than on precisely how much time is spent. If you're interested in the statistical rationale, you could look here.

Matching a virtual machine design with its primary programming language

As background for a side project, I've been reading about different virtual machine designs, with the JVM of course getting the most press. I've also looked at BEAM (Erlang), GHC's RTS (kind of but not quite a VM) and some of the JavaScript implementations. Python also has a bytecode interpreter that I know exists, but have not read much about.
What I have not found is a good explanation of why particular virtual machine design choices are made for a particular language. I'm particularly interested in design choices that would fit with concurrent and/or very dynamic (Ruby, JavaScript, Lisp) languages.
Edit: In response to a comment asking for specificity here is an example. The JVM uses a stack machine rather then a register machine, which was very controversial when Java was first introduced. It turned out that the engineers who designed the JVM had done so intending platform portability, and converting a stack machine back into a register machine was easier and more efficient then overcoming an impedance mismatch where there were too many or too few registers virtual.
Here's another example: for Haskell, the paper to look at is Implementing lazy functional languages on stock hardware: the Spineless Tagless G-machine. This is very different from any other type of VM I know about. And in point of fact GHC (the premier implementation of Haskell) does not run live, but is used as an intermediate step in compilation. Peyton-Jones lists no less then 8 other virtual machines that didn't work. I would like to understand why some VM's succeed where other fail.
I'll answer your question from a different tack: what is a VM? A VM is just a specification for "interpreter" of a lower level language than the source language. Here I'm using the black box meaning of the word "interpreter". I don't care how a VM gets implemented (as a bytecode intepereter, a JIT compiler, whatever). When phrased that way, from a design point of view the VM isn't the interesting thing it's the low level language.
The ideal VM language will do two things. One, it will make it easy to compile the source language into it. And two it will also make it easy to interpret on the target platform(s) (where again the interpreter could be implemented very naively or could be some really sophisticated JIT like Hotspot or V8).
Obviously there's a tension between those two desirable properties, but they do more or less form two end points on a line through the design space of all possible VMs. (Or, perhaps some more complicated shape than a line because this isn't a flat Euclidean space, but you get the idea). If you build your VM language far outside of that line then it won't be very useful. That's what constrains VM design: putting it somewhere into that ideal line.
That line is also why high level VMs tend to be very language specific while low level VMs are more language agnostic but don't provide many services. A high level VM is by its nature close to the source language which makes it far from other, different source languages. A low level VM is by its nature close to the target platform thus close to the platform end of the ideal lines for many languages but that low level VM will also be pretty far from the "easy to compile to" end of the ideal line of most source languages.
Now, more broadly, conceptually any compiler can be seen as a series of transformations from the source language to intermediate forms that themselves can be seen as languages for VMs. VMs for the intermediate languages may never be built, but they could be. A compiler eventually emits the final form. And that final form will itself be a language for a VM. We might call that VM "JVM", "V8"...or we might call that VM "x86", "ARM", etc.
Hope that helps.
One of the techniques of deriving a VM is to just go down the compilation chain, transforming your source language into more and more low level intermediate languages. Once you spot a low level enough language suitable for a flat representation (i.e., the one which can be serialised into a sequence of "instructions"), this is pretty much your VM. And your VM interpreter or JIT compiler would just continue your transformations chain from the point you selected for a serialisation.
Some serialisation techniques are very common - e.g., using a pseudo-stack representation for expression trees (like in .NET CLR, which is not a "real" stack machine at all). Otherwise you may want to use an SSA-form for serialisation, as in LLVM, or simply a 3-address VM with an infinite number of registers (as in Dalvik). It does not really matter which way you take, since it is only a serialisation and it would be de-serialised later to carry on with your normal way of compilation.
It is a bit different story if you intend to interpret you VM code immediately instead of compiling it. There is no consensus currently in what kind of VMs are better suited for interpretation. Both stack- (or I'd dare to say, Forth-) based VMs and register-based had proven to be efficient.
I found this book to be helpful. It discusses many of the points you are asking about. (note I'm not in any way affiliated with Amazon, nor am I promoting Amazon; just was the easiest place to link from).
http://www.amazon.com/dp/1852339691/

assembly language and optimization [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
How can programming in assembly help in achieving optimization
The most likely way programming in assembly can improve your code is by improving you: teaching you more about what is happening at a low level and getting the discipline of optimization can help you make good decisions in higher-level languages.
As far as actually helping one program: as others have noted it's rarely worth it. It's just possible you can use it as a kind of advanced profile-driven optimization: try many variations until you find one that's best on your particular problem.
To start with this: write a program in C or C++ or whatever compiled language you normally use, fire up your debugger, and disassemble a small but nontrivial function, and have a think about why the compiler did what it did. Then try writing a small bit of inline assembler yourself. On modern systems assembly is mostly easily embedded within C rather than done from scratch.
Or alternatively, get a teeny machine like a PIC and make it flash a LED...
These days, you have to be very good at assembly to beat the compiler.
I can do it any day of the week, but only by viewing the compiler's output first.
And then, if it gains more than a couple of percentage points I'd be surprised.
These days, I only program in assembly when I'm doing something the compiler can't do.
In principle, you can write highly-optimized code in assembly because the compiler is limited to specific, general-purpose optimizations that should apply to many programs, while you can be creative and use your knowledge of this particular program.
To take a simple example, back when I was new to this business compilers were very limited in their ability to optimize register usage. You know that to perform any sort of arithmetic or logical operation, the CPU must generally load one of the values into a register, then perform the operation on the other, then save the result? Like to add two numbers together -- and I'll use a pseudo-assembler here because I don't know what assembly languages you know and I've forgotten most of the details myself -- you'd write something like this:
LOAD A,value1
ADD A,value2
STORE a,destination
Compilers used to generate the loads for every operation. So if your C program said:
x=x+y;
z=z+x;
The compiler would generate something like:
LOAD A,x
ADD A,y
STORE A,x
LOAD A,z
ADD A,x
STORE A,z
But a human could observe that by the time we get to the second statement, register A already contains x, and addition is commutative, so we could optimize this to:
LOAD A,x
ADD A,y
STORE A,x
ADD A,z
STORE A,z
Et cetera. One could go through all sorts of tiny micro-optimizations like this. I used to do that all the time back when I was young and the world was green.
But over the years compilers have gotten much smarter, and CPUs have gotten more powerful so the micro-optimizations don't matter as much.
Thus, I haven't written any assembly language code in, wow, probably 15 years. I used to read the assembly generated by the compiler when debugging, sometimes it would give a clue to a subtle problem, but I haven't done that in years now either.
I don't think compilers are even written in assembly any more. Instead, you write the first draft of the compiler in a high level language on some other computer, i.e. you write a cross-compiler to get yourself off the ground.
I suspect the only real use of assembly today is for extremely constrained environments, embedded systems and that sort of thing; and for programs that have to deal intimately with the hardware, like device drivers.
I'd be interested to hear if there are any assembly programmers on this forum who care to tell us why they assembly programmers.
Programming in assembly won't, in and of itself, optimize your code. The main thing about assembly is that it allows you to have very low-level access and to choose exactly what instructions the processor executes.
Since you won't have some compiler generating the assembly for you, you can perform code optimizations when you write the program yourself, if you know how.
So, you think you are smarter than gcc optimizing compiler?
If not, then fughed aboud it (learning assembly for the sake of getting better at optimization). That would be akin to learning Scheme language for the sake of getting better at recursion :)
In general, the compiler will do a fairly good job at generating optimal code. There are, however, cases where writing your own assembly can result in even more optimized (in terms of space and/or speed) code.
Typically, this happens when there is something that you know about the target system that the compiler doesn't. Compilers are designed to work on a variety of systems; if you want to take advantage of something unique to your target system, sometimes you have to go in and do it yourself. Here's an example. A few months ago, I was writing some code for a MIPS-based embedded system. There are many different types of MIPS CPUs, and some support certain opcodes that others do not. My compiler would generate MIPS code using the set of assembly operations that all MIPS architectures support. However, I knew that my chip could do more. I had a subroutine that needed to count the number of leading zeroes in a 32-bit number. The compiler synthesized this into a loop that took about 10 lines of assembly to do. I re-wrote it in one line by using the CLZ opcode that was designed to do just this. I knew that my chip supported the opcode but the compiler didn't. Admittedly, situations like this aren't very common; when they do pop up, however, it's nice to have enough of a background in assembly to take advantage of them.
Sometimes one will need to perform a task which maps particularly well onto some CPU instructions, but does not fit well into any high-level-language constructs. For example, on many processors one may easily perform extended-precision arithmetic using something like:
add r0,r4
addc r1,r5
addc r2,r6
addc r3,r7
This will regard r3:r2:r1:r0 and r7:r6:r5:r4 as numbers four words long, adding the second to the first. Four nice easy instructions, any anyone who understands assembly would know what they do. I know of no way to perform the same task in C without it not only generating bigger and slower object code, but also being an incomprehensible mess of source code.
A somewhat more extreme but specialized real-world example: Given two arrays array1[0..63] and array2[0..63], compute array1[0]*array2[0] + array1[1]*array2[1] + array1[2]*array2[2] ... + array1[63]*array2[63]. On a DSP I used, the computation could be done in machine code in about 75 machine cycles (about 67 of which are a repeating MAC instruction). There's no way C code could come anywhere close.
About the only time I can think of using Assembly language for optimizing code is when you need something very specific, like you need a GPIO on a microcontroller to toggle between high and low exactly every 9 clock cycles. that's too short a time to manage with an interrupt, and higher level language compilers don't normally offer this kind of control over the instruction stream.
Typically you wouldn't program in assembly. You would program in C, and then look at the generated assembly to see what optimzations (or not) the C compiler made automatically. Adjusting your C code (to allow for better vectorization for example) will allow the compiler to re-arrange code better, which will give you optimized assembly
More likely than being able to beat the compiler at writing assembly code. Knowing how typical tasks translate to assembly may help you write better high level language code.
Typically you do not resort to assembly for optimiziation purposes. If this is possible, usually someone already will have provided the essential code ready for you to call, for example in form of a linear algebra library.
Likewise assembly offers direct access to the processor (e.g. for atomicity, time measurement, I/O) but the important accesses will already have have been made accessible for your high level language.
Compilers do a good job of generating assembler.
However, there's a bad reason why hand-written assembler is faster. Since it's harder to write, you write less of it.
It would be nice if programmers could discipline themselves to get the same job done in minimal code, regardless of language.
When writing assembly, or even just straight raw bytes the assembler outputs, you can write programs that use computer hardware specific features or makes something otherwise very carefully specified.
There might be really high benefits if your program does the optimized part far more often than it does anything else. Always set up benchmarks before attempting optimizations.
The downcome is that your hand-written assembly works on fewer different hardware. It may even end up getting limited into the hardware model and revision!
It's rare you ever can or need to write assembly routines because commonly written software must work on almost every hardware you find and your kitten.
There's one interesting application if you know assembly. You can then write programs that produce assembly routines. Though it's mostly only fun unless you keep it really small so you can port it easily.
Read the Graphics Programming Black Book by Michael Abrash
In most modern applications, it can't to any significant degree.
Inter-Process Communication Affects Application Response Time explains why algorithms are unlikely to be bottlenecks. (But always profile - never guess.)
In general, programming in assembly will increase time-to-market, bug density, and maintenance costs. Instead, strive for simplicity and readability in your code.
As poolie mentioned, the main benefit of learning assembly today is a deeper understanding of software and hardware. From that perspective, there's quite a bit of information on Steve Gibson's site.
If you understood why there is sometimes the need to do asm, you would appreciate the strengths, costs (headaches for you).

How do you organize code in embedded projects?

Highly embedded (limited code and ram size) projects pose unique challenges for code organization.
I have seen quite a few projects with no organization at all. (Mostly by hardware engineers who, in my experience are not typically concerned with non-functional aspects of code.)
However, I have been trying to organize my code accordingly:
hardware specific (drivers, initialization)
application specific (not likely to be reused)
reusable, hardware independent
For each module I try to keep the purpose to one of these three types.
Due to limited size of embedded projects and the emphasis on performance, it is often keep this organization.
For some context, my current project is a limited DSP application on a MSP430 with 8k flash and 256 bytes ram.
I've written and maintained multiple embedded products (30+ and counting) on a variety of target micros, including MSP430's. The "rules of thumb" I have been most successful with are:
Try to modularize generic concepts as much as possible (e.g. separate driver code from application code). -- It makes for easier maintenance and reuse/porting of a project to another target micro in the future.
DO NOT start by worrying about optimized code at the very beginning. Try to solve the domain's problem first and optimize second. -- Your target micro can handle a lot more "stuff" than you might expect.
Work to ensure readability. Although most embedded projects seem to have short development-cycles, the projects often live longer than you might expect and another developer will undoubtedly have to work with your code.
I've worked on 8-bit PIC processors with similar limitations.
One restriction you don't have is how many comments you make or what you choose to name your methods, variables, etc.. Take advantage. Speed and size constraints do sometimes trump organization, but you can always explain.
Another tip is to break up a logical source file into even more pieces than you need, then bind them by #includeing them in a compilation unit. This allows you to have lots of reusable code (even one routine per file) but combine in whatever order you need. This is useful e.g. when trying to meet compilation unit size restrictions, or to pick and choose which common subroutines you need on the next project.
I try to organize it as if I had unlimited RAM and ROM, and it usually works out fine. As mentioned elsewhere, do not try to optimize it until you absolutely need to.
If you can get a pin-compatible processor that has more resources, it's better to get it working on that, concentrating on good structure and layout, then optimize for size later when you understand the code better.
Except under exceptional circumstances (see note), the organisation of your code will have no impact on the final product. (contents of the code are obviously a different matter)
So with that in mind you should organise your code as you would any other project.
With that said, the following are fairly typical:
If this is a processor that you've worked on before, or will be working on in the future, you will usually want to keep a dedicated hardware abstraction layer that can be shared between projects in the future. Typically this module would contain items like routines for managing any uarts, timers etc.
Usually it's reasonable to maintain a set of platform specific code for initialisation and setup that performs all of the configuration and initialisation up to the point where your executive takes over and runs your application. It will also include platform specific hal routines.
The executive/application is probably maintained as a separate module. All of the hardware specific code should be hidden in the hal (as mentioned above).
By splitting your code up like this you also have the option of compiling and running your application as a simulation, on a completely different platform, just by replacing the hardware specific code with routines that mimic the hardware.
This can be good for unit testing and debugging and algorithmic problems you might have.
Exceptional circumstances as might be imposed by unusual compiler restrictions. eg. I've come across some compilers that expect all interrupt service routines to be compiled within a single object file.
I've worked with some sensors like the Tmote Sky, I too have seen poor organization, and I have to admit i have contributed to it. Anyway I'd say that some confusion has to be, because loading too much modules or too much part of program will be (imho) resource killing too, so try to be aware of a threshold between organization and usability on the low resources.
Obviously this don't mean let caos begin, but for example try to get a look on the organization of the tinyOS source code and applications, it's an idea on what I'm trying to say.
Although it is a bit painful, one organization technique that is somewhat common with embedded C libraries is to split every single function and variable into a separate C source file, and then aggregate the resulting collection of O files into a library file.
The motivation for doing this is that for most normal linkers the unit of linkage is an object, for every object you either get the whole object or none of it. Since there is a 1-1 relationship between C files and object files, putting each symbol in it's own C file gives each one it's own object. This in turn lets the linker pull in only that subset of functions and variables that are actually used.
This sort of game doesn't help at all for headers they can happily be left as single files.