Is it possible to optimize a compiled binary? - optimization

This is more of a curiosity I suppose, but I was wondering whether it is possible to apply compiler optimizations post-compilation. Are most optimization techniques highly-dependent on the IR, or can assembly be translated back and forth fairly easily?

This has been done, though I don't know of many standard tools that do it.
This paper describes an optimizer for Compaq Alpha processors that works after linking has already been done and some of the challenges they faced in writing it.
If you strain the definition a bit, you can use profile-guided optimization to instrument a binary and then rewrite it based on its observable behaviors with regards to cache misses, page faults, etc.
There's also been some work in dynamic translation, in which you run an existing binary in an interpreter and use standard dynamic compilation techniques to try to speed this up. Here's one paper that details this.
Hope this helps!

There's been some recent research interest in this space. Alex Aiken's STOKE project is doing exactly this with some pretty impressive results. In one example, their optimizer found a function that is twice as fast as gcc -O3 for the Montgomery Multiplication step in OpenSSL's RSA library. It applies these optimizations to already-compiled ELF binaries.
Here is a link to the paper.

Some compiler backends have a peephole optimizer which basically does just that, before it commits to the assembly that represents the IR, it has a little opportunity to optimize.
Basically you would want to do the same thing, from the binary, machine code to machine code. Not the same tool but the same kind of process, examine some size block of code and optimize.
Now the problem you will end up with though is for example you may have had some variables that were marked volatile in C so they are being very inefficiently used in the binary, the optimizer wont know the programmers desire there and could end up optimizing that out.
You could certainly take this back to IR and forward again, nothing to stop you from that.

Related

What is the purpose of Vulkan's new extension VK_KHR_vulkan_memory_model?

Khronos just released their new memory model extension, but there is yet to be an informal discussion, example implementation, etc. so I am confused about the basic details.
https://www.khronos.org/blog/vulkan-has-just-become-the-worlds-first-graphics-api-with-a-formal-memory-model.-so-what-is-a-memory-model-and-why-should-i-care
https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#memory-model
What do these new extensions try to solve exactly? Are they trying to solve synchronization problems at the language level (to say remove onerous mutexes in your C++ code), or is it a new and complex set of features to give you more control over how the GPU deals with sync internally?
(Speculative question) Would it be a good idea to learn and incorporate this new model in the general case or would this model only apply to certain multi-threaded patterns and potentially add overhead?
Most developers won't need to know about the memory model in detail, or use the extensions. In the same way that most C++ developers don't need to be intimately familiar with the C++ memory model (and this isn't just because of x86, it's because most programs don't need anything beyond using standard library mutexes appropriately).
But the memory model allows specifying Vulkan's memory coherence and synchronization primitives with a lot less ambiguity -- and in some cases, additional clarity and consistency. For the most part the definitions didn't actually change: code that was data-race-free before continues to be data-race-free. For a few developers doing advanced or fine-grained synchronization, the additional precision and clarity allows them to know exactly how to make their programs data-race-free without using expensive overly-strong synchronization.
Finally, in building the model the group found a few things that were poorly-designed or broken previously (many of them going all the way back into OpenGL). They've been able to now say precisely what those things do, whether or not they're still useful, and build replacements that do what was actually intended.
The extension advertises that these changes are available, but even more than that, once the extension is final instead of provisional, it will mean that the implementation has been validated to actually conform to the memory model.
Among other things, it enables the same kind of fine grained memory ordering guarantees for atomic operations that are described for C++ here. I would venture to say that many, perhaps even most, PC C++ developers don't really know much about this because the x86 architecture basically makes the memory ordering superfluous.
However, GPUs are not x86 architecture and compute operations executed massively parallel on GPU shader cores probably can, and in some cases must use explicit ordering guarantees to be valid when working against shared data sets.
This video is a good presentation on atomics and ordering as it applies to C++.

GCC optimization flags for matrix/vector operations

I am performing matrix operations using C. I would like to know what are the various compiler optimization flags to improve speed of execution of these matrix operations for double and int64 data - like Multiplication, Inverse, etc. I am not looking for hand optimized code, I just want to make the native code more faster using compiler flags and learn more about these flags.
The flags that I have found so far which improve matrix code.
-O3/O4
-funroll-loops
-ffast-math
First of all, I don't recommend using -ffast-math for the following reasons:
It has been proved that the performance actually degrades when
using this option in most (if not all) cases. So "fast math" is
not actually that fast.
This option breaks strict IEEE compliance on floating-point
operations which ultimately results in accumulation of computational
errors of unpredictable nature.
You may well get different results in different environments and the difference may be
substantial. The term environment (in this case) implies the combination of: hardware,
OS, compiler. Which means that the diversity of situations when you can get unexpected
results has exponential growth.
Another sad consequence is that programs which link against the
library built with this option might
expect correct (IEEE compliant) floating-point math, and this is
where their expectations break, but it will be very tough to figure
out why.
Finally, have a look at this article.
For the same reasons you should avoid -Ofast (as it includes the evil -ffast-math). Extract:
-Ofast
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math and the Fortran-specific -fno-protect-parens and -fstack-arrays.
There is no such flag as -O4. At least I'm not aware of that one, and there is no trace of it in the official GCC documentation. So the maximum in this regard is -O3 and you should be definitely using it, not only to optimize math, but in release builds in general.
-funroll-loops is a very good choice for math routines, especially involving vector/matrix operations where the size of the loop can be deduced at compile-time (and as a result unrolled by the compiler).
I can recommend 2 more flags: -march=native and -mfpmath=sse. Similarly to -O3, -march=native is good in general for release builds of any software and not only math intensive. -mfpmath=sse enables use of XMM registers in floating point instructions (instead of stack in x87 mode).
Furthermore, I'd like to say that it's a pity that you don't want to modify your code to get better performance as this is the main source of speedup for vector/matrix routines. Thanks to SIMD, SSE Intrinsics, and Vectorization, the heavy-linear-algebra code can be orders of magnitude faster than without them. However, proper application of these techniques requires in-depth knowledge of their internals and quite some time/effort to modify (actually rewrite) the code.
Nevertheless, there is one option that could be suitable in your case. GCC offers auto-vectorization which can be enabled by -ftree-vectorize, but it is unnecessary since you are using -O3 (because it includes -ftree-vectorize already). The point is that you should still help GCC a little bit to understand which code can be auto-vectorized. The modifications are usually minor (if needed at all), but you have to make yourself familiar with them. So see the Vectorizable Loops section in the link above.
Finally, I recommend you to look into Eigen, the C++ template-based library which has highly efficient implementation of most common linear algebra routines. It utilizes all the techniques mentioned here so far in a very clever way. The interface is purely object-oriented, neat, and pleasing to use. The object-oriented approach looks very relevant to linear algebra as it usually manipulates the pure objects such as matrices, vectors, quaternions, rotations, filters, and so on. As a result, when programming with Eigen, you never have to deal with such low level concepts (as SSE, Vectorization, etc.) yourself, but just enjoy solving your specific problem.

What's the difference between code written for a desktop machine and a supercomputer?

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
My impression is that it has to do with taking advantage of certain libraries (e.g., BLAS, LAPLACK) where possible, revising algorithms (reducing iteration), profiling, parallelizing, considering memory-hard disk-processor use/access... I am aware of the adage, "want to optimize your code? don't do it", but if one were interested in learning about writing efficient code, what references might be available?
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
Profiling is a must at any level of machinery. In common usage, I've found that scaling to larger and larger grids requires a better understanding of the grid software and the topology of the grid. In that sense, everything you learn about optimizing for one machine is still applicable, but understanding the grid software gets you additional mileage. Hadoop is one of the most popular and widespread grid systems, so learning about the scheduler options, interfaces (APIs and web interfaces), and other aspects of usage will help. Although you may not use Hadoop for a given supercomputer, it is one of the less painful methods for learning about distributed computing. For parallel computing, you may pursue MPI and other systems.
Additionally, learning to parallelize code on a single machine, across multiple cores or processors, is something you can begin learning on a desktop machine.
Recommendations:
Learn to optimize code on a single machine:
Learn profiling
Learn to use optimized libraries (after profiling: so that you see the speedup)
Be sure you know algorithms and data structures very well (*)
Learn to do embarrassingly parallel programming on multiple core machines.
Later: consider multithreaded programming. It's harder and may not pay off for your problem.
Learn about basic grid software for distributed processing
Learn about tools for parallel processing on a grid
Learn to program for alternative hardware, e.g. GPUs, various specialized computing systems.
This is language agnostic. I have had to learn the same sequence in multiple languages and multiple HPC systems. At each step, take a simpler route to learn some of the infrastructure and tools; e.g. learn multicore before multithreaded, distributed before parallel, so that you can see what fits for the hardware and problem, and what doesn't.
Some of the steps may be reordered depending on local computing practices, established codebases, and mentors. If you have a large GPU or MPI library in place, then, by all means, learn that rather than foist Hadoop onto your collaborators.
(*) The reason to know algorithms very well is that as soon as your code is running on a grid, others will see it. When it is hogging up the system, they will want to know what you're doing. If you are running a process that is polynomial and should be constant, you may find yourself mocked. Others with more domain expertise may help you find good approximations for NP-hard problems, but you should know that the concept exists.
Parallelization would be the key.
Since the problems you cited (e.g. CFD, multiphysics, mass transfer) are generally expressed as large-scale linear algebra problems, you need matrix routines that parallelize well. MPI is a standard for those types of problems.
Physics can influence as well. For example, it's possible to solve some elliptical problems efficiently using explicit dynamics and artificial mass and damping matricies.
3D multiphysics means coupled differential equations with varying time scales. You'll want a fine mesh to resolve details in both space and time, so the number of degrees of freedom will rise rapidly; time steps will be governed by the stability requirements of your problem.
If someone ever figures out how to run linear algebra as a map-reduce problem they'll have it knocked.
Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
First, you would need to understand the problem. Not all problems can be solved in parallel (and I'm using the term parallel in as wide meaning as it can get). So, see how the problem is now solved. Can it be solved with some other method quicker. Can it be divided in independent parts ... and so on ...
Fortran is the language specialized for scientific computing, and during the recent years, along with the development of new language features, there has also been some very interesting development in terms of features that are aiming for this "market". The term "co-arrays" could be an interesting read.
But for now, I would suggest reading first into a book like Using OpenMP - OpenMP is a simpler model but the book (fortran examples inside) explains nicely the fundamentals. Message parsing interface (for friends, MPI :) is a larger model, and one of often used. Your next step from OpenMP should probably go in this direction. Books on the MPI programming are not rare.
You mentioned also libraries - yes, some of those you mentioned are widely used. Others are also available. A person who does not know exactly where the problem in performance lies should IMHO never try to undertake the task of trying to rewrite library routines.
Also there are books on parallel algorithms, you might want to check out.
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
In short it comes down to understanding the problem, learning where the problem in performance is, re-solving the whole problem again with a different approach, iterating a few times, and by that time you'll already know what you're doing and where you're stuck.
We're in a position similar to yours.
I'm most in agreement with #Iterator's answer, but I think there's more to say.
First of all, I believe in "profiling" by the random-pausing method, because I'm not really interested in measuring things (It's easy enough to do that) but in pinpointing code that is causing time waste, so I can fix it. It's like the difference between a floodlight and a laser.
For one example, we use LAPACK and BLAS. Now, in taking my stack samples, I saw a lot of the samples were in the routine that compares characters. This was called from a general routine that multiplies and scales matrices, and that was called from our code. The matrix-manipulating routine, in order to be flexible, has character arguments that tell it things like, if a matrix is lower-triangular or whatever. In fact, if the matrices are not very large, the routine can spend more than 50% of its time just classifying the problem. Of course, the next time it is called from the same place, it does the same thing all over again. In a case like that, a special routine should be written. When it is optimized by the compiler, it will be as fast as it reasonably can be, and will save all that classifying time.
For another example, we use a variety of ODE solvers. These are optimized to the nth degree of course. They work by calling user-provided routines to calculate derivatives and possibly a jacobian matrix. If those user-provided routines don't actually do much, samples will indeed show the program counter in the ODE solver itself. However, if the user-provided routines do much more, samples will find the lower end of the stack in those routines mostly, because they take longer, while the ODE code takes roughly the same time. So, optimization should be concentrated in the user-provided routines, not the ODE code.
Once you've done several of the kind of optimization that is pinpointed by stack sampling, which can speed things up by 1-2 orders of magnitude, then by all means exploit parallelism, MPI, etc. if the problem allows it.

How fast is VB .net compared to native code for arithmetic?

I need to write software that will do a lot of math. Mostly it will be matrix multiplication with integers to compute DCT. How much faster should I expect the code to run in native c as compared to VB .Net? Factor of 2, factor of 10, factor of 1000...? Has someone tried and collected statistics on this?
.Net code is JIT-compiled to native code before execution, so it should not be slower than native code in general. I'd expect a factor < 10.
Moreover, adaptive optimization techniques profile the code as it runs, gaining more information than a typical static compiler. So, the JIT can make more informed decisions for further optimizations
The .NET code is compiled into native code by the JIT compiler, so you get native code in both cases.
The difference is that the C code has somewhat less overhead around the calculations, so you should perhaps expect a performace difference of factor 2.
VB is 93.7% as fast as C. If you pick the right scenario.
Actually, if your 'native C' includes regular calls to malloc() and free(), any kind of Gargage Collected language like VB.Net is going to literally run circles around it. GC can be 10x faster than mallocs in your inner loops.
If you break down and use C, try to reuse structures that you declared just once instead of making new ones, to avoid this problem. This may be of benefit even in VB if your solution lends itself to it. However it will be harder to program and GC is very fast.
As far as bounds/overflow checks, if speed is important and testing has revealed they don't happen, and you're not risking life or millions from an error or abend, they are a waste of time. But if you can't get rid of them, your time is likely still more valuable in a language with which you can program more quickly.
If you expect serious size and usage, it pays to split the task with a controlling program and store the allocated 'task definitions' into a shared directory with a file per task solver, or a database. Then you can run a solver per processor (2 per HT CPU), or network computers. Be weary of queue structures - it's tough to atomicly 'Mark-Taken-And-Get-Data-If-Not-Taken'. You know how many task solvers you're going to start. I did this with an imaging utility I develop, it was much easier than expected, and it creamed the previous version. Plus if you use multiple processes with a properly dividable problem domain, you avoid the slight-to-significant programming burden of multithreading. Or convincing your coworkers that your culrly braces are in the right place. Peace.

Compiler optimizations: Where/how can I get a feel for what the payoff is for different optimizations?

In my independent study of various compiler books and web sites, I am learning about many different ways that a compiler can optimize the code that is being compiled, but I am having trouble figuring out how much of a benefit each optimization will tend to give.
How do most compiler writers go about deciding which optimizations to implement first? Or which optimizations are worth the effort or not worth the effort? I realize that this will vary between types of code and even individual programs, but I'm hoping that there is enough similarity between most programs to say, for instance, that one given technique will usually give you a better performance gain than another technique.
I found when implementing textbook compiler optimizations that some of them tended to reverse the improvements made by other optimizations. This entailed a lot of work trying to find the right balance between them.
So there really isn't a good answer to your question. Everything is a tradeoff. Many optimizations work well on one type of code, but are pessimizations for other types. It's like designing a house - if you make the kitchen bigger, the pantry gets smaller.
The real work in building an optimizer is trying out the various combinations, benchmarking the results, and, like a master chef, picking the right mix of ingredients.
Tongue in cheek:
Hubris
Benchmarks
Embarrassment
More seriously, it depends on your compiler's architecture and goals. Here's one person's experience...
Go for the "big payoffs":
native code generation
register allocation
instruction scheduling
Go for the remaining "low hanging fruit":
strength reduction
constant propagation
copy propagation
Keep bennchmarking.
Look at the output; fix anything that looks stupid.
It is usually the case that combining optimizations, or even repeating optimization passes, is more effective than you might expect. The benefit is more than the sum of the parts.
You may find that introduction of one optimization may necessitate another. For example, SSA with Briggs-Chaitin register allocation really benefits from copy propagation.
Historically, there are "algorithmical" optimizations from which the code should benefit in most of the cases, like loop unrolling (and compiler writers should implement those "documented" and "tested" optimizations first).
Then there are types of optimizations that could benefit from the type of processor used (like using SIMD instructions on modern CPUs).
See Compiler Optimizations on Wikipedia for a reference.
Finally, various type of optimizations could be tested profiling the code or doing accurate timing of repeated executions.
I'm not a compiler writer, but why not just incrementally optimize portions of your code, profiling all the while?
My optimization scheme usually goes:
1) make sure the program is working
2) find something to optimize
3) optimize it
4) compare the test results with what came out from 1; if they are different, then the optimization is actually a breaking change.
5) compare the timing difference
Incrementally, I'll get it faster.
I choose which portions to focus on by using a profiler. I'm not sure what extra information you'll garner by asking the compiler writers.
This really depends on what you are compiling. There is was a reasonably good discussion about this on the LLVM mailing list recently, it is of course somewhat specific to the optimizers they have available. They use abbreviations for a lot of their optimization passes, if you not familiar with any of acronyms they are tossing around you can look at their passes page for documentation. Ultimately you can spend years reading academic papers on this subject.
This is one of those topics where academic papers (ACM perhaps?) may be one of the better sources of up-to-date information. The best thing to do if you really want to know could be to create some code in unoptimized form and some in the form that the optimization would take (loops unrolled, etc) and actually figure out where the gains are likely to be using a compiler with optimizations turned off.
It is worth noting that in many cases, compiler writers will NOT spend much time, if any, on ensuring that their libraries are optimized. Benchmarks tend to de-emphasize or even ignore library differences, presumably because you can just use different libraries. For example, the permutation algorithms in GCC are asymptotically* less efficient than they could be when trying to permute complex data. This relates to incorrectly making deep copies during calls to swap functions. This will likely be corrected in most compilers with the introduction of rvalue references (part of the C++0x standard). Rewriting the STL to be much faster is surprisingly easy.
*This assumes the size of the class being permuted is variable. E.g. permutting a vector of vectors of ints would slow down if the vectors of ints were larger.
One that can give big speedups but is rarely done is to insert memory prefetch instructions. The trick is to figure out what memory the program will be wanting far enough in advance, never ask for the wrong memory and never overflow the D-cache.