Intel Assembler optimization

Intel Assembler optimization - optimization

I'm currently trying to optimize the code emitted from a home-made compiler, for a home-made language.
I've tried out Intel VTune to see where the bottlenecks are: http://www.imada.sdu.dk/~sorenh07/misc/vtune-assembly-optimization.png
I find it very impressive that a "subl"-instruction is responsible for over 38% of the clockticks in a program running for 30-90 seconds! Can anybody give an explanation why?
The "optimization report" feature in VTune apparently doesn't exist for programs not compiled with icc. Does there exist a program which suggests optimization for assembler code? (that is, not code coming from a high-level language).

My guess is that it's the idivl instruction that's actually taking up the 38%...division taking longer makes a bit more sense than subtraction no?

Related

Running cachegrind on OpenJDK JVM

I want to use cachegrind to do some performance profiling on the OpenJDK JVM. (BTW, if this is Not A Good Idea, I would like to understand why.)
The problem is it keeps tripping up assertions in the JVM. So what can I do to get a run using cachegrind. Or else, please tell me why this wouldn't work. And if you can suggest an alternative to cachegrind. (Note that I have looked at and used perf. It's just that I was curious how different a tool like cachegrind/callgrind would be in terms of results.)

Valgrind and the tools based on it are typically not very good for performance profiling. Valgrind is a kind of a virtual machine that does not execute the original code directly but rather translates it. Programs often run much slower under Valgrind, and this can affect the application execution scenario.
Also Valgrind does not work well with the dynamic and self-modifying code. With respect to JVM's JIT compilation this can result in random assertion failures and JVM crashes that you apparently observe. Valgrind is known to be much more stable when Java is run with JIT disabled, i.e. -Xint. But in -Xint mode profiling does not make much sense since it does not reflect the real application performance.
perf is preferable way to do this kind of measurements. perf relies on hardware counters, it does not modify the executed code. For deeper performance analyis there is Intel VTune Amplifier. It is not free though, but a trial version is available.

Valgrind vs. Linux perf correlation

Suppose that I choose perf events instructions, LLC-load-misses, LLC-store-misses. Suppose further that I test a program prog varying its input. Is valgrind supposed to give me the "same" functional results for the same input and the same counter? That is, if one value in perf goes up, the one in valgrind should always do the same? Is there any impact in valgrind being a simulation that I should be aware of during profiling my code?
EDIT: BTW, before people grill me for not experimenting myself, I have to say that I (kinda) have, the problem is that I have a Sandybridge processor, and perf has a "bug" that prevents me from measuring LLC-* events. There is a patch, but I don't feel like recompiling my kernel...

Well, Cachegrind is a cache simulator. Even though it tries to mimic some of your hardware's characteristics (cache size, associativity, etc), it does not model every single feature and behavior of your system. Therefore you might in some cases see some differences.
For example, Valgrind's doc states that "Cachegrind simulates branch predictors intended to be typical of mainstream desktop/server processors of around 2004". Sandy Bridge processors first appeared in 2011, and you can guess that branch predictors have improved quite a lot since 2004.
That being said, Valgrind is still a wonderful tool to have in your toolbox.
What's the problem with perf's LLC events on Sandy Bridge processors? I use these events everyday at work on my Sandy Bridge laptop and it works as expected (archlinux 64bits, linux 3.6).

Are there modern compilers for high level languages on simple processors which produce self-modifying code?

Back in the days before caches and branch prediction, it was relatively common if not encouraged to make self-modifying code for certain kinds of optimizations. It was probably most common in games and demos written in assembler in the eras between 8-bit up to early 32-bit such as the Amiga.
I'm not sure if any compilers from those days emitted self-modifying assembler or machine code.
What I'm wondering is whether there are any current/modern compilers that do. Obviously it would be useless or too difficult on powerful processors with caches.
But What about the very many simple processors such as used in embedded systems? Is self modifying code considered a viable optimization strategy by any modern compilers for any simple / 8-bit / embedded processor?
There is a question with a similar title, "Is there any self-improving compiler around?
", but note that it's not about the same subject:
Please note that I am talking about a compiler that improves itself - not a compiler that improves the code it compiles.

All embedded systems today use flash ROM. I believe Amiga and similar were RAM-based systems. The only way "self-modification" exists in embedded systems, is where you have boot loaders, that re-program certain parts of the flash depending on what program and/or functionality that should be used.
Apart from that, it doesn't make sense for the program to modify itself in runtime. Running code from RAM is generally discouraged, for safety reasons (potential of accidental modifications caused by bugs) and for electric reasons (RAM is volatile and far more sensitive to EMC than flash).

How Do You Profile & Optimize CUDA Kernels?

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?

If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.

If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html

The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.

I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.

Optimisation , Compilers and Its Effects

(i) If a Program is optimised for one CPU class (e.g. Multi-Core Core i7)
by compiling the Code on the same , then will its performance
be at sub-optimal level on other CPUs from older generations (e.g. Pentium 4)
... Optimizing may prove harmful for performance on other CPUs..?
(ii)For optimization, compilers may use x86 extensions (like SSE 4) which are
not available in older CPUs.... so ,Is there a fall-back to some non-extensions
based routine on older CPUs..?
(iii)Is Intel C++ Compiler is more optimizing than Visual C++ Compiler or GCC..
(iv) Will a truly Multi-Core Threaded application will perform effeciently on a
older CPUs (like Pentium III or 4)..?

Compiling on a platform does not mean optimizing for this platform. (maybe it's just bad wording in your question.)
In all compilers I've used, optimizing for platform X does not affect the instruction set, only how it is used, e.g. optimizing for i7 does not enable SSE2 instructions.
Also, optimizers in most cases avoid "pessimizing" non-optimized platforms, e.g. when optimizing for i7, typically a small improvement on i7 will not not be chosen if it means a major hit for another common platform.
It also depends in the performance differences in the instruction sets - my impression is that they've become much less in the last decade (but I haven't delved to deep lately - might be wrong for the latest generations). Also consider that optimizations make a notable difference only in few places.
To illustrate possible options for an optimizer, consider the following methods to implement a switch statement:
sequence if (x==c) goto label
range check and jump table
binary search
combination of the above
the "best" algorithm depends on the relative cost of comparisons, jumps by fixed offsets and jumps to an address read from memory. They don't differ much on modern platforms, but even small differences can create a preference for one or other implementation.

It is probably true that optimising code for execution on CPU X will make that code less optimal on CPU Y than the same code optimised for execution on CPU Y. Probably.
Probably not.
Impossible to generalise. You have to test your code and come to your own conclusions.
Probably not.
For every argument about why X should be faster than Y under some set of conditions (choice of compiler, choice of CPU, choice of optimisation flags for compilation) some clever SOer will find a counter-argument, for every example a counter-example. When the rubber meets the road the only recourse you have is to test and measure. If you want to know whether compiler X is 'better' than compiler Y first define what you mean by better, then run a lot of experiments, then analyse the results.

I) If you did not tell the compiler which CPU type to favor, the odds are that it will be slightly sub-optimal on all CPUs. On the other hand, if you let the compiler know to optimize for your specific type of CPU, then it can definitely be sub-optimal on other CPU types.
II) No (for Intel and MS at least). If you tell the compiler to compile with SSE4, it will feel safe using SSE4 anywhere in the code without testing. It becomes your responsibility to ensure that your platform is capable of executing SSE4 instructions, otherwise your program will crash. You might want to compile two libraries and load the proper one. An alternative to compiling for SSE4 (or any other instruction set) is to use intrinsics, these will check internally for the best performing set of instructions (at the cost of a slight overhead). Note that I am not talking about instruction instrinsics here (those are specific to an instruction set), but intrinsic functions.
III) That is a whole other discussion in itself. It changes with every version, and may be different for different programs. So the only solution here is to test. Just a note though; Intel compilers are known not to compile well for running on anything other than Intel (e.g.: intrinsic functions may not recognize the instruction set of a AMD or Via CPU).
IV) If we ignore the on-die efficiencies of newer CPUs and the obvious architecture differences, then yes it may perform as well on older CPU. Multi-Core processing is not dependent per se on the CPU type. But the performance is VERY dependent on the machine architecture (e.g.: memory bandwidth, NUMA, chip-to-chip bus), and differences in the Multi-Core communication (e.g.: cache coherency, bus locking mechanism, shared cache). All this makes it impossible to compare newer and older CPU efficiencies in MP, but that is not what you are asking I believe. So on the whole, a MP program made for newer CPUs, should not be using less efficiently the MP aspects of older CPUs. Or in other words, just tweaking the MP aspects of a program specifically for an older CPU will not do much. Obviously you could rewrite your algorithm to more efficiently use a specific CPU (e.g.: A shared cache may permit you to use an algorithm that exchanges more data between working threads, but this algo will die on a system with no shared cache, full bus lock and low memory latency/bandwidth), but it involves a lot more than just MP related tweaks.

(1) Not only is it possible but it has been documented on pretty much every generation of x86 processor. Go back to the 8088 and work your way forward, every generation. Clock for clock the newer processor was slower for the current mainstream applications and operating systems (including Linux). The 32 to 64 bit transition is not helping, more cores and less clock speed is making it even worse. And this is true going backward as well for the same reason.
(2) Bank on your binaries failing or crashing. Sometimes you get lucky, most of the time you dont. There are new instructions yes, and to support them would probably mean trap for an undefined instruction and have a software emulation of that instruction which would be horribly slow and the lack of demand for it means it is probably not well done or just not there. Optimization can use new instructions but more than that the bulk of the optimization that I am guessing you are talking about has to do with reordering the instructions so that the various pipelines do not stall. So you arrange them to be fast on one generation processor they will be slower on another because in the x86 family the cores change too much. AMD had a good run there for a while as they would make the same code just run faster instead of trying to invent new processors that eventually would be faster when the software caught up. No longer true both amd and intel are struggling to just keep chips running without crashing.
(3) Generally, yes. For example gcc is a horrible compiler, one size fits all fits no one well, it can never and will never be any good at optimizing. For example gcc 4.x code is slower on gcc 3.x code for the same processor (yes all of this is subjective, it all depends on the specific application being compiled). The in house compilers I have used were leaps and bounds ahead of the cheap or free ones (I am not limiting myself to x86 here). Are they worth the price though? That is the question.
In general because of the horrible new programming languages and gobs of memory, storage, layers of caching, software engineering skills are at an all time low. Which means the pool of engineers capable of making a good compiler much less a good optimizing compiler decreases with time, this has been going on for at least 10 years. So even the in house compilers are degrading with time, or they just have their employees to work on and contribute to the open source tools instead having an in house tool. Also the tools the hardware engineers use are degrading for the same reason, so we now have processors that we hope to just run without crashing and not so much try to optimize for. There are so many bugs and chip variations that most of the compiler work is avoiding the bugs. Bottom line, gcc has singlehandedly destroyed the compiler world.
(4) See (2) above. Don't bank on it. Your operating system that you want to run this on will likely not install on the older processor anyway, saving you the pain. For the same reason that the binaries optimized for your pentium III ran slower on your Pentium 4 and vice versa. Code written to work well on multi core processors will run slower on single core processors than if you had optimized the same application for a single core processor.
The root of the problem is the x86 instruction set is dreadful. So many far superior instructions sets have come along that do not require hardware tricks to make them faster every generation. But the wintel machine created two monopolies and the others couldnt penetrate the market. My friends keep reminding me that these x86 machines are microcoded such that you really dont see the instruction set inside. Which angers me even more that the horrible isa is just an interpretation layer. It is kinda like using Java. The problems you have outlined in your questions will continue so long as intel stays on top, if the replacement does not become the monopoly then we will be stuck forever in the Java model where you are one side or the other of a common denominator, either you emulate the common platform on your specific hardware, or you are writing apps and compiling to the common platform.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas