Measuring the impact on performance of assembly on gem5 - gem5

I'm trying to measure the impact on performance caused by replacing a script's C code with assembly on key sections of it, and the resulting execution time is about 40% of the original when ran from a TimingSimpleCPU on SE mode.
So I'm wondering if this change is to be expected, or if it could be caused by the CPU choice or some other factor such as the compiler's optimization level.
I know the question is a bit broad, but i just need a general idea of where to start looking for an explanation.

Related

Fine-tuning performance vs. executable size optimizations

I have a Rust program whose executable size I'd like to decrease by 15-20% compared to its opt-level = 3 size. What makes me hopeful is that compiling with opt-level = "s", i.e. asking Rust to optimize for size, I get a 45% size decrease compared to the performance-optimized version:
Optimization level
Code size (.text + .data segments)
3
27,380 bytes
2
23,934 bytes
1
27,852 bytes
s
15,132 bytes
z
15,264 bytes, but miscompiled (must be some LLVM AVR backend bug, wouldn't be the first...)
However, the runtime performance of the s-optimized version doesn't meet my performance requirements (it's a game for a fixed hardware platform, so there's a performance requirement coming from the frame rate).
Before I start putting in "real work" to decrease code size and/or improve performance by writing "better" code, I'd like to explore what the compiler can do for me. Initially I thought I can just try tuning the inline-threshold setting, but based on the comments on that question, there's (a lot) more going on between Rust's "s" vs 3 optimization levels.
So my question is, what are the various parameters/axes/degrees of freedom of the Rust compiler than I can fine-tune in the hope of getting acceptable performance and smaller code size?

How does Denver/Carmel handle optimization cache overflow

Prologue
I am not sure whether I should ask this question on Stackoverflow or Super User or even another stack exchange site, so please direct me if there is a more suitable place.
Main Body
I read the IEEE paper by Nvidia introducing Denver and found that about optimization cache management, only the part quoted below is relevant.
Optimization cache management uses a mark-and-sweep algorithm to manage the available optimization execution memory.
However, I would like to know if the optimization cache is full, but the dynamic code optimizer (DCO) decides to optimize some more ARM instructions into native micro-ops, what will happen?
I can think of three possible implementations, but I am not sure which one is true.
Keep using the hardware decoder to deal with newly-encountered ARM instructions until the optimization cache has enough space.
Record some statistics to decide whether a optimized block of code needs to be replaced.
Simply flush the optimization cache.
Google yields no results, and I appreciate any references I may have overlooked.

Is faster code also more power efficient?

Assume I have a CPU running at a constant rate, pulling an equal amount of energy per instruction. I also have two functionally identical programs, which result in the same output, except one has been optimized to execute only 100 instructions, while the other program executes 200 instructions. Is the 100 instruction program necessarily faster than the 200 instruction program? Does a program with fewer instructions draw less power than a program with more instructions?
Things are much more complex than this.
For example execution speed is in many cases dominated by memory. As a practical example some code could process the pixels of an image first in rows and then in columns... a different code instead could be more complex but processing rows and columns at the same time.
The second version could execute more instructions because of more complex housekeeping of the data but I wouldn't be surprised if it was faster because of how memory is organized: reading an image one column at a time is going to "trash the cache" and it's very possible that despite being simple the code working that way could be a LOT slower than the more complex one doing the processing in a memory-friendly way. The simpler code may end up being "stalled" a lot waiting for the cache lines to be filled or flushed to the external memory.
This is just an example, but in reality what happens inside a CPU when code is executed is for many powerful processors today a very very complex process: instructions are exploded in micro-instructions, registers are renamed, there is speculative execution of parts of code depending on what branch predictors guess even before the program counter really reaches a certain instruction and so on. Today the only way to know for sure if something is faster or slower is in many cases just trying with real data and measure.
Is the 100 instruction program necessarily faster than the 200 instruction program?
No. Firstly, on some architectures (such as x86) different instructions can take a different number of cycles. Secondly, there are effects — such cache misses, page faults and branch mispreditictions — that complicate the picture further.
From this it follows that the answer to your headline question is "not necessarily".
Further reading.
I found a paper from 2017 comparing the energy usage, speed, and memory consumption of various programming languages. There is an obvious positive correlation between faster languages also using less energy.

Usage of frame pointer optimizations

This is related but NOT the same as frame pointer omitting ? Any risk?
I am trying to follow this old (but still relevan article)
http://blogs.msdn.com/b/larryosterman/archive/2007/03/12/fpo.aspx
Larry (author writes)
machines got sufficiently faster since 1995 that the performance
improvements that were achieved by FPO weren't sufficient to counter
the pain in debugging and analysis that FPO caused
However in the discussion further down the page one user writes
Disabling FPO can have both serious code size and performance impact.
Tail call optimizations have to be disabled when a frame pointer is
present, leading to much greater stack usage in affected paths. Small
functions are also disproportionately affected by prolog/epilog code.
Third, although there are still six registers available with a frame
pointer on X86, only three of them are nonvolatile with respect to
nested calls: EBX, ESI, and EDI. Opening up a fourth register can drop
out a bunch of spill code.
I have a couple of question.
Spill code == Register spillage?
Is the author correct that FPO is generally considered a pain and
the gain doe not out-weigh the benefits.
Is FPO still relevant today in x64 architecture since there are a
LOT more registers o play with.
Do you use FPO? What for (if yes) and does it make a difference to
you?
Finally in this article
http://www.altdevblogaday.com/2012/05/24/x64-abi-intro-to-the-windows-x64-calling-convention/
The author says
[with repect to Windows x64 calling convention].....
All parameters have space reserved on the stack, even the ones passed in registers. In fact, there’s stack space for 4 parameters
even if your function doesn’t have any params. Those parameters are 8
bytes so that’s at least 32 bytes on the stack for every function
(every function actually has at least 48 bytes on the stack…I’ll
explain that another time). This stack area is called the home space.
There are few reasons behind this home space:
If the registers need to be used for something else, the called function can store the data in the home space without moving the stack
pointer.
It keeps the stack structure easy to determine. That’s very handy for debugging, and perhaps necessary for x64′s stack metadata (another
point I’ll come back to another time). ...... The compiler can use it
for whatever it wants, and an optimized build will likely make great
use of it.
Wouldn't an optimized build optimize the excess allocation away?
1.Spill code == Register spillage?
Almost. Stricly speaking, spill code is the code added by the compiler to implement a register spill. The spill itself is the decision to tag a live range as not able to be placed in a register.
2.Is the author correct that FPO is generally considered a pain and the gain doe not out-weigh the benefits.
The author is probably correct that in modern processor architectures, the kinds of functions where FPOs will generate a significant performance gain is a smaller set than in the past. Yet FPO's do make code smaller, reducing cache pressure. They do reduce register pressure. These can be important in some settings. They do speed up prolog and epilog code by a few instructions. The point about debuggers not working well without the FP is noteworthy. It means core dumps are less useful for post mortems on production-optimized code. You'd never want to use FPO during development except for final testing.
3.Is FPO still relevant today in x64 architecture since there are a LOT more registers o play with.
Modern processors are so various and complex that you just about never know what's "relevant" until you try it and measure.
4.Do you use FPO? What for (if yes) and does it make a difference to you?
I have written a medium-size C library (20K SLOC) where it made a small (~5%) difference in run time overall under gcc. This was a native language extension to a scripting language that had to compile under both gcc and Visual C. Using it would have split the build path. I decided 5% was not worth it for the purpose the extension served. But if it had been a dynamic fluid simulation to predict the weather, 5% could have been worth many millions of dollars. The decision would have been different.
5.Wouldn't an optimized build optimize the excess allocation away?
That's entirely up to the compiler and optimizer designer. It looks from the MS documentation here that MS has defined the ABI to require home space for all data even if it's whole lifetime is spent in a register.
1) When you need to use a register and don't have any unused ones, you need to write code to save some register value on the stack and later restore it.
2) FPO was a pain back when unwinding was primarily done by walking the stack. Nowadays standard unwind ABIs exist anyway (e.g. to enable exception handling), so the information already exists, and is organized more efficiently (away from the hot code), so there's no pain. Sure, there would be some pain if you wrote all your machine code by hand, but that's not the typical use case.
3) Typical x86_64 ABIs don't use frame pointers at all (except when absolutely necessary, like for variable-length arrays in C).
4) I'm not a compiler. My compiler doesn't generate frame pointers.
Optimize excess away) Not sure what your question is. The space consumption for the home area isn't a problem. The benefit of not having to adjust any stack pointers is a big advantage, since you need a lot less code. The same goes for the red zone just beyond the stack frame, which allows leaf code to use a lot of memory without ever needing any stack pointer gymnastics.

Profiling / Code optimizing tool [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Please let me know which tool (GNU or 3rd party tool) is the best we can make use for profiling and optimizing our code. Is gprof an effective tool? Do we have dtrace tool ported in Linux?
You're not alone in conflating the terms "profiling" and "optimizing", but they're really very different. As different as weighing a book versus reading it.
As far as gprof goes, here are some issues.
Among profilers, the most useful are the ones that
Sample the whole call stack (not just the program counter) or at least as much of it as contains your code.
Samples on wall-clock time (not just CPU time). If you're losing time in I/O or other blocking calls, a CPU-only profiler will simply not see it.
Tells you by line-of-code (not just by function) the percent of stack samples containing that line. That's important because any such line of code that you could eliminate would save you that percent of overall time, and you don't have to eyeball functions to guess where it is.
A good profiler for linux that does this is Zoom. I'm sure there are others.
(Don't get confused about what matters. Efficiency and/or timing accuracy of a profiler is not important for helping you locate speed bugs. What's important is that it draws your attention to the right things.)
Personally, I use the random-pausing method, because I find it's the most effective.
Not only is it simple and requires no investment, but it finds speedup opportunities that are not localized to particular routines or lines of code, as in this example.
This is reflected in the speedup factors that can be achieved.
gprof is better than nothing. But not much. It estimates time spent not only in a function, but also in all of the functions called by the function - but beware it is an estimate, not a direct measurement. It does not make the distinction that some two callers of the same subroutine may have widely differing times spent inside it, per call. To do better than that you need a real call graph profiler, one that looks at several levels of the stack on a timer tick.
dtrace is good for what it does.
If you are doing low level performance optimization on x86, you should consider Intel's Vtune tool. Not only does it provide the best access I am aware of to low level performance measurement hardware on the chip, the so-called EMON (Event Monitoring) system (some of which I designed), but Vtune also has some pretty good high level tools. Such as call graph profiling that, I believe, is better than gprof. On the low level, I like doing things like generating profiles of the leading suspects, like branch mispredictions and cache misses, and looking at the code to see if there is something that can be done. Sometimes simple stuff, such as making an array size 255 rather than 256, helps a lot.
Generic Linux oprofile, http://oprofile.sourceforge.net/about/, is almost as good as Vtune, and better in some ways. And available for x86 and ARM. I haven't used it much, but I particularly like that you can use it in an almost completely non-intrusive manner, with no need to create the special -pg binary that gprof needs.
Their are many tools from where you can optimize your code,
For web application their are different tools to optimize the code i.e jzip compressors
e.g YUI Compressor etc.
For desktop application optimizing compiler is good.