Any Macro or Technic for Part Optimization? - optimization

I am working on lock free structure with g++ compiler. It seems that with -o1 switch, g++ will change the execution order of my code. How can I forbid g++'s optimization on certain part of my code while maintain the optimization to other part? I know I can split it to two files and link them, but it looks ugly.

If you find that gcc changes the order of execution in your code, you should consider using a memory barrier. Just don't assume that volatile variables will protect you from that issue. They will only make sure that in a single thread, the behavior is what the language guarantees, and will always read variables from their memory location to account for changes "invisible" to the executing code. (e.g changes to a variable done by a signal handler).
GCC supports OpenMP since version 4.2. You can use it to create a memory barrier with a special #pragma directive.
A very good insight about locking free code is this PDF by Herb Sutter and Andrei Alexandrescu: C++ and the Perils of Double-Checked Locking

You can use a function attribute "__attribute__ ((optimize 0))" to set the optimization for a single function, or "#pragma GCC optimize" for a block of code. These are only for GCC 4.4, though, I think - check your GCC manual. If they aren't supported, separation of the source is your only option.
I would also say, though, that if your code fails with optimization turned on, it is most likely that your code is just wrong, especially as you're trying to do something that is fundamentally very difficult. The processor will potentially perform reordering on your code (within the limits of sequential consistency) so any re-ordering that you're getting with GCC could potentially occur anyway.

Related

Ways to make a D program faster

I'm working on a very demanding project (actually an interpreter), exclusively written in D, and I'm wondering what type of optimizations would generally be recommended. The project makes heavy use of GC, classes, asssociative arrays, and pretty much anything.
Regarding compilation, I've already experimented both with DMD and LDC flags and LDC with -flto=full -O3 -Os -boundscheck=off seems to be making a difference.
However, as rudimentary as this may sound, I would like you to suggest anything that comes to your mind that could help speed up the performance, related or not to the D language. (I'm sure I'm missing several things).
Compiler flags: I would add -mcpu=native if the program will be running on your machine. Not sure what effect -Os has in addition to -O3.
Profiling has been mentioned in comments. Personally under Linux I have a script which dumps a process's stack trace and I do that a few times to get an idea of where it's getting hung up on.
Not sure what you mean by GS.
Since you mentioned classes: in D, methods are virtual by default; virtual methods add indirections and are not inlineable. Make sure only those methods that must be virtual are. See if you can rewrite your program using a form of polymorphism that doesn't involve indirections, such as using template metaprogramming.
Since you mentioned associative arrays: these make heavy use of the GC; to speed them up, switch to a third-party library that works on top of std.allocator, such as https://github.com/dlang-community/containers
If some parts of your code are parallelizable, std.parallelism is a good tool for this.
Since you mentioned that the project is an interpreter: there are many avenues for optimizing them, up to JIT/AOT compilation. Perhaps you could link to an existing library such as LLVM or libjit.

Does the --force-lto gem5 scons build option speed up simulation significantly and how does it compare to a gem5.fast build?

While looking for ways to speed up my simulation, I came across the --force-lto option.
I've heard about LTO (Link Time Optimization) before, so that made me wonder why isn't --force-lto the default while building gem5?
Would that make a simulation go much faster than a gem5.fast build compared to a gem5.opt build?
In gem5 fe15312aae8007967812350f8cdac9ad766dcff7 (2019), the gem5.fast build already enables LTO by default, so you generally never want to use that option explicitly, but rather want just gem5.opt.
Other things to also keep in about .fast:
it also removes -g and so you get no debug symbols. I wonder why, since that does not make runs any faster.
it also turns on NDEBUG, which has the standard library effect of disabling asserts entirely, but plus some gem5 specific effects spread throughout the code with #ifndef NDEBUG checks
it disables TRACING_ON, which makes DPRINTF and family become empty statements as seen at: src/base/trace.hh
Those effects can be seen easily at src/SConstruct.
That option exists because the more common gem5.opt build also uses partial linking, which in some versions of GCC was incompatible with LTO.
Therefore, as its the name suggests, --force-lto forces the use of LTO together with partial linking, which might not be stable. That's why I recommend that you use gem5.fast rather than touching --force-lto.
The goal of partial linking is presumably to speed up the link step, which can easily be the bottleneck in a "change on file, rebuild, relink, test" loop, although in my experiments it is not clear that it is efficient at doing that. Today it might just be a relic from the past.
To try to speed up linking, I recommend that you try scons --gold-linker instead, which uses the GOLD linker instead of ld. Note that this option was more noticeably effective for gem5.debug however.
I have found that gem5.fast is generally 20% faster than gem5.opt for Atomic CPUs.

why is my code performing poorly when built with Realview tools but better with Codesourcery?

I have a C project which was previously being built with Codesourcery's gnu tool chain. Recently it was converted to use Realview's armcc compiler but the performance that we are getting with Realview tools is very poor compared to when it is compiled with gnu tools. Shouldnt it be opposite case i.e it should give better performance when compiled with Realview's tools? What am I missing here. How can I improve the performance with Realview's tools?
Also I have noticed that if I run the binary produced by Realview Tools with Lauterbach it crashes but If I run it using Realview ICE it runs fine.
UPDATE 1
Realview Command line:
armcc -c --diag_style=ide
--depend_format=unix_escaped --no_depend_system_headers --no_unaligned_access --c99 --arm_only --debug --gnu --cpu=ARM1136J-S --fpu=SoftVFP --apcs=/nointerwork -O3 -Otime
GNU GCC command line:
arm-none-eabi-gcc -mcpu=arm1136jf-s
-mlittle-endian -msoft-float -O3 -Wall
I am using Realview Tools version 4.1 and GCC version 4.4.1
UPDATE 2
Lauterbach issue has been solved. It was being caused because of Semihosting as the semihosting SWI was not being handled in Lauterbach environment. Retargeting the C library to avoid Semihosting did the trick and now my program runs successfully with Lauterbach as well as Realview ICE. But the performance issue is as it is.
Since you have optimisations on, and in some environments it crashes, it may be that your code uses undefined behaviour or other latent error. Such behaviour can change with optimisation, or even break altogether.
I suggest that you try both tool-chains without optimisation, and make sure that the warning level is set high, and you fix them all. GCC is far better that armcc at error checking so is a reasonable static analysis check. If the code builds clean it is more likely to work and may be easier for the optimiser to handle.
Have you tried removing the '--no_unaligned_access'? ARM11s can typically do unaligned access (if enabled in the startup code) and forcing the compiler/library to not do them may be slowing down your code.
The current version of RVCT says of '--fpu=SoftVFP':
In previous releases of RVCT, if you
specified --fpu=softvfp and a CPU with
implicit VFP hardware, the linker
chose a library that implemented the
software floating-point calls using
VFP instructions. This is no longer
the case. If you require this legacy
behavior, use --fpu=softvfp+vfp.
This suggests to me that if you perhaps have an old version of RVCT the behaviour will be to use software floating point regardless of the presence of hardware floating point. While in the GNU version -msoft-float will use hardware floating point instructions when an FPU is available.
So what version of RVCT are you using?
Either way I suggest that you remove the --fpu option since the compiler will make an implicit appropriate selection based on the --cpu option selected. You also need to correct the CPU selection, your RVCT option says --cpu=ARM1136J-S not ARM1136FJ-S as you told GCC. This will no doubt prevent the compiler from generating VFP instructions, since you told it it has no VFP.
The same source code can produce dramatically different binaries due to factors like. Different compilers (llvm vs gcc, gcc 4 vs gcc3, etc). Different versions of the same compiler. Different compiler options if the same compiler. Optimization (on either compiler). Compiled for release or debug (or whatever terms you want to use, the binaries are quite different). When going embedded, you add in the complication of a bootloader or rom monitor (debugger) and things like that. Then add to that the host side tools that talk to the rom monitor or compiled in debugger. Despite being a far better compiler than gcc, arm compilers were infected with the assumption that the binaries would always be run on top of their rom monitor. I want to remember that by the time rvct became their primary compiler that assumption was on its way out, but I have not really used their tools since then.
The bottom line is there are a handful of major factors that can affect the differences between binaries that can and will lead to a different experience. Assuming that you will get the same performance or results, is a bad assumption, the expectation is that the results will differ. Likewise, within the same environment, you should be able to create binaries that give dramatically different performance results. All from the same source code.
Do you have compiler optimizations turned on in your CodeSourcery build, but not in the Realview build?

Why would one ever want to compile with -O2 instead of -O3

We usually compile with -O2 because -O3 would "trigger subtle bugs".
For our GCC version -O3 enables more aggressive inlining which would actually reveal bugs otherwise unnoticed (e.g. use of uninitialized values from functions taking them as reference arguments or out-of-bounds access for arrays). It seems to me this aggressive inlining also allows a more expressive way of coding with smaller functions and -funswitch-loops helps keeping variable definitions more local in loops.
Given that bugs in our code are orders of magnitude more likely than compiler bugs and that we use -Wall -Wextra without any issues what kind of bugs should we be looking for?
If it matters we use gcc-4.3.2. Compile time is not a major issue for us.
Size. Of course if size does really matters (sometimes is does, like embedded), one would use -Os. But main difference at O3 is the (from you already mentioned) inlining. This can increase the generated code size (but it is faster). Maybe you want speed, but not at all (space) cost? Otherwise I would see no reason why not to use O3 (except you know of a gcc compiler bug that only occurs in your code at O3, but as long as you dont have an error, you cant reproduce at O2, I would not care).
Don't kid yourself that compiler bugs aren't lurking out there to make your life hell. Here's a nasty one which cropped up in Debian last year, and where the fix was to fall back to -O2.
Sometimes aggressive optimisation can break code just like you mentioned. If this is a project you are currently working on, then perhaps this is not a problem. However, if the code in question is legacy code that is fragile, poorly written, and not well-understood, then you want to take as few chances as possible.
Also, not all optimisations are formally proven. That means that they may alter the behaviour of programs in undesirable ways.
The best example I can think of is a Java one, but it should illustrate my point about optimisations in general.
It is common to have code like this
while( keepGoing ){
doStuff();
}
Then value of keepGoing gets modified by another thread. Well one optimisation that the JVM will do, is see that keepGoing is not modified within the body of the loop, so it "elevates" it and checks before the loop, essentially transforming the code into:
if( keepGoing ){
while( true ){
doStuff();
}
}
Which in a multi-threaded environment is not the same thing, but in a single-threaded it is. These are the kinds of things that can break with optimisations. This is a frequent source of "Heisenbugs".
PS- In Java the proper answer is the make keepGoing "volatile" so it cannot presume cached values and would do what you intend.

gfortran optimization causes fortran do-variable loop error during runtime

I have written a fortran routine that uses some legacy fortran 77 code for finite elements. However, with a particular mesh, when the -O optimization flag is turned on, an important do-loop iterator is somehow being modified, even though fortran supposedly prohibits this. I have compiled this code using gfortran4.5 with the -fcheck=do run-time checking enabled and it has verifies what I've noted above. A runtime error occurs, only when optimizations are turned on and points directly to the do-iterator.
Using gdb on optimized code seems (while it seems erratic - lines bouncing back and forth) seems to clearly indicate that the do-iterator somehow gets set back to zero, and essentially this causes a nice infinite loop.
Any suggestions as to how to hunt down and fix whatever is causing this bug would be greatly appreciated, as I'd like to make sure the whole project can be consistently compiled with the same flags.
You say that you use fcheck=do; why not go all the way and use fcheck=all? What you're seeing sounds like a typical case of memory corruption due to an array bounds violation, which fcheck=all can in some cases catch. Where the array bounds checking doesn't work that well is with implicit interfaces and incorrect bounds being passed; a solution here is to put your procedures into modules, allowing the compiler to check interfaces.
And, like Jonathan Dursi said, consider using a tool like valgrind.