Howto enable stronger optimization builds - cmake

I am trying to build PETSc and have problems to enable optimization. Without specifying, PETSc always creates a debugging build, but I can turn that off with passing --with-debugging=0 to cmake. However, this only enables -O1 by default, but as my application is extreme time consuming and very time critical, I want to have at least -O2. I can't find an option except --CFLAGS, which works, but always appends options to the end, so -O1 would override my -O2.
I greped for "-O" to set the flag manually, this gave me a million lines, mostly from the configure.log file and doesn't help.
Does anybody know the file where to set the flag, or a workaround like ...another option that disables the usage of the last specified -O#, but enables the strongest or first?

Citing PETSc' install instructions:
Configure defaults to building PETSc in debug mode. One can switch to
using optimzed mode with the toggle option --with-debugging [defaults
to debug enabled]. Additionally one can specify more suitable
optimization flags with the options COPTFLAGS, FOPTFLAGS, CXXOPTFLAGS.
./configure --with-cc=gcc --with-fc=gfortran --with-debugging=0
COPTFLAGS='-O3 -march=p4 -mtune=p4' FOPTFLAGS='-O3 -qarch=p4
-qtune=p4'

Related

Does the --force-lto gem5 scons build option speed up simulation significantly and how does it compare to a gem5.fast build?

While looking for ways to speed up my simulation, I came across the --force-lto option.
I've heard about LTO (Link Time Optimization) before, so that made me wonder why isn't --force-lto the default while building gem5?
Would that make a simulation go much faster than a gem5.fast build compared to a gem5.opt build?
In gem5 fe15312aae8007967812350f8cdac9ad766dcff7 (2019), the gem5.fast build already enables LTO by default, so you generally never want to use that option explicitly, but rather want just gem5.opt.
Other things to also keep in about .fast:
it also removes -g and so you get no debug symbols. I wonder why, since that does not make runs any faster.
it also turns on NDEBUG, which has the standard library effect of disabling asserts entirely, but plus some gem5 specific effects spread throughout the code with #ifndef NDEBUG checks
it disables TRACING_ON, which makes DPRINTF and family become empty statements as seen at: src/base/trace.hh
Those effects can be seen easily at src/SConstruct.
That option exists because the more common gem5.opt build also uses partial linking, which in some versions of GCC was incompatible with LTO.
Therefore, as its the name suggests, --force-lto forces the use of LTO together with partial linking, which might not be stable. That's why I recommend that you use gem5.fast rather than touching --force-lto.
The goal of partial linking is presumably to speed up the link step, which can easily be the bottleneck in a "change on file, rebuild, relink, test" loop, although in my experiments it is not clear that it is efficient at doing that. Today it might just be a relic from the past.
To try to speed up linking, I recommend that you try scons --gold-linker instead, which uses the GOLD linker instead of ld. Note that this option was more noticeably effective for gem5.debug however.
I have found that gem5.fast is generally 20% faster than gem5.opt for Atomic CPUs.

LLVM lit testing: is it possible to configure number of threads via `lit.cfg`?

I wonder if it's possible to configure the number of threads for testing in a lit.cfg file.
lit offers a command-line flag for specifying the number of threads:
llvm/utils/lit/lit.py -j1 <test directory>
However, I'm not sure how to do this in a lit.cfg file. I want to force all tests in a subdirectory to be run with -j1 - not sure if this is possible.
Edit: for reference, I'm working on the Swift codebase which has a large test suite (4000+ tests) with multiple test subdirectories.
I want to run just one subdirectory with -j1 and the rest with the default number of threads (-j12 for my machine).
I was wondering about that too a while back, but I don't think there is one because of this line here. Usually, the main project compilation times dwarf the lit tests execution time.
It is easy to change, but I'd suggest using your build configuration to this (e.g. make or cmake). So, make test could execute something like lit -j $(nproc) underneath.
Edit (after OP update):
I haven't worked with the swift repo, but maybe you could hack your way around. One thing I could see is that you could influence the LIT_ARGS cmake variable with the options you want by appending to it.
Now to force a single process execution for a specific directory, you may add a lit.local.cfg that sets the singleProcess flag. This seems to override multi-thread execution:
config.singleProcess = True

Modelsim Optimization Issue

I am having problem when I am trying to run the following verilog code snippet in Optimized mode using Modelsim simulator v10.2c.
always # *
if (dut.rtl_module.enable == 1'b1)
force dut.rtl_module.abc_reg = xyz;
If the above snippet is run in non-optimized mode, this works fine. But for optimized mode, it fails.
PS: I am using -O5 optimization level
Optimisation typically disables access to simulator objects. Your force command requires that access.
You'll need to explicitly enable access. Unfortunately I can't see anything useful in the Modelsim AE documentation, however from Riviera-PRO:
+accs
Enables access to design structure. This is the default in -O0,
-O1 and -O2 and may optionally be specified for -O3. If omitted,
the compiler may limit read access to selected objects.
Modelsim supports +acc, it just doesn't appear to be well documented. The only reference appears to be this suggestion:
While optimization is not necessary for class based debugging, you might want to use
vsim -voptargs=+acc=lprn to enable visibility into your design for RTL debugging.

How to enable the ARC optimizer in debug configuration?

The Transitioning to ARC Release Notes makes this statement:
One issue to be aware of is that the optimizer is not run in common
debug configurations, so expect to see a lot more retain/release
traffic at -O0 than at -Os.
How can we enable the optimizer in a default debug configuration?
You can set the optimization level in Xcode's Build Settings independently of for the Debug and Release configurations - just go to build settings, scroll down till you find the optimization setting, and pick the one you want from the menu.
Note: You should probably only do this for curiosity (which is to be encouraged :-)), as optimization can (re)move code etc. debugging may become a little harder, e.g. a variable may "disappear" so you can't so easily track its value as its been assigned to a register.

why is my code performing poorly when built with Realview tools but better with Codesourcery?

I have a C project which was previously being built with Codesourcery's gnu tool chain. Recently it was converted to use Realview's armcc compiler but the performance that we are getting with Realview tools is very poor compared to when it is compiled with gnu tools. Shouldnt it be opposite case i.e it should give better performance when compiled with Realview's tools? What am I missing here. How can I improve the performance with Realview's tools?
Also I have noticed that if I run the binary produced by Realview Tools with Lauterbach it crashes but If I run it using Realview ICE it runs fine.
UPDATE 1
Realview Command line:
armcc -c --diag_style=ide
--depend_format=unix_escaped --no_depend_system_headers --no_unaligned_access --c99 --arm_only --debug --gnu --cpu=ARM1136J-S --fpu=SoftVFP --apcs=/nointerwork -O3 -Otime
GNU GCC command line:
arm-none-eabi-gcc -mcpu=arm1136jf-s
-mlittle-endian -msoft-float -O3 -Wall
I am using Realview Tools version 4.1 and GCC version 4.4.1
UPDATE 2
Lauterbach issue has been solved. It was being caused because of Semihosting as the semihosting SWI was not being handled in Lauterbach environment. Retargeting the C library to avoid Semihosting did the trick and now my program runs successfully with Lauterbach as well as Realview ICE. But the performance issue is as it is.
Since you have optimisations on, and in some environments it crashes, it may be that your code uses undefined behaviour or other latent error. Such behaviour can change with optimisation, or even break altogether.
I suggest that you try both tool-chains without optimisation, and make sure that the warning level is set high, and you fix them all. GCC is far better that armcc at error checking so is a reasonable static analysis check. If the code builds clean it is more likely to work and may be easier for the optimiser to handle.
Have you tried removing the '--no_unaligned_access'? ARM11s can typically do unaligned access (if enabled in the startup code) and forcing the compiler/library to not do them may be slowing down your code.
The current version of RVCT says of '--fpu=SoftVFP':
In previous releases of RVCT, if you
specified --fpu=softvfp and a CPU with
implicit VFP hardware, the linker
chose a library that implemented the
software floating-point calls using
VFP instructions. This is no longer
the case. If you require this legacy
behavior, use --fpu=softvfp+vfp.
This suggests to me that if you perhaps have an old version of RVCT the behaviour will be to use software floating point regardless of the presence of hardware floating point. While in the GNU version -msoft-float will use hardware floating point instructions when an FPU is available.
So what version of RVCT are you using?
Either way I suggest that you remove the --fpu option since the compiler will make an implicit appropriate selection based on the --cpu option selected. You also need to correct the CPU selection, your RVCT option says --cpu=ARM1136J-S not ARM1136FJ-S as you told GCC. This will no doubt prevent the compiler from generating VFP instructions, since you told it it has no VFP.
The same source code can produce dramatically different binaries due to factors like. Different compilers (llvm vs gcc, gcc 4 vs gcc3, etc). Different versions of the same compiler. Different compiler options if the same compiler. Optimization (on either compiler). Compiled for release or debug (or whatever terms you want to use, the binaries are quite different). When going embedded, you add in the complication of a bootloader or rom monitor (debugger) and things like that. Then add to that the host side tools that talk to the rom monitor or compiled in debugger. Despite being a far better compiler than gcc, arm compilers were infected with the assumption that the binaries would always be run on top of their rom monitor. I want to remember that by the time rvct became their primary compiler that assumption was on its way out, but I have not really used their tools since then.
The bottom line is there are a handful of major factors that can affect the differences between binaries that can and will lead to a different experience. Assuming that you will get the same performance or results, is a bad assumption, the expectation is that the results will differ. Likewise, within the same environment, you should be able to create binaries that give dramatically different performance results. All from the same source code.
Do you have compiler optimizations turned on in your CodeSourcery build, but not in the Realview build?