Meaning of the LLVM Optimization Levels - optimization

I recently started working with Clang/LLVM and would like to know if there is any particular documentation on what the -Ox optimization levels do?
I couldn't find much on the LLVM documentation page. Can someone share a few links?
Thanks.

Clang's command-line options documentation is indeed very poor, and in particular you are correct that there's almost no explanation of what the optimizations level do.
FreeBSD, however, does add a man page with a useful summary:
-O0 -O1 -O2 -Os -Oz -O3 -O4
Specify which optimization level to use. -O0 means "no
optimization": this level compiles the fastest and generates the
most debuggable code. -O2 is a moderate level of optimization
which enables most optimizations. -Os is like -O2 with extra
optimizations to reduce code size. -Oz is like -Os (and thus -O2),
but reduces code size further. -O3 is like -O2, except that it
enables optimizations that take longer to perform or that may
generate larger code (in an attempt to make the program run
faster). On supported platforms, -O4 enables link-time
optimization; object files are stored in the LLVM bitcode file
format and whole program optimization is done at link time. -O1 is
somewhere between -O0 and -O2.
If you're looking to find the exact list of passes performed for each optimization, see this Stackoverflow question:
Clang optimization levels

Related

Decreasing runtime of Fortran program with compiling and linking flags in GNU

My problem is related to runtime of a .exe app. I've been given a very large code (which I know is bug free) but it takes too much time to run. I have compiled it with GNU and I'm not able to use the parallel programming neither due to my computer having only two processors.
The problem is tied to a single subroutine of 2000 lines. I have notice that it is mainly made up of loops, where I think there the problem is. Also it is called like 20000 times by the main program.
First I used the -O flags (the best runtime was with -Ofast). After that, I tried to improve the loops performance with -fforce-addr, but no measurable acceleration happened. Lately I am using other flags like -mtune to create a code optimized for the local machine.
Here are my main tests and results:
Original program (31s)
COMPOPTS= -pthread -finline-functions -fbacktrace -fzero-initialized-in-bss -fno-automatic -frecord-marker=4 LINKOPTS= -l unlimit -s unlimited
Using -Ofast (25s)
COMPOPTS= -pthread -finline-functions -fbacktrace -fzero-initialized-in-bss -fno-automatic -frecord-marker=4 -cpp LINKOPTS= -l unlimit -s unlimited
Last situation (24s)
COMPOPTS= -mtune=native -pthread -finline-functions -fbacktrace -fzero-initialized-in-bss -fno-automatic -frecord-marker=4 -cpp -fforce-addr -fschedule-insns2 -ffp-contract=off LINKOPTS=-l ulimit -s unlimited
I have a .exe version compiled with Intel and it’s runtime is 7s. I know Intel usually is like a 20-40% faster than GNU, so I think there is some room for improvement.

Table of optimization levels of the GNU C++ compiler g++, accurate?

Although I know each and every program is a different scenario, I have a rather specific question considering the below table.
Optimization levels of the GNU C++ compiler g++
Ox WHAT IS BEING OPTIMIZED EXEC CODE MEM COMP
TIME SIZE TIME
------------------------------------------------------------------------------
O0 optimize for compilation time | + + - -
O1 optimize for code size and execution time #1 | - - + +
O2 optimize for code size and execution time #2 | -- 0 + ++
O3 optimize for code size and execution time #3 | --- 0 + +++
Ofast O3 with fast none accurate math calculations | --- 0 + +++
Os optimize for code size | 0 -- 0 ++
+increase ++increase more +++increase even more -reduce --reduce more ---reduce even more
I am using version 8.2, though this should be a generic table taken from here and re-written into a plain text.
My question is, if it that can be trusted, I don't know that web site, so I better ask the professionals here. So, is the table more or less accurate?
Your table is grossly accurate.
Notice that GCC has zillions of optimization options. Some weird optimization passes are not even enabled at -O3 (but GCC has several hundreds of optimization passes).
But there is no guarantee than an -O3 optimization always give code which runs faster than the same code compiled with -O2. This is generally the case, but not always. You could find pathological (or just) weird C source code which, when compiled with -O3, gives a slightly slower binary code than the same C source code compiled with -O2. For example, -O3 is likely to unroll loops "better" -at least "more"- than -O2, but some code might perform worse if some particular loop in it is more unrolled. The phoronix website and others are benchmarking GCC and are observing such phenomenon.
Be aware that optimization is an art, it is in general an intractable or undecidable problem, and that current processors are so complex that there is no exact and complete model of their performance (think of cache, branch predictors, pipeline, out-of-order execution). Beside, the detailed micro-architecture of x86 processors is obviously not public (you cannot get the VHDL or chip layout of Intel or AMD chips). Hence, the -march= option to GCC also matters (the same binary code is not always good on both AMD & Intel chips, or even on several brands of Intel processors). So, if compiling code on the same machine that runs it, passing -march=native in addition of -O2 or -O3 is recommended.
People paid by Intel and by AMD are actively contributing to GCC, but they are not allowed to share all the knowledge they have internally about Intel or AMD chips. They are allowed to share (with the GPLv3+ license of GCC) the source code they are contributing to the GCC compiler. Probably engineers from AMD are observing the Intel-contributed GCC code to guess micro-architectural details of Intel chips, and vice versa.
And Intel or AMD interests obviously include making GCC working well with their proprietary chips. That corporate interests justify paying (both at Intel and at AMD) several highly qualified compiler engineers contributing full time to GCC.
In practice, I observed that both AMD and Intel engineers are "playing the game" of open source: they routinely contribute GCC code which also improves their competitor's performance. This is more a social & ethical & economical issue than a technical one.
PS. You can find many papers and books on the economics of open source.

GCC -mthumb against -marm

I am working on performance optimizations of ARM C/C++ code, compiled with GCC. CPU is Tegra 3.
As I know flags -mthumb means generating old 16-bit Thumb instructions. On different tests, I have 10-15% performance increase with -marm against -mthumb.
Is -mthumb used only for compatibility and for performance, while -marm is generally better?
I am asking because android-cmake used -mthumb in Release mode and -marm in Debug. This is very confusing for me.
Thumb is not the older instruction-set, but in fact the newer one. The current revision being Thumb-2, which is a mixed 16/32-bit instruction set. The Thumb1 instruction set was a compressed version of the original ARM instruction set. The CPU would fetch the the instruction, decompress it into ARM and then process it. These days (ARMv7 and above), Thumb-2 is preferred for everything but performance critical or system code. For example, GCC will by default generate Thumb2 for ARMv7 (Like your Tegra3), as the higher code density provided by the 16/32-bit ISA allows for better icache utilization. But this is something which is very hard to measure in a normal benchmark, because most benchmarks will fit into the L1 icache anyway.
For more information check the Wikipedia site: http://en.wikipedia.org/wiki/ARM_architecture#Thumb
ARM is a 32 bit instruction so has more bits to do more things in a single instruction while THUMB with only 16 bits might have to split the same functionality between 2 instructions. Based on the assumption that non-memory instructions took more or less the same time, fewer instructions mean faster code. There were also some things that just couldn't be done with THUMB code.
The idea was then that ARM would be used for performance critical functionality while THUMB (which fits 2 instructions into a 32 bit word) would be used to minimize storage space of programs.
As CPU memory caching became more critical, having more instructions in the icache was a bigger determinant of speed than functional density per instruction. This meant that THUMB code became faster than the equivalent ARM code. ARM (corp) therefore created THUMB32 which is a variable length instruction that incorporates most ARM functionality. THUMB32 should in most cases give you denser as well as faster code due to better caching.

Clang vs GCC - which produces faster binaries? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I'm currently using GCC, but I discovered Clang recently and I'm pondering switching. There is one deciding factor though - quality (speed, memory footprint, reliability) of binaries it produces - if gcc -O3can produce a binary that runs 1% faster, or Clang binaries take up more memory or just fail due to compiler bugs, it's a deal-breaker.
Clang boasts better compile speeds and lower compile-time memory footprint than GCC, but I'm really interested in benchmarks/comparisons of resulting compiled software - could you point me to some pre-existing resources or your own benchmarks?
Here are some up-to-date albeit narrow findings of mine with GCC 4.7.2
and Clang 3.2 for C++.
UPDATE: GCC 4.8.1 v clang 3.3 comparison appended below.
UPDATE: GCC 4.8.2 v clang 3.4 comparison is appended to that.
I maintain an OSS tool that is built for Linux with both GCC and Clang,
and with Microsoft's compiler for Windows. The tool, coan, is a preprocessor
and analyser of C/C++ source files and codelines of such: its
computational profile majors on recursive-descent parsing and file-handling.
The development branch (to which these results pertain)
comprises at present around 11K LOC in about 90 files. It is coded,
now, in C++ that is rich in polymorphism and templates and but is still
mired in many patches by its not-so-distant past in hacked-together C.
Move semantics are not expressly exploited. It is single-threaded. I
have devoted no serious effort to optimizing it, while the "architecture"
remains so largely ToDo.
I employed Clang prior to 3.2 only as an experimental compiler
because, despite its superior compilation speed and diagnostics, its
C++11 standard support lagged the contemporary GCC version in the
respects exercised by coan. With 3.2, this gap has been closed.
My Linux test harness for current coan development processes roughly
70K sources files in a mixture of one-file parser test-cases, stress
tests consuming 1000s of files and scenario tests consuming < 1K files.
As well as reporting the test results, the harness accumulates and
displays the totals of files consumed and the run time consumed in coan (it just passes each coan command line to the Linux time command and captures and adds up the reported numbers). The timings are flattered by the fact that any number of tests which take 0 measurable time will all add up to 0, but the contribution of such tests is negligible. The timing stats are displayed at the end of make check like this:
coan_test_timer: info: coan processed 70844 input_files.
coan_test_timer: info: run time in coan: 16.4 secs.
coan_test_timer: info: Average processing time per input file: 0.000231 secs.
I compared the test harness performance as between GCC 4.7.2 and
Clang 3.2, all things being equal except the compilers. As of Clang 3.2,
I no longer require any preprocessor differentiation between code
tracts that GCC will compile and Clang alternatives. I built to the
same C++ library (GCC's) in each case and ran all the comparisons
consecutively in the same terminal session.
The default optimization level for my release build is -O2. I also
successfully tested builds at -O3. I tested each configuration 3
times back-to-back and averaged the 3 outcomes, with the following
results. The number in a data-cell is the average number of
microseconds consumed by the coan executable to process each of
the ~70K input files (read, parse and write output and diagnostics).
| -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.7.2 | 231 | 237 |0.97 |
----------|-----|-----|-----|
Clang-3.2 | 234 | 186 |1.25 |
----------|-----|-----|------
GCC/Clang |0.99 | 1.27|
Any particular application is very likely to have traits that play
unfairly to a compiler's strengths or weaknesses. Rigorous benchmarking
employs diverse applications. With that well in mind, the noteworthy
features of these data are:
-O3 optimization was marginally detrimental to GCC
-O3 optimization was importantly beneficial to Clang
At -O2 optimization, GCC was faster than Clang by just a whisker
At -O3 optimization, Clang was importantly faster than GCC.
A further interesting comparison of the two compilers emerged by accident
shortly after those findings. Coan liberally employs smart pointers and
one such is heavily exercised in the file handling. This particular
smart-pointer type had been typedef'd in prior releases for the sake of
compiler-differentiation, to be an std::unique_ptr<X> if the
configured compiler had sufficiently mature support for its usage as
that, and otherwise an std::shared_ptr<X>. The bias to std::unique_ptr was
foolish, since these pointers were in fact transferred around,
but std::unique_ptr looked like the fitter option for replacing
std::auto_ptr at a point when the C++11 variants were novel to me.
In the course of experimental builds to gauge Clang 3.2's continued need
for this and similar differentiation, I inadvertently built
std::shared_ptr<X> when I had intended to build std::unique_ptr<X>,
and was surprised to observe that the resulting executable, with default -O2
optimization, was the fastest I had seen, sometimes achieving 184
msecs. per input file. With this one change to the source code,
the corresponding results were these;
| -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.7.2 | 234 | 234 |1.00 |
----------|-----|-----|-----|
Clang-3.2 | 188 | 187 |1.00 |
----------|-----|-----|------
GCC/Clang |1.24 |1.25 |
The points of note here are:
Neither compiler now benefits at all from -O3 optimization.
Clang beats GCC just as importantly at each level of optimization.
GCC's performance is only marginally affected by the smart-pointer type
change.
Clang's -O2 performance is importantly affected by the smart-pointer type
change.
Before and after the smart-pointer type change, Clang is able to build a
substantially faster coan executable at -O3 optimisation, and it can
build an equally faster executable at -O2 and -O3 when that
pointer-type is the best one - std::shared_ptr<X> - for the job.
An obvious question that I am not competent to comment upon is why
Clang should be able to find a 25% -O2 speed-up in my application when
a heavily used smart-pointer-type is changed from unique to shared,
while GCC is indifferent to the same change. Nor do I know whether I should
cheer or boo the discovery that Clang's -O2 optimization harbours
such huge sensitivity to the wisdom of my smart-pointer choices.
UPDATE: GCC 4.8.1 v clang 3.3
The corresponding results now are:
| -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.1 | 442 | 443 |1.00 |
----------|-----|-----|-----|
Clang-3.3 | 374 | 370 |1.01 |
----------|-----|-----|------
GCC/Clang |1.18 |1.20 |
The fact that all four executables now take a much greater average time than previously to process
1 file does not reflect on the latest compilers' performance. It is due to the
fact that the later development branch of the test application has taken on lot of
parsing sophistication in the meantime and pays for it in speed. Only the ratios are
significant.
The points of note now are not arrestingly novel:
GCC is indifferent to -O3 optimization
clang benefits very marginally from -O3 optimization
clang beats GCC by a similarly important margin at each level of optimization.
Comparing these results with those for GCC 4.7.2 and clang 3.2, it stands out that
GCC has clawed back about a quarter of clang's lead at each optimization level. But
since the test application has been heavily developed in the meantime one cannot
confidently attribute this to a catch-up in GCC's code-generation.
(This time, I have noted the application snapshot from which the timings were obtained
and can use it again.)
UPDATE: GCC 4.8.2 v clang 3.4
I finished the update for GCC 4.8.1 v Clang 3.3 saying that I would
stick to the same coan snaphot for further updates. But I decided
instead to test on that snapshot (rev. 301) and on the latest development
snapshot I have that passes its test suite (rev. 619). This gives the results a
bit of longitude, and I had another motive:
My original posting noted that I had devoted no effort to optimizing coan for
speed. This was still the case as of rev. 301. However, after I had built
the timing apparatus into the coan test harness, every time I ran the test suite
the performance impact of the latest changes stared me in the face. I saw that
it was often surprisingly big and that the trend was more steeply negative than
I felt to be merited by gains in functionality.
By rev. 308 the average processing time per input file in the test suite had
well more than doubled since the first posting here. At that point I made a
U-turn on my 10 year policy of not bothering about performance. In the intensive
spate of revisions up to 619 performance was always a consideration and a
large number of them went purely to rewriting key load-bearers on fundamentally
faster lines (though without using any non-standard compiler features to do so). It would be interesting to see each compiler's reaction to this
U-turn,
Here is the now familiar timings matrix for the latest two compilers' builds of rev.301:
coan - rev.301 results
| -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.2 | 428 | 428 |1.00 |
----------|-----|-----|-----|
Clang-3.4 | 390 | 365 |1.07 |
----------|-----|-----|------
GCC/Clang | 1.1 | 1.17|
The story here is only marginally changed from GCC-4.8.1 and Clang-3.3. GCC's showing
is a trifle better. Clang's is a trifle worse. Noise could well account for this.
Clang still comes out ahead by -O2 and -O3 margins that wouldn't matter in most
applications but would matter to quite a few.
And here is the matrix for rev. 619.
coan - rev.619 results
| -O2 | -O3 |O2/O3|
----------|-----|-----|-----|
GCC-4.8.2 | 210 | 208 |1.01 |
----------|-----|-----|-----|
Clang-3.4 | 252 | 250 |1.01 |
----------|-----|-----|------
GCC/Clang |0.83 | 0.83|
Taking the 301 and the 619 figures side by side, several points speak out.
I was aiming to write faster code, and both compilers emphatically vindicate
my efforts. But:
GCC repays those efforts far more generously than Clang. At -O2
optimization Clang's 619 build is 46% faster than its 301 build: at -O3 Clang's
improvement is 31%. Good, but at each optimization level GCC's 619 build is
more than twice as fast as its 301.
GCC more than reverses Clang's former superiority. And at each optimization
level GCC now beats Clang by 17%.
Clang's ability in the 301 build to get more leverage than GCC from -O3 optimization
is gone in the 619 build. Neither compiler gains meaningfully from -O3.
I was sufficiently surprised by this reversal of fortunes that I suspected I
might have accidentally made a sluggish build of clang 3.4 itself (since I built
it from source). So I re-ran the 619 test with my distro's stock Clang 3.3. The
results were practically the same as for 3.4.
So as regards reaction to the U-turn: On the numbers here, Clang has done much
better than GCC at at wringing speed out of my C++ code when I was giving it no
help. When I put my mind to helping, GCC did a much better job than Clang.
I don't elevate that observation into a principle, but I take
the lesson that "Which compiler produces the better binaries?" is a question
that, even if you specify the test suite to which the answer shall be relative,
still is not a clear-cut matter of just timing the binaries.
Is your better binary the fastest binary, or is it the one that best
compensates for cheaply crafted code? Or best compensates for expensively
crafted code that prioritizes maintainability and reuse over speed? It depends on the
nature and relative weights of your motives for producing the binary, and of
the constraints under which you do so.
And in any case, if you deeply care about building "the best" binaries then you
had better keep checking how successive iterations of compilers deliver on your
idea of "the best" over successive iterations of your code.
Phoronix did some benchmarks about this, but it is about a snapshot version of Clang/LLVM from a few months back. The results being that things were more-or-less a push; neither GCC nor Clang is definitively better in all cases.
Since you'd use the latest Clang, it's maybe a little less relevant. Then again, GCC 4.6 is slated to have some major optimizations for Core 2 and Core i7, apparently.
I figure Clang's faster compilation speed will be nicer for original developers, and then when you push the code out into the world, Linux distribution, BSD, etc. end-users will use GCC for the faster binaries.
The fact that Clang compiles code faster may not be as important as the speed of the resulting binary. However, here is a series of benchmarks.
There is very little overall difference between GCC 4.8 and Clang 3.3 in terms of speed of the resulting binary. In most cases code generated by both compilers performs similarly. Neither of these two compilers dominates the other one.
Benchmarks telling that there is a significant performance gap between GCC and Clang are coincidental.
Program performance is affected by the choice of the compiler. If a developer or a group of developers is exclusively using GCC then the program can be expected to run slightly faster with GCC than with Clang, and vice versa.
From developer viewpoint, a notable difference between GCC 4.8+ and Clang 3.3 is that GCC has the -Og command line option. This option enables optimizations that do not interfere with debugging, so for example it is always possible to get accurate stack traces. The absence of this option in Clang makes clang harder to use as an optimizing compiler for some developers.
A peculiar difference I have noted on GCC 5.2.1 and Clang 3.6.2 is
that if you have a critical loop like:
for (;;) {
if (!visited) {
....
}
node++;
if (!*node)
break;
}
Then GCC will, when compiling with -O3 or -O2, speculatively
unroll the loop eight times. Clang will not unroll it at all. Through
trial and error I found that in my specific case with my program data,
the right amount of unrolling is five so GCC overshot and Clang
undershot. However, overshooting was more detrimental to performance, so GCC performed much worse here.
I have no idea if the unrolling difference is a general trend or
just something that was specific to my scenario.
A while back I wrote a few garbage collectors to teach myself more about performance optimization in C. And the results I got is in my mind enough to slightly favor Clang. Especially since garbage
collection is mostly about pointer chasing and copying memory.
The results are (numbers in seconds):
+---------------------+-----+-----+
|Type |GCC |Clang|
+---------------------+-----+-----+
|Copying GC |22.46|22.55|
|Copying GC, optimized|22.01|20.22|
|Mark & Sweep | 8.72| 8.38|
|Ref Counting/Cycles |15.14|14.49|
|Ref Counting/Plain | 9.94| 9.32|
+---------------------+-----+-----+
This is all pure C code, and I make no claim about either compiler's
performance when compiling C++ code.
On Ubuntu 15.10 (Wily Werewolf), x86.64, and an AMD Phenom II X6 1090T processor.
The only way to determine this is to try it. FWIW, I have seen some really good improvements using Apple's LLVM GCC 4.2 compared to the regular GCC 4.2 (for x86-64 code with quite a lot of SSE), but YMMV for different code bases.
Assuming you're working with x86/x86-64 and that you really do care about the last few percent then you ought to try Intel's ICC too, as this can often beat GCC - you can get a 30-day evaluation license from intel.com and try it.
Basically speaking, the answer is: it depends.
There are many many benchmarks focusing on different kinds of application.
My benchmark on my application is: GCC > ICC > Clang.
There are rare I/O, but many CPU float and data structure operations.
The compile flags are -Wall -g -DNDEBUG -O3.
https://github.com/zhangyafeikimi/ml-pack/blob/master/gbdt/profile/benchmark

Optimization in GCC

I have two questions:
(1) I learned somewhere that -O3 is not recommended with GCC, because
The -O3 optimization level may increase the speed of the resulting executable, but can also increase its size. Under some circumstances where these optimizations are not favorable, this option might actually make a program slower. in fact it should not be used system-wide with gcc 4.x. The behavior of gcc has changed significantly since version 3.x. In 3.x, -O3 has been shown to lead to marginally faster execution times over -O2, but this is no longer the case with gcc 4.x. Compiling all your packages with -O3 will result in larger binaries that require more memory, and will significantly increase the odds of compilation failure or unexpected program behavior (including errors). The downsides outweigh the benefits; remember the principle of diminishing returns. Using -O3 is not recommended for gcc 4.x.
Suppose I have a workstation (Kubuntu9.04) which has 128 GB of memory and 24 cores and is shared by many users, some of whom may run intensive programs using like 60 GB memory. Is -O2 a better choice for me than -O3?
(2) I also learned that when a running program crashes unexpectedly, any debugging information is better than none, so the use of -g is recommended for optimized programs, both for development and deployment. But when compiled with -ggdb3 together with -O2 or -O3, will it slow down the speed of execution? Assume I am still using the same workstation.
The only way to know for sure is to benchmark your application compiled with -O2 and -O3. Also there are some individual optimization options that -O3 includes and you can turn on and off individually. Concerning the warning about larger binaries, note that just comparing executable file sizes compiled with -O2 and -O3 will not do much good here, because it is the size of small critical internal loops that matters here the most. You really have to benchmark.
It will result in a larger executable, but there shouldn't be any measurable slowdown.
Try it
You can rarely make accurate judgments about speed and optimisation without any data.
ps. This will also tell you if it's worth the effort. How many milliseconds saved in a function used once at startup is worthwhile ?
Firstly, it does appear that the compiler team is essentially admitting that -O3 isn't reliable. It seems like they are saying: try -O3 on your critical loops or critical modules, or your Lattice QCD program, but it's not reliable enough for building the whole system or library.
Secondly, the problem with making the code bigger (inline functions and other things) isn't only that it uses more memory. Even if you have extra RAM, it can slow you down. This is because the faster the CPU chip gets, the more it hurts to have to go out to DRAM. They are saying that some programs will run faster WITH the extra routine calls and unexploded branches (or whatever O3 replaces with bigger things) because without O3 they will still fit in the cache, and that's a bigger win than the O3 transformations.
On the other issue, I wouldn't normally build anything with -g unless I was currently working on it.
-g and/or -ggdb just adds debugging symbols to the executable. It makes the executable file bigger, but that part isn't loaded into memory(except when run in a debugger or similar).
As for what's best for performance of -O2 and -O3, there's no silver bullet. You have to measure/profile it for your particular program.
In my experience what I found is that GCC does not generate best assembly with O2 and O3, The best way is to apply specific optimization flags which you can find from this will definitely generate better code than -O2 and -O3 because there are flags which you can not find in -O2 and -O3, and they will be useful for your faster code.
One good example is that code and data prefetch instruction will never be inserted in your code with -O2 and -O3, But using additional flags for prefetching will make your memory intensive code 2 to 3 % faster.
You can find list of GCC optimization flags at http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html.
I think this pretty much answers your question:
The downsides outweigh the benefits; remember the principle of diminishing returns. Using -O3 is not recommended for gcc 4.x.
If the guys writing the compiler say not to do it, I wouldn't second guess them.