Hot cold splitting is an effective way for code optimization in LLVM.
This built-in LLVM pass is located at :
/llvm/lib/Transforms/IPO/HotColdSplitting.cpp
Actually, I want to use this pass to optimize my code but I didn't find any documentation on how to use this built-in pass to optimize my code .
I already know that I should use LLVM opt command to load the pass but I didn't find the proper way to apply this optimization pass on my program .
I have two questions so far :
1) How to use opt properly to load this pass to optimize my code
2) Can I use this pass directly on clang to optimize C/C++ code as switches like -fsanitize=address which applies to the underlying compiling program ?
Thanks.
You can pass the -mllvm -hot-cold-split=true flag to clang, which will enable hot/cold splitting pass in the optimizer when compiling your file.
Yes, in principle you can directly use this pass (as of the time of answering the question); hot/cold splitting in LLVM, in its current form, only optimizes for code size. Alternatively you might want to try first collecting profiling data via PGO, and then feeding the profiling data into clang for it to take advantage of profile information during the build (which might help hot/cold splitting in terms of performance).
Hot cold splitting can be used to optimize an app for startup performance, as well as for runtime performance in some cases. To enable hot cold splitting optimization you can pass the flag to llvm using -mllvm -hot-cold-split.
Hot cold splitting gives best performance improvement in the presence of profile data. Although it does optimize applications without profile data using inbuilt static analysis. For example: catch block, non returning functions are already known to be cold. Hot cold splitting uses these information.
Currently there is no direct flag from the clang frontend to enable this so you'll have to use -mllvm -hot-cold-split. For more details on hot cold splitting the youtube video at the llvm-dev is quite informative: https://www.youtube.com/watch?v=Q8rqGg6vHAE
Related
I'm working on a very demanding project (actually an interpreter), exclusively written in D, and I'm wondering what type of optimizations would generally be recommended. The project makes heavy use of GC, classes, asssociative arrays, and pretty much anything.
Regarding compilation, I've already experimented both with DMD and LDC flags and LDC with -flto=full -O3 -Os -boundscheck=off seems to be making a difference.
However, as rudimentary as this may sound, I would like you to suggest anything that comes to your mind that could help speed up the performance, related or not to the D language. (I'm sure I'm missing several things).
Compiler flags: I would add -mcpu=native if the program will be running on your machine. Not sure what effect -Os has in addition to -O3.
Profiling has been mentioned in comments. Personally under Linux I have a script which dumps a process's stack trace and I do that a few times to get an idea of where it's getting hung up on.
Not sure what you mean by GS.
Since you mentioned classes: in D, methods are virtual by default; virtual methods add indirections and are not inlineable. Make sure only those methods that must be virtual are. See if you can rewrite your program using a form of polymorphism that doesn't involve indirections, such as using template metaprogramming.
Since you mentioned associative arrays: these make heavy use of the GC; to speed them up, switch to a third-party library that works on top of std.allocator, such as https://github.com/dlang-community/containers
If some parts of your code are parallelizable, std.parallelism is a good tool for this.
Since you mentioned that the project is an interpreter: there are many avenues for optimizing them, up to JIT/AOT compilation. Perhaps you could link to an existing library such as LLVM or libjit.
While looking for ways to speed up my simulation, I came across the --force-lto option.
I've heard about LTO (Link Time Optimization) before, so that made me wonder why isn't --force-lto the default while building gem5?
Would that make a simulation go much faster than a gem5.fast build compared to a gem5.opt build?
In gem5 fe15312aae8007967812350f8cdac9ad766dcff7 (2019), the gem5.fast build already enables LTO by default, so you generally never want to use that option explicitly, but rather want just gem5.opt.
Other things to also keep in about .fast:
it also removes -g and so you get no debug symbols. I wonder why, since that does not make runs any faster.
it also turns on NDEBUG, which has the standard library effect of disabling asserts entirely, but plus some gem5 specific effects spread throughout the code with #ifndef NDEBUG checks
it disables TRACING_ON, which makes DPRINTF and family become empty statements as seen at: src/base/trace.hh
Those effects can be seen easily at src/SConstruct.
That option exists because the more common gem5.opt build also uses partial linking, which in some versions of GCC was incompatible with LTO.
Therefore, as its the name suggests, --force-lto forces the use of LTO together with partial linking, which might not be stable. That's why I recommend that you use gem5.fast rather than touching --force-lto.
The goal of partial linking is presumably to speed up the link step, which can easily be the bottleneck in a "change on file, rebuild, relink, test" loop, although in my experiments it is not clear that it is efficient at doing that. Today it might just be a relic from the past.
To try to speed up linking, I recommend that you try scons --gold-linker instead, which uses the GOLD linker instead of ld. Note that this option was more noticeably effective for gem5.debug however.
I have found that gem5.fast is generally 20% faster than gem5.opt for Atomic CPUs.
Is it easy to achieve high level of optimization with LLVM?
To give a concrete example let's assume that I have a simple lanuage that I want to write a compiler for.
simple functions
simple structs
tables
pointers (with arithmetic)
control structures
etc.
I can quite easily create compilation-to-C backend and rely on clang -O3.
Is it as easy to use LLVM API for that purpose?
Except perhaps for a few high-level (as in, aware of high-level language features or details that aren't encoded in LLVM IR) optimizations, Clang's backend does little more than generate straightforward IR and run some set of LLVM optimization passes on it. All of these (or at least most) should be available trough the opt command and also as API calls when using the C++ libraries that all LLVM tools are built on. See the tutorial for a simple example. I see several advantages:
LLVM IR is far simpler than C and there's already a convenient API for generating it programatically. To generate C, you either have lots of ugly and unreliable string fiddling or have to build an AST for the C language yourself. Or both.
You get to choose the set of optimizations yourself (it's quite possible that Clang's set of passes isn't ideal for the code the language supports and the IR representation your compiler generates). This also means you can, during development, just run the passes checking for IR wellformedness (uncovering compiler bugs faster). You can just copy Clang's pass order, but if you feel like it, you can also experiment.
It will allow better compile times. Clang is fast for a C compiler, but you'd be adding unnecessary overhead: You generate C code, then Clang parses it, converts it to IR, and goes on to do pretty much what you could do right away.
You may have access to a broader range of features, or at least you'd get them easier (i.e. without having to incorporate #defines, obscure pragmas, instrincts or command line options) to provide them. I'm talking about like vectors, guaranteed (well, more than in C anyway - AFAIK, some code generators ignore them) tail calls, pure/readonly functions, more control over memory layout and type conversions (for instance zero extending vs. sign extending). Granted, you may not need most of them.
LLVM has built-in optimization passes so that you can achieve O3-like optimizations using API.
I am developing a command line utility that has a LOT of flags. A typical command looks like this:
mycommand --foo=A --bar=B --jar=C --gnar=D --binks=E
In most cases, a 'success' message is printed but I still want to verify against other sources like an external database to ensure actual success.
I'm starting to create integration tests and I am unsure of the best way to do this. My main concerns are:
There are many many flag combinations, how do I know which combinations to test? If you do the math for the 10+ flags that can be used together...
Is it necessary to test permutations of flags?
How to build a framework capable of automating the tests and then verifying results.
How to keep track of a large number of flags and providing an order so it is easy to tell what combinations have been implemented and what has not.
The thought of manually writing out individual cases and verifying results in a unit-test like format is daunting.
Does anyone know of a pattern that can be used to automate this type of test? Perhaps even software that attempts to solve this problem? How did people working on GNU commandline tools test their software?
I think this is very specific to your application.
First, how do you determine the success of the execution of you application? Is it a result code? Is it something printed to the console?
For question 2, it depends how you parse those flags in your application. Most of the time, order of flags isn't important, but there are cases where it is. I hope you don't need to test for permutations of flags, because it would add a lot of cases to test.
In a general case, you should analyse what is the impact of each flag. It is possible that a flag doesn't interfere with the others, and then it just need to be tested once. This is also the case for flags that are meant to be used alone (--help or --version, for example). You also need to analyse what values you should test for each flag. Usually, you want to try each kind of possible valid value, and each kind of possible invalid values.
I think a simple bash script could be written to perform the tests, or any scripting language, like Python. Using nested loops, you could try, for each flag, possibles values, including tests for invalid values and the case where the flag isn't set. I will produce a multidimensional matrix of results, that should be analysed to see if results are conform to what expected.
When I write apps (in scripting languages), I have a function that parses a command line string. I source the file that I'm developing and unit test that function directly rather than involving the shell.
I am working on lock free structure with g++ compiler. It seems that with -o1 switch, g++ will change the execution order of my code. How can I forbid g++'s optimization on certain part of my code while maintain the optimization to other part? I know I can split it to two files and link them, but it looks ugly.
If you find that gcc changes the order of execution in your code, you should consider using a memory barrier. Just don't assume that volatile variables will protect you from that issue. They will only make sure that in a single thread, the behavior is what the language guarantees, and will always read variables from their memory location to account for changes "invisible" to the executing code. (e.g changes to a variable done by a signal handler).
GCC supports OpenMP since version 4.2. You can use it to create a memory barrier with a special #pragma directive.
A very good insight about locking free code is this PDF by Herb Sutter and Andrei Alexandrescu: C++ and the Perils of Double-Checked Locking
You can use a function attribute "__attribute__ ((optimize 0))" to set the optimization for a single function, or "#pragma GCC optimize" for a block of code. These are only for GCC 4.4, though, I think - check your GCC manual. If they aren't supported, separation of the source is your only option.
I would also say, though, that if your code fails with optimization turned on, it is most likely that your code is just wrong, especially as you're trying to do something that is fundamentally very difficult. The processor will potentially perform reordering on your code (within the limits of sequential consistency) so any re-ordering that you're getting with GCC could potentially occur anyway.