Measuring build times to identify bottlenecks

Measuring build times to identify bottlenecks - optimization

I'm working on improving the build for a few projects. I've improved build times quite significantly, and I'm at a point now where I think the bottlenecks are more subtle.
The build uses GNU style makefiles. I generate a series of dependency files (.d) and include them in the makefile, otherwise there's nothing fancy going on (eg, no pre-compiled headers or other caching mechanisms).
The build takes about 95 seconds on a 32-core sparc ultra, running with 16 threads in parallel. Idle time hovers around 80% while the build runs, with kernel time hovering between 8-10%. I put the code in /tmp, but most of the compiler support files are NFS mounted and I believe this may be creating a performance bottleneck.
What tools exist for measuring & tracking down these sorts of problems?

From my own experience, compiling C/C++ code requires reading a lot of header files by C preprocessor. I've experienced situations when it took more than 50% of g++ run-time to generate a complete translation unit.
As you mentioned that it idles 80% when compiling it must be waiting for I/O then. iostat and DTrace would be a good starting point.

Related

Memory/Address Sanitizer vs Valgrind

I want some tool to diagnose use-after-free bugs and uninitialized bugs. I am considering Sanitizer(Memory and/or Address) and Valgrind. But I have very little idea about their advantages and disadvantages. Can anyone tell the main features, differences and pros/cons of Sanitizer and Valgrind?
Edit: I found some of comparisons like: Valgrind uses DBI(dynamic binary instrumentation) and Sanitizer uses CTI(compile-time instrumentation). Valgrind makes the program much slower(20x) whether Sanitizer runs much faster than Valgrind(2x). If anyone can give me some more important points to consider, it will be a great help.

I think you'll find this wiki useful.
TLDR main advantages of sanitizers are
much smaller CPU overheads (Lsan is practically free, UBsan/Isan is 1.25x, Asan and Msan are 2-4x for computationally intensive tasks and 1.05-1.1x for GUIs, Tsan is 5-15x)
wider class of detected errors (stack and global overflows, use-after-return/scope)
full support of multi-threaded apps (Valgrind support for multi-threading is a joke)
much smaller memory overhead (up to 2x for Asan, up to 3x for Msan, up to 10x for Tsan which is way better than Valgrind)
Disadvantages are
more complicated integration (you need to teach your build system to understand Asan and sometimes work around limitations/bugs in Asan itself, you also need to use relatively recent compiler)
MemorySanitizer is not reall^W easily usable at the moment as it requires one to rebuild all dependencies under Msan (including all standard libraries e.g. libc++); this means that casual users can only use Valgrind for detecting uninitialized errors
sanitizers typically can not be combined with each other (the only supported combination is Asan+UBsan+Lsan) which means that you'll have to do separate QA runs to catch all types of bugs

One big difference is that the LLVM-included memory and thread sanitizers implicitly map huge swathes of address space (e.g., by calling mmap(X, Y, 0, MAP_NORESERVE|MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0) across terabytes of address space in the x86_64 environment). Even though they don't necessarily allocate that memory, the mapping can play havoc with restrictive environments (e.g., ones with reasonable settings for ulimit values).

HLSL compilation speed

I have one relatively complicated shader, which I want to compile.
Shader has ~700 lines, which are compiled into ~3000 instructions.
Compilation time is with fxc (Windows 8 SDK) about 90 seconds.
I have another shader of similar size and compilation time is 20 seconds.
So here are my questions:
Is possible to speed up the compilation from application viewpoint (faster version of fxc or fxc alternative)?
Is possible to speed up the compilation from code view point (is code constructs which massively slows down the compilation - which ones, how avoid them)?
Is possible to speed up the compilation from fxc settings viewpoint (some secre options as --fast-compile or whatever)?
Edit:
Parallel thread on msdn forum:
https://social.msdn.microsoft.com/Forums/en-US/5e60c68e-8902-48d6-b497-e86ac4f2cfe7/hlsl-compilation-speed?forum=vclanguage

There is no "faster" fxc or d3dcompile library.
You can different things to speed up things, turn off optimisation is one of them, as the driver will optimise anyway from dxbc to final microcode.
But the best advice is to implement a shader cache, if you for example pre-process and hash the shader file and trigger compilation only if it is actually different, you will save time.
The d3dcompile library is multi thread safe, and you want to take advantage of multi core CPU. Implementing the include interface to cache file load can be valuable too if you compile many shader.
Finally, when everything fail, you have no choice but experiment and find what takes that long, and do some rewrite, sometimes, a [branch] or [unroll] on the culprit may be enough to solve the compilation time.

Why is the shader compilation time a problem? Fxc is an offline compiler, meaning that the resulting bytecode is hardware independent and can be distributed with your application.
If you're looking to cut down on the iteration times during development, disabling optimalizations with the "/Od" command line option should help.

Speeding up the Dojo Build

We are running a build of our application using Dojo 1.9 and the build itself is taking an inordinate amount of time to complete. Somewhere along the lines of 10-15 minutes.
Our application is not huge by any means. Maybe 150K LOC. Nothing fancy. Furthermore, when running this build locally using Node, it takes less than a minute.
However, we run the build on a RHEL server with plenty of space and memory, using Rhino. In addition, the tasks are invoked through Ant.
We also use Shrinksafe as the compression mechanism, which could also be the problem. It seems like Shrinksafe is compressing the entire Dojo library (which is enormous) each time the build runs, which seems silly.
Is there anything we can do to speed this up? Or anything we're doing wrong?

Yes, that is inordinate. I have never seen a build take so long, even on an Atom CPU.
In addition to the prior suggestion to use Node.js and not Rhino (by far the biggest killer of build performance), if all of your code has been correctly bundled into layers, you can set optimize to empty string (don’t optimize) and layerOptimize to "closure" (Closure Compiler) in your build profile so only the layers will be run through the optimizer.
Other than that, you should make sure that there isn’t something wrong with the system you are running the build on. (Build files are on NAS with a slow link? Busted CPU fan forcing CPUs to underclock? Ancient CPU with only a single core? Insufficient/bad RAM? Someone else decided to install a TF2 server on it and didn’t tell you?)

Clock_gettime showing high usage during profiling of code

I am profiling a userland application on netbsd with gprof and seeing clock_gettime using upwards of 30% cycles. Gprof does not show where it is getting called from (it shows some function which clearly does not call clock_getttime).
The application uses third party code including libevent 1.4 (which appears to use clock_gettime). I looked into removing the call from that but could not determine much.
I don't understand why it would take that much of time. Any inputs will be appreciated. I also saw gettimeofday taking a lot of cycles. In general, why would getting the time involve so many processing cycles
Is there a way that one can optimize clock_gettime () or can we use any other call?
Is it possible that gcc itself adds this call to the code when it is compiled with -pg for profiling purposes?
Thanks for any answers

It's all relative to whatever else your program is doing, and keep in mind that if you're doing any I/O, the actual CPU time your program uses may be small, and gprof doesn't see anything else.
So if some calls to timing routines get stuck in there, and they are called often enough, sure they can show a high percent.
Why doesn't gprof show where they're being called from?
For routines compiled with -pg, it tries to figure out who the caller is when any routine is entered.
It tries, but that doesn't mean it succeeds.
Anyway, that's gprof.

Optimisation , Compilers and Its Effects

(i) If a Program is optimised for one CPU class (e.g. Multi-Core Core i7)
by compiling the Code on the same , then will its performance
be at sub-optimal level on other CPUs from older generations (e.g. Pentium 4)
... Optimizing may prove harmful for performance on other CPUs..?
(ii)For optimization, compilers may use x86 extensions (like SSE 4) which are
not available in older CPUs.... so ,Is there a fall-back to some non-extensions
based routine on older CPUs..?
(iii)Is Intel C++ Compiler is more optimizing than Visual C++ Compiler or GCC..
(iv) Will a truly Multi-Core Threaded application will perform effeciently on a
older CPUs (like Pentium III or 4)..?

Compiling on a platform does not mean optimizing for this platform. (maybe it's just bad wording in your question.)
In all compilers I've used, optimizing for platform X does not affect the instruction set, only how it is used, e.g. optimizing for i7 does not enable SSE2 instructions.
Also, optimizers in most cases avoid "pessimizing" non-optimized platforms, e.g. when optimizing for i7, typically a small improvement on i7 will not not be chosen if it means a major hit for another common platform.
It also depends in the performance differences in the instruction sets - my impression is that they've become much less in the last decade (but I haven't delved to deep lately - might be wrong for the latest generations). Also consider that optimizations make a notable difference only in few places.
To illustrate possible options for an optimizer, consider the following methods to implement a switch statement:
sequence if (x==c) goto label
range check and jump table
binary search
combination of the above
the "best" algorithm depends on the relative cost of comparisons, jumps by fixed offsets and jumps to an address read from memory. They don't differ much on modern platforms, but even small differences can create a preference for one or other implementation.

It is probably true that optimising code for execution on CPU X will make that code less optimal on CPU Y than the same code optimised for execution on CPU Y. Probably.
Probably not.
Impossible to generalise. You have to test your code and come to your own conclusions.
Probably not.
For every argument about why X should be faster than Y under some set of conditions (choice of compiler, choice of CPU, choice of optimisation flags for compilation) some clever SOer will find a counter-argument, for every example a counter-example. When the rubber meets the road the only recourse you have is to test and measure. If you want to know whether compiler X is 'better' than compiler Y first define what you mean by better, then run a lot of experiments, then analyse the results.

I) If you did not tell the compiler which CPU type to favor, the odds are that it will be slightly sub-optimal on all CPUs. On the other hand, if you let the compiler know to optimize for your specific type of CPU, then it can definitely be sub-optimal on other CPU types.
II) No (for Intel and MS at least). If you tell the compiler to compile with SSE4, it will feel safe using SSE4 anywhere in the code without testing. It becomes your responsibility to ensure that your platform is capable of executing SSE4 instructions, otherwise your program will crash. You might want to compile two libraries and load the proper one. An alternative to compiling for SSE4 (or any other instruction set) is to use intrinsics, these will check internally for the best performing set of instructions (at the cost of a slight overhead). Note that I am not talking about instruction instrinsics here (those are specific to an instruction set), but intrinsic functions.
III) That is a whole other discussion in itself. It changes with every version, and may be different for different programs. So the only solution here is to test. Just a note though; Intel compilers are known not to compile well for running on anything other than Intel (e.g.: intrinsic functions may not recognize the instruction set of a AMD or Via CPU).
IV) If we ignore the on-die efficiencies of newer CPUs and the obvious architecture differences, then yes it may perform as well on older CPU. Multi-Core processing is not dependent per se on the CPU type. But the performance is VERY dependent on the machine architecture (e.g.: memory bandwidth, NUMA, chip-to-chip bus), and differences in the Multi-Core communication (e.g.: cache coherency, bus locking mechanism, shared cache). All this makes it impossible to compare newer and older CPU efficiencies in MP, but that is not what you are asking I believe. So on the whole, a MP program made for newer CPUs, should not be using less efficiently the MP aspects of older CPUs. Or in other words, just tweaking the MP aspects of a program specifically for an older CPU will not do much. Obviously you could rewrite your algorithm to more efficiently use a specific CPU (e.g.: A shared cache may permit you to use an algorithm that exchanges more data between working threads, but this algo will die on a system with no shared cache, full bus lock and low memory latency/bandwidth), but it involves a lot more than just MP related tweaks.

(1) Not only is it possible but it has been documented on pretty much every generation of x86 processor. Go back to the 8088 and work your way forward, every generation. Clock for clock the newer processor was slower for the current mainstream applications and operating systems (including Linux). The 32 to 64 bit transition is not helping, more cores and less clock speed is making it even worse. And this is true going backward as well for the same reason.
(2) Bank on your binaries failing or crashing. Sometimes you get lucky, most of the time you dont. There are new instructions yes, and to support them would probably mean trap for an undefined instruction and have a software emulation of that instruction which would be horribly slow and the lack of demand for it means it is probably not well done or just not there. Optimization can use new instructions but more than that the bulk of the optimization that I am guessing you are talking about has to do with reordering the instructions so that the various pipelines do not stall. So you arrange them to be fast on one generation processor they will be slower on another because in the x86 family the cores change too much. AMD had a good run there for a while as they would make the same code just run faster instead of trying to invent new processors that eventually would be faster when the software caught up. No longer true both amd and intel are struggling to just keep chips running without crashing.
(3) Generally, yes. For example gcc is a horrible compiler, one size fits all fits no one well, it can never and will never be any good at optimizing. For example gcc 4.x code is slower on gcc 3.x code for the same processor (yes all of this is subjective, it all depends on the specific application being compiled). The in house compilers I have used were leaps and bounds ahead of the cheap or free ones (I am not limiting myself to x86 here). Are they worth the price though? That is the question.
In general because of the horrible new programming languages and gobs of memory, storage, layers of caching, software engineering skills are at an all time low. Which means the pool of engineers capable of making a good compiler much less a good optimizing compiler decreases with time, this has been going on for at least 10 years. So even the in house compilers are degrading with time, or they just have their employees to work on and contribute to the open source tools instead having an in house tool. Also the tools the hardware engineers use are degrading for the same reason, so we now have processors that we hope to just run without crashing and not so much try to optimize for. There are so many bugs and chip variations that most of the compiler work is avoiding the bugs. Bottom line, gcc has singlehandedly destroyed the compiler world.
(4) See (2) above. Don't bank on it. Your operating system that you want to run this on will likely not install on the older processor anyway, saving you the pain. For the same reason that the binaries optimized for your pentium III ran slower on your Pentium 4 and vice versa. Code written to work well on multi core processors will run slower on single core processors than if you had optimized the same application for a single core processor.
The root of the problem is the x86 instruction set is dreadful. So many far superior instructions sets have come along that do not require hardware tricks to make them faster every generation. But the wintel machine created two monopolies and the others couldnt penetrate the market. My friends keep reminding me that these x86 machines are microcoded such that you really dont see the instruction set inside. Which angers me even more that the horrible isa is just an interpretation layer. It is kinda like using Java. The problems you have outlined in your questions will continue so long as intel stays on top, if the replacement does not become the monopoly then we will be stuck forever in the Java model where you are one side or the other of a common denominator, either you emulate the common platform on your specific hardware, or you are writing apps and compiling to the common platform.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas