Clock_gettime showing high usage during profiling of code - clock

I am profiling a userland application on netbsd with gprof and seeing clock_gettime using upwards of 30% cycles. Gprof does not show where it is getting called from (it shows some function which clearly does not call clock_getttime).
The application uses third party code including libevent 1.4 (which appears to use clock_gettime). I looked into removing the call from that but could not determine much.
I don't understand why it would take that much of time. Any inputs will be appreciated. I also saw gettimeofday taking a lot of cycles. In general, why would getting the time involve so many processing cycles
Is there a way that one can optimize clock_gettime () or can we use any other call?
Is it possible that gcc itself adds this call to the code when it is compiled with -pg for profiling purposes?
Thanks for any answers

It's all relative to whatever else your program is doing, and keep in mind that if you're doing any I/O, the actual CPU time your program uses may be small, and gprof doesn't see anything else.
So if some calls to timing routines get stuck in there, and they are called often enough, sure they can show a high percent.
Why doesn't gprof show where they're being called from?
For routines compiled with -pg, it tries to figure out who the caller is when any routine is entered.
It tries, but that doesn't mean it succeeds.
Anyway, that's gprof.

Related

adding processing delay within a block in gnuradio

I am working on a block with gnuradio. I have come across a strange performance improvement when i was printing out some huge data on to terminal and the performance degrades without giving a print statement on to terminal.
I infer that while printing out to terminal i am giving gnuradio block an extra processing time for the block to process. This is just my hunch and might not be the exact reason. Kindly correct if this is not correct.
So, is there a way to add a specific amount of processing delay within a block(as what i got during printing out data to terminal) in gnuradio.
Thanks in advance
First of all, the obvious: don't print large amounts of data to a terminal. Terminals aren't meant for that, and your CPU will pretty quickly be the limiting factor, as it tries to render all the text.
I infer that while printing out to terminal i am giving gnuradio block an extra processing time for the block to process. This is just my hunch and might not be the exact reason. Kindly correct if this is not correct.
Printing to terminal is an IO thing to do. That means that the program handling the data (typically, the linux kernel might be handling the PTY data, or it might be directly handed off to the process of your terminal emulator) will set a limit on how it accepts data from the program printing.
Your GNU Radio block's work function is simply blocked, because the resources you're trying to use are limited.
So, can i add a specific amount of processing delay within a block(as what i got during printing out data to terminal) in gnuradio.
Yes, but it doesn't help in any way here.
You're IO bound. Do something that is not printing to terminal – makes a lot of sense, because you can't read that fast, anyway.

Minimum callgrind command for callgraph generation and profiling

I want to use callgrind to profile my program, but it is slowed down too much. What I want to do is generate a callgraph using kcachegrind where every node shows how much percentage the program spent in which function. Can you tell me which features I can safely disable for better performance so this info is still generated?
Thanks a lot!
Quick Overview
Callgrind is essentially a cache profiler (both instruction and data) that works at function-level granularity in order to reproduce the call graph. The profiler observes actions that trigger events during program execution and updates various aggregate counters maintained by the simulator.
However, this fine-grained simulation of cache events comes at a heavy cost of program runtime. You should know that even with all profiling turned off and no useful data being collected, Callgrind will still have a minimum of about 2-4x hit in runtime. When actively collecting data, it would be an average of 10-20x slower.
Is this theoretical minimum acceptable for your requirement? If not, you should consider other profiling options - discussed here. But if, with some careful control, speeding up large, uninteresting chunks of your program to only a 2-4x slowdown sounds reasonable, read on!
Available Hooks
Callgrind offers 2 forms of control over the collection of profiling data. It's important to understand their inter-dependencies in order to make an informed choice:
Intrumentation state - When disabled, no program actions are observed and thus, no events are triggered or collected. The simulator basically switches to an 'idle' state; this is what helps you achieve the theoretical 2-4x minimum I mentioned above (see Nulgrind).
But be warned, this should be used carefully! While it offers attractive benefits, this can have non-trivial effects on accuracy. From the documentation:
However, this only should be used with care and in a coarse fashion: every mode change resets the simulator state (ie. whether a memory block is cached or not) and flushes Valgrinds internal cache of instrumented code blocks, resulting in latency penalty at switching time.
Collection state - When disabled, the aggregate counters are not updated with triggered events. This provides a way to streamline collected data to only the interesting parts of your call stack.
However, intuitively, this does not offer any noticeable speedup in execution time. And of course, instrumentation needs to be switched on for collection to be enabled.
Commands
valgrind --tool=callgrind
--instr-atstart=<yes|no> ;; default = yes
--collect-atstart=<yes|no> ;; default = yes
--toggle-collect=<function> ;; Toggle collection at entry/exit of specific function
<PROGRAM> <PROGRAM_OPTIONS>
Instrumentation - Turning this off in the beginning indicates you have to turn it back on again at the appropriate time. 2 alternate ways to do this:
During program execution, use the following command from the shell at the appropriate time.
callgrind_control -i <on|off>
This would require visibility into your program execution as well as some tolerance in accuracy due to the latency of deploying the command. You could use a few shell tricks to help, of course.
Insert the following macros into your program code and recompile your binary.
CALLGRIND_START_INSTRUMENTATION;
CALLGRIND_STOP_INSTRUMENTATION;
Collection - Similarly, if disabled at the start, collection needs to be toggled around the interesting parts of the code. 2 alternate ways to do this:
Use the --toggle-collect=<function> flag during launch. By definition, this would be inclusive of all the sub-calls within this function. If you can thus identify a particular parent function as your bottleneck, this can be a useful method to isolate relevant data and keep the generated call graph minimal.
Tip: Wildcards are supported in the function name!
Use the following macro before and after the relevant portion of your program code and recompile your binary. This can give you more fine-grained control within functions.
CALLGRIND_TOGGLE_COLLECT;
Summary
To combine all the ideas above, a good approach would be:
#include <callgrind.h>
// Uninteresting program chunk
CALLGRIND_START_INSTRUMENTATION;
// A few extra lines to allow cache warm-up
CALLGRIND_TOGGLE_COLLECT;
// Portion to profile
CALLGRIND_TOGGLE_COLLECT;
CALLGRIND_DUMP_STATS;
CALLGRIND_STOP_INSTRUMENTATION;
// Rest of the program
Recompile, and launch Callgrind with:
valgrind --tool=callgrind --instr-atstart=no --collect-atstart=no <PROGRAM> <PROGRAM_OPTIONS>
Note that there will be 2 Callgrind output files generated by this method - the first created by the DUMP_STATS macro, and the second at program exit. DUMP_STATS zeroes all counters after use, which means the second log will report 0 events.
Within the active instrumentation block, you could also toggle collection multiple times and dump collected stats for each chunk.

Memory/Address Sanitizer vs Valgrind

I want some tool to diagnose use-after-free bugs and uninitialized bugs. I am considering Sanitizer(Memory and/or Address) and Valgrind. But I have very little idea about their advantages and disadvantages. Can anyone tell the main features, differences and pros/cons of Sanitizer and Valgrind?
Edit: I found some of comparisons like: Valgrind uses DBI(dynamic binary instrumentation) and Sanitizer uses CTI(compile-time instrumentation). Valgrind makes the program much slower(20x) whether Sanitizer runs much faster than Valgrind(2x). If anyone can give me some more important points to consider, it will be a great help.
I think you'll find this wiki useful.
TLDR main advantages of sanitizers are
much smaller CPU overheads (Lsan is practically free, UBsan/Isan is 1.25x, Asan and Msan are 2-4x for computationally intensive tasks and 1.05-1.1x for GUIs, Tsan is 5-15x)
wider class of detected errors (stack and global overflows, use-after-return/scope)
full support of multi-threaded apps (Valgrind support for multi-threading is a joke)
much smaller memory overhead (up to 2x for Asan, up to 3x for Msan, up to 10x for Tsan which is way better than Valgrind)
Disadvantages are
more complicated integration (you need to teach your build system to understand Asan and sometimes work around limitations/bugs in Asan itself, you also need to use relatively recent compiler)
MemorySanitizer is not reall^W easily usable at the moment as it requires one to rebuild all dependencies under Msan (including all standard libraries e.g. libc++); this means that casual users can only use Valgrind for detecting uninitialized errors
sanitizers typically can not be combined with each other (the only supported combination is Asan+UBsan+Lsan) which means that you'll have to do separate QA runs to catch all types of bugs
One big difference is that the LLVM-included memory and thread sanitizers implicitly map huge swathes of address space (e.g., by calling mmap(X, Y, 0, MAP_NORESERVE|MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0) across terabytes of address space in the x86_64 environment). Even though they don't necessarily allocate that memory, the mapping can play havoc with restrictive environments (e.g., ones with reasonable settings for ulimit values).

How Do You Profile & Optimize CUDA Kernels?

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.

Is "the optimized delay" a myth or is it real?

From time to time you hear stories that are meant to illustrate how good someone is at something, and sometimes you hear about the guy how is so into code optimization that he optimizes his delay loop.
Since this really sounds like it's a strange thing to do as it's much better to start a "timer interrupt" instead of a optimized buzy wait,
and nobody ever tend to tells you the name of the optimizing hacker.
That has left me to wonder if it is a urban myth or is it real?
What do you say, reality or fiction?
Thanks
Johan
Update: It sounds like ShuggyCoUk was on to something,
wonder if we can find a example.
Update: Just a little clarification, this question is about the "delay" function it self and how that is implemented, not how and where you call it.
And what that purpose was, and how that system became better.
Update: It's no myth, those guys seems to exist
Thanks
ShuggyCoUk
This has more than a kernel of truth about it...
Spin wait can be much better than a signal based interrupt or a yield.
You trade some throughput for much reduced latency.
Often this is vitally important within an OS itself.
You allow yourself the freedom to do operations not possible within an interrupt handler
memory allocation for example.
You can get considerably finer grained control of the interval waited since you can essentially measure the cycle count.
However spin waits are tricky to get right.
If you can you should use use proper idle instructions which:
can power down parts of the core, improving power usage/heat dissipation and even allowing other cores to go faster.
In Hyper Thread based CPUs you allow the other logical thread to use the full CPU pipeline while you spin.
an instruction you might think was a no-op could cause the CPU to execute them out of order via the super scalar execution units. The resulting code may get unforeseen out of order artefacts which force the CPU to apply a great deal of effort in terms of stalls and memory barriers which are unwanted.
This is why you let someone else write the spin wait loop for you in most cases..
In Linux there is the cpu_relax macro
on arm this is barrier()
on x86 this is rep_nop()
In Windows there is YieldProcessor
Accessible in .Net via Thread.SpinWait
OS X eschews providing a standard implementation unless you are in the kernel
see this document and note that it encourages the use only of lck_spin_t
As to some citations of using PAUSE for spin waits:
PostGresSQL
Linux
See also the note that this is better on non P4 as well due to reducing power
The version I've always heard is of a group of hardware programmers who developed a special instruction that optimised the idle (not busy) loop of their operating system. This is mentioned in Kernighan & Pike's book The Practice Of Programming, but even there they admit it may be an Urban Myth.
I've heard stories of programmers who intentionally put in long delay loops early in projects and removed them later as "optimizations" to impress management. Never figured out if the stories were apocryphal or not.