Can we use massif to measure only a certain period? - valgrind

Does valgrind -tool=massif have a similar control like callgrind to profile memory for only a certain period? Can we turn the profiling on and off during a problem run?

No. Massif does not even have a header with user request macros.
It's not what you're asking, but if you're using Massif you might be interested in the new xtree result visualization feature, part of the imminent Valgrind 3.13

Related

Minimum callgrind command for callgraph generation and profiling

I want to use callgrind to profile my program, but it is slowed down too much. What I want to do is generate a callgraph using kcachegrind where every node shows how much percentage the program spent in which function. Can you tell me which features I can safely disable for better performance so this info is still generated?
Thanks a lot!
Quick Overview
Callgrind is essentially a cache profiler (both instruction and data) that works at function-level granularity in order to reproduce the call graph. The profiler observes actions that trigger events during program execution and updates various aggregate counters maintained by the simulator.
However, this fine-grained simulation of cache events comes at a heavy cost of program runtime. You should know that even with all profiling turned off and no useful data being collected, Callgrind will still have a minimum of about 2-4x hit in runtime. When actively collecting data, it would be an average of 10-20x slower.
Is this theoretical minimum acceptable for your requirement? If not, you should consider other profiling options - discussed here. But if, with some careful control, speeding up large, uninteresting chunks of your program to only a 2-4x slowdown sounds reasonable, read on!
Available Hooks
Callgrind offers 2 forms of control over the collection of profiling data. It's important to understand their inter-dependencies in order to make an informed choice:
Intrumentation state - When disabled, no program actions are observed and thus, no events are triggered or collected. The simulator basically switches to an 'idle' state; this is what helps you achieve the theoretical 2-4x minimum I mentioned above (see Nulgrind).
But be warned, this should be used carefully! While it offers attractive benefits, this can have non-trivial effects on accuracy. From the documentation:
However, this only should be used with care and in a coarse fashion: every mode change resets the simulator state (ie. whether a memory block is cached or not) and flushes Valgrinds internal cache of instrumented code blocks, resulting in latency penalty at switching time.
Collection state - When disabled, the aggregate counters are not updated with triggered events. This provides a way to streamline collected data to only the interesting parts of your call stack.
However, intuitively, this does not offer any noticeable speedup in execution time. And of course, instrumentation needs to be switched on for collection to be enabled.
Commands
valgrind --tool=callgrind
--instr-atstart=<yes|no> ;; default = yes
--collect-atstart=<yes|no> ;; default = yes
--toggle-collect=<function> ;; Toggle collection at entry/exit of specific function
<PROGRAM> <PROGRAM_OPTIONS>
Instrumentation - Turning this off in the beginning indicates you have to turn it back on again at the appropriate time. 2 alternate ways to do this:
During program execution, use the following command from the shell at the appropriate time.
callgrind_control -i <on|off>
This would require visibility into your program execution as well as some tolerance in accuracy due to the latency of deploying the command. You could use a few shell tricks to help, of course.
Insert the following macros into your program code and recompile your binary.
CALLGRIND_START_INSTRUMENTATION;
CALLGRIND_STOP_INSTRUMENTATION;
Collection - Similarly, if disabled at the start, collection needs to be toggled around the interesting parts of the code. 2 alternate ways to do this:
Use the --toggle-collect=<function> flag during launch. By definition, this would be inclusive of all the sub-calls within this function. If you can thus identify a particular parent function as your bottleneck, this can be a useful method to isolate relevant data and keep the generated call graph minimal.
Tip: Wildcards are supported in the function name!
Use the following macro before and after the relevant portion of your program code and recompile your binary. This can give you more fine-grained control within functions.
CALLGRIND_TOGGLE_COLLECT;
Summary
To combine all the ideas above, a good approach would be:
#include <callgrind.h>
// Uninteresting program chunk
CALLGRIND_START_INSTRUMENTATION;
// A few extra lines to allow cache warm-up
CALLGRIND_TOGGLE_COLLECT;
// Portion to profile
CALLGRIND_TOGGLE_COLLECT;
CALLGRIND_DUMP_STATS;
CALLGRIND_STOP_INSTRUMENTATION;
// Rest of the program
Recompile, and launch Callgrind with:
valgrind --tool=callgrind --instr-atstart=no --collect-atstart=no <PROGRAM> <PROGRAM_OPTIONS>
Note that there will be 2 Callgrind output files generated by this method - the first created by the DUMP_STATS macro, and the second at program exit. DUMP_STATS zeroes all counters after use, which means the second log will report 0 events.
Within the active instrumentation block, you could also toggle collection multiple times and dump collected stats for each chunk.

How to generate metrics or reports using JProfiler?

I have successfully started my application in Profiling mode but I am not sure how to generate reports or metrics from Jprofiler.
I could see the Live memory (all objects, recorded objects no. of. instance count etc), heap walker etc but I am not sure of what JProfiler concludes or recommends about my application.
Can someone help?
This profiling approach you're describing is jProfiler's live profiling session. The objective is pretty much looking at the charts it produces and identifying anomalies.
For example, on CPU profiling, you will be looking at CPU Hot Spots (i.e. individual methods that consume a disproportionate amount of time).
In memory profiler, you will be able to identify objects that occupy the most memory (also hot spots).

Valgrind vs. Linux perf correlation

Suppose that I choose perf events instructions, LLC-load-misses, LLC-store-misses. Suppose further that I test a program prog varying its input. Is valgrind supposed to give me the "same" functional results for the same input and the same counter? That is, if one value in perf goes up, the one in valgrind should always do the same? Is there any impact in valgrind being a simulation that I should be aware of during profiling my code?
EDIT: BTW, before people grill me for not experimenting myself, I have to say that I (kinda) have, the problem is that I have a Sandybridge processor, and perf has a "bug" that prevents me from measuring LLC-* events. There is a patch, but I don't feel like recompiling my kernel...
Well, Cachegrind is a cache simulator. Even though it tries to mimic some of your hardware's characteristics (cache size, associativity, etc), it does not model every single feature and behavior of your system. Therefore you might in some cases see some differences.
For example, Valgrind's doc states that "Cachegrind simulates branch predictors intended to be typical of mainstream desktop/server processors of around 2004". Sandy Bridge processors first appeared in 2011, and you can guess that branch predictors have improved quite a lot since 2004.
That being said, Valgrind is still a wonderful tool to have in your toolbox.
What's the problem with perf's LLC events on Sandy Bridge processors? I use these events everyday at work on my Sandy Bridge laptop and it works as expected (archlinux 64bits, linux 3.6).

VisualVM profile CPU but run() - Methods disturb

i profile a big jboss server with a lot of classes in it. When i profile the CPU the result is always something like java.util.TimerThread.run() = 62% and java.util.concurrent.ThreadPoolExecutor$Worker.run() = 34,8%.
Under these two methods thousand other methods have 0%.
I think thats a bad bug, because most of these methods run in these Threads. But how can i see which one...
The ThreadDump - Function isnt usefull for this too.
If you don't know which part of the code is slow, it is better to start with CPU sampling. Once you know better (based on the sampling results) what is wrong, you can profile just part of your jboss server. See Profiling With VisualVM, Part 1 and Profiling With VisualVM, Part 2 to get more information about profiling and how to set profiling roots and instrumentation filter.

How Do You Profile & Optimize CUDA Kernels?

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.