Cuda profiler shows strange gaps? - optimization

I am trying to figure out what a profile result means, before I start to optimize. I am very new with CUDA and profiling in general and I am confused by the result.
Specifically, I want to know what is happening during seemingly unoccupied chunks of computation. When I look from top to bottom at the CPU and GPU there appears to be nothing happening during large portions of the code. These look like columns with nothing in Thread1 and nothing in GeForce. Is this normal? Whats happening here?
The run was done a multicore machine under no load with nvprof. The GPU code was compiled with -arch=sm_20 -m32 -g -G for CUDA 5.
Larger Image

The error here was to profile the code in debug mode (-G compiler flag: "Generate debug information for device code"). The behavior of the program is deeply changed, and this should not be used to profile and optimize one's code.
One other thing: a thorough documentation of nvcc's debug mode is hard to find. nvcc probably dumps the registers/shared memory in global memory for easier host access and debugging, which may in turn hide problems such as race conditions in shared memory (cf. discussion here: https://stackoverflow.com/a/10726970/1043187). Thus, programs such as cuda-memcheck --tool racecheck should be used in release mode too.

Related

Is it possible to set a baseline memory usage in valgrind for leak detection?

Is there a way to tell valgrind from inside my code when to start and when to stop checking for memory leaks?
I am using a legacy testing framework which must link with my testing program in order to run. The framework has memory leaks in it - valgrind shows about 50KB of memory that has not been released, but is reachable via heuristic. This is annoying, because I must keep this number in mind to see how much memory is leaked from my code. It would be a lot more convenient if I could tell valgrind to start collecting memory stats when my first test begins, and stop collecting when the last test is over. Is there an API for it?
valgrind memcheck allows to do a "differential" leak search. The differential leak search reports the delta between the previous leak search and the current situation.
You can do such a differential leak search using monitor commands with vgdb, either from the shell or from gdb. See https://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.monitor-commands.
You can also use the client request VALGRIND_DO_CHANGED_LEAK_CHECK from your program, see https://www.valgrind.org/docs/manual/mc-manual.html#mc-manual.clientreqs.

Would a Vulkan program run on a device without gpu (discrete or integrated)?

Perhaps this question could be rephrased as 'what would happen if I were to try and run a Vulkan program on a cpu-only build'.
I'm wondering whether the program would run but not produce output, crash or not build in the first place (although I expect the building process to be for a cpu architecture instead of a gpu architecture).
Would it use the on-motherboard graphics to produce output? In that case, what would happen if the program was run on a cpu-only server?
Depends on how the program initialized vulkan.
Any build can have the vulkan loader installed this is the dynamically loaded library that finds the actual driver, if that is missing the program would be unable to load the loader and may either fail to start or show an error message, depending on how they try and load that.
If no device is available then the number of devices is 0. This is again up to the application to manage. Either by going for an alternative graphics API (opengl) or a error message and failing to start.

Running cachegrind on OpenJDK JVM

I want to use cachegrind to do some performance profiling on the OpenJDK JVM. (BTW, if this is Not A Good Idea, I would like to understand why.)
The problem is it keeps tripping up assertions in the JVM. So what can I do to get a run using cachegrind. Or else, please tell me why this wouldn't work. And if you can suggest an alternative to cachegrind. (Note that I have looked at and used perf. It's just that I was curious how different a tool like cachegrind/callgrind would be in terms of results.)
Valgrind and the tools based on it are typically not very good for performance profiling. Valgrind is a kind of a virtual machine that does not execute the original code directly but rather translates it. Programs often run much slower under Valgrind, and this can affect the application execution scenario.
Also Valgrind does not work well with the dynamic and self-modifying code. With respect to JVM's JIT compilation this can result in random assertion failures and JVM crashes that you apparently observe. Valgrind is known to be much more stable when Java is run with JIT disabled, i.e. -Xint. But in -Xint mode profiling does not make much sense since it does not reflect the real application performance.
perf is preferable way to do this kind of measurements. perf relies on hardware counters, it does not modify the executed code. For deeper performance analyis there is Intel VTune Amplifier. It is not free though, but a trial version is available.

Clock_gettime showing high usage during profiling of code

I am profiling a userland application on netbsd with gprof and seeing clock_gettime using upwards of 30% cycles. Gprof does not show where it is getting called from (it shows some function which clearly does not call clock_getttime).
The application uses third party code including libevent 1.4 (which appears to use clock_gettime). I looked into removing the call from that but could not determine much.
I don't understand why it would take that much of time. Any inputs will be appreciated. I also saw gettimeofday taking a lot of cycles. In general, why would getting the time involve so many processing cycles
Is there a way that one can optimize clock_gettime () or can we use any other call?
Is it possible that gcc itself adds this call to the code when it is compiled with -pg for profiling purposes?
Thanks for any answers
It's all relative to whatever else your program is doing, and keep in mind that if you're doing any I/O, the actual CPU time your program uses may be small, and gprof doesn't see anything else.
So if some calls to timing routines get stuck in there, and they are called often enough, sure they can show a high percent.
Why doesn't gprof show where they're being called from?
For routines compiled with -pg, it tries to figure out who the caller is when any routine is entered.
It tries, but that doesn't mean it succeeds.
Anyway, that's gprof.

How Do You Profile & Optimize CUDA Kernels?

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.