I am trying to use gprof to profile my code. However, when I compile with the flag -pg, gprof displays no time accumulated and the report contains all 0's. The version I have is GNU gprof (GNU Binutils for Ubuntu) 2.34. How can I solve this issue?
The program is definitely running for quite a while (>30 seconds) and it contains only two functions. The same thing happens when I use code and instructions from this article: https://www.thegeekstuff.com/2012/08/gprof-tutorial/.
Related
Let me preface this question by saying that I know it takes programs longer to run in valgrind as there is a lot of overhead. This question is not about that.
To ensure that our implementations of data structures have the appropriate runtime, all test cases time out after a certain amount of time (usually around 10 times the amount of time the teacher produced solutions take to run in Valgrind). I ran the test cases on my laptop early in the day and everything was fine. I made two very minor changes later at night (adding one to something and adding a counter for something else, both of which are constant time operations). I reran the tests and I timed out on even the most basic of test cases, like inserting one node. I was freaking out, so I went to the 24/7 computer lab on campus and ran my code on a virtual machine and it worked fine. I ran the binaries on my laptop and they're speedy. I tried turning my computer off and then back on and that didn't fix anything, so I tried updating valgrind but it is up to date. I removed valgrind and then re-installed and that didn't fix the problem either. To verify it is a problem with valgrind and not my code I made a hello_world.cpp then and ran the binary in valgrind with no extra flags. It takes about 15-20 seconds to run. I have absolutely no idea why this is happening. I've not made any changes to my computer. I've skimmed the valgrind documentation, but I cannot pin down what is wrong. I run Fedora 27.
I am writing a small utility that will get the hardware and software time and print in a file.
This is to check whether both are in sync. I am searching for a vxworks function that prints the hardware time along with milliseconds.
Thanks
I looked this one up for you in the VxWorks 7.0 Manual.
try clock() - if this doesn't work (not able to test it) search the manual for terms like 'Time' and 'Clock' - yielded good results to me.
It is well known that, the callgrind analysis tool of the valgrind suit, provides the possibility to start and stop the colection of data via command line instruction callgrind_control -i on or callgrind_control -i off. For instance, the following code will collect data only after the hour.
(sleep 3600; callgrind_control -i on) &
valgrind --tool=callgrind --instr-atstart=no ./myprog
Is there a similar option for the cachegrind tool? if so, how can I use it (I do not find anything in the documentation)? If no, how can I start collecting data after a certain amount of time with cachegrind?
As far as I know, there is no such function for Cachegrind.
However, Callgrind is an extension of Cachegrind, which means that you can use Cachegrind features on Callgrind.
For example:
valgrind --tool=callgrind --cache-sim=yes --branch-sim=yes ./myprog
Will measure your programs cache and branch performance as if you where using Cachegrind.
I am profiling a userland application on netbsd with gprof and seeing clock_gettime using upwards of 30% cycles. Gprof does not show where it is getting called from (it shows some function which clearly does not call clock_getttime).
The application uses third party code including libevent 1.4 (which appears to use clock_gettime). I looked into removing the call from that but could not determine much.
I don't understand why it would take that much of time. Any inputs will be appreciated. I also saw gettimeofday taking a lot of cycles. In general, why would getting the time involve so many processing cycles
Is there a way that one can optimize clock_gettime () or can we use any other call?
Is it possible that gcc itself adds this call to the code when it is compiled with -pg for profiling purposes?
Thanks for any answers
It's all relative to whatever else your program is doing, and keep in mind that if you're doing any I/O, the actual CPU time your program uses may be small, and gprof doesn't see anything else.
So if some calls to timing routines get stuck in there, and they are called often enough, sure they can show a high percent.
Why doesn't gprof show where they're being called from?
For routines compiled with -pg, it tries to figure out who the caller is when any routine is entered.
It tries, but that doesn't mean it succeeds.
Anyway, that's gprof.
I am trying to figure out what a profile result means, before I start to optimize. I am very new with CUDA and profiling in general and I am confused by the result.
Specifically, I want to know what is happening during seemingly unoccupied chunks of computation. When I look from top to bottom at the CPU and GPU there appears to be nothing happening during large portions of the code. These look like columns with nothing in Thread1 and nothing in GeForce. Is this normal? Whats happening here?
The run was done a multicore machine under no load with nvprof. The GPU code was compiled with -arch=sm_20 -m32 -g -G for CUDA 5.
Larger Image
The error here was to profile the code in debug mode (-G compiler flag: "Generate debug information for device code"). The behavior of the program is deeply changed, and this should not be used to profile and optimize one's code.
One other thing: a thorough documentation of nvcc's debug mode is hard to find. nvcc probably dumps the registers/shared memory in global memory for easier host access and debugging, which may in turn hide problems such as race conditions in shared memory (cf. discussion here: https://stackoverflow.com/a/10726970/1043187). Thus, programs such as cuda-memcheck --tool racecheck should be used in release mode too.