Minimum callgrind command for callgraph generation and profiling - valgrind

I want to use callgrind to profile my program, but it is slowed down too much. What I want to do is generate a callgraph using kcachegrind where every node shows how much percentage the program spent in which function. Can you tell me which features I can safely disable for better performance so this info is still generated?
Thanks a lot!

Quick Overview
Callgrind is essentially a cache profiler (both instruction and data) that works at function-level granularity in order to reproduce the call graph. The profiler observes actions that trigger events during program execution and updates various aggregate counters maintained by the simulator.
However, this fine-grained simulation of cache events comes at a heavy cost of program runtime. You should know that even with all profiling turned off and no useful data being collected, Callgrind will still have a minimum of about 2-4x hit in runtime. When actively collecting data, it would be an average of 10-20x slower.
Is this theoretical minimum acceptable for your requirement? If not, you should consider other profiling options - discussed here. But if, with some careful control, speeding up large, uninteresting chunks of your program to only a 2-4x slowdown sounds reasonable, read on!
Available Hooks
Callgrind offers 2 forms of control over the collection of profiling data. It's important to understand their inter-dependencies in order to make an informed choice:
Intrumentation state - When disabled, no program actions are observed and thus, no events are triggered or collected. The simulator basically switches to an 'idle' state; this is what helps you achieve the theoretical 2-4x minimum I mentioned above (see Nulgrind).
But be warned, this should be used carefully! While it offers attractive benefits, this can have non-trivial effects on accuracy. From the documentation:
However, this only should be used with care and in a coarse fashion: every mode change resets the simulator state (ie. whether a memory block is cached or not) and flushes Valgrinds internal cache of instrumented code blocks, resulting in latency penalty at switching time.
Collection state - When disabled, the aggregate counters are not updated with triggered events. This provides a way to streamline collected data to only the interesting parts of your call stack.
However, intuitively, this does not offer any noticeable speedup in execution time. And of course, instrumentation needs to be switched on for collection to be enabled.
Commands
valgrind --tool=callgrind
--instr-atstart=<yes|no> ;; default = yes
--collect-atstart=<yes|no> ;; default = yes
--toggle-collect=<function> ;; Toggle collection at entry/exit of specific function
<PROGRAM> <PROGRAM_OPTIONS>
Instrumentation - Turning this off in the beginning indicates you have to turn it back on again at the appropriate time. 2 alternate ways to do this:
During program execution, use the following command from the shell at the appropriate time.
callgrind_control -i <on|off>
This would require visibility into your program execution as well as some tolerance in accuracy due to the latency of deploying the command. You could use a few shell tricks to help, of course.
Insert the following macros into your program code and recompile your binary.
CALLGRIND_START_INSTRUMENTATION;
CALLGRIND_STOP_INSTRUMENTATION;
Collection - Similarly, if disabled at the start, collection needs to be toggled around the interesting parts of the code. 2 alternate ways to do this:
Use the --toggle-collect=<function> flag during launch. By definition, this would be inclusive of all the sub-calls within this function. If you can thus identify a particular parent function as your bottleneck, this can be a useful method to isolate relevant data and keep the generated call graph minimal.
Tip: Wildcards are supported in the function name!
Use the following macro before and after the relevant portion of your program code and recompile your binary. This can give you more fine-grained control within functions.
CALLGRIND_TOGGLE_COLLECT;
Summary
To combine all the ideas above, a good approach would be:
#include <callgrind.h>
// Uninteresting program chunk
CALLGRIND_START_INSTRUMENTATION;
// A few extra lines to allow cache warm-up
CALLGRIND_TOGGLE_COLLECT;
// Portion to profile
CALLGRIND_TOGGLE_COLLECT;
CALLGRIND_DUMP_STATS;
CALLGRIND_STOP_INSTRUMENTATION;
// Rest of the program
Recompile, and launch Callgrind with:
valgrind --tool=callgrind --instr-atstart=no --collect-atstart=no <PROGRAM> <PROGRAM_OPTIONS>
Note that there will be 2 Callgrind output files generated by this method - the first created by the DUMP_STATS macro, and the second at program exit. DUMP_STATS zeroes all counters after use, which means the second log will report 0 events.
Within the active instrumentation block, you could also toggle collection multiple times and dump collected stats for each chunk.

Related

Vulkan's execution model and sycnhronization [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to clear up my confusion around Vulkan's execution model and I would like to have my understanding verified and get answers to questions that still remain unclear to me.
So my understanding is following:
The host and the device execute completely asynchronously with respect to each other. I have to use VkFence to synchronize between them, i.e. when I want to know that a particular submission has finished executing on the device, I have to wait on the host for the appropriate VkFence to be signaled.
Different command queues execute asynchronously with respect to each other. Vulkan specification does not provide any guarantees about the order in which submissions to these queues start or finish execution. So vkQueueSubmit on queue A executes completely independently from vkQueueSubmit on queue B and I have to use VkSemaphore in order to make sure that for example submission to queue B starts executing after the submission to queue A is finished.
However different commands submitted to the same command queue respect their submission order, which means that commands submitted later won't start execution unless commands submitted earlier have already started their execution, but on the other hand this does not mean that these later commands cannot finish execution before earlier commands.
State setting commands (e.g. vkCmdBindPipeline, vkCmdBindVertexBuffers ...) are not asynchronous and delayed for later (like e.g. vkCmdDraw). Actually they execute right away on the host (not on the device) and modify the state of VkCommandBuffer and this cumulatively modified state is used in recording action commands that come after.
From the perspective of synchronization VkRenderPass can be thought of as just a simpler interface to pipeline barriers. It can be thought of as having one pipeline barrier in the beginning of render pass instance (in place of vkCmdBeginRenderPass), one at the end of render pass instance (in place of vkCmdEndRenderPass) and one pipeline barrier after each subpass (in place of vkCmdNextSubpass).
In my head the mental model of how commands execute on a single command queue is as one huge stream of commands (ordered in the order that they were recorded to command buffer and the order that these command buffers were submitted to the queue) split by pipeline barriers. Each pipeline barrier splits the stream into two sections, commands before the barrier (section A) and commands after the barrier (section B). Commands in section B are allowed to start (or rather continue their execution with pipeline stage Y) only after all commands in section A have finished executing pipeline stage X.
Questions:
The Vulkan specification (section 2.2.1. Queue Operation) states:
Command buffer submissions to a single queue respect submission order
and other implicit ordering guarantees, but otherwise may overlap or
execute out of order. Other types of batches and queue submissions
against a single queue (e.g. sparse memory binding) have no implicit
ordering constraints with any other queue submission or batch.
Lets say that in my program I have only one general queue, that can issue all kinds of commands (graphics, compute, transfer, presentation, ...), so does the above statement mean the following ?
vkQueueSubmit #3 starts execution only after vkQueueSubmit #2 has already started execution, which starts only after vkQueueSubmit #1 has already started, ... but vkQueueBindSparse or vkQueuePresentKHR can start at any time regardless of when they were issued by the host ... In other words I always have to use VkSemaphore to ensure that presentation (vkQueuePresentKHR) starts at the right time (only after all my graphics work was submitted and executed and thus is ready to be presented).
I am a little bit confused with the definition of submission order within command buffers themselves. Specification states (section 6.2. Implicit Synchronization Guarantees):
1)
For commands recorded outside a render pass, this includes all other
commands recorded outside a render pass, including
vkCmdBeginRenderPass and vkCmdEndRenderPass commands; it does not
directly include commands inside a render pass.
2)
For commands recorded inside a render pass, this includes all other
commands recorded inside the same subpass, including the
vkCmdBeginRenderPass and vkCmdEndRenderPass commands that delimit the
same render pass instance; it does not include commands recorded to
other subpasses.
The first bullet point seems to be clear. The submission order is the order in which commands were recorded to command buffers, whilst whatever is inside of a vkCmdBeginRenderPass and vkCmdEndRenderPass block is considered as one command for the purpose of this bullet point. The second bullet point is a bit unclear to me though. How is the submission order defined here ? It is clear that any command within a specific subpass does not start its execution unless a previous command has already started its execution or unless vkCmdBeginRenderPass was executed. But what about different subpasses ? Does this mean that subpass 1 can start its execution before subpass 0 has started its execution ? This does not make sense to me. What would make sense, is if later subpasses are only allowed to start once previous subpasses have finished.
Vulkan specification (section 6.1.2. Pipeline Stages) states:
Execution of operations across pipeline stages must adhere to implicit
ordering guarantees, particularly including pipeline stage order.
Does this mean that for example Vertex shader stage from draw call 2 is not allowed to begin execution unless vertex shader stage from draw call 1 has already started its execution ?
My mental model of Vulkan's command queue execution (number 6 of my understanding) provokes the question, whether a pipeline barrier submitted to the beginning of a command buffer (B) would affect an earlier command buffer (A). I mean would it make the commands in command buffer B wait to start execution until commands in command buffer A are finished ? I read somewhere that synchronization between different command buffers is the job for events, but according to my understanding this should also be possible with barriers.
Also if I used VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT as source stage and VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT as destination stage of a pipeline barrier that should basically disable any overlap between the commands before and after the barrier, right ?
So as I see it, there are several different parallelisms in Vulkan:
Between CPU and GPU, these are synchronized with VkFence
Between different commands queues on the GPU, these are synchronized with VkSemaphore
Between different submissions to the same queue, exception seem to be submissions with vkQueueSubmit. These are also synchronized with VkSemaphore.
Between different draw calls. These are synchronized with pipeline barrier.
This one is the most confusing to me. So if I have a drawcall that in some way uses the results of any previous drawcall or writes to the same render target (framebuffer), then as far as I understand, I need to make sure that the later drawcall sees the memory effects of all previous drawcalls. But what about, when I am rendering a scene with a bunch of game characters, trees and buildings. Lets say that each such object is one drawcall and all these drawcalls write to the same framebuffer. Do I need to issue a memory barrier after every drawcall ? Intuitively this feels redundant and the demos that I checked out did not issue any barriers in this case, but are there any guarantees that drawcalls logically following after will see the memory effects of drawcalls logically before them ? The question is, when do I need to synchronize between different drawcalls ?
Within a single draw call. Synchronization on this level is possible with shader atomic instructions.
However as far as I am not doing anything unusual, like writing to the same memory address from multiple shader instances or reading from the same memory that I have just written to (e.g. implementation of custom blending in fragment shader), I should be fine. In other words if every fragment shader reads and writes only its corresponding pixel or vertex data, I do not need to worry about synchronization within the same drawcall.
The host and the device execute completely asynchronously with respect to each other.
Yes.
Unless explicit synchronization is used (that is VkFence, vk*WaitIdle, VkEvent). Or the one rare implicit synchronization ( host writes are visible to device access from any subsequent vkQueueSubmit).
Do note there also has to be a "memory domain operation". I.e. you must use VK_PIPELINE_STAGE_HOST_BIT when reading output of GPU on the CPU. (VkFence alone, doing the execution and memory dependency, does not suffice).
Different command queues execute asynchronously with respect to each other.
Correct. In other words, commands from any two queues may run serially, next to each other (in parallel), or even be pre-empted and time-shared, or some combination of the above. Anything goes. Unless explicit synchronization (VkSemaphore or VkFence) is used.
However different commands submitted to the same command queue respect their submission order
Yes. But it is only specification formalism that has no real-world effect. It is only specified so we have formal linguistic framework in which to describe other stuff in the specification (e.g. it specifies nomenclature necessary to describe the behavior of pipeline barriers).
State setting commands (e.g. vkCmdBindPipeline, vkCmdBindVertexBuffers ...) are not asynchronous and delayed for later (like e.g. vkCmdDraw).
No, that is not exactly how I would describe it.
They are not "delayed". They are simply executed exactly where they are recorded in the command buffers.
This is perhaps one of the things where we need the "submission order" formalism. All commands later in submission order after state command see the new state. (I.e. only the commands recorded after the state command see the new state).
From the perspective of synchronization VkRenderPass can be thought of as just a simpler interface to pipeline barriers.
I don't think so. It is actually perhaps bit more complex.
What it does is more efficient synchronization, although it perhaps defines functionally the same synchronization as pipeline barriers could. What it does differently is that (among other things) it defines this synchronization as a monolith (i.e. you tell the driver upfront what resources you are gonna use, and you outline all the things you are gonna do to them later).
Render Pass is a harness necessitated by mobile tiling architecture GPU. On desktop it is also useful if they have some architectural inspiration from the mobile GPUs, or simply as an oracle for driver optimization.
so does the above statement mean the following ? vkQueueSubmit #3 starts execution only after vkQueueSubmit #2 has already started execution, which starts only after vkQueueSubmit #1 has already started
Yes, and no. Read above about the formalism of submission order.
Technically, yes, the commands are guaranteed to execute its VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT in order. But that stage does nothing.
AIS, it is only specification formalism used for other things. It does not say anything in of itself.
I am a little bit confused with the definition of submission order within command buffers themselves.
Yes, the language is bit tricky. The part that trips you up is the subpasses. Note that subpasses are by definition also asynchronous. Therefore we cannot use the simple rule in quote "1)".
If I decode it, what the spec quote means is:
a) Any command recorded before the Render Pass Instance (i.e. before vkCmdBeginRenderPass) is earlier in submission order than the vkCmdBeginRenderPass, and earlier than any and all the commands in the subpasses. (And vice versa, anything in the subpasses is later in submission order.)
b) Similarly any command recorded after the Render Pass Instance (i.e. after vkCmdEndRenderPass) is later in submission order than the vkCmdEndRenderPass, and later than any and all the commands in the subpasses.
c) The commands in a single subpass have the submission order same as the order they were recorded in (vkCmd*).
d) Commands in any two subpasses do not have submission order wrt each other.
Remember submission order is only a formalism. What "d)" means in reality is only that you cannot execute vkCmdPipelineBarrier in subpass 1 and expect that barrier to cover anything from subpass 0. (What you must do is use the VkSubpassDependency instead of vkCmdPipelineBarrier to achieve dependency between subpass 0 and 1.)
Execution of operations across pipeline stages must adhere to implicit ordering guarantees, particularly including pipeline stage order.
This is only an introductory statement linking to some of the other stuff in the specification. It does not say anything in of itself.
"implicit ordering guarantees" links to the submission order we covered.
"pipeline stage order" simply links to pipeline stage ordering. This simply specifies "logical order" between pipeline stages (e.g. Vertex Shader is before Fragment Shader). What it means is whenever you use stage flag bit in any srcStage parameter, Vulkan will implicitly assume you also mean any logically earlier stage flag bit. (And similarly for dstStage).
My mental model of Vulkan's command queue execution (number 6 of my understanding) provokes the question, whether a pipeline barrier submitted to the beginning of a command buffer (B) would affect an earlier command buffer (A)
Yes, that is the general idea.
Think of it like this: vkQueueSubmit concatenates the commands from command buffer at the end of the Queue. It is called "queue" for a reason. Therefore a pipeline barrier affects the command buffer that was submitted earlier. (And BTW that's why it is called submission order)
Also if I used VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT as source stage and VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT as destination stage of a pipeline barrier that should basically disable any overlap between the commands before and after the barrier, right ?
Yes, but that is a code rot.
In this case use VK_PIPELINE_STAGE_ALL_COMMANDS_BIT instead. It is much easier to understand for anyone reading such code.
So as I see it, there are several different parallelisms in Vulkan:
Asynchrony.
Parallelism is not guaranteed. I.e. the driver is allowed to serialize the workload, or time-share it.
But e.g. with some common sense you can guess there will be (notable) parallelism between CPU and GPU, if it is a dedicated GPU.
The question is, when do I need to synchronize between different drawcalls ?
Yes, I think no framebuffer sync between draw commands is one of the exceptions\simplifications Vulkan has.
I believe people support it by the specification of Primitive Order and Rasterization Order.
I.e. in a single subpass you should not need a pipeline barrier between two vkCmdDraw* to synchronize the color and depth buffer. (I think) you still need to explicitly synchronize draw in a subpass with other subpasses and with outside of the render pass instance.
However as far as I am not doing anything unusual, like writing to the same memory address from multiple shader instances or reading from the same memory that I have just written to (e.g. implementation of custom blending in fragment shader), I should be fine.
Yes. The pipeline and the fixed and programmable stages should work similarly as in OpenGL. You should for most part be able to use OpenGL's shaders with little to no modification and achieve the same behavior.

Programmable "real-time" MIDI processing

In my band, all musicians have both hands busy at any time. However, we want to add whole synthesizer chords (1/4 .. whole note length), maybe triggered by a simple foot switch every time (because playing along a sequencer is currently too difficult for us).
Some time ago I wrote a (Windows) console application in C (MinGW) that converted incoming MIDI events to text, piped that text to an external program (AWK script), and re-converted that external program's text output back to MIDI events.
Basically every sort of filtering or event generation was possible; I actually produced chords triggered by simple control messages; I kept note-ONs in memory to be able to -OFF them whenever a new chord was sent, etc. - the actual processing (execution) times were not a problem at all(!)
But I had to understand that not only latency, but also the notoriously unreliable (with respect to "when", "for how long") user application OS multitasking/switching made this concept practically worthless at least for "real-time" use. There were always clearly perceivable delays, of unpredictable duration.
I read about user-mode driver programming and downloaded some resources, but somehow stopped working on that project without a real result.
Apart from that specific project, I even have some experience in writing small "virtual" machines that allow for expressing exactly the variables, conditionals and math, stored as a token tree and processed quite fast. Maybe there is also the option to embed Lua, V8, or anything like that. So calling another (external) program is not necessarily the issue here, since that can be avoided.
The problem that remains is that the processing as a whole is still done by a (user) application. So I figure there is no way around a (user mode) driver, in this scenario.
Alternatively, I was even considering (more "real-time") hardware - a Raspi or the like - but then the MIDI interface may be an additional challenge.
Is there any hardware or software solution (or project) available that may serve as a base for such a _Generic MIDI filter/processor_? Apart from predictable timing behaviour, it is desirable not to need a (C) compilation environment when building filters/rules, since that "creative" step will probably happen in our rehearsal room (laptop available), which is certainly not a "programming lab". Text-based "programs" are fine - for long-term I'll maybe build a GUI for wiring/generating rules anyway.
MIDI is handled pretty well in Windows. I'm not sure the source of the original problems you had. No doubt there is some latency though.
You can handle this in real time with a microcontroller. The good news is that you don't even have to build the hardware. Off-the-shelf controllers are available for this. For example: http://www.midisolutions.com/prodevp.htm

Is there a way to make threads run slow?

This question is probably opposite of what every developer wants their system to do.
I am creating a software that looks into a directory for specific files and reads them in and does certain things. This can create high CPU load. I use GCD to make threads which are put into NSOperationQueue. What I was wondering, is it possible to make this operation not take this huge CPU load? I want to run things way slower, as speed is not an issue, but that the app should play very nice in the background is very important.
In short. Can I make NSOperationQueue or threads in general run slowly without using things like sleep?
The app traverses a directory structure, finds all images and creates thumbnails. Just the traversing of the directories makes the CPU load quite high.
Process priority: nice / renice.
See:
https://superuser.com/questions/42817/is-there-any-way-to-set-the-priority-of-a-process-in-mac-os-x#42819
but you can also do it programmatically.
Your threads are being CPU-intensive. This leads to two questions:
Do they need to be so CPU-intensive? What are they doing that's CPU-intensive? Profile the code. Are you using (say) a quadratic algorithm when you could be using a linear one?
Playing nicely with other processes on the box. If there's nothing else on the box then you /want/ to use all of the available CPU resource: otherwise you're just wasting time. However, it there are other things running then you want to defer to them (within reason), which means giving your process a lower priority (i.e. /higher/ nice value) than other processes. Processes by default have nice value 0, so just make it bigger (say +10). You have to be root to give a process negative niceness.
The Operation Queues section of the in the Concurrency Programming Guide describes the process for changing the priority of a NSOperation:
Changing the Underlying Thread Priority
In OS X v10.6 and later, it is possible to configure the execution priority of an operation’s underlying thread. Thread policies in the system are themselves managed by the kernel, but in general higher-priority threads are given more opportunities to run than lower-priority threads. In an operation object, you specify the thread priority as a floating-point value in the range 0.0 to 1.0, with 0.0 being the lowest priority and 1.0 being the highest priority. If you do not specify an explicit thread priority, the operation runs with the default thread priority of 0.5.
To set an operation’s thread priority, you must call the setThreadPriority: method of your operation object before adding it to a queue (or executing it manually). When it comes time to execute the operation, the default start method uses the value you specified to modify the priority of the current thread. This new priority remains in effect for the duration of your operation’s main method only. All other code (including your operation’s completion block) is run with the default thread priority. If you create a concurrent operation, and therefore override the start method, you must configure the thread priority yourself.
Having said that, I'm not sure how much of a performance difference you'll really see from adjusting the thread priority. For more dramatic performance changes, you may have to use timers, sleep/suspend the thread, etc.
If you're scanning the file system looking for changed files, you might want to refer to the the File System Events Programming Guide for guidance of lightweight techniques for responding to file system changes.

Clock_gettime showing high usage during profiling of code

I am profiling a userland application on netbsd with gprof and seeing clock_gettime using upwards of 30% cycles. Gprof does not show where it is getting called from (it shows some function which clearly does not call clock_getttime).
The application uses third party code including libevent 1.4 (which appears to use clock_gettime). I looked into removing the call from that but could not determine much.
I don't understand why it would take that much of time. Any inputs will be appreciated. I also saw gettimeofday taking a lot of cycles. In general, why would getting the time involve so many processing cycles
Is there a way that one can optimize clock_gettime () or can we use any other call?
Is it possible that gcc itself adds this call to the code when it is compiled with -pg for profiling purposes?
Thanks for any answers
It's all relative to whatever else your program is doing, and keep in mind that if you're doing any I/O, the actual CPU time your program uses may be small, and gprof doesn't see anything else.
So if some calls to timing routines get stuck in there, and they are called often enough, sure they can show a high percent.
Why doesn't gprof show where they're being called from?
For routines compiled with -pg, it tries to figure out who the caller is when any routine is entered.
It tries, but that doesn't mean it succeeds.
Anyway, that's gprof.

Why JVM executes same program faster over time when launched?

I have written a simple SUDOKU solver. To roughly test the performance I'm using simple System.currentTimeMillis calls.
I have prepared a set of initial sudoku configuration in text file. The program reads the file and solves each sudoku configuration. When running the tests I have noticed that the first 3-4 solve runs are really slower than the rest and by slower I mean by order of magnitude.
There is sample pseudo-code snippet:
main(){
while(file has lines){
configuration = readLine();
Solver s = new Solver(configuration);
now1 = System.currentTimeMillis();
s.solve();
now2 = System.currentTimeMillis();
System.out.print(now2 - now1);
}
}
I measure only solve() method, so IO is not a problem, I even hardcoded some data into program - still first few slower. The difficulty of puzzle is not an issue as well I have tried different permutations and difficulties of configuration and always the same - first few are slower.
My question is - why is that and is there a way to prevent it?
This is supposed to happen. The JIT compiler optimizes code that gets called more often as your program runs for longer.
This only reflects the general fact that the technique you're using to test performance simply isn't reliable in Java.
In Practice, Methods are not compiled by JIT the first time they are called by JVM. For each method, the JVM maintains a call count, which is incremented every time the method is called. The JVM interprets a method until its call count exceeds a JIT compilation threshold. Therefore, often-used methods are compiled soon after the JVM has started , and less-used methods are compiled much later, or not at all. This JIT compilation Threshold helps the JVM start quickly.
So, the busiest methods of java program are always optimized most
aggresively which increases its execution speed each time it is called by program.
Here is the Source for above information.
In performance testing engagements we'd always run the system being tested for a while to let it reach a steady state. Only then would we start the performance metrics. You might try the same: run the solve() method a number of times before capturing your metrics.