How to use Callgrind to profile specific functions? - valgrind

Following this, I wrapped my functions with CALLGRIND_xxx_INSTRUMENTATION macros. However, I always get "out of memory".
Here is a simplified version of my program and callgrind would still run out of memory even though I could run callgrind without using the macros.
#include <cstdio>
#include <valgrind/callgrind.h>
void foo(int i)
printf("i=%d\n", i);
int main()
for (int i=0; i<1048576; i++)
To run this, "valgrind --tool=callgrind --instr-atstart=no ./foo >foo.out".
Did I do anything wrong? Please help. Thanks!

CALLGRIND_START_INSTRUMENTATION typical use case is to skip instrumenting the application startup code. If you call it in a loop, then this is costly both in memory and in cpu,
as callgrind will each time re-instrument the code.
If you are interested in only measuring some functions, you should rather
start instrumentation somewhere before the loop, and then use CALLGRIND_TOGGLE_COLLECT before/after the function calls you are interested in.
That will use both less cpu and less memory.
If you want to do the above, you should then use the options --instr-atstart=no and --collect-at-start=no. You then start instrumentation at the relevant place in your program (e.g. after the startup/initialisation code). You can then insert calls to CALLGRIND_TOGGLE_COLLECT in the functions you are interested in.
Note that instead of modifying your program to call CALLGRIND_TOGGLE_COLLECT for a bunch of functions,
you can also use one or more times the command line option --toggle-collect=<function>


How to disable all optimization when using COSMIC compiler?

I am using the COSMIC compiler in the STVD ide and even though optimization is turned of with -no (documentation says "-no: do not use optimizer") some lines of code get removed and cannot have a breakpoint placed upon them, nor are they to be found in the disassembly.
I tried to set -oc (leave removed instructions as comments) which resulted in not even showing the removed lines as comment.
bool foo(void)
uint8_t val;
if (globalvar > 5)
val = 0;
for (val = 0; val < 8; val++)
some code...
return true;
I do know it seems idiotic to set val to 0 prior to the for loop but lets just assume it is for some reason necessary. When I set no optimization I expect it to be not optimized but insted the val = 0; gets removed without any traces.
I am not looking for a workaround like declaring val volatile whitch solves the problem. I am rather looking for a way to prevent the optimization or at least understand/know what changes are made to my code when compiling.
It is not clear from the manual, but it seems that the -no option prevents assembly level optimisation. It seems possible that the code generator stage that runs before assembly optimisation may perform higher level optimisation such as redundant code removal.
From the manual:
disable the constant propagation optimization. By default,
when a variable is assigned with a constant, any subsequent access to that variable is replaced by the constant
itself until the variable is modified or a flow break is
encountered (function call, loop, label ...).
It seems that it is this constant propagation feature that you must explicitly disable.
It is unusual perhaps, but it appears that this compiler optimises by default, and distinguishes between compiler optimisations and assembler optimisations (performed as the compilation stage), and them makes you switch off each individual optimisation separately.
To avoid this in the code, rather than switching it off globally, you could initialise val to a non-zero value in this case:
int val = -1 ;
Then the later assignment to zero will require explicit code. This has the advantage over volatile perhaps in that it will not block optimisations when you do enable them.
I believe that this behaviour is allowed by the C language specification.
You are effectively writing the same value either once or twice to the same variable on successive lines of code. The compiler could assign this value to either a processor register or a memory location as it sees fit and knows that the value following the initial assignment in the for loop is the same as the value assigned when the if clause is actioned. As a result the language spec allows the compiler to throw the redundant code away.
The way to force the compiler to perform all read and write accesses to the variable is to use the volatile keyword. That is what it is for.

Code sharing between multiple independently compiled binaries/hex files

I'm looking for documentation/information on how to share information/code between multiple binaries compiled for a Cortex-m/0/4/7 architectures. The two binaries will be on the same chip and same architecture. They are flashed at different locations and sets the main stack pointer and resets the program counter so that one binary "jumps" to the other binary. I want to share code between these two binaries.
I've done a simple copy of an array of function pointers into a section defined in the linker script into RAM. Then read the RAM out in the other binary and cast it to an array then use the index to call functions in the other binary. This does work as a Proof-of-concept, but I think what I'm looking for is a bit more complex. As I want some way of describing compatibility between the two binaries. I want some what the functionality of shared libraries, but I'm unsure if I need position independent code.
As an example how the current copy process is done it is basically:
Source binary:
void copy_func()
memncpy(array_of_function_pointers, fixed_size, address_custom_ram_section)
Binary which is jumped too from source binary:
array_fp_type get_funcs()
memncpy(adress_custom_ram_section, fixed_size, array_of_fp)
return array_of_fp;
Then I can use the array_of_fp to call into functions residing in the source binary from the jump binary.
So what I'm looking for is some resources or input for someone who have implemented a similar system. Like I would like to not have to have a custom RAM section where I'm copying the function pointers into.
I would be fine with having the compilation step of source binary outputting something which can be included into the compilation step of the jump binary. However it needs to be reproducible and recompiling the source binary shouldn't break the compatibility with the jump binary(even if it included a different file from what is now outputted) as long as you don't change the interface.
To clarify source binary shouldn't require any specific knowledge about the jump binary. The code should not reside in both binaries as this would defeat the purpose of this mechanism. The overall goal if this mechanism is a way to save space when creating multi-binary applications on cortex-m processors.
Any ideas or links to resources are welcome. If you have any more questions feel free to comment on the question and I'll try to answer it.
Its very hard for me to picture what you want to do, but if you're interested in having an application link against your bootloader/ROM, then see Loading symbol file while linking for a hint on what you could do.
Build your "source"(?) image, scrape its mapfile and make a symbol file, then use that when you link your "jump"(?) image.
This does mean you need to link your "jump" image against a specific version of your "source" image.
If you need them to be semi-version independent (i.e. you define a set of functions that get exported, but you can rebuild on either side), then you need to export function pointers at known locations in your "source" image and link against those function pointers in your "jump" image. You can simplify the bookkeeping by making a structure of function pointers access the functions through that on either side.
For example:
struct FunctionPointerTable
extern struct FunctionPointerTable sharedFunctions;
Source file in "source" image:
void function1Implementation(int a)
printf("You sent me an integer: %d\r\n", a);
void function2Implementation(char b)
printf("You sent me an char: %c\r\n", b);
struct FunctionPointerTable sharedFunctions =
Source file in "jump" image:
#include "shared_functions.h"
When you compile/link the "source", take its mapfile and extract the location of sharedFunctions and create a symbol file that is linked with the source the "jump" image.
Note: the printfs (or anything directly called by the shared functions) would come from the "source" image (and not the "jump" image).
If you need them to come from the "jump" image (or be overridable) , then you need to access them through the same function pointer table, and the "jump" image needs to fix the function pointer table up with its version of the relevant function. I updated the function1() to show this. The direct call to function2 will always be the "source" version. The shared function call version of it will go through the jump table and call the "source" version unless the "jump" image updates the function table to point to its implementation.
You CAN get away from the structure, but then you need to export the function pointers one by one (not a big problem), but you want to keep them in order and at a fixed location, which means explicitly putting them in the linker descriptor file, etc. etc. I showed the structure method to distill it down to the easiest example.
As you can see, things get pretty hairy, and there is some penalty (calling through the function pointer is slower because you need to load up the address to jump to)
As explained in comment, we could imagine an application and a bootloader relying on same dynamic library. So application and bootloader rely on library, application can be changed without impact on library or boot.
I did not find an easy way to do a shared library with arm-none-eabi-gcc. However
this document gives some alternatives to shared libraries. I your case, I would recommand the jump table solution.
Write a library with the functions that need to be used in bootloader and in applicative.
"library" code
typedef void (*genericFunctionPointer)(void)
// use the linker script to set MySection at a known address
// I think this could be a structure like Russ Schultz solution but struct may or may not compile identically in lib and boot. However yes struct would be much easyer and avoiding many function pointer cast.
const genericFunctionPointer FpointerArray[] __attribute__ ((section ("MySection")))=
void lib_f1(void)
//some code
uint8_t lib_f2(uint8_t param)
//some code
applicative and/or bootloader code
typedef void (*genericFunctionPointer)(void)
// Use the linker script to set MySection at same address as library was compiled
// in linker script also put this section as `NOLOAD` because it is init by library and not by our code
//volatile is needed here because you read in flash memory and compiler may initialyse usage of this array to NULL pointers
volatile const genericFunctionPointer FpointerArray[NB_F] __attribute__ ((section ("MySection")));
int main(void)
uint8_t a = (correctCastF2)(FpointerArray[lib_f2])(10);
You can look into using linker sections. If you have your bootloader source code in folder bootloader, you can use
} >flash_region1
} >flash_region2
} >flash_region3

Compile-time information in CUDA

I'm optimizing a very time-critical CUDA kernel. My application accepts a wide range of switches that affect the behavior (for instance, whether to use 3rd or 5th order derivative). Consider as an approximation a set of 50 switches, where every switch is an integer variable (a bool sometimes, or a float, but this case is not so relevant for this question).
All these switches are constant during the execution of the application. Most of these switches are run-time and I store them in constant memory, so to exploit the caching mechanism. Some other switches can be compile-time and the customer is fine with having to re-compile the application if he wants to change the value in the switch. A very simple example could be:
__global__ void mykernel(const float* in, float *out)
for ( /* many many times */ )
if (compile_time_switch)
do_this(in, out);
do_that(in, out);
Assume that do_this and do_that are compute-bound and very cheap, that I optimize the for loop so that its overhead is negligible, that I have to place the if inside the iteration. If the compiler recognizes that compile_time_switch is static information it can optimize out the call to the "wrong" function and create code that is just as optimized as if the if weren't there. Now the real question:
In which ways can I provide the compiler with the static value of this switch? I see two such ways, listed below, but none of them work for me. What other possibilities remain?
Template parameters
Providing a template parameter enables this static optimization.
template<int compile_time_switch>
__global__ void mykernel(const float* in, float *out)
for ( /* many many times */ )
if (compile_time_switch)
do_this(in, out);
do_that(in, out);
This simple solution does not work for me, since I don't have direct access to the code that calls the kernel.
Static members
Consider the following struct:
struct GlobalParameters
static const bool compile_time_switch = true;
Now GlobalParameters::compile_time_switch contains the static information as I want it, and that compiler would be able to optimize the kernel. Unfortunately, CUDA does not support such static members.
EDIT: the last statement is apparently wrong. the definition of the struct is of course legit and you are able to use the static member GlobalParameters::compile_time_switch in device code. The compiler inlines the variable, so that the final code will directly contain the value, not a run-time variable access, which is the behavior you would expect from an optimizer compiler. So, the second options is actually suitable.
I consider my problem solved both thanks to this fact and to kronos' answer. However, I'm still looking for other alternative methods to provide compile-time information to the compiler.
Yor third options are preprocessor definitions:
#define compile_time_switch 1
__global__ void mykernel(const float* in, float *out)
for ( /* many many times */ )
if (compile_time_switch)
do_this(in, out);
do_that(in, out);
The preprocessor will discard the else case compleatly and the compiler has nothing to optimize in his dead code elemination pass, because there is no dead code.
Furthermore, you can specify the definition with the -D comand line switch and (I think) any by nvidia supported compiler will accept -D (msvc may use a different switch).

Implement lua scripting through dll calls?

Is it possible to write a program that can execute lua scripts just by using the lua52.dll file?
Or do I have to create a new C project and use all these header and source files?
I just want to create a few global variables and functions and make them available in the lua scripts that should be executed.
So in theory:
The standard command line interpreter for Lua is an example of just such a program. On windows, it is a small executable that is linked to lua52.dll. Its source is, of course, part of the Lua distribution.
Despite being located in the same folder as the sources to the Lua DLL, lua.c only references the public API for Lua, and depends only on the four public header files and the DLL itself.
An even simpler example that embeds a Lua interpreter in a C program is the following, derived from the example shown in the PiL book available online:
#include <stdio.h>
#include <string.h>
#include <lua.h>
#include <lauxlib.h>
#include <lualib.h>
int main (void) {
char buff[256];
int error;
lua_State *L = luaL_newstate(); /* create state */
luaL_openlibs(L); /* open standard libraries */
while (fgets(buff, sizeof(buff), stdin) != NULL) {
error = luaL_loadbuffer(L, buff, strlen(buff), "line") ||
lua_pcall(L, 0, 0, 0);
if (error) {
fprintf(stderr, "%s", lua_tostring(L, -1));
lua_pop(L, 1); /* pop error message from the stack */
return 0;
In your existing application, you would need to call luaL_newstate() once and store the returned handle. Along with a call to luaL_openlibs(), you would likely want to also define one or more Lua modules representing your application's scriptable API. And of course, you need to call lua_close() sometime before exiting so that Lua has a chance to clean up its objects and in particular a chance to deal with any objects that the script authors are depending on to get resources released when the application exits.
With that in place, you generally provide a way to load script fragments provided by your user using luaL_loadbuffer() or any of several other functions built on top of lua_load(). Loading a script compiles it and leaves an anonymous function on the top of the stack that when called will execute all top-level statements in the script.
For a lot more discussion of this, see the chapters of Programming in Lua (an older addition is available online) that relate to the C API.
What language is the above written in? What application is running it? If this is a Lua script, then "AddFunctionToLua" is simply function name() end. If this is C, then you've already got a C project, no need to "create a new C project". So it's unclear what you're asking.

How can be bytecode used for optimizing the execution time of dynamic languages?

I am interested in some optimization methods or general bytecode designs, which might help speed up execution using VM in comparison to interpretation of an AST.
The main win in AST interpretation vs. bytecode is operation dispatch cost, for highly optimised interpreters this starts to become a real problem. "Dispatch" is the term used to describe the overhead required to start executing an operation (such as arithmetic, property access, etc).
A fairly normal AST based interpreter would look something like this:
class ASTNode {
virtual double execute() = 0;
class NumberNode {
virtual double execute() { return m_value; }
double m_value;
class AddNode {
virtual double execute() { return left->execute() + right->execute(); }
So executing the code for something as simple as 1+1 requires 3 virtual calls. Virtual calls a very expensive (in the grand scheme of things) due to the multiple indirections to make the call, and the general cost of making a call in the first place.
In a bytecode interpreter you have you a different dispatch model -- rather than virtual calls you have an execution loop, akin to:
while (1) {
switch (op.type) {
case op_add:
// Efficient interpreters use "registers" rather than
// a stack these days, but the example code would be more
// complicated
push(pop() + pop());
case op_end:
return pop();
This still has a reasonably expensive dispatch cost vs native code, but is much faster than virtual dispatch. You can further improve perf using a gcc extension called "computed goto" which allows you to remove the switch dispatch, reducing total dispatch cost to basically a single indirect branch.
In addition to improving dispatch costs bytecode based interpreters have a number of additional advantages over AST interpreters, mostly due to the ability of the bytecode to "directly" jump to other locations as a real machine would, for example imagine a snippet of code like this:
while (1) {
if (a)
To implement this correctly everytime a statement is executed you would need to indicate whether execution is meant to stay in the loop or stop, so the execution loop becomes something like:
while (condition->execute() == true) {
for (i = 0; i < statements->length(); i++) {
result = statements[i]->execute();
if (result.type == BREAK)
if (result.type == CONTINUE)
i = 0;
As you add more forms of flow control this signalling becomes more and more expensive. Once you add exceptions (eg. flow control that can happen everywhere) you start needing to check for these things in the middle of even basic arithmetic, leading to ever increasing overhead. If you want to see this in the real world I encourage you to look at the ECMAScript spec, where they describe the execution model in terms of an AST interpreter.
In a bytecode interpreter these problems basically go away, as the bytecode is able to directly express control flow rather than indirectly through signalling, eg. continue is simply converted into a jump instruction, and you only get that cost if it's actually hit.
Finally an AST interpreter by definition is recursive, and so has to be prevented from overflowing the system stack, which puts very heavy restrictions on how much you can recurse in your code, something like:
Has 8 levels of recursion (at least) in the interpreter -- this can be a very significant cost; older versions of Safari (pre-SquirrelFish) used an AST interpreter, and for this reason JS was allowed only a couple of hundred levels of recursion vs 1000's allowed in modern browsers.
Perhaps you could look at the various methods which the llvm "opt" tool provides. Those are bytecode-to-bytecode optimisations, and the tool itself will provide analysis on the benefits of applying a particular optimisation.