Compile-time information in CUDA - optimization

I'm optimizing a very time-critical CUDA kernel. My application accepts a wide range of switches that affect the behavior (for instance, whether to use 3rd or 5th order derivative). Consider as an approximation a set of 50 switches, where every switch is an integer variable (a bool sometimes, or a float, but this case is not so relevant for this question).
All these switches are constant during the execution of the application. Most of these switches are run-time and I store them in constant memory, so to exploit the caching mechanism. Some other switches can be compile-time and the customer is fine with having to re-compile the application if he wants to change the value in the switch. A very simple example could be:
__global__ void mykernel(const float* in, float *out)
{
for ( /* many many times */ )
if (compile_time_switch)
do_this(in, out);
else
do_that(in, out);
}
Assume that do_this and do_that are compute-bound and very cheap, that I optimize the for loop so that its overhead is negligible, that I have to place the if inside the iteration. If the compiler recognizes that compile_time_switch is static information it can optimize out the call to the "wrong" function and create code that is just as optimized as if the if weren't there. Now the real question:
In which ways can I provide the compiler with the static value of this switch? I see two such ways, listed below, but none of them work for me. What other possibilities remain?
Template parameters
Providing a template parameter enables this static optimization.
template<int compile_time_switch>
__global__ void mykernel(const float* in, float *out)
{
for ( /* many many times */ )
if (compile_time_switch)
do_this(in, out);
else
do_that(in, out);
}
This simple solution does not work for me, since I don't have direct access to the code that calls the kernel.
Static members
Consider the following struct:
struct GlobalParameters
{
static const bool compile_time_switch = true;
};
Now GlobalParameters::compile_time_switch contains the static information as I want it, and that compiler would be able to optimize the kernel. Unfortunately, CUDA does not support such static members.
EDIT: the last statement is apparently wrong. the definition of the struct is of course legit and you are able to use the static member GlobalParameters::compile_time_switch in device code. The compiler inlines the variable, so that the final code will directly contain the value, not a run-time variable access, which is the behavior you would expect from an optimizer compiler. So, the second options is actually suitable.
I consider my problem solved both thanks to this fact and to kronos' answer. However, I'm still looking for other alternative methods to provide compile-time information to the compiler.

Yor third options are preprocessor definitions:
#define compile_time_switch 1
__global__ void mykernel(const float* in, float *out)
{
for ( /* many many times */ )
if (compile_time_switch)
do_this(in, out);
else
do_that(in, out);
}
The preprocessor will discard the else case compleatly and the compiler has nothing to optimize in his dead code elemination pass, because there is no dead code.
Furthermore, you can specify the definition with the -D comand line switch and (I think) any by nvidia supported compiler will accept -D (msvc may use a different switch).

Related

How to disable all optimization when using COSMIC compiler?

I am using the COSMIC compiler in the STVD ide and even though optimization is turned of with -no (documentation says "-no: do not use optimizer") some lines of code get removed and cannot have a breakpoint placed upon them, nor are they to be found in the disassembly.
I tried to set -oc (leave removed instructions as comments) which resulted in not even showing the removed lines as comment.
bool foo(void)
{
uint8_t val;
if (globalvar > 5)
val = 0;
for (val = 0; val < 8; val++)
{
some code...
}
return true;
}
I do know it seems idiotic to set val to 0 prior to the for loop but lets just assume it is for some reason necessary. When I set no optimization I expect it to be not optimized but insted the val = 0; gets removed without any traces.
I am not looking for a workaround like declaring val volatile whitch solves the problem. I am rather looking for a way to prevent the optimization or at least understand/know what changes are made to my code when compiling.
It is not clear from the manual, but it seems that the -no option prevents assembly level optimisation. It seems possible that the code generator stage that runs before assembly optimisation may perform higher level optimisation such as redundant code removal.
From the manual:
-cp
disable the constant propagation optimization. By default,
when a variable is assigned with a constant, any subsequent access to that variable is replaced by the constant
itself until the variable is modified or a flow break is
encountered (function call, loop, label ...).
It seems that it is this constant propagation feature that you must explicitly disable.
It is unusual perhaps, but it appears that this compiler optimises by default, and distinguishes between compiler optimisations and assembler optimisations (performed as the compilation stage), and them makes you switch off each individual optimisation separately.
To avoid this in the code, rather than switching it off globally, you could initialise val to a non-zero value in this case:
int val = -1 ;
Then the later assignment to zero will require explicit code. This has the advantage over volatile perhaps in that it will not block optimisations when you do enable them.
I believe that this behaviour is allowed by the C language specification.
You are effectively writing the same value either once or twice to the same variable on successive lines of code. The compiler could assign this value to either a processor register or a memory location as it sees fit and knows that the value following the initial assignment in the for loop is the same as the value assigned when the if clause is actioned. As a result the language spec allows the compiler to throw the redundant code away.
The way to force the compiler to perform all read and write accesses to the variable is to use the volatile keyword. That is what it is for.

Code sharing between multiple independently compiled binaries/hex files

I'm looking for documentation/information on how to share information/code between multiple binaries compiled for a Cortex-m/0/4/7 architectures. The two binaries will be on the same chip and same architecture. They are flashed at different locations and sets the main stack pointer and resets the program counter so that one binary "jumps" to the other binary. I want to share code between these two binaries.
I've done a simple copy of an array of function pointers into a section defined in the linker script into RAM. Then read the RAM out in the other binary and cast it to an array then use the index to call functions in the other binary. This does work as a Proof-of-concept, but I think what I'm looking for is a bit more complex. As I want some way of describing compatibility between the two binaries. I want some what the functionality of shared libraries, but I'm unsure if I need position independent code.
As an example how the current copy process is done it is basically:
Source binary:
void copy_func()
{
memncpy(array_of_function_pointers, fixed_size, address_custom_ram_section)
}
Binary which is jumped too from source binary:
array_fp_type get_funcs()
{
memncpy(adress_custom_ram_section, fixed_size, array_of_fp)
return array_of_fp;
}
Then I can use the array_of_fp to call into functions residing in the source binary from the jump binary.
So what I'm looking for is some resources or input for someone who have implemented a similar system. Like I would like to not have to have a custom RAM section where I'm copying the function pointers into.
I would be fine with having the compilation step of source binary outputting something which can be included into the compilation step of the jump binary. However it needs to be reproducible and recompiling the source binary shouldn't break the compatibility with the jump binary(even if it included a different file from what is now outputted) as long as you don't change the interface.
To clarify source binary shouldn't require any specific knowledge about the jump binary. The code should not reside in both binaries as this would defeat the purpose of this mechanism. The overall goal if this mechanism is a way to save space when creating multi-binary applications on cortex-m processors.
Any ideas or links to resources are welcome. If you have any more questions feel free to comment on the question and I'll try to answer it.
Its very hard for me to picture what you want to do, but if you're interested in having an application link against your bootloader/ROM, then see Loading symbol file while linking for a hint on what you could do.
Build your "source"(?) image, scrape its mapfile and make a symbol file, then use that when you link your "jump"(?) image.
This does mean you need to link your "jump" image against a specific version of your "source" image.
If you need them to be semi-version independent (i.e. you define a set of functions that get exported, but you can rebuild on either side), then you need to export function pointers at known locations in your "source" image and link against those function pointers in your "jump" image. You can simplify the bookkeeping by making a structure of function pointers access the functions through that on either side.
For example:
shared_functions.h:
struct FunctionPointerTable
{
void(*function1)(int);
void(*function2)(char);
};
extern struct FunctionPointerTable sharedFunctions;
Source file in "source" image:
void function1Implementation(int a)
{
printf("You sent me an integer: %d\r\n", a);
function2Implementation((char)(a%256))
sharedFunctions.function2((char)(a%256));
}
void function2Implementation(char b)
{
printf("You sent me an char: %c\r\n", b);
}
struct FunctionPointerTable sharedFunctions =
{
function1Implementation,
function2Implementation,
};
Source file in "jump" image:
#include "shared_functions.h"
sharedFunctions.function1(1024);
sharedFunctions.function2(100);
When you compile/link the "source", take its mapfile and extract the location of sharedFunctions and create a symbol file that is linked with the source the "jump" image.
Note: the printfs (or anything directly called by the shared functions) would come from the "source" image (and not the "jump" image).
If you need them to come from the "jump" image (or be overridable) , then you need to access them through the same function pointer table, and the "jump" image needs to fix the function pointer table up with its version of the relevant function. I updated the function1() to show this. The direct call to function2 will always be the "source" version. The shared function call version of it will go through the jump table and call the "source" version unless the "jump" image updates the function table to point to its implementation.
You CAN get away from the structure, but then you need to export the function pointers one by one (not a big problem), but you want to keep them in order and at a fixed location, which means explicitly putting them in the linker descriptor file, etc. etc. I showed the structure method to distill it down to the easiest example.
As you can see, things get pretty hairy, and there is some penalty (calling through the function pointer is slower because you need to load up the address to jump to)
As explained in comment, we could imagine an application and a bootloader relying on same dynamic library. So application and bootloader rely on library, application can be changed without impact on library or boot.
I did not find an easy way to do a shared library with arm-none-eabi-gcc. However
this document gives some alternatives to shared libraries. I your case, I would recommand the jump table solution.
Write a library with the functions that need to be used in bootloader and in applicative.
"library" code
typedef void (*genericFunctionPointer)(void)
// use the linker script to set MySection at a known address
// I think this could be a structure like Russ Schultz solution but struct may or may not compile identically in lib and boot. However yes struct would be much easyer and avoiding many function pointer cast.
const genericFunctionPointer FpointerArray[] __attribute__ ((section ("MySection")))=
{
(genericFunctionPointer)lib_f1,
(genericFunctionPointer)lib_f2,
}
void lib_f1(void)
{
//some code
}
uint8_t lib_f2(uint8_t param)
{
//some code
}
applicative and/or bootloader code
typedef void (*genericFunctionPointer)(void)
// Use the linker script to set MySection at same address as library was compiled
// in linker script also put this section as `NOLOAD` because it is init by library and not by our code
//volatile is needed here because you read in flash memory and compiler may initialyse usage of this array to NULL pointers
volatile const genericFunctionPointer FpointerArray[NB_F] __attribute__ ((section ("MySection")));
enum
{
lib_f1,
lib_f2,
NB_F,
}
int main(void)
{
(correctCastF1)(FpointerArray[lib_f1])();
uint8_t a = (correctCastF2)(FpointerArray[lib_f2])(10);
}
You can look into using linker sections. If you have your bootloader source code in folder bootloader, you can use
SECTIONS
{
.bootloader:
{
build_output/bootloader/*.o(.text)
} >flash_region1
.binary1:
{
build_output/binary1/*.o(.text)
} >flash_region2
.binary2:
{
build_output/binary2/*.o(.text)
} >flash_region3
}

C++/CLI method calls native method to modify int - need pin_ptr?

I have a C++/CLI method, ManagedMethod, with one output argument that will be modified by a native method as such:
// file: test.cpp
#pragma unmanaged
void NativeMethod(int& n)
{
n = 123;
}
#pragma managed
void ManagedMethod([System::Runtime::InteropServices::Out] int% n)
{
pin_ptr<int> pinned = &n;
NativeMethod(*pinned);
}
void main()
{
int n = 0;
ManagedMethod(n);
// n is now modified
}
Once ManagedMethod returns, the value of n has been modified as I would expect. So far, the only way I've been able to get this to compile is to use a pin_ptr inside ManagedMethod, so is pinning in fact the correct/only way to do this? Or is there a more elegant way of passing n to NativeMethod?
Yes, this is the correct way to do it. Very highly optimized inside the CLR, the variable gets the [pinned] attribute so the CLR knows that it stores an interior pointer to an object that should not be moved. Distinct from GCHandle::Alloc(), pin_ptr<> can do it without creating another handle. It is reported in the table that the jitter generates when it compiles the method, the GC uses that table to know where to look for object roots.
Which only ever matters when a garbage collection occurs at the exact same time that NativeMethod() is running. Doesn't happen very often in practice, you'd have to use threads in the program. YMMV.
There is another way to do it, doesn't require pinning but requires a wee bit more machine code:
void ManagedMethod(int% n)
{
int copy = n;
NativeMethod(copy);
n = copy;
}
Which works because local variables have stack storage and thus won't be moved by the garbage collector. Does not win any elegance points for style but what I normally use myself, estimating the side-effects of pinning is not that easy. But, really, don't fear pin_ptr<>.

#define vs const in Objective-C

I'm new to Objective-C, and I have a few questions regarding const and the preprocessing directive #define.
First, I found that it's not possible to define the type of the constant using #define. Why is that?
Second, are there any advantages to use one of them over the another one?
Finally, which way is more efficient and/or more secure?
First, I found that its not possible to define the type of the constant using #define, why is that?
Why is what? It's not true:
#define MY_INT_CONSTANT ((int) 12345)
Second, are there any advantages to use one of them over the another one?
Yes. #define defines a macro which is replaced even before compilation starts. const merely modifies a variable so that the compiler will flag an error if you try to change it. There are contexts in which you can use a #define but you can't use a const (although I'm struggling to find one using the latest clang). In theory, a const takes up space in the executable and requires a reference to memory, but in practice this is insignificant and may be optimised away by the compiler.
consts are much more compiler and debugger friendly than #defines. In most cases, this is the overriding point you should consider when making a decision on which one to use.
Just thought of a context in which you can use #define but not const. If you have a constant that you want to use in lots of .c files, with a #define you just stick it in a header. With a const you have to have a definition in a C file and
// in a C file
const int MY_INT_CONST = 12345;
// in a header
extern const int MY_INT_CONST;
in a header. MY_INT_CONST can't be used as the size of a static or global scope array in any C file except the one it is defined in.
However, for integer constants you can use an enum. In fact that is what Apple does almost invariably. This has all the advantages of both #defines and consts but only works for integer constants.
// In a header
enum
{
MY_INT_CONST = 12345,
};
Finally, which way is more efficient and/or more secure?
#define is more efficient in theory although, as I said, modern compilers probably ensure there is little difference. #define is more secure in that it is always a compiler error to try to assign to it
#define FOO 5
// ....
FOO = 6; // Always a syntax error
consts can be tricked into being assigned to although the compiler might issue warnings:
const int FOO = 5;
// ...
(int) FOO = 6; // Can make this compile
Depending on the platform, the assignment might still fail at run time if the constant is placed in a read only segment and it's officially undefined behaviour according to the C standard.
Personally, for integer constants, I always use enums for constants of other types, I use const unless I have a very good reason not to.
From a C coder:
A const is simply a variable whose content cannot be changed.
#define name value, however, is a preprocessor command that replaces all instances of the name with value.
For instance, if you #define defTest 5, all instances of defTest in your code will be replaced with 5 when you compile.
It is important to understand the difference between the #define and the const instructions which are not meant to the same things.
const
const is used to generate an object from the asked type that will be, once initialised, constant. It means that it is an object in the program memory and can be used as readonly.
The object is generated every time the program is launched.
#define
#define is used in order to ease the code readability and future modifications. When using a define, you only mask a value behind a name. Hence when working with a rectangle you can define width and height with corresponding values. Then in the code, it will be easier to read since instead of numbers there will be names.
If later you decide to change the value for the width you would only have to change it in the define instead of a boring and dangerous find/replace in your whole file.
When compiling, the preprocessor will replace all the defined name by the values in the code. Hence, there is no time lost using them.
In addition to other peoples comments, errors using #define are notoriously difficult to debug as the pre-processor gets hold of them before the compiler.
Since pre-processor directives are frowned upon, I suggest using a const. You can't specify a type with a pre-processor because a pre-processor directive is resolved before compilation. Well, you can, but something like:
#define DEFINE_INT(name,value) const int name = value;
and use it as
DEFINE_INT(x,42)
which would be seen by the compiler as
const int x = 42;
First, I found that its not possible to define the type of the constant using #define, why is that?
You can, see my first snippet.
Second, are there any advantages to use one of them over the another one?
Generally having a const instead of a pre-processor directive helps with debugging, not as much in this case (but still does).
Finally, which way is more efficient and/or more secure?
Both are as efficient. I'd say the macro can potentially be more secure as it can't be changed during run-time, whereas a variable could.
I have used #define before to help create more methods out of one method like if I have something like.
// This method takes up to 4 numbers, we don't care what the method does with these numbers.
void doSomeCalculationWithMultipleNumbers:(NSNumber *)num1 Number2:(NSNumber *)num2 Number3:(NSNumber *)num23 Number3:(NSNumber *)num3;
But I also what to have a method that only takes 3 numbers and 2 numbers so instead of writing two new methods I am going to use the same one using the #define, like so.
#define doCalculationWithFourNumbers(num1, num2, num3, num4) \
doSomeCalculationWithMultipleNumbers((num1), (num2), (num3), (num4))
#define doCalculationWithThreeNumbers(num1, num2, num3) \
doSomeCalculationWithMultipleNumbers((num1), (num2), (num3), nil)
#define doCalculationWithTwoNumbers(num1, num2) \
doSomeCalculationWithMultipleNumbers((num1), (num2), nil, nil)
I think this is a pretty cool thing to have, I know you can go straight to the method and just put nil in the spaces you don't want but if you are building a library it is very useful. Also this is how
NSLocalizedString(<#key#>, <#comment#>)
NSLocalizedStringFromTable(<#key#>, <#tbl#>, <#comment#>)
NSLocalizedStringFromTableInBundle(<#key#>, <#tbl#>, <#bundle#>, <#comment#>)
are done.
Whereas I don't believe you can do this with constants. But constants do have there benefits over #define like you can't specify a type with a #define because it is a pre-processor directive that is resolved before compilation, and if you get an error with #define they are harder to debug then constants. Both have there benefits and downsides but I would say it all depends on the programmer to which one you decided to use. I have written a library with both them in using the #define to do what I have shown and constants for declaring constant variables which I need to specify a type on.

avoiding duplicate SWIG boilerplate when using many SWIG-generated modules

When generating an interface module with SWIG, the generated C/C++ file contains a ton of static boilerplate functions. So if one wants to modularize the use of SWIG-generated interfaces by using many separately compiled small interfaces in the same application, there ends up being a lot of bloat due to these duplicate functions.
Using gcc's -ffunction-sections option, and the GNU linker's --icf=safe option (-Wl,--icf=safe to the compiler), one can remove some of the duplication, but by no means all of it (I think it won't coalesce anything that has a relocation in it—which many of these functions do).
My question: I'm wondering if there's a way to remove more of this duplicated boilerplate, ideally one that doesn't rely on GNU-specific compiler/linker options.
In particular, is there a SWIG option/flag/something that says "don't include boilerplate in each output file"? There actually is a SWIG option, -external-runtime that tells it to generate a "boilerplate-only" output file, but no apparent way of suppressing the copy included in each normal output file. [I think this sort of thing should be fairly simple to implement in SWIG, so I'm surprised that it doesn't seem to exist... but I can't seem to find anything documented.]
Here's a small example:
Given the interface file swg-oink.swg for module swt_oink:
%module swt_oink
%{ extern int oinker (const char *x); %}
extern int oinker (const char *x);
... and a similar interface swg-barf.swg for swt_barf:
%module swt_barf
%{ extern int barfer (const char *x); %}
extern int barfer (const char *x);
... and a test main file, swt-main.cc:
extern "C"
{
#include "lua.h"
#include "lualib.h"
#include "lauxlib.h"
extern int luaopen_swt_oink (lua_State *);
extern int luaopen_swt_barf (lua_State *);
}
int main ()
{
lua_State *L = lua_open();
luaopen_swt_oink (L);
luaopen_swt_barf (L);
}
int oinker (const char *) { return 7; }
int barfer (const char *) { return 2; }
and compiling them like:
swig -lua -c++ swt-oink.swg
g++ -c -I/usr/include/lua5.1 swt-oink_wrap.cxx
swig -lua -c++ swt-barf.swg
g++ -c -I/usr/include/lua5.1 swt-barf_wrap.cxx
g++ -c -I/usr/include/lua5.1 swt-main.cc
g++ -o swt swt-main.o swt-oink_wrap.o swt-barf_wrap.o
then the size of each xxx_wrap.o file is about 16KB, of which 95% is boilerplate, and the size of the final executable is roughly the sum of these, about 39K. If one compiles each interface file with -ffunction-sections, and links with -Wl,--icf=safe, the size of the final executable is 34KB, but there's still clearly a lot of duplication (using nm on the executable one can see tons of functions defined multiple times, and looking at their source, it's clear that it would be fine to use a single global definition for most of them).
I'm fairly sure SWIG doesn't have an option for doing this. I'm speculating now, but I think the reason might well be concern about visibility of this for modules built with different versions of SWIG. Imagine the following scenario:
Two libraries X and Y both provide an interface to their code using SWIG. They both opt to make the "SWIG glue" stuff visible across different translation units in order to reduce code size. This will all be well and good if both X and Y are using the same version of SWIG. What happens though if X uses SWIG 1.1 and Y uses SWIG 1.3? Both modules work fine on their own, but depending on how the platform treats shared objects and how the language itself loads them (RTLD_GLOBAL?) some potentially very bad things would happen from the combination of the two modules being used in the same VM.
The penalty of the code duplication is pretty low I suspect - the cost of swapping between VM and native code is typically quite high, which probably dwarfs the slightly reduced instruction cache hits, although it might be interesting to see real benchmarks. On the up side this is code no users ever need to worry about it, since it's all auto generated and all correctly kept with interfaces written for the corresponding version.
I might be a bit late, but here is a workaround:
In SWIG (<= 1.3 ) there is -noruntime command-line option
Since SWIG 2.0 -noruntime was deprecated, so now one should pass -DSWIG_NOINCLUDE to the C preprocessor - not to the swig itself
I am completely not sure that this is correct, but it at least works for me. I am going to clarify this question in the SWIG's mailing list.