In an attempt to implement wrappers for parts of the CGAL library I have run into some performance issues. Some of it seems to be linked to C++/CLI interop. In an effort to identify the issue I have created a simple example, making a few calls to the some of the parts CGAL I want to wrap. I don't think the specific calls are super important to the issue, but i included it for reference. My program looks as follows:
std::ifstream stream1("Sphere1.OFF");
std::ifstream stream2("Sphere2.OFF");
Polyhedron P1 = Polyhedron();
Polyhedron P2 = Polyhedron();
stream1 >> P1;
stream2 >> P2;
Nef_polyhedron N1(P1);
Nef_polyhedron N2(P2);
Nef_polyhedron N3 = N1 + N2;
Polyhedron and Nef_polyhedron are CGAL types defined as:
typedef CGAL::Simple_cartesian<CGAL::Lazy_exact_nt<CGAL::Gmpq> > Kernel;
typedef CGAL::Nef_polyhedron_3<Kernel> Nef_polyhedron;
typedef CGAL::Polyhedron_3<Kernel> Polyhedron;
If I simply run the program compiling without CLR support it runs in around 10 seconds. This is not ideal, but will have to do for now. The problem is that if i enable CLR support the time to run this small sample program triples to over 30 seconds.
I suspect the issue has something to do with CGAL being largely implemented as header only. Thus, Kernel, Polyhedron and Nef_polyhedron are compiled with CLR support, making it unpredictable how often interop calls take place. The idea was that the Nef_polyhedron N3 = N1 + N2; call should be a single interop call to native code to reduce the overhead.
I tried using #pragma managed(push,off), to force native compilation of the header, but this just redoubled the execution time to around 70 seconds
Why am i getting such a big performance hit with CLR? Might it have something to do with marshalling of the Polyhedron and Nef_polyhedron types? And if so, would it help to specify at marshalling for these types?
Related
I have a C++/CLI method, ManagedMethod, with one output argument that will be modified by a native method as such:
// file: test.cpp
#pragma unmanaged
void NativeMethod(int& n)
{
n = 123;
}
#pragma managed
void ManagedMethod([System::Runtime::InteropServices::Out] int% n)
{
pin_ptr<int> pinned = &n;
NativeMethod(*pinned);
}
void main()
{
int n = 0;
ManagedMethod(n);
// n is now modified
}
Once ManagedMethod returns, the value of n has been modified as I would expect. So far, the only way I've been able to get this to compile is to use a pin_ptr inside ManagedMethod, so is pinning in fact the correct/only way to do this? Or is there a more elegant way of passing n to NativeMethod?
Yes, this is the correct way to do it. Very highly optimized inside the CLR, the variable gets the [pinned] attribute so the CLR knows that it stores an interior pointer to an object that should not be moved. Distinct from GCHandle::Alloc(), pin_ptr<> can do it without creating another handle. It is reported in the table that the jitter generates when it compiles the method, the GC uses that table to know where to look for object roots.
Which only ever matters when a garbage collection occurs at the exact same time that NativeMethod() is running. Doesn't happen very often in practice, you'd have to use threads in the program. YMMV.
There is another way to do it, doesn't require pinning but requires a wee bit more machine code:
void ManagedMethod(int% n)
{
int copy = n;
NativeMethod(copy);
n = copy;
}
Which works because local variables have stack storage and thus won't be moved by the garbage collector. Does not win any elegance points for style but what I normally use myself, estimating the side-effects of pinning is not that easy. But, really, don't fear pin_ptr<>.
I am working on refactoring a large amount of code from an unmanaged C++ assembly into a C# assembly. There is currently a mixed-mode assembly going between the two with, of course, a mix of managed and unmanaged code. There is a function I am trying to call in the unmanaged C++ which relies on FILE*s (as defined in stdio.h). This function ties into a much larger process which cannot be refactored into the C# code yet, but which now needs to be called from the managed code.
I have searched but cannot find a definitive answer to what kind of underlying system pointer the System::IO::FileStream class uses. Is this just applied on top of a FILE*? Or is there some other way to convert a FileStream^ to a FILE*? I found FileStream::SafeFileHandle, on which I can call DangerousGetHandle().ToPointer() to get a native void*, but I'm just trying to be certain that if I cast this to FILE* that I'm doing the right thing...?
void Write(FILE *out)
{
Data->Write(out); // huge bulk of code, writing the data
}
virtual void __clrcall Write(System::IO::FileStream ^out)
{
// is this right??
FILE *pout = (FILE*)out->SafeFileHandle->DangerousGetHandle().ToPointer();
Write(pout);
}
You'll need _open_osfhandle followed by _fdopen.
Casting is not magic. Just because the input and types output are right for your situation doesn't mean the values are.
I'm optimizing a very time-critical CUDA kernel. My application accepts a wide range of switches that affect the behavior (for instance, whether to use 3rd or 5th order derivative). Consider as an approximation a set of 50 switches, where every switch is an integer variable (a bool sometimes, or a float, but this case is not so relevant for this question).
All these switches are constant during the execution of the application. Most of these switches are run-time and I store them in constant memory, so to exploit the caching mechanism. Some other switches can be compile-time and the customer is fine with having to re-compile the application if he wants to change the value in the switch. A very simple example could be:
__global__ void mykernel(const float* in, float *out)
{
for ( /* many many times */ )
if (compile_time_switch)
do_this(in, out);
else
do_that(in, out);
}
Assume that do_this and do_that are compute-bound and very cheap, that I optimize the for loop so that its overhead is negligible, that I have to place the if inside the iteration. If the compiler recognizes that compile_time_switch is static information it can optimize out the call to the "wrong" function and create code that is just as optimized as if the if weren't there. Now the real question:
In which ways can I provide the compiler with the static value of this switch? I see two such ways, listed below, but none of them work for me. What other possibilities remain?
Template parameters
Providing a template parameter enables this static optimization.
template<int compile_time_switch>
__global__ void mykernel(const float* in, float *out)
{
for ( /* many many times */ )
if (compile_time_switch)
do_this(in, out);
else
do_that(in, out);
}
This simple solution does not work for me, since I don't have direct access to the code that calls the kernel.
Static members
Consider the following struct:
struct GlobalParameters
{
static const bool compile_time_switch = true;
};
Now GlobalParameters::compile_time_switch contains the static information as I want it, and that compiler would be able to optimize the kernel. Unfortunately, CUDA does not support such static members.
EDIT: the last statement is apparently wrong. the definition of the struct is of course legit and you are able to use the static member GlobalParameters::compile_time_switch in device code. The compiler inlines the variable, so that the final code will directly contain the value, not a run-time variable access, which is the behavior you would expect from an optimizer compiler. So, the second options is actually suitable.
I consider my problem solved both thanks to this fact and to kronos' answer. However, I'm still looking for other alternative methods to provide compile-time information to the compiler.
Yor third options are preprocessor definitions:
#define compile_time_switch 1
__global__ void mykernel(const float* in, float *out)
{
for ( /* many many times */ )
if (compile_time_switch)
do_this(in, out);
else
do_that(in, out);
}
The preprocessor will discard the else case compleatly and the compiler has nothing to optimize in his dead code elemination pass, because there is no dead code.
Furthermore, you can specify the definition with the -D comand line switch and (I think) any by nvidia supported compiler will accept -D (msvc may use a different switch).
When generating an interface module with SWIG, the generated C/C++ file contains a ton of static boilerplate functions. So if one wants to modularize the use of SWIG-generated interfaces by using many separately compiled small interfaces in the same application, there ends up being a lot of bloat due to these duplicate functions.
Using gcc's -ffunction-sections option, and the GNU linker's --icf=safe option (-Wl,--icf=safe to the compiler), one can remove some of the duplication, but by no means all of it (I think it won't coalesce anything that has a relocation in it—which many of these functions do).
My question: I'm wondering if there's a way to remove more of this duplicated boilerplate, ideally one that doesn't rely on GNU-specific compiler/linker options.
In particular, is there a SWIG option/flag/something that says "don't include boilerplate in each output file"? There actually is a SWIG option, -external-runtime that tells it to generate a "boilerplate-only" output file, but no apparent way of suppressing the copy included in each normal output file. [I think this sort of thing should be fairly simple to implement in SWIG, so I'm surprised that it doesn't seem to exist... but I can't seem to find anything documented.]
Here's a small example:
Given the interface file swg-oink.swg for module swt_oink:
%module swt_oink
%{ extern int oinker (const char *x); %}
extern int oinker (const char *x);
... and a similar interface swg-barf.swg for swt_barf:
%module swt_barf
%{ extern int barfer (const char *x); %}
extern int barfer (const char *x);
... and a test main file, swt-main.cc:
extern "C"
{
#include "lua.h"
#include "lualib.h"
#include "lauxlib.h"
extern int luaopen_swt_oink (lua_State *);
extern int luaopen_swt_barf (lua_State *);
}
int main ()
{
lua_State *L = lua_open();
luaopen_swt_oink (L);
luaopen_swt_barf (L);
}
int oinker (const char *) { return 7; }
int barfer (const char *) { return 2; }
and compiling them like:
swig -lua -c++ swt-oink.swg
g++ -c -I/usr/include/lua5.1 swt-oink_wrap.cxx
swig -lua -c++ swt-barf.swg
g++ -c -I/usr/include/lua5.1 swt-barf_wrap.cxx
g++ -c -I/usr/include/lua5.1 swt-main.cc
g++ -o swt swt-main.o swt-oink_wrap.o swt-barf_wrap.o
then the size of each xxx_wrap.o file is about 16KB, of which 95% is boilerplate, and the size of the final executable is roughly the sum of these, about 39K. If one compiles each interface file with -ffunction-sections, and links with -Wl,--icf=safe, the size of the final executable is 34KB, but there's still clearly a lot of duplication (using nm on the executable one can see tons of functions defined multiple times, and looking at their source, it's clear that it would be fine to use a single global definition for most of them).
I'm fairly sure SWIG doesn't have an option for doing this. I'm speculating now, but I think the reason might well be concern about visibility of this for modules built with different versions of SWIG. Imagine the following scenario:
Two libraries X and Y both provide an interface to their code using SWIG. They both opt to make the "SWIG glue" stuff visible across different translation units in order to reduce code size. This will all be well and good if both X and Y are using the same version of SWIG. What happens though if X uses SWIG 1.1 and Y uses SWIG 1.3? Both modules work fine on their own, but depending on how the platform treats shared objects and how the language itself loads them (RTLD_GLOBAL?) some potentially very bad things would happen from the combination of the two modules being used in the same VM.
The penalty of the code duplication is pretty low I suspect - the cost of swapping between VM and native code is typically quite high, which probably dwarfs the slightly reduced instruction cache hits, although it might be interesting to see real benchmarks. On the up side this is code no users ever need to worry about it, since it's all auto generated and all correctly kept with interfaces written for the corresponding version.
I might be a bit late, but here is a workaround:
In SWIG (<= 1.3 ) there is -noruntime command-line option
Since SWIG 2.0 -noruntime was deprecated, so now one should pass -DSWIG_NOINCLUDE to the C preprocessor - not to the swig itself
I am completely not sure that this is correct, but it at least works for me. I am going to clarify this question in the SWIG's mailing list.
I'm evaluating Server 2008. My C++ executable is getting this error. I've seen this error on MSDN that seems to have required a hot-fix for several previous OSes. Anyone else seen this? I get the same results for the 32 & 64 bit OS.
Code snippet:
HRESULT GroupStart([in] short iClientId, [in] VARIANT GroupDataArray,
[out] short* pGroupInstance, [out] long* pCommandId);
Where the GroupDataArray VARIANT argument wraps a single-dimension SAFEARRAY of VARIANTs wrapping a DCAPICOM_GroupData struct entries:
// DCAPICOM_GroupData
[
uuid(F1FE2605-2744-4A2A-AB85-1E1845C280EB),
helpstring("removed")
]
typedef struct DCAPICOM_GroupData {
[helpstring("removed")]
long m_lImageID;
[helpstring("removed")]
unsigned char m_ucHeadID;
[helpstring("removed")]
unsigned char m_ucPlateID;
} DCAPICOM_GroupData;
After opening a support case with Microsoft, I can now answer my own question. This is (now) a recognized bug. The issue has to do with marshalling on the server side, but before the server code is called. Our structure is 6 bytes long, but this COM implementation is interpreting it as 8, so the marshalling fails, and this is the error you get back. The workaround, until a Service Pack is released to deal with this, is to add two extra bytes to the structure to pad it up to 8 bytes. We haven't run across any more instances that fail yet, but we still have a lot of testing to do still.
We ran into the same error recently with a client/server app communicating via DCOM. It turned out that the size of a marshalled interface pointer going across the wire (i.e., not local) had changed (gotten bigger). You might like to check whether your code is doing any special marshalling via CoMarshalInterface or the like.