CGSSetWindowWarp not working in 64 bits apps - objective-c

I am developping a Mac OS X app using the undocummented CGSSetWindowWarp function.
Everything is ok when compiing in 32 bits but it stop to work (window dissapear completly) when compiling in 64 bits.
Do you have any idée where the issue can be?
Thanks in advance for your help
Regards,

Are you sure about the function signature? The size of the parameters might have changed / might not have changed but int needs to be short, etc.
Or they might have stopped supporting that function altogether.

The points in the warp mesh are not actually CGPoints, they use float for x and y even in 64-bit. (CGPoint uses double on 64-bit)
You can redefine the warp mesh and function like this:
typedef struct CGSPoint {
float x;
float y;
} CGSPoint;
typedef struct {
CGSPoint local;
CGSPoint global;
} CGSPointWarp;
extern CGError CGSSetWindowWarp(CGSConnectionID conn, CGSWindowID window, int w, int h, CGSPointWarp **mesh)

Related

How to tell if Nvidia GPU cores are 32/64 bit processors

Is there a Linux/windows command that shows the type of the processor in an Nvidia GPU? I am not talking about the operating system nor the CPU type. I am asking about the processors (cores) in the GPU itself. At the end of the day, they are processors. How to tell if they have 32/64-bit registers and 32/64-bit ALUs?
Another question that is related to this, are 64-bit instructions, such as adding two (unsigned long int) numbers, emulated using 32-bit instructions, by the compiler or whatever intermediate thing, or they are being executed natively by the hardware?
This question is not quite similar to this, I need a way to tell what type of machine is the GPU itself. Also, the answer to that question does not tell how 64-bit instructions are specifically executed.
I have coded two simple kernels. Each one adds two vectors of types int(32-bit)/long int (64 bit). It turns out on my GPU (Tesla K80), which happens to be pretty new and good one, the cores are just 32-bit.
The time roughly doubles as the vector size increases.
The kernels are as follows:
__global__ void add_32(int * c, int * a, int * b)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
c[gid] = a[gid] + b[gid];
}
typedef long int int64;
__global__ void add_64(int64 * c, int64 * a, int64 * b)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
c[gid] = a[gid] + b[gid];
}
When vector size is 1 Mega element, add_32 takes about 102.911 microsec, whereas add_64 takes 192.669 microsec. (Execution times were reported using Nvidia profiler while running the release-mode binary).
It seems that 64-bit instructions are just emulated via 32-bit instructions!
This could be a brute-force solution to find out what kind of machines are the GPU cores, but definitely not an elegant one.
Update:
Thanks to #Paul A. Clayton comment, it seems that the solution above is not a fair comparison as data size doubles in the 64-bit case. So we should not launch both kernels with the same number of elements. Correct principle would be to launch the 64-bit version with half the number of elements.
To be even more sure, let's consider element-wise vector multiplication instead of addition. If the GPU emulates 64-bit instructions via 32-bit instructions, then it needs at least 3 32-bit multiplication instructions to multiply 2 64-bit numbers using maybe Karatsuba algorithm for instance. This implies that if we launch the 64-bit vector multiplication kernel with N/2 elements, it would take longer than the 32-bit kernel with N elements if 64-bit multiplications were just emulated.
Here are the kernels:
__global__ void mul_32(int * c, int * a, int * b)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
c[gid] = a[gid] * b[gid];
}
typedef long int int64;
__global__ void mul_64(int64 * c, int64 * a, int64 * b)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
c[gid] = a[gid] * b[gid];
}
And here are the experiment details:
Times reported here are from nvidia profiler on release-mode binary:
1- Kernel mul_32 with vector size N = 256 Mega elements, takes 25.608 millisec.
2- Kernel mul_64 with vector size N = 128 Mega elements, takes 24.153 millisec.
I am aware that both kernels produce incorrect results, but I think that has nothing to do with the way computation is being done.

Why is CGFloat float on 32 bit and double on 64 bit?

From "CoreGraphics/CGBase.h":
#if defined(__LP64__) && __LP64__
# define CGFLOAT_TYPE double
# define CGFLOAT_IS_DOUBLE 1
# define CGFLOAT_MIN DBL_MIN
# define CGFLOAT_MAX DBL_MAX
#else
# define CGFLOAT_TYPE float
# define CGFLOAT_IS_DOUBLE 0
# define CGFLOAT_MIN FLT_MIN
# define CGFLOAT_MAX FLT_MAX
#endif
Why did Apple do this? What's the advantage?
I can seem to think of downsides only. Please enlighten me.
Apple explicitly says they did it "to provide a wider range and accuracy for graphical quantities." You can debate whether the wider range and accuracy have been really helpful in practice, but Apple is clear on what they were thinking.
It's worth remembering, BTW, that CGFloat was added in OS X 10.5, long before iPhones (and certainly long before 64-bit iPhones). Going 64-bit is more obviously beneficial on "big memory" machines like Macs. And Apple made "local architecture" types that were supposed to make it easier to transition between the "old" and "new" worlds. I think it's interesting that Swift brought over NSInteger as the default Int type (i.e. Int is architecture-specific). But they made Float and Double architecture independent. There is no equivalent of CGFloat in the language. I read this as a tacit acknowledgement that CGFloat wasn't the greatest idea. NEON only supports single precision floating point math. Double precision math has to be done on the VFP. (Not that NEON was a consideration when CGFloat was invented.)
It's a performance thing.
On a 32-bit CPU, a single precision, 32-bit float can be stored in a single register, and moved around quickly and efficiently, because it's the same size as an architecture-native pointer.
On a 64-bit CPU architecture, a 64-bit IEEE double has the same advantage of being the same size as a native pointer/register/etc.

CGFloat: round, floor, abs, and 32/64 bit precision

TLDR: How do I call standard floating point code in a way that compiles both 32 and 64 bit CGFloats without warnings?
CGFloat is defined as either double or float, depending on the compiler settings and platform. I'm trying to write code that works well in both situations, without generating a lot of warnings.
When I use functions like floor, abs, ceil, and other simple floating point operations, I get warnings about values being truncated. For example:
warning: implicit conversion shortens 64-bit value into a 32-bit value
I'm not concerned about correctness or loss of precision in of calculations, as I realize that I could just use the double precision versions of all functions all of the time (floor instead of floorf, etc); however, I would have to tolerate these errors.
Is there a way to write code cleanly that supports both 32 bit and 64 bit floats without having to either use a lot of #ifdef __ LP64 __ 's, or write wrapper functions for all of the standard floating point functions?
You may use those functions from tgmath.h.
#include <tgmath.h>
...
double d = 1.5;
double e = floor(d); // will choose the 64-bit version of 'floor'
float f = 1.5f;
float g = floor(f); // will choose the 32-bit version of 'floorf'.
If you only need a few functions you can use this instead:
#if CGFLOAT_IS_DOUBLE
#define roundCGFloat(x) round(x)
#define floorCGFloat(x) floor(x)
#define ceilCGFloat(x) ceil(x)
#else
#define roundCGFloat(x) roundf(x)
#define floorCGFloat(x) floorf(x)
#define ceilCGFloat(x) ceilf(x)
#endif

Best (and fastest) way to store triangles and lines in C++?

I've got a few 3D apps going, and I was wondering, what is the best way to store lines and triangles? At the moment, I have lines as an array of typedef'd vectors as such:
typedef struct
{
float x, y, z;
}
Vector
Vector line[2];
Now, I could do it like this:
typedef struct
{
Vector start, end;
}
Line
Line lineVar;
Faces could be similar:
typdef struct
{
Vector v1, v2, v3;
}
Face faceVar;
My question is this: Is there a better or faster way to store lines and faces? Or am I doing it OK?
Thanks,
James
What you have is pretty much how vectors are represented in computer programs. I can't imagine any other way to do it. This is perfectly fine:
typedef struct
{
float x, y, z;
} Vector;
(DirectX stores vector components like this, by the way.)
However, 3D intensive programs typically have the faces index into a vector array to save space since the same points often appear on different faces of a 3D model:
typedef struct
{
int vectorIndex1, vectorIndex2, vectorIndex3;
} Face;

gcc memory alignment pragma

Does gcc have memory alignment pragma, akin #pragma vector aligned in Intel compiler?
I would like to tell compiler to optimize particular loop using aligned loads/store instructions. to avoid possible confusion, this is not about struct packing.
e.g:
#if defined (__INTEL_COMPILER)
#pragma vector aligned
#endif
for (int a = 0; a < int(N); ++a) {
q10 += Ix(a,0,0)*Iy(a,1,1)*Iz(a,0,0);
q11 += Ix(a,0,0)*Iy(a,0,1)*Iz(a,1,0);
q12 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,0,1);
q13 += Ix(a,1,0)*Iy(a,0,0)*Iz(a,0,1);
q14 += Ix(a,0,0)*Iy(a,1,0)*Iz(a,0,1);
q15 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,1,1);
}
Thanks
You can tell GCC that a pointer points to aligned memory by using a typedef to create an over-aligned type that you can declare pointers to.
This helps gcc but not clang7.0 or ICC19, see the x86-64 non-AVX asm they emit on Godbolt. (Only GCC folds a load into a memory operand for mulps, instead of using a separate movups). You have have to use __builtin_assume_aligned if you want to portably convey an alignment promise to GNU C compilers other than GCC itself.
From http://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html
typedef double aligned_double __attribute__((aligned (16)));
// Note: sizeof(aligned_double) is 8, not 16
void some_function(aligned_double *x, aligned_double *y, int n)
{
for (int i = 0; i < n; ++i) {
// math!
}
}
This won't make aligned_double 16 bytes wide. This will just make it aligned to a 16-byte boundary, or rather the first one in an array will be. Looking at the disassembly on my computer, as soon as I use the alignment directive, I start to see a LOT of vector ops. I am using a Power architecture computer at the moment so it's altivec code, but I think this does what you want.
(Note: I wasn't using double when I tested this, because there altivec doesn't support double floats.)
You can see some other examples of autovectorization using the type attributes here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
I tried your solution with g++ version 4.5.2 (both Ubuntu and Windows) and it did not vectorize the loop.
If the alignment attribute is removed then it vectorizes the loop, using unaligned loads.
If the function is inlined so that the array can be accessed directly with the pointer eliminated, then it is vectorized with aligned loads.
In both cases, the alignment attribute prevents vectorization. This is ironic: The "aligned_double *x" was supposed to enable vectorization but it does the opposite.
Which compiler was it that reported vectorized loops for you? I suspect it was not a gcc compiler?
Does gcc have memory alignment pragma, akin #pragma vector aligned
It looks like newer versions of GCC have __builtin_assume_aligned:
Built-in Function: void * __builtin_assume_aligned (const void *exp, size_t align, ...)
This function returns its first argument, and allows the compiler to assume that the returned pointer is at least align bytes aligned.
This built-in can have either two or three arguments, if it has three,
the third argument should have integer type, and if it is nonzero
means misalignment offset. For example:
void *x = __builtin_assume_aligned (arg, 16);
means that the compiler can assume x, set to arg, is at least 16-byte aligned, while:
void *x = __builtin_assume_aligned (arg, 32, 8);
means that the compiler can assume for x, set to arg, that (char *) x - 8 is 32-byte aligned.
Based on some other questions and answers on Stack Overflow circa 2010, it appears the built-in was not available in GCC 3 and early GCC 4. But I do not know where the cut-off point is.