Do modern GPUs optimize multiplication by powers of 2 by doing a bit shift? For example suppose I do the following in a shader:
float t = 0;
t *= 16;
t *= 17;
Is it possible the first multiplication will run faster than the second?
Floating point multiplication cannot be done by bit shift. Howerver, in theory floating point multiplication by power of 2 constants can be optimized. Floating point value is normally stored in the form of S * M * 2 ^ E, where S is a sign, M is mantissa and E is exponent. Multiplying by a power of 2 constant can be done by adding/substracting to the exponent part of the float, without modifying the other parts. But in practice, I would bet that on GPUs a generic multiply instruction is always used.
I had an interesting observation regarding the power of 2 constants while studying the disassembly output of the PVRShaderEditor (PowerVR GPUs). I have noticed that a certain range of power of 2 constants ([2^(-16), 2^10] in my case), use special notation, e.g. C65, implying that they are predefined. Whereas arbitrary constants, such as 3.0 or 2.3, use shared register notation (e.g. SH12), which implies they are stored as a uniform and probably incur some setup cost. Thus using power of 2 constants may yield some optimizational benefit at least on some hardware.
From "CoreGraphics/CGBase.h":
#if defined(__LP64__) && __LP64__
# define CGFLOAT_TYPE double
# define CGFLOAT_IS_DOUBLE 1
# define CGFLOAT_MIN DBL_MIN
# define CGFLOAT_MAX DBL_MAX
#else
# define CGFLOAT_TYPE float
# define CGFLOAT_IS_DOUBLE 0
# define CGFLOAT_MIN FLT_MIN
# define CGFLOAT_MAX FLT_MAX
#endif
Why did Apple do this? What's the advantage?
I can seem to think of downsides only. Please enlighten me.
Apple explicitly says they did it "to provide a wider range and accuracy for graphical quantities." You can debate whether the wider range and accuracy have been really helpful in practice, but Apple is clear on what they were thinking.
It's worth remembering, BTW, that CGFloat was added in OS X 10.5, long before iPhones (and certainly long before 64-bit iPhones). Going 64-bit is more obviously beneficial on "big memory" machines like Macs. And Apple made "local architecture" types that were supposed to make it easier to transition between the "old" and "new" worlds. I think it's interesting that Swift brought over NSInteger as the default Int type (i.e. Int is architecture-specific). But they made Float and Double architecture independent. There is no equivalent of CGFloat in the language. I read this as a tacit acknowledgement that CGFloat wasn't the greatest idea. NEON only supports single precision floating point math. Double precision math has to be done on the VFP. (Not that NEON was a consideration when CGFloat was invented.)
It's a performance thing.
On a 32-bit CPU, a single precision, 32-bit float can be stored in a single register, and moved around quickly and efficiently, because it's the same size as an architecture-native pointer.
On a 64-bit CPU architecture, a 64-bit IEEE double has the same advantage of being the same size as a native pointer/register/etc.
The Programming Guide has instructions for doubles (%ld) and vector types (e.g. %v4f), but not half-precision floats.
Normally in C, varargs arguments are automatically promoted to larger datatypes, such as float to double. The OpenCL documentation seems to imply that a similar promotion applies there.
Therefore a simple %f should work also for half-length floats.
I am developping a Mac OS X app using the undocummented CGSSetWindowWarp function.
Everything is ok when compiing in 32 bits but it stop to work (window dissapear completly) when compiling in 64 bits.
Do you have any idée where the issue can be?
Thanks in advance for your help
Regards,
Are you sure about the function signature? The size of the parameters might have changed / might not have changed but int needs to be short, etc.
Or they might have stopped supporting that function altogether.
The points in the warp mesh are not actually CGPoints, they use float for x and y even in 64-bit. (CGPoint uses double on 64-bit)
You can redefine the warp mesh and function like this:
typedef struct CGSPoint {
float x;
float y;
} CGSPoint;
typedef struct {
CGSPoint local;
CGSPoint global;
} CGSPointWarp;
extern CGError CGSSetWindowWarp(CGSConnectionID conn, CGSWindowID window, int w, int h, CGSPointWarp **mesh)
Does gcc have memory alignment pragma, akin #pragma vector aligned in Intel compiler?
I would like to tell compiler to optimize particular loop using aligned loads/store instructions. to avoid possible confusion, this is not about struct packing.
e.g:
#if defined (__INTEL_COMPILER)
#pragma vector aligned
#endif
for (int a = 0; a < int(N); ++a) {
q10 += Ix(a,0,0)*Iy(a,1,1)*Iz(a,0,0);
q11 += Ix(a,0,0)*Iy(a,0,1)*Iz(a,1,0);
q12 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,0,1);
q13 += Ix(a,1,0)*Iy(a,0,0)*Iz(a,0,1);
q14 += Ix(a,0,0)*Iy(a,1,0)*Iz(a,0,1);
q15 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,1,1);
}
Thanks
You can tell GCC that a pointer points to aligned memory by using a typedef to create an over-aligned type that you can declare pointers to.
This helps gcc but not clang7.0 or ICC19, see the x86-64 non-AVX asm they emit on Godbolt. (Only GCC folds a load into a memory operand for mulps, instead of using a separate movups). You have have to use __builtin_assume_aligned if you want to portably convey an alignment promise to GNU C compilers other than GCC itself.
From http://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html
typedef double aligned_double __attribute__((aligned (16)));
// Note: sizeof(aligned_double) is 8, not 16
void some_function(aligned_double *x, aligned_double *y, int n)
{
for (int i = 0; i < n; ++i) {
// math!
}
}
This won't make aligned_double 16 bytes wide. This will just make it aligned to a 16-byte boundary, or rather the first one in an array will be. Looking at the disassembly on my computer, as soon as I use the alignment directive, I start to see a LOT of vector ops. I am using a Power architecture computer at the moment so it's altivec code, but I think this does what you want.
(Note: I wasn't using double when I tested this, because there altivec doesn't support double floats.)
You can see some other examples of autovectorization using the type attributes here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
I tried your solution with g++ version 4.5.2 (both Ubuntu and Windows) and it did not vectorize the loop.
If the alignment attribute is removed then it vectorizes the loop, using unaligned loads.
If the function is inlined so that the array can be accessed directly with the pointer eliminated, then it is vectorized with aligned loads.
In both cases, the alignment attribute prevents vectorization. This is ironic: The "aligned_double *x" was supposed to enable vectorization but it does the opposite.
Which compiler was it that reported vectorized loops for you? I suspect it was not a gcc compiler?
Does gcc have memory alignment pragma, akin #pragma vector aligned
It looks like newer versions of GCC have __builtin_assume_aligned:
Built-in Function: void * __builtin_assume_aligned (const void *exp, size_t align, ...)
This function returns its first argument, and allows the compiler to assume that the returned pointer is at least align bytes aligned.
This built-in can have either two or three arguments, if it has three,
the third argument should have integer type, and if it is nonzero
means misalignment offset. For example:
void *x = __builtin_assume_aligned (arg, 16);
means that the compiler can assume x, set to arg, is at least 16-byte aligned, while:
void *x = __builtin_assume_aligned (arg, 32, 8);
means that the compiler can assume for x, set to arg, that (char *) x - 8 is 32-byte aligned.
Based on some other questions and answers on Stack Overflow circa 2010, it appears the built-in was not available in GCC 3 and early GCC 4. But I do not know where the cut-off point is.