gcc memory alignment pragma - optimization

Does gcc have memory alignment pragma, akin #pragma vector aligned in Intel compiler?
I would like to tell compiler to optimize particular loop using aligned loads/store instructions. to avoid possible confusion, this is not about struct packing.
e.g:
#if defined (__INTEL_COMPILER)
#pragma vector aligned
#endif
for (int a = 0; a < int(N); ++a) {
q10 += Ix(a,0,0)*Iy(a,1,1)*Iz(a,0,0);
q11 += Ix(a,0,0)*Iy(a,0,1)*Iz(a,1,0);
q12 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,0,1);
q13 += Ix(a,1,0)*Iy(a,0,0)*Iz(a,0,1);
q14 += Ix(a,0,0)*Iy(a,1,0)*Iz(a,0,1);
q15 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,1,1);
}
Thanks

You can tell GCC that a pointer points to aligned memory by using a typedef to create an over-aligned type that you can declare pointers to.
This helps gcc but not clang7.0 or ICC19, see the x86-64 non-AVX asm they emit on Godbolt. (Only GCC folds a load into a memory operand for mulps, instead of using a separate movups). You have have to use __builtin_assume_aligned if you want to portably convey an alignment promise to GNU C compilers other than GCC itself.
From http://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html
typedef double aligned_double __attribute__((aligned (16)));
// Note: sizeof(aligned_double) is 8, not 16
void some_function(aligned_double *x, aligned_double *y, int n)
{
for (int i = 0; i < n; ++i) {
// math!
}
}
This won't make aligned_double 16 bytes wide. This will just make it aligned to a 16-byte boundary, or rather the first one in an array will be. Looking at the disassembly on my computer, as soon as I use the alignment directive, I start to see a LOT of vector ops. I am using a Power architecture computer at the moment so it's altivec code, but I think this does what you want.
(Note: I wasn't using double when I tested this, because there altivec doesn't support double floats.)
You can see some other examples of autovectorization using the type attributes here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html

I tried your solution with g++ version 4.5.2 (both Ubuntu and Windows) and it did not vectorize the loop.
If the alignment attribute is removed then it vectorizes the loop, using unaligned loads.
If the function is inlined so that the array can be accessed directly with the pointer eliminated, then it is vectorized with aligned loads.
In both cases, the alignment attribute prevents vectorization. This is ironic: The "aligned_double *x" was supposed to enable vectorization but it does the opposite.
Which compiler was it that reported vectorized loops for you? I suspect it was not a gcc compiler?

Does gcc have memory alignment pragma, akin #pragma vector aligned
It looks like newer versions of GCC have __builtin_assume_aligned:
Built-in Function: void * __builtin_assume_aligned (const void *exp, size_t align, ...)
This function returns its first argument, and allows the compiler to assume that the returned pointer is at least align bytes aligned.
This built-in can have either two or three arguments, if it has three,
the third argument should have integer type, and if it is nonzero
means misalignment offset. For example:
void *x = __builtin_assume_aligned (arg, 16);
means that the compiler can assume x, set to arg, is at least 16-byte aligned, while:
void *x = __builtin_assume_aligned (arg, 32, 8);
means that the compiler can assume for x, set to arg, that (char *) x - 8 is 32-byte aligned.
Based on some other questions and answers on Stack Overflow circa 2010, it appears the built-in was not available in GCC 3 and early GCC 4. But I do not know where the cut-off point is.

Related

OpenCL Kernel and traditional loops

I'm studying OpenCL and I don't understand the relationship between traditional loop in a C/C++ code and kernel code.
Just for be clear a situation like that:
So my question is: In the traditional loops I have n variable as my boundary while in kernel code I don't have it but I have get_global_id(0) that indicates the memory scope of my array, this means that I start from 0, and iterate until get_global_id matches with the maximum size of the array, n in this case? Or is something different?
Because in this other example I don't know how to write the correspond kernel code
I hope my question is clear because I'm not very well in english, sorry.
Thanks in advance for the help, if there are problems let me know!
An OpenCL kernel is coded like a single iteration of a for-loop, but all iterations are run in parallel with random order.
Consider this vector addition example in C++, where for i=0..N-1, you add each element of the vectors one after the other:
for(int i=0; i<N; i++) { // loop index i
C[i] = A[i]+B[i]; // compute one after the other
}
In OpenCL, the vector addition looks like the inside of this for-loop, but as a function with the kernel keyword and all vectors as parameters:
kernel void add_kernel(const global float* A, const global float* B, global float* C) {
const int i = get_global_id(0);
C[i] = A[i]+B[i]; // compute all loop indices i in parallel
}
You might be wondering: Where is N? You give N to the kernel on the C++ side as its "global range", so the kernel knows how much elements i to calculate in parallel.
Because in the OpenCL kernel every iteration runs in parallel, there must not be any data dependencies from one iteration to the next; otherwise you have to use a double buffer (only read from one buffer and only write to the other). In your second example with A[i] = B[i-1]+B[i]+B[i+1] you do exactly that: only read from B, only write to A. The implementation with periodic boundaries can be done branch-less, see here.

How clever is a compiler (e.g. g++) at SSE optimisation (line re-ordering, operation collation)

Without getting into a discussion about premature optimisation, I have a few questions about how well g++ or other compilers handle SSE optimisation, when the relevant compiler flags are selected:
Do multiple lines of code get re-ordered in order for SSE instructions to be performed on bunches of lines? e.g.
a[0] = a1+a2+a3;
x[0] = a1*a1;
a[1] = b1+b2+b3;
x[1] = b1*b1;
a[2] = c1+c2+c3;
x[2] = c1*c1;
where the compiler could reorder these lines into two sets of SSE instructions?
Does the compiler realise when to take similar sets of operations, that are not in arrays and combine them into SSE instructions? e.g.
a = a1+a2+a3;
b = b1+b2+b3;
c = c1+c2+c3;
Does the compiler optimise instructions in a for loop for SSE optimisation? e.g.
for(unsigned int i = 0; i < 4; i++)
{
x[i] = x[i]*k;
a[i] = a[i]*c;
}
Will a compiler combine 1, 2 and 3 when trying to optimise?
Would be interesting to hear peoples thoughts on this for various SSE optimising compilers.
edit: I'm mostly asking about g++, but other "mainstream" compilers are of interest. I'm also predominantly talking about floating point operations.
In my experience, compilers made a real improvement for vectorization three years ago. Presently, all of your examples will be vectorized efficiently. Moreover, if you have the chance to use Intel's compiler, you will get a huge speed-up, and its reporting mode will give you additional information about the optimizations it applied.
In my day-to-day life, I've seen that you can have the craziest code, but for the computation part, you should help the compiler and use a C approach where you extract your pointer and do your loop:
float * pa = whatever; // data must be contigious
float * pb = whatever;
for (int i=0; i <n; ++i)
{
pa[I] = pa[i]*pb[i]; // example
}
Now we also have OpenMP 4.5, which provides directives for vectorization. This will only be 10% slower than a hand-written solution. Therefore, I do not recommend today to move to intrinsics, except in very specific cases where #pragma will not work.

MKL Sparse BLAS segfault when transposing CSR with 100M rows

I am trying to use MKL Sparse BLAS for CSR matrices with number of rows/columns on the order of 100M. My source code that seems to work fine for 10M rows/columns fails with segfault when I increase it to 100M.
I isolated the problem to the following code snippet:
void TestSegfault1() {
float values[1] = { 1.0f };
int col_indx[1] = { 0 };
int rows_start[1] = { 0 };
int rows_end[1] = { 1 };
// Step 1. Create 1 x 100M matrix
// with single non-zero value at (0,0)
sparse_matrix_t A;
mkl_sparse_s_create_csr(
&A, SPARSE_INDEX_BASE_ZERO, 1, 100000000,
rows_start, rows_end, col_indx, values);
// Step 2. Transpose it to get 100M x 1 matrix
sparse_matrix_t B;
mkl_sparse_convert_csr(A, SPARSE_OPERATION_TRANSPOSE, &B);
}
This function segfaults in mkl_sparse_convert_csr with backtrace
#0 0x00000000004c0d03 in mkl_sparse_s_convert_csr_i4_avx ()
#1 0x0000000000434061 in TestSegfault1 ()
For slightly different code (but essentially the same) it has a little more detail:
#0 0x00000000008fc09b in mkl_serv_free ()
#1 0x000000000099949e in mkl_sparse_s_export_csr_data_i4_avx ()
#2 0x0000000000999ee4 in mkl_sparse_s_convert_csr_i4_avx ()
Apparently something goes bad in memory allocation. And it sure looks like some kind of integer overflow from the outside. The build of MKL I have uses MKL_INT = int = int32.
Is it indeed the case and the limit on number of rows I can have in Sparse BLAS CSR matrix is < 100M (looks more like ~65M)? Or am I doing it wrong?
EDIT 1: MKL version string is "Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for Intel(R) 64 architecture applications".
EDIT 2: Figured it out. There is indeed a subtle kind of integer overflow when allocating memory for internal per-thread buffers. At some point inside mkl_sparse_s_export_csr_data_i4_avx it attempts to allocate (omp_get_max_threads() + 1) * num_rows * 4 bytes; the number doesn't fit in 32-bit signed integer. Subsequent call to mkl_serv_malloc causes memory corruption and eventually segfault. One possible solution is to alter the number of OpenMP threads via omp_set_num_threads call.
Could you check your example on last version of MKL? I run it on MKL 11.3.2 and it passed correctly for 100M matrix. However it could depend on number of threads on your machine (size of matrix mult number of threads have to be less than max int). To prevent such issue I am strongly recommend to use ilp64 version of MKL libraries
Thanks,
Alex
check how this example works with the latest mkl 2019 u4.
compiling the example with ILP64 mode like as follows:
icc -I/opt/intel/compilers_and_libraries_2019/linux/mkl/include test_csr.cpp \
-L/opt/intel/compilers_and_libraries_2019/linux/mkl/lib/intel64 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -liomp5 -lpthread -lm -ldl
./a.out
mkl_sparse_convert_csr passed

CGFloat: round, floor, abs, and 32/64 bit precision

TLDR: How do I call standard floating point code in a way that compiles both 32 and 64 bit CGFloats without warnings?
CGFloat is defined as either double or float, depending on the compiler settings and platform. I'm trying to write code that works well in both situations, without generating a lot of warnings.
When I use functions like floor, abs, ceil, and other simple floating point operations, I get warnings about values being truncated. For example:
warning: implicit conversion shortens 64-bit value into a 32-bit value
I'm not concerned about correctness or loss of precision in of calculations, as I realize that I could just use the double precision versions of all functions all of the time (floor instead of floorf, etc); however, I would have to tolerate these errors.
Is there a way to write code cleanly that supports both 32 bit and 64 bit floats without having to either use a lot of #ifdef __ LP64 __ 's, or write wrapper functions for all of the standard floating point functions?
You may use those functions from tgmath.h.
#include <tgmath.h>
...
double d = 1.5;
double e = floor(d); // will choose the 64-bit version of 'floor'
float f = 1.5f;
float g = floor(f); // will choose the 32-bit version of 'floorf'.
If you only need a few functions you can use this instead:
#if CGFLOAT_IS_DOUBLE
#define roundCGFloat(x) round(x)
#define floorCGFloat(x) floor(x)
#define ceilCGFloat(x) ceil(x)
#else
#define roundCGFloat(x) roundf(x)
#define floorCGFloat(x) floorf(x)
#define ceilCGFloat(x) ceilf(x)
#endif

CGSSetWindowWarp not working in 64 bits apps

I am developping a Mac OS X app using the undocummented CGSSetWindowWarp function.
Everything is ok when compiing in 32 bits but it stop to work (window dissapear completly) when compiling in 64 bits.
Do you have any idée where the issue can be?
Thanks in advance for your help
Regards,
Are you sure about the function signature? The size of the parameters might have changed / might not have changed but int needs to be short, etc.
Or they might have stopped supporting that function altogether.
The points in the warp mesh are not actually CGPoints, they use float for x and y even in 64-bit. (CGPoint uses double on 64-bit)
You can redefine the warp mesh and function like this:
typedef struct CGSPoint {
float x;
float y;
} CGSPoint;
typedef struct {
CGSPoint local;
CGSPoint global;
} CGSPointWarp;
extern CGError CGSSetWindowWarp(CGSConnectionID conn, CGSWindowID window, int w, int h, CGSPointWarp **mesh)