How clever is a compiler (e.g. g++) at SSE optimisation (line re-ordering, operation collation) - g++

Without getting into a discussion about premature optimisation, I have a few questions about how well g++ or other compilers handle SSE optimisation, when the relevant compiler flags are selected:
Do multiple lines of code get re-ordered in order for SSE instructions to be performed on bunches of lines? e.g.
a[0] = a1+a2+a3;
x[0] = a1*a1;
a[1] = b1+b2+b3;
x[1] = b1*b1;
a[2] = c1+c2+c3;
x[2] = c1*c1;
where the compiler could reorder these lines into two sets of SSE instructions?
Does the compiler realise when to take similar sets of operations, that are not in arrays and combine them into SSE instructions? e.g.
a = a1+a2+a3;
b = b1+b2+b3;
c = c1+c2+c3;
Does the compiler optimise instructions in a for loop for SSE optimisation? e.g.
for(unsigned int i = 0; i < 4; i++)
{
x[i] = x[i]*k;
a[i] = a[i]*c;
}
Will a compiler combine 1, 2 and 3 when trying to optimise?
Would be interesting to hear peoples thoughts on this for various SSE optimising compilers.
edit: I'm mostly asking about g++, but other "mainstream" compilers are of interest. I'm also predominantly talking about floating point operations.

In my experience, compilers made a real improvement for vectorization three years ago. Presently, all of your examples will be vectorized efficiently. Moreover, if you have the chance to use Intel's compiler, you will get a huge speed-up, and its reporting mode will give you additional information about the optimizations it applied.
In my day-to-day life, I've seen that you can have the craziest code, but for the computation part, you should help the compiler and use a C approach where you extract your pointer and do your loop:
float * pa = whatever; // data must be contigious
float * pb = whatever;
for (int i=0; i <n; ++i)
{
pa[I] = pa[i]*pb[i]; // example
}
Now we also have OpenMP 4.5, which provides directives for vectorization. This will only be 10% slower than a hand-written solution. Therefore, I do not recommend today to move to intrinsics, except in very specific cases where #pragma will not work.

Related

Big O: What is the name for the complexity O(a * b)?

I am new to studying the Big O notation and have thought of this question. What is the name for the complexity O(a * b)? Is it linear complexity? polynomial? or something else. The code for the implementation is below.
function twoInputsMult(a, b) {
for (let i = 0; i < a; i++) {
for (let j = 0; j < b; j++) {
// do something
}
}
}
Edit: According to the course I'm going through, it is not n^2 or quadratic since it uses two different numbers for the loops. Refer to the image below
O(ab) is just O(ab). Technically, ab is a multivariate polynomial of 2nd degree. But this is not equivalent to a quadratic polynomial, such as a2.
If you know more about a and b, you may be able to deduce more about their relationship. For instance, if a = O(b), then O(ab) = O(b2), which is quadratic. On the other hand, if a is a constant, then we can reduce it to O(b), which is linear.
Notice, by the way, that O(a + b) is just O(max(a, b)).
And if the real world interests you, I might also mention that both of these complexity classes show up a lot e.g. in graph theory, where we have the number of vertices |V| and the number of edges |E|, and typically |E| = O(|V|2) but not necessarily. For instance, Depth-first search has a time complexity of O(|V| + |E|), which just means that it is linear in terms of whichever there is more of: vertices or edges.

OpenCL Kernel and traditional loops

I'm studying OpenCL and I don't understand the relationship between traditional loop in a C/C++ code and kernel code.
Just for be clear a situation like that:
So my question is: In the traditional loops I have n variable as my boundary while in kernel code I don't have it but I have get_global_id(0) that indicates the memory scope of my array, this means that I start from 0, and iterate until get_global_id matches with the maximum size of the array, n in this case? Or is something different?
Because in this other example I don't know how to write the correspond kernel code
I hope my question is clear because I'm not very well in english, sorry.
Thanks in advance for the help, if there are problems let me know!
An OpenCL kernel is coded like a single iteration of a for-loop, but all iterations are run in parallel with random order.
Consider this vector addition example in C++, where for i=0..N-1, you add each element of the vectors one after the other:
for(int i=0; i<N; i++) { // loop index i
C[i] = A[i]+B[i]; // compute one after the other
}
In OpenCL, the vector addition looks like the inside of this for-loop, but as a function with the kernel keyword and all vectors as parameters:
kernel void add_kernel(const global float* A, const global float* B, global float* C) {
const int i = get_global_id(0);
C[i] = A[i]+B[i]; // compute all loop indices i in parallel
}
You might be wondering: Where is N? You give N to the kernel on the C++ side as its "global range", so the kernel knows how much elements i to calculate in parallel.
Because in the OpenCL kernel every iteration runs in parallel, there must not be any data dependencies from one iteration to the next; otherwise you have to use a double buffer (only read from one buffer and only write to the other). In your second example with A[i] = B[i-1]+B[i]+B[i+1] you do exactly that: only read from B, only write to A. The implementation with periodic boundaries can be done branch-less, see here.

Implementing an FFT using vDSP

I have data regarding the redness of the user's finger that is currently quite noisy, so I'd like to run it through an FFT to reduce the noise. The data on the left side of this image is similar to my data currently. I've familiarized myself with the Apple documentation regarding vDSP, but there doesn't seem to be a clear or concise guide on how to implement a Fast Fourier Transform using Apple's vDSP and the Accelerate framework. How can I do this?
I have already referred to this question, which is on a similar topic, but is significantly outdated and doesn't involve vDSP.
Using vDSP for FFT calculations is pretty easy. I'm assuming you have real values on input. The only thing you need to keep in mind you need to convert your real valued array to a packed complex array that FFT algo from vDSP uses internally.
You can see a good overview in the documentation:
https://developer.apple.com/library/content/documentation/Performance/Conceptual/vDSP_Programming_Guide/UsingFourierTransforms/UsingFourierTransforms.html
Here is the smallest example of calculating real valued FFT:
const int n = 1024;
const int log2n = 10; // 2^10 = 1024
DSPSplitComplex a;
a.realp = new float[n/2];
a.imagp = new float[n/2];
// prepare the fft algo (you want to reuse the setup across fft calculations)
FFTSetup setup = vDSP_create_fftsetup(log2n, kFFTRadix2);
// copy the input to the packed complex array that the fft algo uses
vDSP_ctoz((DSPComplex *) input, 2, &a, 1, n/2);
// calculate the fft
vDSP_fft_zrip(setup, &a, 1, log2n, FFT_FORWARD);
// do something with the complex spectrum
for (size_t i = 0; i < n/2; ++i) {
a.realp[i];
a.imagp[i];
}
One trick is that a.realp[0] is the DC offset and a.imagp[0] is the real valued magnitude at the Nyquist frequency.

Is there a blas implementation using cilkplus array notation?

To my surprise, I'm not able to track on the web any implementation of BLAS based on cilkplus' array notation. It is strange, because cilkplus should ensure a (more than) decent performance on today's multicore workstation CPUs, coupled to a very expressive and compact representation of the BLAS algorithms. Even more strange, considering that BLAS/LAPACK is the de facto standard for dense matrix calculations (at least, as specification).
I understand that there are other more recent and sofisticate libraries that try to improve/extend the blas/lapack, for example I've looked at eigen and flens, but still it would be nice to have a cilkplus version of the "standard" blas implementation.
Is this depending by a very limited spread of cilkplus?
http://parallelbook.com/downloads has Cilk Plus code (see "CODE EXAMPLES FROM BOOK") for a few BLAS operations in a Cholesky decomposition example: gemm, portrf, syrk, and trsm. The routines are templates, so they work for any precision.
On the plus side, the Cilk Plus versions give you good composition properties, i.e. you can use them in separate parts of a spawn tree without worry. On the negative side, if you don't need the clean composition, then it's hard to compete with highly tuned parallel BLAS libraries, because the Cilk Plus algorithms tend to be cache oblivious, whereas the highly tuned libraries can exploit cache awareness. E.g., a cache aware algorithm can carefully schedule multiple threads on the same core to work on the same blocks, and thus save memory fetch overhead. It's a lot of work to get the cache awareness right for each machine, but BLAS authors are willing to do the work.
It's exactly the cache awareness ("I own the whole machine" programming) that thwarts clean composition, so you can't have both.
For some BLAS operations, the fork-join structure of Cilk Plus also seems to limit performance compared to less structured parallelism. See slide 2 of http://www.netlib.org/utk/people/JackDongarra/WEB-PAGES/cscads-libtune-09/talk17-knobe.pdf for some examples.
Taking gemm as example, at the end the parallel routine is just calling the blas (sgemm, dgemm, etc.) routine. This might be the netlib reference, or atlas, or openblas, or mkl, but this is opaque in the suggested citation. I was asking for the existence of cilkplus implementation of the reference routine, e.g. something like
void dgemm(MATRIX & A, MATRIX & B, MATRIX & C) {
#pragma cilk grainsize = 64
cilk_for(int i = 1; i <= A.rows; i++) {
double *x = &A(i, 1);
for (int j = 1; j <= A.cols; j++, x += A.colstride)
ROW(C, i) += (*x) * ROW(B, j);
}
}

Speeding up CUDA atomics calculation for many bins/few bins

I am trying to optimize my histogram calculations in CUDA. It gives me an excellent speedup over corresponding OpenMP CPU calculation. However, I suspect (in keeping with intuition) that most of the pixels fall into a few buckets. For argument's sake, assume that we have 256 pixels falling into let us say, two buckets.
The easiest way to do it is to do it appears to be
Load the variables into shared memory
Do vectorized loads for unsigned char, etc. if needed.
Do an atomic add in shared memory
Do a coalesced write to global.
Something like this:
__global__ void shmem_atomics_reducer(int *data, int *count){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ int block_reduced[NUM_THREADS_PER_BLOCK];
block_reduced[threadIdx.x] = 0;
__syncthreads();
atomicAdd(&block_reduced[data[tid]],1);
__syncthreads();
for(int i=threadIdx.x; i<NUM_BINS; i+=NUM_BINS)
atomicAdd(&count[i],block_reduced[i]);
}
The performance of this kernel drops (naturally) when we decrease the number of bins, from around 45 GB/s at 32 bins to around 10 GB/s at 1 bin. Contention, and shared memory bank conflicts are given as reasons. I don't know if there is any way to remove either of these for this calculation in any significant way.
I've also been experimenting with another (beautiful) idea from the parallelforall blog involving warp level reductions using __ballot to grab warp results and then using __popc() to do the warp level reduction.
__global__ void ballot_popc_reducer(int *data, int *count ){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
uint warp_id = threadIdx.x >> 5;
//need lane_ids since we are going warp level
uint lane_id = threadIdx.x%32;
//for ballot
uint warp_set_bits=0;
//to store warp level sum
__shared__ uint warp_reduced_count[NUM_WARPS_PER_BLOCK];
//shared data
__shared__ uint s_data[NUM_THREADS_PER_BLOCK];
//load shared data - could store to registers
s_data[threadIdx.x] = data[tid];
__syncthreads();
//suspicious loop - I think we need more parallelism
for(int i=0; i<NUM_BINS; i++){
warp_set_bits = __ballot(s_data[threadIdx.x]==i);
if(lane_id==0){
warp_reduced_count[warp_id] = __popc(warp_set_bits);
}
__syncthreads();
//do warp level reduce
//could use shfl, but it does not change the overall picture
if(warp_id==0){
int t = threadIdx.x;
for(int j = NUM_WARPS_PER_BLOCK/2; j>0; j>>=1){
if(t<j) warp_reduced_count[t] += warp_reduced_count[t+j];
__syncthreads();
}
}
__syncthreads();
if(threadIdx.x==0){
atomicAdd(&count[i],warp_reduced_count[0]);
}
}
}
This gives decent numbers (well, that is moot - peak device mem bw is 133 GB/s, things seem to depend on launch configuration) for the single bin case (35-40 GB/s for 1 bin, as against 10-15 GB/s using atomics), but performance drops drastically when we increase the number of bins. When we run with 32 bins, performance drops to about 5 GB/s. The reason might perhaps be because of the single thread looping through all the bins, asking for parallelization of the NUM_BINS, loop.
I have tried several ways of going about parallelizing the NUM_BINS loop, none of which seem to work properly. For example, one could (very inelegantly) manipulate the kernel to create some blocks for each bin. This seems to behave the same way, possibly because we would again suffer from contention with multiple blocks attempting to read from global memory. Plus, the programming is clunky. Likewise, parallelizing in the y direction for bins gives similarly uninspiring results.
The other idea I tried just for kicks was dynamic parallelism, launching a kernel for each bin. This was disastrously slow, possibly owing to no real compute work for the child kernels and the launch overhead.
The most promising approach seems to be - from Nicholas Wilt's article
on using these so-called privatized histograms containing bins for each thread in shared memory, which would ostensibly be very heavy on shmem usage (and we only have 48 kB per SM on Maxwell).
Perhaps someone could shed some insight into the problem? I feel that one ought to go change the algorithm instead so as not to use histograms, to use something less frequentist. Otherwise, I suppose we just use the atomics version.
Edit: The context for my problem is in computing probability density functions to be used for pattern-classification. We can compute approximate histograms (more precisely, pdfs) by using non-parametric methods such as Parzen Windows or Kernel Density Estimation. However, this does not overcome the problem of dimensionality as we need to sum over all data points for every bin, which becomes expensive when the number of bins becomes large. See here: Parzen
I faced similar chalanges to work with clustering, but in the botton end, the best solution was to use the scan pattern to group the processing. So, I don't think that it would work for you. Since you asked for some experience in this are, I'll share mine with you.
The issues
In your 1st code, I guess that the deal with the low performance with the number of bins reduction is linked to warp stall, since you do perform very little processing for every evaluated data. When the number of bins is increased, the relation between processing and global memory load (data info) for that kernel is also increased. You can check that very easily with the "Issue Efficiency" Experiments at the Performance Analysis from Nsight. Probably you are getting a low rate of cycles with at least one elegible warp (Warp Issue Efficiency).
Since I was not able to improve the number of elegible warps to somewhere close to 95%, I gave up this approach, since for some cases it gets worse (the memory dependency stall 90% of my processing cycles.
The shuffle and vote reduction is very usefull if the number of bins is not to large. If it is to large, a small amount of threads should be active for every bin filter. So you may end up with a lot of code divergence, and that is very undesirable for parallel processing. You may try to group the divergence in order to remove branching and have a good control flow, so the whole warp/block presents a similar processing, but a lot chance across blocks.
A feasible solution
I don't know where, but there are very good solutions for your problem around that I saw. Did you tried this one?
Also you can use a vectorized load and try something like that, but I'm not sure how much would it improve your performance:
__global__ hist(int4 *data, int *count, int N, int rem, unsigned int init) {
__shared__ unsigned int sBins[N_OF_BINS]; // you may want to declare this one dinamically
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (threadIdx.x < N_OF_BINS) sBins[threadIdx.x] = 0;
for (int i = 0; i < N; i+= warpSize) {
atomicAdd(&sBins[data[i + init].w], 1);
atomicAdd(&sBins[data[i + init].x], 1);
atomicAdd(&sBins[data[i + init].y], 1);
atomicAdd(&sBins[data[i + init].z], 1);
}
//process remaining elements if the data is not multiple of 4
// using recast and a additional control
for (int i = 0; i < rem; i++) {
atomicAdd(&sBins[reinterpret_cast<int*>(data)[N * 4 + init + i]], 1);
}
//update your histogram data here
}