OpenCL Kernel and traditional loops - gpu

I'm studying OpenCL and I don't understand the relationship between traditional loop in a C/C++ code and kernel code.
Just for be clear a situation like that:
So my question is: In the traditional loops I have n variable as my boundary while in kernel code I don't have it but I have get_global_id(0) that indicates the memory scope of my array, this means that I start from 0, and iterate until get_global_id matches with the maximum size of the array, n in this case? Or is something different?
Because in this other example I don't know how to write the correspond kernel code
I hope my question is clear because I'm not very well in english, sorry.
Thanks in advance for the help, if there are problems let me know!

An OpenCL kernel is coded like a single iteration of a for-loop, but all iterations are run in parallel with random order.
Consider this vector addition example in C++, where for i=0..N-1, you add each element of the vectors one after the other:
for(int i=0; i<N; i++) { // loop index i
C[i] = A[i]+B[i]; // compute one after the other
}
In OpenCL, the vector addition looks like the inside of this for-loop, but as a function with the kernel keyword and all vectors as parameters:
kernel void add_kernel(const global float* A, const global float* B, global float* C) {
const int i = get_global_id(0);
C[i] = A[i]+B[i]; // compute all loop indices i in parallel
}
You might be wondering: Where is N? You give N to the kernel on the C++ side as its "global range", so the kernel knows how much elements i to calculate in parallel.
Because in the OpenCL kernel every iteration runs in parallel, there must not be any data dependencies from one iteration to the next; otherwise you have to use a double buffer (only read from one buffer and only write to the other). In your second example with A[i] = B[i-1]+B[i]+B[i+1] you do exactly that: only read from B, only write to A. The implementation with periodic boundaries can be done branch-less, see here.

Related

Is there a way to check feasibilty for given values of decision variables in CPLEX?

right now I am trying to implement a particle swarm optimization algorithm in CPLEX to solve a Vehicle Routing Problem with a large number of customers.
Firstly, I wrote an optimization model in OPL that is now able to get solutions for smaller instances. In order to now also be able to solve problems with a bigger number of customers, I want to do it using a heuristic I have from a paper (Particle Swarm Optimization).
For this, I am implementing each step of the algorithm in a main-block (control flow) in ILOG Script.
In each iteration, the algorithm proposes a solution ( = values for the decision variables) that now needs to be checked if it is feasible and what the solution value is.
In the next iteration, the solution then is tried to be improved.
This is repeated until a certain number of iterations is reached.
I already got the algorithm to work. But now I don't know how I can do the feasibility check.
I basically now have values for the decision variables that I need to run in the model to check if all constraints hold for this solution.
How does this work in CPLEX using control flow?
Obviously, I know how to trigger OPL to generate the model and run the CPLEX solver to get a solution for it, but how does it work when you want to give CPLEX the values for the decision variables and just want it to test the feasibility and not do any optimization?
In Making Optimization Simple you could check "check feasibility" that relies on fixed start.
int nbKids=300;
// a tuple is like a struct in C, a class in C++ or a record in Pascal
tuple bus
{
key int nbSeats;
float cost;
}
// This is a tuple set
{bus} pricebuses={<40,500>,<30,400>};
// asserts help make sure data is fine
assert forall(b in pricebuses) b.nbSeats>0;assert forall(b in pricebuses) b.cost>0;
// solutions we want to test
range options=1..3;
int testSolution[options][pricebuses]=[[5,2],[5,3],[5,4]];
// decision variable array
dvar int+ nbBus[pricebuses];
// objective
minimize
sum(b in pricebuses) b.cost*nbBus[b];
// constraints
subject to
{
sum(b in pricebuses) b.nbSeats*nbBus[b]>=nbKids;
}
float cost=sum(b in pricebuses) b.cost*nbBus[b];
execute DISPLAY_After_SOLVE
{
writeln("The minimum cost is ",cost);
for(var b in pricebuses) writeln(nbBus[b]," buses ",b.nbSeats, " seats");
}
main
{
thisOplModel.generate();
// Test feasibility through fixed start
for(var o in thisOplModel.options)
{
for(var b in thisOplModel.pricebuses)
{
thisOplModel.nbBus[b].UB=thisOplModel.testSolution[o][b];
thisOplModel.nbBus[b].LB=thisOplModel.testSolution[o][b];
}
if (cplex.solve())
{
write(thisOplModel.testSolution[o]," is feasible");
writeln(" and the cost is ",cplex.getObjValue());
}
else writeln(thisOplModel.testSolution[o]," is not feasible");
}
}
/*
which gives
[5 2] is not feasible
[5 3] is not feasible
[5 4] is feasible and the cost is 4100
*/
but if you prefer asserts you also can rely on that.

OpenCL 2.x - Sum Reduction function

From this previous post: strategy-for-doing-final-reduction, I would like to know the last functionalities offered by OpenCL 2.x (not 1.x which is the subject of this previous post above), especially about the atomic functions which allow to perform reductions of a array (in my case a sum reduction).
One told me that performances of OpenCL 1.x atomic functions (atom_add) were bad and I could check it, so I am looking for a way to get the best performances for a final reduction function (i.e the sum of each computed sum corresponding to each work-group).
I recall the typical kind of kernel code that I am using for the moment :
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
}
As you can see, at the end of kernel code execution, I get the array partialSums[number_of_workgroups] containing all partial sums.
Could you tell me please how to perform a second and final reduction of this array, with the best performances possibles of functions availables with OpenCL 2.x . A classic solution is to perform this final reduction with CPU but ideally, I would like to do it directly with kernel code.
A suggestion of code snippet is welcome.
A last point, I am working on MacOS High Sierra 10.13.5 with the following model :
Can OpenCL 2.x be installed on my hardware MacOS model ?
Atomic functions should be avoided because they do harm performance compared to a parallel reduction kernel. Your kernel looks to be on the right track, but you need to remember that you'll have to invoke it multiple times; do not perform the final sum on the host (unless you have a very small amount of data from the previous reduction). That is, you need to keep invoking it until your local size equals your global size. There's no way to do a single invocation for large amounts of data as there is no way to synchronize between work groups.
Additionally, you want to be careful to set an appropriate work group size (i.e. local size), which depends on local & global memory throughput & latency. Unfortunately, as far as I'm aware there is no way to determine this through OpenCL, outside of self-profiling code, though that's not too difficult to write as OCL provides you with JIT compilation. Through empirical testing I've found you should find a sweet spot between suffering too many bank conflicts (too large a local size) vs. global memory latency penalties (too small a local size). It's best to do a benchmark first to determine optimal local size for your reduction, and then use that local size for future reductions.
Edit: It's also worth noting that the best way to chain your kernel invocation together is through OpenCL events.

Tensorflow: what does index denote in CUDA_1D_KERNEL_LOOP(index, nthreads) op user

I have seen in several standard ops of tensorflow layers ( such as https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/maxpooling_op_gpu.cu.cc ), the code CUDA_1D_KERNEL_LOOP(index, nthreads) as part of the Forward and Backward passes...
I think the "index" here is related somehow to the bottom feature map coordinates but am not so sure of its exact meaning... Anyone who could help?
CUDA_1D_KERNEL_LOOP(i, n) is a preprocessor macro defined in tensorflow/core/util/cuda_kernel_helper.h. It provides a generic control flow statement that is used in many Cuda kernels within the Tensorflow codebase.
The statement is typically used for iterating through elements of an array within a kernel. The argument i is the name of the control variable and the argument n is the stopping condition for the control statement. Cuda kernels are launched in parallel threads. Each thread typically operates on a subset of array elements. The macro provides some convenience for accessing the desired array elements.
In the example that you link to, CUDA_1D_KERNEL_LOOP(index, nthreads) is interpreted as:
for (int index = blockIdx.x * blockDim.x + threadIdx.x; index < nthreads; index += blockDim.x * gridDim.x)
Hence index is declared and initialized within CUDA_1D_KERNEL_LOOP before entering the subsequent code block. The precise meaning of index depends on how it is used within the code block.
One thing that puzzled me when I first read this macro is, "Why is it a loop, isn't this inside the kernel which has already been parallelized?" The answer is, the loop handles the case when you have more "threads" than your GPU actually supports.
For example, suppose you are doing a parallelized vector addition, and you have decided that for your GPU, you will be using 512 threads per block, scheduling a maximum of 4096 blocks (these are the default parameters in Caffe2). This means that you can only schedule a maximum of 2097152 threads. Suppose your vector actually has 4M elements; now you can't actually allocate a thread per element. So each thread must be responsible for summing more than one element in the vector: that is what this loop is for!
Here is a smaller example which precisely describes how work ends up being scheduled. Suppose that blockDim.x == 2, gridDim.x == 2, and nthreads == 7. Then if we identify a GPU thread as (blockIdx.x, threadIdx.x), we allocate them to do the following work on the vector: [(0,0), (0,1), (1,0), (1,1), (0,0), (0,1), (1,0)]. In particular, we can see that according to the grid size, there are only four GPU threads available; so for blockIdx.x == 0 threadIdx.x == 0, index will handle processing the vector elements at BOTH 0 and 4.

Speeding up CUDA atomics calculation for many bins/few bins

I am trying to optimize my histogram calculations in CUDA. It gives me an excellent speedup over corresponding OpenMP CPU calculation. However, I suspect (in keeping with intuition) that most of the pixels fall into a few buckets. For argument's sake, assume that we have 256 pixels falling into let us say, two buckets.
The easiest way to do it is to do it appears to be
Load the variables into shared memory
Do vectorized loads for unsigned char, etc. if needed.
Do an atomic add in shared memory
Do a coalesced write to global.
Something like this:
__global__ void shmem_atomics_reducer(int *data, int *count){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ int block_reduced[NUM_THREADS_PER_BLOCK];
block_reduced[threadIdx.x] = 0;
__syncthreads();
atomicAdd(&block_reduced[data[tid]],1);
__syncthreads();
for(int i=threadIdx.x; i<NUM_BINS; i+=NUM_BINS)
atomicAdd(&count[i],block_reduced[i]);
}
The performance of this kernel drops (naturally) when we decrease the number of bins, from around 45 GB/s at 32 bins to around 10 GB/s at 1 bin. Contention, and shared memory bank conflicts are given as reasons. I don't know if there is any way to remove either of these for this calculation in any significant way.
I've also been experimenting with another (beautiful) idea from the parallelforall blog involving warp level reductions using __ballot to grab warp results and then using __popc() to do the warp level reduction.
__global__ void ballot_popc_reducer(int *data, int *count ){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
uint warp_id = threadIdx.x >> 5;
//need lane_ids since we are going warp level
uint lane_id = threadIdx.x%32;
//for ballot
uint warp_set_bits=0;
//to store warp level sum
__shared__ uint warp_reduced_count[NUM_WARPS_PER_BLOCK];
//shared data
__shared__ uint s_data[NUM_THREADS_PER_BLOCK];
//load shared data - could store to registers
s_data[threadIdx.x] = data[tid];
__syncthreads();
//suspicious loop - I think we need more parallelism
for(int i=0; i<NUM_BINS; i++){
warp_set_bits = __ballot(s_data[threadIdx.x]==i);
if(lane_id==0){
warp_reduced_count[warp_id] = __popc(warp_set_bits);
}
__syncthreads();
//do warp level reduce
//could use shfl, but it does not change the overall picture
if(warp_id==0){
int t = threadIdx.x;
for(int j = NUM_WARPS_PER_BLOCK/2; j>0; j>>=1){
if(t<j) warp_reduced_count[t] += warp_reduced_count[t+j];
__syncthreads();
}
}
__syncthreads();
if(threadIdx.x==0){
atomicAdd(&count[i],warp_reduced_count[0]);
}
}
}
This gives decent numbers (well, that is moot - peak device mem bw is 133 GB/s, things seem to depend on launch configuration) for the single bin case (35-40 GB/s for 1 bin, as against 10-15 GB/s using atomics), but performance drops drastically when we increase the number of bins. When we run with 32 bins, performance drops to about 5 GB/s. The reason might perhaps be because of the single thread looping through all the bins, asking for parallelization of the NUM_BINS, loop.
I have tried several ways of going about parallelizing the NUM_BINS loop, none of which seem to work properly. For example, one could (very inelegantly) manipulate the kernel to create some blocks for each bin. This seems to behave the same way, possibly because we would again suffer from contention with multiple blocks attempting to read from global memory. Plus, the programming is clunky. Likewise, parallelizing in the y direction for bins gives similarly uninspiring results.
The other idea I tried just for kicks was dynamic parallelism, launching a kernel for each bin. This was disastrously slow, possibly owing to no real compute work for the child kernels and the launch overhead.
The most promising approach seems to be - from Nicholas Wilt's article
on using these so-called privatized histograms containing bins for each thread in shared memory, which would ostensibly be very heavy on shmem usage (and we only have 48 kB per SM on Maxwell).
Perhaps someone could shed some insight into the problem? I feel that one ought to go change the algorithm instead so as not to use histograms, to use something less frequentist. Otherwise, I suppose we just use the atomics version.
Edit: The context for my problem is in computing probability density functions to be used for pattern-classification. We can compute approximate histograms (more precisely, pdfs) by using non-parametric methods such as Parzen Windows or Kernel Density Estimation. However, this does not overcome the problem of dimensionality as we need to sum over all data points for every bin, which becomes expensive when the number of bins becomes large. See here: Parzen
I faced similar chalanges to work with clustering, but in the botton end, the best solution was to use the scan pattern to group the processing. So, I don't think that it would work for you. Since you asked for some experience in this are, I'll share mine with you.
The issues
In your 1st code, I guess that the deal with the low performance with the number of bins reduction is linked to warp stall, since you do perform very little processing for every evaluated data. When the number of bins is increased, the relation between processing and global memory load (data info) for that kernel is also increased. You can check that very easily with the "Issue Efficiency" Experiments at the Performance Analysis from Nsight. Probably you are getting a low rate of cycles with at least one elegible warp (Warp Issue Efficiency).
Since I was not able to improve the number of elegible warps to somewhere close to 95%, I gave up this approach, since for some cases it gets worse (the memory dependency stall 90% of my processing cycles.
The shuffle and vote reduction is very usefull if the number of bins is not to large. If it is to large, a small amount of threads should be active for every bin filter. So you may end up with a lot of code divergence, and that is very undesirable for parallel processing. You may try to group the divergence in order to remove branching and have a good control flow, so the whole warp/block presents a similar processing, but a lot chance across blocks.
A feasible solution
I don't know where, but there are very good solutions for your problem around that I saw. Did you tried this one?
Also you can use a vectorized load and try something like that, but I'm not sure how much would it improve your performance:
__global__ hist(int4 *data, int *count, int N, int rem, unsigned int init) {
__shared__ unsigned int sBins[N_OF_BINS]; // you may want to declare this one dinamically
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (threadIdx.x < N_OF_BINS) sBins[threadIdx.x] = 0;
for (int i = 0; i < N; i+= warpSize) {
atomicAdd(&sBins[data[i + init].w], 1);
atomicAdd(&sBins[data[i + init].x], 1);
atomicAdd(&sBins[data[i + init].y], 1);
atomicAdd(&sBins[data[i + init].z], 1);
}
//process remaining elements if the data is not multiple of 4
// using recast and a additional control
for (int i = 0; i < rem; i++) {
atomicAdd(&sBins[reinterpret_cast<int*>(data)[N * 4 + init + i]], 1);
}
//update your histogram data here
}

What is reduction variable? Could anyone give me some examples?

What is reduction variable?
Could anyone give me some examples?
Here's a simple example in a C-like language of computing the sum of an array:
int x = 0;
for (int i = 0; i < n; i++) {
x += a[i];
}
In this example,
i is an induction variable - in each iteration it changes by some constant. It can be +1 (as in the above example) or *2 or /3 etc., but the key is that in all the iterations the number is the same.
In other words, in each iteration i_new = i_old op constant, where op is +, *, etc., and neither op nor constant change between iterations.
x is a reduction variable - it accumulates data from one iteration to the next. It always has some initialization (x = 0 in this case), and while the data accumulated can be different in each iteration, the operator remains the same.
In other words, in each iteration x_new = x_old op data, and op remains the same in all iterations (though data may change).
In many languages there's a special syntax for performing something like this - often called a "fold" or "reduce" or "accumulate" (and it has other names) - but in the context of LLVM IR, an induction variable will be represented by a phi node in a loop between a binary operation inside the loop and the initialization value before it.
Commutative* operations in reduction variables (such as addition) are particularly interesting for an optimizing compiler because they appear to show a stronger dependency between iterations than there really is; for instance the above example could be rewritten into a vectorized form - adding, say, 4 numbers at a time, and followed by a small loop to sum the final vector into a single value.
* there are actually more conditions that the reduction variable has to fulfill before a vectorization like this can be applied, but that's really outside the scope here