cuda uncorrectable ECC error encountered - crash

My environment is
Windows 7 x64
Matlab 2012a x64
Cuda SDK 4.2
Tesla C2050 GPU
I am having trouble figuring out why my GPU is crashing with the "uncorrectable ECC error encountered". This error only occurs when i use 512 threads or more. I can't post the kernel, but i will try to describe what it does.
In general, the kernel takes a number of parameters and produces 2 complex matricies defined by the thread size, M and another number, N. So the returned matrices will be of size MxN. A typical configuration is 512x512, but each number is independent and can vary up or down. The kernel works when the numbers are 256x256.
Each thread (kernel) extracts a 999 size vector out of a 2D array based on the thread id, ie size 999xM, then cycles through the row (0 .. N-1) of the output matrices for calculation. A number of intermediate parameters are calculated, only using pow, sin and cos among the + - * / operators. To calculate one of the output matrices an additional loop needs to be executed to sum up the contribution of the 999 vector that was extracted earlier. This loop does some intermediate calculations to determine a range of values that will allow contribution. The contribution is then scaled by a factor determined by the cos and sine values of a calculated fractional value. This is where it crashes. If i stick in a constant value or 1.0 or any other for that matter, the kernel executes without trouble. however, when only one of the calls (cos or sine) is included, the kernel crashes.
Some psuedocode follows:
kernel()
{
/* Extract 999 vector from 2D array 999xM - one 999 vector for each thread. */
for (int i = 0; i < 999; i++)
{
.....
}
/* Cycle through the 2nd dimension of the output matricies */
for (int j = 0; j < N; j++)
{
/* Calculate some intermediate variables */
/* Calculate the real and imaginary components of the first output matrix */
/* real = cos(value), imaginary = sin(value) */
/* Construct the first output matrix from some intermediate variables and the real and imaginary components */
/* Calculate some more intermediate variables */
/* cycle through the extracted vector (0 .. 998) */
for (int k = 0; k < 999; k++)
{
/* Calculate some more intermediate variables */
/* Determine the range of allowed values to contribute to the second output matrix. */
/* Calculate the real and imaginary components of the second output matrix */
/* real = cos(value), imaginary = sin(value) */
/* This is were it crashes, unless real and imaginary are constant values (1.0) */
/* Sum up the contributions of the extracted vector to the second output matrix */
}
/* Construct the Second output matrix from some intermediate variables and the real and imaginary components */
}
}
I thought this could be due to a register limit, but the occupancy calculator indicates that this is not the case, I'm using less than the 32,768 registers with 512 threads. Can anyone give any suggestions as to what the cause of this could be?
Here is the ptasx info:
ptxas info : Compiling entry function '_Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_' for 'sm_20'
ptxas info : Function properties for _Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_
8056 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __internal_trig_reduction_slowpathd
40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 53 registers, 232 bytes cmem[0], 144 bytes cmem[2], 28 bytes cmem[16]
tmpxft_00001d70_00000000-3_MexFunciton.cudafe1.cpp

"Uncorrectable ECC error" usually refers to a hardware failure. ECC is Error Correcting Code, a means to detect and correct errors in bits stored in RAM. A stray cosmic ray can disrupt one bit stored in RAM every once in a great while, but "uncorrectable ECC error" indicates that several bits are coming out of RAM storage "wrong" - too many for the ECC to recover the original bit values.
This could mean that you have a bad or marginal RAM cell in your GPU device memory.
Marginal circuits of any kind may not fail 100%, but are more likely to fail under the stress of heavy use - and associated rise in temperature.
There are diagnostic utilities floating around to stress-test all the RAM banks of your PC to confirm or pinpoint which chip is failing, but I don't know of an analog for testing the device RAM banks of the GPU.
If you have access to another machine with a GPU of similar capability, try running your app on that machine to see how it behaves. If you don't get the ECC error on the second machine, this confirms that the problem is almost certainly in the hardware of the first machine. If you get the same ECC error on the second machine, then ignore everything I've written here and continue looking for your software bug. Unless your code is actually causing hardware damage, the chances of two machines having the same hardware failure are extremely small.

Related

OpenCL 2.x - Sum Reduction function

From this previous post: strategy-for-doing-final-reduction, I would like to know the last functionalities offered by OpenCL 2.x (not 1.x which is the subject of this previous post above), especially about the atomic functions which allow to perform reductions of a array (in my case a sum reduction).
One told me that performances of OpenCL 1.x atomic functions (atom_add) were bad and I could check it, so I am looking for a way to get the best performances for a final reduction function (i.e the sum of each computed sum corresponding to each work-group).
I recall the typical kind of kernel code that I am using for the moment :
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
}
As you can see, at the end of kernel code execution, I get the array partialSums[number_of_workgroups] containing all partial sums.
Could you tell me please how to perform a second and final reduction of this array, with the best performances possibles of functions availables with OpenCL 2.x . A classic solution is to perform this final reduction with CPU but ideally, I would like to do it directly with kernel code.
A suggestion of code snippet is welcome.
A last point, I am working on MacOS High Sierra 10.13.5 with the following model :
Can OpenCL 2.x be installed on my hardware MacOS model ?
Atomic functions should be avoided because they do harm performance compared to a parallel reduction kernel. Your kernel looks to be on the right track, but you need to remember that you'll have to invoke it multiple times; do not perform the final sum on the host (unless you have a very small amount of data from the previous reduction). That is, you need to keep invoking it until your local size equals your global size. There's no way to do a single invocation for large amounts of data as there is no way to synchronize between work groups.
Additionally, you want to be careful to set an appropriate work group size (i.e. local size), which depends on local & global memory throughput & latency. Unfortunately, as far as I'm aware there is no way to determine this through OpenCL, outside of self-profiling code, though that's not too difficult to write as OCL provides you with JIT compilation. Through empirical testing I've found you should find a sweet spot between suffering too many bank conflicts (too large a local size) vs. global memory latency penalties (too small a local size). It's best to do a benchmark first to determine optimal local size for your reduction, and then use that local size for future reductions.
Edit: It's also worth noting that the best way to chain your kernel invocation together is through OpenCL events.

headache for clEnqueueNDRangeKernel local work size

For opencl optimization, my idea is try to make match for
1 workgroup(kernel coding) as compute unit(GPU Hardware)
1 workitem(kernel coding) as process element(GPU Hardware)
( Maybe my idea is not correct, please teach me )
for example:
1. I have a global work size of 4000 by 3000.
2. My GPU opnecl device has a maximum work-group-size of 8192.
3. I call clEnqueueNDRangeKernel with the desired local-work-size (along with all other necessary parameters)
4. by fucntion call:
a. clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), (void*)&workGroupSizeUsed, NULL);
b. clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, sizeof(size_t), (void*)&workGroupSizeUsed, NULL);
above a and b are return 8192.
maximum work-group-size, CL_KERNEL_WORK_GROUP_SIZE, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE all are 8192.
I have no idea what I should follow to define my local work size...
(Q1)Any good idea for setting the local work size? (10x10? 40x30, X by Y )
clEnqueueNDRangeKernel(command_queue, kernel, 2, NULL, global_work_item_size, local_work_item_size, 0, NULL, NULL);
Very headache to define this "local_work_item_size" of clEnqueueNDRangeKernel function.
(Q2)
Could some one explain the difference if I set local work size = 1,1 between
local work size = 4000,3000 ?
Thank you in advance!
(Q1)Any good idea for setting the local work size? (10x10? 40x30, X by Y )
As pmdj pointed out, this highly depends on your application. Since it is unclear how you selected your global_work_size and it is also linked to the local_work_size I would like to explain that one first. Usually what you would want to do is to map the size of the data you want to process to the global_work_size. E.g. if you have an array with 1024 values you would also pick a global_work_size of 1024 because then you can easily use the global id as an index in your OpenCL program:
int index = get_global_id(0);
input_array[index]++; // your data processing
However, the global_work_size is limited to a maximum 2^32 - 1. If you have more data to process than that you can pass your global_work_size and data size as parameters and use a loop like the following one:
int index = get_global_id(0);
for (int i = index; i < data_size; i += global_work_size) {
input_array[i]++; // your data processing
}
The last fact which is important for the global_work_size is that it needs to be dividable by the local_work_size. This can result into a your global_work_size being bigger than your data size, e.g. you could have 1000 values while your local_work_size is 32. Then you would make your global_work_size 1024 and ensure through a condition like the one above (i < data_size) that the redundant work items are not doing anything weird like accessing not allocated memory areas.
The local_work_size depends on your platform. First of all you should always have a local_work_size which is a multiple of 32 for NVIDIA or a multiple of 64 for AMD GPUs. This represents the amount of operations which are scheduled together anyway. If you use a different number the GPU will have idle threads which won't do anything but decrease your performance.
Not only the manufacturer but also the specific type of your GPU has to be considered to find the optimal local_work_size. The global_work_size divided by the local_work_size is the number of work groups. Each work group is executed by one thread inside your CPU/GPU. If you use OpenCL to run your application on powerful hardware you want to make sure that it runs as parallel as possible. E.g. if you use an Intel i7 with 8 threads you would want to make sure that you have at least 8 work groups (global_work_size / local_work_size >= 8). If you use a NVIDIA GeForce GTX 1060 with 1280 CUDA Cores you would want to have at least 1280 work groups. But never at the cost of having a local_work_size of less than 32 which is more important!
If you are having more work groups than your hardware has threads that does not matter, they will be processed sequentially. Hence for most applications you can always set your local_work_size to 32/64. The only exception is if you require synchronization among more than work items. E.g. barriers only work inside work groups but not among different work groups. An example: If you need to to sum up chunks of 1024 values before being able to proceed with your algorithm you would need to set your local_work_size to 1024 for the barrier to work as desired.
(Q2) Could some one explain the difference if I set local work size = 1,1 between local work size = 4000,3000 ?
Both, the global_work_size and the local_work_size can have more than one dimension. If this is used or not solely depends on the preference of the programmer. All algorithms can be implemented in one dimension as well and the number of work groups is calculated by multiplying the dimensions, e.g. if your global_work_size is 20*20 and your local_work_size is 10*10 you would run the program with (20*20) / (10*10) = 400 work groups.
I personally like to use the dimensions if I am processing data which has multiple dimensions. Imagine your input is a two-dimensional image, you could simply use its width and height as global_work_size (e.g. 1024 * 1024) and the local_work_size accordingly (e.g. 32 * 32). In your code you could then use the following indices:
int x = get_global_id(0);
int y = get_global_id(1);
input_array[x][y]++; // your data processing

Speeding up CUDA atomics calculation for many bins/few bins

I am trying to optimize my histogram calculations in CUDA. It gives me an excellent speedup over corresponding OpenMP CPU calculation. However, I suspect (in keeping with intuition) that most of the pixels fall into a few buckets. For argument's sake, assume that we have 256 pixels falling into let us say, two buckets.
The easiest way to do it is to do it appears to be
Load the variables into shared memory
Do vectorized loads for unsigned char, etc. if needed.
Do an atomic add in shared memory
Do a coalesced write to global.
Something like this:
__global__ void shmem_atomics_reducer(int *data, int *count){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ int block_reduced[NUM_THREADS_PER_BLOCK];
block_reduced[threadIdx.x] = 0;
__syncthreads();
atomicAdd(&block_reduced[data[tid]],1);
__syncthreads();
for(int i=threadIdx.x; i<NUM_BINS; i+=NUM_BINS)
atomicAdd(&count[i],block_reduced[i]);
}
The performance of this kernel drops (naturally) when we decrease the number of bins, from around 45 GB/s at 32 bins to around 10 GB/s at 1 bin. Contention, and shared memory bank conflicts are given as reasons. I don't know if there is any way to remove either of these for this calculation in any significant way.
I've also been experimenting with another (beautiful) idea from the parallelforall blog involving warp level reductions using __ballot to grab warp results and then using __popc() to do the warp level reduction.
__global__ void ballot_popc_reducer(int *data, int *count ){
uint tid = blockIdx.x*blockDim.x + threadIdx.x;
uint warp_id = threadIdx.x >> 5;
//need lane_ids since we are going warp level
uint lane_id = threadIdx.x%32;
//for ballot
uint warp_set_bits=0;
//to store warp level sum
__shared__ uint warp_reduced_count[NUM_WARPS_PER_BLOCK];
//shared data
__shared__ uint s_data[NUM_THREADS_PER_BLOCK];
//load shared data - could store to registers
s_data[threadIdx.x] = data[tid];
__syncthreads();
//suspicious loop - I think we need more parallelism
for(int i=0; i<NUM_BINS; i++){
warp_set_bits = __ballot(s_data[threadIdx.x]==i);
if(lane_id==0){
warp_reduced_count[warp_id] = __popc(warp_set_bits);
}
__syncthreads();
//do warp level reduce
//could use shfl, but it does not change the overall picture
if(warp_id==0){
int t = threadIdx.x;
for(int j = NUM_WARPS_PER_BLOCK/2; j>0; j>>=1){
if(t<j) warp_reduced_count[t] += warp_reduced_count[t+j];
__syncthreads();
}
}
__syncthreads();
if(threadIdx.x==0){
atomicAdd(&count[i],warp_reduced_count[0]);
}
}
}
This gives decent numbers (well, that is moot - peak device mem bw is 133 GB/s, things seem to depend on launch configuration) for the single bin case (35-40 GB/s for 1 bin, as against 10-15 GB/s using atomics), but performance drops drastically when we increase the number of bins. When we run with 32 bins, performance drops to about 5 GB/s. The reason might perhaps be because of the single thread looping through all the bins, asking for parallelization of the NUM_BINS, loop.
I have tried several ways of going about parallelizing the NUM_BINS loop, none of which seem to work properly. For example, one could (very inelegantly) manipulate the kernel to create some blocks for each bin. This seems to behave the same way, possibly because we would again suffer from contention with multiple blocks attempting to read from global memory. Plus, the programming is clunky. Likewise, parallelizing in the y direction for bins gives similarly uninspiring results.
The other idea I tried just for kicks was dynamic parallelism, launching a kernel for each bin. This was disastrously slow, possibly owing to no real compute work for the child kernels and the launch overhead.
The most promising approach seems to be - from Nicholas Wilt's article
on using these so-called privatized histograms containing bins for each thread in shared memory, which would ostensibly be very heavy on shmem usage (and we only have 48 kB per SM on Maxwell).
Perhaps someone could shed some insight into the problem? I feel that one ought to go change the algorithm instead so as not to use histograms, to use something less frequentist. Otherwise, I suppose we just use the atomics version.
Edit: The context for my problem is in computing probability density functions to be used for pattern-classification. We can compute approximate histograms (more precisely, pdfs) by using non-parametric methods such as Parzen Windows or Kernel Density Estimation. However, this does not overcome the problem of dimensionality as we need to sum over all data points for every bin, which becomes expensive when the number of bins becomes large. See here: Parzen
I faced similar chalanges to work with clustering, but in the botton end, the best solution was to use the scan pattern to group the processing. So, I don't think that it would work for you. Since you asked for some experience in this are, I'll share mine with you.
The issues
In your 1st code, I guess that the deal with the low performance with the number of bins reduction is linked to warp stall, since you do perform very little processing for every evaluated data. When the number of bins is increased, the relation between processing and global memory load (data info) for that kernel is also increased. You can check that very easily with the "Issue Efficiency" Experiments at the Performance Analysis from Nsight. Probably you are getting a low rate of cycles with at least one elegible warp (Warp Issue Efficiency).
Since I was not able to improve the number of elegible warps to somewhere close to 95%, I gave up this approach, since for some cases it gets worse (the memory dependency stall 90% of my processing cycles.
The shuffle and vote reduction is very usefull if the number of bins is not to large. If it is to large, a small amount of threads should be active for every bin filter. So you may end up with a lot of code divergence, and that is very undesirable for parallel processing. You may try to group the divergence in order to remove branching and have a good control flow, so the whole warp/block presents a similar processing, but a lot chance across blocks.
A feasible solution
I don't know where, but there are very good solutions for your problem around that I saw. Did you tried this one?
Also you can use a vectorized load and try something like that, but I'm not sure how much would it improve your performance:
__global__ hist(int4 *data, int *count, int N, int rem, unsigned int init) {
__shared__ unsigned int sBins[N_OF_BINS]; // you may want to declare this one dinamically
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (threadIdx.x < N_OF_BINS) sBins[threadIdx.x] = 0;
for (int i = 0; i < N; i+= warpSize) {
atomicAdd(&sBins[data[i + init].w], 1);
atomicAdd(&sBins[data[i + init].x], 1);
atomicAdd(&sBins[data[i + init].y], 1);
atomicAdd(&sBins[data[i + init].z], 1);
}
//process remaining elements if the data is not multiple of 4
// using recast and a additional control
for (int i = 0; i < rem; i++) {
atomicAdd(&sBins[reinterpret_cast<int*>(data)[N * 4 + init + i]], 1);
}
//update your histogram data here
}

MKL Sparse BLAS segfault when transposing CSR with 100M rows

I am trying to use MKL Sparse BLAS for CSR matrices with number of rows/columns on the order of 100M. My source code that seems to work fine for 10M rows/columns fails with segfault when I increase it to 100M.
I isolated the problem to the following code snippet:
void TestSegfault1() {
float values[1] = { 1.0f };
int col_indx[1] = { 0 };
int rows_start[1] = { 0 };
int rows_end[1] = { 1 };
// Step 1. Create 1 x 100M matrix
// with single non-zero value at (0,0)
sparse_matrix_t A;
mkl_sparse_s_create_csr(
&A, SPARSE_INDEX_BASE_ZERO, 1, 100000000,
rows_start, rows_end, col_indx, values);
// Step 2. Transpose it to get 100M x 1 matrix
sparse_matrix_t B;
mkl_sparse_convert_csr(A, SPARSE_OPERATION_TRANSPOSE, &B);
}
This function segfaults in mkl_sparse_convert_csr with backtrace
#0 0x00000000004c0d03 in mkl_sparse_s_convert_csr_i4_avx ()
#1 0x0000000000434061 in TestSegfault1 ()
For slightly different code (but essentially the same) it has a little more detail:
#0 0x00000000008fc09b in mkl_serv_free ()
#1 0x000000000099949e in mkl_sparse_s_export_csr_data_i4_avx ()
#2 0x0000000000999ee4 in mkl_sparse_s_convert_csr_i4_avx ()
Apparently something goes bad in memory allocation. And it sure looks like some kind of integer overflow from the outside. The build of MKL I have uses MKL_INT = int = int32.
Is it indeed the case and the limit on number of rows I can have in Sparse BLAS CSR matrix is < 100M (looks more like ~65M)? Or am I doing it wrong?
EDIT 1: MKL version string is "Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for Intel(R) 64 architecture applications".
EDIT 2: Figured it out. There is indeed a subtle kind of integer overflow when allocating memory for internal per-thread buffers. At some point inside mkl_sparse_s_export_csr_data_i4_avx it attempts to allocate (omp_get_max_threads() + 1) * num_rows * 4 bytes; the number doesn't fit in 32-bit signed integer. Subsequent call to mkl_serv_malloc causes memory corruption and eventually segfault. One possible solution is to alter the number of OpenMP threads via omp_set_num_threads call.
Could you check your example on last version of MKL? I run it on MKL 11.3.2 and it passed correctly for 100M matrix. However it could depend on number of threads on your machine (size of matrix mult number of threads have to be less than max int). To prevent such issue I am strongly recommend to use ilp64 version of MKL libraries
Thanks,
Alex
check how this example works with the latest mkl 2019 u4.
compiling the example with ILP64 mode like as follows:
icc -I/opt/intel/compilers_and_libraries_2019/linux/mkl/include test_csr.cpp \
-L/opt/intel/compilers_and_libraries_2019/linux/mkl/lib/intel64 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -liomp5 -lpthread -lm -ldl
./a.out
mkl_sparse_convert_csr passed

How to optimize OpenCL code for neighbors accessing?

Edit: Proposed solutions results are added at the end of the question.
I'm starting to program with OpenCL, and I have created a naive implementation of my problem.
The theory is: I have a 3D grid of elements, where each elements has a bunch of information (around 200 bytes). Every step, every element access its neighbors information and accumulates this information to prepare to update itself. After that there is a step where each element updates itself with the information gathered before. This process is executed iteratively.
My OpenCL implementation is: I create an OpenCL buffer of 1 dimension, fill it with structs representing the elements, which have an "int neighbors 6 " where I store the index of the neighbors in the Buffer. I launch a kernel that consults the neighbors and accumulate their information into element variables not consulted in this step, and then I launch another kernel that uses this variables to update the elements. These kernels use __global variables only.
Sample code:
typedef struct{
float4 var1;
float4 var2;
float4 nextStepVar1;
int neighbors[8];
int var3;
int nextStepVar2;
bool var4;
} Element;
__kernel void step1(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
return;
}
Element elem = elements[id];
for (int i=0; i < 6; ++i){
if (elem.neighbors[i] != -1){
//Gather information of the neighbor and accumulate it in elem.nextStepVars
}
}
elements[id] = elem;
}
__kernel void step2(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
return;
}
Element elem = elements[id];
//update elem variables by using elem.nextStepVariables
//restart elem.nextStepVariables
}
Right now, my OpenCL implementation takes basically the same time than my C++ implementation.
So, the question is: How would you (the experts :P) address this problem?
I have read about 3D images, to store the information and change the neighborhood accessing pattern by changing the NDRange to a 3D one. Also, I have read about __local memory, to first load all the neighborhood in a workgroup, synchronize with a barrier and then use them, so that accesses to memory are reduced.
Could you give me some tips to optimize a process like the one I described, and if possible, give me some snippets?
Edit: Third and fifth optimizations proposed by Huseyin Tugrul were already in the code. As mentioned here, to make structs behave properly, they need to satisfy some restrictions, so it is worth understanding that to avoid headaches.
Edit 1: Applying the seventh optimization proposed by Huseyin Tugrul performance increased from 7 fps to 60 fps. In a more general experimentation, the performance gain was about x8.
Edit 2: Applying the first optimization proposed by Huseyin Tugrul performance increased about x1.2 . I think that the real gain is higher, but hides because of another bottleneck not yet solved.
Edit 3: Applying the 8th and 9th optimizations proposed by Huseyin Tugrul didn't change performance, because of the lack of significant code taking advantage of these optimizations, worth trying in other kernels though.
Edit 4: Passing invariant arguments (such as n_elements or workgroupsize) to the kernels as #DEFINEs instead of kernel args, as mentioned here, increased performance around x1.33. As explained in the document, this is because of the aggressive optimizations that the compiler can do when knowing the variables at compile-time.
Edit 5: Applying the second optimization proposed by Huseyin Tugrul, but using 1 bit per neighbor and using bitwise operations to check if neighbor is present (so, if neighbors & 1 != 0, top neighbor is present, if neighbors & 2 != 0, bot neighbor is present, if neighbors & 4 != 0, right neighbor is present, etc), increased performance by a factor of x1.11. I think this was mostly because of the data transfer reduction, because the data movement was, and keeps being my bottleneck. Soon I will try to get rid of the dummy variables used to add padding to my structs.
Edit 6: By eliminating the structs that I was using, and creating separated buffers for each property, I eliminated the padding variables, saving space, and was able to optimize the global memory access and local memory allocation. Performance increased by a factor of x1.25, which is very good. Worth doing this, despite the programmatic complexity and unreadability.
According to your step1 and step2, you are not making your gpu core work hard. What is your kernel's complexity? What is your gpu usage? Did you check with monitoring programs like afterburner? Mid-range desktop gaming cards can get 10k threads each doing 10k iterations.
Since you are working with only neighbours, data size/calculation size is too big and your kernels may be bottlenecked by vram bandiwdth. Your main system ram could be as fast as your pci-e bandwidth and this could be the issue.
1) Use of Dedicated Cache could be getting you thread's actual grid cell into private registers that is fastest. Then neighbours into __local array so the comparisons/calc only done in chip.
Load current cell into __private
Load neighbours into __local
start looping for local array
get next neighbour into __private from __local
compute
end loop
(if it has many neighbours, lines after "Load neighbours into __local" can be in another loop that gets from main memory by patches)
What is your gpu? Nice it is GTX660. You should have 64kB controllable cache per compute unit. CPUs have only registers of 1kB and not addressable for array operations.
2) Shorter Indexing could be using a single byte as index of neighbour stored instead of int. Saving precious L1 cache space from "id" fetches is important so that other threads can hit L1 cache more!
Example:
0=neighbour from left
1=neighbour from right
2=neighbour from up
3=neighbour from down
4=neighbour from front
5=neighbour from back
6=neighbour from upper left
...
...
so you can just derive neighbour index from a single byte instead of 4-byte int which decreases main memory accessing for at least neighbour accessing. Your kernel will derive neighbour index from upper table using its compute power, not memory power because you would make this from core registers(__privates). If your total grid size is constant, this is very easy such as just adding 1 actual cell id, adding 256 to id or adding 256*256 to id or so.
3) Optimum Object Size could be making your struct/cell-object size a multiple of 4 bytes. If your total object size is around 200-bytes, you can pad it or augment it with some empty bytes to make exactly 200 bytes, 220Bytes or 256 bytes.
4) Branchless Code (Edit: depends!) using less if-statements. Using if-statement makes computation much slower. Rather than checking for -1 as end of neightbour index , you can use another way . Becuase lightweight core are not as capable of heavyweight. You can use surface-buffer-cells to wrap the surface so computed-cells will have always have 6-neighbours so you get rid of if (elem.neighbors[i] != -1) . Worth a try especially for GPU.
Just computing all neighbours are faster rather than doing if-statement. Just multiply the result change with zero when it is not a valid neighbour. How can we know that it is not a valid neighbour? By using a byte array of 6-elements per cell(parallel to neighbour id array)(invalid=0, valid=1 -->multiply the result with this)
The if-statement is inside a loop which counting for six times. Loop unrolling gives similar speed-up if the workload in the loop is relatively easy.
But, if all threads within same warp goes into same if-or-else branch, they don't lose performance. So this depends wheter your code diverges or not.
5) Data Elements Reordering you can move the int[8] element to uppermost side of struct so memory accessing may become more yielding so smaller sized elements to lower side can be read in a single read-operation.
6) Size of Workgroup trying different local workgroup size can give 2-3x performance. Starting from 16 until 512 gives different results. For example, AMD GPUs like integer multiple of 64 while NVIDIA GPUs like integer multiple of 32. INTEL does fine at 8 to anything since it can meld multiple compute units together to work on same workgroup.
7) Separation of Variables(only if you cant get rid of if-statements) Separation of comparison elements from struct. This way you dont need to load a whole struct from main memory just to compare an int or a boolean. When comparison needs, then loads the struct from main memory(if you have local mem optimization already, then you should put this operation before it so loading into local mem is only done for selected neighbours)
This optimisation makes best case(no neighbour or only one eighbour) considerably faster. Does not affect worst case(maximum neighbours case).
8a) Magic Using shifting instead of dividing by power of 2. Doing similar for modulo. Putting "f" at the end of floating literals(1.0f instead of 1.0) to avoid automatic conversion from double to float.
8b) Magic-2 -cl-mad-enable Compiler option can increase multiply+add operation speed.
9) Latency Hiding Execution configuration optimization. You need to hide memory access latency and take care of occupancy.
Get maximum cycles of latency for instructions and global memory access.
Then divide memory latency by instruction latency.
Now you have the ratio of: arithmetic instruction number per memory access to hide latency.
If you have to use N instructions to hide mem latency and you have only M instructions in your code, then you will need N/M warps(wavefronts?) to hide latency because a thread in gpu can do arithmetics while other thread getting things from mem.
10) Mixed Type Computing After memory access is optimized, swap or move some instructions where applicable to get better occupancy, use half-type to help floating point operations where precision is not important.
11) Latency Hiding again Try your kernel code with only arithmetics(comment out all mem accesses and initiate them with 0 or sometihng you like) then try your kernel code with only memory access instructions(comment out calculations/ ifs)
Compare kernel times with original kernel time. Which is affeecting the originatl time more? Concentrate on that..
12) Lane & Bank Conflicts Correct any LDS-lane conflicts and global memory bank conflicts because same address accessings can be done in a serialed way slowing process(newer cards have broadcast ability to reduce this)
13) Using registers Try to replace any independent locals with privates since your GPU can give nearly 10TB/s throughput using registers.
14) Not Using Registers Dont use too many registers or they will spill to global memory and slow the process.
15) Minimalistic Approach for Occupation Look at local/private usage to get an idea of occupation. If you use much more local and privates then less threads can be utilized in same compute unit and leading lesser occupation. Less resource usage leads higher chance of occupation(if you have enough total threads)
16) Gather Scatter When neighbours are different particles(like an nbody NNS) from random addresses of memory, its maybe hard to apply but, gather read optimization can give 2x-3x speed on top of before optimizations (needs local memory optimization to work) so it reads in an order from memory instead of randomly and reorders as needed in the local memory to share between (scatter) to threads.
17) Divide and Conquer Just in case when buffer is too big and copied between host and device so makes gpu wait idle, then divide it in two, send them separately, start computing as soon as one arrives, send results back concurrently in the end. Even a process-level parallelism could push a gpu to its limits this way. Also L2 cache of GPU may not be enough for whole of data. Cache-tiled computing but implicitly done instead of direct usage of local memory.
18) Bandwidth from memory qualifiers. When kernel needs some extra 'read' bandwidth, you can use '__constant'(instead of __global) keyword on some parameters which are less in size and only for reading. If those parameters are too large then you can still have good streaming from '__read_only' qualifier(after the '__global' qualifier). Similary '__write_only' increases throughput but these give mostly hardware-specific performance. If it is Amd's HD5000 series, constant is good. Maybe GTX660 is faster with its cache so __read_only may become more usable(or Nvidia using cache for __constant?).
Have three parts of same buffer with one as __global __read_only, one as __constant and one as just __global (if building them doesn't penalty more than reads' benefits).
Just tested my card using AMD APP SDK examples, LDS bandwidth shows 2TB/s while constant is 5TB/s(same indexing instead of linear/random) and main memory is 120 GB/s.
Also don't forget to add restrict to kernel parameters where possible. This lets compiler do more optimizations on them(if you are not aliasing them).
19) Modern hardware transcendental functions are faster than old bit hack (like Quake-3 fast inverse square root) versions
20) Now there is Opencl 2.0 which enables spawning kernels inside kernels so you can further increase resolution in a 2d grid point and offload it to workgroup when needed (something like increasing vorticity detail on edges of a fluid dynamically)
A profiler can help for all those, but any FPS indicator can do if only single optimization is done per step.
Even if benchmarking is not for architecture-dependent code paths, you could try having a multiple of 192 number of dots per row in your compute space since your gpu has multiple of that number of cores and benchmark that if it makes gpu more occupied and have more gigafloatingpoint operations per second.
There must be still some room for optimization after all these options, but idk if it damages your card or feasible for production time of your projects. For example:
21) Lookup tables When there is 10% more memory bandwidth headroom but no compute power headroom, offload 10% of those workitems to a LUT version such that it gets precomputed values from a table. I didn't try but something like this should work:
8 compute groups
2 LUT groups
8 compute groups
2 LUT groups
so they are evenly distributed into "threads in-flight" and get advantage of latency hiding stuff. I'm not sure if this is a preferable way of doing science.
21) Z-order pattern For traveling neighbors increases cache hit rate. Cache hit rate saves some global memory bandwidth for other jobs so that overall performance increases. But this depends on size of cache, data layout and some other things I don't remember.
22) Asynchronous Neighbor Traversal
iteration-1: Load neighbor 2 + compute neighbor 1 + store neighbor 0
iteration-2: Load neighbor 3 + compute neighbor 2 + store neighbor 1
iteration-3: Load neighbor 4 + compute neighbor 3 + store neighbor 2
so each body of loop doesn't have any chain of dependency and fully pipelined on GPU processing elements and OpenCL has special instructions for asynchronously loading/storing global variables using all cores of a workgroup. Check this:
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/async_work_group_copy.html
Maybe you can even divide computing part into two and have one part use transcandental functions and other part use add/multiply so that add/multiply operations don't wait for a slow sqrt. If there are at least several neighbors to traveerse, this should hide some latency behind other iterations.