OpenCL Sum reduction across different work-groups gives the wrong result - sum

So I'm currently trying to write a kernel in OpenCL with the goal of sum reducing each row of a matrix (g_idata) into an array (g_odata). Said matrix is represented by a float array with column_count * row_count length, and the resulting array has a length of row_count. As such I've implemented the following kernel:
#define T float
#define Operation(X, Y) ((X) + (Y))
__kernel void marrow_kernel( __global T *g_odata,__global T *g_idata,
const unsigned long column_count, const unsigned long row_count, __local volatile T* sdata) {
size_t tid = get_local_id(0);
size_t gid = get_global_id(0);
size_t row = gid / column_count;
size_t column = gid % column_count;
if(row < row_count && column < column_count)
{
sdata[tid] = g_idata[gid];
}
barrier(CLK_LOCAL_MEM_FENCE);
if(row < row_count && column < column_count)
{
size_t step = column_count / 2;
size_t limit = column_count;
while(step > 0)
{
if(column + step < limit) {
if(tid + step < get_local_size(0))
{
sdata[tid] = Operation(sdata[tid], sdata[tid + step]);
}
else if (gid + step < column_count * row_count)
{
sdata[tid] = Operation(sdata[tid], g_idata[gid + step]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
step /= 2;
limit /= 2;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(row < row_count && column == 0)
{
g_odata[row] = column_count % 2 == 0 ? sdata[tid] : sdata[tid] + g_idata[gid + (column_count - 1)];
}
}
Said kernel is currently being instantiated with a work-group of 128 work-units. I currently have no control over the size of the work-group.
Now here's the issue: If lets say I've a row that's shared between two different work-groups, it'll return the wrong result, since it'll fetch the value in the g_idata, since it's impossible to access the result of the next work-group local memory. After the first iteration, that's the wrong value, and it'll afect the final result of the operation.
Anyone can give me an hint on how to solve this problem?

Related

How to not parallelize inner loops in OpenACC

I am a beginner in doing GPU programming with OpenACC. I was trying to do a direct convolution. Convolution consists of 6 nested loops. I only want the first loop to be parallelized. I gave the pragma #pragma acc loop for the first loop and #pragma acc loop seq for the rest. But the output that I am getting is not correct. Is the approach taken by me to parallelize the loop correct ? Specifications for the convolution: Input channels-3, Input Size- 224X224X3, Output channels- 64, Output Size- 111X111X64, filter size- 3X3X3X64. Following is the link to the header files dog.h and squeezenet_params.h. https://drive.google.com/drive/folders/1a9XRjBTrEFIorrLTPFHS4atBOPrG886i
# include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "squeezenet_params.h"
#include "dog.h"
void conv3x3(
const int input_channels, const int input_size,
const int pad, const int stride, const int start_channel,
const int output_size, const float* restrict input_im, const float* restrict filter_weight,
const float* restrict filter_bias, float* restrict output_im){
#pragma acc data copyin (input_im[0:150527],filter_weight[0:1727],filter_bias[0:63]) copyout(output_im[0:788543])
{
#pragma acc parallel
{
#pragma acc loop
for(int p=0;p<64;++p){
filter_weight += p * input_channels * 9;
float bias = filter_bias[p];
output_im += (start_channel + p) * output_size * output_size;
//loop over output feature map
#pragma acc loop seq
for(int i = 0; i < output_size; i++)
{
#pragma acc loop seq
for(int j = 0; j < output_size; j++)
{
//compute one element in the output feature map
float tmp = bias;
//compute dot product of 2 input_channels x 3 x 3 matrix
#pragma acc loop seq
for(int k = 0; k < input_channels; k++)
{
#pragma acc loop seq
for(int l = 0; l < 3; l++)
{
int h = i * stride + l - pad;
#pragma acc loop seq
for(int m = 0; m < 3; m++)
{
int w = j * stride + m - pad;
if((h >= 0) && (h < input_size) && (w >= 0) && (w < input_size))
{
tmp += input_im[k * input_size * input_size + (i * stride + l - pad) * input_size + j * stride + m - pad] \
* filter_weight[9 * k + 3 * l + m];
}
}
}
}
//add relu activation after conv
output_im[i * output_size + j] = (tmp > 0.0) ? tmp : 0.0;
}
}
}
}
}
}
void main(){
float * result = (float*)malloc(sizeof(float) * (1 * 64 * 111 * 111));
conv3x3(3,224,0,2,0,111,sample,conv1_weight,conv1_bias,result);
for(int i=0;i<64 * 111 * 111;++i){
//if(result[i]>0)
printf("%f:%d\n",result[i],i);
}
}
The contributor posted the same question on the PGI User Forums where I've answered. (See: https://www.pgroup.com/userforum/viewtopic.php?f=4&t=7614). The topic question is incorrect in that the inner loops are not getting parallelized nor are the cause of the issue.
The problem here is that the code has a race condition on the shared "output_im" pointer. My suggested solution is to compute a per thread offset into the array rather than trying to manipulate the pointer itself.
for(int p=0;p<64;++p){
filter_weight += p * input_channels * 9;
float bias = filter_bias[p];
int offset;
offset = (start_channel + p) * output_size * output_size;
//loop over output feature map
#pragma acc loop vector collapse(2)
for(int i = 0; i < output_size; i++)
{
for(int j = 0; j < output_size; j++)
{
... cut ...
}
}
//add relu activation after conv
int idx = offset + (i * output_size + j);
output_im[idx] = (tmp > 0.0) ? tmp : 0.0;
}
}

How to pass a pointer argument to a function without knowing the size to be allocated for that pointer

I know this question is very noob. I am trying to understand how the pointer thing works. I studied basics of C but still did not understand this.
Given this piece of function:
+ (void)nv21ToRgbWithWidth:(unsigned int)width height:(unsigned int)height yuyv:(unsigned char *)yuyv rgb:(unsigned char *)rgb
{
const int nv_start = width * height ;
UInt32 i, j, index = 0, rgb_index = 0;
UInt8 y, u, v;
int r, g, b, nv_index = 0;
for(i = 0; i < height ; i++)
{
for(j = 0; j < width; j ++){
//nv_index = (rgb_index / 2 - width / 2 * ((i + 1) / 2)) * 2;
nv_index = i / 2 * width + j - j % 2;
y = yuyv[rgb_index];
u = yuyv[nv_start + nv_index ];
v = yuyv[nv_start + nv_index + 1];
r = y + (140 * (v-128))/100; //r
g = y - (34 * (u-128))/100 - (71 * (v-128))/100; //g
b = y + (177 * (u-128))/100; //b
if(r > 255) r = 255;
if(g > 255) g = 255;
if(b > 255) b = 255;
if(r < 0) r = 0;
if(g < 0) g = 0;
if(b < 0) b = 0;
index = rgb_index % width + (height - i - 1) * width;
rgb[index * 3+0] = b;
rgb[index * 3+1] = g;
rgb[index * 3+2] = r;
rgb_index++;
}
}
}
How am I suppose to know how the unsigned char * for rgb should be initialized before passing in to the function?
I tried calling the function like this:
unsigned char *rgb = NULL;
[MyClass nv21ToRgbWithWidth:imageWidth height:imageHeight yuyv:yuyvValues rgb:rgb];
But the the program crashes on this line:
rgb[index * 3+0] = b;
I see rgb was initialized with NULL, so you can't assign values. So, I thought of initializing an array and pass it to pointer rgb like this:
unsigned char rgbArr[10000];
unsigned char *rgb = rgbArr;
but the function still crashes. I really don't know how should I pass the rgb parameter in this function. Please help me understand this.
The expected size in bytes seems to be at least height*width*3; it might be that allocating such an array as a local variable (as you do with unsigned char rgbArr[10000]) exceeds a stack limit; The program likely crashes in such a case. I'd try to use the heap instead:
unsigned char* rgb = malloc(imageHeight*imageWidth*3);
[MyClass nv21ToRgbWithWidth:imageWidth height:imageHeight yuyv:yuyvValues rgb:rgb];
...
free(rgb);
That is what the malloc(), calloc(), realloc() and free() functions are for. Don't forget to use the free() function to prevent memory leaks... I hope that helps.

Algorithm to group consecutive words minimizing length per group

From an input of space-delimited words, how to concatenate consecutive words so that:
each group has a minimum length L (spaces don't count)
longest group length is minimal (spaces don't count)
Example input:
would a cat eat a mouse
Example minimum length:
L = 5
Naive algorithm that solves the first condition but not the second one:
while length of a group is less than L, concatenate next word to group
if last group is shorter than L, concatenate last two groups together
This naive algorithm produces:
group 1: would
group 2: acateat
group 3: amouse
longest group length: 7
Second condition is not solved because a better solution would be:
group 1: woulda
group 2: cateat
group 3: amouse
longest group length: 6
Which algorithm would solve the second condition (minimal longest group) with relatively fast execution as a program? (by fast, I'd like to avoid testing all possible combinations)
I know C, ObjC, Swift, Javascript, Python, but pseudocode is fine.
This can be done with dynamic programming approach. Let's count a function F(i) - the minimum length of the longest group among correct divisions of the first i words into groups.
F(0) = 0
F(i) = Min(Max(F(j), totalLen(j+1, i))), for j in [0..i-1]
Where
totalLen(i, j) = total length of words from i to j, if the length is at least L
totalLen(i, j) = MAX, if total length is less than L
The answer is the value of F(n). To get the groups themselves we can save the indices of the best j for every i.
There is a implementation from the scratch in c++:
const vector<string> words = {"would", "a", "cat", "eat", "a", "mouse"};
const int L = 5;
int n = words.size();
vector<int> prefixLen = countPrefixLen(words);
vector<int> f(n+1);
vector<int> best(n+1, -1);
int maxL = prefixLen[n];
f[0] = 0;
for (int i = 1; i <= n; ++i) {
f[i] = maxL;
for (int j = 0; j < i; ++j) {
int totalLen = prefixLen[i] - prefixLen[j];
if (totalLen >= L) {
int maxLen = max(f[j], totalLen);
if (f[i] > maxLen) {
f[i] = maxLen;
best[i] = j;
}
}
}
}
output(f[n], prev, words);
Preprocessing and output details:
vector<int> countPrefixLen(const vector<string>& words) {
int n = words.size();
vector<int> prefixLen(n+1);
for (int i = 1; i <= n; ++i) {
prefixLen[i] = prefixLen[i-1] + words[i-1].length();
}
return prefixLen;
}
void output(int answer, const vector<int>& best, const vector<string>& words) {
cout << answer << endl;
int j = best.size()-1;
vector<int> restoreIndex(1, j);
while (j > 0) {
int i = best[j];
restoreIndex.push_back(i);
j = i;
}
reverse(restoreIndex.begin(), restoreIndex.end());
for (int i = 0; i+1 < restoreIndex.size(); ++i) {
for (int j = restoreIndex[i]; j < restoreIndex[i+1]; ++j) {
cout << words[j] << ' ';
}
cout << endl;
}
}
Output:
6
would a
cat eat
a mouse
Runnable: https://ideone.com/AaV5C8
Further improvement
The complexity of this algorithm is O(N^2). If it is too slow for your data I can suggest a simple optimization:
Let's inverse the inner loop. First, this allows to get rid of the prefixLen array and it's preprocessing, because now we add words one by one to the group (actually, we could get rid of this preprocessing in the initial version, but at the expense of simplicity). What is more important we can break our loop when totalLen would be not less than already computed f[i] because further iterations will never lead to an improvement. The code of the inner loop could be changed to:
int totalLen = 0;
for (int j = i-1; j >= 0; --j) {
totalLen += words[j].length();
if (totalLen >= L) {
int maxLen = max(f[j], totalLen);
if (f[i] > maxLen) {
f[i] = maxLen;
best[i] = j;
}
}
if (totalLen >= f[i]) break;
}
This can drastically improve the performance for not very big values of L.

Find nth int with 10 set bits

Find the nth int with 10 set bits
n is an int in the range 0<= n <= 30 045 014
The 0th int = 1023, the 1st = 1535 and so on
snob() same number of bits,
returns the lowest integer bigger than n with the same number of set bits as n
int snob(int n) {
int a=n&-n, b=a+n;
return b|(n^b)/a>>2;
}
calling snob n times will work
int nth(int n){
int o =1023;
for(int i=0;i<n;i++)o=snob(o);
return o;
}
example
https://ideone.com/ikGNo7
Is there some way to find it faster?
I found one pattern but not sure if it's useful.
using factorial you can find the "indexes" where all 10 set bits are consecutive
1023 << x = the (x+10)! / (x! * 10!) - 1 th integer
1023<<1 is the 10th
1023<<2 is the 65th
1023<<3 the 285th
...
Btw I'm not a student and this is not homework.
EDIT:
Found an alternative to snob()
https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
int lnbp(int v){
int t = (v | (v - 1)) + 1;
return t | ((((t & -t) / (v & -v)) >> 1) - 1);
}
I have built an implementation that should satisfy your needs.
/** A lookup table to see how many combinations preceeded this one */
private static int[][] LOOKUP_TABLE_COMBINATION_POS;
/** The number of possible combinations with i bits */
private static int[] NBR_COMBINATIONS;
static {
LOOKUP_TABLE_COMBINATION_POS = new int[Integer.SIZE][Integer.SIZE];
for (int bit = 0; bit < Integer.SIZE; bit++) {
// Ignore less significant bits, compute how many combinations have to be
// visited to set this bit, i.e.
// (bit = 4, pos = 5), before came 0b1XXX and 0b1XXXX, that's C(3, 3) + C(4, 3)
int nbrBefore = 0;
// The nth-bit can be only encountered after pos n
for (int pos = bit; pos < Integer.SIZE; pos++) {
LOOKUP_TABLE_COMBINATION_POS[bit][pos] = nbrBefore;
nbrBefore += nChooseK(pos, bit);
}
}
NBR_COMBINATIONS = new int[Integer.SIZE + 1];
for (int bits = 0; bits < NBR_COMBINATIONS.length; bits++) {
NBR_COMBINATIONS[bits] = nChooseK(Integer.SIZE, bits);
assert NBR_COMBINATIONS[bits] > 0; // Important for modulo check. Otherwise we must use unsigned arithmetic
}
}
private static int nChooseK(int n, int k) {
assert k >= 0 && k <= n;
if (k > n / 2) {
k = n - k;
}
long nCk = 1; // (N choose 0)
for (int i = 0; i < k; i++) {
// (N choose K+1) = (N choose K) * (n-k) / (k+1);
nCk *= (n - i);
nCk /= (i + 1);
}
return (int) nCk;
}
public static int nextCombination(int w, int n) {
// TODO: maybe for small n just advance naively
// Get the position of the current pattern w
int nbrBits = 0;
int position = 0;
while (w != 0) {
final int currentBit = Integer.lowestOneBit(w); // w & -w;
final int bitPos = Integer.numberOfTrailingZeros(currentBit);
position += LOOKUP_TABLE_COMBINATION_POS[nbrBits][bitPos];
// toggle off bit
w ^= currentBit;
nbrBits++;
}
position += n;
// Wrapping, optional
position %= NBR_COMBINATIONS[nbrBits];
// And reverse lookup
int v = 0;
int m = Integer.SIZE - 1;
while (nbrBits-- > 0) {
final int[] bitPositions = LOOKUP_TABLE_COMBINATION_POS[nbrBits];
// Search for largest bitPos such that position >= bitPositions[bitPos]
while (Integer.compareUnsigned(position, bitPositions[m]) < 0)
m--;
position -= bitPositions[m];
v ^= (0b1 << m--);
}
return v;
}
Now for some explanation. LOOKUP_TABLE_COMBINATION_POS[bit][pos] is the core of the algorithm that makes it as fast as it is. The table is designed so that a bit pattern with k bits at positions p_0 < p_1 < ... < p_{k - 1} has a position of `\sum_{i = 0}^{k - 1}{ LOOKUP_TABLE_COMBINATION_POS[i][p_i] }.
The intuition is that we try to move back the bits one by one until we reach the pattern where are all bits are at the lowest possible positions. Moving the i-th bit from position to k + 1 to k moves back by C(k-1, i-1) positions, provided that all lower bits are at the right-most position (no moving bits into or through each other) since we skip over all possible combinations with the i-1 bits in k-1 slots.
We can thus "decode" a bit pattern to a position, keeping track of the bits encountered. We then advance by n positions (rolling over in case we enumerated all possible positions for k bits) and encode this position again.
To encode a pattern, we reverse the process. For this, we move bits from their starting position forward, as long as the position is smaller than what we're aiming for. We could, instead of a linear search through LOOKUP_TABLE_COMBINATION_POS, employ a binary search for our target index m but it's hardly needed, the size of an int is not big. Nevertheless, we reuse our variant that a smaller bit must also come at a less significant position so that our algorithm is effectively O(n) where n = Integer.SIZE.
I remain with the following assertions to show the resulting algorithm:
nextCombination(0b1111111111, 1) == 0b10111111111;
nextCombination(0b1111111111, 10) == 0b11111111110;
nextCombination(0x00FF , 4) == 0x01EF;
nextCombination(0x7FFFFFFF , 4) == 0xF7FFFFFF;
nextCombination(0x03FF , 10) == 0x07FE;
// Correct wrapping
nextCombination(0b1 , 32) == 0b1;
nextCombination(0x7FFFFFFF , 32) == 0x7FFFFFFF;
nextCombination(0xFFFFFFEF , 5) == 0x7FFFFFFF;
Let us consider the numbers with k=10 bits set.
The trick is to determine the rank of the most significant one, for a given n.
There is a single number of length k: C(k, k)=1. There are k+1 = C(k+1, k) numbers of length k + 1. ... There are C(m, k) numbers of length m.
For k=10, the limit n are 1 + 10 + 55 + 220 + 715 + 2002 + 5005 + 11440 + ...
For a given n, you easily find the corresponding m. Then the problem is reduced to finding the n - C(m, k)-th number with k - 1 bits set. And so on recursively.
With precomputed tables, this can be very fast. 30045015 takes 30 lookups, so that I guess that the worst case is 29 x 30 / 2 = 435 lookups.
(This is based on linear lookups, to favor small values. By means of dichotomic search, you reduce this to less than 29 x lg(30) = 145 lookups at worse.)
Update:
My previous estimates were pessimistic. Indeed, as we are looking for k bits, there are only 10 determinations of m. In the linear case, at worse 245 lookups, in the dichotomic case, less than 50.
(I don't exclude off-by-one errors in the estimates, but clearly this method is very efficient and requires no snob.)

dot product using cblas is slow

I want to calculate the product A^T*A ( A is 2000x1000 Matrix). Also i only want to solve the upper triangular Matrix. In the inner loop i have to solve the dot product of two vectors.
Now, here is the problem. Using cblas ddot() is not faster than calculating the dot product with a loop. How is this possible? (using Intel Core (TM)i7 CPU M620 #2,67GHz, 1,92GB RAM)
The problem is caused essentially by matrix size, not by ddot. Your matrices are so large that they do not fit in the cache memory. The solution is to rearrange the three nested loops such that as much as possible can be done with a line in cache, so reducing cache refreshes. A model implementation follows for both the ddot and an daxpy approach. On my computer the time consumption was about 15:1.
In other words: never, never, never program a matrix multiplication along the "row times column" scheme that we learned in school.
/*
Matrix product of A^T * A by two methods.
1) "Row times column" as we learned in school.
2) With rearranged loops such that need for cash refreshes is reduced
(this can be improved even more).
Compile: gcc -o aT_a aT_a.c -lgslcblas -lblas -lm
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cblas.h>
#define ROWS 2000
#define COLS 1000
static double a[ROWS][COLS];
static double c[COLS][COLS];
static void dot() {
int i, j;
double *ai, *bj;
ai = a[0];
for (i=0; i<COLS; i++) {
bj = a[0];
for (j=0; j<COLS; j++) {
c[i][j] = cblas_ddot(ROWS,ai,COLS,bj,COLS);
bj += 1;
}
ai += 1;
}
}
static void axpy() {
int i, j;
double *ci, *bj, aij;
for (i=0; i<COLS; i++) {
ci = c[i];
for (j=0; j<COLS; j++) ci[j] = 0.;
for (j=0; j<ROWS; j++) {
aij = a[j][i];
bj = a[j];
cblas_daxpy(COLS,aij,bj,1,ci,1);
}
}
}
int main(int argc, char** argv) {
clock_t t0, t1;
int i, j;
for (i=0; i<ROWS; ++i)
for (j=0; j<COLS; ++j)
a[i][j] = i+j;
t0 = clock();
dot();
t0 = clock();
printf("Time for DOT : %f sec.\n",(double)t0/CLOCKS_PER_SEC);
axpy();
t1 = clock();
printf("Time for AXPY: %f sec.\n",(double)(t1-t0)/CLOCKS_PER_SEC);
return 0;
}
The CBLAS dot product is effectively just a computation in slightly unrolled loop. The netlib Fortran is just this:
DO I = MP1,N,5
DTEMP = DTEMP + DX(I)*DY(I) + DX(I+1)*DY(I+1) +
$ DX(I+2)*DY(I+2) + DX(I+3)*DY(I+3) + DX(I+4)*DY(I+4)
END DO
ie. just a loop unrolled to a stride of 5.
If you must use a ddot style dot product for your operation, you might get a performance boost by re-writing your loop to use SSE2 intrinsics:
#include <emmintrin.h>
double ddotsse2(const double *x, const double *y, const int n)
{
double result[2];
int n2 = 2 * (n/2);
__m128d dtemp;
if ( (n % 2) == 0) {
dtemp = _mm_setzero_pd();
} else {
dtemp = _mm_set_sd(x[n] * y[n]);
}
for(int i=0; i<n2; i+=2) {
__m128d x1 = _mm_loadr_pd(x+i);
__m128d y1 = _mm_loadr_pd(y+i);
__m128d xy = _mm_mul_pd(x1, y1);
dtemp = _mm_add_pd(dtemp, xy);
}
_mm_store_pd(&result[0],dtemp);
return result[0] + result[1];
}
(not tested, never been compiled, buyer beware).
This may or may be faster than the standard BLAS implementation. You may also want to investigate whether further loop unrolling could improve performance.
If you're not using SSE2 intrinsics or using a data type that may not boost performance with them, you can try to transpose the matrix for an easy improvement in performance for larger matrix multiplications with cblas_?dot. Performing the matrix multiplication in blocks also helps.
void matMulDotProduct(int n, float *A, float* B, int a_size, int b_size, int a_row, int a_col, int b_row, int b_col, float *C) {
int i, j, k;
MKL_INT incx, incy;
incx = 1;
incy = b_size;
//copy out multiplying matrix from larger matrix
float *temp = (float*) malloc(n * n * sizeof(float));
for (i = 0; i < n; ++i) {
cblas_scopy(n, &B[(b_row * b_size) + b_col + i], incy, &temp[i * n], 1);
}
//transpose
mkl_simatcopy('R', 'T', n, n, 1.0, temp, 1, 1);
for (i = 0; i < n; i+= BLOCK_SIZE) {
for (j = 0; j < n; j++) {
for (k = 0; k < BLOCK_SIZE; ++k) {
C[((i + k) * n) + j] = cblas_sdot(n, &A[(a_row + i + k) * a_size + a_col], incx, &temp[n * j], 1);
}
}
}
free(temp);
}
On my machine, this code is about 1 order of magnitude faster than the the 3 loop code (but also 1 order of magnitude slower than cblas_?gemm call) for single precision floats and 2K by 2K matrices. (I'm using Intel MKL).