OpenACC red-black Gauss-Seidel slower than CPU - gpu

I added OpenACC directives to my red-black Gauss-Seidel solver for the Laplace equation (a simple heated plate problem), but the GPU-accelerated code is no faster than the CPU, even for large problems.
I also wrote a CUDA version, and that is much faster than both (for 512x512, on the order of 2 seconds compared to 25 for CPU and OpenACC).
Can anyone think of a reason for this discrepancy? I realize that CUDA offers the most potential speed, but OpenACC should give something better than the CPU for larger problems (like the Jacobi solver for the same sort of problem demonstrated here).
Here is the relevant code (the full working source is here):
#pragma acc data copyin(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size]) copy(temp_red[0:size_temp], temp_black[0:size_temp])
// red-black Gauss-Seidel with SOR iteration loop
for (iter = 1; iter <= it_max; ++iter) {
Real norm_L2 = 0.0;
// update red cells
#pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
#pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
#pragma acc loop independent gang vector(4)
for (int col = 1; col < NUM + 1; ++col) {
#pragma acc loop independent gang vector(64)
for (int row = 1; row < (NUM / 2) + 1; ++row) {
int ind_red = col * ((NUM / 2) + 2) + row; // local (red) index
int ind = 2 * row - (col % 2) - 1 + NUM * (col - 1); // global index
#pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])
Real res = b[ind] + (aW[ind] * temp_black[row + (col - 1) * ((NUM / 2) + 2)]
+ aE[ind] * temp_black[row + (col + 1) * ((NUM / 2) + 2)]
+ aS[ind] * temp_black[row - (col % 2) + col * ((NUM / 2) + 2)]
+ aN[ind] * temp_black[row + ((col + 1) % 2) + col * ((NUM / 2) + 2)]);
Real temp_old = temp_red[ind_red];
temp_red[ind_red] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);
// calculate residual
res = temp_red[ind_red] - temp_old;
norm_L2 += (res * res);
} // end for row
} // end for col
// update black cells
#pragma omp parallel for shared(aP, aW, aE, aS, aN, temp_black, temp_red) \
#pragma acc kernels present(aP[0:size], aW[0:size], aE[0:size], aS[0:size], aN[0:size], b[0:size], temp_red[0:size_temp], temp_black[0:size_temp])
#pragma acc loop independent gang vector(4)
for (int col = 1; col < NUM + 1; ++col) {
#pragma acc loop independent gang vector(64)
for (int row = 1; row < (NUM / 2) + 1; ++row) {
int ind_black = col * ((NUM / 2) + 2) + row; // local (black) index
int ind = 2 * row - ((col + 1) % 2) - 1 + NUM * (col - 1); // global index
#pragma acc cache(aP[ind], b[ind], aW[ind], aE[ind], aS[ind], aN[ind])
Real res = b[ind] + (aW[ind] * temp_red[row + (col - 1) * ((NUM / 2) + 2)]
+ aE[ind] * temp_red[row + (col + 1) * ((NUM / 2) + 2)]
+ aS[ind] * temp_red[row - ((col + 1) % 2) + col * ((NUM / 2) + 2)]
+ aN[ind] * temp_red[row + (col % 2) + col * ((NUM / 2) + 2)]);
Real temp_old = temp_black[ind_black];
temp_black[ind_black] = temp_old * (1.0 - omega) + omega * (res / aP[ind]);
// calculate residual
res = temp_black[ind_black] - temp_old;
norm_L2 += (res * res);
} // end for row
} // end for col
// calculate residual
norm_L2 = sqrt(norm_L2 / ((Real)size));
if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2);
// if tolerance has been reached, end SOR iterations
if (norm_L2 < tol) {

Alright, I found a semi-solution that reduces the time somewhat significantly for smaller problems.
If I insert the lines:
acc_set_device_num(0, acc_device_nvidia);
before I start my timer, in order to activate and set the GPU, the time for the 512x512 problem drops to 9.8 seconds, and down to 42 for 1024x1024. Increasing the problem size further shows how fast even OpenACC can be compared to running on four CPU cores.
With this change, the OpenACC code is on the order of 2x slower than the CUDA code, with the gap getting closer to just a bit slower (~1.2) as the problem size gets bigger and bigger.

I download your full code and i compiled and run it! Did't stop run and for instruction
if(iter % 100 == 0) printf("%5d, %0.6f\n", iter, norm_L2);
the result was:
100, nan
200, nan
I changed all variables with type Real into type float and the result was:
100, 0.000654
200, 0.000370
..., ....
..., ....
8800, 0.000002
8900, 0.000002
9000, 0.000001
9100, 0.000001
9200, 0.000001
9300, 0.000001
9400, 0.000001
9500, 0.000001
9600, 0.000001
9700, 0.000001
Iterations: 9796
Total time: 5.594017 s
With NUM = 1024 the result was:
Iterations: 27271
Total time: 25.949905 s


How do I get the complexity of bilinear/nearest neighbour interpolation algorithm? (calculate the big O)

I want to calculate the big O of the following algorithms for resizing binary images:
Bilinear interpolation:
double scale_x = (double)new_height/(height-1);
double scale_y = (double)new_width/(width-1);
for (int i = 0; i < new_height; i++)
int ii = i / scale_x;
for (int j = 0; j < new_width; j++)
int jj = j / scale_y;
double v00 = matrix[ii][jj], v01 = matrix[ii][jj + 1],
v10 = matrix[ii + 1][jj], v11 = matrix[ii + 1][jj + 1];
double fi = i / scale_x - ii, fj = j / scale_y - jj;
double temp = (1 - fi) * ((1 - fj) * v00 + fj * v01) +
fi * ((1 - fj) * v10 + fj * v11);
if (temp >= 0.5)
result[i][j] = 1;
result[i][j] = 0;
Nearest neighbour interpolation
double scale_x = (double)height/new_height;
double scale_y = (double)width/new_width;
for (int i = 0; i < new_height; i++)
int srcx = floor(i * scale_x);
for (int j = 0; j < new_width; j++)
int srcy = floor(j * scale_y);
result[i][j] = matrix[srcx][srcy];
I assumed that the complexity of both of them is the loop dimensions, i.e O(new_height*new_width). However, the bilinear interpolation surely works much slower than the nearest neighbour. Could you please explain how to correctly compute complexity?
They are both running in Theta(new_height*new_width) time because except for the loop iterations all operations are constant time.
This doesn't in any way imply that the two programs will execute equally fast. It merely means that if you increase new_height and/or new_width to infinity, the ratio of execution time between the two programs will neither go to infinity nor to zero.
(This is making the assumption that the integer types are unbounded and that all arithmetic operations are constant time operations independent of the length of the operands. Otherwise there will be another relevant factor accounting for the cost of the arithmetic.)

how to do 3d sum using openmp

I am a freshman in openmp. I have some trouble in a 3d sum, and I don't know how to improve my code. Here's the code I want to improve in openmp. My aim is to speed up the calculation of this 3d sum. What should I add in my code according to the rules of openmp?
I add #pragma omp parallel for reduction(+:integral) in my code. But an error happens which says the initialization of 'for' is not correct. This is the information of this error:enter image description here I am a chinese, so the language of my IDE is chinese. I use Visual Studio 2019.
int main()
double a = 0.3291;
double d_title = 2.414;
double b = 3.8037;
double c = 4086;
double nu_start = 0;
double mu_start = 0;
double z_start = 0;
double step_nu = 2 * 3.1415926 / 100;
double step_mu = 3.1415926 / 100;
double step_z = 0;
double nu = 0;
double mu = 0;
double z = 0;
double integral=0;
double d_uv = 0;
int i = 0;
int j = 0;
int k = 0;
#pragma omp parallel for default(none) shared(a, d_title, b, c, nu_start, mu_start, z_start, step_nu, step_mu) private( j,k,mu, nu, step_z, z, d_uv) reduction(+:integral)
for (i = 0; i < 100; i++)
mu = mu_start + (i + 1) * step_mu;
for (j = 0; j < 100; j++)
nu = nu_start + (j + 1) * step_nu;
for (k = 0; k < 500; k++)
d_uv = (sin(mu) * sin(mu) * cos(nu) * cos(nu) + sin(mu) * sin(mu) * (a * sin(nu) - d_title * cos(nu)) * (a * sin(nu) - d_title * cos(nu)) + b * b * cos(mu) * cos(mu)) / (c * c);
step_z = 20 / (d_uv * 500);
z = z_start + (k + 1) * step_z;
integral = integral + sin(mu) * (1 - 3 * sin(mu) * sin(mu) * cos(nu) * cos(nu)) * exp(-d_uv * z) * log(1 + z * z) * step_z * step_mu * step_nu;
double out = 0;
out = integral / (c * c);
return 0;
Solutions (UPDATE: It is an answer to the original question:)
To do the least typing you just have to add the following line before for(int i=..)
#pragma omp parallel for private( mu, nu, step_z, z, d_uv) reduction(+:integral)
Here you define which variables have to be private to avoid data race. Note that variables are shared by default, so variable integral also shared, but all threads update its value, which is a data race. To avoid it, you have 2 possibilities: use atomic operation, or a much better option is to use use reduction (add reduction(+:integral) clause).
As you mentioned that you are beginner in OpenMP it is recommended to use default(none) clause in the #pragma omp parallel for directive, so you have to explicitly define sharing attributes. If you forget a variable you will get an error, so you have to consider all variables involved in your parallel region and can think about possible data races:
#pragma omp parallel for default(none) shared(a, d_title, b, c, nu_start, mu_start, z_start, step_nu, step_mu) private( mu, nu, step_z, z, d_uv) reduction(+:integral)
Generally, it is recommended to define your variables in their minimum required scope, so variables defined inside the for loop to parallelize will be private. In this case you just have to add #pragma omp parallel for reduction(+:integral) before your outermost for loop, so your code will be:
#pragma omp parallel for reduction(+:integral)
for (int i = 0; i < 100; i++)
double mu = mu_start + (i + 1) * step_mu;
for (int j = 0; j < 100; j++)
//int id = omp_get_thread_num();
double nu = nu_start + (j + 1) * step_nu;
for (int k = 0; k < 500; k++)
double d_uv = (sin(mu) * sin(mu) * cos(nu) * cos(nu) + sin(mu) * sin(mu) * (a * sin(nu) - d_title * cos(nu)) * (a * sin(nu) - d_title * cos(nu)) + b * b * cos(mu) * cos(mu)) / (c * c);
double step_z = 20 / (d_uv * 500);
double z = z_start + (k + 1) * step_z;
//int id = omp_get_thread_num();
integral = integral + sin(mu) * (1 - 3 * sin(mu) * sin(mu) * cos(nu) * cos(nu)) * exp(-d_uv * z) * log(1 + z * z) * step_z * step_mu * step_nu;
Runtimes: 44 ms (1 thread) and 11 ms (4 threads) on my computer (g++ -O3 -mavx2 -fopenmp).

Finding out the complexity of given program

I'm trying to find out the Complexity of the given program. Suppose we have;
int a = θ;
for (i=θ; i<n; i++){
for(j = n; j>i; j--)
a = a + i + j;
Complexity: O(N*N)
The code runs total no of times
`= N + (N – 1) + (N – 2) + … 1 + 0
= N * (N + 1) / 2
= 1/2 * N^2 + 1/2 * N
O(N^2) times`

Solve: T(n) = T(n/2) + n/2 + 1

I struggle to define the running time for the following algorithm in O notation. My first guess was O(n), but the gap between the iterations and the number I apply isn't steady. How have I incorrectly defined this?
public int function (int n )
if ( n == 0) {
return 0;
int i = 1;
int j = n ;
while ( i < j )
i = i + 1;
j = j - 1;
return function ( i - 1) + 1;
The while is executed in about n/2 time.
The recursion is executed passing as n a value that is about half of the original n, so:
n/2 (first iteration)
n/4 (second iteration, equal to (n/2)/2)
This is similar to a geometric serie.
Infact it can be represented as
n * (1/2 + 1/4 + 1/8 + 1/16 + ...)
So it converges to n * 1 = n
So the O notation is O(n)
Another approach is to write it down as T(n) = T(n/2) + n/2 + 1.
The while loop does n/2 work. Argument passed to next call is n/2.
Solving this using the master theorem where:
a = 1
b = 2
f = n/2 + 1
Let c=0.9
1*(f(n/2) + 1) <? c*f(n)
1*(n/4)+1 <? 0.9*(n/2 + 1)
0.25n + 1 <? 0.45n + 0.9
0 < 0.2n - 0.1
Which is:
T(n) = Θ(n)

how to zoom mandelbrot set

I have successfully implemented the mandelbrot set as described in the wikipedia article, but I do not know how to zoom into a specific section. This is the code I am using:
+(void)createSetWithWidth:(int)width Height:(int)height Thing:(void(^)(int, int, int, int))thing
for (int i = 0; i < height; ++i)
for (int j = 0; j < width; ++j)
double x0 = ((4.0f * (i - (height / 2))) / (height)) - 0.0f;
double y0 = ((4.0f * (j - (width / 2))) / (width)) + 0.0f;
double x = 0.0f;
double y = 0.0f;
int iteration = 0;
int max_iteration = 15;
while ((((x * x) + (y * y)) <= 4.0f) && (iteration < max_iteration))
double xtemp = ((x * x) - (y * y)) + x0;
y = ((2.0f * x) * y) + y0;
x = xtemp;
iteration += 1;
thing(j, i, iteration, max_iteration);
It was my understanding that x0 should be in the range -2.5 - 1 and y0 should be in the range -1 - 1, and that reducing that number would zoom, but that didnt really work at all. How can I zoom?
Suppose the center is the (cx, cy) and the length you want to display is (lx, ly), you can use the following scaling formula:
x0 = cx + (i/width - 0.5)*lx;
y0 = cy + (j/width - 0.5)*ly;
What it does is to first scale down the pixel to the unit interval (0 <= i/width < 1), then shift the center (-0.5 <= i/width-0.5 < 0.5), scale up to your desired dimension (-0.5*lx <= (i/width-0.5)*lx < 0.5*lx). Finally, shift it to the center you given.
first off, with a max_iteration of 15, you're not going to see much detail. mine has 1000 iterations per point as a baseline, and can go to about 8000 iterations before it really gets too slow to wait for.
this might help:
this too: