I am a freshman in openmp. I have some trouble in a 3d sum, and I don't know how to improve my code. Here's the code I want to improve in openmp. My aim is to speed up the calculation of this 3d sum. What should I add in my code according to the rules of openmp?
I add #pragma omp parallel for reduction(+:integral) in my code. But an error happens which says the initialization of 'for' is not correct. This is the information of this error:enter image description here I am a chinese, so the language of my IDE is chinese. I use Visual Studio 2019.
#include<omp.h>
#include<stdio.h>
#include<math.h>
int main()
{
double a = 0.3291;
double d_title = 2.414;
double b = 3.8037;
double c = 4086;
double nu_start = 0;
double mu_start = 0;
double z_start = 0;
double step_nu = 2 * 3.1415926 / 100;
double step_mu = 3.1415926 / 100;
double step_z = 0;
double nu = 0;
double mu = 0;
double z = 0;
double integral=0;
double d_uv = 0;
int i = 0;
int j = 0;
int k = 0;
#pragma omp parallel for default(none) shared(a, d_title, b, c, nu_start, mu_start, z_start, step_nu, step_mu) private( j,k,mu, nu, step_z, z, d_uv) reduction(+:integral)
for (i = 0; i < 100; i++)
{
mu = mu_start + (i + 1) * step_mu;
for (j = 0; j < 100; j++)
{
nu = nu_start + (j + 1) * step_nu;
for (k = 0; k < 500; k++)
{
d_uv = (sin(mu) * sin(mu) * cos(nu) * cos(nu) + sin(mu) * sin(mu) * (a * sin(nu) - d_title * cos(nu)) * (a * sin(nu) - d_title * cos(nu)) + b * b * cos(mu) * cos(mu)) / (c * c);
step_z = 20 / (d_uv * 500);
z = z_start + (k + 1) * step_z;
integral = integral + sin(mu) * (1 - 3 * sin(mu) * sin(mu) * cos(nu) * cos(nu)) * exp(-d_uv * z) * log(1 + z * z) * step_z * step_mu * step_nu;
}
}
}
double out = 0;
out = integral / (c * c);
return 0;
}
Solutions (UPDATE: It is an answer to the original question:)
To do the least typing you just have to add the following line before for(int i=..)
#pragma omp parallel for private( mu, nu, step_z, z, d_uv) reduction(+:integral)
Here you define which variables have to be private to avoid data race. Note that variables are shared by default, so variable integral also shared, but all threads update its value, which is a data race. To avoid it, you have 2 possibilities: use atomic operation, or a much better option is to use use reduction (add reduction(+:integral) clause).
As you mentioned that you are beginner in OpenMP it is recommended to use default(none) clause in the #pragma omp parallel for directive, so you have to explicitly define sharing attributes. If you forget a variable you will get an error, so you have to consider all variables involved in your parallel region and can think about possible data races:
#pragma omp parallel for default(none) shared(a, d_title, b, c, nu_start, mu_start, z_start, step_nu, step_mu) private( mu, nu, step_z, z, d_uv) reduction(+:integral)
Generally, it is recommended to define your variables in their minimum required scope, so variables defined inside the for loop to parallelize will be private. In this case you just have to add #pragma omp parallel for reduction(+:integral) before your outermost for loop, so your code will be:
#pragma omp parallel for reduction(+:integral)
for (int i = 0; i < 100; i++)
{
double mu = mu_start + (i + 1) * step_mu;
for (int j = 0; j < 100; j++)
{
//int id = omp_get_thread_num();
double nu = nu_start + (j + 1) * step_nu;
for (int k = 0; k < 500; k++)
{
double d_uv = (sin(mu) * sin(mu) * cos(nu) * cos(nu) + sin(mu) * sin(mu) * (a * sin(nu) - d_title * cos(nu)) * (a * sin(nu) - d_title * cos(nu)) + b * b * cos(mu) * cos(mu)) / (c * c);
double step_z = 20 / (d_uv * 500);
double z = z_start + (k + 1) * step_z;
//int id = omp_get_thread_num();
integral = integral + sin(mu) * (1 - 3 * sin(mu) * sin(mu) * cos(nu) * cos(nu)) * exp(-d_uv * z) * log(1 + z * z) * step_z * step_mu * step_nu;
}
}
}
Runtimes: 44 ms (1 thread) and 11 ms (4 threads) on my computer (g++ -O3 -mavx2 -fopenmp).
Related
I want to calculate the big O of the following algorithms for resizing binary images:
Bilinear interpolation:
double scale_x = (double)new_height/(height-1);
double scale_y = (double)new_width/(width-1);
for (int i = 0; i < new_height; i++)
{
int ii = i / scale_x;
for (int j = 0; j < new_width; j++)
{
int jj = j / scale_y;
double v00 = matrix[ii][jj], v01 = matrix[ii][jj + 1],
v10 = matrix[ii + 1][jj], v11 = matrix[ii + 1][jj + 1];
double fi = i / scale_x - ii, fj = j / scale_y - jj;
double temp = (1 - fi) * ((1 - fj) * v00 + fj * v01) +
fi * ((1 - fj) * v10 + fj * v11);
if (temp >= 0.5)
result[i][j] = 1;
else
result[i][j] = 0;
}
}
Nearest neighbour interpolation
double scale_x = (double)height/new_height;
double scale_y = (double)width/new_width;
for (int i = 0; i < new_height; i++)
{
int srcx = floor(i * scale_x);
for (int j = 0; j < new_width; j++)
{
int srcy = floor(j * scale_y);
result[i][j] = matrix[srcx][srcy];
}
}
I assumed that the complexity of both of them is the loop dimensions, i.e O(new_height*new_width). However, the bilinear interpolation surely works much slower than the nearest neighbour. Could you please explain how to correctly compute complexity?
They are both running in Theta(new_height*new_width) time because except for the loop iterations all operations are constant time.
This doesn't in any way imply that the two programs will execute equally fast. It merely means that if you increase new_height and/or new_width to infinity, the ratio of execution time between the two programs will neither go to infinity nor to zero.
(This is making the assumption that the integer types are unbounded and that all arithmetic operations are constant time operations independent of the length of the operands. Otherwise there will be another relevant factor accounting for the cost of the arithmetic.)
I am a beginner in doing GPU programming with OpenACC. I was trying to do a direct convolution. Convolution consists of 6 nested loops. I only want the first loop to be parallelized. I gave the pragma #pragma acc loop for the first loop and #pragma acc loop seq for the rest. But the output that I am getting is not correct. Is the approach taken by me to parallelize the loop correct ? Specifications for the convolution: Input channels-3, Input Size- 224X224X3, Output channels- 64, Output Size- 111X111X64, filter size- 3X3X3X64. Following is the link to the header files dog.h and squeezenet_params.h. https://drive.google.com/drive/folders/1a9XRjBTrEFIorrLTPFHS4atBOPrG886i
# include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "squeezenet_params.h"
#include "dog.h"
void conv3x3(
const int input_channels, const int input_size,
const int pad, const int stride, const int start_channel,
const int output_size, const float* restrict input_im, const float* restrict filter_weight,
const float* restrict filter_bias, float* restrict output_im){
#pragma acc data copyin (input_im[0:150527],filter_weight[0:1727],filter_bias[0:63]) copyout(output_im[0:788543])
{
#pragma acc parallel
{
#pragma acc loop
for(int p=0;p<64;++p){
filter_weight += p * input_channels * 9;
float bias = filter_bias[p];
output_im += (start_channel + p) * output_size * output_size;
//loop over output feature map
#pragma acc loop seq
for(int i = 0; i < output_size; i++)
{
#pragma acc loop seq
for(int j = 0; j < output_size; j++)
{
//compute one element in the output feature map
float tmp = bias;
//compute dot product of 2 input_channels x 3 x 3 matrix
#pragma acc loop seq
for(int k = 0; k < input_channels; k++)
{
#pragma acc loop seq
for(int l = 0; l < 3; l++)
{
int h = i * stride + l - pad;
#pragma acc loop seq
for(int m = 0; m < 3; m++)
{
int w = j * stride + m - pad;
if((h >= 0) && (h < input_size) && (w >= 0) && (w < input_size))
{
tmp += input_im[k * input_size * input_size + (i * stride + l - pad) * input_size + j * stride + m - pad] \
* filter_weight[9 * k + 3 * l + m];
}
}
}
}
//add relu activation after conv
output_im[i * output_size + j] = (tmp > 0.0) ? tmp : 0.0;
}
}
}
}
}
}
void main(){
float * result = (float*)malloc(sizeof(float) * (1 * 64 * 111 * 111));
conv3x3(3,224,0,2,0,111,sample,conv1_weight,conv1_bias,result);
for(int i=0;i<64 * 111 * 111;++i){
//if(result[i]>0)
printf("%f:%d\n",result[i],i);
}
}
The contributor posted the same question on the PGI User Forums where I've answered. (See: https://www.pgroup.com/userforum/viewtopic.php?f=4&t=7614). The topic question is incorrect in that the inner loops are not getting parallelized nor are the cause of the issue.
The problem here is that the code has a race condition on the shared "output_im" pointer. My suggested solution is to compute a per thread offset into the array rather than trying to manipulate the pointer itself.
for(int p=0;p<64;++p){
filter_weight += p * input_channels * 9;
float bias = filter_bias[p];
int offset;
offset = (start_channel + p) * output_size * output_size;
//loop over output feature map
#pragma acc loop vector collapse(2)
for(int i = 0; i < output_size; i++)
{
for(int j = 0; j < output_size; j++)
{
... cut ...
}
}
//add relu activation after conv
int idx = offset + (i * output_size + j);
output_im[idx] = (tmp > 0.0) ? tmp : 0.0;
}
}
I know this question is very noob. I am trying to understand how the pointer thing works. I studied basics of C but still did not understand this.
Given this piece of function:
+ (void)nv21ToRgbWithWidth:(unsigned int)width height:(unsigned int)height yuyv:(unsigned char *)yuyv rgb:(unsigned char *)rgb
{
const int nv_start = width * height ;
UInt32 i, j, index = 0, rgb_index = 0;
UInt8 y, u, v;
int r, g, b, nv_index = 0;
for(i = 0; i < height ; i++)
{
for(j = 0; j < width; j ++){
//nv_index = (rgb_index / 2 - width / 2 * ((i + 1) / 2)) * 2;
nv_index = i / 2 * width + j - j % 2;
y = yuyv[rgb_index];
u = yuyv[nv_start + nv_index ];
v = yuyv[nv_start + nv_index + 1];
r = y + (140 * (v-128))/100; //r
g = y - (34 * (u-128))/100 - (71 * (v-128))/100; //g
b = y + (177 * (u-128))/100; //b
if(r > 255) r = 255;
if(g > 255) g = 255;
if(b > 255) b = 255;
if(r < 0) r = 0;
if(g < 0) g = 0;
if(b < 0) b = 0;
index = rgb_index % width + (height - i - 1) * width;
rgb[index * 3+0] = b;
rgb[index * 3+1] = g;
rgb[index * 3+2] = r;
rgb_index++;
}
}
}
How am I suppose to know how the unsigned char * for rgb should be initialized before passing in to the function?
I tried calling the function like this:
unsigned char *rgb = NULL;
[MyClass nv21ToRgbWithWidth:imageWidth height:imageHeight yuyv:yuyvValues rgb:rgb];
But the the program crashes on this line:
rgb[index * 3+0] = b;
I see rgb was initialized with NULL, so you can't assign values. So, I thought of initializing an array and pass it to pointer rgb like this:
unsigned char rgbArr[10000];
unsigned char *rgb = rgbArr;
but the function still crashes. I really don't know how should I pass the rgb parameter in this function. Please help me understand this.
The expected size in bytes seems to be at least height*width*3; it might be that allocating such an array as a local variable (as you do with unsigned char rgbArr[10000]) exceeds a stack limit; The program likely crashes in such a case. I'd try to use the heap instead:
unsigned char* rgb = malloc(imageHeight*imageWidth*3);
[MyClass nv21ToRgbWithWidth:imageWidth height:imageHeight yuyv:yuyvValues rgb:rgb];
...
free(rgb);
That is what the malloc(), calloc(), realloc() and free() functions are for. Don't forget to use the free() function to prevent memory leaks... I hope that helps.
I have the following code, but in this line of code I have warning x[i] = (rhs[i] - x[i - 1]) / b;, compiler is telling me that rhs[i] is a garbage value. Why it's happend? And how to remove this warning?
double* getFirstControlPoints(double* rhs, const int n) {
double *x;
x = (double*)malloc(n * sizeof(double));
double *tmp; // Temp workspace.
tmp = (double*)malloc(n * sizeof(double));
double b = 2.0;
x[0] = rhs[0] / b;
for (int i = 1; i < n; i++) // Decomposition and forward substitution.
{
tmp[i] = 1 / b;
b = (i < n - 1 ? 4.0 : 3.5) - tmp[i];
x[i] = (rhs[i] - x[i - 1]) / b; //The left operand of '-' is a garbage value
}
for (int i = 1; i < n; i++) {
x[n - i - 1] -= tmp[n - i] * x[n - i]; // Backsubstitution.
}
free(tmp);
return x;
}
All compiler warnings and calling getFirstControlPoints you may see on screenshots.
You need a check to make sure you have at least 4 points in the points array because this loop (line 333):
for (NSInteger i = 1 ; i < n - 1 ; ++i) {
// initialisation stuff
}
will not execute at all for n = 0, 1, 2.
Assume that points has 3 objects in it, At line 311 you set n to the count - 1 i.e. n == 2
Then the loop condition is i < 2 - 1 i.e. i < 1.
I think you need the loop condition to be i < n
if points.count is 0 or 1 you are facing some problems, because then, n is -1 or 0, and you access rhs[n-1]; and you malloc n* bytes;
maybe that can be the problem. that you put some rangechecks int o the code?
I have successfully implemented the mandelbrot set as described in the wikipedia article, but I do not know how to zoom into a specific section. This is the code I am using:
+(void)createSetWithWidth:(int)width Height:(int)height Thing:(void(^)(int, int, int, int))thing
{
for (int i = 0; i < height; ++i)
for (int j = 0; j < width; ++j)
{
double x0 = ((4.0f * (i - (height / 2))) / (height)) - 0.0f;
double y0 = ((4.0f * (j - (width / 2))) / (width)) + 0.0f;
double x = 0.0f;
double y = 0.0f;
int iteration = 0;
int max_iteration = 15;
while ((((x * x) + (y * y)) <= 4.0f) && (iteration < max_iteration))
{
double xtemp = ((x * x) - (y * y)) + x0;
y = ((2.0f * x) * y) + y0;
x = xtemp;
iteration += 1;
}
thing(j, i, iteration, max_iteration);
}
}
It was my understanding that x0 should be in the range -2.5 - 1 and y0 should be in the range -1 - 1, and that reducing that number would zoom, but that didnt really work at all. How can I zoom?
Suppose the center is the (cx, cy) and the length you want to display is (lx, ly), you can use the following scaling formula:
x0 = cx + (i/width - 0.5)*lx;
y0 = cy + (j/width - 0.5)*ly;
What it does is to first scale down the pixel to the unit interval (0 <= i/width < 1), then shift the center (-0.5 <= i/width-0.5 < 0.5), scale up to your desired dimension (-0.5*lx <= (i/width-0.5)*lx < 0.5*lx). Finally, shift it to the center you given.
first off, with a max_iteration of 15, you're not going to see much detail. mine has 1000 iterations per point as a baseline, and can go to about 8000 iterations before it really gets too slow to wait for.
this might help: http://jc.unternet.net/src/java/com/jcomeau/Mandelbrot.java
this too: http://www.wikihow.com/Plot-the-Mandelbrot-Set-By-Hand