opencl workitem run parallel - optimization

asking about speed or optimize the code
the kernel for sobel edge detection for gray img
When I run the program without any process only show input video and output(same as input) the frame per secounds fps=70 but when process down to 20 (process using GPU kernel for sobel)
Does anyone have an idea of how to speed up this code? I used local memory instead of global memory but the change is small.
How can I make all work items process the image?
sobel kernel
__kernel void hello_kernel(const __global uchar *input, __global uchar *output,const uint width,const uint height)
{
int x = get_global_id(0);
int y = get_global_id(1);
int index = width * y + x;
float a,b,c,d,e,f,g,h,i;
float8 v;
float sobelX = 0;
float sobelY = 0;
//if(index > width && index < (height*width)-width && (index % width-1) > 0 && (index % width-1) < width-1){
a = input[index-1-width] * -1.0f;
b =input[index-0-width] * 0.0f;
c = input[index+1-width] * +1.0f;
d = input[index-1] * -2.0f;
e = input[index-0] * 0.0f;
f = input[index+1] * +2.0f;
g = input[index-1+width] * -1.0f;
h = input[index-0+width] * 0.0f;
i = input[index+1+width] * +1.0f;
sobelX = a+b+c+d+e+f+g+h+i;
a = input[index-1-width] * -1.0f;
b = input[index-0-width] * -2.0f;
c = input[index+1-width] * -1.0f;
d = input[index-1] * 0.0f;
e = input[index-0] * 0.0f;
f = input[index+1] * 0.0f;
g = input[index-1+width] * +1.0f;
h = input[index-0+width] * +2.0f;
i = input[index+1+width] * +1.0f;
sobelY = a+b+c+d+e+f+g+h+i;
output[index] = sqrt(pow(sobelX,2) + pow(sobelY,2));
}

Related

How to pass a pointer argument to a function without knowing the size to be allocated for that pointer

I know this question is very noob. I am trying to understand how the pointer thing works. I studied basics of C but still did not understand this.
Given this piece of function:
+ (void)nv21ToRgbWithWidth:(unsigned int)width height:(unsigned int)height yuyv:(unsigned char *)yuyv rgb:(unsigned char *)rgb
{
const int nv_start = width * height ;
UInt32 i, j, index = 0, rgb_index = 0;
UInt8 y, u, v;
int r, g, b, nv_index = 0;
for(i = 0; i < height ; i++)
{
for(j = 0; j < width; j ++){
//nv_index = (rgb_index / 2 - width / 2 * ((i + 1) / 2)) * 2;
nv_index = i / 2 * width + j - j % 2;
y = yuyv[rgb_index];
u = yuyv[nv_start + nv_index ];
v = yuyv[nv_start + nv_index + 1];
r = y + (140 * (v-128))/100; //r
g = y - (34 * (u-128))/100 - (71 * (v-128))/100; //g
b = y + (177 * (u-128))/100; //b
if(r > 255) r = 255;
if(g > 255) g = 255;
if(b > 255) b = 255;
if(r < 0) r = 0;
if(g < 0) g = 0;
if(b < 0) b = 0;
index = rgb_index % width + (height - i - 1) * width;
rgb[index * 3+0] = b;
rgb[index * 3+1] = g;
rgb[index * 3+2] = r;
rgb_index++;
}
}
}
How am I suppose to know how the unsigned char * for rgb should be initialized before passing in to the function?
I tried calling the function like this:
unsigned char *rgb = NULL;
[MyClass nv21ToRgbWithWidth:imageWidth height:imageHeight yuyv:yuyvValues rgb:rgb];
But the the program crashes on this line:
rgb[index * 3+0] = b;
I see rgb was initialized with NULL, so you can't assign values. So, I thought of initializing an array and pass it to pointer rgb like this:
unsigned char rgbArr[10000];
unsigned char *rgb = rgbArr;
but the function still crashes. I really don't know how should I pass the rgb parameter in this function. Please help me understand this.
The expected size in bytes seems to be at least height*width*3; it might be that allocating such an array as a local variable (as you do with unsigned char rgbArr[10000]) exceeds a stack limit; The program likely crashes in such a case. I'd try to use the heap instead:
unsigned char* rgb = malloc(imageHeight*imageWidth*3);
[MyClass nv21ToRgbWithWidth:imageWidth height:imageHeight yuyv:yuyvValues rgb:rgb];
...
free(rgb);
That is what the malloc(), calloc(), realloc() and free() functions are for. Don't forget to use the free() function to prevent memory leaks... I hope that helps.

Path Tracing - Generate Camera Rays with a Left Handed coordinate system

Been having some issues implementing a camera for my renderer. As the question states,I would like to know the necessary steps to generate such a camera.With field of view and aspect ratio included.Its important that the Coordinate system be left handed such that -z pushes the camera away from the screen(as I understand it).I have tried looking online but most of the implementations are incomplete or have failed me.Any help is appreciated.Thank You.
I had trouble with this and took a long time to figure out. Here is the code for camera class.
#ifndef CAMERA_H_
#define CAMERA_H_
#include "common.h"
struct Camera {
Vec3fa position, direction;
float fovDist, aspectRatio;
double imgWidth, imgHeight;
Mat4 camMatrix;
Camera(Vec3fa pos, Vec3fa cRot, Vec3fa cDir, float cfov, int width, int height) {
position = pos;
aspectRatio = width / (float)height;
imgWidth = width;
imgHeight = height;
Vec3fa angle = Vec3fa(cRot.x, cRot.y, -cRot.z);
camMatrix.setRotationRadians(angle * M_PI / 180.0f);
direction = Vec3fa(0.0f, 0.0f, -1.0f);
camMatrix.rotateVect(direction);
fovDist = 2.0f * tan(M_PI * 0.5f * cfov / 180.0);
}
Vec3fa getRayDirection(float x, float y) {
Vec3fa delta = Vec3fa((x-0.5f) * fovDist * aspectRatio, (y-0.5f) * fovDist, 0.0f);
camMatrix.rotateVect(delta);
return (direction + delta);
}
};
#endif
Incase if you need the rotateVect() code in the Mat4 class
void Mat4::rotateVect(Vector3& vect) const
{
Vector3 tmp = vect;
vect.x = tmp.x * (*this)[0] + tmp.y * (*this)[4] + tmp.z * (*this)[8];
vect.y = tmp.x * (*this)[1] + tmp.y * (*this)[5] + tmp.z * (*this)[9];
vect.z = tmp.x * (*this)[2] + tmp.y * (*this)[6] + tmp.z * (*this)[10];
}
Here is our setRotationRadians code
void Mat4::setRotationRadians(Vector3 rotation)
{
const float cr = cos(rotation.x);
const float sr = sin(rotation.x);
const float cp = cos(rotation.y);
const float sp = sin(rotation.y);
const float cy = cos(rotation.z);
const float sy = sin(rotation.z);
(*this)[0] = (cp * cy);
(*this)[1] = (cp * sy);
(*this)[2] = (-sp);
const float srsp = sr * sp;
const float crsp = cr * sp;
(*this)[4] = (srsp * cy - cr * sy);
(*this)[5] = (srsp * sy + cr * cy);
(*this)[6] = (sr * cp);
(*this)[8] = (crsp * cy + sr * sy);
(*this)[9] = (crsp * sy - sr * cy);
(*this)[10] = (cr * cp);
}

OpenCL kernel doesn't finish executing

I am writing a simple monte carlo code for simulation of electron scattering. I ran the Kernel for 10 million electron and it runs fine, but when I increase the number of electrons to a higher number, say 50 million, the code just wouldn't finish and the computer freezes. I wanted to know if this is a hardware issue or if there is a possible bug in the code. I am running the code on a iMac with ATI Radeon HD 5870.
int rand_r (unsigned int seed)
{
unsigned int next = seed;
int result;
next *= 1103515245;
next += 12345;
result = (unsigned int) (next / 65536) % 2048;
next *= 1103515245;
next += 12345;
result <<= 10;
result ^= (unsigned int) (next / 65536) % 1024;
next *= 1103515245;
next += 12345;
result <<= 10;
result ^= (unsigned int) (next / 65536) % 1024;
seed = next;
return result;
}
__kernel void MC(const float E, __global float* bse, const int count) {
int tx, ty;
tx = get_global_id(0);
ty = get_global_id(1);
float RAND_MAX = 2147483647.0f;
int rand_seed;
int seed = count*ty + tx;
float rand;
float PI;
PI = 3.14159f;
float z;
z = 28.0f;
float rho;
rho = 8.908f;
float A;
A = 58.69f;
int num;
num = 10000000/(count*count);
int counter, counter1, counter2;
counter = 0;
float4 c_new, r_new;
float E_new, alpha, de_ds, phi, psi, mfp,sig_eNA,step, dsq, dsqi, absc0z;
float J;
J = (9.76f*z + 58.5f*powr(z,-0.19f))*1E-3f;
float4 r0 = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
float2 tilt = (float2)((70.0f/180.0f)*PI , 0.0f);
float4 c0 = (float4)(cos(tilt.y)*sin(tilt.x), sin(tilt.y)*sin(tilt.x), cos(tilt.x), 0.0f);
for (int i = 0; i < num; ++i){
rand_seed = rand_r(seed);
seed = rand_seed;
rand = rand_seed/RAND_MAX; //some random no. generator in gpu
r0 = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
c0 = (float4)(cos(tilt.y)*sin(tilt.x), sin(tilt.y)*sin(tilt.x), cos(tilt.x), 0.0f);
E_new = E;
c_new = c0;
alpha = (3.4E-3f)*powr(z,0.67f)/E_new;
sig_eNA = (5.21f * 602.3f)*((z*z)/(E_new*E_new))*((4.0f*PI)/(alpha*(1+alpha)))*((E_new + 511.0f)*(E_new + 511.0f)/((E_new + 1024.0f)*(E_new + 1024.0f)));
mfp = A/(rho*sig_eNA);
step = -mfp * log(rand);
r_new = (float4)(r0.x + step*c_new.x, r0.y + step*c_new.y, r0.z + step*c_new.z, 0.0f);
r0 = r_new;
counter1 = 0;
counter2 = 0;
while (counter1 < 1000){
alpha = (3.4E-3f)*powr(z,0.67f)/E_new;
sig_eNA = (5.21f * 602.3f)*((z*z)/(E_new*E_new))*((4*PI)/(alpha*(1+alpha)))*((E_new + 511.0f)*(E_new + 511.0f)/((E_new + 1024.0f)*(E_new + 1024.0f)));
mfp = A/(rho*sig_eNA);
rand_seed = rand_r(seed);
seed = rand_seed;
rand = rand_seed/RAND_MAX; //some random no. generator in gpu
step = -mfp * log(rand);
de_ds = -78500.0f*(z/(A*E_new)) * log((1.66f*(E_new + 0.85f*J))/J);
rand_seed = rand_r(seed);
seed = rand_seed;
rand = rand_seed/RAND_MAX; //new random no.
phi = acos(1 - ((2*alpha*rand)/(1 + alpha - rand)));
rand_seed = rand_r(seed);
seed = rand_seed;
rand = rand_seed/RAND_MAX; //third random no.
psi = 2*PI*rand;
if ((c0.z >= 0.999f) || (c0.z <= -0.999f) ){
absc0z = abs(c0.z);
c_new = (float4)(sin(phi) * cos(psi), sin(phi) * sin(psi), (c0.z/absc0z)*cos(phi), 0.0f);
}
else {
dsq = sqrt(1-c0.z*c0.z);
dsqi = 1/dsq;
c_new = (float4)(sin(phi)*(c0.x*c0.z*cos(psi) - c0.y*sin(psi))*dsqi + c0.x*cos(phi), sin(phi) * (c0.y * c0.z * cos(psi) + c0.x * sin(psi)) * dsqi + c0.y * cos(phi), -sin(phi) * cos(psi) * dsq + c0.z * cos(phi), 0.0f);
}
r_new = (float4)(r0.x + step*c_new.x, r0.y + step*c_new.y, r0.z + step*c_new.z, 0.0f);
r0 = r_new;
c0 = c_new;
E_new += step*rho*de_ds;
if (r0.z <= 0 && counter2 == 0){
counter++ ;
counter2 = 1;
}
counter1++ ;
}
}
bse[count*ty + tx] = counter;
}

Maya-like camera implementation

I am working on Maya-like camera implementation, and I've done track and dolly functions correctly but I just cannot implement tumble.
I am working in PhiloGL engine (WebGL base), so I would really appreciate some help with code in this engine.
I've looked at how Maya's camera actually work, but I cannot find out. Here is my code so-far
if(mode == "rot")
{
var angleX = diffx / 150;
var angleY = diffy / 150;
//var angleZ = sign * Math.sqrt((diffx * diffx)+(diffy * diffy)) / 150;
e.stop();
//axe Z
//camera.position.x = x * Math.cos(angleX) - y * Math.sin(angleX);
//camera.position.y = x * Math.sin(angleX) + y * Math.cos(angleX);
//axe X
//camera.position.y = y * Math.cos(angleY) - z * Math.sin(angleY);
//camera.position.z = y * Math.sin(angleY) + z * Math.cos(angleY);
//camera.update();
//axe Y
camera.position.z = z * Math.cos(angleX) - x * Math.sin(angleX);
camera.position.x = z * Math.sin(angleX) + x * Math.cos(angleX);
camera.update();
position.x = e.x;
position.y = e.y;
position.z = e.z;
}
This isn't working nor do I know what am I doing wrong.
Any clues?
I use this in inka3d (www.inka3d.com) but it does not depend on inka3d. The output is a 4x4 matrix. Can you make use of that?
// turntable like camera, y is up-vector
// tx, ty and tz are camera target position
// rx, ry and rz are camera rotation angles (rad)
// di is camera distance from target
// fr is an array where the resulting view matrix is written into (16 values, row major)
control.cameraY = function(tx, ty, tz, rx, ry, rz, di, fr)
{
var a = rx * 0.5;
var b = ry * 0.5;
var c = rz * 0.5;
var d = Math.cos(a);
var e = Math.sin(a);
var f = Math.cos(b);
var g = Math.sin(b);
var h = Math.cos(c);
var i = Math.sin(c);
var j = f * e * h + g * d * i;
var k = f * -e * i + g * d * h;
var l = f * d * i - g * e * h;
var m = f * d * h - g * -e * i;
var n = j * j;
var o = k * k;
var p = l * l;
var q = m * m;
var r = j * k;
var s = k * l;
var t = j * l;
var u = m * j;
var v = m * k;
var w = m * l;
var x = q + n - o - p;
var y = (r + w) * 2.0;
var z = (t - v) * 2.0;
var A = (r - w) * 2.0;
var B = q - n + o - p;
var C = (s + u) * 2.0;
var D = (t + v) * 2.0;
var E = (s - u) * 2.0;
var F = q - n - o + p;
var G = di;
var H = -(tx + D * G);
var I = -(ty + E * G);
var J = -(tz + F * G);
fr[0] = x;
fr[1] = A;
fr[2] = D;
fr[3] = 0.0;
fr[4] = y;
fr[5] = B;
fr[6] = E;
fr[7] = 0.0;
fr[8] = z;
fr[9] = C;
fr[10] = F;
fr[11] = 0.0;
fr[12] = x * H + y * I + z * J;
fr[13] = A * H + B * I + C * J;
fr[14] = D * H + E * I + F * J;
fr[15] = 1.0;
};

Return CATransform3D to map quadrilateral to quadrilateral

I'm trying to derive a CATransform3D that will map a quad with 4 corner points to another quad with 4 new corner points. I've spent a little bit of time researching this and it seems the steps involve converting the original Quad to a Square, and then converting that Square to the new Quad. My methods look like this (code borrowed from here):
- (CATransform3D)quadFromSquare_x0:(float)x0 y0:(float)y0 x1:(float)x1 y1:(float)y1 x2:(float)x2 y2:(float)y2 x3:(float)x3 y3:(float)y3 {
float dx1 = x1 - x2, dy1 = y1 - y2;
float dx2 = x3 - x2, dy2 = y3 - y2;
float sx = x0 - x1 + x2 - x3;
float sy = y0 - y1 + y2 - y3;
float g = (sx * dy2 - dx2 * sy) / (dx1 * dy2 - dx2 * dy1);
float h = (dx1 * sy - sx * dy1) / (dx1 * dy2 - dx2 * dy1);
float a = x1 - x0 + g * x1;
float b = x3 - x0 + h * x3;
float c = x0;
float d = y1 - y0 + g * y1;
float e = y3 - y0 + h * y3;
float f = y0;
CATransform3D mat;
mat.m11 = a;
mat.m12 = b;
mat.m13 = 0;
mat.m14 = c;
mat.m21 = d;
mat.m22 = e;
mat.m23 = 0;
mat.m24 = f;
mat.m31 = 0;
mat.m32 = 0;
mat.m33 = 1;
mat.m34 = 0;
mat.m41 = g;
mat.m42 = h;
mat.m43 = 0;
mat.m44 = 1;
return mat;
}
- (CATransform3D)squareFromQuad_x0:(float)x0 y0:(float)y0 x1:(float)x1 y1:(float)y1 x2:(float)x2 y2:(float)y2 x3:(float)x3 y3:(float)y3 {
CATransform3D mat = [self quadFromSquare_x0:x0 y0:y0 x1:x1 y1:y1 x2:x2 y2:y2 x3:x3 y3:y3];
// invert through adjoint
float a = mat.m11, d = mat.m21, /* ignore */ g = mat.m41;
float b = mat.m12, e = mat.m22, /* 3rd col*/ h = mat.m42;
/* ignore 3rd row */
float c = mat.m14, f = mat.m24;
float A = e - f * h;
float B = c * h - b;
float C = b * f - c * e;
float D = f * g - d;
float E = a - c * g;
float F = c * d - a * f;
float G = d * h - e * g;
float H = b * g - a * h;
float I = a * e - b * d;
// Probably unnecessary since 'I' is also scaled by the determinant,
// and 'I' scales the homogeneous coordinate, which, in turn,
// scales the X,Y coordinates.
// Determinant = a * (e - f * h) + b * (f * g - d) + c * (d * h - e * g);
float idet = 1.0f / (a * A + b * D + c * G);
mat.m11 = A * idet; mat.m21 = D * idet; mat.m31 = 0; mat.m41 = G * idet;
mat.m12 = B * idet; mat.m22 = E * idet; mat.m32 = 0; mat.m42 = H * idet;
mat.m13 = 0 ; mat.m23 = 0 ; mat.m33 = 1; mat.m43 = 0 ;
mat.m14 = C * idet; mat.m24 = F * idet; mat.m34 = 0; mat.m44 = I * idet;
return mat;
}
After calculating both matrices, multiplying them together, and assigning to the view in question, I end up with a transformed view, but it is wildly incorrect. In fact, it seems to be sheared like a parallelogram no matter what I do. What am I missing?
UPDATE 2/1/12
It seems the reason I'm running into issues may be that I need to accommodate for FOV and focal length into the model view matrix (which is the only matrix I can alter directly in Quartz.) I'm not having any luck finding documentation online on how to calculate the proper matrix, though.
I was able to achieve this by porting and combining the quad warping and homography code from these two URLs:
http://forum.openframeworks.cc/index.php/topic,509.30.html
http://forum.openframeworks.cc/index.php?topic=3121.15
UPDATE: I've open sourced a small class that does this: https://github.com/dominikhofmann/DHWarpView