OpenCL AMD GPU compiler crash - crash

I am working on a kernel that find intersections between ray and a triangle list, but (there is always a "but" ) i got some trouble using my opencl compiler indeed it crashes when I try to compile it.
I try to compile it on my CPU compiler and it compile well, but with my GPU compiler it crashes...
//-----------------------------------------------------------------------------
//---------------------------------DEFINES-------------------------------------
//-----------------------------------------------------------------------------
#define RAYON_SORTANT -1000
#define RAYON_ENTRANT 1000
#define MIN_LONGUEUR_RT 1.E-6f
//-----------------------------------------------------------------------------
//---------------------------------CONTENT-------------------------------------
//-----------------------------------------------------------------------------
typedef struct s_CDPoint
{
float x;
float y;
float z;
} CDPoint;
typedef struct s_TTriangle
{
CDPoint triangle_[3];
CDPoint normal_;
} TTriangle;
typedef struct s_GridIntersection
{
CDPoint pos_;
float distance_;
int sensNormale_;
unsigned int idTriangle_;
} TGridIntersection;
//-----------------------------------------------------------------------------
//---------------------------------MUTEX---------------------------------------
//-----------------------------------------------------------------------------
void GetSemaphor(__global int * semaphor)
{
int occupied = atomic_xchg(semaphor, 1);
while(occupied > 0)
{
occupied = atomic_xchg(semaphor, 1);
}
}
void ReleaseSemaphor(__global int * semaphor)
{
int prevVal = atomic_xchg(semaphor, 0);
}
//-----------------------------------------------------------------------------
//---------------------------------GEOMETRIE-----------------------------------
//-----------------------------------------------------------------------------
float dotProduct(const CDPoint* pA, const CDPoint* pB)
{
return (pA->x * pB->x + pA->y * pB->y + pA->z * pB->z);
}
CDPoint crossProduct(const CDPoint* pA, const CDPoint* pB)
{
CDPoint res;
res.x = pA->y * pB->z - pB->y * pA->z;
res.y = pA->z * pB->x - pB->z * pA->x;
res.z = pA->x * pB->y - pB->x * pA->y;
return res;
}
CDPoint soustraction(const CDPoint* pA, const CDPoint* pB)
{
CDPoint res;
res.x = pA->x - pB->x;
res.y = pA->y - pB->y;
res.z = pA->z - pB->z;
return res;
}
CDPoint addition(const CDPoint* pA, const CDPoint* pB)
{
CDPoint res;
res.x = pA->x + pB->x;
res.y = pA->y + pB->y;
res.z = pA->z + pB->z;
return res;
}
CDPoint homothetie(const CDPoint* pA, float val)
{
CDPoint pnt;
pnt.x = pA->x * val;
pnt.y = pA->y * val;
pnt.z = pA->z * val;
return pnt;
}
//-----------------------------------------------------------------------------
//---------------------------------KERNEL--------------------------------------
//-----------------------------------------------------------------------------
__kernel void IntersectionTriangle( __global const TTriangle* pTriangleListe,
const unsigned int idxDebutTriangle,
const unsigned int idxFin,
__constant const CDPoint* pPointOrigine,
__constant const CDPoint* pDir,
__global int *nbInter,
__global TGridIntersection* pResults )
{
__private unsigned int index = get_global_id(0) + idxDebutTriangle;
if (index > idxFin) return;
__global const TTriangle *pTriangle = &pTriangleListe[index];
__private float distance = 0.f;
// Côté du triangle et normale au plan
__private CDPoint edge1 = soustraction(&pTriangle->triangle_[1], &pTriangle->triangle_[0]);
__private CDPoint edge2 = soustraction(&pTriangle->triangle_[2], &pTriangle->triangle_[0]);
__private CDPoint pvec = crossProduct(pDir, &edge2); // produit vectoriel
// Le rayon et le triangle sont il parallèle ?
__private float det = dotProduct(&edge1, &pvec);
if (det == 0.f)
{
return ;
}
__private float inv_det = 1.f / det;
// Distance origin t0
__private CDPoint tvec = soustraction(pPointOrigine, &pTriangle->triangle_[0]);
//Calculate u parameter and test bound
__private float u = (dotProduct(&tvec, &pvec)) * inv_det;
//The intersection lies outside of the triangle
if (u < -MIN_LONGUEUR_RT
|| u > 1.f + MIN_LONGUEUR_RT)
{
return ;
}
u = max(u, 0.f);
//Prepare to test v parameter
__private CDPoint qvec = crossProduct(&tvec, &edge1);
//Calculate V parameter and test bound
__private float v = dotProduct(pDir, &qvec) * inv_det;
//The intersection lies outside of the triangle
if (v < -MIN_LONGUEUR_RT
|| u + v > 1.f + MIN_LONGUEUR_RT)
{
return ;
}
// Get distance
distance = dotProduct(&edge2, &qvec) * inv_det;
if (distance > -MIN_LONGUEUR_RT)
{
// We are using nbInter as semaphor index
GetSemaphor(nbInter);
__private int idxInter = *nbInter;
pResults[idxInter].distance_ = max(distance, 0.f);
// Intersection point
__private CDPoint vDir = homothetie(pDir, distance);
pResults[idxInter].pos_ = addition(pPointOrigine, &vDir);
// Get ray way
pResults[idxInter].sensNormale_ = dotProduct(&pTriangle->normal_, pDir) > 0.f ? RAYON_SORTANT : RAYON_ENTRANT;
// Triangle id
pResults[idxInter].idTriangle_ = index - idxDebutTriangle;
// inc nb inter
*nbInter = *nbInter + 1;
ReleaseSemaphor(nbInter);
}
}
I notice that if I change "__global const TTriangle* pTriangleListe" by "const TTriangle* pTriangleListe" it compiles but it is not the code i want !
Exactly what I want to do, is to fill all triangles in "pTriangleListe", and with a uniform grid get indexes of triangles to check (idxDebutTriangle/idxFin). "pPointOrigine" is the ray origin and "pDir" the direction. "nbInter" and "pResults" are shared and will contains intersections (they are protected by the semaphor)
Here it is my openCL computer configuration :
Platform [0]
id = 5339E7D8
profile = FULL_PROFILE
version = OpenCL 1.2 AMD-APP (1445.5)
name = AMD Accelerated Parallel Processing
vendor = Advanced Micro Devices, Inc.
extensions = cl_khr_icd
cl_khr_d3d10_sharing
cl_khr_d3d11_sharing
cl_khr_dx9_media_sharing
cl_amd_event_callback
cl_amd_offline_devices
cl_amd_hsa
2 Devices detected
Device [0]
id = 010DFA00
type = CL_DEVICE_TYPE_GPU
name = Cedar
vendor = Advanced Micro Devices, Inc.
driver version = 1445.5 (VM)
device version = OpenCL 1.2 AMD-APP (1445.5)
profile = FULL_PROFILE
max compute units = 2
max work items dimensions = 3
max work item sizes = 128 / 128 / 128
max work group size = 128
max clock frequency = 650 MHz
address_bits = 32
max mem alloc size = 512 MB
global mem size = 1024 MB
image support = CL_TRUE
max read image args = 128
max write image args = 8
2D image max size = 16384 x 16384
3D image max size = 2048 x 2048 x 2048
max samplers = 16
max parameter size = 1024
mem base addr align = 2048
min data type align size = 128
single fp config = CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
global mem cache type = CL_NONE
max constant buffer size = 64 KB
max constant args = 8
local mem type = CL_LOCAL
local mem size = 32 KB
error correction support = CL_FALSE
profiling timer resolution = 1 ns
endian little = CL_TRUE
available = CL_TRUE
compiler available = CL_TRUE
execution capabilities = CL_EXEC_KERNEL
queue properties = CL_QUEUE_PROFILING_ENABLE
extensions = cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_3d_image_writes
cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_ext_atomic_counters_32
cl_amd_device_attribute_query
cl_amd_vec3
cl_amd_printf
cl_amd_media_ops
cl_amd_media_ops2
cl_amd_popcnt
cl_khr_d3d10_sharing
cl_khr_d3d11_sharing
cl_khr_dx9_media_sharing
cl_amd_image2d_from_buffer_read_only
cl_khr_spir
cl_khr_gl_event
Device [1]
id = 03501CD0
type = CL_DEVICE_TYPE_CPU
name = Intel(R) Core(TM) i3-2130 CPU # 3.40GHz
vendor = GenuineIntel
driver version = 1445.5 (sse2,avx)
device version = OpenCL 1.2 AMD-APP (1445.5)
profile = FULL_PROFILE
max compute units = 4
max work items dimensions = 3
max work item sizes = 1024 / 1024 / 1024
max work group size = 1024
max clock frequency = 3392 MHz
address_bits = 32
max mem alloc size = 1024 MB
global mem size = 2048 MB
image support = CL_TRUE
max read image args = 128
max write image args = 8
2D image max size = 8192 x 8192
3D image max size = 2048 x 2048 x 2048
max samplers = 16
max parameter size = 4096
mem base addr align = 1024
min data type align size = 128
single fp config = CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
global mem cache type = CL_READ_WRITE_CACHE
global mem cacheline size = 64
global mem cache size = 32768
max constant buffer size = 64 KB
max constant args = 8
local mem type = CL_GLOBAL
local mem size = 32 KB
error correction support = CL_FALSE
profiling timer resolution = 301 ns
endian little = CL_TRUE
available = CL_TRUE
compiler available = CL_TRUE
execution capabilities = CL_EXEC_KERNEL CL_EXEC_NATIVE_KERNEL
queue properties = CL_QUEUE_PROFILING_ENABLE
extensions = cl_khr_fp64
cl_amd_fp64
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_3d_image_writes
cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_ext_device_fission
cl_amd_device_attribute_query
cl_amd_vec3
cl_amd_printf
cl_amd_media_ops
cl_amd_media_ops2
cl_amd_popcnt
cl_khr_d3d10_sharing
cl_khr_spir
cl_amd_svm
cl_khr_gl_event
Thank you for reading !

Related

Binary operator '+=' cannot be applied to operands of type 'Int' and 'UInt8'

Translating Obj-C to Swift. As you can see I declared let buf = UnsafeMutablePointer<UInt8>(CVPixelBufferGetBaseAddress(cvimgRef)) so I'm getting the error in the for loop below it.
Binary operator '+=' cannot be applied to operands of type 'Int' and 'UInt8'
Also as a little addendum I don't know how to translate the remaining Obj-C code below the for loop. What does that slash mean and how do I deal with the pointer? I have to say UnsafeMutableFloat somewhere?
// process the frame of video
func captureOutput(captureOutput:AVCaptureOutput, didOutputSampleBuffer sampleBuffer:CMSampleBuffer, fromConnection connection:AVCaptureConnection) {
// if we're paused don't do anything
if currentState == CurrentState.statePaused {
// reset our frame counter
self.validFrameCounter = 0
return
}
// this is the image buffer
var cvimgRef:CVImageBufferRef = CMSampleBufferGetImageBuffer(sampleBuffer)
// Lock the image buffer
CVPixelBufferLockBaseAddress(cvimgRef, 0)
// access the data
var width: size_t = CVPixelBufferGetWidth(cvimgRef)
var height:size_t = CVPixelBufferGetHeight(cvimgRef)
// get the raw image bytes
let buf = UnsafeMutablePointer<UInt8>(CVPixelBufferGetBaseAddress(cvimgRef))
var bprow: size_t = CVPixelBufferGetBytesPerRow(cvimgRef)
var r = 0
var g = 0
var b = 0
for var y = 0; y < height; y++ {
for var x = 0; x < width * 4; x += 4 {
b += buf[x]; g += buf[x + 1]; r += buf[x + 2] // error
}
buf += bprow() // error
}
Remaining Obj-C code.
r/=255*(float) (width*height);
g/=255*(float) (width*height);
b/=255*(float) (width*height);
You have a lot of type mismatch error.
The type of x should not be UInt8 because x to increase until the value of the width.
for var x:UInt8 = 0; x < width * 4; x += 4 { // error: '<' cannot be applied to operands of type 'UInt8' and 'Int'
So fix it like below:
for var x = 0; x < width * 4; x += 4 {
To increment the pointer address, you can use advancedBy() function.
buf += bprow(UnsafeMutablePointer(UInt8)) // error: '+=' cannot be applied to operands of type 'UnsafeMutablePointer<UInt8>' and 'size_t'
Like below:
var pixel = buf.advancedBy(y * bprow)
And this line,
RGBtoHSV(r, g, b) // error
There are no implicit casts in Swift between CGFloat and Float unfortunately. So you should cast explicitly to CGFloat.
RGBtoHSV(CGFloat(r), g: CGFloat(g), b: CGFloat(b))
The whole edited code is here:
func RGBtoHSV(r: CGFloat, g: CGFloat, b: CGFloat) -> (h: CGFloat, s: CGFloat, v: CGFloat) {
var h: CGFloat = 0.0
var s: CGFloat = 0.0
var v: CGFloat = 0.0
let col = UIColor(red: r, green: g, blue: b, alpha: 1.0)
col.getHue(&h, saturation: &s, brightness: &v, alpha: nil)
return (h, s, v)
}
// process the frame of video
func captureOutput(captureOutput:AVCaptureOutput, didOutputSampleBuffer sampleBuffer:CMSampleBuffer, fromConnection connection:AVCaptureConnection) {
// if we're paused don't do anything
if currentState == CurrentState.statePaused {
// reset our frame counter
self.validFrameCounter = 0
return
}
// this is the image buffer
var cvimgRef = CMSampleBufferGetImageBuffer(sampleBuffer)
// Lock the image buffer
CVPixelBufferLockBaseAddress(cvimgRef, 0)
// access the data
var width = CVPixelBufferGetWidth(cvimgRef)
var height = CVPixelBufferGetHeight(cvimgRef)
// get the raw image bytes
let buf = UnsafeMutablePointer<UInt8>(CVPixelBufferGetBaseAddress(cvimgRef))
var bprow = CVPixelBufferGetBytesPerRow(cvimgRef)
var r: Float = 0.0
var g: Float = 0.0
var b: Float = 0.0
for var y = 0; y < height; y++ {
var pixel = buf.advancedBy(y * bprow)
for var x = 0; x < width * 4; x += 4 { // error: '<' cannot be applied to operands of type 'UInt8' and 'Int'
b += Float(pixel[x])
g += Float(pixel[x + 1])
r += Float(pixel[x + 2])
}
}
r /= 255 * Float(width * height)
g /= 255 * Float(width * height)
b /= 255 * Float(width * height)
//}
// convert from rgb to hsv colourspace
var h: Float = 0.0
var s: Float = 0.0
var v: Float = 0.0
RGBtoHSV(CGFloat(r), g: CGFloat(g), b: CGFloat(b)) // error
}

opencl workitem run parallel

asking about speed or optimize the code
the kernel for sobel edge detection for gray img
When I run the program without any process only show input video and output(same as input) the frame per secounds fps=70 but when process down to 20 (process using GPU kernel for sobel)
Does anyone have an idea of how to speed up this code? I used local memory instead of global memory but the change is small.
How can I make all work items process the image?
sobel kernel
__kernel void hello_kernel(const __global uchar *input, __global uchar *output,const uint width,const uint height)
{
int x = get_global_id(0);
int y = get_global_id(1);
int index = width * y + x;
float a,b,c,d,e,f,g,h,i;
float8 v;
float sobelX = 0;
float sobelY = 0;
//if(index > width && index < (height*width)-width && (index % width-1) > 0 && (index % width-1) < width-1){
a = input[index-1-width] * -1.0f;
b =input[index-0-width] * 0.0f;
c = input[index+1-width] * +1.0f;
d = input[index-1] * -2.0f;
e = input[index-0] * 0.0f;
f = input[index+1] * +2.0f;
g = input[index-1+width] * -1.0f;
h = input[index-0+width] * 0.0f;
i = input[index+1+width] * +1.0f;
sobelX = a+b+c+d+e+f+g+h+i;
a = input[index-1-width] * -1.0f;
b = input[index-0-width] * -2.0f;
c = input[index+1-width] * -1.0f;
d = input[index-1] * 0.0f;
e = input[index-0] * 0.0f;
f = input[index+1] * 0.0f;
g = input[index-1+width] * +1.0f;
h = input[index-0+width] * +2.0f;
i = input[index+1+width] * +1.0f;
sobelY = a+b+c+d+e+f+g+h+i;
output[index] = sqrt(pow(sobelX,2) + pow(sobelY,2));
}

CUDA Kernel Optimization regarding register

I'm quite new to CUDA and GPU programming. I'm trying to write a Kernel for an application in physics. The parallelization is made over a quadrature of directions, each direction resulting in a sweep of a 2D cartesian domain. Here is the kernel. it actually works well, giving good results.
However, a very high number of registers per blocks leads to a spill to local memory that harshly slow down the code performance.
__global__ void KERNEL (int imax, int jmax, int mmax, int lg, int lgmax,
double *x, double *y, double *qd, double *kappa,
double *S, double *G, double *qw, double *SkG,
double *Ska,double *a, double *Ljm, int *data)
{
int m = 1+blockIdx.x*blockDim.x + threadIdx.x ;
int tid = threadIdx.x ;
//Var needed for thread execution
...
extern __shared__ double shared[] ;
//Read some data from Global mem
mu = qd[ (m-1)];
eta = qd[ MSIZE+(m-1)];
wm = qd[3*MSIZE+(m-1)];
amu = fabs(mu);
aeta= fabs(eta);
ista = data[ (m-1)] ;
iend = data[1*MSIZE+(m-1)] ;
istp = data[2*MSIZE+(m-1)] ;
jsta = data[3*MSIZE+(m-1)] ;
jend = data[4*MSIZE+(m-1)] ;
jstp = data[5*MSIZE+(m-1)] ;
j1 = (1-jstp) ;
j2 = (1+jstp)/2 ;
i1 = (1-istp) ;
i2 = (1+istp)/2 ;
isw = ista-istp ;
jsw = jsta-jstp ;
dy = dx = 1.0e-2 ;
for(i=1 ; i<=imax; i++) Ljm[MSIZE*(i-1)+m] = S[jsw*(imax+2)+i] ;
//Beginning of the vertical Sweep, can be from left to right,
// or opposite depending on the thread
for(j=jsta ; j1*jend + j2*j<=j2*jend + j1*j ; j=j+jstp) {
Lw = S[j*(imax+2)+isw] ;
//Beginning of the horizontal Sweep, can be from left to right,
// or opposite depending on the thread
for(i=ista ; i1*iend + i2*i<=i2*iend + i1*i ; i=i+istp) {
ax = dy ;
Lx = ax*amu/ex ;
ay = dx ;
Ly = ay*aeta/ey ;
dv = ax*ay ;
L0 = dv*kappaij ;
Sp = S[j*(imax+2)+i]*dv ;
Ls = Ljm[MSIZE*(i-1)+m] ;
Lp = (Lx*Lw+Ly*Ls+Sp)/(Lx+Ly+L0) ;
Lw = Lw+(Lp-Lw)/ex ;
Ls = Ls+(Lp-Ls)/ey ;
Ljm[MSIZE*(i-1)+m] = Ls ;
shared[tid] = wm*Lp ;
__syncthreads();
for (s=16; s>0; s>>=1) {
if (tid < s) {
shared[tid] += shared[tid + s] ;
}
}
if(tid==0) atomicAdd(&SkG[imax*(j-1)+(i-1)],shared[tid]*kappaij);
}
// End of horizontal sweep
}
// End of vertical sweep
}
How can i optimize the execution of this code ? I run it over 8 blocks of 32 threads.
The occupancy for this kernel is really low, limited by the registers according to the Visual profiler.
I have no idea on how to improve it.
Thanks !
First of all, you are using blocks of 32 threads, because of that, occupancy kernel is too low. Your gpu is running only 256 threads in parallel but it can run up to 1536 threads per multiprocessor (compute capability 2.x)
How many registers are you using?
You also can try to declare your variables into their local scope, helping to the device to reuse better the registers.

OpenCL Memory Optimization - Nearest Neighbour

I'm writing a program in OpenCL that receives two arrays of points, and calculates the nearest neighbour for each point.
I have two programs for this. One of them will calculate distance for 4 dimensions, and one for 6 dimensions. They are below:
4 dimensions:
kernel void BruteForce(
global read_only float4* m,
global float4* y,
global write_only ushort* i,
read_only uint mx)
{
int index = get_global_id(0);
float4 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = -1;
int x = 0;
int mmx = mx;
for(x = 0; x < mmx; x++)
{
float dist = fast_distance(curY, m[x]);
if (dist < minDist)
{
minDist = dist;
minIdx = x;
}
}
i[index] = minIdx;
y[index] = minDist;
}
6 dimensions:
kernel void BruteForce(
global read_only float8* m,
global float8* y,
global write_only ushort* i,
read_only uint mx)
{
int index = get_global_id(0);
float8 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = -1;
int x = 0;
int mmx = mx;
for(x = 0; x < mmx; x++)
{
float8 mx = m[x];
float d0 = mx.s0 - curY.s0;
float d1 = mx.s1 - curY.s1;
float d2 = mx.s2 - curY.s2;
float d3 = mx.s3 - curY.s3;
float d4 = mx.s4 - curY.s4;
float d5 = mx.s5 - curY.s5;
float dist = sqrt(d0 * d0 + d1 * d1 + d2 * d2 + d3 * d3 + d4 * d4 + d5 * d5);
if (dist < minDist)
{
minDist = dist;
minIdx = index;
}
}
i[index] = minIdx;
y[index] = minDist;
}
I'm looking for ways to optimize this program for GPGPU. I've read some articles (including http://www.macresearch.org/opencl_episode6, which comes with a source code) about GPGPU optimization by using local memory. I've tried applying it and came up with this code:
kernel void BruteForce(
global read_only float4* m,
global float4* y,
global write_only ushort* i,
__local float4 * shared)
{
int index = get_global_id(0);
int lsize = get_local_size(0);
int lid = get_local_id(0);
float4 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = 64000;
int x = 0;
for(x = 0; x < {0}; x += lsize)
{
if((x+lsize) > {0})
lsize = {0} - x;
if ( (x + lid) < {0})
{
shared[lid] = m[x + lid];
}
barrier(CLK_LOCAL_MEM_FENCE);
for (int x1 = 0; x1 < lsize; x1++)
{
float dist = distance(curY, shared[x1]);
if (dist < minDist)
{
minDist = dist;
minIdx = x + x1;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
i[index] = minIdx;
y[index] = minDist;
}
I'm getting garbage results for my 'i' output (e.g. many values that are the same). Can anyone point me to the right direction? I'll appreciate any answer that helps me improve this code, or maybe find the problem with the optimize version above.
Thank you very much
Cauê
One way to get a big speed up here is to use local data structures and compute entire blocks of data at a time. You should also only need a single read/write global vector (float4). The same idea can be applied to the 6d version using smaller blocks. Each work group is able to work freely through the block of data it is crunching. I will leave the exact implementation to you because you will know the specifics of your application.
some pseudo-ish-code (4d):
computeBlockSize is the size of the blocks to read from global and crunch.
this value should be a multiple of your work group size. I like 64 as a WG
size; it tends to perform well on most platforms. will be
allocating 2 * float4 * computeBlockSize + uint * computeBlockSize of shared memory.
(max value for ocl 1.0 ~448, ocl 1.1 ~896)
#define computeBlockSize = 256
__local float4[computeBlockSize] blockA;
__local float4[computeBlockSize] blockB;
__local uint[computeBlockSize] blockAnearestIndex;
now blockA gets computed against all blockB combinations. this is the job of a single work group.
*important*: only blockA ever gets written to. blockB is stored in local memory, but never changed or copied back to global
steps:
load blockA into local memory with async_work_group_copy
blockA is located at get_group_id(0) * computeBlockSize in the global vector
optional: set all blockA 'w' values to MAXFLOAT
optional: load blockAnearestIndex into local memory with async_work_group_copy if needed
need to compute blockA against itself first, then go into the blockB's
be careful to only write to blockA[j], NOT blockA[k]. j is exclusive to this work item
for(j=get_local_id(0); j<computeBlockSize;j++)
for(k=0;k<computeBlockSize; k++)
if(j==k) continue; //no self-comparison
calculate distance of blockA[j] vs blockA[k]
store min distance in blockA[j].w
store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)
for (i=0;i<get_num_groups(0);i++)
if (i==get_group_id(0)) continue;
load blockB into local memory: async_work_group_copy(...)
for(j=get_local_id(0); j<computeBlockSize;j++)
for(k=0;k<computeBlockSize; k++)
calculate distance of blockA[j] vs blockB[k]
store min distance in blockA[j].w
store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)
write blockA and blockAnearestIndex to global memory using two async_work_group_copy
There should be no problem in reading a blockB while another work group writes the same block (as its own blockA), because only the W values may have changed. If there happens to be trouble with this -- or if you do require two different vectors of points, you could use two global vectors like you have above, one with the A's (writeable) and the other with the B's (read only).
This algorithm work best when your global data size is a multiple of computeBlockSize. To handle the edges, two solutions come to mind. I recommend writing a second kernel for the non-square edge blocks that would in a similar manner as above. The new kernel can execute after the first, and you could save the second pci-e transfer. Alternately, you can use a distance of -1 to signify a skip in the comparison of two elements (ie if either blockA[j].w == -1 or blockB[k].w == -1, continue). This second solution would result in a lot more branching in your kernel though, which is why I recommend writing a new kernel. A very small percentage of your data points will actually fall in a edge block.

2nd order IIR filter, coefficients for a butterworth bandpass (EQ)?

Important update: I already figured out the answers and put them in this simple open-source library: http://bartolsthoorn.github.com/NVDSP/ Check it out, it will probably save you quite some time if you're having trouble with audio filters in IOS!
^
I have created a (realtime) audio buffer (float *data) that holds a few sin(theta) waves with different frequencies.
The code below shows how I created my buffer, and I've tried to do a bandpass filter but it just turns the signals to noise/blips:
// Multiple signal generator
__block float *phases = nil;
[audioManager setOutputBlock:^(float *data, UInt32 numFrames, UInt32 numChannels)
{
float samplingRate = audioManager.samplingRate;
NSUInteger activeSignalCount = [tones count];
// Initialize phases
if (phases == nil) {
phases = new float[10];
for(int z = 0; z <= 10; z++) {
phases[z] = 0.0;
}
}
// Multiple signals
NSEnumerator * enumerator = [tones objectEnumerator];
id frequency;
UInt32 c = 0;
while(frequency = [enumerator nextObject])
{
for (int i=0; i < numFrames; ++i)
{
for (int iChannel = 0; iChannel < numChannels; ++iChannel)
{
float theta = phases[c] * M_PI * 2;
if (c == 0) {
data[i*numChannels + iChannel] = sin(theta);
} else {
data[i*numChannels + iChannel] = data[i*numChannels + iChannel] + sin(theta);
}
}
phases[c] += 1.0 / (samplingRate / [frequency floatValue]);
if (phases[c] > 1.0) phases[c] = -1;
}
c++;
}
// Normalize data with active signal count
float signalMulti = 1.0 / (float(activeSignalCount) * (sqrt(2.0)));
vDSP_vsmul(data, 1, &signalMulti, data, 1, numFrames*numChannels);
// Apply master volume
float volume = masterVolumeSlider.value;
vDSP_vsmul(data, 1, &volume, data, 1, numFrames*numChannels);
if (fxSwitch.isOn) {
// H(s) = (s/Q) / (s^2 + s/Q + 1)
// http://www.musicdsp.org/files/Audio-EQ-Cookbook.txt
// BW 2.0 Q 0.667
// http://www.rane.com/note170.html
//The order of the coefficients are, B1, B2, A1, A2, B0.
float Fs = samplingRate;
float omega = 2*M_PI*Fs; // w0 = 2*pi*f0/Fs
float Q = 0.50f;
float alpha = sin(omega)/(2*Q); // sin(w0)/(2*Q)
// Through H
for (int i=0; i < numFrames; ++i)
{
for (int iChannel = 0; iChannel < numChannels; ++iChannel)
{
data[i*numChannels + iChannel] = (data[i*numChannels + iChannel]/Q) / (pow(data[i*numChannels + iChannel],2) + data[i*numChannels + iChannel]/Q + 1);
}
}
float b0 = alpha;
float b1 = 0;
float b2 = -alpha;
float a0 = 1 + alpha;
float a1 = -2*cos(omega);
float a2 = 1 - alpha;
float *coefficients = (float *) calloc(5, sizeof(float));
coefficients[0] = b1;
coefficients[1] = b2;
coefficients[2] = a1;
coefficients[3] = a2;
coefficients[3] = b0;
vDSP_deq22(data, 2, coefficients, data, 2, numFrames);
free(coefficients);
}
// Measure dB
[self measureDB:data:numFrames:numChannels];
}];
My aim is to make a 10-band EQ for this buffer, using vDSP_deq22, the syntax of the method is:
vDSP_deq22(<float *vDSP_A>, <vDSP_Stride vDSP_I>, <float *vDSP_B>, <float *vDSP_C>, <vDSP_Stride vDSP_K>, <vDSP_Length __vDSP_N>)
See: http://developer.apple.com/library/mac/#documentation/Accelerate/Reference/vDSPRef/Reference/reference.html#//apple_ref/doc/c_ref/vDSP_deq22
Arguments:
float *vDSP_A is the input data
float *vDSP_B are 5 filter coefficients
float *vDSP_C is the output data
I have to make 10 filters (10 times vDSP_deq22). Then I set the gain for every band and combine them back together. But what coefficients do I feed every filter? I know vDSP_deq22 is a 2nd order (butterworth) IIR filter, but how do I turn this into a bandpass?
Now I have three questions:
a) Do I have to de-interleave and interleave the audio buffer? I know setting stride to 2 just filters on channel but how I filter the other, stride 1 will process both channels as one.
b) Do I have to transform/process the buffer before it enters the vDSP_deq22 method? If so, do I also have to transform it back to normal?
c) What values of the coefficients should I set to the 10 vDSP_deq22s?
I've been trying for days now but I haven't been able to figure this on out, please help me out!
Your omega value need to be normalised, i.e. expressed as a fraction of Fs - it looks like you left out the f0 when you calculated omega, which will make alpha wrong too:
float omega = 2*M_PI*Fs; // w0 = 2*pi*f0/Fs
should probably be:
float omega = 2*M_PI*f0/Fs; // w0 = 2*pi*f0/Fs
where f0 is the centre frequency in Hz.
For your 10 band equaliser you'll need to pick 10 values of f0, spaced logarithmically, e.g. 25 Hz, 50 Hz, 100 Hz, 200 Hz, 400 Hz, 800 Hz, 1.6 kHz, 3.2 kHz, 6.4 kHz, 12.8 kHz.