dot product using cblas is slow - blas

I want to calculate the product A^T*A ( A is 2000x1000 Matrix). Also i only want to solve the upper triangular Matrix. In the inner loop i have to solve the dot product of two vectors.
Now, here is the problem. Using cblas ddot() is not faster than calculating the dot product with a loop. How is this possible? (using Intel Core (TM)i7 CPU M620 #2,67GHz, 1,92GB RAM)

The problem is caused essentially by matrix size, not by ddot. Your matrices are so large that they do not fit in the cache memory. The solution is to rearrange the three nested loops such that as much as possible can be done with a line in cache, so reducing cache refreshes. A model implementation follows for both the ddot and an daxpy approach. On my computer the time consumption was about 15:1.
In other words: never, never, never program a matrix multiplication along the "row times column" scheme that we learned in school.
/*
Matrix product of A^T * A by two methods.
1) "Row times column" as we learned in school.
2) With rearranged loops such that need for cash refreshes is reduced
(this can be improved even more).
Compile: gcc -o aT_a aT_a.c -lgslcblas -lblas -lm
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cblas.h>
#define ROWS 2000
#define COLS 1000
static double a[ROWS][COLS];
static double c[COLS][COLS];
static void dot() {
int i, j;
double *ai, *bj;
ai = a[0];
for (i=0; i<COLS; i++) {
bj = a[0];
for (j=0; j<COLS; j++) {
c[i][j] = cblas_ddot(ROWS,ai,COLS,bj,COLS);
bj += 1;
}
ai += 1;
}
}
static void axpy() {
int i, j;
double *ci, *bj, aij;
for (i=0; i<COLS; i++) {
ci = c[i];
for (j=0; j<COLS; j++) ci[j] = 0.;
for (j=0; j<ROWS; j++) {
aij = a[j][i];
bj = a[j];
cblas_daxpy(COLS,aij,bj,1,ci,1);
}
}
}
int main(int argc, char** argv) {
clock_t t0, t1;
int i, j;
for (i=0; i<ROWS; ++i)
for (j=0; j<COLS; ++j)
a[i][j] = i+j;
t0 = clock();
dot();
t0 = clock();
printf("Time for DOT : %f sec.\n",(double)t0/CLOCKS_PER_SEC);
axpy();
t1 = clock();
printf("Time for AXPY: %f sec.\n",(double)(t1-t0)/CLOCKS_PER_SEC);
return 0;
}

The CBLAS dot product is effectively just a computation in slightly unrolled loop. The netlib Fortran is just this:
DO I = MP1,N,5
DTEMP = DTEMP + DX(I)*DY(I) + DX(I+1)*DY(I+1) +
$ DX(I+2)*DY(I+2) + DX(I+3)*DY(I+3) + DX(I+4)*DY(I+4)
END DO
ie. just a loop unrolled to a stride of 5.
If you must use a ddot style dot product for your operation, you might get a performance boost by re-writing your loop to use SSE2 intrinsics:
#include <emmintrin.h>
double ddotsse2(const double *x, const double *y, const int n)
{
double result[2];
int n2 = 2 * (n/2);
__m128d dtemp;
if ( (n % 2) == 0) {
dtemp = _mm_setzero_pd();
} else {
dtemp = _mm_set_sd(x[n] * y[n]);
}
for(int i=0; i<n2; i+=2) {
__m128d x1 = _mm_loadr_pd(x+i);
__m128d y1 = _mm_loadr_pd(y+i);
__m128d xy = _mm_mul_pd(x1, y1);
dtemp = _mm_add_pd(dtemp, xy);
}
_mm_store_pd(&result[0],dtemp);
return result[0] + result[1];
}
(not tested, never been compiled, buyer beware).
This may or may be faster than the standard BLAS implementation. You may also want to investigate whether further loop unrolling could improve performance.

If you're not using SSE2 intrinsics or using a data type that may not boost performance with them, you can try to transpose the matrix for an easy improvement in performance for larger matrix multiplications with cblas_?dot. Performing the matrix multiplication in blocks also helps.
void matMulDotProduct(int n, float *A, float* B, int a_size, int b_size, int a_row, int a_col, int b_row, int b_col, float *C) {
int i, j, k;
MKL_INT incx, incy;
incx = 1;
incy = b_size;
//copy out multiplying matrix from larger matrix
float *temp = (float*) malloc(n * n * sizeof(float));
for (i = 0; i < n; ++i) {
cblas_scopy(n, &B[(b_row * b_size) + b_col + i], incy, &temp[i * n], 1);
}
//transpose
mkl_simatcopy('R', 'T', n, n, 1.0, temp, 1, 1);
for (i = 0; i < n; i+= BLOCK_SIZE) {
for (j = 0; j < n; j++) {
for (k = 0; k < BLOCK_SIZE; ++k) {
C[((i + k) * n) + j] = cblas_sdot(n, &A[(a_row + i + k) * a_size + a_col], incx, &temp[n * j], 1);
}
}
}
free(temp);
}
On my machine, this code is about 1 order of magnitude faster than the the 3 loop code (but also 1 order of magnitude slower than cblas_?gemm call) for single precision floats and 2K by 2K matrices. (I'm using Intel MKL).

Related

How to not parallelize inner loops in OpenACC

I am a beginner in doing GPU programming with OpenACC. I was trying to do a direct convolution. Convolution consists of 6 nested loops. I only want the first loop to be parallelized. I gave the pragma #pragma acc loop for the first loop and #pragma acc loop seq for the rest. But the output that I am getting is not correct. Is the approach taken by me to parallelize the loop correct ? Specifications for the convolution: Input channels-3, Input Size- 224X224X3, Output channels- 64, Output Size- 111X111X64, filter size- 3X3X3X64. Following is the link to the header files dog.h and squeezenet_params.h. https://drive.google.com/drive/folders/1a9XRjBTrEFIorrLTPFHS4atBOPrG886i
# include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "squeezenet_params.h"
#include "dog.h"
void conv3x3(
const int input_channels, const int input_size,
const int pad, const int stride, const int start_channel,
const int output_size, const float* restrict input_im, const float* restrict filter_weight,
const float* restrict filter_bias, float* restrict output_im){
#pragma acc data copyin (input_im[0:150527],filter_weight[0:1727],filter_bias[0:63]) copyout(output_im[0:788543])
{
#pragma acc parallel
{
#pragma acc loop
for(int p=0;p<64;++p){
filter_weight += p * input_channels * 9;
float bias = filter_bias[p];
output_im += (start_channel + p) * output_size * output_size;
//loop over output feature map
#pragma acc loop seq
for(int i = 0; i < output_size; i++)
{
#pragma acc loop seq
for(int j = 0; j < output_size; j++)
{
//compute one element in the output feature map
float tmp = bias;
//compute dot product of 2 input_channels x 3 x 3 matrix
#pragma acc loop seq
for(int k = 0; k < input_channels; k++)
{
#pragma acc loop seq
for(int l = 0; l < 3; l++)
{
int h = i * stride + l - pad;
#pragma acc loop seq
for(int m = 0; m < 3; m++)
{
int w = j * stride + m - pad;
if((h >= 0) && (h < input_size) && (w >= 0) && (w < input_size))
{
tmp += input_im[k * input_size * input_size + (i * stride + l - pad) * input_size + j * stride + m - pad] \
* filter_weight[9 * k + 3 * l + m];
}
}
}
}
//add relu activation after conv
output_im[i * output_size + j] = (tmp > 0.0) ? tmp : 0.0;
}
}
}
}
}
}
void main(){
float * result = (float*)malloc(sizeof(float) * (1 * 64 * 111 * 111));
conv3x3(3,224,0,2,0,111,sample,conv1_weight,conv1_bias,result);
for(int i=0;i<64 * 111 * 111;++i){
//if(result[i]>0)
printf("%f:%d\n",result[i],i);
}
}
The contributor posted the same question on the PGI User Forums where I've answered. (See: https://www.pgroup.com/userforum/viewtopic.php?f=4&t=7614). The topic question is incorrect in that the inner loops are not getting parallelized nor are the cause of the issue.
The problem here is that the code has a race condition on the shared "output_im" pointer. My suggested solution is to compute a per thread offset into the array rather than trying to manipulate the pointer itself.
for(int p=0;p<64;++p){
filter_weight += p * input_channels * 9;
float bias = filter_bias[p];
int offset;
offset = (start_channel + p) * output_size * output_size;
//loop over output feature map
#pragma acc loop vector collapse(2)
for(int i = 0; i < output_size; i++)
{
for(int j = 0; j < output_size; j++)
{
... cut ...
}
}
//add relu activation after conv
int idx = offset + (i * output_size + j);
output_im[idx] = (tmp > 0.0) ? tmp : 0.0;
}
}

OpenCL Sum reduction across different work-groups gives the wrong result

So I'm currently trying to write a kernel in OpenCL with the goal of sum reducing each row of a matrix (g_idata) into an array (g_odata). Said matrix is represented by a float array with column_count * row_count length, and the resulting array has a length of row_count. As such I've implemented the following kernel:
#define T float
#define Operation(X, Y) ((X) + (Y))
__kernel void marrow_kernel( __global T *g_odata,__global T *g_idata,
const unsigned long column_count, const unsigned long row_count, __local volatile T* sdata) {
size_t tid = get_local_id(0);
size_t gid = get_global_id(0);
size_t row = gid / column_count;
size_t column = gid % column_count;
if(row < row_count && column < column_count)
{
sdata[tid] = g_idata[gid];
}
barrier(CLK_LOCAL_MEM_FENCE);
if(row < row_count && column < column_count)
{
size_t step = column_count / 2;
size_t limit = column_count;
while(step > 0)
{
if(column + step < limit) {
if(tid + step < get_local_size(0))
{
sdata[tid] = Operation(sdata[tid], sdata[tid + step]);
}
else if (gid + step < column_count * row_count)
{
sdata[tid] = Operation(sdata[tid], g_idata[gid + step]);
}
}
barrier(CLK_LOCAL_MEM_FENCE);
step /= 2;
limit /= 2;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
if(row < row_count && column == 0)
{
g_odata[row] = column_count % 2 == 0 ? sdata[tid] : sdata[tid] + g_idata[gid + (column_count - 1)];
}
}
Said kernel is currently being instantiated with a work-group of 128 work-units. I currently have no control over the size of the work-group.
Now here's the issue: If lets say I've a row that's shared between two different work-groups, it'll return the wrong result, since it'll fetch the value in the g_idata, since it's impossible to access the result of the next work-group local memory. After the first iteration, that's the wrong value, and it'll afect the final result of the operation.
Anyone can give me an hint on how to solve this problem?

Algorithm to group consecutive words minimizing length per group

From an input of space-delimited words, how to concatenate consecutive words so that:
each group has a minimum length L (spaces don't count)
longest group length is minimal (spaces don't count)
Example input:
would a cat eat a mouse
Example minimum length:
L = 5
Naive algorithm that solves the first condition but not the second one:
while length of a group is less than L, concatenate next word to group
if last group is shorter than L, concatenate last two groups together
This naive algorithm produces:
group 1: would
group 2: acateat
group 3: amouse
longest group length: 7
Second condition is not solved because a better solution would be:
group 1: woulda
group 2: cateat
group 3: amouse
longest group length: 6
Which algorithm would solve the second condition (minimal longest group) with relatively fast execution as a program? (by fast, I'd like to avoid testing all possible combinations)
I know C, ObjC, Swift, Javascript, Python, but pseudocode is fine.
This can be done with dynamic programming approach. Let's count a function F(i) - the minimum length of the longest group among correct divisions of the first i words into groups.
F(0) = 0
F(i) = Min(Max(F(j), totalLen(j+1, i))), for j in [0..i-1]
Where
totalLen(i, j) = total length of words from i to j, if the length is at least L
totalLen(i, j) = MAX, if total length is less than L
The answer is the value of F(n). To get the groups themselves we can save the indices of the best j for every i.
There is a implementation from the scratch in c++:
const vector<string> words = {"would", "a", "cat", "eat", "a", "mouse"};
const int L = 5;
int n = words.size();
vector<int> prefixLen = countPrefixLen(words);
vector<int> f(n+1);
vector<int> best(n+1, -1);
int maxL = prefixLen[n];
f[0] = 0;
for (int i = 1; i <= n; ++i) {
f[i] = maxL;
for (int j = 0; j < i; ++j) {
int totalLen = prefixLen[i] - prefixLen[j];
if (totalLen >= L) {
int maxLen = max(f[j], totalLen);
if (f[i] > maxLen) {
f[i] = maxLen;
best[i] = j;
}
}
}
}
output(f[n], prev, words);
Preprocessing and output details:
vector<int> countPrefixLen(const vector<string>& words) {
int n = words.size();
vector<int> prefixLen(n+1);
for (int i = 1; i <= n; ++i) {
prefixLen[i] = prefixLen[i-1] + words[i-1].length();
}
return prefixLen;
}
void output(int answer, const vector<int>& best, const vector<string>& words) {
cout << answer << endl;
int j = best.size()-1;
vector<int> restoreIndex(1, j);
while (j > 0) {
int i = best[j];
restoreIndex.push_back(i);
j = i;
}
reverse(restoreIndex.begin(), restoreIndex.end());
for (int i = 0; i+1 < restoreIndex.size(); ++i) {
for (int j = restoreIndex[i]; j < restoreIndex[i+1]; ++j) {
cout << words[j] << ' ';
}
cout << endl;
}
}
Output:
6
would a
cat eat
a mouse
Runnable: https://ideone.com/AaV5C8
Further improvement
The complexity of this algorithm is O(N^2). If it is too slow for your data I can suggest a simple optimization:
Let's inverse the inner loop. First, this allows to get rid of the prefixLen array and it's preprocessing, because now we add words one by one to the group (actually, we could get rid of this preprocessing in the initial version, but at the expense of simplicity). What is more important we can break our loop when totalLen would be not less than already computed f[i] because further iterations will never lead to an improvement. The code of the inner loop could be changed to:
int totalLen = 0;
for (int j = i-1; j >= 0; --j) {
totalLen += words[j].length();
if (totalLen >= L) {
int maxLen = max(f[j], totalLen);
if (f[i] > maxLen) {
f[i] = maxLen;
best[i] = j;
}
}
if (totalLen >= f[i]) break;
}
This can drastically improve the performance for not very big values of L.

FFTW / CUFFT over given axis of multidimensional array [duplicate]

I'm trying to compute batch 1D FFTs using cufftPlanMany. The data set comes from a 3D field, stored in a 1D array, where I want to compute 1D FFTs in the x and y direction. The data is stored as shown in the figure below; continuous in x then y then z.
Doing batch FFTs in the x-direction is (I believe) straighforward; with input stride=1, distance=nx and batch=ny * nz, it computes the FFTs over elements {0,1,2,3}, {4,5,6,7}, ..., {28,29,30,31}. However, I can't think of a way to achieve the same for the FFTs in the y-direction. A batch for each xy plane is again straightforward (input stride=nx, dist=1, batch=nx results in FFTs over {0,4,8,12}, {1,5,9,13}, etc.). But with batch=nx * nz, going from {3,7,11,15} to {16,20,24,28}, the distance is larger than 1. Can this somehow be done with cufftPlanMany?
I think that the short answer to your question (possibility of using a single cufftPlanMany to perform 1D FFTs of the columns of a 3D matrix) is NO.
Indeed, transformations performed according to cufftPlanMany, that you call like
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_C2C, batch);
must obey the Advanced Data Layout. In particular, 1D FFTs are worked out according to the following layout
input[b * idist + x * istride]
where b addresses the b-th signal and istride is the distance between two consecutive items in the same signal. If the 3D matrix has dimensions M * N * Q and if you want to perform 1D transforms along the columns, then the distance between two consecutive elements will be M, while the distance between two consecutive signals will be 1. Furthermore, the number of batched executions must be set equal to M. With those parameters, you are able to cover only one slice of the 3D matrix. Indeed, if you try increasing M, then the cuFFT will start trying to compute new column-wise FFTs starting from the second row. The only solution to this problem is an iterative call to cufftExecC2C to cover all the Q slices.
For the record, the following code provides a fully worked example on how performing 1D FFTs of the columns of a 3D matrix.
#include <thrust/device_vector.h>
#include <cufft.h>
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
int main() {
const int M = 3;
const int N = 4;
const int Q = 2;
thrust::host_vector<float2> h_matrix(M * N * Q);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp;
temp.x = (float)(j + k * M);
//temp.x = 1.f;
temp.y = 0.f;
h_matrix[k*M*N+j*M+i] = temp;
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
}
printf("\n");
thrust::device_vector<float2> d_matrix(h_matrix);
thrust::device_vector<float2> d_matrix_out(M * N * Q);
// --- Advanced data layout
// input[b * idist + x * istride]
// output[b * odist + x * ostride]
// b = signal number
// x = element of the b-th signal
cufftHandle handle;
int rank = 1; // --- 1D FFTs
int n[] = { N }; // --- Size of the Fourier transform
int istride = M, ostride = M; // --- Distance between two successive input/output elements
int idist = 1, odist = 1; // --- Distance between batches
int inembed[] = { 0 }; // --- Input size with pitch (ignored for 1D transforms)
int onembed[] = { 0 }; // --- Output size with pitch (ignored for 1D transforms)
int batch = M; // --- Number of batched executions
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_C2C, batch);
for (int k=0; k<Q; k++)
cufftExecC2C(handle, (cufftComplex*)(thrust::raw_pointer_cast(d_matrix.data()) + k * M * N), (cufftComplex*)(thrust::raw_pointer_cast(d_matrix_out.data()) + k * M * N), CUFFT_FORWARD);
cufftDestroy(handle);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp = d_matrix_out[k*M*N+j*M+i];
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
}
}
The situation is different for the case when you want to perform 1D transforms of the rows. In that case, the distance between two consecutive elements is 1, while the distance between two consecutive signals is M. This allows you to set a number of N * Q transformations and then invoking cufftExecC2C only one time. For the record, the code below provides a full example of 1D transformations of the rows of a 3D matrix.
#include <thrust/device_vector.h>
#include <cufft.h>
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
int main() {
const int M = 3;
const int N = 4;
const int Q = 2;
thrust::host_vector<float2> h_matrix(M * N * Q);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp;
temp.x = (float)(j + k * M);
//temp.x = 1.f;
temp.y = 0.f;
h_matrix[k*M*N+j*M+i] = temp;
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
}
printf("\n");
thrust::device_vector<float2> d_matrix(h_matrix);
thrust::device_vector<float2> d_matrix_out(M * N * Q);
// --- Advanced data layout
// input[b * idist + x * istride]
// output[b * odist + x * ostride]
// b = signal number
// x = element of the b-th signal
cufftHandle handle;
int rank = 1; // --- 1D FFTs
int n[] = { M }; // --- Size of the Fourier transform
int istride = 1, ostride = 1; // --- Distance between two successive input/output elements
int idist = M, odist = M; // --- Distance between batches
int inembed[] = { 0 }; // --- Input size with pitch (ignored for 1D transforms)
int onembed[] = { 0 }; // --- Output size with pitch (ignored for 1D transforms)
int batch = N * Q; // --- Number of batched executions
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_C2C, batch);
cufftExecC2C(handle, (cufftComplex*)(thrust::raw_pointer_cast(d_matrix.data())), (cufftComplex*)(thrust::raw_pointer_cast(d_matrix_out.data())), CUFFT_FORWARD);
cufftDestroy(handle);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp = d_matrix_out[k*M*N+j*M+i];
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
}
}
I guess, idist=nx*nz could also jump a whole plane and batch=nz would then cover one yx plane. The decision should be made according to whether nx or nz is larger.

Setup the accelerator framework for fft on the iPhone

I have set a function to setup the accelerator, after i have read :
Using the Apple FFT and Accelerate Framework
iPhone FFT with Accelerate framework vDSP
and apple docs.
i did this :
void fftSetup()
{
COMPLEX_SPLIT A;
FFTSetup setupReal;
uint32_t log2n;
uint32_t n, nOver2;
int32_t stride;
uint32_t i;
float *originalReal, *obtainedReal;
float scale;
uint32_t L = 1024;
float *mag = new float[L/2];
log2n = 10 ;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
printf("1D real FFT of length log2 ( %d ) = %d\n\n", n, log2n);
for (i = 0; i < n; i++)
originalReal[i] = (float) (i + 1);
vDSP_ctoz((COMPLEX *) originalReal,2,&A,1,nOver2);
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
setupReal = vDSP_create_fftsetup(log2n, FFT_RADIX2);
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_INVERSE);
//get magnitude;
for(i = 1; i < L/2; i++){
mag[i] = sqrtf(A.realp[i]*A.realp[i] + A.imagp[i] * A.imagp[i]);
}
scale = (float) 1.0 / (2 * n);
vDSP_vsmul(A.realp, 1, &scale, A.realp, 1, nOver2);
vDSP_vsmul(A.imagp, 1, &scale, A.imagp, 1, nOver2);
}
questions :
my app is always crash with no error(BAD ACCESS) on one of this 2 lines :
originalReal[i] = (float) (i + 1); // or
vDSP_ctoz((COMPLEX *) originalReal,2,&A,1,nOver2);
i guess i did not set a good value for log2n ? (10 to get 1024 window ? )
how do i get the real magnitude of the bins? my actual fft? the same i wrote here ?
where do i input MY data buffer array (exactly where in my code ? instead originalReal?)
thanks a lot.
I actually manage to make it work ,when i insert into it a sin wave of a certain f.
This is the code :
COMPLEX_SPLIT A;
FFTSetup setupReal;
uint32_t log2n;
uint32_t n, nOver2;
int32_t stride;
uint32_t i;
float *originalReal, *obtainedReal;
float scale;
uint32_t L = 1024;
float *mag = new float[L/2];
log2n = 10 ;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
//printf("1D real FFT of length log2 ( %d ) = %d\n\n", n, log2n);
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
originalReal = (float *) malloc(n * sizeof(float));
obtainedReal = (float *) malloc(n * sizeof(float));
for (i = 0; i < n; i++)
originalReal[i] = cos(2*3.141592*11000*i/44100);//(float) (i + 1);
vDSP_ctoz((COMPLEX *) originalReal,2,&A,1,nOver2);
setupReal = vDSP_create_fftsetup(log2n, FFT_RADIX2);
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);
//vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_INVERSE);
scale = (float) 1.0 / (2 * n);
vDSP_vsmul(A.realp, 1, &scale, A.realp, 1, nOver2);
vDSP_vsmul(A.imagp, 1, &scale, A.imagp, 1, nOver2);
//get magnitude;
for(i = 1; i < L/2; i++)
{
mag[i] = sqrtf(A.realp[i]*A.realp[i] + A.imagp[i] * A.imagp[i]);
NSLog(#"%d:%f",i,mag[i]);
}
Actually its not 44hz between bins,as the guy wrote in the post above! but 43 ! 22050/512=43 . this thing is critical ! because in the higher bins- such as bin[300] you get a completely different resault for 44 and 43 ! (its 300hz drift). so take care of that .