How do I traverse a Tensorflow graph using the C API?

A small program below creates a simple tf graph. I need to traverse the graph, printing information about the nodes as I go.
Is it right to assume that every graph has a root (or distinguished node)? I believe this graph has 3 nodes and I've heard that the edges are tensors.
TF_Graph* g;
TF_Status* s;
#define CHECK_OK(x) if(TF_OK != TF_GetCode(s))return printf("%s\n",TF_Message(s)),(void*)0
TF_Tensor* FloatTensor2x2(const float* values) {
const int64_t dims[2] = {2, 2};
TF_Tensor* t = TF_AllocateTensor(TF_FLOAT, dims, 2, sizeof(float) * 4);
memcpy(TF_TensorData(t), values, sizeof(float) * 4);
return t;
TF_Operation* FloatConst2x2(TF_Graph* graph, TF_Status* s, const float* values, const char* name) {
TF_Tensor* tensor=FloatTensor2x2(values);
TF_OperationDescription* desc = TF_NewOperation(graph, "Const", name);
TF_SetAttrTensor(desc, "value", tensor, s);
if (TF_GetCode(s) != TF_OK) return 0;
TF_SetAttrType(desc, "dtype", TF_FLOAT);
TF_Operation* op = TF_FinishOperation(desc, s);
return op;
TF_Operation* MatMul(TF_Graph* graph, TF_Status* s, TF_Operation* l, TF_Operation* r, const char* name,
char transpose_a, char transpose_b) {
TF_OperationDescription* desc = TF_NewOperation(graph, "MatMul", name);
if (transpose_a) {
TF_SetAttrBool(desc, "transpose_a", 1);
if (transpose_b) {
TF_SetAttrBool(desc, "transpose_b", 1);
TF_AddInput(desc,(TF_Output){l, 0});
TF_AddInput(desc,(TF_Output){r, 0});
TF_Operation* op = TF_FinishOperation(desc, s);
return op;
TF_Graph* BuildSuccessGraph(TF_Output* inputs, TF_Output* outputs) {
// |
// z|
// |
// MatMul
// / \
// ^ ^
// | |
// x Const_0 y Const_1
float const0_val[] = {1.0, 2.0, 3.0, 4.0};
float const1_val[] = {1.0, 0.0, 0.0, 1.0};
TF_Operation* const0 = FloatConst2x2(g, s, const0_val, "Const_0");
TF_Operation* const1 = FloatConst2x2(g, s, const1_val, "Const_1");
TF_Operation* matmul = MatMul(g, s, const0, const1, "MatMul",0,0);
inputs[0] = (TF_Output){const0, 0};
inputs[1] = (TF_Output){const1, 0};
outputs[0] = (TF_Output){matmul, 0};
return g;
int main(int argc, char const *argv[]) {
g = TF_NewGraph();
s = TF_NewStatus();
TF_Output inputs[2],outputs[1];
/* HERE traverse g -- maybe with {inputs,outputs} -- to print the graph */
fprintf(stdout, "OK\n");
If someone could help with what functions to use to get info about the graph, it would be appreciated.

from c_api.h:
// Iterate through the operations of a graph. To use:
// size_t pos = 0;
// TF_Operation* oper;
// while ((oper = TF_GraphNextOperation(graph, &pos)) != nullptr) {
// DoSomethingWithOperation(oper);
// }
TF_CAPI_EXPORT extern TF_Operation* TF_GraphNextOperation(TF_Graph* graph,
size_t* pos);
Note this only returns operations and does not define a way to navigate from one node (Operation) to the next - this edge relationship is stored in the nodes themselves (as pointers).


LAPACKE or MAGMA GPU - inversion of matrix with Cholesky factorization - functions magma_dpotrf_gpu and magma_dpotri_gpu

I have a first version of a function that inverses a matrix of size m and using
magma_dgetrf_gpu and magma_dgetri_gpulike this :
// Inversion
magma_dgetrf_gpu( m, m, d_a, m, piv, &info);
magma_dgetri_gpu( m, d_a, m, piv, dwork, ldwork, &info);
Now, I would like to also inverse but using the Cholesky decomposition. The function looks like the first version one,except the functions used which are :
// Inversion
magma_dpotrf_gpu( MagmaLower, m, d_a, m, &info);
magma_dpotri_gpu( MagmaLower, m, d_a, m, &info);
Here is the entire function that inverses :
void matrix_inverse_magma(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Start magma part
magma_int_t m = F_matrix.size();
if (m) {
magma_init (); // initialize Magma
magma_queue_t queue=NULL;
magma_int_t dev=0;
magma_queue_create(dev ,&queue );
double gpu_time , *dwork; // dwork - workspace
magma_int_t ldwork; // size of dwork
magma_int_t *piv, info; // piv - array of indices of inter -
// changed rows; a - mxm matrix
magma_int_t mm=m*m; // size of a, r, c
double *a; // a- mxm matrix on the host
double *d_a; // d_a - mxm matrix a on the device
magma_int_t err;
ldwork = m * magma_get_dgetri_nb( m ); // optimal block size
// allocate matrices
err = magma_dmalloc_cpu( &a , mm ); // host memory for a
// Convert matrix to *a double pointer
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
a[idx] = F_matrix[i][j];
err = magma_dmalloc( &d_a , mm ); // device memory for a
err = magma_dmalloc( &dwork , ldwork );// dev. mem. for ldwork
piv=( magma_int_t *) malloc(m*sizeof(magma_int_t ));// host mem.
magma_dsetmatrix( m, m, a, m, d_a, m, queue); // copy a -> d_a
// Inversion
magma_dpotrf_gpu( MagmaLower, m, d_a, m, &info);
magma_dpotri_gpu( MagmaLower, m, d_a, m, &info);
magma_dgetmatrix( m, m, d_a , m, a, m, queue); // copy d_a ->a
// Save Final matrix
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
F_output[i][j] = a[idx];
free(a); // free host memory
free(piv); // free host memory
magma_free(d_a); // free device memory
magma_queue_destroy(queue); // destroy queue
magma_finalize ();
// End magma part
Unfortunately, after checking the output data, I have a wrong inversion with my implementation.
I have doubts about the using at this line :
ldwork = m * magma_get_dgetri_nb( m ); // optimal block size
Could anyone see at first sight where the error comes from in my using of dpotrf and dpotri functions (actually magma_dpotrf_gpu and magma_dpotri_gpu) ?
following the advice of Damir Tenishev, I put an example of a function that inverses a matrix using LAPACKE :
// LAPACK version
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Size of F_matrix
int N = F_matrix.size();
int *IPIV = new int[N];
// Statement of main array to inverse
double *arr = new double[N*N];
// Output Diagonal block
double *diag = new double[N];
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
F_output[i][j] = arr[idx];
delete[] IPIV;
delete[] arr;
As you can see, this is a classical version of matrix inversion, which uses LAPACKE_dgetrf and LAPACKE_dgetri
EDIT 2: The MAGMA version is :
// MAGMA version
void matrix_inverse_magma(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Start magma part
magma_int_t m = F_matrix.size();
if (m) {
magma_init (); // initialize Magma
magma_queue_t queue=NULL;
magma_int_t dev=0;
magma_queue_create(dev ,&queue );
double gpu_time , *dwork; // dwork - workspace
magma_int_t ldwork; // size of dwork
magma_int_t *piv, info; // piv - array of indices of inter -
// changed rows; a - mxm matrix
magma_int_t mm=m*m; // size of a, r, c
double *a; // a- mxm matrix on the host
double *d_a; // d_a - mxm matrix a on the device
magma_int_t ione = 1;
magma_int_t ISEED [4] = { 0,0,0,1 }; // seed
magma_int_t err;
const double alpha = 1.0; // alpha =1
const double beta = 0.0; // beta=0
ldwork = m * magma_get_dgetri_nb( m ); // optimal block size
// allocate matrices
err = magma_dmalloc_cpu( &a , mm ); // host memory for a
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
a[idx] = F_matrix[i][j]
err = magma_dmalloc( &d_a , mm ); // device memory for a
err = magma_dmalloc( &dwork , ldwork );// dev. mem. for ldwork
piv=( magma_int_t *) malloc(m*sizeof(magma_int_t ));// host mem.
magma_dsetmatrix( m, m, a, m, d_a, m, queue); // copy a -> d_a
// find the inverse matrix: d_a*X=I using the LU factorization
// with partial pivoting and row interchanges computed by
// magma_dgetrf_gpu; row i is interchanged with row piv(i);
// d_a -mxm matrix; d_a is overwritten by the inverse
magma_dgetrf_gpu( m, m, d_a, m, piv, &info);
magma_dgetri_gpu(m, d_a, m, piv, dwork, ldwork, &info);
magma_dgetmatrix( m, m, d_a , m, a, m, queue); // copy d_a ->a
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
F_output[i][j] = a[idx];
free(a); // free host memory
free(piv); // free host memory
magma_free(d_a); // free device memory
magma_queue_destroy(queue); // destroy queue
magma_finalize ();
// End magma part
As you can see, I have used magma_dgetrf_gpu and magma_dgetri_gpu functions.
Now, I would like to do the same, either with LAPACKE or MAGMA+LAPACK, using dpotrf and dpotri functions. I recall that the matrixes that I inverse are symmetric.
EDIT 3: my attempts come from this documentation link
Especially, see section 4.4.21 magma dpotri - invert a positive definite matrix in double precision, CPU interface on page 325.

How to do batching without UBOs?

I'm trying to implement batching for a WebGL renderer which is struggling with lots of small objects due to too many draw calls. What I thought is I'd batch them all by the kind of shader they use, then draw a few at a time, uploading material parameters and the model matrix for each object once in uniforms.
My problem is that the uniform size limits for non-UBO uniforms are extremely low, as in 256 floats low at a minimum. If my material uses, say, 8 floats, and if you factor in the model matrix, I barely have enough uniforms to draw 10 models in a single batch, which isn't really going to be enough.
Is there any hope to make this work without UBOs? Are textures an option? How are people doing batching without WebGL2 UBOs?
More details: I have no skinning or complex animations, I just have some shaders (diffuse, cook-torrance, whatever) and each model has different material settings for each shader, e.g. color, roughness, index of refraction which can be changed dynamically by the user (so it's not realistic to bake them into the vertex array because we have some high poly data, also users can switch shaders and not all shaders have the same number of parameters) as well as material maps obviously. The geometry itself is static and just has a linear transform on each model. For the most part all meshes are different so geometry instancing won't help a whole lot, but I can look at that later.
I don't know that this is actually faster than lots of draw calls but here is drawing 4 models with a single draw call
It works by adding an id per model. So, for every vertex in model #0 put a 0, for every vertex in model #1 put a 1, etc.
Then it uses model id to index stuff in a texture. The easiest would be model id chooses the row of a texture and then all the data for that model can be pulled out of that row.
For WebGL1
attribute float modelId;
#define TEXTURE_WIDTH ??
#define COLOR_OFFSET ((0.0 + 0.5) / TEXTURE_WIDTH)
#define MATERIAL_OFFSET ((1.0 + 0.5) / TEXTURE_WIDTH)
float modelOffset = (modelId + .5) / textureHeight;
vec4 color = texture2D(perModelData, vec2(COLOR_OFFSET, modelOffset));
vec4 roughnessIndexOfRefaction = texture2D(perModelData,
vec2(MATERIAL_OFFSET, modelOffset));
As long as you are not drawing more than gl.getParameter(gl.MAX_TEXTURE_SIZE) models it will work. If you have more than that either use more draw calls or change the texture coordinate calculations so there's more than one model per row
In WebGL2 you'd change the code to use texelFetch and unsigned integers
in uint modelId;
#define COLOR_OFFSET 0
vec4 color = texelFetch(perModelData, uvec2(COLOR_OFFSET, modelId));
vec4 roughnessIndexOfRefaction = texelFetch(perModelData,
uvec2(MATERIAL_OFFSET, modelId));
example of 4 models drawn with 1 draw call. For each model the model matrix and color are stored in the texture.
const m4 = twgl.m4;
const v3 = twgl.v3;
const gl = document.querySelector('canvas').getContext('webgl');
const ext = gl.getExtension('OES_texture_float');
if (!ext) {
alert('need OES_texture_float');
const COMMON_STUFF = `
#define TEXTURE_WIDTH 5.0
#define MATRIX_ROW_0_OFFSET ((0. + 0.5) / TEXTURE_WIDTH)
#define MATRIX_ROW_1_OFFSET ((1. + 0.5) / TEXTURE_WIDTH)
#define MATRIX_ROW_2_OFFSET ((2. + 0.5) / TEXTURE_WIDTH)
#define MATRIX_ROW_3_OFFSET ((3. + 0.5) / TEXTURE_WIDTH)
#define COLOR_OFFSET ((4. + 0.5) / TEXTURE_WIDTH)
const vs = `
attribute vec4 position;
attribute vec3 normal;
attribute float modelId;
uniform float textureHeight;
uniform sampler2D perModelDataTexture;
uniform mat4 projection;
uniform mat4 view;
varying vec3 v_normal;
varying float v_modelId;
void main() {
v_modelId = modelId; // pass to fragment shader
float modelOffset = (modelId + 0.5) / textureHeight;
// note: in WebGL2 better to use texelFetch
mat4 model = mat4(
texture2D(perModelDataTexture, vec2(MATRIX_ROW_0_OFFSET, modelOffset)),
texture2D(perModelDataTexture, vec2(MATRIX_ROW_1_OFFSET, modelOffset)),
texture2D(perModelDataTexture, vec2(MATRIX_ROW_2_OFFSET, modelOffset)),
texture2D(perModelDataTexture, vec2(MATRIX_ROW_3_OFFSET, modelOffset)));
gl_Position = projection * view * model * position;
v_normal = mat3(view) * mat3(model) * normal;
const fs = `
precision highp float;
varying vec3 v_normal;
varying float v_modelId;
uniform float textureHeight;
uniform sampler2D perModelDataTexture;
uniform vec3 lightDirection;
void main() {
float modelOffset = (v_modelId + 0.5) / textureHeight;
vec4 color = texture2D(perModelDataTexture, vec2(COLOR_OFFSET, modelOffset));
float l = dot(lightDirection, normalize(v_normal)) * .5 + .5;
gl_FragColor = vec4(color.rgb * l, color.a);
// compile shader, link, look up locations
const programInfo = twgl.createProgramInfo(gl, [vs, fs]);
// make some vertex data
const modelVerts = [
twgl.primitives.createSphereVertices(1, 6, 4),
twgl.primitives.createCubeVertices(1, 1, 1),
twgl.primitives.createCylinderVertices(1, 1, 10, 1),
twgl.primitives.createTorusVertices(1, .2, 16, 8),
// merge all the vertices into one
const arrays = twgl.primitives.concatVertices(modelVerts);
// fill an array so each vertex of each model has a modelId
const modelIds = new Uint16Array(arrays.position.length / 3);
let offset = 0;
modelVerts.forEach((verts, modelId) => {
const end = offset + verts.position.length / 3;
while(offset < end) {
modelIds[offset++] = modelId;
arrays.modelId = { numComponents: 1, data: modelIds };
// calls gl.createBuffer, gl.bindBuffer, gl.bufferData
const bufferInfo = twgl.createBufferInfoFromArrays(gl, arrays);
const numModels = modelVerts.length;
const tex = gl.createTexture();
const textureWidth = 5; // 4x4 matrix, 4x1 color
gl.bindTexture(gl.TEXTURE_2D, tex);
gl.texImage2D(gl.TEXTURE_2D, 0, gl.RGBA, textureWidth, numModels, 0, gl.RGBA, gl.FLOAT, null);
gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MIN_FILTER, gl.NEAREST);
gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_MAG_FILTER, gl.NEAREST);
gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_S, gl.CLAMP_TO_EDGE);
gl.texParameteri(gl.TEXTURE_2D, gl.TEXTURE_WRAP_T, gl.CLAMP_TO_EDGE);
// this data is for the texture, one row per model
// first 4 pixels are the model matrix, 5 pixel is the color
const perModelData = new Float32Array(textureWidth * numModels * 4);
const stride = textureWidth * 4;
const modelOffset = 0;
const colorOffset = 16;
// set the colors at init time
for (let modelId = 0; modelId < numModels; ++modelId) {
perModelData.set([r(), r(), r(), 1], modelId * stride + colorOffset);
function r() {
return Math.random();
function render(time) {
time *= 0.001; // seconds
gl.viewport(0, 0, gl.canvas.width, gl.canvas.height);
const fov = Math.PI * 0.25;
const aspect = gl.canvas.clientWidth / gl.canvas.clientHeight;
const near = 0.1;
const far = 20;
const projection = m4.perspective(fov, aspect, near, far);
const eye = [0, 0, 10];
const target = [0, 0, 0];
const up = [0, 1, 0];
const camera = m4.lookAt(eye, target, up);
const view = m4.inverse(camera);
// set the matrix for each model in the texture data
const mat = m4.identity();
for (let modelId = 0; modelId < numModels; ++modelId) {
const t = time * (modelId + 1) * 0.3;
m4.rotateX(mat, t, mat);
m4.rotateY(mat, t, mat);
m4.translate(mat, [0, 0, Math.sin(t * 1.1) * 4], mat);
m4.rotateZ(mat, t, mat);
perModelData.set(mat, modelId * stride + modelOffset);
// upload the texture data
gl.bindTexture(gl.TEXTURE_2D, tex);
gl.texSubImage2D(gl.TEXTURE_2D, 0, 0, 0, textureWidth, numModels,
gl.RGBA, gl.FLOAT, perModelData);
// calls gl.bindBuffer, gl.enableVertexAttribArray, gl.vertexAttribPointer
twgl.setBuffersAndAttributes(gl, programInfo, bufferInfo);
// calls gl.activeTexture, gl.bindTexture, gl.uniformXXX
twgl.setUniforms(programInfo, {
lightDirection: v3.normalize([1, 2, 3]),
perModelDataTexture: tex,
textureHeight: numModels,
// calls gl.drawArrays or gl.drawElements
twgl.drawBufferInfo(gl, bufferInfo);
body { margin: 0; }
canvas { width: 100vw; height: 100vh; display: block; }
<script src=""></script>
Here's 2000 models in one draw call

FFTW / CUFFT over given axis of multidimensional array [duplicate]

I'm trying to compute batch 1D FFTs using cufftPlanMany. The data set comes from a 3D field, stored in a 1D array, where I want to compute 1D FFTs in the x and y direction. The data is stored as shown in the figure below; continuous in x then y then z.
Doing batch FFTs in the x-direction is (I believe) straighforward; with input stride=1, distance=nx and batch=ny * nz, it computes the FFTs over elements {0,1,2,3}, {4,5,6,7}, ..., {28,29,30,31}. However, I can't think of a way to achieve the same for the FFTs in the y-direction. A batch for each xy plane is again straightforward (input stride=nx, dist=1, batch=nx results in FFTs over {0,4,8,12}, {1,5,9,13}, etc.). But with batch=nx * nz, going from {3,7,11,15} to {16,20,24,28}, the distance is larger than 1. Can this somehow be done with cufftPlanMany?
I think that the short answer to your question (possibility of using a single cufftPlanMany to perform 1D FFTs of the columns of a 3D matrix) is NO.
Indeed, transformations performed according to cufftPlanMany, that you call like
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_C2C, batch);
must obey the Advanced Data Layout. In particular, 1D FFTs are worked out according to the following layout
input[b * idist + x * istride]
where b addresses the b-th signal and istride is the distance between two consecutive items in the same signal. If the 3D matrix has dimensions M * N * Q and if you want to perform 1D transforms along the columns, then the distance between two consecutive elements will be M, while the distance between two consecutive signals will be 1. Furthermore, the number of batched executions must be set equal to M. With those parameters, you are able to cover only one slice of the 3D matrix. Indeed, if you try increasing M, then the cuFFT will start trying to compute new column-wise FFTs starting from the second row. The only solution to this problem is an iterative call to cufftExecC2C to cover all the Q slices.
For the record, the following code provides a fully worked example on how performing 1D FFTs of the columns of a 3D matrix.
#include <thrust/device_vector.h>
#include <cufft.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
if (code != cudaSuccess)
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
int main() {
const int M = 3;
const int N = 4;
const int Q = 2;
thrust::host_vector<float2> h_matrix(M * N * Q);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp;
temp.x = (float)(j + k * M);
//temp.x = 1.f;
temp.y = 0.f;
h_matrix[k*M*N+j*M+i] = temp;
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
thrust::device_vector<float2> d_matrix(h_matrix);
thrust::device_vector<float2> d_matrix_out(M * N * Q);
// --- Advanced data layout
// input[b * idist + x * istride]
// output[b * odist + x * ostride]
// b = signal number
// x = element of the b-th signal
cufftHandle handle;
int rank = 1; // --- 1D FFTs
int n[] = { N }; // --- Size of the Fourier transform
int istride = M, ostride = M; // --- Distance between two successive input/output elements
int idist = 1, odist = 1; // --- Distance between batches
int inembed[] = { 0 }; // --- Input size with pitch (ignored for 1D transforms)
int onembed[] = { 0 }; // --- Output size with pitch (ignored for 1D transforms)
int batch = M; // --- Number of batched executions
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_C2C, batch);
for (int k=0; k<Q; k++)
cufftExecC2C(handle, (cufftComplex*)(thrust::raw_pointer_cast( + k * M * N), (cufftComplex*)(thrust::raw_pointer_cast( + k * M * N), CUFFT_FORWARD);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp = d_matrix_out[k*M*N+j*M+i];
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
The situation is different for the case when you want to perform 1D transforms of the rows. In that case, the distance between two consecutive elements is 1, while the distance between two consecutive signals is M. This allows you to set a number of N * Q transformations and then invoking cufftExecC2C only one time. For the record, the code below provides a full example of 1D transformations of the rows of a 3D matrix.
#include <thrust/device_vector.h>
#include <cufft.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
if (code != cudaSuccess)
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
int main() {
const int M = 3;
const int N = 4;
const int Q = 2;
thrust::host_vector<float2> h_matrix(M * N * Q);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp;
temp.x = (float)(j + k * M);
//temp.x = 1.f;
temp.y = 0.f;
h_matrix[k*M*N+j*M+i] = temp;
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
thrust::device_vector<float2> d_matrix(h_matrix);
thrust::device_vector<float2> d_matrix_out(M * N * Q);
// --- Advanced data layout
// input[b * idist + x * istride]
// output[b * odist + x * ostride]
// b = signal number
// x = element of the b-th signal
cufftHandle handle;
int rank = 1; // --- 1D FFTs
int n[] = { M }; // --- Size of the Fourier transform
int istride = 1, ostride = 1; // --- Distance between two successive input/output elements
int idist = M, odist = M; // --- Distance between batches
int inembed[] = { 0 }; // --- Input size with pitch (ignored for 1D transforms)
int onembed[] = { 0 }; // --- Output size with pitch (ignored for 1D transforms)
int batch = N * Q; // --- Number of batched executions
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_C2C, batch);
cufftExecC2C(handle, (cufftComplex*)(thrust::raw_pointer_cast(, (cufftComplex*)(thrust::raw_pointer_cast(, CUFFT_FORWARD);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp = d_matrix_out[k*M*N+j*M+i];
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
I guess, idist=nx*nz could also jump a whole plane and batch=nz would then cover one yx plane. The decision should be made according to whether nx or nz is larger.

Imlib2 error when crosscopilation for arm target

I am trying to port x11 to arm processor. so i am using Imlib2 library for jpeg pictures. I have successfully cross compiled Imlib2 library with x windows to arm.My sample program also built successfully.But when i run that binary jpeg is not displaying properly.(it shows image upload error);
include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <X11/Xlib.h>
#include <Imlib2.h>
int main(int argc, char **argv)
Imlib_Image img;
Display *dpy;
Pixmap pix;
Window root;
Screen *scn;
int width, height;
const char *filename = NULL;
if (argc < 2)
return 0;
filename = argv[1];
img = imlib_load_image(filename);
printf("img values %x",img);
if (!img) {
fprintf(stderr, "%s:Unable to load image\n", filename);
return 0;
width = imlib_image_get_width();
height = imlib_image_get_height();
dpy = XOpenDisplay(NULL);
if (!dpy)
return 0;
scn = DefaultScreenOfDisplay(dpy);
root = DefaultRootWindow(dpy);
pix = XCreatePixmap(dpy, root, width, height,
imlib_render_image_on_drawable(0, 0);
Window w = XCreateSimpleWindow(dpy, root, 0, 0, width, height, 0, None, None);
XSelectInput(dpy, w, ExposureMask);
XMapWindow(dpy, w);
//XCopyPlane(dpy, pix, w, gc, 0, 0, width, height, 0, 0, DefaultDepthOfScreen(scn));
XSetWindowBackgroundPixmap(dpy, w, pix);
XClearWindow(dpy, root);
XEvent ev;
while (XNextEvent(dpy, &ev)) {
if( ev.type == Expose )
//XCopyPlane(dpy, pix, w, gc, 0, 0, width, height, 0, 0, DefaultDepthOfScreen(scn));
printf("Expose called\n");
XFreePixmap(dpy, pix);
return 0;
fprintf(stderr, "usage: %s <image_file>\n", argv[0]);
return 1;

2nd order IIR filter, coefficients for a butterworth bandpass (EQ)?

Important update: I already figured out the answers and put them in this simple open-source library: Check it out, it will probably save you quite some time if you're having trouble with audio filters in IOS!
I have created a (realtime) audio buffer (float *data) that holds a few sin(theta) waves with different frequencies.
The code below shows how I created my buffer, and I've tried to do a bandpass filter but it just turns the signals to noise/blips:
// Multiple signal generator
__block float *phases = nil;
[audioManager setOutputBlock:^(float *data, UInt32 numFrames, UInt32 numChannels)
float samplingRate = audioManager.samplingRate;
NSUInteger activeSignalCount = [tones count];
// Initialize phases
if (phases == nil) {
phases = new float[10];
for(int z = 0; z <= 10; z++) {
phases[z] = 0.0;
// Multiple signals
NSEnumerator * enumerator = [tones objectEnumerator];
id frequency;
UInt32 c = 0;
while(frequency = [enumerator nextObject])
for (int i=0; i < numFrames; ++i)
for (int iChannel = 0; iChannel < numChannels; ++iChannel)
float theta = phases[c] * M_PI * 2;
if (c == 0) {
data[i*numChannels + iChannel] = sin(theta);
} else {
data[i*numChannels + iChannel] = data[i*numChannels + iChannel] + sin(theta);
phases[c] += 1.0 / (samplingRate / [frequency floatValue]);
if (phases[c] > 1.0) phases[c] = -1;
// Normalize data with active signal count
float signalMulti = 1.0 / (float(activeSignalCount) * (sqrt(2.0)));
vDSP_vsmul(data, 1, &signalMulti, data, 1, numFrames*numChannels);
// Apply master volume
float volume = masterVolumeSlider.value;
vDSP_vsmul(data, 1, &volume, data, 1, numFrames*numChannels);
if (fxSwitch.isOn) {
// H(s) = (s/Q) / (s^2 + s/Q + 1)
// BW 2.0 Q 0.667
//The order of the coefficients are, B1, B2, A1, A2, B0.
float Fs = samplingRate;
float omega = 2*M_PI*Fs; // w0 = 2*pi*f0/Fs
float Q = 0.50f;
float alpha = sin(omega)/(2*Q); // sin(w0)/(2*Q)
// Through H
for (int i=0; i < numFrames; ++i)
for (int iChannel = 0; iChannel < numChannels; ++iChannel)
data[i*numChannels + iChannel] = (data[i*numChannels + iChannel]/Q) / (pow(data[i*numChannels + iChannel],2) + data[i*numChannels + iChannel]/Q + 1);
float b0 = alpha;
float b1 = 0;
float b2 = -alpha;
float a0 = 1 + alpha;
float a1 = -2*cos(omega);
float a2 = 1 - alpha;
float *coefficients = (float *) calloc(5, sizeof(float));
coefficients[0] = b1;
coefficients[1] = b2;
coefficients[2] = a1;
coefficients[3] = a2;
coefficients[3] = b0;
vDSP_deq22(data, 2, coefficients, data, 2, numFrames);
// Measure dB
[self measureDB:data:numFrames:numChannels];
My aim is to make a 10-band EQ for this buffer, using vDSP_deq22, the syntax of the method is:
vDSP_deq22(<float *vDSP_A>, <vDSP_Stride vDSP_I>, <float *vDSP_B>, <float *vDSP_C>, <vDSP_Stride vDSP_K>, <vDSP_Length __vDSP_N>)
float *vDSP_A is the input data
float *vDSP_B are 5 filter coefficients
float *vDSP_C is the output data
I have to make 10 filters (10 times vDSP_deq22). Then I set the gain for every band and combine them back together. But what coefficients do I feed every filter? I know vDSP_deq22 is a 2nd order (butterworth) IIR filter, but how do I turn this into a bandpass?
Now I have three questions:
a) Do I have to de-interleave and interleave the audio buffer? I know setting stride to 2 just filters on channel but how I filter the other, stride 1 will process both channels as one.
b) Do I have to transform/process the buffer before it enters the vDSP_deq22 method? If so, do I also have to transform it back to normal?
c) What values of the coefficients should I set to the 10 vDSP_deq22s?
I've been trying for days now but I haven't been able to figure this on out, please help me out!
Your omega value need to be normalised, i.e. expressed as a fraction of Fs - it looks like you left out the f0 when you calculated omega, which will make alpha wrong too:
float omega = 2*M_PI*Fs; // w0 = 2*pi*f0/Fs
should probably be:
float omega = 2*M_PI*f0/Fs; // w0 = 2*pi*f0/Fs
where f0 is the centre frequency in Hz.
For your 10 band equaliser you'll need to pick 10 values of f0, spaced logarithmically, e.g. 25 Hz, 50 Hz, 100 Hz, 200 Hz, 400 Hz, 800 Hz, 1.6 kHz, 3.2 kHz, 6.4 kHz, 12.8 kHz.