Reading TIFF file BitsPerSample and ImageWidth - objective-c

I'm writing an Objective-C application which can work with TIFF images in a much faster way than NSImage do (merging images with it for example costs lots of memory space), so I'm starting with a TIFF reader/writer using a combination of Objective-C and C, for best performance.
By reading Adobe documentation for TIFF files, I've been able to read every metadata of my TIFF images, with some exceptions: BitsPerSample and ImageWidth. While the other metadata returned possible values, BitsPerSample hexadecimal value/offset is fffffffffffffffe (aka. -2), while it has 3 values of size 2 (short), which means that there should be an offset in that place, and since the beginning of the TIFF file is offset 0 and the file weights 5.846.655 bytes, that offset would be invalid even if it was unsigned (18446744073709551614).
In the same way, ImageWidth returned value was 944, while the image width is 1200. Since I can detect that the file is a TIFF file by obtaining the short value from index 2 with length 2 and comparing it to 42, I assume my shortIntegerFromBytesAtRangeWithEndian function is working.
unsigned short shortIntegerFromBytesAtRangeWithEndian(char* bytes, unsigned long start, unsigned long length, int endian) {
unsigned short returnedInt = 0;
BOOL isBigEndian = endian == TIFF_IMAGE_ENDIAN_BIG;
for (unsigned long index = isBigEndian ? 0 : length-1 ; index < length; index += endian){
returnedInt = (returnedInt << BYTE_SIZE) + bytes[index + start];
}
return returnedInt;
}
endian is -1 for little endian and 1 for big endian.
Are these variables read in a different way than other values? These are the values of the image in hexadecimal and decimal, with the size in bytes (I've omitted the offset values since some of them are too big):
? (-2): 0 (0) - Size: 4
ImageWidth (256): 3b0 (944) - Size: 2
ImageLength (257): 320 (800) - Size: 2
Compression (259): 1 (1) - Size: 2
PhotometricInterpretation (262): 2 (2) - Size: 2
StripOffsets (273): 5a6c (23148) - Size: 4
Orientation (274): 1 (1) - Size: 2
SamplesPerPixel (277): 3 (3) - Size: 2
RowsPerStrip (278): 320 (800) - Size: 2
StripByteCounts (279): 56e400 (5694464) - Size: 4
PlanarConfiguration (284): 1 (1) - Size: 2
ResolutionUnit (296): 2 (2) - Size: 2
? (34665): 583e6c (5783148) - Size: 4

Related

Largest set of different byte values unique when clearing bits

I am creating a data format, which will be stored in a DS2431 1-wire EEPROM. One page will be using EPROM emulation mode (where data once written can only be modified by clearing bits). In this page I want to store a byte with an ID, which cannot be changed to another valid value (due to only allowing clearing bits).
I am considering using the set of values that have a popcount of 4 (there are 70 different values). Clearing any bits means popcount is no longer 4, so this satisfies the desired property.
But could a set of byte values be found with more than 70 different values, that satisfy the property?
No. For an 8-bit value, using four bits is optimal.
If you have your 70 4-bit values and decide to add a 5-bit value as valid, you have to give up five 4-bit values that can be created by clearing a bit. Similarly, if you want a valid 3-bit value, you also have to give up five 4-bit values.
If you could increase the number of bits, then you can increase the ratio of possible values to bits used.
Since there are only 256 possible values and 8 possible populations it is a trivial task to test all possible population counts:
#include <stdio.h>
#include <stdint.h>
int popcount( uint8_t byte )
{
int count = 0 ;
for( uint8_t b = 0x01; b != 0; b <<= 1 )
{
count = count + (((byte & b) != 0) ? 1 : 0) ;
}
return count ;
}
int main()
{
int valuecount[8] = {0} ;
for( int i = 0; i < 256; i++ )
{
valuecount[popcount(i)]++ ;
}
printf( "popcount\tvalues\n") ;
for( int p = 0; p < 9; p++ )
{
printf( " %d\t\t %d\n", p, valuecount[p] ) ;
}
return 0;
}
Result:
popcount values
0 1
1 8
2 28
3 56
4 70
5 56
6 28
7 8
8 1
The optimum population count for any word length n is always n / 2. For 16-bits the number of values with 8 1-bits is 12870.

CUDA 5.0 Replay Overhead

I am a novice CUDA programmer. I recently learned more about achieving better performance at lower occupancy. Here is a code snippet, I need help for understanding a few thing about replay overhead and Instruction Level Parallellism
__global__ void myKernel(double *d_dst, double *d_a1, double *d_a2, size_t SIZE)
{
int tId = threadIdx.x + blockDim.x * blockIdx.x;
d_dst[tId] = d_a1[tId] * d_a2[tId];
d_dst[tId + SIZE] = d_a1[tId + SIZE] * d_a2[tId + SIZE];
d_dst[tId + SIZE * 2] = d_a1[tId + SIZE * 2] * d_a2[tId + SIZE * 2];
d_dst[tId + SIZE * 3] = d_a1[tId + SIZE * 3] * d_a2[tId + SIZE * 3];
}
This is my simple kernel, which simply multiplies two 2D array to form a third 2D array (from logical perspective) where these array are all placed as flat 1D arrays in device memory.
Below I present another piece of code snippet:
void doCompute() {
double *h_a1;
double *h_a2;
size_t SIZE = pow(31, 3) + 1;
// Imagine h_a1, h_a2 as 2D arrays
// with 4 rows and SIZE Columns
// For convenience created as 1D arrays
h_a1 = (double *) malloc(SIZE * 4 * sizeof(double));
h_a2 = (double *) malloc(SIZE * 4 * sizeof(double));
memset(h_a1, 5.0, SIZE * 4 * sizeof(double));
memset(h_a2, 5.0, SIZE * 4 * sizeof(double));
double *d_dst;
double *d_a1;
double *d_a2;
cudaMalloc(&d_dst, SIZE * 4 * sizeof(double));
cudaMalloc(&d_a1, SIZE * 4 * sizeof(double));
cudaMalloc(&d_a2, SIZE * 4 * sizeof(double));
cudaMemcpy(d_a1, h_a1, SIZE * 4 * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_a2, h_a2, SIZE * 4 * sizeof(double), cudaMemcpyHostToDevice);
int BLOC_SIZE = 32;
int GRID_SIZE = (SIZE + BLOC_SIZE - 1) / BLOC_SIZE;
myKernel <<< GRID_SIZE, BLOC_SIZE >>> (d_dst, d_a1, d_a2, SIZE);
}
Q1) Am I here breaking any coalesced memory access pattern?
Q2) Can I say that the accesses to the memory, the way they are coded in the kernel
are also example of Instruction Level parallelism? If yes, am I using ILP2 or ILP4? And
Why?
Q3) If all I am doing is right then why does the nvvp profiler gives me following message?
Total Replay Overhead: 4.6%
Global Cache Replay Overhead: 30.3%
How can I reduce them or fix them?
Cheers,
The compiler has a limited ability to schedule instructions for possible ILP exploitation. The GPU itself must also have ILP capability, and the extent of this varies by GPU generation. Yes, any resource that is not available can cause a warp to stall, the typical one being data required from memory. The definitions of the replay quantities you're asking about are given here.
So, for example, the global cache replay overhead will be triggered by a cache miss, and your code is going to have some cache misses. Cache misses are possible even though you have 100% coalesced access and (nearly) 100% bandwidth utilization efficiency.

accelerate framework cepstrum peak find

I'm trying to find peak values of cepstrum analysis with accelerate framework. I get peak values always at the end of or at the beginning of frames. I'm analysing it real-time getting audio from microphone. What is wrong with this my code? My code is below :
OSStatus microphoneInputCallback (void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData){
// get reference of test app we need for test app attributes
TestApp *this = (TestApp *)inRefCon;
COMPLEX_SPLIT complexArray = this->fftA;
void *dataBuffer = this->dataBuffer;
float *outputBuffer = this->outputBuffer;
FFTSetup fftSetup = this->fftSetup;
uint32_t log2n = this->fftLog2n;
uint32_t n = this->fftN; // 4096
uint32_t nOver2 = this->fftNOver2;
uint32_t stride = 1;
int bufferCapacity = this->fftBufferCapacity; // 4096
SInt16 index = this->fftIndex;
OSStatus renderErr;
// observation objects
float *observerBufferRef = this->observerBuffer;
int observationCountRef = this->observationCount;
renderErr = AudioUnitRender(rioUnit, ioActionFlags,
inTimeStamp, bus1, inNumberFrames, this->bufferList);
if (renderErr < 0) {
return renderErr;
}
// Fill the buffer with our sampled data. If we fill our buffer, run the
// fft.
int read = bufferCapacity - index;
if (read > inNumberFrames) {
memcpy((SInt16 *)dataBuffer + index, this->bufferList->mBuffers[0].mData, inNumberFrames*sizeof(SInt16));
this->fftIndex += inNumberFrames;
} else {
// If we enter this conditional, our buffer will be filled and we should PERFORM FFT.
memcpy((SInt16 *)dataBuffer + index, this->bufferList->mBuffers[0].mData, read*sizeof(SInt16));
// Reset the index.
this->fftIndex = 0;
/*************** FFT ***************/
//multiply by window
vDSP_vmul((SInt16 *)dataBuffer, 1, this->window, 1, this->outputBuffer, 1, n);
// We want to deal with only floating point values here.
vDSP_vflt16((SInt16 *) dataBuffer, stride, (float *) outputBuffer, stride, bufferCapacity );
/**
Look at the real signal as an interleaved complex vector by casting it.
Then call the transformation function vDSP_ctoz to get a split complex
vector, which for a real signal, divides into an even-odd configuration.
*/
vDSP_ctoz((COMPLEX*)outputBuffer, 2, &complexArray, 1, nOver2);
// Carry out a Forward FFT transform.
vDSP_fft_zrip(fftSetup, &complexArray, stride, log2n, FFT_FORWARD);
vDSP_ztoc(&complexArray, 1, (COMPLEX *)outputBuffer, 2, nOver2);
complexArray.imagp[0] = 0.0f;
vDSP_zvmags(&complexArray, 1, complexArray.realp, 1, nOver2);
bzero(complexArray.imagp, (nOver2) * sizeof(float));
// scale
float scale = 1.0f / (2.0f*(float)n);
vDSP_vsmul(complexArray.realp, 1, &scale, complexArray.realp, 1, nOver2);
// step 2 get log for cepstrum
float *logmag = malloc(sizeof(float)*nOver2);
for (int i=0; i < nOver2; i++)
logmag[i] = logf(sqrtf(complexArray.realp[i]));
// configure float array into acceptable input array format (interleaved)
vDSP_ctoz((COMPLEX*)logmag, 2, &complexArray, 1, nOver2);
// create cepstrum
vDSP_fft_zrip(fftSetup, &complexArray, stride, log2n-1, FFT_INVERSE);
//convert interleaved to real
float *displayData = malloc(sizeof(float)*n);
vDSP_ztoc(&complexArray, 1, (COMPLEX*)displayData, 2, nOver2);
float dominantFrequency = 0;
int currentBin = 0;
float dominantFrequencyAmp = 0;
// find peak of cepstrum
for (int i=0; i < nOver2; i++){
//get current frequency magnitude
if (displayData[i] > dominantFrequencyAmp) {
// DLog("Bufferer filled %f", displayData[i]);
dominantFrequencyAmp = displayData[i];
currentBin = i;
}
}
DLog("currentBin : %i amplitude: %f", currentBin, dominantFrequencyAmp);
}
return noErr;
}
I haven't worked with the Accelerate Framework, but your code appears to be taking the proper steps to calculate the Cepstrum.
The Cepstrum of real acoustic signals tends to have a very large DC component, a large peak at and near zero quefrency [sic]. Just ignore the near-DC portion of the Cepstrum and look for peaks above 20 Hz frequency (above quefrency of Cepstrum_Width/20Hz).
If the input signal contains a series of very closely spaced overtones, the Cepstrum will also have a large peak at the high quefrency end.
For example, the plot below shows the Cepstrum of a Dirichlet Kernel of N=128 and Width=4096, the spectrum of which is a series of very closely spaced overtones.
You may want to use a static synthetic signal to test and debug your code. A good choice for a test signal is any sinusoid with a fundamental F and several overtones at exact integer multiples of F.
Your Cepstra should look something like the following examples.
First a synthetic signal.
The plot below shows the Cepstrum of a synthetic steady-state E2 note, synthesized using a typical near-DC component, a fundamental at 82.4 Hz, and 8 harmonics at integer multiples of 82.4 Hz. The synthetic sinusoid was programmed to generate 4096 samples.
Observe the prominent non-DC peak at 12.36. The Cepstrum width is 1024 (the output of the second FFT), therefore the peak corresponds to 1024/12.36 = 82.8 Hz which is very close to 82.4 Hz the true fundamental frequency.
Now a real acoustical signal.
The plot below shows the Cepstrum of a real acoustic guitar's E2 note. The signal was not windowed prior to the first FFT. Observe the prominent non-DC peak at 542.9. The Cepstrum width is 32768 (the output of the second FFT), therefore the peak corresponds to 32768/542.9 = 60.4 Hz which is fairly far from 82.4 Hz the true fundamental frequency.
The plot below shows the Cepstrum of the same real acoustic guitar's E2 note, but this time the signal was Hann windowed prior to the first FFT. Observe the prominent non-DC peak at 268.46. The Cepstrum width is 32768 (the output of the second FFT), therefore the peak corresponds to 32768/268.46 = 122.1 Hz which is even farther from 82.4 Hz the true fundamental frequency.
The acoustic guitar's E2 note used for this analysis was sampled at 44.1 KHz with a high quality microphone under studio conditions, it contains essentially zero background noise, no other instruments or voices, and no post processing.
References:
Real audio signal data, synthetic signal generation, plots, FFT, and Cepstral analysis were done here: Musical instrument cepstrum

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?
For example, I have the following OpenCL kernel:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.
EDIT: New version of kernel per suggestion below, copy work area to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.
EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.
EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).
This is definitely a memory access problem. Neighbouring work items' pixels can overlap by as much as 15x16, and worse yet, each work item will overlap at least 225 others.
I would use local memory and get work groups to cooperatively process many 16x16 blocks. I like to use a large, square block for each work group. Rectangular blocks are a bit more complicated, but can get better memory utilization for you.
If you read blocks of n by n pixels form your source image, the boarders will overlap by nx15 (or 15xn). You need to calculate the largest possible value for n base on your available local memory size (LDS). If you are using opencl 1.1 or greater, the LDS is at least 32kb. opencl 1.0 promises 16kb per work group.
n <= sqrt(32kb / sizeof(uint4))
n <= sqrt(32768 / 16)
n ~ 45
Using n=45 will use 32400 out of 32768 bytes of the LDS, and let you use 900 work items per group (45-15)^2 = 900. Note: Here's where a rectangular block would help out; for example 64x32 would use all of the LDS, but with group size = (64-15)*(32-15) = 833.
steps to use LDS for your kernel:
allocate a 1D or 2D local array for your cached block of the image. I use a #define constant, and it rarely has to change.
read the uint values from your image, and store locally.
adjust 'pos' for each work item to relate to the local memory
execute the same i,j loops you have, but using the local memory to read values. remember that the i and j loops stop 15 short of n.
Each step can be searched online if you are not sure how to implement it, or you can ask me if you need a hand.
Chances are good that the LDS on your device will outperform the texture read speed. This is counter-intuitive, but remember that you are reading tiny amounts of data at a time, so the gpu may not be able to cache the pixels effectively. The LDS usage will guarantee that the pixels are available, and given the number of times each pixel is read, I expect this to make a huge difference.
Please let me know what kind of results you observe.
UPDATE: Here's my attempt to better explain my solution. I used graph paper for my drawings, because I'm not all that great with image manipulation software.
Above is a sketch of how the values were read from src in your first code snippet. The big problem is that the pos0 rectangle -- 16x16 uint4 values -- is being read in its entirety for each work item in the group (256 of them). My solution involves reading a large area and sharing the data for all 256 work groups.
If you store a 31x31 region of your image in local memory, all 256 work items' data will be available.
steps:
use work group dimensions: (16,16)
read the values of src into a large local buffer ie: uint4 buff[31][31]; The buffer needs to be translated such that 'pos0' is at buff[0][0]
barrier(CLK_LOCAL_MEM_FENCE) to wait for memory copy operations
do the same i,j for loops you had originally, except you leave out the pos and pos0 values. only use i and j for the location. Accumulate 'diff' in the same way you were doing so originally.
write the solution to 'dest'
This is the same as my first response to your question, except I use n=16. This value does not utilize the local memory fully, but will probably work well for most platforms. 256 tends to be a common maximum work group size.
I hope this clears things up for you.
Some suggestions:
Compute more than 1 output pixel in each work item. It will increase data reuse.
Benchmark different work-group sizes to maximize the usage of texture cache.
Maybe there is a way to separate the kernel into two passes (horizontal and vertical).
Update: more suggestions
Instead of loading everything in local memory, try loading only the local_src values, and use read_image for the other one.
Since you do almost no computations, you should measure read speed in GB/s, and compare to the peak memory speed.

Using CUDA to find the pixel-wise average value of a bunch of images

So I have a cube of images. 512X512X512, I want to sum up the images pixel-wise and save it to a final resulting image. So if all the pixels were value 1...the final image would all be 512. I am having trouble understanding the indexing to do this in CUDA. I figure one thread's job will be to sum up all 512 at it's pixel...so the total thread number will be 512X512. So I plan to do it with 512 blocks, with 512 threads each. From here, I am having trouble coming up with the indexing of how to sum the depth. Any help will be greatly appreciated.
One way to solve this problem is imaging the cube as a set of Z slides. The coordinates X, Y refers to the width and height of the image, and the Z coordinate to each slide in the Z dimension. Each thread will iterate in the Z coordinate to accumulate the values.
With this in mind, configure a kernel to launch a block of 16x16 threads and a grid of enough blocks to process the width and height of the image (I'm assuming a gray scale image with 1 byte per pixel):
#define THREADS 16
// kernel configuration
dim3 dimBlock = dim3 ( THREADS, THREADS, 1 );
dim3 dimGrid = dim3 ( WIDTH / THREADS, HEIGHT / THREADS );
// call the kernel
kernel<<<dimGrid, dimBlock>>>(i_data, o_Data, WIDTH, HEIGHT, DEPTH);
If you are clear how to index a 2D array, loop through the Z dimension would be also clear
__global__ void kernel(unsigned char* i_data, unsigned char* o_data, int WIDTH, int HEIGHT, int DEPTH)
{
// in your kernel map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global index of a pixel into the image array
// this global index is to the first slide of the cube
int idx = x + y * WIDTH;
// partial results
int r = 0;
// iterate in the Z dimension
for (int z = 0; z < DEPTH; ++z)
{
// WIDTH * HEIGHT is the offset of one slide
int idx_z = z * WIDTH*HEIGHT + idx;
r += i_data[ idx_z ];
}
// o_data is a 2D array, so you can use the global index idx
o_data[ idx ] = r;
}
This is a naive implementation. In order to maximize memory throughput, the data should be properly aligned.
This can be done easily using ArrayFire GPU library ( free). In ArrayFire, you can construct 3D arrays like the following :
Two approaches:
// Method 1:
array data = rand(x,y,z);
// Just reshaping the array, this is a noop
data = newdims(data,x*y, z, 1);
// Sum of pixels
res = sum(data);
// Method 2:
// Use ArrayFire "GFOR"
array data = rand(x,y,z);res = zeros(z,1);
gfor(array i, z) {
res(ii) = sum(data(:,:,i);
}