PyCUDA large nonuniform matrix operations

PyCUDA large nonuniform matrix operations - indexing

I am working with large, nonuniform matrices and am having problems with what I believe to be mismatching on the elements.
In example.py, get_simulated_ipp() builds echo and tx, two linear arrays of size 250000 and 25000 respectively. The code also hardcoded sr=25.
My code is attempting to complex multiply tx into echo along different stretches, depending on specified ranges and value of sr. This will then be stored in an array S.
After searching through some other people's examples, I found a way of building blocks and grids here that I thought would work well. I'm unfamiliar with C code, but have been trying to learn over the past week. Here is my code:
#!/usr/bin/python
#This iteration only works on the first and last elements, mismatching after that.
# However, this doesn't result in any empty elements in S
import numpy as np
import example as ex
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
#pull simulated data and get info about it
((echo,tx)) = ex.get_simulated_ipp()
ranges = np.arange(4000,6000).astype(np.int32)
S = np.zeros([len(ranges),len(tx)],dtype=np.complex64)
sr = ex.sr
#copying input to gpu
# will try this explicitly if in/out (in the function call) don't work
block_dim_x = 8 #thread number is product of block dims,
block_dim_y = 8 # want a multiple of 32 (warp multiple)
blocks_x = np.ceil(len(ranges)/block_dim_x).astype(np.int32).item()
blocks_y = np.ceil(len(tx)/block_dim_y).astype(np.int32).item()
kernel_code="""
#include <cuComplex.h>
__global__ void complex_mult(cuFloatComplex *tx, cuFloatComplex *echo, cuFloatComplex *result,
int *ranges, int sr)
{
unsigned int block_num = blockIdx.x + blockIdx.y * gridDim.x;
unsigned int thread_num = threadIdx.x + threadIdx.y * blockDim.x;
unsigned int threads_in_block = blockDim.x * blockDim.y;
unsigned long int idx = threads_in_block * block_num + thread_num;
//aligning the i,j to idx, something is mismatched?
int i = ((idx % (threads_in_block * gridDim.x)) % blockDim.x) +
((block_num % gridDim.x) * blockDim.x);
int j = ((idx - (threads_in_block * block_num)) / blockDim.x) +
((block_num / gridDim.x) * blockDim.y);
result[idx] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);
}
"""
## want something to work like this:
## result[i][j] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);
#includes directory of where cuComplex.h is located
mod = SourceModule(kernel_code, include_dirs=['/usr/local/cuda-7.0/include/'])
complex_mult = mod.get_function("complex_mult")
complex_mult(cuda.In(tx), cuda.In(echo), cuda.Out(S), cuda.In(ranges), np.int32(sr),
block=(block_dim_x,block_dim_y,1),
grid=(blocks_x,blocks_y))
compare = np.zeros_like(S) #built to compare CPU vs GPU calcs
txidx = np.arange(len(tx))
for ri,r in enumerate(ranges):
compare[ri,:] = echo[txidx+r*sr]*tx
print np.subtract(S, compare)
At the bottom here, I've put in a CPU implementation of what I'm attempting to accomplish and put in a subtraction. The result is that the very first and very last elements come out as 0+0j, but the rest do not. The kernel is attempting to align an i and j to the idx so that I can traverse echo, ranges, and tx more easily.
Is there a better way to implement something like this? Also, why might the result not come out as all 0+0j as I intend?
Edit:
Trying a little example to get a better grasp of how the arrays are being indexed with this block/grid configuration, I stumbled upon something very strange. Before, I tried to index the elements, I just wanted to run a little test multiplication. It seems like my block/grid covers all of the ary_in matrix, but the result ends up only doubling the top half of ary_in and the bottom half is returning whatever was left over from the bottom half calculation previously.
If I change blocks_x to 4 so that I cover more space than needed, however, the doubling works fine. If I then run it with a 4x4 grid, with * 3 instead, it'll work out fine with ary_out as ary_in tripled. When I run it again with a 2x4 grid and only doubling, the top half of ary_out returns the doubled ary_in, but the bottom half returns the previous result in memory, a tripled value instead. I would understand this to be something in my index/block/grid mapping wrongly to the values, but I can't figure out what.
ary_in = np.arange(128).reshape((8,16))
print ary_in
ary_out = np.zeros_like(ary_in)
block_dim_x = 4
block_dim_y = 4
blocks_x = 2
blocks_y = 4
limit = block_dim_x * block_dim_y * blocks_x * blocks_y
mod = SourceModule("""
__global__ void indexing_order(int *ary_in, int *ary_out, int n)
{
unsigned int block_num = blockIdx.x + blockIdx.y * gridDim.x;
unsigned int thread_num = threadIdx.x + threadIdx.y * blockDim.x;
unsigned int threads_in_block = blockDim.x * blockDim.y;
unsigned int idx = threads_in_block * block_num + thread_num;
if (idx < n) {
// ary_out[idx] = thread_num;
ary_out[idx] = ary_in[idx] * 2;
}
}
""")
indexing_order = mod.get_function("indexing_order")
indexing_order(drv.In(ary_in), drv.Out(ary_out), np.int32(limit),
block=(block_dim_x,block_dim_y,1),
grid=(blocks_x,blocks_y))
print ary_out
FINAL EDIT:
I figured out the problems. In the edit just above, the ary_in is by default an int64, mismatching with the int initialization in the C code of an int32. This only allocated half the amount of data needed on the GPU for the entire array, so only the top half was moved over and operated on. Adding a .astype(np.int32) solved this problem.
This allowed me to figure out the the ordering of the indexing in my case and fix the main code with:
int i = idx / row_len;
int j = idx % row_len;
I still don't understand how to get this working with non even division of block dimensions into the output array (e.g. 16x16), even with an if (idx

I figured out the problems. In the edit just above, the ary_in is by default an int64, mismatching with the int initialization in the C code of an int32. This only allocated half the amount of data needed on the GPU for the entire array, so only the top half was moved over and operated on. Adding a .astype(np.int32) solved this problem.
This allowed me to figure out the the ordering of the indexing in my case and fix the main code with:
int i = idx / row_len;
int j = idx % row_len;

Related

CUDA profiling - high shared transactions/access but low local replay rate

After running the Visual Profiler, guided analysis tells me that I'm memory-bound, and that in particular my shared memory accesses are poorly aligned/accessed - basically every line I access shared memory is marked as ~2 transactions per access.
However, I couldn't figure out why that was the case (my shared memory is padded/strided so that there shouldn't be bank conflicts), so I went back and checked the shared replay metric - and that says that only 0.004% of shared accesses are replayed.
So, what's going on here, and what should I be looking at to speed up my kernel?
EDIT: Minimal reproduction:
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.gpuarray as gp
mod = SourceModule("""
(splitting the code block to get both Python and CUDA/C++ coloring)
typedef unsigned char ubyte;
__global__ void identity(ubyte *arr, int stride)
{
const int dim2 = 16;
const int dim1 = 64;
const int dim0 = 33;
int shrstrd1 = dim2;
int shrstrd0 = dim1 * dim2;
__shared__ ubyte shrarr[dim0 * dim1 * dim2];
auto shrget = [shrstrd0, shrstrd1, &shrarr](int i, int j, int k) -> int{
return shrarr[i * shrstrd0 + j * shrstrd1 + k];
};
auto shrset = [shrstrd0, shrstrd1, &shrarr](int i, int j, int k, ubyte val) -> void {
shrarr[i * shrstrd0 + j * shrstrd1 + k] = val;
};
int in_x = threadIdx.x;
int in_y = threadIdx.y;
shrset(in_y, in_x, 0, arr[in_y * stride + in_x]);
arr[in_y * stride + in_x] = shrget(in_y, in_x, 0);
}
""",
(ditto)
options=['-std=c++11'])
#Equivalent to identity<<<1, dim3(32, 32, 1)>>>(arr, 64);
identity = mod.get_function("identity")
identity(gp.zeros((64, 64), np.ubyte), np.int32(64), block=(32, 32, 1))
2 transactions per access, shared replay overhead 0.083. Decreasing dim2 to 8 makes the problem go away, which I also don't understand.

Partial answer: I had a fundamental misunderstanding of how shared memory banks worked (namely, that they are banks of around a thousand byte-banks each) and so didn't realize that they looped around, so that too much padding meant that 32 row elements might end up using each bank more than once.
Presumably, though, that conflict just didn't come up every time - instead it came up, oh, about 85 times a block, from the numbers.
I'll leave this here for a day in hopes of a more complete explanation, then close and accept this answer.

Find nth int with 10 set bits

Find the nth int with 10 set bits
n is an int in the range 0<= n <= 30 045 014
The 0th int = 1023, the 1st = 1535 and so on
snob() same number of bits,
returns the lowest integer bigger than n with the same number of set bits as n
int snob(int n) {
int a=n&-n, b=a+n;
return b|(n^b)/a>>2;
}
calling snob n times will work
int nth(int n){
int o =1023;
for(int i=0;i<n;i++)o=snob(o);
return o;
}
example
https://ideone.com/ikGNo7
Is there some way to find it faster?
I found one pattern but not sure if it's useful.
using factorial you can find the "indexes" where all 10 set bits are consecutive
1023 << x = the (x+10)! / (x! * 10!) - 1 th integer
1023<<1 is the 10th
1023<<2 is the 65th
1023<<3 the 285th
...
Btw I'm not a student and this is not homework.
EDIT:
Found an alternative to snob()
https://graphics.stanford.edu/~seander/bithacks.html#NextBitPermutation
int lnbp(int v){
int t = (v | (v - 1)) + 1;
return t | ((((t & -t) / (v & -v)) >> 1) - 1);
}

I have built an implementation that should satisfy your needs.
/** A lookup table to see how many combinations preceeded this one */
private static int[][] LOOKUP_TABLE_COMBINATION_POS;
/** The number of possible combinations with i bits */
private static int[] NBR_COMBINATIONS;
static {
LOOKUP_TABLE_COMBINATION_POS = new int[Integer.SIZE][Integer.SIZE];
for (int bit = 0; bit < Integer.SIZE; bit++) {
// Ignore less significant bits, compute how many combinations have to be
// visited to set this bit, i.e.
// (bit = 4, pos = 5), before came 0b1XXX and 0b1XXXX, that's C(3, 3) + C(4, 3)
int nbrBefore = 0;
// The nth-bit can be only encountered after pos n
for (int pos = bit; pos < Integer.SIZE; pos++) {
LOOKUP_TABLE_COMBINATION_POS[bit][pos] = nbrBefore;
nbrBefore += nChooseK(pos, bit);
}
}
NBR_COMBINATIONS = new int[Integer.SIZE + 1];
for (int bits = 0; bits < NBR_COMBINATIONS.length; bits++) {
NBR_COMBINATIONS[bits] = nChooseK(Integer.SIZE, bits);
assert NBR_COMBINATIONS[bits] > 0; // Important for modulo check. Otherwise we must use unsigned arithmetic
}
}
private static int nChooseK(int n, int k) {
assert k >= 0 && k <= n;
if (k > n / 2) {
k = n - k;
}
long nCk = 1; // (N choose 0)
for (int i = 0; i < k; i++) {
// (N choose K+1) = (N choose K) * (n-k) / (k+1);
nCk *= (n - i);
nCk /= (i + 1);
}
return (int) nCk;
}
public static int nextCombination(int w, int n) {
// TODO: maybe for small n just advance naively
// Get the position of the current pattern w
int nbrBits = 0;
int position = 0;
while (w != 0) {
final int currentBit = Integer.lowestOneBit(w); // w & -w;
final int bitPos = Integer.numberOfTrailingZeros(currentBit);
position += LOOKUP_TABLE_COMBINATION_POS[nbrBits][bitPos];
// toggle off bit
w ^= currentBit;
nbrBits++;
}
position += n;
// Wrapping, optional
position %= NBR_COMBINATIONS[nbrBits];
// And reverse lookup
int v = 0;
int m = Integer.SIZE - 1;
while (nbrBits-- > 0) {
final int[] bitPositions = LOOKUP_TABLE_COMBINATION_POS[nbrBits];
// Search for largest bitPos such that position >= bitPositions[bitPos]
while (Integer.compareUnsigned(position, bitPositions[m]) < 0)
m--;
position -= bitPositions[m];
v ^= (0b1 << m--);
}
return v;
}
Now for some explanation. LOOKUP_TABLE_COMBINATION_POS[bit][pos] is the core of the algorithm that makes it as fast as it is. The table is designed so that a bit pattern with k bits at positions p_0 < p_1 < ... < p_{k - 1} has a position of `\sum_{i = 0}^{k - 1}{ LOOKUP_TABLE_COMBINATION_POS[i][p_i] }.
The intuition is that we try to move back the bits one by one until we reach the pattern where are all bits are at the lowest possible positions. Moving the i-th bit from position to k + 1 to k moves back by C(k-1, i-1) positions, provided that all lower bits are at the right-most position (no moving bits into or through each other) since we skip over all possible combinations with the i-1 bits in k-1 slots.
We can thus "decode" a bit pattern to a position, keeping track of the bits encountered. We then advance by n positions (rolling over in case we enumerated all possible positions for k bits) and encode this position again.
To encode a pattern, we reverse the process. For this, we move bits from their starting position forward, as long as the position is smaller than what we're aiming for. We could, instead of a linear search through LOOKUP_TABLE_COMBINATION_POS, employ a binary search for our target index m but it's hardly needed, the size of an int is not big. Nevertheless, we reuse our variant that a smaller bit must also come at a less significant position so that our algorithm is effectively O(n) where n = Integer.SIZE.
I remain with the following assertions to show the resulting algorithm:
nextCombination(0b1111111111, 1) == 0b10111111111;
nextCombination(0b1111111111, 10) == 0b11111111110;
nextCombination(0x00FF , 4) == 0x01EF;
nextCombination(0x7FFFFFFF , 4) == 0xF7FFFFFF;
nextCombination(0x03FF , 10) == 0x07FE;
// Correct wrapping
nextCombination(0b1 , 32) == 0b1;
nextCombination(0x7FFFFFFF , 32) == 0x7FFFFFFF;
nextCombination(0xFFFFFFEF , 5) == 0x7FFFFFFF;

Let us consider the numbers with k=10 bits set.
The trick is to determine the rank of the most significant one, for a given n.
There is a single number of length k: C(k, k)=1. There are k+1 = C(k+1, k) numbers of length k + 1. ... There are C(m, k) numbers of length m.
For k=10, the limit n are 1 + 10 + 55 + 220 + 715 + 2002 + 5005 + 11440 + ...
For a given n, you easily find the corresponding m. Then the problem is reduced to finding the n - C(m, k)-th number with k - 1 bits set. And so on recursively.
With precomputed tables, this can be very fast. 30045015 takes 30 lookups, so that I guess that the worst case is 29 x 30 / 2 = 435 lookups.
(This is based on linear lookups, to favor small values. By means of dichotomic search, you reduce this to less than 29 x lg(30) = 145 lookups at worse.)
Update:
My previous estimates were pessimistic. Indeed, as we are looking for k bits, there are only 10 determinations of m. In the linear case, at worse 245 lookups, in the dichotomic case, less than 50.
(I don't exclude off-by-one errors in the estimates, but clearly this method is very efficient and requires no snob.)

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?
For example, I have the following OpenCL kernel:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.
EDIT: New version of kernel per suggestion below, copy work area to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.
EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.
EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).

This is definitely a memory access problem. Neighbouring work items' pixels can overlap by as much as 15x16, and worse yet, each work item will overlap at least 225 others.
I would use local memory and get work groups to cooperatively process many 16x16 blocks. I like to use a large, square block for each work group. Rectangular blocks are a bit more complicated, but can get better memory utilization for you.
If you read blocks of n by n pixels form your source image, the boarders will overlap by nx15 (or 15xn). You need to calculate the largest possible value for n base on your available local memory size (LDS). If you are using opencl 1.1 or greater, the LDS is at least 32kb. opencl 1.0 promises 16kb per work group.
n <= sqrt(32kb / sizeof(uint4))
n <= sqrt(32768 / 16)
n ~ 45
Using n=45 will use 32400 out of 32768 bytes of the LDS, and let you use 900 work items per group (45-15)^2 = 900. Note: Here's where a rectangular block would help out; for example 64x32 would use all of the LDS, but with group size = (64-15)*(32-15) = 833.
steps to use LDS for your kernel:
allocate a 1D or 2D local array for your cached block of the image. I use a #define constant, and it rarely has to change.
read the uint values from your image, and store locally.
adjust 'pos' for each work item to relate to the local memory
execute the same i,j loops you have, but using the local memory to read values. remember that the i and j loops stop 15 short of n.
Each step can be searched online if you are not sure how to implement it, or you can ask me if you need a hand.
Chances are good that the LDS on your device will outperform the texture read speed. This is counter-intuitive, but remember that you are reading tiny amounts of data at a time, so the gpu may not be able to cache the pixels effectively. The LDS usage will guarantee that the pixels are available, and given the number of times each pixel is read, I expect this to make a huge difference.
Please let me know what kind of results you observe.
UPDATE: Here's my attempt to better explain my solution. I used graph paper for my drawings, because I'm not all that great with image manipulation software.
Above is a sketch of how the values were read from src in your first code snippet. The big problem is that the pos0 rectangle -- 16x16 uint4 values -- is being read in its entirety for each work item in the group (256 of them). My solution involves reading a large area and sharing the data for all 256 work groups.
If you store a 31x31 region of your image in local memory, all 256 work items' data will be available.
steps:
use work group dimensions: (16,16)
read the values of src into a large local buffer ie: uint4 buff[31][31]; The buffer needs to be translated such that 'pos0' is at buff[0][0]
barrier(CLK_LOCAL_MEM_FENCE) to wait for memory copy operations
do the same i,j for loops you had originally, except you leave out the pos and pos0 values. only use i and j for the location. Accumulate 'diff' in the same way you were doing so originally.
write the solution to 'dest'
This is the same as my first response to your question, except I use n=16. This value does not utilize the local memory fully, but will probably work well for most platforms. 256 tends to be a common maximum work group size.
I hope this clears things up for you.

Some suggestions:
Compute more than 1 output pixel in each work item. It will increase data reuse.
Benchmark different work-group sizes to maximize the usage of texture cache.
Maybe there is a way to separate the kernel into two passes (horizontal and vertical).
Update: more suggestions
Instead of loading everything in local memory, try loading only the local_src values, and use read_image for the other one.
Since you do almost no computations, you should measure read speed in GB/s, and compare to the peak memory speed.

Using CUDA to find the pixel-wise average value of a bunch of images

So I have a cube of images. 512X512X512, I want to sum up the images pixel-wise and save it to a final resulting image. So if all the pixels were value 1...the final image would all be 512. I am having trouble understanding the indexing to do this in CUDA. I figure one thread's job will be to sum up all 512 at it's pixel...so the total thread number will be 512X512. So I plan to do it with 512 blocks, with 512 threads each. From here, I am having trouble coming up with the indexing of how to sum the depth. Any help will be greatly appreciated.

One way to solve this problem is imaging the cube as a set of Z slides. The coordinates X, Y refers to the width and height of the image, and the Z coordinate to each slide in the Z dimension. Each thread will iterate in the Z coordinate to accumulate the values.
With this in mind, configure a kernel to launch a block of 16x16 threads and a grid of enough blocks to process the width and height of the image (I'm assuming a gray scale image with 1 byte per pixel):
#define THREADS 16
// kernel configuration
dim3 dimBlock = dim3 ( THREADS, THREADS, 1 );
dim3 dimGrid = dim3 ( WIDTH / THREADS, HEIGHT / THREADS );
// call the kernel
kernel<<<dimGrid, dimBlock>>>(i_data, o_Data, WIDTH, HEIGHT, DEPTH);
If you are clear how to index a 2D array, loop through the Z dimension would be also clear
__global__ void kernel(unsigned char* i_data, unsigned char* o_data, int WIDTH, int HEIGHT, int DEPTH)
{
// in your kernel map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the global index of a pixel into the image array
// this global index is to the first slide of the cube
int idx = x + y * WIDTH;
// partial results
int r = 0;
// iterate in the Z dimension
for (int z = 0; z < DEPTH; ++z)
{
// WIDTH * HEIGHT is the offset of one slide
int idx_z = z * WIDTH*HEIGHT + idx;
r += i_data[ idx_z ];
}
// o_data is a 2D array, so you can use the global index idx
o_data[ idx ] = r;
}
This is a naive implementation. In order to maximize memory throughput, the data should be properly aligned.

This can be done easily using ArrayFire GPU library ( free). In ArrayFire, you can construct 3D arrays like the following :
Two approaches:
// Method 1:
array data = rand(x,y,z);
// Just reshaping the array, this is a noop
data = newdims(data,x*y, z, 1);
// Sum of pixels
res = sum(data);
// Method 2:
// Use ArrayFire "GFOR"
array data = rand(x,y,z);res = zeros(z,1);
gfor(array i, z) {
res(ii) = sum(data(:,:,i);
}

Optimizing Vector elements swaps using CUDA

Since I am new to cuda .. I need your kind help
I have this long vector, for each group of 24 elements, I need to do the following:
for the first 12 elements, the even numbered elements are multiplied by -1,
for the second 12 elements, the odd numbered elements are multiplied by -1 then the following swap takes place:
Graph: because I don't yet have enough points, I couldn't post the image so here it is:
http://www.freeimagehosting.net/image.php?e4b88fb666.png
I have written this piece of code, and wonder if you could help me further optimize it to solve for divergence or bank conflicts ..
//subvector is a multiple of 24, Mds and Nds are shared memory
____shared____ double Mds[subVector];
____shared____ double Nds[subVector];
int tx = threadIdx.x;
int tx_mod = tx ^ 0x0001;
int basex = __umul24(blockDim.x, blockIdx.x);
Mds[tx] = M.elements[basex + tx];
__syncthreads();
// flip the signs
if (tx < (tx/24)*24 + 12)
{
//if < 12 and even
if ((tx & 0x0001)==0)
Mds[tx] = -Mds[tx];
}
else
if (tx < (tx/24)*24 + 24)
{
//if >12 and < 24 and odd
if ((tx & 0x0001)==1)
Mds[tx] = -Mds[tx];
}
__syncthreads();
if (tx < (tx/24)*24 + 6)
{
//for the first 6 elements .. swap with last six in the 24elements group (see graph)
Nds[tx] = Mds[tx_mod + 18];
Mds [tx_mod + 18] = Mds [tx];
Mds[tx] = Nds[tx];
}
else
if (tx < (tx/24)*24 + 12)
{
// for the second 6 elements .. swp with next adjacent group (see graph)
Nds[tx] = Mds[tx_mod + 6];
Mds [tx_mod + 6] = Mds [tx];
Mds[tx] = Nds[tx];
}
__syncthreads();
Thanks in advance ..

paul gave you pretty good starting points you previous questions.
couple things to watch out for: you are doing non-base 2 division which is expensive.
Instead try to utilize multidimensional nature of the thread block. For example, make the x-dimension of size 24, which will eliminate need for division.
in general, try to fit thread block dimensions to reflect your data dimensions.
simplify sign flipping: for example, if you do not want to flip sign, you can still multiplied by identity 1. Figure out how to map even/odd numbers to 1 and -1 using just arithmetic: for example sign = (even*2+1) - 2 where even is either 1 or 0.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas