cuda filter with ouput of this block is the input of the next block - gpu

Working on a filter following, I am having a problem of doing these pieces of codes for processing an image in GPU:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
If I define:
dim3 threads_perblock(32, 32)
then each block I have: 32 threads can be communicated. The threads of this block can not communicate with the threads from other blocks.
Within a thread_block, I can translate that pieces of code using shared_memory however, for edge (I would say): image[0,31] and image[0,32] in different threadblocks. The image[0,31] should get value from image[0,32] to calculate its value. But they are in different threadblocks.
so that is the problem.
How would I solve this?
Thanks in advance.

If image is in global memory then there is no problem - you don't need to use shared memory and you can just access pixels directly from image without any problem.
However if you have already done some processing prior to this, and a block of image is already in shared memory, then you have a problem, since you need to do neighbourhood operations which are outside the range of your block. You can do one of the following - either:
write shared memory back to global memory so that it is accessible to neighbouring blocks (disadvantage: performance, synchronization between blocks can be tricky)
or:
process additional edge pixels per block with an overlap (1 pixel in this case) so that you have additional pixels in each block to handle the edge cases, e.g. work with a 34x34 block size but store only the 32x32 central output pixels (disadvantage: requires additional logic within kernel, branches may result in warp divergence, not all threads in block are fully used)
Unfortunately neighbourhood operations can be really tricky in CUDA and there is always a down-side whatever method you use to handle edge cases.

You can just use a busy spin (no joke). Just make the thread processing a[32] execute:
while(!variable);
before starting to compute and the thread processing a[31] do
variable = 1;
when it finishes. It's up to you to generalize this. I know this is considered "rogue programming" in CUDA, but it seems the only way to achieve what you want. I had a very similar problem and it worked for me. Your performance might suffer though...
Be careful however, that
dim3 threads_perblock(32, 32)
means you have 32 x 32 = 1024 threads per block.

Related

Exiting after N threads in a compute shader

So I have a compute shader kernel with the following logic:
[numthreads(64,1,1)]
void CVProjectOX(uint3 t : SV_DispatchThreadID){
if(t.x >= TotalN)
return;
uint compt = DbMap[t.x];
....
I do understand that it's not ideal to have ifs elses/branching in compute shaders? if so, what is the best way to limit thread work if number of total expected threads aren't expected to match exactly the kernel's numthreads?
For instance in my example, the kernel group of 64 threads, let's say I expect total 961 threads (it could be anything really), if, I dispatch 960, 1 db slot won't be processed, if I dispatch 1024, there will be 63 unnecessary work or maybe work pointing to non-existing db slot. (db slots number will vary).
Is if(t.x > TotalN)/return fine and the right approach here?
Should I just do min, tx = min(t.x, TotalN) and keep writing on the final db slot?
Should I just modulo? tx = t.x % TotalN and rewrite the first db slots?
What other solutions?
Limiting the number of threads this way is fine, yes. But, be aware that an early return like this doesn't actually save (as much) work as you'd expect:
The hardware utilizes SIMD like thread collections (called wavefonts in directX). Depending on the hardware, the usual size of such a wavefont is usually 4 (Intel iGPUs), 32 (NVidia and most AMD GPUs) or 64 (a few AMD GPUs). Due to the nature of SIMD, all threads in such a wavefont always do exactly the same work, you can only "mask out" some of them (meaning, their writes will be ignored and they are fine reading out-of-bounds memory).
This means that, in the worst case (when the wavefont size is 64), when you need to execute 961 threads and are therefore dispatching 1024, there will still be 63 threads executing the code, they just behave like they wouldn't exist. If the wave size is smaller, the hardware might at least early out on some wavefonts, so in these cases the early return does actually save some work.
So it would be the best if you'd never actually need a number of threads that is not a multiple of your group size (which, in turn, is hopefully a multiple of the hardwares wavefont size). But, if that's not possible, limiting the number of threads in that way is the next best option, especially because all threads that do reach the early return are next to each other, which maximizes the chance that a whole wavefont can early out.

Is Unique Thread Id guaranteed for each Kernel Call in CUDA?

I have recently started to work with Cuda, I have multithread, multiprocess coding experience on C++, Java and Python.
With PyCuda I see example codes like this,
ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
int i = threadIdx.x;
outvec[i] = scalar*vec[i];
}
""")
It seems the thread id itself partakes in the logic of the code. Then the question is will there be enough thread ids covering my entire array (whose indexing I apparently need to reach all elements there), and what happens if I change the size of the array.
Will the indexing always be between 0 and N?
In CUDA the thread id is only unique per so-called thread block, meaning, that your example kernel only does the right thing with only one block doing work. This is probably done in early examples to ease you into the ideas, but it is generally a very bad thing to do in terms of performance:
With one block, you can only utilize one of many streaming multiprocessors (SMs) in a GPU and even that SM will only be able to hide memory access latencies when it has enough parallel work to do while waiting.
A single thread-block also limits you in the number of threads and therefore in the problem-size, if your kernel doesn't contain a loop so every thread can compute more than one element.
Kernel execution is seen strongly hierarchically: Restricting ourselves to one dimensional indexing for simplicity, a kernel is executed on a so-called grid of gridDim.x thread blocks, each containing blockDim.x threads numbered per block by threadIdx.x, while each block is numbered via blockIdx.x.
To get a unique ID for a thread (in a fashion that ideally uses the hardware to load elements from an array), you have to take blockIdx.x * blockDim.x + threadIdx.x. If more than one element shall be computed by every thread, you use a loop of the form
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < InputSize; i += gridDim.x * blockDim.x) {
/* ... */
}
This is called a grid-stride loop, because gridDim.x * blockDim.x is the number of all threads working on the kernel. Different strides (especially having a thread working on consecutive elements: stride = 1) might work, but will be much slower due to the non-ideal memory access pattern.

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

Fastest way to compare uchar arrays in OpenCL

I need do many comparsions in opencl programm. Now i make it like this
int memcmp(__global unsigned char* a,__global unsigned char* b,__global int size){
for (int i = 0; i<size;i++){
if(a[i] != b[i])return 0;
}
return 1;
}
How i can make it faster? Maybe using vectors like uchar4 or somethins else? Thanks!
I guess that your kernel computes "size" elements for each thread. I think that your code can improve if your accesses are more coalesced. Thanks to the L1 caches of the current GPUs this is not a huge problem but it can imply a noticeable performance penalty. For example, you have 4 threads(work-items), size = 128, so the buffers have 512 uchars. In your case, thread #0 acceses to a[0] and b[0], but it brings to cache a[0]...a[63] and the same for b. thread #1 wich belongs to the same warp (aka wavefront) accesses to a[128] and b[128], so it brings to cache a[128]...a[191], etc. After thread #3 all the buffer is in the cache. This is not a problem here taking into account the small size of this domain.
However, if each thread accesses to each element consecutively, only one "cache line" is necessary all the time for your 4 threads execution (the accesses are coalesced). The behavior will be better when more threads per block are considered. Please, try it and tell me your conclusions. Thank you.
See: http://www.nvidia.com/content/cudazone/download/opencl/nvidia_opencl_programmingguide.pdf Section 3.1.2.1
It is a bit old but their concepts are not so old.
PS: By the way, after this I would try to use uchar4 as you commented and also the "loop unrolling".

Concurrent access to a single FFTSetup data structure in GCD

Is it okay to create a single FFTSetup data structure and use it to perform multiple FFT computations concurrently? Would something like the following work?
FFTSetup fftSetup = vDSP_create_fftsetup(
16, // vDSP_Length __vDSP_log2n,
kFFTRadix2 // FFTRadix __vDSP_radix
);
NSAssert(fftSetup != NULL, #"vDSP_create_fftsetup() failed to allocate storage");
for (int i = 0; i < 100; i++)
{
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0),
^{
vDSP_fft_zrip(
fftSetup, // FFTSetup __vDSP_setup,
&(splitComplex[i]), // DSPSplitComplex *__vDSP_ioData,
1, // vDSP_Stride __vDSP_stride,
16, // vDSP_Length __vDSP_log2n,
kFFTDirection_Forward // FFTDirection __vDSP_direction
);
});
}
I suppose that the answer depends on the following considerations:
1) Does vDSP_fft_zrip() only access the data within fftSetup (or the data pointed to by it) in a "read-only" fashion? Or are there perhaps some temporary buffers (scratch space) within fftSetup that is written to by vDSP_fft_zrip() in performing its FFT computations?
2) If data like that in fftSetup is being accessed in a "read-only" fashion, is it okay for multiple processes/threads/tasks/blocks to access it simultaneously? (I am thinking of the case where it is possible for more than one process to open the same file for reading, though not necessarily for writing or appending. Is this analogy appropriate?)
On a related note, just how much memory is being taken up by the FFTSetup data structure? Is there any way to find out? (It is an opaque data type.)
You may create one FFT setup and use it repeatedly and concurrently. This is the intended use. (I am the author of the current implementations of vDSP_fft_zrip and other FFT implementations in vDSP.)
In Using Fourier Transforms we're told that the FFTSetup contains the FFT weight array which is a series of complex exponentials. The vDSP_create_fftsetup documentation says
Once prepared, the setup structure can be used repeatedly by FFT
functions (which read the data in the structure and do not alter it)
for any (power of two) length up to that specified when you created
the structure.
so
conceptually, vDSP_fft_zrip should not need to modify the weight array and so it would appear to be one of the FFT functions that do not alter the FFTSetup (I haven't seen any that do apart from create/destroy), however there are no guarantees on what the actual implementation does - it could do anything.
if vDSP_fft_zrip truly accesses its FFTSetup in a read-only fashion, then it's fine to do that from multiple threads.
As for memory usage, the FFT weight array is e^{i*k*2*M_PI/N} for k = [0..N-1], which are N complex float values, so that would 2*N*sizeof(float).
But those complex exponentials are very symmetric so who knows, under the hood the implementation could require less memory. Or more!
In your case, N = 2^16, so it wouldn't be strange to see up to 256k being used.
Where does that leave you? I think it seems reasonable that the FFTSetup be accessible from multiple threads, but it appears to be undocumented. You could be lucky. Or unlucky and unpleasantly surprised now or in a future version of the framework.
So... do you feel lucky?
I wouldn't attempt any explicit concurrency with the vDSP functions, or any other function in the Accelerate framework (of which vDSP is a part) for that matter. Why? Because Accelerate is already designed to take advantage of multiple cores, as well as specific nuances of a given processor implementation, on your behalf - see http://developer.apple.com/library/mac/#DOCUMENTATION/Darwin/Reference/ManPages/man7/vecLib.7.html. You may end up essentially re-parallelizing already parallel computations that are internal to the implementation (if not now, then possibly in a later version). The best approach to the Accelerate framework is generally to assume that it's more clever than you are and just use it in the simplest way possible, then do your performance measurements. If those measurements reflect a level of performance that is somehow insufficient for your needs, then try your own optimizations (and/or file a bug report against the Accelerate framework at http://bugreport.apple.com since the authors of that framework are always interested in knowing where or if their efforts somehow fell short of developer requirements).