I have recently started to work with Cuda, I have multithread, multiprocess coding experience on C++, Java and Python.
With PyCuda I see example codes like this,
ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
int i = threadIdx.x;
outvec[i] = scalar*vec[i];
}
""")
It seems the thread id itself partakes in the logic of the code. Then the question is will there be enough thread ids covering my entire array (whose indexing I apparently need to reach all elements there), and what happens if I change the size of the array.
Will the indexing always be between 0 and N?
In CUDA the thread id is only unique per so-called thread block, meaning, that your example kernel only does the right thing with only one block doing work. This is probably done in early examples to ease you into the ideas, but it is generally a very bad thing to do in terms of performance:
With one block, you can only utilize one of many streaming multiprocessors (SMs) in a GPU and even that SM will only be able to hide memory access latencies when it has enough parallel work to do while waiting.
A single thread-block also limits you in the number of threads and therefore in the problem-size, if your kernel doesn't contain a loop so every thread can compute more than one element.
Kernel execution is seen strongly hierarchically: Restricting ourselves to one dimensional indexing for simplicity, a kernel is executed on a so-called grid of gridDim.x thread blocks, each containing blockDim.x threads numbered per block by threadIdx.x, while each block is numbered via blockIdx.x.
To get a unique ID for a thread (in a fashion that ideally uses the hardware to load elements from an array), you have to take blockIdx.x * blockDim.x + threadIdx.x. If more than one element shall be computed by every thread, you use a loop of the form
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < InputSize; i += gridDim.x * blockDim.x) {
/* ... */
}
This is called a grid-stride loop, because gridDim.x * blockDim.x is the number of all threads working on the kernel. Different strides (especially having a thread working on consecutive elements: stride = 1) might work, but will be much slower due to the non-ideal memory access pattern.
Related
I need do many comparsions in opencl programm. Now i make it like this
int memcmp(__global unsigned char* a,__global unsigned char* b,__global int size){
for (int i = 0; i<size;i++){
if(a[i] != b[i])return 0;
}
return 1;
}
How i can make it faster? Maybe using vectors like uchar4 or somethins else? Thanks!
I guess that your kernel computes "size" elements for each thread. I think that your code can improve if your accesses are more coalesced. Thanks to the L1 caches of the current GPUs this is not a huge problem but it can imply a noticeable performance penalty. For example, you have 4 threads(work-items), size = 128, so the buffers have 512 uchars. In your case, thread #0 acceses to a[0] and b[0], but it brings to cache a[0]...a[63] and the same for b. thread #1 wich belongs to the same warp (aka wavefront) accesses to a[128] and b[128], so it brings to cache a[128]...a[191], etc. After thread #3 all the buffer is in the cache. This is not a problem here taking into account the small size of this domain.
However, if each thread accesses to each element consecutively, only one "cache line" is necessary all the time for your 4 threads execution (the accesses are coalesced). The behavior will be better when more threads per block are considered. Please, try it and tell me your conclusions. Thank you.
See: http://www.nvidia.com/content/cudazone/download/opencl/nvidia_opencl_programmingguide.pdf Section 3.1.2.1
It is a bit old but their concepts are not so old.
PS: By the way, after this I would try to use uchar4 as you commented and also the "loop unrolling".
I have two pieces of code. One written in C and the corresponding operation written in CUDA.
Please help me understand how __syncthreads() works in context of the following programs. As per my understanding, __syncthreads() ensures synchronization of threads limited to one block.
C program :
{
for(i=1;i<10000;i++)
{
t=a[i]+b[i];
a[i-1]=t;
}
}
`
The equivalent CUDA program :
`
__global__ void kernel0(int *b, int *a, int *t, int N)
{
int b0=blockIdx.x;
int t0=threadIdx.x;
int tid=b0*blockDim.x+t0;
int private_t;
if(tid<10000)
{
private_t=a[tid]+b[tid];
if(tid>1)
a[tid-1]=private_t;
__syncthreads();
if(tid==9999)
*t=private_t;
}
}
Kernel Dimensions:
dim3 k0_dimBlock(32);
dim3 k0_dimGrid(313);
kernel0 <<<k0_dimGrid, k0_dimBlock>>>
The surprising fact is output from C and CUDA program are identical. Given the nature of problem, which has dependency of a[] onto itself, a[i] is loaded by thrad-ID i and written to a[i-1] by the same thread. Now the same happens for thread-ID i-1. Had the problem size been lesser than 32, the output is obvious. But for a problem of size 10000 with 313 blocks and blocks, how does the dependency gets respected ?
As per my understanding, __syncthreads() ensures synchronization of
threads limited to one block.
You're right. __syncthreads() is a synchronization barrier in the context of a block. Therefore, it is useful, for instance, when you must to ensure that all your data is updated before starting the next stage of your algorithm.
Given the nature of problem, which has dependency of a[] onto itself,
a[i] is loaded by thread-ID i and written to a[i-1] by the same thread.
Just imagine the thread 2 reach the if statement, since it matches the condition it enters to the statement. Now that threads do the following:
private_t=a[2]+b[2];
a[1]=private_t;
Witch is equivalent to:
a[1]=a[2]+b[2];
As you pointed, it is data dependency on array a. Since you can't control the order of execution of the warps at some point you'll be using an updated version of the aarray. In my mind, you need to add an extra __syncthreads() statement:
if( tid > 0 && tid<10000)
{
private_t=a[tid]+b[tid];
__syncthreads();
a[tid-1]=private_t;
__syncthreads();
if(tid==9999)
*t=private_t;
}
In this way, every thread gets its own version of private_t variable using the original array a, then the array is updated in parallel.
About the *t value:
If you're only looking at the value of *t, you'll not notice the effect of this random scheduling depending on the launching parameters, that's because the thread with tid==9999 could be in the last warp along with the thread tid==9998. Since the two array positions needed to create the private_t value and you already had that synchronization barrier the answer should be right
This kernel is doing the right thing giving me the correct result. My problem is more in correctness of the while loop if I want to improve the performance. I tried several configuration of blocks and threads but if i'm going to change them, the while loop won't give me the correct result.
The results i obtained changing the configuration of the kernel are that firstArray and secondArray won't be filled completely (they will have 0 inside the cells). Both arrays must be filled with the curValue obtained from the if loop.
Any advice is welcomed :)
Thank you in advance
#define N 65536
__global__ void whileLoop(int* firstArray_device, int* secondArray_device)
{
int curValue = 0;
int curIndex = 1;
int i = (threadIdx.x)+2;
while(i < N) {
if (i % curIndex == 0) {
curValue = curValue + curIndex;
curIndex *= 2;
}
firstArray_device[i] = curValue;
secondArray_device[i] = curValue;
i += blockDim.x * gridDim.x;
}
}
int main(){
firstArray_host[0] = 0;
firstArray_host[1] = 1;
secondArray_host[0] = 0;
secondArray_host[1] = 1;
// memory allocation + copy on GPU
// definition number of blocks and threads
dim3 dimBlock(1, 1);
dim3 dimGrid(1, 1);
whileLoop<<<dimGrid, dimBlock>>>(firstArray_device, secondArray_device);
// copy back to CPU + free memory
}
You have a data dependency issue here which hinders you to do some meaningful optimization. The variables curValue and curIndex are changed within the while loop and feed forward into the next run. As soon as you try to optimize the loop you will find you in a situation where this variables have different states and the result is changed.
I do not really know what you try to achieve, but try to make the while loop indepdent to the values of a former run of the loop to avoid the dependencies. Try to separate the data into threads and data chunks in a way that the indizes and values are calculated on the environment states like threadIdx, blockDim, gridDim...
Also try to avoid conditional loops. It is better to use for loops with a constant number of runs. This is also easier to optimize.
A few things:
You left out the code you used to declare your global arrays on the
device. It would be helpful to have this info.
Your algorithm is
not thread-safe when multiple blocks are used. In other words, if you are running multiple
blocks, not only would they be doing redundant work (thus giving
you no gains), but they would also likely at some point try to write
to the same global memory locations, creating errors.
Your code is thus
correct when only one block is used, but this makes it rather pointless ... you're running a serial, or lightly-threaded operation on a parallel device. You cannot run on all your available resources (multiple blocks on multiple SMPs without memory conflicts (see below)...
Currently there are two main issues with this code from a parallel standpoint:
int i = (threadIdx.x)+2; ...yields a starting index of 2 for a
single thread; 2 and 3 for two threads in a single block, and so on. I doubt this is
what you want as the first two positions (0, 1) are never getting
addressed. (Remember, arrays start at index 0 in C.)
Further, if you include multiple blocks (say 2 blocks
each with one thread) then you would have multiple duplicate indices
(e.g. for 2 b x 1 t --> indices b1t1: 2, b1t2: 2), which when you used the index
to write to global memory would create conflicts and errors. Doing something like int i = threadIdx.x + blockDim.x * blockIdx.x; would be the typical way to correctly calculate your indices so as to avoid this issue.
Your
final expression i += blockDim.x * gridDim.x; is okay, because its
adds a number equivalent to the total # of threads to i and thus
does not create additional clashing or overlap.
Why use the GPU to shuffle memory and do a trivial computation? You may not see much speedup versus a fast CPU, when you factor in the time to take your arrays onto and off of the device.
Work on problems 1 and 2 if you wish, but beyond that consider your overall goal and what exactly kind of algorithm you are trying to optimize and come up with a more parallel-friendly solution -- or consider whether GPU computing really makes sense for your problem.
To parallelize this algorithm, you need to come up with a formula that can directly calculate the value for a given index in the array. So, pick a random index within the range of the array, then consider what the factors are that go into determining what the value will be for that location. After finding a formula, test it by comparing output values for random indexes with the calculated values from your serial algorithm. When that is correct, create a kernel that starts out by selecting an unique index based on it's thread and block indexes. Then calculate the value for that index and store it in the corresponding index in the array.
A trivial example:
Serial:
__global__ void serial(int* array)
{
int j(0);
for (int i(0); i < 1024; ++i) {
array[i] = j;
j += 5;
}
int main() {
dim3 dimBlock(1);
dim3 dimGrid(1);
serial<<<dimGrid, dimBlock>>>(array);
}
Parallel:
__global__ void parallel(int* array)
{
int i(threadIdx.x + blockDim.x * blockIdx.x);
int j(i * 5);
array[i] = j;
}
int main(){
dim3 dimBlock(256);
dim3 dimGrid(1024 / 256);
parallel<<<dimGrid, dimBlock>>>(array);
}
Working on a filter following, I am having a problem of doing these pieces of codes for processing an image in GPU:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
If I define:
dim3 threads_perblock(32, 32)
then each block I have: 32 threads can be communicated. The threads of this block can not communicate with the threads from other blocks.
Within a thread_block, I can translate that pieces of code using shared_memory however, for edge (I would say): image[0,31] and image[0,32] in different threadblocks. The image[0,31] should get value from image[0,32] to calculate its value. But they are in different threadblocks.
so that is the problem.
How would I solve this?
Thanks in advance.
If image is in global memory then there is no problem - you don't need to use shared memory and you can just access pixels directly from image without any problem.
However if you have already done some processing prior to this, and a block of image is already in shared memory, then you have a problem, since you need to do neighbourhood operations which are outside the range of your block. You can do one of the following - either:
write shared memory back to global memory so that it is accessible to neighbouring blocks (disadvantage: performance, synchronization between blocks can be tricky)
or:
process additional edge pixels per block with an overlap (1 pixel in this case) so that you have additional pixels in each block to handle the edge cases, e.g. work with a 34x34 block size but store only the 32x32 central output pixels (disadvantage: requires additional logic within kernel, branches may result in warp divergence, not all threads in block are fully used)
Unfortunately neighbourhood operations can be really tricky in CUDA and there is always a down-side whatever method you use to handle edge cases.
You can just use a busy spin (no joke). Just make the thread processing a[32] execute:
while(!variable);
before starting to compute and the thread processing a[31] do
variable = 1;
when it finishes. It's up to you to generalize this. I know this is considered "rogue programming" in CUDA, but it seems the only way to achieve what you want. I had a very similar problem and it worked for me. Your performance might suffer though...
Be careful however, that
dim3 threads_perblock(32, 32)
means you have 32 x 32 = 1024 threads per block.
I have a kernel which uses 17 registers, reducing it to 16 would bring me 100% occupancy. My question is: are there methods that can be used to reduce the number or registers used, excluding completely rewriting my algorithms in a different manner. I have always kind of assumed the compiler is a lot smarter than I am, so for example I often use extra variables for clarity's sake alone. Am I wrong in this thinking?
Please note: I do know about the --max_registers (or whatever the syntax is) flag, but the use of local memory would be more detrimental than a 25% lower occupancy (I should test this)
Occupancy can be a little misleading and 100% occupancy should not be your primary target. If you can get fully coalesced accesses to global memory then on a high end GPU 50% occupancy will be sufficient to hide the latency to global memory (for floats, even lower for doubles). Check out the Advanced CUDA C presentation from GTC last year for more information on this topic.
In your case, you should measure performance both with and without maxrregcount set to 16. The latency to local memory should be hidden as a result of having sufficient threads, assuming you don't random access into local arrays (which would result in non-coalesced accesses).
To answer you specific question about reducing registers, post the code for more detailed answers! Understanding how compilers work in general may help, but remember that nvcc is an optimising compiler with a large parameter space, so minimising register count has to be balanced with overall performance.
It's really hard to say, nvcc compiler is not very smart in my opinion.
You can try obvious things, for example using short instead of int, passing and using variables by reference (e.g.&variable), unrolling loops, using templates (as in C++). If you have divisions, transcendental functions, been applied in sequence, try to make them as a loop. Try to get rid of conditionals, possibly replacing them with redundant computations.
If you post some code, maybe you will get specific answers.
Utilizing shared memory as cache may lead less register usage and prevent register spilling to local memory...
Think that the kernel calculates some values and these calculated values are used by all of the threads,
__global__ void kernel(...) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int id0 = blockDim.x * blockIdx.x;
int reg = id0 * ...;
int reg0 = reg * a / x + y;
...
int val = reg + reg0 + 2 * idx;
output[idx] = val > 10;
}
So, instead of keeping reg and reg0 as registers and making them possibily spill out to local memory (global memory), we may use shared memory.
__global__ void kernel(...) {
__shared__ int cache[10];
int idx = threadIdx.x + blockDim.x * blockIdx.x;
if (threadIdx.x == 0) {
int id0 = blockDim.x * blockIdx.x;
cache[0] = id0 * ...;
cache[1] = cache[0] * a / x + y;
}
__syncthreads();
...
int val = cache[0] + cache[1] + 2 * idx;
output[idx] = val > 10;
}
Take a look at this paper for further information..
It is not generally a good approach to minimize register pressure. The compiler does a good job optimizing the overall projected kernel performance, and it takes into account lots of factors, incliding register.
How does it work when reducing registers caused slower speed
Most probably the compiler had to spill insufficient register data into "local" memory, which is essentially the same as global memory, and thus very slow
For optimization purposes I would recommend to use keywords like const, volatile and so on where necessary, to help the compiler on the optimization phase.
Anyway, it is not these tiny issues like registers which often make CUDA kernels run slow. I'd recommend to optimize work with global memory, the access pattern, caching in texture memory if possible, transactions over the PCIe.
The instruction count increase when lowering the register usage have a simple explanation. The compiler could be using registers to store the results of some operations that are used more than once through your code in order to avoid recalculating those values, when forced to use less registers, the compiler decides to recalculate those values that would be stored in registers otherwise.