CUDA: while loop index correctness - optimization

This kernel is doing the right thing giving me the correct result. My problem is more in correctness of the while loop if I want to improve the performance. I tried several configuration of blocks and threads but if i'm going to change them, the while loop won't give me the correct result.
The results i obtained changing the configuration of the kernel are that firstArray and secondArray won't be filled completely (they will have 0 inside the cells). Both arrays must be filled with the curValue obtained from the if loop.
Any advice is welcomed :)
Thank you in advance
#define N 65536
__global__ void whileLoop(int* firstArray_device, int* secondArray_device)
{
int curValue = 0;
int curIndex = 1;
int i = (threadIdx.x)+2;
while(i < N) {
if (i % curIndex == 0) {
curValue = curValue + curIndex;
curIndex *= 2;
}
firstArray_device[i] = curValue;
secondArray_device[i] = curValue;
i += blockDim.x * gridDim.x;
}
}
int main(){
firstArray_host[0] = 0;
firstArray_host[1] = 1;
secondArray_host[0] = 0;
secondArray_host[1] = 1;
// memory allocation + copy on GPU
// definition number of blocks and threads
dim3 dimBlock(1, 1);
dim3 dimGrid(1, 1);
whileLoop<<<dimGrid, dimBlock>>>(firstArray_device, secondArray_device);
// copy back to CPU + free memory
}

You have a data dependency issue here which hinders you to do some meaningful optimization. The variables curValue and curIndex are changed within the while loop and feed forward into the next run. As soon as you try to optimize the loop you will find you in a situation where this variables have different states and the result is changed.
I do not really know what you try to achieve, but try to make the while loop indepdent to the values of a former run of the loop to avoid the dependencies. Try to separate the data into threads and data chunks in a way that the indizes and values are calculated on the environment states like threadIdx, blockDim, gridDim...
Also try to avoid conditional loops. It is better to use for loops with a constant number of runs. This is also easier to optimize.

A few things:
You left out the code you used to declare your global arrays on the
device. It would be helpful to have this info.
Your algorithm is
not thread-safe when multiple blocks are used. In other words, if you are running multiple
blocks, not only would they be doing redundant work (thus giving
you no gains), but they would also likely at some point try to write
to the same global memory locations, creating errors.
Your code is thus
correct when only one block is used, but this makes it rather pointless ... you're running a serial, or lightly-threaded operation on a parallel device. You cannot run on all your available resources (multiple blocks on multiple SMPs without memory conflicts (see below)...
Currently there are two main issues with this code from a parallel standpoint:
int i = (threadIdx.x)+2; ...yields a starting index of 2 for a
single thread; 2 and 3 for two threads in a single block, and so on. I doubt this is
what you want as the first two positions (0, 1) are never getting
addressed. (Remember, arrays start at index 0 in C.)
Further, if you include multiple blocks (say 2 blocks
each with one thread) then you would have multiple duplicate indices
(e.g. for 2 b x 1 t --> indices b1t1: 2, b1t2: 2), which when you used the index
to write to global memory would create conflicts and errors. Doing something like int i = threadIdx.x + blockDim.x * blockIdx.x; would be the typical way to correctly calculate your indices so as to avoid this issue.
Your
final expression i += blockDim.x * gridDim.x; is okay, because its
adds a number equivalent to the total # of threads to i and thus
does not create additional clashing or overlap.
Why use the GPU to shuffle memory and do a trivial computation? You may not see much speedup versus a fast CPU, when you factor in the time to take your arrays onto and off of the device.
Work on problems 1 and 2 if you wish, but beyond that consider your overall goal and what exactly kind of algorithm you are trying to optimize and come up with a more parallel-friendly solution -- or consider whether GPU computing really makes sense for your problem.

To parallelize this algorithm, you need to come up with a formula that can directly calculate the value for a given index in the array. So, pick a random index within the range of the array, then consider what the factors are that go into determining what the value will be for that location. After finding a formula, test it by comparing output values for random indexes with the calculated values from your serial algorithm. When that is correct, create a kernel that starts out by selecting an unique index based on it's thread and block indexes. Then calculate the value for that index and store it in the corresponding index in the array.
A trivial example:
Serial:
__global__ void serial(int* array)
{
int j(0);
for (int i(0); i < 1024; ++i) {
array[i] = j;
j += 5;
}
int main() {
dim3 dimBlock(1);
dim3 dimGrid(1);
serial<<<dimGrid, dimBlock>>>(array);
}
Parallel:
__global__ void parallel(int* array)
{
int i(threadIdx.x + blockDim.x * blockIdx.x);
int j(i * 5);
array[i] = j;
}
int main(){
dim3 dimBlock(256);
dim3 dimGrid(1024 / 256);
parallel<<<dimGrid, dimBlock>>>(array);
}

Related

Is Unique Thread Id guaranteed for each Kernel Call in CUDA?

I have recently started to work with Cuda, I have multithread, multiprocess coding experience on C++, Java and Python.
With PyCuda I see example codes like this,
ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
int i = threadIdx.x;
outvec[i] = scalar*vec[i];
}
""")
It seems the thread id itself partakes in the logic of the code. Then the question is will there be enough thread ids covering my entire array (whose indexing I apparently need to reach all elements there), and what happens if I change the size of the array.
Will the indexing always be between 0 and N?
In CUDA the thread id is only unique per so-called thread block, meaning, that your example kernel only does the right thing with only one block doing work. This is probably done in early examples to ease you into the ideas, but it is generally a very bad thing to do in terms of performance:
With one block, you can only utilize one of many streaming multiprocessors (SMs) in a GPU and even that SM will only be able to hide memory access latencies when it has enough parallel work to do while waiting.
A single thread-block also limits you in the number of threads and therefore in the problem-size, if your kernel doesn't contain a loop so every thread can compute more than one element.
Kernel execution is seen strongly hierarchically: Restricting ourselves to one dimensional indexing for simplicity, a kernel is executed on a so-called grid of gridDim.x thread blocks, each containing blockDim.x threads numbered per block by threadIdx.x, while each block is numbered via blockIdx.x.
To get a unique ID for a thread (in a fashion that ideally uses the hardware to load elements from an array), you have to take blockIdx.x * blockDim.x + threadIdx.x. If more than one element shall be computed by every thread, you use a loop of the form
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < InputSize; i += gridDim.x * blockDim.x) {
/* ... */
}
This is called a grid-stride loop, because gridDim.x * blockDim.x is the number of all threads working on the kernel. Different strides (especially having a thread working on consecutive elements: stride = 1) might work, but will be much slower due to the non-ideal memory access pattern.

Determining growth function and Big O

Before anyone asks, yes this was a previous test question I got wrong and knew I got wrong because I honestly just don't understand growth functions and Big O. I've read the technical definition, I know what they are but not how to calculate them. My textbook gives examples off of real-life situations, but I still find it hard to interpret code. If someone can tell me their thought process on how they determine these, that would seriously help. (i.e. this section of code tells me to multiply n by x, etc, etc).
public static int sort(int lowI, int highI, int nums[]) {
int i = lowI;
int j = highI;
int pivot = nums[lowI +(highI-lowI)/2];
int counter = 0;
while (i <= j) {
while (nums[i] < pivot) {
i++;
counter++;
}
while (nums[j] > pivot) {
j--;
counter++;
}
count++;
if (i <= j) {
NumSwap(i, j, nums); //saves i to temp and makes i = j, j = temp
i++;
j--;
}
}
if(lowI< j)
{
return counter + sort(lowI, j, nums);
}
if(i < highI)
{
return counter + sort(i, highI, nums);
}
return counter;
}
It might help for you to read some explanations of Big-O. I think of Big-O as the number of "basic operations" computed as the "input size" increases. For sorting algorithms, "basic operations" usually means comparisons (or counter increments, in your case), and the "input size" is the size of the list to sort.
When I analyze for runtime, I'll start by mentally dividing the code into sections. I ignore one-off lines (like int i = lowI;) because they're only run once, and Big-O doesn't care about constants (though, note in your case that int i = lowI; runs once with each recursion, so it's not only run once overall).
For example, I'd mentally divide your code into three overall parts to analyze: there's the main while loop while (i <= j), the two while loops inside of it, and the two recursive calls at the end. How many iterations will those loops run for, depending on the values of i and j? How many times will the function recurse, depending on the size of the list?
If I'm having trouble thinking about all these different parts at once, I'll isolate them. For example, how long will one of the inner for loops run for, depending on the values of i and j? Then, how long does the outer while loop run for?
Once I've thought about the runtime of each part, I'll bring them back together. At this stage, it's important to think about the relationships between the different parts. "Nested" relationships (i.e. the nested block loops a bunch of times each time the outer thing loops once) usually mean that the run times are multiplied. For example, since the inner while loops are nested within the outer while loop, the total number of iterations is (inner run time + other inner) * outer. It also seems like the total run time would look something like this - ((inner + other inner) * outer) * recursions - too.

If you have to use an accessor function two times in a block of code,

I have a question. Suppose you have to call a function twice in a block of code, and are guaranteed for it to return the same value both times. Should you optimize your code by creating an extra variable?
Example:
Should
foo1(v.length()); // foo1 doesn't modify v.length()
foo2(v.length());
be changed to
int vlen = v.length();
foo1(vlen);
foo2(vlen);
for optimability?
In short, yes.
The increase in performance for one execution of the hypothetical code block where the above code appears might be negligible, but consider that if you have a repetitive loop in which the code block appears, you might see a small performance gain, because there would be less "call/ret" spaghetti stringing in the code. Yet, with the faster processors on the market today, allow me to emphasize that the performance increase is small. Also, the compiled code block is probably smaller by a byte or two in the version where you only called v.length() once.
So again, a small increase in efficiency that is negligible. Yet, it's still the best practice to optimize like this -- especially if you have something like a for loop, where the performance gain is roughly multiplied by the number of times that the unaltered value returned by the function is utilized and multiplied, the performance gain could prove non-negligible.
unsigned int something = function();
for( unsigned int i = 0; i < something; i ++ )
{
...
}
rather than
for( unsigned int i = 0; i < function(); i ++ )
{
...
}

Iterating with different integral types

Does it make any difference if I use e.g. short or char type of variable instead of int as a for-loop initializer?
for (int i = 0; i < 10; ++i) {}
for (short i = 0; i < 10; ++i) {}
for (char i = 0; i < 10; ++i) {}
Or maybe there is no difference? Maybe I make the things even worse and efficiency decreases? Does using different type saves memory and increases speed? I am not sure, but I suppose that ++ operator may need to widen the type, and as a result: slow down the execution.
It will not make any difference you should be caring about, provided the range you iterate over fits into the type you choose. Performance-wise, you'll probably get the best results when the size of the iteration variable is the same as the platform's native integer size, but any decent compiler will optimize it to use that anyway. On a managed platform (e.g. C# or Java), you don't know the target platform at compile time, and the JIT compiler is basically free to optimize for whatever platform it is running on.
The only thing you might want to watch out for is when you use the loop counter for other things inside the loop; changing the type may change the way these things get executed, up to the point (in C++ at least) that a different overload for a function or method may get called because the loop variable has a different type. An example would be when you output the loop variable through a C++ stream, like so: cout << i << endl;. Similarly, the type of the loop variable can infest the implicit types of (sub-)expressions that contain it, and lead to hidden overflows in numeric calculations, e.g.: int j = i * i;.

Reducing Number of Registers Used in CUDA Kernel

I have a kernel which uses 17 registers, reducing it to 16 would bring me 100% occupancy. My question is: are there methods that can be used to reduce the number or registers used, excluding completely rewriting my algorithms in a different manner. I have always kind of assumed the compiler is a lot smarter than I am, so for example I often use extra variables for clarity's sake alone. Am I wrong in this thinking?
Please note: I do know about the --max_registers (or whatever the syntax is) flag, but the use of local memory would be more detrimental than a 25% lower occupancy (I should test this)
Occupancy can be a little misleading and 100% occupancy should not be your primary target. If you can get fully coalesced accesses to global memory then on a high end GPU 50% occupancy will be sufficient to hide the latency to global memory (for floats, even lower for doubles). Check out the Advanced CUDA C presentation from GTC last year for more information on this topic.
In your case, you should measure performance both with and without maxrregcount set to 16. The latency to local memory should be hidden as a result of having sufficient threads, assuming you don't random access into local arrays (which would result in non-coalesced accesses).
To answer you specific question about reducing registers, post the code for more detailed answers! Understanding how compilers work in general may help, but remember that nvcc is an optimising compiler with a large parameter space, so minimising register count has to be balanced with overall performance.
It's really hard to say, nvcc compiler is not very smart in my opinion.
You can try obvious things, for example using short instead of int, passing and using variables by reference (e.g.&variable), unrolling loops, using templates (as in C++). If you have divisions, transcendental functions, been applied in sequence, try to make them as a loop. Try to get rid of conditionals, possibly replacing them with redundant computations.
If you post some code, maybe you will get specific answers.
Utilizing shared memory as cache may lead less register usage and prevent register spilling to local memory...
Think that the kernel calculates some values and these calculated values are used by all of the threads,
__global__ void kernel(...) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int id0 = blockDim.x * blockIdx.x;
int reg = id0 * ...;
int reg0 = reg * a / x + y;
...
int val = reg + reg0 + 2 * idx;
output[idx] = val > 10;
}
So, instead of keeping reg and reg0 as registers and making them possibily spill out to local memory (global memory), we may use shared memory.
__global__ void kernel(...) {
__shared__ int cache[10];
int idx = threadIdx.x + blockDim.x * blockIdx.x;
if (threadIdx.x == 0) {
int id0 = blockDim.x * blockIdx.x;
cache[0] = id0 * ...;
cache[1] = cache[0] * a / x + y;
}
__syncthreads();
...
int val = cache[0] + cache[1] + 2 * idx;
output[idx] = val > 10;
}
Take a look at this paper for further information..
It is not generally a good approach to minimize register pressure. The compiler does a good job optimizing the overall projected kernel performance, and it takes into account lots of factors, incliding register.
How does it work when reducing registers caused slower speed
Most probably the compiler had to spill insufficient register data into "local" memory, which is essentially the same as global memory, and thus very slow
For optimization purposes I would recommend to use keywords like const, volatile and so on where necessary, to help the compiler on the optimization phase.
Anyway, it is not these tiny issues like registers which often make CUDA kernels run slow. I'd recommend to optimize work with global memory, the access pattern, caching in texture memory if possible, transactions over the PCIe.
The instruction count increase when lowering the register usage have a simple explanation. The compiler could be using registers to store the results of some operations that are used more than once through your code in order to avoid recalculating those values, when forced to use less registers, the compiler decides to recalculate those values that would be stored in registers otherwise.