The use of getSmallConstantTripCount method of Loop in LLVM - optimization

In my pass, I add LoopInfo as a required pass. Then I'd like to print the constant loop trip count of each loop if it has one. However, every time I called getSmallConstantTripCount, it returns 0, even for a very simple loop:
for(i=0; i<3; ++i) {;}
Any idea why?

LLVM has a principle of making each part do the least amount of work. LoopInfo::getSmallConstantTripCount doesn't do any fancy analysis, it looks for a simple loop with a single backedge that increments a value by 1 each time and is compared using != against a constant integer.
When you compile the code you wrote at -O0, every "i < 3" actually causes a load from memory to read in the latest value of 'i'. LoopInfo certainly isn't going to do the type of analysis necessary to figure out that the memory accesses aren't needed, that's the job of "opt -mem2reg". Try running that optimization, and maybe -instcombine -loopsimplify -loop-rotate over the code to get it into the shape the getSmallConstantTripCount will handle.

Related

Does initialising an auxiliary array to 0 count as n time complexity already?

very new to big O complexity and I was wondering if an algorithm where you have a given array, and you initialise an auxilary array with the same amount of indexes count as n time already, or do you just assume this is O(1), or nothing at all?
TL;DR: Ignore it
Long answer: This will depend on the rest of your algorithm as well as what you want to achieve. Typically you will do something useful with the array afterwards which does have at least the same time complexity as filling the array, so that array-filling does not contribute to the time complexity. Furthermore filling an array with 0 feels like something you do to initialize the array, so your "real" algorithm can work properly. But nevertheless there are some cases you could consider.
Please note that I use pseudocode in the following examples, I hope it's clear what the algorithm should do. Also note that all the examples don't do anything useful with the array. It's just to show my point.
Lets say you have following code:
A = Array[n]
for(i=0, i<n, i++)
A[i] = 0
print "Hello World"
Then obviously the runtime of your algorithm is highly dependent on the value of n and thus should be counted as linear complexity O(n)
On the other hand, if you have a much more complicated function, say this one:
A = Array[n]
for(i=0, i<n, i++)
A[i] = 0
for(i=0, i<n, i++)
for(j=n-1, j>=0, j--)
print "Hello World"
Then even if you take the complexity of filling the array into account, you will end with complexity of O(n^2+2n) which is equal to the class O(n^2), so it does not matter in this case.
The most interesting case is surely when you have different options to use as basic operation. Say we have the following code (someFunction being an arbitrary function):
A = Array[n*n]
for(i=0, i<n*n, i++)
A[i] = 0
for(i=0, i*i<n, i++)
someFunction(i)
Now it depends on what you choose as basic operation. Which one you choose is highly dependent on what you want to achieve. Let's say someFunction is a very cheap function (regarding time complexity) and accessing the array A is more expensive. Then you would propably go with O(n^2), since accessing the array is done n^2 times. If on the other hand someFunction is expensive compared to filling the array, you would propably choose this as base operation and go with O(sqrt(n)).
Please be aware that one could also come to the conclusion that since the first part (array-filling) is executed more often than the other part (someFunction) it does not matter which one of the operations will take longer time to finish, since at some point the array-filling will need longer time. Thus you could argue that the complexity has to be quadratic O(n^2) This may be right from a theoretical view. But in real life you usually will have an operation you want to count and don't care about the other operations.
Actually you could consider ignoring the array filling as well as taking it into account in all the examples I provided above, depending whether print or accessing the array is more expensive. But I hope in the first two examples it is obvious which one will add more runtime and thus should be considered as the basic operation.

Variable Declaration Inside For Loop Initialization Statement

I have a simple question with regards to initializing for loops.
Here is my for loop declaration:
for (int i=player.x-xIndex-1; i<=player.x+xIndex+1; i++)
{
for (int j=player.y-yIndex-1; j<=player.y+yIndex+1; j++)
{
}
}
My question is:
Is it bad practice to have the values of the indices i and j be set to non-static integer values at declaration?
Will the code just evaluate the minimum and maximum values of i and j once at beginning of execution, or will it evaluate those values (i.e. player.x+xIndex+1, etc.) every single time the loop executes.
Any light you guys can shed on my problem would be awesome!
I'm a freakin' amateur, guys. Seriously.
Thanks :D
Not an amateur question at all. The "initialization" expressions are calculated only on the first run through, because of course they're only used that one time.
For the loop's "condition" (the middle expression that is tested at the end of every iteration), in the worst case it can be evaluated every iteration. Because what if (in this case) player.y actually changes during the loop?
However, most modern compilers will likely not compute that whole thing every loop if they can detect that the end value is provably never changing during the loop.
If you wanted to be double sure and manhandle the path of execution, you can explicitly "hoist" the conditional end expression out of the loop yourself, like:
int maxValue = foo.x + y.bar + 12 + myString.length;
for (int i = 0; i < maxValue; i++) {
....
But now the standard style disclaimer: optimizing prematurely can make your code less readable for no provable gains. Unless you're doing real work in that condition expression, or the loop is running bazillions of iterations, some additional computation won't hurt you much, and might be worth keeping so that it's clearer to yourself and others what you're trying to do.

Understanding CUDA serialization and reconvergence point

EDIT: I realized that I, unfortunately, overlooked a semicolon at the end of the while statement in the first example code and misinterpreted it myself. So there is in fact an empty loop for threads with threadIdx.x != s, a convergency point after that loop and a thread waiting at this point for all the others without incrementing the s variable. I am leaving the original (uncorrected) question below for anyone interested in it. Be aware, that there is a semicolon missing at the end of the second line in the first example and thus, s++ has nothing in common with the cycle body.
--
We were studying serialization in our CUDA lesson and our teacher told us that a code like this:
__shared__ int s = 0;
while (s != threadIdx.x)
s++; // serialized code
would end up with a HW deadlock because the nvcc compiler puts a reconvergence point between the while (s != threadIdx.x) and s++ statements. If I understand it correctly, this means that once the reconvergence point is reached by a thread, this thread stops execution and waits for the other threads until they reach the point too. In this example, however, this never happens, because thread #0 enters the body of the while loop, reaches the reconvergence point without incrementing the s variable and other threads get stuck in an endless loop.
A working solution should be the following:
__shared__ int s = 0;
while (s < blockDim.x)
if (threadIdx.x == s)
s++; // serialized code
Here, all threads within a block enter the body of the loop, all evaluate the condition and only thread #0 increments the s variable in the first iteration (and loop goes on).
My question is, why does the second example work if the first hangs? To be more specific, the if statement is just another point of divergence and in terms of the Assembler language should be compiled into the same conditional jump instruction as the condition in the loop. So why isn't there any reconvergence point before s++ in the second example and has it in fact gone immediately after the statement?
In other sources I have only found that a divergent code is computed independently for every branch - e.g. in an if/else statement, first the if branch is computed with all else-branched threads masked within the same warp and then the other threads compute the else branch while the first wait. There's a reconvergence point after the if/else statement. Why then does the first example freeze, not having the loop split into two branches (a true branch for one thread and a waiting false branch for all the others in a warp)?
Thank you.
It does not make sense to put the reconvergence point between the call to while (s != threadIdx.x) and s++;. It disrupts the program flow since the reconvergence point for a piece of code should be reachable by all threads at compile time. Below picture shows the flowchart of your first piece of code and possible and impossible points of reconvergence.
Regarding this answer about recording the convergence point via SSY instruction, I created below simple kernel resembling your first piece of code
__global__ void kernel_1() {
__shared__ int s;
if(threadIdx.x==0)
s = 0;
__syncthreads();
while (s == threadIdx.x)
s++; // serialized code
}
and compiled it for CC=3.5 with -O3. Below is the result of using cuobjdumbinary tool for the output to observe the CUDA assembly. The result is:
I'm not an expert in reading CUDA assembly but I can see while loop condition checks in lines 0038 and 00a0. At line 00a8, it branches to 0x80 if it satisfies the while loop condition and executes the code block again. The introduction of the reconvergence point is at line 0058 introducing line 0xb8 as the reconvergence point which is after the loop condition check near the exit.
Overall, it is not clear what you're trying to achieve with this piece of code. Also in the second piece of code, the reconvergence point should be again after while loop code block (I don't mean between while and if).
The reason why it "hangs" is neither a HW deadlock nor branching, at least not directly. You produce an endless loop for one or multiple threads (as already suspected).
In your example, there isn't really a convergence point. Since you do not use any synchronization, there aren't any threads that actually wait. What happens here with the while-loop is pretty much a busy-wait.
A kernel only finishes if all threads return. Since you have one (or multiple) endless loops (by accident maybe even none - this is unlikely however) the kernel will never finish.
You declared a shared variable s. This variable is known to all threads within a block.
With your while-statement you basically say (to each thread): increment s until it reaches the value of your (local) thread id. Since all threads are incrementing s in parallel, you introduce race conditions.
Example:
List item
Thread 5 is looping and checking for s to become 5
s is 4
Two threads increment s, it becomes 6
At the same time thread 5 only reached the end of its loop.
Now it reaches the next loop iteration and checks for s and it's not 5.
Thread 5 will never be able to finish since you check via == and the value of s already exceeded the value of the thread id.
Also your solution is quite confusing, because each thread executes the serialized code consecutively (which probably was the intention after all - even though that actually is strange):
Thread 0 will execute the serialized code
After that, thread 1 will execute the serialized code
and so on
Most examples show a program where each thread works on some code, then all threads are synchronized and only single thread executes some more code (maybe it needed the results of all threads).
So, your second example "works" because no thread is stuck in an endless loop, however I can't think of a reason why anyone would use such a code,
since it is confusing and, well, not parallel at all.

Optimizing the number of times a subroutine is called from a loop

A small question on optimizing a program. The problem is stated as follows
Problem Statement:
The main code has a for/DO loop in which a subroutine is present. The subroutine need or need not be executed depending upon a flag I receive from user.
Obvious Solution:
Simplest way of doing it is using an IF loop to call the subroutine. But then this is time consuming if I have to check for the flag everytime the loop is executed. I am doing molecular dynamics and the number of times the loop will be executed will be of the order of 10^5 .
Qn: Is there a better way to do this, like I say to the program whether the subroutine has to be invoked depending on the flag once and for all? I am coding in Fortran 90. So it would be helpful if something can be said along that lines.
PROGRAM MAIN
IMPLICIT NONE
"ALL ARRAY INITIALIZATIONS
CALL DENSITY() ! I do a field based approach. So this is for grid formulation
DO i = 1, neq ! neq = number of eqbm cycles
CALL MC_CYC() ! Monte carlo steps
CALL DENSITY() ! Recalculate density
END DO
DO i = 1,nprod ! production cycle
DO j = 1, niter ! for averages of ensembles
CALL MC_CYC()
CALL DENSITY()
END DO
!do average here
IF(<flag is present>) ! This is where I needed to check flag. Because otherwise the flag will be checked everytime.
CALL RDF()
END IF
END DO
END PROGRAM MAIN
I am not sure that an IF statement is really going to slow down the program, even if called (more than) 100,000 times. You might only end up saving a second or two by restructuring the code (test this though!)
Anyway, if the flag is received at the start of the program, then you'd be able to write your code as
IF(<flag is present>) THEN
DO
...
CALL <subroutine name>
...
ENDDO
ELSE
DO
...
ENDDO
ENDIF
where the second DO loop omits the subroutine CALL.
EDIT
As another alternative (that might be cheaper to implement) is to pre-process the data and have the flag keyed in at compile-time. This might make the user a bit annoyed to have to recompile the program when changing the flag, but it'd make your job easier.
Anyway, you'd have something like
DO
...
#ifdef <flag>
CALL <subroutine name>
#endif
ENDDO

Iterating with different integral types

Does it make any difference if I use e.g. short or char type of variable instead of int as a for-loop initializer?
for (int i = 0; i < 10; ++i) {}
for (short i = 0; i < 10; ++i) {}
for (char i = 0; i < 10; ++i) {}
Or maybe there is no difference? Maybe I make the things even worse and efficiency decreases? Does using different type saves memory and increases speed? I am not sure, but I suppose that ++ operator may need to widen the type, and as a result: slow down the execution.
It will not make any difference you should be caring about, provided the range you iterate over fits into the type you choose. Performance-wise, you'll probably get the best results when the size of the iteration variable is the same as the platform's native integer size, but any decent compiler will optimize it to use that anyway. On a managed platform (e.g. C# or Java), you don't know the target platform at compile time, and the JIT compiler is basically free to optimize for whatever platform it is running on.
The only thing you might want to watch out for is when you use the loop counter for other things inside the loop; changing the type may change the way these things get executed, up to the point (in C++ at least) that a different overload for a function or method may get called because the loop variable has a different type. An example would be when you output the loop variable through a C++ stream, like so: cout << i << endl;. Similarly, the type of the loop variable can infest the implicit types of (sub-)expressions that contain it, and lead to hidden overflows in numeric calculations, e.g.: int j = i * i;.