very new to big O complexity and I was wondering if an algorithm where you have a given array, and you initialise an auxilary array with the same amount of indexes count as n time already, or do you just assume this is O(1), or nothing at all?
TL;DR: Ignore it
Long answer: This will depend on the rest of your algorithm as well as what you want to achieve. Typically you will do something useful with the array afterwards which does have at least the same time complexity as filling the array, so that array-filling does not contribute to the time complexity. Furthermore filling an array with 0 feels like something you do to initialize the array, so your "real" algorithm can work properly. But nevertheless there are some cases you could consider.
Please note that I use pseudocode in the following examples, I hope it's clear what the algorithm should do. Also note that all the examples don't do anything useful with the array. It's just to show my point.
Lets say you have following code:
A = Array[n]
for(i=0, i<n, i++)
A[i] = 0
print "Hello World"
Then obviously the runtime of your algorithm is highly dependent on the value of n and thus should be counted as linear complexity O(n)
On the other hand, if you have a much more complicated function, say this one:
A = Array[n]
for(i=0, i<n, i++)
A[i] = 0
for(i=0, i<n, i++)
for(j=n-1, j>=0, j--)
print "Hello World"
Then even if you take the complexity of filling the array into account, you will end with complexity of O(n^2+2n) which is equal to the class O(n^2), so it does not matter in this case.
The most interesting case is surely when you have different options to use as basic operation. Say we have the following code (someFunction being an arbitrary function):
A = Array[n*n]
for(i=0, i<n*n, i++)
A[i] = 0
for(i=0, i*i<n, i++)
someFunction(i)
Now it depends on what you choose as basic operation. Which one you choose is highly dependent on what you want to achieve. Let's say someFunction is a very cheap function (regarding time complexity) and accessing the array A is more expensive. Then you would propably go with O(n^2), since accessing the array is done n^2 times. If on the other hand someFunction is expensive compared to filling the array, you would propably choose this as base operation and go with O(sqrt(n)).
Please be aware that one could also come to the conclusion that since the first part (array-filling) is executed more often than the other part (someFunction) it does not matter which one of the operations will take longer time to finish, since at some point the array-filling will need longer time. Thus you could argue that the complexity has to be quadratic O(n^2) This may be right from a theoretical view. But in real life you usually will have an operation you want to count and don't care about the other operations.
Actually you could consider ignoring the array filling as well as taking it into account in all the examples I provided above, depending whether print or accessing the array is more expensive. But I hope in the first two examples it is obvious which one will add more runtime and thus should be considered as the basic operation.
I have noticed that PTX code allows for some instructions with complex semantics, such as bit field extract (bfe), find most-significant non-sign bit (bfind), and population count (popc).
Is it more efficient to use them explicitly rather than write code with their intended semantics in C/C++?
For example: "population count", or popc, means counting the one bits. So should I write:
__device__ int popc(int a) {
int d = 0;
while (a != 0) {
if (a & 0x1) d++;
a = a >> 1;
}
return d;
}
for that functionality, or should I, rather, use:
__device__ int popc(int a) {
int d;
asm("popc.u32 %1 %2;":"=r"(d): "r"(a));
return d;
}
? Will the inline PTX be more efficient? Should we write inline PTX to to get peak performance?
also - does GPU have some extra magic instruction corresponding to PTX instructions?
The compiler may identify what you're doing and use a fancy instruction to do it, or it may not. The only way to know in the general case is to look at the output of the compilation in ptx assembly, by using -ptx flag added to nvcc. If the compiler generates it for you, there is no need to hand-code the inline assembly yourself (or use an instrinsic).
Also, whether or not it makes a performance difference in the general case depends on whether or not the code path is used in a significant way, and on other factors such as the current performance limiters of your kernel (e.g. compute-bound or memory-bound).
A few more points in addition to #RobertCrovella's answer:
Even if you do use PTX for something - that should happen rarely. Limit it to small functions of no more than a few PTX lines - which you can then re-use for multiple purposes as you see fit, with most of your coding being in C/C++.
An example of this principle are the intrinsics #njuffa mentiond, in (that's not an official copy of the file I think). Please read it through to see which intrinsics are available to you. That doesn't mean you should use them all, of course.
For your specific example - you do want the PTX over the first version; it certainly won't do any harm. But, again, it is also an example of how you do not need to actually write PTX, since popc has a corresponding __popc intrinsic (again, as #njuffa has noted).
You might also want to have a look at the source code of some CUDA-based libraries to see what kind of PTX snippets they've chosen to use.
Does it make any difference if I use e.g. short or char type of variable instead of int as a for-loop initializer?
for (int i = 0; i < 10; ++i) {}
for (short i = 0; i < 10; ++i) {}
for (char i = 0; i < 10; ++i) {}
Or maybe there is no difference? Maybe I make the things even worse and efficiency decreases? Does using different type saves memory and increases speed? I am not sure, but I suppose that ++ operator may need to widen the type, and as a result: slow down the execution.
It will not make any difference you should be caring about, provided the range you iterate over fits into the type you choose. Performance-wise, you'll probably get the best results when the size of the iteration variable is the same as the platform's native integer size, but any decent compiler will optimize it to use that anyway. On a managed platform (e.g. C# or Java), you don't know the target platform at compile time, and the JIT compiler is basically free to optimize for whatever platform it is running on.
The only thing you might want to watch out for is when you use the loop counter for other things inside the loop; changing the type may change the way these things get executed, up to the point (in C++ at least) that a different overload for a function or method may get called because the loop variable has a different type. An example would be when you output the loop variable through a C++ stream, like so: cout << i << endl;. Similarly, the type of the loop variable can infest the implicit types of (sub-)expressions that contain it, and lead to hidden overflows in numeric calculations, e.g.: int j = i * i;.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I know that you should only optimize things when it is deemed necessary. But, if it is deemed necessary, what are your favorite low level (as opposed to algorithmic level) optimization tricks.
For example: loop unrolling.
gcc -O2
Compilers do a lot better job of it than you can.
Picking a power of two for filters, circular buffers, etc.
So very, very convenient.
-Adam
Why, bit twiddling hacks, of course!
One of the most useful in scientific code is to replace pow(x,4) with x*x*x*x. Pow is almost always more expensive than multiplication. This is followed by
for(int i = 0; i < N; i++)
{
z += x/y;
}
to
double denom = 1/y;
for(int i = 0; i < N; i++)
{
z += x*denom;
}
But my favorite low level optimization is to figure out which calculations can be removed from a loop. Its always faster to do the calculation once rather than N times. Depending on your compiler, some of these may be automatically done for you.
Inspect the compiler's output, then try to coerce it to do something faster.
I wouldn't necessarily call it a low level optimization, but I have saved orders of magnitude more cycles through judicious application of caching than I have through all my applications of low level tricks combined. Many of these methods are applications specific.
Having an LRU cache of database queries (or any other IPC based request).
Remembering the last failed database query and returning a failure if re-requested within a certain time frame.
Remembering your location in a large data structure to ensure that if the next request is for the same node, the search is free.
Caching calculation results to prevent duplicate work. In addition to more complex scenarios, this is often found in if or for statements.
CPUs and compilers are constantly changing. Whatever low level code trick that made sense 3 CPU chips ago with a different compiler may actually be slower on the current architecture and there may be a good chance that this trick may confuse whoever is maintaining this code in the future.
++i can be faster than i++, because it avoids creating a temporary.
Whether this still holds for modern C/C++/Java/C# compilers, I don't know. It might well be different for user-defined types with overloaded operators, whereas in the case of simple integers it probably doesn't matter.
But I've come to like the syntax... it reads like "increment i" which is a sensible order.
Using template metaprogramming to calculate things at compile time instead of at run-time.
Years ago with a not-so-smart compilier, I got great mileage from function inlining, walking pointers instead of indexing arrays, and iterating down to zero instead of up to a maximum.
When in doubt, a little knowledge of assembly will let you look at what the compiler is producing and attack the inefficient parts (in your source language, using structures friendlier to your compiler.)
precalculating values.
For instance, instead of sin(a) or cos(a), if your application doesn't necessarily need angles to be very precise, maybe you represent angles in 1/256 of a circle, and create arrays of floats sine[] and cosine[] precalculating the sin and cos of those angles.
And, if you need a vector at some angle of a given length frequently, you might precalculate all those sines and cosines already multiplied by that length.
Or, to put it more generally, trade memory for speed.
Or, even more generally, "All programming is an exercise in caching" -- Terje Mathisen
Some things are less obvious. For instance traversing a two dimensional array, you might do something like
for (x=0;x<maxx;x++)
for (y=0;y<maxy;y++)
do_something(a[x,y]);
You might find the processor cache likes it better if you do:
for (y=0;y<maxy;y++)
for (x=0;x<maxx;x++)
do_something(a[x,y]);
or vice versa.
Don't do loop unrolling. Don't do Duff's device. Make your loops as small as possible, anything else inhibits x86 performance and gcc optimizer performance.
Getting rid of branches can be useful, though - so getting rid of loops completely is good, and those branchless math tricks really do work. Beyond that, try never to go out of the L2 cache - this means a lot of precalculation/caching should also be avoided if it wastes cache space.
And, especially for x86, try to keep the number of variables in use at any one time down. It's hard to tell what compilers will do with that kind of thing, but usually having less loop iteration variables/array indexes will end up with better asm output.
Of course, this is for desktop CPUs; a slow CPU with fast memory access can precalculate a lot more, but in these days that might be an embedded system with little total memory anyway…
I've found that changing from a pointer to indexed access may make a difference; the compiler has different instruction forms and register usages to choose from. Vice versa, too. This is extremely low-level and compiler dependent, though, and only good when you need that last few percent.
E.g.
for (i = 0; i < n; ++i)
*p++ = ...; // some complicated expression
vs.
for (i = 0; i < n; ++i)
p[i] = ...; // some complicated expression
Optimizing cache locality - for example when multiplying two matrices that don't fit into cache.
Allocating with new on a pre-allocated buffer using C++'s placement new.
Counting down a loop. It's cheaper to compare against 0 than N:
for (i = N; --i >= 0; ) ...
Shifting and masking by powers of two is cheaper than division and remainder, / and %
#define WORD_LOG 5
#define SIZE (1 << WORD_LOG)
#define MASK (SIZE - 1)
uint32_t bits[K]
void set_bit(unsigned i)
{
bits[i >> WORD_LOG] |= (1 << (i & MASK))
}
Edit
(i >> WORD_LOG) == (i / SIZE) and
(i & MASK) == (i % SIZE)
because SIZE is 32 or 2^5.
Jon Bentley's Writing Efficient Programs is a great source of low- and high-level techniques -- if you can find a copy.
Eliminating branches (if/elses) by using boolean math:
if(x == 0)
x = 5;
// becomes:
x += (x == 0) * 5;
// if '5' was a base 2 number, let's say 4:
x += (x == 0) << 2;
// divide by 2 if flag is set
sum >>= (blendMode == BLEND);
This REALLY speeds things out especially when those ifs are in a loop or somewhere that is being called a lot.
The one from Assembler:
xor ax, ax
instead of:
mov ax, 0
Classical optimization for program size and performance.
In SQL, if you only need to know whether any data exists or not, don't bother with COUNT(*):
SELECT 1 FROM table WHERE some_primary_key = some_value
If your WHERE clause is likely return multiple rows, add a LIMIT 1 too.
(Remember that databases can't see what your code's doing with their results, so they can't optimise these things away on their own!)
Recycling the frame-pointer all of a sudden
Pascal calling-convention
Rewrite stack-frame tail call optimizarion (although it sometimes messes with the above)
Using vfork() instead of fork() before exec()
And one I am still looking for, an excuse to use: data driven code-generation at runtime
Liberal use of __restrict to eliminate load-hit-store stalls.
Rolling up loops.
Seriously, the last time I needed to do anything like this was in a function that took 80% of the runtime, so it was worth trying to micro-optimize if I could get a noticeable performance increase.
The first thing I did was to roll up the loop. This gave me a very significant speed increase. I believe this was a matter of cache locality.
The next thing I did was add a layer of indirection, and put some more logic into the loop, which allowed me to only loop through the things I needed. This wasn't as much of a speed increase, but it was worth doing.
If you're going to micro-optimize, you need to have a reasonable idea of two things: the architecture you're actually using (which is vastly different from the systems I grew up with, at least for micro-optimization purposes), and what the compiler will do for you.
A lot of the traditional micro-optimizations trade space for time. Nowadays, using more space increases the chances of a cache miss, and there goes your performance. Moreover, a lot of them are now done by modern compilers, and typically better than you're likely to do them.
Currently, you should (a) profile to see if you need to micro-optimize, and then (b) try to trade computation for space, in the hope of keeping as much as possible in cache. Finally, run some tests, so you know if you've improved things or screwed them up. Modern compilers and chips are far too complex for you to keep a good mental model, and the only way you'll know if some optimization works or not is to test.
In addition to Joshua's comment about code generation (a big win), and other good suggestions, ...
I'm not sure if you would call it "low-level", but (and this is downvote-bait) 1) stay away from using any more levels of abstraction than absolutely necessary, and 2) stay away from event-driven notification-style programming, if possible.
If a computer executing a program is like a car running a race, a method call is like a detour. That's not necessarily bad except there's a strong temptation to nest those things, because once you're written a method call, you tend to forget what that call could cost you.
If your're relying on events and notifications, it's because you have multiple data structures that need to be kept in agreement. This is costly, and should only be done if you can't avoid it.
In my experience, the biggest performance killers are too much data structure and too much abstraction.
I was amazed at the speedup I got by replacing a for loop adding numbers together in structs:
const unsigned long SIZE = 100000000;
typedef struct {
int a;
int b;
int result;
} addition;
addition *sum;
void start() {
unsigned int byte_count = SIZE * sizeof(addition);
sum = malloc(byte_count);
unsigned int i = 0;
if (i < SIZE) {
do {
sum[i].a = i;
sum[i].b = i;
i++;
} while (i < SIZE);
}
}
void test_func() {
unsigned int i = 0;
if (i < SIZE) { // this is about 30% faster than the more obvious for loop, even with O3
do {
addition *s1 = &sum[i];
s1->result = s1->b + s1->a;
i++;
} while ( i<SIZE );
}
}
void finish() {
free(sum);
}
Why doesn't gcc optimise for loops into this? Or is there something I missed? Some cache effect?