Does this code fill the CPU cache? - optimization

I have two ways to program the same functionality.
Method 1:
doTheWork(int action)
{
for(int i = 0 i < 1000000000; ++i)
{
doAction(action);
}
}
Method 2:
doTheWork(int action)
{
switch(action)
{
case 1:
for(int i = 0 i < 1000000000; ++i)
{
doAction<1>();
}
break;
case 2:
for(int i = 0 i < 1000000000; ++i)
{
doAction<2>();
}
break;
//-----------------------------------------------
//... (there are 1000000 cases here)
//-----------------------------------------------
case 1000000:
for(int i = 0 i < 1000000000; ++i)
{
doAction<1000000>();
}
break;
}
}
Let's assume that the function doAction(int action) and the function template<int Action> doAction() consist of about 10 lines of code that will get inlined at compile-time. Calling doAction(#) is equiavalent to doAction<#>() in functionality, but the non-templated doAction(int value) is somewhat slower than template<int Value> doAction(), since some nice optimizations can be done in the code when the argument value is known at compile time.
So my question is, do all the millions of lines of code fill the CPU L1 cache (and more) in the case of the templated function (and thus degrade performance considerably), or does only the lines of doAction<#>() inside the loop currently being run get cached?

It depends on the actual code size - 10 lines of code can be little or much - and of course on the actual machine.
However, Method 2 violently violates this decades rule of thumb: instructions are cheap, memory access is not.
Scalability limit
Your optimizations are usually linear - you might shave off 10, 20 maybe even 30% of execution time. Hitting a cache limit is highly nonlinear - as in "running into a brick wall" nonlinear.
As soon as your code size significantly exceeds the 2nd/3rd level cache's size, Method 2 will lose big time, as the following estimation of a high end consumer system shows:
DDR3-1333 with 10667MB/s peak memory bandwidth,
Intel Core i7 Extreme with ~75000 MIPS
gives you 10667MB / 75000M = 0.14 bytes per instruction for break even - anything larger, and main memory can't keep up with the CPU.
Typical x86 instruction sizes are 2..3 bytes executing in 1..2 cycles (now, granted, this isn't necessarily the same instructions, as x86 instructions are split up. Still...)
Typical x64 instruction lengths are even larger.
How much does your cache help?
I found the following number (different source, so it's hard to compare):
i7 Nehalem L2 cache (256K, >200GB/s bandwidth) which could almost keep up with x86 instructions, but probably not with x64.
In addition, your L2 cache will kick in completely only if
you have perfect prediciton of the next instructions or you don't have first-run penalty and it fits the cache completely
there's no significant amount of data being processed
there's no significant other code in your "inner loop"
there's no thread executing on this core
Given that, you can lose much earlier, especially on a CPU/board with smaller caches.

The L1 instruction cache will only contain instructions which were fetched recently or in anticipation of near future execution. As such, the second method cannot fill the L1 cache simply because the code is there. Your execution path will cause it to load the template instantiated version that represents the current loop being run. As you move to the next loop, it will generally invalidate the least recently used (LRU) cache line and replace it with what you are executing next.
In other words, due to the looping nature of both your methods, the L1 cache will perform admirably in both cases and won't be the bottleneck.

Related

Why do two identical processes don't take the double CPU load

I am using a Raspberry Pi 3 and basically stepped over a little tripwire.
I have a very big and complicated program that takes away a lot of memory and has a big CPU load. I thought it was normal that if I started the same process while the first one was still running, it would take the same amount of memory and especially double the CPU Load. I found out that it doesn't take away more memory and does not affect the CPU load.
To find out if this behavior came from my program, I wrote a tiny c++ program that has extremely high memory usage, here it is:
#include <iostream>
using namespace std;
int main()
{
for(int i = 0; i<100; i++) {
float a[100][100][100];
for (int i2 = 0; i2 < 99; ++i2) {
for (int i3 = 0; i3 < 99; ++i3){
for (int i4 = 0; i4 < 99; ++i4){
a[i2][i3][i4] = i2*i3*i4;
cout << a[i2][i3][i4] << endl;
}
}
}
}
return 0;
}
The CPU-load is about at 30 % of the max-level, I started the code in one terminal. Strangely, when I started it in another terminal at the same time, it didnt affect the CPU load. I concluded that this behaviour couldn't come from my program.
Now I want to know:
Is there a "lock" that ensures that a certain type of process does not grill your cores?
Why don't two identical processes double the CPU load?
Well, I found out that there is a "lock" that makes sure a process doesn't take away all memory and makes the CPU load go up to 100%. It seems that than more processes there are, than more CPU load is there, but not in a linear way.
Additionally, the code I wrote to look for the behaviour has only high memory usage, the 30% level came from the cout in the standard library . Multiple processes can use the command at the same time without increasing the CPU load, but it affects the speed of the printing.
When I found out that, I got suspicious about the programs speed. I used the analytics in my IDE for c++ to find out the duration of my original program, and indeed it was a bit more that two times slower.
That seems to be the solution I was looking for, but I think this is not really applicable to a large audience since the structure of the Raspberry Pi is very particular. I don't know how this works for other systems.
BTW: I could have guessed that there is a lock. I mean, if you start 10 processes that take away 15% of the CPU load max level, you would have 150% CPU usage. IMPOSSIBLE!

How intensive is getting time?

Here's a low level question. How CPU intensive is getting system time?
What is the source of the time? I know there is a hardware clock on the bios chip but I'm thinking that getting data from outside the CPU and RAM will need some hardware synchronization which may delay the read so I'm guessing the CPU may have its own clock. Feel free to correct me if I'm wrong in any way.
Does getting time incur a heavy system function call or is it in any way dependent on the used programming language?
I have just tested it using a C++ program:
clock_t started = clock();
clock_t endClock = started + CLOCKS_PER_SEC;
long itera = 0;
for (; clock() < endClock; itera++)
{
}
I get about 23 million iterations per second (Windows 7, 32bit, Visual Studio 2015, 2.6 GHz CPU). In terms of your question, I would not call this intensive.
In debug mode, I measured 18 million iterations per second.
In case the time is transformed into a localized timestamp, complicated calendar calculations (timezone, daylight saving time, ...) might significantly slow down the loop.
It is not easy to tell what happens inside the clock() call. For my system, it calls QueryPerfomanceCounter, but this recurs to other system functions as explained here.
Tuning
To reduce the time measurement overhead even further, you can measure in every 10th, 100th ... iteration.
The following measures once in 1024 iterations:
for (; (itera & 0x03FF) || (clock() < endClock); itera++)
{
}
This brings up the loop per second count to some 500 million.
Tuning with Timer Thread
The following yields a further improvement of some 10% paid with additional complexity:
std::atomic<bool> processing = true;
// launch a timer thread to clear the processing flag after 1s
std::thread t([&processing]() {
std::this_thread::sleep_for(std::chrono::seconds(1));
processing = false;
});
for (; (itera & 0x03FF) || processing; itera++)
{
}
t.join();
An extra thread is started which sleeps for one second and then sets a control variable. The main thread executes the loop until the timer threads signals the end of processing.

Fastest way to compare uchar arrays in OpenCL

I need do many comparsions in opencl programm. Now i make it like this
int memcmp(__global unsigned char* a,__global unsigned char* b,__global int size){
for (int i = 0; i<size;i++){
if(a[i] != b[i])return 0;
}
return 1;
}
How i can make it faster? Maybe using vectors like uchar4 or somethins else? Thanks!
I guess that your kernel computes "size" elements for each thread. I think that your code can improve if your accesses are more coalesced. Thanks to the L1 caches of the current GPUs this is not a huge problem but it can imply a noticeable performance penalty. For example, you have 4 threads(work-items), size = 128, so the buffers have 512 uchars. In your case, thread #0 acceses to a[0] and b[0], but it brings to cache a[0]...a[63] and the same for b. thread #1 wich belongs to the same warp (aka wavefront) accesses to a[128] and b[128], so it brings to cache a[128]...a[191], etc. After thread #3 all the buffer is in the cache. This is not a problem here taking into account the small size of this domain.
However, if each thread accesses to each element consecutively, only one "cache line" is necessary all the time for your 4 threads execution (the accesses are coalesced). The behavior will be better when more threads per block are considered. Please, try it and tell me your conclusions. Thank you.
See: http://www.nvidia.com/content/cudazone/download/opencl/nvidia_opencl_programmingguide.pdf Section 3.1.2.1
It is a bit old but their concepts are not so old.
PS: By the way, after this I would try to use uchar4 as you commented and also the "loop unrolling".

Read a series of images row by row or entire image for performance?

I'm writing an application that takes a series of exposures of a target and computes their average and saves the resultant image. This technique is used extensively in astrophotography to reduce noise in the final image. Basically, one computes the average at pixel and writes out the value in the output file.
The number of exposures can be quite high, from 20 to 30 (sometimes even more), and with today's large CCD sensors the resolution, too, can be quite high. So the amount of data can be very very large.
My question is, when it comes to performance, should I read the images row by row (Method #1) or should I read the entire image array of all arrays (Method #2)? Using the former method, I will have to load every corresponding row. So, if I have 10 images and I'm reading row #1 - I will have to read the first row from each image, compute their average and write out the row.
With the latter method, I read all images in their entirety, compute and write out the entire image.
In theory, the latter method ought to be much faster but much more memory intensive. In practice however, I've found that the difference in performance isn't great and this was puzzeling. At most, Method #2 was only 2 to 3 seconds faster than Method #1. However, Method #2 was using upto 1.3 GB of memory for 24 8megapixel images. Method #1, on the other hand, was at most using 70MB. On average, both methods are taking about 20 seconds to process 24 8megapixel images.
I am writing this in Objective-C with a good amount of C thrown in when calling CFITSIO.
Here's Method #1:
pixelRows = (double**)malloc(self.numberOfImages * sizeof(double*)); //alloc. pixel array.
for(i=0;i<self.numberOfImages;i++)
{
pixelRows[i] = (double*)malloc(width*sizeof(double));
}
apix = (double*)malloc(width*sizeof(double));
for(firstpix[1]=1;firstpix[1]<=size[1];firstpix[1]++)
{
[self gatherRowsFromImages:firstpix[1] withRowWidth:theWidth thePixelMap:pixelRows];
[self averageRows:pixelRows width:width theAveragedRow:apix];
fits_write_pix(outfptr, TDOUBLE, firstpix, width,apix, &status);
//NSLog(#"Row %ld written.",firstpix[1]);
}
fits_close_file(outfptr,&status);
NSLog(#"End");
if(!status)
{
NSLog(#"File written successfully.");
}
for(i=0;i<self.numberOfImages;i++)
{
free(pixelRows[i]);
}
free(pixelRows);
free(apix);
Here's Method #2:
imageArray = (double**)malloc(files.count * sizeof(double*));
for(i=0;i<files.count;i++)
{
imageArray[i] = (double*)malloc(size[0] * size[1] * sizeof(double));
fits_read_pix(fptr[i],TDOUBLE,firstpix,size[0] * size[1],NULL,imageArray[i],NULL,&status);
//NSLog(#"%d",status);
}
int fileIndex;
NSLog(#"%d",files.count);
apix = (double*)malloc(size[0] * size[1] * sizeof(double));
for(i=0;i<(size[0] * size[1]);i++)
{
apix[i] = 0.0;
for(fileIndex=0;fileIndex<files.count;fileIndex++)
{
apix[i] = apix[i] + imageArray[fileIndex][i];
}
//NSLog(#"%f",apix[i]);
apix[i] = apix[i] / files.count;
}
fits_create_file(&outfptr,[outPath UTF8String],&status);
fits_copy_header(fptr[0],outfptr,&status);
fits_write_pix(outfptr, TDOUBLE, firstpix, size[0] * size[1],apix, &status);
fits_close_file(outfptr,&status);
Any suggestions regarding this? Am I expecting too much of a gain by reading in every image in its entirety?
I would always go for the row-by-row approach, since it is scalable. It may also be faster since the memory footprint is smaller, meaning there is no need to swap out any program to disk just for your memory hungry tool.
Furthermore, to optimize the row-by-row approach, you should also consider reading in images per 8 rows (or some other number). E.g. JPEG is stored in 8x8 blocks, so reading in less than 8 rows would be pointless. Of course this depends on the image format and the library you are using.
There are also other considerations regarding the use of cache memory by the cpu. Memory locations that are used frequently don't have to travel to the "slow" memory but can stay closer to the cpu. There are several levels of cache and they vary in size per cpu type. (the biggest of which is typically 8 or 16 mb at the time of writing)
Another thing to consider is the code that does the actual averaging. Tuning this will also gain a lot, especially for the kind of operation you're doing, look at SSE and related topics. Also using integer calculations will probably beat floating point arithmetic. Using bit shifts for division might also be faster than true division, but it will only allow you to divide by 2^n.

What are you favorite low level code optimization tricks? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I know that you should only optimize things when it is deemed necessary. But, if it is deemed necessary, what are your favorite low level (as opposed to algorithmic level) optimization tricks.
For example: loop unrolling.
gcc -O2
Compilers do a lot better job of it than you can.
Picking a power of two for filters, circular buffers, etc.
So very, very convenient.
-Adam
Why, bit twiddling hacks, of course!
One of the most useful in scientific code is to replace pow(x,4) with x*x*x*x. Pow is almost always more expensive than multiplication. This is followed by
for(int i = 0; i < N; i++)
{
z += x/y;
}
to
double denom = 1/y;
for(int i = 0; i < N; i++)
{
z += x*denom;
}
But my favorite low level optimization is to figure out which calculations can be removed from a loop. Its always faster to do the calculation once rather than N times. Depending on your compiler, some of these may be automatically done for you.
Inspect the compiler's output, then try to coerce it to do something faster.
I wouldn't necessarily call it a low level optimization, but I have saved orders of magnitude more cycles through judicious application of caching than I have through all my applications of low level tricks combined. Many of these methods are applications specific.
Having an LRU cache of database queries (or any other IPC based request).
Remembering the last failed database query and returning a failure if re-requested within a certain time frame.
Remembering your location in a large data structure to ensure that if the next request is for the same node, the search is free.
Caching calculation results to prevent duplicate work. In addition to more complex scenarios, this is often found in if or for statements.
CPUs and compilers are constantly changing. Whatever low level code trick that made sense 3 CPU chips ago with a different compiler may actually be slower on the current architecture and there may be a good chance that this trick may confuse whoever is maintaining this code in the future.
++i can be faster than i++, because it avoids creating a temporary.
Whether this still holds for modern C/C++/Java/C# compilers, I don't know. It might well be different for user-defined types with overloaded operators, whereas in the case of simple integers it probably doesn't matter.
But I've come to like the syntax... it reads like "increment i" which is a sensible order.
Using template metaprogramming to calculate things at compile time instead of at run-time.
Years ago with a not-so-smart compilier, I got great mileage from function inlining, walking pointers instead of indexing arrays, and iterating down to zero instead of up to a maximum.
When in doubt, a little knowledge of assembly will let you look at what the compiler is producing and attack the inefficient parts (in your source language, using structures friendlier to your compiler.)
precalculating values.
For instance, instead of sin(a) or cos(a), if your application doesn't necessarily need angles to be very precise, maybe you represent angles in 1/256 of a circle, and create arrays of floats sine[] and cosine[] precalculating the sin and cos of those angles.
And, if you need a vector at some angle of a given length frequently, you might precalculate all those sines and cosines already multiplied by that length.
Or, to put it more generally, trade memory for speed.
Or, even more generally, "All programming is an exercise in caching" -- Terje Mathisen
Some things are less obvious. For instance traversing a two dimensional array, you might do something like
for (x=0;x<maxx;x++)
for (y=0;y<maxy;y++)
do_something(a[x,y]);
You might find the processor cache likes it better if you do:
for (y=0;y<maxy;y++)
for (x=0;x<maxx;x++)
do_something(a[x,y]);
or vice versa.
Don't do loop unrolling. Don't do Duff's device. Make your loops as small as possible, anything else inhibits x86 performance and gcc optimizer performance.
Getting rid of branches can be useful, though - so getting rid of loops completely is good, and those branchless math tricks really do work. Beyond that, try never to go out of the L2 cache - this means a lot of precalculation/caching should also be avoided if it wastes cache space.
And, especially for x86, try to keep the number of variables in use at any one time down. It's hard to tell what compilers will do with that kind of thing, but usually having less loop iteration variables/array indexes will end up with better asm output.
Of course, this is for desktop CPUs; a slow CPU with fast memory access can precalculate a lot more, but in these days that might be an embedded system with little total memory anyway…
I've found that changing from a pointer to indexed access may make a difference; the compiler has different instruction forms and register usages to choose from. Vice versa, too. This is extremely low-level and compiler dependent, though, and only good when you need that last few percent.
E.g.
for (i = 0; i < n; ++i)
*p++ = ...; // some complicated expression
vs.
for (i = 0; i < n; ++i)
p[i] = ...; // some complicated expression
Optimizing cache locality - for example when multiplying two matrices that don't fit into cache.
Allocating with new on a pre-allocated buffer using C++'s placement new.
Counting down a loop. It's cheaper to compare against 0 than N:
for (i = N; --i >= 0; ) ...
Shifting and masking by powers of two is cheaper than division and remainder, / and %
#define WORD_LOG 5
#define SIZE (1 << WORD_LOG)
#define MASK (SIZE - 1)
uint32_t bits[K]
void set_bit(unsigned i)
{
bits[i >> WORD_LOG] |= (1 << (i & MASK))
}
Edit
(i >> WORD_LOG) == (i / SIZE) and
(i & MASK) == (i % SIZE)
because SIZE is 32 or 2^5.
Jon Bentley's Writing Efficient Programs is a great source of low- and high-level techniques -- if you can find a copy.
Eliminating branches (if/elses) by using boolean math:
if(x == 0)
x = 5;
// becomes:
x += (x == 0) * 5;
// if '5' was a base 2 number, let's say 4:
x += (x == 0) << 2;
// divide by 2 if flag is set
sum >>= (blendMode == BLEND);
This REALLY speeds things out especially when those ifs are in a loop or somewhere that is being called a lot.
The one from Assembler:
xor ax, ax
instead of:
mov ax, 0
Classical optimization for program size and performance.
In SQL, if you only need to know whether any data exists or not, don't bother with COUNT(*):
SELECT 1 FROM table WHERE some_primary_key = some_value
If your WHERE clause is likely return multiple rows, add a LIMIT 1 too.
(Remember that databases can't see what your code's doing with their results, so they can't optimise these things away on their own!)
Recycling the frame-pointer all of a sudden
Pascal calling-convention
Rewrite stack-frame tail call optimizarion (although it sometimes messes with the above)
Using vfork() instead of fork() before exec()
And one I am still looking for, an excuse to use: data driven code-generation at runtime
Liberal use of __restrict to eliminate load-hit-store stalls.
Rolling up loops.
Seriously, the last time I needed to do anything like this was in a function that took 80% of the runtime, so it was worth trying to micro-optimize if I could get a noticeable performance increase.
The first thing I did was to roll up the loop. This gave me a very significant speed increase. I believe this was a matter of cache locality.
The next thing I did was add a layer of indirection, and put some more logic into the loop, which allowed me to only loop through the things I needed. This wasn't as much of a speed increase, but it was worth doing.
If you're going to micro-optimize, you need to have a reasonable idea of two things: the architecture you're actually using (which is vastly different from the systems I grew up with, at least for micro-optimization purposes), and what the compiler will do for you.
A lot of the traditional micro-optimizations trade space for time. Nowadays, using more space increases the chances of a cache miss, and there goes your performance. Moreover, a lot of them are now done by modern compilers, and typically better than you're likely to do them.
Currently, you should (a) profile to see if you need to micro-optimize, and then (b) try to trade computation for space, in the hope of keeping as much as possible in cache. Finally, run some tests, so you know if you've improved things or screwed them up. Modern compilers and chips are far too complex for you to keep a good mental model, and the only way you'll know if some optimization works or not is to test.
In addition to Joshua's comment about code generation (a big win), and other good suggestions, ...
I'm not sure if you would call it "low-level", but (and this is downvote-bait) 1) stay away from using any more levels of abstraction than absolutely necessary, and 2) stay away from event-driven notification-style programming, if possible.
If a computer executing a program is like a car running a race, a method call is like a detour. That's not necessarily bad except there's a strong temptation to nest those things, because once you're written a method call, you tend to forget what that call could cost you.
If your're relying on events and notifications, it's because you have multiple data structures that need to be kept in agreement. This is costly, and should only be done if you can't avoid it.
In my experience, the biggest performance killers are too much data structure and too much abstraction.
I was amazed at the speedup I got by replacing a for loop adding numbers together in structs:
const unsigned long SIZE = 100000000;
typedef struct {
int a;
int b;
int result;
} addition;
addition *sum;
void start() {
unsigned int byte_count = SIZE * sizeof(addition);
sum = malloc(byte_count);
unsigned int i = 0;
if (i < SIZE) {
do {
sum[i].a = i;
sum[i].b = i;
i++;
} while (i < SIZE);
}
}
void test_func() {
unsigned int i = 0;
if (i < SIZE) { // this is about 30% faster than the more obvious for loop, even with O3
do {
addition *s1 = &sum[i];
s1->result = s1->b + s1->a;
i++;
} while ( i<SIZE );
}
}
void finish() {
free(sum);
}
Why doesn't gcc optimise for loops into this? Or is there something I missed? Some cache effect?