Fastest way to compare uchar arrays in OpenCL - optimization

I need do many comparsions in opencl programm. Now i make it like this
int memcmp(__global unsigned char* a,__global unsigned char* b,__global int size){
for (int i = 0; i<size;i++){
if(a[i] != b[i])return 0;
}
return 1;
}
How i can make it faster? Maybe using vectors like uchar4 or somethins else? Thanks!

I guess that your kernel computes "size" elements for each thread. I think that your code can improve if your accesses are more coalesced. Thanks to the L1 caches of the current GPUs this is not a huge problem but it can imply a noticeable performance penalty. For example, you have 4 threads(work-items), size = 128, so the buffers have 512 uchars. In your case, thread #0 acceses to a[0] and b[0], but it brings to cache a[0]...a[63] and the same for b. thread #1 wich belongs to the same warp (aka wavefront) accesses to a[128] and b[128], so it brings to cache a[128]...a[191], etc. After thread #3 all the buffer is in the cache. This is not a problem here taking into account the small size of this domain.
However, if each thread accesses to each element consecutively, only one "cache line" is necessary all the time for your 4 threads execution (the accesses are coalesced). The behavior will be better when more threads per block are considered. Please, try it and tell me your conclusions. Thank you.
See: http://www.nvidia.com/content/cudazone/download/opencl/nvidia_opencl_programmingguide.pdf Section 3.1.2.1
It is a bit old but their concepts are not so old.
PS: By the way, after this I would try to use uchar4 as you commented and also the "loop unrolling".

Related

Is Unique Thread Id guaranteed for each Kernel Call in CUDA?

I have recently started to work with Cuda, I have multithread, multiprocess coding experience on C++, Java and Python.
With PyCuda I see example codes like this,
ker = SourceModule("""
__global__ void scalar_multiply_kernel(float *outvec, float scalar, float *vec)
{
int i = threadIdx.x;
outvec[i] = scalar*vec[i];
}
""")
It seems the thread id itself partakes in the logic of the code. Then the question is will there be enough thread ids covering my entire array (whose indexing I apparently need to reach all elements there), and what happens if I change the size of the array.
Will the indexing always be between 0 and N?
In CUDA the thread id is only unique per so-called thread block, meaning, that your example kernel only does the right thing with only one block doing work. This is probably done in early examples to ease you into the ideas, but it is generally a very bad thing to do in terms of performance:
With one block, you can only utilize one of many streaming multiprocessors (SMs) in a GPU and even that SM will only be able to hide memory access latencies when it has enough parallel work to do while waiting.
A single thread-block also limits you in the number of threads and therefore in the problem-size, if your kernel doesn't contain a loop so every thread can compute more than one element.
Kernel execution is seen strongly hierarchically: Restricting ourselves to one dimensional indexing for simplicity, a kernel is executed on a so-called grid of gridDim.x thread blocks, each containing blockDim.x threads numbered per block by threadIdx.x, while each block is numbered via blockIdx.x.
To get a unique ID for a thread (in a fashion that ideally uses the hardware to load elements from an array), you have to take blockIdx.x * blockDim.x + threadIdx.x. If more than one element shall be computed by every thread, you use a loop of the form
for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < InputSize; i += gridDim.x * blockDim.x) {
/* ... */
}
This is called a grid-stride loop, because gridDim.x * blockDim.x is the number of all threads working on the kernel. Different strides (especially having a thread working on consecutive elements: stride = 1) might work, but will be much slower due to the non-ideal memory access pattern.

cuda filter with ouput of this block is the input of the next block

Working on a filter following, I am having a problem of doing these pieces of codes for processing an image in GPU:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
If I define:
dim3 threads_perblock(32, 32)
then each block I have: 32 threads can be communicated. The threads of this block can not communicate with the threads from other blocks.
Within a thread_block, I can translate that pieces of code using shared_memory however, for edge (I would say): image[0,31] and image[0,32] in different threadblocks. The image[0,31] should get value from image[0,32] to calculate its value. But they are in different threadblocks.
so that is the problem.
How would I solve this?
Thanks in advance.
If image is in global memory then there is no problem - you don't need to use shared memory and you can just access pixels directly from image without any problem.
However if you have already done some processing prior to this, and a block of image is already in shared memory, then you have a problem, since you need to do neighbourhood operations which are outside the range of your block. You can do one of the following - either:
write shared memory back to global memory so that it is accessible to neighbouring blocks (disadvantage: performance, synchronization between blocks can be tricky)
or:
process additional edge pixels per block with an overlap (1 pixel in this case) so that you have additional pixels in each block to handle the edge cases, e.g. work with a 34x34 block size but store only the 32x32 central output pixels (disadvantage: requires additional logic within kernel, branches may result in warp divergence, not all threads in block are fully used)
Unfortunately neighbourhood operations can be really tricky in CUDA and there is always a down-side whatever method you use to handle edge cases.
You can just use a busy spin (no joke). Just make the thread processing a[32] execute:
while(!variable);
before starting to compute and the thread processing a[31] do
variable = 1;
when it finishes. It's up to you to generalize this. I know this is considered "rogue programming" in CUDA, but it seems the only way to achieve what you want. I had a very similar problem and it worked for me. Your performance might suffer though...
Be careful however, that
dim3 threads_perblock(32, 32)
means you have 32 x 32 = 1024 threads per block.

Does this code fill the CPU cache?

I have two ways to program the same functionality.
Method 1:
doTheWork(int action)
{
for(int i = 0 i < 1000000000; ++i)
{
doAction(action);
}
}
Method 2:
doTheWork(int action)
{
switch(action)
{
case 1:
for(int i = 0 i < 1000000000; ++i)
{
doAction<1>();
}
break;
case 2:
for(int i = 0 i < 1000000000; ++i)
{
doAction<2>();
}
break;
//-----------------------------------------------
//... (there are 1000000 cases here)
//-----------------------------------------------
case 1000000:
for(int i = 0 i < 1000000000; ++i)
{
doAction<1000000>();
}
break;
}
}
Let's assume that the function doAction(int action) and the function template<int Action> doAction() consist of about 10 lines of code that will get inlined at compile-time. Calling doAction(#) is equiavalent to doAction<#>() in functionality, but the non-templated doAction(int value) is somewhat slower than template<int Value> doAction(), since some nice optimizations can be done in the code when the argument value is known at compile time.
So my question is, do all the millions of lines of code fill the CPU L1 cache (and more) in the case of the templated function (and thus degrade performance considerably), or does only the lines of doAction<#>() inside the loop currently being run get cached?
It depends on the actual code size - 10 lines of code can be little or much - and of course on the actual machine.
However, Method 2 violently violates this decades rule of thumb: instructions are cheap, memory access is not.
Scalability limit
Your optimizations are usually linear - you might shave off 10, 20 maybe even 30% of execution time. Hitting a cache limit is highly nonlinear - as in "running into a brick wall" nonlinear.
As soon as your code size significantly exceeds the 2nd/3rd level cache's size, Method 2 will lose big time, as the following estimation of a high end consumer system shows:
DDR3-1333 with 10667MB/s peak memory bandwidth,
Intel Core i7 Extreme with ~75000 MIPS
gives you 10667MB / 75000M = 0.14 bytes per instruction for break even - anything larger, and main memory can't keep up with the CPU.
Typical x86 instruction sizes are 2..3 bytes executing in 1..2 cycles (now, granted, this isn't necessarily the same instructions, as x86 instructions are split up. Still...)
Typical x64 instruction lengths are even larger.
How much does your cache help?
I found the following number (different source, so it's hard to compare):
i7 Nehalem L2 cache (256K, >200GB/s bandwidth) which could almost keep up with x86 instructions, but probably not with x64.
In addition, your L2 cache will kick in completely only if
you have perfect prediciton of the next instructions or you don't have first-run penalty and it fits the cache completely
there's no significant amount of data being processed
there's no significant other code in your "inner loop"
there's no thread executing on this core
Given that, you can lose much earlier, especially on a CPU/board with smaller caches.
The L1 instruction cache will only contain instructions which were fetched recently or in anticipation of near future execution. As such, the second method cannot fill the L1 cache simply because the code is there. Your execution path will cause it to load the template instantiated version that represents the current loop being run. As you move to the next loop, it will generally invalidate the least recently used (LRU) cache line and replace it with what you are executing next.
In other words, due to the looping nature of both your methods, the L1 cache will perform admirably in both cases and won't be the bottleneck.

Reducing Number of Registers Used in CUDA Kernel

I have a kernel which uses 17 registers, reducing it to 16 would bring me 100% occupancy. My question is: are there methods that can be used to reduce the number or registers used, excluding completely rewriting my algorithms in a different manner. I have always kind of assumed the compiler is a lot smarter than I am, so for example I often use extra variables for clarity's sake alone. Am I wrong in this thinking?
Please note: I do know about the --max_registers (or whatever the syntax is) flag, but the use of local memory would be more detrimental than a 25% lower occupancy (I should test this)
Occupancy can be a little misleading and 100% occupancy should not be your primary target. If you can get fully coalesced accesses to global memory then on a high end GPU 50% occupancy will be sufficient to hide the latency to global memory (for floats, even lower for doubles). Check out the Advanced CUDA C presentation from GTC last year for more information on this topic.
In your case, you should measure performance both with and without maxrregcount set to 16. The latency to local memory should be hidden as a result of having sufficient threads, assuming you don't random access into local arrays (which would result in non-coalesced accesses).
To answer you specific question about reducing registers, post the code for more detailed answers! Understanding how compilers work in general may help, but remember that nvcc is an optimising compiler with a large parameter space, so minimising register count has to be balanced with overall performance.
It's really hard to say, nvcc compiler is not very smart in my opinion.
You can try obvious things, for example using short instead of int, passing and using variables by reference (e.g.&variable), unrolling loops, using templates (as in C++). If you have divisions, transcendental functions, been applied in sequence, try to make them as a loop. Try to get rid of conditionals, possibly replacing them with redundant computations.
If you post some code, maybe you will get specific answers.
Utilizing shared memory as cache may lead less register usage and prevent register spilling to local memory...
Think that the kernel calculates some values and these calculated values are used by all of the threads,
__global__ void kernel(...) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
int id0 = blockDim.x * blockIdx.x;
int reg = id0 * ...;
int reg0 = reg * a / x + y;
...
int val = reg + reg0 + 2 * idx;
output[idx] = val > 10;
}
So, instead of keeping reg and reg0 as registers and making them possibily spill out to local memory (global memory), we may use shared memory.
__global__ void kernel(...) {
__shared__ int cache[10];
int idx = threadIdx.x + blockDim.x * blockIdx.x;
if (threadIdx.x == 0) {
int id0 = blockDim.x * blockIdx.x;
cache[0] = id0 * ...;
cache[1] = cache[0] * a / x + y;
}
__syncthreads();
...
int val = cache[0] + cache[1] + 2 * idx;
output[idx] = val > 10;
}
Take a look at this paper for further information..
It is not generally a good approach to minimize register pressure. The compiler does a good job optimizing the overall projected kernel performance, and it takes into account lots of factors, incliding register.
How does it work when reducing registers caused slower speed
Most probably the compiler had to spill insufficient register data into "local" memory, which is essentially the same as global memory, and thus very slow
For optimization purposes I would recommend to use keywords like const, volatile and so on where necessary, to help the compiler on the optimization phase.
Anyway, it is not these tiny issues like registers which often make CUDA kernels run slow. I'd recommend to optimize work with global memory, the access pattern, caching in texture memory if possible, transactions over the PCIe.
The instruction count increase when lowering the register usage have a simple explanation. The compiler could be using registers to store the results of some operations that are used more than once through your code in order to avoid recalculating those values, when forced to use less registers, the compiler decides to recalculate those values that would be stored in registers otherwise.

What are you favorite low level code optimization tricks? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I know that you should only optimize things when it is deemed necessary. But, if it is deemed necessary, what are your favorite low level (as opposed to algorithmic level) optimization tricks.
For example: loop unrolling.
gcc -O2
Compilers do a lot better job of it than you can.
Picking a power of two for filters, circular buffers, etc.
So very, very convenient.
-Adam
Why, bit twiddling hacks, of course!
One of the most useful in scientific code is to replace pow(x,4) with x*x*x*x. Pow is almost always more expensive than multiplication. This is followed by
for(int i = 0; i < N; i++)
{
z += x/y;
}
to
double denom = 1/y;
for(int i = 0; i < N; i++)
{
z += x*denom;
}
But my favorite low level optimization is to figure out which calculations can be removed from a loop. Its always faster to do the calculation once rather than N times. Depending on your compiler, some of these may be automatically done for you.
Inspect the compiler's output, then try to coerce it to do something faster.
I wouldn't necessarily call it a low level optimization, but I have saved orders of magnitude more cycles through judicious application of caching than I have through all my applications of low level tricks combined. Many of these methods are applications specific.
Having an LRU cache of database queries (or any other IPC based request).
Remembering the last failed database query and returning a failure if re-requested within a certain time frame.
Remembering your location in a large data structure to ensure that if the next request is for the same node, the search is free.
Caching calculation results to prevent duplicate work. In addition to more complex scenarios, this is often found in if or for statements.
CPUs and compilers are constantly changing. Whatever low level code trick that made sense 3 CPU chips ago with a different compiler may actually be slower on the current architecture and there may be a good chance that this trick may confuse whoever is maintaining this code in the future.
++i can be faster than i++, because it avoids creating a temporary.
Whether this still holds for modern C/C++/Java/C# compilers, I don't know. It might well be different for user-defined types with overloaded operators, whereas in the case of simple integers it probably doesn't matter.
But I've come to like the syntax... it reads like "increment i" which is a sensible order.
Using template metaprogramming to calculate things at compile time instead of at run-time.
Years ago with a not-so-smart compilier, I got great mileage from function inlining, walking pointers instead of indexing arrays, and iterating down to zero instead of up to a maximum.
When in doubt, a little knowledge of assembly will let you look at what the compiler is producing and attack the inefficient parts (in your source language, using structures friendlier to your compiler.)
precalculating values.
For instance, instead of sin(a) or cos(a), if your application doesn't necessarily need angles to be very precise, maybe you represent angles in 1/256 of a circle, and create arrays of floats sine[] and cosine[] precalculating the sin and cos of those angles.
And, if you need a vector at some angle of a given length frequently, you might precalculate all those sines and cosines already multiplied by that length.
Or, to put it more generally, trade memory for speed.
Or, even more generally, "All programming is an exercise in caching" -- Terje Mathisen
Some things are less obvious. For instance traversing a two dimensional array, you might do something like
for (x=0;x<maxx;x++)
for (y=0;y<maxy;y++)
do_something(a[x,y]);
You might find the processor cache likes it better if you do:
for (y=0;y<maxy;y++)
for (x=0;x<maxx;x++)
do_something(a[x,y]);
or vice versa.
Don't do loop unrolling. Don't do Duff's device. Make your loops as small as possible, anything else inhibits x86 performance and gcc optimizer performance.
Getting rid of branches can be useful, though - so getting rid of loops completely is good, and those branchless math tricks really do work. Beyond that, try never to go out of the L2 cache - this means a lot of precalculation/caching should also be avoided if it wastes cache space.
And, especially for x86, try to keep the number of variables in use at any one time down. It's hard to tell what compilers will do with that kind of thing, but usually having less loop iteration variables/array indexes will end up with better asm output.
Of course, this is for desktop CPUs; a slow CPU with fast memory access can precalculate a lot more, but in these days that might be an embedded system with little total memory anyway…
I've found that changing from a pointer to indexed access may make a difference; the compiler has different instruction forms and register usages to choose from. Vice versa, too. This is extremely low-level and compiler dependent, though, and only good when you need that last few percent.
E.g.
for (i = 0; i < n; ++i)
*p++ = ...; // some complicated expression
vs.
for (i = 0; i < n; ++i)
p[i] = ...; // some complicated expression
Optimizing cache locality - for example when multiplying two matrices that don't fit into cache.
Allocating with new on a pre-allocated buffer using C++'s placement new.
Counting down a loop. It's cheaper to compare against 0 than N:
for (i = N; --i >= 0; ) ...
Shifting and masking by powers of two is cheaper than division and remainder, / and %
#define WORD_LOG 5
#define SIZE (1 << WORD_LOG)
#define MASK (SIZE - 1)
uint32_t bits[K]
void set_bit(unsigned i)
{
bits[i >> WORD_LOG] |= (1 << (i & MASK))
}
Edit
(i >> WORD_LOG) == (i / SIZE) and
(i & MASK) == (i % SIZE)
because SIZE is 32 or 2^5.
Jon Bentley's Writing Efficient Programs is a great source of low- and high-level techniques -- if you can find a copy.
Eliminating branches (if/elses) by using boolean math:
if(x == 0)
x = 5;
// becomes:
x += (x == 0) * 5;
// if '5' was a base 2 number, let's say 4:
x += (x == 0) << 2;
// divide by 2 if flag is set
sum >>= (blendMode == BLEND);
This REALLY speeds things out especially when those ifs are in a loop or somewhere that is being called a lot.
The one from Assembler:
xor ax, ax
instead of:
mov ax, 0
Classical optimization for program size and performance.
In SQL, if you only need to know whether any data exists or not, don't bother with COUNT(*):
SELECT 1 FROM table WHERE some_primary_key = some_value
If your WHERE clause is likely return multiple rows, add a LIMIT 1 too.
(Remember that databases can't see what your code's doing with their results, so they can't optimise these things away on their own!)
Recycling the frame-pointer all of a sudden
Pascal calling-convention
Rewrite stack-frame tail call optimizarion (although it sometimes messes with the above)
Using vfork() instead of fork() before exec()
And one I am still looking for, an excuse to use: data driven code-generation at runtime
Liberal use of __restrict to eliminate load-hit-store stalls.
Rolling up loops.
Seriously, the last time I needed to do anything like this was in a function that took 80% of the runtime, so it was worth trying to micro-optimize if I could get a noticeable performance increase.
The first thing I did was to roll up the loop. This gave me a very significant speed increase. I believe this was a matter of cache locality.
The next thing I did was add a layer of indirection, and put some more logic into the loop, which allowed me to only loop through the things I needed. This wasn't as much of a speed increase, but it was worth doing.
If you're going to micro-optimize, you need to have a reasonable idea of two things: the architecture you're actually using (which is vastly different from the systems I grew up with, at least for micro-optimization purposes), and what the compiler will do for you.
A lot of the traditional micro-optimizations trade space for time. Nowadays, using more space increases the chances of a cache miss, and there goes your performance. Moreover, a lot of them are now done by modern compilers, and typically better than you're likely to do them.
Currently, you should (a) profile to see if you need to micro-optimize, and then (b) try to trade computation for space, in the hope of keeping as much as possible in cache. Finally, run some tests, so you know if you've improved things or screwed them up. Modern compilers and chips are far too complex for you to keep a good mental model, and the only way you'll know if some optimization works or not is to test.
In addition to Joshua's comment about code generation (a big win), and other good suggestions, ...
I'm not sure if you would call it "low-level", but (and this is downvote-bait) 1) stay away from using any more levels of abstraction than absolutely necessary, and 2) stay away from event-driven notification-style programming, if possible.
If a computer executing a program is like a car running a race, a method call is like a detour. That's not necessarily bad except there's a strong temptation to nest those things, because once you're written a method call, you tend to forget what that call could cost you.
If your're relying on events and notifications, it's because you have multiple data structures that need to be kept in agreement. This is costly, and should only be done if you can't avoid it.
In my experience, the biggest performance killers are too much data structure and too much abstraction.
I was amazed at the speedup I got by replacing a for loop adding numbers together in structs:
const unsigned long SIZE = 100000000;
typedef struct {
int a;
int b;
int result;
} addition;
addition *sum;
void start() {
unsigned int byte_count = SIZE * sizeof(addition);
sum = malloc(byte_count);
unsigned int i = 0;
if (i < SIZE) {
do {
sum[i].a = i;
sum[i].b = i;
i++;
} while (i < SIZE);
}
}
void test_func() {
unsigned int i = 0;
if (i < SIZE) { // this is about 30% faster than the more obvious for loop, even with O3
do {
addition *s1 = &sum[i];
s1->result = s1->b + s1->a;
i++;
} while ( i<SIZE );
}
}
void finish() {
free(sum);
}
Why doesn't gcc optimise for loops into this? Or is there something I missed? Some cache effect?