In CUDA programming, is atomic function faster than reducing after calculating the intermediate results? - optimization

Atomic functions (such as atomic_add) are widely used for counting or performing summation/aggregation in CUDA programming. However, I can not find information about the speed of atomic functions compared with ordinary global memory read/write.
Consider the following task, where we want to calculate a floating-point array with 256K elements. Each element is the sum over 1000 intermediate variables which is calculated first. One approach is to use atomic_add; While another approach is to use a 256K*1000 temporary array for the intermediate results and then to reduce this array (by taking summation).
Is the first approach using atomic function faster than the second?

In your specific case, even without you providing a concrete program, one does not need to know anything about the difference in latency or in bandwidth between atomic and non-atomic operations to rule out both your approaches: They are both quite inefficient.
You should have single blocks handling single output variables (or a small number of output variables), so that the sum of each 1,000 intermediate variables is not performed via global memory. You may want to read the "classic" presentation by Mark Harris:
Optimizing Parallel Reduction in CUDA
to get the basics. There have been improvements over this in recent years, due to newer hardware capabilities. For a more recent actual implementation, see the CUB library's block reduction primitive.
Also relevant: CUDA: how to sum all elements of an array into one number within the GPU?
If you implement it this way, each output element will only be written to once. And even if the computation of the 1,000 intermediates somehow needs to be distributed among multiple blocks, for whatever reason you have not shared in the question - you should still distribute it over a smaller number, rather than 1,000, so that the global-memory writes of the result take up a small enough fraction of the total computation time, that it is not worth bothering with something other than an atomic addition.

Related

Does optimizing code in TI-BASIC actually make a difference?

I know in TI-BASIC, the convention is to optimize obsessively and to save as many bits as possible (which is pretty fun, I admit).
For example,
DelVar Z
Prompt X
If X=0
Then
Disp "X is zero"
End //28 bytes
would be cleaned up as
DelVar ZPrompt X
If not(X
"X is zero //20 bytes
But does optimizing code this way actually make a difference? Does it noticeably run faster or save memory?
Yes. Optimizing your TI-Basic code makes a difference, and that difference is much larger than you would find for most programming languages.
In my opinion, the most important optimization to TI-Basic programs is size (making them as small as possible). This is important to me since I have dozens of programs on my calculator, which only has 24 kB of user-accessible RAM. In this case, it isn't really necessary to spend lots of time trying to save a few bytes of space; instead, I simply advise learning the shortest and most efficient ways to do things, so that when you write programs, they will naturally tend to be small.
Additionally, TI-Basic programs should be optimized for speed. Examples off of the top of my head include the quirk with the unclosed For( loop, calculating a value once instead of calculating it in every iteration of a loop (if possible), and using quickly-accessed variables such as Ans and the finance variables whenever the variable must be accessed a large number of times (e.g. 1000+).
A third possible optimization is for run-time memory usage. Every loop, function call, etc. has an overhead that must be stored in the memory stack in order to return to the original location, calculate values, etc. during the program's execution. It is important to avoid memory leaks (such as breaking out of a loop with Goto).
It is up to you to decide how you balance these optimizations. I prefer to:
First and foremost, guarantee that there are no memory leaks or incorrectly nested loops in my program.
Take advantage of any size optimizations that have little or no impact on the program's speed.
Consider speed optimizations, and decide if the added speed is worth the increase in program size.
TI-BASIC is an interpreted language, which usually means there is a huge overhead on every single operation.
The way an interpreted language works is that instead of actually compiling the program into code that runs on the CPU directly, each operation is a function call to the interpreter that look at what needs to be done and then calls functions to complete those sub tasks. In most cases, the overhead is a factor or two in speed, and often also in stack memory usage. However, the memory for non-stack is usually the same.
In your above example you are doing the exact same number of operations, which should mean that they run exactly as fast. What you should optimize are things like i = i + 1, which is 4 operations into i++ which is 2 operations. (as an example, TI-BASIC doesn't support ++ operator).
This does not mean that all operations take the exact same time, internally a operation may be calling hundreds of other functions or it may be as simple as updating a single variable. The programmers of the interpreter may also have implemented various peephole optimizations that optimizes very specific language constructs, e.g. for(int i = 0; i < count; i++) could both be implemented as a collection of expensive interpreter functions that behave as if i is generic, or it could be optimized to a compiled loop where it just had to update the variable i and reevaluate the count.
Now, not all interpreted languages are doomed to this pale existence. For example, JavaScript used to be one, but these days all major js engines JIT compile the code to run directly on the CPU.
UPDATE: Clarified that not all operations are created equal.
Absolutely, it makes a difference. I wrote a full-scale color RPG for the TI-84+CSE, and let me tell you, without optimizing any of my code, the game would flat out not run. At present, on the CSE, Sorcery of Uvutu can only run if every other program is archived and all other memory is out of RAM. The programs and data storage alone takes up 20k bytes in RAM, or just 1kb under all of available user memory. With all the variables in use, the memory approaches dangerously low points. I had points in my development where due to poor optimizations, I couldn't even start the game without getting a "memory all gone" error. I had plans to implement various extra things, but due to space and speed concerns, it was impossible to do so. That's only the consideration to space.
In the speed department, the game became, and still is, slow in the overworld. Walking around in the overworld is painfully slow compared to other games, and that's because of what I have to do in that code; I have to check for collisions, check if the user is moving to a new map, check if they pressed a key that should illicit a response, check if a battle should go on, and more. I was able to make slight optimizations to the walking speed, but even then, I could blatantly tell I had made improvements. It still was pretty awfully slow (at least compared to every other port I've made), but I made it a little more tolerable.
In summary, through my own experiences crafting a large project, I can say that in TI-Basic, optimizing code does make a difference. Other answers mentioned this, but TI-Basic is an interpreted language. This means the code isn't compiled into faster, lower level code, but the stuff that you put in the program is read straight out as it executes, is interpreted by the interpreter, calls the subroutines and other stuff it needs to to execute the commands, and then returns back to read the next line. As a result of that, and the fact that the TI-84+ series CPU, the Zilog Z80, was designed in 1976, you get a rather slow interpreter, especially for this day and age. As such, the fewer the commands you run, and the more you take advantage of system weirdness such as Ans being the fastest variable that can also hold the most types of data (integers/floats, strings, lists, matrices, etc), the better the performance you're gonna get.
Sources: My own experiences, documented here: https://codewalr.us/index.php?topic=778.msg27190#msg27190
TI-84+CSE RAM numbers came from here: https://education.ti.com/en/products/calculators/graphing-calculators/ti-84-plus-c-se?category=specifications
Information about the Z80 came from here: http://segaretro.org/Zilog_Z80
Depends, if it's just a basic math program then no. For big games then YES. The TI-84 has only 3.5MB of space available and has the combo of an ancient Z80 processor and a whopping 128KB of RAM. TI-BASIC is also quite slow as it's interpreted (look it up for further information) so if you to make fast-running games then YES. Optimization is very important.

Effect of more number of functions in an application

If I have a large number of functions in my application, Do they effect the execution speed of the application?
For example: I have 10000 functions in my application but each time that I run my application only 1 or 2 functions will work. It is not known beforehand which function(s) will be called, it depends on user given input.
Does it changes the execution speed it I have many number of functions?
The speed shouldn't be significantly affected in your case. The number of procedures defined is much less important than the computational complexity of each procedure called.
Think about it. A 2.5GHz processor can theoretically perform more than 10 billion floating point operations per second (FLOPS). The time required to load a fixed number of procedures into memory, even a million lines of code, will remain constant and fairly trivial, but if one of your procedures is complex enough, the number of operations can increase massively over a comparatively few iterations.
9,998 functions not used, but still in since they are referenced, does not affect performance unless you need to parse all code at each run.
I'm thinking the case analysis size might affect the performance. If you have 10,000 fucntions and only use about 2 each time, then you'll have about 5,000 outcomes and that means a lot of tests if it's a linear analysis or about 13 if it's binary.
I'd start with profiling the code to find the bottlenecks.

CUDA: optimize latency due to iterative process

I have an iterative computation that involves a Fourier transform in each iteration.
in high level it looks like this:
// executed in host , calling functions that run on the device
B = image
L = 100
while(L--) {
A = FFT_2D(B)
A = SOME_PER_PIXEL_CALCULATION(A)
B = INVERSE_FFT_2D(A)
B = SOME_PER_PIXEL_CALCULATION(B)
}
I am using "cufft" library to do the transforms.
now the problem is that I am always working with global memory,
basically if there was a way of doing some of the work with shared memory it would be great,
but it seems like using FFT won't allow me to bypass this, given "cufft" library functions can only be called from the host, and stores input and output in global memory.
how should I tackle this?
thanks.
EDIT:
since there IS a data dependency. it would seem like I can't do much but optimize the 'per pixel' calculations...
the bottleneck is still due to the fact that the kernels pass the data via global memory .which seems unavoidable in this case.
so basically the fact that I have to do the transform an it's inverse is what keeps me from sharing intermidiate computation data.
currently I am exploring ways of doing most of the calculation in the frequency space.
( more of a math problem )
so does anyone has a good idea on how to approximate F{max(0,f(x,y))} given F{f(x,y)} ?
EDIT:
note that f(x,y) is in the time domain, and therefore is real valued,
f(x,y) is also processed before calculating pointwise max(0,f(x,y)), so it is indeed possible for negetiv values to appear.
Concerning the FFT/IFFT, I think you are wrongly assuming that the CUFFT routine does not internally use shared memory. Typical algorithms for FFT calculations split the entire FFT into smaller ones fitting one thread block and so probably they already internally exploit shared memory, see for example the paper.
Concerning the PER_PIXEL_CALCULATIONS, shared memory is typically used to make threads within a thread block cooperate each other. My question is: are the PER_PIXEL_CALCULATIONS independent each other? If so, perhaps thread cooperation is not needed and you would not need shared memory either and arrange the calculations by using only registers.
Anyway, to be more specific on the latter point, you should provide more information on what you actually need (by editing your original post). Is your code related to an implementation of the Gerchberg-Saxton algorithm?

In what types of loops is it best to use the #pragma unroll directive in CUDA?

In CUDA it is possible to unroll loops using the #pragma unroll directive to improve performance by increasing instruction level parallelism. The #pragma can optionally be followed by a number that specifies how many times the loop must be unrolled.
Unfortunately the docs do not give specific directions on when this directive should be used. Since small loops with a known trip count are already unrolled by the compiler, should #pragma unroll be used on larger loops? On small loops with a variable counter? And what about the optional number of unrolls? Also is there recommended documentation about cuda specific loop unrolling?
There aren't any fast and hard rules. The CUDA compiler has at least two unrollers, one each inside the NVVM or Open64 frontends, and one in the PTXAS backend. In general, they tend to unroll loops pretty aggressively, so I find myself using #pragma unroll 1 (to prevent unrolling) more often than any other unrolling attribute. The reasons for turning off loop unrolling are twofold:
(1) When a loop is unrolled completely, register pressure can increase. For example, indexes into small local memory arrays may become compile-time constants, allowing the compiler to place the local data into registers. Complete unrolling may also tends to lengthen basic blocks, allowing more aggressive scheduling of texture and global loads, which may require additional temporary variables and thus registers. Increased register pressure can lead to lower performance due to register spilling.
(2) Partially unrolled loops usually require a certain amount of pre-computation and clean-up code to handle loop counts that are not an exactly a multiple of the unrolling factor. For loops with short trip counts, this overhead can swamp any performance gains to be had from the unrolled loop, leading to lower performance after unrolling. While the compiler contains heuristics for finding suitable loops under these restrictions, the heuristics can't always provide the best decision.
In rare cases I have found that manually providing a higher unrolling factor than what the compiler used automatically has a small beneficial effect on performance (with typical gain in the single digit percent). These are typically cases of memory-intensive code where a larger unrolling factor allows more aggressive scheduling of global or texture loads, or very tight computationally bound loops that benefit from minimization of the loop overhead.
Playing with unrolling factors is something that should happen late in the optimization process, as the compiler defaults cover most cases one will encounter in practice.
It's a tool that you can use to unroll loops. The specifics of when it should/shouldn't be used will vary a lot depending on your code (what's inside the loop for instance). There aren't really any good generic tips except think of what your code would be like unrolled vs rolled and think if it would be better unrolled.

Bit Fields vs Integer

I'm writing an API that gets information about the CPU (using CPUID). What I'm wondering is should I store the values from the bit field returned by calling CPUID in separate integer values, or should I just store the entire bit field in a value and write functions to get the different values on-the-fly?
What is preferable in this case? Memory usage or speed? If it's memory usage, I'll just store the entire bit field in a single variable. If it's speed, I'll store each value in a separate variable.
You're only going to query a CPU once. With modern computers having both huge amounts of memory and processing power, it would make no difference either way.
Just do what would make more sense for the next person who reads it.
Programs must be written for people to read, and only incidentally for machines to execute.
— The Structure and Interpretation of Computer Programs
I think it does not matter here, b/c you will not call your CPU-id code 10000 times per second.. will you?
I think you can define different interface (method) for different value. this is more clear and easy to use. a clear, accuracy & easy to use of interface should be the first thing to consider, then performance (memory usage & speed).