CUDA: optimize latency due to iterative process - optimization

I have an iterative computation that involves a Fourier transform in each iteration.
in high level it looks like this:
// executed in host , calling functions that run on the device
B = image
L = 100
while(L--) {
A = FFT_2D(B)
A = SOME_PER_PIXEL_CALCULATION(A)
B = INVERSE_FFT_2D(A)
B = SOME_PER_PIXEL_CALCULATION(B)
}
I am using "cufft" library to do the transforms.
now the problem is that I am always working with global memory,
basically if there was a way of doing some of the work with shared memory it would be great,
but it seems like using FFT won't allow me to bypass this, given "cufft" library functions can only be called from the host, and stores input and output in global memory.
how should I tackle this?
thanks.
EDIT:
since there IS a data dependency. it would seem like I can't do much but optimize the 'per pixel' calculations...
the bottleneck is still due to the fact that the kernels pass the data via global memory .which seems unavoidable in this case.
so basically the fact that I have to do the transform an it's inverse is what keeps me from sharing intermidiate computation data.
currently I am exploring ways of doing most of the calculation in the frequency space.
( more of a math problem )
so does anyone has a good idea on how to approximate F{max(0,f(x,y))} given F{f(x,y)} ?
EDIT:
note that f(x,y) is in the time domain, and therefore is real valued,
f(x,y) is also processed before calculating pointwise max(0,f(x,y)), so it is indeed possible for negetiv values to appear.

Concerning the FFT/IFFT, I think you are wrongly assuming that the CUFFT routine does not internally use shared memory. Typical algorithms for FFT calculations split the entire FFT into smaller ones fitting one thread block and so probably they already internally exploit shared memory, see for example the paper.
Concerning the PER_PIXEL_CALCULATIONS, shared memory is typically used to make threads within a thread block cooperate each other. My question is: are the PER_PIXEL_CALCULATIONS independent each other? If so, perhaps thread cooperation is not needed and you would not need shared memory either and arrange the calculations by using only registers.
Anyway, to be more specific on the latter point, you should provide more information on what you actually need (by editing your original post). Is your code related to an implementation of the Gerchberg-Saxton algorithm?

Related

In CUDA programming, is atomic function faster than reducing after calculating the intermediate results?

Atomic functions (such as atomic_add) are widely used for counting or performing summation/aggregation in CUDA programming. However, I can not find information about the speed of atomic functions compared with ordinary global memory read/write.
Consider the following task, where we want to calculate a floating-point array with 256K elements. Each element is the sum over 1000 intermediate variables which is calculated first. One approach is to use atomic_add; While another approach is to use a 256K*1000 temporary array for the intermediate results and then to reduce this array (by taking summation).
Is the first approach using atomic function faster than the second?
In your specific case, even without you providing a concrete program, one does not need to know anything about the difference in latency or in bandwidth between atomic and non-atomic operations to rule out both your approaches: They are both quite inefficient.
You should have single blocks handling single output variables (or a small number of output variables), so that the sum of each 1,000 intermediate variables is not performed via global memory. You may want to read the "classic" presentation by Mark Harris:
Optimizing Parallel Reduction in CUDA
to get the basics. There have been improvements over this in recent years, due to newer hardware capabilities. For a more recent actual implementation, see the CUB library's block reduction primitive.
Also relevant: CUDA: how to sum all elements of an array into one number within the GPU?
If you implement it this way, each output element will only be written to once. And even if the computation of the 1,000 intermediates somehow needs to be distributed among multiple blocks, for whatever reason you have not shared in the question - you should still distribute it over a smaller number, rather than 1,000, so that the global-memory writes of the result take up a small enough fraction of the total computation time, that it is not worth bothering with something other than an atomic addition.

Algebraic/implicit loops handling by Gekko

I have got a specific question with regards to algebraic / implicit loops handling by Gekko.
I will give examples in the field of Chemical Engineering, as this is how I found the project and its other libraries.
For example, when it comes to multicomponent chemical equilibrium calculations, it is not possible to explicitly work out the equations, because the concentration of one specie may be present in many different equations.
I have been using other paid software in the past and it would automatically propose a resolution procedure based on how the system is solvable (by analyzing dependency and creating automatic algebraic loops).
My question would be:
Does Gekko do that automatically?
It is a little bit tricky because sometimes one needs to add tear variables and iterate from a good starting value.
I know this message may be a little bit abstract, but I am trying to decide which software to use for my work and this is a pragmatic bottle neck that I have happened to find.
Thanks in advance for your valuable insight.
Python Gekko uses a simultaneous solution strategy so that all units are solved together instead of sequentially. Therefore, tear variables are not needed but large flowsheet problems with recycle can be difficult to converge to a feasible solution. Below are three methods that are in Python Gekko to assist in efficient solutions and initialization.
Method 1: Intermediate Variables
Intermediate variables are useful to decrease the complexity of the model. In many models, the temporary variables outnumber the regular variables. This model reduction often aides the solver in finding a solution by reducing the problem size. Intermediate variables are declared with m.Intermediates() in Python Gekko. The intermediate variables may be defined in one section or in multiple declarations throughout the model. Intermediate variables are parsed sequentially, from top to bottom. To avoid inadvertent overwrites, intermediate variable can be defined once. In the case of intermediate variables, the order of declaration is critical. If an intermediate is used before the definition, an error reports that there is an uninitialized value. Here is additional information on Intermediates with an example problem.
Method 2: Lower Block Triangular Decomposition
For large problems that have trouble with initialization, there is a mode that is activated with the option m.options.COLDSTART=2. This mode performs a lower block triangular decomposition to automatically identify independent blocks that are then solved independently and sequentially.
This decomposition method for initialization is discussed in the PhD dissertation (chapter 2) of Mostafa Safdarnejad or also in Safdarnejad, S.M., Hedengren, J.D., Lewis, N.R., Haseltine, E., Initialization Strategies for Optimization of Dynamic Systems, Computers and Chemical Engineering, 2015, Vol. 78, pp. 39-50, DOI: 10.1016/j.compchemeng.2015.04.016.
Method 3: Automatic Model Reduction
Model reduction requires more pre-processing time but can help to significantly reduce the solver time. There is additional documentation on m.options.REDUCE.
Overall Strategy for Initialization
The overall strategy that we use for initializing hard problems, such as flowsheets with recycle, is shown in this flowchart.
Sometimes it does mean breaking recycles to get an initialized solution. Other times, the initialization strategies detailed above work well and no model rearrangement is necessary. The advantage of working with a simultaneous solution strategy is degree of freedom swapping such as downstream variables can be fixed and upstream variables calculated to meet that value.

Need help understanding Kernel Transport speed on GPU (numba, cupy, cuda)

While GPUs speed math calculations there is a fixed overhead for moving a kernel out to the GPU for execution that is high.
I'm using cupy and numba. THe first time I execute a function call that is using cupy's GPU version of numpy it is quite slow. But the second time it is fast.
I've realized I don't understand how the kernel, or GPU code, gets out to the GPU to run. Operationally I want to understand this better so that I can know when the things I do will accidentally create a slow step due to some kernel transfer. So I need some sorts of rules or rules of thumb understand the concept.
For example, if I multiply two cupy arrays that are stashed on the GPU already I might write C= A*B
At some point the cupy overload on * multiplication has to be coded out on the GPU, and it automagically needs will also get wrapped by loops that break it down into blocks and threads. So presumably this code is some kernel that gets transported out to the GPU. I'm guessing that the next time I call C*D that the GPU no longer needs to be taught what * means and so it will be fast.
But at some point I would imagine the GPU needs to clear out old code so * or other operations not being used at that moment might get flushed from memory, and so later on when the call for A*B happens again there's going to be a penalty in time to recompile it out on the GPU.
Or so I imagine. If I'm right how do I know when these kernels are going to stick around or disappear?
If I'm wrong and this isn't how it works or there is some other slow step (I'm assuming the data is already transported to arrays on the GPU) then what is this slow step and how does organize things so one pay it as little as possible?
I'm trying to avoid writing explicit numba thread managment kernels as one does in cuda++ but just use the standard numba #njit, #vectorize, #stencil decorators. Likewise in Cupy I want to just work at the level of the numpy syntax not dive into thread management.
I've read a lot of docs on this but they just refer to overheads of kernels, not when these get paid and how one controls that so I'm confused.
I don't have a full answer to this yet. But so far the biggest clue I've gotten has come from reading up on the currently undocumented function #cupy.fuse() which makes it more clear than the #numba.jit documents where the kernel launch costs are paid. I have not found the connection to Contexts yet as recommended by #talonmies.
see https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
The key example is this
c = cupy.arange(4)
##cupy.fuse()
def foo(x):
return x+x+x+x+x
foo(.) will be three times slower with #cupy.fuse() commented out because each "+" involves a kernel load and a kernel free. Fusion merges all the adds into a single kernel so those the launch and free are paid onces. FOr matricies less than 1 million in size on a typical 2018 GPU, the add() is so fast that the launch and free are the dominate times.
I wish I could find some documentation on #fuse. FOr example, does it unroll internal functions the way #jit does. Could I achieve that by stacking #jit and #fuse?
I'm still however largely in the dark about when the costs are getting paid in numba.

MATLAB & Mex-Files: Auto-optimization of CUDA-Code depending on Input-Parameters size

Hey there,
I'm currently developing a Mex-file in matlab including CUDA computation. I wonder if there's a good way to 'automatically' optimize the program for arbitrary input parameters from the user. E.g. when the input-parameters don't exceed a certain size, try to use shared and/or constant memory... which will only work up to certain limits. From there on, global memory has to be used. But such optimizations can only be made in runtime because that's the point I get to know the size of input parameters from the user. Any simple solution?
Thanks!
You can simply write different kernels and decide which ones to call at runtime.
You can also use the device query API or do some micro-benchmarking to figure out the sizes of shared/constant memory at runtime. This is probably necessary if you don't want to assume a particular GPU model.

CUDA: synchronizing threads

Almost anywhere I read about programming with CUDA there is a mention of the importance that all of the threads in a warp do the same thing.
In my code I have a situation where I can't avoid a certain condition. It looks like this:
// some math code, calculating d1, d2
if (d1 < 0.5)
{
buffer[x1] += 1; // buffer is in the global memory
}
if (d2 < 0.5)
{
buffer[x2] += 1;
}
// some more math code.
Some of the threads might enter into one for the conditions, some might enter into both and other might not enter into either.
Now in order to make all the thread get back to "doing the same thing" again after the conditions, should I synchronize them after the conditions using __syncthreads() ? Or does this somehow happens automagically?
Can two threads be not doing the same thing due to one of them being one operation behind, thus ruining it for everyone? Or is there some behind the scenes effort to get them to do the same thing again after a branch?
Within a warp, no threads will "get ahead" of any others. If there is a conditional branch and it is taken by some threads in the warp but not others (a.k.a. warp "divergence"), the other threads will just idle until the branch is complete and they all "converge" back together on a common instruction. So if you only need within-warp synchronization of threads, that happens "automagically."
But different warps are not synchronized this way. So if your algorithm requires that certain operations be complete across many warps then you'll need to use explicit synchronization calls (see the CUDA Programming Guide, Section 5.4).
EDIT: reorganized the next few paragraphs to clarify some things.
There are really two different issues here: Instruction synchronization and memory visibility.
__syncthreads() enforces instruction synchronization and ensures memory visibility, but only within a block, not across blocks (CUDA Programming Guide, Appendix B.6). It is useful for write-then-read on shared memory, but is not appropriate for synchronizing global memory access.
__threadfence() ensures global memory visibility but doesn't do any instruction synchronization, so in my experience it is of limited use (but see sample code in Appendix B.5).
Global instruction synchronization is not possible within a kernel. If you need f() done on all threads before calling g() on any thread, split f() and g() into two different kernels and call them serially from the host.
If you just need to increment shared or global counters, consider using the atomic increment function atomicInc() (Appendix B.10). In the case of your code above, if x1 and x2 are not globally unique (across all threads in your grid), non-atomic increments will result in a race-condition, similar to the last paragraph of Appendix B.2.4.
Finally, keep in mind that any operations on global memory, and synchronization functions in particular (including atomics) are bad for performance.
Without knowing the problem you're solving it is hard to speculate, but perhaps you can redesign your algorithm to use shared memory instead of global memory in some places. This will reduce the need for synchronization and give you a performance boost.
From section 6.1 of the CUDA Best Practices Guide:
Any flow control instruction (if, switch, do, for, while) can significantly affect
the instruction throughput by causing threads of the same warp to diverge; that is,
to follow different execution paths. If this happens, the different execution paths
must be serialized, increasing the total number of instructions executed for this
warp. When all the different execution paths have completed, the threads converge
back to the same execution path.
So, you don't need to do anything special.
In Gabriel's response:
"Global instruction synchronization is not possible within a kernel. If you need f() done on all threads before calling g() on any thread, split f() and g() into two different kernels and call them serially from the host."
What if the reason you need f() and g() in same thread is because you're using register memory, and you want register or shared data from f to get to g?
That is, for my problem, the whole reason for synchronizing across blocks is because data from f is needed in g - and breaking out to a kernel would require a large amount of additional global memory to transfer register data from f to g, which I'd like to avoid
The answer to your question is no. You don't need to do anything special.
Anyway, you can fix this, instead of your code you can do something like this:
buffer[x1] += (d1 < 0.5);
buffer[x2] += (d2 < 0.5);
You should check if you can use shared memory and access global memory in a coalesced pattern. Also be sure that you DON'T want to write to the same index in more than 1 thread.