Shader programming - test if between two values - optimization

Shaders allow if(x>a && x<b) but my understanding is logical tests are slow. Is this worth optimizing along the lines:
mid = (a+b)*.5;
half = (b-a)*.5;
if(abs(x - mid) < half){...}
It seems a lot more code but from the days of CPU ASM, such tricks were sometimes worth doing.
If half and mid can be calculated once on the CPU and passed in as shader parameters, then we replace: if(x>a && x<b) with: if(abs(x - mid) < half) - is this worth bothering with?
In my tests I see a little improvement pulling half and mid into pre-calculated shader params, but I've only one GPU to test on so I'm after a "in general" answer, and if there are cases it will be dramatically better or worse.

The precalculating part of the shader params could be irrelevant (especially on directx), as the compiler is often smart enough to figure out the expression is constant.
Question is how abs is implemented. What's your target GPU? Mobile,mid,highend?

Related

scipy.sparse.linalg: what's the difference between splu and factorized?

What's the difference between using
scipy.sparse.linalg.factorized(A)
and
scipy.sparse.linalg.splu(A)
Both of them return objects with .solve(rhs) method and for both it's said in the documentation that they use LU decomposition. I'd like to know the difference in performance for both of them.
More specificly, I'm writing a python/numpy/scipy app that implements dynamic FEM model. I need to solve an equation Au = f on each timestep. A is sparse and rather large, but doesn't depend on timestep, so I'd like to invest some time beforehand to make iterations faster (there may be thousands of them). I tried using scipy.sparse.linalg.inv(A), but it threw memory exceptions when the size of matrix was large. I used scipy.linalg.spsolve on each step until recently, and now am thinking on using some sort of decomposition for better performance. So if you have other suggestions aside from LU, feel free to propose!
They should both work well for your problem, assuming that A does not change with each time step.
scipy.sparse.linalg.inv(A) will return a dense matrix that is the same size as A, so it's no wonder it's throwing memory exceptions.
scipy.linalg.solve is also a dense linear solver, which isn't what you want.
Assuming A is sparse, to solve Au=f and you only want to solve Au=f once, you could use scipy.sparse.linalg.spsolve. For example
u = spsolve(A, f)
If you want to speed things up dramatically for subsequent solves, you would instead use scipy.sparse.linalg.factorized or scipy.sparse.linalg.splu. For example
A_inv = splu(A)
for t in range(iterations):
u_t = A_inv.solve(f_t)
or
A_solve = factorized(A)
for t in range(iterations):
u_t = A_solve(f_t)
They should both be comparable in speed, and much faster than the previous options.
As #sascha said, you will need to dig into the documentation to see the differences between splu and factorize. But, you can use 'umfpack' instead of the default 'superLU' if you have it installed and set up correctly. I think umfpack will be faster in most cases. Keep in mind that if your matrix A is too large or has too many non-zeros, an LU decomposition / direct solver may take too much memory on your system. In this case, you might be stuck with using an iterative solver such as this. Unfortunately, you wont be able to reuse the solve of A at each time step, but you might be able to find a good preconditioner for A (approximation to inv(A)) to feed the solver to speed it up.

OpenCL: Type conversion overhead

What is the cost of casting a variable to a different type in OpenCL?
Example: I want to take dot product of 2 int3 vectors (AFAIK dot() isn't overloaded for int3s), so instead of implementing dot() by myself in unvectorized way, I want to vectorize the code by using the native dot() for float3. First I convert the 2 vectors to float3s and then I cast the result to int.
Which of the two functions, foo and bar, is less time consuming (and why)?
inline int foo(int3 a, int3 b) {
return a.x*b.x + a.y*b.y + a.z*b.z;
}
inline int bar(int3 a, int3 b) {
return (int)dot(convert_float3(a), convert_float3(b));
}
As has been suggested in the comments, measuring is going to be the most useful tool in practice, and the cost of individual instructions is heavily dependent on hardware architecture, but also the compiler.
Nevertheless, a comparison to other operations is useful, and at least AMD publishes a list of the instruction throughput for their devices in this section of their OpenCL optimisation guide, and this includes float-to-int and int-to-float conversion.
In your particular case, I strongly suspect your "vectorising" attempts will have detrimental effects. Most modern GPUs aren't SIMD processors in the CPU SIMD sense. The threads run in lock-step, but each thread operates on scalars. A "horizontal" operation like a dot product may not be particularly efficient even if the GPU does use per-thread SIMD.
If you can limit the range of each of your integers to 24 bits, a series of mad24() and mul24() calls will most likely be fastest. But again - measure. Try the different options on a range of hardware, and run them lots of times, applying basic stats to make sure you aren't just seeing random variation/overhead.
A separate thing to note with regard to integer-to-float conversions is that such conversions are often "free" when you sample as floats from an image object containing integers.

How to optimize OpenCL code for neighbors accessing?

Edit: Proposed solutions results are added at the end of the question.
I'm starting to program with OpenCL, and I have created a naive implementation of my problem.
The theory is: I have a 3D grid of elements, where each elements has a bunch of information (around 200 bytes). Every step, every element access its neighbors information and accumulates this information to prepare to update itself. After that there is a step where each element updates itself with the information gathered before. This process is executed iteratively.
My OpenCL implementation is: I create an OpenCL buffer of 1 dimension, fill it with structs representing the elements, which have an "int neighbors 6 " where I store the index of the neighbors in the Buffer. I launch a kernel that consults the neighbors and accumulate their information into element variables not consulted in this step, and then I launch another kernel that uses this variables to update the elements. These kernels use __global variables only.
Sample code:
typedef struct{
float4 var1;
float4 var2;
float4 nextStepVar1;
int neighbors[8];
int var3;
int nextStepVar2;
bool var4;
} Element;
__kernel void step1(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
return;
}
Element elem = elements[id];
for (int i=0; i < 6; ++i){
if (elem.neighbors[i] != -1){
//Gather information of the neighbor and accumulate it in elem.nextStepVars
}
}
elements[id] = elem;
}
__kernel void step2(__global Element *elements, int nelements){
int id = get_global_id(0);
if (id >= nelements){
return;
}
Element elem = elements[id];
//update elem variables by using elem.nextStepVariables
//restart elem.nextStepVariables
}
Right now, my OpenCL implementation takes basically the same time than my C++ implementation.
So, the question is: How would you (the experts :P) address this problem?
I have read about 3D images, to store the information and change the neighborhood accessing pattern by changing the NDRange to a 3D one. Also, I have read about __local memory, to first load all the neighborhood in a workgroup, synchronize with a barrier and then use them, so that accesses to memory are reduced.
Could you give me some tips to optimize a process like the one I described, and if possible, give me some snippets?
Edit: Third and fifth optimizations proposed by Huseyin Tugrul were already in the code. As mentioned here, to make structs behave properly, they need to satisfy some restrictions, so it is worth understanding that to avoid headaches.
Edit 1: Applying the seventh optimization proposed by Huseyin Tugrul performance increased from 7 fps to 60 fps. In a more general experimentation, the performance gain was about x8.
Edit 2: Applying the first optimization proposed by Huseyin Tugrul performance increased about x1.2 . I think that the real gain is higher, but hides because of another bottleneck not yet solved.
Edit 3: Applying the 8th and 9th optimizations proposed by Huseyin Tugrul didn't change performance, because of the lack of significant code taking advantage of these optimizations, worth trying in other kernels though.
Edit 4: Passing invariant arguments (such as n_elements or workgroupsize) to the kernels as #DEFINEs instead of kernel args, as mentioned here, increased performance around x1.33. As explained in the document, this is because of the aggressive optimizations that the compiler can do when knowing the variables at compile-time.
Edit 5: Applying the second optimization proposed by Huseyin Tugrul, but using 1 bit per neighbor and using bitwise operations to check if neighbor is present (so, if neighbors & 1 != 0, top neighbor is present, if neighbors & 2 != 0, bot neighbor is present, if neighbors & 4 != 0, right neighbor is present, etc), increased performance by a factor of x1.11. I think this was mostly because of the data transfer reduction, because the data movement was, and keeps being my bottleneck. Soon I will try to get rid of the dummy variables used to add padding to my structs.
Edit 6: By eliminating the structs that I was using, and creating separated buffers for each property, I eliminated the padding variables, saving space, and was able to optimize the global memory access and local memory allocation. Performance increased by a factor of x1.25, which is very good. Worth doing this, despite the programmatic complexity and unreadability.
According to your step1 and step2, you are not making your gpu core work hard. What is your kernel's complexity? What is your gpu usage? Did you check with monitoring programs like afterburner? Mid-range desktop gaming cards can get 10k threads each doing 10k iterations.
Since you are working with only neighbours, data size/calculation size is too big and your kernels may be bottlenecked by vram bandiwdth. Your main system ram could be as fast as your pci-e bandwidth and this could be the issue.
1) Use of Dedicated Cache could be getting you thread's actual grid cell into private registers that is fastest. Then neighbours into __local array so the comparisons/calc only done in chip.
Load current cell into __private
Load neighbours into __local
start looping for local array
get next neighbour into __private from __local
compute
end loop
(if it has many neighbours, lines after "Load neighbours into __local" can be in another loop that gets from main memory by patches)
What is your gpu? Nice it is GTX660. You should have 64kB controllable cache per compute unit. CPUs have only registers of 1kB and not addressable for array operations.
2) Shorter Indexing could be using a single byte as index of neighbour stored instead of int. Saving precious L1 cache space from "id" fetches is important so that other threads can hit L1 cache more!
Example:
0=neighbour from left
1=neighbour from right
2=neighbour from up
3=neighbour from down
4=neighbour from front
5=neighbour from back
6=neighbour from upper left
...
...
so you can just derive neighbour index from a single byte instead of 4-byte int which decreases main memory accessing for at least neighbour accessing. Your kernel will derive neighbour index from upper table using its compute power, not memory power because you would make this from core registers(__privates). If your total grid size is constant, this is very easy such as just adding 1 actual cell id, adding 256 to id or adding 256*256 to id or so.
3) Optimum Object Size could be making your struct/cell-object size a multiple of 4 bytes. If your total object size is around 200-bytes, you can pad it or augment it with some empty bytes to make exactly 200 bytes, 220Bytes or 256 bytes.
4) Branchless Code (Edit: depends!) using less if-statements. Using if-statement makes computation much slower. Rather than checking for -1 as end of neightbour index , you can use another way . Becuase lightweight core are not as capable of heavyweight. You can use surface-buffer-cells to wrap the surface so computed-cells will have always have 6-neighbours so you get rid of if (elem.neighbors[i] != -1) . Worth a try especially for GPU.
Just computing all neighbours are faster rather than doing if-statement. Just multiply the result change with zero when it is not a valid neighbour. How can we know that it is not a valid neighbour? By using a byte array of 6-elements per cell(parallel to neighbour id array)(invalid=0, valid=1 -->multiply the result with this)
The if-statement is inside a loop which counting for six times. Loop unrolling gives similar speed-up if the workload in the loop is relatively easy.
But, if all threads within same warp goes into same if-or-else branch, they don't lose performance. So this depends wheter your code diverges or not.
5) Data Elements Reordering you can move the int[8] element to uppermost side of struct so memory accessing may become more yielding so smaller sized elements to lower side can be read in a single read-operation.
6) Size of Workgroup trying different local workgroup size can give 2-3x performance. Starting from 16 until 512 gives different results. For example, AMD GPUs like integer multiple of 64 while NVIDIA GPUs like integer multiple of 32. INTEL does fine at 8 to anything since it can meld multiple compute units together to work on same workgroup.
7) Separation of Variables(only if you cant get rid of if-statements) Separation of comparison elements from struct. This way you dont need to load a whole struct from main memory just to compare an int or a boolean. When comparison needs, then loads the struct from main memory(if you have local mem optimization already, then you should put this operation before it so loading into local mem is only done for selected neighbours)
This optimisation makes best case(no neighbour or only one eighbour) considerably faster. Does not affect worst case(maximum neighbours case).
8a) Magic Using shifting instead of dividing by power of 2. Doing similar for modulo. Putting "f" at the end of floating literals(1.0f instead of 1.0) to avoid automatic conversion from double to float.
8b) Magic-2 -cl-mad-enable Compiler option can increase multiply+add operation speed.
9) Latency Hiding Execution configuration optimization. You need to hide memory access latency and take care of occupancy.
Get maximum cycles of latency for instructions and global memory access.
Then divide memory latency by instruction latency.
Now you have the ratio of: arithmetic instruction number per memory access to hide latency.
If you have to use N instructions to hide mem latency and you have only M instructions in your code, then you will need N/M warps(wavefronts?) to hide latency because a thread in gpu can do arithmetics while other thread getting things from mem.
10) Mixed Type Computing After memory access is optimized, swap or move some instructions where applicable to get better occupancy, use half-type to help floating point operations where precision is not important.
11) Latency Hiding again Try your kernel code with only arithmetics(comment out all mem accesses and initiate them with 0 or sometihng you like) then try your kernel code with only memory access instructions(comment out calculations/ ifs)
Compare kernel times with original kernel time. Which is affeecting the originatl time more? Concentrate on that..
12) Lane & Bank Conflicts Correct any LDS-lane conflicts and global memory bank conflicts because same address accessings can be done in a serialed way slowing process(newer cards have broadcast ability to reduce this)
13) Using registers Try to replace any independent locals with privates since your GPU can give nearly 10TB/s throughput using registers.
14) Not Using Registers Dont use too many registers or they will spill to global memory and slow the process.
15) Minimalistic Approach for Occupation Look at local/private usage to get an idea of occupation. If you use much more local and privates then less threads can be utilized in same compute unit and leading lesser occupation. Less resource usage leads higher chance of occupation(if you have enough total threads)
16) Gather Scatter When neighbours are different particles(like an nbody NNS) from random addresses of memory, its maybe hard to apply but, gather read optimization can give 2x-3x speed on top of before optimizations (needs local memory optimization to work) so it reads in an order from memory instead of randomly and reorders as needed in the local memory to share between (scatter) to threads.
17) Divide and Conquer Just in case when buffer is too big and copied between host and device so makes gpu wait idle, then divide it in two, send them separately, start computing as soon as one arrives, send results back concurrently in the end. Even a process-level parallelism could push a gpu to its limits this way. Also L2 cache of GPU may not be enough for whole of data. Cache-tiled computing but implicitly done instead of direct usage of local memory.
18) Bandwidth from memory qualifiers. When kernel needs some extra 'read' bandwidth, you can use '__constant'(instead of __global) keyword on some parameters which are less in size and only for reading. If those parameters are too large then you can still have good streaming from '__read_only' qualifier(after the '__global' qualifier). Similary '__write_only' increases throughput but these give mostly hardware-specific performance. If it is Amd's HD5000 series, constant is good. Maybe GTX660 is faster with its cache so __read_only may become more usable(or Nvidia using cache for __constant?).
Have three parts of same buffer with one as __global __read_only, one as __constant and one as just __global (if building them doesn't penalty more than reads' benefits).
Just tested my card using AMD APP SDK examples, LDS bandwidth shows 2TB/s while constant is 5TB/s(same indexing instead of linear/random) and main memory is 120 GB/s.
Also don't forget to add restrict to kernel parameters where possible. This lets compiler do more optimizations on them(if you are not aliasing them).
19) Modern hardware transcendental functions are faster than old bit hack (like Quake-3 fast inverse square root) versions
20) Now there is Opencl 2.0 which enables spawning kernels inside kernels so you can further increase resolution in a 2d grid point and offload it to workgroup when needed (something like increasing vorticity detail on edges of a fluid dynamically)
A profiler can help for all those, but any FPS indicator can do if only single optimization is done per step.
Even if benchmarking is not for architecture-dependent code paths, you could try having a multiple of 192 number of dots per row in your compute space since your gpu has multiple of that number of cores and benchmark that if it makes gpu more occupied and have more gigafloatingpoint operations per second.
There must be still some room for optimization after all these options, but idk if it damages your card or feasible for production time of your projects. For example:
21) Lookup tables When there is 10% more memory bandwidth headroom but no compute power headroom, offload 10% of those workitems to a LUT version such that it gets precomputed values from a table. I didn't try but something like this should work:
8 compute groups
2 LUT groups
8 compute groups
2 LUT groups
so they are evenly distributed into "threads in-flight" and get advantage of latency hiding stuff. I'm not sure if this is a preferable way of doing science.
21) Z-order pattern For traveling neighbors increases cache hit rate. Cache hit rate saves some global memory bandwidth for other jobs so that overall performance increases. But this depends on size of cache, data layout and some other things I don't remember.
22) Asynchronous Neighbor Traversal
iteration-1: Load neighbor 2 + compute neighbor 1 + store neighbor 0
iteration-2: Load neighbor 3 + compute neighbor 2 + store neighbor 1
iteration-3: Load neighbor 4 + compute neighbor 3 + store neighbor 2
so each body of loop doesn't have any chain of dependency and fully pipelined on GPU processing elements and OpenCL has special instructions for asynchronously loading/storing global variables using all cores of a workgroup. Check this:
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/async_work_group_copy.html
Maybe you can even divide computing part into two and have one part use transcandental functions and other part use add/multiply so that add/multiply operations don't wait for a slow sqrt. If there are at least several neighbors to traveerse, this should hide some latency behind other iterations.

Are there compilers that optimise floating point operations for accuracy (as opposed to speed)?

We know that compilers are getting better and better at optimising our code and make it run faster, but my question are there compilers that can optimise floating point operations to ensure greater accuracy.
For example a basic rule is to perform multiplications before addition, this is because multiplication and division using floating point numbers does not introduce inaccuracies as great as that of addition and subtraction but can increase the magnitude of inaccuracies introduced by addition and subtraction, so it should be done first in many cases.
So a floating point operation like
y = x*(a + b); // faster but less accurate
Should be changed to
y = x*a + x*b; // slower but more accurate
Are there any compilers that will optimise for improved floating point accuracy at the expense of speed like I showed above? Or is the main concern of compilers speed with out looking at accuracy of floating point operations?
Thanks
Update: The selected answer, showed a very good example where this type of optimisation would not work, so it wouldn't be possible for the compiler to know before hand what is the more accurate way to evaluate y. Thanks for the counter example.
Your premise is faulty. x*(a + b), is (in general) no less accurate than x*a + x*b. In fact, it will often be more accurate, because it performs only two floating point operations (and therefore incurs only two rounding errors), whereas the latter performs three operations.
If you know something about the expected distribution of values for x, a, and b a priori, then you could make an informed decision, but compilers almost never have access to that type of information.
That aside, what if the person writing the program actually meant x*(a+b) and specifically wanted the exactly roundings that are caused by that particular sequence of operations? This sort of thing is actually pretty common in high-quality numerical algorithms.
Better to do what the programmer wrote, not what you think he might have intended.
Edit -- An example to illustrate a case where the transformation you suggested results in a catastrophic loss of accuracy: suppose
x = 3.1415926535897931
a = 1.0e15
b = -(1.0e15 - 1.0)
Then, evaluating in double we get:
x*(a + b) = 3.1415926535897931
but
x*a + x*b = 3.0
Compilers typically "optimize" for accuracy over speed, accuracy defined as exact implementation of the IEEE 754 standard. Whereas integer operations can be reordered in any way that doesn't cause overflow, FP operations need to be performed exactly as the programmer specifies. This may sacrifice numerical accuracy (ordinary C compilers are not equipped to optimize for that) but faithfully implements the what the programmer asked.
A programmer who is sure he hasn't manually optimized for accuracy may enable compiler features like GCC's -funsafe-math-optimizations and -ffinite-math-only to possibly extract extra speed. But usually there isn't much gain.
No, there isn't. Stephen Canon gives some good reasons why this would be a stupid idea, and he's correct; so you won't find a compiler that does this.
If you as the programmer have some knowledge about the ranges of numbers you're manipulating, you can use parentheses, temporary variables and similar constructs to strongly hint the compiler about how you want things done.

Mathematical analysis of a sound sample (as an array of numbers)

I need to find the frequency of a sample, stored (in vb) as an array of byte. Sample is a sine wave, known frequency, so I can check), but the numbers are a bit odd, and my maths-foo is weak.
Full range of values 0-255. 99% of numbers are in range 235 to 245, but there are some outliers down to 0 and 1, and up to 255 in the remaining 1%.
How do I normalise this to remove outliers, (calculating the 235-245 interval as it may change with different samples), and how do I then calculate zero-crossings to get the frequency?
Apologies if this description is rubbish!
The FFT is probably the best answer, but if you really want to do it by your method, try this:
To normalize, first make a histogram to count how many occurrances of each value from 0 to 255. Then throw out X percent of the values from each end with something like:
for (i=lower=0;i< N*(X/100); lower++)
i+=count[lower];
//repeat in other direction for upper
Now normalize with
A[i] = 255*(A[i]-lower)/(upper-lower)-128
Throw away results outside the -128..127 range.
Now you can count zero crossings. To make sure you are not fooled by noise, you might want to keep track of the slope over the last several points, and only count crossings when the average slope is going the right way.
The standard method to attack this problem is to consider one block of data, hopefully at least twice the actual frequency (taking more data isn't bad, so it's good to overestimate a bit), then take the FFT and guess that the frequency corresponds to the largest number in the resulting FFT spectrum.
By the way, very similar problems have been asked here before - you could search for those answers as well.
Use the Fourier transform, it's much more noise insensitive than counting zero crossings
Edit: #WaveyDavey
I found an F# library to do an FFT: From here
As it turns out, the best free
implementation that I've found for F#
users so far is still the fantastic
FFTW library. Their site has a
precompiled Windows DLL. I've written
minimal bindings that allow
thread-safe access to FFTW from F#,
with both guru and simple interfaces.
Performance is excellent, 32-bit
Windows XP Pro is only up to 35%
slower than 64-bit Linux.
Now I'm sure you can call F# lib from VB.net, C# etc, that should be in their docs
If I understood well from your description, what you have is a signal which is a combination of a sine plus a constant plus some random glitches. Say, like
x[n] = A*sin(f*n + phi) + B + N[n]
where N[n] is the "glitch" noise you want to get rid of.
If the glitches are one-sample long, you can remove them using a median filter which has to be bigger than the glitch length. On both sides of the glitch. Glitches of length 1, mean you will have enough with a median of 3 samples of length.
y[n] = median3(x[n])
The median is computed so: Take the samples of x you want to filter (x[n-1],x[n],x[n+1]), sort them, and your output is the middle one.
Now that the noise signal is away, get rid of the constant signal. I understand the buffer is of a limited and known length, so you can just compute the mean of the whole buffer. Substract it.
Now you have your single sinus signal. You can now compute the fundamental frequency by counting zero crossings. Count the amount of samples above 0 in which the former sample was below 0. The period is the total amount of samples of your buffer divided by this, and the frequency is the oposite (1/x) of the period.
Although I would go with the majority and say that it seems like what you want is an fft solution (fft algorithm is pretty quick), if fft is not the answer for whatever reason you may want to try fitting a sine curve to the data using a fitting program and reading off the fitted frequency.
Using Fityk, you can load the data, and fit to a*sin(b*x-c) where 2*pi/b will give you the frequency after fitting.
Fityk can be used from a gui, from a command-line for scripting and has a C++ API so could be included in your programs directly.
I googled for "basic fft". Visual Basic FFT Your question screams FFT, but be careful, using FFT without understanding even a little bit about DSP can lead results that you don't understand or don't know where they come from.
get the Frequency Analyzer at http://www.relisoft.com/Freeware/index.htm and run it and look at the code.