Speed of cos() and sin() function in GLSL shaders? - optimization

I'm interested in information about the speed of sin() and cos() in Open GL Shader Language.
The GLSL Specification Document indicates that:
The built-in functions basically fall into three categories:
...
...
They represent an operation graphics hardware is likely to accelerate at some point. The trigonometry functions fall into this
category.
EDIT:
As has been pointed out, counting clock cycles of individual operations like sin() and cos() doesn't really tell the whole performance story.
So to clarify my question, what I'm really interested in is whether it's worthwhile to optimize away sin() and cos() calls for common cases.
For example, in my application it'll be very common for the argument to be 0. So does something like this make sense:
float sina, cosa;
if ( rotation == 0 )
{
sina = 0;
cosa = 1;
}
else
{
sina = sin( rotation );
cosa = cos( rotation );
}
Or will the GLSL compiler or the sin() and cos() implementations take care of optimizations like that for me?

For example, in my application it'll be very common for the argument to be 0. So does something like this make sense:
No.
Your compiler will do one of two things.
It will issue an actual conditional branch. In the best possible case, if 0 is a value that is coherent locally (such that groups of shaders will often hit 0 or non-zero together), then you might get improved performance.
It will evaluate both sides of the condition, and only store the result for the correct one of them. In which case, you've gained nothing.
In general, it's not a good idea to use conditional logic to dance around small performance like this. It needs to be really big to be worthwhile, like a discard or something.
Also, do note that floating-point equivalence is not likely to work. Not unless you actually pass a uniform or vertex attribute containing exactly 0.0 to the shader. Even interpolating between 0 and non-zero will likely never produce exactly 0 for any fragment.

This is a good question. I too wondered this.
Google'd links say cos and sin are single-cycle on mainstream cards since 2005 or so.

You'd have to test this out yourself, but I'm pretty sure that branching in a shader is far more expensive than a sin or cos calculation. GLSL compilers are pretty good about optimizing shaders, worrying about this is premature optimization. If you later find that, through your entire program, your shaders are the bottleneck, then you can worry about optimizing this.
If you want to take a look at the assembly code of your shader for a specific platform, I would recommend AMD GPU ShaderAnalyzer.

Not sure if this answers your question, but it's very difficult to tell you how many clocks/slots an instruction takes as it depends very much on the GPU. Usually it's a single cycle. But even if not, the compiler may rearrange the order of instruction execution to hide the true cost. It's certainly slower to use texture lookups for sin/cos as it is to execute the instructions.

see how many sin's you can get in one shader in a row, compared to math.abs,frac, ect... i think a gtx 470 can handle 200 sin functions per fragment no probs, the frame will be 10 percent slower than an empty shader. it's farly fast, you can send results in. it will be a good indicator of computational efficiency.

The compiler evaluates both branches, which makes conditions quite expensive. If you use both sin and cos in your shader, you can calculate only sin(a) and cos(a) = sqrt(1.0 - sin(a)) since sin(x)*sin(x) + cos(x)*cos(x) is always 1.0

Related

Raku parallel/functional methods

I am pretty new to Raku and I have a questions to functional methods, in particular with reduce.
I originally had the method:
sub standardab{
my $mittel = mittel(#_);
my $foo = 0;
for #_ {
$foo += ($_ - $mittel)**2;
}
$foo = sqrt($foo/(#_.elems));
}
and it worked fine. Then I started to use reduce:
sub standardab{
my $mittel = mittel(#_);
my $foo = 0;
$foo = #_.reduce({$^a + ($^b-$mittel)**2});
$foo = sqrt($foo/(#_.elems));
}
my execution time doubled (I am applying this to roughly 1000 elements) and the solution differed by 0.004 (i guess rounding error).
If I am using
.race.reduce(...)
my execution time is 4 times higher than with the original sequential code.
Can someone tell me the reason for this?
I thought about parallelism initialization time, but - as I said - i am applying this to 1000 elements and if i change other for loops in my code to reduce it gets even slower!
Thanks for your help
Summary
In general, reduce and for do different things, and they are doing different things in your code. For example, compared with your for code, your reduce code involves twice as many arguments being passed and is doing one less iteration. I think that's likely at the root of the 0.004 difference.
Even if your for and reduce code did the same thing, an optimized version of such reduce code would never be faster than an equally optimized version of equivalent for code.
I thought that race didn't automatically parallelize reduce due to reduce's nature. (Though I see per your and #user0721090601's comment I'm wrong.) But it will incur overhead -- currently a lot.
You could use race to parallelize your for loop instead, if it's slightly rewritten. That might speed it up.
On the difference between your for and reduce code
Here's the difference I meant:
say do for <a b c d> { $^a } # (a b c d) (4 iterations)
say do reduce <a b c d>: { $^a, $^b } # (((a b) c) d) (3 iterations)
For more details of their operation, see their respective doc (for, reduce).
You haven't shared your data, but I will presume that the for and/or reduce computations involve Nums (floats). Addition of floats isn't commutative, so you may well get (typically small) discrepancies if the additions end up happening in a different order.
I presume that explains the 0.004 difference.
On your sequential reduce being 2X slower than your for
my execution time doubled (I am applying this to roughly 1000 elements)
First, your reduce code is different, as explained above. There are general abstract differences (eg taking two arguments per call instead of your for block's one) and perhaps your specific data leads to fundamental numeric computation differences (perhaps your for loop computation is primarily integer or float math while your reduce is primarily rational?). That might explain the execution time difference, or some of it.
Another part of it may be the difference between, on the one hand, a reduce, which will by default compile into calls of a closure, with call overhead, and two arguments per call, and temporary memory storing intermediate results, and, on the other, a for which will by default compile into direct iteration, with the {...} being just inlined code rather than a call of a closure. (That said, it's possible a reduce will sometimes compile to inlined code; and it may even already be that way for your code.)
More generally, Rakudo optimization effort is still in its relatively early days. Most of it has been generic, speeding up all code. Where effort has been applied to particular constructs, the most widely used constructs have gotten the attention so far, and for is widely used and reduce less so. So some or all the difference may just be that reduce is poorly optimized.
On reduce with race
my execution time [for .race.reduce(...)] is 4 times higher than with the original sequential code
I didn't think reduce would be automatically parallelizable with race. Per its doc, reduce works by "iteratively applying a function which knows how to combine two values", and one argument in each iteration is the result of the previous iteration. So it seemed to me it must be done sequentially.
(I see in the comments that I'm misunderstanding what could be done by a compiler with a reduction. Perhaps this is if it's a commutative operation?)
In summary, your code is incurring raceing's overhead without gaining any benefit.
On race in general
Let's say you're using some operation that is parallelizable with race.
First, as you noted, race incurs overhead. There'll be an initialization and teardown cost, at least some of which is paid repeatedly for each evaluation of an overall statement/expression that's being raced.
Second, at least for now, race means use of threads running on CPU cores. For some payloads that can yield a useful benefit despite any initialization and teardown costs. But it will, at best, be a speed up equal to the number of cores.
(One day it should be possible for compiler implementors to spot that a raced for loop is simple enough to be run on a GPU rather than a CPU, and go ahead and send it to a GPU to achieve a spectacular speed up.)
Third, if you literally write .race.foo... you'll get default settings for some tunable aspects of the racing. The defaults are almost certainly not optimal and may be way off.
The currently tunable settings are :batch and :degree. See their doc for more details.
More generally, whether parallelization speeds up code depends on the details of a specific use case such as the data and hardware in use.
On using race with for
If you rewrite your code a bit you can race your for:
$foo = sum do race for #_ { ($_ - $mittel)**2 }
To apply tuning you must repeat the race as a method, for example:
$foo = sum do race for #_.race(:degree(8)) { ($_ - $mittel)**2 }

What speed gain am I changing if I optimize for size (z) instead of speed (3)?

I read both:
https://doc.rust-lang.org/cargo/reference/profiles.html#opt-level
https://github.com/johnthagen/min-sized-rust#optimize-for-size
What I don't understand is which speed gain I'm changing if I use:
[profile.release]
opt-level = "z"
instead of:
[profile.release]
opt-level = 3
Is it right that today opt-level = 3 is the best setting (for opt-level section) for runtime speed?
If I instead use opt-level = "z" I'm decreasing runtime performance, right?
I'm not interested in building/compiling speed.
It's complicated.
The truth of the matter is that a compiler optimization pipeline is largely based on heuristics.
A number of optimizations are sure things (such as strength reduction), however many heavy lifters (such as inlining) are based on a set of heuristics.
The heuristics, of course, are not pulled out of thin air. They have been carefully tuned by the compiler developers based on a sample of programs that is judged representative, and polished based on customer reports.
Still, at the end of the day, they remain heuristics, and as a result some programs are faster with -Oz than with -O3, because a different set of heuristics is used.

Memory efficiency in If statements

I'm thinking more about how much system memory my programs will use nowadays. I'm currently doing A level Computing at college and I know that in most programs the difference will be negligible but I'm wondering if the following actually makes any difference, in any language.
Say I wanted to output "True" or "False" depending on whether a condition is true. Personally, I prefer to do something like this:
Dim result As String
If condition Then
Result = "True"
Else
Result = "False"
EndIf
Console.WriteLine(result)
However, I'm wondering if the following would consume less memory, etc.:
If condition Then
Console.WriteLine("True")
Else
Console.WriteLine("False")
EndIf
Obviously this is a very much simplified example and in most of my cases there is much more to be outputted, and I realise that in most commercial programs these kind of statements are rare, but hopefully you get the principle.
I'm focusing on VB.NET here because that is the language used for the course, but really I would be interested to know how this differs in different programming languages.
The main issue making if's fast or slow is predictability.
Modern CPU's (anything after 2000) use a mechanism called branch prediction.
Read the above link first, then read on below...
Which is faster?
The if statement constitutes a branch, because the CPU needs to decide whether to follow or skip the if part.
If it guesses the branch correctly the jump will execute in 0 or 1 cycle (1 nanosecond on a 1Ghz computer).
If it does not guess the branch correctly the jump will take 50 cycles (give or take) (1/200th of a microsecord).
Therefore to even feel these differences as a human, you'd need to execute the if statement many millions of times.
The two statements above are likely to execute in exactly the same amount of time, because:
assigning a value to a variable takes negligible time; on average less than a single cpu cycle on a multiscalar CPU*.
calling a function with a constant parameter requires the use of an invisible temporary variable; so in all likelihood code A compiles to almost the exact same object code as code B.
*) All current CPU's are multiscalar.
Which consumes less memory
As stated above, both versions need to put the boolean into a variable.
Version A uses an explicit one, declared by you; version B uses an implicit one declared by the compiler.
However version A is guaranteed to only have one call to the function WriteLine.
Whilst version B may (or may not) have two calls to the function WriteLine.
If the optimizer in the compiler is good, code B will be transformed into code A, if it's not it will remain with the redundant calls.
How bad is the waste
The call takes about 10 bytes for the assignment of the string (Unicode 2 bytes per char).
But so does the other version, so that's the same.
That leaves 5 bytes for a call. Plus maybe a few extra bytes to set up a stackframe.
So lets say due to your totally horrible coding you have now wasted 10 bytes.
Not much to worry about.
From a maintainability point of view
Computer code is written for humans, not machines.
So from that point of view code A is clearly superior.
Imagine not choosing between 2 options -true or false- but 20.
You only call the function once.
If you decide to change the WriteLine for another function you only have to change it in one place, not two or 20.
How to speed this up?
With 2 values it's pretty much impossible, but if you had 20 values you could use a lookup table.
Obviously that optimization is not worth it unless code gets executed many times.
If you need to know the precise amount of memory the instructions are going to take, you can use ildasm on your code, and see for yourself. However, the amount of memory consumed by your code is much less relevant today, when the memory is so cheap and abundant, and compilers are smart enough to see common patterns and reduce the amount of code that they generate.
A much greater concern is readability of your code: if a complex chain of conditions always leads to printing a conditionally set result, your first code block expresses this idea in a cleaner way than the second one does. Everything else being equal, you should prefer whatever form of code that you find the most readable, and let the compiler worry about optimization.
P.S. It goes without saying that Console.WriteLine(condition) would produce the same result, but that is of course not the point of your question.

HLSL branch avoidance

I have a shader where I want to move half of the vertices in the vertex shader. I'm trying to decide the best way to do this from a performance standpoint, because we're dealing with well over 100,000 verts, so speed is critical. I've looked at 3 different methods: (pseudo-code, but enough to give you the idea. The <complex formula> I can't give out, but I can say that it involves a sin() function, as well as a function call (just returns a number, but still a function call), as well as a bunch of basic arithmetic on floating point numbers).
if (y < 0.5)
{
x += <complex formula>;
}
This has the advantage that the <complex formula> is only executed half the time, but the downside is that it definitely causes a branch, which may actually be slower than the formula. It is the most readable, but we care more about speed than readability in this context.
x += step(y, 0.5) * <complex formula>;
Using HLSL's step() function (which returns 0 if the first param is greater and 1 if less), you can eliminate the branch, but now the <complex formula> is being called every time, and its results are being multiplied by 0 (thus wasted effort) half of the time.
x += (y < 0.5) ? <complex formula> : 0;
This I don't know about. Does the ?: cause a branch? And if not, are both sides of the equation evaluated or only the one that is relevant?
The final possibility is that the <complex formula> could be offloaded back to the CPU instead of the GPU, but I worry that it will be slower in calculating sin() and other operations, which might result in a net loss. Also, it means one more number has to be passed to the shader, and that could cause overhead as well. Anyone have any insight as to which would be the best course of action?
Addendum:
According to http://msdn.microsoft.com/en-us/library/windows/desktop/bb509665%28v=vs.85%29.aspx
the step() function uses a ?: internally, so it's probably no better than my 3rd solution, and potentially worse since <complex formula> is definitely called every time, whereas it may be only called half the time with a straight ?:. (Nobody's answered that part of the question yet.) Though avoiding both and using:
x += (1.0 - y) * <complex formula>;
may well be better than any of them, since there's no comparison being made anywhere. (And y is always either 0 or 1.) Still executes the <complex formula> needlessly half the time, but might be worth it to avoid branches altogether.
Perhaps look at this answer.
My guess (this is a performance question: measure it!) is that you are best off keeping the if statement.
Reason number one: The shader compiler, in theory (and if invoked correctly), should be clever enough to make the best choice between a branch instruction, and something similar to the step function, when it compiles your if statement. The only way to improve on it is to profile[1]. Note that it's probably hardware-dependent at this level of granularity.
[1] Or if you have specific knowledge about how your data is laid out, read on...
Reason number two is the way shader units work: If even one fragment or vertex in the unit takes a different branch to the others, then the shader unit must take both branches. But if they all take the same branch - the other branch is ignored. So while it is per-unit, rather than per-vertex - it is still possible for the expensive branch to be skipped.
For fragments, the shader units have on-screen locality - meaning you get best performance with groups of nearby pixels all taking the same branch (see the illustration in my linked answer). To be honest, I don't know how vertices are grouped into units - but if your data is grouped appropriately - you should get the desired performance benefit.
Finally: It's worth pointing out that your <complex formula> - if you're saying that you can hoist it out of your HLSL manually - it may well get hoisted into a CPU-based pre-shader anyway (on PC at least, from memory Xbox 360 doesn't support this, no idea about PS3). You can check this by decompiling the shader. If it is something that you only need to calculate once per-draw (rather than per-vertex/fragment) it probably is best for performance to do it on the CPU.
I got tired of my conditionals being ignored so I just made a another kernel and did an override in c execution.
If you need it to be accurate all the time I suggest this fix.

Exponents in Genetic Programming

I want to have real-valued exponents (not just integers) for the terminal variables.
For example, lets say I want to evolve a function y = x^3.5 + x^2.2 + 6. How should I proceed? I haven't seen any GP implementations which can do this.
I tried using the power function, but sometimes the initial solutions have so many exponents that the evaluated value exceeds 'double' bounds!
Any suggestion would be appreciated. Thanks in advance.
DEAP (in Python) implements it. In fact there is an example for that. By adding the math.pow from Python in the primitive set you can acheive what you want.
pset.addPrimitive(math.pow, 2)
But using the pow operator you risk getting something like x^(x^(x^(x))), which is probably not desired. You shall add a restriction (by a mean that I not sure) on where in your tree the pow is allowed (just before a leaf or something like that).
OpenBeagle (in C++) also allows it but you will need to develop your own primitive using the pow from <math.h>, you can use as an example the Sin or Cos primitive.
If only some of the initial population are suffering from the overflow problem then just penalise them with a poor fitness score and they will probably be removed from the population within a few generations.
But, if the problem is that virtually all individuals suffer from this problem, then you will have to add some constraints. The simplest thing to do would be to constrain the exponent child of the power function to be a real literal - which would mean powers would not be allowed to be nested. It depends on whether this is sufficient for your needs though. There are a few ways to add constraints like these (or more complex ones) - try looking in to Constrained Syntactic Structures and grammar guided GP.
A few other simple thoughts: can you use a data-type with a larger range? Also, you could reduce the maximum depth parameter, so that there will be less room for nested exponents. Of course that's only possible to an extent, and it depends on the complexity of the function.
Integers have a different binary representation than reals, so you have to use a slightly different bitstring representation and recombination/mutation operator.
For an excellent demonstration, see slide 24 of www.cs.vu.nl/~gusz/ecbook/slides/Genetic_Algorithms.ppt or check out the Eiben/Smith book "Introduction to Evolutionary Computing Genetic Algorithms." This describes how to map a bit string to a real number. You can then create a representation where x only lies within an interval [y,z]. In this case, choose y and z to be the of less magnitude than the capacity of the data type you are using (e.g. 10^308 for a double) so you don't run into the overflow issue you describe.
You have to consider that with real-valued exponents and a negative base you will not obtain a real, but a complex number. For example, the Math.Pow implementation in .NET says that you get NaN if you attempt to calculate the power of a negative base to a non-integer exponent. You have to make sure all your x values are positive. I think that's the problem that you're seeing when you "exceed double bounds".
Btw, you can try the HeuristicLab GP implementation. It is very flexible with a configurable grammar.