Optimization of HLSL shaders

Optimization of HLSL shaders - optimization

I'm trying to optimize my terrain shader for my XNA game as it seems to consume a lot of ressources. It takes around 10 to 20 FPS on my computer, and my terrain is 512*512 vertices, so the PixelShader is called a lot of times.
I've seen that branching is using some ressources, and I have 3/4 conditions in my shaders.
What could I do to bypass them? Are triadic operators more efficient than conditions?
For instance:
float a = (b == x) ? c : d;
or
float a;
if(b == x)
a = c;
else
c = d;
I'm using also multiple times the functions lerp and clamp, should it be more efficient to use arithmetic operations instead?
Here's the less efficient part of my code:
float fog;
if(FogWaterActivated && input.WorldPosition.y-0.1 < FogWaterHeight)
{
if(!IsUnderWater)
fog = clamp(input.Depth*0.005*(FogWaterHeight - input.WorldPosition.y), 0, 1);
else
fog = clamp(input.Depth*0.02, 0, 1);
return float4(lerp(lerp( output * light, FogColorWater, fog), ShoreColor, shore), 1);
}
else
{
fog = clamp((input.Depth*0.01 - FogStart) / (FogEnd - FogStart), 0, 0.8);
return float4(lerp(lerp( output * light, FogColor, fog), ShoreColor, shore), 1);
}
Thanks!

Any time you can precalculate operations done on shader constants the better. Removing division operations by passing through the inverse into the shader is another useful tip as division is typically slower than multiplication.
In your case, precalculate (1 / (FogEnd - FogStart)), and multiply by that on your second-last line of code.

Related

Is write_image atomic? Is it better to use atomic_max?

Full disclosure: I am cross-posting from the kronos opencl forums, since I have not received any reply there so far:
https://community.khronos.org/t/is-write-image-atomic-is-it-better-than-atomic-max/106418
I’m writing a connected components labelling algorithm for images (2d and 3d); I found no existing implementations and decided to write one based on pointer jumping and a “recollection step” (btw: if you are aware of an easy-to-use, production ready connected component labelling let me know).
The “recollection” step kernel pseudocode for 2d images is as follows:
1) global_id = (x,y)
2) read v from img[x,y], decode it to a pair (tx,ty)
3) read v1 from img[tx,ty]
4) do some calculations to extract a boolean value C and a target value T from v1, v, and the neighbours of (x,y) and (tx,ty)
5) *** IF ( C ) THEN WRITE T INTO (tx,ty).
Q1: all the kernels where “C” is true will compete for writing. Suppose it does not matter which one wins (writes last). I’ve done some tests on an intel GPU, and (with filtering disabled, and clamping enabled) there seems to be no issue at all, write_image seems to be atomic, there is a winning value and my algorithm converges very fast. Can I safely assume that write_image on “unfiltered” images is atomic?
Q2: What I really need is to write into (tx,ty) the maximum T obtained from each kernel. That would involve using buffers instead of images, do clamping myself (or use a larger buffer padded with zeroes), and ** using atomic_max in each kernel**. I did not do this yet out of laziness since I need to change my code to use a buffer just to test it, but I believe it would be far slower. Am I right?
For completeness, here is my actual kernel (to be optimized, any suggestions welcome!)
```
__kernel void color_components2(/* base image */ __read_only image2d_t image,
/* uint32 */ __read_only image2d_t inputImage1,
__write_only image2d_t outImage1) {
int2 gid = (int2)(get_global_id(0), get_global_id(1));
int x = gid.x;
int y = gid.y;
int lock = 0;
int2 size = get_image_dim(inputImage1);
const sampler_t sampler =
CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;
uint4 base = read_imageui(image, sampler, gid);
uint4 ui4a = read_imageui(inputImage1, sampler, gid);
int2 t = (int2)(ui4a[0] % size.x, ui4a[0] / size.x);
unsigned int m = ui4a[0];
unsigned int n = ui4a[0];
if (base[0] > 0) {
for (int a = -1; a <= 1; a++)
for (int b = -1; b <= 1; b++) {
uint4 tmpa =
read_imageui(inputImage1, sampler, (int2)(t.x + a, t.y + b));
m = max(tmpa[0], m);
uint4 tmpb = read_imageui(inputImage1, sampler, (int2)(x + a, y + b));
n = max(tmpb[0], n);
}
}
if(n > m) write_imageui(outImage1,t,(uint4)(n,0,0,0));
}
```

Looping with iterator vs temp object gives different result graphically (Libgdx/Java)

I've got a particle "engine" whom I've implementing a Pool system to and I've tested two different ways of rendering every Particle in a list. Please note that the Pooling really doesn't have anything with the problem to do. I just followed a tutorial and tried to use the second method when I noticed that they behaved differently.
The first way:
for (int i = 0; i < particleList.size(); i++) {
Iterator<Particle> it = particleList.iterator();
while (it.hasNext()) {
Particle p = it.next();
if (p.isDead()){
it.remove();
}
p.render(batch, delta);
}
}
Which works just fine. My particles are sharp and they move with the correct speed.
The second way:
Particle p;
for (int i = 0; i < particleList.size(); i++) {
p = particleList.get(i);
p.render(batch, delta);
if (p.isDead()) {
particleList.remove(i);
bulletPool.free(p);
}
}
Which makes all my particles blurry and moving really slow!
The render method for my particles look like this:
public void render(SpriteBatch batch, float delta) {
sprite.setX(sprite.getX() + (dx * speed) * delta * Assets.FPS);
sprite.setY(sprite.getY() + (dy * speed) * delta * Assets.FPS);
ttl--;
sprite.setScale(sprite.getScaleX() - 0.002f);
if (ttl <= 0 || sprite.getScaleX() <= 0)
isDead = true;
sprite.draw(batch);
}
Why do the different rendering methods provide different results?
Thanks in advance

You are mutating (removing elements from) a list while iterating over it. This is a classic way to make a mess.
The Iterator must have code to handle the delete case correctly. But your index-based for loop does not. Specifically when you call particleList.remove(i) the i is now "out of sync" with the content of the list. Consider what happens when you remove the element at index 3: 'i' will increment to 4, but the old element 4 got shuffled down into index 3, so it will get skipped.
I assume you're avoiding the Iterator to avoid memory allocations. So, one way to side-step this issue is to reverse the loop (go from particleList.size() down to 0). Alternatively, you can only increment i for non-dead particles.

Doing "uint8x8x4_t - 128" then divising this by 2

I'm a bit mixed up about how to achieve a division by a scalar on Neon in a specific case.
In a c++ context, I'm achieving a contrast effect with a very rudimentary algorithm:
if (currentEffect == "contrast_with_cpp")
{
r += ((r - 128) / 2);
g += ((g - 128) / 2);
b += ((b - 128) / 2);
}
I would like to port this algorithm to neon intrinsics.
I've tried, but I'm totally newbie to this approach, and I cannot debug this code in Visual Studio. It is compiled at startup and integrated to a Windows Phone application.
if (currentEffect == "contrast_with_neon") /* Experimental, not working *
{
// To test
copy_rgb = rgb;
// Substract 128 from the copy, prevent it should be a signed variable
?
// Get half value from copy and put it in another copy
uint8x8x4_t otherCopy = interleaved;
otherCopy.val[2] = vmul_n_f32(copy_rgb.val[2], 0.5);
otherCopy.val[1] = vmul_n_f32(copy_rgb.val[1], 0.5);
otherCopy.val[0] = vmul_n_f32(copy_rgb.val[0], 0.5);
// Add it to the first copy
copy_rgb.val[2] = vadd_u8(copy_rgb.val[2], otherCopy.val[2]);
copy_rgb.val[1] = vadd_u8(copy_rgb.val[2], otherCopy.val[1]);
copy_rgb.val[0] = vadd_u8(copy_rgb.val[2], otherCopy.val[0]);
rgb = copy_rgb;
}
Is this achievable using intrinsics?
[Edit] I guess the color data structure is similar to this

Stop wasting your time with intrinsics. It's a real pain, especially with gcc.
Try this in assembly :
vmov.i16 qMdeian, #128 // put this line outside of loop
// -----------------------------------------------
vmovl.u8 qRed, dRed
vmovl.u8 qGrn, dGrn
vmovl.u8 qBlu, dBlu
vsub.s16 qRedTemp, qRed, qMedian
vsub.s16 qGrnTemp, qGrn, qMedian
vsub.s16 qBluTemp, qBlu, qMedian
vshr.s16 qRedTemp, #2
vshr.s16 qGrnTemp, #2
vshr.s16 qBluTemp, #2
vadd.s16 qRed, qRedTemp
vadd.s16 qGrn, qGrnTemp
vadd.s16 qBlu, qBluTemp
vqmovun.s16 dRed, qRed
vqmovun.s16 dGrn, qGrn
vqmovun.s16 dBlu, qBlu
If the does saturate at 255, and any negative values will become zeros which I assume is intended.
PS : What are you doing with float?

Recursion to Iteration conversion

Knowing that every recursive function can be translated to an iterative version. Can someone help me find the iterative version to this pseudo code? I am trying to optimize the code and recursion is clearly not the way to go
sub calc (a, b )
{
total = 0;
if(b <= 1)
return 1
if( 2*a > CONST)
for i IN (1..CONST)
total += calc(i, b-1) ;
else
for j IN (2*a..CONST)
total += calc(j, b-1) ;
return total;
}
CONST = 100;
print calc (CONST,2000);
Thanks for the help!

A refactoring from recursion to iteration is not the answer to your performance woes here. This algorithm benefits most from caching, in much the same way as the Fibonacci sequence does.
After writing a short test program in F#, with some sample data (CONST = 5, a = 0..10, b = 2..10):
The original program took 6.931 seconds
The cached version took 0.049 seconds
The solution is to keep a dictionary with a key of tuple(a,b) and look up the values before calculating. here is the algorithm with caching:
dictionary = new Dictionary<tuple(int, int), int>();
sub calc (a, b )
{
if (dictionary.Contains(tuple(a,b)))
return dictionary[tuple(a,b)];
else
{
total = 0;
if(b <= 1)
return 1
if( 2*a > CONST)
for i IN (1..CONST)
total += calc(i, b-1);
else
for j IN (2*a..CONST)
total += calc(j, b-1);
dictionary[tuple(a,b)] = total;
return total;
}
}
Edit: just to confirm that it was not the iterative nature of my testing that caused the performance gain, I tried them both again a with a single set of parameters (CONST = 5, a = 6, b = 20).
The cached version took 0.034 seconds
The original version is still running... (2+ minutes)

Only tail recursive algorithms can be converted to iterative algorithms. Your provided code is most definitely not tail recursive and thus it can't be easily convert to iterative form.
The solution to your performance problems is Memoization

Simulate "Newton's law of universal gravitation" using Box2D

I want to simulate Newton's law of universal gravitation using Box2D.
I went through the manual but couldn't find a way to do this.
Basically what I want to do is place several objects in space (zero gravity) and simulate the movement.
Any tips?

It's pretty easy to implement:
for ( int i = 0; i < numBodies; i++ ) {
b2Body* bi = bodies[i];
b2Vec2 pi = bi->GetWorldCenter();
float mi = bi->GetMass();
for ( int k = i; k < numBodies; k++ ) {
b2Body* bk = bodies[k];
b2Vec2 pk = bk->GetWorldCenter();
float mk = bk->GetMass();
b2Vec2 delta = pk - pi;
float r = delta.Length();
float force = G * mi * mk / (r*r);
delta.Normalize();
bi->ApplyForce( force * delta, pi );
bk->ApplyForce( -force * delta, pk );
}
}

Unfortunately, Box2D doesn't have native support for it, but you can implement it yourself: Box2D and radial gravity code

As said by others, Box2D has no buildin support for it. But you can add support for it to the library in b2_islands.cpp. Just replace
v += h * b->m_invMass * (b->m_gravityScale * b->m_mass * gravity + b->m_force);
with
int planet_x = 0;
int planet_y = 0;
b2Vec2 gravityVector = (b2Vec2(planet_x, planet_y) - b->GetPosition());
gravityVector.Normalize();
gravityVector.x = gravityVector.x * 10.0f;
gravityVector.y = gravityVector.y * 10.0f;
v += h * b->m_invMass * (b->m_gravityScale * b->m_mass * gravityVector + b->m_force);
Thats a simple solution if you have only one planet.
If you want less force the further away you are, you could use 1/gravityVector instead of normalizing it. That would also make it possible to add up the gravity
of to planets. The you could also iterate over a planet list and sum the gravityVectors up.
Additionally implementing a function like b2World::CreatePlanet might be usefull then.
The 10.0f are just an approximation of the 9.81f from earth, you might need to adjust it. If the mass of the planet is relevant you might need a constant to be multiplied with it, to make it look more realistic, or just increase the density of the object to make it match the real weight of a planet.
Sure you can also set the gravity to 0, 0 and then calculate it before each step for every object, but that might not have so much performance.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Optimization of HLSL shaders - optimization

Related

Is write_image atomic? Is it better to use atomic_max?

Looping with iterator vs temp object gives different result graphically (Libgdx/Java)

Doing "uint8x8x4_t - 128" then divising this by 2

Recursion to Iteration conversion

Simulate "Newton's law of universal gravitation" using Box2D

Categories

Resources