a simple shader optimization case, if i do right? - optimization

in my fragment shader there are two lines as below:
float depthExp=max(0.5,pow(depth,100.0));
gl_FragColor=vec4(depthExp*vec3(color),1);
i "optimize" it into:
if(depth<0.99309249543703590153321021688807){//0.5^(1/100.0)
gl_FragColor=vec4(0.5*vec3(color),1);
}else{
float depthExp=pow(depth,100.0);
gl_FragColor=vec4(depthExp*vec3(color),1);
}
can i get performance rise by this? or i just do the thing against my will?
i give the complete fragment shader here, see if there is chance to optimize it:
varying vec2 TexCoord;
uniform sampler2D Texture_color;
uniform sampler2D Texture_depth;
uniform sampler2D Texture_stencil;
void main()
{
float depth=texture2D(Texture_depth,TexCoord).r;
float stencil=texture2D(Texture_stencil,TexCoord).r;
vec4 color=texture2D(Texture_color,TexCoord);
if(stencil==0.0){
gl_FragColor=color;
}else{
float depthExp=max(0.5,pow(depth,100.0));
gl_FragColor=vec4(depthExp*vec3(color),1);
}
}

First of all excessive branching in the shader is usually not a good idea. On modern hardware it won't be too bad as long as nearby fragments all take the same branch. But once two fragments of a local packet of fragments (whose size is implementation dependent, probably a small square, say 4x4-8x8) take different branches, the GPU will actually have to execute both branches for each fragment of the packet.
So if nearby fragments are likely to take the same branch it may give some improvement. Since the condition is based on the depth (though from a previous rendering, I guess) and the depth buffer is usually comprised of larger regions with mononotous depth distribution, it is indeed likely for nearby fragments to enter the same branch. And since the optimized branch is executed for most of the fragments (since most will be smaller than 0.993, even more so due to the depth buffer's non-linear nauture and the higher precision at the smaller values) it may be profitable. But like Apeforce suggests, the best idea would be to measure it.
But this brings me to another question. Given that virtually all fragments in a usual scene will have a depth smaller than 0.993, except for the background and most of the values will result in incredibly small numbers after exponentiating with 100 (man, 0.95^100 = 0.005 and 0.9^100 = 0,00002), scaling a color (whose precision and impact on perception is not that high in the first place, anyway) by this amount will most probably just zero it out. So if you indeed have a standard depth buffer with values from [0,1] (and maybe even non-linear as usual), then I doubt what the actual purpose of this pow is anyway and if there is probably a different solution to your problem.

Usually, you will want to avoid branching inside shaders at all cost, you are probably better off leaving it as it was from the beginning.
I have heard that modern GPU's are better at this though (branching statements), what are you writing for, OpenGL ES 2.0 or OpenGL 3.2+?
Your use of varying suggest that you are writing for OpenGL ES?
I suggest you just write out your fps to either your console (which is going to affect performance, but since it will for both cases it's no problem) or your screen, using first the original shader, and then the "optimized" shader, and see which one gets the higher frames.
In general though, you can't optimize a shader program using branching conditions, which feels really backwards, but it is because of how the hardware acts.

Related

Godot - Why does my Mesh deform at extreme locations?

I am a beginner making a simple game and I modeled a robot and gave it controls. You basically walk around but there's gravity. It was working great until I let it fall for a long time, when I noticed something odd.
Here is how it looked before the problem:
Then after I fell off the platform, at about -2000 on the y-axis, deformations became present:
It became particularly extreme at about -20,000 on the y-axis:
Is this simply a compression issue? I tried it both compressed and uncompressed and it happened both ways. Perhaps an issue in the Engine?
Based on the numbers you've provided, it's clear you're experiencing floating point precision issues. The precise effect would depend on your platform, but the OpenGL ES Shading Language specifications have some minimum requirements for hardware manufacturers:
https://www.khronos.org/files/opengles_shading_language.pdf (Section 4.5.2)
While The Khronos Group seem to have increased the minimum precision requirement for highp in a newer edition of the specification, which seems to be the version required by OpenGL ES 3.0, I don't know what you're using exactly and how well your hardware conforms to it, so I'll go with the first specification I've linked. Based on that, GLSL floating point numbers have the following minimum precision:
highp : 2^-16 = 0.0000153
mediump: 2^-10 = 0.000977
lowp : 2^-8 = 0.00391
Assuming your hardware doesn't go beyond the minimum requirements, you'll have precision issues of about .3 units even with highp when you're 20,000 units from origin. Considering your vertices are within a 1x1x1 unit box, this equates to about a 30% error relative to the size of your object, which is in line with the images you've shared.
As a solution, try and keep your objects closer to the origin; shift your origin if/when needed.

What is a "Push-Constant" in vulkan?

I've looked through a bunch of tutorials that talk about push constants, allude to possible benefits, but never have I actually seen, even in Vulkan docs, what the heck a "push constant" actually is... I don't understand what they are supposed to be, and what the purpose of push constants are. The closest thing I can find is this post which unfortunately doesn't ask for what they are, but what are the differences between them and another concept and didn't help me much.
What is a push constant, why does it exist and what is it used for? where did its name come from?
Push constants is a way to quickly provide a small amount of uniform data to shaders. It should be much quicker than UBOs but a huge limitation is the size of data - spec requires 128 bytes to be available for a push constant range. Hardware vendors may support more, but compared to other means it is still very little (for example 256 bytes).
Because push constants are much quicker than other descriptors (resources through which we provide data to shaders), they are convenient to use for data that changes between draw calls, like for example transformation matrices.
From the shader perspective, they are declared through layout( push_constant ) qualifier and a block of uniform data. For example:
layout( push_constant ) uniform ColorBlock {
vec4 Color;
} PushConstant;
From the application perspective, if shaders want to use push constants, they must be specified during pipeline layout creation. Then the vkCmdPushConstants() command must be recorded into a command buffer. This function takes, among others, a pointer to a memory from which data to a push constant range should be copied.
Different shader stages of a given pipeline can use the same push constant block (similarly to UBOs) or smaller parts of the whole range. But, what is important, each shader stage can use only one push constant block. It can contain multiple members, though. Another important thing is that the total data size (across all shader stages which use push constants) must fit into the size constraint. So the constraint is not per stage but per whole range.
There is an example in the Vulkan Cookbook's repository showing a simple push constant usage scenario. Sascha Willems's Vulkan examples also contain a sample showing how to use push constants.

bool condition opengl es 2.0 shader

Since it is recommended to not use condition in shader, which is better for a boolean uniform :
A. Create different shaders for different values of a boolean uniform ?
B. Create one shader and just use if-else in the code like this :
uniform bool uValue;
if (uValue) {
// code
} else {
// code
}
I have read somewhere that for uniform bool value, the driver will compile multiple shaders so that we don't have to bother creating multiple shaders. But I can't verify this.
Thanks!
Which approach is more performant depends on a lot of other things:
How many conditions are you switching on?
How many times per frame are you switching?
How much computation happens on either side of your conditional?
What are the other performance constraints in your situation? Memory usage? Power consumption? Client-to-GPU bandwidth?
Try both options and test with Instruments to see which performs better in your case.
We all know that drivers change very quickly in GPU programming.
If your condition is fairly evenly balanced then there probably isn't a definitive right or wrong answer. It will depend on the hardware, the version of the drivers, and possible future mechanisms that the card itself uses to create parallel batches.
If your condition is more one sided, then there might be a real benefit using an if condition in one shader, or having two shaders and switching. Testing the load on the graphics card while it is processing real data is the only way to really answer this.
If this is identified as your bottleneck point and is worth the time investment, then perhaps include both, and choose at runtime. But remember there is no point in optimizing code if it won't make your shader faster. If you code delivers all of the requested visual features and you are still processor bound, then you have done your job.
Equally optimizing if statements, when you are fetch bound, doesn't make any sense. So keep all of your optimization until you have reached as many of the visual features as you can, then optimize, which might get you one more feature, then optimize again.

How is ray coherence used to improve raytracing speed while still looking realistic?

I'm considering exploiting ray coherence in my software per-pixel realtime raycaster.
AFAICT, using a uniform grid, if I assign ray coherence to patches of say 4x4 pixels (where at present I have one raycast per pixel), given 16 parallel rays with different start (and end) point, how does this work out to a coherent scene? What I foresee is:
There is a distance within which the ray march would be exactly the same for adjacent/similar rays. Within that distance, I am saving on processing. (How do I know what that distance is?)
I will end up with a slightly to seriously incorrect image, due to the fact that some rays didn't diverge at the right times.
Given that my rays are cast from a single point rather than a plane, I guess I will need some sort of splitting function according to distance traversed, such that the set of all rays forms a tree as it move outward. My concern here is that finer detail will be lost when closer to the viewer.
I guess I'm just not grasping how this is meant to be used.
If done correctly, ray coherence shouldn't affect the final image. Because the rays are very close together, there's a good change that they'll all take similar paths when traversing the acceleration structure (kd-tree, aabb tree, etc). You have to go down each branch that any of the rays could hit, but hopefully this doesn't increase the number of branches much, and it saves on memory access.
The other advantage is that you can use SIMD (e.g. SSE) to accelerate some of your tests, both in the acceleration structure and against the triangles.

At what phase in rendering does clipping occur?

I've got some OpenGL drawing code that I'm trying to optimize. It's currently testing all drawing objects for visibility client-side before deciding whether or not to send rendering data to OpenGL. (This is easier than it sounds. It's drawing a 2D scene so clipping is trivial: just test against the current coordinates of the viewport rectangle.)
It occurs to me that the entire model could be greatly simplified by passing the entire scene to OpenGL and letting the GPU take care of the clipping. But sometimes the total can be very, very complex, involving up to 100,000 total sprites, most of which never get rendered because they're off-camera, and I'd prefer to not end up killing the framerate in the name of simplicity.
I'm using OpenGL 2.0, and I've got a pretty simple vertex shader and a much more complicated fragment shader. Is there any guarantee that says that if the vertex shader runs and determines coordinates that are completely off-camera for all vertices of a polygon, that a clipping test will be applied somewhere between there and the fragment shader and prevent the fragment shader from ever running for that polygon? And if so, is this automatic or is there something I need to do to enable it? I've looked around online for information on this but I haven't found anything conclusive...
Clipping happens after the vertex transform stage before and after the NDC space; clip planes are applied in clip space, viewport clipping is done in NDC space. That is one step before rasterizing. Clipping means, that a face only partially visible is "cut" by inserting new vertices at the visibility border, or fragments outside the viewport discarded. What you mean is usually called culling. Faces completely outside the viewport are culled, at the same stage like clipping.
From a performance point of view, the best code is code never executed, and the best data is data never accessed. So in your case sending off a single drawing call that makes the GPU process a large batch of vertices clearly takes load off the CPU, but it consumes GPU processing power. Culling those vertices before sending the drawing command consumes CPU power, but takes load off the GPU. The goal is to find the right balance. If the number of vertices is low, a simple brute force approach (just render the whole thing) may easily outperform ever other scheme.
However using a simple, yet effective data management scheme can greatly improve performance on both ends. For example a spatial subdivision structure like a Kd tree is easily built (you don't have to balance it). Sorting the vertices into the Kd tree you can omit (cull) large portions of the tree if one branch near to the root is completely outside the viewport. Preparing drawing a frame you iterate through the visible parts of the tree, building the list of vertices to draw, then you pass this list to the rendering command. Kd trees can be traversed on average in O(n log n) time.
It's important to understand the difference between clipping and culling. You appear to be talking about the latter.
Clipping means taking a triangle and literally cutting it into pieces to fit into the viewport. The OpenGL specification defines this process to happen post-vertex shader, for any triangle that is only partially in view.
Culling means throwing something away entirely. If a triangle is not entirely in view, it can therefore be culled. OpenGL does not say that culling has to happen. Remember: the OpenGL specification defines behavior, not performance.
That being said, hardware makers are not stupid. Obvious efforts like not rasterizing triangles that are outside of the viewport are easily implemented and improve performance. Pretty much any hardware that exists will do this.
Similarly, clipping is typically implemented (where possible) with rasterizer tricks, rather than by creating new triangles. Fragments that would be outside of the viewport simply aren't generated by the rasterizer. This is also legal according to OpenGL, because the spec defines apparent behavior. It doesn't really care if you actually cut the triangle into pieces as long as it looks indistinguishable form if you did.
Your question is essentially one of, "How much work should I do to not render off-screen objects?" That really depends on what your scene is and how you're rendering it. You say you're rendering 100,000 sprites. Are you making 100,000 draw calls, or are these sprites part of larger structures that you render with larger granularity? Do you stream the vertex data to the GPU every frame, or is the vertex data static?
Clipping and culling happen before fragment processing. http://www.opengl.org/wiki/Rendering_Pipeline_Overview
However, you will still be passing 100000 * 4 vertices (assuming you're rendering the sprites with quads and not point sprites) to the card if you don't do culling yourself. Depending on the card's memory performance this can be an issue.