OpenGL quad rendering optimization - optimization

I'm drawing quads in openGL. My question is, is there any additional performance gain from this:
// Method #1
glBegin(GL_QUADS);
// Define vertices for 10 quads
glEnd();
... over doing this for each of the 10 quads:
// Method #2
glBegin(GL_QUADS);
// Define vertices for first quad
glEnd();
glBegin(GL_QUADS);
// Define vertices for second quad
glEnd();
//etc...
All of the quads use the same texture in this case.

Yes, the first is faster, because each call to glBegin or glEnd changes the OpenGL state.
Even better, however, than one call to glBegin and glEnd (if you have a significant number of vertices), is to pass all of your vertices with glVertexPointer (and friends), and then make one call to glDrawArrays or glDrawElements. This will send all your vertices to the GPU in one fell swoop, instead of incrementally by calling glVertex3f repeatedly.

From a function call overhead perspective the second approach is more expensive. If instead of ten quads we used ten thousand. Then glBegin/glEnd would be called ten thousand times per frame instead of once.
More importantly glBegin/glEnd have been deprecated as of OpenGL 3.0, and are not supported by OpenGL ES.
Instead vertices are uploaded as vertex arrays using calls such as glDrawArrays. Tutorial and much more in depth information can be found on the NeHe site.

I decided to go ahead and benchmark it using a loop of 10,000 quads.
The results:
Method 1: 0.0128 seconds
Method 2: 0.0132 seconds
Method #1 does have some improvement, but the improvement is very marginal (3%). It's probably nothing more than the overhead of simply calling more functions. So it's likely that OpenGL itself doesn't get any additional optimization from Method #1.
This is on Windows XP service pack 3 using OpenGL 2.0 and visual studio 2005.

I believe the answer is yes, but you should try it out yourself. Write something to draws 100k quads and see if one is much faster. Then report your results here :)
schnaader: What is meant in the document you read is that you should not have non-gl related code between glBegin and glEnd. They do not mean that you should call it multiple times over calling it in short bits.

I suppose that you get the highest performance gain by reusing the vertices. To achieve that, you would require to maintain some structure for primitives yourself.

You would get better performance for sure in just how much code gets called by the CPU.
Whether or not your drawing performance would be better on the GPU, that would completely depend on the implementation of the driver for your 3d graphics card. You could get potentially wildly different results with a different manufacturer's driver and even with a different version of the driver for the same card.

Related

How to organize opengl es 2.0 program?

I thought in two ways to write my opengl es 2.0 code.
First, I write many calls to draw elements in the screen with many VAOs and VBOs or one only VAO and many VBOs.
Second, I save the coordinates of all elements in one list and I write all vertices of these coordinates in one only VAO and one only VBO and draw all vertices in the screen.
What is the better way that I should follow?
These are the ones I thought, what other ways are there?
The VAO is meant to save you some setup calls when setting the vertex attributes pointers and enabling/disabling the pipeline states related to that setup. Having just one VAO isn't saving you anything, because you will repeatedly re-bind the vertex buffers and change some settings. So you should aim to have multiple VAOs, one per "static" rendering batch, but not necessarily one per object drawn.
As to having all vertices in single VBO or many VBOs - that really depends on the task.
Having all data in single VBO has no benefits if you draw that all in many calls. But there's also no point in allocating one VBO per sprite. It's always about the balance between the costs of different calls to setup the pipeline, so ideally you try different approaches and decide what's best for you in your particular case.
There might be restrictions on the buffer sizes, and there's definitely "reasonable" sizes preferred by specific implementations. I remember some issues with old Intel drivers, when rendering the portion of the buffer would process the entire buffer, skipping unneeded vertices.

Working around WebGL readPixels being slow

I'm trying to use WebGL to speed up computations in a simulation of a small quantum circuit, like what the Quantum Computing Playground does. The problem I'm running into is that readPixels takes ~10ms, but I want to call it several times per frame while animating in order to get information out of gpu-land and into javascript-land.
As an example, here's my exact use case. The following circuit animation was created by computing things about the state between each column of gates, in order to show the inline-with-the-wire probability-of-being-on graphing:
The way I'm computing those things now, I'd need to call readPixels eight times for the above circuit (once after each column of gates). This is waaaaay too slow at the moment, easily taking 50ms when I profile it (bleh).
What are some tricks for speeding up readPixels in this kind of use case?
Are there configuration options that significantly affect the speed of readPixels? (e.g. the pixel format, the size, not having a depth buffer)
Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Should I try to aggregate all the textures I'm reading into a single megatexture and sort things out after a single big read?
Should I be using a different method to get the information back out of the textures?
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?
Should I try to make the readPixel calls all happen at once, after all the render calls have been made (maybe allows some pipelining)?
Yes, yes, yes. readPixels is fundamentally a blocking, pipeline-stalling operation, and it is always going to kill your performance wherever it happens, because it's sending a request for data to the GPU and then waiting for it to respond, which normal draw calls don't have to do.
Do readPixels as few times as you can (use a single combined buffer to read from). Do it as late as you can. Everything else hardly matters.
Should I be avoiding getting the information out at all, and doing all the layout and rendering gpu-side (urgh...)?
This will get you immensely better performance.
If your graphics are all like you show above, you shouldn't need to do any “layout” at all (which is good, because it'd be very awkward to implement) — everything but the text is some kind of color or boundary animation which could easily be done in a shader, and all the layout can be just a static vertex buffer (each vertex has attributes which point at which simulation-state-texel it should be depending on).
The text will be more tedious merely because you need to load all the digits into a texture to use as a spritesheet and do the lookups into that, but that's a standard technique. (Oh, and divide/modulo to get the digits.)
I don't know enough about your use case but just guessing, Why do you need to readPixels at all?
First, you don't need to draw text or your the static parts of your diagram in WebGL. Put another canvas or svg or img over the WebGL canvas, set the css so they overlap. Let the browser composite them. Then you don't have to do it.
Second, let's assume you have a texture that has your computed results in it. Can't you just then make some geometry that matches the places in your diagram that needs to have colors and use texture coords to look up the results from the correct places in the results texture? Then you don't need to call readPixels at all. That shader can use a ramp texture lookup or any other technique to convert the results to other colors to shade the animated parts of your diagram.
If you want to draw numbers based on the result you can use a technique like this so you'd make a shader at references the result shader to look at a result value and then indexes glyphs from another texture based on that.
Am I making any sense?

Best practice for simple DirectX overlay rendering

I'm creating a DirectX 11 game that renders complex meshes in 3D space. I'm using vertex/index buffers/shaders and this all works fine. However I now want to perform some basic 'overlay' rendering - more specifically, I want to render wireframe boxes in 3D space to show the bounds of a particular area. There would only ever be one or two boxes in view at any one time, and their vertices would change position each frame.
I've therefore been searching for simpler DX11 rendering methods but most articles I find still prepare a vertex/index buffer for very simple rendering. I know that hardware is well optimised for processing vertex streams, but is the overhead of building and filling a vertex buffer every frame just to process 8 vertices really the most efficient method?
My question is therefore, what is the most efficient method for performing this very simple rendering in DX11? Is there any more primitive method ("DrawLine", "DrawLineList(D3DXVECTOR3[])", ...) that would be a better solution? It could be less efficient per-vertex than the standard method of passing vertex buffers because it's only ever going to be used for a handful of vertices per frame.
Thanks in advance
Rob
You should create a single vertex / index buffer for each primitive Shape (box, sphere, ...) and use transformation matrix to place it correctly in the world.

bool condition opengl es 2.0 shader

Since it is recommended to not use condition in shader, which is better for a boolean uniform :
A. Create different shaders for different values of a boolean uniform ?
B. Create one shader and just use if-else in the code like this :
uniform bool uValue;
if (uValue) {
// code
} else {
// code
}
I have read somewhere that for uniform bool value, the driver will compile multiple shaders so that we don't have to bother creating multiple shaders. But I can't verify this.
Thanks!
Which approach is more performant depends on a lot of other things:
How many conditions are you switching on?
How many times per frame are you switching?
How much computation happens on either side of your conditional?
What are the other performance constraints in your situation? Memory usage? Power consumption? Client-to-GPU bandwidth?
Try both options and test with Instruments to see which performs better in your case.
We all know that drivers change very quickly in GPU programming.
If your condition is fairly evenly balanced then there probably isn't a definitive right or wrong answer. It will depend on the hardware, the version of the drivers, and possible future mechanisms that the card itself uses to create parallel batches.
If your condition is more one sided, then there might be a real benefit using an if condition in one shader, or having two shaders and switching. Testing the load on the graphics card while it is processing real data is the only way to really answer this.
If this is identified as your bottleneck point and is worth the time investment, then perhaps include both, and choose at runtime. But remember there is no point in optimizing code if it won't make your shader faster. If you code delivers all of the requested visual features and you are still processor bound, then you have done your job.
Equally optimizing if statements, when you are fetch bound, doesn't make any sense. So keep all of your optimization until you have reached as many of the visual features as you can, then optimize, which might get you one more feature, then optimize again.

Optimization using VBO in OpenGL ES 2.0

My system is composed of several objects that represent quads. Each quad is represented by the same vertices and therefore, each object only stores matrices that represent the object's transformation through the world, and its own object space. During each render pass, after these matrices are updated with their frame transforms, they are multiplied with the current view and projection matrices to form the MVP matrix for that object. The objects vertices are then sent with the MVP matrix to the shader, where the vertices are multiplied by the MVP matrix. The inefficiency here is that each quad is drawn separately, meaning there is a separate call glDrawElements for each quad. At any given moment, there may be 50 or 60 quads in existence, some move out of scope and are destroyed or their animation may complete, so they're also destroyed, but more will randomly enter existence. Would there be a significant performance gain to storing all the necessary values in a VBO and just calling glDrawElements once during each pass?
Let's first reason about it with some simple mathematics:
At the moment you don't need to push any vertex data onto the GPU (each frame), but 12-16 floats matrix data per quad, and perform a matrix-matrix multiplication per quad on the CPU.
When putting all in one VBO, you have to transfer 4 vertices (~12 floats) per quad, but no matrix data (except for the global VP, of course) and you have to do 4 matrix-vector multiplies (~1 matrix-matrix multiply) on the CPU.
So the amount of work and data transferred doesn't really change much. But what changes is, that the transferred data is shifted from many many small uniform updates to a single large VBO update, which is very likely to be faster (both because a buffer update is likely to be faster from the hardware side than multiple uniform updates, but don't nail me on that, and second because of the much reduced driver overhead). And on top of that comes the even more reduced overhead by using a single large draw call instead of many smaller.
So yes, it will certainly be worth a try, though it has to be evaluated if it is really a "significant" improvement in your particular application.
Would there be a significant performance gain to storing all the
necessary values in a VBO and just calling glDrawElements once during
each pass?
Yes, it would be much faster. First reason, as you correctly identified, will be a single glDrawElements call. And second being the fact that VBO keeps the data in the GPU itself.
If quads move out of scope you can reuse their memory for the new quads. VBO's can be used to draw subregions of the buffer, so you can get big flexibility without memory allocations.
By using VBO's you are minimising interaction with the GPU and so getting the performance benefit.