Optimization using VBO in OpenGL ES 2.0 - optimization

My system is composed of several objects that represent quads. Each quad is represented by the same vertices and therefore, each object only stores matrices that represent the object's transformation through the world, and its own object space. During each render pass, after these matrices are updated with their frame transforms, they are multiplied with the current view and projection matrices to form the MVP matrix for that object. The objects vertices are then sent with the MVP matrix to the shader, where the vertices are multiplied by the MVP matrix. The inefficiency here is that each quad is drawn separately, meaning there is a separate call glDrawElements for each quad. At any given moment, there may be 50 or 60 quads in existence, some move out of scope and are destroyed or their animation may complete, so they're also destroyed, but more will randomly enter existence. Would there be a significant performance gain to storing all the necessary values in a VBO and just calling glDrawElements once during each pass?

Let's first reason about it with some simple mathematics:
At the moment you don't need to push any vertex data onto the GPU (each frame), but 12-16 floats matrix data per quad, and perform a matrix-matrix multiplication per quad on the CPU.
When putting all in one VBO, you have to transfer 4 vertices (~12 floats) per quad, but no matrix data (except for the global VP, of course) and you have to do 4 matrix-vector multiplies (~1 matrix-matrix multiply) on the CPU.
So the amount of work and data transferred doesn't really change much. But what changes is, that the transferred data is shifted from many many small uniform updates to a single large VBO update, which is very likely to be faster (both because a buffer update is likely to be faster from the hardware side than multiple uniform updates, but don't nail me on that, and second because of the much reduced driver overhead). And on top of that comes the even more reduced overhead by using a single large draw call instead of many smaller.
So yes, it will certainly be worth a try, though it has to be evaluated if it is really a "significant" improvement in your particular application.

Would there be a significant performance gain to storing all the
necessary values in a VBO and just calling glDrawElements once during
each pass?
Yes, it would be much faster. First reason, as you correctly identified, will be a single glDrawElements call. And second being the fact that VBO keeps the data in the GPU itself.
If quads move out of scope you can reuse their memory for the new quads. VBO's can be used to draw subregions of the buffer, so you can get big flexibility without memory allocations.
By using VBO's you are minimising interaction with the GPU and so getting the performance benefit.


Is it better to store uniform descriptors in a single buffer or use seperate buffers for each frame?

My application has a max of 2 frames in flight. The vertex shader takes a uniform buffer containing matrices. Because 2 frames could potentially be rendering at the same time I believe I need to have separate memory for each frame's uniform buffer.
Is it preferable to create a single buffer than holds the uniforms of both frames and use an offset within the buffer to do updates. Or is it better that each frame have its own buffer?
Technically you can do either; it ultimately boils down to what makes the most sense implementation-wise. Provided you can ensure that you're not overwriting buffer data that's still active (and being consumed on the GPU), one memory allocation would be sufficient. It's a big advantage of the Vulkan API (since you have full control) but it does make life more complicated.
In my use-case, I use pages of allocations (kinda like a heap model) where I allocate on demand and return blocks when they're done (basically, if a reference is removed, I age out for however many frames of buffering I have and then free the block). This is targeted at uniform buffers that change infrequently.
For per-draw uniform data, I use push constants - this might be worth looking at for your use-case. Since the per-instance matrix is "use once then discard" and is relatively small, this actually makes life simpler, and could even have a performance benefit, to boot.

How to organize opengl es 2.0 program?

I thought in two ways to write my opengl es 2.0 code.
First, I write many calls to draw elements in the screen with many VAOs and VBOs or one only VAO and many VBOs.
Second, I save the coordinates of all elements in one list and I write all vertices of these coordinates in one only VAO and one only VBO and draw all vertices in the screen.
What is the better way that I should follow?
These are the ones I thought, what other ways are there?
The VAO is meant to save you some setup calls when setting the vertex attributes pointers and enabling/disabling the pipeline states related to that setup. Having just one VAO isn't saving you anything, because you will repeatedly re-bind the vertex buffers and change some settings. So you should aim to have multiple VAOs, one per "static" rendering batch, but not necessarily one per object drawn.
As to having all vertices in single VBO or many VBOs - that really depends on the task.
Having all data in single VBO has no benefits if you draw that all in many calls. But there's also no point in allocating one VBO per sprite. It's always about the balance between the costs of different calls to setup the pipeline, so ideally you try different approaches and decide what's best for you in your particular case.
There might be restrictions on the buffer sizes, and there's definitely "reasonable" sizes preferred by specific implementations. I remember some issues with old Intel drivers, when rendering the portion of the buffer would process the entire buffer, skipping unneeded vertices.

Pass information from/to compute pipelineStages

I am trying to use a compute shader for image processing. Being new to Vulkan I have some (possibly naive) questions:
I try to look at neighborhood of a pixel. So AFAIK I have 2 possiblities:
a, Pass one image to the compute shader and sample the neighborhood pixels directly (x +/- i, y +/- j)
b, Pass multiple images to the compute shader (each being offset) and sample only the current position (x, y)
Is there any difference in sample performance a vs b (aside from b needing way more memory to being passed to GPU)?
I need to pass on pixel information (+ meta info) from one pipeline stage to another (and read it back out once command is done).
a, can I do this in any other way than passing a image with storage bit set?
b, when reading back information from host I probably need to use a framebuffer?
Using a single image and sampling at offsets (maybe using textureGather?) is going to be more efficient, probably by a lot. Each texturing operation has a cost, and this uses fewer. More importantly, the texture cache in GPUs generally loads a small region around your sample point, so sampling the adjacent pixels is likely going to hit in the cache.
Even better would be to load all the pixels once into shared memory, and then work from there. Then instead of fetching pixel (i,j) from thread (i,j) and all of that thread's eight neighbors, you only fetch it once. You still need extra fetches on the edge of the region handled by a single workgroup. (For what it's worth, this technique is not Vulkan specific: you'll see it used in CUDA, OpenCL, D3D Compute, and GL Compute too).
The only way to persist data out of a compute shader is to write it to a storage buffer or storage image. To read that on the CPU, use vkCmdCopyImageToBuffer or vkCmdCopyBuffer to a host-readable resource, and then map that.

Non power of two textures and memory consumption optimization

I read somewhere that XNA framework upscales a texture to nearest power of two size and then sends that to VRAM, which, provided it's how it really works, might be not efficient when loading many small (in my case 150×150) textures, which essentially waste memory with unused texture data resulting from upscaling.
So is there some automatic optimization, or should I make my own implementation of it, like loading all textures, figuring out where the "upscaled" space is big enough to hold some other texture and place it there, remembering sprite positions, thus using one texture instead of two (or more)?
It isn't always handy to do this manually for each texture (placing many small sprites in a single texture), because it's hard to work with later (essentially it becomes less human-oriented), and not always a sprite will be needed in some level of a game, so it would be better if sprites were in a different composition, so it should be done automatically.
There are tools available to create what are known as "sprite sheets" or "texture atlases". This XNA sample does this for you as part of a content pipeline extension.
Note that the padding of textures only happens on devices that do not support non-power-of-two textures. Windows Phone, for example. Modern GPUs won't waste the RAM. However this is still a useful optimisation to allow you to merge batches of sprites (see this answer for details).

At what phase in rendering does clipping occur?

I've got some OpenGL drawing code that I'm trying to optimize. It's currently testing all drawing objects for visibility client-side before deciding whether or not to send rendering data to OpenGL. (This is easier than it sounds. It's drawing a 2D scene so clipping is trivial: just test against the current coordinates of the viewport rectangle.)
It occurs to me that the entire model could be greatly simplified by passing the entire scene to OpenGL and letting the GPU take care of the clipping. But sometimes the total can be very, very complex, involving up to 100,000 total sprites, most of which never get rendered because they're off-camera, and I'd prefer to not end up killing the framerate in the name of simplicity.
I'm using OpenGL 2.0, and I've got a pretty simple vertex shader and a much more complicated fragment shader. Is there any guarantee that says that if the vertex shader runs and determines coordinates that are completely off-camera for all vertices of a polygon, that a clipping test will be applied somewhere between there and the fragment shader and prevent the fragment shader from ever running for that polygon? And if so, is this automatic or is there something I need to do to enable it? I've looked around online for information on this but I haven't found anything conclusive...
Clipping happens after the vertex transform stage before and after the NDC space; clip planes are applied in clip space, viewport clipping is done in NDC space. That is one step before rasterizing. Clipping means, that a face only partially visible is "cut" by inserting new vertices at the visibility border, or fragments outside the viewport discarded. What you mean is usually called culling. Faces completely outside the viewport are culled, at the same stage like clipping.
From a performance point of view, the best code is code never executed, and the best data is data never accessed. So in your case sending off a single drawing call that makes the GPU process a large batch of vertices clearly takes load off the CPU, but it consumes GPU processing power. Culling those vertices before sending the drawing command consumes CPU power, but takes load off the GPU. The goal is to find the right balance. If the number of vertices is low, a simple brute force approach (just render the whole thing) may easily outperform ever other scheme.
However using a simple, yet effective data management scheme can greatly improve performance on both ends. For example a spatial subdivision structure like a Kd tree is easily built (you don't have to balance it). Sorting the vertices into the Kd tree you can omit (cull) large portions of the tree if one branch near to the root is completely outside the viewport. Preparing drawing a frame you iterate through the visible parts of the tree, building the list of vertices to draw, then you pass this list to the rendering command. Kd trees can be traversed on average in O(n log n) time.
It's important to understand the difference between clipping and culling. You appear to be talking about the latter.
Clipping means taking a triangle and literally cutting it into pieces to fit into the viewport. The OpenGL specification defines this process to happen post-vertex shader, for any triangle that is only partially in view.
Culling means throwing something away entirely. If a triangle is not entirely in view, it can therefore be culled. OpenGL does not say that culling has to happen. Remember: the OpenGL specification defines behavior, not performance.
That being said, hardware makers are not stupid. Obvious efforts like not rasterizing triangles that are outside of the viewport are easily implemented and improve performance. Pretty much any hardware that exists will do this.
Similarly, clipping is typically implemented (where possible) with rasterizer tricks, rather than by creating new triangles. Fragments that would be outside of the viewport simply aren't generated by the rasterizer. This is also legal according to OpenGL, because the spec defines apparent behavior. It doesn't really care if you actually cut the triangle into pieces as long as it looks indistinguishable form if you did.
Your question is essentially one of, "How much work should I do to not render off-screen objects?" That really depends on what your scene is and how you're rendering it. You say you're rendering 100,000 sprites. Are you making 100,000 draw calls, or are these sprites part of larger structures that you render with larger granularity? Do you stream the vertex data to the GPU every frame, or is the vertex data static?
Clipping and culling happen before fragment processing. http://www.opengl.org/wiki/Rendering_Pipeline_Overview
However, you will still be passing 100000 * 4 vertices (assuming you're rendering the sprites with quads and not point sprites) to the card if you don't do culling yourself. Depending on the card's memory performance this can be an issue.