Is it better to store uniform descriptors in a single buffer or use seperate buffers for each frame? - vulkan

My application has a max of 2 frames in flight. The vertex shader takes a uniform buffer containing matrices. Because 2 frames could potentially be rendering at the same time I believe I need to have separate memory for each frame's uniform buffer.
Is it preferable to create a single buffer than holds the uniforms of both frames and use an offset within the buffer to do updates. Or is it better that each frame have its own buffer?

Technically you can do either; it ultimately boils down to what makes the most sense implementation-wise. Provided you can ensure that you're not overwriting buffer data that's still active (and being consumed on the GPU), one memory allocation would be sufficient. It's a big advantage of the Vulkan API (since you have full control) but it does make life more complicated.
In my use-case, I use pages of allocations (kinda like a heap model) where I allocate on demand and return blocks when they're done (basically, if a reference is removed, I age out for however many frames of buffering I have and then free the block). This is targeted at uniform buffers that change infrequently.
For per-draw uniform data, I use push constants - this might be worth looking at for your use-case. Since the per-instance matrix is "use once then discard" and is relatively small, this actually makes life simpler, and could even have a performance benefit, to boot.

Related

When is it safe to write over and reuse a MTLBuffer or other Metal vertex buffer?

I'm just getting started with Metal, and am having trouble grasping some basic things. I've been reading a whole bunch of web pages about Metal, and working through Apple's examples, and so forth, but gaps in my understanding remain. I think my key point of confusion is: what is the right way to handle vertex buffers, and how do I know when it's safe to reuse them? This confusion manifests in several ways, as I'll describe below, and maybe those different manifestations of my confusion need to be addressed in different ways.
To be more specific, I'm using an MTKView subclass in Objective-C on macOS to display very simple 2D shapes: an overall frame for the view with a background color inside, 0+ rectangular subframes inside that overall frame with a different background color inside them, and then 0+ flat-shaded squares of various colors inside each subframe. My vertex function is just a simple coordinate transformation, and my fragment function just passes through the color it receives, based on Apple's triangle demo app. I have this working fine for a single subframe with a single square. So far so good.
There are several things that puzzle me.
One: I could design my code to render the whole thing with a single vertex buffer and a single call to drawPrimitives:, drawing all of the (sub)frames and squares in one big bang. This is not optimal, though, as it breaks the encapsulation of my code, in which each subframe represents the state of one object (the thing that contains the 0+ squares); I'd like to allow each object to be responsible for drawing its own contents. It would be nice, therefore, to have each object set up a vertex buffer and make its own drawPrimitives: call. But since the objects will draw sequentially (this is a single-threaded app), I'd like to reuse the same vertex buffer across all of these drawing operations, rather than having each object have to allocate and own a separate vertex buffer. But can I do that? After I call drawPrimitives:, I guess the contents of the vertex buffer have to be copied over to the GPU, and I assume (?) that is not done synchronously, so it wouldn't be safe to immediately start modifying the vertex buffer for the next object's drawing. So: how do I know when Metal is done with the buffer and I can start modifying it again?
Two: Even if #1 has a well-defined answer, such that I could block until Metal is done with the buffer and then start modifying it for the next drawPrimitives: call, is that a reasonable design? I guess it would mean that my CPU thread would be repeatedly blocking to wait for the memory transfers, which is not great. So does that pretty much push me to a design where each object has its own vertex buffer?
Three: OK, suppose each object has its own vertex buffer, or I do one "big bang" render of the whole thing with a single big vertex buffer (this question applies to both designs, I think). After I call presentDrawable: and then commit on my command buffer, my app will go off and do a little work, and then will try to update the display, so my drawing code now executes again. I'd like to reuse the vertex buffers I allocated before, overwriting the data in them to do the new, updated display. But again: how do I know when that is safe? As I understand it, the fact that commit returned to my code doesn't mean Metal is done copying my vertex buffers to the GPU yet, and in the general case I have to assume that could take an arbitrarily long time, so it might not be done yet when I re-enter my drawing code. What's the right way to tell? And again: should I just block waiting until they are available (however I'm supposed to do that), or should I have a second set of vertex buffers that I can use in case Metal is still busy with the first set? (That seems like it just pushes the problem down the pike, since when my drawing code is entered for the third update both previously used sets of buffers might not yet be available, right? So then I could add a third set of vertex buffers, but then the fourth update...)
Four: For drawing the frame and subframes, I'd like to just write a reuseable "drawFrame" type of function that everybody can call, but I'm a bit puzzled as to the right design. With OpenGL this was easy:
- (void)drawViewFrameInBounds:(NSRect)bounds
{
int ox = (int)bounds.origin.x, oy = (int)bounds.origin.y;
glColor3f(0.77f, 0.77f, 0.77f);
glRecti(ox, oy, ox + 1, oy + (int)bounds.size.height);
glRecti(ox + 1, oy, ox + (int)bounds.size.width - 1, oy + 1);
glRecti(ox + (int)bounds.size.width - 1, oy, ox + (int)bounds.size.width, oy + (int)bounds.size.height);
glRecti(ox + 1, oy + (int)bounds.size.height - 1, ox + (int)bounds.size.width - 1, oy + (int)bounds.size.height);
}
But with Metal I'm not sure what a good design is. I guess the function can't just have its own little vertex buffer declared as a local static array, into which it throws vertices and then calls drawPrimitives:, because if it gets called twice in a row Metal might not yet have copied the vertex data from the first call when the second call wants to modify the buffer. I obviously don't want to have to allocate a new vertex buffer every time the function gets called. I could have the caller pass in a vertex buffer for the function to use, but that just pushes the problem out a level; how should the caller handle this situation, then? Maybe I could have the function append new vertices onto the end of a growing list of vertices in a buffer provided by the caller; but this seems to either force the whole render to be completely pre-planned (so that I can preallocate a big buffer of the right size to fit all of the vertices everybody will draw – which requires the top-level drawing code to somehow know how many vertices every object will end up generating, which violates encapsulation), or to do a design where I have an expanding vertex buffer that gets realloc'ed as needed when its capacity proves insufficient. I know how to do these things; but none of them feels right. I'm struggling with what the right design is, because I don't really understand Metal's memory model well enough, I think. Any advice? Apologies for the very long multi-part question, but I think all of this goes to the same basic lack of understanding.
The short answer to you underlying question is: you should not overwrite resources that are used by commands added to a command buffer until that command buffer has completed. The best way to determine that is to add a completion handler. You could also poll the status property of the command buffer, but that's not as good.
First, until you commit the command buffer, nothing is copied to the GPU. Further, as you noted, even after you commit the command buffer, you can't assume the data has been fully copied to the GPU.
Second, you should, in the simple case, put all drawing for a frame into a single command buffer. Creating and committing a lot of command buffers (like one for every object that draws) adds overhead.
These two points combined means you can't typically reuse a resource during the same frame. Basically, you're going to have to double- or triple-buffer to get correctness and good performance simultaneously.
A typical technique is to create a small pool of buffers guarded by a semaphore. The semaphore count is initially the number of buffers in the pool. Code which wants a buffer waits on the semaphore and, when that succeeds, take a buffer out of the pool. It should also add a completion handler to the command buffer that puts the buffer back in the pool and signals the semaphore.
You could use a dynamic pool of buffers. If code wants a buffer and the pool is empty, it creates a buffer instead of blocking. Then, when it's done, it adds the buffer to the pool, effectively increasing the size of the pool. However, there's typically no point in doing that. You would only need more than three buffers if the CPU is running way ahead of the GPU, and there's no real benefit to that.
As to your desire to have each object draw itself, that can certainly be done. I'd use a large vertex buffer along with some metadata about how much of it has been used so far. Each object that needs to draw will append its vertex data to the buffer and encode its drawing commands referencing that vertex data. You would use the vertexStart parameter to have the drawing command reference the right place in the vertex buffer.
You should also consider indexed drawing with the primitive restart value so there's only a single draw command which draws all of the primitives. Each object would add its primitive to the shared vertex data and index buffers and then some high level controller would do the draw.

How to write to the image directly by CPU when load it in Vulkan?

In Direct3D12, you can use "ID3D12Resource::WriteToSubresource" to enable zero-copy optimizations for UMA adapters.
What is the equivalent of "ID3D12Resource::WriteToSubresource" in Vulkan?
What WriteToSubresource seems to do (in Vulkan-equivalent terms) is write pixel data from CPU memory to an image whose storage is in CPU-writable memory (hence the requirement that it first be mapped), to do so immediately without the need for a command buffer, and to be able to do so regardless of linear/tiling.
Vulkan doesn't have a way to do that. You can write directly to the backing storage for linear images (in the generic layout), but not for tiled ones. You have to use a proper transfer command for that, even on UMA architectures. Which means building a command buffer and submitting to a transfer-capable queue, since Vulkan doesn't have any immediate copy commands like that.
A Vulkan way to do this would essentially be a function that writes data to a mapped pointer to device memory storage as appropriate for a tiled VkImage in the pre-initialized layout that you intend to store in a particular region of memory. That way, you could then bind the image to that location of memory, and you'd be able to transition the layout to whatever you want.
But that would require adding such a function and allowing the pre-initialized layout to be used for tiled images (so long as the data is written by this function).
So, from ID3D12Resource::WriteToSubresource docunentation I read it performs one copy, with marketeze sprinkled on top.
Vulkan is an explicit API, which does perfectly allow you to do an one-copy on UMA (or on anything else). It even allows you to do real zero-copy, if you stick with linear tiling.
UMA may look like this: https://vulkan.gpuinfo.org/displayreport.php?id=4919#memorytypes
I.e. has only one heap, and the memory type is both DEVICE_LOCAL and HOST_VISIBLE.
So, if you create a linearly tiled image\buffer in Vulkan, vkMapMemory its memory, and then produce your data into that mapped pointer directly, there you have a (real) zero-copy.
Since this is not always practical (i.e. you cannot always choose how things are allocated, e.g. if it is data returned from library function), there is an extension VK_EXT_external_memory_host (assuming your ICD supports it of course), which allows you to import your host data directly, without having to first make a Vulkan memory map.
Now, there are optimally tiled images. Optimal tiling is opaque in Vulkan (so far), and implementation-dependent, so you do not even know the addressing scheme without some reverse engineering. You, generally speaking, want to use optimally tiled images, because supposedly accessing them has better performance characteristics (at least in common situations).
This is where the single copy comes in. You would take your linearly tiled image (or buffer), and vkCmdCopy* it into your optimally tiled image. That copy is performed by the Device\GPU with all its bells and whistles, potentially faster than CPU, i.e. what I suspect they would call "near zero-copy".

Vulkan: Is there a way to draw multiple objects in different locations like in DirectX12?

In DirectX12, you render multiple objects in different locations using the equivalent of a single uniform buffer for the world transform like:
// Basic simplified pseudocode
SetRootSignature();
SetPrimitiveTopology();
SetPipelineState();
SetDepthStencilTarget();
SetViewportAndScissor();
for (auto object : objects)
{
SetIndexBuffer();
SetVertexBuffer();
struct VSConstants
{
QEDx12::Math::Matrix4 modelToProjection;
} vsConstants;
vsConstants.modelToProjection = ViewProjMat * object->GetWorldProj();
SetDynamicConstantBufferView(0, sizeof(vsConstants), &vsConstants);
DrawIndexed();
}
However, in Vulkan, if you do something similar with a single uniform buffer, all the objects are rendered in the location of last world matrix:
for (auto object : objects)
{
SetIndexBuffer();
SetVertexBuffer();
UploadUniformBuffer(object->GetWorldProj());
DrawIndexed();
}
Is there a way to draw multiple objects with a single uniform buffer in Vulkan, just like in DirectX12?
I'm aware of Sascha Willem's Dynamic uniform buffer example (https://github.com/SaschaWillems/Vulkan/tree/master/dynamicuniformbuffer) where he packs many matrices in one big uniform buffer, and while useful, is not exactly what I am looking for.
Thanks in advance for any help.
I cannot find a function called SetDynamicConstantBufferView in the D3D 12 API. I presume this is some function of your invention, but without knowing what it does, I can only really guess.
It looks like you're uploading data to the buffer object while rendering. If that's the case, well, Vulkan can't do that. And that's a good thing. Uploading to memory that you're currently reading from requires synchronization. You have to issue a barrier between the last rendering command that was reading the data you're about to overwrite, and the next rendering command. It's just not a good idea if you like performance.
But again, I'm not sure exactly what that function is doing, so my understanding may be wrong.
In Vulkan, descriptors are generally not meant to be changed in the middle of rendering a frame. However, the makers of Vulkan realized that users sometimes want to draw using different subsets of the same VkBuffer object. This is what dynamic uniform/storage buffers are for.
You technically don't have multiple uniform buffers; you just have one. But you can use the offset(s) provided to vkCmdBindDescriptorSets to shift where in that buffer the next rendering command(s) will get their data from. So it's a light-weight way to supply different rendering commands with different data.
Basically, you rebind your descriptor sets, but with different pDynamicOffset array values. To make these work, you need to plan ahead. Your pipeline layout has to explicitly declare those descriptors as being dynamic descriptors. And every time you bind the set, you'll need to provide the offset into the buffer used by that descriptor.
That being said, it would probably be better to make your uniform buffer store larger arrays of matrices, using the dynamic offset to jump from one block of matrices to the other. You would tehn
The point of that is that the uniform data you provide (depending on hardware) will remain in shader memory unless you do something to change the offset or shader. There is some small cost to uploading such data, so minimizing the need for such uploads is probably not a bad idea.
So you should go and upload all of your objects buffer data in a single DMA operation. Then you issue a barrier, and do your rendering, using dynamic offsets and such to tell each offset where it goes.
You either have to use Push constants or have separate uniform buffers for each location. These can be bound either with a descriptor per location of dynamic offset.
In Sasha's example you can have more than just the one matrix inside the uniform.
That means that inside UploadUniformBuffer you append the new matrix to the buffer and bind the new location.

Optimization using VBO in OpenGL ES 2.0

My system is composed of several objects that represent quads. Each quad is represented by the same vertices and therefore, each object only stores matrices that represent the object's transformation through the world, and its own object space. During each render pass, after these matrices are updated with their frame transforms, they are multiplied with the current view and projection matrices to form the MVP matrix for that object. The objects vertices are then sent with the MVP matrix to the shader, where the vertices are multiplied by the MVP matrix. The inefficiency here is that each quad is drawn separately, meaning there is a separate call glDrawElements for each quad. At any given moment, there may be 50 or 60 quads in existence, some move out of scope and are destroyed or their animation may complete, so they're also destroyed, but more will randomly enter existence. Would there be a significant performance gain to storing all the necessary values in a VBO and just calling glDrawElements once during each pass?
Let's first reason about it with some simple mathematics:
At the moment you don't need to push any vertex data onto the GPU (each frame), but 12-16 floats matrix data per quad, and perform a matrix-matrix multiplication per quad on the CPU.
When putting all in one VBO, you have to transfer 4 vertices (~12 floats) per quad, but no matrix data (except for the global VP, of course) and you have to do 4 matrix-vector multiplies (~1 matrix-matrix multiply) on the CPU.
So the amount of work and data transferred doesn't really change much. But what changes is, that the transferred data is shifted from many many small uniform updates to a single large VBO update, which is very likely to be faster (both because a buffer update is likely to be faster from the hardware side than multiple uniform updates, but don't nail me on that, and second because of the much reduced driver overhead). And on top of that comes the even more reduced overhead by using a single large draw call instead of many smaller.
So yes, it will certainly be worth a try, though it has to be evaluated if it is really a "significant" improvement in your particular application.
Would there be a significant performance gain to storing all the
necessary values in a VBO and just calling glDrawElements once during
each pass?
Yes, it would be much faster. First reason, as you correctly identified, will be a single glDrawElements call. And second being the fact that VBO keeps the data in the GPU itself.
If quads move out of scope you can reuse their memory for the new quads. VBO's can be used to draw subregions of the buffer, so you can get big flexibility without memory allocations.
By using VBO's you are minimising interaction with the GPU and so getting the performance benefit.

Disadvantages of using Texture Cache / Image2D for 2D Arrays?

When accessing 2D arrays in global memory, using the Texture Cache has many benefits, like filtering and not having to care as much for memory access patterns. The CUDA Programming Guide is only naming one downside:
However, within the same kernel call, the texture cache is not kept coherent with respect to global memory writes, so that any texture fetch to an address that has been written to via a global write in the same kernel call returns undefined data.
If I don't have a need for that, because I never write to the memory I read from, are there any downsides/pitfalls/problems when using the Texture Cache (or Image2D, as I am working in OpenCL) instead of plain global memory? Are there any cases where I will lose performance by using the Texture Cache?
Textures can be faster, the same speed, or slower than "naked" global memory access. There are no general rules of thumb for predicting performance using textures, as the speed up (or lack of speed up) is determined by data usage patterns within your code and the texture hardware being used.
In the worst case, where cache hit rates are very low, using textures is slower that normal memory access. Each thread has to firstly have a cache miss, then trigger a global memory fetch. The resulting total latency will be higher than a direct read from memory. I almost always write two versions of any serious code I am developing where textures might be useful (one with and one without), and then benchmark them. Often it is possible to develop heuristics to select which version to use based on inputs. CUBLAS uses this strategy extensively.