When is it safe to write over and reuse a MTLBuffer or other Metal vertex buffer? - objective-c

I'm just getting started with Metal, and am having trouble grasping some basic things. I've been reading a whole bunch of web pages about Metal, and working through Apple's examples, and so forth, but gaps in my understanding remain. I think my key point of confusion is: what is the right way to handle vertex buffers, and how do I know when it's safe to reuse them? This confusion manifests in several ways, as I'll describe below, and maybe those different manifestations of my confusion need to be addressed in different ways.
To be more specific, I'm using an MTKView subclass in Objective-C on macOS to display very simple 2D shapes: an overall frame for the view with a background color inside, 0+ rectangular subframes inside that overall frame with a different background color inside them, and then 0+ flat-shaded squares of various colors inside each subframe. My vertex function is just a simple coordinate transformation, and my fragment function just passes through the color it receives, based on Apple's triangle demo app. I have this working fine for a single subframe with a single square. So far so good.
There are several things that puzzle me.
One: I could design my code to render the whole thing with a single vertex buffer and a single call to drawPrimitives:, drawing all of the (sub)frames and squares in one big bang. This is not optimal, though, as it breaks the encapsulation of my code, in which each subframe represents the state of one object (the thing that contains the 0+ squares); I'd like to allow each object to be responsible for drawing its own contents. It would be nice, therefore, to have each object set up a vertex buffer and make its own drawPrimitives: call. But since the objects will draw sequentially (this is a single-threaded app), I'd like to reuse the same vertex buffer across all of these drawing operations, rather than having each object have to allocate and own a separate vertex buffer. But can I do that? After I call drawPrimitives:, I guess the contents of the vertex buffer have to be copied over to the GPU, and I assume (?) that is not done synchronously, so it wouldn't be safe to immediately start modifying the vertex buffer for the next object's drawing. So: how do I know when Metal is done with the buffer and I can start modifying it again?
Two: Even if #1 has a well-defined answer, such that I could block until Metal is done with the buffer and then start modifying it for the next drawPrimitives: call, is that a reasonable design? I guess it would mean that my CPU thread would be repeatedly blocking to wait for the memory transfers, which is not great. So does that pretty much push me to a design where each object has its own vertex buffer?
Three: OK, suppose each object has its own vertex buffer, or I do one "big bang" render of the whole thing with a single big vertex buffer (this question applies to both designs, I think). After I call presentDrawable: and then commit on my command buffer, my app will go off and do a little work, and then will try to update the display, so my drawing code now executes again. I'd like to reuse the vertex buffers I allocated before, overwriting the data in them to do the new, updated display. But again: how do I know when that is safe? As I understand it, the fact that commit returned to my code doesn't mean Metal is done copying my vertex buffers to the GPU yet, and in the general case I have to assume that could take an arbitrarily long time, so it might not be done yet when I re-enter my drawing code. What's the right way to tell? And again: should I just block waiting until they are available (however I'm supposed to do that), or should I have a second set of vertex buffers that I can use in case Metal is still busy with the first set? (That seems like it just pushes the problem down the pike, since when my drawing code is entered for the third update both previously used sets of buffers might not yet be available, right? So then I could add a third set of vertex buffers, but then the fourth update...)
Four: For drawing the frame and subframes, I'd like to just write a reuseable "drawFrame" type of function that everybody can call, but I'm a bit puzzled as to the right design. With OpenGL this was easy:
- (void)drawViewFrameInBounds:(NSRect)bounds
{
int ox = (int)bounds.origin.x, oy = (int)bounds.origin.y;
glColor3f(0.77f, 0.77f, 0.77f);
glRecti(ox, oy, ox + 1, oy + (int)bounds.size.height);
glRecti(ox + 1, oy, ox + (int)bounds.size.width - 1, oy + 1);
glRecti(ox + (int)bounds.size.width - 1, oy, ox + (int)bounds.size.width, oy + (int)bounds.size.height);
glRecti(ox + 1, oy + (int)bounds.size.height - 1, ox + (int)bounds.size.width - 1, oy + (int)bounds.size.height);
}
But with Metal I'm not sure what a good design is. I guess the function can't just have its own little vertex buffer declared as a local static array, into which it throws vertices and then calls drawPrimitives:, because if it gets called twice in a row Metal might not yet have copied the vertex data from the first call when the second call wants to modify the buffer. I obviously don't want to have to allocate a new vertex buffer every time the function gets called. I could have the caller pass in a vertex buffer for the function to use, but that just pushes the problem out a level; how should the caller handle this situation, then? Maybe I could have the function append new vertices onto the end of a growing list of vertices in a buffer provided by the caller; but this seems to either force the whole render to be completely pre-planned (so that I can preallocate a big buffer of the right size to fit all of the vertices everybody will draw – which requires the top-level drawing code to somehow know how many vertices every object will end up generating, which violates encapsulation), or to do a design where I have an expanding vertex buffer that gets realloc'ed as needed when its capacity proves insufficient. I know how to do these things; but none of them feels right. I'm struggling with what the right design is, because I don't really understand Metal's memory model well enough, I think. Any advice? Apologies for the very long multi-part question, but I think all of this goes to the same basic lack of understanding.

The short answer to you underlying question is: you should not overwrite resources that are used by commands added to a command buffer until that command buffer has completed. The best way to determine that is to add a completion handler. You could also poll the status property of the command buffer, but that's not as good.
First, until you commit the command buffer, nothing is copied to the GPU. Further, as you noted, even after you commit the command buffer, you can't assume the data has been fully copied to the GPU.
Second, you should, in the simple case, put all drawing for a frame into a single command buffer. Creating and committing a lot of command buffers (like one for every object that draws) adds overhead.
These two points combined means you can't typically reuse a resource during the same frame. Basically, you're going to have to double- or triple-buffer to get correctness and good performance simultaneously.
A typical technique is to create a small pool of buffers guarded by a semaphore. The semaphore count is initially the number of buffers in the pool. Code which wants a buffer waits on the semaphore and, when that succeeds, take a buffer out of the pool. It should also add a completion handler to the command buffer that puts the buffer back in the pool and signals the semaphore.
You could use a dynamic pool of buffers. If code wants a buffer and the pool is empty, it creates a buffer instead of blocking. Then, when it's done, it adds the buffer to the pool, effectively increasing the size of the pool. However, there's typically no point in doing that. You would only need more than three buffers if the CPU is running way ahead of the GPU, and there's no real benefit to that.
As to your desire to have each object draw itself, that can certainly be done. I'd use a large vertex buffer along with some metadata about how much of it has been used so far. Each object that needs to draw will append its vertex data to the buffer and encode its drawing commands referencing that vertex data. You would use the vertexStart parameter to have the drawing command reference the right place in the vertex buffer.
You should also consider indexed drawing with the primitive restart value so there's only a single draw command which draws all of the primitives. Each object would add its primitive to the shared vertex data and index buffers and then some high level controller would do the draw.

Related

Is it better to store uniform descriptors in a single buffer or use seperate buffers for each frame?

My application has a max of 2 frames in flight. The vertex shader takes a uniform buffer containing matrices. Because 2 frames could potentially be rendering at the same time I believe I need to have separate memory for each frame's uniform buffer.
Is it preferable to create a single buffer than holds the uniforms of both frames and use an offset within the buffer to do updates. Or is it better that each frame have its own buffer?
Technically you can do either; it ultimately boils down to what makes the most sense implementation-wise. Provided you can ensure that you're not overwriting buffer data that's still active (and being consumed on the GPU), one memory allocation would be sufficient. It's a big advantage of the Vulkan API (since you have full control) but it does make life more complicated.
In my use-case, I use pages of allocations (kinda like a heap model) where I allocate on demand and return blocks when they're done (basically, if a reference is removed, I age out for however many frames of buffering I have and then free the block). This is targeted at uniform buffers that change infrequently.
For per-draw uniform data, I use push constants - this might be worth looking at for your use-case. Since the per-instance matrix is "use once then discard" and is relatively small, this actually makes life simpler, and could even have a performance benefit, to boot.

Vulkan vkCmdDraw with variable instance count

When I define command buffer I need to specify vertex count and instance count beforehand. Does it mean that if I want to update the number of instances dynamically, I need to recompile the entire command buffer all over again? Just changing this single number seems like a small and innocent tweak. There should be a more efficient way of doing that.
vkCmdDrawIndirect allows for dispatch operations whose parameters are fetched from a VkBuffer. This allows you to change the storage in that buffer object, and that will be reflected in the indirect draw call that the CB uses...
Assuming you did proper synchronzation, at any rate.
After all, you cannot modify the values in storage associated with a VkBuffer while a command that could be using that storage is executing. So if you want to change the data in that memory, you will need some kind of synchronization between the final indirect draw command that reads from the buffer and whatever process writes the data. If it is an on-GPU process (a copy from mapped memory, for example), then it's fairly easy.
However, event setting is not something you can do within a render pass, so the set will have to wait until the entire render pass is over.
The most efficient way to handle this is to double-buffer your draw indirect buffers. On one frame, you write to one piece of memory and execute commands that read from it. On the next frame, you write to a different piece of memory while the GPU is executing commands that write to the previous one. On the third frame, you go back to the first piece of memory (using the synchronization you set up to ensure that the GPU is finished).
Of course, if you're insisting on static command buffers, this means that the command buffers themselves must also be double-buffered. One CB reads the indirect data from one buffer, and the other CB reads from the other.

Changing the blend mode in Metal

Does changing the blend function in Metal needs setting a whole new MTLRenderPipelineState?
I assume YES, because the MTLRenderPipelineState is immutable, so I cannot change its descriptor and, for example descriptor's sourceRGBBlendFactor property. But I wanted to confirm, as this sounds a little inefficient to generate large objects to change a single parameter.
Edit:
I am thinking about a case, where I am drawing one vertex buffer with series of meshes and multiple call to -drawPrimitives:. Each mesh can use a different blend mode but all use the same vertex and fragment shader. In OpenGL I could switch glBlendFunc() between the draw calls. In Metal I need to set a whole separate MTLRenderPipelineState with several state values.
Some objects in Metal are designed to be transient and extremely lightweight, while others are more expensive and can last for a long time, perhaps for the lifetime of the app.
Command buffer and command encoder objects are transient and designed for a single use. They are very inexpensive to allocate and deallocate, so their creation methods return autoreleased objects.
MTLRenderPipelineState is not transient.
Does changing the blend function in Metal needs setting a whole new
MTLRenderPipelineState?
Yes, you must create a whole new MTLRenderPipelineState object for each blending configuration.

Why do I need resources per swapchain image

I have been following different tutorials and I don't understand why I need resources per swapchain image instead of per frame in flight.
This tutorial:
https://vulkan-tutorial.com/Uniform_buffers
has a uniform buffer per swapchain image. Why would I need that if different images are not in flight at the same time? Can I not start rewriting if the previous frame has completed?
Also lunarg tutorial on depth buffers says:
And you need only one for rendering each frame, even if the swapchain has more than one image. This is because you can reuse the same depth buffer while using each image in the swapchain.
This doesn't explain anything, it basically says you can because you can. So why can I reuse the depth buffer but not other resources?
It is to minimize synchronization in the case of the simple Hello Cube app.
Let's say your uniforms change each frame. That means main loop is something like:
Poll (or simulate)
Update (e.g. your uniforms)
Draw
Repeat
If step #2 did not have its own uniform, then it needs to write a uniform previous frame is reading. That means it has to sync with a Fence. That would mean the previous frame is no longer considered "in-flight".
It all depends on the way You are using Your resources and the performance You want to achieve.
If, after each frame, You are willing to wait for the rendering to finish and You are still happy with the final performance, You can use only one copy of each resource. Waiting is the easiest synchronization, You are sure that resources are not used anymore, so You can reuse them for the next frame. But if You want to efficiently utilize both CPU's and GPU's power, and You don't want to wait after each frame, then You need to see how each resource is being used.
Depth buffer is usually used only temporarily. If You don't perform any postprocessing, if Your render pass setup uses depth data only internally (You don't specify STORE for storeOp), then You can use only one depth buffer (depth image) all the time. This is because when rendering is done, depth data isn't used anymore, it can be safely discarded. This applies to all other resources that don't need to persist between frames.
But if different data needs to be used for each frame, or if generated data is used in the next frame, then You usually need another copy of a given resource. Updating data requires synchronization - to avoid waiting in such situations You need to have a copy a resource. So in case of uniform buffers, You update data in a given buffer and use it in a given frame. You cannot modify its contents until the frame is finished - so to prepare another frame of animation while the previous one is still being processed on a GPU, You need to use another copy.
Similarly if the generated data is required for the next frame (for example framebuffer used for screen space reflections). Reusing the same resource would cause its contents to be overwritten. That's why You need another copy.
You can find more information here: https://software.intel.com/en-us/articles/api-without-secrets-the-practical-approach-to-vulkan-part-1

Vulkan: Is there a way to draw multiple objects in different locations like in DirectX12?

In DirectX12, you render multiple objects in different locations using the equivalent of a single uniform buffer for the world transform like:
// Basic simplified pseudocode
SetRootSignature();
SetPrimitiveTopology();
SetPipelineState();
SetDepthStencilTarget();
SetViewportAndScissor();
for (auto object : objects)
{
SetIndexBuffer();
SetVertexBuffer();
struct VSConstants
{
QEDx12::Math::Matrix4 modelToProjection;
} vsConstants;
vsConstants.modelToProjection = ViewProjMat * object->GetWorldProj();
SetDynamicConstantBufferView(0, sizeof(vsConstants), &vsConstants);
DrawIndexed();
}
However, in Vulkan, if you do something similar with a single uniform buffer, all the objects are rendered in the location of last world matrix:
for (auto object : objects)
{
SetIndexBuffer();
SetVertexBuffer();
UploadUniformBuffer(object->GetWorldProj());
DrawIndexed();
}
Is there a way to draw multiple objects with a single uniform buffer in Vulkan, just like in DirectX12?
I'm aware of Sascha Willem's Dynamic uniform buffer example (https://github.com/SaschaWillems/Vulkan/tree/master/dynamicuniformbuffer) where he packs many matrices in one big uniform buffer, and while useful, is not exactly what I am looking for.
Thanks in advance for any help.
I cannot find a function called SetDynamicConstantBufferView in the D3D 12 API. I presume this is some function of your invention, but without knowing what it does, I can only really guess.
It looks like you're uploading data to the buffer object while rendering. If that's the case, well, Vulkan can't do that. And that's a good thing. Uploading to memory that you're currently reading from requires synchronization. You have to issue a barrier between the last rendering command that was reading the data you're about to overwrite, and the next rendering command. It's just not a good idea if you like performance.
But again, I'm not sure exactly what that function is doing, so my understanding may be wrong.
In Vulkan, descriptors are generally not meant to be changed in the middle of rendering a frame. However, the makers of Vulkan realized that users sometimes want to draw using different subsets of the same VkBuffer object. This is what dynamic uniform/storage buffers are for.
You technically don't have multiple uniform buffers; you just have one. But you can use the offset(s) provided to vkCmdBindDescriptorSets to shift where in that buffer the next rendering command(s) will get their data from. So it's a light-weight way to supply different rendering commands with different data.
Basically, you rebind your descriptor sets, but with different pDynamicOffset array values. To make these work, you need to plan ahead. Your pipeline layout has to explicitly declare those descriptors as being dynamic descriptors. And every time you bind the set, you'll need to provide the offset into the buffer used by that descriptor.
That being said, it would probably be better to make your uniform buffer store larger arrays of matrices, using the dynamic offset to jump from one block of matrices to the other. You would tehn
The point of that is that the uniform data you provide (depending on hardware) will remain in shader memory unless you do something to change the offset or shader. There is some small cost to uploading such data, so minimizing the need for such uploads is probably not a bad idea.
So you should go and upload all of your objects buffer data in a single DMA operation. Then you issue a barrier, and do your rendering, using dynamic offsets and such to tell each offset where it goes.
You either have to use Push constants or have separate uniform buffers for each location. These can be bound either with a descriptor per location of dynamic offset.
In Sasha's example you can have more than just the one matrix inside the uniform.
That means that inside UploadUniformBuffer you append the new matrix to the buffer and bind the new location.