Vulkan vkCmdDraw with variable instance count - vulkan

When I define command buffer I need to specify vertex count and instance count beforehand. Does it mean that if I want to update the number of instances dynamically, I need to recompile the entire command buffer all over again? Just changing this single number seems like a small and innocent tweak. There should be a more efficient way of doing that.

vkCmdDrawIndirect allows for dispatch operations whose parameters are fetched from a VkBuffer. This allows you to change the storage in that buffer object, and that will be reflected in the indirect draw call that the CB uses...
Assuming you did proper synchronzation, at any rate.
After all, you cannot modify the values in storage associated with a VkBuffer while a command that could be using that storage is executing. So if you want to change the data in that memory, you will need some kind of synchronization between the final indirect draw command that reads from the buffer and whatever process writes the data. If it is an on-GPU process (a copy from mapped memory, for example), then it's fairly easy.
However, event setting is not something you can do within a render pass, so the set will have to wait until the entire render pass is over.
The most efficient way to handle this is to double-buffer your draw indirect buffers. On one frame, you write to one piece of memory and execute commands that read from it. On the next frame, you write to a different piece of memory while the GPU is executing commands that write to the previous one. On the third frame, you go back to the first piece of memory (using the synchronization you set up to ensure that the GPU is finished).
Of course, if you're insisting on static command buffers, this means that the command buffers themselves must also be double-buffered. One CB reads the indirect data from one buffer, and the other CB reads from the other.

Related

When is it safe to write over and reuse a MTLBuffer or other Metal vertex buffer?

I'm just getting started with Metal, and am having trouble grasping some basic things. I've been reading a whole bunch of web pages about Metal, and working through Apple's examples, and so forth, but gaps in my understanding remain. I think my key point of confusion is: what is the right way to handle vertex buffers, and how do I know when it's safe to reuse them? This confusion manifests in several ways, as I'll describe below, and maybe those different manifestations of my confusion need to be addressed in different ways.
To be more specific, I'm using an MTKView subclass in Objective-C on macOS to display very simple 2D shapes: an overall frame for the view with a background color inside, 0+ rectangular subframes inside that overall frame with a different background color inside them, and then 0+ flat-shaded squares of various colors inside each subframe. My vertex function is just a simple coordinate transformation, and my fragment function just passes through the color it receives, based on Apple's triangle demo app. I have this working fine for a single subframe with a single square. So far so good.
There are several things that puzzle me.
One: I could design my code to render the whole thing with a single vertex buffer and a single call to drawPrimitives:, drawing all of the (sub)frames and squares in one big bang. This is not optimal, though, as it breaks the encapsulation of my code, in which each subframe represents the state of one object (the thing that contains the 0+ squares); I'd like to allow each object to be responsible for drawing its own contents. It would be nice, therefore, to have each object set up a vertex buffer and make its own drawPrimitives: call. But since the objects will draw sequentially (this is a single-threaded app), I'd like to reuse the same vertex buffer across all of these drawing operations, rather than having each object have to allocate and own a separate vertex buffer. But can I do that? After I call drawPrimitives:, I guess the contents of the vertex buffer have to be copied over to the GPU, and I assume (?) that is not done synchronously, so it wouldn't be safe to immediately start modifying the vertex buffer for the next object's drawing. So: how do I know when Metal is done with the buffer and I can start modifying it again?
Two: Even if #1 has a well-defined answer, such that I could block until Metal is done with the buffer and then start modifying it for the next drawPrimitives: call, is that a reasonable design? I guess it would mean that my CPU thread would be repeatedly blocking to wait for the memory transfers, which is not great. So does that pretty much push me to a design where each object has its own vertex buffer?
Three: OK, suppose each object has its own vertex buffer, or I do one "big bang" render of the whole thing with a single big vertex buffer (this question applies to both designs, I think). After I call presentDrawable: and then commit on my command buffer, my app will go off and do a little work, and then will try to update the display, so my drawing code now executes again. I'd like to reuse the vertex buffers I allocated before, overwriting the data in them to do the new, updated display. But again: how do I know when that is safe? As I understand it, the fact that commit returned to my code doesn't mean Metal is done copying my vertex buffers to the GPU yet, and in the general case I have to assume that could take an arbitrarily long time, so it might not be done yet when I re-enter my drawing code. What's the right way to tell? And again: should I just block waiting until they are available (however I'm supposed to do that), or should I have a second set of vertex buffers that I can use in case Metal is still busy with the first set? (That seems like it just pushes the problem down the pike, since when my drawing code is entered for the third update both previously used sets of buffers might not yet be available, right? So then I could add a third set of vertex buffers, but then the fourth update...)
Four: For drawing the frame and subframes, I'd like to just write a reuseable "drawFrame" type of function that everybody can call, but I'm a bit puzzled as to the right design. With OpenGL this was easy:
- (void)drawViewFrameInBounds:(NSRect)bounds
{
int ox = (int)bounds.origin.x, oy = (int)bounds.origin.y;
glColor3f(0.77f, 0.77f, 0.77f);
glRecti(ox, oy, ox + 1, oy + (int)bounds.size.height);
glRecti(ox + 1, oy, ox + (int)bounds.size.width - 1, oy + 1);
glRecti(ox + (int)bounds.size.width - 1, oy, ox + (int)bounds.size.width, oy + (int)bounds.size.height);
glRecti(ox + 1, oy + (int)bounds.size.height - 1, ox + (int)bounds.size.width - 1, oy + (int)bounds.size.height);
}
But with Metal I'm not sure what a good design is. I guess the function can't just have its own little vertex buffer declared as a local static array, into which it throws vertices and then calls drawPrimitives:, because if it gets called twice in a row Metal might not yet have copied the vertex data from the first call when the second call wants to modify the buffer. I obviously don't want to have to allocate a new vertex buffer every time the function gets called. I could have the caller pass in a vertex buffer for the function to use, but that just pushes the problem out a level; how should the caller handle this situation, then? Maybe I could have the function append new vertices onto the end of a growing list of vertices in a buffer provided by the caller; but this seems to either force the whole render to be completely pre-planned (so that I can preallocate a big buffer of the right size to fit all of the vertices everybody will draw – which requires the top-level drawing code to somehow know how many vertices every object will end up generating, which violates encapsulation), or to do a design where I have an expanding vertex buffer that gets realloc'ed as needed when its capacity proves insufficient. I know how to do these things; but none of them feels right. I'm struggling with what the right design is, because I don't really understand Metal's memory model well enough, I think. Any advice? Apologies for the very long multi-part question, but I think all of this goes to the same basic lack of understanding.
The short answer to you underlying question is: you should not overwrite resources that are used by commands added to a command buffer until that command buffer has completed. The best way to determine that is to add a completion handler. You could also poll the status property of the command buffer, but that's not as good.
First, until you commit the command buffer, nothing is copied to the GPU. Further, as you noted, even after you commit the command buffer, you can't assume the data has been fully copied to the GPU.
Second, you should, in the simple case, put all drawing for a frame into a single command buffer. Creating and committing a lot of command buffers (like one for every object that draws) adds overhead.
These two points combined means you can't typically reuse a resource during the same frame. Basically, you're going to have to double- or triple-buffer to get correctness and good performance simultaneously.
A typical technique is to create a small pool of buffers guarded by a semaphore. The semaphore count is initially the number of buffers in the pool. Code which wants a buffer waits on the semaphore and, when that succeeds, take a buffer out of the pool. It should also add a completion handler to the command buffer that puts the buffer back in the pool and signals the semaphore.
You could use a dynamic pool of buffers. If code wants a buffer and the pool is empty, it creates a buffer instead of blocking. Then, when it's done, it adds the buffer to the pool, effectively increasing the size of the pool. However, there's typically no point in doing that. You would only need more than three buffers if the CPU is running way ahead of the GPU, and there's no real benefit to that.
As to your desire to have each object draw itself, that can certainly be done. I'd use a large vertex buffer along with some metadata about how much of it has been used so far. Each object that needs to draw will append its vertex data to the buffer and encode its drawing commands referencing that vertex data. You would use the vertexStart parameter to have the drawing command reference the right place in the vertex buffer.
You should also consider indexed drawing with the primitive restart value so there's only a single draw command which draws all of the primitives. Each object would add its primitive to the shared vertex data and index buffers and then some high level controller would do the draw.

Why do I need resources per swapchain image

I have been following different tutorials and I don't understand why I need resources per swapchain image instead of per frame in flight.
This tutorial:
https://vulkan-tutorial.com/Uniform_buffers
has a uniform buffer per swapchain image. Why would I need that if different images are not in flight at the same time? Can I not start rewriting if the previous frame has completed?
Also lunarg tutorial on depth buffers says:
And you need only one for rendering each frame, even if the swapchain has more than one image. This is because you can reuse the same depth buffer while using each image in the swapchain.
This doesn't explain anything, it basically says you can because you can. So why can I reuse the depth buffer but not other resources?
It is to minimize synchronization in the case of the simple Hello Cube app.
Let's say your uniforms change each frame. That means main loop is something like:
Poll (or simulate)
Update (e.g. your uniforms)
Draw
Repeat
If step #2 did not have its own uniform, then it needs to write a uniform previous frame is reading. That means it has to sync with a Fence. That would mean the previous frame is no longer considered "in-flight".
It all depends on the way You are using Your resources and the performance You want to achieve.
If, after each frame, You are willing to wait for the rendering to finish and You are still happy with the final performance, You can use only one copy of each resource. Waiting is the easiest synchronization, You are sure that resources are not used anymore, so You can reuse them for the next frame. But if You want to efficiently utilize both CPU's and GPU's power, and You don't want to wait after each frame, then You need to see how each resource is being used.
Depth buffer is usually used only temporarily. If You don't perform any postprocessing, if Your render pass setup uses depth data only internally (You don't specify STORE for storeOp), then You can use only one depth buffer (depth image) all the time. This is because when rendering is done, depth data isn't used anymore, it can be safely discarded. This applies to all other resources that don't need to persist between frames.
But if different data needs to be used for each frame, or if generated data is used in the next frame, then You usually need another copy of a given resource. Updating data requires synchronization - to avoid waiting in such situations You need to have a copy a resource. So in case of uniform buffers, You update data in a given buffer and use it in a given frame. You cannot modify its contents until the frame is finished - so to prepare another frame of animation while the previous one is still being processed on a GPU, You need to use another copy.
Similarly if the generated data is required for the next frame (for example framebuffer used for screen space reflections). Reusing the same resource would cause its contents to be overwritten. That's why You need another copy.
You can find more information here: https://software.intel.com/en-us/articles/api-without-secrets-the-practical-approach-to-vulkan-part-1

Vulkan descriptor binding

In my vulkan application i used to draw meshes like this when all the meshes used the same texture
Updatedescriptorsets(texture)
Command buffer record
{
For each mesh
Bind transformubo
Draw mesh
}
But now I want each mesh to have a unique texture so i tried this
Command buffer record
{
For each mesh
Bind transformubo
Updatedescriptorsets (textures[meshindex])
Draw mesh
}
But it gives an error saying descriptorset is destroyed or updated. I looked in vulkan documentation and found out that I can't update descriptorset during command buffer records. So how can I have a unique texture to each mesh?
vkUpdateDescriptorSets is not synchonrized with anything. Therefore, you cannot update a descriptor set while it is in use. You must ensure that all rendering operations that use the descriptor set in question have finished, and that no commands have been placed in command buffers that use the set in question.
It's basically like a global variable; you can't have people accessing a global variable from numerous threads without some kind of synchronization. And Vulkan doesn't synchronize access to descriptor sets.
There are several ways to deal with this. You can give each object its own descriptor set. This is usually done by having the frequently changing descriptor set data be of a higher index than the less frequently changing data. That way, you're not changing every descriptor for each object, only the ones that change on a per-object basis.
You can use push constant data to index into large tables/array textures. So the descriptor set would have an array texture or an array of textures (if you have dynamic indexing for arrays of textures). A push constant would provide an index, which is used by the shader to fetch that particular object's texture from the array texture/array of textures. This makes frequent changes fairly cheap, and the same index can also be used to give each object its own transformation matrices (by fetching into an array of matrices).
If you have the extension VK_KHR_push_descriptor available, then you can integrate changes to descriptors directly into the command buffer. How much better this is than the push constant mechanism is of course implementation-dependent.
If you update a descriptor set then all command buffers that this descriptor set is bound to will become invalid. Invalid command buffers cannot be submitted or be executed by the GPU.
What you basically need to do is to update descriptor sets before you bind them.
This odd behavior is there because in vkCmdBindDescriptorSets some implementations take the vulkan descriptor set, translate it to native descriptor tables and then store it in the command buffer. So if you update the descriptor set after vkCmdBindDescriptorSets the command buffer will be seeing stale data. VK_EXT_descriptor_indexing extension relaxed this behavior under some circumstances.

What would be the value of VkAccessFlags for a VkBuffer or VkImage after being allocated memory?

So I create a bunch of buffers and images, and I need to set up a memory barrier for some reason.
How do I know what to specify in the srcAccessMask field for the barrier struct of a newly created buffer or image, seeing as at that point I wouldn't have specified the access flags for it? How do I decide what initial access flags to specify for the first memory barrier applied to a buffer or image?
Specifying initial values for other parameters in Vk*MemoryBarrier is easy since I can clearly know, say, the original layout of an image, but it isn't apparent to me what the value of srcAccessMask could be the first time I set up a barrier.
Is it based on the usage flags specified during creation of the object concerned? Or is there some other way that can be used to find out?
So, let's assume vkCreateImage and VK_LAYOUT_UNDEFINED.
Nowhere the specification says it defines some scheduled operation. So it is healthy to assume all its work is done as soon as it returns. Plus, it does not even have memory.
So any synchronization needs would be of the memory you bind to it. Let's assume it is just fresh memory from vkAllocate. Similarly, nowhere it is said in the specification that it defines some scheduled operation.
Even so, there's really only two options. Either the implementation does nothing with the memory, or it null-fills it (for security reason). In the case it null-fills it, that must be done in a way you cannot access the original data (even using synchronization errors). So it is healthy to assume the memory has no "synchronization baggage" on it.
So simply srcStageMask = VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT (no previous outstanding scheduled operation) and srcAccessMask = 0 (no previous writes) should be correct.

Secondary command buffer granularity

This is sort of a question which probably needs benchmarking to answer properly.
But maybe some of you already did this and can share some experiences. With "granularity" I mean the number of draw-calls per secondary command buffer. The vulkan samples ("hologram" for example) uses one draw call per secondary command buffer. This seems quite inefficient to me, because you would have to rebind everything for every single draw call, since the every command buffer has its own state. An alternative would be to batch draw calls in secondary command buffers.
I think it all depends on whether Vulkan actually "rebinds" the same pipeline and description sets for every command buffer, even though they are the same as in the previous command buffer.
Do you have any experience with this?