Storage buffer synchronisation between draw calls with vulkan

Storage buffer synchronisation between draw calls with vulkan - vulkan

I'm using a storage buffer to store a per pixel linked list from a fragment shader with fragment shader interlock. Everything works fine within a single draw call but I'm having trouble synchronizing the storage buffer between consecutive draw calls.
My understanding is that you can perform a pipeline barrier within a render pass for the fragment shader stage but only if it concerns an image bound to the framebuffer.
Do I have to call vkCmdEndRenderPass then vkCmdPipelineBarrier then vkCmdBeginRenderPass between each draw call or is there a better solution ?

Synchronization in this case is expected to be done through the interlock operation itself. If your interlock doesn't respect primitive ordering, then you're not supposed to care about the primitive order, so no synchronization should be necessary. If your interlock does care about primitive ordering, then primitive ordering will be imposed by the interlock, so no other synchronization is necessary.
Primitive order defines that primitives generated by one draw call are ordered before all primitives from a later draw call. So if you're doing primitive order interlocking, then by definition critical sections from primitives in one rendering command will be ordered after those from a prior one.
So there's no need for barriers; what you want is primitive ordering.
Now, if you have one group of commands that only need critical sections and don't care about order, but a later group that themselves only need critical sections but need to come after the first, that is a contradiction. The second group does care about ordering, so they should use primitive ordering, not unordered.
Note that in order to make previous writes visible, you need the coherent qualifier. Ordering only guarantees ordering.

Related

In a 2d application where you're drawing a lot of individual sprites, will the rasterization stage inevitably become a bottleneck? [duplicate]

I'm in the processing of learning Vulkan, and I have just integrated ImGui into my code using the Vulkan-GLFW example in the original ImGui repo, and it works fine.
Now I want to render both the GUI and my 3D model on the screen at the same time, and since the GUI and the model definitely needs different shaders, I need to use multiple pipelines and submit multiples commands. The GUI is partly transparent, so I would like it to be rendered after the model. The Vulkan specs states that the execution order of commands are not likely to be the order that I record the commands, thus I need synchronization of some kind. In this Reddit post several methods of exactly achieving my goals was proposed, and I once believed that I must use multiple subpasses (together with subpass dependency) or barriers or other synchronization methods like that to solve this problem.
Then I had a look at SaschaWillems' Vulkan examples, in the ImGui example though, I see no synchronization between the two draw calls, it just record the command to draw the model first, and then the command to draw the GUI.
I am confused. Is synchronization really needed in this case, or did I misunderstand something about command re-ordering or blending? Thanks.

Think about what you're doing for a second. Why do you think there needs to be synchronization between the two sets of commands? Because the second set of commands needs to blend with the data in the first set, right? And therefore, it needs to do a read/modify/write (RMW), which must be able to read data written by the previous set of commands. The data being read has to have been written, and that typically requires synchronization.
But think a bit more about what that means. Blending has to read from the framebuffer to do its job. But... so does the depth test, right? It has to read the existing sample's depth value, compare it with the incoming fragment, and then discard the fragment or not based on the depth test. So basically every draw call that uses a depth test contains a framebuffer read/modify/wright.
And yet... your depth tests work. Not only do they work between draw calls without explicit synchronization, they also work within a draw call. If two triangles in a draw call overlap, you don't have any problem with seeing the bottom one through the top one, right? You don't have to do inter-triangle synchronization to make sure that the previous triangles' writes are finished before the reads.
So somehow, the depth test's RMW works without any explicit synchronization. So... why do you think that this is untrue of the blend stage's RMW?
The Vulkan specification states that commands, and stages within commands, will execute in a largely unordered way, with several exceptions. The most obvious being the presence of explicit execution barriers/dependencies. But it also says that the fixed-function per-sample testing and blending stages will always execute (as if) in submission order (within a subpass). Not only that, it requires that the triangles generated within a command also execute these stages (as if) in a specific, well-defined order.
That's why your depth test doesn't need synchronization; Vulkan requires that this is handled. This is also why your blending will not need synchronization (within a subpass).
So you have plenty of options (in order from fastest to slowest):
Render your UI in the same subpass as the non-UI. Just change pipelines as appropriate.
Render your UI in a subpass with an explicit dependency on the framebuffer images of the non-UI subpass. While this is technically slower, it probably won't be slower by much if at all. Also, this is useful for deferred rendering, since your UI needs to happen after your lighting pass, which will undoubtedly be its own subpass.
Render your UI in a different render pass. This would only really be needed for cases where you need to do some full-screen work (SSAO) that would force your non-UI render pass to terminate anyway.

Is it better to store uniform descriptors in a single buffer or use seperate buffers for each frame?

My application has a max of 2 frames in flight. The vertex shader takes a uniform buffer containing matrices. Because 2 frames could potentially be rendering at the same time I believe I need to have separate memory for each frame's uniform buffer.
Is it preferable to create a single buffer than holds the uniforms of both frames and use an offset within the buffer to do updates. Or is it better that each frame have its own buffer?

Technically you can do either; it ultimately boils down to what makes the most sense implementation-wise. Provided you can ensure that you're not overwriting buffer data that's still active (and being consumed on the GPU), one memory allocation would be sufficient. It's a big advantage of the Vulkan API (since you have full control) but it does make life more complicated.
In my use-case, I use pages of allocations (kinda like a heap model) where I allocate on demand and return blocks when they're done (basically, if a reference is removed, I age out for however many frames of buffering I have and then free the block). This is targeted at uniform buffers that change infrequently.
For per-draw uniform data, I use push constants - this might be worth looking at for your use-case. Since the per-instance matrix is "use once then discard" and is relatively small, this actually makes life simpler, and could even have a performance benefit, to boot.

Order: Fragment OPs guaranteed after vertex OPs? Logical & rasterization order seem to weak

In my quest to fully understand synchronization, I've stumbled over different order guarantees. The strongest one I think is the rasterization order, which makes strong guarantees about the order of fragment operations for individual pixels.
A weaker and more general order is the logical order of the pipeline stages. To quote the bible:
Pipeline stages that execute as a result of a command logically complete execution in a specific order, such that completion of a logically later pipeline stage must not happen-before completion of a logically earlier stage. [...] Similarly, initiation of a logically earlier pipeline stage must not happen-after initiation of a logically later pipeline stage.
That guarantee seems pretty weak as it seems to allow to run all pipeline stages at the same time as long as they start and end in the correct order.
That leads me to one consequence: Doesn't all this make it possible for the vertex stage to not be finished before the fragment stage starts? This is considering the case for a single triangle. Since I think thisis absolutely not what's happening (or possible), it would be nice to find out where the spec makes that guarantee.

There's one problem with your thinking. Pipeline is not a Finite State Machine. They may look the same when expressed as diagram, but they are not the same. Pipeline stages do not "run", because they are not FSM states. Instead queue operations run through the pipeline (hence the name). In reality, one command can spawn multiple vertex shader invocations. Geometry shader can spawn multiple (or no) fragments shader invocations. Only thing that is guaranteed here is that things do not go against the pipeline direction of flow (e.g. that fragment shader invocations never spawn new vertex shaders).
That being said, you are looking in the wrong part of the specification. The paragraph you are quoting "only" specifies the logical order. I.e. that pipeline stages are added implicitly to synchronization commands as appropriate. Logically-earlier stages are implicitly added to any source scope parameter, and logically-later stages are added to any destination stage parameter. But careful, this does not say anything about side-effects of the shaders, and it does not apply to memory dependency, which have to have the stage explicitly stated to work.
What you are looking for is Shader Execution chapter:
The relative execution order of invocations of different shader types is largely undefined. However, when invoking a shader whose inputs are generated from a previous pipeline stage, the shader invocations from the previous stage are guaranteed to have executed far enough to generate input values for all required inputs.

What is a "Push-Constant" in vulkan?

I've looked through a bunch of tutorials that talk about push constants, allude to possible benefits, but never have I actually seen, even in Vulkan docs, what the heck a "push constant" actually is... I don't understand what they are supposed to be, and what the purpose of push constants are. The closest thing I can find is this post which unfortunately doesn't ask for what they are, but what are the differences between them and another concept and didn't help me much.
What is a push constant, why does it exist and what is it used for? where did its name come from?

Push constants is a way to quickly provide a small amount of uniform data to shaders. It should be much quicker than UBOs but a huge limitation is the size of data - spec requires 128 bytes to be available for a push constant range. Hardware vendors may support more, but compared to other means it is still very little (for example 256 bytes).
Because push constants are much quicker than other descriptors (resources through which we provide data to shaders), they are convenient to use for data that changes between draw calls, like for example transformation matrices.
From the shader perspective, they are declared through layout( push_constant ) qualifier and a block of uniform data. For example:
layout( push_constant ) uniform ColorBlock {
vec4 Color;
} PushConstant;
From the application perspective, if shaders want to use push constants, they must be specified during pipeline layout creation. Then the vkCmdPushConstants() command must be recorded into a command buffer. This function takes, among others, a pointer to a memory from which data to a push constant range should be copied.
Different shader stages of a given pipeline can use the same push constant block (similarly to UBOs) or smaller parts of the whole range. But, what is important, each shader stage can use only one push constant block. It can contain multiple members, though. Another important thing is that the total data size (across all shader stages which use push constants) must fit into the size constraint. So the constraint is not per stage but per whole range.
There is an example in the Vulkan Cookbook's repository showing a simple push constant usage scenario. Sascha Willems's Vulkan examples also contain a sample showing how to use push constants.

Vulkan: Is there a way to draw multiple objects in different locations like in DirectX12?

In DirectX12, you render multiple objects in different locations using the equivalent of a single uniform buffer for the world transform like:
// Basic simplified pseudocode
SetRootSignature();
SetPrimitiveTopology();
SetPipelineState();
SetDepthStencilTarget();
SetViewportAndScissor();
for (auto object : objects)
{
SetIndexBuffer();
SetVertexBuffer();
struct VSConstants
{
QEDx12::Math::Matrix4 modelToProjection;
} vsConstants;
vsConstants.modelToProjection = ViewProjMat * object->GetWorldProj();
SetDynamicConstantBufferView(0, sizeof(vsConstants), &vsConstants);
DrawIndexed();
}
However, in Vulkan, if you do something similar with a single uniform buffer, all the objects are rendered in the location of last world matrix:
for (auto object : objects)
{
SetIndexBuffer();
SetVertexBuffer();
UploadUniformBuffer(object->GetWorldProj());
DrawIndexed();
}
Is there a way to draw multiple objects with a single uniform buffer in Vulkan, just like in DirectX12?
I'm aware of Sascha Willem's Dynamic uniform buffer example (https://github.com/SaschaWillems/Vulkan/tree/master/dynamicuniformbuffer) where he packs many matrices in one big uniform buffer, and while useful, is not exactly what I am looking for.
Thanks in advance for any help.

I cannot find a function called SetDynamicConstantBufferView in the D3D 12 API. I presume this is some function of your invention, but without knowing what it does, I can only really guess.
It looks like you're uploading data to the buffer object while rendering. If that's the case, well, Vulkan can't do that. And that's a good thing. Uploading to memory that you're currently reading from requires synchronization. You have to issue a barrier between the last rendering command that was reading the data you're about to overwrite, and the next rendering command. It's just not a good idea if you like performance.
But again, I'm not sure exactly what that function is doing, so my understanding may be wrong.
In Vulkan, descriptors are generally not meant to be changed in the middle of rendering a frame. However, the makers of Vulkan realized that users sometimes want to draw using different subsets of the same VkBuffer object. This is what dynamic uniform/storage buffers are for.
You technically don't have multiple uniform buffers; you just have one. But you can use the offset(s) provided to vkCmdBindDescriptorSets to shift where in that buffer the next rendering command(s) will get their data from. So it's a light-weight way to supply different rendering commands with different data.
Basically, you rebind your descriptor sets, but with different pDynamicOffset array values. To make these work, you need to plan ahead. Your pipeline layout has to explicitly declare those descriptors as being dynamic descriptors. And every time you bind the set, you'll need to provide the offset into the buffer used by that descriptor.
That being said, it would probably be better to make your uniform buffer store larger arrays of matrices, using the dynamic offset to jump from one block of matrices to the other. You would tehn
The point of that is that the uniform data you provide (depending on hardware) will remain in shader memory unless you do something to change the offset or shader. There is some small cost to uploading such data, so minimizing the need for such uploads is probably not a bad idea.
So you should go and upload all of your objects buffer data in a single DMA operation. Then you issue a barrier, and do your rendering, using dynamic offsets and such to tell each offset where it goes.

You either have to use Push constants or have separate uniform buffers for each location. These can be bound either with a descriptor per location of dynamic offset.
In Sasha's example you can have more than just the one matrix inside the uniform.
That means that inside UploadUniformBuffer you append the new matrix to the buffer and bind the new location.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas