During which pipeline stage is blending performed? - vulkan

The Vulkan spec states:
VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT specifies the stage of
the pipeline after blending where the final color values are output
from the pipeline.
This seems to imply some undefined stage between fragment-shader and color-attachment-output where blending takes place.
But let's say after writing to an image I want to use it as color attachment, and add a memory dependency with srcStageMask=VK_PIPELINE_STAGE_TRANSFER_BIT, srcAccessMask=VK_ACCESS_TRANSFER_WRITE_BIT, dstStageMask=VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, dstAccessMask=VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT. If blending took place before color-attachment-output stage, it could read data that's not yet visible.
So what does the spec actually mean in this case?

It's important to remember several facts:
You can only do blending within a render pass.
An image used as an attachment during a render pass cannot be transferred to.
Given these facts, a render pass has to have begun between the transfer to the image and the attempt to blend with that image. And note that your blending operation is relying on the data in the image to be what it was when the render pass began. That means your loadOp for that attachment needs to load the image, not clear it.
And in order for the render pass begin to load the image... it must synchronize with prior modifications to that image. And the specification does spell out which stage actually performs the load operation and how all of those things work:
The load operation for each sample in an attachment happens-before any recorded command which accesses the sample in the first subpass where the attachment is used. Load operations for attachments with a depth/stencil format execute in the VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT pipeline stage. Load operations for attachments with a color format execute in the
VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT pipeline stage.
So you don't need to synchronize the transfer with the blend operation; you need to synchronize the transfer with the render pass. And the stage for that is COLOR_ATTACHMENT_OUTPUT.
As to the deeper point of your question (what stage does blending), the answer is that Vulkan doesn't allow it to matter. Images being used as an attachment in a render pass can only be used in very limited ways. As previously stated, you can't just perform arbitrary transfer operations to them. You can't perform arbitrary write operations to them. You can only access their data as color/depth/stencil attachments and/or as input attachments.
Synchronization between blending in different rendering commands (in the same render pass) is handled automatically. And you can't write to an image via an input attachment (hence the name). So there's no special need to make blending products visible to other operations.
Basically, blending never needs an explicit stage because of the restrictions of the render pass model and the ordering guarantees of blending and other per-sample operations.

Related

In a 2d application where you're drawing a lot of individual sprites, will the rasterization stage inevitably become a bottleneck? [duplicate]

I'm in the processing of learning Vulkan, and I have just integrated ImGui into my code using the Vulkan-GLFW example in the original ImGui repo, and it works fine.
Now I want to render both the GUI and my 3D model on the screen at the same time, and since the GUI and the model definitely needs different shaders, I need to use multiple pipelines and submit multiples commands. The GUI is partly transparent, so I would like it to be rendered after the model. The Vulkan specs states that the execution order of commands are not likely to be the order that I record the commands, thus I need synchronization of some kind. In this Reddit post several methods of exactly achieving my goals was proposed, and I once believed that I must use multiple subpasses (together with subpass dependency) or barriers or other synchronization methods like that to solve this problem.
Then I had a look at SaschaWillems' Vulkan examples, in the ImGui example though, I see no synchronization between the two draw calls, it just record the command to draw the model first, and then the command to draw the GUI.
I am confused. Is synchronization really needed in this case, or did I misunderstand something about command re-ordering or blending? Thanks.
Think about what you're doing for a second. Why do you think there needs to be synchronization between the two sets of commands? Because the second set of commands needs to blend with the data in the first set, right? And therefore, it needs to do a read/modify/write (RMW), which must be able to read data written by the previous set of commands. The data being read has to have been written, and that typically requires synchronization.
But think a bit more about what that means. Blending has to read from the framebuffer to do its job. But... so does the depth test, right? It has to read the existing sample's depth value, compare it with the incoming fragment, and then discard the fragment or not based on the depth test. So basically every draw call that uses a depth test contains a framebuffer read/modify/wright.
And yet... your depth tests work. Not only do they work between draw calls without explicit synchronization, they also work within a draw call. If two triangles in a draw call overlap, you don't have any problem with seeing the bottom one through the top one, right? You don't have to do inter-triangle synchronization to make sure that the previous triangles' writes are finished before the reads.
So somehow, the depth test's RMW works without any explicit synchronization. So... why do you think that this is untrue of the blend stage's RMW?
The Vulkan specification states that commands, and stages within commands, will execute in a largely unordered way, with several exceptions. The most obvious being the presence of explicit execution barriers/dependencies. But it also says that the fixed-function per-sample testing and blending stages will always execute (as if) in submission order (within a subpass). Not only that, it requires that the triangles generated within a command also execute these stages (as if) in a specific, well-defined order.
That's why your depth test doesn't need synchronization; Vulkan requires that this is handled. This is also why your blending will not need synchronization (within a subpass).
So you have plenty of options (in order from fastest to slowest):
Render your UI in the same subpass as the non-UI. Just change pipelines as appropriate.
Render your UI in a subpass with an explicit dependency on the framebuffer images of the non-UI subpass. While this is technically slower, it probably won't be slower by much if at all. Also, this is useful for deferred rendering, since your UI needs to happen after your lighting pass, which will undoubtedly be its own subpass.
Render your UI in a different render pass. This would only really be needed for cases where you need to do some full-screen work (SSAO) that would force your non-UI render pass to terminate anyway.

Do you need synchronization between 2 consecutive renderpasses if the 2nd render pass is depending on the 1st one's output

say the first RenderPass generates an output of a rendered texture, then the second RenderPass sample the texture in the shader and render it to the swapchain .
i don't know if i can do this kind of rendering inside a single RenderPass using subpasses, can subpass attachments have different sizes? in other words , can subpass behave like render textures?
It seems like there's a few separate questions wrapped up here, so I'll have a stab at answering them in order!
First, there's no need to implicitly synchronize two renderpasses where the second relies on the output from the first, provided they are either recorded on the same command buffer in the correct order, or (if recorded on separate command buffers) submitted in the correct order. The GPU will consume commands in the order submitted, so there's an implicit synchronization there.
If you are consuming the output from a renderpass (or subpass) by sampling inside a shader (which you will need to if the sizes are different, see below), rather than setting up a subpass output as an input attachment in a later subpass then you will need to do so in a separate render pass.
If you are consuming the output from a previous subpass as an input attachment, that implies you are using pixel local load operations inside your shader (where framebuffer attachments written in a previous subpass are read from at the exact same location in a subsequent subpass). This requires attachments be the same size, since all operations occur at the same pixel location.
From the Vulkan specification:
The subpasses in a render pass all render to the same dimensions, and fragments for pixel (x,y,layer) in one subpass can only read attachment contents written by previous subpasses at that same (x,y,layer) location. For multi-pixel fragments, the pixel read from an input attachment is selected from the pixels covered by that fragment in an implementation-dependent manner. However, this selection must be made consistently for any fragment with the same shading rate for the lifetime of the VkDevice.
So if your attachments vary in size, this would imply you need to consume your initial output in a separate renderpass.

What are layout transitions in graphics programming

I'm following vulkan-tutorial.com and in the images portion of the tutorial it mentions layout transitions, but doesn't elaborate on what they are. I don't like to copy and paste code without knowing exactly what it does and I can't find a sufficient explanation in the tutorial or on google.
A "layout transition" is exactly what those words mean. It's when you transition the layout of an image sub-resource from one layout to another. So your question really seems to be... what is a layout?
In the Vulkan abstraction, images are composed of sub-resources. These represent distinct sections of an image which can be manipulated independently of other sections. For example, each mipmap level of a mipmapped image is a sub-resource.
At any particular time that an image sub-resource is being used by a GPU process, that sub-resource has a layout. This is part of the Vulkan abstraction of GPU operations, so exactly what it means to the GPU will vary from chip to chip.
The important part is this: layouts restrict how you can use an image sub-resource. Or more to the point, in order to use an image sub-resource in a particular way, it must be in a layout which permits that usage.
When a sub-resource is in the VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL layout, you can only perform operations which read from the sub-resource within a shader. The shader cannot write to the image, nor can the image be used as a render target.
Now, the general layout allows pretty much any use at any time while within that layout. However, this also can represent less optimal performance. Any of the more restricted layouts can make those accesses to the image more performance-friendly (depending on hardware).
So it is your job to keep track of the layout of any image sub-resources you plan to use. Now for most images, you're going to use destination transfer layout to upload to them, and then just leave them as shader read-only, because you don't generally use most images more arbitrarily. So generally, this means keeping track of render targets that you want to read from, as well as swapchain images (you have to transition them to the present layout before presenting them) and storage images.
Layout transitions typically happen as part of an explicit dependency between two operations. This makes sense; if you're uploading data to an image, and you later want to read from it, you need a dependency between the upload and the read. You may as well do the layout transition then, since the transition can modify the way the bytes of the image are stored, so you need the transfer to be done first.

Why some commands can be recorded only outside of a render pass?

I don't know is it an API feature (I'm almost sure it's not) or a GPU specifics, but why, for example, vkCmdWaitEvents can be recorded inside and outside of a render pass, but vkCmdResetEvent can be recorded only outside? The same applies to other commands.
When it comes to event setting in particular, they play havoc with how the render pass model interacts with tile-based renderers.
Recall that the whole point of the complexity of the render pass model is to service the needs of tile-based renderers (TBRs). When a TBR encounters a complex series of subpasses, the way it wants to execute them is as follows.
It does all of the vertex processing stages for all of the rendering commands for all of the subpasses, all at once, storing the resulting vertex data in a buffer for later consumption. Then for each tile, it executes the rasterization stages for each subpass on the primitives that are involved in the building of that tile.
Note that this is the ideal case; specific things can make it fail to various degrees, but even then, it tends to fail in batches, where you do can execute several subpasses of a render pass like this.
So let's say you want to set an event in the middle of a subpass. OK... when does that actually happen? Remember that set-event command actually sets the event after all of the preceeding commands have completed. In a TBR, if everything is proceeding as above, when does it get set? Well ideally, all vertex processing for the entire renderpass is supposed to happen before any rasterization, so setting the event has to happen after the vertex processing is done. And all rasterization processing happens on a tile-by-tile basis, processing whichever primitives overlap that tile. Because of the fragmented rendering process, it's difficult to know when an individual rendering command has completed.
So the only place the set-event call could happen is... after the entire renderpass has completed. That is obviously not very useful.
The alternative is to have the act of issuing a ckCmdSetEvent call fundamentally reshape how the implementation builds the entire render pass. To break up the subpass into the stuff that happened before the event and the stuff that happens after the event.
But the reason why VkRenderPass is so big and complex, the reason why VkPipelines have to reference a specific subpass of a render pass, and the reason why vkCmdPipelineBarrier within a render pass requires you to specify a subpass self-dependency, is so that a TBR implementation can know up front when and where it will have to break the ideal TBR rendering scheme. Having a function introduce that breakup without forewarning works against this idea.
Furthermore, Vulkan is designed so that, if something is going to have to be implemented highly inefficiently, then it is either impossible to do directly or the API really makes it look really inefficient. vkCmd(Re)SetEvent cannot be efficiently implemented within a render pass on TBR hardware, so you can't do it period.
Note that vkCmdWaitEvents doesn't have this problem, because the system knows that the wait is waiting on something outside of a render pass. So it's just some particular stage that has to wait on the event to complete. If it's a vertex stage doing the waiting, it's easy enough to set that wait at the beginning of that command's processing. If it's a fragment stage, it can just insert the wait at the beginning of all rasterization processing; it's not the most efficient way to handle it, but since all vertex processing has executed, odds are good that the event has been set by then.
For other kinds of commands, recall that the dependency graph of everything that happens within a render pass is defined within VkRenderPass itself. The subpass dependency graph is there. You can't even issue a normal vkCmdPipelineBarrier within a render pass, not unless that subpass has an explicit self-dependency in the subpass dependency graph.
So what good would it be to issue a compute shader dispatch or a memory transfer operation in the middle of a subpass if you cannot wait for the operation to finish in that subpass or a later one? If you can't wait on the operation to end, then you cannot use its results. And if you can't use its results... you may as well have issued it before the render pass.
And the reason you can't have other dependencies goes back to TBRs. The dependency graph is an inseparable part of the render pass to allow TBRs to know up-front what the relationship between subpasses is. That allows them to know whether they can build their ideal renderer, and when/where that might break down.
Since the TBR model of render passes makes such waiting impractical, there's no point in allowing you to issue such commands.
Because a renderpass is a special construct that implies focusing work solely on the framebuffer.
In addition each of the subpasses are allowed to be run in parallel unless they have an explicit dependency between them.
This has an effect on how they would need to be synchronized to other instructions in the other subpasses.
Doing copies dominates use of the memory bus and would stall render work that depends on it. Doing that inside the renderpass creates a big gpu bubble that can be easily resolved by putting it outside and making sure its finished by the time you start the renderpass.
Some hardware also has dedicated copy units that is separate from the graphics hardware so the less synchronizing you need to do between them the better.

Why do I need resources per swapchain image

I have been following different tutorials and I don't understand why I need resources per swapchain image instead of per frame in flight.
This tutorial:
https://vulkan-tutorial.com/Uniform_buffers
has a uniform buffer per swapchain image. Why would I need that if different images are not in flight at the same time? Can I not start rewriting if the previous frame has completed?
Also lunarg tutorial on depth buffers says:
And you need only one for rendering each frame, even if the swapchain has more than one image. This is because you can reuse the same depth buffer while using each image in the swapchain.
This doesn't explain anything, it basically says you can because you can. So why can I reuse the depth buffer but not other resources?
It is to minimize synchronization in the case of the simple Hello Cube app.
Let's say your uniforms change each frame. That means main loop is something like:
Poll (or simulate)
Update (e.g. your uniforms)
Draw
Repeat
If step #2 did not have its own uniform, then it needs to write a uniform previous frame is reading. That means it has to sync with a Fence. That would mean the previous frame is no longer considered "in-flight".
It all depends on the way You are using Your resources and the performance You want to achieve.
If, after each frame, You are willing to wait for the rendering to finish and You are still happy with the final performance, You can use only one copy of each resource. Waiting is the easiest synchronization, You are sure that resources are not used anymore, so You can reuse them for the next frame. But if You want to efficiently utilize both CPU's and GPU's power, and You don't want to wait after each frame, then You need to see how each resource is being used.
Depth buffer is usually used only temporarily. If You don't perform any postprocessing, if Your render pass setup uses depth data only internally (You don't specify STORE for storeOp), then You can use only one depth buffer (depth image) all the time. This is because when rendering is done, depth data isn't used anymore, it can be safely discarded. This applies to all other resources that don't need to persist between frames.
But if different data needs to be used for each frame, or if generated data is used in the next frame, then You usually need another copy of a given resource. Updating data requires synchronization - to avoid waiting in such situations You need to have a copy a resource. So in case of uniform buffers, You update data in a given buffer and use it in a given frame. You cannot modify its contents until the frame is finished - so to prepare another frame of animation while the previous one is still being processed on a GPU, You need to use another copy.
Similarly if the generated data is required for the next frame (for example framebuffer used for screen space reflections). Reusing the same resource would cause its contents to be overwritten. That's why You need another copy.
You can find more information here: https://software.intel.com/en-us/articles/api-without-secrets-the-practical-approach-to-vulkan-part-1