Vulkan: Is the rendering pipeline executed once per subpass? - vulkan

Considering a RenderPass that has multiple Subpasses:
Is the implication of multiple subpasses that the entire rendering pipeline is executed once per subpass?
And that the image output of a prior subpass is accessible to subsequent subpasses, assumming correct subpass dependencies?
(with the stipulation that reading prior image data happens at the same pixel location, for tile gpu optimation)
I understand that hardware may optimize things out; it's more of a way of thinking about how the multi-subpass processing happens.
Extending this to multiple renderpasses, then is it the same thing as subpasses? except that image data from prior renderpasses can be accessed at any location, and that the synchronization between renderpasses uses different mechanisms that between subpasses.

Pipeline is not "executed". Pipeline just exists. That's why it is called a pipeline, and not a state machine. The queue operations are the things that are executed.
With Vulkan's render pass it is good to know how tile-based architectures work. First they need to sort everything into tiles; that means they need to know the position of everything upfront. So, the geometry processing (vertex shader, geometry shader, tesselation shader, and all the relevant fixed-function stages) need to be finished for all the queue operations, before pixel processing (fragment shader, framebuffer writes, and other fixed-function stages) starts for any of them.
From that, the subpass restrictions are derived:
If srcSubpass is equal to dstSubpass and not all of the stages in srcStageMask and dstStageMask are framebuffer-space stages, the logically latest pipeline stage in srcStageMask must be logically earlier than or equal to the logically earliest pipeline stage in dstStageMask
I.e. you cannot have a vertex shader dependency waiting on a fragment shader output of previous ops. But you can have "framebuffer-space" dependencies; e.g. fragment shader waiting on fragment shader of previous ops.

Subpass dependencies are just another abstraction of the Vulkan API of how to express the synchronization between different commands (each of which can run through multiple pipeline stages). W.r.t. render passes, subpass dependencies serve two purposes:
Expressing synchronization between the commands submitted within different render passes (that is when the VK_SUBPASS_EXTERNAL subpass-id is being used, see VkSubpassDependency
Expressing synchronization between the commands submitted within the same or different subpasses. In this case, pairs of (0 and 0), or (0 and 1), and so on are specified for the srcSubpass and dstSubpass i a VkSubpassDependency structure, respectively.
Given the correct synchronization scopes, a subsequent subpass can read the rendered results of a previous subpass. Framebuffer attachments can be passed on via input attachments, which are specified in VkSubpassDescription. You can get an overview of this in this lecture from 43:28 onwards.
Regarding the "rendering pipeline is executed"-thing: The lecture mentioned above explains commands and how they
go through pipeline stages in quite some details starting from 22:29. It should make things a lot clearer.
Regarding tiled GPUs: If you are referring to VK_DEPENDENCY_BY_REGION_BIT for VkSubpassDependency::dependencyFlags, the spec says the following:
VK_DEPENDENCY_BY_REGION_BIT specifies that dependencies will be framebuffer-local.
That means, you can only use the following pipeline stages with that flag:
VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT
VK_PIPELINE_STAGE_EARLY_FRAGMENT_TESTS_BIT
VK_PIPELINE_STAGE_LATE_FRAGMENT_TESTS_BIT
VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
Valuable information about tile-based architectures are given in the other answer by krOoze already.

Related

Order: Fragment OPs guaranteed after vertex OPs? Logical & rasterization order seem to weak

In my quest to fully understand synchronization, I've stumbled over different order guarantees. The strongest one I think is the rasterization order, which makes strong guarantees about the order of fragment operations for individual pixels.
A weaker and more general order is the logical order of the pipeline stages. To quote the bible:
Pipeline stages that execute as a result of a command logically complete execution in a specific order, such that completion of a logically later pipeline stage must not happen-before completion of a logically earlier stage. [...] Similarly, initiation of a logically earlier pipeline stage must not happen-after initiation of a logically later pipeline stage.
That guarantee seems pretty weak as it seems to allow to run all pipeline stages at the same time as long as they start and end in the correct order.
That leads me to one consequence: Doesn't all this make it possible for the vertex stage to not be finished before the fragment stage starts? This is considering the case for a single triangle. Since I think thisis absolutely not what's happening (or possible), it would be nice to find out where the spec makes that guarantee.
There's one problem with your thinking. Pipeline is not a Finite State Machine. They may look the same when expressed as diagram, but they are not the same. Pipeline stages do not "run", because they are not FSM states. Instead queue operations run through the pipeline (hence the name). In reality, one command can spawn multiple vertex shader invocations. Geometry shader can spawn multiple (or no) fragments shader invocations. Only thing that is guaranteed here is that things do not go against the pipeline direction of flow (e.g. that fragment shader invocations never spawn new vertex shaders).
That being said, you are looking in the wrong part of the specification. The paragraph you are quoting "only" specifies the logical order. I.e. that pipeline stages are added implicitly to synchronization commands as appropriate. Logically-earlier stages are implicitly added to any source scope parameter, and logically-later stages are added to any destination stage parameter. But careful, this does not say anything about side-effects of the shaders, and it does not apply to memory dependency, which have to have the stage explicitly stated to work.
What you are looking for is Shader Execution chapter:
The relative execution order of invocations of different shader types is largely undefined. However, when invoking a shader whose inputs are generated from a previous pipeline stage, the shader invocations from the previous stage are guaranteed to have executed far enough to generate input values for all required inputs.

VkSubpassDependency specification clarification

I am trying to understand the specification of the VkSubpassDependency structure.
Link to VkSubpassDependency structure is on the specification page
The part that confuses me a lot is the relationship between (srcSubpass, dstSubpass) and (srcStageMask, dstStageMask).
The case where srcSubpass is equal to dstSubpass is the same as a pipeline barrier and doesn't raise any questions for me. However, other cases are quite questionnable.
The specification says:
If srcSubpass is equal to VK_SUBPASS_EXTERNAL, the first synchronization scope includes commands that occur earlier in submission order than the vkCmdBeginRenderPass used to begin the render pass instance. Otherwise, the first set of commands includes all commands submitted as part of the subpass instance identified by srcSubpass and any load, store or multisample resolve operations on attachments used in srcSubpass. In either case, the first synchronization scope is limited to operations on the pipeline stages determined by the source stage mask specified by srcStageMask.
By using the definition of syncronisation scopes given by the specification in synchronization-dependencies I interpret the specification this way:
If srcSubpass and dstSubpass are equal to the same actual subpass numbers and srcStageMask, dstStageMask are equal to some Pipeline stage, for example VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT.
I interpret in this case that the dependency works in the following way: srcSubpass executes freely over the pipeline stages until it reaches the stage given in srcStageMask parameter and stalls until the dstSubpass reaches the stage described in dstStageMask.
If srcSubpass is equal to VK_SUBPASS_EXTERNAL, dstSubpass is equal to the actual existing subpass number and srcStageMask, dstStageMask is equal to the same Pipeline stage.
In that case I am getting confused about determining the first syncronisation scope.
Because the srcSubpass is equal to VK_SUBPASS_EXTERNAL then the first synchronisation scope includes commands that occur earlier in the submission order than the vkCmdBeginRenderPass used
to begin the render pass instance as the specification says. Because the external commands may not have any interaction with the pipeline stages, it makes the interpretation of srcStageMask quite ambiguous. Because nothing relates to the pipeline stage may not be outside the render pass: Does the line from the specification: "In either case, the first synchronization scope is limited to operations on the pipeline stages determined by the source stage mask specified by srcStageMask." refers to the pipeline stages inside dstSubpass and that extends the first syncronistaion scope down to srcStageMask stage in dstSubpass?
You are largely answering your questions. I think the problem is you have the wrong intuition about pipeline stages.
Think about all the commands you have submitted to a queue. Each command is at any point in some pipeline stage, and all commands proceed through the pipeline independently (I should say asynchronously). Think of the pipeline as a playing board, and think of the commands as pegs that go on that board. Everything else should start making sense with that new intuition.
A Pipeline Barrier\Subpass Dependency introduces a dependency. The effect is all the commands in the source scope have to finish at least the srcStageMask stage in their execution, before any of the commands in the destination scope are even allowed to start the dstStageMask stage of their execution.
srcSubpass == dstSubpass is a special case that does nothing but declare a subpass self-dependency. All it does is allow you to later use vkCmdPipelineBarrier inside that Subpass recording. That works mostly like a normal Pipeline Barrier, except it is limited to the commands inside that Subpass, and it is not allowed to change Image Layouts.
srcSubpass < dstSubpass case introduces a dependency between those two Subpasses. The commands recorded in the srcSubpass are the source scope, and the commands in dstSubpass are the destination scope. I.e. all commands in the source subpass must reach at least srcStage, before any of the commands in the destination subpass are allowed to start the dstStage of their execution.
VK_SUBPASS_EXTERNAL refers to the outside of the Render Pass Instance.
I.e. if srcSubpass == VK_SUBPASS_EXTERNAL, then the source scope is all commands recorded before vkCmdBeginRenderPass and anything earlier in submission order. So, this Subpass Dependency would say all commands before the render pass instance have to reach at least their srcStage stage of execution, before any of the commands of the dstSubpass enter their dstStage stages.
If dstSubpass == VK_SUBPASS_EXTERNAL then the destination scope is all commands recorded after vkCmdEndRenderPass (and also later in submission order). So, this Subpass Dependency would say all commands recorded in the srcSubpass subpass have to reach at least their srcStage stage of execution, before any of the commands after the Render Pass Instance enter their dstStage stages of execution.

Why some commands can be recorded only outside of a render pass?

I don't know is it an API feature (I'm almost sure it's not) or a GPU specifics, but why, for example, vkCmdWaitEvents can be recorded inside and outside of a render pass, but vkCmdResetEvent can be recorded only outside? The same applies to other commands.
When it comes to event setting in particular, they play havoc with how the render pass model interacts with tile-based renderers.
Recall that the whole point of the complexity of the render pass model is to service the needs of tile-based renderers (TBRs). When a TBR encounters a complex series of subpasses, the way it wants to execute them is as follows.
It does all of the vertex processing stages for all of the rendering commands for all of the subpasses, all at once, storing the resulting vertex data in a buffer for later consumption. Then for each tile, it executes the rasterization stages for each subpass on the primitives that are involved in the building of that tile.
Note that this is the ideal case; specific things can make it fail to various degrees, but even then, it tends to fail in batches, where you do can execute several subpasses of a render pass like this.
So let's say you want to set an event in the middle of a subpass. OK... when does that actually happen? Remember that set-event command actually sets the event after all of the preceeding commands have completed. In a TBR, if everything is proceeding as above, when does it get set? Well ideally, all vertex processing for the entire renderpass is supposed to happen before any rasterization, so setting the event has to happen after the vertex processing is done. And all rasterization processing happens on a tile-by-tile basis, processing whichever primitives overlap that tile. Because of the fragmented rendering process, it's difficult to know when an individual rendering command has completed.
So the only place the set-event call could happen is... after the entire renderpass has completed. That is obviously not very useful.
The alternative is to have the act of issuing a ckCmdSetEvent call fundamentally reshape how the implementation builds the entire render pass. To break up the subpass into the stuff that happened before the event and the stuff that happens after the event.
But the reason why VkRenderPass is so big and complex, the reason why VkPipelines have to reference a specific subpass of a render pass, and the reason why vkCmdPipelineBarrier within a render pass requires you to specify a subpass self-dependency, is so that a TBR implementation can know up front when and where it will have to break the ideal TBR rendering scheme. Having a function introduce that breakup without forewarning works against this idea.
Furthermore, Vulkan is designed so that, if something is going to have to be implemented highly inefficiently, then it is either impossible to do directly or the API really makes it look really inefficient. vkCmd(Re)SetEvent cannot be efficiently implemented within a render pass on TBR hardware, so you can't do it period.
Note that vkCmdWaitEvents doesn't have this problem, because the system knows that the wait is waiting on something outside of a render pass. So it's just some particular stage that has to wait on the event to complete. If it's a vertex stage doing the waiting, it's easy enough to set that wait at the beginning of that command's processing. If it's a fragment stage, it can just insert the wait at the beginning of all rasterization processing; it's not the most efficient way to handle it, but since all vertex processing has executed, odds are good that the event has been set by then.
For other kinds of commands, recall that the dependency graph of everything that happens within a render pass is defined within VkRenderPass itself. The subpass dependency graph is there. You can't even issue a normal vkCmdPipelineBarrier within a render pass, not unless that subpass has an explicit self-dependency in the subpass dependency graph.
So what good would it be to issue a compute shader dispatch or a memory transfer operation in the middle of a subpass if you cannot wait for the operation to finish in that subpass or a later one? If you can't wait on the operation to end, then you cannot use its results. And if you can't use its results... you may as well have issued it before the render pass.
And the reason you can't have other dependencies goes back to TBRs. The dependency graph is an inseparable part of the render pass to allow TBRs to know up-front what the relationship between subpasses is. That allows them to know whether they can build their ideal renderer, and when/where that might break down.
Since the TBR model of render passes makes such waiting impractical, there's no point in allowing you to issue such commands.
Because a renderpass is a special construct that implies focusing work solely on the framebuffer.
In addition each of the subpasses are allowed to be run in parallel unless they have an explicit dependency between them.
This has an effect on how they would need to be synchronized to other instructions in the other subpasses.
Doing copies dominates use of the memory bus and would stall render work that depends on it. Doing that inside the renderpass creates a big gpu bubble that can be easily resolved by putting it outside and making sure its finished by the time you start the renderpass.
Some hardware also has dedicated copy units that is separate from the graphics hardware so the less synchronizing you need to do between them the better.

vulkan: pWaitDstStageMask member in VkSubmitInfo

In VkSubmitInfo, when pWaitDstStageMask[0] is VK_PIPLINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, vulkan implementation executes pipeline stages without waitng for pWaitSemaphores[0] until it reaches Color Attachment Stage.
However, if the command buffer has multiple subpasses and multiple draw commands, then does WaitDstStageMask mean the stages of all draw commands?
If I want vulkan implementation to wait the semaphore when it reaches color attachment output stage of the last subpass, what should I do?
You probably don't actually want to do this. On hardware that benefits from multi-subpass renderpasses, the fragment work for the entire renderpass will be scheduled and execute as essentially a single monolithic chunk of work. E.g. all subpasses will execute for some pixel (x,y) regions before any subpasses are executed for some other pixel (x,y) regions. So it doesn't really make sense to say insert a synchronization barrier on an external event between two subpasses. So you need to think about what your renderpass is doing and whether it is really open to the kinds of optimizations they were designed for.
If not, then treating the subpasses (or at least the final one) as independent renderpasses isn't going to be a loss anyway, so you might as well just put it in a separate renderpass in a separate submit batch, and put the semaphore wait before it.
If so, then you just want to do the semaphore wait before the COLOR_ATTACHMENT stage for the whole renderpass anyway.
In such situation You have (I think) two options:
You can split render pass - exclude the last subpass and submit it's commands in a separate render pass recorded in a separate command buffer so You can specify semaphore for which it should wait on (but this doesn't sound too reasonably) or...
You can use events - You should signal events after the commands which generate results later commands require, and then, in the last sub-pass You should wait on that event just before the commands that indeed need to wait.
The second approach is probably preferable (despite You are not using submission's pWaitSemaphores and pWaitDstStageMask fields), but it also has it's restrictions:
vkCmdWaitEvents must not be used to wait on event signal operations occuring on other queues.
And I'm not sure, but maybe subpass dependencies may also help You here. Clever definitions of submission's pWaitSemaphores and render pass's subpass dependencies may be enough to do the job. But I'm not too confident in explaining subpass dependencies (I'm not sure I fully understand them) so don't rely on this. Maybe someone can confirm this. Bot the above two option definitely will do the trick.

Multiple instances of same Vulkan subpass

I have been reading through many online tutorials on creating a Vulkan renderer, however, the idea of subpasses is still very unclear to me.
Say I have the following scenario: I need to do a first subpass for setup (fill a depth buffer for testing etc) then have a subpass for every light in the scene (the number of which could change at any time). Because each lighting subpass is exactly the same, would it be possible to declare 2 subpasses and have multiple instances of the same subpass?
The term "pass" here does not mean "full-screen pass" or something like that. Subpasses only matter in terms of what you're rendering to (and reading from previous subpass renderings as input attachments). Where your data comes from (descriptors/push constants), what vertex data they get, what shaders they use, none of that matters to the subpass. The only things the subpass controls are render targets.
So unless different lights are rendering to different images, then there's no reason to give each light a subpass. You simply issue the rendering commands for all of your lights within the same subpass.