Is it possible to synchronize an automatic layout transition with a swapchain image acquisition via pWaitDstStageMask=TOP_OF_PIPE? - vulkan

It is necessary to synchronize automatic layout transitions performed by render passes with the acquisition of a swapchain image via the semaphore provided in vkAcquireNextImageKHR. Vulkan Tutorial states, "We could change the waitStages for the imageAvailableSemaphore to VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT to ensure that the render passes don't begin until the image is available" (Context: waitStages is an array of VkPipelineStageFlags supplied as pWaitDstStageMask to the VkSubmitInfo for the queue submission which performs the render pass and, consequently, the automatic layout transition). It then opts for an alternative (common) solution, which is to use an external subpass dependency instead.
In fact, I've only ever seen such synchronization done with a subpass dependency; I've never seen anyone actually do it by setting pWaitDstStageMask=TOP_OF_PIPE for the VkSubmitInfo corresponding to the queue submission which performs the automatic layout transition, and I'm not certain why this would even work. The specs provide certain guarantees about automatic layout transitions being synchronized with subpass dependencies, so I understand why they would work, but in order to do this sort of synchronization by waiting on a semaphore in a VkSubmitInfo, it is first necessary that image layout transitions even have pipeline stages to begin with. I am not aware of any pipeline stages which image layout transitions go through; I believed them to be entirely atomic operations which occur in between source-available operations and destination-visible operations in a memory dependency (e.g. subpass dependency, memory barrier, etc.), as described by the specs.
Do automatic image layout transitions go through pipeline stages? If so, which stages do they go through (perhaps I'm missing something in the specs)? If not, how could setting pWaitDstStageMask=TOP_OF_PIPE in a VkSubmitInfo waiting on the image acquisition semaphore have any synchronization effect on the automatic image layout transition?

You may be misunderstanding what the tutorial is saying.
Layout transitions do not happen within a pipeline stage. They, like all aspects of an execution dependency, happen between pipeline stages. A dependency states that, after the source stage completes but before the destination stage begins, the dependency's operations will happen.
So if the destination stage of an incoming external subpass dependency is the top of the pipe, then it is saying that the layout transition as part of the dependency will happen before the execution of any pipeline stages in the target subpass.
Remember that the default external subpass dependency for each attachment has the top of the pipe as its source scope. If the semaphore wait dependency's destination stage is the top of the pipe, then the source scope of the external subpass dependency is in the destination scope of the semaphore wait. And therefore, the external subpass dependency happens-after the semaphore wait dependency.
And, as previously stated, layout transitions happen between the source scope and the destination scope. So until the source scope, and everything it depends on, finishes execution, the layout transition will not happen.

Related

When does Image Layout Transition happen when no source or destination stage is specified

Based on the specs https://registry.khronos.org/vulkan/specs/1.3-khr-extensions/pdf/vkspec.pdf
It says "When a layout transition is specified in a
memory dependency, it happens-after the availability operations in the memory dependency, and happens-before the visibility operations"
As we know when calling vkCmdPipelineBarrier(layout1, layout2, ...), the srcAccessMask is the availability operation and the dstAccessMask is the visibility operation. I wonder if I set both of them to 0, which means there is no availability operation and no visibility operation in this memory dependency, will there be any layout transition from layout1 to layout2 actually happening after the barrier call in this case?
To answer my own question:
Based on the specs, the queue present call will execute the visibility operation in this case.
https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/vkQueuePresentKHR.html
"Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are automatically made visible to the read access performed by the presentation engine. This automatic visibility operation for an image happens-after the semaphore signal operation, and happens-before the presentation engine accesses the image."
So we only need to write the src access to register an availability operation and leave the dst access 0(the visibility operation) to the present call.
There is nothing in the standard which says that the availability operations that are part of a memory dependency are optional. They always happen as part of executing the memory dependency. They can be empty and make nothing available, but it still always happens. The same goes for visibility.
So the layout transition happens regardless of what is available or visible. This is useful as any writes being consumed may have been made available before now by some other operation.
The bigger issue is the lack of visibility for operations down the line. A layout transition is ultimately a write operation. So processes that need to use that image need to see the image in the new layout. So it needs to be visible to them. So if you do this, you will likely see your validation layers complain later about some form of memory hazard.

Confusion about VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT

When I read Vulkan Specs for VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, it says:
VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT specifies the stage of the pipeline after blending .... This stage also includes subpass load and store operations,....
I am really confused as why this stage can include subpass load and store operation.
From my understanding, in a subpass that performs drawing on a color attachment:
Subpass load operation happens first in submission order.
Then , there is graphics pipeline (vkCmdDraw) submitted afterwards. Among all those graphics pipeline stages, there is
a final color output stage that is after color blending. That stage is called VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
At the end, subpass store operation happens.
Since those three are very distinct stages all with their own purposes, how come they can be all specified in one VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT.
If put this stage in dstStageMask, is subpass load operation waiting for the srcStageMask or the color output stage of a graphics pipeline waiting for srcStageMask.
Similarly, if I put this stage in srcStageMask, is Vulkan waiting for previous subpass to perform store operation instead of the color output stage in a graphics pipeline?
I am really confused as why this stage can include subpass load and store operation.
It "can" do so by fiat; it does so because the standard says that it does so.
All of this stuff is an abstraction, a model of a conceptual GPU that doesn't necessarily exist in any particular hardware. The job of a Vulkan implementation is to translate this abstraction for their particular hardware.
Subpass load/store operations may or may not be a distinct thing for any particular piece of hardware. Some hardware has them; others do not.
If these processes are distinct for a particular GPU, and a user specifies a dependency through the color output stage, it is that implementation's job to include whatever is necessary to make that dependency work with both the usual color outputs and the subpass load/store hardware. That is, if there's separate caching for both operations (somehow), the implementation has to handle both sets of caches.
So the answer to this question:
If put this stage in dstStageMask, is subpass load operation waiting for the srcStageMask or the color output stage of a graphics pipeline waiting for srcStageMask.
is both. If these are two separate processes on a particular GPU, the implementation must make them appear as though they are the same process.
That being said, attachment load only happens before the first subpass that uses the attachment, and attachment store only happens after the last subpass that uses the attachment. So its not like this is a big deal that impacts every dependency.

In Vulkan is an execution dependency not enough to ensure proper order?

I'm having trouble understanding why we specify stages of the pipeline for the pipeline barrier as well as access mask. The point of specifying a barrier with pipeline stages is to give instructions that all commands before a certain stage happen before all stages after a certain stage. So with this reasoning let's say I want to transfer data from a staging buffer to a device-local buffer. Before I read that buffer from a shader I need to have a barrier saying that the transfer stage must take place before the shader read stage after it. I can do this with:
srcstage = VK_PIPELINE_STAGE_TRANSFER_BIT;
dststage = VK_PIPELINE_STAGE_FRAGMENT_BIT;
This is supposed to say that all commands before the barrier must complete the transfer stage before all commands after the barrier start the fragment shader stage.
However I don't understand the concept of access mask and why we use it together with the stage mask. Apparently the stage masks don't ensure "visibility" or "availability"??? So is it the case that this means that although the transfer stage will complete before the fragment stage will start there is no guarantee that the shader will read the proper data? Because maybe caches have not been flushed and not been made visible/available?
If that's the case and we need to specify access masks such as:
srcaccessmask = VK_ACCESS_MEMORY_WRITE_BIT;
dstaccessmask = VK_ACCESS_MEMORY_READ_BIT;
Then what is even the point of specifying stages if the stages isn't enough, and it comes down to the access mask?
Access masks ensure visibility, but you cannot have visibility over something that is not yet available (ie: you cannot see what hasn't happened yet). Visibility and availability are both necessary for a memory barrier, but each alone is insufficient.
After all, execution barriers do not strictly speaking need memory barriers. A write-after-read barrier does not need to ensure visibility. It simply needs to make sure that no reads happen after the write; a pure execution barrier is sufficient. But if you're doing a write-after-write or a read-after-write, then you need to ensure visibility in addition to availability.
Execution dependency is not enough to ensure proper order simply because Vulkan's memory model is such that it requires manual\explicit memory barriers.
In other words, execution barrier only covers execution, but not side-effects of the execution. So you also need memory barrier(s) when you need the side-effects to be coherent\consistent.
If the coherency is satisfied another way, the access mask is actually what is unnecessary:
semaphore_wait.pWaitDstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
// we need to order our stuff after the semaphore wait, so:
barrier.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
// semaphore already includes all device memory access, so nothing here:
barrier.srcAccessMask = 0;
barrier.oldLayout = COOL_LAYOUT;
barrier.newLayout = EVEN_BETTER_LAYOUT;
Using only access masks (without stage) would be ambiguous. Access masks can limit the amount of memory coherency work the driver needs to perform:
// make available the storage writes from fragment shader,
// but do not bother with storage writes
// from compute, vertex, tessellation, geometry, raytracing, or mesh shader:
barrier.srcStageMask = VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT;
barrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT does implicitly include vertex, tessellation, and geormetry shader stage. But only for execution dependency purposes, and not for memory dependency:
Note
Including a particular pipeline stage in the first synchronization scope of a command implicitly includes logically earlier pipeline stages in the synchronization scope. Similarly, the second synchronization scope includes logically later pipeline stages.
However, note that access scopes are not affected in this way - only the precise stages specified are considered part of each access scope.

Can vkQueuePresentKHR be synced using a pipeline barrier?

The fact vkQueuePresentKHR gets a queue parameter makes me think that it is like a command that is delivered to the queue for execution. If so, it is possible to make it waits (until the writing into the image to be presented is finished) using a pipeline barrier where source stage is VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT and destination is VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT. Or maybe even by an image barrier to ease the sync constraint for the image only.
But the fact that in every tutorial and books the sync is done using semaphore , makes me think that my assumption is wrong. If so, why vkQueuePresentKHR needs a queue parameter ? because the semaphore parameter is seems to be enough: when it is signaled, vkQueuePresentKHR can present the image according to the image index parameter and the swapchain handle parameter.
There are couple of outstanding Issues against the specification. Notably KhronosGroup/Vulkan-Docs#1308 is exactly your question.
Meanwhile everyone usually follows this language:
The processing of the presentation happens in issue order with other queue operations, but semaphores have to be used to ensure that prior rendering and other commands in the specified queue complete before the presentation begins.
Which implies semaphore has to be used. And given we are not 110 % sure, that means semaphore should be used until we know any better.
Another semi-official source is the sync wiki, which uses a semaphore.
Despite what this quote says, I think it is reasonable to believe it is also permissible to use other sync that makes the image already visible before the vkQueuePresent, such as fence wait.
But just pipeline barriers are likely not sufficient. The presentation is outside the queue system:
However, the scope of this set of queue operations does not include the actual processing of the image by the presentation engine.
Additionally there is no VkPipelineStageFlagBit for it, and vkQueuePresentKHR is not included in the submission order, so it cannot be in the synchronization scope of any vkCmdPipelineBarrier.
The confusing part is this unfortunate wording:
Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are automatically made visible to the read access performed by the presentation engine.
I believe the trick is the "before vkQueuePresentKHR is executed". As said above, vkQueuePresentKHR is not part of submission order, therefore you do not know if the memory was or wasn't made available via a pipeline barrier before the vkQueuePresentKHR is executed.
Presentation is a queue operation. That's why you submit it to a queue. A queue that will execute the presentation of the image. And specifically to a queue that is able to perform present operations.
As for how to synchronize... the specification is a bit ambiguous on this point.
Semaphores are definitely able to work; there's a specific callout for this:
Semaphores are not necessary for making the results of prior commands visible to the present:
Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are
automatically made visible to the read access performed by the presentation engine. This automatic visibility operation for an image happens-after the semaphore signal operation, and happens-before the presentation engine accesses the image.
While provisions are made for semaphores, there is no specific statement of other things. In particular, if you don't wait on a semaphore, it's not clear what "happens-after the semaphore signal operation" means, since no such signal operation happened.
Now, the API for vkQueuePresentKHR makes it clear that you don't need to provide a semaphore to wait on:
waitSemaphoreCount is the number of semaphores to wait for before issuing the present request.
The number may be zero.
One might thing that, as a queue operation, all prior synchronization on that queue would still affect presentation. For example, an external subpass dependency if you wrote to the swapchain image as an attachment. And it probably would... if not for one little problem.
See, synchronization is ultimately based on dependencies between stages. And presentation... doesn't have a stage. So while your source for the external dependency would be well-understood, it's not clear what destination stage would work. Even specifying the all-stages flag wouldn't necessarily work.
Does "not a stage" exist in the set of all stages?
In any case, it's best to just use a semaphore. You'll probably need one anyway, so just use that.

Using pipeline barriers instead of semaphores

I want to be sure that I understand pipeline barriers correctly.
So barriers are able to synchronize two command buffers provided the source stage of the second barrier is later than the destination stage of the first barrier. Is this correct?
Of course I will need to use semaphores if the command buffers execute during different iterations of the pipeline.
It seems to me that synchronisation is the hardest part to grasp in Vulkan. IMO the specification isn't clear enough about it.
Preamble:
Most of what applies to Vulkan Pipeline Barriers applies to generic barriers and memory barriers, so you can start there to build your intuition.
I would note, though the specification is not a tutorial, it is reasonably clear and readable. Synchronization is perhaps the hardest part and the description in specification mirrors that. On top of that, especially memory barriers are novel to most (they are usualy shielded from such concept by higher language compiler).
Needed definitions:
Pipeline is abstract scheme of how a unit of work is processed. There are sort of four types (though Vulkan does not say vendors how to do things as long as they follow the rules):
Host access pseudo-pipeline (with one stage)
Transfer (with one stage)
Compute (with one stage)
Graphic (with lot of stages i.e. DI→VI→VS→TCS→TES→GS→EFT→FS→LFT→Output )
There are special stages TOP (before anything is done), BOTTOM (after everything is finished), and ALL (which is the same as bitfield with all stages set).
(Action) command is a command that needs (one or more) passes through the pipeline. It must be recorded to command buffer (with the exception of the host reads and writes through vkMapMemory()).
Command buffer is some sequence of commands (in recorded order!). And queue is too a sequence of recorded commands (concatenated from submited command buffers).
The queue has some leeway in which order it executes the commands (it may reorder commands as long as the user-set state is preserved) and also may overlap commands (e.g. execute VS of next command before finishing FS of previous command). User defined synchronization primitives set a boundaries to this leeway. (There are also some implicit guarantees -- but better to not rely on them and oversynchronize until confident)
My take on explaining Pipeline Barriers:
(Maybe unfortunately) the Pipeline Barriers amalgamates three separate aspects -- execution barrier, memory barrier and layout transition (if it's image).
The execution barrier part assures that all commands recorded before the Barrier reached in exececution at least the specified pipeline stage (or stages) in srcStageMask before any of the commands recorded after the Barrier starts executing their specified stage (or stages) in dstStageMask.
It does handle only execution dependency and not memory! The memory barrier part assures that memory (caches) are properly flushed and invalidated somewhere in between that execution barrier dependency (i.e. after the depending and before the dependant commands and stages).
You provide what kind of memory dependency it is and between what kind of sources/consumers (so the driver can choose appropriate action without remembering the state itself). Typicaly write-read dependency (read-read and read-write do not need any memory synchronization and write-write does not usually make much sense -- why would you overwrite some data without reading them first).
Different data layout in memory may be advantegeous (or even necessery) on some HW. In the same time the memory dependency is handeled, the data is reordered to adhere to the new specified layout.
So barriers are able to synchronize two command buffers provided the source stage of the second barrier is later than the destination stage of the first barrier. Is this correct?
The 1.0.35 Vulkan specification has improved wording that makes this clear:
If vkCmdPipelineBarrier was recorded outside a render pass instance, the first synchronization scope includes every command submitted to the same queue before it, including those in the same command buffer and batch.
...
If vkCmdPipelineBarrier was recorded outside a render pass instance, the second synchronization scope includes every command submitted to the same queue after it, including those in the same command buffer and batch.
Note that there is no requirement on the source or destination stage. You can synchronize with a source as fragment shader and destination as vertex shader just fine.