Confusion about VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT

Confusion about VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT - vulkan

When I read Vulkan Specs for VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, it says:
VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT specifies the stage of the pipeline after blending .... This stage also includes subpass load and store operations,....
I am really confused as why this stage can include subpass load and store operation.
From my understanding, in a subpass that performs drawing on a color attachment:
Subpass load operation happens first in submission order.
Then , there is graphics pipeline (vkCmdDraw) submitted afterwards. Among all those graphics pipeline stages, there is
a final color output stage that is after color blending. That stage is called VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT
At the end, subpass store operation happens.
Since those three are very distinct stages all with their own purposes, how come they can be all specified in one VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT.
If put this stage in dstStageMask, is subpass load operation waiting for the srcStageMask or the color output stage of a graphics pipeline waiting for srcStageMask.
Similarly, if I put this stage in srcStageMask, is Vulkan waiting for previous subpass to perform store operation instead of the color output stage in a graphics pipeline?

I am really confused as why this stage can include subpass load and store operation.
It "can" do so by fiat; it does so because the standard says that it does so.
All of this stuff is an abstraction, a model of a conceptual GPU that doesn't necessarily exist in any particular hardware. The job of a Vulkan implementation is to translate this abstraction for their particular hardware.
Subpass load/store operations may or may not be a distinct thing for any particular piece of hardware. Some hardware has them; others do not.
If these processes are distinct for a particular GPU, and a user specifies a dependency through the color output stage, it is that implementation's job to include whatever is necessary to make that dependency work with both the usual color outputs and the subpass load/store hardware. That is, if there's separate caching for both operations (somehow), the implementation has to handle both sets of caches.
So the answer to this question:
If put this stage in dstStageMask, is subpass load operation waiting for the srcStageMask or the color output stage of a graphics pipeline waiting for srcStageMask.
is both. If these are two separate processes on a particular GPU, the implementation must make them appear as though they are the same process.
That being said, attachment load only happens before the first subpass that uses the attachment, and attachment store only happens after the last subpass that uses the attachment. So its not like this is a big deal that impacts every dependency.

Related

In Vulkan is an execution dependency not enough to ensure proper order?

I'm having trouble understanding why we specify stages of the pipeline for the pipeline barrier as well as access mask. The point of specifying a barrier with pipeline stages is to give instructions that all commands before a certain stage happen before all stages after a certain stage. So with this reasoning let's say I want to transfer data from a staging buffer to a device-local buffer. Before I read that buffer from a shader I need to have a barrier saying that the transfer stage must take place before the shader read stage after it. I can do this with:
srcstage = VK_PIPELINE_STAGE_TRANSFER_BIT;
dststage = VK_PIPELINE_STAGE_FRAGMENT_BIT;
This is supposed to say that all commands before the barrier must complete the transfer stage before all commands after the barrier start the fragment shader stage.
However I don't understand the concept of access mask and why we use it together with the stage mask. Apparently the stage masks don't ensure "visibility" or "availability"??? So is it the case that this means that although the transfer stage will complete before the fragment stage will start there is no guarantee that the shader will read the proper data? Because maybe caches have not been flushed and not been made visible/available?
If that's the case and we need to specify access masks such as:
srcaccessmask = VK_ACCESS_MEMORY_WRITE_BIT;
dstaccessmask = VK_ACCESS_MEMORY_READ_BIT;
Then what is even the point of specifying stages if the stages isn't enough, and it comes down to the access mask?

Access masks ensure visibility, but you cannot have visibility over something that is not yet available (ie: you cannot see what hasn't happened yet). Visibility and availability are both necessary for a memory barrier, but each alone is insufficient.
After all, execution barriers do not strictly speaking need memory barriers. A write-after-read barrier does not need to ensure visibility. It simply needs to make sure that no reads happen after the write; a pure execution barrier is sufficient. But if you're doing a write-after-write or a read-after-write, then you need to ensure visibility in addition to availability.

Execution dependency is not enough to ensure proper order simply because Vulkan's memory model is such that it requires manual\explicit memory barriers.
In other words, execution barrier only covers execution, but not side-effects of the execution. So you also need memory barrier(s) when you need the side-effects to be coherent\consistent.
If the coherency is satisfied another way, the access mask is actually what is unnecessary:
semaphore_wait.pWaitDstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
// we need to order our stuff after the semaphore wait, so:
barrier.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
// semaphore already includes all device memory access, so nothing here:
barrier.srcAccessMask = 0;
barrier.oldLayout = COOL_LAYOUT;
barrier.newLayout = EVEN_BETTER_LAYOUT;
Using only access masks (without stage) would be ambiguous. Access masks can limit the amount of memory coherency work the driver needs to perform:
// make available the storage writes from fragment shader,
// but do not bother with storage writes
// from compute, vertex, tessellation, geometry, raytracing, or mesh shader:
barrier.srcStageMask = VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT;
barrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT does implicitly include vertex, tessellation, and geormetry shader stage. But only for execution dependency purposes, and not for memory dependency:
Note
Including a particular pipeline stage in the first synchronization scope of a command implicitly includes logically earlier pipeline stages in the synchronization scope. Similarly, the second synchronization scope includes logically later pipeline stages.
However, note that access scopes are not affected in this way - only the precise stages specified are considered part of each access scope.

Is it possible to synchronize an automatic layout transition with a swapchain image acquisition via pWaitDstStageMask=TOP_OF_PIPE?

It is necessary to synchronize automatic layout transitions performed by render passes with the acquisition of a swapchain image via the semaphore provided in vkAcquireNextImageKHR. Vulkan Tutorial states, "We could change the waitStages for the imageAvailableSemaphore to VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT to ensure that the render passes don't begin until the image is available" (Context: waitStages is an array of VkPipelineStageFlags supplied as pWaitDstStageMask to the VkSubmitInfo for the queue submission which performs the render pass and, consequently, the automatic layout transition). It then opts for an alternative (common) solution, which is to use an external subpass dependency instead.
In fact, I've only ever seen such synchronization done with a subpass dependency; I've never seen anyone actually do it by setting pWaitDstStageMask=TOP_OF_PIPE for the VkSubmitInfo corresponding to the queue submission which performs the automatic layout transition, and I'm not certain why this would even work. The specs provide certain guarantees about automatic layout transitions being synchronized with subpass dependencies, so I understand why they would work, but in order to do this sort of synchronization by waiting on a semaphore in a VkSubmitInfo, it is first necessary that image layout transitions even have pipeline stages to begin with. I am not aware of any pipeline stages which image layout transitions go through; I believed them to be entirely atomic operations which occur in between source-available operations and destination-visible operations in a memory dependency (e.g. subpass dependency, memory barrier, etc.), as described by the specs.
Do automatic image layout transitions go through pipeline stages? If so, which stages do they go through (perhaps I'm missing something in the specs)? If not, how could setting pWaitDstStageMask=TOP_OF_PIPE in a VkSubmitInfo waiting on the image acquisition semaphore have any synchronization effect on the automatic image layout transition?

You may be misunderstanding what the tutorial is saying.
Layout transitions do not happen within a pipeline stage. They, like all aspects of an execution dependency, happen between pipeline stages. A dependency states that, after the source stage completes but before the destination stage begins, the dependency's operations will happen.
So if the destination stage of an incoming external subpass dependency is the top of the pipe, then it is saying that the layout transition as part of the dependency will happen before the execution of any pipeline stages in the target subpass.
Remember that the default external subpass dependency for each attachment has the top of the pipe as its source scope. If the semaphore wait dependency's destination stage is the top of the pipe, then the source scope of the external subpass dependency is in the destination scope of the semaphore wait. And therefore, the external subpass dependency happens-after the semaphore wait dependency.
And, as previously stated, layout transitions happen between the source scope and the destination scope. So until the source scope, and everything it depends on, finishes execution, the layout transition will not happen.

Vulkan's execution model and sycnhronization [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to clear up my confusion around Vulkan's execution model and I would like to have my understanding verified and get answers to questions that still remain unclear to me.
So my understanding is following:
The host and the device execute completely asynchronously with respect to each other. I have to use VkFence to synchronize between them, i.e. when I want to know that a particular submission has finished executing on the device, I have to wait on the host for the appropriate VkFence to be signaled.
Different command queues execute asynchronously with respect to each other. Vulkan specification does not provide any guarantees about the order in which submissions to these queues start or finish execution. So vkQueueSubmit on queue A executes completely independently from vkQueueSubmit on queue B and I have to use VkSemaphore in order to make sure that for example submission to queue B starts executing after the submission to queue A is finished.
However different commands submitted to the same command queue respect their submission order, which means that commands submitted later won't start execution unless commands submitted earlier have already started their execution, but on the other hand this does not mean that these later commands cannot finish execution before earlier commands.
State setting commands (e.g. vkCmdBindPipeline, vkCmdBindVertexBuffers ...) are not asynchronous and delayed for later (like e.g. vkCmdDraw). Actually they execute right away on the host (not on the device) and modify the state of VkCommandBuffer and this cumulatively modified state is used in recording action commands that come after.
From the perspective of synchronization VkRenderPass can be thought of as just a simpler interface to pipeline barriers. It can be thought of as having one pipeline barrier in the beginning of render pass instance (in place of vkCmdBeginRenderPass), one at the end of render pass instance (in place of vkCmdEndRenderPass) and one pipeline barrier after each subpass (in place of vkCmdNextSubpass).
In my head the mental model of how commands execute on a single command queue is as one huge stream of commands (ordered in the order that they were recorded to command buffer and the order that these command buffers were submitted to the queue) split by pipeline barriers. Each pipeline barrier splits the stream into two sections, commands before the barrier (section A) and commands after the barrier (section B). Commands in section B are allowed to start (or rather continue their execution with pipeline stage Y) only after all commands in section A have finished executing pipeline stage X.
Questions:
The Vulkan specification (section 2.2.1. Queue Operation) states:
Command buffer submissions to a single queue respect submission order
and other implicit ordering guarantees, but otherwise may overlap or
execute out of order. Other types of batches and queue submissions
against a single queue (e.g. sparse memory binding) have no implicit
ordering constraints with any other queue submission or batch.
Lets say that in my program I have only one general queue, that can issue all kinds of commands (graphics, compute, transfer, presentation, ...), so does the above statement mean the following ?
vkQueueSubmit #3 starts execution only after vkQueueSubmit #2 has already started execution, which starts only after vkQueueSubmit #1 has already started, ... but vkQueueBindSparse or vkQueuePresentKHR can start at any time regardless of when they were issued by the host ... In other words I always have to use VkSemaphore to ensure that presentation (vkQueuePresentKHR) starts at the right time (only after all my graphics work was submitted and executed and thus is ready to be presented).
I am a little bit confused with the definition of submission order within command buffers themselves. Specification states (section 6.2. Implicit Synchronization Guarantees):
1)
For commands recorded outside a render pass, this includes all other
commands recorded outside a render pass, including
vkCmdBeginRenderPass and vkCmdEndRenderPass commands; it does not
directly include commands inside a render pass.
2)
For commands recorded inside a render pass, this includes all other
commands recorded inside the same subpass, including the
vkCmdBeginRenderPass and vkCmdEndRenderPass commands that delimit the
same render pass instance; it does not include commands recorded to
other subpasses.
The first bullet point seems to be clear. The submission order is the order in which commands were recorded to command buffers, whilst whatever is inside of a vkCmdBeginRenderPass and vkCmdEndRenderPass block is considered as one command for the purpose of this bullet point. The second bullet point is a bit unclear to me though. How is the submission order defined here ? It is clear that any command within a specific subpass does not start its execution unless a previous command has already started its execution or unless vkCmdBeginRenderPass was executed. But what about different subpasses ? Does this mean that subpass 1 can start its execution before subpass 0 has started its execution ? This does not make sense to me. What would make sense, is if later subpasses are only allowed to start once previous subpasses have finished.
Vulkan specification (section 6.1.2. Pipeline Stages) states:
Execution of operations across pipeline stages must adhere to implicit
ordering guarantees, particularly including pipeline stage order.
Does this mean that for example Vertex shader stage from draw call 2 is not allowed to begin execution unless vertex shader stage from draw call 1 has already started its execution ?
My mental model of Vulkan's command queue execution (number 6 of my understanding) provokes the question, whether a pipeline barrier submitted to the beginning of a command buffer (B) would affect an earlier command buffer (A). I mean would it make the commands in command buffer B wait to start execution until commands in command buffer A are finished ? I read somewhere that synchronization between different command buffers is the job for events, but according to my understanding this should also be possible with barriers.
Also if I used VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT as source stage and VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT as destination stage of a pipeline barrier that should basically disable any overlap between the commands before and after the barrier, right ?
So as I see it, there are several different parallelisms in Vulkan:
Between CPU and GPU, these are synchronized with VkFence
Between different commands queues on the GPU, these are synchronized with VkSemaphore
Between different submissions to the same queue, exception seem to be submissions with vkQueueSubmit. These are also synchronized with VkSemaphore.
Between different draw calls. These are synchronized with pipeline barrier.
This one is the most confusing to me. So if I have a drawcall that in some way uses the results of any previous drawcall or writes to the same render target (framebuffer), then as far as I understand, I need to make sure that the later drawcall sees the memory effects of all previous drawcalls. But what about, when I am rendering a scene with a bunch of game characters, trees and buildings. Lets say that each such object is one drawcall and all these drawcalls write to the same framebuffer. Do I need to issue a memory barrier after every drawcall ? Intuitively this feels redundant and the demos that I checked out did not issue any barriers in this case, but are there any guarantees that drawcalls logically following after will see the memory effects of drawcalls logically before them ? The question is, when do I need to synchronize between different drawcalls ?
Within a single draw call. Synchronization on this level is possible with shader atomic instructions.
However as far as I am not doing anything unusual, like writing to the same memory address from multiple shader instances or reading from the same memory that I have just written to (e.g. implementation of custom blending in fragment shader), I should be fine. In other words if every fragment shader reads and writes only its corresponding pixel or vertex data, I do not need to worry about synchronization within the same drawcall.

The host and the device execute completely asynchronously with respect to each other.
Yes.
Unless explicit synchronization is used (that is VkFence, vk*WaitIdle, VkEvent). Or the one rare implicit synchronization ( host writes are visible to device access from any subsequent vkQueueSubmit).
Do note there also has to be a "memory domain operation". I.e. you must use VK_PIPELINE_STAGE_HOST_BIT when reading output of GPU on the CPU. (VkFence alone, doing the execution and memory dependency, does not suffice).
Different command queues execute asynchronously with respect to each other.
Correct. In other words, commands from any two queues may run serially, next to each other (in parallel), or even be pre-empted and time-shared, or some combination of the above. Anything goes. Unless explicit synchronization (VkSemaphore or VkFence) is used.
However different commands submitted to the same command queue respect their submission order
Yes. But it is only specification formalism that has no real-world effect. It is only specified so we have formal linguistic framework in which to describe other stuff in the specification (e.g. it specifies nomenclature necessary to describe the behavior of pipeline barriers).
State setting commands (e.g. vkCmdBindPipeline, vkCmdBindVertexBuffers ...) are not asynchronous and delayed for later (like e.g. vkCmdDraw).
No, that is not exactly how I would describe it.
They are not "delayed". They are simply executed exactly where they are recorded in the command buffers.
This is perhaps one of the things where we need the "submission order" formalism. All commands later in submission order after state command see the new state. (I.e. only the commands recorded after the state command see the new state).
From the perspective of synchronization VkRenderPass can be thought of as just a simpler interface to pipeline barriers.
I don't think so. It is actually perhaps bit more complex.
What it does is more efficient synchronization, although it perhaps defines functionally the same synchronization as pipeline barriers could. What it does differently is that (among other things) it defines this synchronization as a monolith (i.e. you tell the driver upfront what resources you are gonna use, and you outline all the things you are gonna do to them later).
Render Pass is a harness necessitated by mobile tiling architecture GPU. On desktop it is also useful if they have some architectural inspiration from the mobile GPUs, or simply as an oracle for driver optimization.
so does the above statement mean the following ? vkQueueSubmit #3 starts execution only after vkQueueSubmit #2 has already started execution, which starts only after vkQueueSubmit #1 has already started
Yes, and no. Read above about the formalism of submission order.
Technically, yes, the commands are guaranteed to execute its VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT in order. But that stage does nothing.
AIS, it is only specification formalism used for other things. It does not say anything in of itself.
I am a little bit confused with the definition of submission order within command buffers themselves.
Yes, the language is bit tricky. The part that trips you up is the subpasses. Note that subpasses are by definition also asynchronous. Therefore we cannot use the simple rule in quote "1)".
If I decode it, what the spec quote means is:
a) Any command recorded before the Render Pass Instance (i.e. before vkCmdBeginRenderPass) is earlier in submission order than the vkCmdBeginRenderPass, and earlier than any and all the commands in the subpasses. (And vice versa, anything in the subpasses is later in submission order.)
b) Similarly any command recorded after the Render Pass Instance (i.e. after vkCmdEndRenderPass) is later in submission order than the vkCmdEndRenderPass, and later than any and all the commands in the subpasses.
c) The commands in a single subpass have the submission order same as the order they were recorded in (vkCmd*).
d) Commands in any two subpasses do not have submission order wrt each other.
Remember submission order is only a formalism. What "d)" means in reality is only that you cannot execute vkCmdPipelineBarrier in subpass 1 and expect that barrier to cover anything from subpass 0. (What you must do is use the VkSubpassDependency instead of vkCmdPipelineBarrier to achieve dependency between subpass 0 and 1.)
Execution of operations across pipeline stages must adhere to implicit ordering guarantees, particularly including pipeline stage order.
This is only an introductory statement linking to some of the other stuff in the specification. It does not say anything in of itself.
"implicit ordering guarantees" links to the submission order we covered.
"pipeline stage order" simply links to pipeline stage ordering. This simply specifies "logical order" between pipeline stages (e.g. Vertex Shader is before Fragment Shader). What it means is whenever you use stage flag bit in any srcStage parameter, Vulkan will implicitly assume you also mean any logically earlier stage flag bit. (And similarly for dstStage).
My mental model of Vulkan's command queue execution (number 6 of my understanding) provokes the question, whether a pipeline barrier submitted to the beginning of a command buffer (B) would affect an earlier command buffer (A)
Yes, that is the general idea.
Think of it like this: vkQueueSubmit concatenates the commands from command buffer at the end of the Queue. It is called "queue" for a reason. Therefore a pipeline barrier affects the command buffer that was submitted earlier. (And BTW that's why it is called submission order)
Also if I used VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT as source stage and VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT as destination stage of a pipeline barrier that should basically disable any overlap between the commands before and after the barrier, right ?
Yes, but that is a code rot.
In this case use VK_PIPELINE_STAGE_ALL_COMMANDS_BIT instead. It is much easier to understand for anyone reading such code.
So as I see it, there are several different parallelisms in Vulkan:
Asynchrony.
Parallelism is not guaranteed. I.e. the driver is allowed to serialize the workload, or time-share it.
But e.g. with some common sense you can guess there will be (notable) parallelism between CPU and GPU, if it is a dedicated GPU.
The question is, when do I need to synchronize between different drawcalls ?
Yes, I think no framebuffer sync between draw commands is one of the exceptions\simplifications Vulkan has.
I believe people support it by the specification of Primitive Order and Rasterization Order.
I.e. in a single subpass you should not need a pipeline barrier between two vkCmdDraw* to synchronize the color and depth buffer. (I think) you still need to explicitly synchronize draw in a subpass with other subpasses and with outside of the render pass instance.
However as far as I am not doing anything unusual, like writing to the same memory address from multiple shader instances or reading from the same memory that I have just written to (e.g. implementation of custom blending in fragment shader), I should be fine.
Yes. The pipeline and the fixed and programmable stages should work similarly as in OpenGL. You should for most part be able to use OpenGL's shaders with little to no modification and achieve the same behavior.

What is the relationship of RenderPass and Pipeline in Vulkan?

What is the logical relationship between RenderPass and Pipeline in Vulkan?
If you ignore RenderPass, my understanding of the rendering process is first, the vertex data prepared by the application layer, then the texture data can be submitted to the driver, and after that through the various stages of the pipeline, after writing to the Framebuffer, you can complete a rendering.
So what is the responsibility of the RenderPass? Is it an abstraction that provides metadata for rendering each stage (such as Format), or does it have some other role?
Is RenderPass and Pipeline dependent on feelings? For example, each Pipeline belongs to a Subpass. Or a dependency, such as the last output of the Pipeline, is handled by RenderPass. Or is it something else?

At the end of the day Vulkan is a nice modern-ish OO API. All the objects in Vulkan are practically only what parameters they take. Just saying this to ease your learning. You can look at vkCreateX and largely understand what VkX does in Vulkan.
VkPipeline is a GPU context. Think of GPU as a FPGA (which it isn't, but bear with me). Doing vkCmdBindPipeline would set the GPU to given gate configuration. Exept GPU is not FPGA — in our case it sets the GPU to a state where it can execute the shader programs and fixed-function pipeline stages defined by the VkPipeline.
VkRenderPass is a data oriented thing. It is necessitated by tiled architecture GPUs (mobile GPUs). On desktop GPUs it can still fill the role of being oracle for optimization and\or allow partially-tiled architecture (or any other architecture really that can use this).
Tiled-architecture GPUs need to "load" image\buffer from general-purpose RAM to "on-chip memory". When they are done they "store" their results back to RAM.
VkRenderPass defines what kind of inputs (attachments) will be needed. It defines how they get loaded and stored before and after the render pass instance*, respectively. It also has subpasses. It defines synchronization between them (replaces vkCmdPipelineBarriers). And defines the kind of purpose given render pass attachment will be filling (e.g. if it is color buffer, or a depth buffer).
* Render Pass Instance is the thing created from Render Pass instance by vkCmdBeginRenderPass. Yea, not confusing, right.

Using pipeline barriers instead of semaphores

I want to be sure that I understand pipeline barriers correctly.
So barriers are able to synchronize two command buffers provided the source stage of the second barrier is later than the destination stage of the first barrier. Is this correct?
Of course I will need to use semaphores if the command buffers execute during different iterations of the pipeline.
It seems to me that synchronisation is the hardest part to grasp in Vulkan. IMO the specification isn't clear enough about it.

Preamble:
Most of what applies to Vulkan Pipeline Barriers applies to generic barriers and memory barriers, so you can start there to build your intuition.
I would note, though the specification is not a tutorial, it is reasonably clear and readable. Synchronization is perhaps the hardest part and the description in specification mirrors that. On top of that, especially memory barriers are novel to most (they are usualy shielded from such concept by higher language compiler).
Needed definitions:
Pipeline is abstract scheme of how a unit of work is processed. There are sort of four types (though Vulkan does not say vendors how to do things as long as they follow the rules):
Host access pseudo-pipeline (with one stage)
Transfer (with one stage)
Compute (with one stage)
Graphic (with lot of stages i.e. DI→VI→VS→TCS→TES→GS→EFT→FS→LFT→Output )
There are special stages TOP (before anything is done), BOTTOM (after everything is finished), and ALL (which is the same as bitfield with all stages set).
(Action) command is a command that needs (one or more) passes through the pipeline. It must be recorded to command buffer (with the exception of the host reads and writes through vkMapMemory()).
Command buffer is some sequence of commands (in recorded order!). And queue is too a sequence of recorded commands (concatenated from submited command buffers).
The queue has some leeway in which order it executes the commands (it may reorder commands as long as the user-set state is preserved) and also may overlap commands (e.g. execute VS of next command before finishing FS of previous command). User defined synchronization primitives set a boundaries to this leeway. (There are also some implicit guarantees -- but better to not rely on them and oversynchronize until confident)
My take on explaining Pipeline Barriers:
(Maybe unfortunately) the Pipeline Barriers amalgamates three separate aspects -- execution barrier, memory barrier and layout transition (if it's image).
The execution barrier part assures that all commands recorded before the Barrier reached in exececution at least the specified pipeline stage (or stages) in srcStageMask before any of the commands recorded after the Barrier starts executing their specified stage (or stages) in dstStageMask.
It does handle only execution dependency and not memory! The memory barrier part assures that memory (caches) are properly flushed and invalidated somewhere in between that execution barrier dependency (i.e. after the depending and before the dependant commands and stages).
You provide what kind of memory dependency it is and between what kind of sources/consumers (so the driver can choose appropriate action without remembering the state itself). Typicaly write-read dependency (read-read and read-write do not need any memory synchronization and write-write does not usually make much sense -- why would you overwrite some data without reading them first).
Different data layout in memory may be advantegeous (or even necessery) on some HW. In the same time the memory dependency is handeled, the data is reordered to adhere to the new specified layout.

So barriers are able to synchronize two command buffers provided the source stage of the second barrier is later than the destination stage of the first barrier. Is this correct?
The 1.0.35 Vulkan specification has improved wording that makes this clear:
If vkCmdPipelineBarrier was recorded outside a render pass instance, the first synchronization scope includes every command submitted to the same queue before it, including those in the same command buffer and batch.
...
If vkCmdPipelineBarrier was recorded outside a render pass instance, the second synchronization scope includes every command submitted to the same queue after it, including those in the same command buffer and batch.
Note that there is no requirement on the source or destination stage. You can synchronize with a source as fragment shader and destination as vertex shader just fine.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas