Do I have to specify both stage flags to ensure the proper memory invalidation? - vulkan

If I issue a transfer command, and then want to read that data from the shaders (both vertex and fragment), I'd do like this:
cmd_buffer->issueTransferCommand();
PIPELINE BARRIER
srcStageMask = VK_PIPELINE_STAGE_TRANSFER_BIT;
srcAccessMask = VK_ACCESS_WRITE_BIT; /* flush the transfer writes */
dstStageMask = VK_PIPELINE_STAGE_VERTEX_SHADER_BIT;
dstAccessMask = VK_ACCESS_READ_BIT; /* invalidate the caches from the point of view of the vertex shader */
Do I have specify the invalidation of the caches for the fragment shader stage as well?

Yes, all pipeline stages must be explicitly included for the purposes of memory dependencies.

Related

Is this memory access barrier flag not sufficient?

In a Vulkan example the author dstStageMask as VK_PIPELINE_STAGE_BOTTOM_OF_PIPE and the dstAccessMask as VK_ACCESS_MEMORY_READ_BIT. Now, from my previous questions asked here, the access mask flags only apply specifically and explicitly to each stage flag provided (not to all stages logically before. So for example if I have an access mask of memory read for a stage of fragment shader, then that access mask (cache invalidation) doesn't apply to vertex shader or vertex input, rather I would have to specify both those stage flags separately.
It seems to me, in light of this, that having a dstAccessMask of VK_ACCESS_MEMORY_READ_BIT with a dstStageMask of VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT wouldn't do anything, as the access mask doesn't apply to "logically earlier stages", but only the "BOTTOM_OF_PIPE" stage. Here is Sascha Willem's code from the multisampling example:
dependencies[0].srcSubpass = VK_SUBPASS_EXTERNAL;
dependencies[0].dstSubpass = 0;
dependencies[0].srcStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT;
dependencies[0].dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
dependencies[0].srcAccessMask = VK_ACCESS_MEMORY_READ_BIT;
dependencies[0].dstAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT |
VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;
dependencies[0].dependencyFlags = VK_DEPENDENCY_BY_REGION_BIT;
dependencies[1].srcSubpass = 0;
dependencies[1].dstSubpass = VK_SUBPASS_EXTERNAL;
dependencies[1].srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
dependencies[1].dstStageMask = VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT;
dependencies[1].srcAccessMask = VK_ACCESS_COLOR_ATTACHMENT_READ_BIT | VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT;
dependencies[1].dstAccessMask = VK_ACCESS_MEMORY_READ_BIT;
dependencies[1].dependencyFlags = VK_DEPENDENCY_BY_REGION_BIT;
And here is the part of the spec that says the access mask only applies to the exact stages as given by each flag explicitly, not logically earlier stages:
Including a particular pipeline stage in the first synchronization
scope of a command implicitly includes logically earlier pipeline
stages in the synchronization scope. Similarly, the second
synchronization scope includes logically later pipeline stages.
However, note that access scopes are not affected in this way - only
the precise stages specified are considered part of each access scope.
VK_PIPELINE_STAGE_NONE (and equivalents) do not have accesses, and the latest practice is to just write 0 (resp. VK_ACCESS_NONE) in the access flag. Early day Vulkan practices were messy, and you will find plenty of unupdated code...

Can Vulkan synchronization source and destination stage mask be the same?

The following code from the Vulkan Tutorial seems to conflict with how synchronization scopes work.
// <dependency> is a subpass dependency.
dependency.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
...
dependency.dstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
The above code is trying to set both the srcStageMask and dstStageMask to be the same pipeline stage: VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT.
According to Vulkan Specification:
If a synchronization command includes a source stage mask, its first synchronization scope only includes execution of the pipeline stages specified in that mask, ...
If a synchronization command includes a destination stage mask, its second synchronization scope only includes execution of the pipeline stages specified in that mask, ...
In other words, srcStageMask and dstStageMask create a first synchronization scope with specified stage(s) and a second one with the specified stage(s), respectively.
Also, according to the following:
... for two sets of operations, the first set must happen before the second set.
My confusion is that, since the source and destination stage are the same, the subpass dependency is requiring this pipeline stage must complete before the exact same stage starts to execute.
The color attachment output stage is already guaranteed to be finished (the first scope). How can you specify to start to execute the same finished stage again? (the second scope)
So what is this dependency is trying to say?
A stage only exists within an action command that executes some portion of itself within that stage. Synchronization scopes are based on commands first. Once you have defined which commands are in the scope, stage masks can specify which stages within those commands are affected by the synchronization.
As such, all synchronization operations define a set of commands that happen before the synchronization and the set of commands that happen after. These represent the "first synchronization scope" and "second synchronization scope".
The source stage mask applies to the commands in the "first synchronization scope". The destination stage mask applies to commands in the "second synchronization scope". The commands in one scope are a distinct set from the other scope. So even if you're talking about the same pipeline stages, they're stages in different commands that execute at different times.
So what that does is exactly what it says: it creates a dependency between all executions of the color attachment stage from the source subpass (aka: the "first synchronization scope") and all executions of the color attachment stage from the destination subpass (aka: the "second synchronization scope").

Vulkan memory barrier for indirect compute shader dispatch

I have two compute shaders and the first one modifies DispatchIndirectCommand buffer, which is later used by the second one.
// This is the code of the first shader
struct DispatchIndirectCommand{
uint x;
uint y;
uint z;
};
restrict layout(std430, set = 0, binding = 5) buffer Indirect{
DispatchIndirectCommand[1] dispatch_indirect;
};
void main(){
// some stuff
dispatch_indirect[0].x = number_of_groups; // I used debugPrintfEXT to make sure that this number is correct
}
I execute them as follows
vkCmdBindPipeline(cmd_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, first_shader);
vkCmdDispatch(cmd_buffer, x, 1, 1);
vkCmdBindPipeline(cmd_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, second_shader);
vkCmdDispatchIndirect(cmd_buffer, indirect_buffer, 0);
The problem is that the changes made by first shader are not reflected by the second one.
// This is the code of the second shader
void main(){
debugPrintfEXT("%d", gl_GlobalInvocationID.x); //this seems to never be called
}
I initialise the indirect_buffer with VkDispatchIndirectCommand{.x=0,.y=1,.z=1}, and it seems that the second shader always executes with x==0, because the debugPrintfEXT never prints anything. I tried to add a memory barrier like
VkBufferMemoryBarrier barrier;
barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
barrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
barrier.srcQueueFamilyIndex = queue_idx;
barrier.dstQueueFamilyIndex = queue_idx;
barrier.buffer = indirect_buffer;
barrier.offset = 0;
barrier.size = sizeof_indirect_buffer;
However, this does not seem to make any difference. What does seem to work, is when I use
barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ;
When I use such access flags, the all compute shaders work properly. However, I get a validation error
Validation Error: [ VUID-vkCmdPipelineBarrier-dstAccessMask-02816 ] Object 0: handle = 0x5561b60356c8, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x69c8467d | vkCmdPipelineBarrier(): .pMemoryBarriers[1].dstAccessMask bit VK_ACCESS_INDIRECT_COMMAND_READ_BIT is not supported by stage mask (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT). The Vulkan spec states: The dstAccessMask member of each element of pMemoryBarriers must only include access flags that are supported by one or more of the pipeline stages in dstStageMask, as specified in the table of supported access types (https://vulkan.lunarg.com/doc/view/1.2.182.0/linux/1.2-extensions/vkspec.html#VUID-vkCmdPipelineBarrier-dstAccessMask-02816)
Vulkan's documentation states that
VK_ACCESS_INDIRECT_COMMAND_READ_BIT specifies read access to indirect command data read as part of an indirect build, trace, drawing or dispatching command. Such access occurs in the VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT pipeline stage
It looks really confusing. It clearly does mention "dispatching command", but at the same time it says that the stage must be VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT and not VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT. Is the specification contradictory/imprecise or am I missing something?
You seem to be employing trial-and-error strategy. Do not use such strategy in low level APIs, and preferrably in computer engineering in general. Sooner or later you will encounter something that will look like it works when you test it, but be invalid code anyway. The spec did tell you the exact thing to do, so you never ever had a legitimate reason to try any other flags or with no barriers at all.
As you discovered, the appropriate access flag for indirect read is VK_ACCESS_INDIRECT_COMMAND_READ_BIT. And as the spec also says, the appropriate stage is VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT.
So the barrier for your case should probably be:
srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
dstStageMask = VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
The name of the stage flag is slightly confusing, but the spec is otherwisely very clear it aplies to compute:
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT specifies the stage of the pipeline where VkDrawIndirect* / VkDispatchIndirect* / VkTraceRaysIndirect* data structures are consumed.
and also:
For the compute pipeline, the following stages occur in this order:
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
PS: Also relevant GitHub Issue: KhronosGroup/Vulkan-Docs#176

Vullkan compute shader caches and barriers

I'm trying to understand how the entire L1/L2 flushing works. Suppose I have a compute shader like this one
layout(std430, set = 0, binding = 2) buffer Particles{
Particle particles[];
};
layout(std430, set = 0, binding = 4) buffer Constraints{
Constraint constraints[];
};
void main(){
const uint gID = gl_GlobalInvocationID.x;
for (int pass=0;pass<GAUSS_SEIDEL_PASSES;pass++){
// first query the constraint, which contains particle_id_1 and particle_id_1
const Constraint c = constraints[gID*GAUSS_SEIDEL_PASSES+pass];
// read newest positions
vec3 position1 = particles[c.particle_id_1].position;
vec3 position2 = particles[c.particle_id_2].position;
// modify position1 and position2
position1 += something;
position2 -= something;
// update positions
particles[c.particle_id_1].position = position1;
particles[c.particle_id_2].position = position2;
// in the next iteration, different constraints may use the updated positions
}
}
From what I understand, initially all data resides in L2. When I read particles[c.particle_id_1].position I copy some of the data from L2 to L1 (or directly to a register).
Then in position1 += something I modify L1 (or the register). Finally in particles[c.particle_id_2].position = position1, I flush the data from L1 (or a register) back to L2, right? So if I then have a second compute shader that I want to run afterward this one, and that second shader will read positions of particles, I do not need to synchronize Particles. It would be enough to just put an execution barrier, without memory barrier
void vkCmdPipelineBarrier(
VkCommandBuffer commandBuffer,
VkPipelineStageFlags srcStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkPipelineStageFlags dstStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkDependencyFlags dependencyFlags, // here nothing
uint32_t memoryBarrierCount, // here 0
const VkMemoryBarrier* pMemoryBarriers, // nullptr
uint32_t bufferMemoryBarrierCount, // 0
const VkBufferMemoryBarrier* pBufferMemoryBarriers, // nullptr
uint32_t imageMemoryBarrierCount, // 0
const VkImageMemoryBarrier* pImageMemoryBarriers); // nullptr
Vulkan's memory model does not care about "caches" as caches. Its model is built on the notion of availability and visibility. A value produced by GPU command/stage A is "available" to GPU command/stage B if the command/stage A has an execution dependency with command/stage B. A value produced by GPU command/stage A is "visible" to GPU command/stage B if command/stage A has a memory dependency with command/stage B with regard to the particular memory in question and the access modes that A wrote it and B will access it.
If a value is not both available and visible to a command/stage, then attempting to access it yields undefined behavior.
The implementation of availability and visibility will involve clearing caches and the like. But as far as the Vulkan memory model is concerned, this is an implementation detail it doesn't care about. Nor should you: understand the Vulkan memory model and write code that works within it.
Your pipeline barrier creates an execution dependency, but not a memory dependency. Therefore, values written by CS processes before the barrier are available to CS processes afterwards, but not visible to them. You need to have a memory dependency to establish visibility.
However, if you want a GPU level understanding... it all depends on the GPU. Does the GPU have a cache hierarchy, an L1/L2 split? Maybe some do, maybe not.
It's kind of irrelevant anyway, because merely writing a value to an address in memory is not equivalent to a "flush" of the appropriate caches around that memory. Even using the coherent qualifier would only cause a flush for compute shader operations executing within that same dispatch call. It would not be guaranteed to affect later dispatch calls.
Implementation-dependent. For all we know, a device might have no cache at all, or in future it might be some quantum magic bs.
Shader assignment operation does not imply anything about anything. There's no "L1" or "L2" mentioned anywhere in the Vulkan specification. It is a concept that does not exist.
Completely divorce ourselves from the cache stuff, and all mental bagage that comes with it.
What is important here is that when you read something, then that thing needs to be "visible to" the reading agent (irrespective of what kind of device you use, and what obscure memory architecture it might have). If it is not "visible to", then you might be reading garbage.
When you write something, this does not happen automatically. The writes are not "visible to" anyone.
First you put your writes into src* part of a memory dependency (e.g. via a pipeline barrier). That will make your writes "available from".
Then you put your reader into dst* that will take all referenced writes that are "available from" and make them "visible to" the second synchronization scope.
If you really want to shoehorn this into a cache system concept, don't think of it as levels of cache. Think of it as separate caches. That something is already in some cache does not mean it is in the particular cache the consumer needs.

compute writes and transfer sync

I have a compute shader that writes into a storage buffer. As soon as the compute
queue becomes idle, the storage buffer is transfered to an image. Pipeline barriers
before and after the transfer take care of layout transitions.
The relevant code is as follows:
vkCmdDispatch(...);
...
vkQueueWaitIdle(...);
...
...
VkImageMemoryBarrier i = {};
i.sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;
i.srcAccessMask = 0;
i.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
i.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
i.newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
...
i.image = texture;
i.subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;
i.subresourceRange....
vkCmdPipelineBarrier(
commandbuffer,
VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
VK_PIPELINE_STAGE_TRANSFER_BIT,
0,0,nullptr,0,nullptr,1,&i
);
...
vkCmdCopyBufferToImage(...);
...
i.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
i.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
i.oldLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
i.newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
vkCmdPipelineBarrier(
commandbuffer,
VK_PIPELINE_STAGE_TRANSFER_BIT,
VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT,
0,0,nullptr,0,nullptr,1,&i
);
The image is then used in a subsequent renderpass, and everything works just fine.
However, I am a bit concerned that I might be experiencing an Undefined Behaviour,
because even if waiting for the compute queue will ensure execution order between
the buffer writes and the buffer transfer, there is no explicit barrier that ensures writes
from the compute shader are actually available and visible to the buffer transfer.
Is there an implicit buffer or memory barrier (at least in this case) that I
cannot find in the specs (1.1.123 as of today), or any other kind of mechanism,
such that the above code is correct and the compute shader wites are always
available to the buffer transfer?
If not, would I be right to assume there should be a VkBufferMemoryBarrier right
before the first layout-transition pipeline barrier?
I am a bit confused, because reading the specs, I find:
"vkCmdPipelineBarrier is a synchronization command that inserts a dependency
between commands submitted to the same queue, or between commands in the same subpass."
but here I would need to insert a memory dependency from two different queues and
two distinct pipelines, so Im not really sure which pipeline would have to have a
barrier.. if a barrier is even needed in the first place.
You are thinking of a wrong synchronization tool. For synchronization between queues there is VkSemaphore.
There are additional complication in this case. The concept of queue family–resource ownership. In the case of VK_SHARING_MODE_EXCLUSIVE and differing queue families you must proceed as written in the Queue Family Ownership Transfer section of the specification. I.e. use special releasing and acquiring barrier + semaphore.
Otherwisely, only semaphore is needed as explained in Semaphore Signaling and Semaphore Waiting & Unsignaling sections:
The first access scope includes all memory access performed by the device.
The second access scope includes all memory access performed by the device.