compute writes and transfer sync - vulkan

I have a compute shader that writes into a storage buffer. As soon as the compute
queue becomes idle, the storage buffer is transfered to an image. Pipeline barriers
before and after the transfer take care of layout transitions.
The relevant code is as follows:
vkCmdDispatch(...);
...
vkQueueWaitIdle(...);
...
...
VkImageMemoryBarrier i = {};
i.sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;
i.srcAccessMask = 0;
i.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
i.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
i.newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
...
i.image = texture;
i.subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;
i.subresourceRange....
vkCmdPipelineBarrier(
commandbuffer,
VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
VK_PIPELINE_STAGE_TRANSFER_BIT,
0,0,nullptr,0,nullptr,1,&i
);
...
vkCmdCopyBufferToImage(...);
...
i.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
i.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
i.oldLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
i.newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
vkCmdPipelineBarrier(
commandbuffer,
VK_PIPELINE_STAGE_TRANSFER_BIT,
VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT,
0,0,nullptr,0,nullptr,1,&i
);
The image is then used in a subsequent renderpass, and everything works just fine.
However, I am a bit concerned that I might be experiencing an Undefined Behaviour,
because even if waiting for the compute queue will ensure execution order between
the buffer writes and the buffer transfer, there is no explicit barrier that ensures writes
from the compute shader are actually available and visible to the buffer transfer.
Is there an implicit buffer or memory barrier (at least in this case) that I
cannot find in the specs (1.1.123 as of today), or any other kind of mechanism,
such that the above code is correct and the compute shader wites are always
available to the buffer transfer?
If not, would I be right to assume there should be a VkBufferMemoryBarrier right
before the first layout-transition pipeline barrier?
I am a bit confused, because reading the specs, I find:
"vkCmdPipelineBarrier is a synchronization command that inserts a dependency
between commands submitted to the same queue, or between commands in the same subpass."
but here I would need to insert a memory dependency from two different queues and
two distinct pipelines, so Im not really sure which pipeline would have to have a
barrier.. if a barrier is even needed in the first place.

You are thinking of a wrong synchronization tool. For synchronization between queues there is VkSemaphore.
There are additional complication in this case. The concept of queue family–resource ownership. In the case of VK_SHARING_MODE_EXCLUSIVE and differing queue families you must proceed as written in the Queue Family Ownership Transfer section of the specification. I.e. use special releasing and acquiring barrier + semaphore.
Otherwisely, only semaphore is needed as explained in Semaphore Signaling and Semaphore Waiting & Unsignaling sections:
The first access scope includes all memory access performed by the device.
The second access scope includes all memory access performed by the device.

Related

Do I have to specify both stage flags to ensure the proper memory invalidation?

If I issue a transfer command, and then want to read that data from the shaders (both vertex and fragment), I'd do like this:
cmd_buffer->issueTransferCommand();
PIPELINE BARRIER
srcStageMask = VK_PIPELINE_STAGE_TRANSFER_BIT;
srcAccessMask = VK_ACCESS_WRITE_BIT; /* flush the transfer writes */
dstStageMask = VK_PIPELINE_STAGE_VERTEX_SHADER_BIT;
dstAccessMask = VK_ACCESS_READ_BIT; /* invalidate the caches from the point of view of the vertex shader */
Do I have specify the invalidation of the caches for the fragment shader stage as well?
Yes, all pipeline stages must be explicitly included for the purposes of memory dependencies.

Vulkan memory barrier for indirect compute shader dispatch

I have two compute shaders and the first one modifies DispatchIndirectCommand buffer, which is later used by the second one.
// This is the code of the first shader
struct DispatchIndirectCommand{
uint x;
uint y;
uint z;
};
restrict layout(std430, set = 0, binding = 5) buffer Indirect{
DispatchIndirectCommand[1] dispatch_indirect;
};
void main(){
// some stuff
dispatch_indirect[0].x = number_of_groups; // I used debugPrintfEXT to make sure that this number is correct
}
I execute them as follows
vkCmdBindPipeline(cmd_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, first_shader);
vkCmdDispatch(cmd_buffer, x, 1, 1);
vkCmdBindPipeline(cmd_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, second_shader);
vkCmdDispatchIndirect(cmd_buffer, indirect_buffer, 0);
The problem is that the changes made by first shader are not reflected by the second one.
// This is the code of the second shader
void main(){
debugPrintfEXT("%d", gl_GlobalInvocationID.x); //this seems to never be called
}
I initialise the indirect_buffer with VkDispatchIndirectCommand{.x=0,.y=1,.z=1}, and it seems that the second shader always executes with x==0, because the debugPrintfEXT never prints anything. I tried to add a memory barrier like
VkBufferMemoryBarrier barrier;
barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
barrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
barrier.srcQueueFamilyIndex = queue_idx;
barrier.dstQueueFamilyIndex = queue_idx;
barrier.buffer = indirect_buffer;
barrier.offset = 0;
barrier.size = sizeof_indirect_buffer;
However, this does not seem to make any difference. What does seem to work, is when I use
barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ;
When I use such access flags, the all compute shaders work properly. However, I get a validation error
Validation Error: [ VUID-vkCmdPipelineBarrier-dstAccessMask-02816 ] Object 0: handle = 0x5561b60356c8, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x69c8467d | vkCmdPipelineBarrier(): .pMemoryBarriers[1].dstAccessMask bit VK_ACCESS_INDIRECT_COMMAND_READ_BIT is not supported by stage mask (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT). The Vulkan spec states: The dstAccessMask member of each element of pMemoryBarriers must only include access flags that are supported by one or more of the pipeline stages in dstStageMask, as specified in the table of supported access types (https://vulkan.lunarg.com/doc/view/1.2.182.0/linux/1.2-extensions/vkspec.html#VUID-vkCmdPipelineBarrier-dstAccessMask-02816)
Vulkan's documentation states that
VK_ACCESS_INDIRECT_COMMAND_READ_BIT specifies read access to indirect command data read as part of an indirect build, trace, drawing or dispatching command. Such access occurs in the VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT pipeline stage
It looks really confusing. It clearly does mention "dispatching command", but at the same time it says that the stage must be VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT and not VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT. Is the specification contradictory/imprecise or am I missing something?
You seem to be employing trial-and-error strategy. Do not use such strategy in low level APIs, and preferrably in computer engineering in general. Sooner or later you will encounter something that will look like it works when you test it, but be invalid code anyway. The spec did tell you the exact thing to do, so you never ever had a legitimate reason to try any other flags or with no barriers at all.
As you discovered, the appropriate access flag for indirect read is VK_ACCESS_INDIRECT_COMMAND_READ_BIT. And as the spec also says, the appropriate stage is VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT.
So the barrier for your case should probably be:
srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
dstStageMask = VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
The name of the stage flag is slightly confusing, but the spec is otherwisely very clear it aplies to compute:
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT specifies the stage of the pipeline where VkDrawIndirect* / VkDispatchIndirect* / VkTraceRaysIndirect* data structures are consumed.
and also:
For the compute pipeline, the following stages occur in this order:
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
PS: Also relevant GitHub Issue: KhronosGroup/Vulkan-Docs#176

What does "VkImageMemoryBarrier::srcAccessMask = 0" mean?

I just read Images Vulkan tutorial, and I didn't understand about "VkImageMemoryBarrier::srcAccessMask = 0".
code:
barrier.srcAccessMask = 0;
barrier.dstAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
and this tutorial say:
Since the transitionImageLayout function executes a command buffer with only a single command, you could use this implicit synchronization and set srcAccessMask to 0 if you ever needed a VK_ACCESS_HOST_WRITE_BIT dependency in a layout transition.
Q1 : If function have commandbuffer with multi command, then can't use this implicit synchronization?
Q2 : According to the manual page, VK_ACCESS_HOST_WRITE_BIT is 0x00004000. but tutorial use "0". why?
it's "0" mean implicit
it's "VK_ACCESS_HOST_WRITE_BIT" mean explicit ?
Am I understanding correctly?
0 access mask means "nothing". As in, there is no memory dependency the barrier introduces.
Implicit synchronization means Vulkan does it for you. As the tutorial says:
One thing to note is that command buffer submission results in implicit VK_ACCESS_HOST_WRITE_BIT synchronization
Specifically this is Host Write Ordering Guarantee.
Implicit means you don't have to do anything. Any host write to mapped memory is already automatically visible to any device access of any vkQueueSubmit called after the mapped memory write.
Explicit in this case would mean to submit a barrier with VK_PIPELINE_STAGE_HOST_BIT and VK_ACCESS_HOST_*_BIT.
Note the sync guarantees only work one way. So CPU → GPU will be automatic\implicit. But GPU → CPU always need to be explicit (you need a barrier with dst = VK_PIPELINE_STAGE_HOST_BIT to perform memory domain transfer operation).

Vullkan compute shader caches and barriers

I'm trying to understand how the entire L1/L2 flushing works. Suppose I have a compute shader like this one
layout(std430, set = 0, binding = 2) buffer Particles{
Particle particles[];
};
layout(std430, set = 0, binding = 4) buffer Constraints{
Constraint constraints[];
};
void main(){
const uint gID = gl_GlobalInvocationID.x;
for (int pass=0;pass<GAUSS_SEIDEL_PASSES;pass++){
// first query the constraint, which contains particle_id_1 and particle_id_1
const Constraint c = constraints[gID*GAUSS_SEIDEL_PASSES+pass];
// read newest positions
vec3 position1 = particles[c.particle_id_1].position;
vec3 position2 = particles[c.particle_id_2].position;
// modify position1 and position2
position1 += something;
position2 -= something;
// update positions
particles[c.particle_id_1].position = position1;
particles[c.particle_id_2].position = position2;
// in the next iteration, different constraints may use the updated positions
}
}
From what I understand, initially all data resides in L2. When I read particles[c.particle_id_1].position I copy some of the data from L2 to L1 (or directly to a register).
Then in position1 += something I modify L1 (or the register). Finally in particles[c.particle_id_2].position = position1, I flush the data from L1 (or a register) back to L2, right? So if I then have a second compute shader that I want to run afterward this one, and that second shader will read positions of particles, I do not need to synchronize Particles. It would be enough to just put an execution barrier, without memory barrier
void vkCmdPipelineBarrier(
VkCommandBuffer commandBuffer,
VkPipelineStageFlags srcStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkPipelineStageFlags dstStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkDependencyFlags dependencyFlags, // here nothing
uint32_t memoryBarrierCount, // here 0
const VkMemoryBarrier* pMemoryBarriers, // nullptr
uint32_t bufferMemoryBarrierCount, // 0
const VkBufferMemoryBarrier* pBufferMemoryBarriers, // nullptr
uint32_t imageMemoryBarrierCount, // 0
const VkImageMemoryBarrier* pImageMemoryBarriers); // nullptr
Vulkan's memory model does not care about "caches" as caches. Its model is built on the notion of availability and visibility. A value produced by GPU command/stage A is "available" to GPU command/stage B if the command/stage A has an execution dependency with command/stage B. A value produced by GPU command/stage A is "visible" to GPU command/stage B if command/stage A has a memory dependency with command/stage B with regard to the particular memory in question and the access modes that A wrote it and B will access it.
If a value is not both available and visible to a command/stage, then attempting to access it yields undefined behavior.
The implementation of availability and visibility will involve clearing caches and the like. But as far as the Vulkan memory model is concerned, this is an implementation detail it doesn't care about. Nor should you: understand the Vulkan memory model and write code that works within it.
Your pipeline barrier creates an execution dependency, but not a memory dependency. Therefore, values written by CS processes before the barrier are available to CS processes afterwards, but not visible to them. You need to have a memory dependency to establish visibility.
However, if you want a GPU level understanding... it all depends on the GPU. Does the GPU have a cache hierarchy, an L1/L2 split? Maybe some do, maybe not.
It's kind of irrelevant anyway, because merely writing a value to an address in memory is not equivalent to a "flush" of the appropriate caches around that memory. Even using the coherent qualifier would only cause a flush for compute shader operations executing within that same dispatch call. It would not be guaranteed to affect later dispatch calls.
Implementation-dependent. For all we know, a device might have no cache at all, or in future it might be some quantum magic bs.
Shader assignment operation does not imply anything about anything. There's no "L1" or "L2" mentioned anywhere in the Vulkan specification. It is a concept that does not exist.
Completely divorce ourselves from the cache stuff, and all mental bagage that comes with it.
What is important here is that when you read something, then that thing needs to be "visible to" the reading agent (irrespective of what kind of device you use, and what obscure memory architecture it might have). If it is not "visible to", then you might be reading garbage.
When you write something, this does not happen automatically. The writes are not "visible to" anyone.
First you put your writes into src* part of a memory dependency (e.g. via a pipeline barrier). That will make your writes "available from".
Then you put your reader into dst* that will take all referenced writes that are "available from" and make them "visible to" the second synchronization scope.
If you really want to shoehorn this into a cache system concept, don't think of it as levels of cache. Think of it as separate caches. That something is already in some cache does not mean it is in the particular cache the consumer needs.

Vulkan Queue submission synchronization - vkWaitForFences vs vkQueueWaitIdle [duplicate]

I have a function that copies data from one buffer to another, I need to synchronize its execution.
I have such a bad option:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
//Waiting for completion
vkQueueWaitIdle(graphicsQueue);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
}
This option is bad because if I want to execute the copyBuffer() function several times, then all the buffers will be copied strictly one at a time.
I want to use a fence for each function call so that multiple calls can run in parallel.
So far, I have only such a solution:
void MainWindow::copyBuffer(VkBuffer srcBuffer, VkBuffer dstBuffer, VkDeviceSize size)
{
VkCommandBuffer commandBuffer;
vkAllocateCommandBuffers(logicalDevice, &allocInfo, &commandBuffer);
//Create fence
VkFenceCreateInfo fenceInfo{};
fenceInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
fenceInfo.flags = VK_FENCE_CREATE_SIGNALED_BIT;
VkFence executionCompleteFence = VK_NULL_HANDLE;
if (vkCreateFence(logicalDevice, &fenceInfo, VK_NULL_HANDLE, &executionCompleteFence) != VK_SUCCESS) {
throw MakeErrorInfo("Failed to create fence");
}
//Start recording
vkBeginCommandBuffer(commandBuffer, &beginInfo);
vkCmdCopyBuffer(commandBuffer, srcBuffer, dstBuffer, 1, &copyRegion);
vkEndCommandBuffer(commandBuffer);
//Run command buffer
vkQueueSubmit(graphicsQueue, 1, &submitInfo, VK_NULL_HANDLE);
vkWaitForFences(logicalDevice, 1, &executionCompleteFence, VK_TRUE, UINT64_MAX);
vkResetFences(logicalDevice, 1, &executionCompleteFence);
vkFreeCommandBuffers(logicalDevice, commandPool, 1, &commandBuffer);
vkDestroyFence(logicalDevice, executionCompleteFence, VK_NULL_HANDLE);
}
Which of these options is better?
Is the second option written correctly?
Both functions are bad in the same way. They both block the CPU from doing anything until the transfer is done. And they both could be used to potentially submit multiple CBs to the same queue in the same frame, but with different submit commands.
Neither is desirable if performance is something you care about.
Ultimately, what you need to do is have your copyBuffer function not actually perform the copy. You should have a function which builds a command buffer to do a copy. That CB is then stored in a place to be submitted later with other copying CBs. Or better yet, you can have just one copying CB that each command adds to (the first one called in a frame will create the CB).
At some point in the future, before you've submitted the work that will use this data, you need to submit the transfer operations. And the way this works depends on if you're submitting the transfer operations on the same queue as the work that will consume them or not.
If they're on the same queue, then all you need to do is have an event in a command buffer at the end of your batch that synchronizes the transfer operations with their receivers. If you want to be more clever, each transfer operation can have its own event, which the receiving operations will wait on.
And in same-queue transfers, you also want to make sure that you submit the transfers in the same vkQueueSubmit call as the rest of your work. Or to put it another way, you should never make more than one call to vkQueueSubmit for a particular queue in a particular frame.
If you're dealing with separate queues, then things change. A bit. If timeline semaphores aren't an option, you'll need to submit your transfer work before you submit the receiving operations. This is because the transfer batch will need to signal a semaphore that the receiving operation will wait on. And a binary semaphore cannot be waited on until the operation that signals it has been submitted to a queue.
But otherwise, everything else stays the same. Of course, you don't need events since you're synchronizing by semaphore.
The two functions are semantically identical and do exactly the same blocking behavior.
The second is slightly better. vkQueueWaitIdle is kind of a debug and out-of-hotspot feature. It might incur a hidden second submit to signal the implicit fence.
You don't need to reset fence that you subsequently destroy anyway. And you are creating it presignaled, which is a bug. Also you forgot to pass it to the vkQueueSubmit.