What does "VkImageMemoryBarrier::srcAccessMask = 0" mean? - vulkan

I just read Images Vulkan tutorial, and I didn't understand about "VkImageMemoryBarrier::srcAccessMask = 0".
code:
barrier.srcAccessMask = 0;
barrier.dstAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
and this tutorial say:
Since the transitionImageLayout function executes a command buffer with only a single command, you could use this implicit synchronization and set srcAccessMask to 0 if you ever needed a VK_ACCESS_HOST_WRITE_BIT dependency in a layout transition.
Q1 : If function have commandbuffer with multi command, then can't use this implicit synchronization?
Q2 : According to the manual page, VK_ACCESS_HOST_WRITE_BIT is 0x00004000. but tutorial use "0". why?
it's "0" mean implicit
it's "VK_ACCESS_HOST_WRITE_BIT" mean explicit ?
Am I understanding correctly?

0 access mask means "nothing". As in, there is no memory dependency the barrier introduces.
Implicit synchronization means Vulkan does it for you. As the tutorial says:
One thing to note is that command buffer submission results in implicit VK_ACCESS_HOST_WRITE_BIT synchronization
Specifically this is Host Write Ordering Guarantee.
Implicit means you don't have to do anything. Any host write to mapped memory is already automatically visible to any device access of any vkQueueSubmit called after the mapped memory write.
Explicit in this case would mean to submit a barrier with VK_PIPELINE_STAGE_HOST_BIT and VK_ACCESS_HOST_*_BIT.
Note the sync guarantees only work one way. So CPU → GPU will be automatic\implicit. But GPU → CPU always need to be explicit (you need a barrier with dst = VK_PIPELINE_STAGE_HOST_BIT to perform memory domain transfer operation).

Related

Vullkan compute shader caches and barriers

I'm trying to understand how the entire L1/L2 flushing works. Suppose I have a compute shader like this one
layout(std430, set = 0, binding = 2) buffer Particles{
Particle particles[];
};
layout(std430, set = 0, binding = 4) buffer Constraints{
Constraint constraints[];
};
void main(){
const uint gID = gl_GlobalInvocationID.x;
for (int pass=0;pass<GAUSS_SEIDEL_PASSES;pass++){
// first query the constraint, which contains particle_id_1 and particle_id_1
const Constraint c = constraints[gID*GAUSS_SEIDEL_PASSES+pass];
// read newest positions
vec3 position1 = particles[c.particle_id_1].position;
vec3 position2 = particles[c.particle_id_2].position;
// modify position1 and position2
position1 += something;
position2 -= something;
// update positions
particles[c.particle_id_1].position = position1;
particles[c.particle_id_2].position = position2;
// in the next iteration, different constraints may use the updated positions
}
}
From what I understand, initially all data resides in L2. When I read particles[c.particle_id_1].position I copy some of the data from L2 to L1 (or directly to a register).
Then in position1 += something I modify L1 (or the register). Finally in particles[c.particle_id_2].position = position1, I flush the data from L1 (or a register) back to L2, right? So if I then have a second compute shader that I want to run afterward this one, and that second shader will read positions of particles, I do not need to synchronize Particles. It would be enough to just put an execution barrier, without memory barrier
void vkCmdPipelineBarrier(
VkCommandBuffer commandBuffer,
VkPipelineStageFlags srcStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkPipelineStageFlags dstStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkDependencyFlags dependencyFlags, // here nothing
uint32_t memoryBarrierCount, // here 0
const VkMemoryBarrier* pMemoryBarriers, // nullptr
uint32_t bufferMemoryBarrierCount, // 0
const VkBufferMemoryBarrier* pBufferMemoryBarriers, // nullptr
uint32_t imageMemoryBarrierCount, // 0
const VkImageMemoryBarrier* pImageMemoryBarriers); // nullptr
Vulkan's memory model does not care about "caches" as caches. Its model is built on the notion of availability and visibility. A value produced by GPU command/stage A is "available" to GPU command/stage B if the command/stage A has an execution dependency with command/stage B. A value produced by GPU command/stage A is "visible" to GPU command/stage B if command/stage A has a memory dependency with command/stage B with regard to the particular memory in question and the access modes that A wrote it and B will access it.
If a value is not both available and visible to a command/stage, then attempting to access it yields undefined behavior.
The implementation of availability and visibility will involve clearing caches and the like. But as far as the Vulkan memory model is concerned, this is an implementation detail it doesn't care about. Nor should you: understand the Vulkan memory model and write code that works within it.
Your pipeline barrier creates an execution dependency, but not a memory dependency. Therefore, values written by CS processes before the barrier are available to CS processes afterwards, but not visible to them. You need to have a memory dependency to establish visibility.
However, if you want a GPU level understanding... it all depends on the GPU. Does the GPU have a cache hierarchy, an L1/L2 split? Maybe some do, maybe not.
It's kind of irrelevant anyway, because merely writing a value to an address in memory is not equivalent to a "flush" of the appropriate caches around that memory. Even using the coherent qualifier would only cause a flush for compute shader operations executing within that same dispatch call. It would not be guaranteed to affect later dispatch calls.
Implementation-dependent. For all we know, a device might have no cache at all, or in future it might be some quantum magic bs.
Shader assignment operation does not imply anything about anything. There's no "L1" or "L2" mentioned anywhere in the Vulkan specification. It is a concept that does not exist.
Completely divorce ourselves from the cache stuff, and all mental bagage that comes with it.
What is important here is that when you read something, then that thing needs to be "visible to" the reading agent (irrespective of what kind of device you use, and what obscure memory architecture it might have). If it is not "visible to", then you might be reading garbage.
When you write something, this does not happen automatically. The writes are not "visible to" anyone.
First you put your writes into src* part of a memory dependency (e.g. via a pipeline barrier). That will make your writes "available from".
Then you put your reader into dst* that will take all referenced writes that are "available from" and make them "visible to" the second synchronization scope.
If you really want to shoehorn this into a cache system concept, don't think of it as levels of cache. Think of it as separate caches. That something is already in some cache does not mean it is in the particular cache the consumer needs.

VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT VkAccessFlags set to 0?

In the Vulkan spec it defines:
VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT is equivalent to VK_PIPELINE_STAGE_ALL_COMMANDS_BIT with
VkAccessFlags set to 0 when specified in the second synchronization scope, but specifies no
stages in the first scope.
and similarly:
VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT is equivalent to VK_PIPELINE_STAGE_ALL_COMMANDS_BIT with
VkAccessFlags set to 0 when specified in the first synchronization scope, but specifies no stages
in the second scope.
I'm unclear what it means by "with VkAccessFlags set to 0" in this context?
Technically VkAccessFlags is a type, not a variable, so it can't be set to anything.
(It seems to be adjusting the definitions of TOP/BOTTOM_OF_PIPE for some special property of VK_PIPELINE_STAGE_ALL_COMMANDS_BIT with respect to VkAccessFlags, but I can't quite see what that special property is or where it is specified.)
Anyone know what it's talking about?
(or, put another way: If we removed those two utterances of "with VkAccessFlags set to 0" from the spec, what would break?)
It is roundabout way to say the interpretation of the stage flag is different for a memory dependency.
For execution dependency in src it takes the stage bits you provide, and logically-earlier stages are included automatically. Similarly for dst, logically-later stages are included automatically.
But this applies only to the execution dependency. For a memory dependency, only the stage flags you provide count (and none are added automatically).
For example, let's say you have VK_PIPELINE_STAGE_ALL_COMMANDS_BIT + VK_ACCESS_MEMORY_WRITE_BIT in src. That means all memory writes from all previous commands will be made available. But if you have VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT + VK_ACCESS_MEMORY_WRITE_BIT in src, that means all memory writes from only BOTTOM_OF_PIPE stage are made available, so no memory writes are made available (because that particular stage doesn't make any).
Either way IMO, for code clarity it is better to always state all pipeline stages explicitly whenever one can.

compute writes and transfer sync

I have a compute shader that writes into a storage buffer. As soon as the compute
queue becomes idle, the storage buffer is transfered to an image. Pipeline barriers
before and after the transfer take care of layout transitions.
The relevant code is as follows:
vkCmdDispatch(...);
...
vkQueueWaitIdle(...);
...
...
VkImageMemoryBarrier i = {};
i.sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER;
i.srcAccessMask = 0;
i.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
i.oldLayout = VK_IMAGE_LAYOUT_UNDEFINED;
i.newLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
...
i.image = texture;
i.subresourceRange.aspectMask = VK_IMAGE_ASPECT_COLOR_BIT;
i.subresourceRange....
vkCmdPipelineBarrier(
commandbuffer,
VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT,
VK_PIPELINE_STAGE_TRANSFER_BIT,
0,0,nullptr,0,nullptr,1,&i
);
...
vkCmdCopyBufferToImage(...);
...
i.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
i.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
i.oldLayout = VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL;
i.newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL;
vkCmdPipelineBarrier(
commandbuffer,
VK_PIPELINE_STAGE_TRANSFER_BIT,
VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT,
0,0,nullptr,0,nullptr,1,&i
);
The image is then used in a subsequent renderpass, and everything works just fine.
However, I am a bit concerned that I might be experiencing an Undefined Behaviour,
because even if waiting for the compute queue will ensure execution order between
the buffer writes and the buffer transfer, there is no explicit barrier that ensures writes
from the compute shader are actually available and visible to the buffer transfer.
Is there an implicit buffer or memory barrier (at least in this case) that I
cannot find in the specs (1.1.123 as of today), or any other kind of mechanism,
such that the above code is correct and the compute shader wites are always
available to the buffer transfer?
If not, would I be right to assume there should be a VkBufferMemoryBarrier right
before the first layout-transition pipeline barrier?
I am a bit confused, because reading the specs, I find:
"vkCmdPipelineBarrier is a synchronization command that inserts a dependency
between commands submitted to the same queue, or between commands in the same subpass."
but here I would need to insert a memory dependency from two different queues and
two distinct pipelines, so Im not really sure which pipeline would have to have a
barrier.. if a barrier is even needed in the first place.
You are thinking of a wrong synchronization tool. For synchronization between queues there is VkSemaphore.
There are additional complication in this case. The concept of queue family–resource ownership. In the case of VK_SHARING_MODE_EXCLUSIVE and differing queue families you must proceed as written in the Queue Family Ownership Transfer section of the specification. I.e. use special releasing and acquiring barrier + semaphore.
Otherwisely, only semaphore is needed as explained in Semaphore Signaling and Semaphore Waiting & Unsignaling sections:
The first access scope includes all memory access performed by the device.
The second access scope includes all memory access performed by the device.

How to use VkPipelineCache?

If I understand it correctly, I'm supposed to create an empty VkPipelineCache object, pass it into vkCreateGraphicsPipelinesand data will be written into it. I can then reuse it with other pipelines I'm creating or save it to a file and use it on the next run.
I've tried following the LunarG example to extra the info:
uint32_t headerLength = pData[0];
uint32_t cacheHeaderVersion = pData[1];
uint32_t vendorID = pData[2];
uint32_t deviceID = pData[3];
But I always get headerLength is 32 and the rest 0. Looking at the spec (https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch09s06.html Table 9.1), the cacheHeaderVersion should always be 1, as the only available cache header version is VK_PIPELINE_CACHE_HEADER_VERSION_ONE = 1.
Also the size of pData is usually only 32 bytes, even when I create 10 pipelines with it. What am I doing wrong?
A Vulkan pipeline cache is an opaque object that's only meaningful to the driver. There are very few operations that you're supposed to use on it.
Creating a pipeline cache, optionally with a block of opaque binary data that was saved from an earlier run
Getting the opaque binary data from an existing pipeline cache, typically to serialize to disk before exiting your application
Destroying a pipeline cache as part of the proper shutdown process.
The idea is that the driver can use the cache to speed up creation of pipelines within your program, and also to speed up pipeline creation on subsequent runs of your application.
You should not be attempting to interpret the cache data returned from vkGetPipelineCacheData at all. The only purpose for that data is to be passed into a later call to vkCreatePipelineCache.
Also the size of pData is usually only 32 bytes, even when I create 10 pipelines with it. What am I doing wrong?
Drivers must implement vkCreatePipelineCache, vkGetPipelineCacheData, etc. But they don't actually have to support caching. So if you're working with a driver that doesn't have anything it can cache, or hasn't done the work to support caching, then you'd naturally get back an empty cache (other than the header).

vkQueueSubmit() call includes a stageMask with VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT bit set when device does not have geometryShader feature enabled

First of all, I'm a total newbie with Vulkan (I'm using the binding provided by LWJGL). I know I should copy/paste more code, but I don't even know what would be relevant for now (so don't hesitate to ask me some specific piece of code).
I try to make something like that :
Use a ComputeShader to compute a buffer of pixel.
Use vkCmdCopyBufferToImage to directly copy this array into a framebuffer image.
So, no vertex/fragment shaders for now.
I allocated a Compute Pipeline, and a FrameBuffer. I have one {Queue/CommandPool/CommandBuffer} for Computation, and one other for Rendering.
When I try to submit the graphic queue with:
vkQueueSubmit(graphicQueue, renderPipeline.getFrameSubmission().getSubmitInfo(imageIndex));
I obtain the following error message (from validation) :
ERROR OCCURED: Object: VK_NULL_HANDLE (Type = 0) | vkQueueSubmit() call includes a stageMask with VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT bit set when device does not have geometryShader feature enabled. The spec valid usage text states 'If the geometry shaders feature is not enabled, each element of pWaitDstStageMask must not contain VK_PIPELINE_STAGE_GEOMETRY_SHADER_BIT' (https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#VUID-VkSubmitInfo-pWaitDstStageMask-00076)
ERROR OCCURED: Object: VK_NULL_HANDLE (Type = 0) | vkQueueSubmit() call includes a stageMask with VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT and/or VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT bit(s) set when device does not have tessellationShader feature enabled. The spec valid usage text states 'If the tessellation shaders feature is not enabled, each element of pWaitDstStageMask must not contain VK_PIPELINE_STAGE_TESSELLATION_CONTROL_SHADER_BIT or VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT' (https://www.khronos.org/registry/vulkan/specs/1.0/html/vkspec.html#VUID-VkSubmitInfo-pWaitDstStageMask-00077)
I tried to change the VkSubmitInfo.pWaitDstStageMask to different values (like VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT, VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT, VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT...) but nothing change.
So, what would be the best pWaitDstStageMask for my use case ?
Ok, I found my problem:
The pWaitDstStageMask must be an array with the same size than pWaitSemaphores.
I only putted 1 stage mask, for 2 semaphores.