How to synchronize uniform buffer updates?

How to synchronize uniform buffer updates? - vulkan

I have a number of uniform buffers - one for every framebuffer. I guarantee with the help of fences that the update on cpu is safe, i.e. when I'm memcpy I'm sure buffer is not in use. After an update, I'm flushing the memory.
Now, if I understand correctly, I need to make the new data available for the gpu - for this, I need to use barriers. This is how I'm doing it right now:
VkBufferMemoryBarrier barrier{};
barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
barrier.pNext = nullptr;
barrier.srcAccessMask = VK_ACCESS_HOST_WRITE_BIT;
barrier.dstAccessMask = VK_ACCESS_UNIFORM_READ_BIT;
barrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
barrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
barrier.buffer = buffer;
barrier.offset = offset;
barrier.size = size;
vkCmdPipelineBarrier(commandBuffer, VK_PIPELINE_STAGE_HOST_BIT, VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT, 0, 0, nullptr, 1, &barrier, 0, nullptr);
Well, actually in my case everything works without the barrier. Do I need it at all?
If I change barrier.dstAccessMask and dstStageMask to VK_ACCESS_TRANSFER_READ_BIT and VK_PIPELINE_STAGE_TRANSFER_BIT respectively, everything again works fine, layers are not complaining. What is the better choice and why?
If I try to set a barrier after vkCmdBeginRenderPass, the layer complains. So I moved all my barriers between vkBeginCommandBuffer and vkCmdBeginRenderPass. How correct is this?

Well, actually in my case everything works without the barrier.
Does the specification say you need it? Then you need it.
When dealing with Vulkan, you should not take "everything appears to work" as a sign that you've done everything right.
What is the better choice and why?
The "better choice" is the correct one. The GPU is not doing a transfer operation on this memory; it's reading it as uniform data. Therefore, the operation you specify in the barrier must match this.
Layers aren't complaining because it's more or less impossible for validation layers to know when you've written to a piece of memory. Therefore, they can't tell if you correctly built a barrier to make such writes available to the GPU.
If I try to set a barrier after vkCmdBeginRenderPass, the layer complains.
Barriers inside render passes have to be inside subpasses with self-dependencies. And the barrier itself has to match the subpass self-dependency.
Basically, barriers of the kind you're talking about have to happen before the render pass.
That being said, merely calling vkQueueSubmit automatically creates a memory barrier between (properly flushed) host writes issued before vkQueueSubmit and any usage of those writes from the command buffers in the submit command (and of course, commands in later submit operations).
So you shouldn't need such a barrier, so long as you can ensure that you've finished your writes (and any needed flushing) before the vkQueueSubmit that reads from them. And if you can't guarantee that, you probably should have been using a vkCmdWaitEvents barrier to prevent trying to read until you had finished writing (and flushing).

Related

In Vulkan is an execution dependency not enough to ensure proper order?

I'm having trouble understanding why we specify stages of the pipeline for the pipeline barrier as well as access mask. The point of specifying a barrier with pipeline stages is to give instructions that all commands before a certain stage happen before all stages after a certain stage. So with this reasoning let's say I want to transfer data from a staging buffer to a device-local buffer. Before I read that buffer from a shader I need to have a barrier saying that the transfer stage must take place before the shader read stage after it. I can do this with:
srcstage = VK_PIPELINE_STAGE_TRANSFER_BIT;
dststage = VK_PIPELINE_STAGE_FRAGMENT_BIT;
This is supposed to say that all commands before the barrier must complete the transfer stage before all commands after the barrier start the fragment shader stage.
However I don't understand the concept of access mask and why we use it together with the stage mask. Apparently the stage masks don't ensure "visibility" or "availability"??? So is it the case that this means that although the transfer stage will complete before the fragment stage will start there is no guarantee that the shader will read the proper data? Because maybe caches have not been flushed and not been made visible/available?
If that's the case and we need to specify access masks such as:
srcaccessmask = VK_ACCESS_MEMORY_WRITE_BIT;
dstaccessmask = VK_ACCESS_MEMORY_READ_BIT;
Then what is even the point of specifying stages if the stages isn't enough, and it comes down to the access mask?

Access masks ensure visibility, but you cannot have visibility over something that is not yet available (ie: you cannot see what hasn't happened yet). Visibility and availability are both necessary for a memory barrier, but each alone is insufficient.
After all, execution barriers do not strictly speaking need memory barriers. A write-after-read barrier does not need to ensure visibility. It simply needs to make sure that no reads happen after the write; a pure execution barrier is sufficient. But if you're doing a write-after-write or a read-after-write, then you need to ensure visibility in addition to availability.

Execution dependency is not enough to ensure proper order simply because Vulkan's memory model is such that it requires manual\explicit memory barriers.
In other words, execution barrier only covers execution, but not side-effects of the execution. So you also need memory barrier(s) when you need the side-effects to be coherent\consistent.
If the coherency is satisfied another way, the access mask is actually what is unnecessary:
semaphore_wait.pWaitDstStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
// we need to order our stuff after the semaphore wait, so:
barrier.srcStageMask = VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT;
// semaphore already includes all device memory access, so nothing here:
barrier.srcAccessMask = 0;
barrier.oldLayout = COOL_LAYOUT;
barrier.newLayout = EVEN_BETTER_LAYOUT;
Using only access masks (without stage) would be ambiguous. Access masks can limit the amount of memory coherency work the driver needs to perform:
// make available the storage writes from fragment shader,
// but do not bother with storage writes
// from compute, vertex, tessellation, geometry, raytracing, or mesh shader:
barrier.srcStageMask = VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT;
barrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT does implicitly include vertex, tessellation, and geormetry shader stage. But only for execution dependency purposes, and not for memory dependency:
Note
Including a particular pipeline stage in the first synchronization scope of a command implicitly includes logically earlier pipeline stages in the synchronization scope. Similarly, the second synchronization scope includes logically later pipeline stages.
However, note that access scopes are not affected in this way - only the precise stages specified are considered part of each access scope.

Why `VkPipelineStageFlagBits` does not only define pipeline stages?

VkPipelineStageFlagBits defines flags corresponding to stages that we would expect in a Graphics Pipeline such as VK_PIPELINE_STAGE_VERTEX_SHADER_BIT or VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT, etc. We also have Compute Pipeline stages such as VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT.
But there are also some flags defined here that does not seem to correspond to any Graphics or Compute Pipeline stage at all, such as VK_PIPELINE_STAGE_TRANSFER_BIT, VK_PIPELINE_STAGE_HOST_BIT and VK_PIPELINE_STAGE_ALL_COMMANDS_BIT (from which VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT and VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT are defined). Instead, these are defined in the specification by referring to commands. For instance:
VK_PIPELINE_STAGE_TRANSFER_BIT specifies the following commands:
All copy commands, including vkCmdCopyQueryPoolResults
vkCmdBlitImage2 and vkCmdBlitImage
vkCmdResolveImage2 and vkCmdResolveImage
All clear commands, with the exception of vkCmdClearAttachments
So if it is not a pipeline stage, why is it listed in Vk**PipelineStage**FlagBits? Why is it named this way? What does it mean if I use VK_PIPELINE_STAGE_TRANSFER_BIT as srcStageMask or dstStageMask in VkSubpassDependency for instance?

VK_PIPELINE_STAGE_TRANSFER_BIT is from transfer pipeline, which has one stage.
VK_PIPELINE_STAGE_HOST_BIT pseudostage communicates there will be memory domain transfers between host and device.
VK_PIPELINE_STAGE_ALL_COMMANDS_BIT is a shortcut for |ing together all the bits applicable in a given context.
VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT and VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT are pipeline terminators. They do not mean any stage, but they are necessary on the API level to express "before first stage", and "after last stage".
What does it mean if I use VK_PIPELINE_STAGE_TRANSFER_BIT as srcStageMask or dstStageMask in VkSubpassDependency for instance?
You are not allowed to, unless that dependency is VK_SUBPASS_EXTERNAL.
In ordinary pipeline barrier or VK_SUBPASS_EXTERNAL dependency it means either some work is dependent on a copy being finished, or some copy depending on some writes to be made first.
How host memory sync works is that Vulkan divides it into domains. There's host memory domain, and there's device memory domain. If changes to a memory are not transitioned from one domain to the other before reading, it might be so the writes will not be visible on the other domain.
That's where VK_PIPELINE_STAGE_HOST_BIT (plus related access mask) comes in. It instructs the driver to do this potentialy expensive and disruptive domain transfer.
Now there's a rare implicit Vulkan behavior. The host write ordering guarantee. If you write from host to memory, and then vkQueueSubmit something, you don't need to do anything. Operations in that submission automatically see the host writes, and you won't use VK_PIPELINE_STAGE_HOST_BIT.
This doesn't work other way around though. If you want to read something on host that was written on device, you always need to use dstStage = VK_PIPELINE_STAGE_HOST_BIT (followed by a fence or such), otherwisely the writes might not be visible on the host domain when you try to read them.
Third option is when you submit first, and then you write something on the host. Previously this could happen with Events (but that was subsequently banned), and now it can happen when using extended semaphores. In that case you need to use srcStage = VK_PIPELINE_STAGE_HOST_BIT in addition to that host-device semaphore, otherwise your host writes might not be visible to the reads on that device.

Transfers can be inserted pretty much anywhere into a pipeline to move data between buffers. These are normal commands with dependencies that can be queued, and typically have dedicated hardware so other shaders can run in parallel. The "clear" commands use a constant value as the source for the copy operation, but still use the DMA engine, so they are in the same group.
These bits largely correspond to the resources that will be used. In the case of VkSubpassDependency, they define which resources need to be synchronized for the pipeline to deliver correct results, so if you use a Copy or Clear operation in the first subpass, you need to specify VK_PIPELINE_STAGE_TRANSFER_BIT in the srcStageMask to denote that the first subpass will use the DMA engine, so handover to the second subpass requires the write caches of the DMA engine to be flushed. Omitting this bit would allow an invalid optimization, so it is diallowed to use a Copy or Clear command then.
The "All Commands", "Top of Pipe" and "Bottom of Pipe" are just shorthands to save on negotiation during pipeline setup.

How do I know when Vulkan isn't using memory anymore so I can overwrite it / reuse it?

When working with Vulkan it's common that when creating a buffer, such as a uniform buffer, that you create multiple (buffers 'versions'), because if you have double buffering for example you don't know if the graphics API is still drawing the last frame (using the memory you bound and instructed it to use the last loop). I've seen this happen with uniform buffers but not vertex or index buffers or image/texture buffers. Is this because uniform buffers are updated regularly and vertex buffers or images are not?
If you wanted to update an image or a vertex buffer how would you go about it given that you don't know whether the graphics API is still using it? Do you simply reallocate new memory for that image/buffer and start anew? Even if you just want to update a section of it? And if this is the case that you allocate a new buffer, when would you know to release the old buffer? Would say, for example 5 frames into the future be OK? Or 2 seconds? After all, it could still be being used. How is this done?

given that you don't know whether the graphics API is still using it?
But you do know.
Vulkan doesn't arbitrarily use resources. It uses them exactly and only how your code tells it to use the resource. You created and submitted the commands that use those resources, so if you need to know when a resource is in use, it is you who must keep track of it and manage this.
You have to use API synchronization functions to follow the GPU's execution of commands.
If an action command uses some set of resources, then those resources are in use while that command is being executed. You have tools like events which can be used to stop subsequent commands from executing until some prior commands have finished. And events can tell when a particular command has finished, so that you'll know when those resources are no longer in use.
Semaphores have similar powers, but at the level of a batch of work. If a semaphore is signaled, then all of the commands in the batch that signaled it have completed and are no longer using the resources they use. Fences can be used for extremely coarse synchronization, at the level of a submit command.
You multi-buffer uniform data because the nature of uniform data is such that it typically needs to change every frame. If you have vertex buffers or images to change every frame, then you'll need to do the same thing with those.
For infrequent changes, you may want to have extra memory available so that you can just create new images or buffers, then delete the old ones when the memory is no longer in use. Or you may have to stall the CPU until the GPU has finished using those resources.

Should the amount of resource allocations be "per swap chain image"?

I just learned about uniform buffers (https://vulkan-tutorial.com/Uniform_buffers/Descriptor_layout_and_buffer) and a bit confused about the size of uniformBuffers and uniformBuffersMemory. In the tutorial it is said that:
We should have multiple buffers, because multiple frames may be in flight at the same time and we don't want to update the buffer in preparation of the next frame while a previous one is still reading from it! We could either have a uniform buffer per frame or per swap chain image.
As far as I understand "per swap chain image" approach is more optimal. Please, prove me wrong, if I am. But why do we need it to be the size of swapChainImages.size()? Isn't MAX_FRAMES_IN_FLIGHT just enough, because we have fences? As a simple example, if we have just a single frame in flight and do vkDeviceWaitIdle after each presentation then our single uniform buffer will always be available and not used by cpu/gpu so we don't need an array of them.

do vkDeviceWaitIdle
OK, stop right there. There is basically only one valid reason to call that function: you need to delete every resource created by that device, because you're about the destroy the device, so you wait until all such resources are no longer being used.
Yes, if you halt the CPU's execution until the GPU stops doing stuff, then you're guaranteed that CPU writes to GPU memory will not interact with GPU reads from that memory. But you purchased this guarantee by ensuring that there will be no overlap at all between CPU execution and GPU execution. The CPU sets up some stuff, sends it to the GPU, then waits till the GPU is done, and the CPU starts up again. Everything executes perfectly synchronously. While the CPU is doing work, the GPU is doing nothing. And vice-versa.
This is not a recipe for performance. If you're going to use a graphics API designed to achieve lots of CPU/GPU overlap, you shouldn't throw that away because it's easier to work with.
Get used to multi-buffering any resources that you modify from the CPU on a regular basis. How many buffers you want to use is your choice, one that should be informed by the present mode and the like.
My question is "Do I need n buffers or m is enough?".
The situation you're describing ultimately only happens if your code wanted to have X frames in flight, but the presentation engine requires you to use a minimum of Y swap-chain images, and X < Y. So the question you're asking can be boiled down to, "if I wanted to do double-buffering, but the implementation forces 3 buffers on me, is it OK if I treat it as double-buffering?"
Yes, as long as you're not relying on the vkAcquireNextImage call to block the CPU for your synchronization. But you shouldn't be relying on that anyway, since the call itself doesn't constitute a proper barrier as far as the Vulkan execution model is concerned. You should instead block the CPU on fences tied to the actual work, not on the acquire process.

Impossible to acquire and present in parallel with rendering?

Note: I'm self-learning Vulkan with little knowledge of modern OpenGL.
Reading the Vulkan specifications, I can see very nice semaphores that allow the command buffer and the swapchain to synchronize. Here's what I understand to be a simple (yet I think inefficient) way of doing things:
Get image with vkAcquireNextImageKHR, signalling sem_post_acq
Build command buffer (or use pre-built) with:
Image barrier to transition image away from VK_IMAGE_LAYOUT_UNDEFINED
render
Image barrier to transition image to VK_IMAGE_LAYOUT_PRESENT_SRC_KHR
Submit to queue, waiting on sem_post_acq on fragment stage and signalling sem_pre_present.
vkQueuePresentKHR waiting on sem_pre_present.
The problem here is that the image barriers in the command buffer must know which image they are transitioning, which means that vkAcquireNextImageKHR must return before one knows how to build the command buffer (or which pre-built command buffer to submit). But vkAcquireNextImageKHR could potentially sleep a lot (because the presentation engine is busy and there are no free images). On the other hand, the submission of the command buffer is costly itself, and more importantly, all stages before fragment can run without having any knowledge of which image the final result will be rendered to.
Theoretically, it seems to me that a scheme like the following would allow a higher degree of parallelism:
Build command buffer (or use pre-built) with:
Image barrier to transition image away from VK_IMAGE_LAYOUT_UNDEFINED
render
Image barrier to transition image to VK_IMAGE_LAYOUT_PRESENT_SRC_KHR
Submit to queue, waiting on sem_post_acq on fragment stage and signalling sem_pre_present.
Get image with vkAcquireNextImageKHR, signalling sem_post_acq
vkQueuePresentKHR waiting on sem_pre_present.
Which would, again theoretically, allow the pipeline to execute all the way up to the fragment shader, while we wait for vkAcquireNextImageKHR. The only reason this doesn't work is that it is neither possible to tell the command buffer that this image will be determined later (with proper synchronization), nor is it possible to ask the presentation engine for a specific image.
My first question is: is my analysis correct? If so, is such an optimization not possible in Vulkan at all and why not?
My second question is: wouldn't it have made more sense if you could tell vkAcquireNextImageKHR which particular image you want to acquire, and iterate through them yourself? That way, you could know in advance which image you are going to ask for, and build and submit your command buffer accordingly.

Like Nicol said you can record secondaries independent of which image it will be rendering to.
However you can take it a step further and record command buffers for all swpachain images in advance and select the correct one to submit from the image acquired.
This type of reuse does take some extra consideration into account because all memory ranges used are baked into the command buffer. But in many situations the required render commands don't actually change frame one frame to the next, only a little bit of the data used.
So the sequence of such a frame would be:
vkAcquireNextImageKHR(vk.dev, vk.swap, 0, vk.acquire, VK_NULL_HANDLE, &vk.image_ind);
vkWaitForFences(vk.dev, 1, &vk.fences[vk.image_ind], true, ~0);
engine_update_render_data(vk.mapped_staging[vk.image_ind]);
VkSubmitInfo submit = build_submit(vk.acquire, vk.rend_cmd[vk.image_ind], vk.present);
vkQueueSubmit(vk.rend_queue, 1, &submit, vk.fences[vk.image_ind]);
VkPresentInfoKHR present = build_present(vk.present, vk.swap, vk.image_ind);
vkQueuePresentKHR(vk.queue, &present);
Granted this does not allow for conditional rendering but the gpu is in general fast enough to allow some geometry to be rendered out of frame without any noticeable delays. So until the player reaches a loading zone where new geometry has to be displayed you can keep those command buffers alive.

Your entire question is predicated on the assumption that you cannot do any command buffer building work without a specific swapchain image. That's not true at all.
First, you can always build secondary command buffers; providing a VkFramebuffer is merely a courtesy, not a requirement. And this is very important if you want to use Vulkan to improve CPU performance. After all, being able to build command buffers in parallel is one of the selling points of Vulkan. For you to only be creating one is something of a waste for a performance-conscious application.
In such a case, only the primary command buffer needs the actual image.
Second, who says that you will be doing the majority of your rendering to the presentable image? If you're doing deferred rendering, most of your stuff will be written to deferred buffers. Even post-processing effects like tone-mapping, SSAO, and so forth will probably be done to an intermediate buffer.
Worst-case scenario, you can always render to your own image. Then you build a command buffer who's only contents is an image copy from your image to the presentable one.
all stages before fragment can run without having any knowledge of which image the final result will be rendered to.
You assume that the hardware has a strict separation between vertex processing and rasterization. This is true only for tile-based hardware.
Direct renderers just execute the whole pipeline, top to bottom, for each rendering command. They don't store post-transformed vertex data in large buffers. It just flows down to the next step. So if the "fragment stage" has to wait on a semaphore, then you can assume that all other stages will be idle as well while waiting.
wouldn't it have made more sense if you could tell vkAcquireNextImageKHR which particular image you want to acquire, and iterate through them yourself?
No. The implementation would be unable to decide which image to give you next. This is precisely why you have to ask for an image: so that the implementation can figure out on its own which image it is safe for you to have.
Also, there's specific language in the specification that the semaphore and/or event you provide must not only be unsignaled but there cannot be any outstanding operations waiting on them. Why?
Because vkAcquireNextImageKHR can fail. If you have some operation in a queue that's waiting on a semaphore that's never going to fire, that will cause huge problems. You have to successfully acquire first, then submit work that is based on the semaphore.
Generally speaking, if you're regularly having trouble getting presentable images in a timely fashion, you need to make your swapchain longer. That's the point of having multiple buffers, after all.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas