Synchronizing vertex buffer in vulkan? - vulkan

I have a vertex buffer that is stored in a device memory and a buffer and is host visible and host coherent.
To write to the vertex buffer on the host side I map it, memcpy to it and unmap the device memory.
To read from it I bind the vertex buffer in a command buffer during recording a render pass. These command buffers are submitted in a loop that acquires, submits and presents, to draw each frame.
Currently I write once to the vertex buffer at program start up.
The vertex buffer then remains the same during the loop.
I'd like to modify the vertex buffer between each frame from the host side.
What I'm not clear on is the best/right way to synchronize these host-side writes with the device-side reads. Currently I have a fence and pair of semaphores for each frame allowed simulatenously in flight.
For each frame:
I wait on the fence.
I reset the fence.
The acquire signals semaphore #1.
The queue submit waits on semaphore #1 and signals semaphore #2 and signals the fence.
The present waits on semaphore #2
Where is the right place in this to put the host-side map/memcpy/unmap and how should I synchronize it properly with the device reads?

If you want to take advantage of asynchronous GPU execution, you want the CPU to avoid having to stall for GPU operations. So never wait on a fence for a batch that was just issued. The same thing goes for memory: you should never desire to write to memory which is being read by a GPU operation you just submitted.
You should at least double-buffer things. If you are changing vertex data every frame, you should allocate sufficient memory to hold two copies of that data. There's no need to make multiple allocations, or even to make multiple VkBuffers (just make the allocation and buffers bigger, then select which region of storage to use when you're binding it). While one region of storage is being read by GPU commands, you write to the other.
Each batch you submit reads from certain memory. As such, the fence for that batch will be set when the GPU is finished reading from that memory. So if you want to write to the memory from the CPU, you cannot begin that process until the fence representing the GPU reading operation for that memory reading gets set.
But because you're double buffering like this, the fence for the memory you're about to write to is not the fence for the batch you submitted last frame. It's the batch you submitted the frame before that. Since it's been some time since the GPU received that operation, it is far less likely that the CPU will have to actually wait. That is, the fence should hopefully already be set.
Now, you shouldn't do a literal vkWaitForFences on that fence. You should check to see if it is set, and if it isn't, go do something else useful with your time. But if you have nothing else useful you could be doing, then waiting is probably OK (rather than sitting and spinning on a test).
Once the fence is set, you know that you can freely write to the memory.
How do I know that the memory I have written to with the memcpy has finished being sent to the device before it is read by the render pass?
You know because the memory is coherent. That is what VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means in this context: host changes to device memory are visible to the GPU without needing explicit visibility operations, and vice-versa.
Well... almost.
If you want to avoid having to use any synchronization, you must call vkQueueSubmit for the reading batch after you have finished modifying the memory on the CPU. If they get called in the wrong order, then you'll need a memory barrier. For example, you could have some part of the batch wait on an event set by the host (through vkSetEvent), which tells the GPU when you've finished writing. And therefore, you could submit that batch before performing the memory writing. But in this case, the vkCmdWaitEvents call should include a source stage mask of HOST (since that's who's setting the event), and it should have a memory barrier whose source access flag also includes HOST_WRITE (since that's who's writing to the memory).
But in most cases, it's easier to just write to the memory before submitting the batch. That way, you avoid needing to use host/event synchronization.


Vulkan: Sync of UniformBuffer and StoreBuffer

UniformBuffers and StoreBuffers are updated, on the CPU side, using memcpy. How does synchronization work for those descriptor types? Does using memcpy imply that the application waits for memcpy to upload data to the GPU prior to continuing to next statement? If so, does this mean that barriers are not needed for sync'ing these types of buffers?
Synchronization works the same way for any memory resource: with certain rare exceptions, if you've changed memory, you need a memory dependency to ensure visibility of those changes. The synchronization system doesn't care whether it's used as a UBO or whatever. It cares about the nature of the source operation (the host) and the destination operation (reading from certain shader stages).
For host-to-device memory operations, you need to perform a form of synchronization known as a "domain operation". Fortunately, vkQueueSubmit automatically performs a domain operation on any host writes made visible before the vkQueueSubmit call. So if you write stuff to GPU-visible memory, then call vkQueueSubmit (either in the same thread or via CPU-side inter-thread communication), any commands in that submit call (or later ones) will see the values you wrote.
Assuming you have made them visible. Writes to host-coherent memory are always visible to the GPU, but writes to non-coherent memory must be made visible via a call to vkFlushMappedMemoryRanges.
If you want to write to memory asynchronously to the GPU process that reads it, you'll need to use an event. You write to the memory, make it visible if needs be, then set the event. The GPU commands that read from it would wait on the event, using VK_ACCESS_HOST_WRITE_BIT as the source access, and VK_PIPELINE_STAGE_HOST_BIT as the source stage. The destination access and stage are determined by how you plan to read from it.
Vulkan knows nothing about memcpy. It doesn't care how you modify the memory; it only cares that you do so in accord with its rules.

Is an Empty VkCommandBuffer executed when submitted to a queue?

Hey guys I wonder if we submit a VkSubmitInfo containing one empty VkCommandBuffer to the queue, if it will be executed or ignored. I mean will the semaphores in VkSubmitInfo::pWaitSemaphore and VkSubmitInfo::pDestSemaphore still be considered when submitting an empty VkCommandBuffer ?
Looks a stupid question but what I want is to "multiply" the only one semaphore that gets out of the vkAcquireNextImageKHR.
I mean, I want to submit an empty commandbuffer with VkSubmitInfo::pWaitSemaphore pointing to "acquire_semaphore", and having VkSubmitInfo::pDstSemaphore having as many semaphores as I need.
if it will be executed or ignored.
What would be the difference? If there are no commands in the command buffer, then executing it will do nothing.
I mean, I want to submit an empty commandbuffer with VkSubmitInfo::pWaitSemaphore pointing to "acquire_semaphore", and having VkSubmitInfo::pDstSemaphore having as many semaphores as I need.
This has nothing to do with the execution of the CB itself. The behavior of a batch doesn't change just because the CB doesn't do anything.
However, unless you have multiple queues waiting on the completion of this queue's operations, there's really no reason to have multiple destination semaphores. The batch containing the real work could just wait on the pWaitSemaphores.
Also, there's no reason to have empty batches that only wait on a single semaphore. Let's say you have batch Q, which signals the pWaitSemaphores that this empty batch signals. Well, there's no reason that batch Q's pDstSemaphores couldn't signal the semaphores that you want the empty batch to signal. After all, vkQueueSubmit semaphore wait operations have, as its destination command scope, all subsequent commands for that queue from vkQueueSubmit calls, the current one or subsequent ones.
So you would only need an empty batch if you have to wait on multiple semaphores that are signals from different batches on different queues. And such a complex dependency layout strongly suggests an over-complicated dependency design that will lead to reduced performance.
Even waiting on acquire makes no sense for this. You only need to wait on acquire if that queue is going to manipulate to the acquired image. Well, you can't manipulate an image from multiple queues simultaneously. So there's no point in signaling a bunch of semaphores when acquire completes; that's why acquire only takes one.
So I want to simulate a Fence only with semaphores and see what goes faster.
This suggests strongly that you're thinking about things incorrectly.
You use a fence when you want the CPU to detect the completion of a GPU operation. For vkAcquireNextImageKHR, you would use a fence if you need the CPU to know when the image has been acquired.
Semaphores are about the GPU detecting when a GPU operation has completed, regardless of whether the operation comes from a queue or not. So if the GPU needs to wait until an image is acquired, you use a semaphore.
It doesn't matter which is faster because they do different things.

Should I syncronize an access to a memory with HOST_VISIBLE_BIT | HOST_COHERENT_BIT flags?

In other words is it possible that the GPU will read the memory while I mapping it on the host and writing to it?
There is a distinction between "visibility" and "availability" in Vulkan's memory model. You need both if you want to access a value.
Coherency is about "visibility". But you still need availability. HOST_COHERENT says that you don't need vkFlushMappedMemoryRanges or vkInvalidateMappedMemoryRanges. For CPU writes, visibility requires vkFlushMappedMemoryRanges (which HOST_COHERENT effecitvely provides), but that alone is insufficient for availability:
vkFlushMappedMemoryRanges guarantees that host writes to the memory ranges described by pMemoryRanges can be made available to device access, via availability operations from the VK_ACCESS_HOST_WRITE_BIT access type.
The "availability operations" section links to the Vulkan section on "Execution and Memory Dependencies". So even with coherent mapping, you still need to have a dependency between the host writing that memory and the GPU operation reading it.
For GPU reading operations from CPU-written data, a call to vkQueueSubmit acts as a host memory dependency on any CPU writes to GPU-accessible memory, so long as those writes were made prior to the function call.
If you need more fine-grained write dependency (you want the GPU to be able to execute some stuff in a batch while you're writing data, for example), or if you need to read data written by the GPU, you need an explicit dependency.
For in-batch GPU reading, this could be handled by an event; the host sets the event after writing the memory, and the command buffer operation that reads the memory first issues vkCmdWaitEvents for that event. And you'll need to set the appropriate memory barriers and source/destination stages.
For CPU reading of GPU-written data, this could be an event, a timeline semaphore, or a fence.
But overall, CPU writes to GPU-accessible memory still need some form of synchronization.
Coherent memory just means that you don't need to manually manage the CPU caches with vkInvalidateMappedMemoryRanges and vkFlushMappedMemoryRanges. You still need to use synchronization to make sure that reads and writes from CPU and GPU happen in the right order, and you need memory barriers on the GPU side to manage GPU caches (make CPU writes visible to GPU reads, and make GPU writes available to CPU reads).

Where are buffers located?

I hear a lot about flushing buffers, sending to buffer etc. but I don't have a visual image about where buffers reside and how they look like.
Are buffers part of the OS' kernel or part of each process? If the case is the first, can the same buffers be used by multiple processes?
A buffer is a generic term for a collection of bytes, typically used in the context of either sending, receiving or storing information where the internal data-structure of the information isn't important.
In the case of "flushing" buffers, this typically is used in the context of sending data either to a file or network; the buffer in this case being used to coalesce multiple small writes to the file or network into one larger and more-efficient-to-transmit buffer. After the final write has been performed (or after some "commit" point), the buffer must be "flushed" to ensure that any data left waiting to coalesce with a future write is committed immediately to the underlying file sent over the network rather than left waiting for a future write that might never come.
In both the case of network and file IO, buffers are usuaully used in multiple places. File IO may well be buffered by a buffer in the application, in a library (for instance an implementation of fwrite may buffer the output), in the kernel and even on the device itself - network writes may well be buffered by the device whilst waiting for bandwidth on the wire and hard-disk drives will buffer output from the OS to ensure that data isn't lost as the physical platters spin to the correct position for the write.

Direct memory access DMA - how does it work?

I read that if DMA is available, then processor can route long read or write requests of disk blocks to the DMA and concentrate on other work. But, DMA to memory data/control channel is busy during this transfer. What else can processor do during this time?
First of all, DMA (per se) is almost entirely obsolete. As originally defined, DMA controllers depended on the fact that the bus had separate lines to assert for memory read/write, and I/O read/write. The DMA controller took advantage of that by asserting both a memory read and I/O write (or vice versa) at the same time. The DMA controller then generated successive addresses on the bus, and data was read from memory and written to an output port (or vice versa) each bus cycle.
The PCI bus, however, does not have separate lines for memory read/write and I/O read/write. Instead, it encodes one (and only one) command for any given transaction. Instead of using DMA, PCI normally does bus-mastering transfers. This means instead of a DMA controller that transfers memory between the I/O device and memory, the I/O device itself transfers data directly to or from memory.
As for what else the CPU can do at the time, it all depends. Back when DMA was common, the answer was usually "not much" -- for example, under early versions of Windows, reading or writing a floppy disk (which did use the DMA controller) pretty much locked up the system for the duration.
Nowadays, however, the memory typically has considerably greater bandwidth than the I/O bus, so even while a peripheral is reading or writing memory, there's usually a fair amount of bandwidth left over for the CPU to use. In addition, a modern CPU typically has a fair large cache, so it can often execute some instruction without using main memory at all.
Well the key point to note is that the CPU bus is always partly used by the DMA and the rest of the channel is free to use for any other jobs/process to run. This is the key advantage of DMA over I/O. Hope this answered your question :-)
But, DMA to memory data/control channel is busy during this transfer.
Being busy doesn't mean you're saturated and unable to do other concurrent transfers. It's true the memory may be a bit less responsive than normal, but CPUs can still do useful work, and there are other things they can do unimpeded: crunch data that's already in their cache, receive hardware interrupts etc.. And it's not just about the quantity of data, but the rate at which it's generated: some devices create data in hard real-time and need it to be consumed promptly otherwise it's overwritten and lost: to handle this without DMA the software may may have to nail itself to a CPU core then spin waiting and reading - avoiding being swapped onto some other task for an entire scheduler time slice - even though most of the time further data's not even ready.
During DMA transfer, the CPU is idle and has no control over memory bus. CPU is put in idle state by using high impedance state