When does Image Layout Transition happen when no source or destination stage is specified - vulkan

Based on the specs https://registry.khronos.org/vulkan/specs/1.3-khr-extensions/pdf/vkspec.pdf
It says "When a layout transition is specified in a
memory dependency, it happens-after the availability operations in the memory dependency, and happens-before the visibility operations"
As we know when calling vkCmdPipelineBarrier(layout1, layout2, ...), the srcAccessMask is the availability operation and the dstAccessMask is the visibility operation. I wonder if I set both of them to 0, which means there is no availability operation and no visibility operation in this memory dependency, will there be any layout transition from layout1 to layout2 actually happening after the barrier call in this case?
To answer my own question:
Based on the specs, the queue present call will execute the visibility operation in this case.
https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/vkQueuePresentKHR.html
"Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are automatically made visible to the read access performed by the presentation engine. This automatic visibility operation for an image happens-after the semaphore signal operation, and happens-before the presentation engine accesses the image."
So we only need to write the src access to register an availability operation and leave the dst access 0(the visibility operation) to the present call.

There is nothing in the standard which says that the availability operations that are part of a memory dependency are optional. They always happen as part of executing the memory dependency. They can be empty and make nothing available, but it still always happens. The same goes for visibility.
So the layout transition happens regardless of what is available or visible. This is useful as any writes being consumed may have been made available before now by some other operation.
The bigger issue is the lack of visibility for operations down the line. A layout transition is ultimately a write operation. So processes that need to use that image need to see the image in the new layout. So it needs to be visible to them. So if you do this, you will likely see your validation layers complain later about some form of memory hazard.

Related

Does a system call involve a context switch or not?

I am reading the wikipedia page on system calls and I cannot reconcile a few of the statements that are made there.
At the bottom, it says that "A system call does not generally require a context switch to another process; instead, it is executed in the context of whichever process invoked it."
Yet, at the top, it says that "[...] applications to request services via system calls, which are often initiated via interrupts. An interrupt [...] passes control to the kernel [and then] the kernel executes a specific set of instructions over which the calling program has no direct control".
It seems to me that if the interrupt "passes control to the kernel," that means that the kernel, which is "another process," is executing and therefore a context switch happened. Therefore, there seems to be a contradiction in the wikipedia page. Where is my understanding wrong?
Your understanding is wrong because the kernel isn't a separate process. The kernel is sitting in RAM in shared memory areas. Typically, it sits in the top half of the virtual address space.
When the kernel is invoked with a system call, it is not necessarily using an interrupt. On x86-64, it is invoked directly using a specific processor instruction (syscall). This instruction makes the processor jump to the address stored in a special register.
Syscalls don't necessarily involve a full context switch. They must involve a user mode to kernel mode context switch. Most often, kernels have a kernel stack per process. This stack is mostly unused and empty when no system call is active as it then makes no sense to have anything stored in it.
The registers also need to be saved since the kernel can use them. I don't know for other processors but x86-64 does have the TSS allowing for automated user mode to kernel mode stack switch. The registers still need to be saved manually.
In the end, there is actually a necessary partial context switch when entering the kernel through a system call but it doesn't involve switching the whole process. Since the temporary storage for swapped registers and the kernel stack are already reserved, it involves much less overhead as the kernel doesn't need to touch the page tables. Swapping page tables often involves cache managing and some cache flushing to make it consistent.

Can vkQueuePresentKHR be synced using a pipeline barrier?

The fact vkQueuePresentKHR gets a queue parameter makes me think that it is like a command that is delivered to the queue for execution. If so, it is possible to make it waits (until the writing into the image to be presented is finished) using a pipeline barrier where source stage is VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT and destination is VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT. Or maybe even by an image barrier to ease the sync constraint for the image only.
But the fact that in every tutorial and books the sync is done using semaphore , makes me think that my assumption is wrong. If so, why vkQueuePresentKHR needs a queue parameter ? because the semaphore parameter is seems to be enough: when it is signaled, vkQueuePresentKHR can present the image according to the image index parameter and the swapchain handle parameter.
There are couple of outstanding Issues against the specification. Notably KhronosGroup/Vulkan-Docs#1308 is exactly your question.
Meanwhile everyone usually follows this language:
The processing of the presentation happens in issue order with other queue operations, but semaphores have to be used to ensure that prior rendering and other commands in the specified queue complete before the presentation begins.
Which implies semaphore has to be used. And given we are not 110 % sure, that means semaphore should be used until we know any better.
Another semi-official source is the sync wiki, which uses a semaphore.
Despite what this quote says, I think it is reasonable to believe it is also permissible to use other sync that makes the image already visible before the vkQueuePresent, such as fence wait.
But just pipeline barriers are likely not sufficient. The presentation is outside the queue system:
However, the scope of this set of queue operations does not include the actual processing of the image by the presentation engine.
Additionally there is no VkPipelineStageFlagBit for it, and vkQueuePresentKHR is not included in the submission order, so it cannot be in the synchronization scope of any vkCmdPipelineBarrier.
The confusing part is this unfortunate wording:
Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are automatically made visible to the read access performed by the presentation engine.
I believe the trick is the "before vkQueuePresentKHR is executed". As said above, vkQueuePresentKHR is not part of submission order, therefore you do not know if the memory was or wasn't made available via a pipeline barrier before the vkQueuePresentKHR is executed.
Presentation is a queue operation. That's why you submit it to a queue. A queue that will execute the presentation of the image. And specifically to a queue that is able to perform present operations.
As for how to synchronize... the specification is a bit ambiguous on this point.
Semaphores are definitely able to work; there's a specific callout for this:
Semaphores are not necessary for making the results of prior commands visible to the present:
Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are
automatically made visible to the read access performed by the presentation engine. This automatic visibility operation for an image happens-after the semaphore signal operation, and happens-before the presentation engine accesses the image.
While provisions are made for semaphores, there is no specific statement of other things. In particular, if you don't wait on a semaphore, it's not clear what "happens-after the semaphore signal operation" means, since no such signal operation happened.
Now, the API for vkQueuePresentKHR makes it clear that you don't need to provide a semaphore to wait on:
waitSemaphoreCount is the number of semaphores to wait for before issuing the present request.
The number may be zero.
One might thing that, as a queue operation, all prior synchronization on that queue would still affect presentation. For example, an external subpass dependency if you wrote to the swapchain image as an attachment. And it probably would... if not for one little problem.
See, synchronization is ultimately based on dependencies between stages. And presentation... doesn't have a stage. So while your source for the external dependency would be well-understood, it's not clear what destination stage would work. Even specifying the all-stages flag wouldn't necessarily work.
Does "not a stage" exist in the set of all stages?
In any case, it's best to just use a semaphore. You'll probably need one anyway, so just use that.

Is it possible to synchronize an automatic layout transition with a swapchain image acquisition via pWaitDstStageMask=TOP_OF_PIPE?

It is necessary to synchronize automatic layout transitions performed by render passes with the acquisition of a swapchain image via the semaphore provided in vkAcquireNextImageKHR. Vulkan Tutorial states, "We could change the waitStages for the imageAvailableSemaphore to VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT to ensure that the render passes don't begin until the image is available" (Context: waitStages is an array of VkPipelineStageFlags supplied as pWaitDstStageMask to the VkSubmitInfo for the queue submission which performs the render pass and, consequently, the automatic layout transition). It then opts for an alternative (common) solution, which is to use an external subpass dependency instead.
In fact, I've only ever seen such synchronization done with a subpass dependency; I've never seen anyone actually do it by setting pWaitDstStageMask=TOP_OF_PIPE for the VkSubmitInfo corresponding to the queue submission which performs the automatic layout transition, and I'm not certain why this would even work. The specs provide certain guarantees about automatic layout transitions being synchronized with subpass dependencies, so I understand why they would work, but in order to do this sort of synchronization by waiting on a semaphore in a VkSubmitInfo, it is first necessary that image layout transitions even have pipeline stages to begin with. I am not aware of any pipeline stages which image layout transitions go through; I believed them to be entirely atomic operations which occur in between source-available operations and destination-visible operations in a memory dependency (e.g. subpass dependency, memory barrier, etc.), as described by the specs.
Do automatic image layout transitions go through pipeline stages? If so, which stages do they go through (perhaps I'm missing something in the specs)? If not, how could setting pWaitDstStageMask=TOP_OF_PIPE in a VkSubmitInfo waiting on the image acquisition semaphore have any synchronization effect on the automatic image layout transition?
You may be misunderstanding what the tutorial is saying.
Layout transitions do not happen within a pipeline stage. They, like all aspects of an execution dependency, happen between pipeline stages. A dependency states that, after the source stage completes but before the destination stage begins, the dependency's operations will happen.
So if the destination stage of an incoming external subpass dependency is the top of the pipe, then it is saying that the layout transition as part of the dependency will happen before the execution of any pipeline stages in the target subpass.
Remember that the default external subpass dependency for each attachment has the top of the pipe as its source scope. If the semaphore wait dependency's destination stage is the top of the pipe, then the source scope of the external subpass dependency is in the destination scope of the semaphore wait. And therefore, the external subpass dependency happens-after the semaphore wait dependency.
And, as previously stated, layout transitions happen between the source scope and the destination scope. So until the source scope, and everything it depends on, finishes execution, the layout transition will not happen.

Vulkan: Sync of UniformBuffer and StoreBuffer

UniformBuffers and StoreBuffers are updated, on the CPU side, using memcpy. How does synchronization work for those descriptor types? Does using memcpy imply that the application waits for memcpy to upload data to the GPU prior to continuing to next statement? If so, does this mean that barriers are not needed for sync'ing these types of buffers?
Synchronization works the same way for any memory resource: with certain rare exceptions, if you've changed memory, you need a memory dependency to ensure visibility of those changes. The synchronization system doesn't care whether it's used as a UBO or whatever. It cares about the nature of the source operation (the host) and the destination operation (reading from certain shader stages).
For host-to-device memory operations, you need to perform a form of synchronization known as a "domain operation". Fortunately, vkQueueSubmit automatically performs a domain operation on any host writes made visible before the vkQueueSubmit call. So if you write stuff to GPU-visible memory, then call vkQueueSubmit (either in the same thread or via CPU-side inter-thread communication), any commands in that submit call (or later ones) will see the values you wrote.
Assuming you have made them visible. Writes to host-coherent memory are always visible to the GPU, but writes to non-coherent memory must be made visible via a call to vkFlushMappedMemoryRanges.
If you want to write to memory asynchronously to the GPU process that reads it, you'll need to use an event. You write to the memory, make it visible if needs be, then set the event. The GPU commands that read from it would wait on the event, using VK_ACCESS_HOST_WRITE_BIT as the source access, and VK_PIPELINE_STAGE_HOST_BIT as the source stage. The destination access and stage are determined by how you plan to read from it.
Vulkan knows nothing about memcpy. It doesn't care how you modify the memory; it only cares that you do so in accord with its rules.

Impossible to acquire and present in parallel with rendering?

Note: I'm self-learning Vulkan with little knowledge of modern OpenGL.
Reading the Vulkan specifications, I can see very nice semaphores that allow the command buffer and the swapchain to synchronize. Here's what I understand to be a simple (yet I think inefficient) way of doing things:
Get image with vkAcquireNextImageKHR, signalling sem_post_acq
Build command buffer (or use pre-built) with:
Image barrier to transition image away from VK_IMAGE_LAYOUT_UNDEFINED
render
Image barrier to transition image to VK_IMAGE_LAYOUT_PRESENT_SRC_KHR
Submit to queue, waiting on sem_post_acq on fragment stage and signalling sem_pre_present.
vkQueuePresentKHR waiting on sem_pre_present.
The problem here is that the image barriers in the command buffer must know which image they are transitioning, which means that vkAcquireNextImageKHR must return before one knows how to build the command buffer (or which pre-built command buffer to submit). But vkAcquireNextImageKHR could potentially sleep a lot (because the presentation engine is busy and there are no free images). On the other hand, the submission of the command buffer is costly itself, and more importantly, all stages before fragment can run without having any knowledge of which image the final result will be rendered to.
Theoretically, it seems to me that a scheme like the following would allow a higher degree of parallelism:
Build command buffer (or use pre-built) with:
Image barrier to transition image away from VK_IMAGE_LAYOUT_UNDEFINED
render
Image barrier to transition image to VK_IMAGE_LAYOUT_PRESENT_SRC_KHR
Submit to queue, waiting on sem_post_acq on fragment stage and signalling sem_pre_present.
Get image with vkAcquireNextImageKHR, signalling sem_post_acq
vkQueuePresentKHR waiting on sem_pre_present.
Which would, again theoretically, allow the pipeline to execute all the way up to the fragment shader, while we wait for vkAcquireNextImageKHR. The only reason this doesn't work is that it is neither possible to tell the command buffer that this image will be determined later (with proper synchronization), nor is it possible to ask the presentation engine for a specific image.
My first question is: is my analysis correct? If so, is such an optimization not possible in Vulkan at all and why not?
My second question is: wouldn't it have made more sense if you could tell vkAcquireNextImageKHR which particular image you want to acquire, and iterate through them yourself? That way, you could know in advance which image you are going to ask for, and build and submit your command buffer accordingly.
Like Nicol said you can record secondaries independent of which image it will be rendering to.
However you can take it a step further and record command buffers for all swpachain images in advance and select the correct one to submit from the image acquired.
This type of reuse does take some extra consideration into account because all memory ranges used are baked into the command buffer. But in many situations the required render commands don't actually change frame one frame to the next, only a little bit of the data used.
So the sequence of such a frame would be:
vkAcquireNextImageKHR(vk.dev, vk.swap, 0, vk.acquire, VK_NULL_HANDLE, &vk.image_ind);
vkWaitForFences(vk.dev, 1, &vk.fences[vk.image_ind], true, ~0);
engine_update_render_data(vk.mapped_staging[vk.image_ind]);
VkSubmitInfo submit = build_submit(vk.acquire, vk.rend_cmd[vk.image_ind], vk.present);
vkQueueSubmit(vk.rend_queue, 1, &submit, vk.fences[vk.image_ind]);
VkPresentInfoKHR present = build_present(vk.present, vk.swap, vk.image_ind);
vkQueuePresentKHR(vk.queue, &present);
Granted this does not allow for conditional rendering but the gpu is in general fast enough to allow some geometry to be rendered out of frame without any noticeable delays. So until the player reaches a loading zone where new geometry has to be displayed you can keep those command buffers alive.
Your entire question is predicated on the assumption that you cannot do any command buffer building work without a specific swapchain image. That's not true at all.
First, you can always build secondary command buffers; providing a VkFramebuffer is merely a courtesy, not a requirement. And this is very important if you want to use Vulkan to improve CPU performance. After all, being able to build command buffers in parallel is one of the selling points of Vulkan. For you to only be creating one is something of a waste for a performance-conscious application.
In such a case, only the primary command buffer needs the actual image.
Second, who says that you will be doing the majority of your rendering to the presentable image? If you're doing deferred rendering, most of your stuff will be written to deferred buffers. Even post-processing effects like tone-mapping, SSAO, and so forth will probably be done to an intermediate buffer.
Worst-case scenario, you can always render to your own image. Then you build a command buffer who's only contents is an image copy from your image to the presentable one.
all stages before fragment can run without having any knowledge of which image the final result will be rendered to.
You assume that the hardware has a strict separation between vertex processing and rasterization. This is true only for tile-based hardware.
Direct renderers just execute the whole pipeline, top to bottom, for each rendering command. They don't store post-transformed vertex data in large buffers. It just flows down to the next step. So if the "fragment stage" has to wait on a semaphore, then you can assume that all other stages will be idle as well while waiting.
wouldn't it have made more sense if you could tell vkAcquireNextImageKHR which particular image you want to acquire, and iterate through them yourself?
No. The implementation would be unable to decide which image to give you next. This is precisely why you have to ask for an image: so that the implementation can figure out on its own which image it is safe for you to have.
Also, there's specific language in the specification that the semaphore and/or event you provide must not only be unsignaled but there cannot be any outstanding operations waiting on them. Why?
Because vkAcquireNextImageKHR can fail. If you have some operation in a queue that's waiting on a semaphore that's never going to fire, that will cause huge problems. You have to successfully acquire first, then submit work that is based on the semaphore.
Generally speaking, if you're regularly having trouble getting presentable images in a timely fashion, you need to make your swapchain longer. That's the point of having multiple buffers, after all.