Is an Empty VkCommandBuffer executed when submitted to a queue? - vulkan

Hey guys I wonder if we submit a VkSubmitInfo containing one empty VkCommandBuffer to the queue, if it will be executed or ignored. I mean will the semaphores in VkSubmitInfo::pWaitSemaphore and VkSubmitInfo::pDestSemaphore still be considered when submitting an empty VkCommandBuffer ?
Looks a stupid question but what I want is to "multiply" the only one semaphore that gets out of the vkAcquireNextImageKHR.
I mean, I want to submit an empty commandbuffer with VkSubmitInfo::pWaitSemaphore pointing to "acquire_semaphore", and having VkSubmitInfo::pDstSemaphore having as many semaphores as I need.

if it will be executed or ignored.
What would be the difference? If there are no commands in the command buffer, then executing it will do nothing.
I mean, I want to submit an empty commandbuffer with VkSubmitInfo::pWaitSemaphore pointing to "acquire_semaphore", and having VkSubmitInfo::pDstSemaphore having as many semaphores as I need.
This has nothing to do with the execution of the CB itself. The behavior of a batch doesn't change just because the CB doesn't do anything.
However, unless you have multiple queues waiting on the completion of this queue's operations, there's really no reason to have multiple destination semaphores. The batch containing the real work could just wait on the pWaitSemaphores.
Also, there's no reason to have empty batches that only wait on a single semaphore. Let's say you have batch Q, which signals the pWaitSemaphores that this empty batch signals. Well, there's no reason that batch Q's pDstSemaphores couldn't signal the semaphores that you want the empty batch to signal. After all, vkQueueSubmit semaphore wait operations have, as its destination command scope, all subsequent commands for that queue from vkQueueSubmit calls, the current one or subsequent ones.
So you would only need an empty batch if you have to wait on multiple semaphores that are signals from different batches on different queues. And such a complex dependency layout strongly suggests an over-complicated dependency design that will lead to reduced performance.
Even waiting on acquire makes no sense for this. You only need to wait on acquire if that queue is going to manipulate to the acquired image. Well, you can't manipulate an image from multiple queues simultaneously. So there's no point in signaling a bunch of semaphores when acquire completes; that's why acquire only takes one.
So I want to simulate a Fence only with semaphores and see what goes faster.
This suggests strongly that you're thinking about things incorrectly.
You use a fence when you want the CPU to detect the completion of a GPU operation. For vkAcquireNextImageKHR, you would use a fence if you need the CPU to know when the image has been acquired.
Semaphores are about the GPU detecting when a GPU operation has completed, regardless of whether the operation comes from a queue or not. So if the GPU needs to wait until an image is acquired, you use a semaphore.
It doesn't matter which is faster because they do different things.

Related

How to synchronize between queues on different CPU threads?

It is said semaphores are designed for this but how? It looks like I need to submit the semaphore before waiting for it to signal. Then what's the point of multithreading?
I'm using skia (has its own VkQueue) to draw UI, I don't have access to the commandbuffer, I can only provide semaphores for it. it first waits for the scene complete semaphore then draw ui and signals present ready semaphore.
It works fine when everything happens in a single thread. But after I move the UI part to a second thread. It stopped working and I got validation errors like: VkQueue is waiting on semaphore that has no way to be signaled. Of course, since it's on a different thread, the semaphore might not have been submitted to a queue yet.
The spec for vkQueuePresentKHR says
All elements of the pWaitSemaphores member of pPresentInfo must be semaphores that are signaled, or have semaphore signal operations previously submitted for execution
You can't submit work that waits on a semaphore that you plan to submit later. If you have this kind of dependency in your code you need to externally synchronize the submissions so the command buffers that will signal will be sent BEFORE you submit the dependent command buffers, regardless of the queue.
If you're using multiple threads it sounds like you need to rely on some CPU side synchronization primitives, like a CPU semaphore to properly order the work between them. Pure Vulkan sync primitives won't help you there.

Can vkQueuePresentKHR be synced using a pipeline barrier?

The fact vkQueuePresentKHR gets a queue parameter makes me think that it is like a command that is delivered to the queue for execution. If so, it is possible to make it waits (until the writing into the image to be presented is finished) using a pipeline barrier where source stage is VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT and destination is VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT. Or maybe even by an image barrier to ease the sync constraint for the image only.
But the fact that in every tutorial and books the sync is done using semaphore , makes me think that my assumption is wrong. If so, why vkQueuePresentKHR needs a queue parameter ? because the semaphore parameter is seems to be enough: when it is signaled, vkQueuePresentKHR can present the image according to the image index parameter and the swapchain handle parameter.
There are couple of outstanding Issues against the specification. Notably KhronosGroup/Vulkan-Docs#1308 is exactly your question.
Meanwhile everyone usually follows this language:
The processing of the presentation happens in issue order with other queue operations, but semaphores have to be used to ensure that prior rendering and other commands in the specified queue complete before the presentation begins.
Which implies semaphore has to be used. And given we are not 110 % sure, that means semaphore should be used until we know any better.
Another semi-official source is the sync wiki, which uses a semaphore.
Despite what this quote says, I think it is reasonable to believe it is also permissible to use other sync that makes the image already visible before the vkQueuePresent, such as fence wait.
But just pipeline barriers are likely not sufficient. The presentation is outside the queue system:
However, the scope of this set of queue operations does not include the actual processing of the image by the presentation engine.
Additionally there is no VkPipelineStageFlagBit for it, and vkQueuePresentKHR is not included in the submission order, so it cannot be in the synchronization scope of any vkCmdPipelineBarrier.
The confusing part is this unfortunate wording:
Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are automatically made visible to the read access performed by the presentation engine.
I believe the trick is the "before vkQueuePresentKHR is executed". As said above, vkQueuePresentKHR is not part of submission order, therefore you do not know if the memory was or wasn't made available via a pipeline barrier before the vkQueuePresentKHR is executed.
Presentation is a queue operation. That's why you submit it to a queue. A queue that will execute the presentation of the image. And specifically to a queue that is able to perform present operations.
As for how to synchronize... the specification is a bit ambiguous on this point.
Semaphores are definitely able to work; there's a specific callout for this:
Semaphores are not necessary for making the results of prior commands visible to the present:
Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are
automatically made visible to the read access performed by the presentation engine. This automatic visibility operation for an image happens-after the semaphore signal operation, and happens-before the presentation engine accesses the image.
While provisions are made for semaphores, there is no specific statement of other things. In particular, if you don't wait on a semaphore, it's not clear what "happens-after the semaphore signal operation" means, since no such signal operation happened.
Now, the API for vkQueuePresentKHR makes it clear that you don't need to provide a semaphore to wait on:
waitSemaphoreCount is the number of semaphores to wait for before issuing the present request.
The number may be zero.
One might thing that, as a queue operation, all prior synchronization on that queue would still affect presentation. For example, an external subpass dependency if you wrote to the swapchain image as an attachment. And it probably would... if not for one little problem.
See, synchronization is ultimately based on dependencies between stages. And presentation... doesn't have a stage. So while your source for the external dependency would be well-understood, it's not clear what destination stage would work. Even specifying the all-stages flag wouldn't necessarily work.
Does "not a stage" exist in the set of all stages?
In any case, it's best to just use a semaphore. You'll probably need one anyway, so just use that.

Can I do transfer operation in transfer queue and graphics queue at the same time?

I have made 2 instances of VkQueue: one from graphics family and another one from transfer family. Command pools and command buffers are separated accordingly. Both are doing transfer operations.
Purpose of first one except rendering is to update uniform buffers on
each frame.
Purpose of second one is to update resources: model
vertex/index buffers, texture images etc.
They work in parallel in different threads asynchronously. So it is possible that there will be 2 calls of vkQueueSubmit at the same time.
Is such usage allowed and is it safe?
Note: once I have multithreaded my program sometimes I have VK_DEVICE_LOST on vkQueueSumbit and it is likely that it happens more frequently when resources are loading, that is why I actually came to this question
The Vulkan specification is pretty clear about CPU synchronization of Vulkan functions. vkQueueSubmit says:
Host access to queue must be externally synchronized
Where "queue" is the parameter passed to vkQueueSubmit. It doesn't say every queue; it says "that queue".
And if "external synchronization" is not specifically stated as a requirement of a command, then it isn't a requirement of that command.

The best way to add a new 3D object at runtime

All the time till now I had 3D objects created during the startup. But now I need to add them dynamically. What can be simpler, I thought...
The main issue right now is how to upload the new object's data in the fastest way and find out when the data is uploaded.
Here's my setup:
I'm using the vulkan memory allocator library, so I'm free form memory management burden.
I'm planning to use a separate VkBuffer for every object - this way I don't need to manage offsets, alignments and it would be easier to add/remove objects.
And here are my thoughts/questions:
How to upload the data? I want the buffer to be gpu-visible only, that means I need a staging buffer.
If I use the staging buffer I need to know when the data is ready to use on the gpu. I don't want to flush the pipeline and wait. The only way I see is to use a fence per object and only call the draw command when this fence is ready.
If I use a staging buffer and want to upload multiple objects during a short frame, I need somehow to be sure that the parts of this staging buffer not being overridden by different objects. For this, I need to keep it big, handle alignment for the offsets. But how big?
I'm pretty sure I'm overcomplicating. I believe there should be a much simpler pattern. How would you do this?
I believe there should be a much simpler pattern.
It's Vulkan; it's an explicit, low-level API. "Simple" is not its goal.
Overall, your Vulkan code needs to be written to adapt to the capabilities of the hardware. That's the best way to get performance out of it.
The first decision that needs to be made is whether you need staging at all. Staging (for buffer copies) is only necessary if your device's DEVICE_LOCAL memory is not mappable. And yes, there are (integrated) GPUs that allow you to map DEVICE_LOCAL memory. If that is the case, then you can just write directly to where you need the data to go.
If staging is needed, then you need to decide if the hardware supports an independent transfer-only queue. If so, then you will likely get performance benefits by employing it. Not all hardware supports transfer-only queues, so your application needs to adapt. Also, transfer-only queues can have restrictions on the granularity of memory transfers taking place on those queues, so you need to check to see if your streaming strategy fits within the limits of that particular hardware.
Also, if there is no appropriate transfer queue, you can create the effect of a transfer queue by using a second compute or graphics queue... if the hardware supports multiple queues at all. Being able to submit transfer commands and rendering commands on different queues is a good thing, assuming you are taking advantage of threading (ie: issuing submits of the batches to the different queues on different threads).
If you are able to use a separate queue for transfers (whether a true transfer queue or just a separate compute/graphics queue), then you get to play around with semaphores. The batch that transfers data must signal a semaphore when it completes; this is part of the batch in the vkQueueSubmit call. The batch on the main queue that uses the transferred data for some process needs to wait on that semaphore. So both threads need to be using the same VkSemaphore object. And the wait on the semaphore should just have a global memory barrier, to make the memory visible.
The tricky part is this: you cannot submit the batch that waits on the semaphore until the submit call for the batch that signals it has been submitted. You don't have to wait until completion, but you do have to wait until the vkQueueSubmit call on the transfer queue has returned. So you need a way to transfer the semaphore between different threads, or you could just issue both submit commands on the same thread.
If you aren't using a second queue, then things are slightly simpler.
You still want to build the transfer command buffer itself on a different thread (to take advantage of threading CB construction). But that CB now needs to be communicated to the thread responsible for submitting the rendering stuff. And this channel of communication needs to know that this CB contains transfer commands, which some of the rendering CB processes ought to wait on.
The simplest and most flexible way to do this is to build the transfer CB so that the last command is a vkCmdSetEvent command (and the first command is a vkCmdResetEvent to reset it from previous frames of usage). The submission thread then only needs to create a small CB that only contains a vkCmdWaitEvents command which waits on the transfer event that will be set. That command should issue a full memory barrier, and that CB should execute between the transfer CB and any rendering CBs that read from the transferred data.
The flexibility of this is in the structure of the process. It is structured similarly to how the multi-queue version works. In both cases, a separate thread needs to communicate something to the render submission thread (in one case, a semaphore; in the other, a CB and an event). And the render submission thread needs to do things to wait on that "something", but without disrupting the process of building the rendering commands itself (in one case, you just change the batch to wait on the semaphore; in the other, you insert a CB that waits for the event).
If you want to get a bit smarter about execution dependencies, you can even have the transfer operation forward information about which pipeline stages need to wait on the operation. But that's mostly an optimization.
Here's the thing though: all of the staging cases are not performance-friendly. They're problematic because you can't do anything while the transfer operation is going on. And that is the case because... you're trying to read from the memory in the same frame you're writing to it. That's bad.
You should endeavor instead to delay rendering any objects for which loading is not complete. Or put another way, you want to load the data for new objects before you need them, not on the same frame you need them. This is what streaming systems do: they pre-emptively load data that will be needed soon, but not right now.
But how big?
Only you and your use cases can answer that question. If you are streaming in fixed-sized blocks (which you should do where possible), then it's fairly easy: your staging buffer should be one or maybe two streaming blocks in size. If your rendering system is more flexible, imposing few limitations on the higher-level code, then your staging buffer and your streaming system needs to be more flexible. And there's no right answer for that; it depends entirely on how it gets used.
Welcome to using explicit, low-level APIs.

Syncing 3 threads sharing buffers using NSConditionLock. It's hard

I have 3 threads (in addition to the main thread). The threads read, process, and write. They each do this to a number of buffers, which are cycled through and reused. The reason it's set up this way is so the program can continue to do the other tasks while one of them is running. So, for example, while the program is writing to disk, it can simultaneously be reading more data.
The problem is I need to synchronize all this so the processing thread doesn't try to process buffers that haven't been filled with new data. Otherwise, there is a chance that the processing step could process leftover data in one of the buffers.
The read thread reads data into a buffer, then marks the buffer as "new data" in an array. So, it works like this:
//set up in main thread
NSConditionLock *readlock = [[NSConditionLock alloc] initWithCondition:0];
//set up lock in thread
[readlock lockWhenCondition:buffer_new[current_buf]];
//copy data to buffer
memcpy(buffer[current_buf],source_data,data_length);
//mark buffer as new (this is reset to 0 once the data is processed)
buffer_new[current_buf] = 1;
//unlock
[readlock unlockWithCondition:0];
I use buffer_new[current_buf] as a condition variable to NSConditionLock. If the buffer isn't marked as new, then the thread in question will lock, waiting for the previous thread to write new data. That part seems to work okay.
The main problem is I need to sync this in both directions. If the read thread happens to take too long for some reason and the processing thread has already finished with processing all the buffers, the processing thread needs to wait and vice-versa.
I'm not sure NSConditionLock is the appropriate way to do this.
I'd turn this on its head. As you say, threading is hard and multi-way synchronization of threads is even harder. Queue based concurrency is often much more natural.
Define three queues; a read queue, a write queue and a processing queue. Then employ a rule stating that no buffer shall be enqueued in more than one queue at a time.
That is, a buffer may be enqueued onto the read queue and, once done reading, enqueued into the processing queue, and once done processing, enqueued into the write queue.
You could use a stack of buffers if you want but, typically, the cost of allocation is pretty cheap compared to the cost of processing and, thus, enqueue-for-read could also do the allocation while dequeue-once-written could do the free.
This would be pretty straightforward to code with GCD. Note that if you really want parallelism, your various queues would really just be throttles, using semaphores -- potentially shared -- to enqueue the work to the global concurrent queues.
Note also that this design has a distinct advantage over what you are currently using in that it uses no locks. The only locks are hidden below the GCD APIs as a part of queue management, but that is effectively invisible to your code.
Have you seen then Apple Concurrency Programming Guide ?
It recommends several preferable methods for moving away from a Threads and Locks concurrency model. Using Operation Queues for example can not only reduce and simplify your code, speed up your development and give you better performance.
Sometimes you need to use threads, and you already have the correct idea. You will need to keep adding locks, and with each it will get exponentially more complicated until you can't understand your own code. Then you can start adding locks at random places. Then you're screwed.
Read the concurrency guide, then follow bbum's advice.