How to synchronize between queues on different CPU threads? - vulkan

It is said semaphores are designed for this but how? It looks like I need to submit the semaphore before waiting for it to signal. Then what's the point of multithreading?
I'm using skia (has its own VkQueue) to draw UI, I don't have access to the commandbuffer, I can only provide semaphores for it. it first waits for the scene complete semaphore then draw ui and signals present ready semaphore.
It works fine when everything happens in a single thread. But after I move the UI part to a second thread. It stopped working and I got validation errors like: VkQueue is waiting on semaphore that has no way to be signaled. Of course, since it's on a different thread, the semaphore might not have been submitted to a queue yet.

The spec for vkQueuePresentKHR says
All elements of the pWaitSemaphores member of pPresentInfo must be semaphores that are signaled, or have semaphore signal operations previously submitted for execution
You can't submit work that waits on a semaphore that you plan to submit later. If you have this kind of dependency in your code you need to externally synchronize the submissions so the command buffers that will signal will be sent BEFORE you submit the dependent command buffers, regardless of the queue.
If you're using multiple threads it sounds like you need to rely on some CPU side synchronization primitives, like a CPU semaphore to properly order the work between them. Pure Vulkan sync primitives won't help you there.

Related

In Vulkan can I receive notification in the form of a callback when a command buffer is finished?

I want to know when a command buffer has finished executing in Vulkan but rather than check myself regularly on a fence I was wondering if I could set a callback to be notified. Is this possible? The only callback I've seen I've seen mentioned is for something relating to allocations.
Vulkan is a explicit, low-level API. One of the design choices in such an API is that the graphics driver gets CPU time only when you give it CPU time (in theory). This allows your code more control over the CPU and its threads.
In order for Vulkan to call your code back at arbitrary times, it would have to have a CPU thread on which to do it. But as previously stated, that's not a thing Vulkan implementations are expected to have.
Even the allocation callbacks are only called in the scope of a Vulkan API function which performs those allocations. That is, the asynchronous processing of a Vulkan queue doesn't do allocations.
So you're going to have to either check manually or block a thread on the fence until it is released.

Can vkQueuePresentKHR be synced using a pipeline barrier?

The fact vkQueuePresentKHR gets a queue parameter makes me think that it is like a command that is delivered to the queue for execution. If so, it is possible to make it waits (until the writing into the image to be presented is finished) using a pipeline barrier where source stage is VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT and destination is VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT. Or maybe even by an image barrier to ease the sync constraint for the image only.
But the fact that in every tutorial and books the sync is done using semaphore , makes me think that my assumption is wrong. If so, why vkQueuePresentKHR needs a queue parameter ? because the semaphore parameter is seems to be enough: when it is signaled, vkQueuePresentKHR can present the image according to the image index parameter and the swapchain handle parameter.
There are couple of outstanding Issues against the specification. Notably KhronosGroup/Vulkan-Docs#1308 is exactly your question.
Meanwhile everyone usually follows this language:
The processing of the presentation happens in issue order with other queue operations, but semaphores have to be used to ensure that prior rendering and other commands in the specified queue complete before the presentation begins.
Which implies semaphore has to be used. And given we are not 110 % sure, that means semaphore should be used until we know any better.
Another semi-official source is the sync wiki, which uses a semaphore.
Despite what this quote says, I think it is reasonable to believe it is also permissible to use other sync that makes the image already visible before the vkQueuePresent, such as fence wait.
But just pipeline barriers are likely not sufficient. The presentation is outside the queue system:
However, the scope of this set of queue operations does not include the actual processing of the image by the presentation engine.
Additionally there is no VkPipelineStageFlagBit for it, and vkQueuePresentKHR is not included in the submission order, so it cannot be in the synchronization scope of any vkCmdPipelineBarrier.
The confusing part is this unfortunate wording:
Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are automatically made visible to the read access performed by the presentation engine.
I believe the trick is the "before vkQueuePresentKHR is executed". As said above, vkQueuePresentKHR is not part of submission order, therefore you do not know if the memory was or wasn't made available via a pipeline barrier before the vkQueuePresentKHR is executed.
Presentation is a queue operation. That's why you submit it to a queue. A queue that will execute the presentation of the image. And specifically to a queue that is able to perform present operations.
As for how to synchronize... the specification is a bit ambiguous on this point.
Semaphores are definitely able to work; there's a specific callout for this:
Semaphores are not necessary for making the results of prior commands visible to the present:
Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are
automatically made visible to the read access performed by the presentation engine. This automatic visibility operation for an image happens-after the semaphore signal operation, and happens-before the presentation engine accesses the image.
While provisions are made for semaphores, there is no specific statement of other things. In particular, if you don't wait on a semaphore, it's not clear what "happens-after the semaphore signal operation" means, since no such signal operation happened.
Now, the API for vkQueuePresentKHR makes it clear that you don't need to provide a semaphore to wait on:
waitSemaphoreCount is the number of semaphores to wait for before issuing the present request.
The number may be zero.
One might thing that, as a queue operation, all prior synchronization on that queue would still affect presentation. For example, an external subpass dependency if you wrote to the swapchain image as an attachment. And it probably would... if not for one little problem.
See, synchronization is ultimately based on dependencies between stages. And presentation... doesn't have a stage. So while your source for the external dependency would be well-understood, it's not clear what destination stage would work. Even specifying the all-stages flag wouldn't necessarily work.
Does "not a stage" exist in the set of all stages?
In any case, it's best to just use a semaphore. You'll probably need one anyway, so just use that.

Is an Empty VkCommandBuffer executed when submitted to a queue?

Hey guys I wonder if we submit a VkSubmitInfo containing one empty VkCommandBuffer to the queue, if it will be executed or ignored. I mean will the semaphores in VkSubmitInfo::pWaitSemaphore and VkSubmitInfo::pDestSemaphore still be considered when submitting an empty VkCommandBuffer ?
Looks a stupid question but what I want is to "multiply" the only one semaphore that gets out of the vkAcquireNextImageKHR.
I mean, I want to submit an empty commandbuffer with VkSubmitInfo::pWaitSemaphore pointing to "acquire_semaphore", and having VkSubmitInfo::pDstSemaphore having as many semaphores as I need.
if it will be executed or ignored.
What would be the difference? If there are no commands in the command buffer, then executing it will do nothing.
I mean, I want to submit an empty commandbuffer with VkSubmitInfo::pWaitSemaphore pointing to "acquire_semaphore", and having VkSubmitInfo::pDstSemaphore having as many semaphores as I need.
This has nothing to do with the execution of the CB itself. The behavior of a batch doesn't change just because the CB doesn't do anything.
However, unless you have multiple queues waiting on the completion of this queue's operations, there's really no reason to have multiple destination semaphores. The batch containing the real work could just wait on the pWaitSemaphores.
Also, there's no reason to have empty batches that only wait on a single semaphore. Let's say you have batch Q, which signals the pWaitSemaphores that this empty batch signals. Well, there's no reason that batch Q's pDstSemaphores couldn't signal the semaphores that you want the empty batch to signal. After all, vkQueueSubmit semaphore wait operations have, as its destination command scope, all subsequent commands for that queue from vkQueueSubmit calls, the current one or subsequent ones.
So you would only need an empty batch if you have to wait on multiple semaphores that are signals from different batches on different queues. And such a complex dependency layout strongly suggests an over-complicated dependency design that will lead to reduced performance.
Even waiting on acquire makes no sense for this. You only need to wait on acquire if that queue is going to manipulate to the acquired image. Well, you can't manipulate an image from multiple queues simultaneously. So there's no point in signaling a bunch of semaphores when acquire completes; that's why acquire only takes one.
So I want to simulate a Fence only with semaphores and see what goes faster.
This suggests strongly that you're thinking about things incorrectly.
You use a fence when you want the CPU to detect the completion of a GPU operation. For vkAcquireNextImageKHR, you would use a fence if you need the CPU to know when the image has been acquired.
Semaphores are about the GPU detecting when a GPU operation has completed, regardless of whether the operation comes from a queue or not. So if the GPU needs to wait until an image is acquired, you use a semaphore.
It doesn't matter which is faster because they do different things.

How to determine what's blocking the main thread

So I restructured a central part in my Cocoa application (I really had to!) and I am running into issues since then.
Quick outline: my application controls the playback of QuickTime movies so that they are in sync with external timecode.
Thus, external timecode arrives on a CoreMIDI callback thread and gets posted to the application about 25 times per sec. The sync is then checked and adjusted if it needs to be.
All this checking and adjusting is done on the main thread.
Even if I put all the processing on a background thread it would be a ton of work as I'm currently using a lot of GCD blocks and I would need to rewrite a lot of functions so that they can be called from NSThread. So I would like to make sure first if it will solve my problem.
The problem
My Core MIDI callback is always called in time, but the GCD block that is dispatched to the main queue is sometimes blocked for up to 500 ms. Understandable that adjusting the sync does not quite work if that happens. I couldn't find a reason for it, so I'm guessing that I'm doing something that blocks the main thread.
I'm familiar with Instruments, but I couldn't find the right mode to see what keeps my messages from being processed in time.
I would appreciate if anyone could help.
Don't know what I can do about it.
Thanks in advance!
Watchdog
You can use watch dog that stop when the main thread stopped for time
https://github.com/wojteklu/Watchdog
you can install it using cocoapod
pod 'Watchdog'
You may be blocking the main thread or you might be flooding it with events.
I would suggest three things:
Grab a timestamp for when the timecode arrives in the CoreMIDI callback thread (see mach_absolute_time(). Then grab the current time when your main thread work is being done. You can then adjust accordingly based on how much time has elapsed between posting to the main thread and it actually being processed.
create some kind of coalescing mechanism such that when your main thread is blocked, interim timecode events (that are now out of date) are tossed. This can be as simple as a global NSUInteger that is incremented every time an event is received. The block dispatched to the main queue could capture the current value on creation, then check it when it is processed. If it differs by more than N (N for you to determine), then toss the event because more are in flight.
consider not sending an event to the main thread for every timecode notification. 25 adjustments per second is a lot of work. If processing only 5 per second yields a "good enough" perceptual experience, then that is an awful lot of work saved.
In general, instrumenting the main event loop is a bit tricky. The CPU profiler in Instruments can be quite helpful. It may come as a surprise, but so can the Allocations instrument. In particular, you can use the Allocations instrument to measure memory throughput. If there are tons of transient (short lived) allocations, it'll chew up a ton of CPU time doing all those allocations/deallocations.

Feedback from threads to main program

My software will simulate a few hundred hardware devices, each of which will send several thousand reports to a database server.
Trying it without threading did not give very good results, so now it's time to thread.
Since I am load testing the d/b server, some of those transactions will succeed and a few may fail. The GUI of the main program needs to reflect this. How should the threads communicate their results back to the main program? Update global variables? Send a message? Or something lese?
Now, if I update only at the end of each thread then the GUI is going to look rather boring (and I can't tell if the program hung). It might be nice to update the GUI periodically. But that might cause contention, with threads waiting for other threads to update (for instance, if I am writing to global variables, I need a mutex, which will block each thread which is waiting to write).
I'm new to threading. How is this normally done? Perhaps the main program could poll the threads, instead of the threads iforming the main program?
One way to organize this is for your threads to add messages to a thread-safe queue (e.g. a ConcurrentQueue) as they get data. To keep things simple you can have a timer thread in your UI that periodically dequeues the queued messages to a private list and then renders them. This design allows your threads to easily queue and forget messages with minimal contention, and for your UI to periodically update itself without blocking your writers too much (i.e. for only the period it takes to dequeue current messages to a private list).
Although you are attempting to simulate the load of hundreds of devices, using thread per device is not the way to model this as you can only run so many threads concurrently anyway.