In Vulkan can I receive notification in the form of a callback when a command buffer is finished? - vulkan

I want to know when a command buffer has finished executing in Vulkan but rather than check myself regularly on a fence I was wondering if I could set a callback to be notified. Is this possible? The only callback I've seen I've seen mentioned is for something relating to allocations.

Vulkan is a explicit, low-level API. One of the design choices in such an API is that the graphics driver gets CPU time only when you give it CPU time (in theory). This allows your code more control over the CPU and its threads.
In order for Vulkan to call your code back at arbitrary times, it would have to have a CPU thread on which to do it. But as previously stated, that's not a thing Vulkan implementations are expected to have.
Even the allocation callbacks are only called in the scope of a Vulkan API function which performs those allocations. That is, the asynchronous processing of a Vulkan queue doesn't do allocations.
So you're going to have to either check manually or block a thread on the fence until it is released.

Related

How to synchronize between queues on different CPU threads?

It is said semaphores are designed for this but how? It looks like I need to submit the semaphore before waiting for it to signal. Then what's the point of multithreading?
I'm using skia (has its own VkQueue) to draw UI, I don't have access to the commandbuffer, I can only provide semaphores for it. it first waits for the scene complete semaphore then draw ui and signals present ready semaphore.
It works fine when everything happens in a single thread. But after I move the UI part to a second thread. It stopped working and I got validation errors like: VkQueue is waiting on semaphore that has no way to be signaled. Of course, since it's on a different thread, the semaphore might not have been submitted to a queue yet.
The spec for vkQueuePresentKHR says
All elements of the pWaitSemaphores member of pPresentInfo must be semaphores that are signaled, or have semaphore signal operations previously submitted for execution
You can't submit work that waits on a semaphore that you plan to submit later. If you have this kind of dependency in your code you need to externally synchronize the submissions so the command buffers that will signal will be sent BEFORE you submit the dependent command buffers, regardless of the queue.
If you're using multiple threads it sounds like you need to rely on some CPU side synchronization primitives, like a CPU semaphore to properly order the work between them. Pure Vulkan sync primitives won't help you there.

Rendering Terrain Dynamically with Argument Buffers : Understanding why the particle buffer is not overwritten by the GPU inflight

I am looking through an Apple demo project that is associated with the 2017 WWDC video entitled "Introducing Metal 2" where the developers demonstrate the use of argument buffers. The project is linked here on the page titled "Rendering Terrain Dynamically with Argument Buffers" on the Apple developer website. Here, they synchronize resource writes by the CPU to prevent race conditions with a dispatch_semaphore_t, signaling it when the command buffer finishes executing on the GPU and waiting on it if the CPU is writing data several frames ahead of the GPU. This is consistent with what was shown in a previous 2014 WWDC "Working With Metal: Fundamentals".
I noticed that it seems the APPLParticleRenderer is sending data to be written by the GPU in a compute pass before it finishes reading from that same buffer from the fragment shader from a previous render pass. The resource storage mode of the buffer is MTLResourceStorageModePrivate. My question: does Metal automatically synchronize access to private id<MTLBuffer>s accessible only by the GPU? Do render, compute, and blit passes called from new id<MTLCommandEncoder> have access to the buffer only after other passes have written and read from it (exclusive access)? I have seen that there are guaranteed barriers within tile shaders, where tile memory is accessed exclusively by the kernel before subsequent fragment shaders access the memory.
Lastly, in the 2016 WWDC "What's New in Metal, Part 2", the first presenter, Charles Brissart, at 16:44 mentions that fragment and vertex functions reading and writing from the same buffer must be placed into two render command encoders, but for compute kernels one compute command encoder suffices. This is consistent with what is seen within the particle renderer.
See my comment on the original question for a brief version of this answer.
It turns out that Metal tracks dependencies between commands scheduled to the GPU by default for MTLResource types. The hazardTrackingMode property of a MTLResource is defaulted to MTLHazardTrackingModeTracked (MTLHazardTrackingMode.tracked in Swift) according to the Metal documentation. This means Metal tracks dependencies across commands that modify the resource, as is the case with the particle kernel, and delays execution until prior commands accessing the resource are complete.
Therefore, since the _particleDataPool buffer has a storage mode of MTLResourceStorageModePrivate (storageModePrivate in Swift), it can only be written to by the GPU; hence, no CPU/GPU synchronization is necessary with a semaphore for this buffer and thus no multi-buffer system is necessary for the resource.
Only when a resource can be written to by the CPU while the GPU is still reading from it do we want multiple buffers so the CPU is not idle.
Note that the default hazard tracking mode for a MTLHeap is MTLHazardTrackingModeUntracked (MTLHazardTrackingMode.untracked in Swift), in which case you are responsible for synchronizing resource writes by the GPU
EDIT
After reading into resource synchronization in Metal, there are some additional points I would like to make that I think further clarify what's going on. Note that the remaining portion is in Swift. To learn more in detail, I recommend reading the "Synchronization" section in the Metal documentation here.
MTLFence
Firstly, a MTLFence is used to synchronize accesses to untracked resources within the execution of a single command buffer. A fence gives you explicit control over when the GPU accesses resources and is necessary when you are working with an untracked resource. Otherwise, Metal will handle this synchronization for you
It is important to note that the automatic management I mention in the answer only occurs within a single command buffer between encoding passes. But this does not mean we need to synchronize across command buffers scheduled in the same command queue since a command buffer is not immediately scheduled for execution. In fact, according to the documentation on the addScheduledHandler(_:) method of the MTLCommandBuffer protocol found here
The device object schedules the command buffer after it identifies any dependencies with work tasks submitted by other command buffers or other APIs in the system.
at which point it would be safe to access these same buffers. Note that within a single render encoding pass, it is important to mention that if a vertex shader writes into a buffer the fragment shader in the same pass reads from, this is undefined. I mentioned this in the original question, the solution being to use two render pass encoders. I have yet to determine why this is not necessary for a compute encoder, but I imagine it has to do with how kernels are executed in comparison to vertex and fragment shaders
MTLEvent
In some cases, however, command buffers in different queues created by the same MTLDevice need access to the same resource or depend on one another in some way. In this case, synchronization is necessary because the separate queues schedule their own command buffers without knowledge of the other, meaning there is potential for the two command buffers to be executing concurrently.
To fix this problem, you use an MTLEvent instance created by the device using makeEvent() and encode event signals at specific points in each buffer.
MTLSharedEvent
In the event (no pun intended) that you have multiple processors (different CPU cores, CPU and GPU, or multi-GPU), resource synchronization is needed. Here, you create a MTLSharedEvent in place of a MTLEvent that can be used to synchronize across devices and processes. It is essentially the same API as that of the MTLEvent, but involves command queues on different devices.

Vulkan: Sync of UniformBuffer and StoreBuffer

UniformBuffers and StoreBuffers are updated, on the CPU side, using memcpy. How does synchronization work for those descriptor types? Does using memcpy imply that the application waits for memcpy to upload data to the GPU prior to continuing to next statement? If so, does this mean that barriers are not needed for sync'ing these types of buffers?
Synchronization works the same way for any memory resource: with certain rare exceptions, if you've changed memory, you need a memory dependency to ensure visibility of those changes. The synchronization system doesn't care whether it's used as a UBO or whatever. It cares about the nature of the source operation (the host) and the destination operation (reading from certain shader stages).
For host-to-device memory operations, you need to perform a form of synchronization known as a "domain operation". Fortunately, vkQueueSubmit automatically performs a domain operation on any host writes made visible before the vkQueueSubmit call. So if you write stuff to GPU-visible memory, then call vkQueueSubmit (either in the same thread or via CPU-side inter-thread communication), any commands in that submit call (or later ones) will see the values you wrote.
Assuming you have made them visible. Writes to host-coherent memory are always visible to the GPU, but writes to non-coherent memory must be made visible via a call to vkFlushMappedMemoryRanges.
If you want to write to memory asynchronously to the GPU process that reads it, you'll need to use an event. You write to the memory, make it visible if needs be, then set the event. The GPU commands that read from it would wait on the event, using VK_ACCESS_HOST_WRITE_BIT as the source access, and VK_PIPELINE_STAGE_HOST_BIT as the source stage. The destination access and stage are determined by how you plan to read from it.
Vulkan knows nothing about memcpy. It doesn't care how you modify the memory; it only cares that you do so in accord with its rules.

Can I do transfer operation in transfer queue and graphics queue at the same time?

I have made 2 instances of VkQueue: one from graphics family and another one from transfer family. Command pools and command buffers are separated accordingly. Both are doing transfer operations.
Purpose of first one except rendering is to update uniform buffers on
each frame.
Purpose of second one is to update resources: model
vertex/index buffers, texture images etc.
They work in parallel in different threads asynchronously. So it is possible that there will be 2 calls of vkQueueSubmit at the same time.
Is such usage allowed and is it safe?
Note: once I have multithreaded my program sometimes I have VK_DEVICE_LOST on vkQueueSumbit and it is likely that it happens more frequently when resources are loading, that is why I actually came to this question
The Vulkan specification is pretty clear about CPU synchronization of Vulkan functions. vkQueueSubmit says:
Host access to queue must be externally synchronized
Where "queue" is the parameter passed to vkQueueSubmit. It doesn't say every queue; it says "that queue".
And if "external synchronization" is not specifically stated as a requirement of a command, then it isn't a requirement of that command.

TensorFlow Execution on a single (multi-core) CPU Device

I have some questions regarding the execution model of TensorFlow in the specific case in which there is only a CPU device and the network is used only for inference, for instance using the Image Recognition(https://www.tensorflow.org/tutorials/image_recognition) C++ Example with a multi-core platform.
In the following, I will try to summarize what I understood, while asking some questions.
Session->Run() (file direct_session.cc) calls ExecutorState::RynAsynch, which initializes the TensorFlow ready queue with the roots nodes.
Then, the instruction
runner_([=]() { Process(tagged_node, scheduled_usec); }); (executor.cc, function ScheduleReady, line 2088)
assigns the node (and hence the related operation) to a thread of the inter_op pool.
However, I do not fully understand how it works.
For instance, in the case in which ScheduleReady is trying to assign more operations than the size of the inter_op pool, how operations are enqueued?(FIFO Order?)
Each thread of a pool has a queue of operation or there is a single shared queue?
Where can I found this in the code?
Where can I found the body of each thread of the pools?
Another question regards the nodes managed by inline_ready. How the execution of these (inexpensive or dead) nodes, differs from the one of the other nodes?
Then, (still, to my understanding) the execution flow continues from ExecutorState::Process, which executes the operation, distinguishing between synchronous and asynchronous operations.
How synchronous and asynchronous operations differs in terms of execution?
When the operation is executed, then PropagateOutputs (which calls ActivateNodes) adds to the ready queue the node of every successor which is become ready thanks to the execution of the current node(predecessor).
Finally, NodeDone() calls ScheduleReady() which process the nodes currently in the TensorFlow ready queue.
Conversely, how the intra_op thread pool is managed depends on the specific kernel, right? It is possible that a kernel requests more operations than the intra_op thread pool size?
If yes, with which kind of ordering they are enqueued? (FIFO?)
Once operations are assigned to threads of the pool, then their scheduling is left to the underlying operating system or TensorFlow enforces some kind of scheduling policy?
I'm asking here because I didn't find almost anything about this part of the execution model in the documentation, if I missed some documents please point me to all of them.
Re ThreadPool: When Tensorflow uses DirectSession (as it does in your case), it uses Eigen's ThreadPool. I could not get a web link to the official version of Eigen used in TensorFlow, but here is a link to the thread pool code. This thread pool is using this queue implementation RunQueue. There is one queue per thread.
Re inline_ready: Executor:Process is scheduled in some Eigen Thread. When it runs it executes some nodes. As these nodes are done, they make other nodes (tensorflow operations) ready. Some of these nodes are not expensive. They are added to inline_ready and executed in the same thread, without yielding. Other nodes are expensive and are not executed "immediately" in the same thread. Their execution is scheduled through the Eigen thread pool.
Re sync/async kernels: Tensorflow operations can be backed by synchronous (most CPU kernels) or asynchronous kernels (most GPU kernels). Synchronous kernels are executed in the thread running Process. Asynchronous kernels are dispatched to their device (usually GPU) to be executed. When asynchronous kernels are done, they invoke NodeDone method.
Re Intra Op ThreadPool: The intra op thread pool is made available to kernels to run their computation in parallel. Most CPU kernels don't use it (and GPU kernels just dispatch to GPU) and run synchronously in the thread that called the Compute method. Depending on configuration there is either one intra op thread pool shared by all devices (CPUs), or each device has its own. Kernels simply schedule their work on this thread pool. Here is an example of one such kernel. If there are more tasks than threads, they are scheduled and executed in unspecified order. Here is the ThreadPool interface exposed to kernels.
I don't know of any way tensorflow influences the scheduling of OS threads. You can ask it to do some spinning (i.e. not immediately yield the thread to OS) to minimize latency (from OS scheduling), but that is about it.
These internal details are not documented on purpose as they are subject to change. If you are using tensorflow through Python API, all you should need to know that your ops will execute when their inputs are ready. If you want to enforce some order beyond this, you should use:
with tf.control_dependencies(<tensors_that_you_want_computed_before_the_ops_inside_this_block>):
tf.foo_bar(...)
If you are writing a custom CPU kernel and want to do parallelism inside it (usually needed rarely for very expensive kernels), the thread pool interface linked above is what you can rely on.