Why the Metal triple buffering model matters in official examples?

Why the Metal triple buffering model matters in official examples? - gpu

Metal Best Practices suggest using the triple buffering for dynamic data buffers. But the listing provided in the documentation and the default Metal example generated by the Xcode are blocking every frame waiting for GPU to finish its work:
- (void)render
{
// Wait until the inflight command buffer has completed its work
dispatch_semaphore_wait(_frameBoundarySemaphore, DISPATCH_TIME_FOREVER);
// TODO: Update dynamic buffers and send them to the GPU here !
__weak dispatch_semaphore_t semaphore = _frameBoundarySemaphore;
[commandBuffer addCompletedHandler:^(id<MTLCommandBuffer> commandBuffer) {
// GPU work is complete
// Signal the semaphore to start the CPU work
dispatch_semaphore_signal(semaphore);
}];
// CPU work is complete
// Commit the command buffer and start the GPU work
[commandBuffer commit];
}
So how does the triple buffering improves anything here?

The important bit you didn't spot in the sample is:
_frameBoundarySemaphore = dispatch_semaphore_create(kMaxInflightBuffers);
As the documentation for dispatch_semaphore_create says:
Passing a value greater than zero is useful for managing a finite pool of resources, where the pool size is equal to the value.
kMaxInflightBuffers is set to 3 for triple buffering. The first 3 calls to dispatch_semaphore_wait will succeed without any waiting.

Related

Copy a VkImage after TraceRaysKHR to CPU

Copying a VkImage that is being used to render to an offscreen framebuffer gives a black image.
When using a rasterizer the rendered image is non-empty but as soon as I switch to ray tracing the output image is empty:
// Offscreen render pass
vk::RenderPassBeginInfo offscreenRenderPassBeginInfo;
// setup framebuffer ..
if(useRaytracer) {
helloVk.raytrace(cmdBuf, clearColor);
} else {
cmdBuf.beginRenderPass(offscreenRenderPassBeginInfo, vk::SubpassContents::eInline);
helloVk.rasterize(cmdBuf);
cmdBuf.endRenderPass();
}
// saving to image
if(write_to_image)
{
helloVk.copy_to_image(cmdBuf);
}
Both the ray tracer and rasterizer are using the same resources (e.g. the output image) through shared descriptor sets. I have also a post processing stage where the output is tone mapped & rendered to the swapchain.
I copy the image via a linear image and vkCmdCopyImage.
I tried already so much but there are so many question
How can I get the ray traced image? Is it possible to get the output through memory barriers only in a single command buffer as I am using? Should I create an independent command buffer and get the output after a synchronization barrier? Does ray tracing need special VkPipelineStageFlags?

You always need to consider synchronization in Vulkan. It is a good idea to learn how this works, because in my opinion this is one of the most important things.
Here one of the blogs:
https://www.khronos.org/blog/understanding-vulkan-synchronization
In summary for lazy people:
Barriers (Command and Memory), split commands to groups and make sure groups execute in order and not parallel, to synchronize memory access on pipeline stages in one queue.
Events, similar to Barriers, but uses user-defined "stages", slower than barriers.
Subpass dependency, synchronizes sub passes.
Semaphores, synchronize multiple queues, used to synchronize shared data on GPU.
Fences, same as semaphores, but used to synchronize shared data GPU->CPU.
Timeline Semaphores (since Vulkan 1.2), synchronize work between CPU and GPU in both directions.
Queue waiting for idle, same as fences, synchronizes GPU->CPU, but for entire queue not only one particular commit.
Device waiting for idle, same as queue waiting for idle, synchronizes GPU->CPU, but for all queues.

Resolved by now:
When submitting the command buffer to the queue it would require an additional vkQueueWaitIdle(m_queue) since ray tracing finishes with a certain latency

Synchronizing vertex buffer in vulkan?

I have a vertex buffer that is stored in a device memory and a buffer and is host visible and host coherent.
To write to the vertex buffer on the host side I map it, memcpy to it and unmap the device memory.
To read from it I bind the vertex buffer in a command buffer during recording a render pass. These command buffers are submitted in a loop that acquires, submits and presents, to draw each frame.
Currently I write once to the vertex buffer at program start up.
The vertex buffer then remains the same during the loop.
I'd like to modify the vertex buffer between each frame from the host side.
What I'm not clear on is the best/right way to synchronize these host-side writes with the device-side reads. Currently I have a fence and pair of semaphores for each frame allowed simulatenously in flight.
For each frame:
I wait on the fence.
I reset the fence.
The acquire signals semaphore #1.
The queue submit waits on semaphore #1 and signals semaphore #2 and signals the fence.
The present waits on semaphore #2
Where is the right place in this to put the host-side map/memcpy/unmap and how should I synchronize it properly with the device reads?

If you want to take advantage of asynchronous GPU execution, you want the CPU to avoid having to stall for GPU operations. So never wait on a fence for a batch that was just issued. The same thing goes for memory: you should never desire to write to memory which is being read by a GPU operation you just submitted.
You should at least double-buffer things. If you are changing vertex data every frame, you should allocate sufficient memory to hold two copies of that data. There's no need to make multiple allocations, or even to make multiple VkBuffers (just make the allocation and buffers bigger, then select which region of storage to use when you're binding it). While one region of storage is being read by GPU commands, you write to the other.
Each batch you submit reads from certain memory. As such, the fence for that batch will be set when the GPU is finished reading from that memory. So if you want to write to the memory from the CPU, you cannot begin that process until the fence representing the GPU reading operation for that memory reading gets set.
But because you're double buffering like this, the fence for the memory you're about to write to is not the fence for the batch you submitted last frame. It's the batch you submitted the frame before that. Since it's been some time since the GPU received that operation, it is far less likely that the CPU will have to actually wait. That is, the fence should hopefully already be set.
Now, you shouldn't do a literal vkWaitForFences on that fence. You should check to see if it is set, and if it isn't, go do something else useful with your time. But if you have nothing else useful you could be doing, then waiting is probably OK (rather than sitting and spinning on a test).
Once the fence is set, you know that you can freely write to the memory.
How do I know that the memory I have written to with the memcpy has finished being sent to the device before it is read by the render pass?
You know because the memory is coherent. That is what VK_MEMORY_PROPERTY_HOST_COHERENT_BIT means in this context: host changes to device memory are visible to the GPU without needing explicit visibility operations, and vice-versa.
Well... almost.
If you want to avoid having to use any synchronization, you must call vkQueueSubmit for the reading batch after you have finished modifying the memory on the CPU. If they get called in the wrong order, then you'll need a memory barrier. For example, you could have some part of the batch wait on an event set by the host (through vkSetEvent), which tells the GPU when you've finished writing. And therefore, you could submit that batch before performing the memory writing. But in this case, the vkCmdWaitEvents call should include a source stage mask of HOST (since that's who's setting the event), and it should have a memory barrier whose source access flag also includes HOST_WRITE (since that's who's writing to the memory).
But in most cases, it's easier to just write to the memory before submitting the batch. That way, you avoid needing to use host/event synchronization.

How to synchronize uniform buffer updates?

I have a number of uniform buffers - one for every framebuffer. I guarantee with the help of fences that the update on cpu is safe, i.e. when I'm memcpy I'm sure buffer is not in use. After an update, I'm flushing the memory.
Now, if I understand correctly, I need to make the new data available for the gpu - for this, I need to use barriers. This is how I'm doing it right now:
VkBufferMemoryBarrier barrier{};
barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
barrier.pNext = nullptr;
barrier.srcAccessMask = VK_ACCESS_HOST_WRITE_BIT;
barrier.dstAccessMask = VK_ACCESS_UNIFORM_READ_BIT;
barrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
barrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
barrier.buffer = buffer;
barrier.offset = offset;
barrier.size = size;
vkCmdPipelineBarrier(commandBuffer, VK_PIPELINE_STAGE_HOST_BIT, VK_PIPELINE_STAGE_TESSELLATION_EVALUATION_SHADER_BIT, 0, 0, nullptr, 1, &barrier, 0, nullptr);
Well, actually in my case everything works without the barrier. Do I need it at all?
If I change barrier.dstAccessMask and dstStageMask to VK_ACCESS_TRANSFER_READ_BIT and VK_PIPELINE_STAGE_TRANSFER_BIT respectively, everything again works fine, layers are not complaining. What is the better choice and why?
If I try to set a barrier after vkCmdBeginRenderPass, the layer complains. So I moved all my barriers between vkBeginCommandBuffer and vkCmdBeginRenderPass. How correct is this?

Well, actually in my case everything works without the barrier.
Does the specification say you need it? Then you need it.
When dealing with Vulkan, you should not take "everything appears to work" as a sign that you've done everything right.
What is the better choice and why?
The "better choice" is the correct one. The GPU is not doing a transfer operation on this memory; it's reading it as uniform data. Therefore, the operation you specify in the barrier must match this.
Layers aren't complaining because it's more or less impossible for validation layers to know when you've written to a piece of memory. Therefore, they can't tell if you correctly built a barrier to make such writes available to the GPU.
If I try to set a barrier after vkCmdBeginRenderPass, the layer complains.
Barriers inside render passes have to be inside subpasses with self-dependencies. And the barrier itself has to match the subpass self-dependency.
Basically, barriers of the kind you're talking about have to happen before the render pass.
That being said, merely calling vkQueueSubmit automatically creates a memory barrier between (properly flushed) host writes issued before vkQueueSubmit and any usage of those writes from the command buffers in the submit command (and of course, commands in later submit operations).
So you shouldn't need such a barrier, so long as you can ensure that you've finished your writes (and any needed flushing) before the vkQueueSubmit that reads from them. And if you can't guarantee that, you probably should have been using a vkCmdWaitEvents barrier to prevent trying to read until you had finished writing (and flushing).

Low latency isochronous out on OSX and Windows 10

I'm trying to output isochronous data (generated programmatically) over High Speed USB 2 with very low latency. Ideally around 1-2 ms. On Windows I'm using WinUsb, and on OSX I'm using IOKit.
There are two approaches I have thought of. I'm wondering which is best.
1-frame transfers
WinUsb is quite restrictive in what it allows, and requires each isochronous transfer to be a whole number of frames (1 frame = 1 ms). Therefore to minimise latency use transfers of one frame each in a loop something like this:
for (;;)
{
// Submit a 1-frame transfer ASAP.
WinUsb_WriteIsochPipeAsap(..., &overlapped[i]);
// Wait for the transfer from 2 frames ago to complete, for timing purposes. This
// keeps the loop in sync with the USB frames.
WinUsb_GetOverlappedResult(..., &overlapped[i-2], block=true);
}
This works fairly well and gives a latency of 2 ms. On OSX I can do a similar thing, though it is quite a bit more complicated. This is the gist of the code - the full code is too long to post here:
uint64_t frame = ...->GetBusFrameNumber(...) + 1;
for (;;)
{
// Submit at the next available frame.
for (a few attempts)
{
kr = ...->LowLatencyWriteIsochPipeAsync(...
frame, // Start on this frame.
&transfer[i]); // Callback
if (kr == kIOReturnIsoTooOld)
frame++; // Try the next frame.
else if (kr == kIOReturnSuccess)
break;
else
abort();
}
// Above, I pass a callback with a reference to a condition_variable. When
// the transfer completes the condition_variable is triggered and wakes this up:
transfer[i-5].waitForResult();
// I have to wait for 5 frames ago on OSX, otherwise it skips frames.
}
Again this kind of works and gives a latency of around 3.5 ms. But it's not super-reliable.
Race the kernel
OSX's low latency isochronous functions allow you to submit long transfers (e.g. 64 frames), and then regularly (max once per millisecond) update the frame list which says where the kernel has got to in reading the write buffer.
I think the idea is that you somehow wake up every N milliseconds (or microseconds), read the frame list, work out where you need to write to and do that. I haven't written code for this yet but I'm not entirely sure how to proceed, and there are no examples I can find.
It doesn't seem to provide a callback when the frame list is updated so I suppose you have to use your own timer - CFRunLoopTimerCreate() and read the frame list from that callback?
Also I'm wondering if WinUsb allows a similar thing, because it also forces you to register a buffer so it can be simultaneously accessed by the kernel and user-space. I can't find any examples that explicitly say you can write to the buffer while the kernel is reading it though. Are you meant to use WinUsb_GetCurrentFrameNumber in a regular callback to work out where the kernel has got to in a transfer?
That would require getting a regular callback on Windows, which seems a bit tricky. The only way I've seen is to use multimedia timers which have a minimum period of 1 millisecond (unless you use the undocumented (NtSetTimerResolution?).
So my question is: Can I improve the "1-frame transfers" approach, or should I switch to a 1 kHz callback that tries to race the kernel. Example code very appreciated!

(Too long for a comment, so…)
I can only address the OS X side of things. This part of the question:
I think the idea is that you somehow wake up every N milliseconds (or
microseconds), read the frame list, work out where you need to write
to and do that. I haven't written code for this yet but I'm not
entirely sure how to proceed, and there are no examples I can find.
It doesn't seem to provide a callback when the frame list is updated
so I suppose you have to use your own timer - CFRunLoopTimerCreate()
and read the frame list from that callback?
Has me scratching my head over what you're trying to do. Where is your data coming from, where latency is critical but the data source does not already notify you when data is ready?
The idea is that your data is being streamed from some source, and as soon as any data becomes available, presumably when some completion for that data source gets called, you write all available data into the user/kernel shared data buffer at the appropriate location.
So maybe you could explain in a little more detail what you're trying to do and I might be able to help.

NSThread High CPU usage

Im trying to using a continually running thread to perform tasks at a high rate in my app.
In this app I have a list of 1000 or so time stamps. I poll them until it's their time and then instruct an ausampler to play.
The problem I have is that I seem to fundamentally not understand how NSThread Works. In the simple example below the cpu shoots up to 100% despite no tasks being run.
In what way I am using NSThread incorrectly?
What would be a better way to create a a very fast polling mechanism that doesnt hog the cpu?
myThread =[[NSThread alloc]initWithTarget:self selector:#selector(test) object:Nil];
[audioThread setThreadPriority:0];
[audioThread start];
-(void)test
{
while(mycondition)
{
// do my work
// cpu == 100%
}
}

The solution was to find the minimum interval between sampler fire times (based on bpm/ resolution) and make the thread sleep for that amount.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas