How do you synchronize a Metal Performance Shader with an MTLBlitCommandEncoder? - objective-c

I'm trying to better understand the synchronization requirements when working with Metal Performance Shaders and an MTLBlitCommandEncoder.
I have an MTLCommandBuffer that is set up as follows:
Use MTLBlitCommandEncoder to copy a region of Texture A into Texture B. Texture A is larger than Texture B. I'm extracting a "tile" from Texture A and copying it into Texture B.
Use an MPSImageBilinearScale metal performance shader with Texture B as the source texture and a third texture, Texture C, as the destination. This metal performance shader will scale and potentially translate the contents of Texture B into Texture C.
How do I ensure that the blit encoder completely finishes copying the data from Texture A to Texture B before the metal performance shader starts trying to scale Texture B? Do I even have to worry about this or does the serial nature of a command buffer take care of this for me already?
Metal has the concept of fences using MTLFence for synchronizing access to resources, but I don't see anyway to have a metal performance shader wait on a fence. (Whereas waitForFence: is present on the encoders.)
If I can't use fences and I do need to synchronize, is the recommended practice to just enqueue the blit encoder, then call waitUntilCompleted on the command buffer before enqueue the shader and calling waitUntilCompleted a second time? ex:
id<MTLCommandBuffer> commandBuffer;
// Enqueue blit encoder to copy Texture A -> Texture B
id<MTLBlitCommandEncoder> blitEncoder = [commandBuffer blitCommandEncoder];
[blitEncoder copyFromTexture:...];
[blitEncoder endEncoding];
// Wait for blit encoder to complete.
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
// Scale Texture B -> Texture C
MPSImageBilinearScale *imageScaleShader = [[MPSImageBilinearScale alloc] initWithDevice:...];
[imageScaleShader encodeToCommandBuffer:commandBuffer...];
// Wait for scaling shader to complete.
[commandBuffer commit];
[commandBuffer waitUntilCompleted];
The reason I think I need to do the intermediary copy into Texture B is because MPSImageBilinearScale appears to scale its entire source texture. The clipOffset is useful for output, but it doesn't apply to the actual scaling or transform. So the tile needs to be extracted from Texture A into Texture B that is the same size as the tile itself. Then the scaling and transform will "make sense". Disregard this footnote because I had forgotten some basic math principles and have since figured out how to make the scale transform's translate properties work with the clipRect.

Metal takes care of this for you. The driver and GPU execute commands in a command buffer as though in serial fashion. (The "as though" allows for running things in parallel or out of order for efficiency, but only if the result would be the same as when done serially.)
Synchronization issues arise when both the CPU and GPU are working with the same objects. Also with presenting textures on-screen. (You shouldn't be rendering to a texture that's being presented on screen.)
There's a section of the Metal Programming Guide which deals with read-write access to resources by shaders, which is not exactly the same, but should reassure you:
Memory Barriers
Between Command Encoders
All resource writes performed in a given command encoder are visible
in the next command encoder. This is true for both render and compute
command encoders.
Within a Render Command Encoder
For buffers, atomic writes are visible to subsequent atomic reads
across multiple threads.
For textures, the textureBarrier method ensures that writes
performed in a given draw call are visible to subsequent reads in the
next draw call.
Within a Compute Command Encoder
All resource writes performed in a given kernel function are visible
in the next kernel function.

MPS sits on top of Metal (mostly). It doesn’t replace it (mostly). You may assume that it is using the usual command encoders that you are using.
There are a few areas where MTLFences are required, particularly when interoperating with render encoders and MTLHeaps. When available, make use of the synchronize methods on the MPSImages and buffer types rather than rolling your own.

Related

how to get soft particle effect using direct3d 11 api

I tried all the ways to calculate the particle alpha, and set shaderResource to the draw process, in the renderdoc, the screenDepthTexture is always no Resource.
You’re probably trying to use the same depth buffer texture in two stages of the pipeline at the same time: read in pixel shader to compute softness, and use the same texture as a depth render target in output merger stage.
This is not going to work. When you call OMSetRenderTargets, the pixel shader binding is unset for the resource view of that texture.
An easy workaround is making a copy of your depth texture with CopyResource, and bind the copy to the input of the pixel shader. This way your pixel shader can read from there, while output merger stage uses another copy as depth/stencil target.
In the future, to troubleshoot such issues use D3D11_CREATE_DEVICE_DEBUG flag when creating the device, and read debug output in visual studio. RenderDoc is awesome for higher level bugs, when you’re rendering something but it doesn’t look right. Your mistake is incorrect use of D3D API. That debug layer in Windows SDK is more useful than RenderDoc for such things.

Copy a VkImage after TraceRaysKHR to CPU

Copying a VkImage that is being used to render to an offscreen framebuffer gives a black image.
When using a rasterizer the rendered image is non-empty but as soon as I switch to ray tracing the output image is empty:
// Offscreen render pass
vk::RenderPassBeginInfo offscreenRenderPassBeginInfo;
// setup framebuffer ..
if(useRaytracer) {
helloVk.raytrace(cmdBuf, clearColor);
} else {
cmdBuf.beginRenderPass(offscreenRenderPassBeginInfo, vk::SubpassContents::eInline);
helloVk.rasterize(cmdBuf);
cmdBuf.endRenderPass();
}
// saving to image
if(write_to_image)
{
helloVk.copy_to_image(cmdBuf);
}
Both the ray tracer and rasterizer are using the same resources (e.g. the output image) through shared descriptor sets. I have also a post processing stage where the output is tone mapped & rendered to the swapchain.
I copy the image via a linear image and vkCmdCopyImage.
I tried already so much but there are so many question
How can I get the ray traced image? Is it possible to get the output through memory barriers only in a single command buffer as I am using? Should I create an independent command buffer and get the output after a synchronization barrier? Does ray tracing need special VkPipelineStageFlags?
You always need to consider synchronization in Vulkan. It is a good idea to learn how this works, because in my opinion this is one of the most important things.
Here one of the blogs:
https://www.khronos.org/blog/understanding-vulkan-synchronization
In summary for lazy people:
Barriers (Command and Memory), split commands to groups and make sure groups execute in order and not parallel, to synchronize memory access on pipeline stages in one queue.
Events, similar to Barriers, but uses user-defined "stages", slower than barriers.
Subpass dependency, synchronizes sub passes.
Semaphores, synchronize multiple queues, used to synchronize shared data on GPU.
Fences, same as semaphores, but used to synchronize shared data GPU->CPU.
Timeline Semaphores (since Vulkan 1.2), synchronize work between CPU and GPU in both directions.
Queue waiting for idle, same as fences, synchronizes GPU->CPU, but for entire queue not only one particular commit.
Device waiting for idle, same as queue waiting for idle, synchronizes GPU->CPU, but for all queues.
Resolved by now:
When submitting the command buffer to the queue it would require an additional vkQueueWaitIdle(m_queue) since ray tracing finishes with a certain latency

Vulkan: Uploading non-pow-of-2 texture data with vkCmdCopyBufferToImage

There is a similar thread (Loading non-power-of-two textures in Vulkan) but it pertains to updating data in a host-visible mapped region.
I wanted to utilize the fully-fledged function vkCmdCopyBufferToImage to copy data present in a host-visible buffer to a device-local image. My image has dim which are not power of 2 (they are 1280x720, to be more specific).
When doing so, I've seen that it works fine on NVIDIA and Intel, but not on AMD, where the image becomes "distorted", which indicates problem with rowPitch/padding.
My host-visible buffer is tightly packed so bufferRowLength and bufferImageHeight are the same as imageExtent.width and imageExtent.height.
Shouldn't this function cater for non-power-of-2 textures? Or maybe I'm doing it wrong?
I could implement a workaround with a compute shader but I thought this function's purpose was to be generic. Also, the downside of using a compute shader is that the copy operation could not be performed on a transfer-only queue.

Why do we need multiple render passes and subpasses?

I had some experience with DirectX12 in the past and I don't remember something similar to render passes in Vulkan so I can't make an analogy. If I'm understanding correctly command buffers inside the same subpass doesn't need to be synchronized. So why to complicate and make multiple of them? Why can't I just take one command buffer and put all my frame related info there?
Imagine that the GPU cannot render to images directly. Imagine that it can only render to special framebuffer memory storage, which is completely separate from regular image memory. You cannot talk to this framebuffer memory directly, and you cannot allocate from it. However, during a rendering operation, you can copy data from images into it, read data out of it into images, and of course render to this internal memory.
Now imagine that your special framebuffer memory is fixed in size, a size which is smaller than the size of the overall framebuffer you want to render to (perhaps much smaller). To be able to render to images that are bigger than your framebuffer memory, you basically have to execute all rendering commands for those targets multiple times. To avoid running vertex processing multiple times, you need a way to store the output of vertex processing stages.
Furthermore, when generating rendering commands, you need to have some idea of how to apportion your framebuffer memory. You may have to divide up your framebuffer memory differently if you're rendering to one 32-bpp image than if you're rendering to two. And how you assign your framebuffer memory can affect how your fragment shader code works. After all, this framebuffer rendering memory may be directly accessible by the fragment shader during a rendering operation.
That is the basic idea of the render pass model: you are rendering to special framebuffer memory, of an indeterminate size. Every aspect of the render pass system's complexity is based on this conceptual model.
Subpasses are the part where you determine exactly which things you're rendering to at the moment. Because this affects framebuffer memory arrangement, graphics pipelines are always built by referring to a subpass of a render pass. Similarly, secondary command buffers that are to be executed within a subpass must provide the subpass it will be used within.
When a render pass instance begins execution on a queue, it (conceptually) copies the attachment images we intend to render to into framebuffer rendering memory. At the end of the render pass, the data we render is copied back out to the attachment images.
During the execution of a render pass instance, the data for attachment images is considered "indeterminate". While the model says that we're copying into framebuffer rendering memory, Vulkan doesn't want to force implementations to actually copy stuff if they directly render to images.
As such, Vulkan merely states that no operation can access images that are being used as attachments, except for those which access the images as attachments. For example, you cannot read an attachment image as a texture. But you can read from it as an input attachment.
This is a conceptual description of the way tile-based renderers work. And this is the conceptual model that is the foundation of the Vulkan render pass architecture. Render targets are not accessible memory; they're special things that can only be accessed in special ways.
You can't "just" read from a G-buffer because, while you're rendering to that G-buffer, it exists in special framebuffer memory that isn't in the image yet.
Both features primarily exist for tile-based GPUs, which are common in mobile but, historically, uncommon on desktop computers. That's why DX12 doesn't have an equivalent, and Metal (iOS) does. Though both Nvidia's and AMD's recent architectures do a variant of tile-based rendering now also, and with the recent Windows-on-ARM PCs using Qualcomm chips (tile-based GPU), it will be interesting to see how DX12 evolves.
The benefit of render passes is that during pixel shading, you can keep the framebuffer data in on-chip memory instead of constantly reading and writing external memory. Caches help some, but without reordering pixel shading, the cache tends to thrash quite a bit since it's not large enough to store the entire framebuffer. A related benefit is you can avoid reading in previous framebuffer contents if you're just going to completely overwrite them anyway, and avoid writing out framebuffer contents at the end of the render pass if they're not needed after it's over. In many applications, tile-based GPUs never have to read and write depth buffer data or multisample data to or from external memory, which saves a lot of bandwidth and power.
Subpasses are an advanced feature that, in some cases, allow the driver to effectively merge multiple render passes into one. The goal and underlying mechanism is similar to the OpenGL ES Pixel Local Storage extension, but the API is a bit different in order to allow more GPU architectures to support it and to make it more extensible / future-proof. The classic example where this helps is with basic deferred shading: the first subpass writes out gbuffer data for each pixel, and later subpasses use that to light and shade pixels. Gbuffers can be huge, so keeping all of that on-chip and never having to read or write it to main memory is a big deal, especially on mobile GPUs which tend to be more bandwidth- and power-constrained.

Does Vulkan allow draw commands in memory buffers?

I've mainly used directx in my 3d programming. I'm just learning Vulkan now.
Is this correct:
In vulkan, a draw call (an operation that causes a group of primitives to be rendered, indexed or non-indexed), can only be executed in a command stream by executing a draw call when building that command stream. If you want to draw 3 objects using different vertex or index buffers (/offsets), you will, in the general case, execute 3 API calls.
In d3d12, instead, the arguments for a draw call can come from a GPU memory buffer, filled in any of the ways buffers are filled, including using the GPU.
I'm aware of (complex) ways to essentially draw separate models as one batch, even on API's older than dx12. And of course drawing repeated geometry without multiple drawcalls is trivial.
But the straightforward "write draw commands into GPU memory like you write other data into GPU memory" feature is only available on DX12, correct?
Indirect drawing is a thing in Vulkan. It does require a single vertex buffer contains all the data but you don't need to draw all of the buffer in each call.
There is an extension that allows you to build a set of drawing commands in gpu memory. It also allows binding different descriptor sets and vertex buffers between draws.