What is the DirectX 12 equivalent of Vulkan's "transient attachment"? - vulkan

I have a compute shader which I'd like to output to an image/buffer which is meant to be intermediate stoarge between two pipelines: a compute pipeline, and a graphics pipeline. The graphics pipeline is actually a "dummy", in that it does nothing apart from copy the contents of the intermediate buffer into a swapchain image. This is necessitated by the fact that DX12 deprecated the ability of compute pipelines to use UAVS to directly write into swapchain images.
I think the intermediate storage should be a "transient" attachment, in the Vulkan sense:
VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT specifies that the memory bound to this image will have been allocated with the VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT (see Memory Allocation for more detail). This bit can be set for any image that can be used to create a VkImageView suitable for use as a color, resolve, depth/stencil, or input attachment.`
This is explained in this article:
Finally, Vulkan includes the concept of transient attachments. These are framebuffer attachments that begin in an uninitialized or cleared state at the beginning of a renderpass, are written by one or more subpasses, consumed by one or more subpasses and are ultimately discarded at the end of the renderpass. In this scenario, the data in the attachments only lives within the renderpass and never needs to be written to main memory. Although we’ll still allocate memory for such an attachment, the data may never leave the GPU, instead only ever living in cache. This saves bandwidth, reduces latency and improves power efficiency.
Does DirectX 12 have a similar image usage concept?

Direct3D 12 does not have this concept. And the reason for that limitation ultimately boils down to why transient allocation exists. TL;DR: It's not for doing the kind of thing you're trying to do.
Vulkan's render pass system exists for one purpose: to make tile-based renderers first-class citizens of the rendering system. TBRs do not fit well in OpenGL or D3D's framebuffer model. In both APIs, you can just swap framebuffers in and out whenever you want.
TBRs do not render to memory directly. They perform rendering operations into internal buffers, which are seeded from memory and then possibly written to memory after the rendering operation is complete. Switching rendered images whenever you want works against this structure, which is why TBR vendors have a list of things you're not supposed to do if you want high-performance in your OpenGL ES code.
Vulkan's render pass system is an abstraction of a TBR system. In the abstract model, the render pass system potentially reads data from the images in the frame buffer, then performs a bunch of subpasses on copies of this data, and at the end, potentially writes the updated data back out into the images. So from the outside of the process, it looks like you're rendering to the images, but you're not. To maintain this illusion, for the duration of a render pass, you can only use those framebuffer images in the way that the render pass model allows: as attachments.
Now consider deferred rendering. In deferred rendering, you render to g-buffers, which you then read in your lighting passes to generate the final image. Once you've generated the final image, you don't need those g-buffers anymore. In a regular GPU, that doesn't mean anything; because rendering goes directly to memory, those g-buffers must take up actual storage.
But consider how a TBR works. It does rendering into a single tile; in optimal cases, it processes all of the fragments for a single tile at once. Which means it goes through the geometry and lighting passes. For a TBR, the g-buffer is just a piece of scratch memory you use to get the final answer; it doesn't need to be read from memory or copied to memory.
In short, it doesn't need memory.
Enter lazily allocated memory and transient attachment images. They exist to allow TBRs to keep g-buffers in tile memory and never to have to allocate actual storage for them (or at least, it only happens if some runtime circumstance occurs that forces it, like shoving too much geometry at the GPU). And it only works within a render pass; if you end a render pass and have to use one of the g-buffers in another render pass, then the magic has to go away and the data has to touch actual storage.
The Vulkan API makes how specific this use case is very explicit. You cannot bind a piece of lazily-allocated memory to an image that does not have the USAGE_TRANSIENT_ATTACHMENT flag set on it (or to a buffer of any kind). And you'll notice that it says "transient attachment", as in render pass attachments. It says this because you'll also notice that transient attachments cannot be used for non-attachment uses (part of the valid usage tests for VkImageCreateInfo). At all.
What you want to do is not the sort of thing that lazily allocated memory is made for. It can't work.
As for Direct3D 12, the API is not designed to run on mobile GPUs, and since only mobile GPUs are tile-based renderers (some recent desktop GPUs have TBR similarities, but are not full TBRs), it has no facilities designed explicitly for them. And thus, it has no need for lazily allocated memory or transient attachments.

Related

Does it make sense to read from host memory in a compute shader to save a copy?

This answer suggests using a compute shader to convert from packed 3-channel image data to a 4-channel texture on the GPU. Is it a good idea to, instead of copying the 3 channel image to the GPU before decoding it, write it to a host visible buffer, then read that directly in the compute shader?
It would save a buffer on the GPU, but I don't know if the CPU-GPU buffer copy is done in some clever way that this would defeat.
Well, the first question you need to ask is whether the Vulkan implementation even allows a CS to directly read from host-visible memory. Vulkan implementations have to allow you to create SSBOs in some memory type, but it doesn't have to be a host-visible one.
So even if you want to do this, you'll need to provide a code path for what happens when you can't (or just fail out early on such implementations).
The next question is whether host-visible memory types that you can put an SSBO into are also device-local. Integrated GPUs that have only one memory pool are both host-visible and device-local, so there's no point in ever doing a copy on those (and they obviously can't refuse to allow you to make an SSBO in them).
But many/most discrete GPUs also have memory types that are both host-visible and device-local. These are usually around 256MB in size, regardless of how much actual GPU memory the cards have, and they're intended to be used for streamed data that changes every frame. Of course, GPUs don't necessarily have to allow you to use them for SSBOs.
Should you use such memory types for doing these kinds of image fiddling? You have to profile them to know. And you'll also have to take into account whether your application has ways to hide any DMA upload latency, which would allow you to ignore the cost of transferring the data to non-host-visible memory.

What is the relationship of RenderPass and Pipeline in Vulkan?

What is the logical relationship between RenderPass and Pipeline in Vulkan?
If you ignore RenderPass, my understanding of the rendering process is first, the vertex data prepared by the application layer, then the texture data can be submitted to the driver, and after that through the various stages of the pipeline, after writing to the Framebuffer, you can complete a rendering.
So what is the responsibility of the RenderPass? Is it an abstraction that provides metadata for rendering each stage (such as Format), or does it have some other role?
Is RenderPass and Pipeline dependent on feelings? For example, each Pipeline belongs to a Subpass. Or a dependency, such as the last output of the Pipeline, is handled by RenderPass. Or is it something else?
At the end of the day Vulkan is a nice modern-ish OO API. All the objects in Vulkan are practically only what parameters they take. Just saying this to ease your learning. You can look at vkCreateX and largely understand what VkX does in Vulkan.
VkPipeline is a GPU context. Think of GPU as a FPGA (which it isn't, but bear with me). Doing vkCmdBindPipeline would set the GPU to given gate configuration. Exept GPU is not FPGA — in our case it sets the GPU to a state where it can execute the shader programs and fixed-function pipeline stages defined by the VkPipeline.
VkRenderPass is a data oriented thing. It is necessitated by tiled architecture GPUs (mobile GPUs). On desktop GPUs it can still fill the role of being oracle for optimization and\or allow partially-tiled architecture (or any other architecture really that can use this).
Tiled-architecture GPUs need to "load" image\buffer from general-purpose RAM to "on-chip memory". When they are done they "store" their results back to RAM.
VkRenderPass defines what kind of inputs (attachments) will be needed. It defines how they get loaded and stored before and after the render pass instance*, respectively. It also has subpasses. It defines synchronization between them (replaces vkCmdPipelineBarriers). And defines the kind of purpose given render pass attachment will be filling (e.g. if it is color buffer, or a depth buffer).
* Render Pass Instance is the thing created from Render Pass instance by vkCmdBeginRenderPass. Yea, not confusing, right.

Why do we need multiple render passes and subpasses?

I had some experience with DirectX12 in the past and I don't remember something similar to render passes in Vulkan so I can't make an analogy. If I'm understanding correctly command buffers inside the same subpass doesn't need to be synchronized. So why to complicate and make multiple of them? Why can't I just take one command buffer and put all my frame related info there?
Imagine that the GPU cannot render to images directly. Imagine that it can only render to special framebuffer memory storage, which is completely separate from regular image memory. You cannot talk to this framebuffer memory directly, and you cannot allocate from it. However, during a rendering operation, you can copy data from images into it, read data out of it into images, and of course render to this internal memory.
Now imagine that your special framebuffer memory is fixed in size, a size which is smaller than the size of the overall framebuffer you want to render to (perhaps much smaller). To be able to render to images that are bigger than your framebuffer memory, you basically have to execute all rendering commands for those targets multiple times. To avoid running vertex processing multiple times, you need a way to store the output of vertex processing stages.
Furthermore, when generating rendering commands, you need to have some idea of how to apportion your framebuffer memory. You may have to divide up your framebuffer memory differently if you're rendering to one 32-bpp image than if you're rendering to two. And how you assign your framebuffer memory can affect how your fragment shader code works. After all, this framebuffer rendering memory may be directly accessible by the fragment shader during a rendering operation.
That is the basic idea of the render pass model: you are rendering to special framebuffer memory, of an indeterminate size. Every aspect of the render pass system's complexity is based on this conceptual model.
Subpasses are the part where you determine exactly which things you're rendering to at the moment. Because this affects framebuffer memory arrangement, graphics pipelines are always built by referring to a subpass of a render pass. Similarly, secondary command buffers that are to be executed within a subpass must provide the subpass it will be used within.
When a render pass instance begins execution on a queue, it (conceptually) copies the attachment images we intend to render to into framebuffer rendering memory. At the end of the render pass, the data we render is copied back out to the attachment images.
During the execution of a render pass instance, the data for attachment images is considered "indeterminate". While the model says that we're copying into framebuffer rendering memory, Vulkan doesn't want to force implementations to actually copy stuff if they directly render to images.
As such, Vulkan merely states that no operation can access images that are being used as attachments, except for those which access the images as attachments. For example, you cannot read an attachment image as a texture. But you can read from it as an input attachment.
This is a conceptual description of the way tile-based renderers work. And this is the conceptual model that is the foundation of the Vulkan render pass architecture. Render targets are not accessible memory; they're special things that can only be accessed in special ways.
You can't "just" read from a G-buffer because, while you're rendering to that G-buffer, it exists in special framebuffer memory that isn't in the image yet.
Both features primarily exist for tile-based GPUs, which are common in mobile but, historically, uncommon on desktop computers. That's why DX12 doesn't have an equivalent, and Metal (iOS) does. Though both Nvidia's and AMD's recent architectures do a variant of tile-based rendering now also, and with the recent Windows-on-ARM PCs using Qualcomm chips (tile-based GPU), it will be interesting to see how DX12 evolves.
The benefit of render passes is that during pixel shading, you can keep the framebuffer data in on-chip memory instead of constantly reading and writing external memory. Caches help some, but without reordering pixel shading, the cache tends to thrash quite a bit since it's not large enough to store the entire framebuffer. A related benefit is you can avoid reading in previous framebuffer contents if you're just going to completely overwrite them anyway, and avoid writing out framebuffer contents at the end of the render pass if they're not needed after it's over. In many applications, tile-based GPUs never have to read and write depth buffer data or multisample data to or from external memory, which saves a lot of bandwidth and power.
Subpasses are an advanced feature that, in some cases, allow the driver to effectively merge multiple render passes into one. The goal and underlying mechanism is similar to the OpenGL ES Pixel Local Storage extension, but the API is a bit different in order to allow more GPU architectures to support it and to make it more extensible / future-proof. The classic example where this helps is with basic deferred shading: the first subpass writes out gbuffer data for each pixel, and later subpasses use that to light and shade pixels. Gbuffers can be huge, so keeping all of that on-chip and never having to read or write it to main memory is a big deal, especially on mobile GPUs which tend to be more bandwidth- and power-constrained.

Does Vulkan allow draw commands in memory buffers?

I've mainly used directx in my 3d programming. I'm just learning Vulkan now.
Is this correct:
In vulkan, a draw call (an operation that causes a group of primitives to be rendered, indexed or non-indexed), can only be executed in a command stream by executing a draw call when building that command stream. If you want to draw 3 objects using different vertex or index buffers (/offsets), you will, in the general case, execute 3 API calls.
In d3d12, instead, the arguments for a draw call can come from a GPU memory buffer, filled in any of the ways buffers are filled, including using the GPU.
I'm aware of (complex) ways to essentially draw separate models as one batch, even on API's older than dx12. And of course drawing repeated geometry without multiple drawcalls is trivial.
But the straightforward "write draw commands into GPU memory like you write other data into GPU memory" feature is only available on DX12, correct?
Indirect drawing is a thing in Vulkan. It does require a single vertex buffer contains all the data but you don't need to draw all of the buffer in each call.
There is an extension that allows you to build a set of drawing commands in gpu memory. It also allows binding different descriptor sets and vertex buffers between draws.

Best way to store animated vertex data

From what I understand there are several methods for storing and transferring vertex data to the GPU.
Using a temporary staging buffer and copying it to discrete GPU memory every frame
Using shared buffer (which is slow?) and just update the shared buffer every frame
Storing the staging buffer for each mesh permanently instead of recreating it every frame and copying it to the GPU
Which method is best for storing animating mesh data which changes rapidly?
It depends on the hardware and the memory types it advertises. Note that all of the following requires you to use vkGetBufferMemoryRequirements to check to see if the memory type can support the usages you need.
If hardware advertises a memory type that is both DEVICE_LOCAL and HOST_VISIBLE, then you should use that instead of staging. Now, you still need to double-buffer this, since you cannot write to data that the GPU is reading from, and you don't want to synchronize with the GPU unless the GPU is over a frame late. This is something you should also measure; your GPU needs may require a triple buffer, so design your system to be flexible.
Note that some hardware has two different heaps that are DEVICE_LOCAL, but only one of them will have HOST_VISIBLE memory types for them. So pay attention to those cases.
If there is no such memory type (or if the memory type doesn't support the buffer usages you need), then you need to profile this. The two alternatives are:
Staging (via a dedicated transfer queue, where available) to a DEVICE_LOCAL memory type, where the data eventually gets used.
Directly using a non-DEVICE_LOCAL memory type.
Note that both of these require buffering, since you want to avoid synchronization as much as possible. Staging through a transfer queue will also require a semaphore, since you need to make sure that the graphics queue doesn't try to use the memory until the transfer queue is done with it. It also means you need to deal with resource sharing between queues.
Personally though, I would try to avoid CPU animated vertex data whenever possible. Vulkan-capable GPUs are perfectly capable of doing any animating themselves. GPUs have been doing bone weighted skinning (even dual-quaternion-based) for over a decade now. Even vertex palette animation is something the GPU can do; summing up the various different vertices to reach the final answer. So scenes with lots of CPU-generated vertex data should be relatively rare.