A rarely mentioned Vulkan function "vkCmdUpdateBuffer()", what is it used for? - vulkan

This seems to be a simple Vulkan API question but I really can not find answer after search Internet.
I noticed there is a Vulkan function:
void vkCmdUpdateBuffer(
VkCommandBuffer commandBuffer,
VkBuffer dstBuffer,
VkDeviceSize dstOffset,
VkDeviceSize dataSize,
const void* pData);
At first glance, I thoughts it can be used to record the command buffer since it has prefix vkCmd in its name, but the document says that
vkCmdUpdateBuffer is only allowed outside of a render pass. This command is treated as “transfer” operation, for the purposes of synchronization barriers.
So I start thinking that it is a convenience function that wraps the buffer data transferring operation like using memcpy() to copy the data from host to the device.
Then my question is: Why there is NOT a single Vulkan sample / tutorial (I have searched all of them) using vkCmdUpdateBuffer() instead of manually coping data by memcpy(). Did I understand it wrong?

All vkCmd* functions generate commands into a command buffer. This one is no exception. It is a transfer command, and like most transfer commands, you don't get to do them within a render pass. But there are plenty of command buffer generating commands that don't work in render passes.
Normally Vulkan memory transfer operations only happen between device memory. The typical mechanism for the host to put something in device memory is to write to a mapped pointer. But by definition, that requires that the destination memory be mappable. So if you want to write something to non-mappable memory, you have to copy it to mappable memory, then do a transfer operation between the mappable memory to the non-mappable memory via vkCmdCopy* functions.
And that's fine if you're doing a bunch of transfers all at once. You can copy a bunch of stuff into mapped memory, then submit a batch containing all of the copy operations to copy the data into the appropriate locations.
But sometimes, you're just updating a small piece of device memory. If it's not mappable, then that's a lot of work to do just to get a few kilobytes of data to the GPU. In that case, vkCmdUpdateBuffer may be the better choice, since it can "directly" copy from CPU memory to any device memory.
I say "directly" because that's obviously not what it's doing. It's really doing the same thing you would have done, except it's doing it within the command buffer. You would have copied your CPU data into GPU mappable memory, then created a command that copies from that mappable memory into non-mappable memory.
vkCmdUpdateBuffer does the exact same thing. It copies the data from the pointer/size you give it into mappable memory (which is provided by the command buffer itself. This is why it has an upper limit of 64KB). This copy happens immediately, just as it would have if you did a memcpy, so when this function returns, you can do whatever you want with the pointer you gave it. Then it creates a command in the command buffer that copies from the mappable memory in the command buffer to the destination memory location.
The documentation for this function explicitly gives warnings about using it for larger transfers. That is, it tells you not to do that. This is for quick, small, one-shot updates of unmappable memory. Nothing more.
That's one reason why tutorials don't talk about it: it's a highly special-case function that many novice users will try to use because it's easier than the explicit code. But in most cases, they should not be using it.

Related

How do I determine how much memory a MemoryStream object is using in VB.Net?

I'm handling a long-running process that does a lot of copying to a memory stream so it can create an archive and upload it to azure. In excess of 2 GB in size. The process is throwing an Out Of Memory exception at some point, and I'm trying to figure out what the upper limit is to prevent that, stop the process, upload a partial archive, and pick up where it left off.
But I'm not sure how to determine how much memory the stream is actually using. Right now, I'm using MemoryStream.Length, but I don't think that's the right approach.
So, how can I determine how much memory a MemoryStream is using in VB.Net?
In compliance with the Minimal, Reproducible Example requirements -
Sub Main(args As String())
Dim stream = New MemoryStream
stream.WriteByte(10)
'What would I need to put here to determine the size in memory of stream?
End Sub
This is a known issue with the garbage collector for .Net Framework (I've heard .Net Core may have addressed this, but I haven't done a deep dive myself to confirm it).
What happens is you have an item with an internal (growing) buffer like a string, StringBuilder, List, MemoryStream, etc. As you write to the object, the buffer fills and at some point needs to be replaced. So behind the scenes a new buffer is allocated, and data is copied from the old to the new. At this point the old buffer becomes eligible for garbage collection.
Already we see one large inefficiency: copying from the old buffer to the new can be a significant cost as the the object grows. Therefore, if you have an idea of your item's final size up front, you probably want to use a mechanism that will let you set that initial size. With a List<T>, for example, this means using the constructor overload to include a Capacity... if you know it. This means the object never (or rarely) has to copy the data between buffers.
But we're not done yet. When the copy operation is finished, the garbage collector WILL appropriately reclaim the memory and return it to the operating system. However, there's more involved than just memory. There's also the virtual address space for your application's process. The virtual address space formerly occupied by that memory is NOT immediately released. It can be reclaimed through a process called "compaction", but there are situations where this just... doesn't happen. Most notably, memory on the Large Object Heap (LOH) is rarely-to-never compacted. Once your item eclipses a magic 85,000 byte threshold for the LOH, you're starting to build up holes in the virtual address space. Fill these buffers enough, and the virtual address table runs out of room, resulting in (drumroll please)... an OutOfMemoryException. This is the exception thrown, even though there may be plenty of memory available.
Since most of the items that create this scenario use a doubling algorithm for the underlying buffer, it's possible to generate these exceptions without using all that much memory — just over the square root of the actual possible virtual address space, though that's worst-case and commonly you'll get somewhat higher first.
To avoid this, be careful in how you use items with buffers that grow dynamically: set the full capacity up front when you can, or stream to a more-robust backing store (like a file) when you can't.

Does it make sense to read from host memory in a compute shader to save a copy?

This answer suggests using a compute shader to convert from packed 3-channel image data to a 4-channel texture on the GPU. Is it a good idea to, instead of copying the 3 channel image to the GPU before decoding it, write it to a host visible buffer, then read that directly in the compute shader?
It would save a buffer on the GPU, but I don't know if the CPU-GPU buffer copy is done in some clever way that this would defeat.
Well, the first question you need to ask is whether the Vulkan implementation even allows a CS to directly read from host-visible memory. Vulkan implementations have to allow you to create SSBOs in some memory type, but it doesn't have to be a host-visible one.
So even if you want to do this, you'll need to provide a code path for what happens when you can't (or just fail out early on such implementations).
The next question is whether host-visible memory types that you can put an SSBO into are also device-local. Integrated GPUs that have only one memory pool are both host-visible and device-local, so there's no point in ever doing a copy on those (and they obviously can't refuse to allow you to make an SSBO in them).
But many/most discrete GPUs also have memory types that are both host-visible and device-local. These are usually around 256MB in size, regardless of how much actual GPU memory the cards have, and they're intended to be used for streamed data that changes every frame. Of course, GPUs don't necessarily have to allow you to use them for SSBOs.
Should you use such memory types for doing these kinds of image fiddling? You have to profile them to know. And you'll also have to take into account whether your application has ways to hide any DMA upload latency, which would allow you to ignore the cost of transferring the data to non-host-visible memory.

How do I know when Vulkan isn't using memory anymore so I can overwrite it / reuse it?

When working with Vulkan it's common that when creating a buffer, such as a uniform buffer, that you create multiple (buffers 'versions'), because if you have double buffering for example you don't know if the graphics API is still drawing the last frame (using the memory you bound and instructed it to use the last loop). I've seen this happen with uniform buffers but not vertex or index buffers or image/texture buffers. Is this because uniform buffers are updated regularly and vertex buffers or images are not?
If you wanted to update an image or a vertex buffer how would you go about it given that you don't know whether the graphics API is still using it? Do you simply reallocate new memory for that image/buffer and start anew? Even if you just want to update a section of it? And if this is the case that you allocate a new buffer, when would you know to release the old buffer? Would say, for example 5 frames into the future be OK? Or 2 seconds? After all, it could still be being used. How is this done?
given that you don't know whether the graphics API is still using it?
But you do know.
Vulkan doesn't arbitrarily use resources. It uses them exactly and only how your code tells it to use the resource. You created and submitted the commands that use those resources, so if you need to know when a resource is in use, it is you who must keep track of it and manage this.
You have to use API synchronization functions to follow the GPU's execution of commands.
If an action command uses some set of resources, then those resources are in use while that command is being executed. You have tools like events which can be used to stop subsequent commands from executing until some prior commands have finished. And events can tell when a particular command has finished, so that you'll know when those resources are no longer in use.
Semaphores have similar powers, but at the level of a batch of work. If a semaphore is signaled, then all of the commands in the batch that signaled it have completed and are no longer using the resources they use. Fences can be used for extremely coarse synchronization, at the level of a submit command.
You multi-buffer uniform data because the nature of uniform data is such that it typically needs to change every frame. If you have vertex buffers or images to change every frame, then you'll need to do the same thing with those.
For infrequent changes, you may want to have extra memory available so that you can just create new images or buffers, then delete the old ones when the memory is no longer in use. Or you may have to stall the CPU until the GPU has finished using those resources.

What is the DirectX 12 equivalent of Vulkan's "transient attachment"?

I have a compute shader which I'd like to output to an image/buffer which is meant to be intermediate stoarge between two pipelines: a compute pipeline, and a graphics pipeline. The graphics pipeline is actually a "dummy", in that it does nothing apart from copy the contents of the intermediate buffer into a swapchain image. This is necessitated by the fact that DX12 deprecated the ability of compute pipelines to use UAVS to directly write into swapchain images.
I think the intermediate storage should be a "transient" attachment, in the Vulkan sense:
VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT specifies that the memory bound to this image will have been allocated with the VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT (see Memory Allocation for more detail). This bit can be set for any image that can be used to create a VkImageView suitable for use as a color, resolve, depth/stencil, or input attachment.`
This is explained in this article:
Finally, Vulkan includes the concept of transient attachments. These are framebuffer attachments that begin in an uninitialized or cleared state at the beginning of a renderpass, are written by one or more subpasses, consumed by one or more subpasses and are ultimately discarded at the end of the renderpass. In this scenario, the data in the attachments only lives within the renderpass and never needs to be written to main memory. Although we’ll still allocate memory for such an attachment, the data may never leave the GPU, instead only ever living in cache. This saves bandwidth, reduces latency and improves power efficiency.
Does DirectX 12 have a similar image usage concept?
Direct3D 12 does not have this concept. And the reason for that limitation ultimately boils down to why transient allocation exists. TL;DR: It's not for doing the kind of thing you're trying to do.
Vulkan's render pass system exists for one purpose: to make tile-based renderers first-class citizens of the rendering system. TBRs do not fit well in OpenGL or D3D's framebuffer model. In both APIs, you can just swap framebuffers in and out whenever you want.
TBRs do not render to memory directly. They perform rendering operations into internal buffers, which are seeded from memory and then possibly written to memory after the rendering operation is complete. Switching rendered images whenever you want works against this structure, which is why TBR vendors have a list of things you're not supposed to do if you want high-performance in your OpenGL ES code.
Vulkan's render pass system is an abstraction of a TBR system. In the abstract model, the render pass system potentially reads data from the images in the frame buffer, then performs a bunch of subpasses on copies of this data, and at the end, potentially writes the updated data back out into the images. So from the outside of the process, it looks like you're rendering to the images, but you're not. To maintain this illusion, for the duration of a render pass, you can only use those framebuffer images in the way that the render pass model allows: as attachments.
Now consider deferred rendering. In deferred rendering, you render to g-buffers, which you then read in your lighting passes to generate the final image. Once you've generated the final image, you don't need those g-buffers anymore. In a regular GPU, that doesn't mean anything; because rendering goes directly to memory, those g-buffers must take up actual storage.
But consider how a TBR works. It does rendering into a single tile; in optimal cases, it processes all of the fragments for a single tile at once. Which means it goes through the geometry and lighting passes. For a TBR, the g-buffer is just a piece of scratch memory you use to get the final answer; it doesn't need to be read from memory or copied to memory.
In short, it doesn't need memory.
Enter lazily allocated memory and transient attachment images. They exist to allow TBRs to keep g-buffers in tile memory and never to have to allocate actual storage for them (or at least, it only happens if some runtime circumstance occurs that forces it, like shoving too much geometry at the GPU). And it only works within a render pass; if you end a render pass and have to use one of the g-buffers in another render pass, then the magic has to go away and the data has to touch actual storage.
The Vulkan API makes how specific this use case is very explicit. You cannot bind a piece of lazily-allocated memory to an image that does not have the USAGE_TRANSIENT_ATTACHMENT flag set on it (or to a buffer of any kind). And you'll notice that it says "transient attachment", as in render pass attachments. It says this because you'll also notice that transient attachments cannot be used for non-attachment uses (part of the valid usage tests for VkImageCreateInfo). At all.
What you want to do is not the sort of thing that lazily allocated memory is made for. It can't work.
As for Direct3D 12, the API is not designed to run on mobile GPUs, and since only mobile GPUs are tile-based renderers (some recent desktop GPUs have TBR similarities, but are not full TBRs), it has no facilities designed explicitly for them. And thus, it has no need for lazily allocated memory or transient attachments.

Best way to store animated vertex data

From what I understand there are several methods for storing and transferring vertex data to the GPU.
Using a temporary staging buffer and copying it to discrete GPU memory every frame
Using shared buffer (which is slow?) and just update the shared buffer every frame
Storing the staging buffer for each mesh permanently instead of recreating it every frame and copying it to the GPU
Which method is best for storing animating mesh data which changes rapidly?
It depends on the hardware and the memory types it advertises. Note that all of the following requires you to use vkGetBufferMemoryRequirements to check to see if the memory type can support the usages you need.
If hardware advertises a memory type that is both DEVICE_LOCAL and HOST_VISIBLE, then you should use that instead of staging. Now, you still need to double-buffer this, since you cannot write to data that the GPU is reading from, and you don't want to synchronize with the GPU unless the GPU is over a frame late. This is something you should also measure; your GPU needs may require a triple buffer, so design your system to be flexible.
Note that some hardware has two different heaps that are DEVICE_LOCAL, but only one of them will have HOST_VISIBLE memory types for them. So pay attention to those cases.
If there is no such memory type (or if the memory type doesn't support the buffer usages you need), then you need to profile this. The two alternatives are:
Staging (via a dedicated transfer queue, where available) to a DEVICE_LOCAL memory type, where the data eventually gets used.
Directly using a non-DEVICE_LOCAL memory type.
Note that both of these require buffering, since you want to avoid synchronization as much as possible. Staging through a transfer queue will also require a semaphore, since you need to make sure that the graphics queue doesn't try to use the memory until the transfer queue is done with it. It also means you need to deal with resource sharing between queues.
Personally though, I would try to avoid CPU animated vertex data whenever possible. Vulkan-capable GPUs are perfectly capable of doing any animating themselves. GPUs have been doing bone weighted skinning (even dual-quaternion-based) for over a decade now. Even vertex palette animation is something the GPU can do; summing up the various different vertices to reach the final answer. So scenes with lots of CPU-generated vertex data should be relatively rare.