Does it make sense to read from host memory in a compute shader to save a copy? - vulkan

This answer suggests using a compute shader to convert from packed 3-channel image data to a 4-channel texture on the GPU. Is it a good idea to, instead of copying the 3 channel image to the GPU before decoding it, write it to a host visible buffer, then read that directly in the compute shader?
It would save a buffer on the GPU, but I don't know if the CPU-GPU buffer copy is done in some clever way that this would defeat.

Well, the first question you need to ask is whether the Vulkan implementation even allows a CS to directly read from host-visible memory. Vulkan implementations have to allow you to create SSBOs in some memory type, but it doesn't have to be a host-visible one.
So even if you want to do this, you'll need to provide a code path for what happens when you can't (or just fail out early on such implementations).
The next question is whether host-visible memory types that you can put an SSBO into are also device-local. Integrated GPUs that have only one memory pool are both host-visible and device-local, so there's no point in ever doing a copy on those (and they obviously can't refuse to allow you to make an SSBO in them).
But many/most discrete GPUs also have memory types that are both host-visible and device-local. These are usually around 256MB in size, regardless of how much actual GPU memory the cards have, and they're intended to be used for streamed data that changes every frame. Of course, GPUs don't necessarily have to allow you to use them for SSBOs.
Should you use such memory types for doing these kinds of image fiddling? You have to profile them to know. And you'll also have to take into account whether your application has ways to hide any DMA upload latency, which would allow you to ignore the cost of transferring the data to non-host-visible memory.

Related

Create MPSImage from pre-allocated data pointer

In the application I'm working on, there are chunks of pre-allocated memory that are filled with image data at one point. I need to wrap this data in an MPSImage to use it with Metal's MPS CNN filters.
From looking at the Docs it seems like there's no easy way to do this without copying the data into either the MPSImage or an MTLTexture.
Do you know of a way to achieve that with no-copy from pre-allocated pointers?
Thanks!
You can allocate an MTLTexture backed by an MTLBuffer created with bytesNoCopy constructor. And then allocate an MPSImage from that MTLTexture with initWithTexture:featureChannels:.
Keep in mind though that in this case the texture won't be in an optimal layout for GPU access, so this is a memory vs performance trade off.
Also keep in mind that bytesNoCopy constructor takes virtual memory page boundary aligned addresses only, and the driver needs to make sure that that memory is resident when you submit a command buffer that uses that memory.

Vulkan: Uploading non-pow-of-2 texture data with vkCmdCopyBufferToImage

There is a similar thread (Loading non-power-of-two textures in Vulkan) but it pertains to updating data in a host-visible mapped region.
I wanted to utilize the fully-fledged function vkCmdCopyBufferToImage to copy data present in a host-visible buffer to a device-local image. My image has dim which are not power of 2 (they are 1280x720, to be more specific).
When doing so, I've seen that it works fine on NVIDIA and Intel, but not on AMD, where the image becomes "distorted", which indicates problem with rowPitch/padding.
My host-visible buffer is tightly packed so bufferRowLength and bufferImageHeight are the same as imageExtent.width and imageExtent.height.
Shouldn't this function cater for non-power-of-2 textures? Or maybe I'm doing it wrong?
I could implement a workaround with a compute shader but I thought this function's purpose was to be generic. Also, the downside of using a compute shader is that the copy operation could not be performed on a transfer-only queue.

Is there a way to map a host-cached Vulkan buffer to a specific memory location?

Vulkan is able to import host memory using VkImportMemoryHostPointerInfoEXT. I queried the supported memory types for VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT but the only kind of memory that was available for it was coherent, which does not work for my use case. The memory needs to use explicit invalidations/flushes for performance reasons. So really, I don't want the API to allocate any host-side memory, I just want to tell it the base address that the buffer should upload from/download to. Otherwise I have to use intermediate copies. Using the address returned by vkMapMemory for the host-side work is not desirable for my use-case.
If the Vulkan implementation does not allow you to import memory allocations as "CACHED", then you can't force it to do so. The API provides the opportunity for the implementation to advertise the ability to import your allocations as "CACHED", but the implementation explicitly refused to do it.
Which probably means that it can't. And you can't make the implementation do something it can't do.
So if you have some API that created and manipulates some memory (which cannot use memory provided by someone else), and the Vulkan implementation won't allow reading from that memory unless it is allowed to remove the cached nature of the allocation, and you need CPU caching of that memory, then you're going to have to fall back on memcpy.
I want to mirror memory between the CPU and GPU so that I can access it from either without an implicit PCI-e bus transfer.
If the GPU is discrete, that's impossible. In a discrete GPU setup, the GPU and the CPU have separate local memory pools, and access to either pool from the other requires some form of PCIe transfer operation. Vulkan lets you pick which one is going to have slower access, but one of them will have slower access to the memory.
If the GPU is integrated, then typically there is only one memory pool and one memory type for it. That type will be both local and coherent (and probably cached too), which represents fast access from both devices.
Whether VkImportMemoryHostPointerInfoEXT or vkMapMemory of non-DEVICE_LOCAL_BIT heap, you will typically get a COHERENT memory type.
Because well, the conventional host heap memory from malloc in C is naturally coherent (and the CPUs do typically have automatic cache-coherency mechanisms). There is no cflush() nor cinvalidate() in C.
There is no reason for there being implicit PCI-e transfers when R\W such memory from the Host side. Of course, the dedicated GPU has to read it somehow, so there would be bus transfers when the deviced tries to access the memory. Or you need to have an explicit memory in DEVICE_LOCAL_BIT heap, and transfer data between the two explicitly via vkCmdCopy* to keep them the same.
Actual UMA achitectures could have a non-COHERENT memory type. But their memory heap is always advertised as DEVICE_LOCAL_BIT (even if it is the main memory).

A rarely mentioned Vulkan function "vkCmdUpdateBuffer()", what is it used for?

This seems to be a simple Vulkan API question but I really can not find answer after search Internet.
I noticed there is a Vulkan function:
void vkCmdUpdateBuffer(
VkCommandBuffer commandBuffer,
VkBuffer dstBuffer,
VkDeviceSize dstOffset,
VkDeviceSize dataSize,
const void* pData);
At first glance, I thoughts it can be used to record the command buffer since it has prefix vkCmd in its name, but the document says that
vkCmdUpdateBuffer is only allowed outside of a render pass. This command is treated as “transfer” operation, for the purposes of synchronization barriers.
So I start thinking that it is a convenience function that wraps the buffer data transferring operation like using memcpy() to copy the data from host to the device.
Then my question is: Why there is NOT a single Vulkan sample / tutorial (I have searched all of them) using vkCmdUpdateBuffer() instead of manually coping data by memcpy(). Did I understand it wrong?
All vkCmd* functions generate commands into a command buffer. This one is no exception. It is a transfer command, and like most transfer commands, you don't get to do them within a render pass. But there are plenty of command buffer generating commands that don't work in render passes.
Normally Vulkan memory transfer operations only happen between device memory. The typical mechanism for the host to put something in device memory is to write to a mapped pointer. But by definition, that requires that the destination memory be mappable. So if you want to write something to non-mappable memory, you have to copy it to mappable memory, then do a transfer operation between the mappable memory to the non-mappable memory via vkCmdCopy* functions.
And that's fine if you're doing a bunch of transfers all at once. You can copy a bunch of stuff into mapped memory, then submit a batch containing all of the copy operations to copy the data into the appropriate locations.
But sometimes, you're just updating a small piece of device memory. If it's not mappable, then that's a lot of work to do just to get a few kilobytes of data to the GPU. In that case, vkCmdUpdateBuffer may be the better choice, since it can "directly" copy from CPU memory to any device memory.
I say "directly" because that's obviously not what it's doing. It's really doing the same thing you would have done, except it's doing it within the command buffer. You would have copied your CPU data into GPU mappable memory, then created a command that copies from that mappable memory into non-mappable memory.
vkCmdUpdateBuffer does the exact same thing. It copies the data from the pointer/size you give it into mappable memory (which is provided by the command buffer itself. This is why it has an upper limit of 64KB). This copy happens immediately, just as it would have if you did a memcpy, so when this function returns, you can do whatever you want with the pointer you gave it. Then it creates a command in the command buffer that copies from the mappable memory in the command buffer to the destination memory location.
The documentation for this function explicitly gives warnings about using it for larger transfers. That is, it tells you not to do that. This is for quick, small, one-shot updates of unmappable memory. Nothing more.
That's one reason why tutorials don't talk about it: it's a highly special-case function that many novice users will try to use because it's easier than the explicit code. But in most cases, they should not be using it.

Best way to store animated vertex data

From what I understand there are several methods for storing and transferring vertex data to the GPU.
Using a temporary staging buffer and copying it to discrete GPU memory every frame
Using shared buffer (which is slow?) and just update the shared buffer every frame
Storing the staging buffer for each mesh permanently instead of recreating it every frame and copying it to the GPU
Which method is best for storing animating mesh data which changes rapidly?
It depends on the hardware and the memory types it advertises. Note that all of the following requires you to use vkGetBufferMemoryRequirements to check to see if the memory type can support the usages you need.
If hardware advertises a memory type that is both DEVICE_LOCAL and HOST_VISIBLE, then you should use that instead of staging. Now, you still need to double-buffer this, since you cannot write to data that the GPU is reading from, and you don't want to synchronize with the GPU unless the GPU is over a frame late. This is something you should also measure; your GPU needs may require a triple buffer, so design your system to be flexible.
Note that some hardware has two different heaps that are DEVICE_LOCAL, but only one of them will have HOST_VISIBLE memory types for them. So pay attention to those cases.
If there is no such memory type (or if the memory type doesn't support the buffer usages you need), then you need to profile this. The two alternatives are:
Staging (via a dedicated transfer queue, where available) to a DEVICE_LOCAL memory type, where the data eventually gets used.
Directly using a non-DEVICE_LOCAL memory type.
Note that both of these require buffering, since you want to avoid synchronization as much as possible. Staging through a transfer queue will also require a semaphore, since you need to make sure that the graphics queue doesn't try to use the memory until the transfer queue is done with it. It also means you need to deal with resource sharing between queues.
Personally though, I would try to avoid CPU animated vertex data whenever possible. Vulkan-capable GPUs are perfectly capable of doing any animating themselves. GPUs have been doing bone weighted skinning (even dual-quaternion-based) for over a decade now. Even vertex palette animation is something the GPU can do; summing up the various different vertices to reach the final answer. So scenes with lots of CPU-generated vertex data should be relatively rare.