Create MPSImage from pre-allocated data pointer - objective-c

In the application I'm working on, there are chunks of pre-allocated memory that are filled with image data at one point. I need to wrap this data in an MPSImage to use it with Metal's MPS CNN filters.
From looking at the Docs it seems like there's no easy way to do this without copying the data into either the MPSImage or an MTLTexture.
Do you know of a way to achieve that with no-copy from pre-allocated pointers?
Thanks!

You can allocate an MTLTexture backed by an MTLBuffer created with bytesNoCopy constructor. And then allocate an MPSImage from that MTLTexture with initWithTexture:featureChannels:.
Keep in mind though that in this case the texture won't be in an optimal layout for GPU access, so this is a memory vs performance trade off.
Also keep in mind that bytesNoCopy constructor takes virtual memory page boundary aligned addresses only, and the driver needs to make sure that that memory is resident when you submit a command buffer that uses that memory.

Related

Can you transfer directly to an image without using an intermediary buffer

If I've allocated an image in memory that IS NOT host-visible then I get a staging buffer that IS host-visible so I can write to it. I memcpy into that buffer, and then I do a vkCmdCopyBufferToImage.
However let's say we're running on hardware where the device-local image memory is also host-visible, is it more efficient and better to just memcpy into this image memory? In what image layout does the image need to be in if memcpying straight into it? Transfer destination? How would this work mip levels? In the copy buffer to image method you specify each mip level, but if memcpying in how do you do this? Or do just not? And just do the extra copy to the host-visible staging buffer and then do copy buffer to image?
VK_IMAGE_TILING_OPTIMAL arrangement is implementation-dependent and unexposed.
VK_IMAGE_TILING_LINEAR arrangement is whatever vkGetImageSubresourceLayout says it is.
The device-local host-visible memory on current desktop GPUs have a specific purpose. But anyway, you wouldn't have access to most of the GPU capacity this way.
If you do it the right way™, then there is already only one transfer. The memcpy is unnecessary. Either you build your linear image directly in Vulkan's memory, or you cast your arbitrary memory to Vulkan with VK_EXT_external_memory_host.
In what image layout does the image need to be in if memcpying straight into it?
Host writes are either VK_IMAGE_LAYOUT_PREINITIALIZED or VK_IMAGE_LAYOUT_GENERAL.
How would this work mip levels?
vkGetImageSubresourceLayout() gives pLayout->offset and pLayout->size saying where subresource starts and ends.

Does it make sense to read from host memory in a compute shader to save a copy?

This answer suggests using a compute shader to convert from packed 3-channel image data to a 4-channel texture on the GPU. Is it a good idea to, instead of copying the 3 channel image to the GPU before decoding it, write it to a host visible buffer, then read that directly in the compute shader?
It would save a buffer on the GPU, but I don't know if the CPU-GPU buffer copy is done in some clever way that this would defeat.
Well, the first question you need to ask is whether the Vulkan implementation even allows a CS to directly read from host-visible memory. Vulkan implementations have to allow you to create SSBOs in some memory type, but it doesn't have to be a host-visible one.
So even if you want to do this, you'll need to provide a code path for what happens when you can't (or just fail out early on such implementations).
The next question is whether host-visible memory types that you can put an SSBO into are also device-local. Integrated GPUs that have only one memory pool are both host-visible and device-local, so there's no point in ever doing a copy on those (and they obviously can't refuse to allow you to make an SSBO in them).
But many/most discrete GPUs also have memory types that are both host-visible and device-local. These are usually around 256MB in size, regardless of how much actual GPU memory the cards have, and they're intended to be used for streamed data that changes every frame. Of course, GPUs don't necessarily have to allow you to use them for SSBOs.
Should you use such memory types for doing these kinds of image fiddling? You have to profile them to know. And you'll also have to take into account whether your application has ways to hide any DMA upload latency, which would allow you to ignore the cost of transferring the data to non-host-visible memory.

A rarely mentioned Vulkan function "vkCmdUpdateBuffer()", what is it used for?

This seems to be a simple Vulkan API question but I really can not find answer after search Internet.
I noticed there is a Vulkan function:
void vkCmdUpdateBuffer(
VkCommandBuffer commandBuffer,
VkBuffer dstBuffer,
VkDeviceSize dstOffset,
VkDeviceSize dataSize,
const void* pData);
At first glance, I thoughts it can be used to record the command buffer since it has prefix vkCmd in its name, but the document says that
vkCmdUpdateBuffer is only allowed outside of a render pass. This command is treated as “transfer” operation, for the purposes of synchronization barriers.
So I start thinking that it is a convenience function that wraps the buffer data transferring operation like using memcpy() to copy the data from host to the device.
Then my question is: Why there is NOT a single Vulkan sample / tutorial (I have searched all of them) using vkCmdUpdateBuffer() instead of manually coping data by memcpy(). Did I understand it wrong?
All vkCmd* functions generate commands into a command buffer. This one is no exception. It is a transfer command, and like most transfer commands, you don't get to do them within a render pass. But there are plenty of command buffer generating commands that don't work in render passes.
Normally Vulkan memory transfer operations only happen between device memory. The typical mechanism for the host to put something in device memory is to write to a mapped pointer. But by definition, that requires that the destination memory be mappable. So if you want to write something to non-mappable memory, you have to copy it to mappable memory, then do a transfer operation between the mappable memory to the non-mappable memory via vkCmdCopy* functions.
And that's fine if you're doing a bunch of transfers all at once. You can copy a bunch of stuff into mapped memory, then submit a batch containing all of the copy operations to copy the data into the appropriate locations.
But sometimes, you're just updating a small piece of device memory. If it's not mappable, then that's a lot of work to do just to get a few kilobytes of data to the GPU. In that case, vkCmdUpdateBuffer may be the better choice, since it can "directly" copy from CPU memory to any device memory.
I say "directly" because that's obviously not what it's doing. It's really doing the same thing you would have done, except it's doing it within the command buffer. You would have copied your CPU data into GPU mappable memory, then created a command that copies from that mappable memory into non-mappable memory.
vkCmdUpdateBuffer does the exact same thing. It copies the data from the pointer/size you give it into mappable memory (which is provided by the command buffer itself. This is why it has an upper limit of 64KB). This copy happens immediately, just as it would have if you did a memcpy, so when this function returns, you can do whatever you want with the pointer you gave it. Then it creates a command in the command buffer that copies from the mappable memory in the command buffer to the destination memory location.
The documentation for this function explicitly gives warnings about using it for larger transfers. That is, it tells you not to do that. This is for quick, small, one-shot updates of unmappable memory. Nothing more.
That's one reason why tutorials don't talk about it: it's a highly special-case function that many novice users will try to use because it's easier than the explicit code. But in most cases, they should not be using it.

Best way to store animated vertex data

From what I understand there are several methods for storing and transferring vertex data to the GPU.
Using a temporary staging buffer and copying it to discrete GPU memory every frame
Using shared buffer (which is slow?) and just update the shared buffer every frame
Storing the staging buffer for each mesh permanently instead of recreating it every frame and copying it to the GPU
Which method is best for storing animating mesh data which changes rapidly?
It depends on the hardware and the memory types it advertises. Note that all of the following requires you to use vkGetBufferMemoryRequirements to check to see if the memory type can support the usages you need.
If hardware advertises a memory type that is both DEVICE_LOCAL and HOST_VISIBLE, then you should use that instead of staging. Now, you still need to double-buffer this, since you cannot write to data that the GPU is reading from, and you don't want to synchronize with the GPU unless the GPU is over a frame late. This is something you should also measure; your GPU needs may require a triple buffer, so design your system to be flexible.
Note that some hardware has two different heaps that are DEVICE_LOCAL, but only one of them will have HOST_VISIBLE memory types for them. So pay attention to those cases.
If there is no such memory type (or if the memory type doesn't support the buffer usages you need), then you need to profile this. The two alternatives are:
Staging (via a dedicated transfer queue, where available) to a DEVICE_LOCAL memory type, where the data eventually gets used.
Directly using a non-DEVICE_LOCAL memory type.
Note that both of these require buffering, since you want to avoid synchronization as much as possible. Staging through a transfer queue will also require a semaphore, since you need to make sure that the graphics queue doesn't try to use the memory until the transfer queue is done with it. It also means you need to deal with resource sharing between queues.
Personally though, I would try to avoid CPU animated vertex data whenever possible. Vulkan-capable GPUs are perfectly capable of doing any animating themselves. GPUs have been doing bone weighted skinning (even dual-quaternion-based) for over a decade now. Even vertex palette animation is something the GPU can do; summing up the various different vertices to reach the final answer. So scenes with lots of CPU-generated vertex data should be relatively rare.

vkCmdCopyImageToBuffer does not work when used with an image created by application

I had created my own swap chain with vkCreateImage by allocating its appropriate memory (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT). And got the image data after vkQueueSubmit and vkQueueWaitIdle by mapping the memory associated with it.
Due to the advantages of staging buffers, I created the above memory with VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT and did a vkCmdCopyImageToBuffer in Command Buffers but the result is all values 0. But if I just associate a vkCreateBuffer to the above image and do vkCmdCopyBuffer I do get all the rendered image.
Is this an expected behavior that we cannot do vkCmdCopyImageToBuffer unless it's a system swapchain?
Edit 1:
I am rendering to vkCreateImage with memory type VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT. The things is that when I do vkCmdCopyImageToBuffer the image data in this buffer is all 0.
Now when I create a vkCreateBuffer and bind to the above image memory with vkBindBufferMemory. After which I do vkCmdCopyBuffer, I do get the image data. Why does vkCmdCopyImageToBuffer not work ? Is it that since I am allocating memory for image? Because in case of swapchain images where we do not allocate memory, works fine with vkCmdCopyImageToBuffer. Why do I need extra overhead of binding a buffer to my allocated image memory to make this work.
Check that you are paying attention to image layouts. You may need a barrier to transition the layouts appropriately.
You might also turn on the validation layers to see if they catch anything.
(I am confused by the description as well, so sorry for my vague answer.)