Can you transfer directly to an image without using an intermediary buffer - vulkan

If I've allocated an image in memory that IS NOT host-visible then I get a staging buffer that IS host-visible so I can write to it. I memcpy into that buffer, and then I do a vkCmdCopyBufferToImage.
However let's say we're running on hardware where the device-local image memory is also host-visible, is it more efficient and better to just memcpy into this image memory? In what image layout does the image need to be in if memcpying straight into it? Transfer destination? How would this work mip levels? In the copy buffer to image method you specify each mip level, but if memcpying in how do you do this? Or do just not? And just do the extra copy to the host-visible staging buffer and then do copy buffer to image?

VK_IMAGE_TILING_OPTIMAL arrangement is implementation-dependent and unexposed.
VK_IMAGE_TILING_LINEAR arrangement is whatever vkGetImageSubresourceLayout says it is.
The device-local host-visible memory on current desktop GPUs have a specific purpose. But anyway, you wouldn't have access to most of the GPU capacity this way.
If you do it the right way™, then there is already only one transfer. The memcpy is unnecessary. Either you build your linear image directly in Vulkan's memory, or you cast your arbitrary memory to Vulkan with VK_EXT_external_memory_host.
In what image layout does the image need to be in if memcpying straight into it?
Host writes are either VK_IMAGE_LAYOUT_PREINITIALIZED or VK_IMAGE_LAYOUT_GENERAL.
How would this work mip levels?
vkGetImageSubresourceLayout() gives pLayout->offset and pLayout->size saying where subresource starts and ends.

Related

Create MPSImage from pre-allocated data pointer

In the application I'm working on, there are chunks of pre-allocated memory that are filled with image data at one point. I need to wrap this data in an MPSImage to use it with Metal's MPS CNN filters.
From looking at the Docs it seems like there's no easy way to do this without copying the data into either the MPSImage or an MTLTexture.
Do you know of a way to achieve that with no-copy from pre-allocated pointers?
Thanks!
You can allocate an MTLTexture backed by an MTLBuffer created with bytesNoCopy constructor. And then allocate an MPSImage from that MTLTexture with initWithTexture:featureChannels:.
Keep in mind though that in this case the texture won't be in an optimal layout for GPU access, so this is a memory vs performance trade off.
Also keep in mind that bytesNoCopy constructor takes virtual memory page boundary aligned addresses only, and the driver needs to make sure that that memory is resident when you submit a command buffer that uses that memory.

Does it make sense to read from host memory in a compute shader to save a copy?

This answer suggests using a compute shader to convert from packed 3-channel image data to a 4-channel texture on the GPU. Is it a good idea to, instead of copying the 3 channel image to the GPU before decoding it, write it to a host visible buffer, then read that directly in the compute shader?
It would save a buffer on the GPU, but I don't know if the CPU-GPU buffer copy is done in some clever way that this would defeat.
Well, the first question you need to ask is whether the Vulkan implementation even allows a CS to directly read from host-visible memory. Vulkan implementations have to allow you to create SSBOs in some memory type, but it doesn't have to be a host-visible one.
So even if you want to do this, you'll need to provide a code path for what happens when you can't (or just fail out early on such implementations).
The next question is whether host-visible memory types that you can put an SSBO into are also device-local. Integrated GPUs that have only one memory pool are both host-visible and device-local, so there's no point in ever doing a copy on those (and they obviously can't refuse to allow you to make an SSBO in them).
But many/most discrete GPUs also have memory types that are both host-visible and device-local. These are usually around 256MB in size, regardless of how much actual GPU memory the cards have, and they're intended to be used for streamed data that changes every frame. Of course, GPUs don't necessarily have to allow you to use them for SSBOs.
Should you use such memory types for doing these kinds of image fiddling? You have to profile them to know. And you'll also have to take into account whether your application has ways to hide any DMA upload latency, which would allow you to ignore the cost of transferring the data to non-host-visible memory.

Vulkan: Uploading non-pow-of-2 texture data with vkCmdCopyBufferToImage

There is a similar thread (Loading non-power-of-two textures in Vulkan) but it pertains to updating data in a host-visible mapped region.
I wanted to utilize the fully-fledged function vkCmdCopyBufferToImage to copy data present in a host-visible buffer to a device-local image. My image has dim which are not power of 2 (they are 1280x720, to be more specific).
When doing so, I've seen that it works fine on NVIDIA and Intel, but not on AMD, where the image becomes "distorted", which indicates problem with rowPitch/padding.
My host-visible buffer is tightly packed so bufferRowLength and bufferImageHeight are the same as imageExtent.width and imageExtent.height.
Shouldn't this function cater for non-power-of-2 textures? Or maybe I'm doing it wrong?
I could implement a workaround with a compute shader but I thought this function's purpose was to be generic. Also, the downside of using a compute shader is that the copy operation could not be performed on a transfer-only queue.

vkCmdCopyImageToBuffer does not work when used with an image created by application

I had created my own swap chain with vkCreateImage by allocating its appropriate memory (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT). And got the image data after vkQueueSubmit and vkQueueWaitIdle by mapping the memory associated with it.
Due to the advantages of staging buffers, I created the above memory with VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT and did a vkCmdCopyImageToBuffer in Command Buffers but the result is all values 0. But if I just associate a vkCreateBuffer to the above image and do vkCmdCopyBuffer I do get all the rendered image.
Is this an expected behavior that we cannot do vkCmdCopyImageToBuffer unless it's a system swapchain?
Edit 1:
I am rendering to vkCreateImage with memory type VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT. The things is that when I do vkCmdCopyImageToBuffer the image data in this buffer is all 0.
Now when I create a vkCreateBuffer and bind to the above image memory with vkBindBufferMemory. After which I do vkCmdCopyBuffer, I do get the image data. Why does vkCmdCopyImageToBuffer not work ? Is it that since I am allocating memory for image? Because in case of swapchain images where we do not allocate memory, works fine with vkCmdCopyImageToBuffer. Why do I need extra overhead of binding a buffer to my allocated image memory to make this work.
Check that you are paying attention to image layouts. You may need a barrier to transition the layouts appropriately.
You might also turn on the validation layers to see if they catch anything.
(I am confused by the description as well, so sorry for my vague answer.)

How to read contents of VkBuffer after its being filled with data in Vulkan?

Is there any code snippet to do this?
I am doing an offscreen rendering to a VkImage and want to dump its result to png. I have created a VkBuffer and doing a vkCmdCopyImageToBuffer, but not sure how to move forward.
Edit:
I am creating a VkBuffer with its VkBufferCreateInfo. Not assigning any vkAllocateMemory since I do not want to associate it with any GPU memory. After this I do vkCmdCopyImageToBuffer. So how to do a memcpy assuming the data is copied to VkBuffer.
You will need to map the memory of the buffer you copied the data into and pull the data from the void*. You have to assign memory to the buffer. The memory you'll want to use must be host visible.
If the memory heap you used for the buffer is not coherent then you need to use vkInvalidateMappedMemoryRanges for the mapped range after the fence is triggered and before you start copying the data out.