In the application I'm working on, there are chunks of pre-allocated memory that are filled with image data at one point. I need to wrap this data in an MPSImage to use it with Metal's MPS CNN filters.
From looking at the Docs it seems like there's no easy way to do this without copying the data into either the MPSImage or an MTLTexture.
Do you know of a way to achieve that with no-copy from pre-allocated pointers?
Thanks!
You can allocate an MTLTexture backed by an MTLBuffer created with bytesNoCopy constructor. And then allocate an MPSImage from that MTLTexture with initWithTexture:featureChannels:.
Keep in mind though that in this case the texture won't be in an optimal layout for GPU access, so this is a memory vs performance trade off.
Also keep in mind that bytesNoCopy constructor takes virtual memory page boundary aligned addresses only, and the driver needs to make sure that that memory is resident when you submit a command buffer that uses that memory.
I have a compute shader which I'd like to output to an image/buffer which is meant to be intermediate stoarge between two pipelines: a compute pipeline, and a graphics pipeline. The graphics pipeline is actually a "dummy", in that it does nothing apart from copy the contents of the intermediate buffer into a swapchain image. This is necessitated by the fact that DX12 deprecated the ability of compute pipelines to use UAVS to directly write into swapchain images.
I think the intermediate storage should be a "transient" attachment, in the Vulkan sense:
VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT specifies that the memory bound to this image will have been allocated with the VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT (see Memory Allocation for more detail). This bit can be set for any image that can be used to create a VkImageView suitable for use as a color, resolve, depth/stencil, or input attachment.`
This is explained in this article:
Finally, Vulkan includes the concept of transient attachments. These are framebuffer attachments that begin in an uninitialized or cleared state at the beginning of a renderpass, are written by one or more subpasses, consumed by one or more subpasses and are ultimately discarded at the end of the renderpass. In this scenario, the data in the attachments only lives within the renderpass and never needs to be written to main memory. Although we’ll still allocate memory for such an attachment, the data may never leave the GPU, instead only ever living in cache. This saves bandwidth, reduces latency and improves power efficiency.
Does DirectX 12 have a similar image usage concept?
Direct3D 12 does not have this concept. And the reason for that limitation ultimately boils down to why transient allocation exists. TL;DR: It's not for doing the kind of thing you're trying to do.
Vulkan's render pass system exists for one purpose: to make tile-based renderers first-class citizens of the rendering system. TBRs do not fit well in OpenGL or D3D's framebuffer model. In both APIs, you can just swap framebuffers in and out whenever you want.
TBRs do not render to memory directly. They perform rendering operations into internal buffers, which are seeded from memory and then possibly written to memory after the rendering operation is complete. Switching rendered images whenever you want works against this structure, which is why TBR vendors have a list of things you're not supposed to do if you want high-performance in your OpenGL ES code.
Vulkan's render pass system is an abstraction of a TBR system. In the abstract model, the render pass system potentially reads data from the images in the frame buffer, then performs a bunch of subpasses on copies of this data, and at the end, potentially writes the updated data back out into the images. So from the outside of the process, it looks like you're rendering to the images, but you're not. To maintain this illusion, for the duration of a render pass, you can only use those framebuffer images in the way that the render pass model allows: as attachments.
Now consider deferred rendering. In deferred rendering, you render to g-buffers, which you then read in your lighting passes to generate the final image. Once you've generated the final image, you don't need those g-buffers anymore. In a regular GPU, that doesn't mean anything; because rendering goes directly to memory, those g-buffers must take up actual storage.
But consider how a TBR works. It does rendering into a single tile; in optimal cases, it processes all of the fragments for a single tile at once. Which means it goes through the geometry and lighting passes. For a TBR, the g-buffer is just a piece of scratch memory you use to get the final answer; it doesn't need to be read from memory or copied to memory.
In short, it doesn't need memory.
Enter lazily allocated memory and transient attachment images. They exist to allow TBRs to keep g-buffers in tile memory and never to have to allocate actual storage for them (or at least, it only happens if some runtime circumstance occurs that forces it, like shoving too much geometry at the GPU). And it only works within a render pass; if you end a render pass and have to use one of the g-buffers in another render pass, then the magic has to go away and the data has to touch actual storage.
The Vulkan API makes how specific this use case is very explicit. You cannot bind a piece of lazily-allocated memory to an image that does not have the USAGE_TRANSIENT_ATTACHMENT flag set on it (or to a buffer of any kind). And you'll notice that it says "transient attachment", as in render pass attachments. It says this because you'll also notice that transient attachments cannot be used for non-attachment uses (part of the valid usage tests for VkImageCreateInfo). At all.
What you want to do is not the sort of thing that lazily allocated memory is made for. It can't work.
As for Direct3D 12, the API is not designed to run on mobile GPUs, and since only mobile GPUs are tile-based renderers (some recent desktop GPUs have TBR similarities, but are not full TBRs), it has no facilities designed explicitly for them. And thus, it has no need for lazily allocated memory or transient attachments.
I had created my own swap chain with vkCreateImage by allocating its appropriate memory (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT). And got the image data after vkQueueSubmit and vkQueueWaitIdle by mapping the memory associated with it.
Due to the advantages of staging buffers, I created the above memory with VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT and did a vkCmdCopyImageToBuffer in Command Buffers but the result is all values 0. But if I just associate a vkCreateBuffer to the above image and do vkCmdCopyBuffer I do get all the rendered image.
Is this an expected behavior that we cannot do vkCmdCopyImageToBuffer unless it's a system swapchain?
Edit 1:
I am rendering to vkCreateImage with memory type VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT. The things is that when I do vkCmdCopyImageToBuffer the image data in this buffer is all 0.
Now when I create a vkCreateBuffer and bind to the above image memory with vkBindBufferMemory. After which I do vkCmdCopyBuffer, I do get the image data. Why does vkCmdCopyImageToBuffer not work ? Is it that since I am allocating memory for image? Because in case of swapchain images where we do not allocate memory, works fine with vkCmdCopyImageToBuffer. Why do I need extra overhead of binding a buffer to my allocated image memory to make this work.
Check that you are paying attention to image layouts. You may need a barrier to transition the layouts appropriately.
You might also turn on the validation layers to see if they catch anything.
(I am confused by the description as well, so sorry for my vague answer.)
After seeing the question How to take screenshot (high fps) in Linux (programming) the discussion in the comments has exposed that even thought the problem of rapidly capturing the screenshots is well described. We couldn't actually find the description of the logic behind.
This question was created with intention of resolving the information regarding memory handling of the XLib screen capture
Does xlib screen capture copy the memory from kernel space into user space, or is it creating the intermediate buffer to store the data read from frame buffer?
The circular buffer is to display image in a window. Since reading/writing the buffer for display would take some time, I read an article about using GPU video memory or FPGA VGA SRAM as circular buffer.
But one problem I can see is that there is no easy way to pass that video memory (pointer) to UI API such as MFC or Qt. In order to do that, we need to copy the content to normal RAM which loses its purpose.
So I wonder if it is a good idea to use video memory in GPU or FPGA as circular buffer for display. If so, how can I overcome the issue above? Any clue from experienced developer would be appreciated.
It is always a good idea to use a double buffer for video memory. But it depends on your system if the necessary memory bandwidth is avalible.