vkCmdCopyImageToBuffer does not work when used with an image created by application

vkCmdCopyImageToBuffer does not work when used with an image created by application - vulkan

I had created my own swap chain with vkCreateImage by allocating its appropriate memory (VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT). And got the image data after vkQueueSubmit and vkQueueWaitIdle by mapping the memory associated with it.
Due to the advantages of staging buffers, I created the above memory with VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT and did a vkCmdCopyImageToBuffer in Command Buffers but the result is all values 0. But if I just associate a vkCreateBuffer to the above image and do vkCmdCopyBuffer I do get all the rendered image.
Is this an expected behavior that we cannot do vkCmdCopyImageToBuffer unless it's a system swapchain?
Edit 1:
I am rendering to vkCreateImage with memory type VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT. The things is that when I do vkCmdCopyImageToBuffer the image data in this buffer is all 0.
Now when I create a vkCreateBuffer and bind to the above image memory with vkBindBufferMemory. After which I do vkCmdCopyBuffer, I do get the image data. Why does vkCmdCopyImageToBuffer not work ? Is it that since I am allocating memory for image? Because in case of swapchain images where we do not allocate memory, works fine with vkCmdCopyImageToBuffer. Why do I need extra overhead of binding a buffer to my allocated image memory to make this work.

Check that you are paying attention to image layouts. You may need a barrier to transition the layouts appropriately.
You might also turn on the validation layers to see if they catch anything.
(I am confused by the description as well, so sorry for my vague answer.)

Related

Create MPSImage from pre-allocated data pointer

In the application I'm working on, there are chunks of pre-allocated memory that are filled with image data at one point. I need to wrap this data in an MPSImage to use it with Metal's MPS CNN filters.
From looking at the Docs it seems like there's no easy way to do this without copying the data into either the MPSImage or an MTLTexture.
Do you know of a way to achieve that with no-copy from pre-allocated pointers?
Thanks!

You can allocate an MTLTexture backed by an MTLBuffer created with bytesNoCopy constructor. And then allocate an MPSImage from that MTLTexture with initWithTexture:featureChannels:.
Keep in mind though that in this case the texture won't be in an optimal layout for GPU access, so this is a memory vs performance trade off.
Also keep in mind that bytesNoCopy constructor takes virtual memory page boundary aligned addresses only, and the driver needs to make sure that that memory is resident when you submit a command buffer that uses that memory.

Can you transfer directly to an image without using an intermediary buffer

If I've allocated an image in memory that IS NOT host-visible then I get a staging buffer that IS host-visible so I can write to it. I memcpy into that buffer, and then I do a vkCmdCopyBufferToImage.
However let's say we're running on hardware where the device-local image memory is also host-visible, is it more efficient and better to just memcpy into this image memory? In what image layout does the image need to be in if memcpying straight into it? Transfer destination? How would this work mip levels? In the copy buffer to image method you specify each mip level, but if memcpying in how do you do this? Or do just not? And just do the extra copy to the host-visible staging buffer and then do copy buffer to image?

VK_IMAGE_TILING_OPTIMAL arrangement is implementation-dependent and unexposed.
VK_IMAGE_TILING_LINEAR arrangement is whatever vkGetImageSubresourceLayout says it is.
The device-local host-visible memory on current desktop GPUs have a specific purpose. But anyway, you wouldn't have access to most of the GPU capacity this way.
If you do it the right way™, then there is already only one transfer. The memcpy is unnecessary. Either you build your linear image directly in Vulkan's memory, or you cast your arbitrary memory to Vulkan with VK_EXT_external_memory_host.
In what image layout does the image need to be in if memcpying straight into it?
Host writes are either VK_IMAGE_LAYOUT_PREINITIALIZED or VK_IMAGE_LAYOUT_GENERAL.
How would this work mip levels?
vkGetImageSubresourceLayout() gives pLayout->offset and pLayout->size saying where subresource starts and ends.

What is the DirectX 12 equivalent of Vulkan's "transient attachment"?

I have a compute shader which I'd like to output to an image/buffer which is meant to be intermediate stoarge between two pipelines: a compute pipeline, and a graphics pipeline. The graphics pipeline is actually a "dummy", in that it does nothing apart from copy the contents of the intermediate buffer into a swapchain image. This is necessitated by the fact that DX12 deprecated the ability of compute pipelines to use UAVS to directly write into swapchain images.
I think the intermediate storage should be a "transient" attachment, in the Vulkan sense:
VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT specifies that the memory bound to this image will have been allocated with the VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT (see Memory Allocation for more detail). This bit can be set for any image that can be used to create a VkImageView suitable for use as a color, resolve, depth/stencil, or input attachment.`
This is explained in this article:
Finally, Vulkan includes the concept of transient attachments. These are framebuffer attachments that begin in an uninitialized or cleared state at the beginning of a renderpass, are written by one or more subpasses, consumed by one or more subpasses and are ultimately discarded at the end of the renderpass. In this scenario, the data in the attachments only lives within the renderpass and never needs to be written to main memory. Although we’ll still allocate memory for such an attachment, the data may never leave the GPU, instead only ever living in cache. This saves bandwidth, reduces latency and improves power efficiency.
Does DirectX 12 have a similar image usage concept?

Direct3D 12 does not have this concept. And the reason for that limitation ultimately boils down to why transient allocation exists. TL;DR: It's not for doing the kind of thing you're trying to do.
Vulkan's render pass system exists for one purpose: to make tile-based renderers first-class citizens of the rendering system. TBRs do not fit well in OpenGL or D3D's framebuffer model. In both APIs, you can just swap framebuffers in and out whenever you want.
TBRs do not render to memory directly. They perform rendering operations into internal buffers, which are seeded from memory and then possibly written to memory after the rendering operation is complete. Switching rendered images whenever you want works against this structure, which is why TBR vendors have a list of things you're not supposed to do if you want high-performance in your OpenGL ES code.
Vulkan's render pass system is an abstraction of a TBR system. In the abstract model, the render pass system potentially reads data from the images in the frame buffer, then performs a bunch of subpasses on copies of this data, and at the end, potentially writes the updated data back out into the images. So from the outside of the process, it looks like you're rendering to the images, but you're not. To maintain this illusion, for the duration of a render pass, you can only use those framebuffer images in the way that the render pass model allows: as attachments.
Now consider deferred rendering. In deferred rendering, you render to g-buffers, which you then read in your lighting passes to generate the final image. Once you've generated the final image, you don't need those g-buffers anymore. In a regular GPU, that doesn't mean anything; because rendering goes directly to memory, those g-buffers must take up actual storage.
But consider how a TBR works. It does rendering into a single tile; in optimal cases, it processes all of the fragments for a single tile at once. Which means it goes through the geometry and lighting passes. For a TBR, the g-buffer is just a piece of scratch memory you use to get the final answer; it doesn't need to be read from memory or copied to memory.
In short, it doesn't need memory.
Enter lazily allocated memory and transient attachment images. They exist to allow TBRs to keep g-buffers in tile memory and never to have to allocate actual storage for them (or at least, it only happens if some runtime circumstance occurs that forces it, like shoving too much geometry at the GPU). And it only works within a render pass; if you end a render pass and have to use one of the g-buffers in another render pass, then the magic has to go away and the data has to touch actual storage.
The Vulkan API makes how specific this use case is very explicit. You cannot bind a piece of lazily-allocated memory to an image that does not have the USAGE_TRANSIENT_ATTACHMENT flag set on it (or to a buffer of any kind). And you'll notice that it says "transient attachment", as in render pass attachments. It says this because you'll also notice that transient attachments cannot be used for non-attachment uses (part of the valid usage tests for VkImageCreateInfo). At all.
What you want to do is not the sort of thing that lazily allocated memory is made for. It can't work.
As for Direct3D 12, the API is not designed to run on mobile GPUs, and since only mobile GPUs are tile-based renderers (some recent desktop GPUs have TBR similarities, but are not full TBRs), it has no facilities designed explicitly for them. And thus, it has no need for lazily allocated memory or transient attachments.

Why do we need multiple render passes and subpasses?

I had some experience with DirectX12 in the past and I don't remember something similar to render passes in Vulkan so I can't make an analogy. If I'm understanding correctly command buffers inside the same subpass doesn't need to be synchronized. So why to complicate and make multiple of them? Why can't I just take one command buffer and put all my frame related info there?

Imagine that the GPU cannot render to images directly. Imagine that it can only render to special framebuffer memory storage, which is completely separate from regular image memory. You cannot talk to this framebuffer memory directly, and you cannot allocate from it. However, during a rendering operation, you can copy data from images into it, read data out of it into images, and of course render to this internal memory.
Now imagine that your special framebuffer memory is fixed in size, a size which is smaller than the size of the overall framebuffer you want to render to (perhaps much smaller). To be able to render to images that are bigger than your framebuffer memory, you basically have to execute all rendering commands for those targets multiple times. To avoid running vertex processing multiple times, you need a way to store the output of vertex processing stages.
Furthermore, when generating rendering commands, you need to have some idea of how to apportion your framebuffer memory. You may have to divide up your framebuffer memory differently if you're rendering to one 32-bpp image than if you're rendering to two. And how you assign your framebuffer memory can affect how your fragment shader code works. After all, this framebuffer rendering memory may be directly accessible by the fragment shader during a rendering operation.
That is the basic idea of the render pass model: you are rendering to special framebuffer memory, of an indeterminate size. Every aspect of the render pass system's complexity is based on this conceptual model.
Subpasses are the part where you determine exactly which things you're rendering to at the moment. Because this affects framebuffer memory arrangement, graphics pipelines are always built by referring to a subpass of a render pass. Similarly, secondary command buffers that are to be executed within a subpass must provide the subpass it will be used within.
When a render pass instance begins execution on a queue, it (conceptually) copies the attachment images we intend to render to into framebuffer rendering memory. At the end of the render pass, the data we render is copied back out to the attachment images.
During the execution of a render pass instance, the data for attachment images is considered "indeterminate". While the model says that we're copying into framebuffer rendering memory, Vulkan doesn't want to force implementations to actually copy stuff if they directly render to images.
As such, Vulkan merely states that no operation can access images that are being used as attachments, except for those which access the images as attachments. For example, you cannot read an attachment image as a texture. But you can read from it as an input attachment.
This is a conceptual description of the way tile-based renderers work. And this is the conceptual model that is the foundation of the Vulkan render pass architecture. Render targets are not accessible memory; they're special things that can only be accessed in special ways.
You can't "just" read from a G-buffer because, while you're rendering to that G-buffer, it exists in special framebuffer memory that isn't in the image yet.

Both features primarily exist for tile-based GPUs, which are common in mobile but, historically, uncommon on desktop computers. That's why DX12 doesn't have an equivalent, and Metal (iOS) does. Though both Nvidia's and AMD's recent architectures do a variant of tile-based rendering now also, and with the recent Windows-on-ARM PCs using Qualcomm chips (tile-based GPU), it will be interesting to see how DX12 evolves.
The benefit of render passes is that during pixel shading, you can keep the framebuffer data in on-chip memory instead of constantly reading and writing external memory. Caches help some, but without reordering pixel shading, the cache tends to thrash quite a bit since it's not large enough to store the entire framebuffer. A related benefit is you can avoid reading in previous framebuffer contents if you're just going to completely overwrite them anyway, and avoid writing out framebuffer contents at the end of the render pass if they're not needed after it's over. In many applications, tile-based GPUs never have to read and write depth buffer data or multisample data to or from external memory, which saves a lot of bandwidth and power.
Subpasses are an advanced feature that, in some cases, allow the driver to effectively merge multiple render passes into one. The goal and underlying mechanism is similar to the OpenGL ES Pixel Local Storage extension, but the API is a bit different in order to allow more GPU architectures to support it and to make it more extensible / future-proof. The classic example where this helps is with basic deferred shading: the first subpass writes out gbuffer data for each pixel, and later subpasses use that to light and shade pixels. Gbuffers can be huge, so keeping all of that on-chip and never having to read or write it to main memory is a big deal, especially on mobile GPUs which tend to be more bandwidth- and power-constrained.

UIImage Memory Problems

In my app I am returned from an API the urls of images, which I want to display in the app. This is all well and good, except I started to notice that when I am given, and load, very high-res images my app memory usage spikes 200+mb, often causing it to crash, which is unacceptable.
In one particular example, I am given an image that is of the dimensions 8100*5400 pixels. When the app loaded this image it crashed.
While I first thought the problem was a memory leak I created, but after doing some research, it seems like an unavoidable issue related to the size of the image -- since the image is 43,740,000 pixels and each pixel uses 4 bytes, the memory usage of the image will be a minimum of 174,960,000 bytes, or 174.96 megabytes.
The problem is i cannot control the size of the images sent by the api - they may be any resolution, possibly even larger. Obviously a UIImage will not work for my purposes.
Is there any other way I can display an image with a potentially massive resolution without causing app-crashing memory usage?

Instead of downloading the image as data into memory, which will crash your app, download it as data to disk, which will not.
You can then use the Image I/O framework to load a smaller version of the image which won't take up so much memory.
(Note that you should never attempt to display an image larger than the actual display size that you need, since that's a massive waste of memory. So, even if you can't help downloading the large image, you can at least load and display a version that is the actual much smaller size you need.)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas