vulkan compute shader direct access to CPU allocated memory - vulkan

Is there any way in vulkan computer shader to bind specific location in CPU memory, So that I can directly access it in shader language.
For example, if I have a variable declaration int a[]={contents........};, can I bind the address of a to say binding location 0 and then access in glsl something like this
layout(std430,binding = 0) {
int a[];
}
I want do this because to I don't want spend time on writing and reading from buffer.

Generally, you cannot make the GPU access memory that Vulkan did not allocate itself for the GPU. The exception to this are external allocations made by other APIs that themselves are allocating GPU-accessible memory.
Just taking a random stack or global pointer and shoving it at Vulkan isn't going to work.
I want something like cudaHostGetDevicePointer in CUDA
What you're asking for here is not what that function does. That function takes a CPU pointer to CPU-accessible memory which CUDA allocated for you and which you previously mapped into a CPU address range. The pointer you give it must be within a mapped region of GPU memory.
You can't just shove a stack/global variable at it and expect it to work. The variable would have to be within the mapped allocation, and a global or stack variable can't be within such an allocation.
Vulkan doesn't have a way to reverse-engineer a pointer into a mapped range of device memory back to the VkDeviceMemory object it was mapped from. This is in part because Vulkan doesn't have pointers to allocations; you have to use VkDeviceMemory object, which you create and manage yourself. But if you need to know where a CPU-accessible pointer was mapped from, you can keep track of that yourself.

I want do this because to I don't want spend time on writing and reading from buffer.
Vulkan is exactly for people that do want to spend time managing how the data flows. You might want to consider some rapid prototyping framework or math library instead.
Is there any way in vulkan computer shader to bind specific location in CPU memory
Yes, but it won't save you any time.
Firstly Vulkan does allow allocation of CPU memory via VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT|VK_MEMORY_PROPERTY_HOST_COHERENT_BIT. So you could allocate your stuff in that VkDeviceMemory, map it, do your CPU stuff in that address space, and then use it on GPU.
Second way is via using the VK_EXT_external_memory_host extension, which allows you to import your pointer into Vulkan as VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT. But it is involved in its own way, and the driver might say "nope", so you are back to square one.

Related

How do functions access locals in stack frames?

I've read that stack frames contain return addresses, function arguments, and local variables for a function. Since functions don't know where their stack frame is in memory at compile time, how do they know the memory address of their local variables? Do they offset and dereference the stack pointer for every read or write of a local? In particular, how does this work on embedded devices without efficient support for pointer accesses, where load and store addresses have to be hardcoded into the firmware and pointer accesses go through reserved registers?
The way objects work is that the compiler or assembly programmer determines the layout of an object — the offset of each field relative to the start of the object (as well as the size of the object as a whole).  Then, objects are passed and stored as references, which are generally pointers in C and machine code.  In struct xy { int x; int y; }, we can reason that x is at offset 0 and y at offset 4 from an object reference to a struct xy.
The stack frame is like an object that contains a function's memory-based local variables (instead of struct members), and being accessed not by an object reference but by the stack or frame pointer.  (And being allocated/deallocated by stack pointer decrement/increment, instead of malloc and free.)
Both share the issue that we don't know the actual location/address of a given field (x or y) of a dynamically allocated object or stack frame position of a memory-based local variable until runtime, but when a function runs, it can compute the complete absolute address (of object fields or memory-based local variables) quite simply by adding together the base (object reference or stack/frame pointer) to relative position of the desired item, knowing its predetermined layout.
Processors offer addressing modes that help to support this kind of access, usually something like base + displacement.
Let's also note that many local variables are assigned directly to CPU registers so have no memory address at all.  Other local variables move between memory and CPU registers, and such might be considered optimization that means we don't have to access memory if the value of a variable is needed when that has recently already been loaded into a CPU register.
In many ways, processors for embedded devices are like other processors, offering addressing modes to help with memory accesses, and with optimizing compilers that can make good decisions about where a variable lives.  As you can tell from the above, not all variables need live in memory, and some live in both in memory and in CPU registers to help reduce memory access costs.
The anser is, it depends on the architecture. You will have a register that contains the address of the current stack frame, EBP for x86 for instance, once you know this, individual variables are identified by their offsets into the stack frame, calculated by object size at compile time (hence the need for size to be know at compile time for local variables).
Even if a stack frame appears in different places in memory, the variables will have the same relative offset, so you can always calculate the address.
The size of the stack frame for each function is calculated at compile and included in the code so that each call can set up and clean its own frame.

Create MPSImage from pre-allocated data pointer

In the application I'm working on, there are chunks of pre-allocated memory that are filled with image data at one point. I need to wrap this data in an MPSImage to use it with Metal's MPS CNN filters.
From looking at the Docs it seems like there's no easy way to do this without copying the data into either the MPSImage or an MTLTexture.
Do you know of a way to achieve that with no-copy from pre-allocated pointers?
Thanks!
You can allocate an MTLTexture backed by an MTLBuffer created with bytesNoCopy constructor. And then allocate an MPSImage from that MTLTexture with initWithTexture:featureChannels:.
Keep in mind though that in this case the texture won't be in an optimal layout for GPU access, so this is a memory vs performance trade off.
Also keep in mind that bytesNoCopy constructor takes virtual memory page boundary aligned addresses only, and the driver needs to make sure that that memory is resident when you submit a command buffer that uses that memory.

Does it make sense to read from host memory in a compute shader to save a copy?

This answer suggests using a compute shader to convert from packed 3-channel image data to a 4-channel texture on the GPU. Is it a good idea to, instead of copying the 3 channel image to the GPU before decoding it, write it to a host visible buffer, then read that directly in the compute shader?
It would save a buffer on the GPU, but I don't know if the CPU-GPU buffer copy is done in some clever way that this would defeat.
Well, the first question you need to ask is whether the Vulkan implementation even allows a CS to directly read from host-visible memory. Vulkan implementations have to allow you to create SSBOs in some memory type, but it doesn't have to be a host-visible one.
So even if you want to do this, you'll need to provide a code path for what happens when you can't (or just fail out early on such implementations).
The next question is whether host-visible memory types that you can put an SSBO into are also device-local. Integrated GPUs that have only one memory pool are both host-visible and device-local, so there's no point in ever doing a copy on those (and they obviously can't refuse to allow you to make an SSBO in them).
But many/most discrete GPUs also have memory types that are both host-visible and device-local. These are usually around 256MB in size, regardless of how much actual GPU memory the cards have, and they're intended to be used for streamed data that changes every frame. Of course, GPUs don't necessarily have to allow you to use them for SSBOs.
Should you use such memory types for doing these kinds of image fiddling? You have to profile them to know. And you'll also have to take into account whether your application has ways to hide any DMA upload latency, which would allow you to ignore the cost of transferring the data to non-host-visible memory.

Is there a way to map a host-cached Vulkan buffer to a specific memory location?

Vulkan is able to import host memory using VkImportMemoryHostPointerInfoEXT. I queried the supported memory types for VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT but the only kind of memory that was available for it was coherent, which does not work for my use case. The memory needs to use explicit invalidations/flushes for performance reasons. So really, I don't want the API to allocate any host-side memory, I just want to tell it the base address that the buffer should upload from/download to. Otherwise I have to use intermediate copies. Using the address returned by vkMapMemory for the host-side work is not desirable for my use-case.
If the Vulkan implementation does not allow you to import memory allocations as "CACHED", then you can't force it to do so. The API provides the opportunity for the implementation to advertise the ability to import your allocations as "CACHED", but the implementation explicitly refused to do it.
Which probably means that it can't. And you can't make the implementation do something it can't do.
So if you have some API that created and manipulates some memory (which cannot use memory provided by someone else), and the Vulkan implementation won't allow reading from that memory unless it is allowed to remove the cached nature of the allocation, and you need CPU caching of that memory, then you're going to have to fall back on memcpy.
I want to mirror memory between the CPU and GPU so that I can access it from either without an implicit PCI-e bus transfer.
If the GPU is discrete, that's impossible. In a discrete GPU setup, the GPU and the CPU have separate local memory pools, and access to either pool from the other requires some form of PCIe transfer operation. Vulkan lets you pick which one is going to have slower access, but one of them will have slower access to the memory.
If the GPU is integrated, then typically there is only one memory pool and one memory type for it. That type will be both local and coherent (and probably cached too), which represents fast access from both devices.
Whether VkImportMemoryHostPointerInfoEXT or vkMapMemory of non-DEVICE_LOCAL_BIT heap, you will typically get a COHERENT memory type.
Because well, the conventional host heap memory from malloc in C is naturally coherent (and the CPUs do typically have automatic cache-coherency mechanisms). There is no cflush() nor cinvalidate() in C.
There is no reason for there being implicit PCI-e transfers when R\W such memory from the Host side. Of course, the dedicated GPU has to read it somehow, so there would be bus transfers when the deviced tries to access the memory. Or you need to have an explicit memory in DEVICE_LOCAL_BIT heap, and transfer data between the two explicitly via vkCmdCopy* to keep them the same.
Actual UMA achitectures could have a non-COHERENT memory type. But their memory heap is always advertised as DEVICE_LOCAL_BIT (even if it is the main memory).

Best way to store animated vertex data

From what I understand there are several methods for storing and transferring vertex data to the GPU.
Using a temporary staging buffer and copying it to discrete GPU memory every frame
Using shared buffer (which is slow?) and just update the shared buffer every frame
Storing the staging buffer for each mesh permanently instead of recreating it every frame and copying it to the GPU
Which method is best for storing animating mesh data which changes rapidly?
It depends on the hardware and the memory types it advertises. Note that all of the following requires you to use vkGetBufferMemoryRequirements to check to see if the memory type can support the usages you need.
If hardware advertises a memory type that is both DEVICE_LOCAL and HOST_VISIBLE, then you should use that instead of staging. Now, you still need to double-buffer this, since you cannot write to data that the GPU is reading from, and you don't want to synchronize with the GPU unless the GPU is over a frame late. This is something you should also measure; your GPU needs may require a triple buffer, so design your system to be flexible.
Note that some hardware has two different heaps that are DEVICE_LOCAL, but only one of them will have HOST_VISIBLE memory types for them. So pay attention to those cases.
If there is no such memory type (or if the memory type doesn't support the buffer usages you need), then you need to profile this. The two alternatives are:
Staging (via a dedicated transfer queue, where available) to a DEVICE_LOCAL memory type, where the data eventually gets used.
Directly using a non-DEVICE_LOCAL memory type.
Note that both of these require buffering, since you want to avoid synchronization as much as possible. Staging through a transfer queue will also require a semaphore, since you need to make sure that the graphics queue doesn't try to use the memory until the transfer queue is done with it. It also means you need to deal with resource sharing between queues.
Personally though, I would try to avoid CPU animated vertex data whenever possible. Vulkan-capable GPUs are perfectly capable of doing any animating themselves. GPUs have been doing bone weighted skinning (even dual-quaternion-based) for over a decade now. Even vertex palette animation is something the GPU can do; summing up the various different vertices to reach the final answer. So scenes with lots of CPU-generated vertex data should be relatively rare.