Vulkan on devices that share host memory

Vulkan on devices that share host memory - vulkan

For the purpose of this question, we'll say vkMapMemory for all allocations on such a device cannot fail; they are trivially host-visible, and the result is a direct pointer to some other region of host memory (no work needs to be done).
Is there some way to detect this situation?
The purpose in mind is an arena-based allocator that aggressively maps any host-visible memory, and an objective is to avoid redundant allocations on such hardware.

Yes, it can be detected relatively reliably.
If vkGetPhysicalDeviceMemoryProperties has only one Memory Heap (which would be labeled VK_MEMORY_HEAP_DEVICE_LOCAL_BIT) then it is certain it is the same memory as host.
In words of the authors:
https://www.khronos.org/registry/vulkan/specs/1.0-extensions/html/vkspec.html#memory-device
In a unified memory architecture (UMA) system, there is often only a single memory heap which is considered to be equally “local” to the host and to the device, and such an implementation must advertise the heap as device-local.
In other cases you know trivially if the memory is on the host (i.e. the given Memory Heap on dGPU would not have VK_MEMORY_HEAP_DEVICE_LOCAL_BIT set)

Though, implementations for UMA-based systems described by #krOoze have little reason to not expose direct pointers to buffer data.
Your question seems to proceed from a false assumption.
Vulkan is not OpenGL. Generally speaking, it does not try to hide things from you. If a memory heap cannot be accessed directly by the CPU, then the Vulkan implementation will not expose a memory type for that heap that is host-visible. Conversely, if a memory heap can be accessed directly by the CPU, then the Vulkan implementation will expose a memory type for that heap that is host-visible.
Therefore, if you can map a device allocation at all in Vulkan, then you should assume that you have a "direct pointer to buffer data".

Related

How does malloc know where the first available block is in embedded systems?

I have read that malloc has multiple implementations which are platform depended.
How does it work in an embedded device in bare metal programming?
Let's suppose we have an mcu with 256KB FLASH memory and 64KB RAM.
How does it know how much available RAM there is from my program?

For bare metal systems, you'll have a specific segment allocated in the linker script, often called .heap. There is no such thing as memory sharing between processes, meaning that the heap must have a fixed maximum size and therefore is pretty useless in general. malloc doesn't know a thing about how much RAM your program uses since there is no desktop OS in sight.
Your RAM is divided into .stack, .data, .bss and .heap, each with its own fixed maximum size. More about these segments here: https://electronics.stackexchange.com/a/237759/6102. In a typical bare metal MCU application, most of the RAM will be reserved for .data and .bss. You will have something from 128 bytes up to several kb reserved for the stack. You will typically not have a heap at all - but if you do, it will sit there and take up a fixed amount of x kb no matter how much of it you actually use.
malloc in itself could be implemented in different ways indeed. Either you include a "header" together with each allocated segment, the header stating the allocated size and potentially the address of the next available free segment. Or you could implement it as a look-up table where each item is a pointer to the first element and the size.
None of this is particularly relevant, since you shouldn't be using heap allocation in embedded systems. The main reason being that it doesn't make any sense. You don't want arbitrary behavior, you want deteministic behavior. You want to allocate x amount of memory for the worst case and if a heap was to be used it would have to be at least that large anyway, so you gain nothing but bloat from using a heap. Then comes all the usual problems with allocation overhead, fragmentation and leaks.
For bare metal/RTOS applications, do yourself a favour and delete .heap from your linker script, then forget that you ever heard about malloc. A MCU is not a PC.

Is there a way to map a host-cached Vulkan buffer to a specific memory location?

Vulkan is able to import host memory using VkImportMemoryHostPointerInfoEXT. I queried the supported memory types for VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT but the only kind of memory that was available for it was coherent, which does not work for my use case. The memory needs to use explicit invalidations/flushes for performance reasons. So really, I don't want the API to allocate any host-side memory, I just want to tell it the base address that the buffer should upload from/download to. Otherwise I have to use intermediate copies. Using the address returned by vkMapMemory for the host-side work is not desirable for my use-case.

If the Vulkan implementation does not allow you to import memory allocations as "CACHED", then you can't force it to do so. The API provides the opportunity for the implementation to advertise the ability to import your allocations as "CACHED", but the implementation explicitly refused to do it.
Which probably means that it can't. And you can't make the implementation do something it can't do.
So if you have some API that created and manipulates some memory (which cannot use memory provided by someone else), and the Vulkan implementation won't allow reading from that memory unless it is allowed to remove the cached nature of the allocation, and you need CPU caching of that memory, then you're going to have to fall back on memcpy.
I want to mirror memory between the CPU and GPU so that I can access it from either without an implicit PCI-e bus transfer.
If the GPU is discrete, that's impossible. In a discrete GPU setup, the GPU and the CPU have separate local memory pools, and access to either pool from the other requires some form of PCIe transfer operation. Vulkan lets you pick which one is going to have slower access, but one of them will have slower access to the memory.
If the GPU is integrated, then typically there is only one memory pool and one memory type for it. That type will be both local and coherent (and probably cached too), which represents fast access from both devices.

Whether VkImportMemoryHostPointerInfoEXT or vkMapMemory of non-DEVICE_LOCAL_BIT heap, you will typically get a COHERENT memory type.
Because well, the conventional host heap memory from malloc in C is naturally coherent (and the CPUs do typically have automatic cache-coherency mechanisms). There is no cflush() nor cinvalidate() in C.
There is no reason for there being implicit PCI-e transfers when R\W such memory from the Host side. Of course, the dedicated GPU has to read it somehow, so there would be bus transfers when the deviced tries to access the memory. Or you need to have an explicit memory in DEVICE_LOCAL_BIT heap, and transfer data between the two explicitly via vkCmdCopy* to keep them the same.
Actual UMA achitectures could have a non-COHERENT memory type. But their memory heap is always advertised as DEVICE_LOCAL_BIT (even if it is the main memory).

Are JVM heap/stack different than virtual address space heap/stack?

Memory is divided into "segments" called heap, stack, bss, data, and text. However, the JVM also has these concepts of stack and heap. So how are these two reconciled?
Are they different levels of abstraction, where main memory is one or two levels below the JVM, and whose "segments" maps naturally to JVM's "segments"? Since JVM is supposed to be a virtual computer, it seems to me like they emulate what happens underneath but at a higher level of abstraction.

Sounds to me like you've been reading a textbook or similar. All these terms generally have very precise definitions in books/lectures, but a lot less precise definitions in reality. Therefore what people mean when they say heap is not necessarily exactly the same as what a book etc. says.
Memory is divided into "segments" called heap, stack, bss, data, and text.
This is only true for a typical user space process. In other word this will be true for an everyday program written in c or similar, however it is not true for all programs, and definitely not true for the entire memory space.
When a program is executed the OS allocates memory for the various segments listed, except the the heap. The program can request memory from the OS while it is executing. This allows a program to use a different amount of memory depending on its needs. The heap refers to memory requested by the program usually via a function like malloc. To clarify the heap typically refers to a managed region of memory, usually managed with malloc/free. It is also possible to request memory directly from the OS, in an unmanaged fashion. Most people (Imo) would say this wouldn't count as part of the heap.
The stack is a data structure/segment which keeps track of local variables and function calls. It stores important information like where to return after a function call. In c or other "native" languages the stack is created by the OS and can grow or shrink if needed.
Java allows program to request memory during execution using new. Memory allocated to a java program using new is referred to as memory in the java heap. One could imagine that if you where implementing a Jvm you would use malloc behind the scenes of new. This would result in a java heap within a regular native heap. In reality "serious" jvms do not do this and interact directly with the OS for memory.
In Java the stack is created by the Jvm. One could imagine that this is allocated by malloc, but as with the heap this is likely not how real world jvms do it.
Edit:
A Jvm like hotspot. Would likely allocate memory directly from the OS. This memory would then get put into some kind of pool, from which it would be removed as needed. Reasons for needed memory would be needed includes new, or a stack that needs to grow.

Why Vulkan has a limit of memory allocations?

Is there any technical reasons to limit the maximum number of memory allocations?
Check out vkAllocateMemory manual page. It says:
The maximum number of valid memory allocations that can exist simultaneously within a VkDevice may be restricted by implementation- or platform-dependent limits. If a call to vkAllocateMemory would cause the total number of allocations to exceed these limits, such a call will fail and must return VK_ERROR_TOO_MANY_OBJECTS.
OpenGL doesn't limit allocations, DirectX 11/12 neither. So why should Vulkan do so?

As explained here, this is primarily an OS limitation.
OpenGL doesn't limit allocations, DirectX 11/12 neither
Oh, they do. They just don't tell you about it.
OpenGL and DX11 drivers tend to internally do large GPU (virtual) allocations and perform sub-allocations from those when you allocate memory. Thus, they can create the illusion that you can perform more hardware allocations. But the limitation is still there.
As for DX12, I'm fairly sure that if you try to allocate more than 4096 heaps, you will find CreateHeap returning errors.
Vulkan is simply the API that's up-front about the existence of the limitation.
With Vulkan, this is simply a problem that should never arise. If you're performing over a thousand individual memory allocations, your memory allocation scheme is wrong. You're supposed to allocate a few large slabs of memory, then use sub-sections of them for your textures and buffers.

Are HOST_CACHED_BIT and HOST_COHERENT_BIT contradicting each other?

There are two types of memory in Vulkan buzzling me:
VK_MEMORY_PROPERTY_HOST_COHERENT_BIT bit indicates that the host cache
management commands vkFlushMappedMemoryRanges and
vkInvalidateMappedMemoryRanges are not needed to flush host writes to
the device or make device writes visible to the host, respectively.
VK_MEMORY_PROPERTY_HOST_CACHED_BIT bit indicates that memory allocated
with this type is cached on the host. Host memory accesses to uncached
memory are slower than to cached memory, however uncached memory is
always host coherent.
From what I understand is that modification of memory of type COHERENT is seen immediately by both the host and the device, and modifications to memory of type CACHED may not be seen immediately by the host and/or the device, i.e. invalidating/flushing the memory is needed to invalidate the cache.
I have seen some implementations combine both flags, and it is valid combinations according to the 10.2. Device Memory section in the documentation. Isn't there a contradictory (cached and coherent)?

Cached/coherent memory effectively means that the GPU can see the CPU's caches. This often happens on architectures where the GPU and the CPU are sitting on the same chip. The GPU is effectively just another core on the CPU's die, with access to the CPU's core.
But it can happen on other architectures as well. Some standalone GPUs offer cached/coherent memory. Indeed, most of them don't offer cached memory without coherency. From an architectural standpoint, it represents some way for the GPU to access data through at least part of the CPU cache.
The key thing about cached/coherent memory you should remember is this: if there is an alternative memory type for that memory pool, then the alternative is probably faster for the device to access. Also, if alternatives exist, it is entirely possible that the device may not be able to have images or buffers of certain types/formats stored in such memory types. So unless you really need cached memory access from the CPU, or the device offers no alternative, it's best to avoid it.

There are cache schemes that monitor write accesses to RAM over the memory bus to invalidate the Host's cache when the memory is written to.
This allows the best of both worlds, cached coherent accesses but at the cost of a more complex architecture.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas