How to get currently allocation counts in Vulkan? - vulkan

I'm writing a memory manager in my project to manage Vulkan memory allocation. In practice, allocation counts should be smaller than maxMemoryAllocationCount, so I counted all allocations in my app, and check if it exceeds maxMemoryAllocationCount each allocation.
However, I think is design has bugs, because other apps could also allocate memories from the same device, so I need to get the allocation counts which are counted by the device, but I didn't find any kind of these APIs.
So do I miss something or maxMemoryAllocationCount are application local?

other apps could also allocate memories from the same device
No, they cannot.
They can allocate memory from the same physical device. But they cannot allocate memory from the same VkDevice object. Such objects are specific to the process and cannot be shared. The allocations can be shared, but not the devices themselves (note that a shared allocation counts against the limit on all devices that can access it).
The specification is very clear that this is bound to a specific VkDevice:
The maximum number of valid memory allocations that can exist simultaneously within a VkDevice may be restricted by implementation-or-platform-dependent limits. The maxMemoryAllocationCount feature describes the number of allocations that can exist simultaneously before encountering these internal limits.
When the specification says "device", unless it makes it clear otherwise, it means "VkDevice", not "actual GPU".

Related

How does malloc know where the first available block is in embedded systems?

I have read that malloc has multiple implementations which are platform depended.
How does it work in an embedded device in bare metal programming?
Let's suppose we have an mcu with 256KB FLASH memory and 64KB RAM.
How does it know how much available RAM there is from my program?
For bare metal systems, you'll have a specific segment allocated in the linker script, often called .heap. There is no such thing as memory sharing between processes, meaning that the heap must have a fixed maximum size and therefore is pretty useless in general. malloc doesn't know a thing about how much RAM your program uses since there is no desktop OS in sight.
Your RAM is divided into .stack, .data, .bss and .heap, each with its own fixed maximum size. More about these segments here: https://electronics.stackexchange.com/a/237759/6102. In a typical bare metal MCU application, most of the RAM will be reserved for .data and .bss. You will have something from 128 bytes up to several kb reserved for the stack. You will typically not have a heap at all - but if you do, it will sit there and take up a fixed amount of x kb no matter how much of it you actually use.
malloc in itself could be implemented in different ways indeed. Either you include a "header" together with each allocated segment, the header stating the allocated size and potentially the address of the next available free segment. Or you could implement it as a look-up table where each item is a pointer to the first element and the size.
None of this is particularly relevant, since you shouldn't be using heap allocation in embedded systems. The main reason being that it doesn't make any sense. You don't want arbitrary behavior, you want deteministic behavior. You want to allocate x amount of memory for the worst case and if a heap was to be used it would have to be at least that large anyway, so you gain nothing but bloat from using a heap. Then comes all the usual problems with allocation overhead, fragmentation and leaks.
For bare metal/RTOS applications, do yourself a favour and delete .heap from your linker script, then forget that you ever heard about malloc. A MCU is not a PC.

Is there a way to map a host-cached Vulkan buffer to a specific memory location?

Vulkan is able to import host memory using VkImportMemoryHostPointerInfoEXT. I queried the supported memory types for VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT but the only kind of memory that was available for it was coherent, which does not work for my use case. The memory needs to use explicit invalidations/flushes for performance reasons. So really, I don't want the API to allocate any host-side memory, I just want to tell it the base address that the buffer should upload from/download to. Otherwise I have to use intermediate copies. Using the address returned by vkMapMemory for the host-side work is not desirable for my use-case.
If the Vulkan implementation does not allow you to import memory allocations as "CACHED", then you can't force it to do so. The API provides the opportunity for the implementation to advertise the ability to import your allocations as "CACHED", but the implementation explicitly refused to do it.
Which probably means that it can't. And you can't make the implementation do something it can't do.
So if you have some API that created and manipulates some memory (which cannot use memory provided by someone else), and the Vulkan implementation won't allow reading from that memory unless it is allowed to remove the cached nature of the allocation, and you need CPU caching of that memory, then you're going to have to fall back on memcpy.
I want to mirror memory between the CPU and GPU so that I can access it from either without an implicit PCI-e bus transfer.
If the GPU is discrete, that's impossible. In a discrete GPU setup, the GPU and the CPU have separate local memory pools, and access to either pool from the other requires some form of PCIe transfer operation. Vulkan lets you pick which one is going to have slower access, but one of them will have slower access to the memory.
If the GPU is integrated, then typically there is only one memory pool and one memory type for it. That type will be both local and coherent (and probably cached too), which represents fast access from both devices.
Whether VkImportMemoryHostPointerInfoEXT or vkMapMemory of non-DEVICE_LOCAL_BIT heap, you will typically get a COHERENT memory type.
Because well, the conventional host heap memory from malloc in C is naturally coherent (and the CPUs do typically have automatic cache-coherency mechanisms). There is no cflush() nor cinvalidate() in C.
There is no reason for there being implicit PCI-e transfers when R\W such memory from the Host side. Of course, the dedicated GPU has to read it somehow, so there would be bus transfers when the deviced tries to access the memory. Or you need to have an explicit memory in DEVICE_LOCAL_BIT heap, and transfer data between the two explicitly via vkCmdCopy* to keep them the same.
Actual UMA achitectures could have a non-COHERENT memory type. But their memory heap is always advertised as DEVICE_LOCAL_BIT (even if it is the main memory).

Why Vulkan has a limit of memory allocations?

Is there any technical reasons to limit the maximum number of memory allocations?
Check out vkAllocateMemory manual page. It says:
The maximum number of valid memory allocations that can exist simultaneously within a VkDevice may be restricted by implementation- or platform-dependent limits. If a call to vkAllocateMemory would cause the total number of allocations to exceed these limits, such a call will fail and must return VK_ERROR_TOO_MANY_OBJECTS.
OpenGL doesn't limit allocations, DirectX 11/12 neither. So why should Vulkan do so?
As explained here, this is primarily an OS limitation.
OpenGL doesn't limit allocations, DirectX 11/12 neither
Oh, they do. They just don't tell you about it.
OpenGL and DX11 drivers tend to internally do large GPU (virtual) allocations and perform sub-allocations from those when you allocate memory. Thus, they can create the illusion that you can perform more hardware allocations. But the limitation is still there.
As for DX12, I'm fairly sure that if you try to allocate more than 4096 heaps, you will find CreateHeap returning errors.
Vulkan is simply the API that's up-front about the existence of the limitation.
With Vulkan, this is simply a problem that should never arise. If you're performing over a thousand individual memory allocations, your memory allocation scheme is wrong. You're supposed to allocate a few large slabs of memory, then use sub-sections of them for your textures and buffers.

Vulkan on devices that share host memory

For the purpose of this question, we'll say vkMapMemory for all allocations on such a device cannot fail; they are trivially host-visible, and the result is a direct pointer to some other region of host memory (no work needs to be done).
Is there some way to detect this situation?
The purpose in mind is an arena-based allocator that aggressively maps any host-visible memory, and an objective is to avoid redundant allocations on such hardware.
Yes, it can be detected relatively reliably.
If vkGetPhysicalDeviceMemoryProperties has only one Memory Heap (which would be labeled VK_MEMORY_HEAP_DEVICE_LOCAL_BIT) then it is certain it is the same memory as host.
In words of the authors:
https://www.khronos.org/registry/vulkan/specs/1.0-extensions/html/vkspec.html#memory-device
In a unified memory architecture (UMA) system, there is often only a single memory heap which is considered to be equally “local” to the host and to the device, and such an implementation must advertise the heap as device-local.
In other cases you know trivially if the memory is on the host (i.e. the given Memory Heap on dGPU would not have VK_MEMORY_HEAP_DEVICE_LOCAL_BIT set)
Though, implementations for UMA-based systems described by #krOoze have little reason to not expose direct pointers to buffer data.
Your question seems to proceed from a false assumption.
Vulkan is not OpenGL. Generally speaking, it does not try to hide things from you. If a memory heap cannot be accessed directly by the CPU, then the Vulkan implementation will not expose a memory type for that heap that is host-visible. Conversely, if a memory heap can be accessed directly by the CPU, then the Vulkan implementation will expose a memory type for that heap that is host-visible.
Therefore, if you can map a device allocation at all in Vulkan, then you should assume that you have a "direct pointer to buffer data".

Can a 32-bit processor load a 64-bit memory address using multiple blocks or registers?

I was doing a little on 32-bit microprocessors and have I have learnt that:
1) A 32-bit microprocessor can only address 2^32 bits of memory which means that the memory pointer size should not exceed 32-bit range i.e. the pointer size should be equal to or less than 32-bit.
2) I also came to know that CPU allocate multiple blocks of memory for things like storing numbers and text, that is up to the program and not related to the size of each address (Source:here).So is it possible that a CPU can use multiple blocks (registers) to store pointers more than 32-bit in size?
Processors can access an essentially unlimited amount of memory by using variations on a technique called bank switching. In a simple bank-switching scheme, the memory chips that are wired to a portion of the address space will have some address inputs fed by the processor and some from an external latching device. Historically, the IBM PC had a 1MB address space, but an expanded memory board would IIRC allow two 16KB regions of that space to be mapped to any of dozens or hundreds of 16KB blocks of memory contained thereon. Nowadays processors generally have a memory-management unit built-in, which maps 4KB or 64KB blocks of memory to any address within a much larger space, and additional circuitry may, with OS support, expand things further.
The big difficulty with bank switching is that any given address might identify many different places in memory depending upon how the bank-switching hardware is configured, so accessing data from memories in a banked region will generally be more complicated than accessing data in directly-accessible memory and will only be possible from code which knows how the bank-switching hardware works. Nowadays it's more common to simply use a processor which can access all the memory one needs, but historically bank-switching was often a useful technique for going beyond processor limitations.
You could store a 64 bit pointer using 2 separate locations in memeory. But it probably wouldn't be useful since your processor can only use 32 bit pointers.