Why Vulkan has a limit of memory allocations? - gpu

Is there any technical reasons to limit the maximum number of memory allocations?
Check out vkAllocateMemory manual page. It says:
The maximum number of valid memory allocations that can exist simultaneously within a VkDevice may be restricted by implementation- or platform-dependent limits. If a call to vkAllocateMemory would cause the total number of allocations to exceed these limits, such a call will fail and must return VK_ERROR_TOO_MANY_OBJECTS.
OpenGL doesn't limit allocations, DirectX 11/12 neither. So why should Vulkan do so?

As explained here, this is primarily an OS limitation.
OpenGL doesn't limit allocations, DirectX 11/12 neither
Oh, they do. They just don't tell you about it.
OpenGL and DX11 drivers tend to internally do large GPU (virtual) allocations and perform sub-allocations from those when you allocate memory. Thus, they can create the illusion that you can perform more hardware allocations. But the limitation is still there.
As for DX12, I'm fairly sure that if you try to allocate more than 4096 heaps, you will find CreateHeap returning errors.
Vulkan is simply the API that's up-front about the existence of the limitation.
With Vulkan, this is simply a problem that should never arise. If you're performing over a thousand individual memory allocations, your memory allocation scheme is wrong. You're supposed to allocate a few large slabs of memory, then use sub-sections of them for your textures and buffers.

Related

How to get currently allocation counts in Vulkan?

I'm writing a memory manager in my project to manage Vulkan memory allocation. In practice, allocation counts should be smaller than maxMemoryAllocationCount, so I counted all allocations in my app, and check if it exceeds maxMemoryAllocationCount each allocation.
However, I think is design has bugs, because other apps could also allocate memories from the same device, so I need to get the allocation counts which are counted by the device, but I didn't find any kind of these APIs.
So do I miss something or maxMemoryAllocationCount are application local?
other apps could also allocate memories from the same device
No, they cannot.
They can allocate memory from the same physical device. But they cannot allocate memory from the same VkDevice object. Such objects are specific to the process and cannot be shared. The allocations can be shared, but not the devices themselves (note that a shared allocation counts against the limit on all devices that can access it).
The specification is very clear that this is bound to a specific VkDevice:
The maximum number of valid memory allocations that can exist simultaneously within a VkDevice may be restricted by implementation-or-platform-dependent limits. The maxMemoryAllocationCount feature describes the number of allocations that can exist simultaneously before encountering these internal limits.
When the specification says "device", unless it makes it clear otherwise, it means "VkDevice", not "actual GPU".

How does malloc know where the first available block is in embedded systems?

I have read that malloc has multiple implementations which are platform depended.
How does it work in an embedded device in bare metal programming?
Let's suppose we have an mcu with 256KB FLASH memory and 64KB RAM.
How does it know how much available RAM there is from my program?
For bare metal systems, you'll have a specific segment allocated in the linker script, often called .heap. There is no such thing as memory sharing between processes, meaning that the heap must have a fixed maximum size and therefore is pretty useless in general. malloc doesn't know a thing about how much RAM your program uses since there is no desktop OS in sight.
Your RAM is divided into .stack, .data, .bss and .heap, each with its own fixed maximum size. More about these segments here: https://electronics.stackexchange.com/a/237759/6102. In a typical bare metal MCU application, most of the RAM will be reserved for .data and .bss. You will have something from 128 bytes up to several kb reserved for the stack. You will typically not have a heap at all - but if you do, it will sit there and take up a fixed amount of x kb no matter how much of it you actually use.
malloc in itself could be implemented in different ways indeed. Either you include a "header" together with each allocated segment, the header stating the allocated size and potentially the address of the next available free segment. Or you could implement it as a look-up table where each item is a pointer to the first element and the size.
None of this is particularly relevant, since you shouldn't be using heap allocation in embedded systems. The main reason being that it doesn't make any sense. You don't want arbitrary behavior, you want deteministic behavior. You want to allocate x amount of memory for the worst case and if a heap was to be used it would have to be at least that large anyway, so you gain nothing but bloat from using a heap. Then comes all the usual problems with allocation overhead, fragmentation and leaks.
For bare metal/RTOS applications, do yourself a favour and delete .heap from your linker script, then forget that you ever heard about malloc. A MCU is not a PC.

Vulkan on devices that share host memory

For the purpose of this question, we'll say vkMapMemory for all allocations on such a device cannot fail; they are trivially host-visible, and the result is a direct pointer to some other region of host memory (no work needs to be done).
Is there some way to detect this situation?
The purpose in mind is an arena-based allocator that aggressively maps any host-visible memory, and an objective is to avoid redundant allocations on such hardware.
Yes, it can be detected relatively reliably.
If vkGetPhysicalDeviceMemoryProperties has only one Memory Heap (which would be labeled VK_MEMORY_HEAP_DEVICE_LOCAL_BIT) then it is certain it is the same memory as host.
In words of the authors:
https://www.khronos.org/registry/vulkan/specs/1.0-extensions/html/vkspec.html#memory-device
In a unified memory architecture (UMA) system, there is often only a single memory heap which is considered to be equally “local” to the host and to the device, and such an implementation must advertise the heap as device-local.
In other cases you know trivially if the memory is on the host (i.e. the given Memory Heap on dGPU would not have VK_MEMORY_HEAP_DEVICE_LOCAL_BIT set)
Though, implementations for UMA-based systems described by #krOoze have little reason to not expose direct pointers to buffer data.
Your question seems to proceed from a false assumption.
Vulkan is not OpenGL. Generally speaking, it does not try to hide things from you. If a memory heap cannot be accessed directly by the CPU, then the Vulkan implementation will not expose a memory type for that heap that is host-visible. Conversely, if a memory heap can be accessed directly by the CPU, then the Vulkan implementation will expose a memory type for that heap that is host-visible.
Therefore, if you can map a device allocation at all in Vulkan, then you should assume that you have a "direct pointer to buffer data".

How can I change maximum available heap size for a task in FreeRTOS?

I'm creating a list of elements inside a task in the following way:
l = (dllist*)pvPortMalloc(sizeof(dllist));
dllist is 32 byte big.
My embedded system has 60kB SRAM so I expected my 200 element list can be handled easily by the system. I found out that after allocating space for 8 elements the system is crashing on the 9th malloc function call (256byte+).
If possible, where can I change the heap size inside freeRTOS?
Can I somehow request the current status of heap size?
I couldn't find this information in the documentation so I hope somebody can provide some insight in this matter.
Thanks in advance!
(Yes - FreeRTOS pvPortMalloc() returns void*.)
If you have 60K of SRAM, and configTOTAL_HEAP_SIZE is large, then it is unlikely you are going to run out of heap after allocating 256 bytes unless you had hardly any heap remaining before hand. Many FreeRTOS demos will just keep creating objects until all the heap is used, so if your application is based on one of those, then you would be low on heap before your code executed. You may have also done something like use up loads of heap space by creating tasks with huge stacks.
heap_4 and heap_5 will combine adjacent blocks, which will minimise fragmentation as far as practical, but I don't think that will be your problem - especially as you don't mention freeing anything anywhere.
Unless you are using heap_3.c (which just makes the standard C library malloc and free thread safe) you can call xPortGetFreeHeapSize() to see how much free heap you have. You may also have xPortGetMinimumEverFreeHeapSize() available to query how close you have ever come to running out of heap. More information: http://www.freertos.org/a00111.html
You could also define a malloc() failed hook (http://www.freertos.org/a00016.html) to get instant notification of pvPortMalloc() returning NULL.
For the standard allocators you will find a config option in FreeRTOSConfig.h .
However:
It is very well possible you run out of memory already, depending on the allocator used. IIRC there is one that does not free() any blocks (free() is just a dummy). So any block returned will be lost. This is still useful if you only allocate memory e.g. at startup, but then work with what you've got.
Other allocators might just not merge adjacent blocks once returned, increasing fragmentation much faster than a full-grown allocator.
Also, you might loose memory to fragmentation. Depending on your alloc/free pattern, you quickly might end up with a heap looking like swiss cheese: Many holes between allocated blocks. So while there is still enough free memory, no single block is big enough for the size required.
If you only allocate blocks that size there, you might be better of using your own allocator or a pool (blocks of fixed size). Thaqt would be statically allocated (e.g. array) and chained as a linked list during startup. Alloc/free would then just be push/pop on a stack (or put/get on a queue). That would also be very fast and have complexity O(1) (interrupt-safe if properly written).
Note that normal malloc()/free() are not interrupt-safe.
Finally: Do not cast void *. (Well, that's actually what standard malloc() returns and I expect that FreeRTOS-variant does the same).

Heap profiling on ARM

I am developing a GUI-heavy C++ application on a Freescale MX51-based board Linux 2.6.35. I would like to perform heap profiling.
Unfortunately, all heap profiling tools I have found have either been too intrusive or ostensibly non-working on ARM. Specific tools I've tried:
Valgrind Massif: unworkable on my platform due to the platform's feeble CPU. The 80% CPU time overhead introduced by Massif causes a range of problems in my application that cannot be compensated for.
gperftools (formerly Google Performance Tools) tcmalloc: All features of this rather un-intrusive, library-based libc malloc() replacement work on my target except for the heap profiler. To rephrase, the thread caching allocator works but the profiler does not. I'll explain the failure mode of the profiler below for anyone curious.
Can anyone suggest a set of replacement tools for performing C++ heap profiling on ARM platforms? Ideal output would ultimately be a directed allocation graph, similar to what gperftools' tcmalloc outputs. Low resource utilization is a must- my platform is highly resource constrained.
Failure mode of gperftools' tcmalloc explained:
I'm providing this information only for those that are curious; I do not expect a response. I'm seeing something similar to gperftools' issue #407 below, except on ARM rather than x86.
Specifically, I always get the message "Hooked allocator frame not found, returning empty trace." I spent some time debugging the issue and it appears that, when dynamically linking the tcmalloc library, frame pointers at the boundary between my application and the dynamic library are null- the stack cannot be walked "above" the call into the dynamic library.
gperftools issue #407: https://github.com/gperftools/gperftools/issues/410
stackoverflow user seeing similar problems on ARM: Missing frames on shared libraries on ARM
Heaps. Many ways to do them, but I've only run across 3 main types that matter in embedded land:
Linked list heaps. Each alloc is tracked in a "used" list. Once freed, they are dropped into a "free" list. On freeing, adjacent blocks of free memory are "joined" into larger pieces. Allocs can be any size. Each alloc and free is a O(N) op as it has to traverse the free list to give you a piece of memory plus break the free block into a size close to what you asked for while leaving the remaining block in the free list. Because of the increasing overhead per alloc, this system cannot be used by itself on smaller systems. This also tends to cause memory fragmentation over time if steps aren't taken to minimize it.
Fixed size (unit) heaps. You break your heap into equal size (smaller) parts. This wastes memory a bit, depending on how big the chunks are (and how many different sized, fixed allocator heaps you create), but alloc and free are both O(1) time operations. No searching, no joining. This style is often combined with the first one for "small object allocations" as the engines I've worked with have 95% of their allocations below a set size (say 256 bytes). This way, you use the unit heap for small allocs for huge speed and only minimal memory loss, while using the list heap for larger allocs. No external fragmentation of memory either.
Relocatable memory heaps. You don't give out pointers to memory, but handles. That way, behind the scenes, you can change memory pointers when needed to remove fragmentation or whatever. High overhead. High pain the the #$$ quotient as it's easy to abuse and get dangling pointer all over. Also added overhead for each memory dereference. But wanted to mention it.
There's some basic patterns. You can find all sorts of libs out in the wild that use them and also have built in statistics for number of allocs, fragmentation, and other useful stats. It's also not the hard to roll your own really, though I'd not recommend it for anything outside of satisfying curiosity as debugging without a working malloc is painful indeed. Adding thread support is pretty straightforward as well, but again, downloading a ready made solution is the better choice.
The above info applies to all platforms, ARM or otherwise, though most of my experience has been on low level ARM stuff so the above info is battle tested for your platform. Hope this helps!