What is the purpose of the maximum number of sets in a Vulkan descriptor set pool? - vulkan

In the VkDescriptorPoolCreateInfo structure we specify the maxSets field that configures the maximum number of descriptors sets that can be allocated from a given pool. But what exactly is this maximum setting doing under the hood (if anything)?
For example, we could create a simple application with a pool containing (say) three combined image samplers and three uniform buffers, with the intention of creating three descriptor sets (sampler and UBO) for a triple-buffered strategy.
But the application could theoretically allocate anywhere between zero and six sets from this pool depending on the layout of the allocated descriptor sets and what we configure maxSets to be. i.e. the pool does not know what layouts will be requested, only the available number of each resource type.
So is maxSets a hint to the hardware? Or is there some other reason? I have looked through the documentation but cannot find anything other than this line in the spec:
maxSets is the maximum number of descriptor sets that can be allocated from the pool.
Or is it used by Vulkan to double-check that the application does not allocate more sets than the configured pool size? To continue the example we could configure maxSets to four (valid if not logical) but if we try to allocate a fourth descriptor set from the pool the allocation will fail anyway because the pool is exhausted, so why configure a property that is the responsibility of the application?

The point of having limits on a pool is to be able to pre-allocate the resources needed for doing something or failing that, at least have a fixed limit for the number of allocations that the user will allocate from it. The goal is that, at some point, the system will no longer have to runtime allocate resources when you request resources from the pool.
Individual descriptors take up some form of resource, whether CPU, GPU, or both. But bundling them into descriptor sets can also takes up resources, depending on the implementation. As such, if an implementation can pre-allocate some number of set resources, that would be good for minimizing runtime allocations.
Now, it might have made more sense to define a descriptor pool by providing a number of descriptor layouts and saying that you're only going to allocate some number of sets of each particular layout. But that would be a very constrained descriptor pool model.
So the current interface makes for a compromise between such a strict model and a model that only looks at descriptors rather than sets.


Akka Stream application using more memory than the jvm's heap

I have a Java application that uses akka streams that's using more memory than I have specified the jvm to use. The below values are what I have set through the JAVA_OPTS.
maximum heap size (-Xmx) = 700MB
metaspace (-XX) = 250MB
stack size (-Xss) = 1025kb
Using those values and plugging them into the formula below, one would assume the application would be using around 950MB. However that is not the case and it's using over 1.5GB.
Max memory = [-Xmx] + [-XX:MetaspaceSize] + number_of_threads * [-Xss]
Question: Thoughts on how this is possible?
Application overview:
This java application uses alpakka to connect to pubsub and consumes messages. It utilizes akka stream's parallelism where it performs logic on the consumed messages and then it produces those messages to a kafka instance. See the heap dump below. Note, the heap is only 912.9MB so something is taking up 587.1MB and getting the memory usage over 1.5GB
Why is this a problem?
This application is deployed on a kubernetes cluster and the POD has a memory limit specified to 1.5GB. So when the container, where the java application is running, consumes more that 1.5GB the container is killed and restarted.
The short answer is that those do not account for all the memory consumed by the JVM.
Outside of the heap, for instance, memory is allocated for:
compressed class space (governed by the MaxMetaspaceSize)
direct byte buffers (especially if your application performs network I/O and cares about performance, it's virtually certain to make somewhat heavy use of those)
threads (each thread has a stack governed by -Xss ... note that if mixing different concurrency models, each model will tend to allocate its own threads and not necessarily provide a means to share threads)
if native code is involved (e.g. perhaps in the library Alpakka is using to interact with pubsub?), that can allocate arbitrary amounts of memory outside of the heap)
the code cache (typically 48MB)
the garbage collector's state (will vary based on the GC in use, including the presence of any tunable options)
various other things that generally aren't going to be that large
In my experience you're generally fairly safe with a heap that's at most (pod memory limit minus 1 GB), but if you're performing exceptionally large I/Os etc. you can pretty easily get OOM even then.
Your JVM may ship with support for native memory tracking which can shed light on at least some of that non-heap consumption: most of these allocations tend to happen soon after the application is fully loaded, so running with a much higher resource limit and then stopping (e.g. via SIGTERM with enough time to allow it to save results) should give you an idea of what you're dealing with.

How to get currently allocation counts in Vulkan?

I'm writing a memory manager in my project to manage Vulkan memory allocation. In practice, allocation counts should be smaller than maxMemoryAllocationCount, so I counted all allocations in my app, and check if it exceeds maxMemoryAllocationCount each allocation.
However, I think is design has bugs, because other apps could also allocate memories from the same device, so I need to get the allocation counts which are counted by the device, but I didn't find any kind of these APIs.
So do I miss something or maxMemoryAllocationCount are application local?
other apps could also allocate memories from the same device
No, they cannot.
They can allocate memory from the same physical device. But they cannot allocate memory from the same VkDevice object. Such objects are specific to the process and cannot be shared. The allocations can be shared, but not the devices themselves (note that a shared allocation counts against the limit on all devices that can access it).
The specification is very clear that this is bound to a specific VkDevice:
The maximum number of valid memory allocations that can exist simultaneously within a VkDevice may be restricted by implementation-or-platform-dependent limits. The maxMemoryAllocationCount feature describes the number of allocations that can exist simultaneously before encountering these internal limits.
When the specification says "device", unless it makes it clear otherwise, it means "VkDevice", not "actual GPU".

Specify maximum CPU and memory utilization of ABAP Application Server

Is there any means to configure an ABAP application server to that it does only consume X percent of CPU usage and Y percent of memory on the machine it runs on?
Or is this rather something that is only possible on the operating system level?
Google research revealed how to view the operating system status. As this is only viewing, I would be interested in a means to control this status also from within the ABAP application server.
I'm not aware of a method to bind the memory allocation of an application server to a manually adjusted percentage of the host OS memory. There are several profile parameters that control the different memory types used in an application server. SAP offers a detailed documentation on their memory management.
As far as I know, the maximum memory allocated by an application server is controlled by the size of the roll area for work processes, the extended memory and the total heap size. Profile parameters for those settings are:
ztta/roll_area / ztta/roll_first (per work process, not total)
Work processes first receive memory from the roll area, after that they can request more memory from the extended memory up to the size of ztta/roll_extension. If all extended memory is allocated, the work process can allocate heap memory (with a few downsides, which is why that is happening only when necessary)
The biggest influence on memory will be em/initial_size_MB and abap/heap_area_total (with em/initial_size_MB being the main control mechanism). I'd focus on those two to adjust the total memory consumption of your application server instance.
Side note: em/initial_size_MB has a default of 70 % of the total host memory, so there is already a percentage based memory allocation happening in the kernel as long as that parameter isn't set. But I'm not aware of a way to influence the percentage used by the kernel.
Update, thanks to mkysoft for the information: the two parameters CPU_CORES and PHYS_MEMSIZE are by default set by the operating system and contain the total number of CPUs and the total memory installed in the system. You can manually override them, reducing the resources the SAP kernel uses to calculate default values for several kernel parameters. You could for instance reduce PHYS_MEMSIZE and leave em/initial_size_MB to default. Both parameters also allow you to set a percentage instead of absolute values. You could for instance set both values to 50%, reducing the maximum resources for that application server instance to 50 % of what the hardware has to offer. There's some additional documentation for those two parameters available as well.

Estimating available RAM left with safety margin in C (STM32F4)

I am currently developing application for STM32F407 using STM32CubeMx and Keil uVision. I know that dynamic memory allocation in embedded systems is mostly discouraged, but from spot to spot on internet I can find some arguments in favor of it.
Due to my inventors soul I wanted to try to do it, but do it safely. Let's assume I'm creating a dynamically allocated fifo for incoming UART messages, holding structs composed of the msg itself and its' length. However I wouldn't like to consume all the heap size doing so, therefore I want to check how much of it I have left: Me new (?) idea is to try temporarily allocating some big chunk of memory (say 100 char) - if it's successful, I accept the incoming msg, if not - it means that I'm running out of heap and ignore the msg (or accept it and dequeue the oldest). After checking I of course free the temp memory.
A few questions arise in my mind:
First of all, does it make sens at all? Do you think, basic on your experience, that it could be usefull and safe?
I couldn't find precise info about what exactly shares RAM in ES (I know about heap, stack and volatile vars) so my question is: providing that answer to 1. isn't "hell no go home", what size of the temp memory checker would you pick for the mentioned controller?
About the micro itself - it has 192kB RAM, however in the Drivers\CMSIS\Device\ST\STM32F4xx\Source\Templates\arm\startup_stm32f407xx.s file only 512B+1024B are allocated for heap and stack - isn't that very little, leaving the whooping, remaining 190kB for volatile vars? Would augmenting the heap size to, say 50kB be sensible? If yes, do I do it directly in this file or it's a better practice to do it somewhere else?
Probably for some of you "safe dynamic memory" and "embedded" in one post is both schocking and dazzling, but keep in mind that this is experimenting and exploring new horizons :) Thanks and greetings.
Keil uVision describes only the IDE. If you are using KEil MDK-ARM which implies ARM's RealView compiler then you can get accurate heap information using the __heapstats() function.
__heapstats() is a little strange in that rather than simply returning a value it outputs heap information to a formatted output stream facilitated by a function pointer and file descriptor passed to it. The output function must have an fprintf() like interface. You can use fprintf() of course, but that requires that you have correctly retargetted the stdio
For example the following:
typedef int (*__heapprt)(void *, char const *, ...);
__heapstats( (__heapprt)fprintf, stdout ) ;
outputs for example:
4180 bytes in 1 free blocks (avge size 4180)
1 blocks 2^11+1 to 2^12
Unfortunately that does not really achieve what you need since it outputs text. You could however implement your own function to capture the data in memory and parse the result. You may only need to capture the first decimal digit characters and discard anything else, except that the amount of free memory and the largest allocatable block are not necessarily the same thing of course. Fragmentation is indicated by the number or free blocks and their average size. You can perhaps guarantee to be able to allocate at least an average sized block.
The issue with dynamic allocation in embedded systems are to do with handling memory exhaustion and, in real-time systems, the non-deterministic timing of both allocation and deallocation using the default malloc/free implementations. In your case you might be better off using a fixed-block allocator. You can implement such an allocator by creating a static array of memory blocks (or by dynamically allocating them from the heap at start-up), and placing a pointer to each block on a queue or linked list or stack structure. To allocate you simply remove a pointer from the queue/list/stack, and to free you place a pointer back. When the available blocks structure is empty, memory is exhausted. It is entirely deterministic, and because it is your implementation can be easily monitored for performance and capacity.
With respect to question 3. You are expected to adjust the heap and system stack size to suit your application. Most tools I have used have a linker script that automatically allocates all available memory not statically allocated, allocated to a stack or reserved for other purposes to the heap. However MDK-ARM does not do that in the default linker scripts but rather allocates a fixed size heap.
You can use the linker map file summary to determine how much space is unused and manually expand the heap. I usually do that leaving a small amount of unused space to account for maintenance when the amount of statically allocated data may increase. At some point however; you end up running out of memory, and the arcane error messages from the linker may not make it obvious that your heap is just too big. It is possible to override the default linker script and provide your own, and no doubt possible then to automatically size the heap - though I have never taken the trouble to try it.
Okay I have tested my idea with dynamic heap free space checking and it worked well (although I didn't perform long-run tests), however #Clifford answer and this article convinced me to abandon the idea of dynamic allocation. Eventually I implemented my own, static heap with pages (2d array), occupied pages indicator (0-1 array of size of number of pages) and fifo of structs consisting of pointer to the msg on my static heap (actually just the index of the array) and length of message (to determine how many contiguous pages it occupies). 95% of msg I receive should take up only one page, 5% - 2 or 3 pages, so fragmentation is still possible, but at least I keep a tight rein on it and it affects only the part of memory assigned to this module of the code (in other words: the fragmentation doesn't leak to other parts of the code). So far it has worked without any problems and for sure is faster because the lookup time is O(n*m), n - number of pages, m - the longest page possible, but taking into consideration the laws of probability it goes down to O(n). Moreover n is always a lot smaller the number of all allocation units in memory, so way less to look for.

How does a stack memory increase?

In a typical C program, the linux kernel provides 84K - ~100K of memory. How does the kernel allocate more memory for the stack when the process uses the given memory.
IMO when the process takes up all the memory of the stack and now uses the next contiguous memory, ideally it should page fault and then the kernel handles the page fault.
Is it here that the kernel provides more memory to the stack for the given process, and which data structure in linux kernel identifies the size of the stack for the process??
There are a number of different methods used, depending on the OS (linux realtime vs. normal) and the language runtime system underneath:
1) dynamic, by page fault
typically preallocate a few real pages to higher addresses and assign the initial sp to that. The stack grows downward, the heap grows upward. If a page fault happens somewhat below the stack bottom, the missing intermediate pages are allocated and mapped. Effectively increasing the stack from the top towards the bottom automatically. There is typically a maximum up to which such automatic allocation is performed, which can or can not be specified in the environment (ulimit), exe-header, or dynamically adjusted by the program via a system call (rlimit). Especially this adjustability varies heavily between different OSes. There is also typically a limit to "how far away" from the stack bottom a page fault is considered to be ok and an automatic grow to happen. Notice that not all systems' stack grows downward: under HPUX it (used?) to grow upward so I am not sure what a linux on the PA-Risc does (can someone comment on this).
2) fixed size
other OSes (and especially in embedded and mobile environments) either have fixed sizes by definition, or specified in the exe header, or specified when a program/thread is created. Especially in embedded real time controllers, this is often a configuration parameter, and individual control tasks get fix stacks (to avoid runaway threads taking the memory of higher prio control tasks). Of course also in this case, the memory might be allocated only virtually, untill really needed.
3) pagewise, spaghetti and similar
such mechanisms tend to be forgotten, but are still in use in some run time systems (I know of Lisp/Scheme and Smalltalk systems). These allocate and increase the stack dynamically as-required. However, not as a single contigious segment, but instead as a linked chain of multi-page chunks. It requires different function entry/exit code to be generated by the compiler(s), in order to handle segment boundaries. Therefore such schemes are typically implemented by a language support system and not the OS itself (used to be earlier times - sigh). The reason is that when you have many (say 1000s of) threads in an interactive environment, preallocating say 1Mb would simply fill your virtual address space and you could not support a system where the thread needs of an individual thread is unknown before (which is typically the case in a dynamic environment, where the use might enter eval-code into a separate workspace). So dynamic allocation as in scheme 1 above is not possible, because there are would be other threads with their own stacks in the way. The stack is made up of smaller segments (say 8-64k) which are allocated and deallocated from a pool and linked into a chain of stack segments. Such a scheme may also be requried for high performance support of things like continuations, coroutines etc.
Modern unixes/linuxes and (I guess, but not 100% certain) windows use scheme 1) for the main thread of your exe, and 2) for additional (p-)threads, which need a fix stack size given by the thread creator initially. Most embedded systems and controllers use fixed (but configurable) preallocation (even physically preallocated in many cases).
edit: typo
The stack for a given process has a limited, fixed size. The reason you can't add more memory as you (theoretically) describe is because the stack must be contiguous, and it grows toward the heap. So, when the stack reaches the heap, no extension is possible.
The stack size for a userland program is not determined by the kernel. The kernel stack size is a configuration option for the kernel (usually 4k or 8k).
Edit: if you already know this, and were merely talking about the allocation of physical pages for a process, then you have the procedure down already. But there's no need to keep track of the "stack size" like this: the virtual pages in the stack with no pagetable entries are just normal overcommitted virtual pages. Physical memory will be granted on their first access. But the kernel does not have to overcommit memory, and thus a stack will probably have complete physical realization when the executable is first loaded.
The stack can only be used up to a certain length, because it has a fixed storage capacity in memory. If your question asks in what direction does the stack being used up? the answer is downwards. It is filled down in memory towards the heap. The heap is a dynamic component of memory by which it can actually grow from the bottom up, based on your need of data storage.