How can I change maximum available heap size for a task in FreeRTOS? - embedded

I'm creating a list of elements inside a task in the following way:
l = (dllist*)pvPortMalloc(sizeof(dllist));
dllist is 32 byte big.
My embedded system has 60kB SRAM so I expected my 200 element list can be handled easily by the system. I found out that after allocating space for 8 elements the system is crashing on the 9th malloc function call (256byte+).
If possible, where can I change the heap size inside freeRTOS?
Can I somehow request the current status of heap size?
I couldn't find this information in the documentation so I hope somebody can provide some insight in this matter.
Thanks in advance!

(Yes - FreeRTOS pvPortMalloc() returns void*.)
If you have 60K of SRAM, and configTOTAL_HEAP_SIZE is large, then it is unlikely you are going to run out of heap after allocating 256 bytes unless you had hardly any heap remaining before hand. Many FreeRTOS demos will just keep creating objects until all the heap is used, so if your application is based on one of those, then you would be low on heap before your code executed. You may have also done something like use up loads of heap space by creating tasks with huge stacks.
heap_4 and heap_5 will combine adjacent blocks, which will minimise fragmentation as far as practical, but I don't think that will be your problem - especially as you don't mention freeing anything anywhere.
Unless you are using heap_3.c (which just makes the standard C library malloc and free thread safe) you can call xPortGetFreeHeapSize() to see how much free heap you have. You may also have xPortGetMinimumEverFreeHeapSize() available to query how close you have ever come to running out of heap. More information: http://www.freertos.org/a00111.html
You could also define a malloc() failed hook (http://www.freertos.org/a00016.html) to get instant notification of pvPortMalloc() returning NULL.

For the standard allocators you will find a config option in FreeRTOSConfig.h .
However:
It is very well possible you run out of memory already, depending on the allocator used. IIRC there is one that does not free() any blocks (free() is just a dummy). So any block returned will be lost. This is still useful if you only allocate memory e.g. at startup, but then work with what you've got.
Other allocators might just not merge adjacent blocks once returned, increasing fragmentation much faster than a full-grown allocator.
Also, you might loose memory to fragmentation. Depending on your alloc/free pattern, you quickly might end up with a heap looking like swiss cheese: Many holes between allocated blocks. So while there is still enough free memory, no single block is big enough for the size required.
If you only allocate blocks that size there, you might be better of using your own allocator or a pool (blocks of fixed size). Thaqt would be statically allocated (e.g. array) and chained as a linked list during startup. Alloc/free would then just be push/pop on a stack (or put/get on a queue). That would also be very fast and have complexity O(1) (interrupt-safe if properly written).
Note that normal malloc()/free() are not interrupt-safe.
Finally: Do not cast void *. (Well, that's actually what standard malloc() returns and I expect that FreeRTOS-variant does the same).

Related

How does malloc know where the first available block is in embedded systems?

I have read that malloc has multiple implementations which are platform depended.
How does it work in an embedded device in bare metal programming?
Let's suppose we have an mcu with 256KB FLASH memory and 64KB RAM.
How does it know how much available RAM there is from my program?
For bare metal systems, you'll have a specific segment allocated in the linker script, often called .heap. There is no such thing as memory sharing between processes, meaning that the heap must have a fixed maximum size and therefore is pretty useless in general. malloc doesn't know a thing about how much RAM your program uses since there is no desktop OS in sight.
Your RAM is divided into .stack, .data, .bss and .heap, each with its own fixed maximum size. More about these segments here: https://electronics.stackexchange.com/a/237759/6102. In a typical bare metal MCU application, most of the RAM will be reserved for .data and .bss. You will have something from 128 bytes up to several kb reserved for the stack. You will typically not have a heap at all - but if you do, it will sit there and take up a fixed amount of x kb no matter how much of it you actually use.
malloc in itself could be implemented in different ways indeed. Either you include a "header" together with each allocated segment, the header stating the allocated size and potentially the address of the next available free segment. Or you could implement it as a look-up table where each item is a pointer to the first element and the size.
None of this is particularly relevant, since you shouldn't be using heap allocation in embedded systems. The main reason being that it doesn't make any sense. You don't want arbitrary behavior, you want deteministic behavior. You want to allocate x amount of memory for the worst case and if a heap was to be used it would have to be at least that large anyway, so you gain nothing but bloat from using a heap. Then comes all the usual problems with allocation overhead, fragmentation and leaks.
For bare metal/RTOS applications, do yourself a favour and delete .heap from your linker script, then forget that you ever heard about malloc. A MCU is not a PC.

Estimating available RAM left with safety margin in C (STM32F4)

I am currently developing application for STM32F407 using STM32CubeMx and Keil uVision. I know that dynamic memory allocation in embedded systems is mostly discouraged, but from spot to spot on internet I can find some arguments in favor of it.
Due to my inventors soul I wanted to try to do it, but do it safely. Let's assume I'm creating a dynamically allocated fifo for incoming UART messages, holding structs composed of the msg itself and its' length. However I wouldn't like to consume all the heap size doing so, therefore I want to check how much of it I have left: Me new (?) idea is to try temporarily allocating some big chunk of memory (say 100 char) - if it's successful, I accept the incoming msg, if not - it means that I'm running out of heap and ignore the msg (or accept it and dequeue the oldest). After checking I of course free the temp memory.
A few questions arise in my mind:
First of all, does it make sens at all? Do you think, basic on your experience, that it could be usefull and safe?
I couldn't find precise info about what exactly shares RAM in ES (I know about heap, stack and volatile vars) so my question is: providing that answer to 1. isn't "hell no go home", what size of the temp memory checker would you pick for the mentioned controller?
About the micro itself - it has 192kB RAM, however in the Drivers\CMSIS\Device\ST\STM32F4xx\Source\Templates\arm\startup_stm32f407xx.s file only 512B+1024B are allocated for heap and stack - isn't that very little, leaving the whooping, remaining 190kB for volatile vars? Would augmenting the heap size to, say 50kB be sensible? If yes, do I do it directly in this file or it's a better practice to do it somewhere else?
Probably for some of you "safe dynamic memory" and "embedded" in one post is both schocking and dazzling, but keep in mind that this is experimenting and exploring new horizons :) Thanks and greetings.
Keil uVision describes only the IDE. If you are using KEil MDK-ARM which implies ARM's RealView compiler then you can get accurate heap information using the __heapstats() function.
__heapstats() is a little strange in that rather than simply returning a value it outputs heap information to a formatted output stream facilitated by a function pointer and file descriptor passed to it. The output function must have an fprintf() like interface. You can use fprintf() of course, but that requires that you have correctly retargetted the stdio
For example the following:
typedef int (*__heapprt)(void *, char const *, ...);
__heapstats( (__heapprt)fprintf, stdout ) ;
outputs for example:
4180 bytes in 1 free blocks (avge size 4180)
1 blocks 2^11+1 to 2^12
Unfortunately that does not really achieve what you need since it outputs text. You could however implement your own function to capture the data in memory and parse the result. You may only need to capture the first decimal digit characters and discard anything else, except that the amount of free memory and the largest allocatable block are not necessarily the same thing of course. Fragmentation is indicated by the number or free blocks and their average size. You can perhaps guarantee to be able to allocate at least an average sized block.
The issue with dynamic allocation in embedded systems are to do with handling memory exhaustion and, in real-time systems, the non-deterministic timing of both allocation and deallocation using the default malloc/free implementations. In your case you might be better off using a fixed-block allocator. You can implement such an allocator by creating a static array of memory blocks (or by dynamically allocating them from the heap at start-up), and placing a pointer to each block on a queue or linked list or stack structure. To allocate you simply remove a pointer from the queue/list/stack, and to free you place a pointer back. When the available blocks structure is empty, memory is exhausted. It is entirely deterministic, and because it is your implementation can be easily monitored for performance and capacity.
With respect to question 3. You are expected to adjust the heap and system stack size to suit your application. Most tools I have used have a linker script that automatically allocates all available memory not statically allocated, allocated to a stack or reserved for other purposes to the heap. However MDK-ARM does not do that in the default linker scripts but rather allocates a fixed size heap.
You can use the linker map file summary to determine how much space is unused and manually expand the heap. I usually do that leaving a small amount of unused space to account for maintenance when the amount of statically allocated data may increase. At some point however; you end up running out of memory, and the arcane error messages from the linker may not make it obvious that your heap is just too big. It is possible to override the default linker script and provide your own, and no doubt possible then to automatically size the heap - though I have never taken the trouble to try it.
Okay I have tested my idea with dynamic heap free space checking and it worked well (although I didn't perform long-run tests), however #Clifford answer and this article convinced me to abandon the idea of dynamic allocation. Eventually I implemented my own, static heap with pages (2d array), occupied pages indicator (0-1 array of size of number of pages) and fifo of structs consisting of pointer to the msg on my static heap (actually just the index of the array) and length of message (to determine how many contiguous pages it occupies). 95% of msg I receive should take up only one page, 5% - 2 or 3 pages, so fragmentation is still possible, but at least I keep a tight rein on it and it affects only the part of memory assigned to this module of the code (in other words: the fragmentation doesn't leak to other parts of the code). So far it has worked without any problems and for sure is faster because the lookup time is O(n*m), n - number of pages, m - the longest page possible, but taking into consideration the laws of probability it goes down to O(n). Moreover n is always a lot smaller the number of all allocation units in memory, so way less to look for.

free function not working in c / objective-c [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do malloc() and free() work?
I have encountered a weird problem and I'm really not sure why it doesn't work.
I have the following code in Xcode:
void *ptr = malloc(1024 * 1024 * 100);
memset(ptr, 0, 1024 * 1024 * 100);
free (ptr); //trace this line
ptr = malloc (1024 * 1024 * 100);
memset(ptr, 0, 1024 * 1024 * 100);
free (ptr); //trace this line
I put a breakpoint on each of the free() line, and when I traced the program, free didn't really free up the 100mb. However, if I change the number from 100 to 500 (allocate 500mb twice), memset 500mb, free() works fine. Why?
free can never fail(it does not have a return value) unless you call it with a improper address, which gives you undefined behavior.
You do not have to bother whether free actually frees memory or not you just have to ensure that you call free on the correct address after you are done with dynamic memory usage, rest the compiler should take care for you.
This is one of those things that you should just believe on your compiler to handle correctly.
Also, free just marks the memory being deallocated free(as name says) for reuse. It does not zero out or initialize the memory being deallocated.
When you pass a block of memory to free, that memory does not necessarily get returned to the operating system right away. In fact, based on the wording in the C standard, some argue that the memory can't be returned to the OS until the program exits.
The wording in question is (C99, ยง7.20.3.2/2): "The free function causes the space pointed to by ptr to be deallocated, that is, made available for further allocation." Their argument is that when/if a block of memory is allocated and then freed, it should be available for allocation again -- but if it's returned to the OS, some other process might take it, so it's no longer available for further allocation, as the standard requires. Personally, I don't find that argument completely convincing (I think "allocated by another process" is still allocation), but such is life.
Most libraries allocate large chunks of memory from the OS, and then sub-allocate pieces of those large chunks to the program. When memory is freed by the program, the put that block of memory on an "available" list for further allocation. Most also (at least at times) walk through the list of free blocks, merging free blocks that are adjacent addresses.
Many also follow some heuristics about what memory to keep after it's been freed. First, the keep an entire block as long as any of the memory in that block remains in use. If, however, all the memory in a block has been freed, they look at its size, and (often) at how much free memory they have available. If the amount available and/or size of the free block exceeds some threshold, they'll usually release it back to the OS.
Rather than having fixed thresholds, some try to tailor their behavior to the environment by (for example) basing their thresholds on percentages of available memory instead of fixed sizes. Without that, programs written (say) ten years ago when available memory was typically a lot smaller would often do quite a bit of "thrashing" -- repeatedly allocating and releasing the same (or similar) size blocks to/from the OS.
free() does not have to immediately unmap and return to the OS the pages backing up previously but no longer allocated buffers. It may keep them around so you can allocate memory quickly again. When the program finishes, the pages will be unmapped and returned to the OS.
As others already said, free() doesn't have to return memory to the OS. But I reject an idea that you should never care whether the memory is returned. There should be a good reason to care, but there are valid reasons.
If you do want to return memory to OS, use a platform-specific way which provides this guarantee:
mmap with MAP_ANONYMOUS on systems supporting it (there are many, but MAP_ANONYMOUS is not POSIX): mmap instead of malloc, munmap instead of free.
VirtualAlloc and VirtualFree on Windows.
[Shoul I add something here for other systems? Feel free to suggest.]
These ways of allocating memory work with big memory units (system page size or more).

Heap profiling on ARM

I am developing a GUI-heavy C++ application on a Freescale MX51-based board Linux 2.6.35. I would like to perform heap profiling.
Unfortunately, all heap profiling tools I have found have either been too intrusive or ostensibly non-working on ARM. Specific tools I've tried:
Valgrind Massif: unworkable on my platform due to the platform's feeble CPU. The 80% CPU time overhead introduced by Massif causes a range of problems in my application that cannot be compensated for.
gperftools (formerly Google Performance Tools) tcmalloc: All features of this rather un-intrusive, library-based libc malloc() replacement work on my target except for the heap profiler. To rephrase, the thread caching allocator works but the profiler does not. I'll explain the failure mode of the profiler below for anyone curious.
Can anyone suggest a set of replacement tools for performing C++ heap profiling on ARM platforms? Ideal output would ultimately be a directed allocation graph, similar to what gperftools' tcmalloc outputs. Low resource utilization is a must- my platform is highly resource constrained.
Failure mode of gperftools' tcmalloc explained:
I'm providing this information only for those that are curious; I do not expect a response. I'm seeing something similar to gperftools' issue #407 below, except on ARM rather than x86.
Specifically, I always get the message "Hooked allocator frame not found, returning empty trace." I spent some time debugging the issue and it appears that, when dynamically linking the tcmalloc library, frame pointers at the boundary between my application and the dynamic library are null- the stack cannot be walked "above" the call into the dynamic library.
gperftools issue #407: https://github.com/gperftools/gperftools/issues/410
stackoverflow user seeing similar problems on ARM: Missing frames on shared libraries on ARM
Heaps. Many ways to do them, but I've only run across 3 main types that matter in embedded land:
Linked list heaps. Each alloc is tracked in a "used" list. Once freed, they are dropped into a "free" list. On freeing, adjacent blocks of free memory are "joined" into larger pieces. Allocs can be any size. Each alloc and free is a O(N) op as it has to traverse the free list to give you a piece of memory plus break the free block into a size close to what you asked for while leaving the remaining block in the free list. Because of the increasing overhead per alloc, this system cannot be used by itself on smaller systems. This also tends to cause memory fragmentation over time if steps aren't taken to minimize it.
Fixed size (unit) heaps. You break your heap into equal size (smaller) parts. This wastes memory a bit, depending on how big the chunks are (and how many different sized, fixed allocator heaps you create), but alloc and free are both O(1) time operations. No searching, no joining. This style is often combined with the first one for "small object allocations" as the engines I've worked with have 95% of their allocations below a set size (say 256 bytes). This way, you use the unit heap for small allocs for huge speed and only minimal memory loss, while using the list heap for larger allocs. No external fragmentation of memory either.
Relocatable memory heaps. You don't give out pointers to memory, but handles. That way, behind the scenes, you can change memory pointers when needed to remove fragmentation or whatever. High overhead. High pain the the #$$ quotient as it's easy to abuse and get dangling pointer all over. Also added overhead for each memory dereference. But wanted to mention it.
There's some basic patterns. You can find all sorts of libs out in the wild that use them and also have built in statistics for number of allocs, fragmentation, and other useful stats. It's also not the hard to roll your own really, though I'd not recommend it for anything outside of satisfying curiosity as debugging without a working malloc is painful indeed. Adding thread support is pretty straightforward as well, but again, downloading a ready made solution is the better choice.
The above info applies to all platforms, ARM or otherwise, though most of my experience has been on low level ARM stuff so the above info is battle tested for your platform. Hope this helps!

Large constant array in global memory

Is it possible to increase performance by running on a GPU for the algorithm with the following properties:
There are hundreds and even thousands of independent threads, which do not require any synchronization during calculations
Each thread has a relatively small (less than 200Kb) local memory region containing thread-specific data. Read/Write
Each thread accesses a large memory block (hundreds of megabytes and even gigabytes). This memory is read-only
For each access to the global memory there will be at least two accesses to the local memory
There will be a lot of branches in the algorithm
Unfortunately the algorithm is rather complicated to be show here.
My instinct is to use texture memory aggressively. The caching benefits will beat uncoalesced global memory reads by a mile.
The writes you may need to add some padding etc. to avoid bank conflicts.
The reliance on hundreds of meg or gigs of data is somewhat concerning. Can you carve it up somehow? Hope you have a big beefy Tesla/Quadro w/ oodles of RAM.
That said, the name of game for CUDA optimization is always to experiment, profile/measure, rinse and repeat.
Before I start, please remember that there are two layers of parallelism in CUDA: blocks and threads.
There are hundreds and even thousands of independent threads, which do
not require any synchronization during calculations
Since you can launch as many as 65535 blocks per dimension, you can treat each block in cuda to be equivalent to a "thread" of yours.
Each thread has a relatively small (less than 200Kb) local memory
region containing thread-specific data. Read/Write
Unfortunately most cards have a shared memory limit of 16k per block. So if you can figure out how to handle with this lower limit, great. If not, you will need to use global memory accesses..
Each thread accesses a large memory block (hundreds of megabytes and
even gigabytes). This memory is read-only
You can not bind such large arrays to textures or constant memory. So in a given block, try to make the threads read contiguous chunks of data for the best performance.
For each access to the global memory there will be at least two
accesses to the local memory There will be a lot of branches in the
algorithm
Since you are essentially replacing a single thread in your original implementation with a block in cuda, you may want to revise the code a little bit to try and implement a parallel version of the "per thread code" too.
This may not be clear at first glance, but think it through a little. Any algorithm that has hundreds / thousands of independent parts with no synchronization needed is great for a parallel implementation, even with cuda.