Heap profiling on ARM

Heap profiling on ARM - embedded

I am developing a GUI-heavy C++ application on a Freescale MX51-based board Linux 2.6.35. I would like to perform heap profiling.
Unfortunately, all heap profiling tools I have found have either been too intrusive or ostensibly non-working on ARM. Specific tools I've tried:
Valgrind Massif: unworkable on my platform due to the platform's feeble CPU. The 80% CPU time overhead introduced by Massif causes a range of problems in my application that cannot be compensated for.
gperftools (formerly Google Performance Tools) tcmalloc: All features of this rather un-intrusive, library-based libc malloc() replacement work on my target except for the heap profiler. To rephrase, the thread caching allocator works but the profiler does not. I'll explain the failure mode of the profiler below for anyone curious.
Can anyone suggest a set of replacement tools for performing C++ heap profiling on ARM platforms? Ideal output would ultimately be a directed allocation graph, similar to what gperftools' tcmalloc outputs. Low resource utilization is a must- my platform is highly resource constrained.
Failure mode of gperftools' tcmalloc explained:
I'm providing this information only for those that are curious; I do not expect a response. I'm seeing something similar to gperftools' issue #407 below, except on ARM rather than x86.
Specifically, I always get the message "Hooked allocator frame not found, returning empty trace." I spent some time debugging the issue and it appears that, when dynamically linking the tcmalloc library, frame pointers at the boundary between my application and the dynamic library are null- the stack cannot be walked "above" the call into the dynamic library.
gperftools issue #407: https://github.com/gperftools/gperftools/issues/410
stackoverflow user seeing similar problems on ARM: Missing frames on shared libraries on ARM

Heaps. Many ways to do them, but I've only run across 3 main types that matter in embedded land:
Linked list heaps. Each alloc is tracked in a "used" list. Once freed, they are dropped into a "free" list. On freeing, adjacent blocks of free memory are "joined" into larger pieces. Allocs can be any size. Each alloc and free is a O(N) op as it has to traverse the free list to give you a piece of memory plus break the free block into a size close to what you asked for while leaving the remaining block in the free list. Because of the increasing overhead per alloc, this system cannot be used by itself on smaller systems. This also tends to cause memory fragmentation over time if steps aren't taken to minimize it.
Fixed size (unit) heaps. You break your heap into equal size (smaller) parts. This wastes memory a bit, depending on how big the chunks are (and how many different sized, fixed allocator heaps you create), but alloc and free are both O(1) time operations. No searching, no joining. This style is often combined with the first one for "small object allocations" as the engines I've worked with have 95% of their allocations below a set size (say 256 bytes). This way, you use the unit heap for small allocs for huge speed and only minimal memory loss, while using the list heap for larger allocs. No external fragmentation of memory either.
Relocatable memory heaps. You don't give out pointers to memory, but handles. That way, behind the scenes, you can change memory pointers when needed to remove fragmentation or whatever. High overhead. High pain the the #$$ quotient as it's easy to abuse and get dangling pointer all over. Also added overhead for each memory dereference. But wanted to mention it.
There's some basic patterns. You can find all sorts of libs out in the wild that use them and also have built in statistics for number of allocs, fragmentation, and other useful stats. It's also not the hard to roll your own really, though I'd not recommend it for anything outside of satisfying curiosity as debugging without a working malloc is painful indeed. Adding thread support is pretty straightforward as well, but again, downloading a ready made solution is the better choice.
The above info applies to all platforms, ARM or otherwise, though most of my experience has been on low level ARM stuff so the above info is battle tested for your platform. Hope this helps!

Related

How does malloc know where the first available block is in embedded systems?

I have read that malloc has multiple implementations which are platform depended.
How does it work in an embedded device in bare metal programming?
Let's suppose we have an mcu with 256KB FLASH memory and 64KB RAM.
How does it know how much available RAM there is from my program?

For bare metal systems, you'll have a specific segment allocated in the linker script, often called .heap. There is no such thing as memory sharing between processes, meaning that the heap must have a fixed maximum size and therefore is pretty useless in general. malloc doesn't know a thing about how much RAM your program uses since there is no desktop OS in sight.
Your RAM is divided into .stack, .data, .bss and .heap, each with its own fixed maximum size. More about these segments here: https://electronics.stackexchange.com/a/237759/6102. In a typical bare metal MCU application, most of the RAM will be reserved for .data and .bss. You will have something from 128 bytes up to several kb reserved for the stack. You will typically not have a heap at all - but if you do, it will sit there and take up a fixed amount of x kb no matter how much of it you actually use.
malloc in itself could be implemented in different ways indeed. Either you include a "header" together with each allocated segment, the header stating the allocated size and potentially the address of the next available free segment. Or you could implement it as a look-up table where each item is a pointer to the first element and the size.
None of this is particularly relevant, since you shouldn't be using heap allocation in embedded systems. The main reason being that it doesn't make any sense. You don't want arbitrary behavior, you want deteministic behavior. You want to allocate x amount of memory for the worst case and if a heap was to be used it would have to be at least that large anyway, so you gain nothing but bloat from using a heap. Then comes all the usual problems with allocation overhead, fragmentation and leaks.
For bare metal/RTOS applications, do yourself a favour and delete .heap from your linker script, then forget that you ever heard about malloc. A MCU is not a PC.

How to properly assign huge heap space for JVM

Im trying to work around an issue which has been bugging me for a while. In a nutshell: on which basis should one assign a max heap space for resource-hogging application and is there a downside for tit being too large?
I have an application used to visualize huge medical datas, which can eat up to several gigabytes of memory if several imaging volumes are opened size by side. Caching the data to be viewed is essential for fluent workflow. The software is supported with windows workstations and is started with a bootloader, which assigns the heap size and launches the main application. The actual memory needed by main application is directly proportional to the data being viewed and cannot be determined by the bootloader, because it would require reading the data, which would, ultimately, consume too much time.
So, to ensure that the JVM has enough memory during launch we set up xmx as large as we dare based, by current design, on the max physical memory of the workstation. However, is there any downside to this? I've read (from a post from 2008) that it is possible for native processes to hog up excess heap space, which can lead to memory errors during runtime. Should I maybe also sniff for free virtualmemory or paging file size prior to assigning heap space? How would you deal with this situation?
Oh, and this is my first post to these forums. Nice to meet you all and be gentle! :)
Update:
Thanks for all the answers. I'm not sure if I put my words right, but my problem rose from the fact that I have zero knowledge of the hardware this software will be run on but would, nevertheless, like to assign as much heap space for the software as possible.
I came to a solution of assigning a heap of 70% of physical memory IF there is sufficient amount of virtual memory available - less otherwise.

You can have heap sizes of around 28 GB with little impact on performance esp if you have large objects. (lots of small objects can impact GC pause times)
Heap sizes of 100 GB are possible but have down sides, mostly because they can have high pause times. If you use Azul Zing, it can handle much larger heap sizes significantly more gracefully.
The main limitation is the size of your memory. If you heap exceeds that, your application and your computer will run very slower/be unusable.
A standard way around these issues with mapping software (which has to be able to map the whole world for example) is it break your images into tiles. This way you only display the image which is one the screen (or portions which are on the screen) If you need to be able to zoom in and out you might need to store data at two to four levels of scale. Using this approach you can view a map of the whole world on your phone.

Best to not set JVM max memory to greater than 60-70% of workstation memory, in some cases even lower, for two main reasons. First, what the JVM consumes on the physical machine can be 20% or more greater than heap, due to GC mechanics. Second, the representation of a particular data entity in the JVM heap may not be the only physical copy of that entity in the machine's RAM, as the OS has caches and buffers and so forth around the various IO devices from which it grabs these objects.

How can I change maximum available heap size for a task in FreeRTOS?

I'm creating a list of elements inside a task in the following way:
l = (dllist*)pvPortMalloc(sizeof(dllist));
dllist is 32 byte big.
My embedded system has 60kB SRAM so I expected my 200 element list can be handled easily by the system. I found out that after allocating space for 8 elements the system is crashing on the 9th malloc function call (256byte+).
If possible, where can I change the heap size inside freeRTOS?
Can I somehow request the current status of heap size?
I couldn't find this information in the documentation so I hope somebody can provide some insight in this matter.
Thanks in advance!

(Yes - FreeRTOS pvPortMalloc() returns void*.)
If you have 60K of SRAM, and configTOTAL_HEAP_SIZE is large, then it is unlikely you are going to run out of heap after allocating 256 bytes unless you had hardly any heap remaining before hand. Many FreeRTOS demos will just keep creating objects until all the heap is used, so if your application is based on one of those, then you would be low on heap before your code executed. You may have also done something like use up loads of heap space by creating tasks with huge stacks.
heap_4 and heap_5 will combine adjacent blocks, which will minimise fragmentation as far as practical, but I don't think that will be your problem - especially as you don't mention freeing anything anywhere.
Unless you are using heap_3.c (which just makes the standard C library malloc and free thread safe) you can call xPortGetFreeHeapSize() to see how much free heap you have. You may also have xPortGetMinimumEverFreeHeapSize() available to query how close you have ever come to running out of heap. More information: http://www.freertos.org/a00111.html
You could also define a malloc() failed hook (http://www.freertos.org/a00016.html) to get instant notification of pvPortMalloc() returning NULL.

For the standard allocators you will find a config option in FreeRTOSConfig.h .
However:
It is very well possible you run out of memory already, depending on the allocator used. IIRC there is one that does not free() any blocks (free() is just a dummy). So any block returned will be lost. This is still useful if you only allocate memory e.g. at startup, but then work with what you've got.
Other allocators might just not merge adjacent blocks once returned, increasing fragmentation much faster than a full-grown allocator.
Also, you might loose memory to fragmentation. Depending on your alloc/free pattern, you quickly might end up with a heap looking like swiss cheese: Many holes between allocated blocks. So while there is still enough free memory, no single block is big enough for the size required.
If you only allocate blocks that size there, you might be better of using your own allocator or a pool (blocks of fixed size). Thaqt would be statically allocated (e.g. array) and chained as a linked list during startup. Alloc/free would then just be push/pop on a stack (or put/get on a queue). That would also be very fast and have complexity O(1) (interrupt-safe if properly written).
Note that normal malloc()/free() are not interrupt-safe.
Finally: Do not cast void *. (Well, that's actually what standard malloc() returns and I expect that FreeRTOS-variant does the same).

Why do Cocoa apps use so much memory?

Even the standard blank-window Cocoa app that gets built when you make a new Cocoa project in Xcode uses almost 6 MB of memory. What's the reason for this? Is it possible to make an app use less, or does OS X simply manage memory differently for Cocoa apps?
Not that I'm complaining. I know that performance "hardly matters anymore" (edit: what I mean is, it matters less than readability/maintainability/the programmer's time). I'm just curious.

OS X does a lot of magic with shared memory and copy-on-write pages, so chances are that it doesn't take that much physical RAM for every application.
You can check exactly how memory blocks are mapped by running:
sudo vmmap <PID of the process>

Depends on the all the framework (APIs) you use. Combine that with the VM allocations done by low level ops.
It's only worth trying to reduce the heap alloc (total), as well as the resident size of the code. Making sure your data structs are allocated efficiently and trying to compile with the ever-so-famous "-Os" optimization flag (size optimization). There isn't much you can do about the VM eaten by Cocoa. I wouldn't really worry about it.

This is clearly a 'WTF' moment for developers in general. The question is usually - why does my trivial application use up so much memory.
The answer is down to the underlying framework. You could argue that 6MB is too much, but really, it is nothing.
It's not rare to see computers come with 2GB of memory these days. The stock IMAC is 4GB. The whole point of the computer industry is to use up all the resources a machine has so that it continues to evolve.
Yes you should avoid ineffecincies where possible (Don't load up a 5million point array at start up for instance). But unless your beta demonstrates you fudged up just keep it on the list of todo's.

I'm a bit out on a limb here, but I guess it's because all the libraries that get added have to do quite a bit of setting up and there is no need to garbage collect, so they simply get to waste memory; plus, even if all memory got autoreleased, it would wait until the first idle event, which is after the creation of the window. Delete unneeded libraries/frameworks, or force a garbage collect somewhere after loading the window from the nib and see how much it goes down if you're so concerned.
I am not concerned about it. Some of the memory might be returned later, and the rest is the price you pay for a powerful framework.

A factor which is not directly linked to cocoa but is valid to frameworks in general is that the overhead is not linear. There is usually a fixed and a variable "price", in terms of overhead, to use the framework.
When you create a simple blank window, the fixed overhead is crushing, but when you create an application with tens of windows, dialogs, controls and all, the initial fixed overhead becomes negligible, against the size of the application itself.

Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

I was doing some benchmarks for the performance of code on Windows mobile devices, and noticed that some algorithms were doing significantly better on some hosts, and significantly worse on others. Of course, taking into account the difference in clock speeds.
The statistics for reference (all results are generated from the same binary, compiled by Visual Studio 2005 targeting ARMv4):
Intel XScale PXA270
Algorithm A: 22642 ms
Algorithm B: 29271 ms
ARM1136EJ-S core (embedded in a MSM7201A chip)
Algorithm A: 24874 ms
Algorithm B: 29504 ms
ARM926EJ-S core (embedded in an OMAP 850 chip)
Algorithm A: 70215 ms
Algorithm B: 31652 ms (!)
I checked out floating point as a possible cause, and while algorithm B does use floating point code, it does not use it from the inner loop, and none of the cores seem to have a FPU.
So my question is, what mechanic may be causing this difference, preferrably with suggestions on how to fix/avoid the bottleneck in question.
Thanks in advance.

One possible cause is that the 926 has a shorter pipeline (5 cycles vs. 8 cycles for the 1136, iirc), so branch mispredictions are less costly on the 926.
That said, there are a lot of architectural differences between those processors, too many to say for sure why you see this effect without knowing something about the instructions that you're actually executing.

Clock speed is only one factor. Bus width and latency are big if not bigger factors. Cache is a factor. Speed of the media the program is run from if run from media and not memory.
Is this test using any shared libraries at all at any point in the test or is it all internal code? Fetching shared libraries on media that will vary from platform to platform (even if it is say the same sd card).
Is this the same algorithm compiled separately for each platform or the same binary? You can and will see some compiler induced variation as well. 50% faster and slower can easily come from the same compiler on the same platform by varying compiler settings. If possible you want to execute the same binary, and insure that no shared libraries are used in the loop under test. If not the same binary disassemble the loop under test for each platform and insure that there are no variations other than register selection.

From the data you have presented, its difficult to point the exact problem, but we can share some of the prior experience
Cache setting (check if all the
processors has the same CACHE
setting)
You need to check both D-Cache and I-Cache
For analysis,
Break down your code further, not just as algorithm but at a block level, and try to understand the block that causes the bottle-neck. After you find the block that causes the bottle-neck, try to disassemble the block's source code, and check the assembly. It may help.

Looks like the problem is in cache settings or something memory-related (maybe I-Cache "overflow").
Pipeline stalls, branch miss-predictions usually give less significant differences.
You can try to count some basic operations, executed in each algorithm, for example:
number of "easy" arithmetical/bitwise ops (+-|^&) and shifts by constant
number of shifts by variable
number of multiplications
number of "hard" arithmetics operations (divides, floating point ops)
number of aligned memory reads (32bit)
number of byte memory reads (8bit) (it's slower than 32bit)
number of aligned memory writes (32bit)
number of byte memory writes (8bit)
number of branches
something else, don't remember more :)
And you'll get info, that things get 926 much slower. After this you can check suspicious blocks, making using of them more or less intensive. And you'll get the answer.
Furthermore, it's much better to enable assembly listing generation in VS and use it (but not your high-level source code) as base for research.
p.s.: maybe the problem is in OS/software/firmware? Did you testing on clean system? OS is the same on all devices?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas