Points to be considered while designing or coding for lesser footprint deliverables - memory-footprint

Please post the points one should keep in mind while designing or coding for lesser footprint deliverables for embedded systems.
I am not giving compiler or platform details, as I want generic information. But, any specific information on Linux based OS is also welcome.

Depends on how low you want to get. I'm currently coding for fiscal printers, and there's no OS, and the main rule is no dynamic memory allocation. The funny thing is that I still convinced the crew to code fully modern C++ ;).
Actually there are a few rules we decided upon:
no dynamic allocation
hence, no STL
no exception handling (obvious reasons)

There isn't a general answer, only ones specific to language/platform ... but
Small memory footprint ...
Don't use Java, C#/mono, PHP, Perl, Python or anything with garbage collection
Get as close to the metal as feasible, Use C
Do alot of profiling to see where memory is getting allocated, if you are using dynamic allocation
Ensure you prevent heap-fragmentation by allocating sensible chunks and sizes of the heap
Avoid recursive functions especially those that use malloc(). Better allocating a chunk and passing a pointer around.
use free() ;)
Ensure your types are no bigger than required
Turn on compiler optimizations
There will be more.

for real low footprint consider doing Assembly directly.
We all know that Hello World in C or C++ is 20kb+(because of all the default libraries which get linked). In Assembly this overhead is gone. As pointed out in the comments one can reduce the standard libraries quite a bit. However, the fact remains that the code density you can get when coding assembly is much higher than a compiler will generate from a higher language. So for code where every byte matters, use assembly.
also when programming on devices with less capable processors, programming in assembly language might be your only way to do make the program fast enough for it to be realtime enough to (for instance) control machines

When faced with such constraints, it is advisable to pre-allocate memory in order to guarantee that the system will work under load. A design pattern such as "object pooling" can be used to share resources within the system.
The C language enables tight resource (i.e. memory & compute cycles) control. It should be strongly considered.
Avoid recursion as it is easy to abuse and can result in stack overflow conditions.

Related

is it recommended to use SPI flash to run code instead internal flash due to memory limitation of internal flash?

We used the LPC546xx family microcontroller in our project, currently, at the initial stage, we are finalizing the software and hardware requirements. The basic firmware size (which contains RTOS, 3rd party stack, library, etc...) currently is 480 KB. Now once full application developed than the size will exceed the internal flash size (512KB) and plus we needed storage which can hold firmware update image separately.
So we planned to use SPI flash (S25LP064A-JBLE, http://www.issi.com/WW/pdf/IS25LP032-064-128.pdf, serial flash memory) of 4MB\8MB to boot and run firmware.
is it recommended to run code from SPI flash? how can I map external flash memory directly to CPU memory space? Can anyone give an example that contains this memory mapping(linker script etc..) or demo application in which LPC546xx uses SPI FLASH?
Generally speaking it's not recommended, or differently put: the closer to the CPU the better. Both S25LP064A and LPC546xx however support XIP, so it is viable.
This is not a trivial issue as many aspects are affecting. I.e. issue is best avoided and should really have been ironed out in the planning stage. Embedded Systems are more about compromising than anything and making the right/better choices takes skill end experience.
Same question with replies on the NXP forum: link
512K of NVRAM is huge. There are almost certainly room for optimisations even if 3'rd party libraries are used.
On a related note this discussion concerning XIP should give valuable insight: link.
I would strongly encourage use of file-systems if not done already, for which external storage is much better suited. The further from the computational unit, the more relevant. That's not XIP and the penalty is copy-to-RAM either way you do it. I.e. performance will be slower. But in my experience, the need for speed has often-times not been thoroughly considered and at least partially greatly overestimated.
Regarding your mentioning of RTOS and FW-upgrade:
Unless it's a poor RTOS there's file-system awareness built in. Especially for FW upgrading (Note: you'll need room for 3 images, factory reset included), unless already supported by the SoC-vendor by some other means (OTA), it will make life much easier and less risky. If there's no FS-awareness, it can be added.
FW upgrade requires a lot of extra storage. More if simpler. Simpler is however also safer which especially for FW upgrades matters hugely. In the simplest case (binary flat image), you'll need at least twice the amount of memory you're already consuming.
All-in-all: I think the direction you're going is viable and depending on the actual situation perhaps your only choice.

Memory/Address Sanitizer vs Valgrind

I want some tool to diagnose use-after-free bugs and uninitialized bugs. I am considering Sanitizer(Memory and/or Address) and Valgrind. But I have very little idea about their advantages and disadvantages. Can anyone tell the main features, differences and pros/cons of Sanitizer and Valgrind?
Edit: I found some of comparisons like: Valgrind uses DBI(dynamic binary instrumentation) and Sanitizer uses CTI(compile-time instrumentation). Valgrind makes the program much slower(20x) whether Sanitizer runs much faster than Valgrind(2x). If anyone can give me some more important points to consider, it will be a great help.
I think you'll find this wiki useful.
TLDR main advantages of sanitizers are
much smaller CPU overheads (Lsan is practically free, UBsan/Isan is 1.25x, Asan and Msan are 2-4x for computationally intensive tasks and 1.05-1.1x for GUIs, Tsan is 5-15x)
wider class of detected errors (stack and global overflows, use-after-return/scope)
full support of multi-threaded apps (Valgrind support for multi-threading is a joke)
much smaller memory overhead (up to 2x for Asan, up to 3x for Msan, up to 10x for Tsan which is way better than Valgrind)
Disadvantages are
more complicated integration (you need to teach your build system to understand Asan and sometimes work around limitations/bugs in Asan itself, you also need to use relatively recent compiler)
MemorySanitizer is not reall^W easily usable at the moment as it requires one to rebuild all dependencies under Msan (including all standard libraries e.g. libc++); this means that casual users can only use Valgrind for detecting uninitialized errors
sanitizers typically can not be combined with each other (the only supported combination is Asan+UBsan+Lsan) which means that you'll have to do separate QA runs to catch all types of bugs
One big difference is that the LLVM-included memory and thread sanitizers implicitly map huge swathes of address space (e.g., by calling mmap(X, Y, 0, MAP_NORESERVE|MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0) across terabytes of address space in the x86_64 environment). Even though they don't necessarily allocate that memory, the mapping can play havoc with restrictive environments (e.g., ones with reasonable settings for ulimit values).

Are there modern compilers for high level languages on simple processors which produce self-modifying code?

Back in the days before caches and branch prediction, it was relatively common if not encouraged to make self-modifying code for certain kinds of optimizations. It was probably most common in games and demos written in assembler in the eras between 8-bit up to early 32-bit such as the Amiga.
I'm not sure if any compilers from those days emitted self-modifying assembler or machine code.
What I'm wondering is whether there are any current/modern compilers that do. Obviously it would be useless or too difficult on powerful processors with caches.
But What about the very many simple processors such as used in embedded systems? Is self modifying code considered a viable optimization strategy by any modern compilers for any simple / 8-bit / embedded processor?
There is a question with a similar title, "Is there any self-improving compiler around?
", but note that it's not about the same subject:
Please note that I am talking about a compiler that improves itself - not a compiler that improves the code it compiles.
All embedded systems today use flash ROM. I believe Amiga and similar were RAM-based systems. The only way "self-modification" exists in embedded systems, is where you have boot loaders, that re-program certain parts of the flash depending on what program and/or functionality that should be used.
Apart from that, it doesn't make sense for the program to modify itself in runtime. Running code from RAM is generally discouraged, for safety reasons (potential of accidental modifications caused by bugs) and for electric reasons (RAM is volatile and far more sensitive to EMC than flash).

What does programming for PS3's Cell Processor entail?

How is programming for the Cell Processor on the PS3 different than programming for any other processor found on a normal desktop?
What kind of programming paradigms, techniques, and practices are used to fully utilize the Cell Processors potential?
All the articles I hear concerning PS3 development discuss, "Learning how to program on the Cell Processor." What does this really mean beyond some hand waving?
In addition to everything George mentions, the SPUs are really better thought of as streaming vector processors. They work best when you have an algorithm that works on long sequences of numerical data, which can be fed through the SPU's limited memory via DMA, rather than having the SPU load a chunk of memory, try to operate on it, find that it needs to follow a pointer to somewhere outside its memory, load that, keep going, find another one, and so on.
So, programming for them isn't a simple model of concurrency and threads; it's more like high performance numerical or scientific computation. It is also non-uniform memory access taken to an extreme.
Furthermore, every processor is in-order with deep pipelines, so the programmer has to be much more aware of data hazards and instruction bubbles and all the numerous micro-optimizations that we are told the compiler "should" take care of for us (but it really doesn't). Things like mispredicted branches, load-hit-stores, cache misses, etc. hurt a lot more than they would on an out-of-order processor that could juggle the order of operations around to hide such latencies.
For concrete examples, check out Mike Acton's CellPerformance blog. Mike is my favorite old-school assembly-happy perf curmudgeon in the business, and he's really earned his chops on this issue.
The Cell part of the PS3 consists of 6 SPU processors. They each have 256 KB of non-shared memory and are connected via a high-speed ring that allows for DMA between each other and the PowerPC host processor. They are not pipelined or cached. This makes it rather different than an multi-core x86 with shared memory, pipelining and caching. Also, the SPU processors do not use the same instruction set as the PowerPC so you've got some asymmetry there.
In short, your typical shared-memory, multithreaded program won't just drop onto the Cell without some work (with the caveat that computer science works hard at making different machines appear to be the same so some implementors try hard to automate the process).
At a high level the program will need to be broken up into tasks that fit within the Cell's hard memory limit. Those can run in parallel and each sub-task can be sequenced to an available Cell processor. At a low level, the compiler (or assembly programmer) will need to work harder to generate code that runs quickly on a processor -- no run-time trickery to make things go faster is available. The theory being that those programmer/compiler friendly features cost silicon and speed that can be better spent giving you more and faster SPUs. Of course, you're not getting any more SPU's on the PS3 but in the general case you'll get more SPUs per number of transistor available on chip.
Completely agree with George Philips and Crashworks. Only thing I'd add is that SPU programming is fundamentally about job management. To get the best out of the SPUs you need to keep them ticking over and feeding back results. There's no point in having one SPU chewing through some complex post-processing if your having to sit and wait for the results for a frame and the rest of your SPUs are sat idle. So how you distribute your jobs requires a lot of thought and this has a big impact on how you chunk up your data.
"All the articles I hear concerning PS3 development discuss, 'Learning how to program on the Cell Processor.' What does this really mean beyond some hand waving?"
Well, stuff you have to deal with on SPUs...
Atomic operations (lock-free try-discard style).
Strong distinction between memory areas. You have to know which pointer is pointing to which memory area or you'll screw everything up.
No enforced hardware distinction between data and code. This is actually a fun thing, you can setup dynamic code loading and essentially stream subroutines in and out. Self-modifying code is possible but not necessarily practical on SPU.
Lack of hardware debugging aids.
Limited memory size.
Fast memory access.
Instruction set balanced toward SIMD operations.
Floating point "gotchas".
You ideally want to keep the SPUs doing useful work all of the time, but it's really challenging. Not only are they not well suited for handling some types of problems, but often moving a system to be efficient on SPU can involve a complete redesign. Debugging problems that would be easy to catch on the PPU can sometimes take days on SPU.
I think when people use the phrase "learning how to program the cell" they are mostly hand waving. You can learn the basics in a week, the challenge comes in trying to apply that knowledge to real code... which often already exists and isn't in a form well-suited for use on SPU.

Why do Cocoa apps use so much memory?

Even the standard blank-window Cocoa app that gets built when you make a new Cocoa project in Xcode uses almost 6 MB of memory. What's the reason for this? Is it possible to make an app use less, or does OS X simply manage memory differently for Cocoa apps?
Not that I'm complaining. I know that performance "hardly matters anymore" (edit: what I mean is, it matters less than readability/maintainability/the programmer's time). I'm just curious.
OS X does a lot of magic with shared memory and copy-on-write pages, so chances are that it doesn't take that much physical RAM for every application.
You can check exactly how memory blocks are mapped by running:
sudo vmmap <PID of the process>
Depends on the all the framework (APIs) you use. Combine that with the VM allocations done by low level ops.
It's only worth trying to reduce the heap alloc (total), as well as the resident size of the code. Making sure your data structs are allocated efficiently and trying to compile with the ever-so-famous "-Os" optimization flag (size optimization). There isn't much you can do about the VM eaten by Cocoa. I wouldn't really worry about it.
This is clearly a 'WTF' moment for developers in general. The question is usually - why does my trivial application use up so much memory.
The answer is down to the underlying framework. You could argue that 6MB is too much, but really, it is nothing.
It's not rare to see computers come with 2GB of memory these days. The stock IMAC is 4GB. The whole point of the computer industry is to use up all the resources a machine has so that it continues to evolve.
Yes you should avoid ineffecincies where possible (Don't load up a 5million point array at start up for instance). But unless your beta demonstrates you fudged up just keep it on the list of todo's.
I'm a bit out on a limb here, but I guess it's because all the libraries that get added have to do quite a bit of setting up and there is no need to garbage collect, so they simply get to waste memory; plus, even if all memory got autoreleased, it would wait until the first idle event, which is after the creation of the window. Delete unneeded libraries/frameworks, or force a garbage collect somewhere after loading the window from the nib and see how much it goes down if you're so concerned.
I am not concerned about it. Some of the memory might be returned later, and the rest is the price you pay for a powerful framework.
A factor which is not directly linked to cocoa but is valid to frameworks in general is that the overhead is not linear. There is usually a fixed and a variable "price", in terms of overhead, to use the framework.
When you create a simple blank window, the fixed overhead is crushing, but when you create an application with tens of windows, dialogs, controls and all, the initial fixed overhead becomes negligible, against the size of the application itself.