Handling stack overflows in embedded systems

Handling stack overflows in embedded systems - error-handling

In embedded software, how do you handle a stack overflow in a generic way?
I come across some processor which does protect in hardware way like recent AMD processors.
There are some techniques on Wikipedia, but are those real practical approaches?
Can anybody give a clear suggested approach which works in all case on today's 32-bit embedded processors?

Ideally you write your code with static stack usage (no recursive calls). Then you can evaluate maximum stack usage by:
static analysis (using tools)
measurement of stack usage while running your code with complete code coverage (or as high as possible code coverage until you have a reasonable confidence you've established the extent of stack usage, as long as your rarely-run code doesn't use particularly more stack than the normal execution paths)
But even with that, you still want to have a means of detecting and then handling stack overflow if it occurs, if at all possible, for more robustness. This can be especially helpful during the project's development phase. Some methods to detect overflow:
If the processor supports a memory read/write interrupt (i.e. memory access breakpoint interrupt) then it can be configured to point to the furthest extent of the stack area.
In the memory map configuration, set up a small (or large) block of RAM that is a "stack guard" area. Fill it with known values. In the embedded software, regularly (as often as reasonably possible) check the contents of this area. If it ever changes, assume a stack overflow.
Once you've detected it, then you need to handle it. I don't know of many ways that code can gracefully recover from a stack overflow, because once it's happened, your program logic is almost certainly invalidated. So all you can do is
log the error
Logging the error is very useful, because otherwise the symptoms (unexpected reboots) can be very hard to diagnose.
Caveat: The logging routine must be able to run reliably even in a corrupted-stack scenario. The routine should be simple. I.e. with a corrupted stack, you probably can't try to write to EEPROM using your fancy EEPROM writing background task. Maybe just log the error into a struct that is reserved for this purpose, in non-init RAM, which can then be checked after reboot.
Reboot (or perhaps shutdown, especially if the error reoccurs repeatedly)
Possible alternative: restart just the particular task, if you're using an RTOS, and your system is designed so the stack corruption is isolated, and all the other tasks are able to handle that task restarting. This would take some serious design consideration.

While embedded stack overflow can be caused by recursive functions getting out of hand, it can also be caused by errant pointer usage (although this could be considered another type of error), and normal system operation with an undersized stack. In other words, if you don't profile your stack usage it can occur outside of a defect or bug situation.
Before you can "handle" stack overflow you have to identify it. A good method for doing this is to load the stack with a pattern during initialization and then monitor how much of the pattern disappears during run-time. In this fashion you can identify the highest point the stack has reached.
The pattern check algorithm should execute in the opposite direction of stack growth. So, if the stack grows from 0x1000 to 0x2000, then your pattern check can start at 0x2000 to increase efficiency. If your pattern was 0xAA and the value at 0x2000 contains something other than 0xAA, you know you've probably got some overflow.
You should also consider placing an empty RAM buffer immediately after the stack so that if you do detect overflow you can shut down the system without losing data. If your stack is followed immediately by heap or SRAM data then identifying an overflow will mean that you have already suffered corruption. Your buffer will protect you for a little bit longer. On a 32-bit micro you should have enough RAM to provide at least a small buffer.

If you are using a processor with a Memory Management Unit your hardware can do this for you with minimal software overhead. Most modern 32 bit processors have them and more and more 32 bit micro controllers feature them as well.
Set up a memory area in the MMU that will be used for the stack. It should be bordered by two memory areas where the MMU does not allow access. When your application is running you will receive a exception/interrupt as soon as you overflow the stack.
Because you get a exception at the moment the error occur you know exactly where in your application the stack went bad. You can look at the call stack to see exactly how you got to where you are. This makes it a lot easier to find your problem than trying to figure out what is wrong by detecting your problem long after it happened.
I have used this successfully on PPC and AVR32 processors. When you start out using an MMU you feel like it is a waste of time since you got along great without it for many years but once you see the advantages of a exception at the exact spot where your memory problem occur you will never go back. A MMU can also detect zero pointer accesses if you disallow memory access to the bottom park of your ram.
If you are using an RTOS your MMU protects the memory and stacks of other tasks errors in one task should not affect them. This means you could also easily restart your task without affecting the other tasks.
In addition to this a processor with a MMU usually also has lots of ram your program is a lot less likely to overflow your stack and you don't need to fine tune everything to get you application to run correctly with a small memory foot print.
An alternative to this would be to use the Processor debug facilities to cause a interrupt on a memory access to the end of your stack. This will probably be very processor specific.

A stack overflow occurs the stack memory is exhausted by too large of a call stack ? e.g. a recursive function too many levels deep.
There are techniques to detect a stack overflow by placing known data after the stack so it could be detected if the stack grew too much and overwrote it.
There are static source code analysis tools such as GnatStack, StackAnalyzer from AbsInt, and Bound-T which can be used to determine or make a guess at the maximum run-time stack-size.

Related

Weird memory corruption issue, FreeRTOS, STM32F777II

I am currently working on an embedded firmware development which uses FreeRTOS running on an STM32F777II microcontroller. Resource wise, I have around 10 tasks (total sum of stack size will be under 40 KByte) at the same priority, around 4 queues of 1KByte each, 4 binary semaphores. I know this would be an incomplete question without posting the actual code, but I really do not have any specific portion in my firmware that I think will be worth sharing related to my issue. I have a ton of business logic in my code which I cannot fully share as well.
I have a struct which consists of multiple char and int arrays of a specific length. 4 of the tasks uses these structures each. Each structure consumes around 15KByte of space and is defined in the global space of the FreeRTOS environment, not local to a task. The structs are allocated statically only and not dynamically on runtime. And since I initialize few members of the struct when declaring, so they go to the .data section only if I am not mistaken. Until now, there had been absolutely no problem whatsoever in my code and it worked 100% without any issue at all. Now I recently had a requirement where I had to add the same stuct to 2 more tasks. So, I added this 15KByte stuct to one of my tasks, basically just allocated and initialized and did not do any processing in any of the tasks. Observed no problems, nothing, no data corruption, nothing. Now when I allocated one more struct variable of the same type only, what I observe is data corruption in a lot of other places in my project. Some of the queues stopped working correctly and showed garbage data when read. Some of the other buffers also showed data corruption. I am really not sure why just one more variable allocation of this struct is triggering a lot of data corruption at other places in my project. If I remove this one allocation, everything goes back to normal. My MCU has 512KB of RAM and as per the IDE's build analyzer feature, it showed below 40% RAM usage, so what is triggering this issue, any suggestions to try? Could be because of some overlapping of .data or .bss sections or something? I did not observe any stack overflows or hard faults in the system during this.

For a quick resolution,
I randomly just disabled the D-cache by commenting out the function:
SCB_EnableDCache();
and voila, everything started to function correctly as it should without any instances of data corruption.
For long run and correct resolution:
Looks like there are some latent issues with my coding. I need to review the memory use, and regions of memory with different properties. Look at the buses, review any DMA usage, and MPU memory settings. Also, review the correct usage of volatile memory directives, thread-safe operation, and cache-coherency issues. Also, Use memory fencing and cache flushing as appropriate.
More details: Level 1 cache on STM32F7 Series and STM32H7 Series

What are traps?

There are many different types of traps listed in processor datasheets, e.g. BusFault, MemManage Fault, Usage Fault and Address Error.
What is their purpose? How can they be utilized in fault handling?

Traps are essentially subroutine calls that are forced by the processor when it detects something unusual in your stream of instructions. (Some processors make them into interrupts, but that's mostly just pushing more context onto the stack; this gets more interesting if the trap includes a switch between user and system address spaces).
This is useful for handling conditions that occur rarely but need to be addressed, such as division by zero. Normally, it is useless overhead to have an extra pair of instructions to test the divisor for zero before executing a divide instruction, since the divisor is never expected to be zero. So the architects have the processor do this check in parallel with the actual divide as part of the divide instruction, and cause the processor to trap to a divide-by-zero routine if the divisor is zero. Another interesting case is illegal-memory-address; clearly, you don't want to have to code a test to check each address before you use it.
Often there are a variety of fault conditions of potential interest and the processor by design will pass control to a different trap routine (often set up as a vector) for each different type of fault.
Once the processor has a trap facility, the CPU architects find lots of uses. A common use is debugger breakpoints, and trap-to-OS for the purposes of executing a operating system call.

Microprocessors have traps for various fault conditions. They are synchronous interrupts that allow the running OS / software to take appropriate action on the error. Traps interrupt program flow and set register bits to indicate the fault. Debugger breakpoints are also implemented using traps.
In a typical computing environment, the operating system takes care of CPU traps triggered by user processes. Let's consider what happens when I run the following program:
int main(void)
{
volatile int a = 1, b = 0;
a = a % b; /* div by zero */
return 0;
}
An error message was displayed, and my box is still running like nothing happened. My operating system's approach to fault handling in this case was to kill the offending process and inform the user with the error message Floating point exception.
Traps in kernel mode are more problematic. It is not as strightforward for the OS to take corrective action if it is itself at fault. For a system process there is no underlying layer of protection. This is why faulty device drivers can cause real problems.
When working on bare metal, without the comforting protection of an operating system, the situation is much similar to the one above. Number one objective for achieving continuous and correct operation is to catch all potential trap conditions before they get to trigger any traps, using assertions and higher-level error handlers. Consider traps as the last line of defense, a safety net you don't intentionally want to fall into.
Defining behaviors for trap handlers is worth some thought, even if they "should never happen". They will be executed when things go wrong in an unanticipated manner, be it due to cosmic rays altering RAM in the most extreme case. Unfortunately, there is no single correct answer to what error handlers should do.
Code Complete, 2nd ed:
The style of error processing that is most appropriate depends on the kind of software the error occurs in and generally favors more correctness or more robustness. Strictly speaking, these terms are at opposite ends of the scale from each other. Correctness means never returning an inaccurate result; no result is better than an inaccurate result. Robustness means always trying to do something that will allow the software to keep operating, even if that leads to results that are inaccurate sometimes.
Clearly, my operating system's fault handling is designed with robustness in mind; I can execute flawed code and do pretty much anything without crashing the system. Designing solely for robustness would mean a recovery attempt whenever possible, and if all else fails, reset. This is a suitable approach if your product is e.g. a toy.
Safety critical applications need a bit more paranoia and should favor correctness instead; when a fault is detected, write error log, shutdown. We don't want our radiation therapy unit to pick dosage levels from invalid garbage values.

The ARMv7-M (not to be confused with the ARM7 nor the ARMv7-A) Cortex-M3 technical reference manual, which may also be part of one of the new ARM ARMs (ARM Architectural Reference Manual)
has a section describing each one of these faults.
Now the whys versus the whats are perhaps at the root of the question. The why is usually so you have a chance to recover. Imagine your set-top box or telephone that hits one of these do you want it to hang or if possible try to recover? Unless you are expecting one of these faults (which in this context you shouldnt be, x86 systems and some of their faults are a completely different story) if you survive long enough to hit one of these you would most likely end up pulling the trigger on yourself (the software trying to kill itself by resetting the processor/system). You can go through the long list and try to find ones you can recover from. Divide by zero, how is the exception handler to know what the math mistake is that lead to this? In general it cant. Unaligned load or store, how is the handler to know what that code was trying to do, like divide by zero it is probably a software bug. Undefined instruction, the code went into the weeds and executed data most likely by this point you are already too far gone and couldnt recover. Any kind of memory bus fault the handler cannot repair the hardware.
You have to go through every fault, and for each fault define how you are going to handle it, all the ways you could have gotten to that one fault and the ways you can get out or handle each one of those paths. On occasion you might be able to recover, otherwise you need a default action, hang the processor in an infinite loop in the handler for example so that the software engineer, if available, can try to use a debugger to get in and find where the code stopped. Or have a watchdog timer, inside or outside the chip depending on the chip and board design (often outside the chip the WDT will reset the whole board). You might have some non-volatile memory that you attempt to store the fault in, before letting or causing the reset, the time and code it takes to do that might lead you to another fault depending on what is failing.

Simply put, they allow you to execute code when something happens in the processor. They're sometimes used by the OS for error recovery.

Off-chip memcpy?

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.
This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.
Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.

That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".
Some architectural and practical problems:
1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.
2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.
3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.
4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.

Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).
The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.
A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.

Net Win?
It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.
Heavier User Context?
An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.
And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.
Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.
Or Distant Kernel Resource?
If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.
Idea! Have that super-fast CPU thing do it!
Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?

How to determine maximum stack usage in embedded system?

When I give the Keil compiler the "--callgraph" option,
it statically calculates the exact "Maximum Stack Usage" for me.
Alas, today it is giving me a "Maximum Stack Usage = 284 bytes + Unknown(Functions without stacksize...)" message, along with a list of "Functions with no stack information".
Nigel Jones says that recursion is a really bad idea in embedded systems
("Computing your stack size" 2009),
so I've been careful not to make any mutually recursive functions in this code.
Also, I make sure that none of my interrupt handlers ever re-enable interrupts until their final return-from-interrupt instruction, so I don't need to worry about re-entrant interrupt handlers.
Without recursion or re-entrant interrupt handlers, it should able to statically determine the maximum stack usage.
(And so most of the answers to
How to determine maximum stack usage?
do not apply).
My understanding is that the software that handles the "--callgraph" option
first finds the maximum stack depth for each interrupt handler when it's not interrupted by a higher-priority interrupt, and the maximum stack depth of the main() function when it is not interrupted.
Then it adds them all up to find the total (worst-case) maximum stack depth.
That occurs when the main() background task is at its maximum depth when it is interrupted by the lowest-priority interrupt, and that interrupt is at its maximum depth when it is interrupted by the next-lowest-priority interrupt, and so on.
I suspect the software that handles --callgraph is getting confused about the small assembly-language functions in the "Functions with no stack information" list.
The --callgraph documentation seems to imply that I need to manually calculate (or make a conservative estimate) how much stack they use -- they're very short, so that should be simple -- and then "Use frame directives in assembly language code to describe how your code uses the stack."
One of them is the initial startup code that resets the stack to zero before jumping to main() -- so, in effect, this consumes zero stack.
Another one is the "Fault" interrupt handler that locks up in an infinite loop until I cycle the power -- it's safe to assume this consumes zero stack.
I'm using the Keil uVision V4.20.03.0 to compile code for the LM3S1968 ARM Cortex-M3.
So how do I use "frame directives" to tell the software that handles "--callgraph" how much stack these functions use?
Or is there some better approach to determine maximum stack usage?
(See How to determine maximum stack usage in embedded system with gcc? for almost the same question targeted to the gcc compiler.)

Use the --info=stack in the linker option. The map file will then include a stack usage for all functions with external linkage.
In a single tasking environment, the stack usage for main() will give you the total requirement. If you are using an RTOS such as RTX where each task has its own stack, then you need to look at the stack usage for all task entry points, and then add some more (64 bytes in the case of RTX) for the task context storage.
This and other techniques applicable to Keil and more generally are described here

John Regehr of the University of Utah has a good discussion of measuring stack usage in embedded systems at http://www.embedded.com/design/prototyping-and-development/4025013/Say-no-to-stack-overflow, though note that the link to ftp.embedded.com is stale, and one occurrence of “without interrupts disabled” should have either the first or last word negated. In the commercial world, Coverity has a configurable stack overflow checker, and some versions of CodeWarrior have a semi-documented warn_stack_usage pragma. (It’s not mentioned in my version of the compiler documentation, but is in MetroWerks’ “Targeting Palm OS” document.)

How does a stack memory increase?

In a typical C program, the linux kernel provides 84K - ~100K of memory. How does the kernel allocate more memory for the stack when the process uses the given memory.
IMO when the process takes up all the memory of the stack and now uses the next contiguous memory, ideally it should page fault and then the kernel handles the page fault.
Is it here that the kernel provides more memory to the stack for the given process, and which data structure in linux kernel identifies the size of the stack for the process??

There are a number of different methods used, depending on the OS (linux realtime vs. normal) and the language runtime system underneath:
1) dynamic, by page fault
typically preallocate a few real pages to higher addresses and assign the initial sp to that. The stack grows downward, the heap grows upward. If a page fault happens somewhat below the stack bottom, the missing intermediate pages are allocated and mapped. Effectively increasing the stack from the top towards the bottom automatically. There is typically a maximum up to which such automatic allocation is performed, which can or can not be specified in the environment (ulimit), exe-header, or dynamically adjusted by the program via a system call (rlimit). Especially this adjustability varies heavily between different OSes. There is also typically a limit to "how far away" from the stack bottom a page fault is considered to be ok and an automatic grow to happen. Notice that not all systems' stack grows downward: under HPUX it (used?) to grow upward so I am not sure what a linux on the PA-Risc does (can someone comment on this).
2) fixed size
other OSes (and especially in embedded and mobile environments) either have fixed sizes by definition, or specified in the exe header, or specified when a program/thread is created. Especially in embedded real time controllers, this is often a configuration parameter, and individual control tasks get fix stacks (to avoid runaway threads taking the memory of higher prio control tasks). Of course also in this case, the memory might be allocated only virtually, untill really needed.
3) pagewise, spaghetti and similar
such mechanisms tend to be forgotten, but are still in use in some run time systems (I know of Lisp/Scheme and Smalltalk systems). These allocate and increase the stack dynamically as-required. However, not as a single contigious segment, but instead as a linked chain of multi-page chunks. It requires different function entry/exit code to be generated by the compiler(s), in order to handle segment boundaries. Therefore such schemes are typically implemented by a language support system and not the OS itself (used to be earlier times - sigh). The reason is that when you have many (say 1000s of) threads in an interactive environment, preallocating say 1Mb would simply fill your virtual address space and you could not support a system where the thread needs of an individual thread is unknown before (which is typically the case in a dynamic environment, where the use might enter eval-code into a separate workspace). So dynamic allocation as in scheme 1 above is not possible, because there are would be other threads with their own stacks in the way. The stack is made up of smaller segments (say 8-64k) which are allocated and deallocated from a pool and linked into a chain of stack segments. Such a scheme may also be requried for high performance support of things like continuations, coroutines etc.
Modern unixes/linuxes and (I guess, but not 100% certain) windows use scheme 1) for the main thread of your exe, and 2) for additional (p-)threads, which need a fix stack size given by the thread creator initially. Most embedded systems and controllers use fixed (but configurable) preallocation (even physically preallocated in many cases).
edit: typo

The stack for a given process has a limited, fixed size. The reason you can't add more memory as you (theoretically) describe is because the stack must be contiguous, and it grows toward the heap. So, when the stack reaches the heap, no extension is possible.
The stack size for a userland program is not determined by the kernel. The kernel stack size is a configuration option for the kernel (usually 4k or 8k).
Edit: if you already know this, and were merely talking about the allocation of physical pages for a process, then you have the procedure down already. But there's no need to keep track of the "stack size" like this: the virtual pages in the stack with no pagetable entries are just normal overcommitted virtual pages. Physical memory will be granted on their first access. But the kernel does not have to overcommit memory, and thus a stack will probably have complete physical realization when the executable is first loaded.

The stack can only be used up to a certain length, because it has a fixed storage capacity in memory. If your question asks in what direction does the stack being used up? the answer is downwards. It is filled down in memory towards the heap. The heap is a dynamic component of memory by which it can actually grow from the bottom up, based on your need of data storage.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas