How to determine maximum stack usage in embedded system? - embedded

When I give the Keil compiler the "--callgraph" option,
it statically calculates the exact "Maximum Stack Usage" for me.
Alas, today it is giving me a "Maximum Stack Usage = 284 bytes + Unknown(Functions without stacksize...)" message, along with a list of "Functions with no stack information".
Nigel Jones says that recursion is a really bad idea in embedded systems
("Computing your stack size" 2009),
so I've been careful not to make any mutually recursive functions in this code.
Also, I make sure that none of my interrupt handlers ever re-enable interrupts until their final return-from-interrupt instruction, so I don't need to worry about re-entrant interrupt handlers.
Without recursion or re-entrant interrupt handlers, it should able to statically determine the maximum stack usage.
(And so most of the answers to
How to determine maximum stack usage?
do not apply).
My understanding is that the software that handles the "--callgraph" option
first finds the maximum stack depth for each interrupt handler when it's not interrupted by a higher-priority interrupt, and the maximum stack depth of the main() function when it is not interrupted.
Then it adds them all up to find the total (worst-case) maximum stack depth.
That occurs when the main() background task is at its maximum depth when it is interrupted by the lowest-priority interrupt, and that interrupt is at its maximum depth when it is interrupted by the next-lowest-priority interrupt, and so on.
I suspect the software that handles --callgraph is getting confused about the small assembly-language functions in the "Functions with no stack information" list.
The --callgraph documentation seems to imply that I need to manually calculate (or make a conservative estimate) how much stack they use -- they're very short, so that should be simple -- and then "Use frame directives in assembly language code to describe how your code uses the stack."
One of them is the initial startup code that resets the stack to zero before jumping to main() -- so, in effect, this consumes zero stack.
Another one is the "Fault" interrupt handler that locks up in an infinite loop until I cycle the power -- it's safe to assume this consumes zero stack.
I'm using the Keil uVision V4.20.03.0 to compile code for the LM3S1968 ARM Cortex-M3.
So how do I use "frame directives" to tell the software that handles "--callgraph" how much stack these functions use?
Or is there some better approach to determine maximum stack usage?
(See How to determine maximum stack usage in embedded system with gcc? for almost the same question targeted to the gcc compiler.)

Use the --info=stack in the linker option. The map file will then include a stack usage for all functions with external linkage.
In a single tasking environment, the stack usage for main() will give you the total requirement. If you are using an RTOS such as RTX where each task has its own stack, then you need to look at the stack usage for all task entry points, and then add some more (64 bytes in the case of RTX) for the task context storage.
This and other techniques applicable to Keil and more generally are described here

John Regehr of the University of Utah has a good discussion of measuring stack usage in embedded systems at http://www.embedded.com/design/prototyping-and-development/4025013/Say-no-to-stack-overflow, though note that the link to ftp.embedded.com is stale, and one occurrence of “without interrupts disabled” should have either the first or last word negated. In the commercial world, Coverity has a configurable stack overflow checker, and some versions of CodeWarrior have a semi-documented warn_stack_usage pragma. (It’s not mentioned in my version of the compiler documentation, but is in MetroWerks’ “Targeting Palm OS” document.)

Related

Process Instructions storage (Operating System)

I was learning about how a process looks inside a memory from(OS concepts by Abraham silberschatz).
So I came to know that it mainly has following section
HEAP
STACK
DATA/ R/WCode( for global variables)
Text(or ROC) that contains the code
Shared library
OS reserved space
List item
Diagram link
I have some questions regarding the process workflow.
Where does PCB fits in this diagram.
People generally show the called functions getting pushed onto the process stack in memory, but in actual 3 pieces of info gets pushed(local variable, parameters passed and return address).Two sub-questions here:
2.1 Where are the actual instructions stored in this diagram(because stack only has data not instructions). Is it in the text section.
2.2 if this stack data is pushed then they must get popped when the function
execution is completed.So how does Return-Address comes into play while popping.
That "Figure 3.1 - A process in memory", shows the address space of a process.  Each process typically has its own address space (not depicted), and the kernel also typically has its own address space (also not depicted).
The PCBs usually live within the kernel, who is managing the processes.
Functions, when then are not active, still have machine code instructions which are located within the text segment/section of the process.
Functions that are invoked are said to be activated, and an activation record, also known as a stack frame or call frame, is created on the stack for some functions, depending on how complex the function is.  (The stack frame or activation record is effectively private data to the function, not generally intended for other functions to inspect, modulo exception (throw/catch) mechanisms)
A function, X, that calls another function, Y, will suspend itself upon the invocation of Y waiting for Y to return to X, before X resumes.  In such scenario, X uses an activation record on the stack to maintain its suspended state, which it will use upon resumption.
If/when function Y returns to X it must remove any data that it (Y) allocated on the stack, and restore the stack pointer, and other call-preserved registers, to their original value(s) in order for X to successfully resume.  The stack pointer register is the reference for a function to find its stack allocated data.
When X calls Y we can speak to the presence of two return addresses: X has a return address to get back to its caller, and Y has a return address to get back to X.  As this is the nature of calling, some architectures will provide instructions for calling that also push the return address directly onto the stack, and provide instructions for returning that pop the return address off the stack to resume the caller.
However, RISC architectures will generally leave the return address in a CPU register, requiring functions that make further calls to save their own return address in a place that will survive their own further calling (that place being stack memory).  RISC architectures also tend to engage in less pushing and popping, collecting all stack frame allocations into one (larger) allocation in prologue and all stack frame deallocations into one (larger) deallocation in epilogue.
The suspension, invocation, returning, and resumption is logical, from the point of view of functions: though the processor doesn't really see or care about functions, but rather instead sees a continuous stream of instructions that happen to include various forms of branching.

OS Context Switch in ISR

I am just eager to know how OS actually does context switch when some asynchronous event raise ISR that make higher priority task ready to run. As far as I know when CPU enter ISR it puts some of register values to the hardware stack, so how scheduler retreives those values and puts it to the task stack ? Does it access hardware stack in order to copy values that are allready preserved ? I hope I was clear.
Thanks in advance.
On a Cortex-M3 processor you have the MSP (Main Stack Pointer - which is your hardware stack) and the PSP (Process Stack Pointer - which is your task stack).
On entry to an exception the stack frame is stored on the current PSP stack (in normal, non nested operation). The exception handler then switches to the MSP stack, however it can still access the PSP stack so it can store any remaining registers etc on that same PSP stack as well as any other task information it needs.
The exception can then selected the new high priority task and switch the PSP to this tasks stack and restoring the registers that is needs. It then leaves the PSP in exactly the same state as when the task was suspended so that on return from exception the rest of the stack is correctly restored.
It is more complex than this in certain situations but that is the basic operation (On ARM Cortex-M). It will be different on other processors.
I would recommend downloading FreeRTOS and looking at the various different port layers. There is a port for pretty much everything there, and the low level task switching stuff in the "portable" directories is fairly small and straightforward.
As I'm not quite sure what the scope of your question is, I'll try and summarize some concepts of preemptive scheduling:
There's one stack per task. For each stack, there's a stack pointer pointing to it. So basically, for the task switch, the current stack pointer is saved and the next task's stack pointer is loaded. Interestingly, the return from OS to the task's code is then done via a RETURN instruction, and not a JUMP or CALL like one might expect.
When an ISR interrupts a running task, it will not run another task itself. As you correctly said, it only makes a task runnable (taking it out of waiting state), so that, in the next scheduling cycle, the OS can consider the now-ready task for further execution. (If and when that task runs depends on his assigned priority; if it has a very high priority, the OS may try and make sure it runs before any other, lower prio task gets switched to.)
The actual task switching only occurs after the ISR finished and returned, so there's no need to copy anything from one stack to another.
In 'simple' implementations, the ISR may just return to the task it interrupted, so that no early, 'out-of-order' context switch will occur.
Another, more complex implementation can have the ISR return to the OS instead of the interrupted task. A function like yield() would thus be called, giving the OS the chance to do a task switch immediately if necessary.
This, however, may require that affected ISRs get special exit instructions appended replacing the normal compiler-generated ISR code.

On reset what happens in embedded system?

I have a doubt regarding the reset due to power up:
As I know that microcontroller is hardwired to start with some particular memory location say 0000H on power up. At 0000h, whether interrupt service routine is written for reset(initialization of stack pointer and program counter etc) or the reset address is there at 0000h(say 7000) so that micro controller jumps at 7000 address and there initialization of stack and PC is written.
Who writes this reset service routine? Is it the manufacturer of microcontroller chip(Intel or microchip etc) or any programmer can change this reset service routine(For example, programmer changed the PC to 4000h from 7000h on power up reset resulting into the first instruction to be fetched from 4000 instead of 7000).
How the stack pointer and program counter are initialized to the respective initial addresses as on power up microcontroller is not in the state to put the address into stack pointer and program counter registers(there is no initialization done till reset service routine).
What should be the steps in the reset service routine considering all possibilities?
With reference to your numbering:
The hardware reset process is processor dependent and will be fully described in the data sheet or reference manual for the part, but your description is generally the case - different architectures may have subtle variations.
While some microcontrollers include a ROM based boot-loader that may contain start-up code, typically such bootloaders are only used to load code over a communications port, either to program flash memory directly or to load and execute a secondary bootloader to RAM that then programs flash memory. As far as C runtime start-up goes, this is either provided with the compiler/toolchain, or you write it yourself in assembler. Normally even when start-up code is provided by the compiler vendor, it is supplied as source to be assembled and linked with your application. The compiler vendor cannot always know things like memory map, SDRAM mapping and timing, or processor clock speed or what oscillator crystal is used in your hardware, so the start-up code will generally need customisation or extension through initialisation stubs that you must implement for your hardware.
On ARM Cortex-M devices in fact the initial PC and stack-pointer are in fact loaded by hardware, they are stored at the reset address and loaded on power-up. However in the general case you are right, the reset address either contains the start-up code or a vector to the start-up code, on pre-Cortex ARM architectures, the reset address actually contains a jump instruction rather than a true vector address. Either way, the start-up code for a C/C++ runtime must at least initialise the stack pointer, initialise static data, perform any necessary C library initialisation and jump to main(). In the case of C++ it must also execute the constructors of any global static objects before calling main().
The processor cores normally have as you say a starting address of some sort of table either a list of addresses or like ARM a place where instructions are executed. Wrapped around that core but within the chip can vary. Cores that are not specific to the chip vendor like 8051, mips, arm, xscale, etc are going to have a much wider range of different answers. Some microcontroller vendors for example will look at strap pins and if the strap is wired a certain way when reset is released then it executes from a special boot flash inside the chip, a bootloader that you can for example use to program the user boot flash with. If the strap is not tied that certain way then sometimes it boots your user code. One vendor I know of still has it boot their bootloader flash, if the vector table has a valid checksum then they jump to the reset vector in your vector table otherwise they sit in their bootloader mode waiting for you to talk to them.
When you get into the bigger processors, non-microcontrollers, where software lives outside the processor either on a boot flash (separate chip from the processor) or some ram that is managed somehow before reset, etc. Those usually follow the rule for the core, start at address 0xFFFFFFF0 or start at address 0x00000000, if there is garbage there, oh well fire off the undefined instruction vector, if that is garbage just hang there or sit in an infinite loop calling the undefined instruction vector. this works well for an ARM for example you can build a board with a boot flash that is erased from the factory (all 0xFFs) then you can use jtag to stop the arm and program the flash the first time and you dont have to unsolder or socket or pre-program anything. So long as your bootloader doesnt hang the arm you can have an unbrickable design. (actually you can often hold the arm in reset and still get at it with the jtag debugger and not worry about bad code messing with jtag pins or hanging the arm core).
The short answer: How many different processor chip vendors have there been? There are many different solutions, as many as you can think of and more have been deployed. Placing a reset handler address in a known place in memory is the most common though.
EDIT:
Questions 2 and 3. if you are buying a chip, some of the microcontrollers have this protected bootloader, but even with that normally you write the boot code that will be used by the product. And part of that boot code is to initialize the stack pointers and prepare memory and bring up parts of the chip and all those good things. Sometimes chip vendors will provide examples. if you are buying a board level product, then often you will find a board support package (BSP) which has working example code to bring up the board and perhaps do a few things. Say the beagleboard for example or the open-rd or embeddedarm.com come with a bootloader (u-boot or other) and some already have linux pre-installed. boards like that the user usually just writes some linux apps/drivers and adds them to the bsp, but you are not limited to that, you are often welcome to completely re-write and replace the bootloader. And whoever writes the bootloader has to setup the stacks and bring up the hardware, etc.
systems like the gameboy advance or nds or the like, the vendor has some startup code that calls your startup code. so they may have the stack and such setup for them but they are handing off to you, so much of the system may be up, you just get to decide how to slice up the memorires, where you want your stack, data, program, etc.
some vendors want to keep this stuff controlled or a secret, others do not. in some cases you may end up with a board or chip with no example code, just some data sheets and reference manuals.
if you want to get into this business though you need to be prepared to write this startup code (in assembler) that may call some C code to bring up the rest of the system, then that might start up the main operating system or application or whatever. Microcotrollers sounds like what you are playing with, the answers to your questions are in the chip vendors users guides, some vendors are better than others. search for the word reset or boot in the document to try to figure out what their boot schemes are. I recommend you use "dollar votes" to choose the better vendors. A vendor with bad docs, secret docs, bad support, dont give them your money, spend your money on vendors with freely downloadable, well written docs, with well written examples and or user forums with full time employees trolling around answering questions. There are times where the docs are not available except to serious, paying customers, it depends on the market. most general purpose embedded systems though are openly documented. the quality varies widely, but the docs, etc are there.
Depends completely on the controller/embedded system you use. The ones I've used in game development have the IP point at a starting address in RAM. The boot strap code supplied from the compiler initializes static/const memory, sets the stack pointer, and then jumps execution to a main() routine of some sort. Older systems also started at a fixed address, but you manually had to set the stack, starting vector table, and other stuff in assembler. A common name for the starting assembler file is CRT0.s for the stuff I've done.
So 1. You are correct. The microprocessor has to start at some fixed address.
2. The ISR can be supplied by the manufacturer or compiler creator, or you can write one yourself, depending on the complexity of the system in question.
3. The stack and initial programmer counter are usually handled via some sort of bootstrap routine that quite often can be overriden with your own code. See above.
Last: The steps will depend on the chip. If there is a power interruption of any sort, RAM may be scrambled and all ISR vector tables and startup code should be rewritten, and the app should be run as if it just powered up. But, read your documentation! I'm sure there is platform specific stuff there that will answer these for your specific case.

How does a stack memory increase?

In a typical C program, the linux kernel provides 84K - ~100K of memory. How does the kernel allocate more memory for the stack when the process uses the given memory.
IMO when the process takes up all the memory of the stack and now uses the next contiguous memory, ideally it should page fault and then the kernel handles the page fault.
Is it here that the kernel provides more memory to the stack for the given process, and which data structure in linux kernel identifies the size of the stack for the process??
There are a number of different methods used, depending on the OS (linux realtime vs. normal) and the language runtime system underneath:
1) dynamic, by page fault
typically preallocate a few real pages to higher addresses and assign the initial sp to that. The stack grows downward, the heap grows upward. If a page fault happens somewhat below the stack bottom, the missing intermediate pages are allocated and mapped. Effectively increasing the stack from the top towards the bottom automatically. There is typically a maximum up to which such automatic allocation is performed, which can or can not be specified in the environment (ulimit), exe-header, or dynamically adjusted by the program via a system call (rlimit). Especially this adjustability varies heavily between different OSes. There is also typically a limit to "how far away" from the stack bottom a page fault is considered to be ok and an automatic grow to happen. Notice that not all systems' stack grows downward: under HPUX it (used?) to grow upward so I am not sure what a linux on the PA-Risc does (can someone comment on this).
2) fixed size
other OSes (and especially in embedded and mobile environments) either have fixed sizes by definition, or specified in the exe header, or specified when a program/thread is created. Especially in embedded real time controllers, this is often a configuration parameter, and individual control tasks get fix stacks (to avoid runaway threads taking the memory of higher prio control tasks). Of course also in this case, the memory might be allocated only virtually, untill really needed.
3) pagewise, spaghetti and similar
such mechanisms tend to be forgotten, but are still in use in some run time systems (I know of Lisp/Scheme and Smalltalk systems). These allocate and increase the stack dynamically as-required. However, not as a single contigious segment, but instead as a linked chain of multi-page chunks. It requires different function entry/exit code to be generated by the compiler(s), in order to handle segment boundaries. Therefore such schemes are typically implemented by a language support system and not the OS itself (used to be earlier times - sigh). The reason is that when you have many (say 1000s of) threads in an interactive environment, preallocating say 1Mb would simply fill your virtual address space and you could not support a system where the thread needs of an individual thread is unknown before (which is typically the case in a dynamic environment, where the use might enter eval-code into a separate workspace). So dynamic allocation as in scheme 1 above is not possible, because there are would be other threads with their own stacks in the way. The stack is made up of smaller segments (say 8-64k) which are allocated and deallocated from a pool and linked into a chain of stack segments. Such a scheme may also be requried for high performance support of things like continuations, coroutines etc.
Modern unixes/linuxes and (I guess, but not 100% certain) windows use scheme 1) for the main thread of your exe, and 2) for additional (p-)threads, which need a fix stack size given by the thread creator initially. Most embedded systems and controllers use fixed (but configurable) preallocation (even physically preallocated in many cases).
edit: typo
The stack for a given process has a limited, fixed size. The reason you can't add more memory as you (theoretically) describe is because the stack must be contiguous, and it grows toward the heap. So, when the stack reaches the heap, no extension is possible.
The stack size for a userland program is not determined by the kernel. The kernel stack size is a configuration option for the kernel (usually 4k or 8k).
Edit: if you already know this, and were merely talking about the allocation of physical pages for a process, then you have the procedure down already. But there's no need to keep track of the "stack size" like this: the virtual pages in the stack with no pagetable entries are just normal overcommitted virtual pages. Physical memory will be granted on their first access. But the kernel does not have to overcommit memory, and thus a stack will probably have complete physical realization when the executable is first loaded.
The stack can only be used up to a certain length, because it has a fixed storage capacity in memory. If your question asks in what direction does the stack being used up? the answer is downwards. It is filled down in memory towards the heap. The heap is a dynamic component of memory by which it can actually grow from the bottom up, based on your need of data storage.

Handling stack overflows in embedded systems

In embedded software, how do you handle a stack overflow in a generic way?
I come across some processor which does protect in hardware way like recent AMD processors.
There are some techniques on Wikipedia, but are those real practical approaches?
Can anybody give a clear suggested approach which works in all case on today's 32-bit embedded processors?
Ideally you write your code with static stack usage (no recursive calls). Then you can evaluate maximum stack usage by:
static analysis (using tools)
measurement of stack usage while running your code with complete code coverage (or as high as possible code coverage until you have a reasonable confidence you've established the extent of stack usage, as long as your rarely-run code doesn't use particularly more stack than the normal execution paths)
But even with that, you still want to have a means of detecting and then handling stack overflow if it occurs, if at all possible, for more robustness. This can be especially helpful during the project's development phase. Some methods to detect overflow:
If the processor supports a memory read/write interrupt (i.e. memory access breakpoint interrupt) then it can be configured to point to the furthest extent of the stack area.
In the memory map configuration, set up a small (or large) block of RAM that is a "stack guard" area. Fill it with known values. In the embedded software, regularly (as often as reasonably possible) check the contents of this area. If it ever changes, assume a stack overflow.
Once you've detected it, then you need to handle it. I don't know of many ways that code can gracefully recover from a stack overflow, because once it's happened, your program logic is almost certainly invalidated. So all you can do is
log the error
Logging the error is very useful, because otherwise the symptoms (unexpected reboots) can be very hard to diagnose.
Caveat: The logging routine must be able to run reliably even in a corrupted-stack scenario. The routine should be simple. I.e. with a corrupted stack, you probably can't try to write to EEPROM using your fancy EEPROM writing background task. Maybe just log the error into a struct that is reserved for this purpose, in non-init RAM, which can then be checked after reboot.
Reboot (or perhaps shutdown, especially if the error reoccurs repeatedly)
Possible alternative: restart just the particular task, if you're using an RTOS, and your system is designed so the stack corruption is isolated, and all the other tasks are able to handle that task restarting. This would take some serious design consideration.
While embedded stack overflow can be caused by recursive functions getting out of hand, it can also be caused by errant pointer usage (although this could be considered another type of error), and normal system operation with an undersized stack. In other words, if you don't profile your stack usage it can occur outside of a defect or bug situation.
Before you can "handle" stack overflow you have to identify it. A good method for doing this is to load the stack with a pattern during initialization and then monitor how much of the pattern disappears during run-time. In this fashion you can identify the highest point the stack has reached.
The pattern check algorithm should execute in the opposite direction of stack growth. So, if the stack grows from 0x1000 to 0x2000, then your pattern check can start at 0x2000 to increase efficiency. If your pattern was 0xAA and the value at 0x2000 contains something other than 0xAA, you know you've probably got some overflow.
You should also consider placing an empty RAM buffer immediately after the stack so that if you do detect overflow you can shut down the system without losing data. If your stack is followed immediately by heap or SRAM data then identifying an overflow will mean that you have already suffered corruption. Your buffer will protect you for a little bit longer. On a 32-bit micro you should have enough RAM to provide at least a small buffer.
If you are using a processor with a Memory Management Unit your hardware can do this for you with minimal software overhead. Most modern 32 bit processors have them and more and more 32 bit micro controllers feature them as well.
Set up a memory area in the MMU that will be used for the stack. It should be bordered by two memory areas where the MMU does not allow access. When your application is running you will receive a exception/interrupt as soon as you overflow the stack.
Because you get a exception at the moment the error occur you know exactly where in your application the stack went bad. You can look at the call stack to see exactly how you got to where you are. This makes it a lot easier to find your problem than trying to figure out what is wrong by detecting your problem long after it happened.
I have used this successfully on PPC and AVR32 processors. When you start out using an MMU you feel like it is a waste of time since you got along great without it for many years but once you see the advantages of a exception at the exact spot where your memory problem occur you will never go back. A MMU can also detect zero pointer accesses if you disallow memory access to the bottom park of your ram.
If you are using an RTOS your MMU protects the memory and stacks of other tasks errors in one task should not affect them. This means you could also easily restart your task without affecting the other tasks.
In addition to this a processor with a MMU usually also has lots of ram your program is a lot less likely to overflow your stack and you don't need to fine tune everything to get you application to run correctly with a small memory foot print.
An alternative to this would be to use the Processor debug facilities to cause a interrupt on a memory access to the end of your stack. This will probably be very processor specific.
A stack overflow occurs the stack memory is exhausted by too large of a call stack ? e.g. a recursive function too many levels deep.
There are techniques to detect a stack overflow by placing known data after the stack so it could be detected if the stack grew too much and overwrote it.
There are static source code analysis tools such as GnatStack, StackAnalyzer from AbsInt, and Bound-T which can be used to determine or make a guess at the maximum run-time stack-size.