Process Instructions storage (Operating System) - process

I was learning about how a process looks inside a memory from(OS concepts by Abraham silberschatz).
So I came to know that it mainly has following section
DATA/ R/WCode( for global variables)
Text(or ROC) that contains the code
Shared library
OS reserved space
List item
Diagram link
I have some questions regarding the process workflow.
Where does PCB fits in this diagram.
People generally show the called functions getting pushed onto the process stack in memory, but in actual 3 pieces of info gets pushed(local variable, parameters passed and return address).Two sub-questions here:
2.1 Where are the actual instructions stored in this diagram(because stack only has data not instructions). Is it in the text section.
2.2 if this stack data is pushed then they must get popped when the function
execution is completed.So how does Return-Address comes into play while popping.

That "Figure 3.1 - A process in memory", shows the address space of a process.  Each process typically has its own address space (not depicted), and the kernel also typically has its own address space (also not depicted).
The PCBs usually live within the kernel, who is managing the processes.
Functions, when then are not active, still have machine code instructions which are located within the text segment/section of the process.
Functions that are invoked are said to be activated, and an activation record, also known as a stack frame or call frame, is created on the stack for some functions, depending on how complex the function is.  (The stack frame or activation record is effectively private data to the function, not generally intended for other functions to inspect, modulo exception (throw/catch) mechanisms)
A function, X, that calls another function, Y, will suspend itself upon the invocation of Y waiting for Y to return to X, before X resumes.  In such scenario, X uses an activation record on the stack to maintain its suspended state, which it will use upon resumption.
If/when function Y returns to X it must remove any data that it (Y) allocated on the stack, and restore the stack pointer, and other call-preserved registers, to their original value(s) in order for X to successfully resume.  The stack pointer register is the reference for a function to find its stack allocated data.
When X calls Y we can speak to the presence of two return addresses: X has a return address to get back to its caller, and Y has a return address to get back to X.  As this is the nature of calling, some architectures will provide instructions for calling that also push the return address directly onto the stack, and provide instructions for returning that pop the return address off the stack to resume the caller.
However, RISC architectures will generally leave the return address in a CPU register, requiring functions that make further calls to save their own return address in a place that will survive their own further calling (that place being stack memory).  RISC architectures also tend to engage in less pushing and popping, collecting all stack frame allocations into one (larger) allocation in prologue and all stack frame deallocations into one (larger) deallocation in epilogue.
The suspension, invocation, returning, and resumption is logical, from the point of view of functions: though the processor doesn't really see or care about functions, but rather instead sees a continuous stream of instructions that happen to include various forms of branching.


How do functions access locals in stack frames?

I've read that stack frames contain return addresses, function arguments, and local variables for a function. Since functions don't know where their stack frame is in memory at compile time, how do they know the memory address of their local variables? Do they offset and dereference the stack pointer for every read or write of a local? In particular, how does this work on embedded devices without efficient support for pointer accesses, where load and store addresses have to be hardcoded into the firmware and pointer accesses go through reserved registers?
The way objects work is that the compiler or assembly programmer determines the layout of an object — the offset of each field relative to the start of the object (as well as the size of the object as a whole).  Then, objects are passed and stored as references, which are generally pointers in C and machine code.  In struct xy { int x; int y; }, we can reason that x is at offset 0 and y at offset 4 from an object reference to a struct xy.
The stack frame is like an object that contains a function's memory-based local variables (instead of struct members), and being accessed not by an object reference but by the stack or frame pointer.  (And being allocated/deallocated by stack pointer decrement/increment, instead of malloc and free.)
Both share the issue that we don't know the actual location/address of a given field (x or y) of a dynamically allocated object or stack frame position of a memory-based local variable until runtime, but when a function runs, it can compute the complete absolute address (of object fields or memory-based local variables) quite simply by adding together the base (object reference or stack/frame pointer) to relative position of the desired item, knowing its predetermined layout.
Processors offer addressing modes that help to support this kind of access, usually something like base + displacement.
Let's also note that many local variables are assigned directly to CPU registers so have no memory address at all.  Other local variables move between memory and CPU registers, and such might be considered optimization that means we don't have to access memory if the value of a variable is needed when that has recently already been loaded into a CPU register.
In many ways, processors for embedded devices are like other processors, offering addressing modes to help with memory accesses, and with optimizing compilers that can make good decisions about where a variable lives.  As you can tell from the above, not all variables need live in memory, and some live in both in memory and in CPU registers to help reduce memory access costs.
The anser is, it depends on the architecture. You will have a register that contains the address of the current stack frame, EBP for x86 for instance, once you know this, individual variables are identified by their offsets into the stack frame, calculated by object size at compile time (hence the need for size to be know at compile time for local variables).
Even if a stack frame appears in different places in memory, the variables will have the same relative offset, so you can always calculate the address.
The size of the stack frame for each function is calculated at compile and included in the code so that each call can set up and clean its own frame.

OS Context Switch in ISR

I am just eager to know how OS actually does context switch when some asynchronous event raise ISR that make higher priority task ready to run. As far as I know when CPU enter ISR it puts some of register values to the hardware stack, so how scheduler retreives those values and puts it to the task stack ? Does it access hardware stack in order to copy values that are allready preserved ? I hope I was clear.
Thanks in advance.
On a Cortex-M3 processor you have the MSP (Main Stack Pointer - which is your hardware stack) and the PSP (Process Stack Pointer - which is your task stack).
On entry to an exception the stack frame is stored on the current PSP stack (in normal, non nested operation). The exception handler then switches to the MSP stack, however it can still access the PSP stack so it can store any remaining registers etc on that same PSP stack as well as any other task information it needs.
The exception can then selected the new high priority task and switch the PSP to this tasks stack and restoring the registers that is needs. It then leaves the PSP in exactly the same state as when the task was suspended so that on return from exception the rest of the stack is correctly restored.
It is more complex than this in certain situations but that is the basic operation (On ARM Cortex-M). It will be different on other processors.
I would recommend downloading FreeRTOS and looking at the various different port layers. There is a port for pretty much everything there, and the low level task switching stuff in the "portable" directories is fairly small and straightforward.
As I'm not quite sure what the scope of your question is, I'll try and summarize some concepts of preemptive scheduling:
There's one stack per task. For each stack, there's a stack pointer pointing to it. So basically, for the task switch, the current stack pointer is saved and the next task's stack pointer is loaded. Interestingly, the return from OS to the task's code is then done via a RETURN instruction, and not a JUMP or CALL like one might expect.
When an ISR interrupts a running task, it will not run another task itself. As you correctly said, it only makes a task runnable (taking it out of waiting state), so that, in the next scheduling cycle, the OS can consider the now-ready task for further execution. (If and when that task runs depends on his assigned priority; if it has a very high priority, the OS may try and make sure it runs before any other, lower prio task gets switched to.)
The actual task switching only occurs after the ISR finished and returned, so there's no need to copy anything from one stack to another.
In 'simple' implementations, the ISR may just return to the task it interrupted, so that no early, 'out-of-order' context switch will occur.
Another, more complex implementation can have the ISR return to the OS instead of the interrupted task. A function like yield() would thus be called, giving the OS the chance to do a task switch immediately if necessary.
This, however, may require that affected ISRs get special exit instructions appended replacing the normal compiler-generated ISR code.

What is a kernel stack used for?

The following is a description I read of a context switch between process A and process B. I don't understand what a kernel stack is used for. There is suppose to be a per process kernel stack. And the description I am reading speaks of saving registers of A onto the kernel stack of A and also saving registers of A to the process structure of A. What exactly is the point of saving the registers to both the kernel stack and the process structure and why the need for both?
A context switch is conceptually simple: all the OS has to do is save
a few register values for the currently-executing process (onto its
kernel stack, for example) and restore a few for the
soon-to-be-executing process (from its kernel stack). By doing so, the
OS thus ensures that when the return-from-trap instruction is finally
executed, instead of returning to the process that was running, the
system resumes execution of another process...
Process A is running and then is interrupted by the timer interrupt.
The hardware saves its registers (onto its kernel stack) and enters
the kernel (switching to kernel mode). In the timer interrupt handler,
the OS decides to switch from running Process A to Process B. At that
point, it calls the switch() routine, which carefully saves current
register values (into the process structure of A), restores the
registers of Process B (from its process structure entry), and then
switches contexts, specifically by changing the stack pointer to use
B’s kernel stack (and not A’s). Finally, the OS returns-from-trap,
which restores B’s registers and starts running it.
I have a disagreement with the second paragraph.
Process A is running and then is interrupted by the timer interrupt. The hardware saves its registers (onto its kernel stack) and enters the kernel (switching to kernel mode).
I am not aware of a system that saves all the registers on the kernel stack on an interrupt. Program Counter, Processor Status, and Stack Pointer (assuming the hardware does not have a separate Kernel Mode Stack Pointer). Normally, processors save the minimum necessary on the kernel stack after an interrupt. The interrupt handler will then save any additional registers it wants to use and restores them before exit. The processor's RETURN FROM INTERRUPT or EXCEPTION instruction then restores the registers automatically stored by the interrupt.
That description assumes no change in the process.
If the interrupt handle decides to change the process, it saves the current register state (the "process context" --most processors have a single instruction for this. In Intel land you might have to use multiple instructions) then executes another instruction to load the process context of the new process.
To answer your heading question "What is a kernel stack used for?", it is used whenever the processor is in Kernel mode. If the kernel did not have a stack protected from user access, the integrity of the system could be compromised. The kernel stack tends to be very small.
To answer you second question, "What exactly is the point of saving the registers to both the kernel stack and the process structure and why the need for both?"
They serve two different purpose. The saved registers on the kernel stack are used to get out of kernel mode. The context process block saves the entire register set in order to change processes.
I think your misunderstanding comes from the wording of your source that suggests all registers are stored on the stack when entering kernel mode, rather than just the minimum number of registers needed to make the kernel mode switch. The system will usually only save what it needs to get back to user mode (and may use that same information to return back to the original process in another context switch, depending upon the system). The change in process context saves all the registers.
Edits to answer additional questions:
If the interrupt handler needs to use register not saved by the CPU automatically by the interrupt, it pushes them on the kernel stack on entry and pops them off on exit. The interrupt handler has to explicitly save and restore any [general] registers it uses. The Process Context Block does not get touched for this.
The Process Context Block only gets altered as part an actual context switch.
Lets assume we have a processor with a program counter, stack pointer, processor status and 16 general registers (I know no such system really exists) and that the same SP is used for all modes.
Interrupt occurs.
The hardware pushes the PC, SP, and PS on to the stack, loads the SP with the address of the kernel mode stack and the PC from the interrupt handler (from the processor's dispatch table).
Interrupt handler gets called.
The writer of the handler decides he is going to us R0-R3. So the first lines of the handler have:
Push R0 ; on to the kernel mode stack
Push R1
Push R2
Push R3
The interrupt handler does whatever it wants to do.
The writer of the interrupt handler needs to do:
Pop R3
Pop R2
Pop R1
Pop R0
REI ; Whatever the system's return from interrupt or exception instruction is.
Hardware Takes over
Restores the PS, PC, and SP from the kernel mode stack, then resumes executing where it was before the interrupt.
I've made up my own processor for simplification. Some processors have lengthy instructions that are interruptable (e.g. block character moves). Such instructions often use registers to maintain their context. On such a system, the processor would have to automatically save any registers is uses to maintain context within the instruction.
An interrupt handler does not muck with the process context block unless it is changing processes.
It's difficult to speak in general terms about how an OS works 'under the hood', because it's dependent on how the hardware works. Also, terminology isn't highly standardised.
My guess is that by the 'Process structure entry' the writer means what is commonly known as the 'context' of the process, and that contains a copy of every register. It's not possible for the interrupt code to immediately save registers to this structure, because it would have to use (and therefore modify) registers in doing so. That's why it has to save a few registers, enough so that it can do the job, somewhere immediately available, e.g. where the stack pointer is pointing, which the writer calls the 'kernel stack'.
Depending on the architecture, this could be a single stack or separate ones per process.

How to determine maximum stack usage in embedded system?

When I give the Keil compiler the "--callgraph" option,
it statically calculates the exact "Maximum Stack Usage" for me.
Alas, today it is giving me a "Maximum Stack Usage = 284 bytes + Unknown(Functions without stacksize...)" message, along with a list of "Functions with no stack information".
Nigel Jones says that recursion is a really bad idea in embedded systems
("Computing your stack size" 2009),
so I've been careful not to make any mutually recursive functions in this code.
Also, I make sure that none of my interrupt handlers ever re-enable interrupts until their final return-from-interrupt instruction, so I don't need to worry about re-entrant interrupt handlers.
Without recursion or re-entrant interrupt handlers, it should able to statically determine the maximum stack usage.
(And so most of the answers to
How to determine maximum stack usage?
do not apply).
My understanding is that the software that handles the "--callgraph" option
first finds the maximum stack depth for each interrupt handler when it's not interrupted by a higher-priority interrupt, and the maximum stack depth of the main() function when it is not interrupted.
Then it adds them all up to find the total (worst-case) maximum stack depth.
That occurs when the main() background task is at its maximum depth when it is interrupted by the lowest-priority interrupt, and that interrupt is at its maximum depth when it is interrupted by the next-lowest-priority interrupt, and so on.
I suspect the software that handles --callgraph is getting confused about the small assembly-language functions in the "Functions with no stack information" list.
The --callgraph documentation seems to imply that I need to manually calculate (or make a conservative estimate) how much stack they use -- they're very short, so that should be simple -- and then "Use frame directives in assembly language code to describe how your code uses the stack."
One of them is the initial startup code that resets the stack to zero before jumping to main() -- so, in effect, this consumes zero stack.
Another one is the "Fault" interrupt handler that locks up in an infinite loop until I cycle the power -- it's safe to assume this consumes zero stack.
I'm using the Keil uVision V4.20.03.0 to compile code for the LM3S1968 ARM Cortex-M3.
So how do I use "frame directives" to tell the software that handles "--callgraph" how much stack these functions use?
Or is there some better approach to determine maximum stack usage?
(See How to determine maximum stack usage in embedded system with gcc? for almost the same question targeted to the gcc compiler.)
Use the --info=stack in the linker option. The map file will then include a stack usage for all functions with external linkage.
In a single tasking environment, the stack usage for main() will give you the total requirement. If you are using an RTOS such as RTX where each task has its own stack, then you need to look at the stack usage for all task entry points, and then add some more (64 bytes in the case of RTX) for the task context storage.
This and other techniques applicable to Keil and more generally are described here
John Regehr of the University of Utah has a good discussion of measuring stack usage in embedded systems at, though note that the link to is stale, and one occurrence of “without interrupts disabled” should have either the first or last word negated. In the commercial world, Coverity has a configurable stack overflow checker, and some versions of CodeWarrior have a semi-documented warn_stack_usage pragma. (It’s not mentioned in my version of the compiler documentation, but is in MetroWerks’ “Targeting Palm OS” document.)

How does a stack memory increase?

In a typical C program, the linux kernel provides 84K - ~100K of memory. How does the kernel allocate more memory for the stack when the process uses the given memory.
IMO when the process takes up all the memory of the stack and now uses the next contiguous memory, ideally it should page fault and then the kernel handles the page fault.
Is it here that the kernel provides more memory to the stack for the given process, and which data structure in linux kernel identifies the size of the stack for the process??
There are a number of different methods used, depending on the OS (linux realtime vs. normal) and the language runtime system underneath:
1) dynamic, by page fault
typically preallocate a few real pages to higher addresses and assign the initial sp to that. The stack grows downward, the heap grows upward. If a page fault happens somewhat below the stack bottom, the missing intermediate pages are allocated and mapped. Effectively increasing the stack from the top towards the bottom automatically. There is typically a maximum up to which such automatic allocation is performed, which can or can not be specified in the environment (ulimit), exe-header, or dynamically adjusted by the program via a system call (rlimit). Especially this adjustability varies heavily between different OSes. There is also typically a limit to "how far away" from the stack bottom a page fault is considered to be ok and an automatic grow to happen. Notice that not all systems' stack grows downward: under HPUX it (used?) to grow upward so I am not sure what a linux on the PA-Risc does (can someone comment on this).
2) fixed size
other OSes (and especially in embedded and mobile environments) either have fixed sizes by definition, or specified in the exe header, or specified when a program/thread is created. Especially in embedded real time controllers, this is often a configuration parameter, and individual control tasks get fix stacks (to avoid runaway threads taking the memory of higher prio control tasks). Of course also in this case, the memory might be allocated only virtually, untill really needed.
3) pagewise, spaghetti and similar
such mechanisms tend to be forgotten, but are still in use in some run time systems (I know of Lisp/Scheme and Smalltalk systems). These allocate and increase the stack dynamically as-required. However, not as a single contigious segment, but instead as a linked chain of multi-page chunks. It requires different function entry/exit code to be generated by the compiler(s), in order to handle segment boundaries. Therefore such schemes are typically implemented by a language support system and not the OS itself (used to be earlier times - sigh). The reason is that when you have many (say 1000s of) threads in an interactive environment, preallocating say 1Mb would simply fill your virtual address space and you could not support a system where the thread needs of an individual thread is unknown before (which is typically the case in a dynamic environment, where the use might enter eval-code into a separate workspace). So dynamic allocation as in scheme 1 above is not possible, because there are would be other threads with their own stacks in the way. The stack is made up of smaller segments (say 8-64k) which are allocated and deallocated from a pool and linked into a chain of stack segments. Such a scheme may also be requried for high performance support of things like continuations, coroutines etc.
Modern unixes/linuxes and (I guess, but not 100% certain) windows use scheme 1) for the main thread of your exe, and 2) for additional (p-)threads, which need a fix stack size given by the thread creator initially. Most embedded systems and controllers use fixed (but configurable) preallocation (even physically preallocated in many cases).
edit: typo
The stack for a given process has a limited, fixed size. The reason you can't add more memory as you (theoretically) describe is because the stack must be contiguous, and it grows toward the heap. So, when the stack reaches the heap, no extension is possible.
The stack size for a userland program is not determined by the kernel. The kernel stack size is a configuration option for the kernel (usually 4k or 8k).
Edit: if you already know this, and were merely talking about the allocation of physical pages for a process, then you have the procedure down already. But there's no need to keep track of the "stack size" like this: the virtual pages in the stack with no pagetable entries are just normal overcommitted virtual pages. Physical memory will be granted on their first access. But the kernel does not have to overcommit memory, and thus a stack will probably have complete physical realization when the executable is first loaded.
The stack can only be used up to a certain length, because it has a fixed storage capacity in memory. If your question asks in what direction does the stack being used up? the answer is downwards. It is filled down in memory towards the heap. The heap is a dynamic component of memory by which it can actually grow from the bottom up, based on your need of data storage.