Where in memory are return values stored in memory? - low-level

Where in memory are return values stored in memory?
Consider the follwing code:
int add(int a, int b) {
int result = a+b;
return result;
}
void main() {
int sum = add(2, 3);
}
When add(2, 3) is called, the 2 function parameters are pushed on the stack, the stack frame pointer is pushed on the stack, and a return address is pushed on the stack. The flow of execution then jumps to add(...), and local variables within that function are also stored on the stack.
When add(...) is complete, and executes the return instruction... where does the return value get stored? How does [result] end up in [sum]?

This clearly depends on your hardware architecture and your compiler. On 64-bit x86 using gcc, your code compiles to:
.file "call.c"
.text
.globl add
.type add, #function
add:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
movl %edi, -20(%rbp)
movl %esi, -24(%rbp)
movl -24(%rbp), %eax
movl -20(%rbp), %edx
leal (%rdx,%rax), %eax
movl %eax, -4(%rbp)
movl -4(%rbp), %eax ; return value placed in EAX
leave
ret
.cfi_endproc
.LFE0:
.size add, .-add
.globl main
.type main, #function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
movq %rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq $16, %rsp
movl $3, %esi
movl $2, %edi
call add
movl %eax, -4(%rbp) ; the result of add is stored in sum
leave
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
Here, the compiler is using the EAX register to communicate the result of add to the caller.
You can read up on x86 calling conventions in Wikipedia.

There is no general answer to this question, because it depends on the target architecture. There is usually a Binary API Spec for any target architecture that defines that and the compiler creates codes that works according to this spec. Most architectures use a register for passing the return value back, simply because it is the fastest way to do it. That is only possible if the value will fit into a register, of course. If not, they might use a register pair (e.g. lower 32 bit in one register, upper 32 bit in another one), or they will pass it back via the stack. Some architectures never use registers and always pass back via the stack. Since the caller must create a stack frame before calling the function (there are exceptions to this rule, but lets stay with the default case here), the stack frame is still there when the function returns to the caller and the caller knows how to access it, it has to know that, since it must also clean the stack frame on return. On most architectures the caller cleans the stack frame, not the callee, since the caller knows how many arguments it has passed via stack (e.g. for a C function that takes a variable number of arguments), while the callee does not (not at compile time, the callee may only know that at runtime), thus it makes more sense to let the caller clean it. And before doing that, the caller can read back any value of the stack frame it wishes to retrieve.

On x86 the return value is put in the EAX register (that may depend on your actual calling convention though).
You could disassemble the code compiled from your source to check what happens for sure.

The function parameters, local variables and return values may be pushed/popped on the stack, or be stored into internal CPU registers, it is highly system dependent.

Usually in the accumulator. For a return value that does not fit into the accumulator, the accumulator will hold a pointer to it on the stack. This is a common scheme, used on the few platforms I have dealt with at that level, but depends on hardware and I think on the compiler/assembler too.

EAX is used to store the return value if the size permits it (here it does); it's the caller's action (main in your case) to assign the EAX content to sum

Related

Accepts and displays one character of 1 through 9 using assembly code in c

Can someone please explain me each line of this assembly code?
void main(void){
_asm{
mov ah,8 ;read key no echo
int 21h
cmp al,‘0’ ;filter key code
jb big
cmp al,‘9’
ja big
mov dl,al ;echo 0 – 9
mov ah,2
int 21h
big:
}
}
PS: I am new to assembly in c/c++.
Per the docs, the return value is in al, not ah. That's why it compares to al.
Edit: Adding more detail:
Looking at this code:
mov ah,8 ;read key no echo
int 21h
Think of this like a function call. Now normally a function call in asm looks like call myroutine. But DOS used interrupts to allow you to call various operating system functions (read a key from the keyboard, read data from a file, etc).
So, executing the int 21h instruction called the operating system. But how was the operating system supposed to know which OS function you wanted? Typically by putting a value in ah. If you search, you can find a number of resources that show listings of all the int 21h functions (like this). The numbers on the right are the values you put in ah.
So, mov ah,8 is preparing to call the "Wait for console input without echo" function. mov ah,2 is "Display output." Other registers are used to pass various parameters to the function being called. You need to read the description of the specific interrupt to understand what goes where.
Note that NONE of this is related to "writing inline asm in C." This is just how to call OS function from C code running under DOS. If you aren't running under DOS, int 21 won't work.

Favorability of alloca for array allocation vs simple [] array declaration

Reading some Apple code, I stumbled upon the following C chunk
alloca(sizeof(CMTimeRange) * 3)
is this the same thing as allocation stack memory via
CMTimeRange *p = CMTimeRange[3] ?
Is there any implications on performance? The need to free the memory?
If you really only want to allocate 3 elements of something on the stack the use of alloca makes no sense at all. It only makes sense if you have a variable length that depends on some dynamic parameter at runtime, or if you do an unknown number of such allocations in the same function.
alloca is not a standard function and differs from platform to platform. The C standard has prefered to introduce VLA, variable length arrays as a replacement.
is this the same thing as allocation stack memory via...
I would think not quite. Declaring a local variable causes the memory to be reserved when the stack frame is entered (by subtracting the size of variable from the stack pointer and adjusting for alignment).
It looks like alloca(3) works by adjusting the stack pointer at the moment it is encountered. Note the "Bugs" section of the man page.
alloca() is machine and compiler dependent; its use is discouraged.
alloca() is slightly unsafe because it cannot ensure that the pointer returned points to a valid and usable block of memory. The allocation made may exceed the bounds of the stack, or even go further into other objects in memory, and alloca() cannot determine such an error. Avoid alloca() with large unbounded allocations.
These two points together add up to the following in my opinion:
DO NOT USE ALLOCA
Assuming as Joachim points out you mean CMTimeRange someVariableName[3]...
Both will allocate memory on the stack.
I'm guessing alloca() will have to add extra code after your function prologue to do the allocation... The function prologue is code that the compiler automatically generates for you to create room on the stack. The upshot is that your function may be slightly larger once compiled but not by much... a few extra instructions to modify the stack pointer and possibly stack frame. I guess a compiler could optimize the call out if it wasn't in a conditional branch, or just even lift it outside of a conditional branch though?
I experimented on my MQX compiler with no optimisations... it's not objective-c, just C, also a different platform, but hopefully that's a good enough approximation and does show a difference in emitted code. I used two simple functions with a large array on the stack to make sure stack space had to be used (variable couldn't exist solely in registers).
Obviously it is not advisable to put large arrays on the stack... this is just for demo purposes.
unsigned int TEST1(unsigned int stuff)
{
unsigned int a1[100]; // Make sure it must go on stack
unsigned int a2[100]; // Make sure it must go on stack
a1[0] = 0xdead;
a2[0] = stuff + 10;
return a2[0];
}
unsigned int TEST2(unsigned int stuff)
{
unsigned int a1[100]; // Make sure it must go on stack
unsigned int *a2 = alloca(sizeof(unsigned int)*100);
a1[0] = 0xdead;
a2[0] = stuff + 10;
return a2[0];
}
The following assembler was generated:
TEST1:
Both arrays a1 and a2 are put on the stack in the function prologue...
0: 1cfcb6c8 push %fp
4: 230a3700 mov %fp,%sp
8: 24993901 sub3 %sp,%sp,100 # Both arrays put on stack
c: 7108 mov_s %r1,%r0
e: 1b38bf98 0000dead st 0xdead,[%fp,0xffff_fce0] ; 0xdead
16: e00a add_s %r0,%r0,10
18: 1b9cb018 st %r0,[%fp,0xffff_fe70]
1c: 240a36c0 mov %sp,%fp
20: 1404341b pop %fp
24: 7ee0 j_s [%blink]
TEST2:
Only array a1 is put on the stack in the proglogue... Extra lines of code have to be generated to deal with the alloca.
0: 1cfcb6c8 push %fp
4: 230a3700 mov %fp,%sp
8: 24593c9c sub3 %sp,%sp,50 # Only one array put on stack
c: 240a07c0 mov %r4,%blink
10: 220a0000 mov %r2,%r0
14: 218a0406 mov %r1,0x190 # Extra for alloca()
18: 2402305c sub %sp,%sp,%r1 # Extra for alloca()
1c: 08020000r bl _stkchk # Extra for alloca()
20: 738b mov_s %r3,%sp # Extra, r3 to access write via pointer
22: 1b9cbf98 0000dead st 0xdead,[%fp,0xffff_fe70] ; 0xdead
2a: 22400280 add %r0,%r2,10
2e: a300 st_s %r0,[%r3] # r3 to access write via pointer
30: 270a3100 mov %blink,%r4
34: 240a36c0 mov %sp,%fp
38: 1404341b pop %fp
3c: 7ee0 j_s [%blink]
Also you alloca() memory will be accessed through pointers (unless there are clever compiler optimisations for this... I don't know) so causes actual memory access. Automatic variables might be optimized to being just register accesses, which is better... the compiler can figure out using register colouring what automatic variables are best left in registers and if they ever need to be on the stack.
I had a quick search through C99 standard (C11 is about... my reference is out of date a little). Could not see a reference to alloca so maybe not a standard-defined function. A possible disadvantage?

How do I get started with ARM on iOS?

Just curious as to how to get started understanding ARM under iOS. Any help would be super nice.
In my opinion, the best way to get started is to
Write small snippets of C code (later Objective-C)
Look at the corresponding assembly code
Find out enough to understand the assembly code
Repeat!
To do this you can use Xcode:
Create a new iOS project (a Single View Application is fine)
Add a C file scratchpad.c
In the Project Build Settings, set "Generate Debug Symbols" to "No"
Make sure the target is iOS Device, not Simulator
Open up scratchpad.c and open the assistant editor
Set the assistant editor to Assembly and choose "Release"
Example 1
Add the following function to scratchpad.c:
void do_nothing(void)
{
return;
}
If you now refresh the Assembly in the assistant editor, you should see lots of lines starting with dots (directives), followed by
_do_nothing:
# BB#0:
bx lr
Let's ignore the directives for now and look at these three lines. With a bit of searching on the internet, you'll find out that these lines are:
A label (the name of the function prefixed with an underscore).
Just a comment emitted by the compiler.
The return statement. The b means branch, ignore the x for now (it has something to do with switching between instruction sets), and lr is the link register, where callers store the return address.
Example 2
Let's beef it up a bit and change the code to:
extern void do_nothing(void);
void do_nothing_twice(void)
{
do_nothing();
do_nothing();
}
After saving and refreshing the assembly, you get the following code:
_do_nothing_twice:
# BB#0:
push {r7, lr}
mov r7, sp
blx _do_nothing
pop.w {r7, lr}
b.w _do_nothing
Again, with a bit of searching on the internet, you'll find out the meaning of each line. Some more work needs to be done because make two calls: The first call needs to return to us, so we need to change lr. That is done by the blx instruction, which does not only branch to _do_nothing, but also stores the address of the next instruction (the return address) in lr.
Because we change the return address, we have to store it somewhere, so it is pushed on the stack. The second jump has a .w suffixed to it, but let's ignore that for now. Why doesn't the function look like this?
_do_nothing_twice:
# BB#0:
push {lr}
blx _do_nothing
pop.w {lr}
b.w _do_nothing
That would work as well, but in iOS, the convention is to store the frame pointer in r7. The frame pointer points to the place in the stack where we store the previous frame pointer and the previous return address.
So what the code does is: First, it pushes r7 and lr to the stack, then it sets r7 to point to the new stack frame (which is on the top of the stack, and sp points to the top of the stack), then it branches for the first time, then it restores r7 and lr, finally it branch for the second time. Abx lr at the end is not needed, because the called function will return to lr, which points to our caller.
Example 3
Let's have a look at a last example:
void swap(int *x, int *y)
{
int temp = *x;
*x = *y;
*y = temp;
}
The assembly code is:
_swap:
# BB#0:
ldr r2, [r0]
ldr r3, [r1]
str r3, [r0]
str r2, [r1]
bx lr
With a bit of searching, you will learn that arguments and return values are stored in registers r0-r3, and that we may use those freely for our calculations. What the code does is straightforward: It loads the value that r0 and r1 point to in r2 and r3, then it stores them back in exchanged order, then it branches back.
And So On
That's it: Write small snippets, get enough info to roughly understand what's going on in each line, repeat. Hope that helps!

How is tail-call optimization in FPLs implemented at the assembly level?

How do LISPs or MLs implement tail-call optimization?
I can't speak on the exact implementation details different compilers/interpreters, however generally speaking tail-call optimization operates like this:
Normally a function call involves something like this:
Allocate stack space for the return
Push your current instruction pointer onto the stack
Allocate stack space for the function parameters and set them appropriately
Call your function
To return it sets it's return space appropriately, pops off the instruction pointer it should be returning to and jumps to it
However when a function is in tail position, which pretty much means you are returning the result of the function you are about to call, you can be tricky and do
Re-use the stack space allocated for your own return value as the stack space allocated for the return value of the function you are about to call
Re-use the instruction pointer you should be returning to as the instruction pointer that the function you are about to call will use
Free your own parameters stack space
Allocate space for the parameters and set appropriately
Set the value of your parameters
Call your function
When it returns, it will be returning directly to your caller.
Note that #1 and #2 don't actually involve any work, #3 can be tricky or simple depending on your implementation, and 4-7 don't involve anything special from the function you are calling. Also note that all of this results in a 0 stack growth with respect to your call stack, so this allows for infinte recursion and generally speeding things up a little.
Also note that this kind of optimization can be applied not only to directly recursive functions, but co-recursive functions and in fact all calls that are in tail position.
It's CPU-architecture and/or operating system dependent what kind of functions can be tail-call optimized. That's because calling conventions (for passing function arguments and/or transferring control between functions) differ between CPUs and/or operating systems. It usually boils down to whether anything passed in the tail call must come from the stack or not. Take, for example, a function like:
void do_a_tailcall(char *message)
{
printf("Doing a tailcall here; you said: %s\n", message);
}
If you compile this, even with high optimization (-O8 -fomit-frame-pointer), on 32bit x86 (Linux), you get: do_a_tailcall:
subl $12, %esp
movl 16(%esp), %eax
movl $.LC0, (%esp)
movl %eax, 4(%esp)
call printf
addl $12, %esp
ret
.LC0:
.string "Doing a tailcall here; you said: %s\n"i.e. a classical function, with stackframe setup/teardown (subl $12, %esp / addl $12, %esp) and an explicit ret from the function.
In 64bit x86 (Linux), this looks like:do_a_tailcall:
movq %rdi, %rsi
xorl %eax, %eax
movl $.LC0, %edi
jmp printf
.LC0:
.string "Doing a tailcall here; you said: %s\n"
so it got tail-optimized.
On an entirely different type of CPU architecture (SPARC), this looks like (I've left the compiler's comment in):
.L16:
.ascii "Doing a tailcall here; you said: %s\n\000"
!
! SUBROUTINE do_a_tailcall
!
.global do_a_tailcall
do_a_tailcall:
sethi %hi(.L16),%o5
or %g0,%o0,%o1
add %o5,%lo(.L16),%o0
or %g0,%o7,%g1
call printf ! params = %o0 %o1 ! Result = ! (tail call)
or %g0,%g1,%o7
Yet another one ... ARM (Linux EABI):.LC0:
.ascii "Doing a tailcall here; you said: %s\012\000"
do_a_tailcall:
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
# link register save eliminated.
mov r1, r0
movw r0, #:lower16:.LC0
movt r0, #:upper16:.LC0
b printf
The differences here are the way arguments are passed, and control is transferred:
32bit x86 (stdcall / cdecl type calling) passes args on the stack, and hence the potential for tail call optimization is very limited - apart from specific cornercases, it's only likely to happen for exact argument passthrough or when tail calling functions that take no arguments at all.
64bit x86 (UNIX x86_64 style, but not too different on Win64) passes a certain number of arguments in registers, and that leaves the compiler considerably more freedom on what can be called without having to pass anything on the stack. Control transfer via jmp simply makes the tail-called function inherit the stack - including the topmost value, which would be the return address into the original caller of our do_a_tailcall.
SPARC not only passes function arguments in registers, but also return addresses (it uses a link register, %o7). So while you transfer control via call, that doesn't actually force a new stackframe since all it does is to set both link register and program counter ... to undo the former via another odd feature of SPARC, the so-called delay slot instruction (the or %g0,%g1,%o7 - sparc-ish for mov %g1,%o7 - is executed after the call but before the target of the call is reached). The above code is created from an old compiler rev ... and not as optimized as theoretically could be...
ARM is similar to SPARC as it uses a link register, which tail-recursive functions just pass unmodified/untouched to the tail-call. It's also similar to x86 by using b (branch) on tail recursion instead of the "call" equivalent (bl, branch-and-link).
In all architectures where at least some argument passing can happen in registers, tail call optimization can be applied by the compiler on a large variety of functions.

Updating variable that lives in the data segment from the stack and its segment

I currently have three segments of memory, my main data segment, stack segment and the segment where my API lives. The following instructions are executed from the data segment, they push the address of cursorRow and welcomeMsg then do a far call to the function in my API segment. The cursorRow variable lives in the main data segment that is calling the API function. The call looks like this:
push cursorRow
push welcomeMsg
call API_SEGMENT:API_printString
How can I alter cursorRow, inside of the segment where my API lives, through the stack? cursorRow needs to be updated from the API. NO API functions alter the data segment. I have tried things like: inc byte [ds:bp+8] and add [ds:bp+8], 1.
Here is the API procedure being called:
printStringProc:
push bp
mov bp, sp
mov si, [bp+6]
.printloop:
lodsb
cmp al, 0
je printStringDone
mov ah, 0x0E ; teletype output
mov bh, 0x00 ; page number
mov bl, 0x07 ; color (only in graphic mode)
int 0x10
jmp .printloop
printStringDone:
; move the cursor down
mov ah, 02h ; move cursor
mov dh, [bp+8]
mov dl, 0 ; column
mov bh, 0 ; page number
int 10h
add [ds:bp+8], 1
pop bp
retf
it prints strings, but the cursorRow variable doesn't correctly update. I hope I'm clear enough on my issue. It's hard to explain :D
This is because you passed the pointer to cursorRow, not cursorRow itself. When you perform
inc [ds:bp+8]
you: 1) get the value of bp, 2) add 8, 3) assume the result is a pointer in ds, 4) increment the value stored there (the pointer to cursorRow). Since the pointer is stored on the stack, you are incrementing the pointer when you do this. What you need to do is take the pointer off of the stack and increment the value that points to.
mov bx, [bp+8]
inc [bx]
This code: 1) gets the value of bp, 2) adds 8, 3) assumes the result is a pointer in ss, 4) load the value stored there (the pointer to cursorRow) into bx, 5) assumes bx is a pointer in ds, 6) increments the value stored there (the value of cursorRow).
It's look like you just pushed the value of cursorRow onto the stack. Without the address you cannot update it. With the address you can easily reference that addresses' value, put it into a register, perform operations on it, then take the value that's in that register and put it into the address of cursorRow.