Developing a simple bootloader for an Embedded system - embedded

I've been tasked in developing a simple bootloader for an embedded system. We are not running any OS or RTOS so I want it to be really simple.
This code will be stored in a ROM and the processor will begin execution at power on.
My goal is to have a first part written in ASM which would take care of the following operations:
Initialize the processor
Copy the .data segment from ROM to RAM
Clear the .bss segment in RAM
Call main
Main would be obviously written in C and perform higher level operations like self-test etc...
Now what I really don't know how to do is combine these two programs into a single one. I found a crappy tool that basically uses objcopy to gather the .text and .data sections from executables and appends some asm in front but this seem to be a really ugly way to do it and I was wondering if someone could point me in the right direction?

You can (in principle) link the object file generated from the assembler code like you would link any object from your program.
The catch is that you need to lay out the generated executable so that your startup code is in the beginning. If you use GNU ld, the way to do that is a linker script.
Primitive setup (not checked for syntax errors):
MEMORY
{
FLASH (RX) : ORIGIN = 0, LENGTH = 256K
RAM (RWX) : ORIGIN = 0x40000000, LENGTH = 4M
}
SECTIONS
{
.bootloader 0 : AT(0) { bootloader.o(.text) } >FLASH AT>FLASH
.text : { _stext = .; *(.text .text.* .rodata .rodata.*); _etext = . } >FLASH AT>FLASH
.data : { _sdata = .; *(.data .data.*); _edata = .; _sdata_load = LOADADDR(.data) } >RAM AT>FLASH
.bss (NOLOAD) { _sbss = .; *(.bss .bss.*); _ebss = . } >RAM
}
The basic idea is to give the linker a rough idea of the memory map, then assign sections from the input files to sections in the final program.
The linker keeps the distinction between the "virtual" and the "load" address for every output section, so you can tell it to generate a binary where the code is relocated for the final addresses, but the layout in the executable is different (here, I tell it to place the .data section in RAM, but append it to the .text section in flash).
Your bootloader can then use the symbols provided (_sdata, _edata, _sdata_load) to find the data section both in RAM and in flash, and copy it.
Final caveat: if your program uses static constructors, you also need a constructor table, and the bootloader needs to call the static constructors.

Simon is right on. There are simpler linker scripts than that that will work just fine for what you are doing but the bottom line is it is the linker that takes the objects and makes the binary, so depending on the linker you are using you have to understand the ways you can tell that linker to do stuff and then have it do it. Unfortunately I dont think there is an industry standard to this you have to go linker by linker and understand them. And certainly with gnu ld there are many very complicated linker scripts out there, some folks live to solve things in the linker.

Related

RISC-V inline assembly using memory not behaving correctly

This system call code is not working at all. The compiler is optimizing things out and generally behaving strangely:
template <typename... Args>
inline void print(Args&&... args)
{
char buffer[1024];
auto res = strf::to(buffer) (std::forward<Args> (args)...);
const size_t size = res.ptr - buffer;
register const char* a0 asm("a0") = buffer;
register size_t a1 asm("a1") = size;
register long syscall_id asm("a7") = ECALL_WRITE;
register long a0_out asm("a0");
asm volatile ("ecall" : "=r"(a0_out)
: "m"(*(const char(*)[size]) a0), "r"(a1), "r"(syscall_id) : "memory");
}
This is a custom system call that takes a buffer and a length as arguments.
If I write this using global assembly it works as expected, but program code has generally been extraordinarily good if I write the wrappers inline.
A function that calls the print function with a constant string produces invalid machine code:
0000000000120f54 <start>:
start():
120f54: fa1ff06f j 120ef4 <public_donothing-0x5c>
-->
120ef4: 747367b7 lui a5,0x74736
120ef8: c0010113 addi sp,sp,-1024
120efc: 55478793 addi a5,a5,1364 # 74736554 <add_work+0x74615310>
120f00: 00f12023 sw a5,0(sp)
120f04: 00a00793 li a5,10
120f08: 00f10223 sb a5,4(sp)
120f0c: 000102a3 sb zero,5(sp)
120f10: 00500593 li a1,5
120f14: 06600893 li a7,102
120f18: 00000073 ecall
120f1c: 40010113 addi sp,sp,1024
120f20: 00008067 ret
It's not loading a0 with the buffer at sp.
What am I doing wrong?
It's not loading a0 with the buffer at sp.
Because you didn't ask for a pointer as an "r" input in a register. The one and only guaranteed/supported behaviour of T foo asm("a0") is to make an "r" constraint (including +r or =r) pick that register.
But you used "m" to let it pick an addressing mode for that buffer, not necessarily 0(a0), so it probably picked an SP-relative mode. If you add asm comments inside the template like "ecall # 0 = %0 1 = %1 2 = %2" you can look at the compiler's asm output and see what it picked. (With clang, use -no-integrated-as so asm comments in the template come through in the -S output.)
Wrapping a system call does need the pointer in a specific register, i.e. using "r" or +"r"
asm volatile ("ecall # 0=%0 1=%1 2=%2 3=%3 4=%4"
: "=r"(a0_out)
: "r"(a0), "r"(a1), "r"(syscall_id), "m"(*(const char(*)[size]) a0)
: // "memory" unneeded; the "m" input tells the compiler which memory is read
);
That "m" input can be used instead of the "memory" clobber, not instead of an "r" pointer input. (For write specifically, because it only reads that one area of pointed-to memory and has no other side-effects on memory user-space can see, only on kernel write write buffers and file-descriptor positions which aren't C objects this program can access directly. For a read call, you'd need the memory to be an output operand.)
With optimization disabled, compilers do typically pick another register as the base for the "m" input (e.g. 0(a5) for GCC), but with optimization enabled GCC picks 0(a0) so it doesn't cost extra instructions. Clang still picks 0(a2), wasting an instruction to set up that pointer, even though the "=r"(a0_out) is not early-clobber. (Godbolt, with a very cut-down version of the function that doesn't call strf::to, whatever that is, just copies a byte into the buffer.)
Interestingly, with optimization enabled for my cut-down stand-alone version of the function without fixing the bug, GCC and clang do happen to put a pointer to buffer into a0, picking 0(a0) as the template expansion for that operand (see the Godbolt link above). This seems to be a missed optimization vs. using 16(sp); I don't see why they'd need the buffer address in a register at all.
But without optimization, GCC picks ecall # 0 = a0 1 = 0(a5) 2 = a1. (In my simplified version of the function, it sets a5 with mv a5,a0, so it did actually have the address in a0 as well. So it's a good thing you had more code in your function to make it not happen to work by accident, so you could find the bug in your code.)

Static ELF alternative to dlsym

Is it possible to lookup the location of a function using ELF? Similar to what
void *f = dlopen(NULL,..);
void *func = dlsym(f, "myfunc");
does, but without requiring -rdynamic during compilation?
I can see using nm that the naming of the items is still present in a compiled binary?:
0000000000400716 T lookup
0000000000400759 T main
Can I use this information to locate the items once the program is loaded into memory?
Can I use this information to locate the items once the program is loaded into memory?
You sure can: iterate over all symbols in the a.out until you find the matching one. Example code to iterate over symbols is here. Or use libelf.
If you need to perform multiple symbol lookups, iterate once (slow) over all symbols, build a map from symbol name to its address, and perform lookups using that map.
Update:
The example you point to seems incomplete? It uses data and elf, where are they coming from?
Yes, you need to apply a little bit of elbow grease to that example.
The data is location in memory where the a.out is read into, or (better) mmaped.
You can either mmap the a.out yourself, or find the existing mapping via e.g. getauxval(AT_PHDR) rounded down to page size.
The ehdr is (ElfW(Ehdr) *)data (that is, data cast to Elf32_Ehdr or Elf64_Ehdr as appropriate.
If this is not clear, then you probably should just use libelf, which takes care of the details for you.
Also, does ELF only allow me to find the name of the symbol, or can it actually give me the pointer to the in memory location of the symbol?
It can give you both: str + sym[i].st_name is the name, sym[i].st_value is the pointer (the value displayed by nm).
(presumably the e.g. 0000000000400716 is some relative base address, not the actual in memory location, right?)
No, actually (for this binary) it's the absolute address.
Position-independent binaries do use relative addresses (so you'll need something like getauxval mentioned above to find the base location of such executable), but this particular binary looks like ET_EXEC (use readelf -h a.out to verify this). The address 0x400000 is typical address for loading of non-PIE executables on Linux x86_64 (which is probably what your system is).

How do I limit the scope of the preprocessing definitions used?

How do I preprocess a code base using the clang (or gcc) preprocessor while limiting its text processing to use only #define entries from a single header file?
This is useful generally: imagine you want to preview the immediate result of some macros that you are currently working on… without having all the clutter that results from the mountain of includes inherent to C.
Imagine a case, where there are macros that yield a backward compatible call or an up-to-date one based on feature availability.
#if __has_feature(XYZ)
# define JX_FOO(_o) new_foo(_o)
# define JX_BAR(_o) // nop
...
#else
# define JX_FOO(_o) old_foo(_o)
# define JX_BAR(_o) old_bar(_o)
...
#endif
A concrete example is a collection of Objective-C code that was ported to be ARC-compatible (Automatic Reference Counting) from manual memory management (non-ARC) using a collection of macros (https://github.com/JanX2/google-diff-match-patch-Objective-C/blob/master/JXArcCompatibilityMacros.h) so that it compiles both ways afterwards.
At some point, you want to drop non-ARC support to improve readability and maintainability.
Edit: The basis for getting the preprocessor output is described here: C, Objective-C preprocessor output
Edit 2: If someone has details of how the source-to-source transformation options in Xcode are implemented (Edit > Refactor > Convert To…), that might help.
If you are writing the file from scratch or all the includes are in one place, why not wrap them inside of:
#ifndef MACRO_DEBUG
#include "someLib.h"
/* ... */
#endif
But as I mentioned, this only works when the includes are in consecutive lines and in the best case, you are starting to write the file yourself from scratch so you don't have to go and look for the includes.
This is a perfect case for sed/awk. However there exists an even better tool available for the exact use-case that you mention. Checkout coan.
To pre-process a source file as if the symbol <SYMBOL>is defined,
$ coan source -D<SYMBOL> sourcefile.c
Similarly to pre-process a source file as if the symbol <SYMBOL>is NOT defined,
$ coan source -U<SYMBOL> source.c
This is a bit of a stupid solution, but it works: apparently you can use AppCode’s refactoring to delete uses of a macro.
This limits the solution to OS X, though. It also is slightly tedious, because you have to do this manually for every JX_FOO() and JX_BAR().

Assembly code for optimized bitshifting of a vector

i'm trying to write a routine that will logically bitshift by n positions to the right all elements of a vector in the most efficient way possible for the following vector types: BYTE->BYTE, WORD->WORD, DWORD->DWORD and WORD->BYTE (assuming that only 8 bits are present in the result). I would like to have three routines for each type depending on the type of processor (SSE2 supported, only MMX suppported, only standard instruction se supported). Therefore i need 12 functions in total.
I have already found by myself how to backup and restore the registers that i need, how to make a loop, how to copy data into regular registers or MMX registers and how to shift by 1 position logically.
Because i'm not familiar with assembly language that's about it.
Which registers should i use for each instruction set?
How will the availability of the large vector (an image) in L1 cache be optimized?
How do i find the next element of the vector (a pointer kind of thing), i know i can make a mov by address and i assume i have to increment the address by 1, 2 or 4 depending on my type of data?
Although i have all the ideas, writing the code is a bit difficult at this point.
Thank you.
Arnaud.
Edit:
Here is what i'm trying to do for MMX for a shift by 1 on a DWORD:
__asm("push mm"); // backup register
__asm("push cx"); // backup register
__asm("mov %cx, length"); // initialize loop
__asm("loopstart_shift1:"); // start label
__asm("movd %xmm0, r/m32"); // get 32 bits data
__asm("psrlq %xmm0, 1"); // right shift 32 bits data logically (stuffs 0 on the left) by 1
__asm("mov r/m32,%xmm0"); // set 32 bits data
__asm("dec %cx"); // decrement index
__asm("cmp %cx,0");
__asm("jnz loopstart_shift1");
__asm("pop cx"); // restore register
__asm("pop mm"); // restore register
__asm("emms"); // leave MMX state
I strongly suggest you pause and take a look at using intrinsics with C or C++ instead of trying to write raw asm - that way the C/C++ compiler will take care of all the register allocation, instruction scheduling and general housekeeping tasks and you can just focus on the important parts, e.g. instead of using psrlq see _m_psrlq in mmintrin.h. (Better yet, look at using 128 bit SSE intrinsics.)
Sounds like you'd benefit from either using or looking into BitMagic's source. its entirely intrinsics based too, which makes its far more portable (though from the looks of it your using GCC, so it might have to get an MSVC to GCC intrinics mapping).

How do I get the Keil RealView MDK-ARM toolchain to link a region for execution in one area in memory but have it store it in another?

I'm writing a program that updates flash memory. While I'm erasing/writing flash I would like to be executing from RAM. Ideally I'd link my code to an execution region that's stored in flash that on startup I would copy to the RAM location to which it is linked.
I don't include any of the normal generated C/C++ initialization code so I can't just tag my function as __ram.
If I could do the above then the debuggers symbols would be relevant for the copied to RAM code and I'd be able to debug business as usual.
I'm thinking that something along the lines of OVERLAY/RELOC might help but I'm not sure.
Thanks,
Maybe your application code can do it manually. Something like
pSourceAddr = &FunctionInFlash;
pDestAddr = &RamReservedForFunction;
while(pSourceAddr <= (&FunctionInFlash+FunctionSize))
{ *pDestAddr++ = *pSourceAddr++;
};
typedef int (*RamFuncPtr)(int arg1); //or whatever the signature is..
result = ((RamFuncPtr)&RamReservedForFunction)(argument1);
You should be able to get the linker definition file to export symbols for the FunctionInFlash and RamReservedForFunction addresses.