What are common structures for firmware files? - embedded

I'm a total n00b in embedded programming. Suppose I'm building a firmware using a compiler. The result of this operation is a file that will be flashed into (I guess) the flash memory of a MCU such an ARM or a AVR.
My question is: What common structures (if any) are used for such generated files containing the firmware?
I came from desktop development and I understand that for example for Windows the compiler will most likely generate a PE or PE+, while Unix-like systems I may get a ELF and COFF, but have no idea for embedded systems.
I also understand that this highly depends on many factors (Compiler, ISA, MCU vendor, OS, etc.) so I'm fine with at least one example.
Update: I will up vote all answers providing examples of used structures and will select the one I feel best surveys the state of the art.

The firmware file is the Executable and Linkable File, usually processed to a binary (.bin) or text represented binary (.hex).
This binary file is the exact memory that is written to the embedded flash. When you first power the board, an internal bootloader will redirect the execution to your firmware entry point, normally at the address 0x0.
From there, it is your code that is running, this is why you have a startup code (usually startup.s file) that will configure clock, stack pointer registers, vector table, load the data section to RAM (your initialized variables), clear the zero initialized section, maybe you will want to copy your code to RAM and jump to the copy to avoid running code from FLASH (can be faster on some platforms), and so on.
When running over an Operational System, all these platform choices and resources are not in control of user code, there you can only link to the OS libraries and use the provided API to do low level actions. In embedded, it is 100% user code, you access the hardware and manage its resources.
Not surprisingly, Operational Systems are booted in a similar manner as firmware, since both are there in touch with the processor, memory and I/Os.
All of that, to say: the structure of a firmware is similar to the structure of any compiled program. There's the data sections and code sections that are organized in memory during the load by the Operational System, or by the program itself when running on embedded.
One main difference is the memory addressing in the firwmare binary, usually addresses are physical RAM address, since you do not have memory mapping feature on most of micro-controllers. This is transparent to the user, the compiler will abstract it.
Other significant difference is the stack pointer, on OSs user code will not reserve memory for the stack by itself, it relays on OS for that. When on firmware, you have to do it in user code for the same reason as before, there's no middle man to manage it for you. The linker script of the compiler will reserve Stack and Heap memory accordingly configured and there will be a stack_pointer symbol on your .map file letting you know where it points to. You won't find it in OSs program's map files.

Most tools output either an ELF, or a COFF, or something similar that can eventually boil down to a HEX/bin file.
That isn't necessarily what your target wants to see, however. Every vendor has their own format of "firmware" files. Sometimes they're encrypted and signed, sometimes plain text. Sometimes there's compression, sometimes it's raw. It might be a simple file, or something complex that is more than just your program.
An integral part of doing embedded work is the build flow and system startup/booting procedure, plus getting your code onto the part. Don't underestimate the effort.

Ultimately the data written to the ROM is normally just the code and constant data from which your application is composed, and therefore has no structure other than perhaps being segmented into code and data, and possibly custom segments if you have created them. The structure in this sense is defined by the linker script or configuration used to build the code. The file containing this code/data may be raw binary, or an encoded binary format such as Intel Hex or Motorola S-Record for example.
Typically also your toolchain will generate an object code file that contains not only the code/data, but also symbolic and debug information for use by a debugger. In this case when the debugger runs, it loads the code to the target (as in the binary file case above) and the symbol/debug information to the host to allow source level debugging. These files may be in a proprietary object file format specific to the toolchain, but are often standard "open" formats such as ELF. However strictly the meta-data components of an object file are not part of the firmware since they are not loaded on the target.

I've recently run across another firmware format not listed here. It's a binary format, might be called ".EEP" but might not. I think it is used by NXP. I've seen it used for ARM THUMB2 and for mystery stuff that may be a DSP/BSP.
All the following are 32-bit values, all stored in reverse endian except for CAFEBABE (so... BEBAFECA?):
CAFEBABE
length_in_16_bit_words(yes, 16-bit...?!)
base_addr
CRC32
length*2 bytes of data
FFFF (optional filler if the length is an odd number)
If there are more data blocks:
length
base
checksum that is not a CRC but something bizarre
data
FFFF (optional filler if the length is an odd number)
...
When no more data blocks remain:
length == 0
base == 0
checksum that is not a CRC but something bizarre
Then all of that is repeated for another memory bank/device.

Related

How to send data to target through Trace32 debugger?

I need a way to send some data to the ucontroller through Trace32. I heard that this is possible some way, but I have no idea where to start.
What I am actually trying to do is run a piece of code on a Aurix TC297 ucontroller to do some measurements (runtime, RAM, etc.). This piece of code is actually a Kalman filter that needs as input a vector of structs that I have too send from the computer through Trace32. Please help !
"A way to send some data to the ucontroller through Trace32" is a little bit vague. There are various possibilities depending on what your actually try to achieve and might also depend on the used CPU family and target OS. Anyhow one of the following might work:
Simply writing some raw data to the target memory can be achieved with the Data.Set command.
To transfer a big amount of data (or even a whole application) from a file to the target memory the Data.LOAD commands might be the right choice. E.g. Data.LOAD.Binary command for a raw binary file.
To set variables in your application or even initiate C-style data arrays use the Var.Set command.
To write data to NOR flash or onchip flash memory you'll need the FLASH.AUTO command in addition to the previously mentioned commands (after declaring the flash memory to TRACE32).
To write data to a NAND, SPI or other serial flash memory you probably should use the FLASHFILE.Set command (after initialization of the FLASHFILE programming system).
To transfer data from TRACE32 to your target while the CPU is running you might have to configure correctly SYStem.MemAccess and use the memory access class prefix "E". E.g. Data.Set E:<addr> <data> or Var.Set %E <expression>.
You can use FDX for a bidirectional data transfer between debugger and a running target application.
To enable the target application to open and read files from the computer running TRACE32, you have to compile your application with suitable semihosting code and initiate semihosting in TRACE32 with TERM.GATE command.

What decides which structure a process has in memory?

I've learned that a process has the following structure in memory:
(Image from Operating System Concepts, page 82)
However, it is not clear to me what decides that a process looks like this. I guess processes could (and do?) look different if you have a look at non-standard OS / architectures.
Is this structure decided by the OS? By the compiler of the program? By the computer architecture? A combination of those?
Related and possible duplicate: Why do stacks typically grow downwards?.
On some ISAs (like x86), a downward-growing stack is baked in. (e.g. call decrements SP/ESP/RSP before pushing a return address, and exceptions / interrupts push a return context onto the stack so even if you wrote inefficient code that avoided the call instruction, you can't escape hardware usage of at least the kernel stack, although user-space stacks can do whatever you want.)
On others (like MIPS where there's no implicit stack usage), it's a software convention.
The rest of the layout follows from that: you want as much room as possible for downward stack growth and/or upward heap growth before they collide. (Or allowing you to set larger limits on their growth.)
Depending on the OS and executable file format, the linker may get to choose the layout, like whether text is above or below BSS and read-write data. The OS's program loader must respect where the linker asks for sections to be loaded (at least relative to each other, for executables that support ASLR of their static code/data/BSS). Normally such executables use PC-relative addressing to access static data, so ASLRing the text relative to the data or bss would require runtime fixups (and isn't done).
Or position-dependent executables have all their segments loaded at fixed (virtual) addresses, with only the stack address randomized.
The "heap" isn't normally a real thing, especially in systems with virtual memory so each process can have their own private virtual address space. Normally you have some space reserved for the stack, and everything outside that which isn't already mapped is fair game for malloc (actually its underlying mmap(MAP_ANONYMOUS) system calls) to choose when allocating new pages. But yes even modern glibc's malloc on modern Linux does still use brk() to move the "program break" upward for small allocations, increasing the size of "the heap" the way your diagram shows.
I think this is recommended by some committee and then tools like GCC conform to that recommendation. Binary format defines these segments and operating system and its tools facilitate the process of that format to run on system. lets say ELF is recommended by system V and then adopted by unix; and gcc produce the ELF binaries to be run on unix. so i feel story may start from binary format as it decides about memory mappings(code, data/heap/stack). binary format,among other hacks, defines memory mappings to be mapped for loading programs. As for example ELF defines segments (arrange code in text,data,stack to be loaded in memory), GCC generates that segments of ELF binary while loader loads these segments. operating system also has freedom in adjusting the values of these segments like stack size. These are debatable loud thoughts which I try to consolidate.
That figure represents a a specific implementation or an idealized one. A process does not necessarily have that structure. On many systems a process looks only somewhat similar to what is in the diagram.

Procedure to download code in Flash Memory

I am new to Embedded field. One doubt came up in my mind about hex file downloading:
As output of the linker and locator is a binary file having various sections as .text,.bss,.data etc. and .text resides in Flash, .bss goes to RAM, .data goes to RAM ...
so my question is that
how .bss and .data are written to RAM as i am using FLASH Loader for burning my program onto flash.
Is there any index kind of thing in the final binary which discriminates between .text and .bss segments.
Is there any utility in the linker/locator which converts our simple binary into hex format.
How can I discriminate between .text and .bss from the contents of hex file?
Thanks in advance. Please help.
1.) how .bss and .data are written to RAM as i am using FLASH Loader for burning my program onto flash?
Code, constant data and initialized data are all written to FLASH. At runtime, initialized data is copied to bss during startup. Constant data is usually accessed directly (you declare it with the "const" keyword).
2.) Is there any index kind of thing in the final binary which discriminates between .text and .bss segments?
I think you mean by "binary" the linker output. This is usually referred to as an object file and is different from a binary image. The object file includes all code, data, symbol, debug information and memory addresses. For the GCC tool chain, the linker output is usually an .elf file.
Your linker uses a "link script" or other definition file to locate the various segments at the appropriate memory addresses. Your tool chain should have docs on how to change that.
3.) Is there any utility in the linker/locator which converts our simple binary into hex format?
The "objcopy" utility will read linker output and can write output files in a variety of formats, including Intel-hex. For human readable output see "objdump".
4.) How can I discriminate between .text and .bss from the contents of hex file?
By memory address. GCC uses the "initialized data" segment for data which is copied to the bss. It is located according to your linker script.
Intel-hex format: http://en.wikipedia.org/wiki/Intel_HEX
GCC: http://gcc.gnu.org/onlinedocs/
Your programming tools just write the program image (hex or binary file) into Flash, at the specified address.
When the compiler compiles the program, it adds information (look-up tables) containing information about the .bss and .data etc into the hex file, along with some C start-up code. The C start-up code, which runs before the C main() function begins, is responsible for initialising .bss and .data according to the information in the look-up tables in Flash memory. Typically, that means:
Initialising .bss to zeros.
Copying initialisation values for .data from the Flash memory into the appropriate locations in RAM.
(On some platforms, the initialisation values might even be stored compressed to save space in Flash, and the start-up code decompresses it as it writes it to RAM).
On some platforms, you might want to run code in RAM, for various reasons. The C start-up code can also copy code from Flash to RAM in a similar way.
The C start-up code can also do other essential things:
Initialise stack(s) and initial stack pointer
Initialise heap (if your program uses it)
Set some essential processor configuration, such as processor operating mode, clock speeds, Flash wait states, memory management unit, etc
Typically these issues are solved by the embedded tools you use to manage the target device.
For small embedded devices (less than 16K to 128K of RAM and ROM/NVRAM), code executes directly from NVRAM. The initialization code copies initialized data to RAM, perhaps decompressing it, and initializes uninitialized RAM, usually by clearing it. The locator is responsible for making all data references access the proper target addresses and not the address of ROM data.
Large embedded devices run much faster than NVRAM, so they tend to be implemented where NVRAM content (code plus data) is copied from NVRAM to RAM for execution.
To answer your other points:
2) The final binary is carefully structured to distinguish between program sections. There are many different file formats. See the corresponding Wikipedia article about the format for details.
3) You many not need to convert to hex format, though that depends on your target loader. For example, U-Boot based systems support kermit file transfers of the binary data.
4) See 2).

Compact decompression library for embedded use

We're currently creating a device for a customer that will get a block of data (like, say, 5-10KB) from a PC application. This is a bit simplified, so assume that the data must be passed and uncompressed a lot, not just once a year. The communication channel is really, really slow, so we'd like to compress the data beforehand, pass to the device and let it uncompress the data to its internal flash. The device itself, however, runs on a micro controller that is not really fast and does not have a lot of memory. It has enough flash memory to store the result, and can uncompress the data block as it is received, but it may not have enough RAM to store the entire compressed or uncompressed (or even both!) data blocks. And of course, it doesn't have an operating system or other luxury.
This means we need a sufficiently fast uncompression algorithm that does not use a lot of memory. The compression can be slow and ugly, since we're doing it on the PC side. C or .NET code preferred though for compression, to make things easier. The uncompression code should be in C, since it's unlikely that someone has an ASM optimized version for our controller.
We found LZO, which would be almost perfect for us, but it has a so-called "free" license (GPL) by default, which makes it totally unusable for our customer. The author says that commercial licenses are available on request, but unfortunately he's currently unreachable (for non-technical reasons, like the news on his site say).
I found a few other libraries, including the puff.c from zlib, and we're still investigating, but I thought I'd ask for your experience:
Which compression algorithm and/or library do you recommend for embedded purposes, given that the decompression device has really limited resources and source code and a commercial license are required?
You might want to check out one of these which are not GPL and are fairly compact implementations:
fastlz - MIT license, fairly simple code
lzjb - sun CDDL, used in zfs for compression, simple and very short
liblzf - BSD-style license, small, fast
lzfx - BSD-style, based on liblzf, small, fast
Those algorithms are all based on the original algorithm of Lempel–Ziv–Welch (They have all LZ in common)
https://en.wikipedia.org/wiki/Lempel–Ziv–Welch
I have used LZSS. I used code from Haruhiko Okumura as base. It uses the last portion of uncompressed data(2K) as dictionary. This code can be modified to not require a temporary ring buffer if you have no memory. The licensing is not clear from his site but some versions was released with a "Use, distribute, and modify this program freely" line included and the code is used by commercial vendors.
Here is an implementation based on the same code that forms part of the Allegro game library. Allegro licensing is giftware or zlib.
Another option could be the lzfx lib that implement LZF. I have not used it yet but it seems nice. Also uses previous results so it has low memory requirements and is released under a BSD Licence.
One alternative could be the LZ77 coder/decoder in the Basic Compression Library.
Since it uses the unpacked data history for its dictionary, it uses no extra RAM except for the compressed and uncompressed data buffers. It should be ideal for your use case (zlib license, portable C). The entire decoder is just 70 lines of code (including comments), and really fast.
EDIT: Yet another alternative is the liblzg library, which is a refined version of the aforementioned LZ77 coder/decoder. It compresses better, is generally faster, and requires no memory for decompression. It is very, very free (zlib license).
I would recommend ZLIB.
From the wiki:
The library provides facilities for control of processor and memory use
There are also facilities for conserving memory. These are probably only useful in restricted memory environments such as some embedded systems.
zlib is also used in many embedded devices because the code is portable, liberally-licensed
and has a relatively small memory footprint.
A lot depends on the nature of the data. If it is simple enough, you may not need anything very fancy. For example if the downloaded data was a simple image (for example something like a line graph), a simple run length encoding could cut the data down by a factor of ten and you would need trivial amounts of code and RAM to decode it.
Of course if the data is more complex, then this won't be of much use. But I'd start by exploring the data being sent and see if there are specific aspects which would allow you to compress it more effectively than using a general purpose algorithm.
You might want to check out Jørgen Ibsen's aPlib - a couple of excerpts from the product page:
The compression ratios achieved by aPLib combined with the speed and tiny footprint of the depackers (as low as 169 bytes!) makes it the ideal choice for many products.
aPLib is free to use even for commercial use, please check the included license for details.
The compression library is closed-source (yes, I know this could be a problem), but has precompiled libraries for a variety of compilers and operating systems, including both 32- and 64-bit editions. There's C and x86 assembly source code for the decompressor.
EDIT:
Jørgen also has a free (zlib license) BrifLZ library you could check out if not having compressor source is a big problem.
I've seen people use 7zip on an embedded system with memory in the tens of megabytes.
there is a specific custom version of zlib for Micro-controller based on ARM Cortex-M (M0, M0+, M1, M3, M4)
https://github.com/kuym/Zlib-ARM-Cortex-M

How does one use dynamic recompilation?

It came to my attention some emulators and virtual machines use dynamic recompilation. How do they do that? In C i know how to call a function in ram using typecasting (although i never tried) but how does one read opcodes and generate code for it? Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C? If so how do you find the length of the code? How do you account for system interrupts?
-edit-
system interrupts and how to (re)compile the data is what i am most interested in. Upon more research i heard of one person (no source available) used js, read the machine code, output js source and use eval to 'compile' the js source. Interesting.
It sounds like i MUST have knowledge of the target platform machine code to dynamically recompile
Yes, absolutely. That is why parts of the Java Virtual Machine must be rewritten (namely, the JIT) for every architecture.
When you write a virtual machine, you have a particular host-architecture in mind, and a particular guest-architecture. A portable VM is better called an emulator, since you would be emulating every instruction of the guest-architecture (guest-registers would be represented as host-variables, rather than host-registers).
When the guest- and host-architectures are the same, like VMWare, there are a ton of (pretty neat) optimizations you can do to speed up the virtualization - today we are at the point that this type of virtual machine is BARELY slower than running directly on the processor. Of course, it is extremely architecture-dependent - you would probably be better off rewriting most of VMWare from scratch than trying to port it.
It's quite possible - though obviously not trivial - to disassemble code from a memory pointer, optimize the code in some way, and then write back the optimized code - either to the original location or to a new location with a jump patched into the original location.
Of course, emulators and VMs don't have to RE-write, they can do this at load-time.
This is a wide open question, not sure where you want to go with it. Wikipedia covers the generic topic with a generic answer. The native code being emulated or virtualized is replaced with native code. The more the code is run the more is replaced.
I think you need to do a few things, first decide if you are talking about an emulation or a virtual machine like a vmware or virtualbox. An emulation the processor and hardware is emulated using software, so the next instruction is read by the emulator, the opcode pulled apart by code and you determine what to do with it. I have been doing some 6502 emulation and static binary translation which is dynamic recompilation but pre processed instead of real time. So your emulator may take a LDA #10, load a with immediate, the emulator sees the load A immediate instruction, knows it has to read the next byte which is the immediate the emulator has a variable in the code for the A register and puts the immediate value in that variable. Before completing the instruction the emulator needs to update the flags, in this case the Zero flag is clear the N flag is clear C and V are untouched. But what if the next instruction was a load X immediate? No big deal right? Well, the load x will also modify the z and n flags, so the next time you execute the load a instruction you may figure out that you dont have to compute the flags because they will be destroyed, it is dead code in the emulation. You can continue with this kind of thinking, say you see code that copies the x register to the a register then pushes the a register on the stack then copies the y register to the a register and pushes on the stack, you could replace that chunk with simply pushing the x and y registers on the stack. Or you may see a couple of add with carries chained together to perform a 16 bit add and store the result in adjacent memory locations. Basically looking for operations that the processor being emulated couldnt do but is easy to do in the emulation. Static binary translation which I suggest you look into before dynamic recompilation, performs this analysis and translation in a static manner, as in, before you run the code. Instead of emulating you translate the opcodes to C for example and remove as much dead code as you can (a nice feature is the C compiler can remove more dead code for you).
Once the concept of emulation and translation are understood then you can try to do it dynamically, it is certainly not trivial. I would suggest trying to again doing a static translation of a binary to the machine code of the target processor, which a good exercise. I wouldnt attempt dynamic run time optimizations until I had succeeded in performing them statically against a/the binary.
virtualization is a different story, you are talking about running the same processor on the same processor. So x86 on an x86 for example. the beauty here is that using non-old x86 processors, you can take the program being virtualized and run the actual opcodes on the actual processor, no emulation. You setup traps built into the processor to catch things, so loading values in AX and adding BX, etc these all happen at real time on the processor, when AX wants to read or write memory it depends on your trap mechanism if the addresses are within the virtual machines ram space, no traps, but lets say the program writes to an address which is the virtualized uart, you have the processor trap that then then vmware or whatever decodes that write and emulates it talking to a real serial port. That one instruction though wasnt realtime it took quite a while to execute. What you could do if you chose to is replace that instruction or set of instructions that write a value to the virtualized serial port and maybe have then write to a different address that could be the real serial port or some other location that is not going to cause a fault causing the vm manager to have to emulate the instruction. Or add some code in the virtual memory space that performs a write to the uart without a trap, and have that code instead branch to this uart write routine. The next time you hit that chunk of code it now runs at real time.
Another thing you can do is for example emulate and as you go translate to a virtual intermediate bytcode, like llvm's. From there you can translate from the intermediate machine to the native machine, eventually replacing large sections of program if not the whole thing. You still have to deal with the peripherals and I/O.
Here's an explaination of how they are doing dynamic recompilation for the 'Rubinius' Ruby interpteter:
http://www.engineyard.com/blog/2010/making-ruby-fast-the-rubinius-jit/
This approach is typically used by environments with an intermediate byte code representation (like Java, .net). The byte code contains enough "high level" structures (high level in terms of higher level than machine code) so that the VM can take chunks out of the byte code and replace it by a compiled memory block. The VM typically decide which part is getting compiled by counting how many times the code was already interpreted, since the compilation itself is a complex and time-consuming process. So it is usefull to only compile the parts which get executed many times.
but how does one read opcodes and generate code for it?
The scheme of the opcodes is defined by the specification of the VM, so the VM opens the program file, and interprets it according to the spec.
Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C?
This process is an implementation detail of the VM, typically there is a compiler embedded, which is capable to transform the VM opcode stream into machine code.
How do you account for system interrupts?
Very simple: none. The code in the VM can't interact with real hardware. The VM interact with the OS, and transfer OS events to the code by jumping/calling specific parts inside the interpreted code. Every event in the code or from the OS must pass the VM.
Also hardware virtualization products can use some kind of JIT. A typical use cases in the X86 world is the translation of 16bit real mode code to 32 or 64bit protected mode code to not to be forced to emulate a CPU in real mode. Also a software-only VM replaces jump instructions in the executing code by jumps into the VM control software, which at each branch the following code path for jump instructions scans and them replace, before it jumps to the real code destination. But I doubt if the jump replacement qualifies as JIT compilation.
IIS does this by shadow copying: after compilation it copies assemblies to some temporary place and runs them from temp.
Imagine, that user change some files. Then IIS will recompile asseblies in next steps:
Recompile (all requests handled by old code)
Copies new assemblies (all requests handled by old code)
All new requests will be handled by new code, all requests - by old.
I hope this'd be helpful.
A virtual Machine loads "byte code" or "intermediate language" and not machine code therefore, I suppose, that it just recompiles the byte code more efficiently once it has more runtime data.
http://en.wikipedia.org/wiki/Just-in-time_compilation