ARM Cortex-M3 Startup Code - embedded

I'm trying to understand how the initialization code works that ships with Keil (realview v4) for the STM32 microcontrollers. Specifically, I'm trying to understand how the stack is initialized.
In the documentation on ARM's website it mentions that one of the routines in startup_xxx.s, __user_initial_stack_heap, should not use more than 88 bytes of stack. Do you know where that limitation is coming from?
It seems that when the reset handler calls System_Init it is executing a couple functions in a C environment which I believe means it is using some form of a temporary stack (it allocates a few automatic variables). However, all of those stack'd items should be out of scope once it returns back and then calls __main which is where __user_initial_stack_heap is called from.
So why is there this requirement for __user_initial_stack_heap to not use more than 88 bytes? Does the rest of __main use a ton of stack or something?
Any explanation of the cortex-m3 stack architecture as it relates to the startup sequence would be fantastic.

You will see from the __user_initial_stackheap() documentation, that the function is for legacy support and that it is superseded by __user_setup_stackheap(); the documentation for the latter provides a clue ragarding your question:
Unlike __user_initial_stackheap(), __user_setup_stackheap() works with systems where the application starts with a value of sp (r13) that is already correct, for example, Cortex-M3
[..]
Using __user_setup_stackheap() rather than __user_initial_stackheap() improves code size because there is no requirement for a temporary stack.
On Cortex-M the sp is initialised on reset by the hardware from a value stored in the vector table, on older ARM7 and ARM9 devices this is not the case and it is necessary to set the stack-pointer in software. The start-up code needs a small stack for use before the user defined stack is applied - this may be the case for example if the user stack were in external memory and could not be used until the memory controller were initialised. The 88 byte restriction is imposed simply because this temporary stack is sized to be as small as possible since it is probably unused after start-up.
In your case in STM32 (a Cortex-M device), it is likely that there is in fact no such restriction, but you should perhaps update your start-up code to use the newer function to be certain. That said, given the required behaviour of this function and the fact that its results are returned in registers, I would suggest that 88 bytes would be rather extravagant if you were to need that much! Moreover, you only need to reimplement it if you are using scatter loading file as described.

Related

Interpreting Cortex M4 hard fault error debug info

Sorry - this is long! I'm sure I'll get some TL;DRs. :/
I'm not at all new to the world of Cortex M3/4; I have encountered plenty of hard fault errors in the past and, without exception, they have been due to stack overflow on FreeRTOS. However, in this case, I'm really struggling to track down a hard fault on a system that has someone else's old code that I have slightly modified.
I have a system with an LCD and touch screen. We have new hardware, which is almost identical to the old hardware other than it changing from an LPC1788 to a drop-in equivalent LPC4088 and the touch screen being I2C rather than SPI.
I'm using Keil uvision (which is new to me) with an NXP4088 which is an M4 core and Keil RL-ARM RTOS (also new to me) which is using C/C++ hybrid, the C++ also not something I have much experience with. On top of this, there is Segger emWin (which I've never used) closed source code where it always seems to be crashing. It will render a few screens, read the touch screen buttons etc and then fall over. Sometimes it falls over immediately though.
I have followed this:
http://www.keil.com/appnotes/files/apnt209.pdf
I have attached a picture of the debugger/IDE when it crashes below (click to enlarge).
When it crashes, the highlighted green task in the OS is, without exception, ApplicationTask (which I have not modified).
If I am reading the info correctly the Keil uvision debugger tells me that the stack being used was the MSP stack which is at address 0x20003238. There's a memory dump below:
If I understand correctly, this means that R0, 2, 3 and 12 are 0, the program counter is at 0 as are LR and PSR. However, this goes against what's in the Call Stack + Locals window in the first picture. If I right click on the 0x00005A50 underneath ApplicationTask:4 and choose caller code, it itells me it is
BL.W GUI_ALLOC_UnlockH
Which is in the emWin binary blob I think.
However, if I look at 0x20001B60 (which is the PSP stack value) as below:
That seems to tally up much better with what the Call Stack + Local Window tells me. It also seems to tell me that it's crashing in emWin and extensive Googling shows that Segger always completely wash their hands of any possibility their closed source code could be at fault. To be fair, it's unlikely as it's been working OK until I modified the code to use an I2C touch screen interface rather than SPI. However, where it's crashing (or seems to be) is nothing to do with the code I have modified.
Also, this window below:
Gives the BFAR address as 0xF00B4DDA and the memory manager fault address as 0xF00B4DDA. I don't know whether I should be interpreting this as to being the issue.
I have found a few other posts around the web, including one staggeringly similar one to this here on Stack Overflow (but all have no solution associated with them) where people have the same issue.
So, my questions are:
Am I reading this data correctly and understanding the Keil document I linked to? I really feel I must be missing something with this MSP/PSP issue.
Am I using the caller code function with uvision correctly? The bit where I right click on Call Stack + Locals' address below ApplicationTask:4 and it always seems to take me to some Segger code I can't examine and surely isn't what's at fault.
Should I really be reading the issue as a bus fault address with it trying to read from or write to 0xF00B4DDA which is reserved space?
I tried implementing a piece of code such as this:
https://blog.frankvh.com/2011/12/07/cortex-m3-m4-hard-fault-handler/
But that just stops the whole system running properly and ends up in at BKPT instruction in some init code. On top of this, I am not convinced this kind of thing would tell me any more than uvision does, other than it showing me slightly faster and with zero effort. Am I right in this latter assumption?

Simulating multiple instances of an embedded processor

I'm working on a project which will entail multiple devices, each with an embedded (ARM) processor, communicating. One development approach which I have found useful in the past with projects that only entailed a single embedded processor was develop the code using Visual Studio, divided into three portions:
Main application code (in unmanaged C/C++ [see note])
I/O-simulating code (C/C++) that runs under Visual Studio
Embedded I/O code (C), which Visual Studio is instructed not to build, runs on the target system. Previously this code was for the PIC; for most future projects I'm migrating to the ARM.
Feeding the embedded compiler/linker the code from parts 1 and 3 yields a hex file that can run on the target system. Running parts 1 and 2 together yields code which can run on the PC, with the benefit of better debugging tools and more precise control over I/O behavior (e.g. I can make the simulation code introduce certain types of random hiccups more easily than I can induce controlled hiccups on real hardware).
Target code is written in C, but the simulation environment uses C++ so as to simulate I/O registers. For example, I have a PortArray data structure; the header file for the embedded compiler includes a line like unsigned char LATA # 0xF89; and my header file for simulation includes #define LATA _IOBIT(f89,1) which in turn invokes a macro that accesses a suitable property of an I/O object, so a statement like LATA |= 4; will read the simulated latch, "or" the read value with 4, and write the new value. To make this work, the target code has to compile under C++ as well as under C, but this mostly isn't a problem. The biggest annoyance is probably with enum types (which behave as integers in C, but have to be coaxed to do so in C++).
Previously, I've used two approaches to making the simulation interactive:
Compile and link a DLL with target-application and simulation code, and have VB code in the same project which interacts with it.
Compile the target-application code and some simulation code to an EXE with instance of Visual Studio, and use a second instance of Visual Studio for the simulation-UI. Have the two programs communicate via TCP, so nearly all "real" I/O logic is in the simulation program. For example, the aforementioned `LATA |= 4;` would send a "read port 0xF89" command to the TCP port, get the response, process the received value, and send a "write port 0xF89" command with the result.
I've found the latter approach to run a tiny bit slower than the former in some cases, but it seems much more convenient for debugging, since I can suspend execution of the unmanaged simulation code while the simulation UI remains responsive. Indeed, for simulating a single target device at a time, I think the latter approach works extremely well. My question is how I should best go about simulating a plurality of target devices (e.g. 16 of them).
The difficulty I have is figuring out how to make each simulated instance get its own set of global variables. If I were to compile to an EXE and run one instance of the EXE for each simulated target device, that would work, but I don't know any practical way to maintain debugger support while doing that. Another approach would be to arrange the target code so that everything would compile as one module joined together via #include. For simulation purposes, everything could then be wrapped into a single C++ class, with global variables turning into class-instance variables. That would be a bit more object-oriented, but I really don't like the idea of forcing all the application code to live in one compiled and linked module.
What would perhaps be ideal would be if the code could load multiple instances of the DLL, each with its own set of global variables. I have no idea how to do that, however, nor do I know how to make things interact with the debugger. I don't think it's really necessary that all simulated target devices actually execute code simultaneously; it would be perfectly acceptable for simulation instances to use cooperative multitasking. If there were some way of finding out what range of memory holds the global variables, it might be possible to have the 'task-switch' method swap out all of the global variables used by the previously-running instance and swap in the contents applicable to the instance being switched in. Although I'd know how to do that in an embedded context, though, I'd have no idea how to do that on the PC.
Edit
My questions would be:
Is there any nicer way to allow simulation logic to be paused and examined in VS2010 debugger, while keeping a responsive UI for the simulator front-end, than running the simulator front end and the simulator logic in separate instances of VS2010, if the simulation logic must be written in C and the simulation front end in managed code? For example, is there a way to tell the debugger that when a breakpoint is hit, some or all other threads should be allowed to keep running while the thread that had hit the breakpoint sits paused?
If the bulk of the simulation logic must be source-code compatible with an embedded system written in C (so that the same source files can be compiled and run for simulation purposes under VS2010, and then compiled by the embedded-systems compiler for use in real hardware), is there any way to have the VS2010 debugger interact with multiple simulated instances of the embedded device? Assume performance is not likely to be an issue, but the number of instances will be large enough that creating a separate project for each instance would be likely be annoying in the absence of any way to automate the process. I can think of three somewhat-workable approaches, but don't know how to make any of them work really nicely. There's also an approach which would be better if it's possible, but I don't know how to make it work.
Wrap all the simulation code within a single C++ class, such that what would be global variables in the target system become class members. I'm leaning toward this approach, but it would seem to require everything to be compiled as a single module, which would annoyingly affect the design of the target system code. Is there any nice way to have code access class instance members as though they were globals, without requiring all functions using such instances to be members of the same module?
Compile a separate DLL for each simulated instance (so that e.g. if I want to run up to 16 instances, I would include 16 DLL's in the project, all sharing the same source files). This could work, but every change to the project configuration would have to be repeated 16 times. Really ugly.
Compile the simulation logic to an EXE, and run an appropriate number of instances of that EXE. This could work, but I don't know of any convenient way to do things like set a breakpoint common to all instances. Is it possible to have multiple running instances of an EXE attached to a single debugger instance?
Load multiple instances of a DLL in such a way that each instance gets its own global variables, while still being accessible in the debugger. This would be nicest if it were possible, but I don't know any way to do so. Is it possible? How? I've never used AppDomains, but my intuition would suggest that might be useful here.
If I use one VS2010 instance for the front-end, and another for the simulation logic, is there any way to arrange things so that starting code in one will automatically launch the code in the other?
I'm not particularly committed to any single simulation approach; while it might be nice to know if there's some way of slightly improving the above, I'd also like to know of any other alternative approaches that could work even better.
I would think that you'd still have to run 16 copies of your main application code, but that your TCP-based I/O simulator could keep a different set of registers/state for each TCP connection that comes in.
Instead of a bunch of global variables, put them into a single structure that encompasses the I/O state of a single device. Either spawn off a new thread for each socket, or just keep a list of active sockets and dedicate a single instance of the state structure for each socket.
the simulators I have seen that handle multiple instances of the instruction set/processor are designed that way. There is a structure usually that contains a complete set of registers, and a new pointer or an array of these structures are used to multiply them into multiple instances of the processor.

Why don't compilers generate microinstructions rather than assembly code?

I would like to know why, in the real world, compilers produce Assembly code, rather than microinstructions.
If you're already bound to one architecture, why not go one step further and free the processor from having to turn assembly-code into microinstructions at Runtime?
I think perhaps there's a implementation bottleneck somewhere but I haven't found anything on Google.
EDIT by microinstructions I mean: if you assembly instruction is ADD(R1,R2), the microinstructions would be. Load R1 to the ALU, load R2 to the ALU, execute the operation, load the results back onto R1. Another way to see this is to equate one microinstruction to one clock-cycle.
I was under the impression that microinstruction was the 'official' name. Apparently there's some mileage variation here.
FA
Compilers don't produce micro-instructions because processors don't execute micro-instructions. They are an implementation detail of the chip, not something exposed outside the chip. There's no way to provide micro-instructions to a chip.
Because an x86 CPU doesn't execute micro operations, it executes opcodes. You can not create a binary image that contains micro operations since there is no way to encode them in a way that the CPU understands.
What you are suggesting is basically a new RISC-style instruction set for x86 CPUs. The reason that isn't happening is because it would break compatibility with the vast amount of applications and operating systems written for the x86 instruction set.
The answer is quite easy.
(Some) compilers do indeed generate code sequences like load r1, load r2, add r2 to r1. But this are precisely the machine code instructions (that you call microcode). These instructions are the one and only interface between the outer world and the innards of a processor.
(Other compilers generate just C and let a C backend like gcc care about the dirty details.)

Using open source SNES emulator code to turn a rom file into a self-contained executable game

Would it be possible to take the source code from a SNES emulator (or any other game system emulator for that matter) and a game ROM for the system, and somehow create a single self-contained executable that lets you play that particular ROM without needing either the individual rom or the emulator itself to play? Would it be difficult, assuming you've already got the rom and the emulator source code to work with?
It shouldn't be too difficult if you have the emulator source code. You can use a method that is often used to store images in c source files.
Basically, what you need to do is create a char * variable in a header file, and store the contents of the rom file in that variable. You may want to write a script to automate this for you.
Then, you will need to alter the source code so that instead of reading the rom in from a file, it uses the in memory version of the rom, stored in your variable and included from your header file.
It may require a little bit of work if you need to emulate file pointers and such, or you may be lucky and find that the rom loading function just loads the whole file in at once. In this case it would probably be as simple as replacing the file load function with a function to return your pointer.
However, be careful for licensing issues. If the emulator is licensed under the GPL, you may not be legally allowed to store a proprietary file in the executable, so it would be worth checking that, especially before you release / distribute it (if you plan to do so).
Yes, more than possible, been done many times. Google: static binary translation. Graham Toal has a good howto paper on the subject, should show up early in the hits. There may be some code out there I may have left some code out there.
Completely removing the rom may be a bit more work than you think, but not using an emulator, definitely possible. Actually, both requirements are possible and you may be surprised how many of the handheld console games or set top box games are translated and not emulated. Esp platforms like those from Nintendo where there isnt enough processing power to emulate in real time.
You need a good emulator as a reference and/or write your own emulator as a reference. Then you need to write a disassembler, then you have that disassembler generate C code (please dont try to translate directly to another target, I made that mistake once, C is portable and the compilers will take care of a lot of dead code elimination for you). So an instruction of a make believe instruction set might be:
add r0,r0,#2
And that may translate into:
//add r0,r0,#2
r0=r0+2;
do_zflag(r0);
do_nflag(r0);
It looks like the SNES is related to the 6502 which is what Asteroids used, which is the translation I have been working on off and on for a while now as a hobby. The emulator you are using is probably written and tuned for runtime performance and may be difficult at best to use as a reference and to check in lock step with the translated code. The 6502 is nice because compared to say the z80 there really are not that many instructions. As with any variable word length instruction set the disassembler is your first big hurdle. Do not think linearly, think execution order, think like an emulator, you cannot linearly translate instructions from zero to N or N down to zero. You have to follow all the possible execution paths, marking bytes in the rom as being the first byte of an instruction, and not the first byte of an instruction. Some bytes you can decode as data and if you choose mark those, otherwise assume all other bytes are data or fill. Figuring out what to do with this data to get rid of the rom is the problem with getting rid of the rom. Some code addresses data directly others use register indirect meaning at translation time you have no idea where that data is or how much of it there is. Once you have marked all the starting bytes for instructions then it is a trivial task to walk the rom from zero to N disassembling and or translating.
Good luck, enjoy, it is well worth the experience.

What exactly does GCC -fobjc-direct-dispatch option do?

The GCC manual says:
-fobjc-direct-dispatch
Allow fast jumps to the message dispatcher. On
Darwin this is accomplished via the comm page.
Can I assume this flag eliminates dynamic dispatch? How does it work?
I believe it should be as fast as a C function call if it is linked directly.
No, the dynamic dispatch is still there (calls still route through objc_msgSend). And this option doesn't introduce any difference currently with x86(-64).
From http://developer.apple.com/legacy/mac/library/documentation/DeveloperTools/gcc-3.3/gcc/Objective_002dC-Dialect-Options.html:
For some functions (such as objc_msgSend) called very frequently by Objective-C programs, special entry points exist in high memory that may be jumped to directly (e.g., via the "bla" instruction on the PowerPC) for improved performance. The fobjc-direct-dispatch option will cause such jumps to be generated. This option is only available in conjunction with the NeXT runtime; furthermore, programs built with the -fobjc-direct-dispatch option will only run on Mac OS X 10.4 (Tiger) or later systems.