What are pseudo-instructions for in gem5? - gem5

So, I was seeing how some simulations in gem5 are implemented, more specifically, I was having a look at PIMSim (https://github.com/vineodd/PIMSim). I saw they had implemented some pseudo-instructions for the x86 architecture. I have seen these pseudo-instructions are only used in full system mode. For that they have modified the following files:
include/gem5/m5ops.h
util/m5/m5op_x86.S
src/arch/x86/isa/decoder/two_byte_opcodes.isa
src/sim/pseudo_inst.hh(cc)
I have understood what changes are neccessary to implement a custom pseudo-instruction, but what I do not understand is what they are and how they are used. I do not find any place outside these files this functions are called. Any help? Thanks in advance!

Pseudo ops are ways to make magic simulation operations from the inside the guest, this type of technique is more generally known as guest instrumentation
They can be used/implemented either as:
magic instructions placed in unused encoding space of the real ISA
I think this is always enabled, except in KVM where the host CPU takes over and just crashes if those unknown instructions are seen.
access to a magic memory address. This is configured/enabled from the Python configs, System.py contains:
m5ops_base = Param.Addr(
0xffff0000 if buildEnv['TARGET_ISA'] == 'x86' else 0,
"Base of the 64KiB PA range used for memory-mapped m5ops. Set to 0 "
"to disable.")
ARM semihosting: some custom semihosting operations were wired to m5ops recently. It is worth nothing that there is some overlap between what some m5ops and what some of the standardized semihosting operations can achieve, like quitting the simulator.
Some of the most commons m5ops ones are:
m5 exit: quit simulator
m5 checkpoint: take a checkpoint
m5 dumpstats: dump stats
m5 resetstats: zero out the stats and restart counting for the next m5 dumpstats
m5 readfile: read the value of host fs.py --script option contents, very useful to run different workloads after Linux boot checkpoint
m5ops are useful because it is often hard to determine when you want to do the above operations in other ways, e.g.: do something when Linux finishes boot. E.g., to do it natively naively from the simulator, you'd need to know in advance at what tick that happens. You could mess around with checking if the PC matches some address (already done e.g. for Linux panic checking), but that's a bit harder.
There also exists the in-tree m5 tool that you can cross compile and place in your full system guest to exposes the magic instructions from an executable CLI interface.
But you can just hard code them in your binaries as well to get more precise results if needed, e.g. hardcoding as in X86
#define LKMC_M5OPS_CHECKPOINT __asm__ __volatile__ (".word 0x040F; .word 0x0043;" : : "D" (0), "S" (0) :)
#define LKMC_M5OPS_DUMPSTATS __asm__ __volatile__ (".word 0x040F; .word 0x0041;" : : "D" (0), "S" (0) :)
more hardcode examples at: https://github.com/cirosantilli/linux-kernel-module-cheat/blob/4f82f79be7b0717c12924f4c9b7c4f46f8f18e2f/lkmc/m5ops.h Or you can also use them more nicely and laboriously from the mainline tree as shown at: How to use m5 in gem5-20
Some more info can also be found at: https://cirosantilli.com/linux-kernel-module-cheat/#m5ops

Related

How to code ARM interrupt functions in C

I am using arm-none-eabi-gcc toolchain, v 4.8.2, on LinuxMint 17.2 64b.
I am, at hobbyist level, trying to play with a TM4C123G board and its usual features (coding various blinkies, uart things...) but always trying to remain as close to the metal as possible without using other libraries (eg CMSIS...) whenever possible. Also no IDE (CCS, Keil...), just Linux terminal windows, the board and I... All that mostly for education purpose.
The issue : I am stuck trying to implement the usual interrupt functions like :
EnableInt (clearing bit 0, bit I, of special registry PRIMASK) :
CPSIE I
WaitForInt :
WFI
DisableInt :
CPSID I
Eg, I added this function to my .c file for EnableInt :
void EnableInt(void)
{ __asm(" cpsie i\n");
}
... this compiles but the execution does not seem to work properly (in the simplest blinky.c version, I cannot get any LED action once I have called EnableInt() in the C code). The blinky.c code can be found here.
What would be the proper way to write these interrupt routines in a .c file (ideally without using other libraries, but just setting/clearing bits of the appropriate registers...)?
EDIT : removed the bx lr instructions - but EnableInt() does not seem to work any better - still looking for a solution.
EDIT2 : Actually the function EnableInt(), defined as above, is now working. My SysTick_Handler was mapped incorrectly to the Interrupt Vector table in the startup file (while my original problem was the bx lr instructions which I removed in Edit1).
The ARM Cortex-M4 CPU which your Tivia MCU incorporates does basically not require the software environment to take special action for entry/exit the interrupt handler. The only requirement is to use the AAPCS calling standard, which should be the default with gcc if compiling for this CPU.
The CPU is supported by some tightly coupled "core" peripherals provided by ARM. These are standard for most (if not all) Cortex-M3/4 MCUs. MCU vendors can configure some features, but the basic operation is always the same.
To simplify software development, ARM has introduced the CMSIS software standard. This at least consists of some header-files which unify access to the core-peripherals and use of special CPU instructions. Among those are intrinsics to manipulate the special CPU registers like PRIMASK, BASEMASK, OPTION, etc. Another header provides definitions of the core peripherals and functions to manipulate some of them where a simple access is not sufficient.
So, one of these peripherals supports the CPU for interrupt handling: The NVIC (nested vector-interrupt controller). This prioritises interrupts aagains each other and provides the interrupt vector to the CPU which uses this vector to fetch the address of the interrupt handler.
The NVIC also includes enable-bits for all interrupt sources. So, to have an interrupt processed by the CPU, for a typical MCU you have to enable the interrupt in two or three locations:
PRIMASK/BASEMASK in the CPU: last line of defense. These are the global interrupt gates. `PRIMASK is similar to the interrupt-enable bit in the status-register of the smaller CPUs, BASEMASK is part of interrupt-priority resolution (just ignore it for the beginning).
NVIC interrupt-enable bit for each peripheral interrupt source. E.g Timer, UART, SPI, etc. Many peripherals have multiple internal sources tied to this NVIC-line. (e.g UART rx and tx interrupt).
The interrupt-enable bits in the peripheral itself. E.g. UART rx-interrupt, tx interrupt, rxerror interrupt, etc.
Some peripherals might not have internal bits, so the last one might be missing.
To get things working, you should read the Reference Manaul (Family Guide, or similar), then there is often some "porgramming the Cortex-M4" howto (e.g ST has one for the STM32 series). You should also get the documents from ARM (they are available for free download).
Finally you need the CMSIS headers from your MCU vendor (TI here). These should be tailored for your MCU. You might have to provide some `#define's.
And, yes, this is quite some stuff to read. But imo it is worth the effort. Alternatively you might start with a book. There are some out which might be helpful to get the whole picture first (it is really hard to get from the single documents - yet possible).

u-boot : Relocation

This one is a basic question related to u-boot.
Why does the u-boot code relocate itself ?
Ok, it makes sense if u-boot is executing from NOR-flash or boot ROM space but if it runs from SDRAM already why does it have to relocate itself once again ?
This question comes up frequently. Good answers sometimes too.
I agree it is handy to load the build to SDRAM during development. That works for me, I do it all the time. I have some special boot code in flash which does not enable MMU/cache. For my u-boot builds I switch CONFIG_SYS_TEXT_BASE between flash and ram builds. I run my development builds that way routinely.
As a practical matter, handling re-initialization of MMU/cache would be a nontrivial matter. And U-Boot benefits IMO from simplicity, as result of leaving out things like that.
The tech lead at Denx has expressed his opinion. IIRC his other posts are more strongly worded than that one. I get the impression that he does not like to repeat himself.
update: why relocate. Memory access is faster from RAM than from ROM, this will matter particularly if target has no instruction cache. Executing from RAM allows flash reprogramming; also (more minor) it allows software breakpoints with "trap" instructions; also it is more like the target's normal mode of operation, so if e.g. burst reads from RAM are iffy the failure will be seen at early boot.
U-boot has to reserve 3 regions in memory that stores: 1) u-boot itself, 2) uImage (compressed kernel), and 3) uncompressed kernel. These 3 regions must be carefully placed in u-boot to prevent conflict.
However, the previous stage boot-loader, (BL2 or BL1) that brings u-boot into DRAM memory don\t know u-boot's planing on these 3 regions. So it can only loads u-boot onto a lower address in DRAM memory and jump to it. Then, after u-boot execute some basic initialization and detect current PC is not in planed location, u-boot call relocate function that move u-boot to the planned location and jump to it.
The code of NOR flash must initialize the SDRAM, Then the copy code from Nor Flash to SDRAM, The process will copy itself, because you could enable MMU, we will start Virtual address mapping.

How does machine code communicate with processor?

Let's take Python as an example. If I am not mistaken, when you program in it, the computer first "translates" the code to C. Then again, from C to assembly. Assembly is written in machine code. (This is just a vague idea that I have about this so correct me if I am wrong) But what's machine code written in, or, more exactly, how does the processor process its instructions, how does it "find out" what to do?
If I am not mistaken, when you program in it, the computer first "translates" the code to C.
No it doesn't. C is nothing special except that it's the most widespread programming language used for system programming.
The Python interpreter translates the Python code into so called P-Code that's executed by a virtual machine. This virtual machine is the actual interpreter which reads P-Code and every blip of P-Code makes the interpreter execute a predefined codepath. This is not very unlike how native binary machine code controls a CPU. A more modern approach is to translate the P-Code into native machine code.
The CPython interpreter itself is written in C and has been compiled into a native binary. Basically a native binary is just a long series of numbers (opcodes) where each number designates a certain operation. Some opcodes tell the machine that a defined count of numbers following it are not opcodes but parameters.
The CPU itself contains a so called instruction decoder, which reads the native binary number by number and for each opcode it reads it gives power to the circuit of the CPU that implement this particular opcode. there are opcodes, that address memory, opcodes that load data from memory into registers and so on.
how does the processor process its instructions, how does it "find out" what to do?
For every opcode, which is just a binary pattern, there is its own circuit on the CPU. If the pattern of the opcode matches the "switch" that enables this opcode, its circuit processes it.
Here's a WikiBook about it:
http://en.wikibooks.org/wiki/Microprocessor_Design
A few years ago some guy built a whole, working computer from simple function logic and memory ICs, i.e. no microcontroller or similar involved. The whole project called "Big Mess o' Wires" was more or less a CPU built from scratch. The only thing nerdier would have been building that thing from single transistors (which actually wasn't that much more difficult). He also provides a simulator which allows you to see how the CPU works internally, decoding each instruction and executing it: Big Mess o' Wires Simulator
EDIT: Ever since I originally wrote that answer, building a fully fledged CPU from modern, discrete components has been done: For your considereation a MOS6502 (the CPU that powered the Apple II, Commodore C64, NES, BBC Micro and many more) built from discetes: https://monster6502.com/
Machine-code does not "communicate with the processor".
Rather, the processor "knows how to evaluate" machine-code. In the [widespread] Von Neumann architecture this machine-code (program) can be thought of as an index-able array of where each cell contains a machine-code instruction (or data, but let's ignore that for now).
The CPU "looks" at the current instruction (often identified by the PC or Program Counter) and decides what to do (this can either be done directly with transistors/"bare-metal", or it be translated to even lower-level code): this is known as the fetch-decode-execute cycle.
When the instructions are executed side-effects occur such as setting a control flag, putting a value in a register, or jumping to a different index (changing the PC) in the program, etc. See this simple overview of a CPU which covers the above a little bit better.
It is the evaluation of each instruction -- as it is encountered -- and the interaction of side-effects that results in the operation of a traditional processor.
(Of course, modern CPUs are very complex and do lots of neat tricky things!)
That's called microcode. It's the code in the CPU that reads machine code instructions and translate that into low level data flow.
When the CPU for example encounters the add instruction, the microcode describes how it should get the two values, feed them to the ALU to do the calculation, and where to put the result.
Electricity. Circuits, memory, and logic gates.
Also, I believe Python is usually interpreted, not compiled through C → assembly → machine code.

Beagleboard bare metal programming

I just got my BeagleBoard-Xm and I'm wondering if there is any detailed step by step tutorials on how to get a very simple bare metal software running on the hardware?
The reason I ask is I want to deeply understand how the hardware architecture works, everything from the bootloader, linkers, interrupts, exceptions, MMU etc. I figured the best way is to get a simple hello world program to execute on the beagleboard xm without an OS. Nothing advanced, just start up the board and get a "hello world" output on the screen. thats it!
The next step would be getting an tiny OS to run, that can schedule some very simple tasks. No filesystem needed, just to understand the basics of the OS.
Absolutely no problem...
First off get the serial port up and running, I have one of the older/earlier beagleboards and remember the serial port and just about everything about the I/O being painful, nevertheless get a serial port on it so you can see it boot.
It boots uboot I think and you can press a key or esc or something like that to interrupt the normal boot (into linux). From the uboot prompt it is easy to load your first simple programs.
I have some beagleboard code handy at the moment but dont have my beagleboard itself handy to try them. So go to http://sam7stuff.blogspot.com/ to get an idea of how to mix some startup assembler and C code for OSless embedded programs (for arm, I have a number of examples out there for other thumb/cortex-m3 platforms, but those boot a little differently).
The sam7 ports for things and memory address space is totally different from the beagleboard/omap. The above is a framework that you can change or re-invent.
You will need the OMAP 35x techincal reference manual from ti.com. Search for the omap part on their site OMAP3530.
Also the beagleboard documentation. For example this statement:
A single RS232 port is provided on the BeagleBoard and provides access to the TX and
RX lines of UART3
So in the trm for the omap searching for UART3 shows that it is at a base address of 0x49020000. (often it is very difficult to figure out the entire address for something as the manuals usually have part of the memory map here, and another part there, and near the register descriptions only the lower few bits of the address are called out.)
Looking at the uart registers THR_REG is where you write bytes to be sent out the uart, note that it is a 16 bit register.
Knowing this we can make the first program:
.globl _start
_start:
ldr r0,=0x49020000
mov r1,#0x55
strh r1,[r0]
strh r1,[r0]
strh r1,[r0]
strh r1,[r0]
strh r1,[r0]
hang: b hang
Here is a makefile for it:
ARMGNU = arm-none-linux-gnueabi
AOPS = --warn --fatal-warnings
COPS = -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding
uarttest.bin : uarttest.s
$(ARMGNU)-as $(AOPS) uarttest.s -o uarttest.o
$(ARMGNU)-ld -T rammap uarttest.o -o uarttest.elf
$(ARMGNU)-objdump -D uarttest.elf > uarttest.list
$(ARMGNU)-objcopy uarttest.elf -O srec uarttest.srec
$(ARMGNU)-objcopy uarttest.elf -O binary uarttest.bin
And the linker script that is used:
/* rammap */
MEMORY
{
ram : ORIGIN = 0x80300000, LENGTH = 0x10000
}
SECTIONS
{
.text : { *(.text*) } > ram
}
Note the linux version from codesourcery is called out, you do not need that version of a gnu cross compiler, in fact this code being asm only needs an assembler and linker (binutils stuff). The arm-none-eabi-... type cross compiler will work as well (assuming you get the lite tools from codesourcery).
Once you have a .bin file look at the help on uboot, I dont remember the exact command but it is probably an l 0x80300000 or load_xmodem or some such thing. Basically you want to x, or y or z modem the .bin file over the serial port into memory space for the processor, then using a go or whatever the command is tell uboot to branch to your program.
You should see a handful of U characters (0x55 is 'U') come out the serial port when run.
Your main goal up front is to get a simple serial port routine up so you can print stuff out to debug and otherwise see what your programs are doing. Later you can get into graphics, etc. but first use the serial port.
There was some cheating going on. Since uboot came up and initialized the serial port we didnt have to, just shove bytes into the thr. but quickly you will overflow the thr's storage and lose bytes, so you then need to read the trm for the omap and find some sort of bit that indicates the transmitter is empty, it has transmitted everything, then create a uart_send type function that polls for transmitter empty then sends the one byte out.
also forget about printf(), you need to create your own print a number (octal or hex are the easiest) and perhaps print string. I do this sort of work all day and all night and 99% of the time all I use is a small routine that prints 32 bit hex numbers out the uart. from the numbers I can debug and see the status of the programs.
So take the sam7 model or something like it (note the compiler and linker command line options are important as is the order of files on the link command line, the first file has to be your entry point if you want to have the first instruction/word in the .bin file be your entry point, which is usually a good idea as down the road you want to know how to control this for booting from a rom).
You can probably do quite a bit without removing or replacing uboot, if you start to look at the linux based boot commands for uboot you will see that it is copying what is pretty much a .bin file from flash or somewhere into a spot in ram, then branching to it. Now branching to linux, esp arm linux involves some arm tables and possible setting up some registers, where your programs wont want or need that. basically whatever command you figure out to use, after you have copied your program to ram, is what you will script in a boot script for uboot should you choose to have the board boot and run like it does with linux.
Saying that you can use jtag and not rely on uboot to work, when you go that path though there are likely a certain number of things you have to do on boot to get the chip up and running, in particular configuring the uart is likely a few clock dividers somewhere, clock enables, I/O enables, various things like that. Which is why the sam7 example starts with a blink the led thing instead of a uart thing. The amotek jtag-tiny is a good jtag wiggler, I have been quite pleased, use these all day long every day at work. The beagleboard probably uses a TI pinout and not the standard ARM pinout so you will likely need to change the cabling. And I dont know if the OMAP gives you direct access to the arm tap controller or if you have to do something ti specific. You are better off just going the uboot route for the time being.
Once you have a framework where you have a small amount of asm to setup the stack and branch to your entrypoint C code, you can start to turn that C code into an OS or do whatever you want. If you look at chibios or prex or others like it you will find they have small asm boot code that gets them into their system. Likewise there are uart debug and non-debug routines in there. Many rtoses are going to want to use interrupts and not poll for thr to be empty.
If this post doesnt get you up and running with your hello world (letting you do some of the work), let me know and i will dig out my beagleboard and create a complete example. My board doesnt exactly match yours but as far as hello world goes it should be close enough.
You could also try TI StarterWare:
http://www.ti.com/tool/starterware-sitara

How does one use dynamic recompilation?

It came to my attention some emulators and virtual machines use dynamic recompilation. How do they do that? In C i know how to call a function in ram using typecasting (although i never tried) but how does one read opcodes and generate code for it? Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C? If so how do you find the length of the code? How do you account for system interrupts?
-edit-
system interrupts and how to (re)compile the data is what i am most interested in. Upon more research i heard of one person (no source available) used js, read the machine code, output js source and use eval to 'compile' the js source. Interesting.
It sounds like i MUST have knowledge of the target platform machine code to dynamically recompile
Yes, absolutely. That is why parts of the Java Virtual Machine must be rewritten (namely, the JIT) for every architecture.
When you write a virtual machine, you have a particular host-architecture in mind, and a particular guest-architecture. A portable VM is better called an emulator, since you would be emulating every instruction of the guest-architecture (guest-registers would be represented as host-variables, rather than host-registers).
When the guest- and host-architectures are the same, like VMWare, there are a ton of (pretty neat) optimizations you can do to speed up the virtualization - today we are at the point that this type of virtual machine is BARELY slower than running directly on the processor. Of course, it is extremely architecture-dependent - you would probably be better off rewriting most of VMWare from scratch than trying to port it.
It's quite possible - though obviously not trivial - to disassemble code from a memory pointer, optimize the code in some way, and then write back the optimized code - either to the original location or to a new location with a jump patched into the original location.
Of course, emulators and VMs don't have to RE-write, they can do this at load-time.
This is a wide open question, not sure where you want to go with it. Wikipedia covers the generic topic with a generic answer. The native code being emulated or virtualized is replaced with native code. The more the code is run the more is replaced.
I think you need to do a few things, first decide if you are talking about an emulation or a virtual machine like a vmware or virtualbox. An emulation the processor and hardware is emulated using software, so the next instruction is read by the emulator, the opcode pulled apart by code and you determine what to do with it. I have been doing some 6502 emulation and static binary translation which is dynamic recompilation but pre processed instead of real time. So your emulator may take a LDA #10, load a with immediate, the emulator sees the load A immediate instruction, knows it has to read the next byte which is the immediate the emulator has a variable in the code for the A register and puts the immediate value in that variable. Before completing the instruction the emulator needs to update the flags, in this case the Zero flag is clear the N flag is clear C and V are untouched. But what if the next instruction was a load X immediate? No big deal right? Well, the load x will also modify the z and n flags, so the next time you execute the load a instruction you may figure out that you dont have to compute the flags because they will be destroyed, it is dead code in the emulation. You can continue with this kind of thinking, say you see code that copies the x register to the a register then pushes the a register on the stack then copies the y register to the a register and pushes on the stack, you could replace that chunk with simply pushing the x and y registers on the stack. Or you may see a couple of add with carries chained together to perform a 16 bit add and store the result in adjacent memory locations. Basically looking for operations that the processor being emulated couldnt do but is easy to do in the emulation. Static binary translation which I suggest you look into before dynamic recompilation, performs this analysis and translation in a static manner, as in, before you run the code. Instead of emulating you translate the opcodes to C for example and remove as much dead code as you can (a nice feature is the C compiler can remove more dead code for you).
Once the concept of emulation and translation are understood then you can try to do it dynamically, it is certainly not trivial. I would suggest trying to again doing a static translation of a binary to the machine code of the target processor, which a good exercise. I wouldnt attempt dynamic run time optimizations until I had succeeded in performing them statically against a/the binary.
virtualization is a different story, you are talking about running the same processor on the same processor. So x86 on an x86 for example. the beauty here is that using non-old x86 processors, you can take the program being virtualized and run the actual opcodes on the actual processor, no emulation. You setup traps built into the processor to catch things, so loading values in AX and adding BX, etc these all happen at real time on the processor, when AX wants to read or write memory it depends on your trap mechanism if the addresses are within the virtual machines ram space, no traps, but lets say the program writes to an address which is the virtualized uart, you have the processor trap that then then vmware or whatever decodes that write and emulates it talking to a real serial port. That one instruction though wasnt realtime it took quite a while to execute. What you could do if you chose to is replace that instruction or set of instructions that write a value to the virtualized serial port and maybe have then write to a different address that could be the real serial port or some other location that is not going to cause a fault causing the vm manager to have to emulate the instruction. Or add some code in the virtual memory space that performs a write to the uart without a trap, and have that code instead branch to this uart write routine. The next time you hit that chunk of code it now runs at real time.
Another thing you can do is for example emulate and as you go translate to a virtual intermediate bytcode, like llvm's. From there you can translate from the intermediate machine to the native machine, eventually replacing large sections of program if not the whole thing. You still have to deal with the peripherals and I/O.
Here's an explaination of how they are doing dynamic recompilation for the 'Rubinius' Ruby interpteter:
http://www.engineyard.com/blog/2010/making-ruby-fast-the-rubinius-jit/
This approach is typically used by environments with an intermediate byte code representation (like Java, .net). The byte code contains enough "high level" structures (high level in terms of higher level than machine code) so that the VM can take chunks out of the byte code and replace it by a compiled memory block. The VM typically decide which part is getting compiled by counting how many times the code was already interpreted, since the compilation itself is a complex and time-consuming process. So it is usefull to only compile the parts which get executed many times.
but how does one read opcodes and generate code for it?
The scheme of the opcodes is defined by the specification of the VM, so the VM opens the program file, and interprets it according to the spec.
Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C?
This process is an implementation detail of the VM, typically there is a compiler embedded, which is capable to transform the VM opcode stream into machine code.
How do you account for system interrupts?
Very simple: none. The code in the VM can't interact with real hardware. The VM interact with the OS, and transfer OS events to the code by jumping/calling specific parts inside the interpreted code. Every event in the code or from the OS must pass the VM.
Also hardware virtualization products can use some kind of JIT. A typical use cases in the X86 world is the translation of 16bit real mode code to 32 or 64bit protected mode code to not to be forced to emulate a CPU in real mode. Also a software-only VM replaces jump instructions in the executing code by jumps into the VM control software, which at each branch the following code path for jump instructions scans and them replace, before it jumps to the real code destination. But I doubt if the jump replacement qualifies as JIT compilation.
IIS does this by shadow copying: after compilation it copies assemblies to some temporary place and runs them from temp.
Imagine, that user change some files. Then IIS will recompile asseblies in next steps:
Recompile (all requests handled by old code)
Copies new assemblies (all requests handled by old code)
All new requests will be handled by new code, all requests - by old.
I hope this'd be helpful.
A virtual Machine loads "byte code" or "intermediate language" and not machine code therefore, I suppose, that it just recompiles the byte code more efficiently once it has more runtime data.
http://en.wikipedia.org/wiki/Just-in-time_compilation