I'm trying to add a ORAM module to gem5, it would modify the address from the CPU to Memory. After reading the introduction about how to add a device named HelloDevice to gem5 in ASPLOS 2008 tutorial, I am still confused that if I add a new device to gem5, do I have to use Full System Mode to run tests/test-progs/hello/bin/x86/linux/hello?
tests/test-progs/hello/bin/x86/linux/hello is an userland executable, meant to be ran with se.py.
I think devices are not visible from se.py since it only emulates userland by translating simple instructions, and capturing syscalls, you can't see for example arbitrary hardware registers or physical memory.
Therefore yes, I think you need to use full system emulation with your build.
If you don't know how to use fs.py, give this setup a try.
I have a Xilinx development board connected to a RHEL workstation.
I have U-boot loaded over JTAG and connect to it with minicom.
Then I tftpboot the helloworld standalone application.
Where do these images go?
I understand I am specifying a loadaddr, but I don't fully understand the meaning.
When I run the standalone application, I get various outputs on the serial console.
I had it working correctly the first time, but then started trying different things when building.
It almost feelings like I am clobbering memory, but I assumed after a power cycle anything tftp'd would be lost.
The issue stills occurs through a power cycle though.
Where do these images go?
The U-Boot command syntax is:
tftpboot [loadAddress] [[hostIPaddr:]bootfilename]
You can explicitly specify the memory destination address as the loadAddress parameter.
When the loadAddress parameter is omitted from the command, then the memory destination address defaults to the value of the environment variable loadaddr.
Note that several other U-Boot commands also use this loadaddr variable, such as "bootp", "rarpboot", "loadb" and "diskboot".
I understand I am specifying a loadaddr, but I don't fully understand the meaning.
When I run the standalone application, I get various outputs on the serial console.
The loadAddress is simply the start address in memory to which the file transfered will be written.
For a standalone application, this loadAddress should match the CONFIG_STANDALONE_LOAD_ADDR that was used to link this program.
Likewise the "go" command to execute this standalone application program should use the same CONFIG_STANDALONE_LOAD_ADDR.
For example, assume the physical memory of your board starts at 0x20000000.
To allow the program to use the maximum amount of available memory, the program is configured to start at:
#define CONFIG_STANDALONE_LOAD_ADDR 0x20000000
For convenient loading, define the environment variable (at the U-Boot prompt):
setenv loadaddr 0x20000000
Assuming that the serverip variable is defined with the IP address of the TFTP server, then the U-Boot command
tftpboot hello_world.bin
should retrieve that file from the server, and store it at 0x20000000.
Use
go 20000000
to execute the program.
I assumed after a power cycle anything tftp'd would be lost.
It should.
But what might persist in "volatile" memory after a power cycle is unpredictable. Nor can you be assured of a default value such as all zeros or all ones. The contents of dynamic RAM should always be presumed to be unknown unless you know it has been initialized and has been written.
Is a software image loaded into non-volatile RAM when using tftpboot from U-boot?
Only if your board has main memory that is non-volatile (e.g. ferrite core or battery-backed SRAM, which are not likely).
You can use the "md" (memory display) command to inspect RAM.
I am migrating a code from Linux to Vxworks. The code requires opening physical/main memory and then map the physical to virtual memory using mmap.
In Linux, main memory is accessed by
fd = open("/dev/mem", O_RDONLY);
Can you please let me know how this can be accomplished in Vxworks.
Thanks in advance
It depends on which programming environment your migrated code will be running.
For kernel mode, it is much easier as generally you can access everywhere in the system memory in read-only mode as long as its memory region is mapped in the page table. No special API is needed in your code to access the memory.
For user mode (aka Real Time Process, only available starting from VxWorks 6.0), things are a little bit complicated. You need write a pair of code blocks, with one operating in the kernel mode while the other one in the user mode. Please refer to the comment block in the VxWorks source codes for a code example # vxworks-6.9/target/usr/src/os/mm/devMemLib.c (taking VxWorks 6.9 for example).
This one is a basic question related to u-boot.
Why does the u-boot code relocate itself ?
Ok, it makes sense if u-boot is executing from NOR-flash or boot ROM space but if it runs from SDRAM already why does it have to relocate itself once again ?
This question comes up frequently. Good answers sometimes too.
I agree it is handy to load the build to SDRAM during development. That works for me, I do it all the time. I have some special boot code in flash which does not enable MMU/cache. For my u-boot builds I switch CONFIG_SYS_TEXT_BASE between flash and ram builds. I run my development builds that way routinely.
As a practical matter, handling re-initialization of MMU/cache would be a nontrivial matter. And U-Boot benefits IMO from simplicity, as result of leaving out things like that.
The tech lead at Denx has expressed his opinion. IIRC his other posts are more strongly worded than that one. I get the impression that he does not like to repeat himself.
update: why relocate. Memory access is faster from RAM than from ROM, this will matter particularly if target has no instruction cache. Executing from RAM allows flash reprogramming; also (more minor) it allows software breakpoints with "trap" instructions; also it is more like the target's normal mode of operation, so if e.g. burst reads from RAM are iffy the failure will be seen at early boot.
U-boot has to reserve 3 regions in memory that stores: 1) u-boot itself, 2) uImage (compressed kernel), and 3) uncompressed kernel. These 3 regions must be carefully placed in u-boot to prevent conflict.
However, the previous stage boot-loader, (BL2 or BL1) that brings u-boot into DRAM memory don\t know u-boot's planing on these 3 regions. So it can only loads u-boot onto a lower address in DRAM memory and jump to it. Then, after u-boot execute some basic initialization and detect current PC is not in planed location, u-boot call relocate function that move u-boot to the planned location and jump to it.
The code of NOR flash must initialize the SDRAM, Then the copy code from Nor Flash to SDRAM, The process will copy itself, because you could enable MMU, we will start Virtual address mapping.
It came to my attention some emulators and virtual machines use dynamic recompilation. How do they do that? In C i know how to call a function in ram using typecasting (although i never tried) but how does one read opcodes and generate code for it? Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C? If so how do you find the length of the code? How do you account for system interrupts?
-edit-
system interrupts and how to (re)compile the data is what i am most interested in. Upon more research i heard of one person (no source available) used js, read the machine code, output js source and use eval to 'compile' the js source. Interesting.
It sounds like i MUST have knowledge of the target platform machine code to dynamically recompile
Yes, absolutely. That is why parts of the Java Virtual Machine must be rewritten (namely, the JIT) for every architecture.
When you write a virtual machine, you have a particular host-architecture in mind, and a particular guest-architecture. A portable VM is better called an emulator, since you would be emulating every instruction of the guest-architecture (guest-registers would be represented as host-variables, rather than host-registers).
When the guest- and host-architectures are the same, like VMWare, there are a ton of (pretty neat) optimizations you can do to speed up the virtualization - today we are at the point that this type of virtual machine is BARELY slower than running directly on the processor. Of course, it is extremely architecture-dependent - you would probably be better off rewriting most of VMWare from scratch than trying to port it.
It's quite possible - though obviously not trivial - to disassemble code from a memory pointer, optimize the code in some way, and then write back the optimized code - either to the original location or to a new location with a jump patched into the original location.
Of course, emulators and VMs don't have to RE-write, they can do this at load-time.
This is a wide open question, not sure where you want to go with it. Wikipedia covers the generic topic with a generic answer. The native code being emulated or virtualized is replaced with native code. The more the code is run the more is replaced.
I think you need to do a few things, first decide if you are talking about an emulation or a virtual machine like a vmware or virtualbox. An emulation the processor and hardware is emulated using software, so the next instruction is read by the emulator, the opcode pulled apart by code and you determine what to do with it. I have been doing some 6502 emulation and static binary translation which is dynamic recompilation but pre processed instead of real time. So your emulator may take a LDA #10, load a with immediate, the emulator sees the load A immediate instruction, knows it has to read the next byte which is the immediate the emulator has a variable in the code for the A register and puts the immediate value in that variable. Before completing the instruction the emulator needs to update the flags, in this case the Zero flag is clear the N flag is clear C and V are untouched. But what if the next instruction was a load X immediate? No big deal right? Well, the load x will also modify the z and n flags, so the next time you execute the load a instruction you may figure out that you dont have to compute the flags because they will be destroyed, it is dead code in the emulation. You can continue with this kind of thinking, say you see code that copies the x register to the a register then pushes the a register on the stack then copies the y register to the a register and pushes on the stack, you could replace that chunk with simply pushing the x and y registers on the stack. Or you may see a couple of add with carries chained together to perform a 16 bit add and store the result in adjacent memory locations. Basically looking for operations that the processor being emulated couldnt do but is easy to do in the emulation. Static binary translation which I suggest you look into before dynamic recompilation, performs this analysis and translation in a static manner, as in, before you run the code. Instead of emulating you translate the opcodes to C for example and remove as much dead code as you can (a nice feature is the C compiler can remove more dead code for you).
Once the concept of emulation and translation are understood then you can try to do it dynamically, it is certainly not trivial. I would suggest trying to again doing a static translation of a binary to the machine code of the target processor, which a good exercise. I wouldnt attempt dynamic run time optimizations until I had succeeded in performing them statically against a/the binary.
virtualization is a different story, you are talking about running the same processor on the same processor. So x86 on an x86 for example. the beauty here is that using non-old x86 processors, you can take the program being virtualized and run the actual opcodes on the actual processor, no emulation. You setup traps built into the processor to catch things, so loading values in AX and adding BX, etc these all happen at real time on the processor, when AX wants to read or write memory it depends on your trap mechanism if the addresses are within the virtual machines ram space, no traps, but lets say the program writes to an address which is the virtualized uart, you have the processor trap that then then vmware or whatever decodes that write and emulates it talking to a real serial port. That one instruction though wasnt realtime it took quite a while to execute. What you could do if you chose to is replace that instruction or set of instructions that write a value to the virtualized serial port and maybe have then write to a different address that could be the real serial port or some other location that is not going to cause a fault causing the vm manager to have to emulate the instruction. Or add some code in the virtual memory space that performs a write to the uart without a trap, and have that code instead branch to this uart write routine. The next time you hit that chunk of code it now runs at real time.
Another thing you can do is for example emulate and as you go translate to a virtual intermediate bytcode, like llvm's. From there you can translate from the intermediate machine to the native machine, eventually replacing large sections of program if not the whole thing. You still have to deal with the peripherals and I/O.
Here's an explaination of how they are doing dynamic recompilation for the 'Rubinius' Ruby interpteter:
http://www.engineyard.com/blog/2010/making-ruby-fast-the-rubinius-jit/
This approach is typically used by environments with an intermediate byte code representation (like Java, .net). The byte code contains enough "high level" structures (high level in terms of higher level than machine code) so that the VM can take chunks out of the byte code and replace it by a compiled memory block. The VM typically decide which part is getting compiled by counting how many times the code was already interpreted, since the compilation itself is a complex and time-consuming process. So it is usefull to only compile the parts which get executed many times.
but how does one read opcodes and generate code for it?
The scheme of the opcodes is defined by the specification of the VM, so the VM opens the program file, and interprets it according to the spec.
Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C?
This process is an implementation detail of the VM, typically there is a compiler embedded, which is capable to transform the VM opcode stream into machine code.
How do you account for system interrupts?
Very simple: none. The code in the VM can't interact with real hardware. The VM interact with the OS, and transfer OS events to the code by jumping/calling specific parts inside the interpreted code. Every event in the code or from the OS must pass the VM.
Also hardware virtualization products can use some kind of JIT. A typical use cases in the X86 world is the translation of 16bit real mode code to 32 or 64bit protected mode code to not to be forced to emulate a CPU in real mode. Also a software-only VM replaces jump instructions in the executing code by jumps into the VM control software, which at each branch the following code path for jump instructions scans and them replace, before it jumps to the real code destination. But I doubt if the jump replacement qualifies as JIT compilation.
IIS does this by shadow copying: after compilation it copies assemblies to some temporary place and runs them from temp.
Imagine, that user change some files. Then IIS will recompile asseblies in next steps:
Recompile (all requests handled by old code)
Copies new assemblies (all requests handled by old code)
All new requests will be handled by new code, all requests - by old.
I hope this'd be helpful.
A virtual Machine loads "byte code" or "intermediate language" and not machine code therefore, I suppose, that it just recompiles the byte code more efficiently once it has more runtime data.
http://en.wikipedia.org/wiki/Just-in-time_compilation