We are designing an ASIC that is based on an ARM CPU and multiple other hardware engines. These engines are controlled through memory mapped registers accessible by the CPU AHB port. Currently, we have simple tests written in C that are compiled, loaded and executed by the CPU RTL. Everything works fine, but, as expected, the CPU RTL slows down the simulation quite a lot, which makes difficult to run more complex C code to the test more extensively the rest of the design. Since we are not trying to verify the CPU RTL, I was wondering if a different approach would be possible.
We have most of the firmware already written in C, which is used for FPGA prototyping. The firmware accesses the memory mapped registers using the macros MemRead and MemWrite, which can be easily substituted by a function call or something similar. The approach would be:
Remove the CPU RTL from the design
Create a shim that is invoked by the MemRead and MemWrite C calls that would generate AHB requests to the design AHB slave port left disconnected by the CPU removal
Compile the C based firmware, RTL design (minus the CPU) and testbench with the simulator (Cadence irun, in our case)
Run the simulation
Is this even possible? Has anyone tried this approach? Any suggestions on how to come up with the "shim" between the C code firmware and the Verilog testbench and design?
Thank you,
Marcus.
Related
I am using arm-none-eabi-gcc toolchain, v 4.8.2, on LinuxMint 17.2 64b.
I am, at hobbyist level, trying to play with a TM4C123G board and its usual features (coding various blinkies, uart things...) but always trying to remain as close to the metal as possible without using other libraries (eg CMSIS...) whenever possible. Also no IDE (CCS, Keil...), just Linux terminal windows, the board and I... All that mostly for education purpose.
The issue : I am stuck trying to implement the usual interrupt functions like :
EnableInt (clearing bit 0, bit I, of special registry PRIMASK) :
CPSIE I
WaitForInt :
WFI
DisableInt :
CPSID I
Eg, I added this function to my .c file for EnableInt :
void EnableInt(void)
{ __asm(" cpsie i\n");
}
... this compiles but the execution does not seem to work properly (in the simplest blinky.c version, I cannot get any LED action once I have called EnableInt() in the C code). The blinky.c code can be found here.
What would be the proper way to write these interrupt routines in a .c file (ideally without using other libraries, but just setting/clearing bits of the appropriate registers...)?
EDIT : removed the bx lr instructions - but EnableInt() does not seem to work any better - still looking for a solution.
EDIT2 : Actually the function EnableInt(), defined as above, is now working. My SysTick_Handler was mapped incorrectly to the Interrupt Vector table in the startup file (while my original problem was the bx lr instructions which I removed in Edit1).
The ARM Cortex-M4 CPU which your Tivia MCU incorporates does basically not require the software environment to take special action for entry/exit the interrupt handler. The only requirement is to use the AAPCS calling standard, which should be the default with gcc if compiling for this CPU.
The CPU is supported by some tightly coupled "core" peripherals provided by ARM. These are standard for most (if not all) Cortex-M3/4 MCUs. MCU vendors can configure some features, but the basic operation is always the same.
To simplify software development, ARM has introduced the CMSIS software standard. This at least consists of some header-files which unify access to the core-peripherals and use of special CPU instructions. Among those are intrinsics to manipulate the special CPU registers like PRIMASK, BASEMASK, OPTION, etc. Another header provides definitions of the core peripherals and functions to manipulate some of them where a simple access is not sufficient.
So, one of these peripherals supports the CPU for interrupt handling: The NVIC (nested vector-interrupt controller). This prioritises interrupts aagains each other and provides the interrupt vector to the CPU which uses this vector to fetch the address of the interrupt handler.
The NVIC also includes enable-bits for all interrupt sources. So, to have an interrupt processed by the CPU, for a typical MCU you have to enable the interrupt in two or three locations:
PRIMASK/BASEMASK in the CPU: last line of defense. These are the global interrupt gates. `PRIMASK is similar to the interrupt-enable bit in the status-register of the smaller CPUs, BASEMASK is part of interrupt-priority resolution (just ignore it for the beginning).
NVIC interrupt-enable bit for each peripheral interrupt source. E.g Timer, UART, SPI, etc. Many peripherals have multiple internal sources tied to this NVIC-line. (e.g UART rx and tx interrupt).
The interrupt-enable bits in the peripheral itself. E.g. UART rx-interrupt, tx interrupt, rxerror interrupt, etc.
Some peripherals might not have internal bits, so the last one might be missing.
To get things working, you should read the Reference Manaul (Family Guide, or similar), then there is often some "porgramming the Cortex-M4" howto (e.g ST has one for the STM32 series). You should also get the documents from ARM (they are available for free download).
Finally you need the CMSIS headers from your MCU vendor (TI here). These should be tailored for your MCU. You might have to provide some `#define's.
And, yes, this is quite some stuff to read. But imo it is worth the effort. Alternatively you might start with a book. There are some out which might be helpful to get the whole picture first (it is really hard to get from the single documents - yet possible).
The beaglebone Black processor includes two independent Programmable Real Time Units (PRUs). Hobbyists and professionals are excited about possible use of these units for real-time applications, which is understood. However, if you can have a RTOS (whether for the beaglebone or the raspberry pi), why would you need the PRUs?
EDIT-
For information, the BBB has an ARM Cortex A8 running at 1 GHz, with 1.9 DMIPS / MHz. The PRUs are simple RISCs running at 200 MHz.
Linux, even with the real-time scheduler is unsuited to many critical hard real-time tasks with response requirements at the microsecond level, on the other hand it provides or enables a great deal of functionality in terms of UI, connectivity and filesystem support. These things are either not available in an RTOS or are provided at significant cost in high end RTOS, and with much more limited hardware support.
So if you have a system that has hard-real time constraints, but needs more general purpose computing features such a networking, filesystem connection to commercial-off-the-shelf (COTS) peripherals etc., then the PRU provides a solution to that.
On the other hand I can't help but think that this is a marketing exercise on the part of TI to sell more chips. A similar solution has always been possible (and indeed common) using one or more processors to perform time critical tasks, possibly running an RTOS, while UI and connectivity are handled by a single processor with the necessary hardware and memory resources but without the real-time constraints. The PRU device does have two 32 bit cores, but XMOS xCORE devices have as many as 16 cores - with 16 communicating cores, you may not even need an RTOS.
To answer the question...
[...] if you can have a RTOS [...], why would you need the PRUs?
... directly; you probably wouldn't need them in that case, but you would loose Linux - and your application may need that. It is just one of many solutions to real-time applications using Linux. You pays your money, and takes your choice.
Most likely the processor in BeagleBone or RaspberryPI is too "heavy" for real time - after all, you could run RTOS on your PC, but it will not be very responsive entirely deterministic, even when it's faster than your typical microcontroller (I guess that these PRUs are some sort of microcontrollers with a new fancy name). In such high-level application processor as found on these boards you rarely have direct access to hardware or interrupts, which are essential for real time applications that actually do something time-critical.
I am working on Uboot bootloader. I have some basic question about the functionality of Bootloader and the application it is going to handle:
Q1: As per my knowledge, bootloader is used to download the application into memory. Over internet I also found that bootloader copies the application to RAM and then the application runs from RAM. I am confused with the working of Bootloader...When application is provided to bootloader through serial or TFTP, What happens next, whether Bootloader copies it to RAM first or whether it writes directly to Flash.
Q2: Why there is a need for Bootloader to copy application to RAM and then run the application from RAM? What difficulties we will face if our application runs from FLASH?
Q3: What is the meaning of statement "My application is running from RAM/FLASH"? Is it mean that our application's .text segment or .code segment is in RAM/FLASH? And we are not concerned about .bss section because it is designed to be in RAM.
Thanks
Phogat
When any hardware system is designed, the designer must consider where the executable code will be located. The answer depends on the microcontroller, the included memory types, and the system requirements. So the answer varies from system to system. Some systems execute code located in RAM. Other systems execute code located in flash. You didn't tell us enough about your system to know what it is designed to do.
A system might be designed to execute code from RAM because RAM access times are faster than flash so code can execute faster. A system might be designed to execute code from flash because flash is plentiful and RAM may not be. A system might be designed to execute code from flash so that it boots more quickly. These are just some examples and there are other considerations as well.
RAM is volatile so it does not retain code through a power cycle. If the system executes code located in RAM then a bootloader is required to obtain and write the code to RAM at powerup. Flash is non-volatile so execution can start right away at powerup and a bootloader is not necessary (but may still be useful).
Regarding Q3, the answer is yes. If the system is running from RAM then the .text will be located in RAM (but not until after the bootloader has copied it to there). If the system is running from flash then the .text section will be located in flash. The .bss section is variables and will be in RAM regardless of where the .text section is.
Yes, in general a bootloader boots the system, but it might also provide a mechanism for interrupting the default boot path and allow alternate firmware to be downloaded and run instead, as well as other features (like flashing).
Traditional rom had a traditional ram like interface, address, data, chip select, read/write, etc. And you can still buy rom that way, but it is cheaper from a pin real estate perspective to use something spi or i2c based, which is slower. Not desireable to run from, but tolerable to read once then run from ram. newer flash technologies can/have had problems with read-disturb, where if your code is in a tight loop reading the same instructions or for any other reason the flash is being read too fast, the charge can drop such that a read returns the wrong data, potentially causing the program to change course or crash. Also your PC and other linux platforms are used to copying the kernel from NV storage (hard disk) to ram and then running from there so the copy from flash to ram and run from ram has a comfort level, and is often faster than flash. So there are many potential reasons to not use flash, but depending on the system it may be possible to run from flash just fine (some systems the flash in question is not accessible directly and not executable, of course SOME rom in that system needed to be executable/bootable).
It simplifies the coding challenges if you program the flash with something that is in ram. You can create and debug the code one time that reads from ram and writes to flash and reads from flash and writes to ram. DONE. Now you can work on separate code that receives data from serial to ram, or from ram to serial. DONE. Then work on code that does the same over ethernet or usb or whatever DONE. You dont have to deal with inventing a protocol or solving the problem of timing. Flash writing is very slow, and even xmodem at a moderate speed can be way too fast, so you have to buffer that data in ram anyway, might as well make the tasks completely separate, instead of an xmodem or any other serial based flash loader with a big ram based fifo, just move the data to ram, then separately go from ram to flash. Same for other interfaces. It is technically possible to buffer the data and give the illusion of going from the download interface straight to flash, and depending on the protocol it is technically possible to hold off the sender so that as little as one flash page is required in ram before programming flash. With the older parallel flashes you could do something pretty cool which I dont think most people figured out. When you stop writing to the flash page for some known period of time the flash would automatically start to program that page and you have to wait for 10ms or something like that before it is done. What folks assumed was you had to program sequential addresses and had to get the new data for the next address in that period of time and would demand high serial port speeds, etc, the reality is you can pound the same address over and over again with the same data and the flash wont start to program the page, and the download interface can be infinitely slow. Serial flashes work differently and either dont need tricks or have different tricks.
RAM/FLASH is not some industry term. It likely means that .text is in rom (flash) and .data and .bss are in ram. A copy of the initial state of .data will probably be on flash as well and copied to ram before main() is called, likewise .bss will be zeroed before main() is called. look at crt0.S for most platforms in gnu sources (glibc, or is it gcc, I dont know) to get the gist of how the bootstrap works in a generic fashion.
A bootloader is not required to run linux or other operating systems, you dont NEED uboot, but it is quite useful. Linux is pretty easy, you copy the kernel and root file system, either set some registers or some tags in memory or both then branch to the entry point in the kernel and linux takes over from there. Because linux is so complicated it is desireable to have a complicated bootloader that can capitalize on high speed interfaces like ethernet (rather than being limited to serial or slower).
I would add something regarding your question Q2.
Q2: Why there is a need for Bootloader to copy application to RAM and then run the application from RAM? What difficulties we will face if our application runs from FLASH?
It is not only about having SPI or similar serial external code memory (which is not that often anyway).
Even the external ROM/FLASH/EPROM/ connected to the usual high speed parallel bus will will prevent a system from running on a maximum clock (with zero wait state) even on the relatively slow MCUs due to the external memory access time. You would need 10 ns FLASH access time for the 100 MHz clock, which is not so easy to get (if economically possible at all). And you would agree that 100 MHz is not such a brain spinning speed any more :-)
That is why many MCU/CPU architectures are doing tricks with reading multiply instructions at once, or having internal cash, or doing whatever was needed to compensate for a slow external code memory. Only most older 8-bit architectures can execute the code directly from the flash memory ('in place').
Even if your only code memory was the internal Flash, something need to be done to speed it up. Take a look for example at this article:
http://www.iqmagazineonline.com/magazine/pdf/v_3_2_pdf/pg14-15-18-19-9Q6Phillips-Z.pdf
It desribes how the ARM7 has incorporated something they called MAM (Memory Accelerator Module). It is a good read, and you will find some measures there to speed up the code memory access for the specific ARM7 arhitecture (goes for most others):
Limit maximum clock frequency (from 80 MHz to about 20 MHz for the example in the article)
Insert wait-cycles during flash accesses
Use an instruction cache
Copy the program code from flash to RAM
Obviously, if the instruction cache was not an option (too small, or the clock too high) you are really left only with execution from the RAM, after relocating the code there at the start up.
There is an option also to run only specific section of code from the RAM, which could be specified to the linker. For the DSP (Digital System Processing) systems, there was really no option to run from the EPROM/FLASH even in the old days with clock around only few tens of MHz, let alone now.
Another issue is debugging, the options for debugging the code placed in ROM, or even Flash, are very limited (you have to move section of the code to RAM to be able to set a break point on most systems).
Regarding Q2, one of the difficulties you may face executing from Flash is another code update. If you are executing from the same block of Flash you are trying to update, the system will crash. This depends on your system architecture (how your application and bootloader are organized in Flash) but may be particularly hard to avoid if you are trying to update the bootloader itself.
Let's take Python as an example. If I am not mistaken, when you program in it, the computer first "translates" the code to C. Then again, from C to assembly. Assembly is written in machine code. (This is just a vague idea that I have about this so correct me if I am wrong) But what's machine code written in, or, more exactly, how does the processor process its instructions, how does it "find out" what to do?
If I am not mistaken, when you program in it, the computer first "translates" the code to C.
No it doesn't. C is nothing special except that it's the most widespread programming language used for system programming.
The Python interpreter translates the Python code into so called P-Code that's executed by a virtual machine. This virtual machine is the actual interpreter which reads P-Code and every blip of P-Code makes the interpreter execute a predefined codepath. This is not very unlike how native binary machine code controls a CPU. A more modern approach is to translate the P-Code into native machine code.
The CPython interpreter itself is written in C and has been compiled into a native binary. Basically a native binary is just a long series of numbers (opcodes) where each number designates a certain operation. Some opcodes tell the machine that a defined count of numbers following it are not opcodes but parameters.
The CPU itself contains a so called instruction decoder, which reads the native binary number by number and for each opcode it reads it gives power to the circuit of the CPU that implement this particular opcode. there are opcodes, that address memory, opcodes that load data from memory into registers and so on.
how does the processor process its instructions, how does it "find out" what to do?
For every opcode, which is just a binary pattern, there is its own circuit on the CPU. If the pattern of the opcode matches the "switch" that enables this opcode, its circuit processes it.
Here's a WikiBook about it:
http://en.wikibooks.org/wiki/Microprocessor_Design
A few years ago some guy built a whole, working computer from simple function logic and memory ICs, i.e. no microcontroller or similar involved. The whole project called "Big Mess o' Wires" was more or less a CPU built from scratch. The only thing nerdier would have been building that thing from single transistors (which actually wasn't that much more difficult). He also provides a simulator which allows you to see how the CPU works internally, decoding each instruction and executing it: Big Mess o' Wires Simulator
EDIT: Ever since I originally wrote that answer, building a fully fledged CPU from modern, discrete components has been done: For your considereation a MOS6502 (the CPU that powered the Apple II, Commodore C64, NES, BBC Micro and many more) built from discetes: https://monster6502.com/
Machine-code does not "communicate with the processor".
Rather, the processor "knows how to evaluate" machine-code. In the [widespread] Von Neumann architecture this machine-code (program) can be thought of as an index-able array of where each cell contains a machine-code instruction (or data, but let's ignore that for now).
The CPU "looks" at the current instruction (often identified by the PC or Program Counter) and decides what to do (this can either be done directly with transistors/"bare-metal", or it be translated to even lower-level code): this is known as the fetch-decode-execute cycle.
When the instructions are executed side-effects occur such as setting a control flag, putting a value in a register, or jumping to a different index (changing the PC) in the program, etc. See this simple overview of a CPU which covers the above a little bit better.
It is the evaluation of each instruction -- as it is encountered -- and the interaction of side-effects that results in the operation of a traditional processor.
(Of course, modern CPUs are very complex and do lots of neat tricky things!)
That's called microcode. It's the code in the CPU that reads machine code instructions and translate that into low level data flow.
When the CPU for example encounters the add instruction, the microcode describes how it should get the two values, feed them to the ALU to do the calculation, and where to put the result.
Electricity. Circuits, memory, and logic gates.
Also, I believe Python is usually interpreted, not compiled through C → assembly → machine code.
Currently I am testing some RTL, I am using ncverilog, and it is very ... very slow. I have heard that, if we use some kind of FPGA boards, then things will be faster. Is it for real?
You're talking about two different things.
NCVerilog is a simulation tool while an FPGA board is real hardware. So, there will be differences. Real hardware will be generally faster but with a simulator, you can have all sorts of debugging fun. Trying to probe a specific signal is just a matter of adding a line to the testbench. Also, you can easily make changes to the simulated model instead of having to redesign the FPGA board.
If you run simulation on a sufficiently powerful machine, you can sometimes approximate real-world performance (assuming that the FPGA is a slow one).
All in all, you should do both. Use a simulator to do your basic development and evaluation. Move onto your FPGA hardware once your design is sufficiently well defined.
We've had the same issues with simulation speed too. However, we stick with simulations for the majority of our verification. Each sim checks a specific function and are much quicker than system-level sims. We've also made them self-checking and are useful for regressions tests (unit-tests).
For long system tests on real-world signals that take too much time to simulate, we move these to the FPGA if we can. We need to manually re-check all these testcases again after code changes, so it can be slow in its own way.
Sometimes though, FPGAing a design is just not feasible. Sometimes full designs are too large to fit into an FPGA, or the clock rate is too high. But remember that you don't necessarily have to FPGA your entire design, it may be enough to get the important block you're interested in and check this out fully.
You can trace activity on signals in a running FPGA design using "embedded logic analyzer" software tools like Altera SignalTap or Xilinx ChipScope. Before synthesizing/mapping your RTL to the device, you would use these tools to attach soft probes to the signals you want to watch. You can set triggers so that a signal's values only get logged under certain conditions. Then you generate the bitfile and program the device with JTAG. The logic analyzer communicates with your PC over JTAG and logs activity on your probes, which you can then analyze.
It's a bit complicated to set up, as these tools are not especially easy to use, but you will get results much faster than with RTL simulation.
What kind of RTL are you testing ? If you use FPGA boards, then you can compile
your code provided you have the right tool for the right FPGA. Since FPGA are reprograammable, then of course you can test your code on the board, and have the target (FPGA) execute your code (RTL)
But it is no more a simulation, it is a test, with a given hardware, at a given clock speed.
And you don't get nice result on the screen, you need to use physical probe and scope. Plus you don't get to see how the internal of your code is working.
verilog or VHDL simulation is sort of like running code using a debugger. FPGA testing is more like debugging with printf. The big difference is that when simulating, your CPU has to simulate the behaviour of all those logic gate that results of your code. On the FPGA, there is no simulation, you just 'run' the code, so it is much faster, but you have less information.
You should use simulation for very small components, and then test your whole program on a FPGA.
You're probably asking about hardware simulation accelerators.
Here is one of them : GateRocket