How you disable the processor cache on a PowerPC processor? - embedded

In our embedded system (using a PowerPC processor), we want to disable the processor cache. What steps do we need to take?
To clarify a bit, the application in question must have as constant a speed of execution as we can make it.
Variability in executing the same code path is not acceptable. This is the reason to turn off the cache.

I'm kind of late to the question, and also it's been a while since I did all the low-level processor init code on PPCs, but I seem to remember the cache & MMU being pretty tightly coupled (one had to be enabled to enable the other) and I think in the MMU page tables, you could define the cacheable attribute.
So my point is this: if there's a certain subset of code that must run in deterministic time, maybe you locate that code (via a linker command file) in a region of memory that is defined as non-cacheable in the page tables? That way all the code that can/should benefit from the cache does, and the (hopefully) subset of code that shouldn't, doesn't.
I'd handle it this way anyway, so that later, if you want to enable caching for part of the system, you just need to flip a few bits in the MMU page tables, instead of (re-)writing the init code to set up all the page tables & caching.

From the E600 reference manual:
The HID0 special-purpose register contains several bits that invalidate, disable, and lock the instruction and data caches.
You should use HID0[DCE] = 0 to disable the data cache.
You should use HID0[ICE] = 0 to disable the instruction cache.
Note that at power up, both caches are disabled.
You will need to write this in assembly code.

Perhaps you don't want to globally disable cache, you only want to disable it for a particular address range?
On some processors you can configure TLB (translation lookaside buffer) entries for address ranges such that each range could have caching enabled or disabled. This way you can disable caching for memory mapped I/O, and still leave caching on for the main block of RAM.
The only PowerPC I've done this on was a PowerPC 440EP (from IBM, then AMCC), so I don't know if all PowerPCs work the same way.

What kind of PPC core is it? The cache control is very different between different cores from different vendors... also, disabling the cache is in general considered a really bad thing to do to the machine. Performance becomes so crawlingly slow that you would do as well with an old 8-bit processor (exaggerating a bit). Some ARM variants have TCMs, tightly-coupled memories, that work instead of caches, but I am not aware of any PPC variant with that facility.
Maybe a better solution is to keep Level 1 caches active, and use the on-chip L2 caches as statically mapped RAM instead? That is common on modern PowerQUICC devices, at least.

Turning off the cache will do you no good at all. Your execution speed will drop by an order of magnitude. You would never ship a system like this, so its performance under these conditions is of no interest.
To achieve a steady execution speed, consider one of these approaches:
1) Lock some or all of the cache. All current PowerPC chips from Freescale, IBM, and AMCC offer this feature.
2) If it's a Freescale chip with L2 cache, consider mapping part of that cache as on-chip memory.

Related

is it recommended to use SPI flash to run code instead internal flash due to memory limitation of internal flash?

We used the LPC546xx family microcontroller in our project, currently, at the initial stage, we are finalizing the software and hardware requirements. The basic firmware size (which contains RTOS, 3rd party stack, library, etc...) currently is 480 KB. Now once full application developed than the size will exceed the internal flash size (512KB) and plus we needed storage which can hold firmware update image separately.
So we planned to use SPI flash (S25LP064A-JBLE, http://www.issi.com/WW/pdf/IS25LP032-064-128.pdf, serial flash memory) of 4MB\8MB to boot and run firmware.
is it recommended to run code from SPI flash? how can I map external flash memory directly to CPU memory space? Can anyone give an example that contains this memory mapping(linker script etc..) or demo application in which LPC546xx uses SPI FLASH?
Generally speaking it's not recommended, or differently put: the closer to the CPU the better. Both S25LP064A and LPC546xx however support XIP, so it is viable.
This is not a trivial issue as many aspects are affecting. I.e. issue is best avoided and should really have been ironed out in the planning stage. Embedded Systems are more about compromising than anything and making the right/better choices takes skill end experience.
Same question with replies on the NXP forum: link
512K of NVRAM is huge. There are almost certainly room for optimisations even if 3'rd party libraries are used.
On a related note this discussion concerning XIP should give valuable insight: link.
I would strongly encourage use of file-systems if not done already, for which external storage is much better suited. The further from the computational unit, the more relevant. That's not XIP and the penalty is copy-to-RAM either way you do it. I.e. performance will be slower. But in my experience, the need for speed has often-times not been thoroughly considered and at least partially greatly overestimated.
Regarding your mentioning of RTOS and FW-upgrade:
Unless it's a poor RTOS there's file-system awareness built in. Especially for FW upgrading (Note: you'll need room for 3 images, factory reset included), unless already supported by the SoC-vendor by some other means (OTA), it will make life much easier and less risky. If there's no FS-awareness, it can be added.
FW upgrade requires a lot of extra storage. More if simpler. Simpler is however also safer which especially for FW upgrades matters hugely. In the simplest case (binary flat image), you'll need at least twice the amount of memory you're already consuming.
All-in-all: I think the direction you're going is viable and depending on the actual situation perhaps your only choice.

Does context switch between processes invalidate the MMU(memory control unit)?

This is a sentence in the PowerPoint of my system lecture, but I don't understand why context switch invalidates the MMU. I know it will invalidate the cache since the cache contains information of another process. However, as for MMU, it just maps virtual memory to physical memory. If context switch invalidates it, does this mean the MMU use different mechanism of mapping in different processes?
Does this mean the MMU use different mechanism of mapping in different processes?
Your conclusion is essentially right.
Each process has its mapping from virtual to physical addresses (called context).
The address 0x401000 for example can be translated to 0x01234567 for process A and to 0x89abcdef for process B.
Having different contexts allows for an easy isolation of the processes, easy on demand paging and simplified relocation.
So each context switch must invalidate the TLB or the CPU would continue using the old translations.
Some pages however are global, meaning that they have the same translation independently of the current process address space.
For example the kernel code is mapped in the same way for every process adn thus doesn't need to be remapped.
So in the end only a part of the TLB is invalidated.
You can read how Linux handles the process address space for a real example of applied theory.
What you are describing is entirely system specific.
First of all, what they are probably referring to is invaliding the MMU cache. That assume the MMU has a cache (likely these days but not guaranteed).
When a context switch occurs, the processor has set put the MMU in a state where leftovers from the previous process would screw up the new process. If it did not, the cache would map the new process's logical pages to the old process's physical page frames.
For example, some processors use one page table for the system space and one or more other page tables for the user space. After a context switch, it would be ideal for the processor to invalidate any caching of the user space page tables but leave any caching of the system table table alone.
Note that in most processors all of this is done entirely behind the scenes. Even OS programmers do not need to deal with (or even be aware of) any flushing or invalidation of the MMU. There is a single switch process context instruction that handles everything. Other processors require the OS programmer to handle additional tasks as part of a context switch which, in some oddball processors, includes explicitly flushing the MMU cache.

Which takes longer time? Switching between the user & kernel modes or switching between two processes?

Which takes longer time?
Switching between the user & kernel modes (or) switching between two processes?
Please explain the reason too.
EDIT : I do know that whenever there is a context switch, it takes some time for the dispatcher to save the status of the previous process in its PCB, and then reload the next process from its corresponding PCB. And for switching between the user and the kernel modes, I know that the mode bit has to be changed. Isn't it all, or is there more to it?
Switching between processes (given you actually switch, not run them in parallel) by an order of oh-my-god.
Trapping from userspace to kernelspace used to be done with a processor interrupt earlier. Around 2005 (don't remember the kernel version), and after a discussion on the mailing list where someone found that trapping was slower (in absolute measures!) on a high-end xeon processor than on an earlier Pentium II or III (again, my memory), they implemented it with a new cpu instruction sysenter (which had actually existed since Pentium Pro I think). This is done in the Virtual Dynamic Shared Object (vdso) page in each process (cat /proc/pid/maps to find it) IIRC.
So, nowadays, a kernel trap is basically just a couple of cpu instructions, hence rather few cycles, compared to tenths or hundreds of thousands when using an interrupt (which is really slow on modern CPU's).
A context switch between processes is heavy. It means storing all processor state (registers, etc) to RAM (at a magic memory location in the user process space actually, guess where!), in practice dirtying all cached memory in the cpu, and reading back the process state for the new process. It will (likely) have nothing still in the cpu cache from last time it ran, so each memory read will be a cache miss, and needed to be read from RAM. This is rather slow. When I was at the university, I "invented" (well, I did come up with the idea, knowing that there is plenty of dye in a CPU, but not enough cool if it's constantly powered) a cache that was infinite size although unpowered when unused (only used on context switches i.e.) in the CPU, and implemented this in Simics. Implemented support for this magic cache I called CARD (Context-switch Active, Run-time Drowsy) in Linux, and benchmarked rather heavily. I found that it could speed-up a Linux machine with lots of heavy processes sharing the same core with about 5%. This was at relatively short (low-latency) process time slices, though.
Anyway. A context switch is still pretty heavy, while a kernel trap is basically free.
Answer to at which memory location in user-space, for each process:
At address zero. Yep, the null pointer! You can't read from this entire page from user-space anyway :) This was back in 2005, but it's probably the same now unless the CPU state information has grown larger than a page size, in which case they might have changed the implementation.

Off-chip memcpy?

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.
This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.
Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.
That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".
Some architectural and practical problems:
1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.
2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.
3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.
4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.
Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).
The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.
A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.
Net Win?
It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.
Heavier User Context?
An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.
And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.
Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.
Or Distant Kernel Resource?
If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.
Idea! Have that super-fast CPU thing do it!
Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?

Context switch time - Role of RTOS and Processor

Does the RTOS play a major role or processor play a major role in determining the time for context switch ? What is the percentage of share between these two major players in determining the context switch time .
Can anyone tell with respect to uC/OS-II RTOS ?
I would say both are significant, but it is not really as simple as that:
The actual context switch time is simply a matter of the number of instruction cycles required to perform the switch, like anything in software it may be coded efficiently or it may not. On the other hand, all other things being equal, a processor with a large register set will require more instruction cycles to save the context; but having a large register set may make other code far more efficient.
A processor may also have an architecture that directly supports fast context switching. For example the lowly 8bit 8051 has four duplicate register banks; so a context switch is little more that a register bank switch (so long as you have not more that four threads), and given that Silicon Labs produce 8051 based devices at 100MIPS, that could be very fast indeed!
More sophisticated processors and operating systems may use an MMU to provide thread memory protection, this is an additional context switch overhead but with benefits that may override that. Also of course such processors generally also have high clock rates which helps.
So all in all, the processor speed, the processor architecture, the quality of the RTOS implementation, and the functionality provided by the RTOS may all affect context switch time. But in the end the easiest way to improve switch time is almost certainly to increase the clock rate.
Although it is nice to have more headroom, if context switch time is a make or break issue for your project on any reputable RTOS you should consider the suitability of either your hardware or your design. You should aim toward a design that minimises context switches. For example, if an ADC conversion takes 6us and a context switch takes 20us, then you would do better to busy-wait than to use a conversion-complete interrupt; better yet use DMA transfers to avoid context switches on single data items where possible.
uC/OS-II RTOS is written in C, with some very specific sections(maybe in assembly) for the processor specific handling. The context switching will be part of the sections that are very specific to the processor.
So the context switch time will be very dependent on the processor selected and the specific sections used to adapt uC/OS-II to that processor. I believe all the source code is available so you should be able to see how much source is needed for a context switch. I also think uC/OS-II has callback's that may allow you to add some performance measuring code.
Just to complete on what Clifford was saying, context switching time also depends on the conditions that trigger the context switch, so mainly it depends on the benchmark.
Depending on the RTOS implementation, in some cases it's possible to switch directly to the first waiting process bypassing the scheduler altogether.
This of course gives a huge boost in some benchmarks.
For example we make some benchmark that measures the overhead (in µs) required to deliver a signal and switch to the high-priority process varying the particular kernel configuration and the target architecture:
http://www.bertos.org/discover/context-switch-overhead