Does the CPU stall when access external memory(SRAM) via FSMC - embedded

I am using a STM32f103 chip with a Cortex-m3 core in a project. According to the manual 3.3.1. Cortex-M3 instructions, load a 32bit word with a single LRD instruction takes 2 CPU cycles to complete (assuming the destination is not PC).
My understanding is that this is only true for reading from internal memories (Flash or internal SRAM)
When reading from an external SRAM via the FSMC, it must take more cycles to complete the read operation. During the read operation, does the CPU stall until the FSMC is able to put the data together? In other words, do I lose CPU cycles when accessing external memories?
Thank you.
Edit 1: Also assume all access are aligned 32bit access.

LDR and STR instructions are not interruptible. The FSMC is bridged from the AHB, and can run at a much slower rate, as you already know. For reads, the pipeline will stall until the data is ready, and this may cause increased worst-case interrupt latency. The write may or may not stall the pipe, depending on configuration. The reference manual says there is a two-word write buffer, but it appears that may only be used to buffer bursting memories. If you were using a CRAM (PSRAM) with a bursting interface, subsequent writes would likely not complete before the next instruction is executing, but a subsequent read would stall (longer) to allow the write to finish before initiating the read.
If using LDM and STM instructions to perform multiple reads or writes, these instructions are interruptible, and it is implementation defined as to whether they will restart from the beginning or continue when returned to. I haven't been able to find out how ST has chosen to implement this behavior. In either case, each individual bus transaction would should not be interrupted.
In regards to LDRD and STRD for working on 64-bit values, I found this discussion which references the following from the ARM-ARM:
"... LDRD, ... STRD, ... instructions are executed as a sequence of
word-aligned word accesses. Each 32-bit word access is guaranteed to
be single-copy atomic. The architecture does not require subsequences
of two or more word accesses from the sequence to be single-copy
atomic."
So, it appears that LDRD and STRD are likely to function the same way LDM and STM function.

The STM32F1xx FSMC has programmable wait states - if for your memory that is not set to zero, then it will indeed take additional cycles. The data bus for the external memory is either 16 or 8 bits, so 32 bit accesses will also take additional cycles. Also the write FIFO can cause the insertion of wait states.
On the other hand the Cortex-M is a Harvard architecture core with different memories on different buses so that instruction and data fetches can occur simultaneously, minimising ot some extent processor stalling.

Related

Is uint16_t and uint32_t interrupt safe in Cortex M architecture?

I am working on some embedded stuff. I had multiple interrupts possibly working on same data and so I was wondering if uint16_t and uint32_t data types are interrupt safe.
If interrupt is working a uint16/32_t data and is halfway interrupted by another interrupt that is trying to read this data, it will see corrupted data. Is this a possible scenario?
Thanks
To expand on the answer from #DinhQC, all single-result instructions on 16- and 32-bit data types are 'atomic' with respect to interrupts on the Cortex-M as long as the data is properly aligned (and you have to try quite hard to get the C compiler to give you unaligned data, because unaligned accesses are slow and need special treatment). Multiple-result operations like LDM and STM can be interrupted and resumed, on most implementations, but the integrity of each individual 32-bit transfer within the LDM or STM is guaranteed.
The important thing is to understand whether the operations you're performing are single instructions at the machine level or not. If you're incrementing a shared variable, for example, this will take three instructions: a read, a modify, and a write. If an interrupt occurs in between the read and the write, and the interrupt service routine modifies the same variable, this modification will be overwritten when the ISR returns.
The safe way to go is to use some kind of hardware-supported mechanism to enforce atomicity or mutual exclusion over your shared data. There are more powerful, more flexible and faster approaches to mutual exclusion on the Cortex-M than disabling and re-enabling interrupts, though, notably the STREX and LDREX instructions (which are available in C too). Take a look at my answer to this other question for more information.
Cortex-M processors do not corrupt and give your data undefined value. The value will always be deterministic. However, there are many conditions that affect the value of the data in case of interrupts. The uint16/32_t data can be located in the memory, or only inside the processor registers. If in memory, it can be 16/32-bit aligned or not 16/32-bit aligned. The processor, e.g. M0 or M4, and the operation performed on the data, e.g. add or multiply, also matter. All of those will determine whether the instruction used to process the data is atomic or not.
You can find more details in this discussion and this answered by Joseph Yiu.
Generally speaking, if the instruction is atomic (single execution cycle), the interrupt cannot disturb the data operation. However, at your C code level, uint16/32_t data operation may take more than 1 instruction. Therefore, it is hard to guarantee that the program runs as expected. This also applies to uint8_t data. You may wish to disable interrupts before working on the shared data and enable interrupts afterwards. The technique is covered well in this answer (look at point 2).

Could a tight loop destroy cells of a microcontroller's flash?

It is well-known that Flash memory has limited write endurance, less so that reads could also have an upper limit such as mentioned in this Flash endurance test's Conclusion (3rd point).
On a microcontroller the code is typically stored in Flash, and is executed by fetching code words directly from the Flash cells.(at least this is most commonly so on 8 bit micros, some 32 bit micros might have some small buffer).
Depending on the particular code, it might happen that a location is accessed extremely frequently, such as if on the main execution path there is some busy loop, such as a wait for an interrupt (for example from a timer, synchronizing execution to a fixed interval).This could generate 100K or even more (read) accesses per second on average to a single Flash cell (depending on clock and the particular code).
Could such code actually destroy the cells of the Flash underneath it?(Is there any necessity to be concerned about this particular problem when designing code for microcontrollers? Such as part of a system which is meant to operate for years? Of course the Flash could be periodically verified by CRC, but that doesn't prevent the system failing if it happens, only that the failure will more likely happen in a controlled manner)
Only erasing/writing will affect the memory cells, not reading. You don't need to consider the number of reads when designing the program.
Programmed flash memory does age though, meaning that the value of the cells might not be reliable after a certain amount of years. This is known as data retention and depends mainly on temperature. MCU manufacturers typically specify a worse case in years, assuming that the part is kept in maximum specified ambient temperature.
This is something to consider for products that are expected to live long (> 10 years), particularly in environments where high temperatures can be expected. CRC and/or ECC is a good counter-measure against data retention, although if you do find that a cell has been corrupted, it typically just means that the application should shut down to a non-recoverable safe state.
I know of two techniques to approach this issue:
1) One technique is to set aside a const 32-bit integer variable in the system code. Then calculate a CRC32 checksum of the compiled binary image, and inserting the checksum into the reserved variable using an ELF-editor.
A module in the system software will then calculate a CRC32 over the flash area occupied by the application and compare to the "stored" value.
If you are using GCC, the linker can define a symbol to tell you where the segment stops. This method can detect errors but cannot correct them.
2) Another technique is to use a microcontroller that supports Flash ECC. TI sells Cortex-R4 MCUs which support Flash ECC (Hercules series).
I doubt that this is a practical concern. The article you cited vaguely asserts that this can happen but with no supporting evidence or quantification of the effect. There is a vague, unsupported and unquantified reference in the introduction:
[...] flash degrades over time from erasing/writing (or even just reading, although that decay is slower) [...]
Then again in the conclusion:
We did not check flash decay for reads, but reading also causes long term decay. It would be interesting to see if we can read a spot enough times to cause failure.
The author may be referring to read-disturbance in NAND flash, but microcontrollers do not use NAND flash for code storage/execution since it is not random-access. Read disturb is not a permanent effect, erasing and re-writing the affected block restores endurance. NAND controllers maintain read counts for blocks and automatically copy and erase blocks as necessary. They also employ ECC to detect and correct errors, and identify "write-worn" areas.
There is the potential for long-term "bit-rot" but I doubt that it is caused specifically by reading rather just ageing.
Most RTOS systems spend the majority of their processing time in a do-nothing idle loop, and run happily 24/7 365 days a year. Some processors support a wait-for-interrupt instruction that halts the CPU in the idle loop, but by no means all, and it is not uncommon not to use such an instruction. Processors with flash accelerators or caches may also prevent continuous rapid fetch from a single location, but again that is by no means all microcontrollers.

Interrupt vector table: why do some architectures employ a "jump table" VS an "array of pointers"?

On some architectures (e.g. x86) the Interrupt Vector Table (IVT) is indeed what it says on the tin: a table of vectors, aka pointers. Each vector holds the address of an Interrupt Service Routine (ISR). When an Interrupt Request (IRQ) occurs, the CPU saves some context and loads the vector into the PC register, thus jumping to the ISR. so far so good.
But on some other architectures (e.g. ARM) the IVT contains executable code, not pointers. When an IRQ occurs, the CPU saves some context and executes the vector. But there is no space in between these "vectors", so there is no room for storing the ISR there. Thus each "vector instruction" typically just jumps to the proper ISR somewhere else in memory.
My question is: what are the advantages of the latter approach ?
I would kinda understand if the ISRs themselves had fixed well-known addresses, and were spaced out so that reasonnable IRSs would fit in-place. Then we would save one indirection level, though at the expense of some fragmentation. But this "compact jump table" approach seems to have no advantage at all. What did I miss ?
Some of the reasons, but probably not all of them (I'm pretty much self educated in these matters):
You can have fall through (one exception does nothing, and just goes to the next in the table)
The FIQ interrupt (Fast Interrupt Requests) is the last in the table, and as the name suggest, it's used for devices that need immediate and low latency processing.
It means you can just put that ISR in there (no jumping), and process it as fast as possible. Also, the way FIQ was thought with it's dedicated registers, it allows for optimal implementation of FIQ handlers. See https://en.wikipedia.org/wiki/Fast_interrupt_request
I think it has do with simplifying the processor's hardware.
If you have machine instructions (jump instructions) in the vector interrupt table, the only extra thing the processor has to do when it has to jump to an interrupt handler is to load the address of the corresponding interrupt vector in the PC.
Whereas, if you have addresses in the interrupt vector table, the processor must be able to read the interruption handler start address from memory, and then jump to it.
The extra hardware required to read from memory and writing to a register is more complex than the required to just writing to a register.

Which takes longer time? Switching between the user & kernel modes or switching between two processes?

Which takes longer time?
Switching between the user & kernel modes (or) switching between two processes?
Please explain the reason too.
EDIT : I do know that whenever there is a context switch, it takes some time for the dispatcher to save the status of the previous process in its PCB, and then reload the next process from its corresponding PCB. And for switching between the user and the kernel modes, I know that the mode bit has to be changed. Isn't it all, or is there more to it?
Switching between processes (given you actually switch, not run them in parallel) by an order of oh-my-god.
Trapping from userspace to kernelspace used to be done with a processor interrupt earlier. Around 2005 (don't remember the kernel version), and after a discussion on the mailing list where someone found that trapping was slower (in absolute measures!) on a high-end xeon processor than on an earlier Pentium II or III (again, my memory), they implemented it with a new cpu instruction sysenter (which had actually existed since Pentium Pro I think). This is done in the Virtual Dynamic Shared Object (vdso) page in each process (cat /proc/pid/maps to find it) IIRC.
So, nowadays, a kernel trap is basically just a couple of cpu instructions, hence rather few cycles, compared to tenths or hundreds of thousands when using an interrupt (which is really slow on modern CPU's).
A context switch between processes is heavy. It means storing all processor state (registers, etc) to RAM (at a magic memory location in the user process space actually, guess where!), in practice dirtying all cached memory in the cpu, and reading back the process state for the new process. It will (likely) have nothing still in the cpu cache from last time it ran, so each memory read will be a cache miss, and needed to be read from RAM. This is rather slow. When I was at the university, I "invented" (well, I did come up with the idea, knowing that there is plenty of dye in a CPU, but not enough cool if it's constantly powered) a cache that was infinite size although unpowered when unused (only used on context switches i.e.) in the CPU, and implemented this in Simics. Implemented support for this magic cache I called CARD (Context-switch Active, Run-time Drowsy) in Linux, and benchmarked rather heavily. I found that it could speed-up a Linux machine with lots of heavy processes sharing the same core with about 5%. This was at relatively short (low-latency) process time slices, though.
Anyway. A context switch is still pretty heavy, while a kernel trap is basically free.
Answer to at which memory location in user-space, for each process:
At address zero. Yep, the null pointer! You can't read from this entire page from user-space anyway :) This was back in 2005, but it's probably the same now unless the CPU state information has grown larger than a page size, in which case they might have changed the implementation.

Off-chip memcpy?

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.
This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.
Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.
That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".
Some architectural and practical problems:
1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.
2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.
3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.
4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.
Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).
The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.
A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.
Net Win?
It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.
Heavier User Context?
An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.
And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.
Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.
Or Distant Kernel Resource?
If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.
Idea! Have that super-fast CPU thing do it!
Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?