Frustrating FreeRTOS xQueueCreate() limitation

Frustrating FreeRTOS xQueueCreate() limitation - embedded

I'm trying to use a queue to buffer characters from my UART ISR to a background task. I want the queue to be 512 bytes long. This is unfortunately impossible, because the type of the size argument is unsigned portBASE_TYPE which for the xmega256a3 is a single byte (char). Is there a reason the maximum size of a queue floats with portBASE_TYPE? Rather than uint16_t?
I'm curious if others have hit the same limitation, and what, if anything, they've done about it.

Richard Barry (FreeRTOS author) posted the following response on the FreeRTOS mailing list:
This is only the case on 8-bit architectures. It has been mentioned a few times (you can search the support archive on the FreeRTOS site), but not for years as most new projects are using 32-bit architectures. The simple thing to do is change the definition of portBASE_TYPE in portmacro.h, but it will make your code larger and less efficient.
As an aside, many of the FreeRTOS demos use queues to pass characters into and out of interrupts to provide a simple example of tasks and interrupts communicating, but unless the throughput is very low (a command console for example), it is not the recommended way of writing production code. Using circular buffers, preferably with a DMA, is much more efficient.

It's natural to use portBASE_TYPE for the majority of variables for efficiency reasons. The AVR is an 8 bit architecture and so will be more efficient dealing with 8 bit queue arithmetic than 16 bits. For some applications this efficiency may be critical.
Using a uint16_t doesn't make sense on 32 bit architectures and you'll note that the portBASE_TYPE for ARM cores is a 32 bit value, so choosing a uint16_t as the default type of queue length would be an artificial restriction on these cores.
Here's some options:
Refactor your tasks to read from the queue more often. Unless other tasks are stealing too much processing time, it should be possible to lower your ISR queue length and buffer the data in your reading thread.
Recompile FreeRTOS with a different portBASE_TYPE. I haven't tried this but I don't see a reason why this wouldn't work unless there was some assembler code in FreeRTOS which expected an 8 bit portBASE_TYPE. I had a quick look and didn't see any obvious signs of the assembler code expecting 8 bit types.
Use your own queuing library that has the capability to store as much data as you need. Use other FreeRTOS primitives such as a semaphore to signal to your task that data has been added to your queue. Instead of your task blocking on a queue read, it would block on a semaphore. Upon the semaphore being signalled, you'd use your own queuing library to read queued data.

Related

Will semaphore corrupt data transmission of peripherals like UART in a microcontroller?

Semaphore disables interrupts and so will this cause other operations like receiving data on SPI to get corrupt?

Disabling interrupts cannot corrupt the data on the hardware interface.
The problem is if the data is received by the hardware peripheral and then the it raises an interrupt to have the processor collect the data then this will be delayed. If it is delayed for too long then potentially more data will have been received. Depending on the peripheral, either the new data or the old data will have to be discarded. Either way stream of data will be incomplete.
In most cases it is difficult to predict or test how long it is safe to disable interrupts for, so if possible it is best to avoid turning interrupts off.
If the peripheral includes a FIFO buffer, then the length of time that it is safe to disable interrupts for may be increased (although still difficult to predict).
Most modern microcontrollers have many ways to avoid disabling interrupts:
A better approach is to have the peripheral transfer the data to memory with DMA, so no interrupt is required at all.
Most modern processor cores provide ways to implement a semaphore do not even need to disable interrupts.

There's no standard way of implementing a semaphore. To disable all interrupts on the MCU is one way to do it, but it's a very poor amateur way of doing so. Because in more complex applications with multiple interrupts, this will make all real-time considerations and calculations a nightmare.
It creates subtle but severe bugs. Particularly when some quack has done so from deep inside some driver code. You import the driver into your project and suddenly previously working code breaks. In particular, be very careful about using various libs provided by silicon vendors - they are often of very poor quality.
There are better ways to do it, including:
Ensuring atomic access of shared variables, which can only be done with inline assembler or C11 _Atomic if supported.
Disabling one specific interrupt for a specific hardware peripheral, if it is possible to do do given the real-time considerations. Then this should be handled by the driver for that hardware peripheral in the form of setter/getter functions.
Use a "poor man's semaphore" in the form of a plain flag variable, by relying on the interrupt mechanism of the MCU blocking all other interrupts while the ISR is executing. Example.

Is uint16_t and uint32_t interrupt safe in Cortex M architecture?

I am working on some embedded stuff. I had multiple interrupts possibly working on same data and so I was wondering if uint16_t and uint32_t data types are interrupt safe.
If interrupt is working a uint16/32_t data and is halfway interrupted by another interrupt that is trying to read this data, it will see corrupted data. Is this a possible scenario?
Thanks

To expand on the answer from #DinhQC, all single-result instructions on 16- and 32-bit data types are 'atomic' with respect to interrupts on the Cortex-M as long as the data is properly aligned (and you have to try quite hard to get the C compiler to give you unaligned data, because unaligned accesses are slow and need special treatment). Multiple-result operations like LDM and STM can be interrupted and resumed, on most implementations, but the integrity of each individual 32-bit transfer within the LDM or STM is guaranteed.
The important thing is to understand whether the operations you're performing are single instructions at the machine level or not. If you're incrementing a shared variable, for example, this will take three instructions: a read, a modify, and a write. If an interrupt occurs in between the read and the write, and the interrupt service routine modifies the same variable, this modification will be overwritten when the ISR returns.
The safe way to go is to use some kind of hardware-supported mechanism to enforce atomicity or mutual exclusion over your shared data. There are more powerful, more flexible and faster approaches to mutual exclusion on the Cortex-M than disabling and re-enabling interrupts, though, notably the STREX and LDREX instructions (which are available in C too). Take a look at my answer to this other question for more information.

Cortex-M processors do not corrupt and give your data undefined value. The value will always be deterministic. However, there are many conditions that affect the value of the data in case of interrupts. The uint16/32_t data can be located in the memory, or only inside the processor registers. If in memory, it can be 16/32-bit aligned or not 16/32-bit aligned. The processor, e.g. M0 or M4, and the operation performed on the data, e.g. add or multiply, also matter. All of those will determine whether the instruction used to process the data is atomic or not.
You can find more details in this discussion and this answered by Joseph Yiu.
Generally speaking, if the instruction is atomic (single execution cycle), the interrupt cannot disturb the data operation. However, at your C code level, uint16/32_t data operation may take more than 1 instruction. Therefore, it is hard to guarantee that the program runs as expected. This also applies to uint8_t data. You may wish to disable interrupts before working on the shared data and enable interrupts afterwards. The technique is covered well in this answer (look at point 2).

Could a tight loop destroy cells of a microcontroller's flash?

It is well-known that Flash memory has limited write endurance, less so that reads could also have an upper limit such as mentioned in this Flash endurance test's Conclusion (3rd point).
On a microcontroller the code is typically stored in Flash, and is executed by fetching code words directly from the Flash cells.(at least this is most commonly so on 8 bit micros, some 32 bit micros might have some small buffer).
Depending on the particular code, it might happen that a location is accessed extremely frequently, such as if on the main execution path there is some busy loop, such as a wait for an interrupt (for example from a timer, synchronizing execution to a fixed interval).This could generate 100K or even more (read) accesses per second on average to a single Flash cell (depending on clock and the particular code).
Could such code actually destroy the cells of the Flash underneath it?(Is there any necessity to be concerned about this particular problem when designing code for microcontrollers? Such as part of a system which is meant to operate for years? Of course the Flash could be periodically verified by CRC, but that doesn't prevent the system failing if it happens, only that the failure will more likely happen in a controlled manner)

Only erasing/writing will affect the memory cells, not reading. You don't need to consider the number of reads when designing the program.
Programmed flash memory does age though, meaning that the value of the cells might not be reliable after a certain amount of years. This is known as data retention and depends mainly on temperature. MCU manufacturers typically specify a worse case in years, assuming that the part is kept in maximum specified ambient temperature.
This is something to consider for products that are expected to live long (> 10 years), particularly in environments where high temperatures can be expected. CRC and/or ECC is a good counter-measure against data retention, although if you do find that a cell has been corrupted, it typically just means that the application should shut down to a non-recoverable safe state.

I know of two techniques to approach this issue:
1) One technique is to set aside a const 32-bit integer variable in the system code. Then calculate a CRC32 checksum of the compiled binary image, and inserting the checksum into the reserved variable using an ELF-editor.
A module in the system software will then calculate a CRC32 over the flash area occupied by the application and compare to the "stored" value.
If you are using GCC, the linker can define a symbol to tell you where the segment stops. This method can detect errors but cannot correct them.
2) Another technique is to use a microcontroller that supports Flash ECC. TI sells Cortex-R4 MCUs which support Flash ECC (Hercules series).

I doubt that this is a practical concern. The article you cited vaguely asserts that this can happen but with no supporting evidence or quantification of the effect. There is a vague, unsupported and unquantified reference in the introduction:
[...] flash degrades over time from erasing/writing (or even just reading, although that decay is slower) [...]
Then again in the conclusion:
We did not check flash decay for reads, but reading also causes long term decay. It would be interesting to see if we can read a spot enough times to cause failure.
The author may be referring to read-disturbance in NAND flash, but microcontrollers do not use NAND flash for code storage/execution since it is not random-access. Read disturb is not a permanent effect, erasing and re-writing the affected block restores endurance. NAND controllers maintain read counts for blocks and automatically copy and erase blocks as necessary. They also employ ECC to detect and correct errors, and identify "write-worn" areas.
There is the potential for long-term "bit-rot" but I doubt that it is caused specifically by reading rather just ageing.
Most RTOS systems spend the majority of their processing time in a do-nothing idle loop, and run happily 24/7 365 days a year. Some processors support a wait-for-interrupt instruction that halts the CPU in the idle loop, but by no means all, and it is not uncommon not to use such an instruction. Processors with flash accelerators or caches may also prevent continuous rapid fetch from a single location, but again that is by no means all microcontrollers.

How to keep interrupts short?

The most heard advice in embedded programming is "keep your interrupts short".
Now my situation is that I have a very long running task in my main() loop (writing large blocks of data to SDcard), which can sometimes take 100ms. So to keep my system responsive I moved all other stuff to interrupt-handlers.
For example, normally one would handle the incoming UART data in an interrupt, then process the incoming command in the main() loop, and then send back the response. But in my case, the whole processing/handling of the commands also takes places in the interrupts, because my main() loop can be blocked for (relatively) long periods.
The optimal solution would be to switch to an RTOS but I don't have the RAM for it. Are there alternatives for my design where the interrupts can be short?

The traditional approach for this is for Interrupts to schedule a deferred procedure and end the interrupt as soon as possible.
Once the interrupt has finished, the list of deferred procedures is walked from most-important to least important.
Consider the case where you have your main (lower proiority) action, and two interrupts I1 and I2, where I2 is more important than main, but less important than I1.
In this case, let's suppose you're running main and I1 fires. I1 schedules a deferred procedure and signals to the hardware that I1 is done. I1's DPC now begins running. Suddenly I2 comes in from the hardware. I2's interrupt takes over from I1's DPC and schedules I2's DPC and signals to the hardware that it's done.
The scheduler then returns to I1's DPC (because it is more important), and when I1's DPC completes, I2's DPC begins (because it is more important than main), and then eventually returns execution to main.
This design allows you to schedule the importance of different interrupts, encourages you to keep your interrupts small, and allows you to complete DPCs in an ordered and in-order prioritized way.

There are 100 different ways to skin this cat, depending on CPU architecture (interrupt nesting & prioritization, software interrupt support, etc.) but let's take a pretty straightforward approach that is relatively simple to understand and free from the race conditions and resource-sharing hazards of a preemptive kernel.
(Disclaimer: my first choice is typically a preemptive real time kernel, many of them can run in extremely resource-constrained systems... SecurityMatt's suggestion is good but if you're not comfortable implementing your own preemptible kernel / task switcher, particularly one that handles asynchronous (interrupt-triggered) preemption, you can get wrapped around the axle pretty quickly. So what I'm proposing below is not as responsive as a preemption-based kernel, but it's much simpler and often adequate).
Create 3 event/work queues:
Q1 is the lowest priority and handles your slow, background SD card writes
Q2 holds requests to process incoming UART packets
Q3 (highest priority) holds UART RX FIFO read requests.
I split up the UART RX FIFO reading and the processing of the read packet so that the FIFO reading is always serviced ahead of the packet processing; maybe you want to keep them together, your choice.
For this to work, you break your large (~100ms) SD card write process into a bunch of smaller, discrete, run to completion steps.
So for example, to write 5 blocks, 20ms each, you write the first block, then enqueue "write next block" to Q1. You go back to your scheduler at the end of each step & scan the queues in priority order, starting with Q3. If Q2 and Q3 are empty, you pull the next event off of Q1 ("write next block"), and run that command for another 20ms before returning and scanning the queues again. If 20ms is not responsive enough, you break up each 20ms block write into a more fine-grained set of steps, continually posting to Q1 the next work step.
Now for the incoming UART stuff; in the UART RX ISR, you simple enqueue a "read UART FIFO" command in Q3, and return from interrupt back into the 20ms "write block" step that was interrupted. As soon as the CPU finishes the write, it goes back and scans the queues in priority order (worst case response will be 20ms if the block write had just begun at the time of the interrupt). The queue scanner (scheduler) will see that Q3 now has work to do, and it will run that command before going back and scanning again.
The responsiveness in your system, worst case, will be determined by the longest run-to-completion step in the system, regardless of priority. You keep your system very responsive by doing work in small, discrete, run to completion steps.
Note that I have to speak in generalities here. Maybe you want to read the UART RX FIFO in the ISR, put the data into a buffer, and only defer the packet processing, not the actual reading of the FIFO (then you'd only have 2 queues). You have to work this out for yourself. But I hope the approach makes sense.
This event-driven approach with prioritized queues is exactly the approach used by the Quantum Platform (QP) event-driven framework. The QP actually supports an underlying non-preemptive (cooperative) scheduler, such as what was described here, or a preemptive scheduler which runs the scheduler each an event is queued (similar to the approach suggested by SecurityMatt). You can see the code/implementation of the QP's cooperative scheduler over at QP website.

An alternative solution would be as follow:
Anywhere the FAT library can capture the processor for a long time, you insert a call to a new function which is normally very fast and return to the caller after a few machine cycles. Such fast function would not impact the real-time performance of your time consuming operation, such as reading/writing to SD Flash. You would insert such call in any loop that wait for a flash sector to be erased. You also insert a call to such function in between every 512 bytes written or 512 bytes read.
The goal of that function is to perform most of the task that you would normally have inside the "while(1)" loop in a typical "main()" for embedded device. It would first increment an integer and perform a fast modulo on the new value, then return if the modulo is not equal to an arbitrary constant. The code is as follow:
void premption_check(void)
{
static int fast_modulo = 0;
//divide the number of call
fast_modulo++;
if( (fast_modulo & 0x003F) != 3)
{
return;
}
//the processor would continue here only once every 64 calls to "premption_check"
Next, you call the functions that extract RS232 characters/strings from the serial port interrupts, process any command if complete strings are received, etc
The binary mask 0x3F used above means that we look only at the 6 least significant bits of the counter. When these 6 bits happen to be equal to the arbitrary value 5, when go ahead with the calls to functions which may take some micro-second or even milli-second to execute. You may want to try smaller or larger binary mask depending on the speed at which you want to service the serial port and other operations. You may even use simultaneously more than one mask to service some operation faster than other.
The FAT library and the SD card should not experience any problem when some sporadic delay happen in between two Flash erase operation, for example.
The solution given here works even with a micro-controller with only 2K byte, like many variant of 8051. As incredible as it may seems, the pinball machine of 1980 to 1990 had a few K of RAM, slow processors (like 10 MHz) and they where able to test one hundred switch... fully debounced, update a X/Y matrix display, produce sound effects, etc The solutions developed by these engineer can still be used to boost the performance of large system. Even with the best servers with 64 Gig RAM and many Terabyte of hard disk, I presume that any bytes count when some company want to index billions of WEB pages.

As no-one has suggested coming at it from this end yet I'll throw it in the hat:
It's possible that sticking the SD card service routine in a low-priority interrupt, maybe throwing in some DMA if you can, would free up your main loop & other interrupts to be more responsive, rather than being stuck in a main() loop waiting for longtime for something to finish.
The caveat to this is I don't know if the hardware has any way of triggering the interrupt when the SD card is ready for more, you might have to cheat by running a polling timer to check & force the interrupt. I'm not above that sort of thing though, if you have spare hardware timers & interrupts it can be done with very little overhead.
Resorting to an RTOS for something like this would seem overkill & an admission of failure to me... ;)

Off-chip memcpy?

I was profiling a program today at work that does a lot of buffered network activity, and this program spent most of its time in memcpy, just moving data back and forth between library-managed network buffers and its own internal buffers.
This got me thinking, why doesn't intel have a "memcpy" instruction which allows the RAM itself (or the off-CPU memory hardware) to move the data around without it ever touching the CPU? As it is every word must be brought all the way down to the CPU and then pushed back out again, when the whole thing could be done asynchronously by the memory itself.
Is there some architecture reason that this would not be practical? Obviously sometimes the copies would be between physical memory and virtual memory, but those cases are dwindling with the cost of RAM these days. And sometimes the processor would end up waiting for the copy to finish so it could use the result, but surely not always.

That's a big issue that includes network stack efficiency, but I'll stick to your specific question of the instruction. What you propose is an asynchronous non-blocking copy instruction rather than the synchronous blocking memcpy available now using a "rep mov".
Some architectural and practical problems:
1) The non-blocking memcpy must consume some physical resource, like a copy engine, with a lifetime potentially different than the corresponding operating system process. This is quite nasty for the OS. Let's say that thread A kicks of the memcpy right before a context switch to thread B. Thread B also wants to do a memcpy and is much higher priority than A. Must it wait for thread A's memcpy to finish? What if A's memcpy was 1000GB long? Providing more copy engines in the core defers but does not solve the problem. Basically this breaks the traditional roll of OS time quantum and scheduling.
2) In order to be general like most instructions, any code can issue the memcpy insruction any time, without regard for what other processes have done or will do. The core must have some limit to the number of asynch memcpy operations in flight at any one time, so when the next process comes along, it's memcpy may be at the end of an arbitrarily long backlog. The asynch copy lacks any kind of determinism and developers would simply fall back to the old fashioned synchronous copy.
3) Cache locality has a first order impact on performance. A traditional copy of a buffer already in the L1 cache is incredibly fast and relatively power efficient since at least the destination buffer remains local the core's L1. In the case of network copy, the copy from kernel to a user buffer occurs just before handing the user buffer to the application. So, the application enjoys L1 hits and excellent efficiency. If an async memcpy engine lived anywhere other than at the core, the copy operation would pull (snoop) lines away from the core, resulting in application cache misses. Net system efficiency would probably be much worse than today.
4) The asynch memcpy instruction must return some sort of token that identifies the copy for use later to ask if the copy is done (requiring another instruction). Given the token, the core would need to perform some sort of complex context lookup regarding that particular pending or in-flight copy -- those kind of operations are better handled by software than core microcode. What if the OS needs to kill the process and mop up all the in-flight and pending memcpy operations? How does the OS know how many times a process used that instruction and which corresponding tokens belong to which process?
--- EDIT ---
5) Another problem: any copy engine outside the core must compete in raw copy performance with the core's bandwidth to cache, which is very high -- much higher than external memory bandwidth. For cache misses, the memory subsystem would bottleneck both sync and async memcpy equally. For any case in which at least some data is in cache, which is a good bet, the core will complete the copy faster than an external copy engine.

Memory to memory transfers used to be supported by the DMA controller in older PC architectures. Similar support exists in other architectures today (e.g. the TI DaVinci or OMAP processors).
The problem is that it eats into your memory bandwidth which can be a bottleneck in many systems. As hinted by srking's answer reading the data into the CPU's cache and then copying it around there can be a lot more efficient then memory to memory DMA. Even though the DMA may appear to work in the background there will be bus contention with the CPU. No free lunches.
A better solution is some sort of zero copy architecture where the buffer is shared between the application and the driver/hardware. That is incoming network data is read directly into preallocated buffers and doesn't need to be copied and outgiong data is read directly out of the application's buffers to the network hardware. I've seen this done in embedded/real-time network stacks.

Net Win?
It's not clear that implementing an asynchronous copy engine would help. The complexity of such a thing would add overhead that might cancel out the benefits, and it wouldn't be worth it just for the few programs that are memcpy()-bound.
Heavier User Context?
An implementation would either involve user context or per-core resources. One immediate issue is that because this is a potentially long-running operation it must allow interrupts and automatically resume.
And that means that if the implementation is part of the user context, it represents more state that must be saved on every context switch, or it must overlay existing state.
Overlaying existing state is exactly how the string move instructions work: they keep their parameters in the general registers. But if existing state is consumed then this state is not useful during the operation and one may as well then just use the string move instructions, which is how the memory copy functions actually work.
Or Distant Kernel Resource?
If it uses some sort of per-core state, then it has to be a kernel-managed resource. The consequent ring-crossing overhead (kernel trap and return) is quite expensive and would further limit the benefit or turn it into a penalty.
Idea! Have that super-fast CPU thing do it!
Another way to look at this is that there already is a highly tuned and very fast memory moving engine right at the center of all those rings of cache memories that must be kept coherent with the move results. That thing: the CPU. If the program needs to do it then why not apply that fast and elaborate piece of hardware to the problem?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas