Choosing between double buffer and ring buffer? - embedded

I have a problem of decoding a packet which is sent through UART of a micro controller (The firmware needs to be baremetal, no RTOS support). The packet is 32 bytes long, and is send at every 10 milliseconds (continuously, without any stop).
I need to do very minimal processing in ISR (to keep the ISR short enough) and do the deferred processing in main() loop. There are two approaches coming to my mind -
1. Use a interrupt-safe ring buffer with ISR writing into the buffer and main() loop reading from it. The head and tail pointer are assumed to be atomic types of my architecture, so as to make sure that the buffer is interrupt-safe. See a sample implementation here.
Use a double buffering scheme(ping-pong buffer), wherein the main() loop shall process one of the buffer while the ISR is writing to the other. Assume that i can atomically modify the pointer to the ISR buffer, so that the critical section problem is avoided.
The UART is capable of generating RX FIFO not-empty interrupt. Also DMA support is available.
Which is the optimal datastructure to use here?
What is the tradeoff involved here?

A double buffer is just a special kind of ring buffer with only two slots that are exchanged between producer and consumer. If your processing times don't vary much, it should be enough. A ring buffer can be helpful if input rates or processing times vary, but then you would most likely need some flow control to slow down input rate when processing can't keep up.

Related

Freertos and the necessity of uart transmit interrupt

For uart reception, it's pretty obvious to me what can go wrong in case of 'blocking receive' over uart. Even in freertos with a dedicated task to read from uart, context / task switching could result in missing bytes that were received in the uart peripheral.
But for transmission I am not really sure if there is a need for interrupt based approach. I transmit from a task, and in my design it's no problem if that task is blocked for a short while. (it also blocks/sleeps on mutexes e.g).
Is there another strong argument to use use uart transmit in interrupt mode? I am not risking anything wrt loss of data, right?
In my case I use an stm32, but I guess the type of mcu is not really relevant here.
Let's focus on TX only and assume that we don't use interrupts and handle all the transmission with the tools provided by the RTOS.
µC UART hardware generally have a transmit shift register (TSR) and some kind of data register (DR). The software loads the DR, and if the TSR is empty, DR is instantly transferred into TSR and TX begins. The software is free to load another byte into DR, and the hardware loads the new byte from DR to TSR whenever the TX (shift-out) of the previous byte finishes. Hardware provides status bits for querying the status of DR & TSR. This way, the software can using polling method and still achieve continuous transmission with no gaps between the bytes.
I'm not sure if the hardware configuration I described above holds for every µC. I have experience with 8 & 16-bit PICs and STM32 F0, F1, F4 series. They are all similar. UART hardware doesn't provide additional hardware buffers.
Now, back to RTOS... Obviously, your TX task needs to be polling UART status bits. If we assume that UART baud rate is 115200 (which is a common value), you waste ~90 µs of polling for each byte. The general rule of RTOS is that if you are waiting for something to happen, your task needs to be blocked so other tasks can run. But block on what? What will tell you when to unblock? For this you need interrupts. Your task blocks on task notification, (ulTaskNotifyTake()), and the interrupt gives the notification using xTaskNotifyGive().
So, I can't imagine any other way without using interrupts. But, the method mentioned above isn't good either. It makes no sense to block - unblock with each byte.
There are 2 possible solutions:
Move TX handling completely to interrupt handler (ISR), and notify the task when TX is completed.
Use DMA instead! Almost all modern 32-bit µCs have DMA support. DMA generates a single interrupt when the TX is completed. You can notify the task from the DMA transfer complete interrupt.
On this answer I've focused on TX, but using DMA is the proper way of handling reception (RX) too.

Why are FIFO One-quarter full, Half-full, three-quarter full interrupts provided in a UART RX FIFO? What are their use cases?

I am implementing a protocol decoder which receives bytes through UART of a microcontroller. The ISR takes bytes from the UART peripheral and puts it in a ring buffer. The main loop reads from the ring buffer and runs a state machine to decode it.
The UART internally has a 32-byte receive FIFO, and provides interrupts when this FIFO is quarter-full, half-full, three-quarter full and completely full. How should i determine which of these interrupts should trigger my ISR? What is the tradeoff involved?
Note - The protocol involves packets of 32-byte (fixed length), send every 10ms.
This depends on a lot of things, most of all the maximum baudrate supported, and how much time your application needs for executing other tasks.
Traditional ring buffers work on byte-per-byte interrupt basis. But it is of course always nice to reduce the number of interrupts. It probably doesn't matter much how often you let it trigger.
It is much more important to implement a double-buffer scheme. You should of course not start to run a state machine decoding straight from a single ring buffer. That will turn into a race condition nightmare.
Your main program should hit the semaphore/disable the UART interrupt, then copy the whole buffer, then allow interrupt. Ideally buffer copy is done by changing a pointer, rather than doing a hard copy. The code doing this needs to be benchmarked to perform faster than 1/baudrate * 10 seconds. Where 10 is: 1 start, 8 data, 1 stop, assuming UART is 8-N-1.
If available, use DMA over software ring buffers.
Given a packet based protocol and a UART that interrupts when more than one byte has been received, consider what should happen if the final byte of a packet is received but that final byte isn't enough to fill the FIFO past the threshold and trigger an interrupt. Is your application simply not going to receive that incomplete packet until some subsequent packet is received and the FIFO finally fills enough? What if the other end is waiting for a response and never sends another packet? Or is your application supposed to poll the UART to check for lingering bytes remaining in the UART FIFO? That seems overly complicated to both use an interrupt and poll for received bytes.
With the packet-based protocols I have implemented, the UART driver does not rely on the UART FIFO and configures the UART to interrupt when a single byte is available. This way the driver gets notified for every byte and there is no chance for the final byte of a packet to be left lingering in the UART's FIFO.
The UART's FIFO can be convenient for streaming protocols (such as audio or video data). When the driver is receiving a stream of data then there will always be incoming data to keep filling the FIFO. The driver can rely on the UART's FIFO to buffer some data. The driver can be more efficient by processing multiple bytes per interrupt and reducing the interrupt rate.
You might consider using the UART FIFO since your packets are a fixed length. But consider how the driver would recover if a single byte is dropped due to noise or whatever. I think it's still best to not rely on the FIFO for packet-based protocols regardless of whether the packets are fixed length.

Z80 Multibyte Commands in IM0

I'm trying just for the fun to design a more complex Z80 CP/M system with a lot of peripheral devices. When reading the documentation I stumbled over an (undocumented?) behaviour of the Z80 CPU, when accepting an interrupt in IM0.
When an interrupt occurs, the Z80 activates M1 and IORQ to signal the external device: "Hey, give me an opcode". All is well if the opcode is rst 00 or something like this. Now the documentation tells, ANY opcode of any command can be given to the cpu, for instance a CALL.
But now comes the undocumented part: "The first byte of a multi-byte instruction is read during the interrupt acknowledge cycle. Subsequent bytes are read in by a normal memory read sequence."
A "normal memory read sequence". How can I determine, if the CPU wants to get a byte from memory or instead the next byte from the device?
EDIT: I think, I found a (good?) solution: I can dectect the start of the interrupt acknowlegde cycle by analyzing IORQ and M1. Also I can detect the next "normal" opcode fetch by analyzing MREQ and M1. This way I can install a flip-flop triggered by these two ANDed signals, i.e. the flip-flop is 1 as long as the CPU reads data from the io-device. This 1 I can use to inhibit the bus drivers to and from the memory.
My intentions? I'm designing an interrupt controller with 8 prioritized inputs in a CPLD. It's registers hold a 16 bit address for each interrupt pin. Just for the fun :-)
My understanding is that the peripheral device is required:
to know how many bytes it needs to feed;
to respond to normal read cycles following the IORQ cycle; and
to arrange that whatever would normally respond to memory read cycles does not do so for the duration.
Also the behaviour was documented by Zilog in an application note, from which your quote originates (presumably uncredited).
In practice I guess 99.99% of IM0 users just use an RST and 99.99% of the rest use a known-size instruction like CALL xxxx.
(also I'm aware of a few micros that effectively guaranteed not to put anything onto the bus during an interrupt cycle, thereby turning IM0 into a synonym of IM1 owing to open collector output).
The interrupt behavior is reasonably documented in the Z80 manual:
Interupt modes, IM2 allows you to supply an 8-bit address to a 16-bit pointer. At least halfway to the desired 16-bit direct address.
How to set the interrupt modes
My understanding is that the M1 + IORQ combination is used since there was no pin left for a dedicated interrupt response. A fun detail is also that the Zilog I/O chips like PIO, SIO, CTC reads the RETI instruction (as the CPU fetches it) to learn that the CPU is ready to accept another interrupt.

How to keep interrupts short?

The most heard advice in embedded programming is "keep your interrupts short".
Now my situation is that I have a very long running task in my main() loop (writing large blocks of data to SDcard), which can sometimes take 100ms. So to keep my system responsive I moved all other stuff to interrupt-handlers.
For example, normally one would handle the incoming UART data in an interrupt, then process the incoming command in the main() loop, and then send back the response. But in my case, the whole processing/handling of the commands also takes places in the interrupts, because my main() loop can be blocked for (relatively) long periods.
The optimal solution would be to switch to an RTOS but I don't have the RAM for it. Are there alternatives for my design where the interrupts can be short?
The traditional approach for this is for Interrupts to schedule a deferred procedure and end the interrupt as soon as possible.
Once the interrupt has finished, the list of deferred procedures is walked from most-important to least important.
Consider the case where you have your main (lower proiority) action, and two interrupts I1 and I2, where I2 is more important than main, but less important than I1.
In this case, let's suppose you're running main and I1 fires. I1 schedules a deferred procedure and signals to the hardware that I1 is done. I1's DPC now begins running. Suddenly I2 comes in from the hardware. I2's interrupt takes over from I1's DPC and schedules I2's DPC and signals to the hardware that it's done.
The scheduler then returns to I1's DPC (because it is more important), and when I1's DPC completes, I2's DPC begins (because it is more important than main), and then eventually returns execution to main.
This design allows you to schedule the importance of different interrupts, encourages you to keep your interrupts small, and allows you to complete DPCs in an ordered and in-order prioritized way.
There are 100 different ways to skin this cat, depending on CPU architecture (interrupt nesting & prioritization, software interrupt support, etc.) but let's take a pretty straightforward approach that is relatively simple to understand and free from the race conditions and resource-sharing hazards of a preemptive kernel.
(Disclaimer: my first choice is typically a preemptive real time kernel, many of them can run in extremely resource-constrained systems... SecurityMatt's suggestion is good but if you're not comfortable implementing your own preemptible kernel / task switcher, particularly one that handles asynchronous (interrupt-triggered) preemption, you can get wrapped around the axle pretty quickly. So what I'm proposing below is not as responsive as a preemption-based kernel, but it's much simpler and often adequate).
Create 3 event/work queues:
Q1 is the lowest priority and handles your slow, background SD card writes
Q2 holds requests to process incoming UART packets
Q3 (highest priority) holds UART RX FIFO read requests.
I split up the UART RX FIFO reading and the processing of the read packet so that the FIFO reading is always serviced ahead of the packet processing; maybe you want to keep them together, your choice.
For this to work, you break your large (~100ms) SD card write process into a bunch of smaller, discrete, run to completion steps.
So for example, to write 5 blocks, 20ms each, you write the first block, then enqueue "write next block" to Q1. You go back to your scheduler at the end of each step & scan the queues in priority order, starting with Q3. If Q2 and Q3 are empty, you pull the next event off of Q1 ("write next block"), and run that command for another 20ms before returning and scanning the queues again. If 20ms is not responsive enough, you break up each 20ms block write into a more fine-grained set of steps, continually posting to Q1 the next work step.
Now for the incoming UART stuff; in the UART RX ISR, you simple enqueue a "read UART FIFO" command in Q3, and return from interrupt back into the 20ms "write block" step that was interrupted. As soon as the CPU finishes the write, it goes back and scans the queues in priority order (worst case response will be 20ms if the block write had just begun at the time of the interrupt). The queue scanner (scheduler) will see that Q3 now has work to do, and it will run that command before going back and scanning again.
The responsiveness in your system, worst case, will be determined by the longest run-to-completion step in the system, regardless of priority. You keep your system very responsive by doing work in small, discrete, run to completion steps.
Note that I have to speak in generalities here. Maybe you want to read the UART RX FIFO in the ISR, put the data into a buffer, and only defer the packet processing, not the actual reading of the FIFO (then you'd only have 2 queues). You have to work this out for yourself. But I hope the approach makes sense.
This event-driven approach with prioritized queues is exactly the approach used by the Quantum Platform (QP) event-driven framework. The QP actually supports an underlying non-preemptive (cooperative) scheduler, such as what was described here, or a preemptive scheduler which runs the scheduler each an event is queued (similar to the approach suggested by SecurityMatt). You can see the code/implementation of the QP's cooperative scheduler over at QP website.
An alternative solution would be as follow:
Anywhere the FAT library can capture the processor for a long time, you insert a call to a new function which is normally very fast and return to the caller after a few machine cycles. Such fast function would not impact the real-time performance of your time consuming operation, such as reading/writing to SD Flash. You would insert such call in any loop that wait for a flash sector to be erased. You also insert a call to such function in between every 512 bytes written or 512 bytes read.
The goal of that function is to perform most of the task that you would normally have inside the "while(1)" loop in a typical "main()" for embedded device. It would first increment an integer and perform a fast modulo on the new value, then return if the modulo is not equal to an arbitrary constant. The code is as follow:
void premption_check(void)
{
static int fast_modulo = 0;
//divide the number of call
fast_modulo++;
if( (fast_modulo & 0x003F) != 3)
{
return;
}
//the processor would continue here only once every 64 calls to "premption_check"
Next, you call the functions that extract RS232 characters/strings from the serial port interrupts, process any command if complete strings are received, etc
The binary mask 0x3F used above means that we look only at the 6 least significant bits of the counter. When these 6 bits happen to be equal to the arbitrary value 5, when go ahead with the calls to functions which may take some micro-second or even milli-second to execute. You may want to try smaller or larger binary mask depending on the speed at which you want to service the serial port and other operations. You may even use simultaneously more than one mask to service some operation faster than other.
The FAT library and the SD card should not experience any problem when some sporadic delay happen in between two Flash erase operation, for example.
The solution given here works even with a micro-controller with only 2K byte, like many variant of 8051. As incredible as it may seems, the pinball machine of 1980 to 1990 had a few K of RAM, slow processors (like 10 MHz) and they where able to test one hundred switch... fully debounced, update a X/Y matrix display, produce sound effects, etc The solutions developed by these engineer can still be used to boost the performance of large system. Even with the best servers with 64 Gig RAM and many Terabyte of hard disk, I presume that any bytes count when some company want to index billions of WEB pages.
As no-one has suggested coming at it from this end yet I'll throw it in the hat:
It's possible that sticking the SD card service routine in a low-priority interrupt, maybe throwing in some DMA if you can, would free up your main loop & other interrupts to be more responsive, rather than being stuck in a main() loop waiting for longtime for something to finish.
The caveat to this is I don't know if the hardware has any way of triggering the interrupt when the SD card is ready for more, you might have to cheat by running a polling timer to check & force the interrupt. I'm not above that sort of thing though, if you have spare hardware timers & interrupts it can be done with very little overhead.
Resorting to an RTOS for something like this would seem overkill & an admission of failure to me... ;)

Frustrating FreeRTOS xQueueCreate() limitation

I'm trying to use a queue to buffer characters from my UART ISR to a background task. I want the queue to be 512 bytes long. This is unfortunately impossible, because the type of the size argument is unsigned portBASE_TYPE which for the xmega256a3 is a single byte (char). Is there a reason the maximum size of a queue floats with portBASE_TYPE? Rather than uint16_t?
I'm curious if others have hit the same limitation, and what, if anything, they've done about it.
Richard Barry (FreeRTOS author) posted the following response on the FreeRTOS mailing list:
This is only the case on 8-bit architectures. It has been mentioned a few times (you can search the support archive on the FreeRTOS site), but not for years as most new projects are using 32-bit architectures. The simple thing to do is change the definition of portBASE_TYPE in portmacro.h, but it will make your code larger and less efficient.
As an aside, many of the FreeRTOS demos use queues to pass characters into and out of interrupts to provide a simple example of tasks and interrupts communicating, but unless the throughput is very low (a command console for example), it is not the recommended way of writing production code. Using circular buffers, preferably with a DMA, is much more efficient.
It's natural to use portBASE_TYPE for the majority of variables for efficiency reasons. The AVR is an 8 bit architecture and so will be more efficient dealing with 8 bit queue arithmetic than 16 bits. For some applications this efficiency may be critical.
Using a uint16_t doesn't make sense on 32 bit architectures and you'll note that the portBASE_TYPE for ARM cores is a 32 bit value, so choosing a uint16_t as the default type of queue length would be an artificial restriction on these cores.
Here's some options:
Refactor your tasks to read from the queue more often. Unless other tasks are stealing too much processing time, it should be possible to lower your ISR queue length and buffer the data in your reading thread.
Recompile FreeRTOS with a different portBASE_TYPE. I haven't tried this but I don't see a reason why this wouldn't work unless there was some assembler code in FreeRTOS which expected an 8 bit portBASE_TYPE. I had a quick look and didn't see any obvious signs of the assembler code expecting 8 bit types.
Use your own queuing library that has the capability to store as much data as you need. Use other FreeRTOS primitives such as a semaphore to signal to your task that data has been added to your queue. Instead of your task blocking on a queue read, it would block on a semaphore. Upon the semaphore being signalled, you'd use your own queuing library to read queued data.