How to keep interrupts short?

How to keep interrupts short? - embedded

The most heard advice in embedded programming is "keep your interrupts short".
Now my situation is that I have a very long running task in my main() loop (writing large blocks of data to SDcard), which can sometimes take 100ms. So to keep my system responsive I moved all other stuff to interrupt-handlers.
For example, normally one would handle the incoming UART data in an interrupt, then process the incoming command in the main() loop, and then send back the response. But in my case, the whole processing/handling of the commands also takes places in the interrupts, because my main() loop can be blocked for (relatively) long periods.
The optimal solution would be to switch to an RTOS but I don't have the RAM for it. Are there alternatives for my design where the interrupts can be short?

The traditional approach for this is for Interrupts to schedule a deferred procedure and end the interrupt as soon as possible.
Once the interrupt has finished, the list of deferred procedures is walked from most-important to least important.
Consider the case where you have your main (lower proiority) action, and two interrupts I1 and I2, where I2 is more important than main, but less important than I1.
In this case, let's suppose you're running main and I1 fires. I1 schedules a deferred procedure and signals to the hardware that I1 is done. I1's DPC now begins running. Suddenly I2 comes in from the hardware. I2's interrupt takes over from I1's DPC and schedules I2's DPC and signals to the hardware that it's done.
The scheduler then returns to I1's DPC (because it is more important), and when I1's DPC completes, I2's DPC begins (because it is more important than main), and then eventually returns execution to main.
This design allows you to schedule the importance of different interrupts, encourages you to keep your interrupts small, and allows you to complete DPCs in an ordered and in-order prioritized way.

There are 100 different ways to skin this cat, depending on CPU architecture (interrupt nesting & prioritization, software interrupt support, etc.) but let's take a pretty straightforward approach that is relatively simple to understand and free from the race conditions and resource-sharing hazards of a preemptive kernel.
(Disclaimer: my first choice is typically a preemptive real time kernel, many of them can run in extremely resource-constrained systems... SecurityMatt's suggestion is good but if you're not comfortable implementing your own preemptible kernel / task switcher, particularly one that handles asynchronous (interrupt-triggered) preemption, you can get wrapped around the axle pretty quickly. So what I'm proposing below is not as responsive as a preemption-based kernel, but it's much simpler and often adequate).
Create 3 event/work queues:
Q1 is the lowest priority and handles your slow, background SD card writes
Q2 holds requests to process incoming UART packets
Q3 (highest priority) holds UART RX FIFO read requests.
I split up the UART RX FIFO reading and the processing of the read packet so that the FIFO reading is always serviced ahead of the packet processing; maybe you want to keep them together, your choice.
For this to work, you break your large (~100ms) SD card write process into a bunch of smaller, discrete, run to completion steps.
So for example, to write 5 blocks, 20ms each, you write the first block, then enqueue "write next block" to Q1. You go back to your scheduler at the end of each step & scan the queues in priority order, starting with Q3. If Q2 and Q3 are empty, you pull the next event off of Q1 ("write next block"), and run that command for another 20ms before returning and scanning the queues again. If 20ms is not responsive enough, you break up each 20ms block write into a more fine-grained set of steps, continually posting to Q1 the next work step.
Now for the incoming UART stuff; in the UART RX ISR, you simple enqueue a "read UART FIFO" command in Q3, and return from interrupt back into the 20ms "write block" step that was interrupted. As soon as the CPU finishes the write, it goes back and scans the queues in priority order (worst case response will be 20ms if the block write had just begun at the time of the interrupt). The queue scanner (scheduler) will see that Q3 now has work to do, and it will run that command before going back and scanning again.
The responsiveness in your system, worst case, will be determined by the longest run-to-completion step in the system, regardless of priority. You keep your system very responsive by doing work in small, discrete, run to completion steps.
Note that I have to speak in generalities here. Maybe you want to read the UART RX FIFO in the ISR, put the data into a buffer, and only defer the packet processing, not the actual reading of the FIFO (then you'd only have 2 queues). You have to work this out for yourself. But I hope the approach makes sense.
This event-driven approach with prioritized queues is exactly the approach used by the Quantum Platform (QP) event-driven framework. The QP actually supports an underlying non-preemptive (cooperative) scheduler, such as what was described here, or a preemptive scheduler which runs the scheduler each an event is queued (similar to the approach suggested by SecurityMatt). You can see the code/implementation of the QP's cooperative scheduler over at QP website.

An alternative solution would be as follow:
Anywhere the FAT library can capture the processor for a long time, you insert a call to a new function which is normally very fast and return to the caller after a few machine cycles. Such fast function would not impact the real-time performance of your time consuming operation, such as reading/writing to SD Flash. You would insert such call in any loop that wait for a flash sector to be erased. You also insert a call to such function in between every 512 bytes written or 512 bytes read.
The goal of that function is to perform most of the task that you would normally have inside the "while(1)" loop in a typical "main()" for embedded device. It would first increment an integer and perform a fast modulo on the new value, then return if the modulo is not equal to an arbitrary constant. The code is as follow:
void premption_check(void)
{
static int fast_modulo = 0;
//divide the number of call
fast_modulo++;
if( (fast_modulo & 0x003F) != 3)
{
return;
}
//the processor would continue here only once every 64 calls to "premption_check"
Next, you call the functions that extract RS232 characters/strings from the serial port interrupts, process any command if complete strings are received, etc
The binary mask 0x3F used above means that we look only at the 6 least significant bits of the counter. When these 6 bits happen to be equal to the arbitrary value 5, when go ahead with the calls to functions which may take some micro-second or even milli-second to execute. You may want to try smaller or larger binary mask depending on the speed at which you want to service the serial port and other operations. You may even use simultaneously more than one mask to service some operation faster than other.
The FAT library and the SD card should not experience any problem when some sporadic delay happen in between two Flash erase operation, for example.
The solution given here works even with a micro-controller with only 2K byte, like many variant of 8051. As incredible as it may seems, the pinball machine of 1980 to 1990 had a few K of RAM, slow processors (like 10 MHz) and they where able to test one hundred switch... fully debounced, update a X/Y matrix display, produce sound effects, etc The solutions developed by these engineer can still be used to boost the performance of large system. Even with the best servers with 64 Gig RAM and many Terabyte of hard disk, I presume that any bytes count when some company want to index billions of WEB pages.

As no-one has suggested coming at it from this end yet I'll throw it in the hat:
It's possible that sticking the SD card service routine in a low-priority interrupt, maybe throwing in some DMA if you can, would free up your main loop & other interrupts to be more responsive, rather than being stuck in a main() loop waiting for longtime for something to finish.
The caveat to this is I don't know if the hardware has any way of triggering the interrupt when the SD card is ready for more, you might have to cheat by running a polling timer to check & force the interrupt. I'm not above that sort of thing though, if you have spare hardware timers & interrupts it can be done with very little overhead.
Resorting to an RTOS for something like this would seem overkill & an admission of failure to me... ;)

Related

Is there any problem with computing the whole program in an ISR?

I have a program that should be run periodically on a MCU (ex. STM32). For example it should be run at every 1 ms. If I program an 1ms ISR and call my complete program in it, assuming it will not exceed 1ms, is that a good way? Is there any problem that I could be facing? Will it be precise?

The usual answer would be probably that ISRs should be generally kept to a minimum and most of the work should be performed in the "background loop". (Since you are apparently using the traditional "foreground-background" architecture, a.k.a. "main+ISRs").
But ARM Cortex-M (e.g., STM32) has been specifically designed such that ISRs can be written as plain C functions. In that case, working with ISRs is no different than working with any other C code. This includes ease of debugging.
Moreover, ARM Cortex-M comes with the NVIC (Nested Vectored Interrupt Controller). The NVIC allows you to prioritize interrupts and they can preempt each other. This means that you can quite easily build sets of periodic "tasks" (ISRs), which run under the preemptive, priority-based scheduler. Interestingly, this scheduler (implemented in the NVIC hardware) meets all requirements of RMA/RMS (Rate Monotonic Analysis/Scheduling), so you could prove the schedulability of your system. Of course, the ISR "tasks" cannot block internally, but this is not required for RMA/RMS. Also, if you have any shared resources between ISRs running at different priorities (or ISR and the background loop), you need to properly protect the resources by disabling interrupts.
So, your idea of using ISRs as "tasks" makes a lot of sense to me. Your system will be optimal meaning that any other approach would be less efficient. (This includes the use of any kind of RTOS.) Also, this design can be low-power because you could use the "background loop" in main() to put your CPU and peripherals into low-power sleep mode (WFI instruction, etc.) In fact, you can view the "background loop" as the idle "task" in this "hardware-RTOS".

Disable interrupt to let freeRTOS run on stm32

I'm working a project where I am getting digital samples continuously through DMA on STM32f4. DMA generates a complete callback interrupt after every sample where I do some DSP. My plan is to allow freeRTOS to work on other tasks while DMA is waiting on the callback. However, DMA is generating callback too frequently, not allowing freeRTOS to run. I want to make it so that after every DMA complete callback, freeRTOS tasks is allowed to run for 6ms. I thought of calling __disable_irq() from complete callback and __enable_irq() from one of the tasks but that would not guarantee 6ms also I have a high priority button interrupt. I also tried disabling just DMA interrupt calling __set_BASEPRI(priority<<(8-__NVIC_PRIO_BITS)) then starting a timer for 6ms. On timer period elapsed callback in call __set_BASEPRI(0) to enable DMA interrupt. But for some reason this did not allow freeRTOS to run at all. It goes back and forth between DMA complete callback and Timer period elapsed callback.
I am new to embedded programming so any comment on this will help. Thank You.

You should not think of the DSP process being separate from the RTOS tasks, do the DSP in an RTOS task - the signal processing is the most time critical aspect of your system, you have to process the data as fast as it arrives with no loss.
If the DSP is being done in an interrupt context and starving your tasks, then clearly you are doing too much work in the interrupt context, and have too high an interrupt rate. You need to fix your design for something more schedulable.
If your DMA transfers are single samples, you will get one interrupt per sample - the ADC will do that on its own; so using DMA in that manner offers no advantage over direct ADC interrupt processing.
Instead you should use block processing, so you DMA a block of say 80 samples samples cyclically, for which you get a half-transfer interrupt at 40 samples, and full-transfer interrupt at 80 samples. So for each interrupt you might then trigger a task-event or semaphore to defer the DSP processing to a high-priority RTOS task. This achieves two things;
For the entirety of the n sample block acquisition time, the RTOS is free to:
be performing the DSP processing for the previous block,
use any remaining time to process the lower priority tasks.
Any interrupt overhead spent context switching etc. is reduced by 1/n, allowing more time performing core signal processing and background tasks.
Apart form reducing the number of interrupts and software overhead, the signal processing algorithms themselves can be optimised more readily when performing block-processing.
A variation on the above is rather then triggering a task event or semaphore from the DMA interrupt handler, you could place the new sample block in a message queue, which will then provide some buffering. This is useful if the DSP processing might be less deterministic, so cannot always guarantee to complete processing of one block before the next is ready. However overall it remains necessary that on average you complete block processing in the time it takes to acquire a block, with time to spare for other tasks.
If your lower priority tasks are still starved, then the clear indication is that your DSP process is simply too much for your processor. There may be scope for optimisation, but that would be a different question.
Using the suggested block-processing strategy I have in the past migrated an application from a TI C2000 DSP running at 200MHz and 98% CPU load, to a 72MHz STM32F1xx at 60% CPU load. The performance improvement is potentially very significant if you get it right.
With respect to your "high-priority" button interrupt, I would question your priority assignment. Buttons are operated manually with human response and perception times measured in 10's or even 100's of milliseconds. That is hardly your time critical task, whereas missing an ADC sample of a few microseconds would cause your signal processing to go seriously awry.
You may be making the mistake of confusing "high-priority" with "important". In the context or a real-time system, they are not the same thing. You could simply poll the button in a low-priority task, or if you use an interrupt, the interrupt should do no more than signal a task (or more realistically trigger a de-bounce timer) (see Rising edge interrupt triggering multiple times on STM32 Nucleo for example).

TinyAVR 0-Series: Can I use pin-change sensing without entering interrupt handler?

I am evaluating the ATtiny806 running at 20MHz to build a cycle-accurate Intel 4004 microprocessor emulator. (I know it will be a bit too slow, but AVRs have a huge community.)
I need to synchronize to the external, two-phase non-overlapping clocks. These are not fast clocks (the original 4004 ran at 750kHz)
but if I spin-wait for every clock edge, I risk wasting most of my time budget.
The TinyAVR 0-series has a very nice pin-change interrupt facility that can be configured to trigger only on rising edges.
But, an interrupt routine round-trip is 8 cycles (3 in, 5 out).
My question is:
Can I leverage the pin-change sensing mechanism while never visiting an ISR?
(Other processor families let you poll for interruptible conditions without enabling interrupts from that peripheral). Can polling be done with a tight skip-on-bit/jump-back loop, followed by a set-bit instruction?

Straightforward way
You can always just poll on the level of the GPIO pin using the single cycle skip if bit set/clear instruction on the appropriate PORT register and bit.
But as you mention, polling does burn cycles so I'm not sure exactly what you want here - either a poll (that burns cycles but has low latency) or an interrupt (that has higher latency but allows processing to continue until the condition is true).
Note that if things get really tight and you are looking for, say, power savings by sleeping between clock signal transitions then you can do tricks like having an ISR that nevers returns (saving the IRET cycles) but that requires some careful coding probably with something like a state machine.
INTFLAG way
Alternately, if you want to use the internal pin state machine logic and you can live without interrupts, then you can use the INTFLAGS flags to check for the pin change configured in the ISC bits of the PINxCTRL register. As long as global interrupts are not enabled in SREG then you can spin poll on the appropriate INTFLAG bit to check/wait for the desired condition, and then write a 1 to that bit to clear the flag.
Note that if you want to make this fast, you will probably want to map the appropriate PORT to a VPORT since the VPORT registers are in I/O Memory. This lets you use SBIS to test the INTFLAG bit a single cycle and SBI to clear the bit in a single cycle (these instructions only work on IO memory and the normal PORT registers are not in IO Memory).
Finally one more complication, if you need to leave the interrupts on when doing this, it is probably possible by hacking the interrupt priority registers. You'd set the pin change to be on level 0, and then make sure the interrupts you care about are level 1 or higher, and then trick the interrupt controller into thinking that there is already a level 0 running so these interrupts do not actually fire. There are also other restrictions to this strategy so avoid it if at all possible.
Programmable logic way
If you want to get really esoteric, it is likely possible that you could route the input value of a pin to a configurable custom logic LUT in the chip and then route the output of that module to a bit that you test using a 1-cycle bit test (maybe an unused IO Pin). To do this, you'd feedback the output of the LUT back into one of its inputs and then use the LUT to create a strobe on the edge you are looking for. This is very complex, and also since the strobe has no acknowledgement that if the signal changes when you are not looking for it (in a spin check) then it will be lost and you will have to wait for the next edge (probably fatal in your application).

How do you avoid interrupt starvation in a nested interrupt system?

I am learning about interrupts and couldn't understand what happens when there are too many interrupts to a point where the CPU can't process the foreground loop or complete the existing interrupts. I read through this article https://www.cs.utah.edu/~regehr/papers/interrupt_chapter.pdf but didn't completely understand how a scheduler would help, if there are simply too many interrupts?
Do we switch to a faster CPU if the interrupts can not be missed?

Yes, you had to switch to a faster CPU!
You had to ensure that there is enough time for the mainloop. Therefore it is really important to keep your Interrupt service as short as possible and do some CPU workloads tests.

Indeed, any time there is contention over a shared resource, there is the possibility of starvation. The schedulers discussed in the paper limit the interrupt rate, thus ensuring some interrupt-free processing time during each interval. During high activity periods, interrupt handling is disabled, and the scheduler switches to polling mode where it interrogates the state of the interrupt request lines periodically, effectively throttling the stream of interrupts. The operating system strives to do as little as possible in each interrupt handler - tasks are often simply queued so they can be handled later at a different stage. There are many considerations and trade-offs that go into any scheduling algorithm.

Overall you need a clue of how much time each part of your program consumes. This is pretty easy to measure in practice live with an oscilloscope. If you activate a GPIO when entering and de-activate it when leaving the interrupt, you don't only get to see how much time the ISR consumes, but also how often it kicks in. If you do this for each ISR you get a good idea how much time they need. You can then do something similar in main(), to get a rough estimate of the complete execution cycle of the program, main + interrupts.
As for the best solution, it is obviously to reduce the amount of interrupts. Use polling if possible. Use DMA. Use serial peripherals (UART, CAN etc) that are hardware-buffered instead of interrupt-intensive ones. Use hardware PWM instead of output compare timers. And so on. These things need to be considered early on when you pick a suitable MCU for your project. If you picked the wrong MCU, then you'll obviously have to change. Twiddling with the CPU clock sounds like quick & dirty fix. Get the design right instead.

Operating System Basics

I am reading process management,and I have a few doubts-
What is meant by an I/o request,for E.g.-A process is executing and
hence it is in running state,it is in waiting state if it is waiting
for the completion of an I/O request.I am not getting by what is meant by an I/O request,Can you
please give an example to elaborate.
Another doubt is -Lets say that a process is executing and suddenly
an interrupt occurs,then the process stops its execution and will be
put in the ready state,is it possible that some other process began
its execution while the interrupt is also being processed?

Regarding the first question:
A simple way to think about it...
Your computer has lots of components. CPU, Hard Drive, network card, sound card, gpu, etc. All those work in parallel and independent of each other. They are also generally slower than the CPU.
This means that whenever a process makes a call that down the line (on the OS side) ends up communicating with an external device, there is no point for the OS to be stuck waiting for the result since the time it takes for that operation to complete is probably an eternity (in the CPU view point of things).
So, the OS fires up whatever communication the process requested (call it IO request), flags the process as waiting for IO, and switches execution to another process so the CPU can do something useful instead of sitting around blocked waiting for the IO request to complete.
When the external device finishes whatever operation was requested, it generates an interrupt, so the OS is informed the work is done, and it can then flag the blocked process as ready again.
This is all a very simplified view of course, but that's the main idea. It allows the CPU to do useful work instead of waiting for IO requests to complete.
Regarding the second question:
It's tricky, even for single CPU machines, and depends on how the OS handles interrupts.
For code simplicity, a simple OS might for example, whenever an interrupt happens process the interrupt in one go, then resume whatever process it decides it's appropriate whenever the interrupt handling is done. So in this case, no other process would run until the interrupt handling is complete.
In practice, things get a bit more complicated for performance and latency reasons.
If you think about an interrupt lifetime as just another task for the CPU (From when the interrupt starts to the point the OS considers that handling complete), you can effectively code the interrupt handling to run in parallel with other things.
Just think of the interrupt as notification for the OS to start another task (that interrupt handling). It grabs whatever context it needs at the point the interrupt started, then keeps processing that task in parallel with other processes.

I/O request generally just means request to do either Input , Output or both. The exact meaning varies depending on your context like HTTP, Networks, Console Ops, or may be some process in the CPU.
A process is waiting for IO: Say for example you were writing a program in C to accept user's name on command line, and then would like to print 'Hello User' back. Your code will go into waiting state until user enters their name and hits Enter. This is a higher level example, but even on a very low level process executing in your computer's processor works on same basic principle
Can Processor work on other processes when current is interrupted and waiting on something? Yes! You better hope it does. Thats what scheduling algorithms and stacks are for. However the real answer depending on what Architecture you are on, does it support parallel or serial processing etc.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas