maximum safe CPU utilization for embedded system

maximum safe CPU utilization for embedded system - embedded

What is the safe maximum CPU utilization time for embedded system for critical applications. We are measuring performance with top. Is 50 - 75% is safe?

Real-Time embedded systems are designed to meet Real-Time constraints, for example:
Voltage acquisition and processing every 500 us (lets say sensor monitoring).
Audio buffer processing every 5.8 ms (4ms processing).
Serial Command Acknowledgement within 3ms.
Thanks to Real-Time Operating System (RTOS), which are "preemptive" (the scheduler can suspend a task to execute one with a higher priority), you can meet those constraints even at 100% CPU usage: The CPU will execute the high priority task then resume to whatever it was doing.
But this does not mean you will meet the constraint no matter what, few tips:
High priority Tasks execution must be as short as possible (by calculating Execution Time/Occurence you can have an estimation of CPU usage).
If estimated CPU usage is too high, look for code optimization, hardware equivalent (hardware CRC, DMA, ...), or second micro-processor.
Stress test your device and measure if your real-time constraints are met.
For the previous example:
Audio processing should be the lowest priority
Serial Acknowledgement/Voltage Acquisition the highest
Stress test can be done by issuing Serial Commands and checking for missed audio buffers, missed analog voltage events and so on. You can also vary CPU clock frequency: your device might meet constraints at much lower clock frequencies, reducing power consumption.
To answer you question, 50-75 and even 100% CPU usage is safe as long as you meet real-time constraints, but bear in mind that if you want to add functionality later on, you will not have much room for that at 98% CPU usage.

In rate monotonic scheduling mathematical analysis determines that real-time tasks (that is tasks with specific real-time deadlines) are schedulable when the utilisation is below about 70% (and priorities are appropriately assigned). If you have accurate statistics for all tasks and they are deterministic this can be as high as 85% and still guarantee schedulability.
Note however that the utilisation applies only to tasks with hard-real-time deadlines. Background tasks may utilise the remaining CPU time all the time without missing deadlines.
Assuming by "CPU utilization" you are referring to time spent executing code other than an idle loop, in a system with a preemptive priority based scheduler, then ... it depends.
First of all there is a measurement problem; if your utilisation sampling period is sufficiently short, it will often switch between 100% and zero, whereas if it were very long you'd get a good average, but would not know if utilisation was high for long-enough to starve a lower priority task than that running to the extent that it might miss its deadlines. It is kind of the wrong question, because any practical utilisation sampling rate will typically be much longer then the shortest deadline, so is at best a qualitative rather then quantitative measure. It does not tell you much that is useful in critical cases.
Secondly there is teh issue of what you are measuring. While it is common for an RTOS to have a means to measure CPU utilisation, it is measuring utilisation by all tasks including those that have no deadlines.
If the utilisation becomes 100% in the lowest priority task for example, there is no harm to schedulabilty - the low priority task is in this sense no different from the idle loop. It may have consequences for power consumption in systems that normally enter a low-power mode in the idle loop.
It a higher priority task takes 100% CPU utilisation such that a deadline for a lower priority task is missed, then your system will fail (in the sense that deadlines will be missed - the consequences are application specific).
Simply measuring CPU utilisation is insufficient, but if your CPU is at 100% utilisation and that utilisation is not simply some background task in the lowest priority thread, it is probably not schedulable- there will be stuff not getting done. The consequences are of course application dependent.
Whilst having a low priority background thread consume 100% CPU may do no harm, it does render the ability to measure CPU utilisation entirely useless. If the task is preemted and for some reason the higher priority task take 100%, you may have no way of detecting the issue and the background task (and any tasks lower then the preempting task) will not get done. It is better therefore to ensure that you have some idle time so that you can detect abnormal scheduling behaviour (if you have no other means to do so).
One common solution to the non-yielding background task problem is to perform such tasks in the idle loop. Many RTOS allow you to insert idle-loop hooks to do this. The background task is then preemptable but not included in the utilisation measurement. In that case of course you cannot then have a low-priority task that does not yield, because your idle-loop does stuff.
When assigning priorities, the task with the shortest execution time should have the highest priority. Moreover the execution time should be more deterministic in higher priority tasks - that is if it takes 100us to run it should within some reasonable bounds always take that long. If some processing might be variable such that it takes say 100us most of the time and must occasionally do something that requires say 100ms; then the 100ms process should be passed off to some lower priority task (or the priority of the task temporarily lowered, but that pattern may be hard to manage and predict, and may cause subsequent deadlines or events to be missed).
So if you are a bit vague about your task periods and deadlines, a good rule of thumb is to keep it below 70%, but not to include non-real-time background tasks in the measurement.

Related

Scheduling on multiple cores with each list in each processor vs one list that all processes share

I have a question about how scheduling is done. I know that when a system has multiple CPUs scheduling is usually done on a per processor bases. Each processor runs its own scheduler accessing a ready list of only those processes that are running on it.
So what would be the pros and cons when compared to an approach where there is a single ready list that all processors share?
Like what issues are there when assigning processes to processors and what issues might be caused if a process always lives on one processor? In terms of the mutex locking of data structures and time spent waiting on for the locks are there any issues to that?

Generally there is one, giant problem when it comes to multi-core CPU systems - cache coherency.
What does cache coherency mean?
Access to main memory is hard. Depending on the memory frequency, it can take between a few thousand to a few million cycles to access some data in RAM - that's a whole lot of time the CPU is doing no useful work. It'd be significantly better if we minimized this time as much as possible, but the hardware required to do this is expensive, and typically must be in very close proximity to the CPU itself (we're talking within a few millimeters of the core).
This is where the cache comes in. The cache keeps a small subset of main memory in close proximity to the core, allowing accesses to this memory to be several orders of magnitude faster than main memory. For reading this is a simple process - if the memory is in the cache, read from cache, otherwise read from main memory.
Writing is a bit more tricky. Writing to the cache is fast, but now main memory still holds the original value. We can update that memory, but that takes a while, sometimes even longer than reading depending on the memory type and board layout. How do we minimize this as well?
The most common way to do so is with a write-back cache, which, when written to, will flush the data contained in the cache back to main memory at some later point when the CPU is idle or otherwise not doing something. Depending on the CPU architecture, this could be done during idle conditions, or interleaved with CPU instructions, or on a timer (this is up to the designer/fabricator of the CPU).
Why is this a problem?
In a single core system, there is only one path for reads and writes to take - they must go through the cache on their way to main memory, meaning the programs running on the CPU only see what they expect - if they read a value, modified it, then read it back, it would be changed.
In a multi-core system, however, there are multiple paths for data to take when going back to main memory, depending on the CPU that issued the read or write. this presents a problem with write-back caching, since that "later time" introduces a gap in which one CPU might read memory that hasn't yet been updated.
Imagine a dual core system. A job starts on CPU 0 and reads a memory block. Since the memory block isn't in CPU 0's cache, it's read from main memory. Later, the job writes to that memory. Since the cache is write-back, that write will be made to CPU 0's cache and flushed back to main memory later. If CPU 1 then attempts to read that same memory, CPU 1 will attempt to read from main memory again, since it isn't in the cache of CPU 1. But the modification from CPU 0 hasn't left CPU 0's cache yet, so the data you get back is not valid - your modification hasn't gone through yet. Your program could now break in subtle, unpredictable, and potentially devastating ways.
Because of this, cache synchronization is done to alleviate this. Application IDs, address monitoring, and other hardware mechanisms exist to synchronize the caches between multiple CPUs. All of these methods have one common problem - they all force the CPU to take time doing bookkeeping rather than actual, useful computations.
The best method of avoiding this is actually keeping processes on one processor as much as possible. If the process doesn't migrate between CPUs, you don't need to keep the caches synchronized, as the other CPUs won't be accessing that memory at the same time (unless the memory is shared between multiple processes, but we'll not go into that here).
Now we come to the issue of how to design our scheduler, and the three main problems there - avoiding process migration, maximizing CPU utilization, and scalability.
Single Queue Multiprocessor scheduling (SQMS)
Single Queue Multiprocessor schedulers are what you suggested - one queue containing available processes, and each core accesses the queue to get the next job to run. This is fairly simple to implement, but has a couple of major drawbacks - it can cause a whole lot of process migration, and does not scale well to larger systems with more cores.
Imagine a system with four cores and five jobs, each of which takes about the same amount of time to run, and each of which is rescheduled when completed. On the first run through, CPU 0 takes job A, CPU 1 takes B, CPU 2 takes C, and CPU 3 takes D, while E is left on the queue. Let's then say CPU 0 finishes job A, puts it on the back of the shared queue, and looks for another job to do. E is currently at the front of the queue, to CPU 0 takes E, and goes on. Now, CPU 1 finishes job B, puts B on the back of the queue, and looks for the next job. It now sees A, and starts running A. But since A was on CPU 0 before, CPU 1 now needs to sync its cache with CPU 0, resulting in lost time for both CPU 0 and CPU 1. In addition, if two CPUs both finish their operations at the same time, they both need to write to the shared list, which has to be done sequentially or the list will get corrupted (just like in multi-threading). This requires that one of the two CPUs wait for the other to finish their writes, and sync their cache back to main memory, since the list is in shared memory! This problem gets worse and worse the more CPUs you add, resulting in major problems with large servers (where there can be 16 or even 32 CPU cores), and being completely unusable on supercomputers (some of which have upwards of 1000 cores).
Multi-queue Multiprocessor Scheduling (MQMS)
Multi-queue multiprocessor schedulers have a single queue per CPU core, ensuring that all local core scheduling can be done without having to take a shared lock or synchronize the cache. This allows for systems with hundreds of cores to operate without interfering with one another at every scheduling interval, which can happen hundreds of times a second.
The main issue with MQMS comes from CPU Utilization, where one or more CPU cores is doing the majority of the work, and scheduling fairness, where one of the processes on the computer is being scheduled more often than any other process with the same priority.
CPU Utilization is the biggest issue - no CPU should ever be idle if a job is scheduled. However, if all CPUs are busy, so we schedule a job to a random CPU, and a different CPU ends up becoming idle, it should "steal" the scheduled job from the original CPU to ensure every CPU is doing real work. Doing so, however, requires that we lock both CPU cores and potentially sync the cache, which may degrade any speedup we could get by stealing the scheduled job.
In conclusion
Both methods exist in the wild - Linux actually has three different mainstream scheduler algorithms, one of which is an SQMS. The choice of scheduler really depends on the way the scheduler is implemented, the hardware you plan to run it on, and the types of jobs you intend to run. If you know you only have two or four cores to run jobs, SQMS is likely perfectly adequate. If you're running a supercomputer where overhead is a major concern, then an MQMS might be the way to go. For a desktop user - just trust the distro, whether that's a Linux OS, Mac, or Windows. Generally, the programmers for the operating system you've got have done their homework on exactly what scheduler will be the best option for the typical use case of their system.
This whitepaper describes the differences between the two types of scheduling algorithms in place.

How do you avoid interrupt starvation in a nested interrupt system?

I am learning about interrupts and couldn't understand what happens when there are too many interrupts to a point where the CPU can't process the foreground loop or complete the existing interrupts. I read through this article https://www.cs.utah.edu/~regehr/papers/interrupt_chapter.pdf but didn't completely understand how a scheduler would help, if there are simply too many interrupts?
Do we switch to a faster CPU if the interrupts can not be missed?

Yes, you had to switch to a faster CPU!
You had to ensure that there is enough time for the mainloop. Therefore it is really important to keep your Interrupt service as short as possible and do some CPU workloads tests.

Indeed, any time there is contention over a shared resource, there is the possibility of starvation. The schedulers discussed in the paper limit the interrupt rate, thus ensuring some interrupt-free processing time during each interval. During high activity periods, interrupt handling is disabled, and the scheduler switches to polling mode where it interrogates the state of the interrupt request lines periodically, effectively throttling the stream of interrupts. The operating system strives to do as little as possible in each interrupt handler - tasks are often simply queued so they can be handled later at a different stage. There are many considerations and trade-offs that go into any scheduling algorithm.

Overall you need a clue of how much time each part of your program consumes. This is pretty easy to measure in practice live with an oscilloscope. If you activate a GPIO when entering and de-activate it when leaving the interrupt, you don't only get to see how much time the ISR consumes, but also how often it kicks in. If you do this for each ISR you get a good idea how much time they need. You can then do something similar in main(), to get a rough estimate of the complete execution cycle of the program, main + interrupts.
As for the best solution, it is obviously to reduce the amount of interrupts. Use polling if possible. Use DMA. Use serial peripherals (UART, CAN etc) that are hardware-buffered instead of interrupt-intensive ones. Use hardware PWM instead of output compare timers. And so on. These things need to be considered early on when you pick a suitable MCU for your project. If you picked the wrong MCU, then you'll obviously have to change. Twiddling with the CPU clock sounds like quick & dirty fix. Get the design right instead.

Why I/O-bound processes are faster?

Typically the CPU runs for a while without stopping, then a system call is made to read from a file or write to a file. When the system call completes, the CPU computes again until it needs more data or has to write more data, and so on.
Some processes spend most of their time computing, while others spend most of their time waiting for I/O. The former are called compute-bound; the latter are called I/O-bound. Compute-bound processes typically have long CPU bursts and thus infrequent I/O waits, whereas I/O-bound processes have short CPU bursts and thus frequent I/O waits.
As CPU gets faster, processes tend to
get more I/O-bound.
Why and how?
Edited:
It's not a homework question. I was studying the book (Modern Operating Systems by Tanenbaum) and found this matter there. I didn't get the concept that's why I am asking here. Don't tag this question as a homework please.

With a faster CPU, the amount of time spent using the CPU will decrease (given the same code), but the amount of time spent doing I/O will stay the same (given the same I/O performance), so the percentage of time spent on I/O will increase, and I/O will become the bottleneck.
That does not mean that "I/O bound processes are faster".

As CPU gets faster, processes tend to get more I/O-bound.
What it's trying to say is:
As CPU gets faster, processes tend to not increase in speed in proportion to CPU speed because they get more I/O-bound.
Which means that I/O bound processes are slower than non-I/O bound processes, not faster.
Why is this the case? Well, when only CPU speed increases all the rest of your system haven't increased in speed. Your hard disk is still the same speed, your network card is still the same speed, even your RAM is still the same speed*. So as the CPU increases in speed, the limiting factor to your program becomes less and less the CPU speed but more about how slow your I/O is. In other words, programs naturally shift to being more and more I/O bound. In other words: ..as CPU gets faster, processes tend to get more I/O-bound.
*note: Historically everything else also improved in speed along with the CPU, just not as much. For example CPUs went from 4MHz to 2GHz, a 500x speed increase whereas hard disk speed went from around 1MB/s to 70MB/s, a lame 70x increase.

how does CLOCK control events order?

how does clock control various events(operations) from being occurred in desired sequence?what is the significance of a clock cycle time(i've heard that many operations can be issued in a single clock cycle)?
or simply,how does CPU controls operation ordering?

CPUs have various processing units (float, vector, integer), and pipelines of different lengths for each unit.
The clock determines at which speed it will go through all operations in a pipeline, each operation being a tick. Once it gets to the end, the result is sent back to cache/memory.
Multiple pipelines can be active at the same time.
That's all I can tell you..
Ars Technica used to have great articles about this, such as this one:
Understanding the Microprocessor

The clock does not control the sequence of instructions. The clock controls the amount of times per second that the CPU "ticks." Each time is referred as a cycle and consequently each cycle takes some time to complete.
The sequence of instructions is dictated by the running program. Modern CPUs also include optimisations that influence the exact sequence.
These optimisations also make the clock speed (= amount of cycles per second) less significant. For example a dual core CPU is able to execute two instructions in the same cycle.
Yes usually instructions complete in a couple of cycles and compilers optimise the programs to use costly instructions less.

Testing Real Time Operating System for Hardness

I have an embedded device (Technologic TS-7800) that advertises real-time capabilities, but says nothing about 'hard' or 'soft'. While I wait for a response from the manufacturer, I figured it wouldn't hurt to test the system myself.
What are some established procedures to determine the 'hardness' of a particular device with respect to real time/deterministic behavior (latency and jitter)?
Being at college, I have access to some pretty neat hardware (good oscilloscopes and signal generators), so I don't think I'll run into any issues in terms of testing equipment, just expertise.

With that kind of equipment, it ought to be fairly easy to sync the o-scope to a steady clock, produce a spike each time the real-time system produces an output, an see how much that spike varies from center. The less the variation, the greater the hardness.

To clarify Bob's answer maybe:
Use the signal generator to generate a pulse at some varying frequency.
Random distribution across some range would be best.
use the signal generator (trigger signal) to start the scope.
the RTOS has to respond, do it thing and send an output pulse.
feed the RTOS output into input 2 of the scope.
get the scope to persist/collect mode.
get the scope to start on A , stop on B. if you can.
in an ideal workd, get it to measure the distribution for you. A LeCroy would.
Start with a much slower trace than you would expect. You need to be able to see slow outliers.
You'll be able to see the distribution.
Assuming a normal distribution the SD of the response time variation is the SOFTNESS.
(This won't really happen in practice, but if you don't get outliers it is reasonably useful. )
If there are outliers of large latency, then the RTOS is NOT very hard. Does not meet deadlines well. Unsuitable then it is for hard real time work.
Many RTOS-like things have a good left edge to the curve, sloping down like a 1/f curve.
Thats indicitive of combined jitters. The thing to look out for is spikes of slow response on the right end of the scope. Keep repeating the experiment with faster traces if there are no outliers to get a good image of the slope. Should be good for some speculative conclusion in your paper.
If for your application, say a delta of 1uS is okay, and you measure 0.5us, it's all cool.
Anyway, you can publish the results ( and probably in the publish sense, but certainly on the web.)
Link from this Question to the paper when you've written it.

Hard real-time has more to do with how your software works than the hardware on its own. When asking if something is hard real-time it must be applied to the complete system (Hardware, RTOS and application). This means hard or soft real-time is system design issues.
Under loading exceeding the specification even a hard real-time system will fail (hopefully with proper failure indication) while a soft real-time system with low loading would give hard real-time results. How much processing must happen in time and how much pre/post processing can be performed is the real key to hard/soft real-time.
In some real-time applications some data loss is not a failure it should just be below a certain level, again a system criteria.
You can generate inputs to the board and have a small application count them and check at what level data is going to be lost. But that gives you a rating specific to that system running that application. As soon as you start doing more processing your computational load increases and you now have a different hard real-time limit.
This board will running a bare bones scheduler will give great predictable hard real-time performance for most tasks.
Running a full RTOS with heavy computational load you probably only get soft real-time.
Edit after comment
The most efficient and easiest way I have used to measure my software's performance (assuming you use a schedular) is by using a free running hardware timer on the board and to time stamp my start and end of my cycle. Or if you run a full RTOS time stamp you acquisition and transition. Save your Max time and run a average on the values over a second. If your average is around 50% and you max is within 20% of your average you are OK. If not it is time to refactor your application. As your application grows the cycle time will grow. You can monitor the effect of all your software changes on your cycle time.
Another way is to use a hardware timer generate a cyclical interrupt. If you are in time reset the interrupt. If you miss the deadline you have interrupt handler signal a failure. This however will only give you a warning once your application is taking to long but it rely on hardware and interrupts so you can't miss.
These solutions also eliminate the requirement to hook up a scope to monitor the output since the time information can be displayed in any kind of terminal by a background task. If it is easy to monitor you will monitor it regularly avoiding solving the timing problems at the end but as soon as they are introduced.
Hope this helps

I have the same board here at work. It's a slightly-modified 2.6 Kernel, I believe... not the real-time version.
I don't know that I've read anything in the docs yet that indicates that it is meant for strict RTOS work.

I think that this is not a hard real-time device, since it runs no RTOS.

I understand being geek, but using oscilloscope to test a computer with ethernet/usb/other digital ports and HUGE internal state (RAM) is both ineffective and unreliable.
Instead of watching wave forms, you can connect any PC to the output port and run proper statistical analysis.
The established procedure (if the input signal is analog by nature) is to test system against several characteristic inputs - traditionally spikes, step functions and sine waves of different frequencies - and measure phase shift and variance for each input type. Worst case is then used in specifications of the system.
Again, if you are using standard ports, you can easily generate those on PC. If the input is truly analog, a separate DAC or simply a good sound card would be needed.
Now, that won't say anything about OS being real-time - it could be running vanilla Linux or even Win CE and still produce good and stable results in those tests if hardware is fast enough.
So, you need to simulate heavy and varying loads on processor, memory and all ports, let it heat and eat memory for a few hours, and then repeat tests. If latency stays constant, it's hard real-time. If it doesn't, under any load and input signal type, increase above acceptable limit, it's soft. Otherwise, it's advertisement.
P.S.: Implication is that even for critical systems you don't actually need hard real-time if you have hardware.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas