I have an embedded device (Technologic TS-7800) that advertises real-time capabilities, but says nothing about 'hard' or 'soft'. While I wait for a response from the manufacturer, I figured it wouldn't hurt to test the system myself.
What are some established procedures to determine the 'hardness' of a particular device with respect to real time/deterministic behavior (latency and jitter)?
Being at college, I have access to some pretty neat hardware (good oscilloscopes and signal generators), so I don't think I'll run into any issues in terms of testing equipment, just expertise.
With that kind of equipment, it ought to be fairly easy to sync the o-scope to a steady clock, produce a spike each time the real-time system produces an output, an see how much that spike varies from center. The less the variation, the greater the hardness.
To clarify Bob's answer maybe:
Use the signal generator to generate a pulse at some varying frequency.
Random distribution across some range would be best.
use the signal generator (trigger signal) to start the scope.
the RTOS has to respond, do it thing and send an output pulse.
feed the RTOS output into input 2 of the scope.
get the scope to persist/collect mode.
get the scope to start on A , stop on B. if you can.
in an ideal workd, get it to measure the distribution for you. A LeCroy would.
Start with a much slower trace than you would expect. You need to be able to see slow outliers.
You'll be able to see the distribution.
Assuming a normal distribution the SD of the response time variation is the SOFTNESS.
(This won't really happen in practice, but if you don't get outliers it is reasonably useful. )
If there are outliers of large latency, then the RTOS is NOT very hard. Does not meet deadlines well. Unsuitable then it is for hard real time work.
Many RTOS-like things have a good left edge to the curve, sloping down like a 1/f curve.
Thats indicitive of combined jitters. The thing to look out for is spikes of slow response on the right end of the scope. Keep repeating the experiment with faster traces if there are no outliers to get a good image of the slope. Should be good for some speculative conclusion in your paper.
If for your application, say a delta of 1uS is okay, and you measure 0.5us, it's all cool.
Anyway, you can publish the results ( and probably in the publish sense, but certainly on the web.)
Link from this Question to the paper when you've written it.
Hard real-time has more to do with how your software works than the hardware on its own. When asking if something is hard real-time it must be applied to the complete system (Hardware, RTOS and application). This means hard or soft real-time is system design issues.
Under loading exceeding the specification even a hard real-time system will fail (hopefully with proper failure indication) while a soft real-time system with low loading would give hard real-time results. How much processing must happen in time and how much pre/post processing can be performed is the real key to hard/soft real-time.
In some real-time applications some data loss is not a failure it should just be below a certain level, again a system criteria.
You can generate inputs to the board and have a small application count them and check at what level data is going to be lost. But that gives you a rating specific to that system running that application. As soon as you start doing more processing your computational load increases and you now have a different hard real-time limit.
This board will running a bare bones scheduler will give great predictable hard real-time performance for most tasks.
Running a full RTOS with heavy computational load you probably only get soft real-time.
Edit after comment
The most efficient and easiest way I have used to measure my software's performance (assuming you use a schedular) is by using a free running hardware timer on the board and to time stamp my start and end of my cycle. Or if you run a full RTOS time stamp you acquisition and transition. Save your Max time and run a average on the values over a second. If your average is around 50% and you max is within 20% of your average you are OK. If not it is time to refactor your application. As your application grows the cycle time will grow. You can monitor the effect of all your software changes on your cycle time.
Another way is to use a hardware timer generate a cyclical interrupt. If you are in time reset the interrupt. If you miss the deadline you have interrupt handler signal a failure. This however will only give you a warning once your application is taking to long but it rely on hardware and interrupts so you can't miss.
These solutions also eliminate the requirement to hook up a scope to monitor the output since the time information can be displayed in any kind of terminal by a background task. If it is easy to monitor you will monitor it regularly avoiding solving the timing problems at the end but as soon as they are introduced.
Hope this helps
I have the same board here at work. It's a slightly-modified 2.6 Kernel, I believe... not the real-time version.
I don't know that I've read anything in the docs yet that indicates that it is meant for strict RTOS work.
I think that this is not a hard real-time device, since it runs no RTOS.
I understand being geek, but using oscilloscope to test a computer with ethernet/usb/other digital ports and HUGE internal state (RAM) is both ineffective and unreliable.
Instead of watching wave forms, you can connect any PC to the output port and run proper statistical analysis.
The established procedure (if the input signal is analog by nature) is to test system against several characteristic inputs - traditionally spikes, step functions and sine waves of different frequencies - and measure phase shift and variance for each input type. Worst case is then used in specifications of the system.
Again, if you are using standard ports, you can easily generate those on PC. If the input is truly analog, a separate DAC or simply a good sound card would be needed.
Now, that won't say anything about OS being real-time - it could be running vanilla Linux or even Win CE and still produce good and stable results in those tests if hardware is fast enough.
So, you need to simulate heavy and varying loads on processor, memory and all ports, let it heat and eat memory for a few hours, and then repeat tests. If latency stays constant, it's hard real-time. If it doesn't, under any load and input signal type, increase above acceptable limit, it's soft. Otherwise, it's advertisement.
P.S.: Implication is that even for critical systems you don't actually need hard real-time if you have hardware.
Related
I'm trying to write a code in which every 1 ms a number plused one , should be replaced the old number . (something like a chronometer ! ) .
the problem is whenever the cpu usage increases because of some other programs running on the pc, this 1 milliseconds is also increased and timing in my program changes !
is there any way to prevent cpu load changes affecting timing in my program ?
It sounds as though you are trying to generate an analogue output waveform with a digital-to-analogue converter card using software timing, where your software is responsible for determining what value should be output at any given time and updating the output accordingly.
This is OK for stationary or low-speed signals but you are trying to do it at 1 ms intervals, in other words to output 1000 samples per second or 1 ks/s. You cannot do this reliably on a desktop operating system - there are too many other processes going on which can use CPU time and block your program from running for many milliseconds (or even seconds, e.g. for network access).
Here are a few ways you could solve this:
Use buffered, hardware-clocked output if your analogue output device supports it. Instead of writing one sample at a time, you send the device a waveform or array of samples and it outputs them at regular intervals using a timing signal generated in hardware. Unfortunately, low-end DAQ devices often don't support hardware-clocked output.
Instead of expecting the loop that writes your samples to the AO to run every millisecond, read LabVIEW's Tick Count (ms) value in the loop and use that as an index to your array of samples: rather than trying to output every sample, your code will now say 'what time is it now, and therefore what should the output be?' That won't give you a perfect signal out but at least now it should keep the correct frequency rather than be 'slowed down' - instead you will see glitches imposed on the signal whenever the loop can't keep up. This is easy to test and maybe it will be adequate for your needs.
Use a real-time operating system instead of a desktop OS. In the case of LabVIEW this would mean using the Real-Time software module and either a National Instruments hardware device that supports RT, such as the CompactRIO series, or installing the RT OS on a dedicated PC if the hardware is compatible. This is not a cheap option, obviously (unless it's strictly for personal, home use). In any case you would need to have an RT-compatible driver for your output device.
Use your computer's sound output as the output device. LabVIEW has functions for buffered sound output and you should be able to get reliable results. You'll need to upsample your signal to one of the sound output's available sample rates, probably 44.1 ks/s. The drawbacks are that the output level is limited in range and is not calibrated, and will probably be AC-coupled so you can't output a DC or very low-frequency signal. However if the level is OK for what you want to connect it to, or you can add suitable signal conditioning, this could be a neat solution. If you need the output level to be calibrated you could simultaneously measure it with your DAQ card and scale the sound waveform you're outputting to keep it correct.
The answer to your question is "not on a desktop computer." This is why products like LabVIEW Real-Time and dedicated deterministic hardware exist: you need a computer built around dedication to a particular process in order to consistently serve that process. Every application in a regular Windows/Mac/Linux desktop system has the problem you are seeing of potentially being interrupted by other system processes, particularly in its UI layer.
There is no way to prevent cpu load changes from affecting timing in your program unless the computer has a realtime clock.
If it doesn't have a realtime clock, there is no reason to expect it to behave deterministically. Do you need for your program to run at that pace?
I'm currently brainstorming for an idea of mine that involves a p2p render farm, somewhat like renderfarm.fi but in the difference that you pay for the service and contributors to the processing pool get paid.
Currently renderfarms measure price based on GHZ/h, but when the computers rendering are untrusted is there a good way to measure the equivilant GHZ/h of a computer, considering the computers could be partially loaded with other programs slowing down true time spent rendering, etc?
Because your worker process can ask the OS counters how much execution time they've received and that can be matched up with progress on the work package you can pay out based upon work-units-completed, but charge in GHz/h. You know you can't trust the user's clock (or anything else for that matter), but you can verify the work units returned and approximate their computational complexity by combining the program counters from multiple peers.
You have no way to know for sure that the system is or is not particularly loaded, but you do know if work went out and came back. However you will have to verify that the work was done correctly. Probably means over-provisioning and running every render twice on two different machines to ensure someone isn't inserting garbage results that are faster to compute.
Good luck. I don't know how you'll be able to beat the likes of Amazon with them charging ~$0.10 per GHz/h.
The operating system can, and most likely will measure the actual CPU time taken up by the process. As such, that can be used as a measure of how much practical time the process itself has spent running on the machine's CPU. The CPU time doesn't get skewed in any direction due to other processes running on the background, so it's very ideal for this purpose.
The CPU time itself is the resource such rendering services sell, as such it's logical to measure it per user/client basis and then price the service accordingly by the CPU time spent by the user/client of the render farm.
How should I decide system requirements like:
RAM capacity
FLASH memory capacity
Processor frequency
etc
I am building an application to control NAND FLASH, LCD driver, UART control, keypad control using a 16 bit micro-controller.
This has to be estimated from previous projects with similar functionality. Or even other people's products. But it is best to develop with a larger capacity and decide on final parts when your software nears completion, because its easier to omit components than to try and find room for them later. This kind of design can be an iterative process, start with one estimate and see if a prototype works, don't commit to volumes until you are nearly at the end.
In the case of an LCD based product, you will have two major components using up flash memory, the code and the LCD data (character strings, bitmaps etc). Its certainly easier to estimate the LCD data than the code, which depends on functionality, compiler optimisations etc. If you are bringing in external libraries then at least you already have code for them.
In any case, have an upgrade plan. The worst thing is to run out of capacity at the end of the project and be struggling to optimise the last feature/debug solution in without adding another problem. Make sure you know what the next size up chips are and how you can get them to fit, sometimes a PCB can be designed to take various different chips in the same position. Or have an expandable system, where you can plug things into a memory bus.
How many units will you be making ?
If your volumes are low (<1e3), but per unit profits high and time to market matters, more hardware will get the developers done sooner.
If the volumes are huge (>1e6), profits per unit low, then you penny pinch the hardware, but time to develop will go up. If time to market matters, that's a tradeoff.
Design the board with 2x the capacity (RAM/flash), but don't load the parts, other than to check it works.
Then if you run out of room, there is somewhere to go.
Will customers expect to get firmware updates ? Or is this a drop-ship product with no support ? Supportable is harder, needs more resources.
You'll need to pad resources to have room to expand into if the product needs support for a long time.
For CPU frequency estimates, how much work is required to be done ?
Get an Eval board for a likely MCU and prove out the core function.
Let us say it's a display for a piece of exercise equipment. Can it keep up with the sensors on the device at 2-3x the designed pace ? That's reading the sensors and updating the display. If cost is required to be low, you can then underclock the eval board adn see what trades can be made.
How is programming for the Cell Processor on the PS3 different than programming for any other processor found on a normal desktop?
What kind of programming paradigms, techniques, and practices are used to fully utilize the Cell Processors potential?
All the articles I hear concerning PS3 development discuss, "Learning how to program on the Cell Processor." What does this really mean beyond some hand waving?
In addition to everything George mentions, the SPUs are really better thought of as streaming vector processors. They work best when you have an algorithm that works on long sequences of numerical data, which can be fed through the SPU's limited memory via DMA, rather than having the SPU load a chunk of memory, try to operate on it, find that it needs to follow a pointer to somewhere outside its memory, load that, keep going, find another one, and so on.
So, programming for them isn't a simple model of concurrency and threads; it's more like high performance numerical or scientific computation. It is also non-uniform memory access taken to an extreme.
Furthermore, every processor is in-order with deep pipelines, so the programmer has to be much more aware of data hazards and instruction bubbles and all the numerous micro-optimizations that we are told the compiler "should" take care of for us (but it really doesn't). Things like mispredicted branches, load-hit-stores, cache misses, etc. hurt a lot more than they would on an out-of-order processor that could juggle the order of operations around to hide such latencies.
For concrete examples, check out Mike Acton's CellPerformance blog. Mike is my favorite old-school assembly-happy perf curmudgeon in the business, and he's really earned his chops on this issue.
The Cell part of the PS3 consists of 6 SPU processors. They each have 256 KB of non-shared memory and are connected via a high-speed ring that allows for DMA between each other and the PowerPC host processor. They are not pipelined or cached. This makes it rather different than an multi-core x86 with shared memory, pipelining and caching. Also, the SPU processors do not use the same instruction set as the PowerPC so you've got some asymmetry there.
In short, your typical shared-memory, multithreaded program won't just drop onto the Cell without some work (with the caveat that computer science works hard at making different machines appear to be the same so some implementors try hard to automate the process).
At a high level the program will need to be broken up into tasks that fit within the Cell's hard memory limit. Those can run in parallel and each sub-task can be sequenced to an available Cell processor. At a low level, the compiler (or assembly programmer) will need to work harder to generate code that runs quickly on a processor -- no run-time trickery to make things go faster is available. The theory being that those programmer/compiler friendly features cost silicon and speed that can be better spent giving you more and faster SPUs. Of course, you're not getting any more SPU's on the PS3 but in the general case you'll get more SPUs per number of transistor available on chip.
Completely agree with George Philips and Crashworks. Only thing I'd add is that SPU programming is fundamentally about job management. To get the best out of the SPUs you need to keep them ticking over and feeding back results. There's no point in having one SPU chewing through some complex post-processing if your having to sit and wait for the results for a frame and the rest of your SPUs are sat idle. So how you distribute your jobs requires a lot of thought and this has a big impact on how you chunk up your data.
"All the articles I hear concerning PS3 development discuss, 'Learning how to program on the Cell Processor.' What does this really mean beyond some hand waving?"
Well, stuff you have to deal with on SPUs...
Atomic operations (lock-free try-discard style).
Strong distinction between memory areas. You have to know which pointer is pointing to which memory area or you'll screw everything up.
No enforced hardware distinction between data and code. This is actually a fun thing, you can setup dynamic code loading and essentially stream subroutines in and out. Self-modifying code is possible but not necessarily practical on SPU.
Lack of hardware debugging aids.
Limited memory size.
Fast memory access.
Instruction set balanced toward SIMD operations.
Floating point "gotchas".
You ideally want to keep the SPUs doing useful work all of the time, but it's really challenging. Not only are they not well suited for handling some types of problems, but often moving a system to be efficient on SPU can involve a complete redesign. Debugging problems that would be easy to catch on the PPU can sometimes take days on SPU.
I think when people use the phrase "learning how to program the cell" they are mostly hand waving. You can learn the basics in a week, the challenge comes in trying to apply that knowledge to real code... which often already exists and isn't in a form well-suited for use on SPU.
Currently I am testing some RTL, I am using ncverilog, and it is very ... very slow. I have heard that, if we use some kind of FPGA boards, then things will be faster. Is it for real?
You're talking about two different things.
NCVerilog is a simulation tool while an FPGA board is real hardware. So, there will be differences. Real hardware will be generally faster but with a simulator, you can have all sorts of debugging fun. Trying to probe a specific signal is just a matter of adding a line to the testbench. Also, you can easily make changes to the simulated model instead of having to redesign the FPGA board.
If you run simulation on a sufficiently powerful machine, you can sometimes approximate real-world performance (assuming that the FPGA is a slow one).
All in all, you should do both. Use a simulator to do your basic development and evaluation. Move onto your FPGA hardware once your design is sufficiently well defined.
We've had the same issues with simulation speed too. However, we stick with simulations for the majority of our verification. Each sim checks a specific function and are much quicker than system-level sims. We've also made them self-checking and are useful for regressions tests (unit-tests).
For long system tests on real-world signals that take too much time to simulate, we move these to the FPGA if we can. We need to manually re-check all these testcases again after code changes, so it can be slow in its own way.
Sometimes though, FPGAing a design is just not feasible. Sometimes full designs are too large to fit into an FPGA, or the clock rate is too high. But remember that you don't necessarily have to FPGA your entire design, it may be enough to get the important block you're interested in and check this out fully.
You can trace activity on signals in a running FPGA design using "embedded logic analyzer" software tools like Altera SignalTap or Xilinx ChipScope. Before synthesizing/mapping your RTL to the device, you would use these tools to attach soft probes to the signals you want to watch. You can set triggers so that a signal's values only get logged under certain conditions. Then you generate the bitfile and program the device with JTAG. The logic analyzer communicates with your PC over JTAG and logs activity on your probes, which you can then analyze.
It's a bit complicated to set up, as these tools are not especially easy to use, but you will get results much faster than with RTL simulation.
What kind of RTL are you testing ? If you use FPGA boards, then you can compile
your code provided you have the right tool for the right FPGA. Since FPGA are reprograammable, then of course you can test your code on the board, and have the target (FPGA) execute your code (RTL)
But it is no more a simulation, it is a test, with a given hardware, at a given clock speed.
And you don't get nice result on the screen, you need to use physical probe and scope. Plus you don't get to see how the internal of your code is working.
verilog or VHDL simulation is sort of like running code using a debugger. FPGA testing is more like debugging with printf. The big difference is that when simulating, your CPU has to simulate the behaviour of all those logic gate that results of your code. On the FPGA, there is no simulation, you just 'run' the code, so it is much faster, but you have less information.
You should use simulation for very small components, and then test your whole program on a FPGA.
You're probably asking about hardware simulation accelerators.
Here is one of them : GateRocket