Complexity slowdown for stored program computer - time-complexity

The Wikipedia page on Turing machines states that a universal Turing machine is slower than the machines it simulates by at most a log factor. I was curious - what is the equivalent in real life, comparing a pure hardware solution (non stored program computer - e.g. ASIC) vs. a stored program computer? Is it also a log factor?

The wikipedia page is a bit prudent. In fact you can write a one tape
universal machine for one tape machines with no overhead.
I am not aware of a precise reference, but the result somehow belongs to the
folklore of the subject. There is only a small "gray" zone regarding the
simulation of programs with subquadratic complexity. This is related to
the fact the universal machine may need to reshape the input before starting
the simulation, and such an operation, on a single tape machine, may already
take quadratic time.
This is somehow related to the general problem of encoding of tuples on
single tape machines that, as far as I can say, has been somehow neglected,
while it is central to complexity issues.
My personal feeling, at present, is that you can overcome the problem of
simulation of machines with subquadratic complexity, obtaining a really
fair universal machine.

Related

How do computers switch between different processes (is it mainly an OS thing)?

Early computers, such as the ENIAC, had to have their program memory (a.k.a. instruction memory) changed manually in order for different programs to run. This would involve changing the tape or punchcard on which instructions were stored, so that every time a new program was to be run, the tape or punchcard had to be changed.
This limitation of early computers was in part due to the low informational density of rolls of tape compared to modern HDDs, but it was also partly due to the idea (and please correct me if I'm wrong) that each roll of tape was supposed to store only a single program.
In contrast, modern computers can switch between many different programs without having to physically replace ROM. It's easy to switch from one active window to another, or to start running a new program with a few mouse clicks. We now have HDDs and SSDs rather than punchcards and tape, so that we can simply have all the programs we want to run stored on a single SDD and a few HDDs which are connected to our computers all the time. And we never need to change our memory-storage devices until they break.
I hope the above is enough to motivate the following question:
What are some typical low-level features (w.r.t. hardware and/or software) that enable modern computers to switch between different processes or programs stored in ROM, as opposed to simply treating ROM as a container for single programs as did the computers of yore?
HDD and SSD's are not equivalent to ROM. There is a distinction between a program and a process. a process is a program in execution. Multiple programs are stored in HDD and SDD's. When a program is loaded into RAM it becomes a process. A loader which is part of OS does that. Multiple program can be loaded at the same time in RAM. Context switch is an OS function. Context switch requires not only movement of program or instructions but data as well. A lot of low level features are involved. It is not possible to list all of them. In sort, Yes it is mainly an OS thing.

Virtualize specific environment (CPU, cache, clock)

I have written some code that's supposed to run on a certain hardware-setup. I'd like to test it to get some preliminary metrics, but without buying the hardware setup, since it's very expensive.
At first I, naively, thought I could set some specifications to the platform when creating a virtual machine through a manager such as VMware Workstation, but it seems like it's not possible.
What ways do you believe would be the best to emulate a certain environment? Of course, RAM, disk space and OS should be fairly easy, but limiting the CPU seems to be the general issue.
I'm trying to simulate the Intel AtomĀ® Processor E3845, so I have some requirements to the maximum cores, cache size and of course the clock frequency.
The closest I've found so far would be to install WMware ESXi on a piece of hardware and limit the CPU. But I'm unsure if this is the best way. Further, I've never really worked with this before, why I'm unsure if I can limit the cache and so forth. Simply "down-scaling" the metrics does not feel like a good solution when we are rather dependant on the cache (that is, we've seen issues with certain sizes and speeds).
I Would love to hear some inputs if you have any.

Why is RDTSC a virtualized instruction on modern processors?

I am studying RDTSC and learning about how it is virtualized for the purposes of virtual machines like VirtualBox and VMWare. Why did Intel/AMD go to all the trouble of virtualizing this instruction?
I feel like it can be easily simulated with a trap and it's not exactly a super-common instruction (I tested and there's no noticable slow-down for general usage in a virtual machine where hardware RDTSC virtualization is disabled).
However, I know Intel/AMD wouldn't have gone to all the trouble to add this instruction to the virtualizing hardware unless it was important to able to execute very fast.
Does anyone know why?
Its common to use RDTSC to get fine-grained timing information, where the overhead of a virtualization trap would be quite significant. Most common use is to have two RDTSC instructions with a small amount of code between them, taking the difference of the times as the elapsed time (number of cycles) for the code sequence. So even the overhead of pipeline drains/flushes is quite significant.
Also, since all the instruction does is read a continuously running counter, virtualizing it is quite easy -- the hardware only needs to allow saving/reloading the counter value on VM context switches, and not anything special for the RDTSC instruction itself.
VMs should be able to have separate TSCs because they start up at different times. The physical CPU just has one, so something is needed to at least get individual, per-VM TSC offsets.
Also, since VMs don't own the underlying physical CPUs fully, that is, they don't get to execute on them all the time, their TSCs should also somehow reflect the "on/off" periods and it is desirable that they don't change abruptly in value w.r.t. actual time, which the VMs should still get right from the host OS, because there's a lot of software that is virtualization-unready and can break when the numbers are too off.
I think these are the reasons why RDTSC is virtualized. But whatever you do, meeting conflicting requirements is tough and they complicate matters. You can't hide virtualization and have VMs run at near-native speed at the same time. There are trade offs and some things have to give in.

Virtual memory beyond page tables

I am working on a research project to develop an OS for a many-core(1000+) chip. we are looking into implementing a virtual memory type system for memory permissions (read/write/execute) that would allow memory to be safely shared across cores.
basically we want a system that would allow us to mark a 'page' as being readable by some subset of cores writeable by another...etc. we are not going to be doing address translation (at least at this point) but we need a way to efficiently set and query permissions. it is going to be a software filled datastructure with a simple TLB style cache.
Our intuition is that simply replicating page tables for each core will be too expensive (in terms of memory usage).
what datastructures would be efficient for this kind of problem?
thanks
Have you looked at how common multi-core (2-12 core) CPU's address this problem?
Do you know where/when/why/how the solution that is used in these common multi-core CPU's -- will not scale to a 1,000+ cores?
In other words -- can you quantify what's wrong with the existing solution, which is working, and has been working, with common CPU's whose core count <= 12 ?
If you know that -- then the answer is closer than you think, because it just requires understanding how AMD/Intel solved the problem on a lesser scale -- and what's needed to make their solution work on a greater one (Maybe more memory for tables, algorithm tweaks, etc.)
Look at AMD's/Intel's data structures -- then build a software simulator for 1,000+ cores with those data structures, and see where/when/why and how your simulation fails -- if it fails...
Ideally build your simulator with a user-selectable number of cores, then TEST, TEST, TEST with different amounts of cores -- working your way up, noting bottlenecks along the way.
Your simulator should work EXACTLY as well as AMD (if you're using AMD data structures) or Intel (if you're using Intel data structures) -- at the same core count as one of their chips... because it should prove that THEY (AMD/Intel) are doing what they're doing correctly (because they are), and because that will help prove that your simulation program is doing it's simulation correctly -- at a specific number of cores.
Wishing you luck!

Testing Real Time Operating System for Hardness

I have an embedded device (Technologic TS-7800) that advertises real-time capabilities, but says nothing about 'hard' or 'soft'. While I wait for a response from the manufacturer, I figured it wouldn't hurt to test the system myself.
What are some established procedures to determine the 'hardness' of a particular device with respect to real time/deterministic behavior (latency and jitter)?
Being at college, I have access to some pretty neat hardware (good oscilloscopes and signal generators), so I don't think I'll run into any issues in terms of testing equipment, just expertise.
With that kind of equipment, it ought to be fairly easy to sync the o-scope to a steady clock, produce a spike each time the real-time system produces an output, an see how much that spike varies from center. The less the variation, the greater the hardness.
To clarify Bob's answer maybe:
Use the signal generator to generate a pulse at some varying frequency.
Random distribution across some range would be best.
use the signal generator (trigger signal) to start the scope.
the RTOS has to respond, do it thing and send an output pulse.
feed the RTOS output into input 2 of the scope.
get the scope to persist/collect mode.
get the scope to start on A , stop on B. if you can.
in an ideal workd, get it to measure the distribution for you. A LeCroy would.
Start with a much slower trace than you would expect. You need to be able to see slow outliers.
You'll be able to see the distribution.
Assuming a normal distribution the SD of the response time variation is the SOFTNESS.
(This won't really happen in practice, but if you don't get outliers it is reasonably useful. )
If there are outliers of large latency, then the RTOS is NOT very hard. Does not meet deadlines well. Unsuitable then it is for hard real time work.
Many RTOS-like things have a good left edge to the curve, sloping down like a 1/f curve.
Thats indicitive of combined jitters. The thing to look out for is spikes of slow response on the right end of the scope. Keep repeating the experiment with faster traces if there are no outliers to get a good image of the slope. Should be good for some speculative conclusion in your paper.
If for your application, say a delta of 1uS is okay, and you measure 0.5us, it's all cool.
Anyway, you can publish the results ( and probably in the publish sense, but certainly on the web.)
Link from this Question to the paper when you've written it.
Hard real-time has more to do with how your software works than the hardware on its own. When asking if something is hard real-time it must be applied to the complete system (Hardware, RTOS and application). This means hard or soft real-time is system design issues.
Under loading exceeding the specification even a hard real-time system will fail (hopefully with proper failure indication) while a soft real-time system with low loading would give hard real-time results. How much processing must happen in time and how much pre/post processing can be performed is the real key to hard/soft real-time.
In some real-time applications some data loss is not a failure it should just be below a certain level, again a system criteria.
You can generate inputs to the board and have a small application count them and check at what level data is going to be lost. But that gives you a rating specific to that system running that application. As soon as you start doing more processing your computational load increases and you now have a different hard real-time limit.
This board will running a bare bones scheduler will give great predictable hard real-time performance for most tasks.
Running a full RTOS with heavy computational load you probably only get soft real-time.
Edit after comment
The most efficient and easiest way I have used to measure my software's performance (assuming you use a schedular) is by using a free running hardware timer on the board and to time stamp my start and end of my cycle. Or if you run a full RTOS time stamp you acquisition and transition. Save your Max time and run a average on the values over a second. If your average is around 50% and you max is within 20% of your average you are OK. If not it is time to refactor your application. As your application grows the cycle time will grow. You can monitor the effect of all your software changes on your cycle time.
Another way is to use a hardware timer generate a cyclical interrupt. If you are in time reset the interrupt. If you miss the deadline you have interrupt handler signal a failure. This however will only give you a warning once your application is taking to long but it rely on hardware and interrupts so you can't miss.
These solutions also eliminate the requirement to hook up a scope to monitor the output since the time information can be displayed in any kind of terminal by a background task. If it is easy to monitor you will monitor it regularly avoiding solving the timing problems at the end but as soon as they are introduced.
Hope this helps
I have the same board here at work. It's a slightly-modified 2.6 Kernel, I believe... not the real-time version.
I don't know that I've read anything in the docs yet that indicates that it is meant for strict RTOS work.
I think that this is not a hard real-time device, since it runs no RTOS.
I understand being geek, but using oscilloscope to test a computer with ethernet/usb/other digital ports and HUGE internal state (RAM) is both ineffective and unreliable.
Instead of watching wave forms, you can connect any PC to the output port and run proper statistical analysis.
The established procedure (if the input signal is analog by nature) is to test system against several characteristic inputs - traditionally spikes, step functions and sine waves of different frequencies - and measure phase shift and variance for each input type. Worst case is then used in specifications of the system.
Again, if you are using standard ports, you can easily generate those on PC. If the input is truly analog, a separate DAC or simply a good sound card would be needed.
Now, that won't say anything about OS being real-time - it could be running vanilla Linux or even Win CE and still produce good and stable results in those tests if hardware is fast enough.
So, you need to simulate heavy and varying loads on processor, memory and all ports, let it heat and eat memory for a few hours, and then repeat tests. If latency stays constant, it's hard real-time. If it doesn't, under any load and input signal type, increase above acceptable limit, it's soft. Otherwise, it's advertisement.
P.S.: Implication is that even for critical systems you don't actually need hard real-time if you have hardware.