Is "the optimized delay" a myth or is it real? - optimization

From time to time you hear stories that are meant to illustrate how good someone is at something, and sometimes you hear about the guy how is so into code optimization that he optimizes his delay loop.
Since this really sounds like it's a strange thing to do as it's much better to start a "timer interrupt" instead of a optimized buzy wait,
and nobody ever tend to tells you the name of the optimizing hacker.
That has left me to wonder if it is a urban myth or is it real?
What do you say, reality or fiction?
Thanks
Johan
Update: It sounds like ShuggyCoUk was on to something,
wonder if we can find a example.
Update: Just a little clarification, this question is about the "delay" function it self and how that is implemented, not how and where you call it.
And what that purpose was, and how that system became better.
Update: It's no myth, those guys seems to exist
Thanks
ShuggyCoUk

This has more than a kernel of truth about it...
Spin wait can be much better than a signal based interrupt or a yield.
You trade some throughput for much reduced latency.
Often this is vitally important within an OS itself.
You allow yourself the freedom to do operations not possible within an interrupt handler
memory allocation for example.
You can get considerably finer grained control of the interval waited since you can essentially measure the cycle count.
However spin waits are tricky to get right.
If you can you should use use proper idle instructions which:
can power down parts of the core, improving power usage/heat dissipation and even allowing other cores to go faster.
In Hyper Thread based CPUs you allow the other logical thread to use the full CPU pipeline while you spin.
an instruction you might think was a no-op could cause the CPU to execute them out of order via the super scalar execution units. The resulting code may get unforeseen out of order artefacts which force the CPU to apply a great deal of effort in terms of stalls and memory barriers which are unwanted.
This is why you let someone else write the spin wait loop for you in most cases..
In Linux there is the cpu_relax macro
on arm this is barrier()
on x86 this is rep_nop()
In Windows there is YieldProcessor
Accessible in .Net via Thread.SpinWait
OS X eschews providing a standard implementation unless you are in the kernel
see this document and note that it encourages the use only of lck_spin_t
As to some citations of using PAUSE for spin waits:
PostGresSQL
Linux
See also the note that this is better on non P4 as well due to reducing power

The version I've always heard is of a group of hardware programmers who developed a special instruction that optimised the idle (not busy) loop of their operating system. This is mentioned in Kernighan & Pike's book The Practice Of Programming, but even there they admit it may be an Urban Myth.

I've heard stories of programmers who intentionally put in long delay loops early in projects and removed them later as "optimizations" to impress management. Never figured out if the stories were apocryphal or not.

Related

Could a GPU be Programmed and/or Modified to Carry Out CPU Instructions?

I was wondering if a GPU could behave like a CPU if modified or programmed to do so. If there is a way, I would also like to know how that could be done. The reason why is, well, sometimes I do that kind of stuff as experiments, just for fun. Plus, if it isn't a big hassle, then it would be much better than buying an expensive processor just to get better performance. I usually don't need my GPU, only because I use my computer for the simplest of things. My other computer, that's a slightly different story (because I use it for video playback), but you get the idea.
Yes, it's called GPGPU (general purpose GPU), and with it you could program some CPU-like workloads on your GPU using languages like CUDA or OpenCL.
Of course this method doesn't work well with any workload, the CPU is still much better in single-threaded hard-to-parallelize codes, or codes with complicated control flow (due to branch predictors) or memory locality (due to better caching and prefetching). GPGPUs are mostly better for performing very straight-forward highly parallel vectorizable code.
In fact, this method of computation caught enough traction to create a new lines of products, (such as Xeon Phi, formely Larrabee), and enhancing existing GPUs (e.g. Tesla/Fermi, and others)
EDIT
Having reread your question - if you mean running actual CPU ISA on such GPGPU, not just some general CPU task, then the best bet is Xeon Phi mentioned above, it's intended to be based on the same ISA as the CPU (it's the only x86 GPGPU I know of).

Risk Assessment: Using Pthreads (vs. GCD or NSThread)

A colleague suggested recently that I use pthreads instead of GCD because it's, "way faster." I don't disagree that it's faster, but what's the risk with pthreads?
My feeling is that they will ultimately not be anywhere nearly as idiot-proof as GCD (and my team of one is 50% idiots). Are pthreads hard to get right?
GCD and pthreads are both ways of doing work asynchronously, but they are significantly different. Most descriptions of GCD describe it in terms of threads and of thread pooling, but as DrPizza puts it
to concentrate on [threads and thread pools] is to miss the point. GCD’s value lies not in thread pooling, but in queuing.
                                                                Grand Central Dispatch for Win32: why I want it
GCD has some nice benefits over APIs like pthreads.
GCD does more to encourage and support "islands of serialization in a sea of parallelism." GCD makes it easy to avoid a lot of locks and mutexes and condition variables that are the normal way of comunicating between threads. This is because you decompose your program into tasks and GCD handles getting the task input and output to the appropriate thread behind the scenes. So programming with GCD allows you to pretty much write serially and not worry too much about stuff people often worry about in threaded code. That makes the code simpler and less bug prone.
GCD can do scaling for you so the program uses as much parallelism as the dependencies between the tasks you've decomposed your program into and the hardware allow for. Of course designing the program to be scalable is generally the hard bit, but you'll still need something to actually take advantage of that work to run as much as possible in parallel. Work stealing schedulers like GCD do that part.
GCD is composable. If you explicitly spawn threads for things you want to do asynchronously or in parallel you can run into a problem when libraries you use do the same thing. Say you decide you can run eight threads simultaneously because that's how many threads will be effective for your program given the machine it runs on. And then say a library you use on each thread does the same thing. Now you could have up to 64 threads running at once, which is more than you know is effective for your program.
Thread pooling solves this but everyone needs to use the same thread pool. GCD uses thread pooling internally and provides the same pool to everyone.
GCD provides a bunch of 'sources' and makes it easy to write an event driven program that depends on or takes input from the sources. For example you can very easily have a queue set up to launch a task every time data is available to read on a network socket, or when a timer fires, or whatever.
I don't think they're hard to get right, but having worked with many different approaches over the years (pthreads, GCD, NSThread, NSOperationQueue, etc.) I have no evidence to support an assertion like "pthreads are way faster." Even if they were faster (and I would expect the difference to be marginal at best) I always say, "use the highest level abstraction that gets the job done." Also, avoid pre-mature optimization.
Anecdotally speaking, GCD is pretty damn fast. How I see it, portability is the primary advantage of pthreads over GCD. If this is OSX/iOS exclusive code, I would see no advantage whatsoever to using pthreads, absent empirical evidence to the contrary.
Ignore the other well thought technical reasons, because they aren't relevant. You are not writing software for a benchmark, are you? At some point, a user is going to sit in front of your device and try to use it. And do you know what happens if you use pthreads instead of GCD? What happens is that your software doesn't scale well in the presence of other software multitasking at the same time because it is going to fight for the CPU presuming it is the only software running at the same time. Which is crazy. Nobody runs single task OSes any more. Even single task iOS runs much stuff in the background.
Instead, if all the programs you were running used GCD, the OS can scale the number of concurrent tasks running on their queues and thus match better the number of actual processors, reducing task switching overhead.
If your program doesn't require pseudo real time low latency and thus a dedicated thread to process stuff as soon as it is available (maybe the definition of your colleague's "way faster"), chances are GCD will be superior for the user because it will use better the resources available on their device. Even if GCD's API was horrible or slow it would be worthwhile to use it over other solutions which don't scale across different processes.
Probably NSThread is implemented using the pthreads library, the point is that the lower is the level of a concept, the more you have to do useless and repetitive tasks.
So the pthreads library isn't so hard to learn, my professor at university taught it, and even the most (call 'em so) slow at learning people were able to use the library, maybe randomly copying-pasting the code just for lazily but doing the job successfully.
So I definitely suggest you to implement a pthread wrapper class, it's easy to do it.
This way you eliminate the useless stuff, for example you may be doing this thousand of times:
pthread_mutex_init( mutex_ptr, NULL);
So (if that's your case, but it's just an example) you may be passing always NULL, and the same is valid for other functions.
Once implemented the class it isn't said that is faster than GCD.
GCD do some optimizations, for example two blocks may be ran in the same thread.
So I suggest to use your defined class only if it's faster than GCD, to test it with time profiler.

Is there any feature of programming that automatically detects computational repetition?

I'm new to programming, taking MIT's 6.00. While watching the Dynamic Programming lecture a simple question occurred to me: Is there any kind of built-in feature (for computers in general) to detect repetitive tasks and compensate?
I realize that's quite vague. I was working on my grandfather's computer because he had been complaining that it was slow. Indeed, it would lag for up to 15 seconds at a time, waiting for programs to open, etc. When I upgraded the RAM, the problem was gone. So if the computer was constantly having to write page ins and page outs to disk, why couldn't it have just popped up a little message suggesting a RAM upgrade? That would save quite a bit of time.
Computers are good at performing tasks quickly but slow code can be, well, slow. Can that be automated? Is this even a legitimate question?
In the example you describe the code isn't slow because it's reading/writing to disk. It's slow because it isn't actually doing anything but instead is waiting for the OS to page in and out to disk.
Also, a RAM upgrade isn't always the solution to frequent paging (say buggy program leaking memory or something).
It's not really possible in the general sense for the OS to detect what all the possible issues are and suggest a solution. That is in fact a variation of the Halting Problem.
It's impossible in general for a computer to know whether a slowness was because it's running an operation that fundamentally takes a long time to finish, or whether it's taking more time than it should really be.
Also, even if you've identified that an operation is slow, it's even more difficult to diagnose the precise reason why it is slow. Sometimes it's because you need more RAM, other times because slow network, or slow disk, or slow CPU. This is even more harder if the checker is running inside the same machine that it is running on since it's also experiencing the slowness itself.
However there are several things that can be done under certain limited situations. Many popular OSes (e.g. Windows, Linux, Android) can detect slow response to user input, and will offer to either give more time or force close applications (Android) or draw the not responding window in grayscale (Linux), or in bluish tint (Windows), if the application fails to respond to user input within certain period of time.

When to use windowed watchdog for embedded systems

This post is not for asking how to use it, but when.
There is a lot of documentation about windowed watchdogs (WW), and most microcontrollers already include it. Every vendor states that WW are meant for safety applications, but no one says more about this topic.
I would like to be pointed to specific examples, but examples that could be a little more than "for a car's brakes system".
We all know that a WW must be fed neither too early nor too late, but how will this scenario help to improve safeness?
Thank you!!
The overall point of a Watchdog is to ensure that the firmware is executing as expected. The theory is that if your firmware can periodically kick the watchdog, then the other functions it is responsible for are also happening.
From a system design, they're the last level of fail safe. It's basically saying "we don't know what the system is doing, because it's not able to kick the watchdog. So, reset the device and hope the problem goes away."
They can protect you from accidental infinite loops, stack corruptions, RAM bit twiddles, etc.
A Windowed Watchdog is a better solution than a single-sided Watchdog as the window can protect against more things... For example, with a single-sided, if the loop you're stuck in includes the watchdog kick, you'd never know you had a problem. For a Windowed Watchdog, you have a better chance of resetting due to the likelyhood of kicking too fast...
So, to answer your question. You'd use a Windowed Watchdog any time you wanted to be reasonably sure that the firmware is doing what it is supposed to, or to fall back to a safe state if it's not. They are generally focused on in safety systems, but all embedded devices can benefit from their use. (For example, a house thermostat is not considered a safety-critical system, however if it completely locks up and requires someone to remove the batteries to restart it that would be an annoyance.)

What does programming for PS3's Cell Processor entail?

How is programming for the Cell Processor on the PS3 different than programming for any other processor found on a normal desktop?
What kind of programming paradigms, techniques, and practices are used to fully utilize the Cell Processors potential?
All the articles I hear concerning PS3 development discuss, "Learning how to program on the Cell Processor." What does this really mean beyond some hand waving?
In addition to everything George mentions, the SPUs are really better thought of as streaming vector processors. They work best when you have an algorithm that works on long sequences of numerical data, which can be fed through the SPU's limited memory via DMA, rather than having the SPU load a chunk of memory, try to operate on it, find that it needs to follow a pointer to somewhere outside its memory, load that, keep going, find another one, and so on.
So, programming for them isn't a simple model of concurrency and threads; it's more like high performance numerical or scientific computation. It is also non-uniform memory access taken to an extreme.
Furthermore, every processor is in-order with deep pipelines, so the programmer has to be much more aware of data hazards and instruction bubbles and all the numerous micro-optimizations that we are told the compiler "should" take care of for us (but it really doesn't). Things like mispredicted branches, load-hit-stores, cache misses, etc. hurt a lot more than they would on an out-of-order processor that could juggle the order of operations around to hide such latencies.
For concrete examples, check out Mike Acton's CellPerformance blog. Mike is my favorite old-school assembly-happy perf curmudgeon in the business, and he's really earned his chops on this issue.
The Cell part of the PS3 consists of 6 SPU processors. They each have 256 KB of non-shared memory and are connected via a high-speed ring that allows for DMA between each other and the PowerPC host processor. They are not pipelined or cached. This makes it rather different than an multi-core x86 with shared memory, pipelining and caching. Also, the SPU processors do not use the same instruction set as the PowerPC so you've got some asymmetry there.
In short, your typical shared-memory, multithreaded program won't just drop onto the Cell without some work (with the caveat that computer science works hard at making different machines appear to be the same so some implementors try hard to automate the process).
At a high level the program will need to be broken up into tasks that fit within the Cell's hard memory limit. Those can run in parallel and each sub-task can be sequenced to an available Cell processor. At a low level, the compiler (or assembly programmer) will need to work harder to generate code that runs quickly on a processor -- no run-time trickery to make things go faster is available. The theory being that those programmer/compiler friendly features cost silicon and speed that can be better spent giving you more and faster SPUs. Of course, you're not getting any more SPU's on the PS3 but in the general case you'll get more SPUs per number of transistor available on chip.
Completely agree with George Philips and Crashworks. Only thing I'd add is that SPU programming is fundamentally about job management. To get the best out of the SPUs you need to keep them ticking over and feeding back results. There's no point in having one SPU chewing through some complex post-processing if your having to sit and wait for the results for a frame and the rest of your SPUs are sat idle. So how you distribute your jobs requires a lot of thought and this has a big impact on how you chunk up your data.
"All the articles I hear concerning PS3 development discuss, 'Learning how to program on the Cell Processor.' What does this really mean beyond some hand waving?"
Well, stuff you have to deal with on SPUs...
Atomic operations (lock-free try-discard style).
Strong distinction between memory areas. You have to know which pointer is pointing to which memory area or you'll screw everything up.
No enforced hardware distinction between data and code. This is actually a fun thing, you can setup dynamic code loading and essentially stream subroutines in and out. Self-modifying code is possible but not necessarily practical on SPU.
Lack of hardware debugging aids.
Limited memory size.
Fast memory access.
Instruction set balanced toward SIMD operations.
Floating point "gotchas".
You ideally want to keep the SPUs doing useful work all of the time, but it's really challenging. Not only are they not well suited for handling some types of problems, but often moving a system to be efficient on SPU can involve a complete redesign. Debugging problems that would be easy to catch on the PPU can sometimes take days on SPU.
I think when people use the phrase "learning how to program the cell" they are mostly hand waving. You can learn the basics in a week, the challenge comes in trying to apply that knowledge to real code... which often already exists and isn't in a form well-suited for use on SPU.