(When) Does hardware, especially the CPU(s), deliver wrong results? - hardware

What I'm talking about is: Is it possible that under certain circumstances the CPU "buggs" and suddenly responses 1+1=2?
In which parts of the computer can that happen (HDD, RAM, Mainboard)?
What could be the causes? Bad quality? Overheating?
Does that even happen? When yes, how frequently?
If everything is okay with the CPU (not a single fault in production, good temperature), can that still happen sometimes?
What would be the results of, let's say one to three wrong computations?
This is programming related as it would be nice to know if you can even rely on the hardware to return the right results.

It can happen in all hardware; it happens quite often in RAM chips. There are mechanisms to detect and correct such errors, but in regards to RAM, only in the more expensive ECC chips. See Wikipedia's article on RAM and Error Correction
Also interesting is the article on Error Detection and Correction in general.

One example: http://en.wikipedia.org/wiki/Pentium_FDIV_bug

What I'm talking about is: Is it
possible that under certain
circumstances the CPU "buggs" and
suddenly responses 1+1=2?
Yes
In which parts of the computer can
that happen (HDD, RAM, Mainboard)?
All of them
What could be the causes? Bad quality?
Overheating?
The most common cause is overclocking. Less common causes include faulty hardware.
If everything is okay with the CPU
(not a single fault in production,
good temperature), can that still
happen sometimes?
It can be a ram problem like I said above, or really anything.
What would be the results of, let's
say one to three wrong computations?
I don't understand this question.. You mean what would happen to the program? It would probably segfault but impossible to say. You mean what would 1+1 result into? Impossible to say. You mean what would happen if 1 in 3 computations were to fail on average? The computer wouldn't even boot.

Well first you need to find an Computer Engineer who thinks that 1+1=2 is a bug and that its a hardware problem which needs to be fixed.
#Andreas Bonini, Midhat and Pekka: In such incidences it would be highly recommended to take a maths course on April Fool's day.

Andrew Appel had a great demo a few years ago where he started a lecture by lighting a 100W bulb under a PC running Java. Within 20 minutes there were enough memory errors that he could exploit one to crack the Java virtual machine and take it over.
Cool your hardware!

Related

Is there any feature of programming that automatically detects computational repetition?

I'm new to programming, taking MIT's 6.00. While watching the Dynamic Programming lecture a simple question occurred to me: Is there any kind of built-in feature (for computers in general) to detect repetitive tasks and compensate?
I realize that's quite vague. I was working on my grandfather's computer because he had been complaining that it was slow. Indeed, it would lag for up to 15 seconds at a time, waiting for programs to open, etc. When I upgraded the RAM, the problem was gone. So if the computer was constantly having to write page ins and page outs to disk, why couldn't it have just popped up a little message suggesting a RAM upgrade? That would save quite a bit of time.
Computers are good at performing tasks quickly but slow code can be, well, slow. Can that be automated? Is this even a legitimate question?
In the example you describe the code isn't slow because it's reading/writing to disk. It's slow because it isn't actually doing anything but instead is waiting for the OS to page in and out to disk.
Also, a RAM upgrade isn't always the solution to frequent paging (say buggy program leaking memory or something).
It's not really possible in the general sense for the OS to detect what all the possible issues are and suggest a solution. That is in fact a variation of the Halting Problem.
It's impossible in general for a computer to know whether a slowness was because it's running an operation that fundamentally takes a long time to finish, or whether it's taking more time than it should really be.
Also, even if you've identified that an operation is slow, it's even more difficult to diagnose the precise reason why it is slow. Sometimes it's because you need more RAM, other times because slow network, or slow disk, or slow CPU. This is even more harder if the checker is running inside the same machine that it is running on since it's also experiencing the slowness itself.
However there are several things that can be done under certain limited situations. Many popular OSes (e.g. Windows, Linux, Android) can detect slow response to user input, and will offer to either give more time or force close applications (Android) or draw the not responding window in grayscale (Linux), or in bluish tint (Windows), if the application fails to respond to user input within certain period of time.

Determining failing sectors on portable flash memory

I'm trying to write a program that will detect signs of failure for portable flash memory devices (thumb drives, etc).
I have seen tools in the past that are able to detect failing sectors and other kinds of trouble on conventional mechanical hard drives, but I fear that flash memory does not have the same kind of predictable low-level access to the hardware due to the internal workings of the storage. Things like wear-leveling and other block-remapping techniques (to skip over 'dead' sectors?) lead me to believe that determining if a flash drive is failing will be difficult at best, if not impossible (short of having constant read failures and device unmounts).
Flash drives at their end-of-life should be easy to detect (constant CRC discrepancies during reads and all-out failure). But what about drives that might be failing early? Are there any tell-tale signs like slower throughput speeds that might indicate a flash drive is going to fail much sooner than normal?
Along the lines of detecting potentially bad blocks, I had considered attempting random reads/writes to a file close to or exactly the size of the entire volume, but even then is it possible that the drive might report sizes under its maximum capacity to account for 'dead' blocks?
In short, is there any way to circumvent or at least detect (algorithmically or otherwise) the use of block-remapping or other life extension techniques for flash memory?
Let me end this question by expressing my uncertainty as to whether or not this belongs on serverfault.com . This is definitely a hardware-related question, but I also desire a software solution - preferably one that I can program myself.
If this question is misplaced, I will be happy to migrate it to serverfault - but I do need a programming solution. Please let me know if you need clarification :)
Thanks!
It's interesting if badblocks can help in this case
AFAIK, Wear leveling happens at the firmware level. The hardware does not know about the bad block, till such time the firmware detects one.
And there is no known way to find this bad sectors before hand. BTW, I guess, it is not bad sectors, but bad blocks. Once a sector is bad, the whole block is marked as bad ...

What does programming for PS3's Cell Processor entail?

How is programming for the Cell Processor on the PS3 different than programming for any other processor found on a normal desktop?
What kind of programming paradigms, techniques, and practices are used to fully utilize the Cell Processors potential?
All the articles I hear concerning PS3 development discuss, "Learning how to program on the Cell Processor." What does this really mean beyond some hand waving?
In addition to everything George mentions, the SPUs are really better thought of as streaming vector processors. They work best when you have an algorithm that works on long sequences of numerical data, which can be fed through the SPU's limited memory via DMA, rather than having the SPU load a chunk of memory, try to operate on it, find that it needs to follow a pointer to somewhere outside its memory, load that, keep going, find another one, and so on.
So, programming for them isn't a simple model of concurrency and threads; it's more like high performance numerical or scientific computation. It is also non-uniform memory access taken to an extreme.
Furthermore, every processor is in-order with deep pipelines, so the programmer has to be much more aware of data hazards and instruction bubbles and all the numerous micro-optimizations that we are told the compiler "should" take care of for us (but it really doesn't). Things like mispredicted branches, load-hit-stores, cache misses, etc. hurt a lot more than they would on an out-of-order processor that could juggle the order of operations around to hide such latencies.
For concrete examples, check out Mike Acton's CellPerformance blog. Mike is my favorite old-school assembly-happy perf curmudgeon in the business, and he's really earned his chops on this issue.
The Cell part of the PS3 consists of 6 SPU processors. They each have 256 KB of non-shared memory and are connected via a high-speed ring that allows for DMA between each other and the PowerPC host processor. They are not pipelined or cached. This makes it rather different than an multi-core x86 with shared memory, pipelining and caching. Also, the SPU processors do not use the same instruction set as the PowerPC so you've got some asymmetry there.
In short, your typical shared-memory, multithreaded program won't just drop onto the Cell without some work (with the caveat that computer science works hard at making different machines appear to be the same so some implementors try hard to automate the process).
At a high level the program will need to be broken up into tasks that fit within the Cell's hard memory limit. Those can run in parallel and each sub-task can be sequenced to an available Cell processor. At a low level, the compiler (or assembly programmer) will need to work harder to generate code that runs quickly on a processor -- no run-time trickery to make things go faster is available. The theory being that those programmer/compiler friendly features cost silicon and speed that can be better spent giving you more and faster SPUs. Of course, you're not getting any more SPU's on the PS3 but in the general case you'll get more SPUs per number of transistor available on chip.
Completely agree with George Philips and Crashworks. Only thing I'd add is that SPU programming is fundamentally about job management. To get the best out of the SPUs you need to keep them ticking over and feeding back results. There's no point in having one SPU chewing through some complex post-processing if your having to sit and wait for the results for a frame and the rest of your SPUs are sat idle. So how you distribute your jobs requires a lot of thought and this has a big impact on how you chunk up your data.
"All the articles I hear concerning PS3 development discuss, 'Learning how to program on the Cell Processor.' What does this really mean beyond some hand waving?"
Well, stuff you have to deal with on SPUs...
Atomic operations (lock-free try-discard style).
Strong distinction between memory areas. You have to know which pointer is pointing to which memory area or you'll screw everything up.
No enforced hardware distinction between data and code. This is actually a fun thing, you can setup dynamic code loading and essentially stream subroutines in and out. Self-modifying code is possible but not necessarily practical on SPU.
Lack of hardware debugging aids.
Limited memory size.
Fast memory access.
Instruction set balanced toward SIMD operations.
Floating point "gotchas".
You ideally want to keep the SPUs doing useful work all of the time, but it's really challenging. Not only are they not well suited for handling some types of problems, but often moving a system to be efficient on SPU can involve a complete redesign. Debugging problems that would be easy to catch on the PPU can sometimes take days on SPU.
I think when people use the phrase "learning how to program the cell" they are mostly hand waving. You can learn the basics in a week, the challenge comes in trying to apply that knowledge to real code... which often already exists and isn't in a form well-suited for use on SPU.

Is "the optimized delay" a myth or is it real?

From time to time you hear stories that are meant to illustrate how good someone is at something, and sometimes you hear about the guy how is so into code optimization that he optimizes his delay loop.
Since this really sounds like it's a strange thing to do as it's much better to start a "timer interrupt" instead of a optimized buzy wait,
and nobody ever tend to tells you the name of the optimizing hacker.
That has left me to wonder if it is a urban myth or is it real?
What do you say, reality or fiction?
Thanks
Johan
Update: It sounds like ShuggyCoUk was on to something,
wonder if we can find a example.
Update: Just a little clarification, this question is about the "delay" function it self and how that is implemented, not how and where you call it.
And what that purpose was, and how that system became better.
Update: It's no myth, those guys seems to exist
Thanks
ShuggyCoUk
This has more than a kernel of truth about it...
Spin wait can be much better than a signal based interrupt or a yield.
You trade some throughput for much reduced latency.
Often this is vitally important within an OS itself.
You allow yourself the freedom to do operations not possible within an interrupt handler
memory allocation for example.
You can get considerably finer grained control of the interval waited since you can essentially measure the cycle count.
However spin waits are tricky to get right.
If you can you should use use proper idle instructions which:
can power down parts of the core, improving power usage/heat dissipation and even allowing other cores to go faster.
In Hyper Thread based CPUs you allow the other logical thread to use the full CPU pipeline while you spin.
an instruction you might think was a no-op could cause the CPU to execute them out of order via the super scalar execution units. The resulting code may get unforeseen out of order artefacts which force the CPU to apply a great deal of effort in terms of stalls and memory barriers which are unwanted.
This is why you let someone else write the spin wait loop for you in most cases..
In Linux there is the cpu_relax macro
on arm this is barrier()
on x86 this is rep_nop()
In Windows there is YieldProcessor
Accessible in .Net via Thread.SpinWait
OS X eschews providing a standard implementation unless you are in the kernel
see this document and note that it encourages the use only of lck_spin_t
As to some citations of using PAUSE for spin waits:
PostGresSQL
Linux
See also the note that this is better on non P4 as well due to reducing power
The version I've always heard is of a group of hardware programmers who developed a special instruction that optimised the idle (not busy) loop of their operating system. This is mentioned in Kernighan & Pike's book The Practice Of Programming, but even there they admit it may be an Urban Myth.
I've heard stories of programmers who intentionally put in long delay loops early in projects and removed them later as "optimizations" to impress management. Never figured out if the stories were apocryphal or not.

Multi core programming

I want to get into multi core programming (not language specific) and wondered what hardware could be recommended for exploring this field.
My aim is to upgrade my existing desktop.
If at all possible, I would suggest getting a dual-socket machine, preferably with quad-core chips. You can certainly get a single-socket machine, but dual-socket would let you start seeing some of the effects of NUMA memory that are going to be exacerbated as the core counts get higher and higher.
Why do you care? There are two huge problems facing multi-core developers right now:
The programming model Parallel programming is hard, and there is (currently) no getting around this. A quad-core system will let you start playing around with real concurrency and all of the popular paradigms (threads, UPC, MPI, OpenMP, etc).
Memory Whenever you start having multiple threads, there is going to be contention for resources, and the memory wall is growing larger and larger. A recent article at arstechnica outlines some (very preliminary) research at Sandia that shows just how bad this might become if current trends continue. Multicore machines are going to have to keep everything fed, and this will require that people be intimately familiar with their memory system. Dual-socket adds NUMA to the mix (at least on AMD machines), which should get you started down this difficult road.
If you're interested in more info on performance inconsistencies with multi-socket machines, you might also check out this technical report on the subject.
Also, others have suggested getting a system with a CUDA-capable GPU, which I think is also a great way to get into multithreaded programming. It's lower level than the stuff I mentioned above, but throw one of those on your machine if you can. The new Portland Group compilers have provisional support for optimizing loops with CUDA, so you could play around with your GPU even if you don't want to learn CUDA yourself.
Quad-core, because it'll permit you to do problems where the number of concurrent processes is > 2, which often non-trivializes problems.
I would also, for sheer geek squee, pick up a nice NVidia card and use the CUDA API. If you have the bucks, there's a stand-alone CUDA workstation that plugs into your main computer via a cable and an expansion slot.
It depends what you want to do.
If you want to learn the basics of multithreaded programming, then you can do that on your existing single-core PC. (If you have 2 threads, then the OS will switch between them on a single-core PC. Then when you move to a dual-core PC they should automatically run in parallel on separate cores, for a 2x speedup). This has the advantage of being free! The disadvantages are that you won't see a speedup (in fact a parallel implementation is probably slightly slower due to overheads), and that buggy code has a slightly higher chance of working.
However, although you can learn multithreaded programming on a single-core box, a dual-core (or even HyperThreading) CPU would be a great help.
If you want to really stress-test the code you're writing, then as "blue tuxedo" says, you should go for as many cores as you can easily afford, and if possible get hyperthreading too.
If you want to learn about algorithms for running on graphics cards - which is a very different area to x86 multicore - then get CUDA and buy a normal nVidia graphics card that supports it.
I'd recommend at least a quad-core processor.
You could try tinkering with CUDA. It's free, not that hard to use and will run on any recent NVIDIA card.
Alternatively, you could get a PlayStation 3 and the Linux SDK and work out how to program a Cell processor. Note that the next cheapest option for Cell BE development is an order of magnitude more expensive than a PS3.
Finally, any modern motherboard that will take a Core Quad or quad-core Opteron (get a good one from Asus or some other reputable manufacturer) will let you experiment with a multi-core PC system for a reasonable sum of money.
The difficult thing with multithreaded/core programming is that it opens a whole new can of worms. The bugs you'll be faced with are usually not the one you're used to. Race conditions can remain dormant for ages until they bite and your mainstream language compiler won't assist you in any way. You'll get random data and/or crashes that only happen once a day/week/month/year, usually under the most mysterious conditions...
One things remains true fortunately : the higher the concurrency exhibited by a computer, the more race conditions you'll unveil.
So if you're serious about multithreaded/core programming, then go for as many cpu cores as possible. Keep in mind that neither hyperthreading nor SMT allow for the level of concurrency that multiple cores provide.
I would agree that, depending on what you ultimately want to do, you can probably get by with just your current single-core system. Multi-core programming is basically multi-threaded programming, and you can certainly do that on a single-core chip.
When I was a student, one of our projects was to build a thread-safe implementation the malloc library for C. Even on a single core processor, that was more than enough to cure me of my desire to get into multi-threaded programming. I would try something small like that before you start thinking about spending lots of money.
I agree with the others where I would upgrade to a quad-core processor. I am also a BIG FAN of ASUS Motherboards (the P5Q Pro is excellent for Core2Quad and Core2Duo processors)!
The draw for multi-core programming is that you have more resources to get things done faster. If you are serious about multi-core programming, then I would absolutely get a quad-core processor. I don't believe that you should get the new i7 architecture from Intel to take advantage of multi-core processing because anything written to take advantage of the Core2Duo or Core2Quad will just run better on the newer architecture.
If you are going to dabble in multi-core programming, then I would get a good Core2Duo processor. Remember, it's not just how many cores you have, but also how FAST the cores are to process the jobs. My Core2Duo running at 4GHz routinely completes jobs faster than my Core2Quad running at 2.4GHz even with a multi-core program.
Let me know if this helps!
JFV