Anomymous Gerbil wrote:I see quite a lot of posts talking about the problems with failure rates of such massive systems. But why is that a problem?
Surely the systems and apps are designed such that the failures are essentially invisible to the apps - or is that not true?
Surely any such computers are built with easily-replaceable modules - or is that not true?
Is it a financial problem, i.e. the sheer cost of staff/materials/etc of finding and replacing all those modules?
Or is it that failures can occur which aren't detected, thereby spoiling the computations?
Or...?
(Just curious, I have no knowledge of these sorts of systems.)
Sorry, but need to use "internet-style" quotes to attempt an answer.
>Surely the systems and apps are designed such that the failures are essentially invisible to the apps - or is that not true?
Good luck with that. Some apps try to do near continuous checkpointing, so that little time is lost if they have to roll-back to a last-known-good state after a fault/crash. Most don't, many can't even tell that they've been corrupted until it's way too late. For most it's hard, or there's just too much state, or too little filesystem bandwidth to take such checkpoints. The "state" of an HPC app on modern supercomputers (even just counting memory state, not counting filesystem state, and definitely not counting network state (stuff sent, but not fully received)) could be many Terabytes.
>Surely any such computers are built with easily-replaceable modules - or is that not true?
There's a difference between "stuck faults", where some component is permanently broken, and "transient faults", where an otherwise "good" component just gets a wrong answer once in a trillion times, or crashes out-of-the-blue. Some marginal components will always pass diagnostics, but still occasionally (once in a gazillion times) get the wrong answer or otherwise crash. If you have enough "spares", the marginal components should be replaced. But, declaring a component marginal is a "policy" decision, where you set some threshold error rate (in failures over time). But, then you've got a gazillion components to carefully check and track, and most times you can't tell exactly what was at fault.
>Is it a financial problem, i.e. the sheer cost of staff/materials/etc of finding and replacing all those modules?
Imagine that you have 1 million of a thing designed to a 1 million hour MTBF. That means that on average 1 will fail per hour somewhere in the machine. The real problem is that stuck-faults are easy to find/fix, but transient-faults can be nasty to find and harder to fix. I.e., answer this question: is this chip getting the wrong answer because the chip is bad, or because its motherboard bad, or because the solder joints holding the chip to the motherboard is bad, or because the power-supply powering the motherboard is bad? Of the things you mention, probably "staff" is the worst, because figuring out exactly what's wrong can be hard when you are dealing with transient failures.
>Or is it that failures can occur which aren't detected, thereby spoiling the computations?
Define "detected". Say, for example, that my CPU issues a load from memory to a register. If a cosmic or alpha hits that target register and flips a bit (that I'm about to overwrite) before the memory contents get there, I just don't care. When designing a supercomputer, you think about how long a bit in some register is likely to stay there (over a range of apps), and what its probability of being killed/flipped by an alpha or cosmic would be before it's overwritten. Memory errors occur all the time (because there's usually/hopefully lots of memory, but ECC usually corrects them to return the right contents, and Servers use "scrubbing" that runs around in the background and reads/corrects/rewrites memory to continually fix single-bit errors (hopefully before they become uncorrectable multi-bit errors). (That's mostly why the computers I build have ECC.)
One of the more important benchmarks for Supercomputers is Linpack, which can run for about a day. It produces an "answer" and a "residual". The answer has to be in the "ballpark", which allows some mixing/reordering of floating-point operations. But, if the residual is too large, it means your supercomputer made an error in some calculation, but God-only-knows-where-or-when it happened in that multi-hour run.
That all probably raised more questions than it answered, but hope it helped.