Page 1 of 1

Hardware Failures on a Million Consumer PCs

Posted: Mon Nov 19, 2012 8:52 pm
by Ryu Connor
Link

Pretty interesting read.

OEM is more stable than white box.
Laptops are more stable than desktops.
An underclock of as little as .5% has a huge impact on stability.
Overclocking has a substantial likelihood of failure.
Once a hardware crash/failure has occurred in any of the three measured components (CPU, Memory, Disk) you are doomed from there forward.
Disks failures have the most rapid recurrence rate.

Many of us won't be shocked by some of these details (and more within the document). In particular I find the last one regarding disks the least shocking of all.

Re: Hardware Failures on a Million Consumer PCs

Posted: Mon Nov 19, 2012 9:21 pm
by Waco
AKA common sense and statistics magic.

Re: Hardware Failures on a Million Consumer PCs

Posted: Tue Nov 20, 2012 9:59 am
by just brew it!
The take-away for me --

CPU machine check exceptions are more likely to cause an OS crash than DRAM errors. I was initially somewhat surprised by this; however, after thinking about it a bit more, it makes sense. The whole point of machine check exceptions is to prevent the CPU from doing something off the wall, essentially a virtual panic button to immediately shut everything down. AFAIK *all* machine check exceptions result in an OS crash (BSOD). OTOH most DRAM errors will probably result in an *application* crash or silent data corruption (neither of which are represented in the data used for this study).

They're using OS crashes as a proxy for system instability. While I understand their motivation for doing so (the data is readily available via automated crash reports), I'd be much more interested in knowing how frequently user data is lost or corrupted. Unfortunately, collecting data to do this analysis would be impractical.

Re: Hardware Failures on a Million Consumer PCs

Posted: Tue Nov 20, 2012 2:53 pm
by Ryu Connor
Yes, the authors delved into the fact that their methodology was worthless for determining soft errors occurring in consumer level non-ECC RAM. Leaves us with a bit of mystery about how poor the memory is we use. As the rest of the document details that consumer level equipment does not stand up to the tolerances of server level equipment. That may not be shocking to some people, but I firmly believe there is a group of enthusiasts and IT Pros who believes the extra costs of say a Xeon versus an i7 is just raw profit. This document (and the other cited studies) details that you do get greater stability for your money.

Re: Hardware Failures on a Million Consumer PCs

Posted: Tue Nov 20, 2012 3:15 pm
by just brew it!
Yup.

And as I've noted on these forums (repeatedly), the question of RAM stability/reliability is why I prefer to use ECC RAM even for desktops. This, in turn, is one of the reasons I remain in the AMD camp and buy Asus motherboards almost exclusively. An inexpensive Asus motherboard plus an Athlon II, Phenom II, or FX CPU will get you an ECC capable platform for a fraction of the cost of an equivalent Intel-based solution (since Intel forces you to upgrade to a workstation/server mobo and Xeon CPU if you want ECC support).

Re: Hardware Failures on a Million Consumer PCs

Posted: Tue Nov 20, 2012 3:44 pm
by cynan
Ryu Connor wrote:
Yes, the authors delved into the fact that their methodology was worthless for determining soft errors occurring in consumer level non-ECC RAM. Leaves us with a bit of mystery about how poor the memory is we use. As the rest of the document details that consumer level equipment does not stand up to the tolerances of server level equipment. That may not be shocking to some people, but I firmly believe there is a group of enthusiasts and IT Pros who believes the extra costs of say a Xeon versus an i7 is just raw profit. This document (and the other cited studies) details that you do get greater stability for your money.


Except that the conditions consumer PCs run in are much more variable. IE, less likely to be on battery backups, in dust-free environments (leading to overheating issues), disks are less likely to experience g-shock hazards. And then there is the case of secondary PC components being potentially less reliable, on average, than what most server racks use (ie, PSUs, etc). In the end, this simply does not provide enough data to conclude that the extra cost of a Xeon over an i7 is not in fact "all profit".

Re: Hardware Failures on a Million Consumer PCs

Posted: Tue Nov 20, 2012 4:13 pm
by Ryu Connor
The authors noted your limitations.

What stands out in opposition to your viewpoint is that laptops are more stable than desktops. Laptops have to endure these same poor conditions as desktops and yet their specialized consumer parts handle it better. It is no silver bullet to the question, but it does further support the concept that the design and market aims of the components matter.

A more curious question is why do OEMs have better stability than white boxes despite enduring similar conditions and using similar parts.

Re: Hardware Failures on a Million Consumer PCs

Posted: Tue Nov 20, 2012 5:38 pm
by Krogoth
What is the author defining "white box" as? If "white box" population DIY system consists mostly of enthusiast systems, then it is no surprise that OEMs system are more stable in the samples. It is because the overwhelming majority of overclocked systems are in the "enthusiast" ring (almost 99%). Overclocking is always known to reduce long-term stability at the expense of more performance. OEM systems in the last 10-15 years are extremely difficult to overclock since manufacturers remove options for it at the software level. The OEM crowd have little or no interest in overclocking if they even know how to do it in the first place.

I'm willing to bet that once you remove overclocked systems from the samples. The differences between OEM and DIY are going to be marginal at best. They are both suffer from el cheapo, bargain basement components trying to work in tandem without blowing up in your face. They also both have a minority of users who are willing to send the extra $$$$ and time to make sure that they get quality components they have been thoroughly tested to work without incident (prosumers).

Memory issues are still the overwhelming cause of instability problems in a modern system. Memory doesn't like running beyond spec or enduring high temperatures for long periods of time. The only problem with the sampling is that fails to factor the motherboard and memory controller to possible problem spots. From my own personal experience, I have dealt with memory and motherboard combinations that refuse to work at all at certain memory divider/multipliers (example 1:1, 5:6) that are still running within "spec", but work "flawlessly" with other ratios (2:3).

I'm curious to see if relaxing timings have any affect on long-term reliably for memory.

Re: Hardware Failures on a Million Consumer PCs

Posted: Tue Nov 20, 2012 5:53 pm
by Ryu Connor
Only 2% of the same size was overclocked with the caveat that only 477,464 machines within the sample could have their proper clock speed identified.

Overclocked is defined by being 5% outside of rated speeds.

Study wrote:
We have divided the analysis between two CPU vendors, labeled “Vendor A” and “Vendor B.” The table shows that CPUs from Vendor A are nearly 20x as likely to crash a machine during the 8 month observation period when they are overclocked, and CPUs from Vendor B are over 4x as likely. After a failure occurs, all machines, irrespective of CPU vendor or overclocking, are significantly more likely to crash from additional machine check exceptions.


The data implies that AMD and Intel also have a substantial difference in the manufacturing quality of their chips. Who is who in this study is an interesting guess.

It also implies that overclocking will sooner rather than later bite you in the ass.

As for OEM vs White Box.

Study wrote:
We identify a machine as brand name if it comes from one of the top 20 OEM computer manufacturers as measured by worldwide sales volume. To avoid conflation with other factors, we remove overclocked machines and laptops from our analysis.


So overclocking did not taint the result that OEM is more stable than white box. Anything not one of the top 20 OEMs is a white box, so DIY boxes do fall into the white box category.

Edit:

As one answer to my own musings. Most OEMs slightly underclock their machines.

Study wrote:
Therefore, we further partitioned the non-overclocked machines into underclocked machines, which run below their rated frequency (65% of machines), and rated machines, which run at or no more than 0.5% above their rated frequency (32% of machines). As shown in Figure 5, underclocked machines are between 39% and 80% less likely to crash during the 8 month observation period than machines with CPUs running at their rated frequency.


A small change can have a rather large payback in stability.

Re: Hardware Failures on a Million Consumer PCs

Posted: Sat Dec 15, 2012 10:33 pm
by anotherengineer
"I remain in the AMD camp and buy Asus motherboards almost exclusively. An inexpensive Asus motherboard"

JBI, since you have experience with this, does this mean gigabyte consumer mobo's, do not offer ECC ram support?

Re: Hardware Failures on a Million Consumer PCs

Posted: Sat Dec 15, 2012 10:45 pm
by sschaem
"Laptops are more stable than desktops" ?

Re: Hardware Failures on a Million Consumer PCs

Posted: Sat Dec 15, 2012 11:37 pm
by vargis14
I am playing my free copy of metro 2033 and having fun.....just popped out to chk the forums real quick and i am astounded that laptops die less then laptops!

Note that i did not read the PDF too long and i want to get back to 2033.

Most DIY boxes and desktops/ home servers stay on 24/7. Laptops do not stay on 24/7.
Wonder if that is why laptops came out more reliable then desktops?

All 3 of my machines are on 24/7.....well at least at idle they all downclock:)

Re: Hardware Failures on a Million Consumer PCs

Posted: Sun Dec 16, 2012 2:13 am
by Ryu Connor
Laptop hardware is built to endure a more hostile environment than desktops.

Re: Hardware Failures on a Million Consumer PCs

Posted: Sun Dec 16, 2012 2:20 am
by Captain Ned
Ryu Connor wrote:
Laptop hardware is built to endure a more hostile environment than desktops.

Tell that to the 15YO daughter. I and the Lenovo service guide are best buds.

Re: Hardware Failures on a Million Consumer PCs

Posted: Sun Dec 16, 2012 2:31 am
by just brew it!
anotherengineer wrote:
JBI, since you have experience with this, does this mean gigabyte consumer mobo's, do not offer ECC ram support?

I have not checked recently, but as of about 3 years ago no they did not.

Re: Hardware Failures on a Million Consumer PCs

Posted: Sun Dec 16, 2012 4:04 am
by ptsant
Ryu Connor wrote:
Link
OEM is more stable than white box.
An underclock of as little as .5% has a huge impact on stability.
Overclocking has a substantial likelihood of failure.


I think that OEMs do want more stable systems (less support costs) while consumers want more performance for buck. Therefore OEMs are more likely to use a decent case/airflow and especially ensure that the PSU is of sufficient quality for the rated performance instead of going for the most expensive high-end gfx card. These two parameters are often neglected.

I am also a fan of ECC memory, although it isn't strictly necessary for a gaming-only system. My AMD system with ECC RAM and a Seasonic 80+ Gold 750W PSU never crashes. Think 100% load 24/7 over several days. My point is, it can be done. It's just that enthusiasts don't care if they have to reset once in a while. They'd rather have 10% more performance.

Re: Hardware Failures on a Million Consumer PCs

Posted: Wed Dec 19, 2012 6:44 pm
by cynan
Ryu Connor wrote:
Laptop hardware is built to endure a more hostile environment than desktops.


What about the fact that laptops are often run off of batteries? Many random stability issues can involve fluctuations in power supply. A battery takes all of this out of the equation.

Re: Hardware Failures on a Million Consumer PCs

Posted: Wed Dec 19, 2012 6:51 pm
by just brew it!
Any system you actually care about should be on a UPS.

Re: Hardware Failures on a Million Consumer PCs

Posted: Wed Dec 19, 2012 7:05 pm
by bthylafh
Captain Ned wrote:
Ryu Connor wrote:
Laptop hardware is built to endure a more hostile environment than desktops.

Tell that to the 15YO daughter. I and the Lenovo service guide are best buds.


I think when that comes up, if we're still using laptops by that point, I'll be looking at a used Toughbook for my daughter.