Hardware Failures on a Million Consumer PCs

Don't see a specific place for your hardware question? This is the forum for you!

Moderators: mac_h8r1, Nemesis

Hardware Failures on a Million Consumer PCs

Postposted on Mon Nov 19, 2012 8:52 pm

Link

Pretty interesting read.

OEM is more stable than white box.
Laptops are more stable than desktops.
An underclock of as little as .5% has a huge impact on stability.
Overclocking has a substantial likelihood of failure.
Once a hardware crash/failure has occurred in any of the three measured components (CPU, Memory, Disk) you are doomed from there forward.
Disks failures have the most rapid recurrence rate.

Many of us won't be shocked by some of these details (and more within the document). In particular I find the last one regarding disks the least shocking of all.
"Welcome back my friends to the show that never ends. We're so glad you could attend. Come inside! Come inside!"
Ryu Connor
Global Moderator
Gold subscriber
 
 
Posts: 3598
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA

Re: Hardware Failures on a Million Consumer PCs

Postposted on Mon Nov 19, 2012 9:21 pm

AKA common sense and statistics magic.
Z68XP-UD4 | 2700K @ 4.7 GHz | 16 GB | GTX 780 SLI | PCP&C Silencer 950 | XSPC RX360 | Heatkiller R3 | D5 + RP-452X2 | HAF 932 | 480 GB Extreme Pro
Waco
Gerbil Elite
 
Posts: 812
Joined: Tue Jan 20, 2009 4:14 pm

Re: Hardware Failures on a Million Consumer PCs

Postposted on Tue Nov 20, 2012 9:59 am

The take-away for me --

CPU machine check exceptions are more likely to cause an OS crash than DRAM errors. I was initially somewhat surprised by this; however, after thinking about it a bit more, it makes sense. The whole point of machine check exceptions is to prevent the CPU from doing something off the wall, essentially a virtual panic button to immediately shut everything down. AFAIK *all* machine check exceptions result in an OS crash (BSOD). OTOH most DRAM errors will probably result in an *application* crash or silent data corruption (neither of which are represented in the data used for this study).

They're using OS crashes as a proxy for system instability. While I understand their motivation for doing so (the data is readily available via automated crash reports), I'd be much more interested in knowing how frequently user data is lost or corrupted. Unfortunately, collecting data to do this analysis would be impractical.
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 38085
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Hardware Failures on a Million Consumer PCs

Postposted on Tue Nov 20, 2012 2:53 pm

Yes, the authors delved into the fact that their methodology was worthless for determining soft errors occurring in consumer level non-ECC RAM. Leaves us with a bit of mystery about how poor the memory is we use. As the rest of the document details that consumer level equipment does not stand up to the tolerances of server level equipment. That may not be shocking to some people, but I firmly believe there is a group of enthusiasts and IT Pros who believes the extra costs of say a Xeon versus an i7 is just raw profit. This document (and the other cited studies) details that you do get greater stability for your money.
"Welcome back my friends to the show that never ends. We're so glad you could attend. Come inside! Come inside!"
Ryu Connor
Global Moderator
Gold subscriber
 
 
Posts: 3598
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA

Re: Hardware Failures on a Million Consumer PCs

Postposted on Tue Nov 20, 2012 3:15 pm

Yup.

And as I've noted on these forums (repeatedly), the question of RAM stability/reliability is why I prefer to use ECC RAM even for desktops. This, in turn, is one of the reasons I remain in the AMD camp and buy Asus motherboards almost exclusively. An inexpensive Asus motherboard plus an Athlon II, Phenom II, or FX CPU will get you an ECC capable platform for a fraction of the cost of an equivalent Intel-based solution (since Intel forces you to upgrade to a workstation/server mobo and Xeon CPU if you want ECC support).
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 38085
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Hardware Failures on a Million Consumer PCs

Postposted on Tue Nov 20, 2012 3:44 pm

Ryu Connor wrote:Yes, the authors delved into the fact that their methodology was worthless for determining soft errors occurring in consumer level non-ECC RAM. Leaves us with a bit of mystery about how poor the memory is we use. As the rest of the document details that consumer level equipment does not stand up to the tolerances of server level equipment. That may not be shocking to some people, but I firmly believe there is a group of enthusiasts and IT Pros who believes the extra costs of say a Xeon versus an i7 is just raw profit. This document (and the other cited studies) details that you do get greater stability for your money.


Except that the conditions consumer PCs run in are much more variable. IE, less likely to be on battery backups, in dust-free environments (leading to overheating issues), disks are less likely to experience g-shock hazards. And then there is the case of secondary PC components being potentially less reliable, on average, than what most server racks use (ie, PSUs, etc). In the end, this simply does not provide enough data to conclude that the extra cost of a Xeon over an i7 is not in fact "all profit".
cynan
Gerbil Elite
Gold subscriber
 
 
Posts: 844
Joined: Thu Feb 05, 2004 2:30 pm

Re: Hardware Failures on a Million Consumer PCs

Postposted on Tue Nov 20, 2012 4:13 pm

The authors noted your limitations.

What stands out in opposition to your viewpoint is that laptops are more stable than desktops. Laptops have to endure these same poor conditions as desktops and yet their specialized consumer parts handle it better. It is no silver bullet to the question, but it does further support the concept that the design and market aims of the components matter.

A more curious question is why do OEMs have better stability than white boxes despite enduring similar conditions and using similar parts.
"Welcome back my friends to the show that never ends. We're so glad you could attend. Come inside! Come inside!"
Ryu Connor
Global Moderator
Gold subscriber
 
 
Posts: 3598
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA

Re: Hardware Failures on a Million Consumer PCs

Postposted on Tue Nov 20, 2012 5:38 pm

What is the author defining "white box" as? If "white box" population DIY system consists mostly of enthusiast systems, then it is no surprise that OEMs system are more stable in the samples. It is because the overwhelming majority of overclocked systems are in the "enthusiast" ring (almost 99%). Overclocking is always known to reduce long-term stability at the expense of more performance. OEM systems in the last 10-15 years are extremely difficult to overclock since manufacturers remove options for it at the software level. The OEM crowd have little or no interest in overclocking if they even know how to do it in the first place.

I'm willing to bet that once you remove overclocked systems from the samples. The differences between OEM and DIY are going to be marginal at best. They are both suffer from el cheapo, bargain basement components trying to work in tandem without blowing up in your face. They also both have a minority of users who are willing to send the extra $$$$ and time to make sure that they get quality components they have been thoroughly tested to work without incident (prosumers).

Memory issues are still the overwhelming cause of instability problems in a modern system. Memory doesn't like running beyond spec or enduring high temperatures for long periods of time. The only problem with the sampling is that fails to factor the motherboard and memory controller to possible problem spots. From my own personal experience, I have dealt with memory and motherboard combinations that refuse to work at all at certain memory divider/multipliers (example 1:1, 5:6) that are still running within "spec", but work "flawlessly" with other ratios (2:3).

I'm curious to see if relaxing timings have any affect on long-term reliably for memory.
Ivy Bridge i5-3570K@4.0Ghz, Gigabyte Z77X-UD3H, 2x4GiB of PC-12800, EVGA 660Ti, Corsair CX-600 and Fractal Refined R4 (W). Kentsfield Q6600@3Ghz, HD 4850 2x2GiB PC2-6400, Gigabyte EP45-DS4P, OCZ Modstream 700W, and PC-7B.
Krogoth
Maximum Gerbil
Silver subscriber
 
 
Posts: 4474
Joined: Tue Apr 15, 2003 3:20 pm
Location: somewhere on Core Prime

Re: Hardware Failures on a Million Consumer PCs

Postposted on Tue Nov 20, 2012 5:53 pm

Only 2% of the same size was overclocked with the caveat that only 477,464 machines within the sample could have their proper clock speed identified.

Overclocked is defined by being 5% outside of rated speeds.

Study wrote:We have divided the analysis between two CPU vendors, labeled “Vendor A” and “Vendor B.” The table shows that CPUs from Vendor A are nearly 20x as likely to crash a machine during the 8 month observation period when they are overclocked, and CPUs from Vendor B are over 4x as likely. After a failure occurs, all machines, irrespective of CPU vendor or overclocking, are significantly more likely to crash from additional machine check exceptions.


The data implies that AMD and Intel also have a substantial difference in the manufacturing quality of their chips. Who is who in this study is an interesting guess.

It also implies that overclocking will sooner rather than later bite you in the ass.

As for OEM vs White Box.

Study wrote:We identify a machine as brand name if it comes from one of the top 20 OEM computer manufacturers as measured by worldwide sales volume. To avoid conflation with other factors, we remove overclocked machines and laptops from our analysis.


So overclocking did not taint the result that OEM is more stable than white box. Anything not one of the top 20 OEMs is a white box, so DIY boxes do fall into the white box category.

Edit:

As one answer to my own musings. Most OEMs slightly underclock their machines.

Study wrote:Therefore, we further partitioned the non-overclocked machines into underclocked machines, which run below their rated frequency (65% of machines), and rated machines, which run at or no more than 0.5% above their rated frequency (32% of machines). As shown in Figure 5, underclocked machines are between 39% and 80% less likely to crash during the 8 month observation period than machines with CPUs running at their rated frequency.


A small change can have a rather large payback in stability.
"Welcome back my friends to the show that never ends. We're so glad you could attend. Come inside! Come inside!"
Ryu Connor
Global Moderator
Gold subscriber
 
 
Posts: 3598
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA

Re: Hardware Failures on a Million Consumer PCs

Postposted on Sat Dec 15, 2012 10:33 pm

"I remain in the AMD camp and buy Asus motherboards almost exclusively. An inexpensive Asus motherboard"

JBI, since you have experience with this, does this mean gigabyte consumer mobo's, do not offer ECC ram support?
Life doesn't change after marriage, it changes after children!
anotherengineer
Gerbil Elite
 
Posts: 608
Joined: Fri Sep 25, 2009 1:53 pm
Location: Timmins, ON Canada, Yes I know, Up in the sticks

Re: Hardware Failures on a Million Consumer PCs

Postposted on Sat Dec 15, 2012 10:45 pm

"Laptops are more stable than desktops" ?
sschaem
Gerbil Team Leader
 
Posts: 265
Joined: Tue Oct 02, 2007 11:05 am

Re: Hardware Failures on a Million Consumer PCs

Postposted on Sat Dec 15, 2012 11:37 pm

I am playing my free copy of metro 2033 and having fun.....just popped out to chk the forums real quick and i am astounded that laptops die less then laptops!

Note that i did not read the PDF too long and i want to get back to 2033.

Most DIY boxes and desktops/ home servers stay on 24/7. Laptops do not stay on 24/7.
Wonder if that is why laptops came out more reliable then desktops?

All 3 of my machines are on 24/7.....well at least at idle they all downclock:)
2600k HT on@4705mhz 8gb Cas9 1600 mem 2x EVGA GTX770 4gb Classified cards in SLI @1320 mhz core and 2003 mhz mem,mounted in CM HAF922 with a TX-850 PSU 2xHTPC's 2xi3 2120 3.3ghz dual core,1xasus LP HD6570 1xHIS hd7750@1150core1325mem,55"PanyVT30
vargis14
Graphmaster Gerbil
 
Posts: 1323
Joined: Fri Aug 20, 2010 6:03 pm
Location: philly suburbs

Re: Hardware Failures on a Million Consumer PCs

Postposted on Sun Dec 16, 2012 2:13 am

Laptop hardware is built to endure a more hostile environment than desktops.
"Welcome back my friends to the show that never ends. We're so glad you could attend. Come inside! Come inside!"
Ryu Connor
Global Moderator
Gold subscriber
 
 
Posts: 3598
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA

Re: Hardware Failures on a Million Consumer PCs

Postposted on Sun Dec 16, 2012 2:20 am

Ryu Connor wrote:Laptop hardware is built to endure a more hostile environment than desktops.

Tell that to the 15YO daughter. I and the Lenovo service guide are best buds.
Life is hard; but it's harder if you're stupid. Big Al.
Captain Ned
Global Moderator
Gold subscriber
 
 
Posts: 20628
Joined: Wed Jan 16, 2002 7:00 pm
Location: Vermont, USA

Re: Hardware Failures on a Million Consumer PCs

Postposted on Sun Dec 16, 2012 2:31 am

anotherengineer wrote:JBI, since you have experience with this, does this mean gigabyte consumer mobo's, do not offer ECC ram support?

I have not checked recently, but as of about 3 years ago no they did not.
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 38085
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Hardware Failures on a Million Consumer PCs

Postposted on Sun Dec 16, 2012 4:04 am

Ryu Connor wrote:Link
OEM is more stable than white box.
An underclock of as little as .5% has a huge impact on stability.
Overclocking has a substantial likelihood of failure.


I think that OEMs do want more stable systems (less support costs) while consumers want more performance for buck. Therefore OEMs are more likely to use a decent case/airflow and especially ensure that the PSU is of sufficient quality for the rated performance instead of going for the most expensive high-end gfx card. These two parameters are often neglected.

I am also a fan of ECC memory, although it isn't strictly necessary for a gaming-only system. My AMD system with ECC RAM and a Seasonic 80+ Gold 750W PSU never crashes. Think 100% load 24/7 over several days. My point is, it can be done. It's just that enthusiasts don't care if they have to reset once in a while. They'd rather have 10% more performance.
ptsant
Gerbil
Gold subscriber
 
 
Posts: 58
Joined: Mon Oct 05, 2009 12:45 pm

Re: Hardware Failures on a Million Consumer PCs

Postposted on Wed Dec 19, 2012 6:44 pm

Ryu Connor wrote:Laptop hardware is built to endure a more hostile environment than desktops.


What about the fact that laptops are often run off of batteries? Many random stability issues can involve fluctuations in power supply. A battery takes all of this out of the equation.
cynan
Gerbil Elite
Gold subscriber
 
 
Posts: 844
Joined: Thu Feb 05, 2004 2:30 pm

Re: Hardware Failures on a Million Consumer PCs

Postposted on Wed Dec 19, 2012 6:51 pm

Any system you actually care about should be on a UPS.
(this space intentionally left blank)
just brew it!
Administrator
Gold subscriber
 
 
Posts: 38085
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Hardware Failures on a Million Consumer PCs

Postposted on Wed Dec 19, 2012 7:05 pm

Captain Ned wrote:
Ryu Connor wrote:Laptop hardware is built to endure a more hostile environment than desktops.

Tell that to the 15YO daughter. I and the Lenovo service guide are best buds.


I think when that comes up, if we're still using laptops by that point, I'll be looking at a used Toughbook for my daughter.
Think for yourself, schmuck!
i5-2500K@4.3|Asus P8P67-LE|8GB DDR3-1600|Powercolor R7850 2G|1.5TB 7200.11|1988 Model M|Saitek X-45 & P880|Logitech MX 518|Dell 2209WA|Sennheiser PC151|Asus Xonar DX
bthylafh
Grand Gerbil Poohbah
 
Posts: 3232
Joined: Mon Dec 29, 2003 11:55 pm
Location: Southwest Missouri, USA


Return to General Hardware

Who is online

Users browsing this forum: No registered users and 1 guest