Personal computing discussed

Moderators: renee, mac_h8r1, Nemesis

 
Ryu Connor
Global Moderator
Topic Author
Posts: 4369
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA
Contact:

Hardware Failures on a Million Consumer PCs

Mon Nov 19, 2012 8:52 pm

Link

Pretty interesting read.

OEM is more stable than white box.
Laptops are more stable than desktops.
An underclock of as little as .5% has a huge impact on stability.
Overclocking has a substantial likelihood of failure.
Once a hardware crash/failure has occurred in any of the three measured components (CPU, Memory, Disk) you are doomed from there forward.
Disks failures have the most rapid recurrence rate.

Many of us won't be shocked by some of these details (and more within the document). In particular I find the last one regarding disks the least shocking of all.
All of my written content here on TR does not represent or reflect the views of my employer or any reasonable human being. All content and actions are my own.
 
Waco
Maximum Gerbil
Posts: 4850
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Hardware Failures on a Million Consumer PCs

Mon Nov 19, 2012 9:21 pm

AKA common sense and statistics magic.
Victory requires no explanation. Defeat allows none.
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Hardware Failures on a Million Consumer PCs

Tue Nov 20, 2012 9:59 am

The take-away for me --

CPU machine check exceptions are more likely to cause an OS crash than DRAM errors. I was initially somewhat surprised by this; however, after thinking about it a bit more, it makes sense. The whole point of machine check exceptions is to prevent the CPU from doing something off the wall, essentially a virtual panic button to immediately shut everything down. AFAIK *all* machine check exceptions result in an OS crash (BSOD). OTOH most DRAM errors will probably result in an *application* crash or silent data corruption (neither of which are represented in the data used for this study).

They're using OS crashes as a proxy for system instability. While I understand their motivation for doing so (the data is readily available via automated crash reports), I'd be much more interested in knowing how frequently user data is lost or corrupted. Unfortunately, collecting data to do this analysis would be impractical.
Nostalgia isn't what it used to be.
 
Ryu Connor
Global Moderator
Topic Author
Posts: 4369
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA
Contact:

Re: Hardware Failures on a Million Consumer PCs

Tue Nov 20, 2012 2:53 pm

Yes, the authors delved into the fact that their methodology was worthless for determining soft errors occurring in consumer level non-ECC RAM. Leaves us with a bit of mystery about how poor the memory is we use. As the rest of the document details that consumer level equipment does not stand up to the tolerances of server level equipment. That may not be shocking to some people, but I firmly believe there is a group of enthusiasts and IT Pros who believes the extra costs of say a Xeon versus an i7 is just raw profit. This document (and the other cited studies) details that you do get greater stability for your money.
All of my written content here on TR does not represent or reflect the views of my employer or any reasonable human being. All content and actions are my own.
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Hardware Failures on a Million Consumer PCs

Tue Nov 20, 2012 3:15 pm

Yup.

And as I've noted on these forums (repeatedly), the question of RAM stability/reliability is why I prefer to use ECC RAM even for desktops. This, in turn, is one of the reasons I remain in the AMD camp and buy Asus motherboards almost exclusively. An inexpensive Asus motherboard plus an Athlon II, Phenom II, or FX CPU will get you an ECC capable platform for a fraction of the cost of an equivalent Intel-based solution (since Intel forces you to upgrade to a workstation/server mobo and Xeon CPU if you want ECC support).
Nostalgia isn't what it used to be.
 
cynan
Graphmaster Gerbil
Posts: 1160
Joined: Thu Feb 05, 2004 2:30 pm

Re: Hardware Failures on a Million Consumer PCs

Tue Nov 20, 2012 3:44 pm

Ryu Connor wrote:
Yes, the authors delved into the fact that their methodology was worthless for determining soft errors occurring in consumer level non-ECC RAM. Leaves us with a bit of mystery about how poor the memory is we use. As the rest of the document details that consumer level equipment does not stand up to the tolerances of server level equipment. That may not be shocking to some people, but I firmly believe there is a group of enthusiasts and IT Pros who believes the extra costs of say a Xeon versus an i7 is just raw profit. This document (and the other cited studies) details that you do get greater stability for your money.


Except that the conditions consumer PCs run in are much more variable. IE, less likely to be on battery backups, in dust-free environments (leading to overheating issues), disks are less likely to experience g-shock hazards. And then there is the case of secondary PC components being potentially less reliable, on average, than what most server racks use (ie, PSUs, etc). In the end, this simply does not provide enough data to conclude that the extra cost of a Xeon over an i7 is not in fact "all profit".
 
Ryu Connor
Global Moderator
Topic Author
Posts: 4369
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA
Contact:

Re: Hardware Failures on a Million Consumer PCs

Tue Nov 20, 2012 4:13 pm

The authors noted your limitations.

What stands out in opposition to your viewpoint is that laptops are more stable than desktops. Laptops have to endure these same poor conditions as desktops and yet their specialized consumer parts handle it better. It is no silver bullet to the question, but it does further support the concept that the design and market aims of the components matter.

A more curious question is why do OEMs have better stability than white boxes despite enduring similar conditions and using similar parts.
All of my written content here on TR does not represent or reflect the views of my employer or any reasonable human being. All content and actions are my own.
 
Krogoth
Emperor Gerbilius I
Posts: 6049
Joined: Tue Apr 15, 2003 3:20 pm
Location: somewhere on Core Prime
Contact:

Re: Hardware Failures on a Million Consumer PCs

Tue Nov 20, 2012 5:38 pm

What is the author defining "white box" as? If "white box" population DIY system consists mostly of enthusiast systems, then it is no surprise that OEMs system are more stable in the samples. It is because the overwhelming majority of overclocked systems are in the "enthusiast" ring (almost 99%). Overclocking is always known to reduce long-term stability at the expense of more performance. OEM systems in the last 10-15 years are extremely difficult to overclock since manufacturers remove options for it at the software level. The OEM crowd have little or no interest in overclocking if they even know how to do it in the first place.

I'm willing to bet that once you remove overclocked systems from the samples. The differences between OEM and DIY are going to be marginal at best. They are both suffer from el cheapo, bargain basement components trying to work in tandem without blowing up in your face. They also both have a minority of users who are willing to send the extra $$$$ and time to make sure that they get quality components they have been thoroughly tested to work without incident (prosumers).

Memory issues are still the overwhelming cause of instability problems in a modern system. Memory doesn't like running beyond spec or enduring high temperatures for long periods of time. The only problem with the sampling is that fails to factor the motherboard and memory controller to possible problem spots. From my own personal experience, I have dealt with memory and motherboard combinations that refuse to work at all at certain memory divider/multipliers (example 1:1, 5:6) that are still running within "spec", but work "flawlessly" with other ratios (2:3).

I'm curious to see if relaxing timings have any affect on long-term reliably for memory.
Gigabyte X670 AORUS-ELITE AX, Raphael 7950X, 2x16GiB of G.Skill TRIDENT DDR5-5600, Sapphire RX 6900XT, Seasonic GX-850 and Fractal Define 7 (W)
Ivy Bridge 3570K, 2x4GiB of G.Skill RIPSAW DDR3-1600, Gigabyte Z77X-UD3H, Corsair CX-750M V2, and PC-7B
 
Ryu Connor
Global Moderator
Topic Author
Posts: 4369
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA
Contact:

Re: Hardware Failures on a Million Consumer PCs

Tue Nov 20, 2012 5:53 pm

Only 2% of the same size was overclocked with the caveat that only 477,464 machines within the sample could have their proper clock speed identified.

Overclocked is defined by being 5% outside of rated speeds.

Study wrote:
We have divided the analysis between two CPU vendors, labeled “Vendor A” and “Vendor B.” The table shows that CPUs from Vendor A are nearly 20x as likely to crash a machine during the 8 month observation period when they are overclocked, and CPUs from Vendor B are over 4x as likely. After a failure occurs, all machines, irrespective of CPU vendor or overclocking, are significantly more likely to crash from additional machine check exceptions.


The data implies that AMD and Intel also have a substantial difference in the manufacturing quality of their chips. Who is who in this study is an interesting guess.

It also implies that overclocking will sooner rather than later bite you in the ass.

As for OEM vs White Box.

Study wrote:
We identify a machine as brand name if it comes from one of the top 20 OEM computer manufacturers as measured by worldwide sales volume. To avoid conflation with other factors, we remove overclocked machines and laptops from our analysis.


So overclocking did not taint the result that OEM is more stable than white box. Anything not one of the top 20 OEMs is a white box, so DIY boxes do fall into the white box category.

Edit:

As one answer to my own musings. Most OEMs slightly underclock their machines.

Study wrote:
Therefore, we further partitioned the non-overclocked machines into underclocked machines, which run below their rated frequency (65% of machines), and rated machines, which run at or no more than 0.5% above their rated frequency (32% of machines). As shown in Figure 5, underclocked machines are between 39% and 80% less likely to crash during the 8 month observation period than machines with CPUs running at their rated frequency.


A small change can have a rather large payback in stability.
All of my written content here on TR does not represent or reflect the views of my employer or any reasonable human being. All content and actions are my own.
 
anotherengineer
Gerbil Jedi
Posts: 1688
Joined: Fri Sep 25, 2009 1:53 pm
Location: Northern, ON Canada, Yes I know, Up in the sticks

Re: Hardware Failures on a Million Consumer PCs

Sat Dec 15, 2012 10:33 pm

"I remain in the AMD camp and buy Asus motherboards almost exclusively. An inexpensive Asus motherboard"

JBI, since you have experience with this, does this mean gigabyte consumer mobo's, do not offer ECC ram support?
Life doesn't change after marriage, it changes after children!
 
sschaem
Gerbil Team Leader
Posts: 282
Joined: Tue Oct 02, 2007 11:05 am

Re: Hardware Failures on a Million Consumer PCs

Sat Dec 15, 2012 10:45 pm

"Laptops are more stable than desktops" ?
 
vargis14
Gerbil Jedi
Posts: 1900
Joined: Fri Aug 20, 2010 6:03 pm
Location: philly suburbs

Re: Hardware Failures on a Million Consumer PCs

Sat Dec 15, 2012 11:37 pm

I am playing my free copy of metro 2033 and having fun.....just popped out to chk the forums real quick and i am astounded that laptops die less then laptops!

Note that i did not read the PDF too long and i want to get back to 2033.

Most DIY boxes and desktops/ home servers stay on 24/7. Laptops do not stay on 24/7.
Wonder if that is why laptops came out more reliable then desktops?

All 3 of my machines are on 24/7.....well at least at idle they all downclock:)
2600k@4848mhz @1.4v CM Nepton40XL 16gb Ram 2x EVGA GTX770 4gb Classified cards in SLI@1280mhz Stock boost on a GAP67-UD4-B3, SBlaster Z powered by TX-850 PSU pushing a 34" LG 21/9 3440-1440 IPS panel. Pieced together 2.1 sound system
 
Ryu Connor
Global Moderator
Topic Author
Posts: 4369
Joined: Thu Dec 27, 2001 7:00 pm
Location: Marietta, GA
Contact:

Re: Hardware Failures on a Million Consumer PCs

Sun Dec 16, 2012 2:13 am

Laptop hardware is built to endure a more hostile environment than desktops.
All of my written content here on TR does not represent or reflect the views of my employer or any reasonable human being. All content and actions are my own.
 
Captain Ned
Global Moderator
Posts: 28704
Joined: Wed Jan 16, 2002 7:00 pm
Location: Vermont, USA

Re: Hardware Failures on a Million Consumer PCs

Sun Dec 16, 2012 2:20 am

Ryu Connor wrote:
Laptop hardware is built to endure a more hostile environment than desktops.

Tell that to the 15YO daughter. I and the Lenovo service guide are best buds.
What we have today is way too much pluribus and not enough unum.
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Hardware Failures on a Million Consumer PCs

Sun Dec 16, 2012 2:31 am

anotherengineer wrote:
JBI, since you have experience with this, does this mean gigabyte consumer mobo's, do not offer ECC ram support?

I have not checked recently, but as of about 3 years ago no they did not.
Nostalgia isn't what it used to be.
 
ptsant
Gerbil XP
Posts: 397
Joined: Mon Oct 05, 2009 12:45 pm

Re: Hardware Failures on a Million Consumer PCs

Sun Dec 16, 2012 4:04 am

Ryu Connor wrote:
Link
OEM is more stable than white box.
An underclock of as little as .5% has a huge impact on stability.
Overclocking has a substantial likelihood of failure.


I think that OEMs do want more stable systems (less support costs) while consumers want more performance for buck. Therefore OEMs are more likely to use a decent case/airflow and especially ensure that the PSU is of sufficient quality for the rated performance instead of going for the most expensive high-end gfx card. These two parameters are often neglected.

I am also a fan of ECC memory, although it isn't strictly necessary for a gaming-only system. My AMD system with ECC RAM and a Seasonic 80+ Gold 750W PSU never crashes. Think 100% load 24/7 over several days. My point is, it can be done. It's just that enthusiasts don't care if they have to reset once in a while. They'd rather have 10% more performance.
Image
 
cynan
Graphmaster Gerbil
Posts: 1160
Joined: Thu Feb 05, 2004 2:30 pm

Re: Hardware Failures on a Million Consumer PCs

Wed Dec 19, 2012 6:44 pm

Ryu Connor wrote:
Laptop hardware is built to endure a more hostile environment than desktops.


What about the fact that laptops are often run off of batteries? Many random stability issues can involve fluctuations in power supply. A battery takes all of this out of the equation.
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Hardware Failures on a Million Consumer PCs

Wed Dec 19, 2012 6:51 pm

Any system you actually care about should be on a UPS.
Nostalgia isn't what it used to be.
 
bthylafh
Maximum Gerbil
Posts: 4320
Joined: Mon Dec 29, 2003 11:55 pm
Location: Southwest Missouri, USA

Re: Hardware Failures on a Million Consumer PCs

Wed Dec 19, 2012 7:05 pm

Captain Ned wrote:
Ryu Connor wrote:
Laptop hardware is built to endure a more hostile environment than desktops.

Tell that to the 15YO daughter. I and the Lenovo service guide are best buds.


I think when that comes up, if we're still using laptops by that point, I'll be looking at a used Toughbook for my daughter.
Hakkaa päälle!
i7-8700K|Asus Z-370 Pro|32GB DDR4|Asus Radeon RX-580|Samsung 960 EVO 1TB|1988 Model M||Logitech MX 518 & F310|Samsung C24FG70|Dell 2209WA|ATH-M50x

Who is online

Users browsing this forum: No registered users and 1 guest
GZIP: On