Google publishes hard drive failure study

Amidst all its plotting to take over the Internet, Google has published (PDF) a rather interesting paper regarding the failure rates of hard drives. Studies on the subject are fairly hard to come by, but Google reckons its thousands of servers offer a good sample size to spot trends in hard drive failure rates. Google introduces the paper as follows:

We have built an infrastructure that collects vital information about all Google’s systems every few minutes, and a repository that stores these data in timeseries format (essentially forever) for further analysis. The information collected includes environmental factors (such as temperatures), activity levels and many of the Self-Monitoring Analysis and Reporting Technology (SMART) parameters that are believed to be good indicators of disk drive health. We mine through these data and attempt to find evidence that corroborates or contradicts many of the commonly held beliefs about how various factors can affect disk drive lifetime.

Our paper is unique in that it is based on data from a disk population size that is typically only available from vendor warranty databases, but has the depth of deployment visibility and detailed lifetime follow-up that only an end-user study can provide.

The results are surprising. For instance, Google’s data suggest that high drive temperatures and high utilization don’t necessarily translate to higher failure rates. The data also suggest that the highest failure rates occur in drives that are three years old.

Disappointingly, Google omits to mention what might be the most important piece of information of all: which manufacturers have the most failure-prone drives. Perhaps the search giant doesn’t want a lawsuit on its hands, or perhaps it doesn’t want to risk compromising any juicy discounts it might receive from hard drive makers. Nevertheless, Google claims differences in failure rates between drive models or brands are not significant. “In contrast to age-related results, we note that all results shown in the rest of the paper are not affected significantly by the population mix,” the paper says.

Comments closed
    • Klopsik206
    • 14 years ago

    Do I got it right?

    – There ARE NO significant reliability differences between manufactures
    – There ARE significant reliability differences between HDD models (but they do not reveal those)
    – SMART is close to useless
    – Don’t worry much about temperature and workload
    – Two most “dangerous” moments are: brand new dirve, and 3 yo. drive.

    There are no clues how to shop for a drive, but at least we can save some pennies on cooling 😉

      • tu2thepoo
      • 14 years ago

      they’re making a distinction between “age” and “vintage.”

      Think of it like wine: let’s say there was a singularly bad crop of grapes in 2002. Thus, a 2002 pinot grigio will be much more likely to taste bad than a 2001 or 2003. The fact that the 2003 is younger doesn’t mean it would necessarily taste worse than the 2002.

      (keeping in mind that an older wine is generally considered to be more desirable, while a newer hard drive is generally more desirable than an old one.)

      Or think of it like CPUs – the first pentium 4’s, built on the Willamette core, performed relatively worse than the late-model Pentium 3’s that had preceded them, and the Northwood Pentium 4’s that followed. This would hold true even if you had an old 1.3ghz tualatin-core Pentium 3 from 2001 and someone gave you a new-in-box, never-used 1.4ghz willamette-core Pentium 4.

      Getting back to hard drives, let’s say there was some manufacturing process that was introduced 3 years ago. Also, let’s assume that process took a while to master, and so the first drives built on that process were of a /[

      • Buub
      • 14 years ago

      No, SMART isn’t close to useless. But you cannot rely on it blindly.

      A drive can fail without issuing a single SMART error. However, the converse is not true. Just because some drives don’t issue SMART errors doesn’t mean they all do.

      A drive that does issue SMART errors is likely to fail soon afterwards.

      So, if you DO receive SMART errors, replace the drive pronto, but don’t expect them to always predict a drive failure.

    • Voldenuit
    • 14 years ago

    (Conspiracy Theorist):

    Google will blackmail hard drive manufacturers by threatening to name the manufacturer of the failing drives.

    Expect to see lots of Maxtor ads on google. :p

    • ludi
    • 14 years ago

    Oh, and incidentally…IEEE paper format FTW!

    • albundy
    • 14 years ago

    “which manufacturers have the most failure-prone drives”

    Currently living with 4 bad sectors on a 15k scsi drive for 5 years with active cooling. you tell me that its the manufacturer, and I’ll tell you your lying. its the quality of the drive, no doubt about it.

      • Buub
      • 14 years ago

      Not every individual drive will follow the trends of a large group of drives. They’re useful as trends, not as an indicator for every single drive that exists.

    • WaltC
    • 14 years ago

    Q: How many people whose drives don’t fail after three years bother to post somewhere about it?

    A: Not many if any.

    Q: How many people whose drives fail prior to or at about the three year level post about it somewhere?

    A: A good many of them, else Google would have no data.

    Q:What is the relationship in terms of numbers between the drives that fail at or before the three-year mark and the number of drives that do not?

    A: The Google paper does not provide that information.

    Q: So, what does the Google paper tell us about hard drive failure?

    A: That the older drives are, assuming regular use, the more likely it is that they will fail before or near the three-year mark, and that some hard drives were designed and built better than others, resulting in some drives that have higher failure rates over time than better designed and built hard drives experience.

    Q: Is there anything new here?

    A: No. These things have been commonly known for a long time, which is why everybody knows that when you buy a hard drive you should buy one with at least a three-year warranty. It is also well known that a few years ago some of the hard drive makers decided that the easiest road to profitability was to cut way back on their hard drive warranties, a practice that worked so well that almost all hard drive OEMs have since returned to the longer warranty periods as they have discovered that consumers equate warranties with a manufacturer’s confidence in the quality of his products. This has long been known, too.

    Q: Then why did Google present this paper?

    A: To promote Google, why else?….;)

      • ludi
      • 14 years ago

      Uh, Walt…first, did you catch the fact that Google was presenting findings from their own datacenter records (not compiling third-party data)? — second, did you notice that Google found very little correlation between overall failure rates and drive operating temperatures, something that other (smaller) studies had thought was a significant factor? — and third, that no predictive model could be built based on SMART reporting parameters?

      I thought this was actually quite interesting. I would have liked to see Google disclose the total number of disks, though.

        • DrCR
        • 14 years ago

        r[http://www.grc.com/default.htm<]§ Spin Rite have been on my prehaps-buy list for some time now, but as I usually just replace my drives by ear I've never been able to justify the cost of a harddrive to purchase the software.

          • ludi
          • 14 years ago

          I’m not familiarized with Spinrite. All I can point you to is the Google paper — they found that although the SMART parameters can be a good indicator of a future failure, about half the time a drive can fail without giving any warnings via SMART. Something along the lines of a night watchman who sleeps through half his shifts — he’s reliable, but only when he’s awake.

      • Buub
      • 14 years ago

      LOL Walt, did you even read the paper?

      • tu2thepoo
      • 14 years ago

      You might want to channel some effort into reading comprehension before writing the next know-it-all comment.

        • Proesterchen
        • 14 years ago

        Walt never has something is unimportant as fact come between him and his preconceived opinion on a topic. So reading the study he rants about would really only be a waste of time.

    • Bensam123
    • 14 years ago

    I wonder where Maxtor sits.

    Then again, you don’t need farms of servers to figure that one out.

      • albundy
      • 14 years ago

      LOL! wrong question. its where it doesnt sit. all 20 maxtors in my company failed within two years time. I wish I had the funds to put scsi drives in every workstation, but the budget gave me little to work with. You get what you paid for. I’d love to see the face of the guy a year later after he bought a 750GB maxtor! LOL!

    • Buub
    • 14 years ago

    My experience is that the most failure-prone manufacturer changes over time. Each company has a bad run or two, then they figure it out and clean up their act. So, I don’t know that’s going to be an informative statistic — it will depend on which model you buy and when you buy it.

    • morphine
    • 14 years ago

    “/[

    • SVB
    • 14 years ago

    This “report” is almost useless without the “vintage” of the suspect drives. Google did it’s users a big disservice by not naming names. What might have forced manufacturers to clean up their acts is now just fodder for the round file.

      • spworley
      • 14 years ago

      They did a service, not a disservice, by releasing interesting information about hard drive failures. We’re just disappointed that it wasn’t even more, especially about the drive and manufacturer failure correlations.
      Google certainly didn’t owe its customers this info at all, so we can’t complain that it doesn’t provide all the details we’d really like to see.

      • radioactive21
      • 14 years ago

      I agree, how is google doing a disservice? Would it be better to not publish it at all? Informatin is better than no information, especially this type of analyst.

      There might be other motives and reasons, one that is already state is a lawsuit not to mention future discounts. Google is covering its ass when it choses not to name companies.

        • SVB
        • 14 years ago

        Just how are they doing a disservice. That depends on what we can do with the info. Would it change which drive we buy? Would it cause us to change drives every 2.5 years? Do the drives that we are currently using last 2 years or 5 years? If, after reading the article, we cannot use any of the information in a useful manner, what is the purpose of reading the article. I did and I think it was a waste of time.

    • PRIME1
    • 14 years ago

    y[<"The highest failure rates occurred on hard drives involved in a Google image search for naked pictures of Rosie O'Donnel"<]y I knew it!

      • wierdo
      • 14 years ago

      Lost twelve of those that way… *sigh*

        • LoneWolf15
        • 14 years ago

        Aren’t you a glutton for punishment in more ways than one.

    • spworley
    • 14 years ago

    There’s one practical piece of advice at the end. It’s common sense, but good to be confirmed.
    If you ever get a hard drive error, even just one one, then replace the hard drive entirely ASAP. A drive which shows just one spurious one error is 39 times more likely to completely fail in 60 days than a drive which shows no symptom at all.

    • adisor19
    • 14 years ago

    Well well well, looks like 3 years is just the right time for an HD to pop. Aren’t we glad we got Seagate to offer 5 years ?

    Adi

      • pureevilmatt
      • 14 years ago

      They note in the study that the age of the drives may not be the issue, but instead the vintage; the year the drive was made. IE, drives made 4 years ago are likely to still be running, drives made 3 years ago, not so much. Western Digital still offers 5 year warranty on it’s drives. So I’m thinking the majority of drives still fail after 5 years.

      Why the glut at 3 years ago? Problems with fluid bearing tech? Who knows… This type of “conclusion” is kinda useless unless they tell us which models from which year are the ones that are failing the most and under what conditions… something they can’t do because they’d piss off the HD vendors. Close to being useful, but no cigar.

      The temperature information is the most useful part of this study.

    • Willard
    • 14 years ago

    y[

      • just brew it!
      • 14 years ago

      They have the data, they’re just not sharing.

      From the paper:

      q[

        • ludi
        • 14 years ago

        Read: “These geese tend to lay golden eggs now and then during volume-discount negotiations, so we’re not going to shoot any of them.”

      • tcunning1
      • 14 years ago

      One word: MAXTOR

        • Generic Ninja
        • 14 years ago

        Where I work we have 540 odd HP machines. They came with about a 50/50 mix of WD and Maxtor 40GB drives. After almost 3 years now we have replaced 120% of the Maxtor drives. I quote over 100% because of the extremely high failure rate of the Maxtors. The replacements we have been getting fail with the same depressing regularity. Currently we ship back over 20 drives a month. Some months we have 1 or 2 WDs in the shipment. It is getting so bad that the current urge is to rubber mallet any new Maxtors just to save the time / trouble of replacing them down the road.

Pin It on Pinterest

Share This