Facebook SSD reliability study shows early burnouts

A team comprised of Facebook engineers and Carnegie Mellon researchers has published a study covering flash memory failure rates over a four-year period, with data collected from SSDs in Facebook's servers. The paper doesn't tie the statistics gathered to particular makes and models of drives, but the results are still informative.

Distribution of uncorrectable error counts across SSDs

The team collected data straight from the drives' hardware rather than relying on OS reporting, thus measuring the amount of data that was actually written to the flash cells. Somewhat parallel to the findings in Google's 2007 paper on mechanical hard drives, the SSD study reveals that failures tend to occur early, followed by a period of smooth sailing up until the end of the drive's useful life.

However, there's a key difference: the team identified what they called an "early detection period." When SSDs are still young, their initial usage leads the controller to immediately identify which flash cells are unreliable. This is further corroborated by the fact that higher-capacity SSDs see relatively the same percentage of failed flash cells in this initial period—and that the window for this early detection is shorter in the smaller drives.

SSD lifecycle failure pattern

The study also shows that temperature has a direct impact on flash memory reliability, with drives running at lower temperatures or using more aggressive throttling mechanisms displaying comparatively fewer cell failures. Interestingly, the amount of data read seems to play no part at all when determining a cell's lifetime.

Comments closed
    • kuttan
    • 5 years ago

    Since the study started 4 years ago the SSD drives reliability of that time and now are totally different and can’t compare much. The modern day SSDs are much more reliable than what is available 4 years back.

      • MadManOriginal
      • 5 years ago

      This is really talking about flash cell reliability though, not overall drive reliability. A lot of SSD problems are not in the flash itself.

      • melgross
      • 5 years ago

      Somewhat more reliable, not much more reliable.

    • Welch
    • 5 years ago

    To many people editing their posts on Facebook = too many writes = dead SSD.

    Yeah it would be nice to know if Facebook was pulling a Backblaze and using consumer drives in their servers and then claiming foul when they were failing prematurely.

    Not to mention, these are drives dating back 4+ years ago based on the start of the study. I think any of us enthusiasts can easily recall the numerous issues with reliability, power states and firmware almost all manufacturers had around that time period. I feel like the last year or so has been the big jump up in quality as far as SSDs go, minus the occasional (See Often) screw up from Samsung.

      • UnfriendlyFire
      • 5 years ago

      Your criticism is equivalent to Youtube restricting videos to highly compressed 144p to reduce server load and bandwidth usage.

      Or EA skimping on server upgrades for the SimCity 2013 launch. Boy was that a fiasco.

    • chµck
    • 5 years ago

    Has anyone heard of the journal it was published in?

      • Nevermind
      • 5 years ago

      Has anyone heard of Carnegie Mellon University? Yeah I heard about it down by the seashore..

        • derFunkenstein
        • 5 years ago

        Carnegie melons are the best melons.

        • w76
        • 5 years ago

        That’s the university, not the journal. 😉 SIGMETRICS was the journal.

      • davidbowser
      • 5 years ago

      I can’t even get the paper to look at it. Getting a server timeout.

    • albundy
    • 5 years ago

    were they OCZ ssd’s?

      • Deanjo
      • 5 years ago

      Actually the paper mentions:

      [quote<]This work is supported in part by Samsung and the Intel Science[/quote<] So take it for what it's worth and draw your own conclusions.

      • NeelyCam
      • 5 years ago

      Seagate.

    • FireGryphon
    • 5 years ago

    Those graphs are very hard to read. Can someone explain what I’m looking at?

      • wibeasley
      • 5 years ago

      I believe the first graph shows the *cumulative* errors (ie, “given x, what proportion of the drives have failed), and the second shows the failure*rate* (ie, expected percentage of errors for a given window of time).

      (I didn’t read the source, but it looks like they’re also making a distinction between error and failures.)

      • meerkt
      • 5 years ago

      The axes are labeled. Isn’t it clear enough?

      In the first graph the X axis represents the span of all SSD NUMBERS, with SSD number 1.0 on the right and SSD 0 on the left. In the second graph a high Y is high failure RATE and low is low FAILURE rate. The X goes from low flash USAGE to high.

      Perhaps this combined graph will help:
      [url<]http://i.imgur.com/G4Zblvx.png[/url<]

      • ronch
      • 5 years ago

      The way I understand it, is that as the article states, new drives detect bad cells early on. That would be in section 1 of the graph. Past that, section 2, there’s smooth sailing. Past that again, section 3, is when those good cells start dying similar to how they continuously fall off the cliff in TR’s SSD endurance experiments.

    • jihadjoe
    • 5 years ago

    Aren’t hard drives (and most other computer equipment) sort of the same? Most drive failures I’ve seen tend to happen either when the drive is either fairly new, or already very old.

    Seems like if your stuff passes that infant mortality stage it’ll serve long and well.

    • Ninjitsu
    • 5 years ago

    I understood the conclusions but not the graphs… :/

    [quote<] Interestingly, the amount of data read seems to play no part at all when determining a cell's lifetime. [/quote<] That's pretty obvious, I think...

      • funko
      • 5 years ago

      Not a very well written study. The design isn’t a high caliber either. Hard to figure out any take-away points from reading this study. They use language suggestion cause and effect, when they should only go as far as saying that there are associations. (for ex: failure one of drive, and failure of another drive)

      They did not normalize for makes nor models across any of their analyses, so this is more akin to a metaanalysis than anything, and even then it hardly qualifies.

      In the end it should be something we can casually look at for fun, but I definitely don’t think anyone should be making business decisions or any serious modeling on this data.

    • xeridea
    • 5 years ago

    I first saw the title as I was starting my day and I saw “Facebook” and “Burnout”. I was thinking of their stocks, or the type of person that can result from social media addiction. Good article otherwise.

    • chuckula
    • 5 years ago

    My my, hey hey.
    SSDs are here to stay.

    It’s better to burn out,
    than to fade away.

    My my, hey hey.

      • ChicagoDave
      • 5 years ago

      +1 internets for Neil Young reference

      • anotherengineer
      • 5 years ago

      +1 for Canadian Singer reference 😉

        • Deanjo
        • 5 years ago

        I prefer Kurgan’s delivery

        [url<]https://www.youtube.com/watch?v=a0aERAJ18_s[/url<]

          • Nevermind
          • 5 years ago

          Kurgan’s emo.

      • Anovoca
      • 5 years ago

      -3 for botching the air guitar solo.

      • Nevermind
      • 5 years ago

      Neil Young called, he says he prefers Jeb Bush using his song..

Pin It on Pinterest

Share This