Facebook SSD reliability study shows early burnouts

A team comprised of Facebook engineers and Carnegie Mellon researchers has published a study covering flash memory failure rates over a four-year period, with data collected from SSDs in Facebook's servers. The paper doesn't tie the statistics gathered to particular makes and models of drives, but the results are still informative.

Distribution of uncorrectable error counts across SSDs

The team collected data straight from the drives' hardware rather than relying on OS reporting, thus measuring the amount of data that was actually written to the flash cells. Somewhat parallel to the findings in Google's 2007 paper on mechanical hard drives, the SSD study reveals that failures tend to occur early, followed by a period of smooth sailing up until the end of the drive's useful life.

However, there's a key difference: the team identified what they called an "early detection period." When SSDs are still young, their initial usage leads the controller to immediately identify which flash cells are unreliable. This is further corroborated by the fact that higher-capacity SSDs see relatively the same percentage of failed flash cells in this initial period—and that the window for this early detection is shorter in the smaller drives.

SSD lifecycle failure pattern

The study also shows that temperature has a direct impact on flash memory reliability, with drives running at lower temperatures or using more aggressive throttling mechanisms displaying comparatively fewer cell failures. Interestingly, the amount of data read seems to play no part at all when determining a cell's lifetime.

Tip: You can use the A/Z keys to walk threads.
View options

This discussion is now closed.