Single page Print

The SSD Endurance Experiment: Only two remain after 1.5PB

Another one bites the dust

You won't believe how much data can be written to modern SSDs. No, seriously. Our ongoing SSD Endurance Experiment has demonstrated that some consumer-grade drives can withstand over a petabyte of writes before burning out. That's a hyperbole-worthy total for a class of products typically rated to survive only a few hundred terabytes at most.

Our experiment began with the Corsair Neutron GTX 240GB, Intel 335 Series 240GB, Samsung 840 Series 250GB, and Samsung 840 Pro 256GB, plus two Kingston HyperX 3K 240GB drives. They all surpassed their endurance specifications, but the 335 Series, 840 Series, and one of the HyperX drives failed to reach the petabyte mark. The remainder pressed on toward 1.5PB, and two of them made it relatively unscathed. That journey claimed one more victim, though—and you won't believe which one.

Seriously, you won't. But I'll stop now.

To celebrate the latest milestone, we've checked the health of the survivors, put them through another data retention test, and compiled performance results from the last 500TB. We've also taken a closer look at the last throes of our latest casualty.

If you're unfamiliar with our endurance experiment, this introductory article is recommended reading. It provides far more details on our subjects, methods, and test rigs than we'll revisit today. Here are the basics: SSDs are based on NAND flash memory with limited endurance, so we're writing an unrelenting stream of data to a stack of drives to see what happens. We pause every 100TB to collect health and performance data, which we then turn into stunningly beautiful graphs. Ahem.

Understanding NAND's limited lifespan requires some familiarity with how NAND works. This non-volatile memory stores data by trapping electrons inside miniscule cells built with process geometries as small as 16 nm. The cells are walled off by an insulating oxide layer, but applying voltage causes electrons to tunnel through that barrier. Electrons are drawn into the cell when data is written and out of it when data is erased.

The catch—and there always is one—is that the tunneling process erodes the insulator's ability to hold electrons within the cell. Stray electrons also get caught in the oxide layer, generating a baseline negative charge that narrows the voltage range available to represent data. The narrower that range gets, the more difficult it becomes to write reliably. Cells eventually wear to the point that they're no longer viable, after which they're retired and replaced with spare flash from the SSD's overprovisioned area.

Since NAND wear is tied to the voltage range used to define data, it's highly sensitive to the bit density of the cells. Three-bit TLC NAND must differentiate between eight possible values within that limited range, while its two-bit MLC counterpart only has to contend with four values. TLC-based SSDs typically have lower endurance as a result.

As we've learned in the experiment thus far, flash wear causes SSDs to perish in different ways. The Intel 335 Series is designed to check out voluntarily after a predetermined number of writes. That drive dutifully bricked itself after 750TB, even though its flash was mostly intact at the time. The first HyperX failed a little earlier, at 728TB, under much different conditions. It suffering rash of reallocated sectors, programming failures, and erase failures before its ultimate demise.

Counter-intuitively, the TLC-based Samsung 840 Series outlasted those MLC casualties to write over 900TB before failing suddenly. But its reallocated sectors started piling up after just a few hundred terabytes of writes, confirming TLC's more fragile nature. The 840 Series also suffered hundreds of uncorrectable errors split between an initial spate at 300TB and second accumulation near the end of the road.

So, what about the latest death?

Much to our surprise, the Neutron GTX failed next. It had logged only three reallocated sectors through 1.1PB of writes, but SMART warnings appeared soon after, cautioning that the raw read error rate had exceeded the acceptable threshold. The drive still made it to 1.2PB and through our usual round of performance benchmarks. However, its SMART attributes showed a huge spike in reallocated sectors:

Over the last 100TB, the Neutron compensated for over 3400 sector failures. And that was it. When we readied the SSDs for the next leg, our test rig refused to boot with the Neutron connected. The same thing happened with a couple of other machines, and hot-plugging the drive into a running system didn't help. Although the Neutron was detected, the Windows disk manager stalled when we tried to access it.

Despite the early warnings of impending doom, the Neutron's exit didn't go entirely by the book. The drive is supposed to keep writing until its flash reserves are used up, after which it should slip into a persistent read-only state to preserve user data. As far as we can tell, our sample never made it to read-only mode. It was partitioned and loaded with 10GB of data before the power cycle that rendered the drive unresponsive, and that partition and data remain inaccessible.

We've asked Corsair to clarify the Neutron GTX's sector size and how much of the overprovisioned area is available to replace retired flash. Those details should give us a better sense of whether the drive ran out of spare NAND or was struck down by something else. For what it's worth, the other SMART attributes suggest the Neutron may have had some flash in reserve.

The SMART data has two values for reallocated sectors: one that counts up from zero and another that ticks down from 256. The latter still hadn't bottomed out after 1.2PB, and neither had the life-left estimate. Hmmm.

Although the graph shows the raw read error rate plummeting toward the end, the depiction isn't entirely accurate. That attribute was already at its lowest value after 1.108PB of writes, which is when we noticed the first SMART error. We may need to grab SMART info more regularly in future endurance tests.

Now that we've tended to the dead, it's time to check in on the living...