Single page Print

The SSD Endurance Experiment: They're all dead

This is the end, beautiful friend
— 10:22 AM on March 12, 2015

I never thought this whole tech journalism gig would turn me into a mass murderer. Yet here I am, with the blood of six SSDs on my hands, and that's not even the half of it. You see, these were not crimes of passion or rage, nor were they products of accident. More than 18 months ago, I vowed to push all six drives to their bitter ends. I didn't do so in the name of god or country or even self-defense, either. I did it just to watch them die.

Technically, I'm also a torturer—or at least an enhanced interrogator. Instead of offering a quick and painless death, I slowly squeezed out every last drop of life with a relentless stream of writes far more demanding than anything the SSDs would face in a typical PC. To make matters worse, I exploited their suffering by chronicling the entire process online.

Today, that story draws to a close with the final chapter in the SSD Endurance Experiment. The last two survivors met their doom on the road to 2.5PB, joining four fallen comrades who expired earlier. It's time to honor the dead and reflect on what we've learned from all the carnage.

Experiment with intent to kill
Before we get to the end, we have to start at the beginning. If you're unfamiliar with the experiment, this introductory article provides a comprehensive look at our test systems and methods. We'll only indulge a quick run-down of the details here.

Our solid-state death march was designed to test the limited write tolerance inherent to all NAND flash memory. This breed of non-volatile storage retains data by trapping electrons inside of nanoscale memory cells. A process called tunneling is used to move electrons in and out of the cells, but the back-and-forth traffic erodes the physical structure of the cell, leading to breaches that can render it useless.

Electrons also get stuck in the cell wall, where their associated negative charges complicate the process of reading and writing data. This accumulation of stray electrons eventually compromises the cell's ability to retain data reliably—and to access it quickly. Three-bit TLC NAND differentiates between more values within the cell's possible voltage range, making it more sensitive to electron build-up than two-bit MLC NAND.

Watch our discussion of the SSD Endurance Experiment on the TR Podcast

Even with wear-leveling algorithms spreading writes evenly across the flash, all cells will eventually fail or become unfit for duty. When that happens, they're retired and replaced with flash allocated from the SSD's overprovisioned area. This spare NAND ensures that the drive's user-accessible capacity is unaffected by the war of attrition ravaging its cells.

The casualties will eventually exceed the drive's ability to compensate, leaving unanswered questions. How many writes does it take? What happens to your data at the end? Do SSDs lose any performance or reliability as the writes pile up?

This experiment sought to find out by writing a near-constant stream of data to Corsair's Neutron GTX 240GB, Intel's 335 Series 240GB, Kingston's HyperX 3K 240GB, Samsung's 840 Series 250GB, and Samsung's 840 Pro 256GB.

The first lesson came quickly. All of the drives surpassed their official endurance specifications by writing hundreds of terabytes without issue. Delivering on the manufacturer-guaranteed write tolerance wouldn't normally be cause for celebration, but the scale makes this achievement important. Most PC users, myself included, write no more than a few terabytes per year. Even 100TB is far more endurance than the typical consumer needs.

Clear evidence of flash wear appeared after 200TB of writes, when the Samsung 840 Series started logging reallocated sectors. As the only TLC candidate in the bunch, this drive was expected to show the first cracks. The 840 Series didn't encounter actual problems until 300TB, when it failed a hash check during the setup for an unpowered data retention test. The drive went on to pass that test and continue writing, but it recorded a rash of uncorrectable errors around the same time. Uncorrectable errors can compromise data integrity and system stability, so we recommend taking drives out of service the moment they appear.

After receiving a black mark on its permanent record, the 840 Series sailed smoothly up to 800TB. But it suffered another spate of uncorrectable errors on the way to 900TB, and it died without warning before reaching a petabyte. Although the 840 Series had retired thousands of flash blocks up until that point, the SMART attributes suggested plenty of reserves remained. The drive may have been brought down by a sudden surge of flash failures too severe to counteract. In any case, the final blow was fatal; our attempts to recover data from the drive failed.

Few expected a TLC SSD to last that long—and fewer still would have bet on it outlasting two MLC-based drives. Intel's 335 Series failed much earlier, though to be fair, it pulled the trigger itself. The drive's media wear indicator ran out shortly after 700TB, signaling that the NAND's write tolerance had been exceeded. Intel doesn't have confidence in the drive at that point, so the 335 Series is designed to shift into read-only mode and then to brick itself when the power is cycled. Despite suffering just one reallocated sector, our sample dutifully followed the script. Data was accessible until a reboot prompted the drive to swallow its virtual cyanide pill.

The reaper came for the Kingston HyperX 3K next. As with the 335 Series, the SMART data's declining life indicator foretold the drive's death and triggered messages warning that the end was nigh. The flash held up nicely through 600TB, but it suffered a boatload of failures and reallocated sectors leading up to 728TB, after which it refused to write. At least the data was still accessible at the end. The HyperX didn't respond after a reboot, though. Kingston tells us the drive won't boot if its NAND reserve has been exhausted.

The next failure occurred after the 840 Series bit the dust. Corsair's Neutron GTX was practically flawless through 1.1PB—that's petabytes—but it posted thousands of reallocated sectors and produced numerous warning messages over the following 100TB. The drive was still functional after 1.2PB of writes, and its SMART attributes suggested adequate flash remained in reserve. However, the Neutron failed to answer the bell after a subsequent reboot. As with the other corpses, the drive wasn't even detected, nixing any possibility of easy data recovery.

And then came the calm. The remaining two SSDs carried on past the 2PB threshold before meeting their ultimate ends. On the next page, we'll examine their last moments in greater detail