Single page Print

The SSD Endurance Experiment: Casualties on the way to a petabyte

And then there were three
— 9:25 AM on June 16, 2014

I feel for the subjects of our SSD Endurance Experiment. They didn't volunteer for this life. These consumer-grade drives could have ended up in a corporate desktop or grandma's laptop or even an enthusiast's PC. They could have spent their days saving spreadsheets and caching Internet files and occasionally making space for new Steam downloads. Instead, they ended up in our labs, on the receiving end of a torturous torrent of writes designed to kill them.

Talk about a rough life.

We started with six SSDs: the Corsair Neutron GTX 240GB, Intel 335 Series 240GB, Samsung 840 Series 250GB, Samsung 840 Pro 256GB, and two Kingston HyperX 3K 240GB. They all exceeded their endurance specifications early on, successfully writing hundreds of terabytes without issue. That's a heck of a lot of data, and certainly more than most folks will write in the lifetimes of their drives.

The last time we checked in, the SSDs had just passed the 600TB mark. They were all functional, but the 840 Series was burning through its TLC cells at a steady pace, and even some of the MLC drives were starting to show cracks. We've now written over a petabyte, and only half of the SSDs remain. Three drives failed at different points—and in different ways—before reaching the 1PB milestone. We've performed autopsies on the casualties and our usual battery of tests on the survivors, and there is much to report.

If you haven't been following along with our endurance experiment, this introductory article is a good starting point. It spends far more time detailing our test methods and system configurations than the brief primer we'll provide here.

The premise is straightforward. Flash memory has limited endurance, so we're writing data to a stack of SSDs to see how much they can take. We're checking health and performance at regular intervals, and we're not going to stop until all the drives are dead.

The root cause of NAND's limited endurance is a little complicated. Flash stores information by trapping electrons inside nanoscale cells; the associated voltage defines the data. The "tunneling" process used to move electrons in and out of the cell is destructive, not only eroding the physical structure of the cell wall, but also causing stray electrons to become stuck in it. These errant electrons impart a negative charge of their own, reducing the range of voltages available to represent data. The narrower that range becomes, the more difficult it is for SSDs to perform writes and to verify their validity.

Electron build-up is especially problematic at higher bit densities. MLC NAND needs to differentiate between four possible values within the flash's shrinking voltage window, but TLC NAND must track twice as many. It's more sensitive to normal flash wear as a result, which is why our 840 Series has been burning through more of its flash than the MLC-based drives in the experiment.

Continued write cycling eventually causes cells to become unreliable, at which point those cells are retired and replaced by flash harvested from the drive's "spare area." This reserve of fresh flash area ensures the SSD maintains its user-accessible storage capacity even if cell failures incapacitate some of the NAND. Of course, eventually that reserve becomes exhausted and the drive will begin to fail.

Now that we've laid the groundwork, it's time to inspect the casualties. The first failures were a bit of a surprise but also completely expected. When we checked on our lab rats after 700TB of writes, we found SMART messages warning that the Intel 335 Series and one of the Kingston HyperX 3K units were at risk of failure. Both drives are based on MLC NAND, so we didn't expect them to falter before our lone TLC contender.

Although the failure-prone drives were fully functional at 700TB, neither one made it to 800TB. The HyperX 3K expired at 728TB, while the 335 Series croaked at 750TB. We'll deal with the Intel first, since its demise was a little more straightforward.

The 335 Series' flash was almost entirely intact when the SMART warning hit. Only one reallocated sector had been logged up until that point, and it appeared way back at the 300TB mark, so it didn't inspire the warning. Instead, the slow decline of the media wearout indicator (MWI) was responsible.

This SMART attribute starts at 100 and decreases as the NAND's rated write tolerance is exhausted. It's completely unaffected by the number of reallocated sectors, and it's been ticking down steadily since the experiment began. The remaining life estimate in Intel's SSD Toolbox utility is based on the MWI, and so is the general health assessment offered by HD Sentinel, the third-party tool we've been using to grab raw SMART data.

Our journey to 700TB drove the MWI all the way down to one, which is supposed to put the 335 Series in a read-only, "logical disable" state. The flash is deemed unreliable at this point, and in typically conservative fashion, Intel doesn't want to perform a write that isn't guaranteed. The SMART readout might have truncated a decimal place, though, because we were still able to run our usual performance tests and kick off the next 100TB of writes.

The 335 Series was fine until about 50GB into that run, when write errors started appearing in Anvil's Storage Utilities, the application tasked with flooding the SSDs with writes. The Anvil app actually froze, though we were able to load it again and extract the performance log stored on the drive. We'll take a closer look at those results in a moment.

Oddly, the 335 Series wouldn't return SMART information after the Anvil write errors appeared. The attributes were inaccessible in both third-party tools and Intel's own utility, which indicated that the SMART feature was disabled. After a reboot, the SSD disappeared completely from the Intel software. It was still detected by the storage driver, but only as an inaccessible, 0GB SATA device.

According to Intel, this end-of-life behavior generally matches what's supposed to happen. The write errors suggest the 335 Series had entered read-only mode. When the power is cycled in this state, a sort of self-destruct mechanism is triggered, rendering the drive unresponsive. Intel really doesn't want its client SSDs to be used after the flash has exceeded its lifetime spec. The firm's enterprise drives are designed to remain in logical disable mode after the MWI bottoms out, regardless of whether the power is cycled. Those server-focused SSDs will still brick themselves if data integrity can't be verified, though.

SMART functionality is supposed to persist in logical disable mode, so it's unclear what happened to our test subject there. Intel says attempting writes in the read-only state could cause problems, so the fact that Anvil kept trying to push data onto the drive may have been a factor.

All things considered, the 335 Series died in a reasonably graceful, predictable manner. SMART warnings popped up long before write errors occurred, providing plenty of time—and additional write headroom—for users to prepare. On the next page, we'll explore what happened to the HyperX 3K.