Extensive validation efforts
The NSG's development and quality assurance teams are housed in the same building, making it easy for them to collaborate closely during development. Reliability is "part of the architecture," Vasudevan told us.
We got a few ingredients for Intel's special sauce, including background data refresh, a feature that moves data around if flash cells are losing their charge. Randomized data patterns are used to cut down on the cell-to-cell interference. Also, failed reads are reattempted at higher voltages in an attempt to extract data from cells that might otherwise produce errors.
Intel has a particularly intimate relationship with the flash that goes into its SSDs. The company is part of a joint NAND manufacturing venture with Micron, and according to Hazen, Intel characterizes "every cell on every die on every package on every SSD" it sells. The SSD's own controller and firmware are used to test the NAND, he said, and "hundreds of settings" determine how voltage is distributed to the cells. Intel uses its familiarity with NAND vulnerabilities to ensure sufficient error-correction capabilities in the accompanying controller, as well.
The drives do more than just test themselves, of course. Intel's qualification team gets a crack at new models early in the development process, and it continues to monitor them after finished products have been released into the wild. It even maintains a comprehensive stockpile of every SSD Intel has ever made. If customers encounter issues with older models, Intel has comparable drives on hand for further analysis.
Intel's internal testing comprises thousands of unique tests on hundreds of different validation platforms for both server and client systems. When the local staff punches out, remote testers from India log in to keep the operation running 24 hours a day.
The main Folsom verification lab has enough capacity to test 2,500 SATA drives and 1,500 PCIe ones simultaneously. The scale is impressive, as are the heat and cooling noise from the racks of drives. Additional testing is done in other Folsom labs, in separate Intel facilities, and also at the factory. All told, Intel's internal test capacity is about 10,000 drives.
Every aspect of SSD operation is tested. Drives are heated and chilled in massive incubators, and stacks of them are put away for long-term data integrity tests. Even power-loss protection is given a thorough workout. Special equipment simulates numerous disconnect scenarios, including cutting the power and data lines separately and pulling connectors off at an angle instead of with a straight tug.
If drives fail, the Folsom facility is loaded with diagnostic hardware that can be used to investigate the problem. The lab is also equipped to validate the individual components used in SSDs, and it even has the tools required to analyze and cut entire flash dies.
Endurance testing is a fairly straightforward process for client SSDs, but burning through the write cycles of high-endurance enterprise drives is a little more time-consuming. For these SSDs, Intel follows the short-stroke method established by the JEDEC standards group. This approach uses custom firmware to limit writes to a smaller percentage of the flash, forcing those blocks to accumulate write cycles at an accelerated rate. The short-stroke pattern typically targets flash cells in the corners of the dies, though it can also sample from the middle.
Intel's internal reliability specification for high-endurance SSDs is tighter than the requirements laid out in JEDEC's JESD219 standard for enterprise drives. The Intel spec accepts fewer functional failures and has a lower uncorrectable bit error rate: 10-17 instead of a mere 10-16. The Intel spec also adds a provision to account for read disturb, which isn't covered by the JEDEC spec. Read disturb describes a phenomenon by which reading flash memory can alter the charge—and thus the data—stored in adjacent cells.
Intel Fellow and Director of Reliability Methods Neal Mielke told us that short-stroke testing revealed errors in the DC S3700 server drive. There were only two errors, he said, and they were within the spec. But Intel still changed the firmware and re-tested the drive to confirm that the problem had been addressed. Mielke said it's "pretty common" to find issues when doing reliability testing early in product development.
On the next page, we'll delve into the most interesting element of Mielke's presentation: Intel's efforts to safeguard SSDs from cosmic rays. Seriously.
|The TR Podcast 171 video is now available via YouTube||2|
|Deal of the week: Radeons, mechanical keyboards, IPS displays, and more||8|
|Windows 10 to support USB 3.1 Type-C's new features||18|
|Apple preps for March 9 'Spring Forward' event||43|
|Thursday Night Shortbread||55|
|TR's February 2015 System Guide||70|
|ISPs to be common carriers under new FCC rules||186|
|These folks won a copy of Homeworld Remastered Collection||53|