In early March, Intel gathered industry analysts and members of the tech press in Folsom, California to talk SSDs. The city is home base for Intel’s Non-Volatile Memory Solutions Group, otherwise known as NSG, which is responsible for the development and testing of Intel solid-state drives. The NSG had a story to tell about how its design and validation work produces extremely reliable SSDs. We got hard numbers on failure rates, details about efforts to make SSDs more dependable, and a peek behind the scenes at the Folsom facility.
We were also let in on a little secret—an easter egg, if you will. Remember the Intel 730 Series we reviewed in February? You know, the one with the skull on it? Well, it also has a bonus that glows under UV lighting. Good thing I have an old black light left over from my misspent youth.
I can’t decide if the logo is cheesy or cool. It’s probably a little of both. Intel is working on other aesthetic flourishes, which certainly can’t hurt on premium-priced products like the 730 Series.
With that revelation out of the way, let’s get down to business.
SSDs store precious personal information, and they increasingly power the datacenters behind popular online services. Reliability is of paramount importance. The trouble is, it’s rarely quantified. SSD makers have traditionally shied away from providing field reliability data, and retailers typically don’t disclose manufacturer-specific return rates. We’re left sifting through user reviews and forum threads to get a sense of which drives are better than others.
The anecdotal evidence spread across those sources suggests Intel SSDs are among the most reliable. And, wouldn’t you know, the snippets of data shared with us seem to agree.
Back in 2010, Intel decided to convert all of its corporate PCs to solid-state storage. The firm deployed its own SSDs, of course, and the failure rate for those drives has thus far been five times lower than for the mechanical drives they replaced. That’s an impressive decrease, especially since it reaches back four years. SSDs have matured quite a bit over the last couple of generations.
Intel didn’t discuss specific failure rates for the SSDs used in its own PCs. However, it did provide data on over six million drives shipped as part of its business-oriented Pro family. This product line has an annual failure rate target of 0.73%, and Intel has been comfortably under that mark for quite some time.
Even the return rates have met Intel’s AFR goal. More impressively, perhaps, the annual failure rate during this period never exceeded 0.4%. For much of 2013, the failure rate was well under 0.2%.
Intel is meeting its targets for datacenter and client SSDs as a whole, too. Venkatesh Vasudevan, Director of NSG Quality, Reliability & Validation, pulled up the following graph during his presentation:
These numbers are based only on 2-3 million drives, Vasudevan said, and the sample size is small and skewed for the datacenter SSDs. Still, failure rates were below 0.2% for pretty much all of 2013, and they were around 0.1% for the last seven months of the year. That blip at nearly 0.8% on the datacenter plot represents some initial teething problems with Intel’s then-new DC-series drives, Vasudevan clarified.
We don’t have similar data from other drive makers, so it’s hard to put Intel’s numbers into perspective. However, Vasudevan pointed out that IHS analyst Ryan Chien told Computerworld in September that he’d seen data suggesting that “client SSD annual failure rates under warranty tend to be around 1.5%.” Based on that information, the assertion by NSG Marketing Director Pete Hazen that Intel SSDs have “best-in-class annual failure rates” seems pretty plausible.
Intel’s field reliability data also lends some credibility to its argument for deploying dual 730 Series drives in striped RAID 0 arrays on high-end desktop PCs. RAID 0 doubles the odds of losing data to a catastrophic drive failure, but with failure rates as low as Intel claims, that doesn’t seem like such a big danger.
Extensive validation efforts
The NSG’s development and quality assurance teams are housed in the same building, making it easy for them to collaborate closely during development. Reliability is “part of the architecture,” Vasudevan told us.
We got a few ingredients for Intel’s special sauce, including background data refresh, a feature that moves data around if flash cells are losing their charge. Randomized data patterns are used to cut down on the cell-to-cell interference. Also, failed reads are reattempted at higher voltages in an attempt to extract data from cells that might otherwise produce errors.
Intel has a particularly intimate relationship with the flash that goes into its SSDs. The company is part of a joint NAND manufacturing venture with Micron, and according to Hazen, Intel characterizes “every cell on every die on every package on every SSD” it sells. The SSD’s own controller and firmware are used to test the NAND, he said, and “hundreds of settings” determine how voltage is distributed to the cells. Intel uses its familiarity with NAND vulnerabilities to ensure sufficient error-correction capabilities in the accompanying controller, as well.
The drives do more than just test themselves, of course. Intel’s qualification team gets a crack at new models early in the development process, and it continues to monitor them after finished products have been released into the wild. It even maintains a comprehensive stockpile of every SSD Intel has ever made. If customers encounter issues with older models, Intel has comparable drives on hand for further analysis.
Intel’s internal testing comprises thousands of unique tests on hundreds of different validation platforms for both server and client systems. When the local staff punches out, remote testers from India log in to keep the operation running 24 hours a day.
The main Folsom verification lab has enough capacity to test 2,500 SATA drives and 1,500 PCIe ones simultaneously. The scale is impressive, as are the heat and cooling noise from the racks of drives. Additional testing is done in other Folsom labs, in separate Intel facilities, and also at the factory. All told, Intel’s internal test capacity is about 10,000 drives.
Every aspect of SSD operation is tested. Drives are heated and chilled in massive incubators, and stacks of them are put away for long-term data integrity tests. Even power-loss protection is given a thorough workout. Special equipment simulates numerous disconnect scenarios, including cutting the power and data lines separately and pulling connectors off at an angle instead of with a straight tug.
If drives fail, the Folsom facility is loaded with diagnostic hardware that can be used to investigate the problem. The lab is also equipped to validate the individual components used in SSDs, and it even has the tools required to analyze and cut entire flash dies.
Endurance testing is a fairly straightforward process for client SSDs, but burning through the write cycles of high-endurance enterprise drives is a little more time-consuming. For these SSDs, Intel follows the short-stroke method established by the JEDEC standards group. This approach uses custom firmware to limit writes to a smaller percentage of the flash, forcing those blocks to accumulate write cycles at an accelerated rate. The short-stroke pattern typically targets flash cells in the corners of the dies, though it can also sample from the middle.
Intel’s internal reliability specification for high-endurance SSDs is tighter than the requirements laid out in JEDEC’s JESD219 standard for enterprise drives. The Intel spec accepts fewer functional failures and has a lower uncorrectable bit error rate: 10-17 instead of a mere 10-16. The Intel spec also adds a provision to account for read disturb, which isn’t covered by the JEDEC spec. Read disturb describes a phenomenon by which reading flash memory can alter the charge—and thus the data—stored in adjacent cells.
Intel Fellow and Director of Reliability Methods Neal Mielke told us that short-stroke testing revealed errors in the DC S3700 server drive. There were only two errors, he said, and they were within the spec. But Intel still changed the firmware and re-tested the drive to confirm that the problem had been addressed. Mielke said it’s “pretty common” to find issues when doing reliability testing early in product development.
On the next page, we’ll delve into the most interesting element of Mielke’s presentation: Intel’s efforts to safeguard SSDs from cosmic rays. Seriously.
Prime the particle accelerator
Mielke explained that there are two kinds of errors: those that are detected and those that are not. Detected errors are problematic, but they’re a known quantity. They can be addressed by error-correction routines. Silent errors are much more serious. They can result in unreported data corruption, which is especially dangerous for enterprise storage.
As one might expect, Intel’s server customers have extremely tight specifications for silent errors. They allow as few as silent error per 1025 bits, or one per billion drives—basically zero tolerance, according to Mielke.
For SSDs, Mielke pointed to cosmic rays as the main source of worry for silent errors. In extremely rare circumstances, these high-energy particles can strike integrated circuits, causing bits to flip in the silicon. He added that the flash memory in SSDs is fairly insensitive to these kinds of errors; the NAND is loaded with ECC protection, and enterprise-class drives have end-to-end data protection. The controller and DRAM are more vulnerable, though. Flipped bits in these components can apparently make firmware code execute incorrectly, causing silent errors and other problems.
To combat faulty execution, Intel designed firmware that validates its own behavior. If the self-testing mechanism detects suspicious activity, Mielke said, the drive takes “whatever action is necessary to preserve data integrity.” These actions could entail reporting an uncorrectable error to the host or canceling a non-critical operation like background garbage collection. If a critical operation can’t be verified and data corruption is possible, the drive bricks itself in an act of virtual seppuku, unable to live with the shame of having possibly compromised data integrity. This suicidal behavior is meant for systems with redundant SSD arrays, of course.
Intel vetted its firmware validation scheme by artificially injecting flipped bits into DC S3700. It also took that drive to the Los Alamos Neutron Science Center and stuck it in the crosshairs of an actual particle accelerator.
The particle beam was 100 million times stronger than typical cosmic rays, so the SSDs only lasted about 10 minutes in it. Adjusting for the beam strength, Intel estimates the DC S3700’s “hang rate” due to cosmic interference to be only 0.029% per year.
Of the 86 DC S3700 drives Intel tested, only three rebooted under pressure before finally succumbing. Those reboots didn’t affect data integrity, though. In fact, Mielke said Intel didn’t detect any silent errors in the flash during testing. Based on those results, the company projects a silent error rate of less than 0.0005% per year. The sample size wasn’t large enough to add any more zeros behind the decimal point.
Speaking of small sample sizes, Intel also put a handful of the DC S3700’s competitors into the particle stream. These drives didn’t fare as well. They rebooted more frequently, Mielke claimed, and silent errors were detected on seven of the sixteen subjects. The adjusted hang rates for those unnamed competitors range from 0.077-0.133%, with projected silent error rates of 0.029-0.255%.
Bombarding drives with high-energy particles might seem excessive, but Mielke told us this sort of testing is commonly done with CPUs and DRAM. He wasn’t aware of any other SSD makers who subject their products to this treatment, though.
Along with information on reliability and verification testing, we got the usual rundown on Intel’s existing products. The company also dropped some hints about future ones.
NVM Express is the next big thing on the SSD front. Unlike the existing AHCI spec, which was designed for hard drives, NVMe is tailored specifically for SSDs. The new host interface should deliver substantially lower latencies than AHCI—as much as a 6X reduction, according to NSG General Manager Rob Crooke.
Given the responsiveness of current AHCI SSDs, further latency reductions may be difficult for consumers to notice. However, smaller reductions can be very important for datacenters. An Equinix whitepaper quoted by Marketing Director Pete Hazen asserts that, at Amazon, “every 100ms delay costs 1% of sales.” The same whitepaper says Microsoft determined that a 500-ms delay in page load times translates to a revenue reduction of 1.2% per user. Those percentages may be low, but the dollar amounts are huge when one considers the volume of customers each company serves.
Intel’s first NVM Express SSD will be revealed in the next few months. The drive will be a high-performance product aimed at datacenters, and it seems to be based on a third-generation architecture similar to that of the DC S3700 series. This product will be available in two flavors: in a traditional 2.5″ form factor and as a PCI Express add-in card. The 2.5″ variant will use the SFF-8639 connector, which is similar to SATA Express’ physical interface but supports up to four PCIe lanes instead of two.
Server SSDs appear to be a growing focus for Intel. The company doesn’t want to overreach, Hazen said, so it’s trying to apply its resources “in markets that are the most impactful.” Joining the race to the bottom in the consumer SSD market isn’t part of the plan. Intel is content to let other players duke it out over the slim margins available in that space.
On the server front, Hazen claimed Intel can’t build SSDs big enough to meet datacenter demands. Scaling NAND down to smaller lithography nodes is becoming more difficult, but during his presentation, Crooke pointed to 3D NAND as a possible solution. Stacking cells vertically enables higher densities without shrinking the process geometry, and he suggested this approach could accelerate Moore’s Law, at least as it applies to flash memory.
In the near term, 3D NAND may be the best bet for improving flash storage densities. However, it might not be the next big thing. Crooke characterized NAND-based SSDs as a disruptive technology, and he said the industry is now in a “reoptimizing stage.” Intel is working on new disruptive tech, he added, and it should be ready in the “next few years.” We don’t know what that disruption will entail, but if it’s anything like the transition from mechanical drives to SSDs, the storage industry could get a lot more interesting. Again.