Behind the scenes with Intel’s SSD division

In early March, Intel gathered industry analysts and members of the tech press in Folsom, California to talk SSDs. The city is home base for Intel’s Non-Volatile Memory Solutions Group, otherwise known as NSG, which is responsible for the development and testing of Intel solid-state drives. The NSG had a story to tell about how its design and validation work produces extremely reliable SSDs. We got hard numbers on failure rates, details about efforts to make SSDs more dependable, and a peek behind the scenes at the Folsom facility.

We were also let in on a little secret—an easter egg, if you will. Remember the Intel 730 Series we reviewed in February? You know, the one with the skull on it? Well, it also has a bonus that glows under UV lighting. Good thing I have an old black light left over from my misspent youth.


I can’t decide if the logo is cheesy or cool. It’s probably a little of both. Intel is working on other aesthetic flourishes, which certainly can’t hurt on premium-priced products like the 730 Series.

With that revelation out of the way, let’s get down to business.

SSDs store precious personal information, and they increasingly power the datacenters behind popular online services. Reliability is of paramount importance. The trouble is, it’s rarely quantified. SSD makers have traditionally shied away from providing field reliability data, and retailers typically don’t disclose manufacturer-specific return rates. We’re left sifting through user reviews and forum threads to get a sense of which drives are better than others.

The anecdotal evidence spread across those sources suggests Intel SSDs are among the most reliable. And, wouldn’t you know, the snippets of data shared with us seem to agree.

Back in 2010, Intel decided to convert all of its corporate PCs to solid-state storage. The firm deployed its own SSDs, of course, and the failure rate for those drives has thus far been five times lower than for the mechanical drives they replaced. That’s an impressive decrease, especially since it reaches back four years. SSDs have matured quite a bit over the last couple of generations.

Intel didn’t discuss specific failure rates for the SSDs used in its own PCs. However, it did provide data on over six million drives shipped as part of its business-oriented Pro family. This product line has an annual failure rate target of 0.73%, and Intel has been comfortably under that mark for quite some time.

Source: Intel

Even the return rates have met Intel’s AFR goal. More impressively, perhaps, the annual failure rate during this period never exceeded 0.4%. For much of 2013, the failure rate was well under 0.2%.

Intel is meeting its targets for datacenter and client SSDs as a whole, too. Venkatesh Vasudevan, Director of NSG Quality, Reliability & Validation, pulled up the following graph during his presentation:

Source: Intel

These numbers are based only on 2-3 million drives, Vasudevan said, and the sample size is small and skewed for the datacenter SSDs. Still, failure rates were below 0.2% for pretty much all of 2013, and they were around 0.1% for the last seven months of the year. That blip at nearly 0.8% on the datacenter plot represents some initial teething problems with Intel’s then-new DC-series drives, Vasudevan clarified.

We don’t have similar data from other drive makers, so it’s hard to put Intel’s numbers into perspective. However, Vasudevan pointed out that IHS analyst Ryan Chien told Computerworld in September that he’d seen data suggesting that “client SSD annual failure rates under warranty tend to be around 1.5%.” Based on that information, the assertion by NSG Marketing Director Pete Hazen that Intel SSDs have “best-in-class annual failure rates” seems pretty plausible.

Intel’s field reliability data also lends some credibility to its argument for deploying dual 730 Series drives in striped RAID 0 arrays on high-end desktop PCs. RAID 0 doubles the odds of losing data to a catastrophic drive failure, but with failure rates as low as Intel claims, that doesn’t seem like such a big danger.

Extensive validation efforts

The NSG’s development and quality assurance teams are housed in the same building, making it easy for them to collaborate closely during development. Reliability is “part of the architecture,” Vasudevan told us.

We got a few ingredients for Intel’s special sauce, including background data refresh, a feature that moves data around if flash cells are losing their charge. Randomized data patterns are used to cut down on the cell-to-cell interference. Also, failed reads are reattempted at higher voltages in an attempt to extract data from cells that might otherwise produce errors.

Intel has a particularly intimate relationship with the flash that goes into its SSDs. The company is part of a joint NAND manufacturing venture with Micron, and according to Hazen, Intel characterizes “every cell on every die on every package on every SSD” it sells. The SSD’s own controller and firmware are used to test the NAND, he said, and “hundreds of settings” determine how voltage is distributed to the cells. Intel uses its familiarity with NAND vulnerabilities to ensure sufficient error-correction capabilities in the accompanying controller, as well.

The drives do more than just test themselves, of course. Intel’s qualification team gets a crack at new models early in the development process, and it continues to monitor them after finished products have been released into the wild. It even maintains a comprehensive stockpile of every SSD Intel has ever made. If customers encounter issues with older models, Intel has comparable drives on hand for further analysis.

One of many racks of SSDs in the Folsom lab. Source: Intel

Intel’s internal testing comprises thousands of unique tests on hundreds of different validation platforms for both server and client systems. When the local staff punches out, remote testers from India log in to keep the operation running 24 hours a day.

The main Folsom verification lab has enough capacity to test 2,500 SATA drives and 1,500 PCIe ones simultaneously. The scale is impressive, as are the heat and cooling noise from the racks of drives. Additional testing is done in other Folsom labs, in separate Intel facilities, and also at the factory. All told, Intel’s internal test capacity is about 10,000 drives.

Drives baking during a retention test. Source: Intel

Every aspect of SSD operation is tested. Drives are heated and chilled in massive incubators, and stacks of them are put away for long-term data integrity tests. Even power-loss protection is given a thorough workout. Special equipment simulates numerous disconnect scenarios, including cutting the power and data lines separately and pulling connectors off at an angle instead of with a straight tug.

If drives fail, the Folsom facility is loaded with diagnostic hardware that can be used to investigate the problem. The lab is also equipped to validate the individual components used in SSDs, and it even has the tools required to analyze and cut entire flash dies.

Racks of validation test systems. Source: Intel

Endurance testing is a fairly straightforward process for client SSDs, but burning through the write cycles of high-endurance enterprise drives is a little more time-consuming. For these SSDs, Intel follows the short-stroke method established by the JEDEC standards group. This approach uses custom firmware to limit writes to a smaller percentage of the flash, forcing those blocks to accumulate write cycles at an accelerated rate. The short-stroke pattern typically targets flash cells in the corners of the dies, though it can also sample from the middle.

Intel’s internal reliability specification for high-endurance SSDs is tighter than the requirements laid out in JEDEC’s JESD219 standard for enterprise drives. The Intel spec accepts fewer functional failures and has a lower uncorrectable bit error rate: 10-17 instead of a mere 10-16. The Intel spec also adds a provision to account for read disturb, which isn’t covered by the JEDEC spec. Read disturb describes a phenomenon by which reading flash memory can alter the charge—and thus the data—stored in adjacent cells.

Intel Fellow and Director of Reliability Methods Neal Mielke told us that short-stroke testing revealed errors in the DC S3700 server drive. There were only two errors, he said, and they were within the spec. But Intel still changed the firmware and re-tested the drive to confirm that the problem had been addressed. Mielke said it’s “pretty common” to find issues when doing reliability testing early in product development.

On the next page, we’ll delve into the most interesting element of Mielke’s presentation: Intel’s efforts to safeguard SSDs from cosmic rays. Seriously.

Prime the particle accelerator

Mielke explained that there are two kinds of errors: those that are detected and those that are not. Detected errors are problematic, but they’re a known quantity. They can be addressed by error-correction routines. Silent errors are much more serious. They can result in unreported data corruption, which is especially dangerous for enterprise storage.

As one might expect, Intel’s server customers have extremely tight specifications for silent errors. They allow as few as silent error per 1025 bits, or one per billion drives—basically zero tolerance, according to Mielke.

For SSDs, Mielke pointed to cosmic rays as the main source of worry for silent errors. In extremely rare circumstances, these high-energy particles can strike integrated circuits, causing bits to flip in the silicon. He added that the flash memory in SSDs is fairly insensitive to these kinds of errors; the NAND is loaded with ECC protection, and enterprise-class drives have end-to-end data protection. The controller and DRAM are more vulnerable, though. Flipped bits in these components can apparently make firmware code execute incorrectly, causing silent errors and other problems.

To combat faulty execution, Intel designed firmware that validates its own behavior. If the self-testing mechanism detects suspicious activity, Mielke said, the drive takes “whatever action is necessary to preserve data integrity.” These actions could entail reporting an uncorrectable error to the host or canceling a non-critical operation like background garbage collection. If a critical operation can’t be verified and data corruption is possible, the drive bricks itself in an act of virtual seppuku, unable to live with the shame of having possibly compromised data integrity. This suicidal behavior is meant for systems with redundant SSD arrays, of course.

Intel vetted its firmware validation scheme by artificially injecting flipped bits into DC S3700. It also took that drive to the Los Alamos Neutron Science Center and stuck it in the crosshairs of an actual particle accelerator.

Forget frickin’ lasers, we’ve got a particle accelerator. Source: Intel

The particle beam was 100 million times stronger than typical cosmic rays, so the SSDs only lasted about 10 minutes in it. Adjusting for the beam strength, Intel estimates the DC S3700’s “hang rate” due to cosmic interference to be only 0.029% per year.

Of the 86 DC S3700 drives Intel tested, only three rebooted under pressure before finally succumbing. Those reboots didn’t affect data integrity, though. In fact, Mielke said Intel didn’t detect any silent errors in the flash during testing. Based on those results, the company projects a silent error rate of less than 0.0005% per year. The sample size wasn’t large enough to add any more zeros behind the decimal point.

Speaking of small sample sizes, Intel also put a handful of the DC S3700’s competitors into the particle stream. These drives didn’t fare as well. They rebooted more frequently, Mielke claimed, and silent errors were detected on seven of the sixteen subjects. The adjusted hang rates for those unnamed competitors range from 0.077-0.133%, with projected silent error rates of 0.029-0.255%.

Bombarding drives with high-energy particles might seem excessive, but Mielke told us this sort of testing is commonly done with CPUs and DRAM. He wasn’t aware of any other SSD makers who subject their products to this treatment, though.

We didn’t get any pictures from Los Alamos, but this one looks mildly radioactive. Source: Intel

Looking forward

Along with information on reliability and verification testing, we got the usual rundown on Intel’s existing products. The company also dropped some hints about future ones.

NVM Express is the next big thing on the SSD front. Unlike the existing AHCI spec, which was designed for hard drives, NVMe is tailored specifically for SSDs. The new host interface should deliver substantially lower latencies than AHCI—as much as a 6X reduction, according to NSG General Manager Rob Crooke.

Given the responsiveness of current AHCI SSDs, further latency reductions may be difficult for consumers to notice. However, smaller reductions can be very important for datacenters. An Equinix whitepaper quoted by Marketing Director Pete Hazen asserts that, at Amazon, “every 100ms delay costs 1% of sales.” The same whitepaper says Microsoft determined that a 500-ms delay in page load times translates to a revenue reduction of 1.2% per user. Those percentages may be low, but the dollar amounts are huge when one considers the volume of customers each company serves.

Intel’s first NVM Express SSD will be revealed in the next few months. The drive will be a high-performance product aimed at datacenters, and it seems to be based on a third-generation architecture similar to that of the DC S3700 series. This product will be available in two flavors: in a traditional 2.5″ form factor and as a PCI Express add-in card. The 2.5″ variant will use the SFF-8639 connector, which is similar to SATA Express’ physical interface but supports up to four PCIe lanes instead of two.

Server SSDs appear to be a growing focus for Intel. The company doesn’t want to overreach, Hazen said, so it’s trying to apply its resources “in markets that are the most impactful.” Joining the race to the bottom in the consumer SSD market isn’t part of the plan. Intel is content to let other players duke it out over the slim margins available in that space.

On the server front, Hazen claimed Intel can’t build SSDs big enough to meet datacenter demands. Scaling NAND down to smaller lithography nodes is becoming more difficult, but during his presentation, Crooke pointed to 3D NAND as a possible solution. Stacking cells vertically enables higher densities without shrinking the process geometry, and he suggested this approach could accelerate Moore’s Law, at least as it applies to flash memory.

In the near term, 3D NAND may be the best bet for improving flash storage densities. However, it might not be the next big thing. Crooke characterized NAND-based SSDs as a disruptive technology, and he said the industry is now in a “reoptimizing stage.” Intel is working on new disruptive tech, he added, and it should be ready in the “next few years.” We don’t know what that disruption will entail, but if it’s anything like the transition from mechanical drives to SSDs, the storage industry could get a lot more interesting. Again.

Comments closed
    • meerkt
    • 6 years ago

    Any chance to get more details from Intel on long term retention behavior?

    • ronch
    • 6 years ago

    One can’t read this article without feeling admiration for Intel afterwards. Like I’ve said before, I’d much rather buy an SSD from one of the bigger boys rather than those who just buy the parts off the shelves, puts them together, and slaps a sticker on with their name on it. Could be a little cheaper, but why take the risk? I just hope the other big players such as Samsung and Crucial do the same level of methodical product testing and long-term evaluation to further improve the quality, reliability and performance of their SSDs.

    • ronch
    • 6 years ago

    Cool. My next CPU, chipset and SSD will all be from Intel. Screw AMD and Samsung that don’t test their products as stringently.

    /sarcasm? Perhaps.

    • Bauxite
    • 6 years ago

    Just shows that good SSDs >> any HDDs in every way except cost. Not a good sign for backers of spinning rust.

    (don’t say “capacity”, 1TB on a freakin’ msata proves they win big on density too, and nobody that cares about capacity only uses 1 drive)

      • MadManOriginal
      • 6 years ago

      Capacity

      …per dollar.

      • smilingcrow
      • 6 years ago

      “nobody that cares about capacity only uses 1 drive”

      Nobody at all, ever!

      • Firestarter
      • 6 years ago

      time spent waiting is costly too

    • Wild Thing
    • 6 years ago

    Ooof that’s some high quality hardware pr0n right there.
    Thanks for sharing. 🙂

    • Waco
    • 6 years ago

    Ha, I didn’t know they tested the 3700 here at LA. That’s pretty awesome. 🙂

    • HisDivineOrder
    • 6 years ago

    I read articles like this and I think, “If only OCZ had cared to QA their drives properly, they would still be around today.”

      • albundy
      • 6 years ago

      If only OCZ had the R&D personnel and unlimited cash that intel has to QA their drives properly, they would still be around today. That goes for all SSD makers.

    • slaimus
    • 6 years ago

    [quote<]The Intel spec accepts fewer functional failures and has a lower uncorrectable bit error rate: 1^-17 instead of a mere 1^-16[/quote<] One to any power is still one--do you mean mean 10^-17?

      • Dissonance
      • 6 years ago

      Fixed. Good catch.

        • indeego
        • 6 years ago

        ECC needed for your exponentials. 🙂

          • MadManOriginal
          • 6 years ago

          This is GCC – Gerbil Checking & Correction!

    • Flatland_Spider
    • 6 years ago

    As a colleague once noted, “You can always tell [the manufacturers] who tests to failure.”

      • PhilipMcc
      • 6 years ago

      Agree. Sometimes failure comes more quickly than expected.

    • balanarahul
    • 6 years ago

    You are late. PCPer had this like, ages ago. Your article was more detailed though.
    It was always known that the NAND die itself was never to blame for a failed SSD. The problem always happens to be with other factors.

    On another note, am I the only one who reads “Remember me” below the password field in Prophet’s voice??? (They call me; Prophet. Remember me.)

      • Khali
      • 6 years ago

      Yes, your the only one that does that, or that will admit to it any way. :p

      • Waco
      • 6 years ago

      A few weeks longer for a quality article is not “ages ago”.

        • balanarahul
        • 6 years ago

        I said “like ages ago”. I am so proud to be the most down voted commentator on this page. 😛

          • stdRaichu
          • 6 years ago

          Like a few weeks ago is less like ages ago than like ages ago. So just like like start liking the dislikes.

          Punctuation intentionally omitted 🙂

    • captaintrav
    • 6 years ago

    IMPRESSIVE. We’ve long known that Intel SSDs are pretty much the kings of reliability. This article actually makes me want one though – their commitment to quality seems to border on the obsessive.

      • continuum
      • 6 years ago

      Given the scope of failures out there from other makers (heck, even Intel isn’t invulnerable), I’ll take the opposite!

      • Flatland_Spider
      • 6 years ago

      It’s normal for stuff in a datacenter.

      • HisDivineOrder
      • 6 years ago

      If they keep making their own controllers and not relying on Sandforce, I’ll be considering them in the future. If they go back to vetting third parties, I’ll keep ignoring them.

        • MadManOriginal
        • 6 years ago

        They do more than vet third parties, they make their own firmware afaik, both for Sandforce and Marvell controllers. In addition, it’s not just about the controller, but the NAND itself as well.

          • HisDivineOrder
          • 6 years ago

          And yet they cannot fix something fundamentally broken like Sandforce. They can just make it tolerable.

            • Waco
            • 6 years ago

            Fundamentally broken? What world do you live on?

    • Ninjitsu
    • 6 years ago

    For a company that bombards its products with high-energy particles, it’s incredible that they couldn’t get Haswell’s IHS right.

      • smilingcrow
      • 6 years ago

      Right for their shareholders – saved money.
      Wrong for over-clockers – Boo hiss.

        • jihadjoe
        • 6 years ago

        Possibly bad for their shareholders either way.

        A company the size of Intel is just about profits/savings as they are about their public image, because of how it ties into their stock price. Apparently the OCers griped bad enough about the TIM thing that they’re going to [url=https://techreport.com/review/26189/intel-to-renew-commitment-to-desktop-pcs-with-a-slew-of-new-cpus<]fix it[/url<] in the upcoming Haswell refresh.

          • HisDivineOrder
          • 6 years ago

          I’ll believe it when someone rips off the spreader and says, “OMG! They fixed it!”

            • Waco
            • 6 years ago

            Except that the CPU die might come with the heatspreader…

            • balanarahul
            • 6 years ago

            But isn’t that intentional?? The CPU die coming off the PCB with the IHS is proof that they are using solder and that temps are gonna be lower…

            • Waco
            • 6 years ago

            That was kinda the point…

            • Bauxite
            • 6 years ago

            Someone on one of the enthusiast forums (extremesystems I think) destroyed an ES Ivy-Bridge E before it launched. They were specifically testing to see if socket 2011 was getting “cost-sized” like the consumer line did, thankfully the answer was no.

        • Ninjitsu
        • 6 years ago

        Well yeah, it is pretty obvious that it was deliberately let through.

    • anotherengineer
    • 6 years ago

    Interesting stuff.

    “Bombarding drives with high-energy particles might seem excessive, but Mielke told us this sort of testing is commonly done with CPUs and DRAM. He wasn’t aware of any other SSD makers who subject their products to this treatment, though.”

    I am not aware of any other SSD makers who have the budgets of Intel though. And probably one of the reasons Intel drives are typically more money.

      • Helmore
      • 6 years ago

      Samsung maybe?

      • captaintrav
      • 6 years ago

      I’m pretty sure OCZ didn’t do this. 😉

        • smilingcrow
        • 6 years ago

        What, any testing at all? 🙂

          • Terra_Nocuus
          • 6 years ago

          gotta cut prices somehow!

        • juampa_valve_rde
        • 6 years ago

        You didnt get it cap, OCZ did so much testing that the bobarded ssd where after tested for working released on market to test end user durability and the psychological effects of hardware malfunction on the owner, all to get the best product to the consumers, right? haha

    • lilbuddhaman
    • 6 years ago

    [quote<]We didn't get any pictures from Los Alamos, but this one looks mildly radioactive. [/quote<] SensibleChuckle.gif Now how do I prevent cosmic rays?

      • GodsMadClown
      • 6 years ago

      I have nothing constructive to add, but I was amused to read about how cosmic rays are a problem even for the CCD cameras on the Mars rover.

      [url<]http://www.slate.com/blogs/bad_astronomy/2014/04/08/curiosity_photo_light_seen_on_mars_is_a_camera_artifact_not_a_real_one.html[/url<]

      • Spittie
      • 6 years ago

      Tinfoil hats?

        • smilingcrow
        • 6 years ago

        Just make sure you buy an Intel hat as they have been thoroughly tested against aliens and the NSA.

      • Flatland_Spider
      • 6 years ago

      You got to get rid of that pesky sun and other cosmic occurrences. 🙂

      • indeego
      • 6 years ago

      [url<]http://xkcd.com/radiation/[/url<] You don't.

        • Chrispy_
        • 6 years ago

        Sure you do.
        Snuff out all the stars and then wait for a few gazillion years for the background radiation to die down.
        EASY.

      • anotherengineer
      • 6 years ago

      Deep deep underground.

      [url<]http://en.wikipedia.org/wiki/Sudbury_Neutrino_Observatory[/url<]

Pin It on Pinterest

Share This