AMD offers its take on GPU packages, failures

Facts and rumors about failing Nvidia chips have been spewing from all sides for months now. What’s AMD’s take on the issue, and why aren’t we seeing similar failures from its products? We recently had a chat with Neil McLellan, AMD’s director of packaging and interconnect technologies, who offered his insight and opinions about these matters.

To understand where AMD is coming from, one must go back a few years to the former ATI. Prompted by problems with packaging and interconnect materials in consoles as well as the European Union’s Restriction of Hazardous Substances (RoHS) directive, ATI hired McLellan and went about rethinking its chip packaging strategy. In 2005, the RoHS directive required GPU packages to start connecting to their host boards with lead-free solder balls. ATI also took that opportunity to replace the high-lead solder bumps with so-called eutectic bumps. As you’ll see in the diagram below, those solder bumps connect the silicon GPU die to the rest of the package:

A diagram of a GPU package. Source: AMD.

Why the change? High-lead bumps use 90% lead and 10% tin, while eutectic bumps switch that ratio to 37% lead and 63% tin. High-lead bumps can handle more current, but AMD thinks they’re more prone to fatigue and need “comprehensive reliability engineering to be used successfully.” To illustrate the fatigue issue, McLellan evoked a soda can: the tab will probably stay on if you bend it up and down slightly a hundred times, but it’ll likely pop off if you bend it all the way two or three times. Similarly, high-lead bumps can fail because of repetitive heating and cooling. That’s because the silicon GPU die and package substrate (see the diagram above) have different thermal expansion coefficients—2 parts per million/°C for the silicon and 30 ppm/°C for the substrate, McLellan said—which puts a significant stress on the bumps.

Eutectic bumps are easier to work with in AMD’s view, but they have their downsides, too. They have lower tolerance for high current densities than their high-lead counterparts, so bumping up the amperage can render them useless by way of electromigration. Because different parts of a chip can have different power requirements, McLellan said a given chip might have mean power delivery of 200mA with some bumps getting 50mA and others receiving 600mA. To avoid stressing outliers excessively, AMD’s engineers apply a redistribution layer—essentially a thick metal layer—between the bumps and the die in order to even out power delivery.

Keeping control over chip packaging is easier said than done, though. McLellan noted that both AMD and Nvidia rely on a number of third-party firms (like SPIL and ASE) to do the dirty work of packaging chips, and different firms can use different processes and materials. He went on to suggest that those companies didn’t mind following AMD’s guidelines for material usage and package design, but they declined to take the fall if any problems occurred. In essence, AMD could be on its own if it runs into packaging problems—although McLellan said that hasn’t happened with the new packaging design so far.

On the upside, AMD says using eutectic bumps makes chips cheaper to produce, and they also increase yields. AMD states plainly in a related presentation, “There is no financial reason not to make the move to a more reliable package.”

What about Nvidia? McLellan was a little vague in his criticism of AMD’s rival, talking down the company for not paying closer attention to packaging and (allegedly) not caring a whole lot. However, he believes Nvidia’s mobile graphics parts are failing because they use high-lead bumps and are running into the soda-can problem. This problem has shown up in notebooks because those systems get turned on and off a lot, but McLellan said plainly that folks who power-cycle Nvidia-powered desktops regularly should start seeing the same issues eventually.

To complicate things further, AMD says the RoHS directive will start requiring chipmakers to remove lead from both solder balls and solder bumps in 2010—and some of AMD’s customers are requesting the change sooner than that. McLellan said that switch will introduce “an entirely new problem, which turns out to be quite challenging,” although he didn’t get into specifics. He did, however, mention that AMD has been working on the issue for the past 18 months, has some “great ideas” and has “done a lot of work.” In his view, Nvidia has likely been spending the same time trying to fix problems in current package designs.

Of course, this is all a little one-sided. We’ve been trying to get Nvidia to comment on AMD’s little spiel for well over a week now. While the company seemed willing, we still haven’t received a statement. Stay tuned.

Comments closed
    • sigher
    • 12 years ago

    If you have an issue then fix it, dope/dot the tin/lead or use another material all together.
    Incidentally that’s the advantage of the EU directives, if they are forced to find a fix they will, if it’s easier to not look they’d do that.

    • lolento
    • 12 years ago

    I have read the internal reports at Nvidia. The failures are pin specific to the pcix bus and it is mostly chipset related (c51).

    Pin specific failures cannot be due to packaging material. Failures due to packaging material should be random pins near the corners of the die.

    Let’s see when Nvidia will come clean.

    If you want to do your own research on Hi-Pb versus eutectic Sn-Pb solder bump reliability, go to IEEE database or your local university library and look up “High Lead”, “Eutectic”, “Electro Migration”, “Finite Element”, “Thermo Fatigue”. Read these articles yourself.

    High lead solder has 50 years of history from the IBM days. Niel Mclellan IS a hack. He stole credits from his engineers on numerous patents that he holds.

      • continuum
      • 12 years ago

      PCI-e bus I assume? (nitpicking I know)

      We’ve been very aware of lead-free solder issues for the last 5 years or so where I am, but because we deal at a system integration level rather than component design we don’t have too much visibility into the actual details of the nVidia GPU failures. Nice to know. 🙂

        • lolento
        • 12 years ago

        Just adding to this. AMD’s CPU division uses High-Pb solder themselves. Did Neil comments on this?

        This GPU failure thing is getting uglier by the day. I suspect Nvidia is withholding information so they can prepare to sue their subcons.

        AMD (Neil) is making these comment at this stage to prevent the fire from burning to their yard. Other than the eutectic versus Hi-Pb issue, everyone in the industry uses the same or very similar Bill of Materials (everyone who use TSMC, UMC, SPIL and ASE)!

        The NV gpu failures are not materials related, they are design (IC and thermal) related.

          • Snake
          • 12 years ago

          “The NV gpu failures are not materials related, they are design (IC and thermal) related”

          Most likely – similar to the Samsung VRAM BGA issue

          §[<http://www.google.com/search?hl=en&q=samsung+bga+failures&aq=f&oq=samsung+bga+failure<]§ The repair tech working on my personal laptop (for 3 months...) reports that the packaging is simply too thin and can't handle the long term thermal dynamics of the circuit while in operation. Sounds like nVidia may have gone down the same road.

    • liquidsquid
    • 12 years ago

    Save the world by eliminating lead! Doom technology by eliminating lead!

    Sometimes I think the EU has become a bunch of Luddites, and are trying to indirectly make high-technology too expensive to be profitable, thereby killing the industry as a whole, or simply eliminating everyone but the giants who can afford to pay licensing and fines.

    I still don’t understand the benefit of removing lead from electronics. Last I knew nobody died or got brain damage from eating a PCB, or from soldering on one. I can understand removing lead from paints, toys, fuel, etc, but from high-technology? Just stupid.

    Leave it up to politicians, and you wind up with more serious problems than just a little lead in an IC ball.

    -LS

      • MadManOriginal
      • 12 years ago

      Yea the only people that should be harmed are those poor e-cyclers sitting at their hot-plate style dissasembly station breathing toxic fumes all day assuming people dispose of electronics properly. Lead is only part of the problem there.

        • ludi
        • 12 years ago

        Uhm, people working with solder are mainly at risk from breathing fumes from the burnoff of the rosin flux that is used to make the solder “runny”, and that can be controlled in any commercial production environment with appropriate ventilation considerations.

        A worker may get trace amounts of lead on their hands while working, but this can be mitigated by including hand washing in the shutdown protocol before breaks and at the end of a shift, and by not wiping one’s eyes or eating while working.

        There’s no question that there are ways lead content could be legitimately reduced in electronics manufacture compared to the state of the industry prior to RoHS, but RoHS itself was purely a political move with no consideration of the technical merits or the actual level of benefit to the environment. By going too far, it introduced a host of somewhat serious problems for which there was no practical benefit obtained.

          • MadManOriginal
          • 12 years ago

          I’m not denying there are technical issues with RoHS but your comments “can be controlled in any commercial production environment with appropriate ventilation considerations” and “this can be mitigated by including hand washing in the shutdown protocol before breaks and at the end of a shift” make it clear you have no idea what happens to ‘properly’ disposed of electronics when they get shipped overseas for e-cycling.

            • DrDillyBar
            • 12 years ago

            Indeed. To quote the e-waste wiki: ‘Electronic waste represents 2 percent of America’s trash in landfills, but it equals 70 percent of overall toxic waste.’

            • blubje
            • 12 years ago

            yeah the previous posts are disregarding the effects of throwing it away… that would undoubtedly indirectly harm human health, not to mention anyone unfortunate enough to live in some proximity to a landfill.

            • MadManOriginal
            • 12 years ago

            I’m not talking about landfills in the US but that’s a problem too.

            • ludi
            • 12 years ago

            I think you’re making the mistake of focusing too hard on a small subset of a much larger picture, and not seeing that picture.

            All electronics manufacturing invovles a wide range of hazardous substances, and all electronics waste is not completely innert, even after reducing or removing heavy metals such as lead. (Airborn pulverized fiberglass from board shredding, to name just one example, is a much more immediate threat to human health if it isn’t handled correctly.) Therefore:

            1) If you implement regulations that have the effect of creating higher failure rates because the manufacturing techniques are more difficult and/or less reliable, then more waste is created that if things had simply been left alone or if a more reasonable middle-ground solution had been sought. The unintended consequence is more net environmental damage, and potentially greater human health hazards.

            2) If you implement regulations that make it onerous to operate in the regulated environment, and you have a country like China that is principlally concerned with sustaining an economy that has raised millions people out of poverty and still has most of a billion to go, then manufacturers will subcontract to Chinese manufacturers who will happily wreack havoc on the environment no matter what materials you specify, in order to make things as quickly and cheaply as possible — which, incidentally, exacerbates failure rates and produces more of problem (1).

            The fact that something feels good, has the right message, and may even reduce a problem in one area, doesn’t mean it’s the best policy when considered in the big picture.

      • DrDillyBar
      • 12 years ago

      RoHS restricts the use of 6 substances. Lead is 1.

      • cegras
      • 12 years ago

      You’ve obviously never considered what happens to electronics once they are discarded.

        • liquidsquid
        • 12 years ago

        Sure, they go back into the ground from where they came. It isn’t like Lead does not exist naturally in the environment and rocks beneath our feet.

        I wonder where lead came from? Was it created by some evil technical overlords bent on the destruction of children’s brains to further the Democrat’s agenda? No, it came from the very ground we stick it back into.

    • ub3r
    • 12 years ago

    If you let the GPU get hot enough, the solder balls will re-solder themselves back onto the pad they were connected to.

      • eitje
      • 12 years ago

      and then when it cools down again?

    • lolento
    • 12 years ago

    Niel Mclellan used to be my boss in my previous company. He’s a hack and doesn’t know what he’s talking about; he relies on engineers below him to tell him what to say. I have not seen a single engineering observation from this guy the three years I’ve been working for him. How he got to where he is? God knows?

    If you actually read academic studies and participate in engineering, you would know that high-lead solder is much more reliable to fatigue, electro-migration, and also creates a much less stressful package due to the decrease in collapse height during processing. Going with eutectic solder is only for cost reduction and only for small die size applications.

      • YeuEmMaiMai
      • 12 years ago

      ok whatever lol

      • xtalentx
      • 12 years ago

      Well then you explain the failure rate?

    • pogsnet
    • 12 years ago
    • PRIME1
    • 12 years ago

    Next Intel offers their take on AMD’s sales…..

      • eitje
      • 12 years ago

      i believe they DID have some things to say about the B1 Phenoms.

    • SHOES
    • 12 years ago

    Informative and no sign of slammage good stuff!
    ^^^^^^^^^

    • CB5000
    • 12 years ago

    Ah so never shutdown the computer… noted…

      • ludi
      • 12 years ago

      Not necessarily, just don’t get obsessive about having to shut it down or sleep it if you’re coming back to it sometime in the near future.

    • A_Pickle
    • 12 years ago

    A soda can?

      • xtalentx
      • 12 years ago

      It’s like putting a potatoe in a tail pipe!

      Ahh of course.

        • BobbinThreadbare
        • 12 years ago

        It’s like a balloon, and something bad happens.

    • derFunkenstein
    • 12 years ago

    What, someone hired by AMD slams nVidia’s problems? How novel!

    Though I suppose he could be right, I think AMD is the last place nVidia wants any advice from.

      • UberGerbil
      • 12 years ago

      Well the RoHS directives are something they’re all dealing with (Intel and others too) so it’s an interesting insight regardless of anything to do with the nVidia/AMD rivalry.

        • derFunkenstein
        • 12 years ago

        not disagreeing that it’s interesting; just saying I’d be annoyed if I was nVidia. I fully expect some sort of “you don’t know what we’re doing here os just bugger off” dismissal.

          • DrDillyBar
          • 12 years ago

          That would require a response from them

          • Madman
          • 12 years ago

          Reverse engineering is very popular in electronics, I bet they know very well what the heck their competitors chips look like and how they behave. It’s patents that slow things down.

            • YeuEmMaiMai
            • 12 years ago

            if you are dealing with the same issues, there is really no need to reverse engineer anything……….ATI proabably had the same failures when they tried it……….but they most likely caught it in testing

          • DrDillyBar
          • 12 years ago

          I think you called it.

      • ssidbroadcast
      • 12 years ago

      To be fair, it doesn’t look like he is “slamming” them. Just a bit biased. At least, not like that one Microsoft shill slammed Apple on the “apple tax” the other day.

        • xtalentx
        • 12 years ago

        Actually McLellan didn’t slam anyone. Did you even read the article?

          • ssidbroadcast
          • 12 years ago

          Brooks:
          q[http://news.cnet.com/8301-10805_3-10064580-75.html<]§ In Tuesdays shortbread. Which article were /[

          • ssidbroadcast
          • 12 years ago

          q[

            • xtalentx
            • 12 years ago

            I was talking about the OP O.o

            • ssidbroadcast
            • 12 years ago

            Ah, mmk. I drink too much coffee. So I guess we’re on the same page?

            • ludi
            • 12 years ago

            THERE’S NO SUCH THING.

            (Okay, there is. It’s when you start feeling really sick and shaky. But that’s why tea was invented: to flush out the coffee without resorting to — eek! — colorless water.)

    • Usacomp2k3
    • 12 years ago

    Cool description. Thanks for the write-up.

    • ludi
    • 12 years ago

    *[

      • donkeycrock
      • 12 years ago

      He didnt say how long it takes, if you buy a new video card every year, than there shouldnt be much problem.

        • ludi
        • 12 years ago

        Yeah, but that doesn’t describe most of the people around here, and I’ve had several conversations with other gerbils who clearly had no concept of the thermal-cycling mechanism to the point of disbelieving its existence. It does exist, and there are several different ways it can act on components, this being one of them.

          • UberGerbil
          • 12 years ago

          g[

            • ludi
            • 12 years ago

            In fairness, it’s been a few months since the most recent one of those. Maybe they all got caught up when I wasn’t looking.

          • MadManOriginal
          • 12 years ago

          Warranties can cover this for desktop users too. Thankfully many NV partners have lifetime warranties.

            • Forge
            • 12 years ago

            Make sure to read the legal text. Most of those ‘lifetime’ warranties are talking about the *RETAIL* lifetime of the card, not the useful lifetime of the card.

            Most ‘lifetime’ warranties are actually SHORTER than 2 or 3 year warranties.

            By eVGA’s very average definition, everything before the 8800GTX is EOL and thus OUT OF WARRANTY now, and the first gen G80 stuff is going to be EOL in the next month or two.

            Don’t be fooled! Read!

            • indeego
            • 12 years ago

            Very interesting. Yeah it should be /[

            • MadManOriginal
            • 12 years ago

            I’m not going to read every company’s warranty but since you mention eVGA here you go: §[<http://www.evga.com/support/lifetime/default.asp<]§ nothing on that page mentions 'product market lifetime' or such. It does mention only certain product numbers noted by the last two characters in the part number for lifetime warranty but Newegg for example clearly differentiates the warranty length for those products as well, the only thing vaguely deceptive is the perception that eVGA is *always* lifetime warranty but they don't try to hide that. I know PNY as one example b[

            • YeuEmMaiMai
            • 12 years ago

            that might change if they start getting a lot of failed parts………..everyone claims ati’s warranty sucks but I have had to use it only once since the ati mach 32 days and that was for a cap that exploded on a 8500LE

        • grantmeaname
        • 12 years ago

        if you’re the kind of person who buys a new video card every year, odds are none of them are going to be based on G86 chips anyways.

      • WaltC
      • 12 years ago

      I’ve consistently power-cycled all my machines for 22 years (and that’s a lot of them) and I cannot think of a single hardware failure (all of them easily countable on one hand) that I could attribute to the “dangers” of turning my computers on & off. It might have escaped the attention of some, but not only have computers come standard with handy-dandy on-off switches for as long as I can remember–but they’ve always been designed around that principle, too.

      The topic here is simply inferior packaging which is failing prematurely under normal operating conditions. It does not involve routine power cycling, although it may be through that very normal, expected condition that the packaging failures become evident. Optimally made hardware is *designed* to survive power cycling. Basically, if it can’t tolerate routine power cycling then the hardware is deficient.

        • ludi
        • 12 years ago

        “Optimally made hardware is *designed* to survive power cycling. Basically, if it can’t tolerate routine power cycling then the hardware is deficient.”

        Yes, but there are differing definitions of “routine”, and some of us really do run this stuff into the ground, either by direct use or by cascading upgrades through other machines as upgrades are performed on a primary machine.

          • indeego
          • 12 years ago

          These companies test the crap out of their products, there is little more damaging to a company’s financials than a product recall when you are on a roll. I’ve seen optical routers tested in heat 2x the hottest temp in their specs operate for weeks, with no degradation. These manufacturers can’t think of every condition, but surely they have thought about power cycling their devices a mere 5-6 thousands of times given average useg{

            • ludi
            • 12 years ago

            That’s the ideal case, but accelerated aging tests can only come so close to approximating a real-world duty cycle. And then something like the Nvidia mobile GPU fiasco comes along, and suggests that the tests are not always as reliable and/or thorough as they could be.

            Also: Optical routers? Isn’t that some fairly high-end enterprise type stuff? Doubt the average consumer-grade product on Newegg gets whipped to anything close to those kinds of standards.

    • The Dark One
    • 12 years ago

    It’s interesting to hear it coming from an AMD guy, but isn’t this (in a much less rambling way) what Charlie said at the start of september?

      • Saribro
      • 12 years ago

      Also somewhat less detailed, but yeah, it is.

        • A_Pickle
        • 12 years ago

        Less frothy-at-the-mouth, too.

    • DrDillyBar
    • 12 years ago

    Finally some hard details. I had read this was the likely problem, but was forced to visualize the issues based on vague descriptions up until now. Hope nVidia has an official statement to come.

      • eitje
      • 12 years ago

      Agreed. Great write-up, Cyril!

Pin It on Pinterest

Share This