Nvidia explores ways of cramming many GPUs onto one package

As fast as modern GPUs are, there will always demand for more horsepower. Beyond gamers looking to drive multiple 4K displays with one graphics card, there are researchers and businesses with an ever-increasing thirst for greater compute acceleration. Judging from a recent research publication, Nvidia thinks it's rapidly approaching the limits of its current GPU architectural model, so it's looking for a way forward. The idea is still in the simulation stage, but the paper proposes a Multi-Chip Module GPU (MCM-GPU) that would comprise several GPU modules integrated as a single package.

The proposal was put together by researchers and engineers from Arizona State University, Nvidia, the University of Texas at Austin, and the Barcelona Supercomputing Center. The idea starts with the recognition that Nvidia is soon going to struggle to squeeze more performance out of its current layouts with today's fabrication technology. Typically, the company has been able to improve GPU performance between generations by ratcheting up the streaming multiprocessor (SM) count. Unfortunately, it's getting increasingly difficult to cram more transistors into single dies. Nvidia's V100 GPU, for example, required TSMC to produce the chips at the reticle limit of its 12-nm process. Furthermore, there are costs and problems associated with making ever-larger dies, as yield numbers decrease due to manufacturing faults.

It's possible that Nvidia could take the approach of putting multiple GPUs on the same PCB, as it did with the Tesla K10 and K80. However, the researchers found a number of problems with this approach that the company has yet to solve. For example, they note that it's not easy to distribute work across multiple GPUs, so it requires a lot of effort from programmers to use the hardware efficiently.

Instead, these researchers want to take advantage of developments in package technologies that might allow Nvidia to place mutiple GPU modules (GPMs) onto one package. These GPMs would be smaller than current GPUs, and therefore easier and cheaper to manufacture. While the researchers acknowldedge that questions remain about the performance of packages like this one, they claim that recent developments in substrate technology could allow the company to implement a fast, robust interconnect architecture to let these modules communicate. Theoretically, on-package bandwidth could reach multiple terabytes per second.

In Nvidia's in-house GPU simulator, the research team put together an MCM-GPU with a whopping 256 SMs, compared to Pascal's "measly" 56 SMs. The team then pitted that against a hypothetical (and unbuildable) 256-SM GPU built with the company's current architecture. The results showed that the MCM-GPU was 45.5% faster than the monolithic chip. Further comparison with multiple GPUs on the same board (rather than integrated into one package) still gave the MCM-GPU a 26.8% performance advantage.

These numbers all come from simulations and rely on upcoming technologies and untested optimizations, of course, so it's probably a little early to start putting pennies in the piggy bank and saving up to buy a card with an MCM-GPU. That being said, rumor does have it that AMD is pursuing a similar idea with its Navi GPU, so it's possible that the MCM-CPU concept could become more prominent in the future. In the meantime, this paper serves as an intriguing opportunity to peek behind the curtain and hear Nvidia's engineers talk about the company's current design challenges and possible routes to new levels of GPU computational prowess.

Comments closed
    • Mr Bill
    • 2 years ago

    To Infinity Fabric and beyond!

    • deruberhanyok
    • 2 years ago

    I’m a little behind on this, but: not surprising to see this kind of research happening. GPU development started looking like an accelerated version of CPU development starting somewhere around DX9 in 2002. But where CPUs took ~25 years to start seeing multi-core standard (thinking from early Apple II days to the introduction of the Athlon 64 X2), it’s taking GPUs roughly half that time.

    In really, really basic terms: if two 1060s in “SLI” (using DX12 multi-GPU, early Ashes benchmarks from mid 2016) can match a 1080 in performance, but two GP106 GPUs are “easier” to produce than a single GP104 (where “easier” means “less expensive” or “lower failure rate” or both), then why not find a way to combine two of the smaller ones and get similar performance?

    • Growler
    • 2 years ago

    Does this mean the [url=http://i.imgur.com/SWbks.jpg<]Bitchin'fast!3D[super<]2000[/super<][/url<] may finally see the light of day?

      • the
      • 2 years ago

      I’ll take two please.

        • BIF
        • 2 years ago

        You should wait for the Quantum 5D version. Oh wait, it’s already here….ummm, no, that’s the box with the dead cat.

        So now I think it was here for about [s<]1 hour[/s<] 5 minutes last [s<]Monday[/s<] Wednesday, but now is gone. It may come back [s<]next week[/s<] last month, but won't have drivers until [s<]1988[/s<] [s<]1972[/s<] 1989. November [s<]12th[/s<] 17th to be exact, and only in a specific [s<]Egghead Computer[/s<] [s<]Skyline Chili restaurant[/s<] Frisch's Big Boy. Oh drat, it's gone again. That's okay, we would have had a hell of a time finding an Egghead Computer anyway...

    • RedBearArmy
    • 2 years ago

    Quad core GPU ? Sign me up.

    Oh how thee i see ye mighty Titans descent from HPC heavens wielding HBM infused GPU cores to teach us GDDR peasant race the blessings of the holy Interposer.
    (all for that sweet 999$+ price tag)

      • ImSpartacus
      • 2 years ago

      Quad [i<]die[/i<] GPU. GPUs have buckets of "cores" these days. But yeah, HBM or some other compact memory tech is effectively required for this kind of thing.

        • lycium
        • 2 years ago

        > GPUs have buckets of “cores” these days.

        Not even that many real cores in reality, the thing is that NVIDIA started counting individual ALUs as “cores”; if CPUs did this, then SSE counts as a factor of 4, AVX a factor 8, and AVX-512 a factor of 16 in number of cores.

        Which is obviously nonsense, but I guess the marketing guys wanted Big Numbers (TM).

          • tipoo
          • 2 years ago

          Honestly though I’d rather an ALU count than a compute unit count, since those are inconsistent between architectures. One may have 8 ALUs per compute unit while the other has 32, core counts either way across architectures aren’t comparable but 512 ALUs vs 384 for instance would tell you a better story than 16 cores vs 48, which tells you about nothing.

        • RedBearArmy
        • 2 years ago

        I guess i should have added /just kidding kids/ tag
        Lycium made a good reply.
        From myself i will add this: [url<]https://techreport.com/review/30048/exploring-nvidia-pascal-architecture/3[/url<] What really counts is the GigaThread Engine up front which spreads the load around. 4 dies -> 4 cores. These will have to be balaced in software and that SYS+IO block.

    • John p
    • 2 years ago

    it doesnot matter when windows ten can handle this all ready just drop more GPU’s in.
    Microsoft to the answer.
    please believe me some sarcasm above.

    • ronch
    • 2 years ago

    I see Nvidia has risen to the challenge of building graphics chips of epic proportions.

    • Laykun
    • 2 years ago

    Ah, finally getting the benefit of buying out 3DFX I see, the Voodoo 5 6000 will at last see the light of day.

    [url<]http://www.3dchip.de/Grafikkartenmodds/Grafixmodds/3dfxv59000agp.jpg[/url<]

      • DoomGuy64
      • 2 years ago

      Exactly what I thought when I saw this.

      • BIF
      • 2 years ago

      That one would create a small black hole in my computer. I guess it’s okay if I never stick my hand in there!

    • Kougar
    • 2 years ago

    Would tend to think it would be easier to implement this given the parallel workloads, in contrast to general purpose processors. The compiler, driver, and hardware scheduling could all handle a lot of the overhead for distributing workloads evenly across the GPUs keeping it simple to develop for.

    And since these wouldn’t be powering displays, splicing results together from four or more sources would have none of the tearing, juddering, or distortion syncing problems from consumer SLI arrangements either. In fact I’d would be very surprised if NVIDIA doesn’t go this route a few generations from now for it’s largest designs.

      • jts888
      • 2 years ago

      A lot of general compute stuff is more naturally parallel than modern graphics rendering.

      The industry shift to deferred shading and post-process/multi-frame effects has been the biggest thing hurting multi-GPU rendering, since those techniques eliminate the ability to work on frames (and even some intra-frame render targets) in isolation. Nearly every modern engine tries to use previous frame data in any way possible to speed things up, including light map caching, temporal anti-aliasing, and screen space reflections, and getting them to work right in multi-GPU now is ad-hoc hackery.

      A proper MCM GPU probably requires at the very least:[list<][*<]interconnects with massive bandwidth and hooks into the memory crossbars if not even intra SE/SM units [/*<][*<]tiling/binning rasterizers that can efficiently push fragments to shader blocks on remote dies [/*<][*<]sophisticated autonomous hardware caching of read assets to each die's local memory pool [/*<][*<]really robust coherence engine in caches for getting intermediate render target output back to shader blocks' L1s[/*<][/list<] Something along these lines (Navi?) could actually be relatively close to existing, but it'd be an even bigger architecture change than unified shading or VLIW->SIMT.

        • Kougar
        • 2 years ago

        Well NVIDIA already has NVLink for the interconnect that can handle 300GB/s in current implementations. So it looks like NVIDIA has all the pieces ready to go, I’d wager they have already begun splicing MCM GPUs together in the process of creating one of these. It’s the natural next step for efficiency reasons alone, but my understanding is 7nm stuff will reduce the 12nm 800nm reticle limit down to 700nm too.

          • exilon
          • 2 years ago
          • ImSpartacus
          • 2 years ago

          [quote<]my understanding is 7nm stuff will reduce the 12nm 800nm reticle limit down to 700nm too.[/quote<] Which fab are you referring to? We know that TSMC's 12nm FFN had a reticle limit of roughly 800mm2 from [url=http://www.anandtech.com/show/11367/nvidia-volta-unveiled-gv100-gpu-and-tesla-v100-accelerator-announced<]GV100's announcement[/url<]. And we know that GloFo's 7nm LP will have a reticle limit in the ballpark of 700mm2 (up from 14nm LPP's 650ish mm2 limit) according to [url=http://www.anandtech.com/show/11558/globalfoundries-details-7-nm-plans-three-generations-700-mm-hvm-in-2018<]their 7nm roadmap[/url<]. But those are separate companies doing their own thing. I'm no expert, but my impression is that you wouldn't want to go backwards on your reticle limit.

            • jts888
            • 2 years ago

            Well, you can count on one hand the number of high-volume >300mm^2 design producers, and one of them is already trying to flee that segment as fast as humanly possible, and even its direct competitors are now talking openly about using stitched-together MCMs as well. I think it’s even been publicly stated a few times that Vega will be AMD’s last big monolithic GPU.

            Given all that, not pushing reticle limits to their extremes probably has some nice advantages in simplifying foundry systems.

            • the
            • 2 years ago

            AMD has been avoiding large monolithic dies since the botched Radeon HD 2900. They went back to large dies as it was the only real option at the time due to delays in new process nodes and poor Crossfire scaling at the time with their multi-GPU cards. If at first you don’t succeed, try, try and try again.

            As for the rest of the large >300 mm^2 producers, Intel is also on the stage wanting to break up their large monolithic dies into multiple smaller blocks. The twist here is that Intel wanting to use their own EMIB technology. IBM’s bigger designs have moved some stuff off die due to being so large (see their Centuar memory buffer chip with L4 cache). Their legacy MCM experience is top notch so producing multiple die solutions is probably coming. What is left of the volume 300 mm^2 market are the FPGA players which also have already started down the path of interposers on the high end.

            I’d also consider Apple a wild card here but everything they’ve done points at them being happier with a larger single die for the moment. It would not surprise me if they go full 3D with TSV stacking of their low power logic dies + modem. This would less about performance and more about reducing power consumption and board space. Cost of course is a huge factor.

            • Kougar
            • 2 years ago

            Ah, good point.

            Wasn’t thinking about different fabs and should’ve been. I just remember the GloFo article I read talking about the 700mm limit.

      • DeadOfKnight
      • 2 years ago

      Why couldn’t it be used to power displays? GPUs are essentially modular by design already, hence GP102, GP104, GP106 having the same basic architecture on chip. The problem with putting them together is they don’t share resources, they just share the workload. But if they were desiged from the ground up to share resources, then that alleviates the problem.

      Until recently there was no high performance way to do this due to the latency of communication between chips. However, with high performance interconnection of multiple chips on package, it’s possible to get something that far exceeds crossfire or SLI. I wouldn’t hold my breath for an X4 GPU for consumers, but an X2 chip could be in the cards for something like a Titan Z.

        • Kougar
        • 2 years ago

        The same drawbacks with SLI apply. How do you evenly split up the workload across multiple discrete MCM cores?

        If you do it by frame (SFR), then it gets complicated to stitch all those frame pieces together without tearing, alignment, and artifact issues. If you do an alternate frame rendering (AFR) per core, then frames have to be strictly synchronized which still does not solve the judder / uneven frametime issue.

        Utilizing a collection of multiple MCM’s as a single “GPU” is going to still incur additional latency on frame times. Tech Report has done considerable testing on the impact of frame time latency that shows raw FPS is by no means a guarantee of smooth gameplay.

    • Krogoth
    • 2 years ago

    Nvidia already knows that making massive monolith dies has been numbered since Kepler. This is just them trying to stay ahead of the game and trying one up AMD’s Infinity Fabric.

    • psuedonymous
    • 2 years ago

    The question is: who will they be able to wrangle into fabbing a yet-more-gargantuan interposer to host the thing?! GP100 already uses an interposer so large it requires two different patterning stages to cover the whole thing due to it being bigger than the reticule size.

      • the
      • 2 years ago

      The trick to large interposer size is to use an organic one that doesn’t share the same recticle limits as silicon. The down size is that current HBM is purportedly incompatible. Then again, any chip using this is still a few years away, enough time to get memory manufacturers to adapt.

      At this point, the limitations on the design would purely be power and thermal.

        • jts888
        • 2 years ago

        AMD had Fiji prototypes with multi-“zoned” interposers like 5+ years ago to get around reticle limits, and I assume that’s what GP100, Vega, Volta, etc. will all be doing too.

        Epyc can get away with organic interposers since each Zeppelin die is only using a total of 48x ~10Gb differential wire pair lanes in each direction (and only 16 for Threadripper?), so less than a few hundred traces including grounding. OTOH Vega has 2k data lanes alone going out from a single ~2cm chip edge, even if they’re only running <=2Gb per pin.

          • d0x360
          • 2 years ago

          The 295x was 2 Fiji GPUs on a single board but the system treated it as a discrete GPU. It was and still is a really good card aside from its thermal issues.

          It doesn’t directly apply to what this article is about but it shows and has known about this issue for quite some time and has worked on ways to address it. Using HBM in the Fury X was essentially practice and directly led to infinity fabric.

          I wouldn’t be shocked if amd gets there before Nvidia despite the massive gulf between the 2 companies in regards to r&d funds.

          Amd’s architecture on the gpu side also shows they have been thinking of this problem and have already taken steps to make it work in gaming. The way their GPUs work is quite different from Nvidia which is where the whole fine wine theory comes from.

          As for Vega…I’m not so sure it’s wise to make any conclusions on its performance just yet. As with any previous revision to GCN it not only takes developers time to learn the ins and outs but it takes amd themselves time. They often rewrite shader code for certain games so it takes better advantage of the differences in hardware design. It’s almost to the point where an Nvidia GPU is general purpose and straight forward this easier to get performance out of while amd architecture manages to do more with less but it’s more difficult to get there and has only been hampered by things like gameworks and nvidias underhanded tactics to hurt performance on even their own hardware once it passes a certain age.

          I’d rather have amd lead the charge on something like this than I would Nvidia simply because I don’t trust Nvidia. While both companies exist to make money I get the impression only 1 of them actually cares about moving the industry forward at the same time.

            • ImSpartacus
            • 2 years ago

            Remember, the 295X2 was two Hawaii GPUs.

            The only dual Fiji part was the first Radeon Pro Duo.

            And it was just Crossfire on a card (in the case of the 295X2). I don’t believe it’s a whole lot moor complicated than that.

    • NovusBogus
    • 2 years ago

    Unlike CPUs, the GPU task load is highly parallel so the arms race will continue for the foreseeable future.

    • Star Brood
    • 2 years ago

    Isn’t AMD already going to be doing this a la infinity fabric in Vega’s successor?

      • ImSpartacus
      • 2 years ago

      Yes, all things point to that being the case.

      However, I don’t believe we’ve received an explicit confirmation from AMD. And based on how rough their experience with Vega has been, I’m beginning to question whether they can tackle a complex issue like MCM-style graphics.

        • _ppi
        • 2 years ago

        EPYC?

          • exilon
          • 2 years ago

          All indications point to QPI/UPI level bandwidth and latency, aka still too low. You’ll end up with AFR again to gain any performance out of the extra dies. Infinity fabric still needs a few more generations before it’s suitable for replacing on chip interconnects.

          • ImSpartacus
          • 2 years ago

          Epyc is one example of that in the CPU side, that is, a package with four Zeppelin die connected together to get you to 32C. Threadripper is similar, but with two Zeppelin (and then Summit Ridge only has one Zeppelin).

          But I don’t know if it’s trivial to do the same with GPUs. Possible, of course, but perhaps not easy.

          AMD is R&D-limited. Just look at Vega FE and its driver issues. It’s got potential, but there is additional effort needed and that effort costs $$$.

            • exilon
            • 2 years ago

            It’s easier on the GPU side if anything, given the parallelism.

            Eypc and Threadripper is a question of whether AMD can make multi-NUMA nodes of low core count dies priced low enough against monolithic dies for customers to ignore the NUMA part.

        • the
        • 2 years ago

        Raja dropped hints of this last year indicating that an interposer can used for more than just linking a GPU to memory. The implication of course was GPU to GPU in the context.

        Far from any official statement but a strong suggestion in my book.

          • ImSpartacus
          • 2 years ago

          I know AMD has made a lot of noise about MCM in the past couple years, but I don’t believe they have explicitly mentioned that Navi (i.e. Vega’s successor) was going to utilize it.

          Obviously it’s plausible, but I just don’t think AMD has explicitly bound themselves to that route.

      • Krogoth
      • 2 years ago

      Yep

      • wiak
      • 2 years ago

      heh amd has years of experience with mcm modules, the EYPC 32-Core is basicly four Ryzen dies in MCM, perhaps Navi will be dual Dies with InfinityFabric connection?

      DX12 has proper multi-adapter support , so meybe adoredtv’s idea is correct? the future is dual or quad gpu dies in mcm module?

        • the
        • 2 years ago

        The difference is that the Epyc MCM is using more traditional techniques to put multiple dies into a socket. What nVidia is doing here is using an interposer, since they’d need it anyway for the stacked memory, to accomplish the same high level concept. The detail is that with an interposer there can be far more inter-die connections, they can run at higher clocks and they consume less power to link the two dies. The two downsides are cost and complexity as it’ll impact yields.

        The way nVidia is combining the chips, there wouldn’t be a need to do multi-adapter. With multiple dies in a single socket linked together as they’re describing [i<]could[b<]*[/b<][/i<] appear as a single GPU to the OS. Thus all the SLI/Crossfire tricks to scaling don't apply. Similarly, this would permit the multiple dies to directly access memory from each other in a relatively low latency fashion so memory can be directly shared. Thus two sets of a GPU die + 4 GB of HBM connected directly will appear as one beefier GPU + 8 GB of HBM total. [b<]*[/b<]Multiple dies here are probably a strong candidate for segmentation for virtualization of GPU workloads. Just because it can be seen as one GPU doesn't mean it has to be or that some want it to be.

    • chuckula
    • 2 years ago

    [quote<]The team then pitted that against a hypothetical (and unbuildable) 256-SM GPU built with the company's current architecture. The results showed that the MCM-GPU was 45.5% faster than the monolithic chip.[/quote<] Did it? The chart below does not appear to indicate that the MCM-GPU is 45% faster than the unbuildable monolithic equivalent. More like 45% faster than a (buildable) optimized multi-GPU setup. Still it's rather impressive. I'd really like to see [url=http://www.3dic.org/EMIB<]EMIB[/url<] take off since it provides a lot of the advantages of silicon-vias on an interposer without actually requiring an expensive full-sized interposer for larger arrays of chips.

      • Redocbew
      • 2 years ago

      The abstract of the paper says “45.5% faster than the largest implementable monolithic GPU”. I’m not sure if the V100 was the reference point they used for that or not, but I guess there’s some wiggle room there anyway since the performance of the “largest implementable” GPU is going to vary depending on architecture.

      Still an interesting idea though.

      • rube798
      • 2 years ago

      Relevant piece from the paper:

      “We show that with these optimizations, a 256 SMs MCM-GPU achieves 45.5% speedup over the largest possible monolithic GPU with 128 SMs. Furthermore, it performs 26.8% better than an equally equipped discrete multi-GPU, and its performance is within 10% of that of a hypothetical monolithic GPU that cannot be built based on today’s technology roadmap.”

      I think the performance ratios refer to something like this:

      1.25x = 1 monolithic 128 SMs -> 2 discrete cards with 128 SMs each
      1.46x = 1 monolithic 128 SMs -> 2 MCMs with 128 SMs each w/ 768 GB/s
      1.62x = 1 monolithic 128 SMs -> 2 MCMs with 128 SMs each w/ 6 TB/s
      1.64x = 1 monolithic 128 SMs -> 1 monolithic 256 SMs

        • exilon
        • 2 years ago

        Eeesh, that diminishing returns on more cores at 256 SMs. Even graphics get smacked by Amdahl given enough cores, I guess. Eventually we’ll just have to suck it up and scale up GPUs (more IPC, higher frequency) instead of scaling out (more cores).

          • ImSpartacus
          • 2 years ago

          I doubt that’s the case. GPU workloads are magnificently parallelizable.

          There’s probably a bottleneck in their implementation somehow.

          • DavidC1
          • 2 years ago

          That’s not it.

          From the paper:

          “A system with 256 SMs can also be built by interconnecting two maximally sized discrete GPUs of 128 SMs each. Similar to our MCM-GPU proposal, each GPU has a private 128KB L1 cache per SM, an 8MB memory-side cache, and 1.5 TB/s of DRAM bandwidth. We assume such a configuration as a maximally sized future mono-lithic GPU design. ”

          “We assume that two GPUs are interconnected via the next generation of on-board level links with 256 GB/s of aggregate bandwidth”

          (Notice, “on-board”)

          “We refer to this design as a baseline multi-GPU system”

          With that out of the way, the “Optimized Multi-GPU” design is the one with GPU-side L1.5 cache which improves over the baseline by 25%. They propose yet another one which is using MCM, rather than on-board as evidenced by here: “Our proposed MCM-GPU on the other hand, outperforms the baseline multi-GPU by an average of 51.9% mainly due to higher quality on-package interconnect.”

          That MCM-GPU improves performance over the baseline multi-GPU by 51.9%. The MCM-GPU also has the L1.5 cache and other improvements that were added to the “Optimized Multi-GPU” configuration.

          They are ALL using 256 SMs.

        • ImSpartacus
        • 2 years ago

        Just wanted to say thanks for extracting that from the article. TR’s wording was very confusing.

        • Zizy
        • 2 years ago

        Afaik they considered just 4x 64SM MCM. The 128SM was only “how big single GPU we could perhaps maybe make”. But I just checked their figures 🙂

        The thing I am not sure about is their MCM seems to have just 1->2->3->4->1 connected. 1->3 and 2->4 (diagonals) seem missing. Well, I guess that when you have compute bound ones and some that just don’t scale too well, you could hide 5-10% memory bound disadvantage pretty well just by averaging.

      • Andrew Lauritzen
      • 2 years ago

      Right, the article is poorly worded here… I had to re-read that bit a few times before I convinced myself it wasn’t actually a claim of the paper 🙂

      It’s not possible for the MCM version to be faster than a monolithic GPU of the same size. The point of the comparison is precisely to see how close to the performance of the theoretical (“unbuildable”) monolithic GPU you can get with MCM + a fast interconnect.

    • ultima_trev
    • 2 years ago

    I will take quad SLI GV104 on a single PCB [spoiler<]interposer[/spoiler<], thanks.

Pin It on Pinterest

Share This