AMD Radeon Instinct MI50 and MI60 bring 7-nm GPUs to the data center

Alongside a preview of its first 7-nm Epyc CPUs built with the Zen 2 microarchitecture, AMD debuted its first 7-nm data-center graphics-processing units today. The Radeon Instinct MI50 and Radeon Instinct MI60 take advantage of a new 7-nm GPU built with the Vega architecture to crunch through tomorrow's high-performance computing, deep learning, cloud computing, and virtualized desktop applications.

As we noted with AMD's next-generation Epyc CPUs, TSMC's 7-nm process provides the red team's chip designers with a 2x density improvement versus GlobalFoundries' 14-nm FinFET process. The resulting silicon can be tuned for 50% lower power for the same performance or for 1.25x the performance in the same power envelope. In the case of the Vega chip that powers the MI25 and MI60, that process change allowed AMD to cram a marketing-approved figure of 13.2 billion transistors into a 331-mm² die, up from 12.5 billion transistors in 471 mm² on the 14-nm Vega 10.

AMD didn't call this chip by an internal codename, but it's clearly a refined and tuned version of the Vega architecture we know from the gaming space. Vega DC (as I'll call it for convenience) unlocks a variety of data-processing capabilities to suit a wide range of compute demands. For those who need the highest-possible precision, Vega DC can perform double-precision floating-point math at half the rate of single-precision data types, for as much as 7.4 TFLOPS. Single-precision math proceeds at a rate of 14.7 TFLOPS. The fully-fledged version of this chip inside the Radeon Instinct MI60 crunches through half-precision floating point math at 29.5 TFLOPS, 59 TOPS for INT8, and 118 TOPS for INT4.

Compared to Nvidia's PCIe version of its Tesla V100 accelerator, the Radeon Instinct MI60 seems to stack up favorably. The green team specs the V100 for 7 TFLOPS FP64, 14 TFLOPS of FP32, 28 TFLOPS of FP16, 56 TOPS of INT8, and 112 TOPS on FP16 input data with FP32 accumulation by way of the Volta architecture's tensor cores. While the two architectures are not entirely cross-comparable in their capabilities, the relatively small die and high throughput of the Radeon Instinct MI60 still impresses by this measure.

To support that blistering number-crunching capability, AMD hooks Vega DC up to 32 GB of HBM2 RAM spread over four stacks of memory. With 1024-bit-wide interfaces per stack, Vega DC can claim as much as 1 TB/s of memory bandwidth. While Tesla V100 boasts a similarly wide bus, its HBM2 memory runs at a slightly slower speed, resulting in bandwidth of 900 GB/s. AMD also claims end-to-end ECC support with Vega DC for data integrity.

The bleeding-edge technology doesn't stop there, either. AMD has implemented PCI Express 4.0 links on Vega DC for a 31.5 GB/s path to the CPU and main memory, or up to 64 GB/s of bi-directional transfer. On top of that, AMD builds Infinity Fabric edge connectors onto every Radeon Instinct MI50 and MI60 card that allow for 200 GB/s of total bi-directional bandwidth for coherent GPU-to-GPU communication. These Infinity Fabric links form a ring topology across as many as four Radeon Instinct accelerators.

Like past Radeon data-center cards, the MI50 and MI60 will allow virtual desktop deployments using hardware-managed partitioning. Each Radeon Instinct card can support up to 16 guest VMs per card, or one VM can harness as many as eight accelerators. This feature will come free of charge for those who wish to harness it.

Four Radeon Instinct MI50 cards in an Infinity Fabric ring

AMD expects the Radeon Instinct MI60 to ship to data-center customers before the end of 2018, while the Radeon Instinct MI50 will begin reaching customers by the end of the first quarter of 2019. AMD also announced its ROCm 2.0 software compute stack alongside this duo of 7-nm cards, and it expects that software to become available by the end of this year.

Comments closed
    • stefem
    • 12 months ago

    A more complete comparison, no offense Jeff 😉

    Vega 20: 2019?
    _____DP:___7.4 TFLOPS
    _____SP:___14.7 TFLOPS
    __Half-P:___29.5 TFLOPS
    ___INT8:___ 59 TOPS
    ___INT4:___118 TOPS
    ____BW:___1000 GB/s

    GV100 (PCI-E): 2017
    _____DP:___7.8(7) TFLOPS
    _____SP:___15.7(14) TFLOPS
    __Half-P:___31.4(28) TFLOPS
    Tensor FP:___125(112) TFLOPS ( FP16 input, full-precision multipl result, accumulated in FP32)
    ___ INT8:___62.8 TOPS (with INT32 result and accumulation)
    ___ INT4:___62.8 TOPS
    ____ BW:___900 GB/s

    Quadro RTX 6000: 2018
    _____DP:___0.5 TFLOPS
    _____SP:___16.3 TFLOPS
    __Half-P:___32.6 TFLOPS
    ___ INT8:___65.2 TFLOPS
    Tensor FP:___130.5 TFLOPS ( FP16 input, full-precision multipl result accumulated in FP32)
    Tnsr. INT8:___261.0 TOPS (with INT32 result and accumulation?)
    Tnsr. INT4:___522.0 TOPS
    _____BW:___672 GB/s

    Then some question, what’s the TDP of Vega 20? while processing FP16 data will perform full precision product and accumulation (like Volta and Turing) which is not important for inferencing but almost fundamental for the training of the DNN model?

      • Jeff Kampman
      • 12 months ago

      [url<]https://www.amd.com/en/products/professional-graphics/instinct-mi60[/url<] says 300 W, same as V100 in mezzanine form

        • stefem
        • 12 months ago

        That’s interesting, though there’s not much more to read, I’m wondering if it have just FP64 units and they are handling FP32 through packed math, like FP16 operations on Vega 10

      • NoOne ButMe
      • 12 months ago

      Vega 20 (Radeon Instinct MI60) is 4Q2018.
      With it’s slightly cut down version (Radeon Instinct MI50) coming in 1Q2019 iirc.

    • Mat3
    • 12 months ago

    I thought the improvement from Glofo 14nm to TSMC 7nm would be better than 1.25.

      • Chrispy_
      • 12 months ago

      It’s still the same CPU.

      What you’re confusing it with is the usual ability to cram more shaders/cores in because of a die shrink.

      In this case, it was a 64CU part on 14nm and it’s still a 64CU part on 7nm, so the only improvement they can get is clockspeed tweaks.

        • freebird
        • 12 months ago

        To get picky… “It’s still the same [b<][u<]G[/u<][/b<]PU." Although, this is really being sold as a Compute GPU, so pickiness is debatable 😀

        • stefem
        • 12 months ago

        That’s exactly the point, if the design in the same (well, they added FP64 capability…) what you should see is the scaling given by the new process node which is exactly what Mat3 was arguing

          • Chrispy_
          • 12 months ago

          What scaling are you expecting from a node shrink? 25% clock increase is about what was expected – a little lower than the obviously-optimistic 35% used to attract investors – but in the right ballpark for sure.

          I cannot think of a process shrink in the last 25 years that made more than 50% clock improvements [i<]at the same power envelope[/i<] and it's rarely been above 30%. What I'm trying to explain - perhaps not very well - is that people's conceptions of a new process node are usually misled by the Intel/AMD/Nvidia also using the process node shrink as an opportunity to increase transistor count and cram in more cores/shaders/units/cowbells; That means that a new node seems like a huge performance jump because it's not just a new node. The best case scenario for a new generation is the cumulative combination of : A new or tweaked architecture A wider design with more cores because the smaller node has a higher transistor density A clockspeed increase because the smaller node clocks higher. In this case, for the MI50 and MI60, the architecture is unchanged and the design is no wider than before. All we're left with is the clock increase.

            • stefem
            • 11 months ago

            Relax and read my post as a clarification on what Mat3 meant to say in his comment, and read better this time please.

            First we are talking about a 19.6% clock increase not 25%, then I’ve just said that if the design is the same what you will see is just the improvement that come from the different process, since (as I mentioned) it’s not exactly the same as they increased their FP64 capabilities and the transistor number rose to 13.2M we may have to make some more consideration.
            Now tell me, how does all the thing you wrote replaying to my comment have to do with what I said?

          • tipoo
          • 12 months ago

          A straight die shrink will only give you whatever clock advantages lower power per clock gets you. Scaling comes from building out the architecture with the extra transistors you have to spend, that part isn’t automagic.

            • stefem
            • 11 months ago

            We are talking specifically about clock scaling

      • chuckula
      • 12 months ago

      So did AMD when they claimed 1.35 in January.
      [url<]https://twitter.com/Dayman58/status/1059900726275723265?s=20[/url<]

      • stefem
      • 12 months ago

      If GloFo 14nm is similar to Samsung 14nm, which is very likely, then is very similar to TSMC 16nm

        • Goty
        • 12 months ago

        Didn’t GloFo basically license the 14nm node wholesale from Samsung, or was that a different node?

          • stefem
          • 11 months ago

          Yes, but no one actually tested if there are any difference, fab need lots of tuning. Even different fab from the same foundry could produce slightly different performing dies.

    • ronch
    • 12 months ago

    I sometimes wonder if AMD actually sells a lot of these things. Do they?

      • renz496
      • 12 months ago

      honestly i don’t think so. when it comes to accelerator type of card intel are more successful with their Phi. just look at S9150 (hawaii) which compete directly at the time with nvidia K80 (dual GK210). i never saw even a single system using S9150 in top500 listing and yet there is plenty of system using K80.

    • psuedonymous
    • 12 months ago

    With basically the same layout as RX Vega, to bump up from 13.7 SP TFLOPs of the RX Vega 64 Liquid to 14.7 TFLOPs (1.073x) would imply a boost clock bump (rated FP perf at max boost clocks) of 1677MHz to ~1800 MHz.

      • BSandLies
      • 12 months ago

      It also has 5.6% more transistors. Perhaps some of the 7.3% of extra performance come from minor tweaks to the Vega design that takes more transistors to accomplish and/or other capabilities on the die that can be put towards compute tasks.

        • psuedonymous
        • 12 months ago

        I would expect that most of the transistor increase comes from the additional two HBM2 controllers.

          • NoOne ButMe
          • 12 months ago

          and FP64 support

          • stefem
          • 11 months ago

          It’s most probably more from added FP64 capabilities than from the additional memory controllers

      • stefem
      • 12 months ago

      They mention exactly 1800MHz on a chart

    • ronch
    • 12 months ago

    (471 ÷ 12.5) = 37.68mm^2 per billion transistors

    (331 ÷ 13.2) = 25.07mm^2 per billion transistors

    25.07 ÷ 37.68 = 0.665

    Yes we’re not using SRAM but at least for Vega isn’t this more like a 50% increase in density?

      • willmore
      • 12 months ago

      The I/O is pretty much fixed size and doesn’t scale with process geometry. And this chip has a lot of I/O.

    • WayneManion
    • 12 months ago

    So moving from the current node to 7-nm gets them a 1.5x increase in transistor density. That seems a little wimpy. Better than being stuck on 28-nm for 3+ years, I guess.

    No mention was made of CU count here, but given GCN’s lack of flexibility, I’d imagine MI60 is still stuck at 4096 SPs. With that in mind, my napkin math says the clock rate got a bump from Vega 64’s “up to 1247 MHz” to 1443 MHz thanks to that node shrink.

    For comparison’s sake, Nvidia’s spendiest datacenter silicon spins at 1530 MHz on a 12-nm process. One has to wonder how much of a speed boost the green team is going to get when it moves to 7-nm.

      • Srsly_Bro
      • 12 months ago

      1800Mhz is the number i keep seeing for freq

      • Chrispy_
      • 12 months ago

      It is definitely still 4096 SPs, because it’s still Vega 64, just on a new process.

      You can see the 64 compute units in the block diagram.

    • chuckula
    • 12 months ago

    Clearly these parts are failures because AMD integrated memory controllers, PCIe, and all of the GPU cores into a single piece of silicon.

    Not Epyc AMD. No Epyc at all.

      • Mr Bill
      • 12 months ago

      What happens when you glue a few of those around a CPU?

        • K-L-Waster
        • 12 months ago

        Sticky issue.

          • Mr Bill
          • 12 months ago

          Surely not with modern superscaler architecture.

      • Srsly_Bro
      • 12 months ago

      We’re you watching? It’s Epyc, 2 Epyc, Epyc 2.

      • ptsant
      • 12 months ago

      I heard that big GPUs are much easier to make than big CPUs because a lot of the chip surface is made from repetitive units (GCN cores) so the yields are OK. The yields for a big 28-core Intel chips are most probably abysmal.

      Different situations, different strategies.

        • BorgOvermind
        • 11 months ago

        GPUs were always ahead of CPUs in absolutely all aspects.

    • Chrispy_
    • 12 months ago

    [quote<]50% lower power for the same performance or for 1.25x the performance in the same power envelope[/quote<] I must be getting old but I really want the former and couldn't care less about the latter.

      • chuckula
      • 12 months ago

      If Nvidia came out with a new high-end product on an actually new process node and even Jen-Hsun’s CEO math only gave it a 25% boost, they’d burn him and his leather jacket at the stake.

        • ptsant
        • 12 months ago

        AMD does not claim a radical redesign. This is 7nm version of poor Vega with some extra features, especially for INT8, half-precision and virtualization. In that context, 25% gain from a process change is quite impressive.

        The new gen is yet to come.

          • stefem
          • 12 months ago

          It also add double precision but It’s hardly impressive if you don’t know the power envelope, especially considering the known actual performance of the fab process we are talking about.

            • Srsly_Bro
            • 12 months ago

            The number is 300w. I’ve seen it around yesterday. Google can help.

            • stefem
            • 12 months ago

            Then is not impressive at all…

            • renz496
            • 11 months ago

            AMD really need new architecture to get the power efficiency under control. there is no way around that. we already see the with polaris that even node shrink is no longer enough.

            • chuckula
            • 12 months ago

            OMG! 300 watts from a 300 mm^2 chip!?!?!?!

            THAT’S PHYSICALLY IMPOSSIBLE TO COOL EVAR!!!

            Your’s truly,
            The Same People Who Think Cascade-Lake AP chips can’t ever possibly be cooled.

            • stefem
            • 11 months ago

            Ok, put that way looks really impressive…. to cool.

          • psuedonymous
          • 12 months ago

          [quote<]In that context, 25% gain from a process change is quite impressive.[/quote<] Assuming it really is Tiny Vega (with no additional CUs), then the SP and HP FLOP increase is not 25%, but 7.3%. DP rate increase is massive with the switch to half-rate (packed math FP64 units, as with Volta and Tesla).

            • stefem
            • 12 months ago

            True, I guess they did improve on power consumption then (Volta has dedicated units for FP64)

            • stefem
            • 11 months ago

            It’s actually a 19.6% increase as the old Vega 10 based instinct run at 1500MHz at the same 300W TDP

    • Krogoth
    • 12 months ago

    Can it run Crysis?

      • SlappedSilly
      • 12 months ago

      Virtually

      • Srsly_Bro
      • 12 months ago

      [url<]https://www.techpowerup.com/249299/it-cant-run-crysis-radeon-instinct-mi60-only-supports-linux[/url<] Some say wine says otherwise. Glad you're still here after the event.

      • ptsant
      • 12 months ago

      Yes, but it doesn’t have a video output so it will render in ASCII over an ssh terminal.

        • Goty
        • 12 months ago

        10/10 would play

          • ptsant
          • 12 months ago

          Oldskool
          [url<]https://www.youtube.com/watch?v=0nRPoS2WDJA[/url<]

Pin It on Pinterest

Share This