Rys Sommefeldt works for Imagination Technologies and runs Beyond3D. He took us inside Nvidia’s Fermi architecture a couple of years ago, and now he’s back with a deep dive into what we know about Nvidia’s Pascal GPU so far.
Telling you that Nvidia recently announced its first Pascal GPU, the GP100, is probably a bit redundant. It’s been the talk of the PC technology space since Nvidia CEO Jen-Hsun Huang announced the GP100-powered Tesla P100 in his inimitable, best-keynote-giver-in-Silicon-Valley style during the first keynote of the company’s recent GPU Technology Conference (GTC) in sunny San Jose.
It feels like Pascal’s been a long time coming. In reality, we haven’t deviated too far from Nvidia’s typical modus operandi of a roadmap announcement at a GTC, followed by products a couple of years down the line. The company hasn’t been able to release the first Pascal chip for a while, thanks to the size it needed to be to make a generational leap in performance over prior Tesla products. Now that volume production of 16-nm transistors is possible, it’s finally time for the first big Pascal chip to arrive, too.
28-nm manufacturing has lasted a long time in discrete GPU land. AMD and Nvidia both skipped the 20-nm node at the various foundries because of its unsuitability for the needs of high-power semiconductor devices. Because of the long pause at 28 nanometers, people have been clamoring for the first products on newer production technologies to see what advancements they’d bring to the table. Volume manufacturing for TSMC’s 28-nm high-performance process started back in late 2011, remember!
Now that Pascal is here, at least in announcement form, I jumped at the chance to reprise my 2009 analysis of Nvidia’s Fermi architecture. Fermi was announced at GTC in September of that year, but the company mostly talked about it from the standpoint of its GPU compute potential. I took a look at that chip then, and made some guesses about what its features might mean for consumer graphics products. I’ll be performing a similar analysis this time around.
My task is a little different this time, though, because we were also told the basic graphics-focused makeup of GP100 at GTC. Thanks to those details, I don’t have to do too much speculation about the chip’s graphics features and risk getting some of them wrong, like I did with Fermi. However, reading the Pascal tea leaves leaves me wondering whether GP100 will actually ever be used in GeForce products.
Let’s start with a brief recap of the last generation to see where today’s chips ended up on 28nm before we jump into the new stuff. Be warned: if you’re not interested in the bigger building blocks of GPU design and lots of talk about how many of them are present, here be dragons. Still with me? Great, because some context and background always helps set the scene. Join us now on this weird journey through Blaise’s semiconductor namesake.
A recap of the Maxwell architecture
We were actually going to take you all the way back to Fermi here, but after collating all of the research to take that seven-year trip down memory lane, we realised that a backdrop of Maxwell and Maxwell 2 is enough. You see, Maxwell never really showed up in true Tesla form like GP100 has for Pascal. Even the biggest manifestation of the Maxwell 2 microarchitecture, GM200, made some design choices that were definitely focused on satisfying consumer GeForce customers, rather than the folks that might have wanted to buy it in Tesla form for HPC applications.
Key for those HPC customers is support for double-precision arithmetic, or FP64. FP64 is something that has no real place in what you might call a true GPU, because of the nature of graphics rendering itself. That capability is needed for certain HPC applications and algorithms, though, especially those where a highly-parallel machine that looks a lot like a GPU is a good fit, and for those tasks that have a ratio of FP64 to lesser-precision computation that’s much more in favour of having a lot of FP64 performance baked into the design.
You’d expect a HPC-focused Maxwell to have at least a 1/3 FP64-to-FP32 throughput ratio like that of the big Kepler chip, GK110, that came before it. Instead, GM200 had almost the bare minimum of FP64 performance—1/32 of the FP32 rate—without cutting it out of the design altogether. We’ll circle back to that thought later. The rest of the Maxwell microarchitecture, especially in Maxwell 2, was typical of a graphics-focused design. It’s also typical of the way Nvidia has scaled out its designs in recent generations: from the building block of a streaming multiprocessor, or SM, upwards.
The Maxwell SM. Source: Nvidia
Nvidia groups a number of SMs in a structure that could stand on its own as a full GPU, and it calls those structures graphics processing clusters, or GPCs. Indeed, they do operate independently. A GPC has everything needed to go about the business of graphics rendering, including a full front-end with a rasterizer, the SMs that provide all of the GPC’s compute and texturing ability, the required fixed-function bits like schedulers and shared memory, and a connection to the outside world and memory through the company’s now-standard L2 cache hierarchy.
Maxwell GPCs contain four SMs. Each Maxwell SM is a collection of four 32-wide main scalar SIMD ALUs, each with its own scheduler. Each of the 32 lanes in the SIMD operate in unison, as you’d expect a modern scalar SIMD design to. Texturing hardware also comes along for the ride in the SM to let the GPU get nicer access to spatially coherent (and usually filtered) data. Normally, that data is used to render your games, but it can also do useful things for compute algorithms. Fusing off the texture hardware for HPC-focused designs doesn’t make too much sense—unless you’re trying to hide that the chip used to be a GPU, of course. Each Maxwell SM offers eight samples per clock of texturing ability.
The GM200 GPU. Source: Nvidia
GM200 uses six GPCs, so it has six front-ends, six rasterisers, six sets of back-ends and connections to the shared 3MB of L2 cache in its memory hierarchy, and a total of 24 SMs (and thus 24 times 4 times vec32 SIMDs, and 24 times 8 samples per clock of texturing capability) across the whole chip. With clock speeds of 1GHz or more in all of its shipping configurations, and speeds that are often even greater in its GeForce GTX 980 Ti form—especially the overclocked partner boards—it’s the most powerful single GPU that’s shipped to date.
If GM200 sounds big, that’s because it absolutely is. At just over 600mm², fabricated by TSMC on its 28-nm high-performance process technology, it’s pretty much the biggest GPU Nvidia could have made before tipping over the edge of the yield curve. Big GPUs lend themselves to decent yields, because it’s easy to sell them in cut-down form. You still need the yield to be decent to extract a profit from a GPU configured with the bits you are able to turn on against the competitive landscape of the day, though.
So that’s our GP100 backdrop in a nutshell. What I’m trying to get at by painting yet another picture of the big Maxwell is that it’s mostly just a big consumer GPU, not an HPC part. Maxwell’s lack of FP64 performance hurts its usefulness in HPC applications, and Nvidia can’t ignore that forever. Intel is shipping its new Knights Landing (KNL) Xeon Phi now. That product is an FP64 beast. It’s also capable of tricks that other GPU-like designs can’t pull off, like booting an OS by itself. That’s because each of its SIMD vector units is managed by a set of decently-capable x86 cores.
Our Maxwell and GM200 recap highlights the fact that GP100 has its work cut out in a particular field: HPC. Let’s take a 10,000-foot view of how it’s been designed to tackle that market as an overall product before we dive into some of the details.
The GP100 GPU
At a high level, GP100 is still an “SMs in collections of GPCs” design, so we don’t have to develop a new understanding of how it works at the microarchitecture level—at least as far as the basics go. Nvidia has resurrected the concept of a texture-processing cluster, or TPC, as a way of grouping a pair of SMs, but we can mostly ignore that name for our purposes.
The GP100 SM. Source: Nvidia
A full, unfettered GP100 is a six-GPC design, and each of those GPCs contains 10 SMs. Nvidia announced that the first shipping product with the GP100, the Tesla P100, would have 56 of its SMs enabled. It’s highly likely that Nvidia is disabling two TPCs in different GPCs to achieve that cut-down state, and it’s likely turning them off to improve yields.
A block diagram of the GP100 GPU. Source: Nvidia
That’s because GP100 is a whopping 610mm², and it’s produced by TSMC on its 16-nm FinFET Plus (16FF+) node. 16FF+ is definitely mature, but GP100 is easily the biggest and most complex design that’s yet been manufactured using that technology so far. Given the potential customers for the Tesla P100, you can bet that Nvidia would absolutely turn on all 60 SMs in GP100 if it could. I’m guessing that power usage isn’t a concern for GP100, really, so the reason behind the deactivated SMs has to be yield-related.
The Pascal SM in GP1xx is actually much smaller than the GM2xx SM for the main hardware. It’s just two 32-wide main SIMD ALUs this time, rather than four. There are also big changes afoot in this main ALU, but let’s hold that thought for the time being. Also along for the ride is a separate 16-wide FP64 ALU, giving the design “half-rate” FP64 throughput. If we multiply out all of the numbers that describe the GP100 design, you’ll see exactly what that rate ends up as: 5.3 TFLOPs. Good googly moogly. Most of the GPUs I work on for my day job at Imagination Technologies have around 1/10th of that throughput for FP32 performance, and literally zero FP64 ability at all. If you’re a HPC person and your code needs FP64 performance to go fast, GP100 is your very best friend.
Pascal has a familiar L1-shared memory-into-L2 cache hierarchy, as we’ve seen on Kepler and Maxwell, and it’s 4MB in size on GP100. That changes the “L2-size-per-SM” ratio significantly compared to GM200 and Maxwell, and not in the bigger-is-better direction. GP100’s 56 enabled SMs share 4MB of L2 in GP100, compared to 24 SMs that share 3MB of L2 in GM200.
While there are half the 32-wide ALUs per SM in GP100 compared to GM200, there’s no reduction in the size of the register file (RF) that the SMs have access to. That gives GP100 twice the per-SM RF space compared to GM200. For certain classes of data-dense code like the kind you tend to find in HPC applications, that’s a very welcome change in the new chip. As an aside, if you think the 4MB of L2 cach GP100 has is a lot of on-chip memory, there’s actually more than three times that amount in total RF space if you add it all up.
Six GPCs made up of 10 SMs—and each of those GPCs with lots of welcome FP64 ALU performance—plus a large per-SM register file all want to be fed with a beefy memory subsystem to give a nice “bytes-per-FLOP” ratio: the metric that really matters for devices like this. To get there, Nvidia is using the second version of High Bandwidth Memory (called HBM2) for GP100. I’ll leave the gory details of that memory for later, but there’s a huge increase in external memory bandwidth for GP100 compared to what was possible in GM200 and other GPUs that relied on GDDR5. That’s even the case with the conservative clocks for the HBM2 configuration Nvidia chosen to go with in GP100.
From a HPC standpoint, at least, we’re pretty much done with our high-level view of how GP100 is constructed. For a recap, the chip has six GPCs, and 10 SMs are in each GPC. Each SM gets 256KiB of register file to play with. Nvidia has turned off four SMs across the chip (two SMs grouped as TPCs—one TPC each for two unlucky GPCs), all sharing a 4MB L2 cach and then on to a very wide, high-throughput HBM memory system.
Let’s take a closer look at the SM and see what’s changed in the ALUs and how they interact with the register file. The changes help both HPC and graphics applications so they’re particularly interesting.
GP100 and FP16 performance
The biggest change in the Pascal microarchitecture at the SM level is support for native FP16 (or half-precision) arithmetic. Rather than dedicate a separate ALU structure to FP16 like it does with FP64 hardware, Pascal runs FP16 arithmetic by cleverly reusing its FP32 hardware. It won’t be completely apparent how Pascal does this until the chip’s ISA is released, but we can take a guess.
Nvidia has disclosed that the hardware supports data packing and unpacking from the regular 32-bit wide registers, along with the required sub-addressing. Along with the huge RF we discussed earlier, it’s highly likely that GP100 splits each FP32 SIMD lane in the ALU into a “vec2” type of arrangement, and those vec2 FP16 instructions then address two halves of a single register in the ISA. This method is probably identical to how Nvidia supported FP16 in the Maxwell Tegra X1. If that’s the case, Pascal isn’t actually the first Nvidia design of the modern era to support native FP16, but it is the first design destined for a discrete GPU.
Because the FP16 capability is part of the same ALU that GP100 already needs to support FP32, it’s reasonably cheap to design in terms of on-die area. Including FP16 support offers benefits to a couple of big classes of programs that might be run on a GP100 in its useful lifetime. Because GP100 only powers Tesla products right now (and may always do so), Nvidia’s messaging around FP16 support focuses on how it helps deep learning algorithms. This capability makes for a big performance jump when running those algorithms, and it also offers a reduction in required storage and movement of the data required to feed those algorithms. Those savings are mainly in the form of memory bandwidth, although we’ll soon see that GP100 has plenty of that, too.
The second obvious big winner for native FP16 support is graphics. The throughput of the FP16 hardware is up to twice as fast of that as FP32 math, and lots of modern shader programs can be run at reduced precision if the shader language and graphics API support it. In turn, those programs can take advantage of native FP16 support in hardware. That “up-to” caveat is important, though, because it highlights the fact that there’s a vectorization aspect to FP16; it’s not just “free.” FP16 support is part of many major graphics APIs these days, so a GeForce Pascal is ideally suited to produce big potential benefits in performance for gaming applications, as well.
Wide and fast: GP100’s HBM2 memory subsystem
We’re in the home stretch of describing what’s new in Pascal compared to Maxwell, at least in the context of GP100. AMD was first to market with HBM, putting it to critically-acclaimed use with its Fiji GPU in a range of Radeon consumer products. HBM brings two big benefits to the table, and AMD took advantage of both of these: lots and lots of dedicated bandwidth, and a much smaller package size.
In short, HBM individually connects the memory channels of a number of DRAM devices directly to the GPU, by way of a clever physical packaging method and a new wiring technology. The DRAM devices are stacked on top of each other, and the parallel channels connect to the GPU using an interposer. That means the GPU sits on top of a big piece of passive silicon with wires etched into it, and the DRAM devices sit right next to the GPU on that same big piece of silicon. As you may have guessed, the interposer lets all of those parts sit together on one package.
Nvidia’s pictures of the GP100 package (and the cool NVLink physical interconnect) show you what I mean. Each of the four individual stacks of DRAM devices talk to the GPU using a 1024-bit memory interface. High-end GPUs have bounced between 256-bit and 512-bit bus widths for some time before the rise of HBM. Now, with HBM, we get 1024-bit memory interfaces per stack. Each stack has a maximum memory capacity defined by the JEDEC standards body, so aggregate memory bandwidth and memory capacity are intrinsically linked in designs that use HBM.
GP100 connects to four 1024-bit stacks of HBM2, each made up of four 8Gb DRAM layers. In total, GP100 has 16GB of memory. The peak clock of HBM2 in the JEDEC specification is 1000 MT/s, giving a per-stack bandwidth of 256GB/sec, or 1TiB/sec across a four-stack setup. Nvidia has chosen to clock GP100’s HBM2 at 700 MT/s, or an effective 1400 MT/s thanks to HBM2’s double data rate. GP100 therefore has just a touch less than 720GB/sec of memory bandwidth, or around double that of the fastest possible GDDR5-equipped GPU on a 384-bit bus (like GM200).
The downside of all of that bandwidth is its cost. The interposer silicon has to be big enough to hold the GPU and four stacks of HBM, and we already noted that the GP100 die is a faintly ridiculous 610 mm² on a modern 16-nm process. Given that information, I’m guessing the GP100 interposer is probably on the order of 1000 mm². We could work it out together, you and I, but my eyeballing of the package in Nvidia’s whitepaper tells me that I’m close, so let’s keep our digital calipers in our drawers.
1000-mm² pieces of silicon—with etched features, remember, so there’s lithography involved—are expensive, even if those features are regular and reasonably straightforward to image and manufacture. They’re cut from the same 300-mm silicon wafers as normal processors, too, so chipmakers only get a relatively small handful of them per wafer. The long sides of the interposer will result in quite a lot of wasted space on the circular wafer, too. We wouldn’t be surprised if making the interposer alone results in a per-unit cost of around two of Nvidia’s low-end discrete graphics cards in their entirety: GPU, memories, PCB, display connectors, SMT components, and so on.
Now that we have a good picture of the changes wrought in Pascal’s microarchitecture and memory system in the compute-oriented GP100, we can have a go at puzzling over what the first GeForce products that contain Pascal might look like.
GeForce Pascals: some wild guesses
When we start thinking about what Pascal might look like in consumer GeForces, I have a couple of guesses. The major changes that Nvidia is likely to make in these parts boils down to two things: FP64 compute and the use of HBM2.
To repeat what we concluded earlier, FP64 is completely useless for graphics, and it takes up a lot of die area. That’s especially true for the dedicated SIMDs needed to run FP64 alongside the main FP32-and-FP16 pipeline, as with the GP100 design. To keep costs down for consumers, I’m expecting Nvidia to effectively remove FP64 in the chips that arrive to power GeForce models. It’ll still be there because it can’t disappear completely, but it’ll probably just be 1/32-rate like we got in GM200.
Then there’s HBM2. I’d have argued for its inclusion in GeForce Pascals a few months ago, but GDDR5X is on the way. This memory doubles the prefetch length and also should come with a fairly large increase in effective clock speed. It’ll be cheaper to use than HBM2 at similar aggregate bandwidths, and it’s cheaper to implement at the on-chip PHY level—not to mention the savings from the lack of an interposer and stack packaging. GDDR5X also doesn’t have strict rules tying bandwidth to capacity. That lets Nvidia use memory sizes other than 4GB, 8GB, 12GB, or 16GB on its GeForce products, compared to the limitations of HBM2.
Given those guesses, I think there’s at least one consumer chip that’s still really big, but quite a bit smaller than 610 mm². It probably has similar overall throughput to GP100 in the metrics we care about for graphics, and it’ll probably come with less memory capacity. Even so, it should still have plenty of overall bandwidth. Some rumours say this chip is called GP102. I think it’ll have 56 to 60 SMs, 1/32nd FP64 throughput, and more than 8GB of 384-bit GDDR5X. If it exists, then it’s likely destined for a Titan-class card first, and maybe a enthusiast’s favourite “Ti” product later on.
Nvidia is also likely working on a GM200 replacement for the pair of high-end GeForce non-Tis that make up the meat of the enthusiast market these days. It’s likely called GP104. That chip will likely also have token FP64 throughput—remember that these are GPUs, not HPC cards. I also bet it’ll have 8GB of 256-bit GDDR5X, 40 SMs or thereabouts, and all the associated machinery in terms of texturing and backend throughput that implies, in a die of around 300 mm².
After that, I don’t really want to put a flag in the ground. Expect something else for the “GTX 1060” part of the product line, and something else for the “GTX 1050” and below, probably at a die size of around 100mm2. By then, we’re onto GP2xx and some other possible changes for the design in some small ways.
We left out some discussion of some other really interesting bits of GP100, should you want to go read about them yourself. Nvidia’s own architecture whitepaper is a good resource, so I’d recommend reading it and focusing on two things. The first is the details of the NVLink interconnect. NVLink is used extensively in Nvidia’s construction of its DGX-1 rackable supercomputer. The other point of interest is the fact that GP100 can now service its own page faults without host intervention. That feature has some really exciting applications for graphics, but it’s too big of a topic to cover here. We don’t know whether that feature will make it into GeForce Pascals, but it’s definitely worth keeping an eye out for.
Anyway, we hope that trip through a Maxwell refresher, an overview of Pascal and GP100, a look at HBM2 and its associated costs, and some guesses about the other Pascals has whetted your appetite for the upcoming pitched battle between Nvidia’s Pascal and AMD’s Polaris. I, for one, need something to drive an Oculus Rift. Maybe two of those somethings for each eye. Jeff’s also jonesin’ for a fix of big, powerful GPUs, given that the Oculus Rift and HTC Vive have shown up at TR HQ recently. It’s about time new GPUs made possible with new manufacturing showed up, and the groundswell of VR adoption is probably going to be a really good kicker for whatever hits the market from both Nvidia and AMD.