GP100 and FP16 performance
The biggest change in the Pascal microarchitecture at the SM level is support for native FP16 (or half-precision) arithmetic. Rather than dedicate a separate ALU structure to FP16 like it does with FP64 hardware, Pascal runs FP16 arithmetic by cleverly reusing its FP32 hardware. It won’t be completely apparent how Pascal does this until the chip's ISA is released, but we can take a guess.
Nvidia has disclosed that the hardware supports data packing and unpacking from the regular 32-bit wide registers, along with the required sub-addressing. Along with the huge RF we discussed earlier, it’s highly likely that GP100 splits each FP32 SIMD lane in the ALU into a “vec2” type of arrangement, and those vec2 FP16 instructions then address two halves of a single register in the ISA. This method is probably identical to how Nvidia supported FP16 in the Maxwell Tegra X1. If that's the case, Pascal isn’t actually the first Nvidia design of the modern era to support native FP16, but it is the first design destined for a discrete GPU.
Because the FP16 capability is part of the same ALU that GP100 already needs to support FP32, it’s reasonably cheap to design in terms of on-die area. Including FP16 support offers benefits to a couple of big classes of programs that might be run on a GP100 in its useful lifetime. Because GP100 only powers Tesla products right now (and may always do so), Nvidia’s messaging around FP16 support focuses on how it helps deep learning algorithms. This capability makes for a big performance jump when running those algorithms, and it also offers a reduction in required storage and movement of the data required to feed those algorithms. Those savings are mainly in the form of memory bandwidth, although we'll soon see that GP100 has plenty of that, too.
The second obvious big winner for native FP16 support is graphics. The throughput of the FP16 hardware is up to twice as fast of that as FP32 math, and lots of modern shader programs can be run at reduced precision if the shader language and graphics API support it. In turn, those programs can take advantage of native FP16 support in hardware. That "up-to" caveat is important, though, because it highlights the fact that there's a vectorization aspect to FP16; it’s not just "free." FP16 support is part of many major graphics APIs these days, so a GeForce Pascal is ideally suited to produce big potential benefits in performance for gaming applications, as well.
Wide and fast: GP100's HBM2 memory subsystem
We’re in the home stretch of describing what’s new in Pascal compared to Maxwell, at least in the context of GP100. AMD was first to market with HBM, putting it to critically-acclaimed use with its Fiji GPU in a range of Radeon consumer products. HBM brings two big benefits to the table, and AMD took advantage of both of these: lots and lots of dedicated bandwidth, and a much smaller package size.
In short, HBM individually connects the memory channels of a number of DRAM devices directly to the GPU, by way of a clever physical packaging method and a new wiring technology. The DRAM devices are stacked on top of each other, and the parallel channels connect to the GPU using an interposer. That means the GPU sits on top of a big piece of passive silicon with wires etched into it, and the DRAM devices sit right next to the GPU on that same big piece of silicon. As you may have guessed, the interposer lets all of those parts sit together on one package.
Nvidia’s pictures of the GP100 package (and the cool NVLink physical interconnect) show you what I mean. Each of the four individual stacks of DRAM devices talk to the GPU using a 1024-bit memory interface. High-end GPUs have bounced between 256-bit and 512-bit bus widths for some time before the rise of HBM. Now, with HBM, we get 1024-bit memory interfaces per stack. Each stack has a maximum memory capacity defined by the JEDEC standards body, so aggregate memory bandwidth and memory capacity are intrinsically linked in designs that use HBM.
GP100 connects to four 1024-bit stacks of HBM2, each made up of four 8Gb DRAM layers. In total, GP100 has 16GB of memory. The peak clock of HBM2 in the JEDEC specification is 1000 MT/s, giving a per-stack bandwidth of 256GB/sec, or 1TiB/sec across a four-stack setup. Nvidia has chosen to clock GP100's HBM2 at 700 MT/s, or an effective 1400 MT/s thanks to HBM2's double data rate. GP100 therefore has just a touch less than 720GB/sec of memory bandwidth, or around double that of the fastest possible GDDR5-equipped GPU on a 384-bit bus (like GM200).
The downside of all of that bandwidth is its cost. The interposer silicon has to be big enough to hold the GPU and four stacks of HBM, and we already noted that the GP100 die is a faintly ridiculous 610 mm² on a modern 16-nm process. Given that information, I'm guessing the GP100 interposer is probably on the order of 1000 mm². We could work it out together, you and I, but my eyeballing of the package in Nvidia’s whitepaper tells me that I’m close, so let’s keep our digital calipers in our drawers.
1000-mm² pieces of silicon—with etched features, remember, so there’s lithography involved—are expensive, even if those features are regular and reasonably straightforward to image and manufacture. They’re cut from the same 300-mm silicon wafers as normal processors, too, so chipmakers only get a relatively small handful of them per wafer. The long sides of the interposer will result in quite a lot of wasted space on the circular wafer, too. We wouldn’t be surprised if making the interposer alone results in a per-unit cost of around two of Nvidia’s low-end discrete graphics cards in their entirety: GPU, memories, PCB, display connectors, SMT components, and so on.
Now that we have a good picture of the changes wrought in Pascal's microarchitecture and memory system in the compute-oriented GP100, we can have a go at puzzling over what the first GeForce products that contain Pascal might look like.