Single page Print

A recap of the Maxwell architecture
We were actually going to take you all the way back to Fermi here, but after collating all of the research to take that seven-year trip down memory lane, we realised that a backdrop of Maxwell and Maxwell 2 is enough. You see, Maxwell never really showed up in true Tesla form like GP100 has for Pascal. Even the biggest manifestation of the Maxwell 2 microarchitecture, GM200, made some design choices that were definitely focused on satisfying consumer GeForce customers, rather than the folks that might have wanted to buy it in Tesla form for HPC applications.

Key for those HPC customers is support for double-precision arithmetic, or FP64. FP64 is something that has no real place in what you might call a true GPU, because of the nature of graphics rendering itself. That capability is needed for certain HPC applications and algorithms, though, especially those where a highly-parallel machine that looks a lot like a GPU is a good fit, and for those tasks that have a ratio of FP64 to lesser-precision computation that’s much more in favour of having a lot of FP64 performance baked into the design.

You’d expect a HPC-focused Maxwell to have at least a 1/3 FP64-to-FP32 throughput ratio like that of the big Kepler chip, GK110, that came before it. Instead, GM200 had almost the bare minimum of FP64 performance—1/32 of the FP32 rate—without cutting it out of the design altogether. We’ll circle back to that thought later. The rest of the Maxwell microarchitecture, especially in Maxwell 2, was typical of a graphics-focused design. It's also typical of the way Nvidia has scaled out its designs in recent generations: from the building block of a streaming multiprocessor, or SM, upwards.

The Maxwell SM. Source: Nvidia

Nvidia groups a number of SMs in a structure that could stand on its own as a full GPU, and it calls those structures graphics processing clusters, or GPCs. Indeed, they do operate independently. A GPC has everything needed to go about the business of graphics rendering, including a full front-end with a rasterizer, the SMs that provide all of the GPC's compute and texturing ability, the required fixed-function bits like schedulers and shared memory, and a connection to the outside world and memory through the company's now-standard L2 cache hierarchy.

Maxwell GPCs contain four SMs. Each Maxwell SM is a collection of four 32-wide main scalar SIMD ALUs, each with its own scheduler. Each of the 32 lanes in the SIMD operate in unison, as you’d expect a modern scalar SIMD design to. Texturing hardware also comes along for the ride in the SM to let the GPU get nicer access to spatially coherent (and usually filtered) data. Normally, that data is used to render your games, but it can also do useful things for compute algorithms. Fusing off the texture hardware for HPC-focused designs doesn’t make too much sense—unless you’re trying to hide that the chip used to be a GPU, of course. Each Maxwell SM offers eight samples per clock of texturing ability.

The GM200 GPU. Source: Nvidia

GM200 uses six GPCs, so it has six front-ends, six rasterisers, six sets of back-ends and connections to the shared 3MB of L2 cache in its memory hierarchy, and a total of 24 SMs (and thus 24 times 4 times vec32 SIMDs, and 24 times 8 samples per clock of texturing capability) across the whole chip. With clock speeds of 1GHz or more in all of its shipping configurations, and speeds that are often even greater in its GeForce GTX 980 Ti form—especially the overclocked partner boards—it’s the most powerful single GPU that’s shipped to date.

If GM200 sounds big, that’s because it absolutely is. At just over 600mm², fabricated by TSMC on its 28-nm high-performance process technology, it’s pretty much the biggest GPU Nvidia could have made before tipping over the edge of the yield curve. Big GPUs lend themselves to decent yields, because it’s easy to sell them in cut-down form. You still need the yield to be decent to extract a profit from a GPU configured with the bits you are able to turn on against the competitive landscape of the day, though.

So that’s our GP100 backdrop in a nutshell. What I’m trying to get at by painting yet another picture of the big Maxwell is that it’s mostly just a big consumer GPU, not an HPC part. Maxwell's lack of FP64 performance hurts its usefulness in HPC applications, and Nvidia can’t ignore that forever. Intel is shipping its new Knights Landing (KNL) Xeon Phi now. That product is an FP64 beast. It's also capable of tricks that other GPU-like designs can’t pull off, like booting an OS by itself. That's because each of its SIMD vector units is managed by a set of decently-capable x86 cores.

Our Maxwell and GM200 recap highlights the fact that GP100 has its work cut out in a particular field: HPC. Let’s take a 10,000-foot view of how it’s been designed to tackle that market as an overall product before we dive into some of the details.