Last week at its GPU Technology Conference, Nvidia unveiled the first details of its upcoming GK110 GPU, the “real” Kepler chip and bigger brother to the GK104 silicon powering the GeForce GTX 600 series. Although the GK110 won’t be hitting the market for some time yet, Nvidia’s increasing focus on GPU-computing applications has changed the rules, causing the GPU firm to show its cards well ahead of the product’s release. As a result, we now know quite a bit about the GK110’s architecture and the mix of resources it offers for GPU-computing work. With a little conjecture, we can probably paint a fairly accurate picture of its graphics capabilities, too.
Let’s start with the GK110’s basic specifications. Since we’ve known the GK104’s layout for a while now, the exact dimensions of its bigger brother have been the subject of some speculation. Turns out most of our guesses weren’t too far from the mark, although there are a few surprises. We don’t have its exact dimensions yet, but the chip itself is likely to be enormous; it packs in 7.1 billion transistors, roughly double the count of the GK104. The die shot released by Nvidia offers some clear hints about how those transistors have been allocated, as you can see below.
The GK110 is divided into five of the deep green structures above, which are almost certainly GPCs, or graphics processing clusters, nearly complete GPUs unto themselves. Each of those GPCs houses three SMX cores, and Nvidia has confirmed the chip hosts a total of 15 of those. By contrast, the GK104 has four GPCs with two SMX cores each, so the GK110 nearly doubles its per-clock processing power.
Ringing three sides of the chip are its six 64-bit memory controllers, giving it an aggregate 384-bit path to memory, 50% more than the GK104. That’s not an increase in interface width from the big Kepler’s true predecessor, the Fermi-based GF110, but GDDR5 data rates are up by roughly 50% in the Kepler generation, so there’s a bandwidth increase on tap, regardless. Looks like the PCI Express interface is on the upper edge of the chip; it has been upgraded to Gen3, with twice the peak data rates of Gen2 devices.
Because it has a dual mission, serving both the GPU computing and video card markets, the GK110 has a bit different character than GK104. As you’ve likely noted, in some cases it has twice the capacity of GK104, while other increases are closer to 50% or so. More notably, the GK110 has some compute-oriented features that the GK104 lacks, including ECC support (for both on-chip storage and off-chip memory) and the ability to process double-precision floating-point math at much higher rates. (The GK104 has token double-precision support at 1/24th the single precision rate, only to maintain compatibility. Single-precision datatypes tend to be entirely sufficient for real-time graphics and most consumer applications involving GPU computing.)
Nvidia said repeatedly at the show that increasing double-precision performance was a major objective for the big Kepler chip, and it appears the firm is on track to deliver. The GF110-based Tesla M2090 card is rated for a peak of 666 DP gigaflops, and Nvidia claims the GK110-based Tesla K20 will exceed one teraflops. If we assume a relatively conservative clock rate of 700MHz for the Tesla product, we’d expect the K20 to double the M2090’s throughput, to 1.3 teraflops.
The ceiling may be even higher than that. Nvidia’s press release about the K20 cryptically says the GK110 “delivers three times more double precision performance compared to Fermi architecture-based Tesla products,” and Huang said something similar in his keynote. In other presentations, though, the 3X claims were tied to power efficiency, as in three times the DP flops per watt, which seems like a more plausible outcome—and a very good one, since power constraints are paramount in virtually any computing environment these days. In order to deliver full-on three times the DP flops of Fermi-based Tesla cards, the K20 would have to run at nearly 1GHz. It’s possible the K20 could reach that speed temporarily thanks to Nvidia’s new driver-based dynamic voltage and frequency scaling mechanism (dubbed GPU Boost in the GeForce products), but it seems unlikely the K20 will achieve sustained operation at that frequency.
The SMX core
The single biggest change in the Kepler architecture is the redesigned shader multiprocessor core, nicknamed the SMX.
From a block diagram standpoint, the GK110’s SMX looks very much like the GK104’s, with the same basic set of resources, from the 192 single-precision shader ALUs right down to the 16 texels per clock of texture filtering. That’s a departure from the Fermi generation, where the GF104’s SM mixed things up a bit. The only major change from the GK104 is the addition of 64 double-precision math units. At least, that’s what the official block diagram tells us, but I’m having a hard time believing the DP execution units are entirely separate from the single-precision ones. Odds are that the GK110 breaks up those 64-bit numbers into two pieces and uses a pair of ALUs to process them together, or something of that nature.
Our understanding is that the SMX has eight basic execution units, four units with 32 ALUs each and another four with 16 ALUs each. We suspect double-precision math is handled on the four 32-wide execution units, with the 16-wide units left idle. The numbers work out if that’s the case, at least. The GK110 can process 64 double-precision ops per clock, one third of its single-precision rate.
All this talk of rates brings up another issue with the Kepler generation. As David Kanter has pointed out, the SMX’s big increases in shader flops have been accompanied by proportionately smaller increases in local storage area and bandwidth. As a result, key architectural ratios like bandwidth per flop have declined, even thought the chip’s overall power has increased. The GK110 has a new trick that should help offset this change in ratios somewhat: the SMX’s 48KB L1 texture cache can now be used as a read-only cache for compute, bypassing the texture unit. Apparently some clever CUDA coders were already making use of this cache in older GPUs, but with GK110, they won’t have to contend with texture filtering and the like.
Along the same lines, the GK110’s shared L2 cache has doubled in size from Fermi, to 1.5MB, and it has twice the bandwidth per clock, as well. Yes, the ALU count has more than doubled, but the increases in cache size and bandwidth should mean improvement, even with the shifting ratios.
Built for compute
The GK110 includes some other compute-oriented provisions that the GK104 lacks, and those are intended to deal with the growing problem of keeping a massively parallel GPU fully occupied with work.
Fermi and prior chips have only a single work queue, so incoming commands from the CPU are serialized, and work can only be submitted by, effectively, a single CPU core. As a result, even though Fermi supports multiple concurrent kernels, Nvidia claims the GPU often isn’t fully occupied when running complex programs. To remedy this situation, the GK110 has 32 work queues, managed in hardware, so it can be fed by multiple CPU threads running on multiple CPU cores. Nvidia has oh-so-cleverly named this new capability “Hyper-Q”.
The other big hitter is a feature called Dynamic Parallelism. In a nutshell, the big Kepler gives programs running on the GPU the ability to spawn new programs without going back to the CPU for help. Among other things, this feature allows a common logic structure, the nested loop, to work properly and efficiently on a GPU.
Perhaps the best illustration of this capability is the classic computing case of evaluating a fractal image like a Mandelbrot set. On the GK110, a Mandelbrot routine could evaluate the entire image area by breaking it into a coarse grid and checking to see which portions of that grid contain an edge. The blocks that do not contain an edge wouldn’t need to be further evaluated, and the program could “zoom in” on the edge areas to compute their shape in more detail. The program could repeat this process multiple times, each time ignoring non-edge blocks and focusing closer on blocks with edges in them, in order to achieve a very high resolution result without performing unnecessary work—and without constantly returning to the CPU for guidance.
Since, as we understand it, pretty much any data-parallel computing problem requires a data set that can be mapped to a grid, the usefulness of Dynamic Parallelism ought to be pretty wide-ranging. Also, Nvidia claims it simplifies the programming task just by allowing the presence of nested loop logic. Obviously, these benefits won’t show up in a peak flops count, but they should improve the GPU’s real-world effectiveness, regardless.
Nvidia has tweaked the programming model for Kepler in several more ways. A new “shuffle” instruction allows for data to be passed between threads without going through local storage. Atomic operations have been beefed up, with int64 versions of some operations joining their int32 counterparts. Kepler’s combination of a shorter pipeline and more atomic units should increase performance, too. Nvidia claims the atomic ops that were slowest on Fermi will be as much as ten times faster on Kepler, and even the fastest atomics on Fermi will be twice as fast on the GK110. Also, Kepler’s ISA encoding allows up to 255 registers to be associated with each thread, up from 63 in Fermi.
A GK110-based GeForce?
Nvidia has done a tremendous amount of work, from the hardware to software to promotion and more, to cultivate a market for its graphics chips as data-parallel processors for use in supercomputing, HPC, and academia. GTC 2012 featured a total of 340 different sessions presented by folks from a broad range of disciplines, and virtually all of the presenters were using GPUs for something other than real-time graphics.
If your interest in GPUs was, like mine, first sparked by graphics and gaming, you might be wondering about the prospects for a GeForce card based on the GK110. Trouble is, those prospects have been dampened somewhat by Nvidia’s success in other areas. The GK110 won’t reach the market until the fourth quarter of 2012, and multiple folks from Nvidia forthrightly admitted to us that those chips are already sold out through the end of 2012. All of those sales are to supercomputing clusters and the like, where each chip commands a higher price than it would aboard a video card. One gentleman seated in front of us at the GK110 deep-dive session mentioned in passing that he had 15,000 of the chips on order, which was his reason for attending.
The Nvidia executives we talked with raised the possibility of a GK110-based GeForce being released this year only if necessary to counter some move by rival AMD. That almost certainly means that any GK110-based GeForce to hit the market in 2012 would come in extremely limited quantities.
Nevertheless, with the information Nvidia has revealed about the GK110 and a dash of speculation, we can paint a picture of how a GeForce card based on the big Kepler might look. Note that we’re assuming a higher clock frequency for the consumer graphics card than we have for the Tesla K20. Beyond the clock speeds, which affect all of the rates, we’re only guessing about a couple of graphics capabilities. Nvidia hasn’t officially confirmed that the GK110 has five GPCs, although we do have the die shot. Similarly, we’d expect 48 pixels per clock of ROP throughput to accompany its six memory channels, if the GK110 retains the same arrangement as the mid-sized Kepler.
|Process node||40 nm @ TSMC||28 nm @ TSMC||28 nm @ TSMC||28 nm @ TSMC|
|Core clock||772 MHz||1006 MHz||900 MHz||925 MHz|
|Hot clock||1544 MHz||–||–||—|
|Setup rate||3088 Mtris/s||4024 Mtris/s||4500 Mtris/s||1850 Mtris/s|
|SP FMA rate||1.6 Tflops||3.1 Tflops||5.2 Tflops||3.8 Tflops|
bilinear texel rate
|49/49 Gtexels/s||129/129 Gtexels/s||216/216 Gtexels/s||118/59 Gtexels/s|
|ROP rate||37 Gpixels/s||32 Gpixels/s||43 Gpixels/s||30 Gpixels/s|
|Memory clock||4000 MT/s||6000 MT/s||6000 MT/s||5700 MT/s|
|Memory bus width||384 bits||256 bits||384 bits||384 bits|
|Memory bandwidth||192 GB/s||192 GB/s||288 GB/s||274 GB/s|
We think the GK104 has a more suitable mix of resources for real-time graphics, especially for current games that have been cross-developed for antiquated console hardware. The GK110 may be twice the size, but it’s not likely to be twice as fast for gaming. Still, our theoretical GK110-based GeForce increases shader flops and texture filtering capacity by two-thirds, along with respectable improvements in ROP rate and memory bandwidth. Since the GK104 is already a match for AMD’s Tahiti, we reckon the GK110 would be substantially faster still—if and when it makes it way into a consumer graphics card.