At Nvidia's GPU Technology Conference in 2010, CEO Jen-Hsun Huang made some pretty dramatic claims about his company's future GPU architecture, code-named Kepler. Huang predicted the chip would be nearly three times more efficient, in terms of FLOPS per watt, than the firm's prior Fermi architecture. Those improvements, he said, would go "far beyond" the traditional advances chip companies can squeeze out of the move to a newer, smaller fabrication process. The gains would come from changes to the chip's architecture, design, and software together.
Fast forward to today, and it's time to see whether Nvidia has hit its mark. The first chip based on the Kepler architecture is hitting the market, aboard a new graphics card called the GeForce GTX 680, and we now have a clear sense of what was involved in the creation of this chip. Although Kepler's fundamental capabilities are largely unchanged versus the last generation, Nvidia has extensively refined and polished nearly every aspect of this GPU with an eye toward improved power efficiency.
Kepler was developed under the direction of lead architect John Danskin and Sr. VP of GPU engineering Jonah Alben. Danskin and Alben told us their team took a rather different approach to chip development than what's been common at Nvidia in the past, with much closer collaboration between the different disciplines involved, from the architects to the chip designers to the compiler developers. An idea that seemed brilliant to the architects would be nixed because it didn't work well in silicon, or if it didn't serve the shared goal of building a very power-efficient processor.
Although Kepler is, in many ways, the accumulation of many small refinements, Danskin identified the two most major changes as the revised SM—or shader multiprocessor, the GPU's processing "core"—and a vastly improved memory interface. Let's start by looking at the new SM, which Nvidia calls the SMX, because it gives us the chance to drop a massive block diagram on you. Warm up your scroll wheels for this baby.
To some extent, GPUs are just massive collections of floating-point computing power, and the SM is the locus of that power. The SM is where nearly all of the graphics processing work takes place, from geometry processing to pixel shading and texture sampling. As you can see, Kepler's SMX is clearly more powerful than past generations, because it's over 700 pixels tall in block diagram form. Fermi is, like, 520 or so, tops. More notably, the SMX packs a heaping helping of ALUs, which Nvidia has helpfully labeled as "cores." I'd contend the SM itself is probably the closest analog to a CPU core, so we'll avoid that terminology. Whatever you call it, though, the new SMX has more raw computing power—192 ALUs versus 32 ALUs in the Fermi SM. According to Alben, about half of the Kepler team was devoted to building the SMX, which is a new design, not a derivative of Fermi's SM.
The organization of the SMX's execution units isn't truly apparent in the diagram above. Although Nvidia likes to talk about them as individual "cores," the ALUs are actually grouped into execution units of varying widths. In the SMX, there are four 16-ALU-wide vector execution units and four 32-wide units. Each of the four schedulers in the diagram above is associated with one vec16 unit and one vec32 unit. There are eight special function units per scheduler to handle, well, special math functions like transcendentals and interpolation. (Incidentally, the partial use of vec32 units is apparently how the GF114 got to have 48 ALUs in its SM, a detail Alben let slip that we hadn't realized before.)
Although each of the SMX's execution units works on multiple data simultaneously according to its width—and we've called them vector units as a result—work is scheduled on them according to Nvidia's customary scheme, in which the elements of a pixel or thread are processed sequentially on a single ALU. (AMD has recently adopted a similar scheduling format in its GCN architecture.) As in the past, Nvidia schedules its work in groups of 32 pixels or threads known as "warps." Those vec32 units should be able to output a completed warp in each clock cycle, while the vec16 units and SFUs will require multiple clocks to output a warp.
The increased parallelism in the SMX is a consequence of Nvidia's decision to seek power efficiency with Kepler. In Fermi and prior designs, Nvidia used deep pipelining to achieve high clock frequencies in its shader cores, which typically ran at twice the speed of the rest of the chip. Alben argues that arrangement made sense from the standpoint of area efficiency—that is, the extra die space dedicated to pipelining was presumably more than offset by the performance gained at twice the clock speed. However, driving a chip at higher frequencies requires increased voltage and power. With Kepler's focus shifted to power efficiency, the team chose to use shorter pipelines and to expand the unit count, even at the expense of some chip area. That choice simplified the chip's clocking, as well, since the whole thing now runs at one speed.
Another, more radical change is the elimination of much of the control logic in the SM. The key to many GPU architectures is the scheduling engine, which manages a vast number of threads in flight and keeps all of the parallel execution units as busy as possible. Prior chips like Fermi have used lots of complex logic to decide which warps should run when, logic that takes a lot of space and consumes a lot of power, according to Alben. Kepler has eliminated some of that logic entirely and will rely on the real-time complier in Nvidia's driver software to help make scheduling decisions. In the interests of clarity, permit me to quote from Nvidia's whitepaper on the subject, which summarizes the change nicely:
Both Kepler and Fermi schedulers contain similar hardware units to handle scheduling functions, including, (a) register scoreboarding for long latency operations (texture and load), (b) inter-warp scheduling decisions (e.g., pick the best warp to go next among eligible candidates), and (c) thread block level scheduling (e.g., the GigaThread engine); however, Fermi’s scheduler also contains a complex hardware stage to prevent data hazards in the math datapath itself. A multi-port register scoreboard keeps track of any registers that are not yet ready with valid data, and a dependency checker block analyzes register usage across a multitude of fully decoded warp instructions against the scoreboard, to determine which are eligible to issue.
For Kepler, we realized that since this information is deterministic (the math pipeline latencies are not variable), it is possible for the compiler to determine up front when instructions will be ready to issue, and provide this information in the instruction itself. This allowed us to replace several complex and power-expensive blocks with a simple hardware block that extracts the pre-determined latency information and uses it to mask out warps from eligibility at the inter-warp scheduler stage.
The short story here is that, in Kepler, the constant tug-of-war between control logic and FLOPS has moved decidedly in the direction of more on-chip FLOPS. The big question we have is whether Nvidia's compiler can truly be effective at keeping the GPU's execution units busy. Then again, it doesn't have to be perfect, since Kepler's increases in peak throughput are sufficient to overcome some loss of utilization efficiency. Also, as you'll soon see, this setup obviously works pretty well for graphics, a well-known and embarrassingly parallel workload. We are more dubious about this arrangement's potential for GPU computing, where throughput for a given workload could be highly dependent on compiler tuning. That's really another story for another chip on another day, though, as we'll explain shortly.