The SMX core
The single biggest change in the Kepler architecture is the redesigned shader multiprocessor core, nicknamed the SMX.
From a block diagram standpoint, the GK110's SMX looks very much like the GK104's, with the same basic set of resources, from the 192 single-precision shader ALUs right down to the 16 texels per clock of texture filtering. That's a departure from the Fermi generation, where the GF104's SM mixed things up a bit. The only major change from the GK104 is the addition of 64 double-precision math units. At least, that's what the official block diagram tells us, but I'm having a hard time believing the DP execution units are entirely separate from the single-precision ones. Odds are that the GK110 breaks up those 64-bit numbers into two pieces and uses a pair of ALUs to process them together, or something of that nature.
Our understanding is that the SMX has eight basic execution units, four units with 32 ALUs each and another four with 16 ALUs each. We suspect double-precision math is handled on the four 32-wide execution units, with the 16-wide units left idle. The numbers work out if that's the case, at least. The GK110 can process 64 double-precision ops per clock, one third of its single-precision rate.
All this talk of rates brings up another issue with the Kepler generation. As David Kanter has pointed out, the SMX's big increases in shader flops have been accompanied by proportionately smaller increases in local storage area and bandwidth. As a result, key architectural ratios like bandwidth per flop have declined, even thought the chip's overall power has increased. The GK110 has a new trick that should help offset this change in ratios somewhat: the SMX's 48KB L1 texture cache can now be used as a read-only cache for compute, bypassing the texture unit. Apparently some clever CUDA coders were already making use of this cache in older GPUs, but with GK110, they won't have to contend with texture filtering and the like.
Along the same lines, the GK110's shared L2 cache has doubled in size from Fermi, to 1.5MB, and it has twice the bandwidth per clock, as well. Yes, the ALU count has more than doubled, but the increases in cache size and bandwidth should mean improvement, even with the shifting ratios.
Built for compute
The GK110 includes some other compute-oriented provisions that the GK104 lacks, and those are intended to deal with the growing problem of keeping a massively parallel GPU fully occupied with work.
Fermi and prior chips have only a single work queue, so incoming commands from the CPU are serialized, and work can only be submitted by, effectively, a single CPU core. As a result, even though Fermi supports multiple concurrent kernels, Nvidia claims the GPU often isn't fully occupied when running complex programs. To remedy this situation, the GK110 has 32 work queues, managed in hardware, so it can be fed by multiple CPU threads running on multiple CPU cores. Nvidia has oh-so-cleverly named this new capability "Hyper-Q".
The other big hitter is a feature called Dynamic Parallelism. In a nutshell, the big Kepler gives programs running on the GPU the ability to spawn new programs without going back to the CPU for help. Among other things, this feature allows a common logic structure, the nested loop, to work properly and efficiently on a GPU.
Perhaps the best illustration of this capability is the classic computing case of evaluating a fractal image like a Mandelbrot set. On the GK110, a Mandelbrot routine could evaluate the entire image area by breaking it into a coarse grid and checking to see which portions of that grid contain an edge. The blocks that do not contain an edge wouldn't need to be further evaluated, and the program could "zoom in" on the edge areas to compute their shape in more detail. The program could repeat this process multiple times, each time ignoring non-edge blocks and focusing closer on blocks with edges in them, in order to achieve a very high resolution result without performing unnecessary work—and without constantly returning to the CPU for guidance.
Since, as we understand it, pretty much any data-parallel computing problem requires a data set that can be mapped to a grid, the usefulness of Dynamic Parallelism ought to be pretty wide-ranging. Also, Nvidia claims it simplifies the programming task just by allowing the presence of nested loop logic. Obviously, these benefits won't show up in a peak flops count, but they should improve the GPU's real-world effectiveness, regardless.
Nvidia has tweaked the programming model for Kepler in several more ways. A new "shuffle" instruction allows for data to be passed between threads without going through local storage. Atomic operations have been beefed up, with int64 versions of some operations joining their int32 counterparts. Kepler's combination of a shorter pipeline and more atomic units should increase performance, too. Nvidia claims the atomic ops that were slowest on Fermi will be as much as ten times faster on Kepler, and even the fastest atomics on Fermi will be twice as fast on the GK110. Also, Kepler's ISA encoding allows up to 255 registers to be associated with each thread, up from 63 in Fermi.