Single page Print

The first chip: GK104
Now that we've looked at the SMX, we dial back the magnification a bit and consider the overall layout of the first chip based on the Kepler architecture, the GK104.


Logical block diagram of the GK104. Source: Nvidia.

You can see that there are four GPCs, or graphics processing clusters, in the GK104, each nearly a GPU unto itself, with its own rasterization engine. The chip has eight copies of the SMX onboard, for a gut-punching total of 1536 ALUs and 128 texels per clock of texture filtering power.

The L2 cache shown above is 512KB in total, divided into four 128KB "slices," each with 128 bits of bandwidth per clock cycle. That adds up to double the per-cycle bandwidth of the GF114 or 30% more than the biggest Fermi, the GF110. The rest of the specifics are in the table below, with the relevant comparisons to other GPUs.

ROP
pixels/
clock
Texels
filtered/
clock
(int/fp16)
Shader
ALUs
Rasterized
triangles/
clock
Memory
interface
width (bits)
Estimated
transistor
count
(Millions)
Die size
(mm²)
Fabrication
process node
GF114 32 64/64 384 2 256 1950 360 40 nm
GF110 48 64/64 512 4 384 3000 520 40 nm
GK104 32 128/128 1536 4 256 3500 294 28 nm
Cypress 32 80/40 1600 1 256 2150 334 40 nm
Cayman 32 96/48 1536 2 256 2640 389 40 nm
Pitcairn 32 80/40 1280 2 256 2800 212 28 nm
Tahiti 32 128/64 2048 2 384 4310 365 28 nm

In terms of basic, per-clock rates, the GK104 stacks up reasonably well against today's best graphics chips. However, if the name "GK104" isn't enough of a clue for you, have a look at some of the vitals. This chip's memory interface is only 256 bits wide, all told, and its die size is smaller than the middle-class GF114 chip that powers the GeForce GTX 560 series. The GK104 is also substantially smaller, and comprised of fewer transistors, than the Tahiti GPU behind AMD's Radeon HD 7900 series cards. Although the product based on it is called the GeForce GTX 680, the GK104 is not a top-of-the-line, reticle-busting monster. For the Kepler generation, Nvidia has chosen to bring a smaller chip to market first.


Die shot of the GK104. Source: Nvidia.


The GK104 poses, no duckface, next to a quarter for scale

Although Nvidia won't officially confirm it, there is surely a bigger Kepler in the works. The GK104 is obviously more tailored for graphics than GPU computing, and GPU computing is an increasingly important market for Nvidia. The GK104 can handle double-precision floating-point data formats, but it only does so at 1/24th the rate it processes single-precision math, just enough to maintain compatibility. Nvidia has suggested there will be some interesting GPU-computing related announcements during its GTC conference in May, and we expect the details of the bigger Kepler to be revealed at that point. Our best guess is that the GK100, or whatever it's called, will be a much larger chip, presumably with six 64-bit memory interfaces and 768KB of L2 cache. We wouldn't be surprised to see its SM exchange those 32-wide execution units for 16-wide units capable of handling double-precision math, leaving it with a total of 128 ALUs per SM. We'd also expect full ECC protection for all local storage and off-chip memory, just like the GF110.

The presence of a larger chip at some point in Nvidia's future doesn't mean the GK104 lacks for power. Although it "only" has four 64-bit memory controllers, this chip's memory interface is probably the most notable change outside of the SMX. As Danskin very carefully put it, "Fermi, our memory wasn't as fast as it could have been. This is, in fact, as fast as it could be." The interface still supports GDDR5 memory, but data rates are up from about 4 Gbps in the Fermi products to 6 Gbps in the GeForce GTX 680. As a result, the GTX 680 is able essentially to match the GeForce GTX 580 in total memory bandwidth, at 192 GB/s, while having a 50% narrower data path.

The other novelty in the GK104 is Nvidia's first PCI Express 3.0-compatible interconnect, which doubles the peak data rate possible for GPU-to-host communication. We don't expect major performance benefits for graphics workloads from this faster interface, but it could matter in multi-GPU scenarios or for GPU computing applications.