Single page Print

Nvidia's GeForce GTX 460 graphics processor

Aiming for the happy medium
— 4:21 AM on July 12, 2010

We've been following the story of the Fermi architecture for the better part a year now, since Nvidia first tipped its hand about plans for a new generation of DirectX 11-class GPUs. Fermi's story has been one of the more intriguing developments over that span of time, because it involves great ambitions and the strains that go with attempting to achieve them. Nvidia wanted its new top-of-the-line GPU to serve multiple markets, both traditional high-end graphics cards and the nascent market for GPUs as parallel computing engines. Not only that, but Fermi was to be unprecedentedly capable in both domains, with a novel and robust programming model for GPU computing and a first-of-its-kind parallel architecture for geometry processing in graphics.

Naturally, that rich feature set made for a large and complex GPU, and such things can be deadly in the chip business—especially when a transition to a new architecture is mated with an immature chip fabrication process, as was the case here. Time passed, and the first Fermi-based chip, the GF100, became bogged down with delays. Rumors flew about a classic set of problems: manufacturing issues, silicon re-spins, and difficult trade-offs between power consumption and performance. Eventually, as you know, the GF100 arrived in the GeForce GTX 470 and 480 graphics cards, which turned out to be reasonably solid but not much faster than the then-six-month-old Radeon HD 5870—which is based on a much smaller, cheaper-to-produce chip.


The GF100, though, has a lot of extra fat in it that's unnecessary for, well, video cards. We wondered at that time, several months ago, whether a leaner version of the Fermi architecture might not be a tougher competitor. If you'll indulge me, I'll quote myself here:

We're curious to see how good a graphics chip this generation of Nvidia's technology could make when it's stripped of all the extra fat needed to serve other markets: the extensive double-precision support, ECC, fairly large caches, and perhaps two or three of its raster units. You don't need any of those things to play games—or even to transcode video on a GPU. A leaner, meaner mid-range variant of the Fermi architecture might make a much more attractive graphics card, especially if Nvidia can get some of the apparent chip-level issues worked out and reach some higher clock speeds.

Sounds good, no? Well, I'm pleased to report that nearly all of that has come to pass in the form of a GPU known as the GF104. What's more, the first graphics cards based on it, to be sold as the GeForce GTX 460 768MB and 1GB, are aimed directly at the weak spot in the Radeon's armor: the $199-229 price range.

A new Fermi: GF104
The GF104 GPU is undoubtedly based on the same generation of technology as the GF100 before it, but to thrust them both under the umbrella of the same architecture almost feels misleading. In truth, the GF104 has been pretty radically rebalanced in terms of the number and type of functional units onboard, clearly with an eye toward more efficient graphics performance. We'll illustrate that point with a high-level functional block diagram of the GPU. If you'd like to compare against the GF100, a diagram and our discussion of that GPU is right here.

Block diagram of the GF104. Source: Nvidia.

These diagrams are becoming increasingly hard to read as the unit counts on GPUs mushroom. Starting with the largest elements, you can see that there are only two GPCs, or graphics processing clusters, in the GF104. The GF100 has four. As a result, the number of SMs, or shader multiprocessors, is down to eight. Again, GF100 has twice as many. The immediately obvious result of these cuts is that GF104 has half as many raster and polymorph engines as the GF100, which means its potential for polygon throughput is substantially reduced. That's very much an expected change, and not necessarily a major loss at this point in time.

Another immediately obvious change is a reduction in the number of memory controllers flanking the GPCs. The GF104 has four memory controllers and associated ROP partitions, while the GF100 has six. What you can't tell from the diagram is that, apparently, 128KB of L2 cache is also associated with each memory controller/ROP group. With four such groups, the GF104 features 512KB of L2 cache, down from 768K on the GF100. The local memory pools on the GF104 are different in another way, too: the ECC protection for these memories has been removed, since it's essentially unneeded in a consumer product—especially a graphics card.

Our description so far may lead you to think the GF104 is simply a GF100 that's been sawed in half, but that's not the case. To understand the other changes, we need to zoom in on one of those SM units and take a closer look.

Block diagram of an SM in the GF104. Source: Nvidia.

Each SM in the GF104 is a little "fatter" than the GF100's. You can count 48 "CUDA cores" in the diagram above, if you're so inclined. That's an increase from 32 in the GF100. We're not really inclined to call those shader arithmetic logic units (ALUs) "cores," though. The SM itself probably deserves that honor.

While we're being picky, what you should really see in that diagram is a collection of five different execution units: three 16-wide vector execution units, one 16-wide load/store unit, and an eight-wide special function unit, or SFU. By contrast, the GF100's SM has two 16-wide execution units, one 16-wide load/store unit, and a four-wide SFU block. The GF104 SM's four dispatch units represent a doubling from the GF100, although the number of schedulers per SM remains the same.

The end result of these modifications is an SM with considerably more processing power: 50% more ALUs for general shader processing and double the number of SFUs to handle interpolation and transcendentals—both especially important mathematical operations for graphics. The doubling of instruction dispatch bandwidth should help keep the additional 16-wide ALU block occupied with warps—groups of 32 parallel threads or pixels in Nvidia's lexicon—to process.

One place where the GF104's SM is less capable is double-precision math, a facility important to some types of GPU computing but essentially useless for real-time graphics. Nvidia has retained double-precision support for the sake of compatibility, but only one of those 16-wide ALU blocks is DP-capable, and it processes double-precision math at one quarter the usual speed. All told, that means the GF104 is just 1/12 its regular speed for double-precision.

Another big graphics-related change is the doubling of the number of texture units in the SM to eight. That goes along nicely with the increase in interpolation capacity in the SFUs, and it grants the GF104 a more texturing-intensive personality than its elder sibling.

Boil down all of the increases here and decreases there versus the GF100, and you begin to get a picture of the GF104 as a chip with a rather different balance of internal graphics hardware—one that arguably better matches the demands of today's games.

width (bits)
GF100 48 64 512 4 384
GF104 32 64 384 2 256
Cypress 32 80 1600 1 256

The GF104 is a smaller chip aimed at a broader market than GF100, of course, so some compromises were necessary. What's interesting is where those compromises were made. ROP throughput (which determines pixel fill rate and anti-aliasing power), shader ALU count, and memory interface width are each reduced by a third. The triangle throughput for rasterization (and tessellation, via the polymorph engines) is cut in half. Yet texturing capacity holds steady, with no reduction at all. When you consider that Nvidia's shader ALUs run at twice the frequency of the rest of the chip and are typically more efficient than AMD's, the GF104's balance begins to look quite a bit like AMD's Cypress, in fact.

That said, Nvidia is unquestionably following its own playbook here. A couple of generations back, the firm reshaped its enormous G80 GPU into a leaner, meaner variant. In the process, it went from a 384-bit memory interface to 256 bits, from 24 ROP units to 16, and from four texture units per SM to eight. The resulting G92 GPU performed nearly as well as the G80 in many games, and it became a long-running success story.