Single page Print

How this architecture stacks up
Understanding the basics of an architecture like this one is good, but in order to truly grok the essence of a modern GPU, one must develop a sense of the scale involved when the basic units are replicated many times across the chip. With Tahiti, those numbers can be staggering. Tahiti has 33% more compute units than Cayman has SIMDs (32 versus 24), with a third more peak FLOPS and a third higher texture sampling and filtering capacity, clock for clock.

If you'd like to look at it another way, just four of Tahiti's CUs would add up to the same pixel-shading capacity as an entire R600, the chip behind the Radeon HD 2900 XT (though Tahiti has much more robust datatype support and host of related enhancements).

Today's quad-core Sandy Bridge CPUs have four cores and can track eight threads via simultaneous multi-threading (SMT), but GPUs use threading in order to keep their execution units busy on a much broader scale. Each of Tahiti's CUs can track up to 40 wavefronts in flight at once. Across 32 CUs, that adds up to 1280 wavefronts or 20,480 threads in flight. Meanwhile, by Demers' estimates, Tahiti's L1 caches have an aggregate bandwidth of about 2 TB/s, while the L2s can transfer nearly 710 GB/s at 925MHz.

Peak pixel
fill rate
(Gpixels/s)
Peak bilinear
filtering
(Gtexels/s)
Peak bilinear
FP16 filtering
(Gtexels/s)
Peak shader
arithmetic
(TFLOPS)
Peak
rasterization
rate
(Mtris/s)
Memory
bandwidth
(GB/s)
GeForce GTX 280 19 48 24 0.6 602 142
GeForce GTX 480 34 42 21 1.3 2800 177
GeForce GTX 580 37 49 49 1.6 3088 192
Radeon HD 5870 27 80 40 2.7 850 154
Radeon HD 6970 28 85 42 2.7 1760 176
Radeon HD 7970 30 118 59 3.8 1850 264

In terms of key graphics rates, the Tahiti-driven Radeon HD 7970 eclipses the Cayman-based Radeon HD 6970 and the Fermi-powered GeForce GTX 580 in nearly every respect. The exceptions are the ROP rates and the triangle rasterization rate.

ROP rates, of course, include the pixel fill rate and, more crucially these days, the amount of blending power for multisampled antialiasing. The 7970 is barely faster than the 6970 on the this front because it sports the same basic mix of hardware: eight ROP partitions, each capable of outputting four colored pixels or 16 Z/stencil pixels per clock. Rather than increasing the hardware counts here, AMD decided on a reorganization. In previous designs, two ROP partitions (or render back-ends) were associated with each memory controller, but AMD claims the memory controllers were "oversubscribed" in that setup, leaving the ROPs twiddling their thumbs at times. Tahiti's ROPs are no longer associated with a specific memory controller. Instead, the chip has a crossbar allowing direct, switched communication between each ROP partition and each memory controller. (The ROPs are not L2 cache clients, incidentally.) With this increased flexibility and the addition of two more memory controllers, AMD claims Tahiti's ROPs should achieve up to 50% higher utilization and thus efficiency. Higher efficiency is a good thing, but the big question is whether Tahiti's relatively low maximum ROP rates will be a limiting factor, even if the chip does approach its full potential more frequently. The GeForce GTX 580 still has quite an advantage in max possible throughput over the 7970, 37 to 30 Gpixels/s.

Tahiti's peak polygon rasterization rates haven't improved too much on paper, either. It still has dual rasterizers, like Cayman before it. Rather than trying to raise the theoretical max throughput, which is already quite high considering the number of pixels and polygons likely to be onscreen, AMD's engineers have focused on delivered performance, especially with high degrees of tessellation. Geometry expansion for tessellation can have one input but many outputs, and that can add up to a difficult data flow problem. To address this issue, the parameter caches for Tahiti's two geometry engines have doubled in size, and those caches can now read from one another, a set of changes Nalasco says amounts to tripling the effective cache size. If those caches become overwhelmed, they can now spill into the chip's L2 cache, as well. If even that fails, AMD says vaguely that Tahiti is "better" when geometry data must spill into off-chip memory. This collection of tweaks isn't likely to allow Tahiti to match Fermi's distributed geometry processing architecture step for step, but we do expect a nice increase over Cayman. That alone should be more than sufficient for everything but a handful of worst-case games that use polygons gratuitously and rather bizarrely.

In addition to everything else, Tahiti has a distinctive new capability called partially resident textures, or in the requisite TLA, PRTs. This feature amounts to hardware acceleration for virtual or streaming textures, a la the "MegaTexture" feature built into id Software's recent game engines, including the one for Rage. Tahiti supports textures up to 32 terabytes in size, and it will map and filter them. Large textures can be broken down into 64KB tiles and pulled into memory as needed, managed by the hardware.

AMD's internal demo team has created a nifty animated demonstration of this feature running on Tahiti. The program implements, in real time, a method previously reserved primarily for offline film rendering. In it, textures are mapped on a per-polygon basis, eliminating the need for an intermediate UV map serving as a two-dimensional facsimile of each 3D object. AMD claims this technique solves one of the long-standing problem with tessellation: the cracking and seams that can appear when textures are mapped onto objects of varying complexity.

The firm is hopeful methods like this one can remove one of the long-standing barriers to the wider use of tessellation in future games. Trouble is, Tahiti's PRT capability isn't exposed in any current or near-future version of Microsoft's DirectX API, so it's unlikely more than a handful of game developers—those who favor OpenGL—will make use of it anytime soon. We're also left wondering whether the tessellation hardware in AMD's prior two generations of DirectX 11 GPUs will ever be used as effectively as we once imagined, since they lack PRT support and are subject to the texture mapping problems Tahiti's new hardware is intended to solve.