Single page Print

Shader processing

Block diagram of a single SP unit.
Source: AMD.

Since the RV770 shares its core shader structure with the R600 family, much of what I wrote about how shader processing works in my R600 review should still apply here. The RV770's basic execution unit remains a five-ALU-wide superscalar block like the one on the right, which has four "regular" ALUs and one "fat" ALU that can handle some special functions the others can't, like transcendentals.

AMD has extended the functionality of these SP blocks slightly with RV770, but remarkably, they've managed to reduce the area they occupy on the chip versus RV670, even on the same fabrication process. RV770 Chief Architect Scott Hartog cited a 40% increase in performance per square millimeter. In fact, AMD originally planned to put eight SIMD cores on this GPU, but once the shader team's optimizations were complete, the chip had die space left empty; the I/O ring around the outside of the chip was the primary size constraint. In response, they added two additional SIMD cores, bringing the SP count up to 800 and vaulting the RV770 over the teraflop mark.

Most of the new capabilities of the RV770's shaders are aimed at non-graphics applications. For instance, from the RV670, they inherit the ability to handle double-precision floating-point math, a capability that has little or no application in real-time graphics at present. The "fat" ALU in the SP block can perform one double-precision FP add or multiply per clock, while the other four ALUs can combine to process one double-precision add. In essence, that means the RV770's peak compute rate for double-precision multiply-add operations is one-fifth of its single-precison rate, or 240 gigaflops in the case of the Radeon HD 4870. That's quite a bit faster than even the GeForce GTX 280, whose peak DP compute rate is 78 gigaflops.

Another such accommodation is the addition of 16KB of local shared memory in each SIMD core, useful for sharing data between threads in GPU-compute applications. This is obviously rather similar to the 16KB of shared memory Nvidia has built into each of the SM structures in its recent GPUs, although the RV770 has relatively less memory per stream processor, about a tenth of what the GT200 has. This local data share isn't accessible to programmers via graphics APIs like DirectX, but AMD may use it to enable larger kernels for custom AA filters or for other forms of post-processing. Uniquely, the RV770 also has a small, 16K global data share for the passing of data between SIMDs.

Beyond that, the ability to perform an integer bit-shift operation has been migrated from the "fat" ALU to all five of them in each SP block, a provision aimed at accelerating video processing, encoding, and compression. The design team also added memory import and export capabilities, to allow for full-speed scatter and gather operations. And finally, the RV770 has a new provision for the creation of lightweight threads for GPU compute applications. Graphics threads tend to have a lot of state information associated with them, not all of which may be necessary for other types of processing. The RV770 can quickly generate threads with less state info for such apps.

Peak shader
arithmetic (GFLOPS)
Single-issue Dual-issue
GeForce 8800 GTX 346 518
GeForce 9800 GTX 432 648
GeForce 9800 GX2 768 1152
GeForce GTX 260 477 715
GeForce GTX 280 622 933
Radeon HD 2900 XT 475 -
Radeon HD 3870 496 -
Radeon HD 3870 X2 1056 -
Radeon HD 4850 1000 -
Radeon HD 4870 1200 -

Although most of these changes won't affect graphics performance, one change may. Both AMD and Nvidia seem to be working on getting a grasp on how developers may use geometry shaders and optimizing their GPUs for different possibilities. In the GT200, we saw Nvidia increase its buffer sizes dramatically to better accommodate the use of a shader for geometry amplification, or tessellation. AMD claims its GPUs were already good at handling such scenarios, but has enhanced the RV770 for the case where the geometry shader keeps data on the chip for high-speed rendering.

The single biggest improvement made in the RV770's shader processing ability, of course, is the increase to 10 SIMDs and a total of 800 so-called stream processors on a single chip. This change affects graphics and GPU-compute applications alike. The table on the right shows the peak theoretical computational rates of various GPUs. Of course, as with almost anything of this nature, the peak number isn't destiny; it's just a possibility, if everything were to go exactly right. That rarely happens. For instance, the GeForces can only reach their peak numbers if they're able to use their dual-issue capability to execute an additional multiply operation in each clock cycle. In reality, that doesn't always happen. Similarly, in order to get peak throughput out of the Radeon, the compiler must schedule instructions cleverly for its five-wide superscalar ALU block, avoiding dependencies and serializing the processing of data that doesn't natively have five components.

Fortunately, we can run a few simple synthetic shader tests to get a sense of the GPUs' processing prowess.

In its most potent form, the Radeon HD 4870, the RV770 represents a huge improvement over the Radeon HD 3870—pretty straightforwardly, about two times the measured performance. Versus the competition, the Radeon HD 4850 outperforms the GeForce 9800 GTX in three of the four tests, although the gap isn't as large as the theoretical peak numbers would seem to suggest. More impressively, the Radeon HD 4870 surpasses the GT200-based GeForce GTX 260 in two of the four tests and essentially matches the GTX 280 in the GPU particles and Perlin noise tests. That's against a chip twice the size of the RV770, with a memory interface twice as wide.