Single page Print

Shader processing
Let's pull up that diagram of the G80 once more, so we have some context for talking about shader processing and performance.

Block diagram of the GeForce 8800. Source: NVIDIA.

A single SP cluster. Source: NVIDIA.

The G80's unified architecture substitutes massive amounts of more generalized parallel floating-point processing power for the vertex and pixel shaders of past GPUs. Again we can see the eight clusters of 16 SPs, with each cluster of SPs arranged in two groups of eight. To the left, you can see a slightly more detailed diagram of a single SP cluster. Each cluster has its own dedicated texture address and filtering units (the blue blocks) and its own pool of L1 cache. Behind the L1 cache is a connection to the crossbar that goes to the ROP units, with their L2 caches and connections to main memory.

Getting an exact handle on the amount of shader power available here isn't a wholly simple task, although you'll see lots of numbers thrown around as authoritative. We can get a rough sense of where the G80 stands versus the R580+ GPU in the Radeon X1950 XTX by doing some basic math. The R580+ has 48 pixel shader ALUs that can operate on four pixel components each, and it runs at 650MHz. That means the R580+ can operate on about 125 billion components per second, at optimal peak performance. With its 128 SPs at 1.35GHz, the G80 can operate on about 173 billion components per second. Of course, that's a totally bogus comparison, and I should just stop typing now. Actual performance will depend on the instruction mix, the efficiency of the architecture, and the ability of the architecture to handle different instruction types. (The G80's scalar SPs can dual-issue a MAD and a MUL, for what it's worth.)

The G80 uses a threading model, with an internal thread processor, to track all data being processed. Nvidia says the G80 can have "thousands of threads" in flight at any given time, and it switches between them regularly in order to keep all of the SPs as fully occupied as possible. Certain operations like texture fetch or filtering can take quite a while, relatively speaking, so the SPs will switch away to another task while such an operation completes.

Threading also facilitates the use of a common shader unit for vertex, pixel, and geometry shader processing. Threading is the primary means of load balancing between these different data types. For DirectX 9 applications, that means vertex and pixel threads only, but the G80 can do robust load balancing between these thread types even though the DX9 API doesn't have a unified shader instruction language. Load balancing is handled automatically, so it's transparent to applications.

ATI created its first unified shader architecture in the Xenos chip for the Xbox 360, and all parties involved—including Microsoft, ATI, and Nvidia—seem to agree that unified shaders are the way to go. By their nature, graphics workloads tend to vary between being more pixel-intensive and more vertex-intensive, from scene to scene or even as one frame is being drawn. The ability to retask computational resources dynamically allows the GPU to use the bulk of it resources to attack the present bottleneck. This arrangement ensures that large portions of the chip don't sit unused while others face more work than they can handle.

To illustrate the merits of a unified architecture, Nvidia showed us a demo using the game Company of Heroes and a tool called NVPerfHUD that plots the percentage of pixel and vertex processing power used over time. Here's a slide that captures the essence of what we saw.

Source: Nvidia.

The proportion of GPU time dedicated to vertex and pixel processing tended to swing fluidly in a pretty broad range. Pixel processing was almost always more prominent than vertex processing, but vertex time would spike occasionally when there was lots of geometric complexity on the screen. That demo alone makes a pretty convincing argument for the merits of unified shaders—and for the G80's implementation of them.

Threading also governs the GPU's ability to process advanced shader capabilities like dynamic branching. On a parallel chip like this one, branches can create problems because the GPU may have to walk a large block of pixels through both sides of a branch in order to get the right results. ATI made lots of noise about the 16-pixel branching granularity in R520 when it was introduced, only to widen the design to 48 pixel shaders (and thus to 48-pixel granularity) with the R580. For G80, Nvidia equates one pixel to one thread, and says the GPU's branching granularity is 32 pixels—basically the width of the chip, since pixels have four scalar components each. In the world of GPUs, this constitutes reasonably fine branching granularity.

One more, somewhat unrelated, note on the G80's stream processors. Nvidia's pixel shaders have supported 32-bit floating-point datatypes for some time now, but the variance of data formats available on graphics processors has been an issue for just as long. The DirectX 10 specification attempts to tidy these things up a bit, and Nvidia believes the G80 can reasonably claim to be IEEE 754-compliant—perhaps not in every last picky detail of the spec, but generally so. This fact should make the G80 better suited for general computational tasks.