Two weeks ago today, Rich Heye, VP and GM of ATI's desktop business unit, stood up in front of a room full of skeptical journalists and attempted to defuse those concerns. The problem with R520, he told us, with neither a snag caused by TSMC's 90nm process tech nor a fundamental design issue. The chip was supposed to launch in June, he said, but was slowed by a circuit design buga simple problem, but one that was repeated throughout the chip. Once ATI identified the problem and fixed it, the R520 gained 150MHz in clock frequency. That may not sound like much if you're thinking of CPUs, but in the world of 16-pipe-wide graphics processors, 150MHz can make the difference between competitive success and failure.
With those concerns addressed, ATI proceeded to unveil not just R520, but a whole family of Radeon graphics products ranging from roughly $79 to $549, based on three new GPUs that share a common heritage. It is one of the most sweeping product launches we've ever seen in graphics, intended to bring ATI up to feature parity with NVIDIAand then some. Read on as we delve into the technology behind ATI's new GPU lineup and then test its performance head to head against its direct competition.
Shader Model 3.0 and threading
Probably the most notable feature of the ATI R500-series graphics architecture is its support for the Shader Model 3.0 programming model. Shader Model 3.0 lives under the umbrella of Microsoft's DirectX 9 API, the games and graphics programming interface for Windows. SM3.0 is the most advanced of several programming models built into DX9, and it's the one used by all NVIDIA products in the GeForce 6 and 7 series product lines. SM3.0's key features include a more CPU-like programming model for pixel shaders, the most powerful computational units on a GPU. SM3.0 pixel shaders must be able to execute longer programs, and they must support dynamic flow control within those programsthings such as looping and branching with conditionals. These pixel shaders must also do their computational work with 128-bits floating-point precision32-bits of floating-point precision per color channel for the red, green, blue, and alpha.
ATI's new GPUs support all of these things, including 32-bit precision per color channel. That's a step up in precision from ATI's previous DirectX 9-class graphics processors, all of which did internal pixel shader calculations with 24 bits of FP precision. Unlike NVIDIA's recent GPUs, the R500 series' pixel shaders will not accept a "partial precision" hint from the programmer and cut back pixel shader precision to 16-bits per channel for some calculations in order to save on resources like internal chip register space. Instead, R500 GPUs do all pixel shader calculations with 32-bit precision. The GPU can, of course, store data in lower precision texture formats, but the internal pixel shader precision doesn't change.
The move from 24 to 32 bits of precision establishes a nice baseline for the future, but virtually no applications have yet layered enough rendering passes on top of one another to cause 24-bit precision to become a problem. As we have learned over the life of the GeForce 6 and 7 series, Shader Model 3.0's true value doesn't come in the form of visual improvements over the more common SM2.0, but in higher performance. The same sort of visual effects possible in SM3.0 are generally possible in 2.0, but they're not always possible in real time. Through judicious use of longer shader programs with looping and dynamic branching, applications may use SM3.0 to enable new forms of eye candy.
In order to take best advantage of Shader Model 3.0's capabilities, ATI has equipped the R520's pixel shader engine with the scheduling and control logic capable of handling up to 512 parallel threads. Threads are important in modern graphics architectures because they're used to keep a GPU's many execution resources well fed; having lots of threads standing by for execution allows the GPU to mask latency by doing other work while waiting for something relatively slow to happen, such as a texture access. A Shader Model 3.0 GPU may have to wait for the result of a conditional (such as an if-else statement) to be returned before proceeding with execution of a dependent branch, so such latency masking becomes even more important with the addition of dynamic flow control.

Despite the ability for programs to branch and loop, though, Shader Model 3.0 GPUs retain their parallelism, and that pulls against the efficient execution of "branchy" code. Pixels (or more appropriately, fragments) are processed together in blocks, and when pixels in the same block take different forks of a branch, all pixels must traverse both forks of that branch. (The ones not affected by the active branch are simply masked out during processing.) In the R520, pixels are grouped into threads in four-by-four blocks, which ATI says is much finer grained threading than in competing GPUs.
To illustrate the improved efficiency of its architecture, ATI offers this example of a shadow mapping algorithm using an if-else statement:

The GPU with large thread sizes must send lots of pixels down both sides of a branch, and thus it doesn't realize the benefits of dynamic flow control. A four-by-four block, like in the R520, is much more efficient by comparison.
Comparing the scheduling and threading capabilities of R520 to the competition isn't the easiest thing to do, because NVIDIA hasn't offered quite as much detail about exactly how its GPUs do things. NVIDIA seems to rely more on software, including the real-time compiler in its graphics drivers, to assist with scheduling. Yet NVIDIA says that its GPUs do indeed have logic on board for management of threads and branching, and that they keep "hundreds" of threads in flight in order to mask latency. As for threading granularity, some clever folks have tested the NV40 and G70 and concluded that the NV40 handles pixels in blocks of 4096, while the G70 uses blocks of 1024. NVIDIA claims those numbers aren't entirely correct, however, pegging the NV40's thread size at around 880 pixels and the G70's at roughly a quarter of that. In fact, the G70's pipeline structure was altered to allow for finer-grained flow control. The efficiency difference between groups of 200-some pixels and 16 pixels in ATI's example above is pretty stark, but how often and how much this difference between the G70 and R520 will matter will depend on how developers use flow control in their shaders.
| Friday night topic: The trouble with Best Buy | 128 |