Single page Print

ATI's Radeon X1000 series GPUs

The graphics game changes again

WE'VE BEEN WAITING quite a while for ATI's new generation of graphics chips. It's no secret that the R500-series GPUs are arriving later than expected, and fans of the company have nervously speculated about the cause of the delay. ATI chose to build its new series of graphics chips using 90nm process technology, and going with new process tech has always been risky. Some folks fretted that ATI may have run into the same sort of problems at 90nm that made Intel's Pentium 4 "Prescott" processors famously hot, power hungry, and unable to hit their projected clock speeds. In a related vein, others fussed over rumors that ATI's new high-end R520 architecture was "only" 16 pipes wide, compounding the process technology risk. If the R520 couldn't hit its clock speed targets, it could have a difficult time keeping pace with its would-be rival, NVIDIA's GeForce 7800, whose wider 24-pipe design makes it less dependent on high clock frequencies. As the weeks dragged on with no sign of ATI's new GPUs, the rumor mill began circulating these concerns ever more urgently.

Two weeks ago today, Rich Heye, VP and GM of ATI's desktop business unit, stood up in front of a room full of skeptical journalists and attempted to defuse those concerns. The problem with R520, he told us, with neither a snag caused by TSMC's 90nm process tech nor a fundamental design issue. The chip was supposed to launch in June, he said, but was slowed by a circuit design bug—a simple problem, but one that was repeated throughout the chip. Once ATI identified the problem and fixed it, the R520 gained 150MHz in clock frequency. That may not sound like much if you're thinking of CPUs, but in the world of 16-pipe-wide graphics processors, 150MHz can make the difference between competitive success and failure.

With those concerns addressed, ATI proceeded to unveil not just R520, but a whole family of Radeon graphics products ranging from roughly $79 to $549, based on three new GPUs that share a common heritage. It is one of the most sweeping product launches we've ever seen in graphics, intended to bring ATI up to feature parity with NVIDIA—and then some. Read on as we delve into the technology behind ATI's new GPU lineup and then test its performance head to head against its direct competition.

Shader Model 3.0 and threading
Probably the most notable feature of the ATI R500-series graphics architecture is its support for the Shader Model 3.0 programming model. Shader Model 3.0 lives under the umbrella of Microsoft's DirectX 9 API, the games and graphics programming interface for Windows. SM3.0 is the most advanced of several programming models built into DX9, and it's the one used by all NVIDIA products in the GeForce 6 and 7 series product lines. SM3.0's key features include a more CPU-like programming model for pixel shaders, the most powerful computational units on a GPU. SM3.0 pixel shaders must be able to execute longer programs, and they must support dynamic flow control within those programs—things such as looping and branching with conditionals. These pixel shaders must also do their computational work with 128-bits floating-point precision—32-bits of floating-point precision per color channel for the red, green, blue, and alpha.

ATI's new GPUs support all of these things, including 32-bit precision per color channel. That's a step up in precision from ATI's previous DirectX 9-class graphics processors, all of which did internal pixel shader calculations with 24 bits of FP precision. Unlike NVIDIA's recent GPUs, the R500 series' pixel shaders will not accept a "partial precision" hint from the programmer and cut back pixel shader precision to 16-bits per channel for some calculations in order to save on resources like internal chip register space. Instead, R500 GPUs do all pixel shader calculations with 32-bit precision. The GPU can, of course, store data in lower precision texture formats, but the internal pixel shader precision doesn't change.

The move from 24 to 32 bits of precision establishes a nice baseline for the future, but virtually no applications have yet layered enough rendering passes on top of one another to cause 24-bit precision to become a problem. As we have learned over the life of the GeForce 6 and 7 series, Shader Model 3.0's true value doesn't come in the form of visual improvements over the more common SM2.0, but in higher performance. The same sort of visual effects possible in SM3.0 are generally possible in 2.0, but they're not always possible in real time. Through judicious use of longer shader programs with looping and dynamic branching, applications may use SM3.0 to enable new forms of eye candy.

In order to take best advantage of Shader Model 3.0's capabilities, ATI has equipped the R520's pixel shader engine with the scheduling and control logic capable of handling up to 512 parallel threads. Threads are important in modern graphics architectures because they're used to keep a GPU's many execution resources well fed; having lots of threads standing by for execution allows the GPU to mask latency by doing other work while waiting for something relatively slow to happen, such as a texture access. A Shader Model 3.0 GPU may have to wait for the result of a conditional (such as an if-else statement) to be returned before proceeding with execution of a dependent branch, so such latency masking becomes even more important with the addition of dynamic flow control.

A block diagram of the R520 architecture (Source: ATI)

Despite the ability for programs to branch and loop, though, Shader Model 3.0 GPUs retain their parallelism, and that pulls against the efficient execution of "branchy" code. Pixels (or more appropriately, fragments) are processed together in blocks, and when pixels in the same block take different forks of a branch, all pixels must traverse both forks of that branch. (The ones not affected by the active branch are simply masked out during processing.) In the R520, pixels are grouped into threads in four-by-four blocks, which ATI says is much finer grained threading than in competing GPUs.

To illustrate the improved efficiency of its architecture, ATI offers this example of a shadow mapping algorithm using an if-else statement:

The GPU with large thread sizes must send lots of pixels down both sides of a branch, and thus it doesn't realize the benefits of dynamic flow control. A four-by-four block, like in the R520, is much more efficient by comparison.

Comparing the scheduling and threading capabilities of R520 to the competition isn't the easiest thing to do, because NVIDIA hasn't offered quite as much detail about exactly how its GPUs do things. NVIDIA seems to rely more on software, including the real-time compiler in its graphics drivers, to assist with scheduling. Yet NVIDIA says that its GPUs do indeed have logic on board for management of threads and branching, and that they keep "hundreds" of threads in flight in order to mask latency. As for threading granularity, some clever folks have tested the NV40 and G70 and concluded that the NV40 handles pixels in blocks of 4096, while the G70 uses blocks of 1024. NVIDIA claims those numbers aren't entirely correct, however, pegging the NV40's thread size at around 880 pixels and the G70's at roughly a quarter of that. In fact, the G70's pipeline structure was altered to allow for finer-grained flow control. The efficiency difference between groups of 200-some pixels and 16 pixels in ATI's example above is pretty stark, but how often and how much this difference between the G70 and R520 will matter will depend on how developers use flow control in their shaders.