Who could blame us, really, for being impatient? The GeForce 8800 is a stunning achievement, and we’re eager to see whether AMD can match it. You’ll have to forgive the most eager among us, the hollow-eyed Radeon fanboys inhabiting the depths of our forums, wandering aimlessly while carrying their near-empty bottles of X1000-series eye candy and stopping periodically to endure an episode of the shakes. We’ve all heard the stories about AMD’s new GPU, code-named R600, and wondered what manner of chip it might be. We’ve heard whispers of jaw-dropping potential, butespecially as the delays piled updoubts crept in, as well.
Happily, R600 is at last ready to roll. The Radeon HD 2900 XT graphics card should hit the shelves of online stores today, and we have spent the past couple of weeks dissecting it. Has AMD managed to deliver the goods? Keep reading for our in-depth review of the Radeon HD 2900 XT.
Into the R600
The R600 is easily the biggest technology leap from the Radeon folks since the release of the seminal R300 GPU in the Radeon 9700, and it’s also the first Radeon since that represents a true break from R300 technology. That’s due in part to the fact that R600 is designed to work in concert with Microsoft’s DirectX 10 graphics programming interface, which modifies the traditional graphics pipeline to unlock more programmability and flexibility. As a state-of-the-art GPU, the R600 is also a tremendously powerful parallel computing engine. We’re going to look at some aspects of R600 in some detail, but let’s start with an overview of the entire chip, so we have a basis for the rest of our discussion.
The R600’s most fundamental innovation is the introduction of a unified shader architecture that can process the three types of graphics programspixel shaders, vertex shaders, and geometry shadersestablished by DX10’s Shader Model 4.0 using a single type of processing unit. This arrangement allows for dynamic load balancing between these three thread types, making it possible for R600 to bring the majority of its processing power to bear on the most urgent computational need at hand during the rendering of a frame. In theory, a unified shader architecture can be vastly more efficient and effective than a GPU with fixed shader types, as all DX9-class (and prior) desktop GPUs were.
A high-level diagram of the R600 architecture like the one above will no doubt invoke memories of ATI’s first unified shader architecture, the Xenos GPU inside the Xbox 360. The basic arrangement of functional units looks very similar, but R600 is in fact a new and different design in key respects like shader architecture and thread dispatch. One might also wish to draw parallels to the unified shader architecture of Nvidia’s G80 GPU, but the R600 arranges its execution resources quite differently from G80, as well. In its GeForce 8800 GTX incarnation, the G80 has 128 scalar stream processors running at 1.35GHz. The R600 is more parallel and runs at lower frequencies; AMD counts 320 stream processors running at 742MHz on the Radeon HD 2900 XT. That’s not an inaccurate portrayal of the GPU’s structure, but there’s much more to it than that, as we’ll discuss briefly.
First, though, let’s have a look at the R600 chip itself, because, well, see for yourself.
Like the G80, it’s frickin’ huge. With the cooler removed, you can see it from space. AMD estimates the chip at 700 million transistors, and TSMC packs those transistors onto a die using an 80nm fab process. I measured the R600 at roughly 21 mm by 20 mm, which works out to 420 mm².
I’d like to give you a side-by-side comparison with the G80, but that chip is typically covered by a metal cap, making pictures and measurements difficult. (Yes, I probably should sacrifice a card for science, but I haven’t done it yet.) Nvidia says the G80 has 680 million transistors, and it’s produced on a larger 90nm fab process at TSMC. I’ve seen die size estimates for G80 that range from roughly 420 to 490 mm², although Nvidia won’t confirm exact numbers. R600, however, doesn’t have to rely on a separate chip to provide display logic, so it’s almost certainly smaller overall.
Command processing, setup, and dispatch
I continue to be amazed by the growing amount of disclosure we get from AMD and Nvidia as they introduce ever more complex GPUs, and R600 is no exception on that front. At its R600 media event, AMD chip architect Eric Demers gave the assembled press a whirlwind tour of the GPU, most of which whizzed wholly unimpeded over our heads. I’ll try to distill down the bits I caught with as much accuracy as I can.
Our tour of the R600 began, appropriately, with the GPU’s command processor. Demers said previous Radeons have also had logic to process the command stream from the graphics driver, but on the R600, this is actually a processor; it has memory, can handle math, and downloads microcode every time it boots up. The reason this command processor is so robust is so it can offload work from the graphics driver. In keeping with a DirectX 10 theme, it’s intended to reduce state management overhead. DirectX 9 tends to group work in lots of small batches, creating substantial overhead just to manage all of the objects in a scene. That work typically falls to the graphics driver, burdening the CPU. Demers described the R600 command processor as “somewhat self-aware,” snooping to determine and manage state itself. The result? A claimed reduction in CPU overhead of up to 30% in DirectX 9 applications, with even less overhead in DX10.
Next in line beyond the command processor is the setup engine, which prepares data for processing. It has three functions for DX10’s three shader program types: vertex assembly (for vertex shaders), geometry assembly (for geometry shaders), and scan conversion and interpolation (for pixel shaders). Each function can submit threads to the dispatch processor.
One item of note near the vertex assembler is a dedicated hardware engine for tessellation. This unit is a bit of secret sauce for AMD, since the G80 doesn’t have anything quite like it. The tessellator allows for the use of very high polygon surfaces with a minimal memory footprint by using a form of compression. This hardware takes two inputsa low-poly model and a mathematical description of a curved surfaceand outputs a very detailed, high-poly model. AMD’s Natalya Tatarchuk showed a jaw-dropping demo of the tessellator in action, during which I kept thinking to myself, “Man, I wish she’d switch to wireframe mode so I could see what’s going on.” Until I realized the thing was in wireframe mode, and the almost-solid object I was seeing was comprised of millions of polygons nearly the size of a pixel.
This tessellator may live in a bit of an odd place for this generation of hardware. It’s not a part of the DirectX 10 spec, but AMD will expose it via vertex shader calls for developers who wish to use it. We’ve seen such features go largely unused in the past, but AMD thinks we might see games ported from the Xbox 360 using this hardware since the Xbox 360 GPU has a similar tessellator unit. Also, tessellation capabilities are a part of Microsoft’s direction for future incarnations of DirectX, and AMD says it’s committed to this feature for the long term (unlike the ill-fated Truform feature that it built into the original Radeon hardware, only to abandon it in the subsequent generation). We’ll have to see whether game developers use it.
The setup engine passes data to the R600’s threaded dispatch processor. This part of the GPU, as Demers put it “is where the magic is.” Its job is to keep all of the shader cores occupied, which it does by managing a large number of threads of three different types (vertex, geometry, and pixel shaders) and switching between them. The R600’s dispatch processor keeps track of “hundreds of threads” in flight at any given time, dynamically deciding which ones should execute and which ones should go to sleep depending on the work being queued, the availability of data requested from memory, and the like. By keeping a large number of threads in waiting, it can switch from one to another as needed in order to keep the shader processors busy.
The thread dispatch process involves multiple levels of arbitration between the three thread types in waiting and the work already being done. Each of the R600’s four SIMD arrays of shader processors has two arbiter units associated with it, as the diagram shows, and each one of those has a sequencer attached. The arbiter decides which thread to process next based on a range of variables, and the sequencer then determines the best ordering of instructions for execution of that thread. The SIMD arrays are pipelined, and the two arbiters per SIMD allow for execution of two different threads in interleaved fashion. Notice, also, that vertex and texture fetches have their own arbiters, so they can run independently of shader ops.
As you may be gathering, this dispatch processor involves lots of complexity and a good deal of mystery about its exact operation, as well. Robust thread handling is the reason why GPUs are very effective parallel computing devices, because they can keep themselves very well occupied. If a thread has to stop and wait for the retrieval of data from memory, which can take hundreds of GPU cycles, other threads are ready and waiting to execute in the interim. This logic almost has to occupy substantial amounts chip area, since the dispatch processor must keep track of all of the threads in flight and make “smart” decisions about what to do next.
In its shader core, the R600’s most basic unit is a stream processing block like the one depicted in the diagram on the right. This unit has five arithmetic logic units (ALUs), arranged together in superscalar fashionthat is, each of the ALUs can execute a different instruction, but the instructions must all be issued together at once. You’ll notice that one of the five ALUs is “fat.” That’s because this ALU’s capabilities are a superset of the others’; it can be called on to handle transcendental instructions (like sine and cosine), as well. All four of the others have the same capabilities. Optimally, each of the five ALUs can execute a single multiply-add (MAD) instruction per clock on 32-bit floating-point data. (Like G80, the R600 essentially meets IEEE 754 standards for precision.) The stream processor block also includes a dedicated unit for branch execution, so the stream processors themselves don’t have to worry about flow control.
These stream processor blocks are arranged in arrays of 16 on the chip, for a SIMD (single instruction multiple data) arrangement, and are controlled via VLIW (very long instruction word) commands. At a basic level, that means as many as six instructions, five math and one for the branch unit, are grouped into a single instruction word. This one instruction word then controls all 16 execution blocks, which operate in parallel on similar data, be it pixels, vertices, or what have you.
The four SIMD arrays on the chip operate independently, so branch granularity is determined by the width of the SIMD and the depth of the pipeline. For pixel shaders, the effective “width” of the SIMD should typically be 16 pixels, since each stream processor block can process a single four-component pixel (with a fifth slot available for special functions or other tasks). The stream processor units are pipelined with eight cycles of latency, but as we’ve noted, they always execute two threads at once. That makes the effective instruction latency per thread four cycles, which brings us to 64 pixels of branch granularity for R600. Some other members of the R600 family have smaller SIMD arrays and thus finer branch granularity.
Let’s stop and run some numbers so we can address the stream processor count claimed by AMD. Each SIMD on the R600 has 16 of these five-ALU-wide superscalar execution blocks. That’s a total of 80 ALUs per SIMD, and the R600 has four of those. Four times 80 is 320, and that’s where you get the “320 stream processors” number. Only it’s not quite that simple.
The superscalar VLIW design of the R600’s stream processor units presents some classic challenges. AMD’s compilera real-time compiler built into its graphics driverswill have to work overtime to keep all five of those ALUs busy with work every cycle, if at all possible. That will be a challenge, especially because the chip cannot co-issue instructions when one is dependent on the results of the other. When executing shaders with few components and lots of dependencies, the R600 may operate at much less than its peak capacity. (Cue sounds of crashing metal and human screams alongside images of other VLIW designs like GeForce FX, Itanium, and Crusoe.)
The R600 has many things going for it, however, not least of which is the fact that the machine maps pretty well to graphics workloads, as one might expect. Vertex shader data often has five components and pixel shader data four, although graphics usage models are becoming more diverse as programmable shading takes off. The fact that the shader ALUs all have the same basic set of capabilities should help reduce scheduling complexity, as well.
Still, Nvidia has already begun crowing about how much more efficient and easier to utilize its scalar stream processors in the G80 are. For its part, AMD is talking about potential for big performance gains as its compiler matures. I expect this to be an ongoing rhetorical battle in this generation of GPU technology.
So how does R600’s shader power compare to G80? Both AMD and Nvidia like to throw around peak FLOPS numbers when talking about their chips. Mercifully, they both seem to have agreed to count programmable operations from the shader core, bracketing out fixed-function units for graphics-only operations. Nvidia has cited a peak FLOPS capacity for the GeForce 8800 GTX of 518.4 GFLOPS. The G80 can co-issue one MAD and one MUL instruction per clock to each of its 128 scalar SPs. That’s three operations (multiply-add and multiply) per cycle at 1.35GHz, or 518.4 GFLOPS. However, the guys at B3D have shown that that extra MUL is not always available, which makes counting it questionable. If you simply count the MAD, you get a peak of 345.6 GFLOPS for G80.
By comparison, the R600’s 320 stream processors running at 742MHz give it a peak capacity of 475 GFLOPS. Mike Houston, the GPGPU guru from Stanford, told us he had achieved an observed compute throughput of 470 GFLOPS on R600 with “just a giant MAD kernel.” So R600 seems capable of hitting something very near its peak throughput in the right situation. What happens in graphics and games, of course, may vary quite a bit from that.
The best way to solve these shader performance disputes, of course, is to test the chips. We have a few tests that may give us some insight into these matters.
The Radeon 2900 XT comes out looking good in 3DMark’s vertex shader tests, solidly ahead of the GeForce 8800 GTX. Oddly, though, we’ve seen similar or better performance in this test out of a mid-range GeForce 8600 GTS than we see here from the 8800 GTX. The GTX may be limited by other factors here or simply not allocating all of its shader power to vertex processing.
The tables turn in 3DMark’s pixel shader test, and the R600 ends up in a virtual dead heat with the GeForce 8800 GTS, a cut-down version of the G80 with only 96 stream processors.
This particle test runs a physics simulation in a shader, using vertex texture fetch to store and access the results. Here, the Radeon 2900 XT is slower than the 8800 GTX, but well ahead of the GTS. The Radeon X1950 XTX can’t participate since lacks vertex texture fetch.
Futuremark says the Perlin noise test “computes six octaves of 3-dimensional Perlin simplex noise using a combination of arithmetic instructions and texture lookups.” They expect such things for become popular in future games for use in procedural modeling and texturing, although procedural texturing has always been right around the corner and never seems to make its way here. If and when it does, the R600 should be well prepared, because runs this shader quite well.
Next up is a series of shaders in ye old ShaderMark, a test that’s been around forever but may yet offer some insights.
The Radeon HD 2900 XT lands somewhere north of the GeForce 8800 GTS, but it can’t match the full-fledged G80 in ShaderMark generally.
ShaderMark also gives us an intriguing look at image quality by quantifying how closely each graphics cards’ output matches that of Microsoft’s reference rasterizer for DirectX 9. We can’t really quantify image quality, but this does tell us something about the computational precision and adherence to Microsoft’s standards in these GPUs.
DirectX 10 has much tighter standards for image quality, and these DX10-class GPUs are remarkably close together, both overall and in individual shaders.
Finally, here’s a last-minute addition to our shader tests courtesy of AMD. Apparently already aware of the trash talk going on about the potential scheduling pitfalls of its superscalar shading core, AMD sent out a simple set of DirectX 10 shader tests in order to prove a point. I decided to go ahead and run these tests and present you with the results, although the source of the benchmarks is not exactly an uninterested third party, to say the least. The results are informative, though, because they present some difficult scheduling cases for the R600 shader core. You can make of them what you will. First, the results, and then the test explanations:
The first thing to be said is that G80 again appears to be limited somehow in its vertex shader performance, as we saw with 3DMark’s vertex tests. That hasn’t yet been an issue for the G80 in real-world games, so I’d say the pixel shader results are the more interesting ones. Here are AMD’s explanations of the tests, edited and reformatted for brevity’s sake:
1) “float MAD serial” – Dependant Scalar Instructions Basically this test issues a bunch of scalar MAD instructions that are sequentially executed. This way only one out of 5 slot of the super-scalar instruction could be utilized. This is absolutely the worst case that would rarely be seen in the real-world shaders.
2) “float4 MAD parallel” – Vector Instructions This test issues 2 sequences of MAD instructions operating on float4 vectors. The smart compiler in the driver is able to split 4D vectors among multiple instructions to fill all 5 slots. This case represents one of the best utilization cases and is quite representative of instruction chains that would be seen in many shaders. This also demontrates [sic] the flexibility of the architecture where not only trivial case like 3+2 or 4+1 can be handled.
3) “float SQRT serial” – Special Function This is a test that utilizes the 5th “supped up” [sic] scalar instruction slot that can execute regular (ADD, MUL, and etc.) instructions along with transcendental instructions.
4) “float 5-instruction issue” – Non Dependant Scalar Instructions This test has 5 different types of scalar instructions (MUL, MAD, MIN, MAX, SQRT), each with it’s own operand data, that are co-issued into one super-scalar instruction. This represents a typical case where in-driver shader compiler is able to co-issue instructions for maximal efficiency. This again shows how efficiently instructions can be combined by the shader compiler.
5) “int MAD serial” – Dependant DX10 Integer Instructions This test shows the worst case scalar instruction issue with sequential execution. This is similar to test 1, but uses integer instructions instead of floating point ones.
6) “int4 MAD parallel” – DX10 Integer Vector Instructions Similar to test 2, however integer instructions are used instead of floating point ones.
The GeForce 8800 GTX is just under three times the speed of the Radeon HD 2900 XT in AMD’s own worst-case scenario, the float MAD serial with dependencies preventing superscalar parallelism. From there, the R600 begins to look better. The example of the float4 MAD parallel is impressive, since AMD’s compiler does appear to be making good use of the R600’s potential when compared to G80. The next two floating-point tests make use of the “fat” ALU in the R600, and so the R600 looks quite good.
We get the point, I think. Computationally, the R600 can be formidable. One worry is that these shaders look to be executing pure math, with no texture lookups. We should probably talk about texturing rather than dwell on these results.
Texturing and memory bandwidth
AMD has endowed the R600 with four texture units that operate independently of the chip’s shader core. The R600’s texture units and total texture addressing and filtering capacity look similar to the Radeon X1950 XTX’s, but with some notable improvements. Those improvements include the ability to filter FP16-format texturespopular for high dynamic range lightingat full speed (16 pixels per clock) and FP32 textures at half speed. The R600 can do trilinear and anisotropic filtering for all formats. The Radeon X1950 XTX couldn’t handle these texture formats in its filtering hardware and had to resorts to its pixel shaders instead, so AMD estimates R600 is roughly seven times the speed of its predecessor in this respect.
Like the Radeon X1950 XTX, each R600 texture unit can grab an additional four unfiltered textures per clock from memory using its fetch4 ability, which is the reason you see the four teensy additional texture address processors and texture samplers in the diagram above. This additional capacity to grab data from memory can be useful for certain tasks like shadowing or stream computing applications.
The texture units can access several of the GPU’s many caches, as appropriate, including the L1 texture cache, the vertex cache (32KB), and the L2 texture cache (256KB).
The memory controller the R600 is evolved from the one in the R580. Demers said this one is a fully distributed ring bus, not a hybrid like the R580’s. Demers asserts the ring bus is simpler to design and easier to adapt to new products than the more commonly used crossbar arrangement. The R600’s ring is comprised of four sets of wires running around the chip in read/write pairs, for a total of about 2000 wires and 1024 bits of communication capacity. The ring bus has about 84 read clients and 70 write clients inside the chip, and PCI Express is just one of the many ring stops, as are the eight 64-bit channels to local memory.
In case I caught you snoozing, I said eight 64-bit channelsthat works out to a 512-bit-wide path to memory, well above the 384 bits of the G80. What does all of this mean to the Radeon HD 2900 XT?
|Radeon X1950 XTX||650||16||10.4||16||10.4||–||–||2000||256||64.0|
|GeForce 8800 GTS||500||20||10.0||24||12.0||24||12.0||1600||320||64.0|
|GeForce 8800 GTX||575||24||13.8||32||18.4||32||18.4||1800||384||86.4|
|GeForce 8800 Ultra||612||24||14.7||32||19.6||32||19.6||2160||384||103.7|
|Radeon HD 2900 XT||742||16||11.9||16||11.9||16||11.9||1650||512||105.6|
With 512MB GDDR3 of memory running at 1.65GHz, the Radeon HD 2900 XT has a torrential 105.6 GB/s of peak memory bandwidth, higher that of the GeForce 8800 Ultra and well above the GTX or GTS. Yet its peak multitextured fill rate is only about 12 Gtexels/s, close to that of the GeForce 8800 GTS and well behind the GTX. AMD seems to have been pleased with the basic fill rate and filtering capabilities of the Radeon X1950 XTX and chose only to extend them in R600 to include HDR texture formats. Texturing is indeed becoming less important as programmable shaders gain traction, but many of those shaders store or access data in textures, which is a concern. The Radeon HD 2900 XT trails Nvidia’s fastest graphics cards by miles here, despite having a wider path to memory.
Here are a few quick texture fill rate and filtering tests to see how these theoretical peak numbers play out.
The Radeon HD 2900 XT gets closer than any of the other cards to its theoretical maximum pixel fill rate, probably because it has sufficient memory bandwidth to make that happen. When we switch to multitexturing, the chips reach very near their theoretical limits, which puts the Radeon HD 2900 XT just behind the GeForce 8800 GTS. These are not FP16 textures, so the Radeon X1950 XTX performs reasonably well, too.
Texture filtering quality and performance
It’s time to enter the psychedelic tunnel once again and see how these GPUs handle anisotropic filtering. The images below are output from the D3D AF tester, and what you’re basically doing is looking down a 3D rendered tube with a checkerboard pattern applied. The colored bands indicated different mip-map levels, and you can see that the GPUs vary the level of detail they’re using depending on the angle of the surface.
|Radeon X1950 XTX||Radeon HD 2900 XT||GeForce 8800 GTX|
The Radeon X1950 XTX, err, cheats quite a bit at certain angles of inclination. Flat floors and walls get good treatment, but other surfaces do not. Nvidia did the same thing with the GeForce 7 series, but they banished this trick in G80, which produces a nice, nearly round pattern. To match it, AMD has instituted a tighter pattern in the Radeon HD 2900 XT, which is the same as the “high quality” option on the X1000 series. In fact, AMD has simply removed this lower quality choice from the R600’s repertoire.
If none of this makes any sense to you, perhaps an illustration will help. Here’s a screenshot from Half-Life 2 that shows what happens when the angle of a surface goes the wrong way on each of these GPUs.
|Radeon X1950 XTX|
|Radeon HD 2900 XT|
|GeForce 8800 GTX|
The flat surface looks great on all three, but things turn to mush at a different angle on the Radeon X1950 XTX. Fortunately, the newer GPUs avoid this nastiness.
|Radeon X1950 XTX||Radeon HD 2900 XT (default)||GeForce 8800 GTX|
Here’s a look at the high-quality settings for the X1950 XTX and 8800 GTX alongside the HD 2900 XT’s one-and-only option. As you can tell, AMD just uses the high-quality algorithm from the R580 at all times on the R600. This algorithm produces good results, but it’s not quite as close to perfect as the G80. Look at the practical impact in our example, though.
|Radeon X1950 XTX|
|Radeon HD 2900 XT (default)|
|GeForce 8800 GTX|
All three GPUs produce very similar results. The colored test patterns do suggest the R600 is a little weak at 45° angles of inclination. I tried to capture an example of this weakness in our Half-Life 2 sample scene by changing the angle a bit, but honestly, I couldn’t see it. I later tried in other games, again to no avail.
So in my book, the off-angle aniso optimization is effectively dead, and thank goodness. That doesn’t mean I’m entirely pleased with the state of texture filtering. It looks to me like AMD has retained the same adaptive trilinear filtering algorithm in R600 has in its previous GPUs, with no substantial changes. That means the same quirks are carried over. The G80’s texture filtering may be a little better, but I’m not entirely decided on that issue. Maybe its particular quirks are just newer. Many of the remaining problems with both algorithms are motion-based and difficult to capture with a screenshot, so I’m going to have to invent a new way to complain.
How does all of this high quality texture filtering impact performance? Here’s a look. It’s not FP16 filtering, unfortunately, but it’s still useful info.
Uh oh. D3D RightMark shows us how the GPUs scale by filtering type, and the story is a rough one for AMD. The Radeon HD 2900 XT starts out more or less as expected but falls increasingly behind the GeForce 8800 GTS as the filtering complexity increases.
Render back-ends and antialiasing
To the right is a logical diagram of one of the R600’s render back-ends. (Nvidia calls these ROPs, if you’re wondering.) The R600 packs four of these units, and they work pretty much as you’d expect from the diagram. They can output four pixels per clock to the frame buffer and can process depth and stencil tests at twice that rate. Among the improvements from R580 are higher peak rates of Z and stencil compression, some improvements to common Z-buffer optimizations, and the ability to use FP32-format Z buffers for higher depth precision.
The render back-ends are also traditionally the place where the resolve process for multisampled antialiasing happens. AMD has carried over all of the previous antialiasing goodness of its prior chips in R600, including gamma-correct blends, programmable sample patterns, temporal AA, and Super AA modes for CrossFire load balancing. The R600 trades the older GPU’s 6X multisampling mode for a new 8X mode that, duh, offers higher quality by virtue of more samples. I’ve added the R600’s default sample patterns to my Giant Chart of AA Sample Patterns, producing the following glorious cornucopia of colored dots. As always, the green dots represent texture/shader samples, pink dots represent color/Z and coverage samples, and teensy red dots represent coverage samples alone (for Nvidia’s CSAA modes).
I’ve included the Radeon HD 2900 XT’s CrossFire SuperAA mode in a separate column, although SuperAA is presently limited to a single mode on the 2900 XT. I’ve also included composite sample patterns for the 2900 XT’s temporal AA modes. These sample patterns actually occur in two halves over the course of two frames whenever frame rates go above 60 FPS. My current assessment of temporal AA: meh. It sounded like a good idea at the time, but AMD could spike it and I’d be happy.
And so the grand table adds the R600’s distinctiveness to its own. As ever, AMD has used a nice quasi-random pattern in the R600’s new 8X multisampled mode.
So that’s part of the story. After seeing Nvidia’s very smart coverage sample antialiasing technique in the G80, I had doubts about whether AMD could answer with something as good and innovative itself. To recap in a nutshell, coverage sampled AA does what it appears to do in the table above: stores more samples to determine polygon coverage while discarding color/Z samples it doesn’t necessarily need. That keeps its memory footprint and performance overhead low, yet it generally produces good results, as you’ll see in the examples on the following pages.
AMD’s answer to coverage sampled AA is made possible by the fact that the render back-ends in the R600 can now quickly pass data back to the shaders, and that leads to AMD’s latest innovation: custom filter antialiasing. The essence of CFAA is that R600 can run a multitude of antialiasing filters, with a programmable resolve stage, allowing for all kinds of new and different AA voodoo. That voodoo starts with a couple of new filters AMD has included with the first round of Radeon HD drivers: a pair of tent filters. Unlike the traditional box filter, these tent filters reach outside of the bounds of the pixel to grab extra samples. Here are a couple of examples, with narrow and wide tent filters using the Radeon HD’s 8X sample pattern, from AMD.
The narrow tent grabs a single sample from each neighboring pixel, while the wide tent grabs two. That leads to an effective sample size of 12X for the narrow tent and 16X for the wide tent. The HD 2900 XT can also combine narrow and wide tent filters with its 2X and 4X AA modes for effective sample sizes of 4X, 6X, 6X again, and 8X.
Those of you who are old-school PC graphics guys like me may be having some serious, gut-wrenching flashbacks right now to Nvidia’s screen-blurring Quincunx mode from GeForces of old. These tent filters are fairly smart about how they go about their business, though; they compute a weighted average of the samples based on a linear function that decreases the weight of samples further from the pixel center. Tent filters do introduce a measure of blurring across the whole screen, but the effect is very subtle, as you can see in the example below. The base AA mode is 8X multisampled.
|Box – 8X MSAA||Narrow tent – 12X||Wide tent – 16X|
The blurring is most obvious in the text, but it is in fact a full-scene affair. Look at the leaves on the sidewalk below the park bench, the bricks and windowpanes of the building behind, or the cobblestone texture on the street. The tent filters blur all of these things subtly, which leads to a tradeoff: images aren’t as sharp, but high-frequency “pixelation” is reduced throughout the scene.
Frankly, I was all set not to like CFAA’s tent filters when I first heard about them. They make things blurry, don’t involve clever tricks like Nvidia’s coverage sampling, and hey, Quincunx sucked. But here’s the thing: I really like them. It’s hard to argue with results, and CFAA’s tent filters do some important things well. Have look at this example shot from Oblivion.
|GeForce 8800 GTS
|GeForce 8800 GTS
|Radeon HD 2900 XT
CFAA 8X – 4X MSAA + Wide tent
This CFAA mode with 8 samples produces extremely clean edges and does an excellent job of resolving very fine geometry, like the tips of the spires on the cathedral. Even 16X CSAA can’t match it. Also, have a look at the tree leaves in these shots. They use alpha transparency, and I don’t have transparency AA enabled, so you see some jagged edges on the GeForce 8800. The wide tent filter’s subtle blending takes care of these edges, even without transparency AA.
You may not be convinced yet, and I don’t blame you. CFAA’s tent filters may not be for everyone. I would encourage you to try them, though, before writing them off. There is ample theoretical backing for the effectiveness of tent filters, and as with any AA method, much of their effectiveness must be seen in full motion in order to be properly appreciated. I prefer the 4X MSAA + wide tent filter to anything Nvidia offers, in spite of myself. I’ve found that it looks great on the 30″ wide-screen LCD attached to my GPU test rig. The reduction in high-frequency pixel noise is a good thing on a sharp LCD display; it adds a certain solidity to objects that just.. works. Oblivion has never looked better on the PC than it does on the Radeon HD 2900 XT.
How does this AA voodoo perform, you ask? Here’s a test using one of 3DMark’s HDR tests, which uses FP16 texture formats.
Another feature of CFAA tent filters is that they have no additional memory footprint or sampling requirements, and in this case, that translates to almost no performance overhead. Ok, my graph here is hard to read, but if you look closely, you’ll see that CFAA’s narrow and wide tent filters don’t slow down the 2X and 4X MSAA modes on which they’re based. There is a performance penalty involved when they’re combined with 8X MSAA, but it’s not too punishing.
In its current state, then, the R600’s CFAA is an impressive answer to Nvidia’s CSAA, the, er, Quincunx smear aside. The thing about custom filters is that they can do many things, and AMD has big plans for them. They’re talking about a custom filter than runs an edge-detect pass on the entire image and then goes back and applies AA selectively. In fact, they even delivered a driver to us late in our testing along with a separate executable to enable this filter. Unfortunately, I wasn’t able to get it working in time to try it out. We’ll have to look at it later.
Oh, and it is possible that Nvidia could counter CFAA with some shader-based custom AA filters of its own, completely stealing AMD’s thunder. For the record, I’d wholeheartedly endorse that move.
Antialiasing image quality – GPUs side by side
Here’s a look at the Radeon HD 2900 XT’s edge AA image quality. These images come from Half-Life 2, and they’re blown up to 4X their normal size so you can see the pixel colors along the three angled edges shown prominently in this example.
|Radeon X1950 XTX
|GeForce 8800 GTX||Radeon HD 2900 XT||Radeon HD 2900 XT
Narrow tent filter
|Radeon HD 2900 XT
Wide tent filter
|6X||6X CFAA||6X CFAA|
|SuperAA 8X||8X CSAA||8X CFAA|
|Super AA 12X||12X CFAA|
|Super AA 14X||16X CSAA||16X CFAA|
CFAA’s tent filters look reasonably good in these classic edge cases, as do AMD’s 8X multisampled and SuperAA 16X modes. I think our example from Oblivion on the previous page does a better job of showcasing the tent filters’ strengths, though.
Antialiasing image quality – Alpha transparency
Here’s one final AA image quality example, focused on the methods that AMD and Nvidia have devised to handle the tough case of textures with alpha transparency cutouts in them. Nvidia calls its method transparency AA and AMD calls its adaptive AA, but they are fundamentally similar. The scene below, again from Half-Life 2, has two examples of alpha-transparent textures: the leaves on the tree and the letters in the sign. 4X multisampling is enabled in all cases. The top row shows images without transparency AA enabled. The second row shows the lower-quality variants of transparency/adaptive AA from Nvidia and AMD, and the bottom row shows the highest quality option from each.
For what it’s worth, I took these screenshots on the Radeon HD 2900 XT with an updated driver that AMD says provides performance improvements in adaptive AA, so I believe this is a new algorithm.
|Alpha transparency antialiasing quality w/4X AA|
|Radeon X1950 XTX||GeForce 8800 GTX||Radeon HD 2900 XT
|Radeon HD 2900 XT
Narrow tent filter
|Radeon HD 2900 XT
Wide tent filter
Looks like you can get away with the lower quality adaptive AA method on the Radeon HD 2900 XT. If you combine it with a tent filter, the results are pretty good.
Avivo HD video processing, display, and audio support
In addition to all of the new graphics goodness, the R600 brings with it some important new capabilities for high-definition displays and video playback. The most prominent among them is a brand-new video processor AMD has dubbed the UVD, for universal video decoder. The UVD is a dedicated processor with its own instruction and data caches, and it can accelerate key stages of the decompression and playback of HD video formats for both HD-DVD and Blu-ray, including H.264 and VC-1. Nvidia just recently introduced a pair of lower-end G80 derivatives with a new H.264 decode acceleration unit, but the AMD folks like to point out that it can’t do bitstream processing or entropy decode for videos in the VC-1 format. More importantly, perhaps, the G80 lacks this unit and cannot provide more-or-less “full” acceleration of even H.264 video playback. The R600’s UVD should allow it to play HD movies with much lower CPU utilization and power consumption than the G80, as a result. Update: Turns out the R600 lacks UVD acceleration, which is confined to lower-end Radeon HD GPUs. See our explanation here.
The display portion of the R600 can drive a pair of dual-link DVI ports for some insane maximum resolutions, and it can support HDCP over those DL-DVI connections, allowing monitors like my Dell 3007WFP to play back DRM-encrusted movies at the display’s full resolution. That’s the theory, at least; I have yet to try it. AMD has also embedded the HDCP crypto keys into the GPU, eliminating the need for an external crypto ROM chip.
Finally, as long rumored, the R600 includes a six-channel audio controller, but only for a single purpose: support of 5.1 audio over HDMI. Radeon HD cards won’t have any other form of analog or digital audio output.
The cards, specs, and prices
We’ve talked quite a bit about the R600 GPU without saying much about the card on which it’s based. As you may have gathered, the Radeon HD 2900 XT has a 742MHz R600 GPU onboard and 512MB of GDDR3 memory clocked at 825MHz (or 1650MHz effective). The card has a dual-slot cooler and is 9.5″ long, or just a little shorter than a GeForce 8800 GTX but longer than a GTS.
The 2900 XT comes with two PCIe auxiliary power plugs on board, and one of the two is of the brand-new eight-pin variety. We were able to use our review unit with an older power supply by attaching two six-pin aux power connectors, but AMD has limited GPU overclocking to boards with an eight-pin connector attached. Also, we found that our 700W power supply wasn’t up to the task of powering a Radeon HD 2900 XT CrossFire rig. In order to achieve stability, we had to switch to a new Thermaltake 1kW PSU with a pair of eight-pin connectors that AMD supplied.
Like the Radeon X1950 Pro, the HD 2900 XT comes with a pair of internal CrossFire connectors onboard. The day of dongles is behind us, and the dual connectors may someday allow more than two cards to be teamed up in the systems of the filthy rich or criminally insane.
The board itself has two DVI connectors, but it can also support HDMIwith audio via a plug adapter from AMD.
The Radeon HD 2900 XT is slated to become available today at online vendors for $399, and it will be bundled with a coupon for getting a trio of games from Valve via the Steam distribution service when they’re released: Half-Life 2: Episode Two, Portal, and Team Fortress 2.
And, well, that’s the whole plan. AMD has no higher-end products to announce; it’s just positioning the Radeon HD 2900 XT against the GeForce 8800 GTS at $399 and calling it good. Now that’s a remarkable change of strategy from the past, oh, five years of intense one-upsmanship. It seems AMD wasn’t quite able to extract sufficient performance from the R600 in its current state to challenge the GTX for the outright performance crown, so they decided to go for a price-performance win instead. CrossFire, they say, will serve the high end of the market.
That leaves the HD 2900 XT to contend with products like this “superclocked” EVGA version of the GeForce 8800 GTS that Nvidia sent us when it caught wind of AMD’s plans. (They can be aggressive that way, in case you hadn’t noticed.)
This card has a 575MHz core clock and 1.7GHz memorya formidable boost from the stock 8800 GTS.and as I write it can be had for under $400 at online vendors. Nvidia got us this card (actually a pair of them) and some fancy new drivers for it after we were deep into our testing last week. As a result, we were only able to include in a subset of our tests, and only in single-card mode. We expect to follow up with it later, since it does represent real products available now competing with the Radeon HD 2900 XT. We do have an “overclocked in the box” version of the GeForce 8800 GTS 320MB throughout our results, and it often outperforms the stock-clocked 640MB GTS at lower resolutions.
The rest of the family
Joining the Radeon HD 2900 XT shortly will be a family of products based on two new lower end GPUs. Both of these chips will be DX10-compliant and derived from R600 technology, but both will be manufactured on TSMC’s 65nm fab process. Like the R600, they will have Avivo HD decode and playback acceleration and HDMI support with audio.
The mid-range variant is the GPU code-named RV630, and cards based on it will be in the Radeon HD 2600 lineup. The RV630 has an estimated 390M transistors, and this scaled-down R600 derivative has three SIMD arrays, each of which has eight stream processor units (or “120 stream processors” in AMD-speak). This GPU has two texture units, a 128KB L2 texture cache, one render back-end, and a 128-bit external memory interface. AMD plans Radeon HD 2600 Pro and XT cards ranging in price from $99 to $199. The most intriguing of those from an enthusiast standpoint will no doubt be the $199 2600 XT, pitted directly against the GeForce 8600 GTS. The XT will come with native CrossFire connectors, a single-slot cooler, and no PCIe auxiliary power connector.
The Radeon HD 2400 series will occupy the low end of the market, powered by the RV610 GPU. This 180 million transistor chip packs two SIMD arrays with four stream processing units each, for a total of 40 stream processors, as AMD likes to count ’em. It has a single texture unit and render back end, uses a shared texture and vertex cache, and has a 64-bit memory interface. Befitting its station in life, Radeon HD 2400 XT and Pro cards will sell for $99 and less. Some versions will ship with only a passive cooler like the one below.
You can imagine that puppy driving a giant television via an HDMI link in a silent HTPC box, no?
The rest of the Radeon HD family is slated to join the 2900 XT on store shelves on July 1. AMD also has plans for a full mobility lineup based on Radeon HD tech, and those parts are due in July, as well. You may see the Mobility Radeon HD 2300 kicking around before then, but it’s not a DX10-capable part. Kind of like the Radeon 9000 back in the day, it’s an older 3D core being pulled into a new naming scheme.
And now, on to the benchmarks…
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.
Our test systems were configured like so:
|Processor||Core 2 Extreme X6800 2.93GHz||Core 2 Extreme X6800 2.93GHz|
|System bus||1066MHz (266MHz quad-pumped)||1066MHz (266MHz quad-pumped)|
|Motherboard||XFX nForce 680i SLI||Asus P5W DH Deluxe|
|North bridge||nForce 680i SLI SPP||975X MCH|
|South bridge||nForce 680i SLI MCP||ICH7R|
|Chipset drivers||ForceWare 15.00||INF update 18.104.22.1680
Matrix Storage Manager 6.21
|Memory size||4GB (4 DIMMs)||4GB (4 DIMMs)|
|Memory type||2 x Corsair TWIN2X20488500C5D
DDR2 SDRAM at 800MHz
|2 x Corsair TWIN2X20488500C5D
DDR2 SDRAM at 800MHz
|CAS latency (CL)||4||4|
|RAS to CAS delay (tRCD)||4||4|
|RAS precharge (tRP)||4||4|
|Cycle time (tRAS)||18||18|
|Hard drive||Maxtor DiamondMax 10 250GB SATA 150||Maxtor DiamondMax 10 250GB SATA 150|
|Audio||Integrated nForce 680i SLI/ALC850
with Microsoft drivers
with Microsoft drivers
|Graphics||GeForce 8800 Ultra 768MB PCIe
with ForceWare 158.18 drivers
|Radeon X1950 XTX512MB PCIe
+ Radeon X1950 CrossFire
with Catalyst 7.4 drivers
|GeForce 8800 GTX 768MB PCIe
with ForceWare 158.18 drivers
|Dual Radeon HD 2900 XT 512MB PCIe
with 8.37.4.070419a-046506E drivers
|Dual GeForce 8800 GTX 768MB PCIe
with ForceWare 158.18 drivers
| BFG GeForce 8800 GTS 640MB PCIe
with ForceWare 158.18 drivers
|Dual BFG GeForce 8800 GTS SLI 640MB PCIe
with ForceWare 158.18 drivers
|XFX GeForce 8800 GTS 320MB PCIe
with ForceWare 158.18 drivers
|EVGA GeForce 8800 GTS 640MB PCIe
with ForceWare 158.42 drivers
|Dual EVGA GeForce 8800 GTS 640MB PCIe
with ForceWare 158.42 drivers
|Radeon X1950 XTX512MB PCIe
with Catalyst 7.4 drivers
|Radeon HD 2900 XT 512MB PCIe
with 8.37.4.070419a-046506E drivers
|OS||Windows Vista Ultimate x86 Edition||Windows Vista Ultimate x86 Edition|
Thanks to Corsair for providing us with memory for our testing. Their quality, service, and support are easily superior to no-name DIMMs.
Our test systems were powered by OCZ GameXStream 700W power supply units. Thanks to OCZ for providing these units for our use in testing.
Unless otherwise specified, image quality settings for the graphics cards were left at the control panel defaults.
The test systems’ Windows desktops were set at 1600×1200 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled for all tests.
We used the following versions of our test applications:
- Rainbow Six: Vegas 1.04
- Battlefield 2142 1.2
- Supreme Commander 3223
- The Elder Scrolls IV: Oblivion 1.2
- S.T.A.L.K.E.R.: Shadow of Chernobyl 1.0001
- Half-Life 2: Episode One with trdem2 demo
- FutureMark 3DMark06 Build 1.1.0
- FRAPS 2.8.2
- ShaderMark 2.1 build 130a
- D3D FSAA Viewer 5
- D3D RightMark beta 4
The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
S.T.A.L.K.E.R.: Shadow of Chernobyl
We tested S.T.A.L.K.E.R. by manually playing through a specific point in the game five times while recording frame rates using the FRAPS utility. Each gameplay sequence lasted 60 seconds. This method has the advantage of simulating real gameplay quite closely, but it comes at the expense of precise repeatability. We believe five sample sessions are sufficient to get reasonably consistent and trustworthy results. In addition to average frame rates, we’ve included the low frames rates, because those tend to reflect the user experience in performance-critical situations. In order to diminish the effect of outliers, we’ve reported the median of the five low frame rates we encountered.
For this test, we set the game to its “maximum” quality settings at 2560×1600 resolution. Unfortunately, the game crashed on both GeForce and Radeon cards when we set it to use dynamic lighting, so we had to stick with its static lighting option. Nevertheless, this is a good-looking game some nice shader effects and lots of vegetation everywhere.
The Radeon HD 2900 XT kicks off our game benchmarks with a mixed result. It’s slower than the competing GeForce 8800 GTS 640MB in this game, but it’s quicker in CrossFire mode than its competitor is in SLI. We found throughout our benchmarks that Nvidia’s SLI support in Windows Vista doesn’t tend to scale as well as it long has in Windows XP. AMD’s CrossFire support is generally superior in Vista.
Here’s another new game, and a very popular request for us to try. Like many RTS and isometric-view RPGs, though, Supreme Commander isn’t exactly easy to test well, especially with a utility like FRAPS that logs frame rates as you play. Frame rates in this game seem to hit steady plateaus at different zoom levels, complicating the task of getting meaningful, repeatable, and comparable results. For this reason, we used the game’s built-in “/map perftest” option to test performance, which plays back a pre-recorded game.
Another note: the frame rates you see below look pretty low, but for this type of game, they’re really not bad. We’ve observed frame rates in the game similar to the numbers from the performance test, but they’re still largely acceptable, even at higher resolutions. This is simply different from an action game, where always-fluid motion is required for smooth gameplay.
And a final note: you’ll see that SLI performance doesn’t scale in this game, but we’ve included those scores simply because it worked. We weren’t able to get either of our CrossFire systems, Radeon X1950 XTX or HD 2900 XT, working with Supreme Commander, which is why those scores were omitted.
The 2900 XT runs neck-and-neck with the 8800 GTS in average frame rates, but check out its low frame rate numbers. They’re consistently higher. This may well be the R600’s reduced state management overhead in action.
We tested this one with FRAPS, much like we did S.T.A.L.K.E.R. In order to get this game to present any kind of challenge to these cards, we had to turn up 16X anisotropic filtering, 4X antialiasing, and transparency supersampling (or the equivalent on the Radeons, “quality” adaptive AA). I’d have run the game at 2560×1600 resolution if it supported that display mode.
We’ve tested BF2 with a pair of drivers on the Radeon HD 2900 XT. The normal set of results comes from the driver we used throughout most of this review, and the other one comes from an early alpha driver that improves adaptive AA performance.
With the new alpha driver, the 2900 XT still can’t quite match the “overclocked in the box” version of the GeForce 8800 GTS 640MB, and that card’s fancy driver doesn’t look to be doing it any favors; the stock-clocked GTS is faster yet here.
Half-Life 2: Episode One
This one combines high dynamic range lighting with 4X antialiasing and still has fluid frame rates at very high resolutions. Unfortunately, though, we encountered some fog rendering problems on the Radeon HD 2900 XT in this game. AMD says it’s working with Valve on a fix, but doesn’t have one ready just yet. We’ve gone ahead and included the results here with the hope that performance won’t change with the fix.
This one is a clear win for AMD. The HD 2900 XT outperforms the 8800 GTS 640MB, and the HD 2900 XT CrossFire rig proves fastest overall.
We turned up all of Oblivion’s graphical settings to their highest quality levels for this test. The screen resolution was set to 1920×1200 resolution, with HDR lighting enabled. 16X anisotropic filtering was forced on via the cards’ driver control panels. We strolled around the outside of the Leyawin city wall, as show in the picture below, and recorded frame rates with FRAPS. This area has loads of vegetation, some reflective water, and some long view distances.
We tested this one with and without antialiasing. Without AA, performance was like so:
The HD 2900 XT looks pretty good. We then worked around some AA issues in Nvidia’s drivers and were able to test with AA enabled. We also added a couple of new configs: the Radeon HD 2900 XT with new alpha drivers to improve performance with AA in Oblivion and that GeForce 8800 GTS 640MB OC with its updated drivers.
Ah, the drama! The Radeon HD’s new alpha driver allows it to just barely edge past the GeForce 8800 GTS 640MB OC.
Rainbow Six: Vegas
This game is notable because it’s the first game we’ve tested based on Unreal Engine 3. As with Oblivion, we tested with FRAPS. This time, I played through a 90-second portion of the “Dante’s” map in the game’s Terrorist Hunt mode, with all of the game’s quality options cranked. The game engine doesn’t seem to work well with multisampled antialiasing, so we didn’t enable AA.
AMD’s new baby nearly matches the GeForce 8800 GTX here, and the 8800 GTS 640MB trails by over 10 frames per second.
The HD 2900 XT bests all three of the 8800 GTS incarnations we tested in 3DMark, and thanks to superior CrossFire scaling, it’s the fastest multi-GPU solution overall in 3DMark. For the record, we’ve seen much better SLI performance out of the 8800 in Windows XP.
Call of Juarez
For our final benchmark, we have an early copy of a DirectX 10 game provided to us by AMD. This is our first chance to look at a DirectX 10 game, even if it is an unfinished one. Nvidia let us know in no uncertain terms that this build of Call of Juarez should not be used for benchmarking, so of course, we had to give it a spin.
….aaaand, it’s a dead freaking heat between the two $399 graphics cards. How’s that for reading the tea leaves for DX10 games? I have no idea what this means exactly.
We measured total system power consumption at the wall socket using an Extech power analyzer model 380803. The monitor was plugged into a separate outlet, so its power draw was not part of our measurement.
The idle measurements were taken at the Windows desktop. The cards were tested under load running Oblivion at 1920×1200 resolution with 16X anisotropic filtering. We loaded up the game and ran it in the same area where we did our performance testing.
The cards were measured on the same motherboard when possible, but we had to use a different board in order to run the Radeons in CrossFire, so keep that in mind. We even had to use a larger 1kW PSU for the HD 2900 XT CrossFire system, which will no doubt change overall system power consumption.
Idle power consumption on the Radeon HD 2900 XT looks very much in line with the GeForce 8800 cards from Nvidia. The larger PSU and different motherboard raises the stakes some for the 2900 XT CrossFire system.
When running a game, though, the R600 does pull quite a bit of juice. The system with a single Radeon HD 2900 XT draws 48W more than the system with the souped-up GeForce 8800 GTS 320MB, and the 2900 XT CrossFire rig with its massive PSU sets what I believe is a new single-system power draw record for Damage Labs at 490W. That’s a crown AMD’s graphics division has stolen from its CPU guys, whose Quad FX platform reached over 460W. Think what would happen if the two could combine their powers.
Seriously, though, the 2900 XT’s power draw is a strong clue as to why AMD elected not to pursue the overall performance crown.
Noise levels and cooling
We measured noise levels on our test systems, sitting on an open test bench, using an Extech model 407727 digital sound level meter. The meter was mounted on a tripod approximately 14″ from the test system at a height even with the top of the video card. We used the OSHA-standard weighting and speed for these measurements.
You can think of these noise level measurements much like our system power consumption tests, because the entire systems’ noise levels were measured, including the Zalman CNPS9500 LED we used to cool the CPU. Of course, noise levels will vary greatly in the real world along with the acoustic properties of the PC enclosure used, whether the enclosure provides adequate cooling to avoid a card’s highest fan speeds, placement of the enclosure in the room, and a whole range of other variables. These results should give a reasonably good picture of comparative fan noise, though.
Here’s where one place where those power draw numbers have an impact. AMD has equipped the Radeon HD 2900 XT with a blower than can move an awful lot of hot air, and that inevitably translates into noise. This isn’t anything close to GeForce FX 5800 Ultra Dustbuster levelsthat card hit 58.8 dB on the same meterbut the 2900 XT is in a class by itself among high-end graphics cards. I think I probably could live with these noise levels, since the thing is only likely to crank up during games, but it definitely makes its presence known. The GeForce 8800’s hiss is whisper-quiet by comparison.
The Radeon HD 2900 XT is an impressive, full-featured DirectX 10-ready graphics processor. Its unified shader architecture is a clear advance over the previous generation of Radeons and is the same class of product as Nvidia’s GeForce 8800 series in terms of basic capabilities. The GPU even has some cool distinctive features, like its tessellator, that the GeForce 8800 can’t match. As we’ve discussed, the scheduling required to achieve efficient utilization of this GPU’s VLIW superscalar stream processing engines could prove to be tricky, putting it at a disadvantage compared to its competition. Some of the synthetic shader benchmarks we ran illustrated that possibility. However, this GPU design has a bias toward massive amounts of parallel shader processing power, and I’m largely persuaded that shader power won’t be a weakness for it. I’m more concerned about its texture filtering capacity. Our tests showed its texturing throughput to be substantially lower than the GeForce 8800 GTS with 16X anisotropic filtering. One can’t help but wonder if the 2900 XT’s performance in today’s DX9 games wouldn’t be higher if it had more filtering throughput.
The 2900 XT does match the GeForce 8800 series on image quality generally, which was by no means a foregone conclusion. Kudos to AMD for jettisoning the Radeon X1000 series’ lousy angle-dependent aniso for a higher quality default algorithm. I also happen to like the 2900 XT’s custom tent filters for antialiasing an awful lotan outcome I didn’t expect, until I saw it in action for myself. Now I’m hooked, and I consider the Radeon HD’s image quality to be second to none on the PC as a result. Nvidia may yet even the score with its own custom AA filters, though.
The HDCP support over dual-link DVI ports and HDMI audio support are both welcome additions, too. We haven’t yet had time to test CPU utilization during HD-DVD or Blu-ray playback, but we’ve got that on the list for a follow-up article (along with GPU overclocking, edge-detect AA filters, dual-link DVI with HDCP on the Dell 3007WFP, AMD’s Stream computing plans, and a whole host of other items).
Ultimately, though, we can’t overlook the fact that AMD built a GPU with 700M transistors that has 320 stream processor ALUs and a 512-bit memory interface, yet it just matches or slightly exceeds the real-world performance of the GeForce 8800 GTS. The GTS is an Nvidia G80 with 25% of its shader core disabled and only 60% of the memory bandwidth of the Radeon HD 2900 XT. That’s gotta be a little embarrassing. At the same time, the Radeon HD 2900 XT draws quite a bit more power under load than the full-on GeForce 8800 GTX, and it needs a relatively noisy cooler to keep it in check. If you ask folks at AMD why they didn’t aim for the performance crown with a faster version of the R600, they won’t say it outright, but they will hint that leakage with this GPU on TSMC’s 80HS fab process was a problem. All of the telltale signs are certainly there.
Given that, AMD was probably smart not to try to go after the performance crown with a two-foot-long graphics card attached to a micro-A/C unit and sporting a thousand-dollar price tag. Instead, they’ve delivered a pretty good value in a $399 graphics card, so long as you’re willing to overlook its higher power draw and noise output while you’re gaming.
There are many things we don’t yet know about the GeForce 8800 and Radeon HD 2900 GPUs, not least of which is how they will perform in DirectX 10 games. I don’t think our single DX10 benchmark with a pre-release game tell us much, so we’ll probably just have to wait and see. Things could look very different six months from now, even if the chips themselves haven’t changed.