AMD’s Radeon HD 4870 graphics processor


Not much good has happened for either party since AMD purchased ATI. New chips from both sides of the fence have been late, run hot, and underperformed compared to the competition. Meanwhile, the combined company has posted staggering financial losses, causing many folks to wonder whether AMD could continue to hold up its end of the bargain as junior partner in the PC market’s twin duopolies, for CPUs and graphics chips.

AMD certainly has its fair share of well-wishers, as underdogs often do. And a great many of them have been waiting with anticipation—you can almost hear them vibrating with excitement—for the Radeon HD 4800 series. The buzz has been building for weeks now. For the first time in quite a while, AMD would seem to have an unequivocal winner on its hands in this new GPU.

Our first peek at Radeon HD 4850 performance surely did nothing to quell the excitement. As I said then, the Radeon HD 4850 kicks more ass than a pair of donkeys in an MMA cage match. But that was only half of the story. What the Radeon HD 4870 tells us is that those donkeys are all out of bubble gum.

Uhm, or something like that. Keep reading to see what the Radeon HD 4800 series is all about.

The RV770 GPU

Work on the chip code-named RV770 began two and a half years ago. AMD’s design teams were, unusually, dispersed across six offices around the globe. Their common goal was to take the core elements of the underperforming R600 graphics processor and turn them into a much more efficient GPU. To make that happen, the engineers worked carefully on reducing the size of the various logic blocks on the chip without cutting out functionality. More efficient use of chip area allowed them to pack in more of everything, raising the peak capacity of the GPU in many ways. At the same time, they focused on making sure the GPU could more fully realize its potential by keeping key resources well fed and better managing the flow of data through the chip.

The fruit of their labors is a graphics processor whose elements look familiar, but whose performance and efficiency are revelations. Let’s have a look at a 10,000-foot overview of the chip, and then we’ll consider what makes it different.

A block diagram of the RV770 GPU. Source: AMD.

Some portions of the diagram above are too small to make out at first glance, I know. We’ll be looking at them in more detail in the following pages. The first thing you’ll want to notice here, though, is the number of processors in the shader array, which is something of a surprise compared to early rumors. The RV770 has 10 SIMD cores, as you can see, and each them contains 16 stream processor units. You may not be able to see it above, but each of those SP units is a superscalar processing block comprised of five ALUs. Add it all up, and the RV770 has a grand total of 800 ALUs onboard, which AMD advertises as 800 “stream processors.” Whatever you call them, that’s a tremendous amount of computing power—well beyond the 320 SPs in the RV670 GPU powering the Radeon HD 3800 series. In fact, this is the first teraflop-capable GPU, with a theoretical peak of a cool one teraflops in the Radeon HD 4850 and up to 1.2 teraflops in the Radeon HD 4870. Nvidia’s much larger GeForce GTX 280 falls just shy of the teraflop mark.

The blue blocks to the right of the SIMDs are texture units. The RV770’s texture units are now aligned with SIMDs, so that adding more shader power equates to adding more texturing power, as is the case with Nvidia’s recent GPUs. Accordingly, the RV770 has 10 texture units, capable of addressing and filtering up to 40 texels per clock, more than double the capacity of the RV670.

Across the bottom of the diagram, you can see the GPU’s four render back-ends, each of which is associated with a 64-bit memory interface. Like a bad tattoo, the four back-ends and 256 bits of total memory connectivity are telltale class indicators: this is decidedly a mid-range GPU. Yet the individual render back-ends on RV770 are vastly more powerful than their predecessors, and the memory controllers have one heck of a trick up their sleeves in the form of support for GDDR5 memory, which enables substantially more bandwidth over every pin.

Despite all of the changes, the RV770 shares the same basic feature set with the RV670 that came before it, including support for Microsoft’s DirectX 10.1 standard. The big news items this time around are (sometimes major) refinements, including formidable increases in texturing capacity, shader power, and memory bandwidth, along with efficiency improvements throughout the design.

The chip

Like the RV670 before it, the RV770 is fabricated at TSMC on a 55nm process, which packs its roughly 956 million transistors into a die that’s 16mm per side, for a total area of 260 mm². The chip has grown from the RV670, but not as much as one might expect given its increases in capacity. The RV670 weighed in at an estimated 666 million transistors and was 192 mm².

Of course, AMD’s new GPU is positively dwarfed by Nvidia’s GT200, a 577 mm² behemoth made up of 1.4 billion transistors. But the more relevant comparisons may be to Nvidia’s mid-range GPUs. The first of those GPUs, of course, is the G92, a 65nm chip that’s behind everything from the GeForce 8800 GT to the GeForce 9800 GTX. That chip measured out, with our shaky ruler, to more or less 18mm per side, or 324 mm². (Nvidia doesn’t give out official die size specs anymore, so we’re reduced to this.) The second competing GPU from Nvidia is a brand-new entrant, the 55nm die shrink of the G92 that drives the newly announced GeForce 9800 GTX+. The GTX+ chip has the same basic transistor count of 754 million, but, well, have a look. The pictures below were all taken with the camera in the same position, so they should be pretty much to scale.

Nvidia’s G92

The RV770

The die-shrunk G92 at 55nm aboard the GeForce 9800 GTX+

Yeah, so apparently I have rotation issues. These things should not be difficult, I know. Hopefully you can still get a sense of comparative size. By my measurements, interestingly enough, the 55nm GTX+ chip looks to be 16 mm per side and thus 260 mm², just like the RV770. That’s despite the gap in transistor counts between the RV770 and G92, but then Nvidia and AMD seem to count transistors differently, among a multitude of other variables at work here.

The pictures below will give you a closer look at the chip’s die itself. The second one even locates some of the more important logic blocks.

A picture of the RV770 die. Source: AMD.

The RV770 die’s functional units highlighted. Source: AMD.

As you can see, the RV770’s memory interface and I/O blocks form a ring around the periphery of the chip, while the SIMD cores and texture units take up the bulk of the area in the middle. The SIMDs and the texture units are in line with one another.

What’s in the cards

Initially, the Radeon HD 4800 series will come in two forms, powder and rock. Err, I mean, 4850 and 4870. By now, you may already be familiar with the 4850, which has been selling online for a number of days.

Here’s a look at our review sample from Sapphire. The stock clock on the 4850 is 625MHz, and that clock governs pretty much the whole chip, including the shader core. These cards come with 512MB of GDDR3 memory running at 993MHz, for an effective 1986MT/s. AMD pegs the max thermal/power rating (or TDP) of this card at 110W. As a result, the 4850 needs only a single six-pin aux power connector to stay happy.

Early on, AMD suggested the 4850 would sell for about $199 at online vendors, and so far, street prices seem to jibe with that, by and large.

And here we have the big daddy, the Radeon HD 4870. This card’s much beefier cooler takes up two slots and sends hot exhaust air out of the back of the case. The bigger cooler and dual six-pin power connections are necessary given the 4870’s 160W TDP.

Cards like this one from VisionTek should start selling online today at around $299. That’s another hundred bucks over the 4850, but then you’re getting a lot more card. The 4870’s core clock is 750MHz, and even more importantly, it’s paired up with 512MB of GDDR5 memory. The base clock on that memory is 900MHz, but it transfers data at a rate of 3600MT/s, which means the 4870’s peak memory bandwidth is nearly twice that of the 4850.

Both the 4870 and the 4850 come with dual CrossFire connectors along the top edge of the card, and both can participate in CrossFireX multi-GPU configurations with two, three, or four cards daisy-chained together.

Nvidia’s response

The folks at Nvidia aren’t likely to give up their dominance at the $199 sweet spot of the video card market without a fight. In response to the release of the Radeon HD 4850, they’ve taken several steps to remain competitive. Most of those steps involve price cuts. Stock-clocked versions of the GeForce 9800 GTX have dropped to $199 to match the 4850. Meanwhile, you have higher clocked cards like this one:

This “XXX Edition” card from XFX comes with core and shader clocks of 738 and 1836MHz, respectively, up from 675/1688MHz stock, along with 1144MHz memory. XFX bundles this card with a copy of Call of Duty 4 for $239 at Newegg, along with a $10.00 mail-in rebate, which gives you maybe better-than-even odds of getting a check for ten bucks at some point down the line, if you’re into games of chance.

Cards like this “XXX Edition” will serve as a bridge of sorts for Nvidia’s further answer to the Radeon HD 4850 in the form of the GeForce 9800 GTX+. Those cards will be based on the 55nm die shrink of the G92 GPU, and they’ll share the XXX Edition’s 738MHz core and 1836MHz shader clocks, although their memory will be slightly slower at 1100MHz. Nvidia expects GTX+ cards to be available in decent quantities by July 16 at around $229.

For most intents and purposes, of course, these two cards should be more or less equivalent, including performance. The GTX+ shares the 9800 GTX’s dual-slot cooler and layout, as well. As a result, and because of time constraints, we’ve chosen to include only the XXX Edition in most of our testing. The exception is the place where the 55nm chip is likely to make the biggest difference: in power draw and the related categories of heat and noise. We’ve tested the 9800 GTX+ separately in those cases.

Nvidia has also decided to sweeten the pot a little bit by supplying us with drivers that endow the GeForce 9800 GTX and GTX 200-series cards with support for GPU-accelerated physics via the PhysX API. You’ll see early results from those drivers in our 3DMark Vantage performance numbers.

Shader processing

Block diagram of a single SP unit.
Source: AMD.

Since the RV770 shares its core shader structure with the R600 family, much of what I wrote about how shader processing works in my R600 review should still apply here. The RV770’s basic execution unit remains a five-ALU-wide superscalar block like the one on the right, which has four “regular” ALUs and one “fat” ALU that can handle some special functions the others can’t, like transcendentals.

AMD has extended the functionality of these SP blocks slightly with RV770, but remarkably, they’ve managed to reduce the area they occupy on the chip versus RV670, even on the same fabrication process. RV770 Chief Architect Scott Hartog cited a 40% increase in performance per square millimeter. In fact, AMD originally planned to put eight SIMD cores on this GPU, but once the shader team’s optimizations were complete, the chip had die space left empty; the I/O ring around the outside of the chip was the primary size constraint. In response, they added two additional SIMD cores, bringing the SP count up to 800 and vaulting the RV770 over the teraflop mark.

Most of the new capabilities of the RV770’s shaders are aimed at non-graphics applications. For instance, from the RV670, they inherit the ability to handle double-precision floating-point math, a capability that has little or no application in real-time graphics at present. The “fat” ALU in the SP block can perform one double-precision FP add or multiply per clock, while the other four ALUs can combine to process one double-precision add. In essence, that means the RV770’s peak compute rate for double-precision multiply-add operations is one-fifth of its single-precison rate, or 240 gigaflops in the case of the Radeon HD 4870. That’s quite a bit faster than even the GeForce GTX 280, whose peak DP compute rate is 78 gigaflops.

Another such accommodation is the addition of 16KB of local shared memory in each SIMD core, useful for sharing data between threads in GPU-compute applications. This is obviously rather similar to the 16KB of shared memory Nvidia has built into each of the SM structures in its recent GPUs, although the RV770 has relatively less memory per stream processor, about a tenth of what the GT200 has. This local data share isn’t accessible to programmers via graphics APIs like DirectX, but AMD may use it to enable larger kernels for custom AA filters or for other forms of post-processing. Uniquely, the RV770 also has a small, 16K global data share for the passing of data between SIMDs.

Beyond that, the ability to perform an integer bit-shift operation has been migrated from the “fat” ALU to all five of them in each SP block, a provision aimed at accelerating video processing, encoding, and compression. The design team also added memory import and export capabilities, to allow for full-speed scatter and gather operations. And finally, the RV770 has a new provision for the creation of lightweight threads for GPU compute applications. Graphics threads tend to have a lot of state information associated with them, not all of which may be necessary for other types of processing. The RV770 can quickly generate threads with less state info for such apps.

Peak shader
arithmetic (GFLOPS)

Single-issue Dual-issue

GeForce 8800 GTX

346 518
GeForce 9800 GTX

432 648
GeForce 9800 GX2

768 1152
GeForce GTX 260

477 715
GeForce GTX 280

622 933
Radeon HD 2900 XT

475
Radeon HD 3870 496
Radeon HD 3870 X2

1056
Radeon HD 4850

1000
Radeon HD 4870

1200

Although most of these changes won’t affect graphics performance, one change may. Both AMD and Nvidia seem to be working on getting a grasp on how developers may use geometry shaders and optimizing their GPUs for different possibilities. In the GT200, we saw Nvidia increase its buffer sizes dramatically to better accommodate the use of a shader for geometry amplification, or tessellation. AMD claims its GPUs were already good at handling such scenarios, but has enhanced the RV770 for the case where the geometry shader keeps data on the chip for high-speed rendering.

The single biggest improvement made in the RV770’s shader processing ability, of course, is the increase to 10 SIMDs and a total of 800 so-called stream processors on a single chip. This change affects graphics and GPU-compute applications alike. The table on the right shows the peak theoretical computational rates of various GPUs. Of course, as with almost anything of this nature, the peak number isn’t destiny; it’s just a possibility, if everything were to go exactly right. That rarely happens. For instance, the GeForces can only reach their peak numbers if they’re able to use their dual-issue capability to execute an additional multiply operation in each clock cycle. In reality, that doesn’t always happen. Similarly, in order to get peak throughput out of the Radeon, the compiler must schedule instructions cleverly for its five-wide superscalar ALU block, avoiding dependencies and serializing the processing of data that doesn’t natively have five components.

Fortunately, we can run a few simple synthetic shader tests to get a sense of the GPUs’ processing prowess.

In its most potent form, the Radeon HD 4870, the RV770 represents a huge improvement over the Radeon HD 3870—pretty straightforwardly, about two times the measured performance. Versus the competition, the Radeon HD 4850 outperforms the GeForce 9800 GTX in three of the four tests, although the gap isn’t as large as the theoretical peak numbers would seem to suggest. More impressively, the Radeon HD 4870 surpasses the GT200-based GeForce GTX 260 in two of the four tests and essentially matches the GTX 280 in the GPU particles and Perlin noise tests. That’s against a chip twice the size of the RV770, with a memory interface twice as wide.

Texturing, memory hierarchy, and render back-ends

A single RV770 texture unit. Source: AMD.

Like the shaders, the texture units in the RV770 have been extensively streamlined. Hartog claimed an incredible 70% increase in performance per square millimeter for these units. Not only that, but as I’ve mentioned, the texture units are now aligned with shader SIMDs, so future RV770-based designs could scale the amount of processing power up or down while maintaining the same ratio of shader power to texture filtering capacity. Interestingly enough, the RV770 retains the same shader-to-texture capacity mix as the RV670 and the R600 before it. Nvidia has moved further in this direction recently with the release of the GT200, but the Radeons still have a substantially higher ratio of gigaflops to gigatexels.

With 10 texture units onboard, the RV770 can sample and bilinearly filter up to 40 texels per clock. That’s up from 16 texels per clock on RV670, a considerable increase. One of the ways AMD managed to squeeze down the size of its texture units was taking a page from Nvidia’s playbook and making the filtering of FP16 texture formats work at half the usual rate. As a result, the RV770’s peak FP16 filtering rate is only slightly up from RV670. Still, Hartog described the numbers game here as less important than the reality of measured throughput.

To ensure that throughput is what it should be, the design team overhauled the RV770’s caches extensively, replacing the R600’s “distributed unified cache” with a true L1/L2 cache hierarchy.

A block diagram of the RV770’s cache hierarchy. Source: AMD.

Each L1 texture cache is associated with a SIMD/texture unit block and stores unique data for it, and each L2 cache is aligned with a memory controller. Much of this may sound familiar to you, if you’ve read about certain competitors to RV770. No doubt AMD has learned from its opponents.

Furthermore, Hartog said RV770 uses a new cache allocation routine that delays the allocation of space in the L1 cache until the request for that data is fulfilled. This mechanism should allow RV770 to use its texture caches more efficiently. Vertices are stored in their own separate cache. Meanwhile, the chip’s internal bandwidth is twice that of the previous generation—a provision necessary, Hartog said, to keep pace with the amount of data coming in from GDDR5 memory. He claimed transfer rates of up to 480GB/s for an L1 texture fetch and up to 384GB/s for data transfers between the L1 and L2 caches.

An overview of the RV770’s memory interface. Source: AMD.

The RV770’s reworked memory subsystem doesn’t stop at the caches, either. AMD’s vaunted ring bus is dead and gone, and it’s not even been replaced by a crossbar. Instead, RV770 opts for a simpler approach. The GPU’s four memory controllers are distributed around the edges of the chip, next to their primary bandwidth consumers, including the render back-ends and the L2 caches. Data is partitioned via tiling to maintain good locality of reference for each controller/cache pair, and a hub passes lower bandwidth data to and from the I/O units for PCI Express, display controllers, the UVD2 video engine, and the CrossFireX interconnect. AMD claims this approach brings efficiency gains, with the RV770 capable of reaching 95% of its theoretical peak bandwidth, up 10% from the RV670.

These gains alone wouldn’t allow the RV770 to realize its full potential, however, with only a 256-bit aggregate path to memory. For extra help in this department, AMD worked with DRAM vendors to develop a new memory type, GDDR5. GDDR5 keeps the single-ended signaling used in current DRAM types and uses a range of techniques to achieve higher bandwidth. Among them: a new clocking architecture, an error-detection protocol for the wires, and individual training of DRAM devices upon startup. AMD’s Joe Macri, who heads the JEDEC DRAM and GDDR5 committees, points out that this last feature should allow for additional overclocking headroom with better cooling, since DRAM training will respond to improvements in environmental conditions.

GDDR5’s command clock runs at a quarter of the data rate, which is presumably why the Radeon HD 4870’s memory clock shows up as 900MHz when the actual data rate is 3600 MT/s. Do the math, and you’ll find that the 4870’s peak memory bandwidth works out to 115.2 GB/s, which is even more than the Radeon HD 2900 XT managed with a 512-bit interface or what the GeForce GTX 260 can reach with a 448-bit interface to GDDR3. And that’s with 3.6Gbps devices. AMD says it’s already seeing 5Gbps GDDR5 memory now and expects to see 6Gbps before the end of the year.

An RV770 render back-end unit.
Source: AMD.

The final element in the RV770’s wide-ranging re-plumbing of the R600 architecture comes in the form of heavily revised render back-ends. (For the confused, Nvidia calls these exact same units ROPs, but we’ll use AMD’s term in discussing its chips.) One of the RV770 design team’s major goals was to improve antialiasing performance, and render back-ends are key to doing so. Looking at the diagram on the left, the RV770’s render back-end doesn’t look much different from any other, and the chip only has four of them, so what’s the story?

Well, for one, the individual render back-end units are quite a bit more powerful. Below is a table supplied by Hartog that shows the total render back-end capacity of the RV770 versus RV670, both of which have the same number of units on chip.

RV670 versus RV770 total render back-end throughput. Source: AMD.

According to this table, the RV770’s render back-ends are twice as fast as the RV670’s in many situations: for any form of multisampled AA and for 64-bit color modes even without AA. Not only that, but the RV770 can perform up to 64 Z or stencil operations per clock cycle. Hartog identified the RV670’s Z rate as the primary limiting factor in the RV670’s antialiasing performance.

That’s not the whole story, however. Ever since the R600 first appeared, we heard rumors that its render back-ends were essentially broken in that they would not perform the resolve step for multisampled AA—instead, the R600 and family handled this task in the shader core. Shader-based resolve did allow AMD to do some nice things with custom AA filters, but the R600-family’s relatively weak AA performance was always a head-scratcher. Why do it that way, if it’s so slow?

I suspect, as a result of the shader-based resolve, that the numbers you see for RV670 in the table above are, shall we say, optimistic. They may be correct as theoretical peaks, but I suspect the RV670 doesn’t often reach them.

Fortunately, AMD has confirmed to us that the RV770 no longer uses its shader core for standard MSAA resolve. If there was a problem with the R6xx chips’ render back-ends—and AMD still denies it—that issue has been fixed. The RV770 will still use shader-based resolve for AMD’s custom-filter AA modes, but for regular box filters, the work is handled in custom hardware in the render back-ends—as it was on pre-R600 Radeons and on all modern GeForce GPUs.

Testing RV770’s mettle

So how do the rearchitected bits of RV770 work when you put them all together? Let’s have a look. First, here’s a quick table showing the theoretical peak capacities of some relevant GPUs, which we can use for reference.

Peak
pixel
fill rate
(Gpixels/s)

Peak bilinear

texel
filtering
rate
(Gtexels/s)


Peak bilinear

FP16 texel
filtering
rate
(Gtexels/s)


Peak
memory
bandwidth
(GB/s)

GeForce 8800 GTX

13.8 18.4 18.4 86.4
GeForce 9800 GTX

10.8 43.2 21.6 70.4
GeForce 9800 GX2

19.2 76.8 38.4 128.0
GeForce GTX 260

16.1 36.9 18.4 111.9
GeForce GTX 280

19.3 48.2 24.1 141.7
Radeon HD 2900 XT

11.9 11.9 11.9 105.6
Radeon HD 3870 12.4 12.4 12.4 72.0
Radeon HD 3870 X2

26.4 26.4 26.4 115.2
Radeon HD 4850

10.0 25.0 12.5 63.6
Radeon HD 4870

12.0 30.0 15.0 115.2

Oddly enough, on paper, the RV770’s numbers don’t look all that impressive. The Radeon HD 4850 trails the GeForce 9800 GTX in every category, and the 4870 isn’t much faster in most departments—except for memory bandwidth, of course, thanks to GDDR5. But what happens when we measure throughput with a synthetic test?

Color fill rate tests like this one tend to be limited mainly by memory bandwidth, as seems to be largely the case here. The Radeon HD 4850 manages to outdo the GeForce 9800 GTX, though, despite a slightly lower memory clock. As for the 4870, well, it beats out the GeForce GTX 260 and the Radeon HD 3870 X2, which would seem to suggest that its GDDR5 memory is fast and relatively efficient. The GTX 260 and 3870 X2 have similar memory bandwidth in theory, but they’re slower in practice.

This is a test of integer texture filtering performance, so many of the GPUs should be faster here than in our next test. The RV770 doesn’t look too bad, and its performance scales down gracefully as the number of filter taps increases. But Nvidia’s GPUs clearly have more texture filtering capacity, both in theory and in practice, with 32-bit texture formats.

This test, however, measures FP16 texture filtering throughput, and here, the tables turn. Amazingly, the Radeon HD 4850 outdoes the GeForce GTX 280, and the 4870 is faster still. Only the “X2” cards, with dual GPUs onboard, are in the same league. It would seem Nvidia’s GPUs have some sort of internal bottleneck preventing them from reaching their full potential with FP16 filtering. If so, they’re in good company: the Radeon HD 3870’s theoretical peak for FP16 filtering is almost identical to the Radeon HD 4850’s, yet the 4850 is much faster.

Incidentally, if the gigatexel numbers produced by 3DMark seem confusing to you, well, I’m right there with you. I asked FutureMark about this problem, and they’ve confirmed that the values are somehow incorrect. They say they’re looking into it now—or, well, after folks are back from their summer vacations. In the meantime, I’m assuming we can trust the relative performance reported by 3DMark, even if the units in which they’re reported are plainly wrong. Let’s hope I’m right about that.

Texture filtering quality

You’re probably looking at the table below and wondering what sort of drugs will produce that effect. It’s not my place to offer pharmaceutical advice, and I don’t want to dwell too much on this subject, but the images below are test patterns for texture filtering quality. My main purpose in including them is to demonstrate that not much has changed on this front since the debut of the DirectX 10 generation of GPUs. These are the same patterns we saw in our Radeon HD 2900 XT review, and they’re big, honkin’ improvements over what the DirectX 9-class GPUs did.


Anisotropic texture filtering and trilinear blending

Radeon HD 3870

Radeon HD 4870


GeForce GTX 280

GeForce GTX 280 HQ

The images above come from the snappily-named D3D AF tester, and what you’re basically doing is looking down a 3D rendered tube with a checkerboard pattern applied. The colored bands indicated different mip-map levels, and you can see that the GPUs vary the level of detail they’re using depending on the angle of the surface

The GeForce GTX 280’s pattern, for what it’s worth, is identical to that produced by a G80 or G92 GPU. Nvidia’s test pattern is closer to round and thus a little more perfect, but we’ve found the practical difference between the two algorithms to be imperceptible.

On a more interesting note, the impact of Nvidia’s trilinear blending optimizations is apparent. You can see how much smoother the color transitions between mip maps are with its “high quality” option enabled in the driver control panel, and you’ve seen how that option affects performance on the prior page of this review. Then again, although the Radeon’s test pattern looks purty, AMD has a similar adaptive trilinear algorithm of its own that dynamically applies less blending as it sees fit.

The bottom line, I think, on image quality is that current DX10-class GPUs from Nvidia and AMD produce output that is very similar. Having logged quite a few hours playing games with both brands of GPUs, I’m satisfied that either one will serve you well. We may revisit the image quality issue again before long, though. I’d like to look more closely at the impact of those trilinear optimizations in motion rather than in screenshots or test patterns. We’ll see.

Antialiasing

The RV770’s beefed up texture filtering looks pretty good, but how do those new render back-ends help antialiasing performance? Well, here we have the beginnings of an answer. The results below show how increasing sample levels impact frame rates. We tested in Half-Life 2 Episode Two at 1920×1200 resolution with the rest of the game’s image quality options at their highest possible settings.

To get a sense of the impact of the new render back-ends, compare the results for the Radeon HD 3870 X2 and the Radeon HD 4870. The two start out at about the same spot without antialiasing (the 1X level), with the 3870 X2 slightly ahead. However, as soon as we enable 2X AA, the 3870 X2’s performance drops off quickly, while the 4870’s frame rates step down more gracefully. The 4870 produces higher frames rates with 8X multisampling than the 3870 X2 does with just 2X AA.

I’ve shown performance results for Nvidia’s coverage sampled AA (CSAA) modes in the graph above, but presenting the results from the multitude of custom-filter AA (CFAA) modes AMD offers is more difficult, so I’ve put them into tables. First up is the Radeon HD 3870 X2, followed by the Radeon HD 4870.

Radeon HD 3870 X2 – Half-Life 2 Episode Two – AA scaling
Base

MSAA

mode

Sample

count

FPS Filter

type

Sample

count

FPS Filter

type

Sample

count

FPS Filter

type

Sample

count

FPS
1X 1 98.0
2X 2 66.2 Narrow

tent

4 65.5 Wide

tent

6 62.7
4X 4 65.0 Narrow

tent

6 47.5 Wide

tent

8 46.2 Edge

detect

12 37.7
8X 8 59.1 Narrow

tent

12 26.9 Wide

tent

16 25.5 Edge

detect

24 28.1
Radeon HD 4870 – Half-Life 2 Episode Two – AA scaling
Base

MSAA

mode

Sample

count

FPS Filter

type

Sample

count

FPS Filter

type

Sample

count

FPS Filter

type

Sample

count

FPS
1X 1 96.3
2X 2 84.4 Narrow

tent

4 69.3 Wide

tent

6 66.3
4X 4 79.8 Narrow

tent

6 52.5 Wide

tent

8 51.3 Edge

detect

12 39.5
8X 8 73.1 Narrow

tent

12 31.6 Wide

tent

16 29.2 Edge

detect

24

28.8

The thing that strikes me about these results is how similarly these two solutions scale when we get into the CFAA modes. The 4870 is quite a bit faster in the base MSAA modes with just a box filter, where the render back-ends take care of the MSAA resolve step. Once we get into shader-based resolve on both GPUs, though, the 4870 is only slightly quicker than the 3870 X2 in each CFAA mode. That means, practically speaking, that RV770-based cards will pay a relatively higher penalty for going from standard multisampled AA to the CFAA modes than R6xx-based ones do. You’re simply better off running a Radeon HD 4870 in 8X MSAA than you are using any custom filter. That’s not a problem, of course, just an artifact of the big performance improvements delivered by the RV770’s new render back-ends. Many folks will probably prefer to use 8X MSAA given the option, anyhow, since it doesn’t impose the subtle blurring effect that AMD’s custom tent filters do.

Incidentally, the RV770’s performance also scales much more gracefully to 8X MSAA than any GeForce does. The Radeon HD 4870 outperforms even the mighty GeForce GTX 280 with 8X multisampling, and the 4850 practically trounces the 9800 GTX. Believe it or not, I’m already getting viral marketing emails from [email protected] asking me to test more games with 8X AA. Jeez, these guys are connected.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processor Core 2 Extreme QX9650 3.0GHz Core 2 Extreme QX9650 3.0GHz
System bus 1333MHz (333MHz quad-pumped) 1333MHz (333MHz quad-pumped)
Motherboard Gigabyte GA-X38-DQ6 EVGA nForce 780i SLI
BIOS revision F9a P05p
North bridge X38 MCH 780i SLI SPP
South bridge ICH9R 780i SLI MCP
Chipset drivers INF update 8.3.1.1009
Matrix Storage Manager 7.8
ForceWare 15.17
Memory size 4GB (4 DIMMs) 4GB (4 DIMMs)
Memory type 2 x Corsair TWIN2X20488500C5D
DDR2 SDRAM
at 800MHz
2 x Corsair TWIN2X20488500C5D
DDR2 SDRAM
at 800MHz
CAS latency (CL) 5 5
RAS to CAS delay (tRCD) 5 5
RAS precharge (tRP) 5 5
Cycle time (tRAS) 18 18
Command rate 2T 2T
Audio Integrated ICH9R/ALC889A
with RealTek 6.0.1.5618 drivers
Integrated nForce 780i SLI MCP/ALC885
with RealTek 6.0.1.5618 drivers
Graphics
Radeon HD 2900 XT 512MB PCIe
with Catalyst 8.5 drivers
Dual XFX GeForce 9800 GTX XXX 512MB PCIe
with ForceWare 177.39 drivers
Asus Radeon HD 3870 512MB PCIe
with Catalyst 8.5 drivers
Radeon HD 3870 X2 1GB PCIe
with Catalyst 8.5 drivers
Radeon HD 4850 512MB PCIe
with Catalyst 8.501.1-080612a-064906E-ATI drivers
Dual Radeon HD 4850 512MB PCIe
with Catalyst 8.501.1-080612a-064906E-ATI drivers
Radeon HD 4870 512MB PCIe
with Catalyst 8.501.1-080612a-064906E-ATI drivers
Dual Radeon HD 4870 512MB PCIe
with Catalyst 8.501.1-080612a-064906E-ATI drivers
MSI GeForce 8800 GTX 768MB PCIe
with ForceWare 175.16 drivers
XFX GeForce 9800 GTX 512MB PCIe
with ForceWare 175.16 drivers
XFX GeForce 9800 GTX XXX 512MB PCIe
with ForceWare 177.39 drivers
GeForce 9800 GTX+ 512MB PCIe
with ForceWare 177.39 drivers
XFX GeForce 9800 GX2 1GB PCIe
with ForceWare 175.16 drivers
GeForce GTX 260 896MB PCIe
with ForceWare 177.34 drivers
GeForce GTX 280 1GB PCIe
with ForceWare 177.34 drivers
Hard drive WD Caviar SE16 320GB SATA
OS Windows Vista Ultimate x64 Edition
OS updates Service Pack 1, DirectX March 2008 update

Thanks to Corsair for providing us with memory for our testing. Their quality, service, and support are easily superior to no-name DIMMs.

Our test systems were powered by PC Power & Cooling Silencer 750W power supply units. The Silencer 750W was a runaway Editor’s Choice winner in our epic 11-way power supply roundup, so it seemed like a fitting choice for our test rigs. Thanks to OCZ for providing these units for our use in testing.

Unless otherwise specified, image quality settings for the graphics cards were left at the control panel defaults. Vertical refresh sync (vsync) was disabled for all tests.

We used the following versions of our test applications:

The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Call of Duty 4: Modern Warfare

We tested Call of Duty 4 by recording a custom demo of a multiplayer gaming session and playing it back using the game’s timedemo capability. Since these are high-end graphics configs we’re testing, we enabled 4X antialiasing and 16X anisotropic filtering and turned up the game’s texture and image quality settings to their limits.

We’ve chosen to test at 1680×1050, 1920×1200, and 2560×1600—resolutions of roughly two, three, and four megapixels—to see how performance scales.

Aaaaaand…. wow. The Radeon HD 4850 beats out both its like-priced competitor, the GeForce 9800 GTX, and the slightly more expensive XXX Edition, which is also serving as our proxy for the 9800 GTX+. Doubling up on cards only accentuates the gap between the 4850 and 9800 GTX XXX. Meanwhile, the Radeon HD 4870 just edges out the GeForce GTX 260 at 2560×1600 resolution. At 1920×1200, the 4870 actually manages to outrun both the GTX 260 and 280, but the big GeForce chips come into their own at four-megapixel-plus display resolutions. Trouble is, two 4850s in CrossFire pretty much obliterate a single GeForce GTX 280, regardless—and they cost considerably less.

Half-Life 2: Episode Two

We used a custom-recorded timedemo for this game, as well. We tested Episode Two with the in-game image quality options cranked, with 4X AA and 16 anisotropic filtering. HDR lighting and motion blur were both enabled.

Nvidia’s back in the game a little more here, as the 9800 GTX XXX Edition hangs right with the Radeon HD 4850, in both single-card and dual-GPU configurations. The GeForce is even faster at 2560×1600. However, the Radeon HD 4870’s performance has to be disconcerting for Nvidia; it’s quicker than the GTX 260 in all but the highest resolution, and even there, the 4870 is less than three frames per second behind its pricier rival.

Two 4870s in CrossFire, which also cost less than a GeForce GTX 280, are miles head of anything else we tested.

Enemy Territory: Quake Wars

We tested this game with 4X antialiasing and 16X anisotropic filtering enabled, along with “high” settings for all of the game’s quality options except “Shader level” which was set to “Ultra.” We left the diffuse, bump, and specular texture quality settings at their default levels, though. Shadow and smooth foliage were enabled, but soft particles were disabled. Again, we used a custom timedemo recorded for use in this review.

This one’s a clean sweep for AMD. The Radeon HD 4850 is faster than either variant of the GeForce 9800 GTX, and the 4870 pumps out over 60 frames per second at 2560×1600, outrunning the GeForce GTX 260.

Crysis

Rather than use a timedemo, I tested Crysis by playing the game and using FRAPS to record frame rates. Because this way of doing things can introduce a lot of variation from one run to the next, I tested each card in five 60-second gameplay sessions.

Also, I’ve chosen a new area for testing Crysis. This time, I’m on a hillside in the recovery level having a firefight with six or seven of the bad guys. As before, I’ve tested at two different settings, with the game’s “High” quality presets and with its “Very high” ones, also.

The 4850 trips up a bit in Crysis, where it’s just a hair’s breadth slower than the 9800 GTX. CrossFire scaling looks to be rather disappointing, too, compared to SLI scaling. The 4870, though, comes out looking good yet again by virtue of having beaten up on the hundred-bucks-more-expensive GeForce GTX 260.

Assassin’s Creed

There has been some controversy surrounding the PC version of Assassin’s Creed, but I couldn’t resist testing it, in part because it’s such a gorgeous, well-produced game. Also, hey, I was curious to see how the performance picture looks for myself. The originally shipped version of this game can take advantage of the Radeon HD 3000- and 4000-series GPUs’ DirectX 10.1 capabilities to get a frame rate boost with antialiasing, and as you may have heard, Ubisoft chose to remove the DX10.1 path in an update to the game. I chose to test the game without this patch, leaving DX10.1 support intact.

I used our standard FRAPS procedure here, five sessions of 60 seconds each, while free-running across the rooftops in Damascus. All of the game’s quality options were maxed out, and I had to edit a config file manually in order to enable 4X AA at this resolution.

The RV770 show continues with this unscheduled detour into controversial DX10.1 territory.

Race Driver GRID

I tested this absolutely gorgeous-looking game with FRAPS, as well, and in order to keep things simple, I decided to capture frame rates over a single, longer session as I raced around the track. This approach has the advantage of letting me report second-by-second frame-rate results.

Yowza. The Radeon HD 4870 is nearly twice as fast as the 3870, which is good enough to put it at the very top of the single-GPU solutions. Two 4850 or 4870 cards seem to scale well in CrossFire, as well.

For what it’s worth, I tried re-testing the 3870 X2 with the new Catalyst 8.6 drivers to see whether they had a CrossFire profile for GRID, like the 4800 series drivers obviously do, but performance was the same. I also tried renaming the game executable, but that attempt seemed to run afoul of the game’s copy protection somehow. Oh well.

3DMark Vantage

And finally, we have 3DMark Vantage’s overall index. I’m pleased to have games that will challenge the performance of a new graphics card today, so we don’t have to rely on an educated guess about possible future usage models like 3DMark. However, I did collect some scores to see how the GPUs would fare, so here they are. Note that I used the “High” presets for the benchmark rather than “Extreme,” which is what everyone else seems to be using. Somehow, I thought frame rates in the fives were low enough.

Since both camps have released new drivers that promise big performance boosts for 3DMark Vantage, we tested with almost all new drivers here. For the GeForce 8800 GTX and 9800 GX2, we used ForceWare 175.19 drivers. For the other GeForces, we used the new 177.39 drivers, complete with PhysX support. And for the Radeon HD 3870 and 2900 XT, we tested with Catalyst 8.6. Since the 3870 X2 seemed to crash in 3DMark with Cat 8.6, we stuck with the 8.5 revision for it.

I suppose the final graph there is the most dramatic. That’s where Nvidia’s support for GPU-accelerated physics, via the PhysX API used by 3DMark’s “CPU” physics test, kicks in. Obviously, the GPU acceleration results in much higher scores than we see with CPU-only physics, which affects both the composite CPU score and the overall 3DMark score.

I’m certainly as impressed as anyone with Nvidia’s port of the PhysX API to its CUDA GPU-computing platform, but I’m not sure that’s, you know, entirely fair from a benchmarking point of view. 3DMark has become like the Cold War-era East German judge at the Olympics all of a sudden. The overall GPU score may be a better measure of these chips, and it puts the Radeon HD 4850 ahead of the GeForce 9800 GTX XXX.

Power consumption

We measured total system power consumption at the wall socket using an Extech power analyzer model 380803. The monitor was plugged into a separate outlet, so its power draw was not part of our measurement. The cards were plugged into a motherboard on an open test bench.

The idle measurements were taken at the Windows Vista desktop with the Aero theme enabled. The cards were tested under load running Half-Life 2 Episode Two at 2560×1600 resolution, using the same settings we did for performance testing.

The power consumption of the two Radeon HD 4000-series cards at idle isn’t bad, but it is disappointing in light of what Nvidia has achieved with the GeForce GTX cards. The 4870, in particular, is perplexing, because GDDR5 memory is supposed to require less power. When running a game, the new Radeons look relatively better, with lower power draw than their closest competitors.

Note that those competitors include the GeForce 9800 GTX+, based on the 55nm shrink of the G92 GPU. At the same clock speeds as the 65nm XXX Edition, the GTX+-equipped system draws 11W less power at idle and 25W less under load.

Noise levels

We measured noise levels on our test systems, sitting on an open test bench, using an Extech model 407727 digital sound level meter. The meter was mounted on a tripod approximately 12″ from the test system at a height even with the top of the video card. We used the OSHA-standard weighting and speed for these measurements.

You can think of these noise level measurements much like our system power consumption tests, because the entire systems’ noise levels were measured, including the stock Intel cooler we used to cool the CPU. Of course, noise levels will vary greatly in the real world along with the acoustic properties of the PC enclosure used, whether the enclosure provides adequate cooling to avoid a card’s highest fan speeds, placement of the enclosure in the room, and a whole range of other variables. These results should give a reasonably good picture of comparative fan noise, though.

I wasn’t able to reliably measure noise levels for most of these systems at idle. Our test systems keep getting quieter with the addition of new power supply units and new motherboards with passive cooling and the like, as do the video cards themselves. Our test rigs at idle are too close to the sensitivity floor for our sound level meter, so I only measured noise levels under load. Even then, I wasn’t able to get a good measurement for the GeForce 8800 GTX; its cooler is just too quiet.

There you have it. Not bad. However, I should warn you that we tested these noise levels on an open test bench, and the 4850 and 4870 were definitely not running their blowers at top speed. They’re quite a bit louder when they first spin up, for a split second, at boot time. When crammed into the confines of your own particular case, your mileage will probably vary. In fact, for the 4850, I’d almost guarantee it, for reasons you’ll see below.

GPU temperatures

Per your requests, I’ve added GPU temperature readings to our results. I captured these using AMD’s Catalyst Control Center and Nvidia’s nTune Monitor, so we’re basically relying on the cards to report their temperatures properly. In the case of multi-GPU configs, I only got one number out of CCC. I used the highest of the numbers from the Nvidia monitoring app. These temperatures were recorded while running the “rthdribl” demo in a window. Windowed apps only seem to use one GPU, so it’s possible the dual-GPU cards could get hotter with both GPUs in action. Hard to get a temperature reading if you can’t see the monitoring app, though.

The new Radeons achieve their relatively low noise levels by allowing the GPU to run at much higher temperatures than current GeForces or past Radeons. The 4850, in particular, seems to get ridiculously hot, not just in the monitoring app but on the card and cooler itself—well beyond the threshold of pain. This mofo will burn you.

I’m hopeful that board makers will find some solutions. Shortly before we went to press, we received a poorly documented and possibly incomplete set of files from Sapphire that may allow us flash to a new BIOS revision on the 4850, and I believe their aim is to reduce temperatures. I kind of worry about what they’ll do to the noise levels, but perhaps we can test that. Longer term, one hopes we’ll see 4850 cards with much better coolers on them, perhaps with dual slots and a rear exhaust setup, like the 4870. That would be a huge improvement.

Conclusions

The RV770 GPU looks to be an unequivocal success on almost every front. In its most affordable form, the Radeon HD 4850 delivers higher performance overall than the GeForce 9800 GTX and redefines GPU value at the ever-popular $199 price point. Meanwhile, the RV770’s most potent form is even more impressive, in my view. Onboard the Radeon HD 4870, this GPU sets a new standard for architectural efficiency—in terms of performance per die area—due to two things: a broad-reaching rearchitecting and optimization the of R600 graphics core and the astounding amount of bandwidth GDDR5 memory can transfer over a 256-bit interface. Both of these things seem to work every bit as well as advertised. In practical terms, what all of this means is that the Radeon HD 4870, a $299 product, competes closely with the GeForce GTX 260, a $399 card based on a chip twice the size.

I have to take issue with a couple of arguments I hear coming from both sides of the GPU power struggle, though. AMD decided a while back, after the R600 debacle, to stop building high-end GPUs as a cost-cutting measure and instead address the high end with multi-GPU solutions. They have since started talking about how the era of the large, “monolithic” GPU is over. I think that’s hogwash. In fact, I’d love to see a RV770-derived behemoth with 1600 SPs and 80 texture units on the horizon. Can you imagine? Big chips don’t suffer from the quirks of multi-GPU implementations, which never seem to have profiles for newly released games just as you’d want to be playing them, and building a big chip doesn’t necessarily preclude a company from building a mid-sized one. Yes, Nvidia still makes high-end GPUs like the GeForce GTX 280, but they also make mid-range chips, too.

One example of such a chip is the 55nm variant of the G92 that powers the GeForce 9800 GTX+. If Nvidia can deliver those as expected by mid-July and cut another 30 bucks off of the projected list price, they’ll have a very effective counter to the Radeon HD 4850, nearly equivalent in size, performance, and power consumption.

At the same time, Nvidia is trying to press its advantage on the GPU-compute front by investing loads of marketing time and effort into its CUDA platform, with particular emphasis on the potential value of its GPU-accelerated PhysX API to gamers. I can see the vision there, but look: hardware-accelerated physics has been just around the corner for longer than I care to remember, but it’s never really happened. Perhaps Nvidia will succeed where Ageia alone didn’t, but I wouldn’t base my GPU buying decision on it. If PhysX-based games really do arrive someday, I doubt they’ll make much of an impact during the lifespan of one of today’s graphics cards.

On top of that, AMD has made its own considerable investment in the realm of heterogeneous computing—like, for instance, buying ATI, a little transaction you may have heard about, along with some intriguing code names like Fusion and Torrenza. We got a refresher on AMD’s plans in our recent talk with Patti Harrell, and they’re remarkably similar to what Nvidia is doing. In fact, AMD was first by a mile with a client for [email protected], and Adobe showed the same Photoshop demo at the press event for RV770 that it did at Nvidia’s GT200 expo—the program uses a graphics API, not CUDA. Nvidia may have more to invest in marketing and building a software ecosystem around CUDA, but cross-GPU standards are what will allow GPU computing to succeed. When that happens, AMD will surely be there, too.

Comments closed

Pin It on Pinterest

Share This

Share this post with your friends!