The graphics game has been nothing if not interesting the past year or so. AMD’s Radeon HD 4800 series upended expectations by using a mid-sized chip to serve the bulk of the market and pairing of two of them in an X2 card to create a high-end product. This strategy has worked out pretty well, in no small part because the Radeon HD 4870 GPU has proven to be very efficient for its size. The result? Fast graphics cards have become very affordable, with prices driving to almost-embarrassing lows over time.
Nvidia, meanwhile, has been relatively quiet in terms of truly new products. The last new GeForce we reviewed, back in March, was the GTS 250, a cost-reduced card based on a GPU that traces its roots back to the two-year-old GeForce 8800 GT. Nvidia has milked that G92 GPU as if it were a cow mainlining an experimental drug cocktail from Monsanto. The higher end of the GeForce lineup has been powered by the GT200 GPU, a much larger chip than anything AMD makes with only somewhat higher performance than the Radeon HD 4870.
All the while, folks have been buzzing about what, exactly, comes next for GPUs. Intel’s Larrabee project has been imminent for some time now, promising big things via the miracle medium of PowerPoint. In a sort of pre-emptive response, Nvidia employees have developed, en masse, a puzzling tick: speak to them, and they keep saying “PhysX and CUDA, CUDA and PhysX” after each normal sentence. Sometimes they throw in a reference to 3D Vision, as well, although they seem vaguely embarrassed to admit their chips do graphics anymore. For its part, AMD has been talking rather ambiguously about “Fusion,” which once stood for a combination of CPU parts and GPU parts into a future uber-processor capable of amazing feats of simultaneous sequential and data-parallel processing but now seems to have morphed into “We’d like to sell you a CPU and an integrated graphics chipset, too.”
In the midst of all of this craziness, thank goodness, work has continued on new and rather traditional graphics processors, which have become important enough to cause all of this fuss in the first place. Less than 18 months after the introduction of the Radeon HD 4800 series, AMD has produced a new chip that’s roughly the same size yet promises to double its predecessor’s power in nearly every respect, including shader processing, texturing, pixel throughputand, yes, GPU-compute capacity. The Radeon HD 5870 is more capable, too, in a hundred little ways, not least of which is its fidelity to the DirectX 11 spec. And in a solid bonus for its target market, the card based on it looks like the Batmobile.
What’s under the Batmobile’s hood
Where to start? Perhaps with codenames, since they’re thoroughly confusing. The last-gen GPU that powered the Radeon HD 4870 was code-named RV770, a familiar number in a succession of Radeon chips. The rumor mill long ago began talking about its successor as the RV870, a logical step forward. Yet marketing types have hijacked codenames and proliferated them, just to make my life difficult, and thus the RV870 became known as “Cypress.” The official name now is the Radeon HD 5870. We’ll refer it to in various ways throughout this article, just to keep you on your toes.
Much like the RV770, the Cypress chip is the product of a three-year project conducted at multiple sites around the globe, directed from AMD’s Orlando office by chief architect Clay Taylor.
A logical block diagram of Cypress. Source: AMD.
The image above contains much of what you might want to know about the newest Radeon, if you squint right. What you’re seeing truly is a doubling of resources versus the RV770. Cypress has twice as many SIMD arrays in its shader core, twice as many texture units aligned with those SIMD arrays, double the number of render back-ends, and even two rasterizers. The big-impact number may be 1600, as in the number of shader processors or whatever AMD is calling them this week. 1600 ALUs, at any rate, bring a prodigious amount of compute power to this puppy.
This GPU is more than just a doubling of what came before, though. If you could zoom in a little deeper, you’d find refinements made to nearly every functional area of the chip. In fact, we hope to do just that in the following pages. But first, we need to scare off anyone who randomly wandered in from Google trying to figure out which graphics card to buy by talking explicitly about chips.
Sorting the silicon
Cypress is manufactured by TSMC on its 40-nm fab process, and it shoehorns an estimated (and breathtaking) 2.15 billion transistors into a die that’s 334 mm². That makes it a little bit larger than other mid-sized GPUs; both the RV770 and the G92b from Nvidia are about 256 mm². Because they’re both manufactured on a 55-nm process, they contain considerably fewer transistors956 million for the RV770 and 754 million for the G92b, though counting methods sometimes vary. Chip size is important because it relates pretty directly to manufacturing costs. By delivering the first 40-nm product in this part of the market, and by cramming in a formidable amount of processing power, AMD has a good thing going.
The 55-nm G92b
Comparisons to the GT200 chip on GeForce GTX 200-series graphics cards are more difficult, because Nvidia doesn’t like to talk about die sizes, and I’m too chicken to pry the metal cap off one of the chips and risk destroying a card in the process.
The 65-nm GT200 under its metal cap
We know the GT200 ain’t small. Its transistor count is roughly 1.4 billion, and credible reports placed the original 65-nm GT200’s die size at 576 mm². The 55-nm GT200b shrink probably made it just under the 500 mm² mark, according to the rumor mill, but that’s still, uh, hefty. I swear I saw Tom Cruise and Nicole Kidman racing to plant a flag in one corner of the thing.
Cypress is but one member of an entire Evergreen family of products in development at AMD, all of which will share a common technology base. Initially, two cards, the Radeon HD 5870 and 5850, will be based on Cypress. Another codename, Hemlock, denotes the multi-GPU card based on dual Cypress chips that will likely be known as the Radeon HD 5870 X2. Juniper is a separate, smaller chip aimed at the range between $100 and $200. Logic dictates AMD would slot Juniper-based cards into the Radeon HD 5700 series. All of these products are scheduled to be introduced between now and the end of the year, amazingly enough, some in rapid succession.
The rest of the Evergreens will fall after Christmas, in the first quarter of 2010. Redwood is slated to serve the mainstream market (i.e., really cheap graphics cards) and Cedar the value segment (really even cheaper, like $60 cards). When all is said and done, AMD should have a top-to-bottom family of 40-nm, DirectX 11-capable graphics card offerings.
The focus of our attention today, though, is the Radeon HD 5870. This is AMD’s fastest single-GPU implementation of Cypress, with all 1600 SPs enabled and cranking away at 850 MHz. The card has a gigabyte of GDDR5 memory onboard clocked at 1200 MHz, for a 4.8 Gbps data rate. Also, it’s rather long. Have a look:
Radeon HD 5870 (left) next to 4890 (right)
Twin dual-link DVI connectors, along with HDMI, DisplayPort, and CrossFire connections
Thankfully, two six-pin power plugs will suffice
The bare card
The 5870 card’s PCB is 10.5″ long, an inch longer than the 4980 before it and the same size as a GeForce GTX 260 or a Radeon HD 4870 X2. However, that fancy cooler shroud extends to roughly 10 7/8″, which means the 5870 might have fit problems in more compact PC cases. You’ll want to measure before assuming this beast will fit into your mid-tower enclosure, folks.
Despite its iffy dimensions, AMD has clearly paid attention to detail in the card’s design. The multi-colored, injection-molded cooler shroud with Bat-inspired intake vents is just part of that. Dave Baumann, the 5870’s product manager, told us the firm had listened to users’ worries about high idle temperatures in the 4800 series and adjusted the 5870’s cooling accordingly. The 5870 should also have lower fan RPMs than its predecessor, and the use of a different bearing in the blower should produce a lower-pitched sound that’s less obtrusive in operation. AMD has built in hardware detection of voltage regulator temperatures, as well, to avoid the overheating problems caused by “an application that amounted to a power virus” that caused some problems on RV770 and other cards. (FurMark, anyone?)
The single biggest improvement from the last generation, though, is in power consumption. The 5870’s peak power draw is rated at 188W, up a bit from the 4870’s 160W TDP. But idle power draw on the 5870 is rated at an impressively low 27W, down precipitously from the 90W rating of the 4870. Much of the improvement comes from Cypress’s ability to put its GDDR5 memory into a low-power state, something the 4870’s first-gen GDDR5 interface couldn’t do. Additionally, the second 5870 board in CrossFire multi-GPU config can go even lower, dropping into an ultra-low power state just below 20W.
AMD says the plan is for Radeon HD 5870 cards to be available for purchase today at a price of $379. Nvidia appears to have cut prices preemptively in anticipation of the 5870’s launch, too, at least selectively. This GeForce GTX 285 is down to $295.99 after rebate at Newegg, and this MSI GeForce GTX 295 is reduced to $469.99 with free shipping, as I write.
The Radeon HD 5850. Source: AMD.
In all likelihood, the GTX 285 will find closer competition in the form of the Radeon HD 5850, the second Cypress-based product, due next week. The 5850 will have two of its SIMD arrays and texture units disabled, leading to a total of 1440 SPs and 72 texels per clock of filtering capacity. Also, clock speeds will be down, with the GPU at 725 MHz and the GDDR5 memory at 1 GHz or 4 Gbps. The 5850 will have the same suite of display outputs and CrossFire multi-GPU capabilities as the 5870, though, and will come with a visibly shorter PCB. AMD expects these boards to be available next week, most likely on Monday, for $259.
To Eyefinity and beyond
You may have already read about AMD’s Eyefinity capability that it’s pushing with the Radeon HD 5000 series. Most members of the Evergreen family (with the exception of the smallest chip) will be able to support up to three different displays simultaneously, as the 5870 can with its four outputs. One may connect either two DVI displays and one DisplayPort or one DVI, one HDMI 1.3a, and one DisplayPort. Optimally, that means a single 5870 could drive three four-megapixel displays at once. AMD has demonstrated and plans to release the Eyefinity6 edition of the Radeon HD 5870, which breaks new ground in the use of superscript in product naming. The Eyefinity6 backs up that bravado with an array of six compact DisplayPort connections that will allow it to feed up to six four-megapixel displays at once with a single GPU.
The other key to Eyefinity is a bit of driver magic that makes multiple monitors attached to a card appear to the OS as a single, large display surface. AMD’s drivers support a multitude of different possible configurations with varying monitor sizes and portrait/landscape orientations, some of which involve multiple display groups and thus multiple virtual display surfaces. Because all of the monitors in a display group appear as one to the operating system and applications, many games can simply run across multiple displays without any additional tweaking. Here’s a look at six narrow-bezel Samsung monitors running DiRT 2 on a single GPU:
And here’s a more extreme configuration AMD had cooked up at the press event for the Radeon HD 5870, with 24 total displays connected.
You might see that picture and think Eyefinity already works with CrossFire multi-GPU configurations, but that’s not the case yet. AMD says it is working on that, though.
I’m not entirely sure yet what to think of Eyefinity. On the one hand, I’m a bona-fide multi-monitor enthusiast myself, sporting six megapixels on my desktop as I write these words. I expect AMD to make big inroads into financial trading firms and other places where multi-display configurations are common. I’m pleased to see AMD paying renewed attention to multi-monitor capabilities, and just the sheer thought of having over 24 megapixels of display fidelity pushes my PC enthusiast buttons. On the other hand, I tend to think that, for most of us, large-screen gaming might be better conducted on a big HDTV or via a projector, where you’re pushing fewer pixels (and less in need of GPU horsepower) across a larger display, uninterrupted by bezels.
But we’ll see. I intend to address Eyefinity and gaming in more depth in a future article. Perhaps I’ll find a use for all of those pixels.
The graphics engine
AMD refers to the front-end of Cypress as the graphics engine, encompassing as it does the traditional setup engine, the command processor, and the thread dispatch processor. Notable new additions here include a second rasterizer and a next-generation tessellation unit.
The Cypress graphics engine. Source: AMD.
Keeping with the theme of doubling resources, AMD added a second rasterizer to make sure the GPU can convert polygon meshes into pixels at a rate sufficient to keep up with the rest of the chip. There are two separate units here, and I wondered at first whether taking full advantage of them might require the use of DirectX 11 and its multithreaded command processing. But AMD says the geometry assembly and thread dispatch units have been modified to perform the necessary load balancing in hardware transparently.
The tessellator is capable of turning lower-polygon models into higher-poly ones by using mathematical hints, such as higher-order surfaces. Radeons have had hardware tessellation units for several generations, as does the Xbox 360 GPU, but they’ve not been widely used because prior versions of DirectX haven’t exposed their capabilities. That all changes with DirectX 11, which exposes the tessellator for programming via two new shader types: hull shaders and domain shaders. Not only that, but Cypress’ tessellator is improved from prior iterations, so it can handle popular (as these things go) algorithms like Catmull-Clark in a single pass. The tessellator can adjust the level of geometric detail in real time, too. We should see vastly more geometric detail in terrain, characters, and the like once hardware tessellation goes into widespread use.
Notable by their absence are the interpolation units traditionally found in the setup engine. These fixed-function interpolators have given way to a long-term trend in graphics processors; they’ve been replaced by the shader processors. AMD has added interpolation instructions to its shader cores as a means of implementing a new DirectX 11 feature called pull-model interpolation, which gives developers more direct control over interpolation (and thus over texture and shader filtering.) The shader core offers higher mathematical precision than the old fixed-function hardware, and it has many times the compute power for linear interpolation, as well. AMD CTO Eric Demers pointed out in his introduction to the Cypress architecture that the RV770’s interpolation hardware had become a performance-limiting step in some texture filtering tests, and using the SIMDs for interpolation should bypass that bottleneck.
Not only has Cypress doubled the amount of computing power available on a single GPU, but AMD has also added refinements to improve the per-clock performance, mathematical precision, and fundamental capabilities of its stream processors.
Here’s another look at the basic layout of the chip. Cypress has 20 SIMDs, each of which has 16 of what AMD calls thread processors inside of it. Each of those thread processors has five arithmetic logic units, or ALUs. Multiply it out, and you get a grand total of 1600 ALUs across the entire chip, or 1600 “stream processors” or “stream cores,” depending on which version of AMD nomenclature you pick. “Stream cores” is the latest, and it seems to be a bit inflationary. My friend David Kanter argues that what makes a core in computer architecture is the ability to fetch instructions. By that measure, Cypress would have 20 cores, since the thread processors inside of each SIMD march together according to one large instruction word.
The organization of the thread processors is essentially unchanged from the RV770 and traces its roots pretty directly back to the R600. The primary execution unit is superscalar and five ALUs wide. That fifth ALU is a superset of the others, capable of handling more advanced math like trascendentals. The execution units are pipelined with eight cycles of latency, but the SIMDs can execute two hardware thread groups, or “wavefronts” in AMD parlance, in interleaved fashion, so the effective wavefront latency is four cycles. Multiply that latency by the width of the SIMD, and you have 64 pixels or threads of branch granularity, just as in R600.
Despite this similarity to past architectures, AMD has made a host of improvements to Cypress, some of which are helpful for graphics, others for GPU compute, and some for both. Demers told us DirectX 11, DirectCompute 11, and OpenCL are fully implemented in hardware, with no need for performance-robbing software emulation of features. Demers stopped just short of asserting that Cypress would support the next version of OpenCL fully in hardware, as well, but gave the distinct impression that this chip would likely be able to do so.
Cypress adds a number of instructions to support DirectX 11, DirectCompute, and other missions this chip may have, including video encoding. One general performance improvement is the ability to co-issue a MUL and a dependent ADD instruction in a single clock, sidestepping a pitfall of its superscalar execution units.
On the dedicated compute front, Cypress continues to execute double-precision FP math at one-fifth its peak rate for single-precision, but AMD has upped the ante on precision in several ways. Demers claims the GPU is compliant with the IEEE 754-2008 standard, with precision-enhancing denorms handled “at speed.” The chip now supports a fused multiply-add instruction, which takes the result of a multiply operation and feeds it directly into the adder without rounding in between. Demers describes FMA as a way to achieve DP-like results with single-precision datatypes. (This FMA capability is present in some CPU architectures, but isn’t yet built into x86 microprocessors, believe it or notthough Intel and AMD have both announced plans to add it.) The lone potential snag for full IEEE compliance, Demers told us, is the case of “a few numerical exceptions.” The chip will report that such exceptions have occurred, but won’t execute user code to handle them.
A block diagram of Cypress as a stream processor. Source: AMD.
GeForce 9800 GT
|GeForce GTS 250||484||726|
|GeForce GTX 285||744||1116|
|GeForce GTX 295||1192||1788|
|Radeon HD 4850||1088||–|
|Radeon HD 4870||1200||–|
|Radeon HD 4890 OC||1440||–|
|Radeon HD 4870 X2||2400||–|
|Radeon HD 5850||2088||–|
|Radeon HD 5870||2720||–|
AMD continues to devote more transistors to compute-specific logic. The local data stores on each SIMD, used for inter-process communication, have doubled in size to 32KB, and AMD’s distinctive global data share has quadrupled from 16 to 64KB. The memory export buffer can now scatter up to 64 32-bit values per clock, twice the rate of RV770. Cypress supports 32-bit atomic operations, as well; hardware semaphores enable global synchronization in “a few cycles,” according to Demers. However, Demers wouldn’t reveal whether or not Cypress’s memory controller is capable of supporting ECC memory, a capability that could be crucial in the burgeoning markets for GPU computing.
Demers made no bones about the fact that the primary market for this chip is graphics and gaming, but he was quick to point out that Cypress is also the most advanced GPU compute engine in the world. Given the current state of things, that claims seems to be credibleat least for the time being. The Radeon HD 5870’s peak processing power is formidable at 2.7 TFLOPS for single-precision math and 544 GFLOPS for double-precision. That’s more than twice the peak theoretical capacity of the GT200b’s fastest graphics card variant, GeForce GTX 285, even if we generously include Nvidia’s co-issue feature in our FLOPS count.
Of course, as with almost any processor, peak throughput is only part of the story. We don’t yet have much in the way of standard GPU compute benchmarks or applications we can run, but we can look at the directed tests for shader performance in 3DMark.
These results range from disappointingslightly slower than the GTX 285 in the GPU cloth testto astoundingconsiderably faster than two Radeon HD 4870s in the parallax occlusion mapping and Perlin noise tests.
Texturing and memory
Cypress’ memory hierarchy has been massaged, too. The doubling of the number of SIMDs on chip means twice as many 8KB L1 caches onboard, so the total L1 size doubles to 160KB. All told, these caches give the GPU as much as one terabyte per second of bandwidth for L1 texture fetches, according to Demersa staggering number. The four L2 caches associated with the memory controllers have grown from 64KB to 128KB each, and deliver up to 435 GB/s of bandwidth between the L1 and L2 caches.
AMD has held steady at four 64-bit memory controllers, yielding an aggregate 256-bit path to memory. GDDR5 data rates are up from 3.6 Gbps on the Radeon HD 4870 to 4.8 Gbps on the 5870, an increase of less than 50%. This is perhaps one potential weakness of a chip that has doubled in nearly every other department, but AMD contends the RV770 was memory bandwidth-rich and compute-poor, relatively speaking, and thus not taking full advantage of the memory bandwidth available to it. Cypress, the firm claims, will be more balanced.
AMD has made other provisions to make the best use of the bandwidth available to Cypress. Chief among them are new block texture compression modes, contributed by AMD to the DirectX 11 spec and now available to all GPU makers. These compression modes are purported to offer higher quality than prior standards, with better signal-to-noise ratios and better handling of transparency. The basic technology has been adapted to work with both standard 8-bit-per-channel integer and FP16 HDR texture formats, with compression ratios up to 6:1 possible. Texture sizes of up to 16k by 16k are now supported, as well.
Another tweak that gets my IQ-junkie juices flowing is the move to a new anisotropic filtering algorithm that does not vary the level of detail according to the angle of the surface to which it is being applied. This is a hardware-level change to the texture filtering units. AMD claims it has implemented an algorithm that achieves the same results as the Microsoft Direct3D reference rasterizer, but does so more efficiently, with no additional performance cost compared to its prior GPUs.
We can see the impact of this change using the infamous tunnel test, pictured below. The idea here is that you’re looking down a 3D-rendered cylinder, and the mip-maps are different colors in order to show you where one ends and the other begins. Some level of blending between them is being applied by the GPU, alsothat’s trilinear filtering. The closer the colored shape is to a circle, the less the aniso filtering algorithm varies according to the angle of inclination. In other words, rounder is better. The smoother the blending between the colors, the more trilinear filtering is being applied. Smoother is better.
Anisotropic texture filtering and trilinear blending
Radeon HD 4870
Radeon HD 5870
GeForce GTX 285
Up to now, as you can see, Nvidia has performed better on this test than AMD. I don’t want to overstate the importance of that; the reality is that these things are much easier to spot in a contrived test like this one than in a real game, where the differences are very tough to see. Still, the 5870 aces this test in pixel-perfect fashion, setting a new standard for anisotropic filtering.
Not only that, but generally we’d be handing you a caveat right now about trilinear filtering on the Radeons, because AMD has long used an adaptive trilinear algorithm that applies more or less blending depending on the contrast between the textures involved. In the case of a test like this one, that algorithm always does its best work, because the mip maps are entirely different colors. In games, it applies less filtering and may not always achieve ideal results. However, for Cypress, AMD has decided to stop using that adaptive algorithm. Instead, they say, the Radeon HD 5870 applies full trilinear filtering all of the time by default, so the buttery smooth transitions between mip-map colors you’re seeing in the image above are in earnest.
In games, the impact of these performance quality improvements is subtle, but you can expect to see less high-frequency noise in the form of things like texture crawling and sparkle on the 5870. I need to play with it some more, frankly, in order to find some good examples of the differences. I can tell you now that they’ll likely be very difficult to capture in a static screenshot. We’ll try to look into this topic more when we have time, though.
GeForce 9800 GT
|GeForce GTS 250||12.3||49.3||24.6||71.9|
|GeForce GTX 285||21.4||53.6||26.8||166.4|
|GeForce GTX 295||32.3||92.2||46.1||223.9|
|Radeon HD 4850||10.9||27.2||13.6||67.2|
|Radeon HD 4870||12.0||30.0||15.0||115.2|
|Radeon HD 4890 OC||14.4||36.0||18.0||124.8|
|Radeon HD 4870 X2||24.0||60.0||30.0||230.4|
|Radeon HD 5850||23.2||52.2||26.1||128.0|
|Radeon HD 5870||27.2||68.0||34.0||153.6|
One reason AMD was able to make these image quality improvements is this GPU’s embarrassment of riches in the texture filtering department, where it more than doubles the peak theoretical capacity of the Radeon HD 4870. Cypress also has twice as many render back-ends or ROPs as the RV770, with two attached to each memory controller, so it has substantially more peak pixel fill rate and antialiasing oomph.
This color fill rate test usually ends up being memory-bandwidth limited. The Radeon HD 5870 has a little bit less memory bandwidth, in theory, than the GeForce GTX 285, and it works out that way in practice, too.
This test measures filtering rates with standard integer texture formats. The 5870 falls a little shy of its theoretical peak of 68 bilinear filtered Gtexels/s, but most of these GPUs do. Interestingly enough, the 5870 also falls behind the Radeon HD 4870 X2 and the GeForce GTX 285 when we get to the higher levels of anisotropy. Those other cards both have more memory bandwidth than the 5870, which may play a part. But remember, also, that they’re producing lower-quality results than the 5870. Notice that in its high-quality filtering mode, which still can’t match the 5870’s output, the GTX 285’s performance drops below the 5870’s.
This is a test of FP16 texture filtering, which is probably where we want to focus more of our attention, since this is the hard stuff. However, I still don’t know what the heck is going on with the units here. At the very least they’re off by a factor of 100, since the 5870’s peak theoretical FP16 filtering speed is 34 Gtexels/s and 3DMark is reporting 1868. This has been a long-standing problem with 3DMark Vantage, and the folks at FutureMark have stopped answering my emails about it. I’m open to suggestions for alternate FP16 texture filtering tests.
In the meantime, we’re going to assume the relative differences here are meaningful, at least, and notice that the Radeon HD 5870 is alone at the top of the charts. This is likely one of the places where most GPUs are interpolation limited, and the 5870’s shader-based interpolation allows it to outpace even two Radeon HD 4870s on an X2 card.
The render back-ends and antialiasing
The render back-ends in Cypress haven’t escaped notice, either. Besides doubling up, the individual render back-end units have gained some new capabilities. A new read-back path lets the chip’s texture units read from the compressed color buffers for antialiasing, which should improve performance with AMD’s custom-filtered AA modes. Performance when using multiple render targets has purportedly improved, and comically, AMD has built in a provision for fast color clears because some software vendors were prone to clearing the screen many times, for whatever reason.
The larger L2 caches adjacent to the render back-ends should mean less of a performance hit when going from 4X to 8X multisampled antialiasing, according to AMD, although that seems a bit academic since the hit on the RV770 was really quite small.
The biggest news on the antialiasing front, though, is the return of supersampling. Yep, it’s back! Most antialiasing methods in recent years have focused on object edges alone, especially the dominant form of AA, known as multisampling. Supersampling is more of a brute-force method in which every pixel onscreen is sampled multiple times, not just object edges. It’s terribly inefficient, of course, except that it has the potential to improve image quality everywhere and is obviously the best choice, if you can afford to pay the performance cost. (Supersampling is de riguer among professional animators and the like.)
Because it touches every pixel on the screen, supersampling can address difficult cases that multisampled AA modes won’t addressvisible edges created by sharp color transitions or alpha transparencies in textures, shimmering in object interiors caused by pixel shaders without sufficient internal sampling rates, or any high-frequency noise your texture filtering algorithm has failed to eliminate. Using it in a game, you may simply find that objects onscreen appear to have more solidity to them.
AMD has gussied up its supersampling mode by toying with the sample patterns, too. I don’t have an image of it, and conventional tools won’t produce one, but the sampling pattern varies within a 2×2-pixel block, in an attempt to defeat our eyes’ propensity to recognize regular patterns. Using four different patterns helps on this front.
The Radeon HD 5870 supports 2X, 4X, and 8X supersampling via a simple switch in the Catalyst Control Center, and the traditional box filters can be combined with custom-filtered AA modes to ratchet up the effective sample count. I’d like to write more about it, but I’ve only had a week to spend with the 5870 so far, and I think these antialiasing methods deserve their own article, at some point, complete with a suite of comparative screenshotsnot that screenshots can capture the full impact of supersampling on image quality.
I’m going to tip my hand on the 5870’s gaming performance in order to give you a look at the performance hit caused by various antialiasing methods. Hang on…
Yeah, so looking at the orange bars that represent 4X multisampled AA, a single 5870 is indeed faster than a Radeon HD 4870 X2. Yikes.
And, back on task, the performance hit when going from 4X MSAA to 8X MSAA is indeed smaller on the 5870 than on the 4870 X2although, like I said, both are pretty much academic at this point.
Notice how much larger the performance hit is for 8X MSAA on the GeForce GTX 285. Nvidia’s ROPs or render back-ends just don’t handle 8X multisampling as well as AMD’s, for some reason. Notice, though, that Nvidia pretty much makes up for its poor 8X multisampling performance via its coverage-sampling AA modes, which store fewer color samples than conventional multisampling. Nvidia’s 16X CSAA mode employs more coverage samples and fewer color samples than 8X multisampling and delivers arguably comparable image quality with essentially no performance hit versus 4X MSAA.
For this reason, I’ve limited the bulk of my performance testing to 4X multisampled AA, where the GPUs are on common ground.
Oh, and yes, the performance hit with supersampling is brutal, but at 4X, the 5870 still achieves a very playable 47 FPS average in Left 4 Dead at 2560×1600 with 16X anisotropic filtering. If you have the power, why not use it?
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.
Our test systems were configured like so:
|System bus||QPI 6.4 GT/s
|North bridge||X58 IOH|
|Chipset drivers||INF update 22.214.171.1245
Matrix Storage Manager 126.96.36.1993
|Memory size||6GB (3 DIMMs)|
DDR3 SDRAM at 1333MHz
|CAS latency (CL)||8|
|RAS to CAS delay (tRCD)||8|
|RAS precharge (tRP)||8|
|Cycle time (tRAS)||24|
with Realtek 188.8.131.5219 drivers
Sapphire Radeon HD 4890 OC 1GB PCIe
with Catalyst 8.66-090910a-088431E drivers
Radeon HD 4870 X2 2GB PCIe
with Catalyst 8.66-090910a-088431E drivers
Radeon HD 5870 1GB PCIe
with Catalyst 8.66-090910a-088431E drivers
Dual Radeon HD 5870 1GB PCIe
with Catalyst 8.66-090910a-088431E drivers
Asus GeForce GTX 285 1GB PCIe
with ForceWare 190.62 drivers
Dual Asus GeForce GTX 285 1GB PCIe
with ForceWare 190.62 drivers
GeForce GTX 295 2GB PCIe
with ForceWare 190.62 drivers
|Hard drive||WD Caviar SE16 320GB SATA|
|PC Power & Cooling Silencer
|OS||Windows 7 Ultimate x64 Edition
|OS updates|| DirectX
March 2009 update
Thanks to Corsair for providing us with memory for our testing. Their quality, service, and support are easily superior to no-name DIMMs.
Our test systems were powered by PC Power & Cooling Silencer 750W power supply units. The Silencer 750W was a runaway Editor’s Choice winner in our epic 11-way power supply roundup, so it seemed like a fitting choice for our test rigs.
Unless otherwise specified, image quality settings for the graphics cards were left at the control panel defaults. Vertical refresh sync (vsync) was disabled for all tests.
We used the following versions of our test applications:
- Crysis Warhead 1.1 64-bit
- Far Cry 2 1.03
- Left 4 Dead build 3939
- Sacred 2: Fallen Angel 2.40.0 build 1551
- Tom Clancy’s HAWX 1.01
- Wolfenstein 1.1
- 3DMark Vantage 1.0.1
- D3D RightMark beta 4
- FRAPS 2.9.8
The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Far Cry 2
We tested Far Cry 2 using the game’s built-in benchmarking tool, which allowed us to test the different cards at multiple resolutions in a precisely repeatable manner. We used the benchmark tool’s “Very high” quality presets with the DirectX 10 renderer and 4X multisampled antialiasing.
True to form, the 5870 tracks closely with the Radeon HD 4870 X2 here, matching the performance of two prior-gen chips with a single piece of silicon. That’s sufficient to put the 5870 comfortably ahead of the GeForce GTX 285, but not quite enough to push it past the dual-GPU GeForce GTX 295.
We recorded a demo during a multiplayer game on the Hospital map and played it back using the “timeNetDemo” command. At all resolutions, the game’s quality options were at their peaks, with 4X multisampled AA and 8X anisotropic filtering enabled.
AMD’s new hotness again tracks closely with the 4870 X2, but its performance lead over the GeForce GTX 285 narrows hereand pretty much disappears altogether in multi-GPU mode. Nvidia has a long history of relatively high performance in id Software’s OpenGL-based game engines, and that trend appears to be continuing here.
Of course, that’s all relative. The reality is that even the slowest card pushes a reasonably decent 46 FPS in this game at high quality levels and a four-megapixel resolution, and every other card we tested averages over 60 FPS. If we want to challenge the 5870, game developers will have to start making use of DirectX 11 and more advanced shader effects.
Left 4 Dead
We also used a custom-recorded timedemo with Valve’s excellent zombie shooter, Left 4 Dead. We tested with 4X multisampled AA and 16X anisotropic filtering enabled and all of the game’s quality options cranked
The 5870 returns to form here, dominating both the single- and multi-GPU contests while performing almost exactly like a Radeon HD 4870 X2. Again, the game itself barely challenges these GPUs.
Tom Clancy’s HAWX
Last time we tested with HAWX, we used FRAPS to record frame rates while we played the game. Doing so does work, but I had some trepidation about its repeatability, because of one thing: when you take off straight up, pointed at the sky, frame rates tend to skyrocket. The amount of time you spend nose-up in the game will affect frame rates rather profoundly. And personally, I can’t play this game well without accelerating straight up from time to time. Otherwise, I run into the ground, or I just can’t get targets lined up quickly.
As a result, I decided this time to use the built-in benchmark tool in HAWX, which seems to do a good job of putting a video card through its paces. We tested this game in DirectX 10 mode with all of the image quality options either turned on or set to “High”, along with 4X multisampled antialiasing. Since this game supports DirectX 10.1 for enhanced performance, we enabled it on the Radeons. No current GeForce GPU supports DX10.1, though, so we couldn’t use it with them.
Hmm… so the 5870 isn’t much faster than the other single-GPU cards in the mix here, though it does scale nicely in CrossFire mode.
To give you some idea of the effect of DirectX 10.1 on performance here, the 5870’s frame rate at 2560×1600 dropped to 52 FPS with DirectX 10, a whole four frames per second. The 4870 X2 took a bigger hit, going from 73 to 56 FPS with the change.
Sacred 2: Fallen Angel
I must confess that I’ve spent the vast majority of my gaming time in the last couple of months playing Sacred 2. A little surprisingly for an RPG, this game is demanding enough to test even the fastest GPUs at its highest quality settings. And it puts all of that GPU power to good use by churning out some fantastic visuals.
We tested at 2560×1600 resolution with the game’s quality options at their “Very high” presets (typically the best possible quality setting) with 4X MSAA.
Given the way this game tends to play, we decided to test with fewer, longer sessions when capturing frame rates with FRAPS. We settled on three five-minute-long play sessions, all in the same area of the game. We then reported the median of the average and minimum frame rates from the three runs.
This game also supports Nvidia’s PhysX, with some nice GPU-accelerated add-on effects if you have a GeForce card. Processing those effects will put a strain on your GPU, and we’re already testing at some pretty strenuous settings. Still, I’ve included results for the GeForce GTX 295 in two additional configurations: with PhysX effects enabled in the card’s default multi-GPU SLI configuration, and with on-card SLI disabled, in which case the second GPU is dedicated solely to PhysX effects. It is possible to play Sacred 2 with the extra PhysX eye candy enabled on a Radeon, but in that case, the physical simulations are handled entirely on the CPUand they’re unbearably slow, unfortunately.
In another strong showing, the new Radeon outperforms both teams’ dual-GPU cards, the 4870 X2 and the GTX 295. In CrossFire, it’s money.
You can see the performance hit caused by enabling PhysX at this resolution. On the GTX 295, it’s just not worth it. Another interesting note for you… As I said, enabling the extra PhysX effects on the Radeon cards leads to horrendous performance, like 3-4 FPS, because those effects have to be handled on the CPU. But guess what? I popped Sacred 2 into windowed mode and had a look at Task Manager while the game was running at 3 FPS, and here’s what I saw, in miniature:
Ok, so it’s hard to see, but Task Manager is showing CPU utilization of 14%, which means the gameand Nvidia’s purportedly multithreaded PhysX solveris making use of just over one of our Core i7-965 Extreme’s eight front-ends and less than one of its four cores. I’d say that in this situation, failing to make use of the CPU power available amounts to sabotaging performance on your competition’s hardware. The truth is that rigid-body physics isn’t too terribly hard to do on a modern CPU, even with lots of objects. Nvidia may not wish to port is PhysX solver to the Radeon, even though a GPU like Cypress is more than capable of handling the job. That’s a shame, yet one can understand the business reasons. But if Nvidia is going to pay game developers to incorporate PhysX support into their games, it ought to work in good faith to optimize for the various processors available to it. At a very basic level, threading your easily parallelizable CPU-based PhysX solver should be part of that work, in my view.
But will it run Crysis?
Although we’ve had a bit of a tough time finding games that will really push the limits of the Radeon HD 5870, this game engine is certain to do it. In a true test of GPU power, we turned up all of the quality settings in Warhead to the highest settings using the cheesily-named “Enthusiast” presets. The game looks absolutely gorgeous at these settings, but few video cards will run it smoothly. In fact, we chose to test at 1920×1200 rather than 2560×1600 because it appears at least some of the cards have serious trouble at the higher resolution, almost as if they were running out of video RAM. Anyhow, this is a pretty brutal test, tough enough to challenge even our fastest multi-GPU setups.
For this game, we tested each GPU config in five 60-second sessions, covering the same portion of the game each time. We’ve then reported the median average and minimum frame rates from those five runs.
Told you this would be a tough test. The 5870’s performance once again mirrors that of the Radeon HD 4870 X2. However, as you can see, the 5870 experienced a couple of odd performance dips in CrossFire mode at certain points during our session. This problem occurred in multiple sessions and had a real impact on playability, unfortunately. I expect AMD has some driver work to do on this front.
We measured total system power consumption at the wall socket using an Extech power analyzer model 380803. The monitor was plugged into a separate outlet, so its power draw was not part of our measurement. The cards were plugged into a motherboard on an open test bench.
The idle measurements were taken at the Windows desktop with the Aero theme enabled. The cards were tested under load running Left 4 Dead at 2560×1600 resolution, using the same settings we did for performance testing.
Look at that. A single 5870 draws less power at idle than any other card we tested, besting even the prior champ, the GeForce GTX 285. And two 5870s in CrossFire draw less power at idle than a single Radeon HD 4890. Very nice.
The 5870 also draws the least power under load. Given its performance, the overall power efficiency is astounding.
We measured noise levels on our test system, sitting on an open test bench, using an Extech model 407738 digital sound level meter. The meter was mounted on a tripod approximately 8″ from the test system at a height even with the top of the video card. We used the OSHA-standard weighting and speed for these measurements.
You can think of these noise level measurements much like our system power consumption tests, because the entire systems’ noise levels were measured. Of course, noise levels will vary greatly in the real world along with the acoustic properties of the PC enclosure used, whether the enclosure provides adequate cooling to avoid a card’s highest fan speeds, placement of the enclosure in the room, and a whole range of other variables. These results should give a reasonably good picture of comparative fan noise, though.
The 5870 has best-in-class acoustics at idle and the second-lowest noise level under load.
For most of the cards, I used GPU-Z to log temperatures during our load testing. In the case of multi-GPU setups, I recorded temperatures on the primary card. However, GPU-Z didn’t yet know what to do with the 5870, so I had to resort to running a 3D program in a window while reading the temperature from the Overdrive section of AMD’s Catalyst control panel.
The days of pushing 95°C on the GPU core are, happily, fading away. AMD adjusted its fan speed thresholds on the Radeon HD 4890, and it’s stuck to the same formula here. That big Bat-cooler holds the 5870 to a more comfortable 76°, even though it’s relatively quiet. And I’m pleased to report that our 5870 eventually dropped back down to 39° at idle after running this test.
Well, Sherlock, what do you expect me to say? AMD has succeeded in delivering the first DirectX 11 GPU by some number of months, perhaps more than just a few, depending on how quickly Nvidia can get its DX11 part to market. AMD has also managed to double its graphics and compute performance outright from one generation to the next, while ratcheting up image quality at the same time. The Radeon HD 5870 is the fastest GPU on the planet, with the best visual output, and the most compelling set of features. Yet it’s still a mid-sized chip by GPU standards. As a result, the 5870’s power draw, noise levels, and GPU temperatures are all admirably low. My one gripe: I wish the board wasn’t quite so long, because it may face clearance issues in some enclosures.
If there’s trouble brewing at all here, it’s won’t come immediately from the GPU competition, but from game consoles and the developers who have chosen them almost exclusively as their performance targets. Games need to move on and take advantage of the many, many multiples of console-class graphics and processor power now available on the PC. If they don’t, AMD may have trouble selling this incredibly fast chip to consumers simply because the applications don’t take full advantage of it.
AMD seems to have recognized this problem when it built support for multiple large monitors into its chip and tweaked its software to support gaming across multiple displays, along with high-quality texture filtering and advanced antialiasing features like custom filters, edge-detection, and supersampling. I’m happy to have all of it, personally. I’ll acknowledge that not everyone needs a GPU this powerful to play today’s games, but I’ll take all of the graphics power I can get, especially if the GPU can use it to produce higher quality pixels. In my book, the Radeon HD 5870 is a steal at $379.
As ever in PC hardware, though, you may find even better values a rung or two from the top of the performance ladder. In this case, I’m thinking of the Radeon HD 5850, which sure looks promising at $259. I’m curious to see what the next week will bring on that front.