Nvidia’s GeForce GTX 280 graphics processor
If the GPU world were a wildlife special on the National Geographic channel, the G80 processor that powers GeForce 8800 GTX graphics cards would be a stunningly successful apex predator. In the nearly two years that have passed since its introduction, no other single-chip graphics solution has surpassed it. Newer GPUs have come close, shrinking similar capabilities into smaller, cooler chips, but that’s about it. The G80 is still the biggest, baddest beast of its kinda chip, as we said at the time, with “the approximate surface area of Rosie O’Donnell.” After it dispatched its would-be rival, the Radeon HD 2900 XT, in an epic mismatch, AMD gave up on building high-end GPUs altogether, preferring instead to go the multi-GPU route. Meanwhile, the G80 has sired a whole range of successful offspring, from teeny little mobile chips to dual-chip monstrosities like the GeForce 9800 GX2.
Of course, even the strongest predator has a limited time as king of the pride, and the G80’s reign is coming to a close. Today, its true heir arrives on the scene in the form of the GT200 graphics processor powering the GeForce GTX 200-series graphics cards. Despite being built on a smaller chip fabrication process, the GT200 is even larger than the G80, and it packs nearly twice the processing power of its progenitor.
This new contender isn’t content with just ruling the same territory, either. Nvidia has ambitious plans to expand the GPU’s processing domain beyond real-time graphics and gaming, and as the GPU computing picture becomes clearer, those plans seem increasingly viable. Join us as we dive in for a look at this formidable new processor.
The GT200 GPU: an overview
The first thing to be said about the GT200 is that it’s not a major departure from Nvidia’s current stable of G80-derived GPUs. Instead, it’s very much a refinement of that architecture, with a multitude of tweaks throughout intended to improve throughput, efficiency, and the like. The GT200 adds a handful of new capabilities at the edges, but its core graphics functionality is very similar to current GeForce 8- and 9-series products.
As any graphics expert will tell you, determining what’s changed involves the study of Chiclets, of course. Nvidia has laid out the Chiclets in various flavors and patterns in order to convey the internal organization of GT200. Behold:
Shiny, but with a chewy center!
Arranged in this way, the Chiclets have much to tell us. The 10 large groups across the upper portion of the diagram are what Nvidia calls thread processing clusters, or TPCs. TPCs are familiar from G80, which has eight of them onboard. The little green boxes inside of the TPCs are the chip’s basic processing cores, known in Nvidia’s parlance as stream processors or SPs. The SPs are arranged in groups of eight, as you can see, and these groups have earned their own name and acronym, for the trifecta: they’re called SMs, or streaming multiprocessors.
Now, let’s combine the power of all three terms. 10 TPCs multiplied by three SMs times eight SPs works out to a total of 240 processing cores on the GT200. That’s an awful lot of green Chiclets and nearly twice the G80’s 128 SPs, a substantial increase in processing potentialnot to mention chewy, minty flavor.
One of the key changes in the organization of the GT200 is the increase from two to three SMs inside of each thread processing cluster. The TPCs still house the chip’s texture addressing and filtering hardware (brown Chiclets), but the ratio of SPs to texturing units has increased by half, from 2:1 to 3:1. We’ve seen a growing bias toward shader power versus texturing over time, and this is another step in that direction. Even with the change, though, Nvidia remains more conservative on this front than AMD.
The lower part of the diagram reveals a corresponding rise in pixel-pushing power with the increase in ROP (raster operator) partitions from six on the G80 (GeForce 8800 GTX) and four on the G92 (GeForce 9800 GTX) to eight on the GT200. Since each ROP partition can output four pixels at a time, the GT200 can output 32 pixels per clock. And since each ROP partition also hosts a 64-bit memory controller, the GT200’s path to memory is an aggregated 512 bits wide.
In short, the GT200 has a whole lot of pretty much everything.
One thing it lacks, however, is support for DirectX 10.1. Some folks had expected Nvidia to follow AMD down this path, since AMD introduced DX10.1 support in its Radeon HD 3000 series last fall. DX10.1 introduces extensions that expose greater control over the GPU’s antialiasing capabilities, among other things. Nvidia says its GPUs can handle some DX10.1 capabilities, but not all of them. That prevents it from claming DX10.1 support, since Microsoft considers it an all-or-nothing affair. Curiously, though, Nvidia says it is working with game developers to support a subset of DX10.1 extensions, even though Microsoft may not be entirely pleased with the prospect. I believe that work includes addressing problems with antialiasing and game engines that use deferred shading, one of the places where DX10.1 promises to have a big performance impact. Curiouser and curiouser: Nvidia is cagey about exactly which DX10.1 capabilities its GPUs can and cannot support, for whatever reason.
The chip: Large and in charge
When I say the GT200 has a whole lot of everything, that naturally includes transistors: roughly 1.4 billion of them, more than double the 681 million transistors in the G80. Ever faithful to its dictum to avoid the risk of transitioning to a new fab process with a substantially new design, Nvidia has stuck with a 65nm manufacturing technology, and in fact, it says the GT200 is the largest chip TSMC has ever fabricated.
Mounted on a board and covered with a protective metal cap, the GT200 looks like so:
When I asked Nvidia’s Tony Tamasi about the GT200’s die size, he wouldn’t get too specific, preferring only to peg it between 500 and 600 mm². Given that, I think the reports that GT200’s die is 576 mm² are credible. Whatever the case, this chip is biglike “after the first wafer was fabbed, the tide came in two minutes late” big. It’s the Kim Kardashian’s butt of the GPU world.
To give you some additional perspective on its size, here’s a to-scale comparison Nvidia provided between a GT200 GPU and an Intel “Penryn” 45nm dual-core CPU.
Such a large chip can’t be inexpensive to manufacture, since defect rates tend to rise exponentially with chip area. Nvidia almost seems to revel in having such a big chip, though, and it does have experience in this realm. It certainly seems as if Nvidia’s last large chip, the G80, worked out pretty well. Perhaps they’re not crazy to do this.
If you’re curious about what’s where on the die, have a look at the helpfully colored diagram below.
Tamasi noted that the shader cores look very regular and recognizable because they are made up of custom logic, like on a CPU, rather than being the product of automated logic synthesis. He also pointed out that, if you count ’em, the GT200 has exactly the number of on-chip shader structures you’d expect, with 10 TPCs easily visible and no extras thrown in to help increase yields. Of course, nothing precludes Nvidia from selling a GT200-based product with fewer than 10 TPCs enabled, either.
You may be wondering, with a chip this large, about power consumptionas in: Will the lights flicker when I fire up Call of Duty 4? The chip’s max thermal design power, or TDP, is 236W, which is considerable. However, Nvidia claims idle power draw for the GT200 of only 25W, down from 64W in the G80. They even say GT200’s idle power draw is similar to AMD’s righteously frugal RV670 GPU. We shall see about that, but how did they accomplish such a thing? GeForce GPUs have many clock domains, as evidenced by the fact that the GPU core and shader clock speeds diverge. Tamasi said Nvidia implemented dynamic power and frequency scaling throughout the chip, with multiple units able to scale independently. He characterized G80 as an “on or off” affair, whereas GT200’s power use scales more linearly with demand. Even in a 3D game or application, he hinted, the GT200 might use much less power than its TDP maximum. Much like a CPU, GT200 has multiple power states with algorithmic determination of the proper state, and those P-states include a new, presumably relatively low-power state for video decoding and playback. Also, GT200-based cards will be compatible with Nvidia’s HybridPower scheme, so they can be deactivated entirely in favor of a chipset-based GPU when they’re not needed.
As you may have noticed in the photograph above, the GT200 brings back an innovation from the G80 that we hadn’t really expected to see again: a separate, companion display chip. This chip is similar in function to the one on the G80 but is a new chip with additional capabilities, including support for 10-bit-per-color-channel scan out. GT200 cards will feature a pair of dual-link DVI outputs with HDCP over both links (for high-res HD movie playback), and Nvidia claims they will support HDMI via a DVI-to-HDMI adapter, although our sample of a production card from XFX didn’t include such an adapter. GT200 boards can also support DisplayPort, but they’ll require a custom card design from the vendor, since Nvidia’s reference design doesn’t include a DisplayPort, er, display port. (Seriously, WTH?)
Incidentally, if you’re going to be playing HD movies back over one of those fancy connections, you’ll be pleased to learn that Nvidia has extended the PureVideo logic in the GT200 to handle decoding of files encoded with the VC-1 and WMV9 codecs, as well as H.264.
The cards: GeForce GTX 280 and 260
The GT200 GPU will initially ship in two different models of video cards from a variety of Nvidia partners. The board you see below is an example of the big daddy, the GeForce GTX 280, as it will ship from XFX. This is the full-throttle implementation of the GT200, with all 240 SPs and eight ROP partitions active. The GTX 280’s core clock speed wll be 602MHz, with SPs clocked at 1296MHz. It comes with a full gigabyte of GDDR3 memory running at 1107MHz, for an effective 2214MT/s.
Obviously, this board has a dual-slot cooler, and it’s covered in a full complement of body armor composed half of metal (mostly around back) and half of plastic (you figure it out). We first saw this all-encompassing shroud treatment in the GeForce 9800 GX2. I suppose it’s possible this provision could actually reduce return rates on the cards simply by protecting them from rough handling orhehwould-be volt-modders. I worry about the shroud trapping in heat, but I noticed when tearing one apart that the metal plate on the back side of the card apparently acts as a heat radiator for the memory chips mounted back there.
From end to end, the GTX 280 card is 10.5″ long and, as I’ve mentioned, its TDP is 236W. To keep this puppy fed, you’ll need a PSU with one eight-pin PCIe aux power connector and one six-pin one.
As with the 9800 GX2, the GTX 280’s SLI connectors are covered by a rubber cap, to keep the “black monolith” design theme going. Popping it off reveals dual connectors, threatening the prospect of three-way GTX 280 SLI. Heck, the only really exposed bit of the GTX 280 is the PCIe x16 connector, which is (of course) PCIe 2.0 compliant.
Oddly enough, the GTX 280 is slated to be available not today, but tomorrow, June 17th. You will be expected to pay roughly $649 for the privilege of owning one, which is, you know, a lot. If it’s any consolation, the XFX version of the GTX 280 ships with a copy of Assassin’s Creed, which is stellar as far as bundled games go and a nice showcase for the GTX 280.
The GeForce GTX 260 appears to use the same basic board design and cooler as the GTX 280, but it gets by with two six-pin aux power connectors. This card uses a somewhat stripped-down version of the GTX 280, with two thread processing clusters and one ROP partition disabled. As a result, the GTX 260 has 192 stream processors, a 448-bit path to memory, and reduced texturing and pixel-pushing power compared to the GTX 280. Clock rates are 576MHz for the core, 1242MHz SPs, and 999MHz memory. The deletion of one memory interface also brings another quirk: the GTX 260’s total memory size is 896MB, which is kinda weird but probably harmless.
Initially, Nvidia told us to expect GTX 260 cards to sell for $449, but last week, they revised the price down to $399. Could they be anticipating potent competition from the Radeon camp, or are they just feeling generous? Who knows, but we’ll take the lower price. GeForce GTX 260 cards aren’t slated for availability until June 26. By then, we expect to see another interesting option in the market, as well.
Here’s a quick picture of a GTX 260 card completely stripped of its shroud and cooler. I had a devil of a time removing that stuff. The GT200 GPU remains enormous.
GPU compute breaks out
Speaking of enormity, Nvidia is certainly talking up the potential of the GT200 and its siblings for applications beyond traditional real-time graphics processing. This isn’t just idle talk, of course; the potential of GPUs for handling certain types of computing problems has been evident for some time now. Quite a few tasks require performing relatively simple transforms on large amounts of similar data, and GPUs are ideal for streaming through such data sets and crunching them. Starting with the G80, Nvidia has built provisions into each of its graphics processors for GPU-compute applications. The firm has also developed CUDA, a C-like programming interface for its DX10-class GPUs that it offers to the world free of charge via a downloadable SDK.
The first GPU-enabled applications came from specific industries that confront particularly difficult computing problems that are good candidates for acceleration via parallel processors like GPUs: oil and gas exploration, biomedical imaging and simulation, computational fluid dynamics, and such things. Both Nvidia and ATI (now AMD) have been showing demos of such applications to the press for some time now. Both companies have even re-branded their GPUs as parallel compute engines and sold them in workstation- and server-style configurations. Indeed, surely one of the reasons Nvidia can justify building large GPUs like the G80 and GT200 is the fact that those chips can command high margins inside of Tesla GPU-compute products.
Impressive as they have been, though, such applications haven’t typically had broad appeal. Nvidia hopes the next wave of GPU-compute programs will include more consumer-oriented software, and it’s fond of pointing out that it has shipped over 70 million CUDA-capable GPUs since the introduction of the GeForce 8a considerable installed base. Meanwhile, the company has quietly been buying up and investing in makers of tools and software with GPU-computing potential. At its press event for the GT200, Nvidia and its partners showed off a number of promising consumer-level products in development that use CUDA and the GPU to deliver new levels of performance. Among them:
- Adobe showed a work-in-progress version of Photoshop that uses the GPU to accelerate image display and manipulation. The demo consisted of loading up a 442 megapixel imagea 2GB fileand working with it. GPU accelerated zooming allowed the program to flow instantaneously between a full-image view and close examination of a small section of the image, then move back out again. The image could be rotated freely, in real time, as well. Using another tool, the Adobe rep loaded up a 3D model of a motorcycle and was able to paint directly on its surface. He then grabbed a bit of vector art, stamped it on the surface of the model, and it sort of “melted” into place, moving with the bike as the view rotated. Again, the program’s responses were instant and fluid.
- Nvidia recently purchased a company called Rayscale that has developed a ray-tracing application for the GPU. Their software mixes traditional GPU rasterization via OpenGL with ray-tracing via CUDA to create high-quality images with much better reflections than possible with rasterization alone. At present, the company’s founders said, the software isn’t quite able to render images in real time; one limiter is the speed penalty exacted by performing a context switch from graphics mode to CUDA. Nvidia says it’s working on improving the speed of such switches.
- A firm called Elemental is preparing several video encoding products that use GPU acceleration, including a plug-in for Adobe Premiere and a stand-alone transcoder called BadaBoom. The compny showed a demonstration of a very, very quick video transcode and claimed BadaBoom could convert an MPEG2 file to H.264 at a rate faster than real-time playbacka huge improvement over CPU-based encoding. The final product isn’t due until August, but Nvidia provided us with an early version of BadaBoom as we were in the late stages of putting together this review. We haven’t yet had time to play with it, but we’re hoping to conduct a reasonably good apples-to-apples comparison between video encoding on a multi-core CPU and a GPU, if we can work out the exact quality settings used by BadaBoom. Obviously, H.264 video encoding at the speeds Elemental claims could have tremendous mass-market appeal.
- Stanford University’s distributed computing guru, Vijay Pande, was on hand to show off a [email protected] client for Nvidia GPUsat last! Radeons have had a Folding client for some time now, of course. The GeForce client has the distinction of being developed in CUDA, so it should be compatible with any GeForce 8 or newer Nvidia GPU. Pande said the GeForce GTX 280 can simulate protein folding at a rate of over 400 nanoseconds per day, or over 500 ns/day if the card’s not driving a display. We have a beta copy of the Folding client, and yep, it folds. However, we’re not yet satisfied with the performance testing tools Nvidia supplied alongside it, so we’ll refrain from publishing any number of our own just yet. I believe the client itself shouldn’t be too far from public release. (If you haven’t yet, consider joining Team TR and putting that GPU to good use.)
Last but certainly not least, Manju Hegde, former CEO of Ageia, offered an update on his team’s progress in porting the physics “solvers” for the PhysX API to the GPU in the wake of Nvidia’s buyout of Ageia. He said they started porting the solvers to CUDA roughly two and a half months ago and had them up and running within a month. Compared to the performance of a Core 2 Quad CPU, Hedge said the GeForce GTX 280 was up to 15X faster simulating fluids, 12X faster with soft bodies, and 13X faster with cloth and fabrics. (I believe that puts the GTX 280’s performance at roughly six to 10 times that of Ageia’s own PhysX hardware, for what it’s worth.) Their goal is to make sure all current hardware-accelerated PhysX content works with the GPU drivers.
Hegde also pointed out that game developers have become much more open to using hardware physics acceleration in their games since the acquisition, with 12 top-flight titles signing on in the first month, versus two titles in Ageia’s two-and-a-half years in existence. Among the games currently in development that will use PhysX are Natural Motion’s Backbreaker football sim and the sweet-looking Mirror’s Edge.
One question we don’t know the answer to just yet is how well hardware physics acceleration will coexist with 3D graphics processing, especially on low-end and mid-range GPUs. Hedge showed a striking “Creature from the Deep” demo that employs soft bodies, force fields, and particle debris at the event, but he later revealed that demo used two GPUs, one for graphics and the other for physics. Again, context switching overhead is an issue here. We expect to have an early PhysX driver to play with later this week. We’ll have to see how it performs.
Partially thanks to its push into GPU computing, Nvidia has been much more open about some details of the GT200’s architecture than it has been with prior GPU designs. As a result, we can take a look inside of a thread processing cluster and see a little more clearly how it works. The diagram at the right shows one TPC. Each TPC has three shader multiprocessors (SMs), eight texture addressing/filtering units, and an L1 cache. For whatever reason, Nvidia won’t divulge the size of this L1 cache.
Inside of each SM is one instruction unit (IU), eight stream processors (SPs), and a 16K pool of local, shared memory. This local memory can facilitate inter-thread communication in GPU compute applications, but it’s not used that way in graphics, where such communication isn’t necessary.
For a while now, Nvidia has struggled with exactly how to characterize its GPUs’ computing model. At last, the firm seems to have settled on a name: SIMT, for “single instruction, multiple thread.” As with G80, GT200 execution is scalar rather than vector, with each SP processing a single pixel component at a time. The key to performance is keeping all of those execution units fed as much of the time as possible, and threading is the means by which the GT200 accomplishes this goal. All threads in the GT200 are managed in hardware by the IUs, with zero cost for switching between them.
The IU manages things in groups of 32 parallel threads Nvidia calls “warps.” The IU can track up to 32 warps, so each SM can handle up to 1024 threads in flight. Across the GT200’s 30 SMs, that adds up to as many as 30,720 concurrent hardware threads in flight at any given time. (G80 was similar, but peaked at 768 threads per SM for a maximum of 12,288 threads in flight.) The warp is a fundamental unit in the GPU. The chip’s branching granularity is one warp, which equates to 32 pixels or 16 vertices (or, I suppose, 32 compute threads). Since one pixel equals one thread, and since the SPs are scalar, the compiler schedules pixel elements for execution sequentially: red, then green, then blue, and then alpha. Meanwhile, inside of that same SM, seven other pixels are getting the exact same treatment in parallel.
Should the threads in a warp hit a situation where a high-latency operation like a texture read/memory access is required, the IU can simply switch to processing another of the many warps it tracks while waiting for the results to come back. In this way, the GPU hides latency and keeps its SPs occupied.
That is, as I understand it, SIMT in a nutshell, and it’s essentially the model established by the G80. Of course, the GT200 is improved in ways big and small to deliver more processing power more efficiently than the G80.
One of those improvements is relatively high-profile because it affects the GT200’s theoretical peak FLOPS numbers. As you may know, each SP can contribute up two FLOPS per clock by executing a multiply-add (MAD) instruction. On top of that, each SP has an associated special-function unit that handles things like transcendentals and interpolation. That SFU can also, when not being used otherwise, execute a floating-point multiply instruction, contributing another FLOP per clock to the SP’s output. By issuing a MAD and a MUL together, the SPs can deliver three total FLOPS per clock, and this potential is the basis for Nvidia’s claim of 518 GFLOPS peak for the GeForce 8800 GTX, as well as of the estimate of 933 GFLOPS for the GeForce GTX 280.
Trouble is, that additional MUL wasn’t always accessible on the G80, leading some folks to muse about the mysterious case of the missing MUL. Nvidia won’t quite admit that dual-issue on the G80 was broken, but it says scheduling on the GT200 has been massaged so that it “can now perform near full-speed dual-issue” of a MAD+MUL pair. Tamasi claims the performance impact of dual-issue is measurable, with 3DMark Vantage’s Perlin noise test gaining 16% and the GPU cloth test gaining about 7% when dual-issue is active. That’s a long way from 33%, but it’s better than nothing, I suppose.
Another enhancement in GT200 is the doubling of the size of the register file for each SM. The aim here is, by adding a more on-chip storage, to allow more complex shaders to run without overflowing into memory. Nvidia cites improvements of 35% in 3DMark Vantage’s parallax occlusion mapping test, 6% in GPU cloth, 5% in Perlin noise, and 15% overall with Vantage’s Extreme presets due to the larger register file.
Another standout in the laundry list of tweaks to GT200 is a much larger buffer for stream output from geometry shaders. Some developers have attempted to use geometry shaders for tessellation, but the large amount of data they produced caused problems for G80 and its progeny. The GT200’s stream out buffer is six times the size of G80’s, which should help. Nvidia’s own numbers show the Radeon HD 3870 working faster with geometry shaders than the G80; those same measurements put the GT200 above the Radeon HD 3870 X2.
The diagram above sets the stage for the final two modifications to the GT200’s processing capabilities. Nvidia likes to show this simplified diagram in order to explain how the GPU works in CUDA compute mode, when most of its graphics-specific logic won’t be used. As you can see, the Chiclets don’t change much, although the ROP hardware is essentially ignored, and what’s left is a great, big parallel compute machine.
One thing such a machine needs for scientific computing and the like is the ability to handle higher precision floating-point datatypes. Such precision isn’t typically necessary in graphics, especially real-time graphics, so it wasn’t a capability of the first DirectX 10-class GPUs. The GT200, however, adds the ability to process IEEE 754R-compliant, 64-bit, double-precision floating-point math. Nvidia has added one double-precision unit in each SM, so GT200 has 30 total. That gives it a peak double-precision computational rate of 78 GFLOPS, well below the GPU’s single-precision peak but still not too shabby.
Another facility added to the GT200 for the CUDA crowd is represented by the extra-wide, light-blue Chiclets in the diagram above: the ability to perform atomic read-modify-write operations into memory, useful for certain types of GPU-compute algorithms.
GeForce 8800 GTX
|GeForce 9800 GTX||432||648|
|GeForce 9800 GX2||768||1152|
|GeForce GTX 260||477||715|
|GeForce GTX 280||622||933|
|Radeon HD 2900 XT||475||–|
|Radeon HD 3870||496||–|
|Radeon HD 3870 X2||1056||–|
So how powerful is the GT200’s shader array? With 240 cores operating at 1296MHz, it’s potentially quite formidable. The table on the right should put things into context.
As you’d expect, the GT200’s peak computational rate will depend on whether and how much it’s able to use its dual-issue capability to get that third FLOP per clock. We can probably expect that the GT200 will reach closer to its dual-issue peak than the G80 does to its own, but I suspect the GT200’s practical peak for graphics processing may be something less than 933 GFLOPS.
Nevertheless, the GeForce GTX 280 looks to be substantially more powerful than any other single-GPU solution, and it’s not far from the two dual-GPU cards we’ve listed, the Radeon HD 3870 X2 and the GeForce 9800 GX2. I should point out, however, that the GTX 280 just missed being able to claim a teraflop. Surely Nvidia intended to reach that mark and somehow fell just short. I believe we’ll see a GT200-based Tesla product with slightly higher shader clocks, so it can make that claim.
The GT200’s shader tweaks pay some nice dividends in 3DMark’s synthetic shader tests, as the GeForce GTX 280 grabs the top spot in each. The parallax occlusion mapping test is where the GT200’s larger register file is reputedly a big help, and both GeForce GTX cards top even the 9800 GX2 there, despite the fact that performance in that test scales well on the multi-GPU cards.
Neither multi-GPU solution scales well in the GPU cloth and particles benchmarks, however, and those cards are left to fend for themselves on the strength of a single GPU. Surprisingly, among the single-GPU options, the GT200 is only incrementally faster than the GeForce 8800 GTX and 9800 GTX in both tests.
The Radeons mount more of a challenge to the GeForces in the Perlin noise benchmark, but once again, the GTX 280 captures the top spot, and the hobbled GT200 in the GTX 260 nearly matches a pair of G92s on the 9800 GX2. Both the larger register file and the improved dual-issue on the GT200 are purported to help out in this test, and those claims are looking pretty plausible.
Texturing, ROP hardware, and memory interface
Ah, the basic math thatoutside of shadersdetermines so much of a GPU’s character. Let’s have a look at the numbers, and then we’ll talk about why they are the way they are.
GeForce 8800 GTX
|GeForce 9800 GTX||10.8||43.2||21.6||70.4|
|GeForce 9800 GX2||19.2||76.8||38.4||128.0|
|GeForce GTX 260||16.1||36.9||18.4||111.9|
|GeForce GTX 280||19.3||48.2||24.1||141.7|
|Radeon HD 2900 XT||11.9||11.9||11.9||105.6|
|Radeon HD 3870||12.4||12.4||12.4||72.0|
|Radeon HD 3870 X2||26.4||26.4||26.4||115.2|
Each of the GT200’s thread processing clusters has the ability to address and bilinearly filter eight textures per clock, just like in the G92. That’s up from the G80, whose TPCs were limited to addressing four textures per clock and filtering eight. As in both of those chips, the GT200 filters FP16 texture formats at half the usual rate. Because the new GPU has 10 TPCs, its texturing capacity is up, from 64 texels per clock in G92 to 80 texels per clock in GT200. That’s not a huge gain in texture filtering throughput, but Nvidia expects more efficient scheduling to bring GT200 closer to its theoretical peak than G92.
Meanwhile, the GT200’s ROP partitions runneth over. It has eight of ’em, 50% more than the G80 and twice the number in the G92. Each of its ROP partitions can output four pixels per clock, which means the GT200 can draw pixels at a rate of 32 per clock cycle. As a result, the single-GPU GeForce GTX 280’s hypothetical peak pixel-pushing power surpasses even the GeForce 9800 GX2’s. Beyond the increase in number, the ROP hardware is largely unchanged, although it can now perform frame-buffer blends in one clock cycle instead of two, so the GT200’s blend rate is 32 samples per clock, versus 12 per clock on the G80.
To me, the GT200’s healthy complement of ROP partitions is the most welcome development of all because, especially on Nvidia’s GPUs, the ROP hardware plays a big role in antialiasing performance. Lots of ROP capacity means better frame rates with higher levels of antialiasing, which is always a good thing.
Another thing the wealth of ROP partitions provides is an ample path to memory, 512 bits in all. That kind of external bandwidth means the GT200 has to have lots of traces running from the GPU to memory and lots of space on the chip dedicated to I/O pads, and some folks have questioned the wisdom of such things. After all, the last example we have of a GPU with a 512-bit interface is the Radeon HD 2900 XT, and it turned out to be awfully large for the performance it delivered. Nvidia insists the primary limiter of the GT200’s size is its shader cores and says the I/O pads are roughly balanced to this. Although the GT200 sticks with tried-and-true GDDR3 memory, it’s capable of supporting GDDR4 memory types, as wellnot that it may ever be necessary. The GTX 280’s whopping 142 GB/s of bandwidth outdoes anything we’ve seen to date, even the dual-GPU cards.
Speaking of bandwidth, we’ve found that synthetic tests of pixel fill rate tend to be limited more by memory bandwidth that anything else. That seems to be the case here, since none of the cards reach anything close to a theoretical peak and the top four finish in order of memory bandwidth.
The texturing results prove to be more interesting, in part because the numbers and units don’t correspond to these GPUs’ abilities at all. They’re typically a little more than ten times the theoretical peak. I’ve looked at FutureMark’s whitepaper and even inquired directly with them about what’s going on here, but I haven’t yet received an answer. The results do appear to make sense for what this is: a relative comparison of FP16 texel fill rate.
RightMark’s fill rate test uses integer texture formats, so it’s a little different. Here, the GTX 280’s texel throughput essentially doubles that of the GeForce 8800 GTX. The GT200’s more efficient scheduling does seem to be helping a little bit, as well; the GTX 260 matches the GeForce 9800 GTX, despite having a slightly lower theoretical peak.
Texture filtering quality and performance
The GT200 carries over the same texture filtering algorithms used in the G80 and friends, so there isn’t much to say there. I suggest reading the texture filtering section of my G80 review for more discussion of this subject.
We should, however, pause to consider performance briefly, to see how the GT200’s filtering hardware handles different filtering levels compared to other GPUs. We’ve tested the GTX 280 both at its default settings and with the driver control panel’s “High quality” preset, which disables some sampling and trilinear filtering optimizations.
The GT200’s texture filtering performance scales more or less as expected, although we should note that the GTX 260 starts out roughly equivalent to the 9800 GTX and then drops off slightly as the aniso level increases. The GTX 260’s more efficient scheduling seems to give way to its slightly lower filtering capacity.
As with texture filtering, so with antialiasing: the GT200’s AA hardware and capabilities are pretty much unchanged. We have tested performance, though, including the proprietary extensions to regular multisampled antialiasing offered by both Nvidia and AMD. The results below show how increasing sample levels impact frame rates. We tested in Half-Life 2 Episode Two at 1920×1200 resolution with the rest of the game’s image quality options at their highest possible settings.
Ok, so let’s get this out of the way. This is our first look at the GTX 280’s performance in an actual game, and wow. Yeah, so it’s fast.
Once you’re over that, you’ll notice that the GT200’s performance as sample counts rise tends to tail off pretty graduallyuntil we hit 8X multisampling, where it takes a pretty big hit. Interestingly enough, the Radeon HD 3870-based cards don’t lose much at all when going from 4X to 8X multisampling. The GT200’s saving grace, if it needs one, is Nvidia’s coverage sampled AA, which offers higher quality edge smoothing with very little additional overhead. CSAA 16X, in particular, is very nice. Nvidia’s latest GPUs offer this higher quality mode essentially for “free.”
But what, you ask, about AMD’s custom filter AA? Never fear. I have tested it, too, but it’s really tough to present the results in a graph. Instead, I’ve compiled them in a table.
|Radeon HD 3780 X2 – Half-Life 2 Episode Two – AA scaling|
AMD’s custom filters grab samples from adjacent pixels and factor them in (the tent filters use a weighted average) to increase the effective sample count. This method has the effect of causing some amount of blurring throughout the entire screen, but it does tend to work. AMD’s tent filters can be particularly good at clarifying the details of fine geometry, like the tip of a sword or a power line in the distance.
Unfortunately, when combined with 4X AA or better, these custom filters exact a pretty serious performance penaltynot something we saw with the original R600 back in the day, for what it’s worth. I’ll be curious to see whether this weakness persists with newer R600-derived GPUs with more memory bandwidth and larger frame buffers.
By the way, I have said nice things in the past about the Radeon HD series’ tent filters, but my estimation of CFAA has sunk over time. The blurring effect seems to be more noticeable, and annoying, in some games than in others for whatever reason. Surely one reason is the increase in other sorts of post-processing filters in games generally. Right now, CSAA seems to have all of the advantages: no blurring, little to no performance penalty, and CSAA modes are accessible as an option in many newer games.
And now, we’re off to the races…
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.
Our test systems were configured like so:
2 Extreme QX9650 3.0GHz
Matrix Storage Manager 7.8
DDR2 SDRAM at 800MHz
to CAS delay (tRCD)
with RealTek 220.127.116.1118 drivers
Radeon HD 2900 XT 512MB PCIe
with Catalyst 8.5 drivers
Asus Radeon HD 3870 512MB PCIe
with Catalyst 8.5 drivers
Radeon HD 3870 X2 1GB PCIe
with Catalyst 8.5 drivers
8800 GTX 768MB PCIe
with ForceWare 175.16 drivers
9800 GTX 512MB PCIe
with ForceWare 175.16 drivers
9800 GX2 1GB PCIe
with ForceWare 175.16 drivers
GTX 260 896MB PCIe
with ForceWare 177.34 drivers
GTX 280 1GB PCIe
with ForceWare 177.26 drivers
Caviar SE16 320GB SATA
Vista Ultimate x64 Edition
Pack 1, DirectX March 2008 update
Thanks to Corsair for providing us with memory for our testing. Their quality, service, and support are easily superior to no-name DIMMs.
Our test systems were powered by PC Power & Cooling Silencer 750W power supply units. The Silencer 750W was a runaway Editor’s Choice winner in our epic 11-way power supply roundup, so it seemed like a fitting choice for our test rigs. Thanks to OCZ for providing these units for our use in testing.
Unless otherwise specified, image quality settings for the graphics cards were left at the control panel defaults. Vertical refresh sync (vsync) was disabled for all tests.
We used the following versions of our test applications:
- Call of Duty 4: Modern Warfare 1.5
- Crysis 1.2.1
- Half-Life 2 Episode Two
- Enemy Territory: Quake Wars 1.5
- Assassin’s Creed (unpatched)
- Race Driver GRID
- 3DMark Vantage 1.0.1
- FRAPS 2.9.4
The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Call of Duty 4: Modern Warfare
We tested Call of Duty 4 by recording a custom demo of a multiplayer gaming session and playing it back using the game’s timedemo capability. Since these are high-end graphics configs we’re testing, we enabled 4X antialiasing and 16X anisotropic filtering and turned up the game’s texture and image quality settings to their limits.
We’ve chosen to test at 1680×1050, 1920×1200, and 2560×1600resolutions of roughly two, three, and four megapixelsto see how performance scales.
As expected, the GeForce GTX 280 outperforms any other single-GPU solution, cranking out over 50 frames per second at 2560×1600 resolution. However, the dual-GPU cards have a lot of fight in them: the Radeon HD 3870 X2 sticks with the GTX 260 at lower resolutions, and the 9800 GX2 simply trounces the GTX 280. The thing is, the picture changes at 2560×1600, where the GTX 260 pulls decisively ahead of the 3870 X2 and the GTX 280 closes the gap with the 9800 GX2.
Half-Life 2: Episode Two
We used a custom-recorded timedemo for this game, as well. We tested Episode Two with the in-game image quality options cranked, with 4X AA and 16 anisotropic filtering. HDR lighting and motion blur were both enabled.
The GeForce GTX cards look relatively stronger here, with the 280 basically matching the 9800 GX2 at 2560×1600. The GTX 260 keeps some distance between itself and the 3870 X2, too. Obviously, with everything but the single-GPU Radeons churning out nearly 60 frames per second at 1920×1200, most of these cards will handle Episode Two just fine on most displays.
Enemy Territory: Quake Wars
We tested this game with 4X antialiasing and 16X anisotropic filtering enabled, along with “high” settings for all of the game’s quality options except “Shader level” which was set to “Ultra.” We left the diffuse, bump, and specular texture quality settings at their default levels, though. Shadow and smooth foliage were enabled, but soft particles were disabled. Again, we used a custom timedemo recorded for use in this review.
At last, the GTX 280 pulls ahead of the 9800 GX2 by a hair at the highest resolution. However, the GTX 260 has some trouble fending off the Radeon HD 3870 X2, which runs neck and neck with it.
By the way, the drop-off for the 9800 GTX at 2560×1600 is in earnest. I tested and re-tested it. I suspect the card may be running out of memory, although if that’s the case, I’m not sure why the GX2 isn’t affected, as well.
Rather than use a timedemo, I tested Crysis by playing the game and using FRAPS to record frame rates. Because this way of doing things can introduce a lot of variation from one run to the next, I tested each card in five 60-second gameplay sessions.
Also, I’ve chosen a new area for testing Crysis. This time, I’m on a hillside in the recovery level having a firefight with six or seven of the bad guys. As before, I’ve tested at two different settings, with the game’s “High” quality presets and with its “Very high” ones, also.
Sadly, the GTX 280 is no magic bullet for Crysis performance, in case you were looking for one. Still, please note that the median low frame rate for the GTX 280 with the “High” quality settings is 25 FPS. That’s not too bad at all, and for this reason, Crysis feels eminently playable on the GTX 280. Of course, I had to go and pick a hillside with ridiculously long view distances and an insane amount of vegetation and detail for my new testing area, so folks will still say Crysis doesn’t run well. For what it’s worth, FPS averages on the FRAPS readout jump into the 40s if you turn around and face uphill. Not that it mattersavoiding low frame rates is the key to playability, and the GTX does that.
Then again, so does the 9800 GX2.
There has been some controversy surrounding the PC version of Assassin’s Creed, but I couldn’t resist testing it, in part because it’s such a gorgeous, well-produced game. Also, hey, I was curious to see how the performance picture looks for myself. The originally shipped version of this game can take advantage of the Radeon HD 3870 GPU’s DirectX 10.1 capabilities to get a performance boost with antialiasing, and as you may have heard, Ubisoft chose to remove the DX10.1 path in an update to the game. I chose to test the game without this patch, leaving DX10.1 support intact.
I used our standard FRAPS procedure here, five sessions of 60 seconds each, while free-running across the rooftops in Damascus. All of the game’s quality options were maxed out, and I had to edit a config file manually in order to enable 4X AA at this resolution. Eh, it worked.
Wow, the Radeons just look exceptionally strong here. Even the Radeon HD 2900 XT, which lacks DX10.1 support, comes out ahead of the GeForce 8800 GTXa rare occurrence. With DX10.1, the Radeon HD 3870 isn’t too far behind the GTX 260, amazingly enough. The new GeForces do post solid gains over the older ones, though, and the SLI-on-a-stick 9800 GX2 doesn’t look so hot.
Race Driver GRID
I tested this absolutely gorgeous-looking game with FRAPS, as well, and in order to keep things simple, I decided to capture frame rates over a single, longer session as I raced around the track. This approach has the advantage of letting me report second-by-second frame-rate results.
The 9800 GX2 is fastest overall, but it wasn’t without its quirks. I had to copy in a new SLI profile file in order to get GRID to use both GPUs. Once that was installed, the GX2 obviously did very well. On a similar note, the Radeon HD 3870 X2 seems to have lacked a profile for this game, since it wasn’t any faster than a single 3870.
One oddity in the numbers is that the GTX 260 seems to be bumping up against a frame rate cap at 60 FPS most of the time. Only once, during a short period, does it reach above 60. I’m not sure what’s going on here. I tested and re-tested, confirmed that vsync was disabled, and the results didn’t change. My best guess is that the GTX 260 might be interacting with some sort of dynamic level-of-detail mechanism in the game engine. Interestingly, the GTX 280 rarely ranges below the 60 FPS level.
And finally, we have 3DMark Vantage’s overall index. I’m pleased to have games that will challenge the performance of a new graphics card today, so we don’t have to rely on an educated guess about possible future usage models like 3DMark. However, I did collect some scores to see how the GPUs would fare, so here they are. Note that I used the “High” presets for the benchmark rather than “Extreme,” which is what everyone else seems to be using. Somehow, I thought frame rates in the fives were low enough.
The GT200’s enhanced processing engine serves it well in 3DMark Vantage. As I’ve mentioned, Nvidia claims the GT200’s larger register file has tangible benefits with Vantage’s complex shaders.
We measured total system power consumption at the wall socket using an Extech power analyzer model 380803. The monitor was plugged into a separate outlet, so its power draw was not part of our measurement. The cards were plugged into a motherboard on an open test bench.
The idle measurements were taken at the Windows Vista desktop with the Aero theme enabled. The cards were tested under load running Half-Life 2 Episode Two at 2560×1600 resolution, using the same settings we did for performance testing.
Well, not bad. The GeForce GTX cards pull less power at idle than the 9800 GTX or the Radeon HD 3870 X2. They’re not quite down to Radeon HD 3870 levels, but this is a much larger chip. When running Episode Two, the GT200 cards’ power draw shoots up by quite a bit, but remains well within reasonable limits.
We measured noise levels on our test systems, sitting on an open test bench, using an Extech model 407727 digital sound level meter. The meter was mounted on a tripod approximately 12″ from the test system at a height even with the top of the video card. We used the OSHA-standard weighting and speed for these measurements.
You can think of these noise level measurements much like our system power consumption tests, because the entire systems’ noise levels were measured, including the stock Intel cooler we used to cool the CPU. Of course, noise levels will vary greatly in the real world along with the acoustic properties of the PC enclosure used, whether the enclosure provides adequate cooling to avoid a card’s highest fan speeds, placement of the enclosure in the room, and a whole range of other variables. These results should give a reasonably good picture of comparative fan noise, though.
I wasn’t able to reliably measure noise levels for most of these systems at idle. Our test systems keep getting quieter with the addition of new power supply units and new motherboards with passive cooling and the like, as do the video cards themselves. Our test rigs at idle are too close to the sensitivity floor for our sound level meter, so I only measured noise levels under load. Even then, I wasn’t able to get a good measurement for the GeForce 8800 GTX; its cooler is just too quiet.
All of Nvidia’s new-look coolers are louder than the incredibly quiet dual-slot cooler on the 8800 GTX. The GTX 260 and 280 both put out a fairly noticeable hissing noise when they’re running games, as our readings suggest. I wouldn’t consider them unacceptable, because they’re nice and quite at idle. And, as you’ll see, I think there’s a reason the new GPU coolers are louder.
Per your requests, I’ve added GPU temperature readings to our results. I captured these using AMD’s Catalyst Control Center and Nvidia’s nTune Monitor, so we’re basically relying on the cards to report their temperatures properly. In the case of multi-GPU configs, I only got one number out of CCC. I used the highest of the numbers from the Nvidia monitoring app. These temperatures were recorded while running the “rthdribl” demo in a window. Windowed apps only seem to use one GPU, so it’s possible the dual-GPU cards could get hotter with both GPUs in action. Hard to get a temperature reading if you can’t see the monitoring app, though.
Looks to me like the 8800 GTX is so much quieter than newer cards because it’s willing to let GPU temperatures climb much higher. 84°C is pretty warm, so I can’t complain too much about the acoustics of the later cards.
What you make of the GeForce GTX 280 may hinge on where you come down on the multi-GPU question. Clearly, the GTX 280 is far and away the new single-GPU performance champ, and Nvidia has done it again by nearly doubling the resources of the G80. Its performance is strongest, relatively speaking, at high resolutions where current solutions suffer most, surely in part because of its true 1GB memory size. And one can’t help but like the legion of tweaks and incremental enhancements Nvidia has made to an already familiar and successful basic GPU architecture, from better tuning of the shader cores to the precipitous reduction in idle power draw.
All other things being equal, I’d rather have a big single-GPU card like the GTX 280 than a dual-chip special like the Radeon HD 3870 X2 or the GeForce 9800 GX2 any day. Multi-GPU setups are fragile, and in some games, their performance simply doesn’t scale very well. Also, Nvidia’s support for multiple monitors in SLI and GX2 solutions is pretty dreadful.
The trouble is, things are pretty decidedly not equal. More often than not, the GeForce 9800 GX2 is faster than the GTX 280, and the GX2 is currently selling for as little as 470 bucks, American money. Compared to that, the GTX 280’s asking price of $649 seems mighty steep. Even the GTX 260 at $399 feels expensive in light of the alternativesdual GeForce 8800 GTs in SLI, for instanceunless you’re committed to the single-GPU path.
Another problem with cards like the 9800 GX2 is simply that they’ve shown us that there’s more performance to be had in today’s games than what the GTX 260 and 280 can offer. One can’t escape the impression, seeing the benchmark results, that the GT200’s performance could be higher. Yet many of the changes Nvidia has introduced in this new GPU fall decidedly under the rubric of future-proofing. We’re unlikely to see games push the limits of this shader core for some time to come, for example. I went back and looked, and it turns out that when the GeForce 8800 GTX debuted, it was often slower than two GeForce 7900 GTX cards in SLI. No one cared much at the time because the G80 brought with it a whole boatload of new capabilities. One can’t exactly say the same for the GT200, but then again, things like a double-size register file for more complex shaders or faster stream-out for geometry shaders may end up being fairly consequential in the long run. It’s just terribly difficult to judge these things right now, when cheaper multi-GPU alternatives will run today’s games faster.
And then there’s the fact that AMD has committed itself to the multi-GPU path entirely for the high end. I can’t decide whether that legitimizes the approach or makes Nvidia the winner by default. Probably, it’s a little of both, although I dunno how that works. The folks at AMD are already talking big about the performance of RV700, their next-generation dual-GPU video card, though. We’ll have to wait and see how these things play out.
Whatever happens there, Nvidia has opened up new selling points for its GPUs with CUDA and the apparent blossoming of a nascent GPU-compute ecosystem. Perks like PhysX acceleration and speed-of-light Photoshop work may make a fast GPU indispensable one day, and if that happens, the GT200 GPU will be ready to take full advantage.