Nvidia’s GeForce 8800 graphics processor

DURING THE LAST couple of years, whenever we’ve asked anyone from Nvidia about the future of graphics processors and the prospects for a unified architecture that merges vertex and pixel shaders into a single pool of floating-point processors, they’ve smiled and said something quietly to the effect of: Well, yes, that is one approach, and we think it’s a good future direction for GPUs. But it isn’t strictly necessary to have unified shader hardware in order to comply with DirectX 10, and it may not be the best approach for the next generation of graphics processors.

Nvidia’s response to queries like this one was remarkably consistent, and most of us assumed that ATI would be first to market with a unified shader architecture for PC graphics. Heck, ATI had already completed a unified design for the Xbox 360, so its next GPU would be a second-generation effort. Surely ATI would take the technology lead in the first round of DirectX 10-capable graphics chips.

Except for this: Nvidia seems to have fooled almost everybody. Turns out all of that talk about unified architectures not being necessary was just subterfuge. They’ve been working on a unified graphics architecture for four years now, and today, they’re unveiling the thing to the world—and selling it to consumers starting immediately. The graphics processor formerly known as G80 has been christened as the GeForce 8800, and it’s anything but a conventional GPU design. Read on for more detail than you probably need to know about it.

G80: the parallel stream processor
Since we’re all geeks here, we’ll start at the beginning with a look at a large-scale block diagram of the G80 design. Nvidia has reworked or simply thrown out and replaced vast portions of its past GPU designs, so not much of what you see below will be familiar.


Block diagram of the GeForce 8800. Source: NVIDIA.

This is just a Google Earth-style flyover of the thing, but we’ll take a closer look at some of its component parts as we go. The key thing to realize here is that you don’t see many elements of the traditional graphics rendering pipeline etched into silicon. Instead, those little, green blocks marked “SP” and arranged in groups of 16 are what Nvidia calls stream processors. The G80 has eight groups of 16 SPs, for a total of 128 stream processors. These aren’t vertex or pixel shaders, but generalized floating-point processors capable of operating on vertices, pixels, or any manner of data. Most GPUs operate on pixel data in vector fashion, issuing instructions to operate concurrently on the multiple color components of a pixel (such as red, green, blue and alpha), but the G80’s stream processors are scalar—each SP handles one component. SPs can also be retasked to handle vertex data (or other things) dynamically, according to demand. Also unlike a traditional graphics chip, whose clock frequency might be just north of 600MHz or so, these SPs are clocked at a relatively speedy 1.35GHz, giving the GeForce 8800 a tremendous amount of raw floating-point processing power. Most of the rest of the chip is clocked independently at a more conventional 575MHz.

Below the eight “clusters” of stream processors is a crossbar-style switch (the bit with all of the lines and arrows) that connects them to six ROP partitions. Each ROP partition has its own L2 cache and an interface to graphics memory (or frame buffer, hence the “FB” label) that’s 64-bits wide. In total, that gives the G80 a 384-bit path to memory—half again as wide as the 256-bit interface on past high-end graphics chips like the G71 or ATI’s R580. Contrary to what you might think, this 384-bit memory interface doesn’t operate in some sort of elliptical fashion, grabbing data alternately in 256-bit and 128-bit chunks. It’s just a collection of six 64-bit data paths, with no weirdness needed.

Also in the G80, though not pictured above, is a video display engine that Nvidia describes as “new from the ground up.” The display path now features 10 bits per color channel of precision throughout, much like what ATI claims for its Avivo display engine in the Radeon X1000 series.

That’s the 10,000-foot overview of the new GPU. As I said, we’ll dive deeper into its various components, but let’s first stop to appreciate the scale and scope of this thing. You may need to be at 10,000 feet elevation to see the entire surface area of the G80 at once. Nvidia estimates the G80 to be a mind-boggling 681 million transistors. That’s over twice the number of transistors on the G71, roughly 278 million. ATI tends to count these things a little differently, but they peg their R580 GPU at 384 million transistors. So the G80 is a next-gen design in terms of transistor count as well as features, an obvious tribute to Moore’s Law.

The thing is, the G80 isn’t manufactured on a next-generation chip fabrication process. After some bad past experiences (read: GeForce FX), Nvidia prefers not to tackle a new GPU design and a new fab process at the same time. There’s too much risk involved. So they have instead asked TSMC to manufacture the G80 on its familiar 90nm process, with the result being the single largest chip I believe I’ve ever seen. Here’s a look at the GeForce 8800 GTX, stripped of its cooler, below an ATI Radeon X1900 XTX.

Yikes. Here’s a closer shot, just to make sure your eyes don’t deceive you.

It’s under a metal cap (fancy marketing term: “heat spreader”) in the pictures above, but we can surmise that the G80 has the approximate surface area of Rosie O’Donnell. Nvidia’s isn’t handing out exact die size measurements, but they claim to get about 80 chips gross per wafer. Notice that’s a gross number. Any chip of this size has got to be incredibly expensive to manufacture, because the possibility of defects over such a large die area will be exponentially higher than with a GPU like the G71 or R580. That’s going to make for some very expensive chips. This is what I call Nvidia “giving back to the community” after watching the success of $500 graphics cards line their pockets in recent years. No doubt the scale of this design was predicated on the possibility of production on a 65nm fab process, and I would expect Nvidia to move there as soon as possible.


The GeForce 8800’s discrete display chip

That’s not all for GeForce 8800 silicon, either. You may have noticed the separate chip mounted on the GeForce 8800 GTX board in the pictures above, between the GPU and the display outputs. This is an external display chip that has the TDMS logic for driving LCD displays and the RAMDACs for analog monitors. This puppy can drive an HDTV-out connector and two dual-link DVI outputs with HDCP support.

Nvidia says it chose to use a separate display chip in order to simplify board routing and manufacturing. That makes some sense, I suppose, given the G80’s already ample die area, but the presence of an external display chip raises some intriguing possibilities. For instance, we might see a multi-GPU graphics card with dual G80s (or derivatives) with only a single display chip. Nvidia could also offer G80-based solutions for non-graphics applications without including any display output whatsoever.

Another detail you may have spied in the pictures above is the presence of two “golden fingers” connectors for SLI multi-GPU configurations. As with ATI’s new internal CrossFire connectors, the two links per board will allow staggered connections between more than two graphics cards, raising the possibility of three- and four-way SLI configurations in motherboards with enough PCIe graphics slots.

 

The cards, specs, and prices
The flagship G80-based graphics card is the GeForce 8800 GTX. This card has all of the G80’s features enabled, including 128 stream processors at 1.35GHz, a 575MHz “core” clock, and a 384-bit path to memory. That memory is 768MB of Samsung GDDR3 clocked at 900MHz, or 1800MHz effective.


MSI’s GeForce 8800 GTX


The Radeon X1950 XTX (left) versus the GeForce 8800 GTX (right)

The GTX features a dual-slot cooler that shovels hot air out of the rear of the case, much like the cooler on a Radeon X1950 XTX. However, at 10.5″, the 8800 GTX is over an inch and a half longer than the X1950 XTX—long enough to create fit problems in some enclosures, I fear. In fact, the 8800 GTX is longer than our test system’s motherboard, the ATX-sized Asus P5N32-SLI SE Deluxe.


Twin PCIe power plugs on the GTX

Here’s another detail that testifies to the audacity of the 8800 GTX: dual six-pin PCIe power connectors. Nvidia says this is a matter of necessity in order to fit into the PCI Express spec. The card’s TDP is 185W, so it needs at least that much input power. The PCIe spec allows for 75W delivered through the PCIe x16 slot and 75W through each six-pin auxiliary connector. 150W wouldn’t cut it, so another power plug was required. Incidentally, Nvidia also claims the 8800 GTX requires only about 5W more than the Radeon X1950 XTX, slyly raising the question of whether the top Radeon is technically in compliance with PCIe standards.

You may be breathing a sigh of relief at hearing of a 5W delta between the 8800 GTX and the Radeon X1950 XTX. Go ahead and exhale, because Nvidia’s claims seem to be on target. We’ll test power consumption and noise later in our review, but the 8800 GTX isn’t beyond the bounds of reason, and the cooler is no Dustbuster like the one on the GeForce FX 5800 Ultra.


BFG Tech’s GeForce 8800 GTS

For the less extreme among us, here is BFG Tech’s version of the GeForce 8800 GTS. Comfortingly, this card has only one auxiliary power plug and is the exact same length as a Radeon X1950 XTX or a GeForce 7900 GTX. The GTS features a 500MHz core clock, 1.2GHz stream processors, 640MB of GDDR3 memory at 800MHz, and the same set of output ports and capabilities as the GTX. The GTS uses a cut-down version of the G80 with some units disabled, so it has “only” 96 SPs, five ROP partitions, and a 320-bit memory path.

Both variants of the GeForce 8800 are due to begin selling at online retailers today. Nvidia says to expect the GTX to sell for $599 and the GTS for $449. We talked with a couple of Nvidia board partners, though, and got a slightly different story. XFX does plan to go with those prices, but expects to see a premium of $30 or so at launch. BFG Tech has set the MSRP for its 8800 GTX at $649 and for its GTS at $499. Given what we’ve seen out of the GeForce 8800, I would expect to see cards selling for these higher prices, at least initially.

Supplies of the GeForce 8800 GTX may be a little bit iffy at first, thanks to a manufacturing problem with some of those cards. Nvidia insists there’s been no recall since the products weren’t yet in the hands of consumers, but they had to pull back some of the boards for repair at the last minute due to an incorrect resistor value that could cause screen corruption in 3D applications. Nvidia and its partners intend to make sure none of these cards make their way into the hands of consumers, and they’re saying GTX cards that work properly will still be available for sale immediately. GeForce 8800 GTS boards are not affected by this problem.

Interestingly, by the way, neither BFG nor XFX is offering “overclocked in the box” versions of the GeForce 8800 GTS and GTX. Their cards run at standard speeds. MSI’s 8800 GTX doesn’t come overclocked, but it does include “Dynamic Overclocking Technology” software. We haven’t had the time to try it out yet, but for once, MSI looks to be the rebel among the group.

 

Shader processing
Let’s pull up that diagram of the G80 once more, so we have some context for talking about shader processing and performance.


Block diagram of the GeForce 8800. Source: NVIDIA.

A single SP cluster. Source: NVIDIA.

The G80’s unified architecture substitutes massive amounts of more generalized parallel floating-point processing power for the vertex and pixel shaders of past GPUs. Again we can see the eight clusters of 16 SPs, with each cluster of SPs arranged in two groups of eight. To the left, you can see a slightly more detailed diagram of a single SP cluster. Each cluster has its own dedicated texture address and filtering units (the blue blocks) and its own pool of L1 cache. Behind the L1 cache is a connection to the crossbar that goes to the ROP units, with their L2 caches and connections to main memory.

Getting an exact handle on the amount of shader power available here isn’t a wholly simple task, although you’ll see lots of numbers thrown around as authoritative. We can get a rough sense of where the G80 stands versus the R580+ GPU in the Radeon X1950 XTX by doing some basic math. The R580+ has 48 pixel shader ALUs that can operate on four pixel components each, and it runs at 650MHz. That means the R580+ can operate on about 125 billion components per second, at optimal peak performance. With its 128 SPs at 1.35GHz, the G80 can operate on about 173 billion components per second. Of course, that’s a totally bogus comparison, and I should just stop typing now. Actual performance will depend on the instruction mix, the efficiency of the architecture, and the ability of the architecture to handle different instruction types. (The G80’s scalar SPs can dual-issue a MAD and a MUL, for what it’s worth.)

The G80 uses a threading model, with an internal thread processor, to track all data being processed. Nvidia says the G80 can have “thousands of threads” in flight at any given time, and it switches between them regularly in order to keep all of the SPs as fully occupied as possible. Certain operations like texture fetch or filtering can take quite a while, relatively speaking, so the SPs will switch away to another task while such an operation completes.

Threading also facilitates the use of a common shader unit for vertex, pixel, and geometry shader processing. Threading is the primary means of load balancing between these different data types. For DirectX 9 applications, that means vertex and pixel threads only, but the G80 can do robust load balancing between these thread types even though the DX9 API doesn’t have a unified shader instruction language. Load balancing is handled automatically, so it’s transparent to applications.

ATI created its first unified shader architecture in the Xenos chip for the Xbox 360, and all parties involved—including Microsoft, ATI, and Nvidia—seem to agree that unified shaders are the way to go. By their nature, graphics workloads tend to vary between being more pixel-intensive and more vertex-intensive, from scene to scene or even as one frame is being drawn. The ability to retask computational resources dynamically allows the GPU to use the bulk of it resources to attack the present bottleneck. This arrangement ensures that large portions of the chip don’t sit unused while others face more work than they can handle.

To illustrate the merits of a unified architecture, Nvidia showed us a demo using the game Company of Heroes and a tool called NVPerfHUD that plots the percentage of pixel and vertex processing power used over time. Here’s a slide that captures the essence of what we saw.


Source: Nvidia.

The proportion of GPU time dedicated to vertex and pixel processing tended to swing fluidly in a pretty broad range. Pixel processing was almost always more prominent than vertex processing, but vertex time would spike occasionally when there was lots of geometric complexity on the screen. That demo alone makes a pretty convincing argument for the merits of unified shaders—and for the G80’s implementation of them.

Threading also governs the GPU’s ability to process advanced shader capabilities like dynamic branching. On a parallel chip like this one, branches can create problems because the GPU may have to walk a large block of pixels through both sides of a branch in order to get the right results. ATI made lots of noise about the 16-pixel branching granularity in R520 when it was introduced, only to widen the design to 48 pixel shaders (and thus to 48-pixel granularity) with the R580. For G80, Nvidia equates one pixel to one thread, and says the GPU’s branching granularity is 32 pixels—basically the width of the chip, since pixels have four scalar components each. In the world of GPUs, this constitutes reasonably fine branching granularity.

One more, somewhat unrelated, note on the G80’s stream processors. Nvidia’s pixel shaders have supported 32-bit floating-point datatypes for some time now, but the variance of data formats available on graphics processors has been an issue for just as long. The DirectX 10 specification attempts to tidy these things up a bit, and Nvidia believes the G80 can reasonably claim to be IEEE 754-compliant—perhaps not in every last picky detail of the spec, but generally so. This fact should make the G80 better suited for general computational tasks.

 

Shader performance
We can get a sense of the effectiveness of the G80’s unified shader architecture using a battery of vertex and pixel processing tests. These first two are simple vertex processing tests from 3DMark.

The G80 doesn’t look like anything special in the simple vertex test, but when we add more complexity, it begins to look more competent.

Next up is a simple particle physics simulation that runs entirely on the GPU. This test uses vertex texture fetch, one of the few Shader Model 3.0 capabilities that the R580 lacks. As a result, it doesn’t run on the Radeon X1950 XTX.

Here we begin to get a sense of this unified architecture’s potential. The G80 is many times faster than the GeForce 7900 GTX in this task, as we might expect from a GPU designed for broadly parallel vertex processing and quick feedback of SP results into new threads.

Now, let’s look at pixel shader performance. Up first is 3DMark’s lone pixel shader test.

The G71 and R580+ are very evenly matched here, but the G80 is in a class by itself.

ShaderMark gives us a broader range of pixel shaders to test. Below are the results from the individual shaders, followed by an average of those results.

The G80’s pixel shading prowess is remarkable. Overall, its ShaderMark performance is just shy of twice that of the R580+ and G71. These are simple DirectX 9 shaders that don’t gain from any of DirectX 10’s advances in shader programming, either. The only place where the G80 falls behind the two previous-gen chips is in the three HDR shaders. For whatever reason, it doesn’t do as well in those tests.

We can also attempt to quantify the image quality of the G80’s pixel shader output by using another feature of ShaderMark. This isn’t exactly a quantitative measure of something as subjective as image quality, but it does measure how closely the GPU’s output matches that of Microsoft’s DirectX reference rasterizer, a software renderer that acts as a standard for DirectX graphics chips.

The G80’s image output more closely matches that of the Microsoft reference rasterizer than the images from the other GPUs. This isn’t by itself an earth-shattering result, but it is a good indication.

 

Texturing
We’ve talked quite a bit about the top portion of that SP cluster, but not much about the lower part. Attached to each group of 16 SPs is a texture address and filtering unit. Each one of these units can handle four texture address operations (basically grabbing a texture to apply to a fragment), for a total of 32 texture address units across the chip. These units run at the G80’s core clock speed of 575MHz, not the 1.35GHz of the SPs. The ability to apply 32 textures per clock is formidable, even if shader power is becoming relatively more important. Here’s how the math breaks down versus previous top-end graphics cards:

  Core
clock
(MHz)
Pixels/
clock
Peak
fill rate
(Mpixels/s)
Textures/
clock
Peak
fill rate
(Mtexels/s)
Effective
memory
clock (MHz)
Memory
bus width
(bits)
Peak memory
bandwidth
(GB/s)
GeForce 7900 GTX 650 16 10400 24 15600 1600 256 51.2
Radeon X1950 XTX 650 16 10400 16 10400 2000 256 64.0
GeForce 8800 GTS 500 20 10000 24 12000 1600 320 64.0
GeForce 8800 GTX 575 24 13800 32 18400 1800 384 86.4

So in theory, the G80’s texturing capabilities are quite strong; its 18.4 Gtexel/s theoretical peak isn’t vastly higher than the GeForce 7900 GTX’s, but its memory bandwidth advantage over the G71 is pronounced. As for pixel fill rates, both ATI and Nvidia seem to have decided that about 10 Gpixels/s is sufficient for the time being.


The G80’s texture address and filtering units.
Source: NVIDIA.

The G80 appears capable of delivering on its theoretical promise in practice, and then some. I’ve included the 3DMark fill rate tests mainly because they show us how the G71’s pixel fill rate scales up with display resolution (freaky!) and how close these GPUs can get to their theoretical peak texturing capabilities (answer: very close.) However, I prefer RightMark’s test overall, and it shows the G80 achieving just under twice the texturing capacity of the R580+.

The G80’s texturing abilities are also superior to the G71’s in a way that our results above don’t show. The G71 uses one of the ALUs in each pixel shader processor to serve as a texture address unit. This sharing arrangement is sometimes very efficient, but it can cause slowdowns in texturing and shader operations, especially when the two are tightly interleaved. The G80’s texture address units are decoupled from the stream processors and operate independently, so that texturing can happen freely alongside shader processing—just like, dare I say it, ATI’s R580+.

More impressive than the G80’s texture addressing capability, though, is its capacity for texture filtering. You’ll see eight filtering units and four address units in the diagram to the left, if you can work out what “TA” and “TF” mean. The G80 has twice the texture filtering capacity per address unit of the G71, so it can do either 2X anisotropic filtering or bilinear filtering of FP16-format textures at full speed, or 32 pixels per clock. (Aniso 2X and FP16 filtering combined happen at 16 pixels per clock.) These units can also filter textures in FP32 format for extremely high precision results. All of this means, of course, that the G80 should be able to produce very nice image quality without compromising performance.

 

Texture filtering quality and performance
Fortunately, Nvidia has chosen to take advantage of the G80’s additional texture filtering capacity to deliver better default image quality to PC graphics. This development will no doubt be welcome news to those have been subjected to the texture crawling, moire, and sparkle produced by the default anisotropic filtering settings of the GeForce 7 series.

In order to show you how this works, I’m going to have you look at a trio of psychedelic drawings.

Default quality
GeForce 7900 GTX Radeon X1950 XTX GeForce 8800 GTX

For the uninitiated, these are resized screenshots from 3DMark’s texture filtering test app, and what you’re doing here is essentially staring down a 3D rendered tube. The red, green, and blue patterns you see are mip maps, and they’re colored in order to make them easier to see. On all three GPUs, 16X anisotropic filtering is enabled.

You’ll want to pay attention to two things in these screenshots. The first is the series of transitions from red to blue, blue to green, and green to red. The smoother the transition between the different mip levels, the better the GPU is handling a key aspect of its filtering job: trilinear filtering. The Radeon appears to be doing a superior job of trilinear filtering in the picture above, but I wouldn’t get too excited about that. ATI uses an adaptive trilinear algorithm that does more or less aggressive blending depending on the situation. In our case with colored mip levels, the adaptive algorithm goes all out, filtering more smoothly than it might in a less contrived case.

Still and all, the G71’s trilinear looks kinda shabby.

The other thing to notice is the shape of the pattern created by the colored mip maps. Both the G71 and R580+ go for the “wild orchid” look. This is because they both practice a performance optimization in which the amount of anisotropic filtering applied to a surface depends on its angle of inclination from the camera. The tips of the flowers are the angles at which the weakest filtering is applied. At some point, somebody at ATI thought this was a good idea, and at another point, Nvidia agreed. Happily, that era has passed with the introduction of the G80, whose default aniso algorithm produces a nice, tight circle that’s almost perfectly round.

To illustrate how this filtering mumbo jumbo works out in a game, I’ve taken some screenshots from Half-Life 2, staring up at a building from the street. The building’s flat surface looks nice and sharp on all three cards thanks to 16X aniso, but when you turn at an angle from the building on the G71 and R580, things turn into a blurry mess. Not so on the G80.

Default quality
GeForce 7900 GTX

Radeon X1950 XTX

GeForce 8800 GTX

To be fair to ATI, the R580+ also has a pretty decent aniso filtering algorithm in its “high quality” mode. The quality level drops off somewhat at 45° angles from the camera, but not terribly so. The G71, meanwhile, keeps churning out the eight-pointed flower no matter what setting we use.

All three GPUs, including the G80, produce nicer mip map transitions in their high quality modes.

High quality
GeForce 7900 GTX Radeon X1950 XTX GeForce 8800 GTX

High quality
GeForce 7900 GTX

Radeon X1950 XTX

GeForce 8800 GTX

So the Radeon X1950 XT produces decent aniso filtering in its high quality mode, but the G80 does so all of the time. And the G71 is impervious to our attempts to help it succeed.

Now that we’ve seen how it looks, we can put these performance numbers in context. I’ve tested with both default and high-quality settings, so the performance difference between the two is apparent. Unfortunately, D3D RightMark won’t test FP16 or FP32 texture filtering, so we’ll have to settle for just the standard 8-bit integer texture format until we get to our gaming tests.

The G80 cranks out twice the fill rate of the R580+ at 16X aniso, despite having to work harder to produce higher quality images. Very impressive.

 

ROPs and antialiasing


A ROP pipeline logical diagram
Source: NVIDIA.

Next up on our G80 hit list are the ROPs, or (I believe) raster operators, which convert the shaded and textured fragments that come out of the SP clusters into pixels. As we’ve discussed, the G80 has six ROP partitions onboard, and those are connected to the SPs via a fragment crossbar. Each ROP partition has an associated L2 cache and 64-bit path to memory, and each one is responsible for drawing a portion of the pixels on the screen.

Like everything else on the G80, these ROPs are substantially improved from the G71. Each ROP partition can process up to 16 color samples and 16 Z samples per clock, or alternately 32 Z samples per clock (which is useful for shadowing algorithms and the like.) That adds up to a total peak capacity of 96 color + Z samples per clock or 192 Z samples per clock. An individual ROP partition can only write four pixels per clock to memory, but the overload of sample capacity should help with antialiasing. Nvidia has also endowed the G80’s ROPs with improved color and Z compression routines, that they claim are twice as effective as the G71’s.

Most notably, perhaps, the G80’s ROPs can handle blending of high-precision data formats in conjunction with multisampled antialiasing—gone is the G71’s inability to do AA along with HDR lighting. The G80 can handle both FP16 and FP32 formats with multisampled AA.

Speaking of multisampled antialiasing, you are probably familiar by now with this AA method, because it’s standard on both ATI and Nvidia GPUs. If you aren’t, let me once again recommend reading this article for an overview. Multisampled AA is too complex a subject for me to boil down into a nutshell easily, but its essence is like that of many antialiasing methods; it captures multiple samples from inside the space of a single pixel and blends them together in order to determine the final color of the pixel. Compared to brute-force methods like supersampling, though, multisampling skips several steps along the way, performing only one texture read and shader calculation per pixel, usually sampled from the pixel center. The GPU then reads and stores a larger number of color and Z samples for each pixel, along with information about which polygon covers each sample point. MSAA uses the Z, color, and coverage information to determine how to blend to the pixel’s final color.

Clear as mud? I’ve skipped some steps, but the end result is that multisampled AA works relatively well and efficiently. MSAA typically only modifies pixels on the edge of polygons, and it makes those edges look good.

For the G80, Nvidia has cooked up an extension of sorts to multisampling that it calls coverage sampling AA, or CSAA for short. Present Nvidia and former SGI graphics architect John Montrym introduced this mode at the G80 press event. Montrym explained that multisampling has a problem with higher sample rates. Beyond four samples, he asserted, “the storage cost increases faster than the image quality improves.” This problem is exacerbated with HDR formats, where storage costs are higher. Yet “for the vast majority of edge pixels,” Montrym said, “two colors are enough.” The key to better AA, he argued, is “more detailed coverage information,” or information about how much each polygon covers the area inside of a pixel.

CSAA achieves that goal without increasing the AA memory footprint too drastically by calculating additional coverage samples but discarding the redundant color and Z information that comes along with them. Montrym claimed CSAA could offer the performance of 4X AA with roughly 16X quality. This method works well generally and has some nice advantages, he said, but has to fall back to the quality of the stored color/Z sample count in a couple of tough cases, such as on shadow edges generated by stencil shadow volumes.

Nvidia has added a number of new AA modes to the G80, three of which use a form of CSAA. Here’s a look at the information stored in each mode:

The 8X, 16X, and 16xQ modes are CSAA variants, with a smaller number of stored color/Z samples than traditional multisampling modes. 16X is the purest CSAA mode, with four times the coverage samples compared to color/Z. Note, also, that Nvidia has added a pure 8X multisampled mode to the G80, dubbed 8xQ.

We can see the locations of the G80’s AA sample points by using a simple FSAA test application. This app won’t show the location of coverage-only sample points for CSAA, unfortunately.

  GeForce 7900 GTX GeForce 8800 GTX Radeon X1950 XTX
2X

4X

6X    

8x  

 
8xS/8xQ

 
16X  

 
16xQ  

 

Lo and behold, Nvidia’s 8xQ multisampled mode introduces a new, non-grid-aligned sample pattern with a quasi-random distribution, much like ATI’s 6X pattern. I’ve tried to shake a map of the 16 sample pattern out of Nvidia, but without success to date.

The big question is: does CSAA work, and if so, how well? For that, we have lots of evidence, but we’ll start with a quick side-by-side comparison of 4X multisampling with the three CSAA modes. Here’s a small example with some high-contrast, near-vertical edges, straight out of Half-Life 2. I’ve resized these images to precisely four times their original size to make them easier to see, but they are otherwise unretouched.

Coverage Sampling AA Quality
4X 8X 16X 16xQ

To my eye, CSAA works, and works well. In fact, 16xQ looks no more effective to me than the 16X mode, despite having twice as many stored color and Z samples. You can see direct comparisons of all of the G80’s modes to the G71 and R580+ in the larger image comparison table on the next page, but first, let’s look at how CSAA impacts performance.

This isn’t quite 16X quality at 4X performance, but CSAA 8X and 16X both have less overhead than the 8xQ pure multisampled mode. In fact, CSAA 16X may be the sweet spot for image quality and performance together.

These numbers from Half-Life 2 are also our first look at G80 performance in a real game. Perhaps you’re getting excited to see more, in light of these numbers? Before we get there, we have a couple of quick AA image quality comparisons between the GPUs to do.

 

Antialiasing image quality – GPUs side by side
As on the last page, these images are from Half-Life 2, and they’re just a quick-hit look at smoothing out (primarily) a couple of high-contrast, near-vertical edges. In these screenshots, the G80’s gamma correct AA blending is turned on, while the G71’s is turned off, for a very simple reason: those are the current control panel defaults for the two GPUs. ATI’s gamma-correct blends are always on, I believe.

Antialiasing quality
GeForce 7900 GTX Radeon X1950 XTX GeForce 8800 GTX
No AA

2X

4X

  6X 8X
 

8xS   8xQ

 

    16X
   

    16xQ
   

ATI 6X multisampled mode is still quite good, but the G80’s 8X and 16X CSAA modes are now clearly superior.

 

Antialiasing image quality – Alpha transparency
Here’s one final AA image quality example, focused on the methods that ATI and Nvidia have devised to handle the tough case of textures with alpha transparency cutouts in them. Nvidia calls its method transparency AA and ATI calls its adaptive AA, but they are fundamentally similar. The scene below, again from Half-Life 2, has two examples of alpha-transparent textures: the leaves on the tree and the letters in the sign. 4X multisampling is enabled in all cases. Without provisions for alpha transparency, though, the edges are not antialiased.

Alpha transparency antialiasing quality w/4X AA
GeForce 7900 GTX Radeon X1950 XTX GeForce 8800 GTX
No transparency/Adaptive AA

Transparency multisampling/Adaptive performance mode

Transparency supersampling/Adaptive quality mode

ATI’s adaptive performance mode looks to be superior to Nvidia’s transparency multisampling mode, but adaptive quality mode and transparency supersampling produce very similar results, likely because both are doing 4X supersampling on the alpha-transparent textures.

 

Test notes
We did run into a few snags in our testing. For one, we had to update our Asus P5N32-SLI SE Deluxe’s BIOS in order to resolve a problem. With the original 0204 BIOS, the system reported only 1GB of memory in Windows whenever a pair of 7950 GX2s was installed. This was not a problem with any of our single or dual-GPU configs, but Quad SLI required a BIOS update.

Also, when we tried to run a pair of GeForce 7600 GT cards in SLI, we encountered some odd image artifacts that we couldn’t make go away. The image artifacts didn’t appear to affect performance, so we’ve included results for the GeForce 7600 GT in SLI. If we find a resolution for the problem and performance changes, we’ll update the scores in this article.

Finally, the 3DMark06 test results for the Radeon X1950 XTX CrossFire system were obtained using an Asus P5W DH motherboard, for reasons explained here. Otherwise, we used the test systems as described below.

Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

For our initial GPU comparison, our test systems were configured like so:

Processor Core 2 Extreme X6800 2.93GHz
System bus 1066MHz (266MHz quad-pumped)
Motherboard Asus P5N32-SLI Deluxe
BIOS revision 0305
North bridge nForce4 SLI X16 Intel Edition
South bridge nForce4 MCP
Chipset drivers ForceWare 6.86
Memory size 2GB (2 DIMMs)
Memory type Corsair TWIN2X2048-8500C5 DDR2 SDRAM at 800MHz
CAS latency (CL) 4
RAS to CAS delay (tRCD) 4
RAS precharge (tRP) 4
Cycle time (tRAS) 15
Hard drive Maxtor DiamondMax 10 250GB SATA 150
Audio Integrated nForce4/ALC850 with Realtek 5.10.0.6150 drivers
Graphics Radeon X1950 XTX 512MB PCIe with Catalyst 6.10 drivers
GeForce 7900 GTX 512MB PCIe with ForceWare 93.71 drivers
GeForce 8800 GTS 640MB PCIe with ForceWare 96.94 drivers
GeForce 8800 GTX 768MB PCIe with ForceWare 96.94 drivers
OS Windows XP Professional (32-bit)
OS updates Service Pack 2, DirectX 9.0c update (August 2006)

For the broader gaming comparisons against many other cards, our test configuration looked like this:

Processor Core 2 Extreme X6800 2.93GHz Core 2 Extreme X6800 2.93GHz Core 2 Extreme X6800 2.93GHz
System bus 1066MHz (266MHz quad-pumped) 1066MHz (266MHz quad-pumped) 1066MHz (266MHz quad-pumped)
Motherboard Asus P5N32-SLI Deluxe Intel D975XBX Asus P5W DH
BIOS revision 0204 BX97510J.86A.1073.2006.0427.1210 0801
0305
North bridge nForce4 SLI X16 Intel Edition 975X MCH 975X MCH
South bridge nForce4 MCP ICH7R ICH7R
Chipset drivers ForceWare 6.86 INF Update 7.2.2.1007
Intel Matrix Storage Manager 5.5.0.1035
INF Update 7.2.2.1007
Intel Matrix Storage Manager 5.5.0.1035
Memory size 2GB (2 DIMMs) 2GB (2 DIMMs) 2GB (2 DIMMs)
Memory type Corsair TWIN2X2048-8500C5 DDR2 SDRAM at 800MHz Corsair TWIN2X2048-8500C5 DDR2 SDRAM at 800MHz Corsair TWIN2X2048-8500C5 DDR2 SDRAM at 800MHz
CAS latency (CL) 4 4 4
RAS to CAS delay (tRCD) 4 4 4
RAS precharge (tRP) 4 4 4
Cycle time (tRAS) 15 15 15
Hard drive Maxtor DiamondMax 10 250GB SATA 150 Maxtor DiamondMax 10 250GB SATA 150 Maxtor DiamondMax 10 250GB SATA 150
Audio Integrated nForce4/ALC850 with Realtek 5.10.0.6150 drivers Integrated ICH7R/STAC9221D5 with SigmaTel 5.10.5143.0 drivers Integrated ICH7R/ALC882M with Realtek 5.10.00.5247 drivers
Graphics Radeon X1800 GTO 256MB PCIe
with Catalyst 8.282-060802a-035722C-ATI drivers
Radeon X1900 XTX 512MB PCIe + Radeon X1900 CrossFire
with Catalyst 8.282-060802a-035515C-ATI drivers
Radeon X1900 XT 256MB PCIe + Radeon X1900 CrossFire
with Catalyst 8.282-060802a-035515C-ATI drivers
Radeon X1900 GT 256MB PCIe
with Catalyst 8.282-060802a-035722C-ATI drivers
Radeon X1950 XTX 512MB PCIe + Radeon X1950 CrossFire
with Catalyst 8.282-060802a-03584E-ATI drivers
 
Radeon X1900 XT 256MB PCIe
with Catalyst 8.282-060802a-03584E-ATI drivers
   
Radeon X1900 XTX 512MB PCIe
with Catalyst 8.282-060802a-03584E-ATI drivers
Radeon X1950 XTX 512MB PCIe
with Catalyst 8.282-060802a-03584E-ATI drivers
   
BFG GeForce 7600 GT OC 256MB PCIe
with ForceWare 91.47 drivers
   
Dual BFG GeForce 7600 GT OC 256MB PCIe
with ForceWare 91.47 drivers
   
XFX GeForce 7900 GS 480M Extreme 256MB PCIe
with ForceWare 91.47 drivers
   
Dual XFX GeForce 7900 GS 480M Extreme 256MB PCIe
with ForceWare 91.47 drivers
   
GeForce 7900 GT 256MB PCIe
with ForceWare 91.31 drivers
   
Dual GeForce 7900 GT 256MB PCIe
with ForceWare 91.31 drivers
   
XFX GeForce 7950 GT 570M Extreme 512MB PCIe
with ForceWare 91.47 drivers
   
Dual XFX GeForce 7950 GT 570M Extreme 512MB PCIe
with ForceWare 91.47 drivers
   
GeForce 7900 GTX 512MB PCIe
with ForceWare 91.31 drivers
   
Dual GeForce 7900 GTX 512MB PCIe
with ForceWare 91.31 drivers
   
GeForce 7950 GX2 1GB PCIe
with ForceWare 91.31 drivers
   
Dual GeForce 7950 GX2 1GB PCIe
with ForceWare 91.47 drivers
   
GeForce 8800 GTS 640MB PCIe with ForceWare 96.94 drivers    
GeForce 8800 GTX 768MB PCIe with ForceWare 96.94 drivers    
OS Windows XP Professional (32-bit)
OS updates Service Pack 2, DirectX 9.0c update (August 2006)

Thanks to Corsair for providing us with memory for our testing. Their quality, service, and support are easily superior to no-name DIMMs.

Our test systems were powered by OCZ GameXStream 700W power supply units. Thanks to OCZ for providing these units for our use in testing.

Unless otherwise specified, image quality settings for the graphics cards were left at the control panel defaults.

The test systems’ Windows desktops were set at 1280×960 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled for all tests.

We used the following versions of our test applications:

The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

 

Quake 4
In contrast to our narrower GPU comparison, I’ve included a tremendous number of different graphics solutions, both single and multi-GPU, against the GeForce 8800 GTX and GTS cards in the results below. If you find the sheer number of results daunting, I’d encourage you to scroll down to the summary line graphs at the bottom of the page, which I’ve pared down to just a few competing high-end graphics cards. I find the line graphs easier to read, anyhow.

In order to make sure we pushed the video cards as hard as possible, we enabled Quake 4’s multiprocessor support before testing.

Yes, you read that right. A single GeForce 8800 GTX is as fast as two Radeon X1950 XTs running in a CrossFire configuration. This is absolutely astonishing performance, though not entirely unexpected given what we’ve seen from the G80 in the preceding pages. Even the substantially cut-down GeForce 8800 GTS is faster than the previous gen’s top single-GPU performers.

 

F.E.A.R.
We’ve used FRAPS to play through a sequence in F.E.A.R. in the past, but this time around, we’re using the game’s built-in “test settings” benchmark for a quick, repeatable comparison.

The GeForce 8800 GTX’s dominance in F.E.A.R. isn’t quite as pronounced as in Quake 4, but it’s still very fast. The GTS can’t entirely separate itself from the Radeon X1950 XTX, though.

 

Half-Life 2: Episode One
The Source game engine uses an integer data format for its high dynamic range rendering, which allows all of the cards here to combine HDR rendering with 4X antialiasing.

Chalk up another big win for the GeForce 8800 cards, with the GTX again shadowing the Radeon X1950 CrossFire rig.

 
The Elder Scrolls IV: Oblivion
We tested Oblivion by manually playing through a specific point in the game five times while recording frame rates using the FRAPS utility. Each gameplay sequence lasted 60 seconds. This method has the advantage of simulating real gameplay quite closely, but it comes at the expense of precise repeatability. We believe five sample sessions are sufficient to get reasonably consistent and trustworthy results. In addition to average frame rates, we’ve included the low frames rates, because those tend to reflect the user experience in performance-critical situations. In order to diminish the effect of outliers, we’ve reported the median of the five low frame rates we encountered.

We set Oblivion’s graphical quality settings to “Ultra High.” The screen resolution was set to 1600×1200 resolution, with HDR lighting enabled. 16X anisotropic filtering was forced on via the cards’ driver control panels.

Here’s one of the best-respected graphics engines in a game today, replete with FP16-based high dynamic range lighting and lots of nice effects, the winner of our poll on the best-looking PC game—and the GeForce 8800 utterly dominates in it.

I played around with Oblivion a fair amount on the GeForce 8800 cards, and they’re not without the visual glitches that tend come with fresh drivers on a new GPU architecture. Some character shadows tend to flicker, there’s occasional Z cracking, and every once in a while, a stray miscolored block of pixels pops up somewhere on the screen. These problems are fairly infrequent, though, and I expect them to be fixed in later drivers. Regardless, the game looks absolutely stunning. The texture filtering problems I’ve noted on the G71 are definitely fixed here. The GeForce 8800’s default filtering methods aren’t perfect, but they are better than ATI’s, which are still pretty good.

I wanted to see what I could do to push the G80 with Oblivion, so I left the game at its “Ultra quality” settings and cranked up the resolution to 2048×1536. I then turned up the quality options in the graphics driver control panel: 4X AA, transparency supersampling, and high-quality texture filtering. I set the Radeon X1950 XTX to comparable settings for comparison, but the GeForce 7900 GTX couldn’t join the party, since it can’t do FP16 HDR with AA.

The 8800 GTX runs the game at these settings with horsepower to spare! I was still curious about how far I could take things, so I turned on 16X CSAA in addition to everything else, and the 8800 GTX completed our test loop at about 45 frames per second—easily a playable frame rate for this game. Here’s a screenshot from that test session:


Oblivion with HDR lighting, 16X aniso, 16X CSAA, transparency supersampling, and HQ filtering.
Click for a full-sized, uncompressed PNG version.

Ghost Recon Advanced Warfighter
We tested GRAW with FRAPS, as well. We cranked up all of the quality settings for this game, with the exception of antialiasing. However, GRAW doesn’t allow cards with 256MB of memory to run with its highest texture quality setting, so those cards were all running at the game’s “Medium” texture quality.

Once again, the GeForce 8800 cards are beating up on SLI and CrossFire rigs. Yow.

 

3DMark06

The GeForce 8800 GTX’s sheer dominance continues in 3DMark06, with one card matching the Radeon X1950 CrossFire setup again. Incidentally, I haven’t included all of the results of the 3DMark06 sub-tests here, but the GeForce 8800 GTX doesn’t have a particular strength or weakness compared to the Radeon X1950 CrossFire or GeForce 7900 GTX SLI rigs. The 8800 GTX’s SM2.0, SM3.0/HDR, and CPU scores are all with a few points of the Radeon X1950 CrossFire system’s.

 

Power consumption
Now for the moment of truth. We measured total system power consumption at the wall socket using an Extech power analyzer model 380803. The monitor was plugged into a separate outlet, so its power draw was not part of our measurement. Remember, out of necessity, we’re using different motherboards for the CrossFire systems. Otherwise, the system components other than the video cards were kept the same.

The idle measurements were taken at the Windows desktop. The cards were tested under load running Oblivion using the game’s Ultra Quality setting at 1600×1200 resolution with 16X anisotropic filtering.

Somehow, this 681 million-transistor beast only registers 7W more at the wall socket—while running a game—than the Radeon X1950 XTX, very close to Nvidia’s claim of a 5W difference between the cards. Now, oddly enough, sitting idle at the desktop is another story. The 8800 GTX-equipped system draws nearly 30W more than the Radeon X1950 XTX system while just sitting there.

Noise levels and cooling
We measured noise levels on our test systems, sitting on an open test bench, using an Extech model 407727 digital sound level meter. The meter was mounted on a tripod approximately 14″ from the test system at a height even with the top of the video card. The meter was aimed at the very center of the test systems’ motherboards, so that no airflow from the CPU or video card coolers passed directly over the meter’s microphone. We used the OSHA-standard weighting and speed for these measurements.

You can think of these noise level measurements much like our system power consumption tests, because the entire systems’ noise levels were measured, including CPU and chipset fans. We had temperature-based fan speed controls enabled on the motherboard, just as we would in a working system. We think that’s a fair method of measuring, since (to give one example) running a pair of cards in SLI may cause the motherboard’s coolers to work harder. The motherboard we used for all single-card and SLI configurations was the Asus P5N32-SLI SE Deluxe, which on our open test bench required an auxiliary chipset cooler. The Asus P5W DH Deluxe motherboard we used for CrossFire testing didn’t require a chipset cooler, so those systems were inherently a little bit quieter. In all cases, we used a Zalman CNPS9500 LED to cool the CPU.

Of course, noise levels will vary greatly in the real world along with the acoustic properties of the PC enclosure used, whether the enclosure provides adequate cooling to avoid a cards’ highest fan speeds, placement of the enclosure in the room, and a whole range of other variables. These results should give a reasonably good picture of comparative fan noise, though.

We measured the coolers at idle on the Windows desktop and under load while playing back our Quake 4 nettimedemo. The cards were given plenty of opportunity to heat up while playing back the demo multiple times. Still, in some cases, the coolers did not ramp up to their very highest speeds under load. The Radeon X1800 GTO and Radeon X1900 cards, for instance, could have been louder had they needed to crank up their blowers to top speed. Fortunately, that wasn’t necessary in this case, even after running a game for an extended period of time.

You’ll see two sets of numbers for the GeForce 7950 GT below, one for the XFX cards with their passive cooling and another for the BFG Tech cards, which use the stock Nvidia active cooler. I measured them both for an obvious reason: they were bound to produce very different results.

We’ve long been impressed with the whisper-quiet cooler on the GeForce 7900 GTX, and Nvidia has done it again with the GeForce 8800 series cooler (it’s the same one for the GTS and GTX.) This nice, big, dual-slot cooler is even quieter than the 7900 GTX’s. The thing does have to make some noise in order to move air, but the pitch it emits tends not to tickle our eardrums too much—or to register too strongly on the decibel meter.

 
Conclusions
I didn’t expect Nvidia to produce a unified shader architecture for this generation of product, and I certainly wasn’t anticipating anything quite like this GPU. Honestly, I just didn’t foresee the company taking such a huge risk with the GeForce FX debacle still relatively fresh in its mind. As Nvidia CEO Jen-Hsun Huang pointed out at the G80 press event, designing a more traditional graphics pipeline in silicon leads to fairly predictable performance characteristics. The performance of a graphics processor that depends on stream processors and thread-based load balancing is much more difficult to model. You don’t necessarily know what you’re going to have until it’s in silicon. The history of graphics is littered with failed chip designs that attempted to implement a non-traditional pipeline or to make a big leap toward more general programmability.

Fortunately, the green team did take that risk, and they managed to pull it off. I still can’t believe the G80 is a collection of scalar stream processors, and I’m shocked that it performs so well. The real-world, delivered performance in today’s OpenGL and DirectX 9-class games is roughly twice that of the Radeon X1950 XTX. Yes, it’s taken 680 million transistors to get there, but this kind of performance gain from one generation to the next is remarkable, a testament to the effectiveness of the G80’s implementation of unified shaders.

The G80 has just about everything else one could ask of a new GPU architecture, too. The new features and innovation are legion, anchored by the push for compliance with DirectX 10 and its new capabilities. The G71’s texture filtering problems have been banished, and the G80 sets a new standard for image quality in terms of both texture filtering and edge antialiasing. This GPU’s texture filtering hardware at last—or once again—delivers angle-independent anisotropic filtering at its default settings, and coverage sampled antialiasing offers the feathery-smooth quality of 16X sample sizes without being a major drag on frame rates. Despite being large enough to scare the cattle, the G80’s doesn’t draw much more power under load than the Radeon X1950 XTX. The chip is still too large and consumes too much power at idle, but this architecture should be a sweetheart once it makes the transition to a 65nm fab process, which is where it really belongs.

No doubt Nvidia will derive an entire family of products from this basic technology, scaling it back to meet the needs of lower price points. Those family members should be arriving soon, too, given the way DirectX 10 deprecates prior generations of graphics chips. Huang has said Nvidia is taping out “about a chip a month” right now. I’m curious to see how they scale this design down to smaller footprints, given that the ROP partitions are tied to memory paths. Will we see a mid-range graphics card with a 192-bit interface? I’m also intrigued by the possibilities for broader parallelism in new configurations. Will we see arrays of G80-derived GPUs sitting behind one of those external display chips?

Our time with the G80 so far has been limited, and as a result, I’ve regrettably not scratched the surface in several areas, including the importance of DirectX 10 and CUDA, Nvidia’s analog to ATI’s stream computing initiative. We will have to address those topics in more detail in the near future. They are related in that both DX10 and CUDA expose new ways to harness the power of the G80’s stream processors for non-traditional uses, either for graphics-related tasks like geometry shaders that can create and destroy vertices or for non-graphics apps like gaming physics and scientific computing. Now that the GPU has cast off even more of its fixed pipeline and gained more general stream processing capacity, the applications for its power are much broader. Both Microsoft and Nvidia are working to give developers the tools to take advantage of that power—and soon.

We also haven’t tested the GeForce 8800 in SLI. Although Nvidia says SLI will be an option for consumers starting today, they actively discouraged us from testing SLI and didn’t provide us with an SLI-capable driver. Fair enough—I didn’t even do justice to DX10 and CUDA, and besides, who needs two of these things right now? Folks who want to do SLI with an 8800 GTX will need to have a power supply with four PCIe aux power plugs—or some converters and lots of open four-pin Molex plugs. They’ll also probably need to have a 30″ LCD display capable of 2560×1600 resolution in order to use this magnitude of GPU power in current games. Regardless, we are interested in the new possibilities for things like SLI AA, so we will be exploring GeForce 8800 SLI as soon as we can.

For now, the G80—excuse me, GeForce 8800—is the new, unquestioned king of the hill all by itself, no second graphics card needed. 

Comments closed

Pin It on Pinterest

Share This

Share this post with your friends!