I have to admit, long-delayed graphics chips can be kind of fun. Instead of a single, overwhelming burst of information all at once, complete with performance data and hands-on impressions from the actual product, we get to partake in a slow drip-drip-drip of information about a new GPU architecture. That’s certainly been the case with Nvidia’s DirectX 11-class GPU, known variously as Fermi (the overarching GPU architecture) and GF100 (the first chip to implement that architecture). We’ve already had a look at the compute-specific bits of the Fermi architecture, and we’ve engaged in deep, informed speculation on its graphics capabilities, as well. We know exactly what the competition looks like, and gosh darn it, we’d like to lay hands on the GPU itself soon.
The first cards based on the GF100 aren’t quite ready yet, though, and we have one more stop to make before we get that chance. After the Consumer Electronics Show in Las Vegas, Nvidia invited various members of the press, including your humble correspondent, to get a closer look at the particulars of the GF100’s graphics hardware. We now know that a great deal of Rys’s speculation about the GF100’s graphics particulars was correct, but we also know that he was off in a few rather notable places. We’ve filled in quite a few details in surprising ways, as well. Keep reading as we round out our knowledge of the GF100’s graphics architecture and explain why this GPU just might be worth the wait.
A graphics architecture overview
First things first, I suppose. The GF100 is late, and Nvidia made no bones about it in this most recent briefing about the chip. Drew Henry, head of the company’s GeForce business, told us forthrightly that he’d prefer to have a product in the market now, but said of the situation, “It is what it is.” At present, the message about the GF100’s status is equal parts straightforward and cautious: GF100 chips are “in full production at TSMC” and we can expect to see products in “Q1 2010.” If the chip is in production, we can probably assume the main sources of the product delays have been rectified in the latest silicon spin. Beyond that, we have very little: no product names, prices, clock speeds, or more precise guidance on ship dates.
By making the window extend throughout the first quarter of the year, Nvidia has given itself ample leeway. Products could ship as late as March 31st without missing that target. If I were to narrow it down, though, I’d probably expect to see products somewhere around the first of March, give or take a week or two.
Time will tell on that front, but we now have a trove of specifics about the operation of the GPU from Nvidia itself. We’ve covered the computational capabilities of the GF100 quite thoroughly in our two prior pieces on the architecture, so we’ll focus most of our attention here on its graphics features. Let’s begin, as we often do, with a high-altitude overview.
A functional block diagram of GF100. Source: Nvidia.
As GPUs become more complex, these diagrams become ever more difficult to read from this distance. However, much of what you see above is already familiar, including the organization of the GPU’s execution resources into 16 SMs, or shader multiprocessors. Those SMs house an array of execution units capable of executing, at peak, 512 arithmetic operations in a single clock cycle. Nvidia would tell you the GF100 has 512 “CUDA cores,” and in a sense, they might be right. But the more we know about the way this architecture works, the less we’re able to accept that definition, any more than we can say that AMD’s Cypress has 1600 “shader cores.” The “cores” proper are really the SMs, in the case of the GF100, and the SIMD engines, in the case of Cypress. Terminology aside, though, the GF100 does have a tremendous amount of processing power on tap. Also familiar are the six 64-bit GDDR5 memory interfaces, which hold the potential to deliver as much as 50% more bandwidth than Cypress or Nvidia’s prior-generation part, the GT200.
The first hint we have of something new is the presence of four “GPCs,” (or graphics processing clusters, I believe, although I thought that name was taken by Gary Phelps’ Choice, as we used to call our Dean of Students’ preferred smokes back in college). Nvidia Senior VP of GPU Engineering Jonah Alben called the GPCs “almost complete, independent GPUs” when he first described them to us. As you can see, each one has its own rasterization engine, which points toward an intriguing departure from the norm.
Each GPC contain four SMs, and we’ll have to zoom in on a single SM in order to get a closer look at the rest of the GF100’s graphics-focused hardware.
A functional block diagram of GF100 shader multiprocessor. Source: Nvidia.
Now we can see that each SM has four texture units associated with it. More unconventionally, each SM also hosts a geometry unit, which Nvidia has creatively dubbed a “polymorph engine.” Since the GF100 has four GPCs and 16 SMs, it has a total of 64 texture units and 16 polymorph engines. The Fermi architecture detailed here is scalable along several lines: variants could be made to have fewer GPCs and fewer numbers of SMs within each GPC. We can surely expect to see smaller chips based on this architecture that have been scaled down in one or both ways.
We should also note the GF100’s ROP units, of which there are 48 ringing the L2 cache in the diagram above. With that bit added, we have sketched in full the general outlines of the GF100. What remains is to fill in some detail in several areas, starting, of course, with the curiously quad rasterizers and 16 geometry units.
Polymorph? Yeah, I saw The Crying Game
The biggest surprise of the day is undoubtedly the reshuffling that’s happened in the GF100’s geometry handling resources. When explaining the decision to undertake this reorganization, Alben pointed out that geometry performance hasn’t been a major focus in GPU progress over the years. Between the GeForce FX 5800 Ultra (NV30) and the GeForce GTX 280 (GT200), he estimated pixel shading performance has mushroomed by a factor of 150. During that same span, he said, geometry performance has only tripled.
That’s true in part because the hardware that handles a key part of the graphics pipeline, the setup engine, has simply not been parallelized. Instead, any progress has been supplied by increases in clock rates and in per-clock performance. The GeForce 256, for instance, could process a triangle in eight clock cycles. The GeForce FX could do so in two cycles, and the G80 in (optimally) a single cycle.
Alben and the team saw that growing gap between geometry and shader performance as a problem, and believed that the advent of DirectX 11, with its introduction of hardware-based tessellation and two new programmable pipeline stages for geometry processing, made the moment opportune for a change. Henry Moreton, an Nvidia Distinguished Engineer and geometry processing expert who authored the original rasterization microcode at SGI, characterized the earlier attempts at geometry processing in Direct3D as “train wrecks.” He told us, however, that he believes in DirectX 11, “they got it right.” Nvidia’s response was to build what it believes is the world’s first parallel architecture for geometry processing.
Block diagram of a GF100 polymorph engine. Source: Nvidia.
Each SM in the GF100 has a so-called polymorph engine. This engine facilitates a host of pre-rasterization stages of the Direct3D pipeline, including vertex and hull shaders, tessellation, and domain and geometry shaders. All four of those shader types run in the shader array, of course. Beyond that, Alben told us the block diagram above is in fact architecturally accurateall five of the functions map to dedicated units on the chip.
DirectX 11’s tessellation support is what enables the GF100’s geometry-intensive focus. The basic concept has been around quite a while: to create complex geometry by combining a low-resolution polygon mesh with a mathematical description of a more complex surface.
Situating tessellation further along in the graphics pipeline has tremendous advantages over simply using more complex geometry, not least of which is a savings of bus bandwidth between the host system and the GPU. Because tessellation happens after the vertex shader stage, animations can be processed (much more efficiently) for the base polygon mesh rather than the more complex final result. The final model will then inherit all of the appropriate movement.
Tessellation isn’t just for smoothing out objects, either. Once a more complex mesh has been created, it can be altered via the use of a displacement map, which imparts new depth information to the object. Displacement maps can be used to generate complex terrain or to add complexity to an in-game object or character model. Unlike currently popular techniques like bump or normal maps, displacement maps really do alter geometry, so object silhouettes are correctly modified, not just the object interiors. Thus, tessellation has the potential to improve the look of games substantially beyond the current standard.
In DX11, tessellation involves two programmable stages, hull shaders and domain shaders, sandwiched around a fixed-function geometry expansion step. Hull shaders run first on the base polygon mesh, and they do the level-of-detail calculations for the subdivision of existing polygons. The tessellators take this input and create new vertices. Domain shaders then evaluate the surfaces created by the tessellation step and can apply displacement maps. So yes, the GF100 has 16 separate, fixed-function units dedicated to tessellation, but their duties are limited to geometry expansion. The rest of the work happens in the shader array.
The distribution of the polymorph engines’ various duties to 16 separate units suggests broad parallelization, and so it is. Nvidia claims, for instance, that vertex fetches now happen in parallel, with up to 32 attributes being fetched per cycle across the GPU, four times the capacity of the GT200.
Managing all of the related calculations in parallel for some of the pipeline stages is no trivial task. Moreton cited an example related to tessellation, when a single SM has been given a patch that will generate thousands of triangles and thus potentially spill out of local storage. In such cases, the GF100 will evaluate patches and decompose them into smaller patches for distribution to multiple SMs across the chip. The related data are kept on die and passed to other SMs via their L1 caches. The results must still be output in the appropriate order, which requires careful scheduling and coordination across SMs. Thus, the GF100 employs a sort of coherency protocol to track geometry data at the thread level; a network in the chip distributes this information.
Simple block diagram of a GF100 raster engine. Source: Nvidia.
Once the polymorph engines have finished their work, the resulting data are forwarded the GF100’s four raster engines. Optimally, each one of those engines can process a single triangle per clock cycle. The GF100 can thus claim a peak theoretical throughput rate of four polygons per cycle, although Alben called that “the impossible-to-achieve rate,” since other factors will limit throughput in practice. Nvidia tells us that in directed tests, GF100 has averaged as many as 3.2 triangles per clock, which is still quite formidable.
Sharp-eyed readers may recall that AMD claimed it had dual rasterizers upon the launch of the Cypress GPU in the Radeon HD 5870. Based on that, we expected Cypress to be able to exceed the one polygon per cycle limit, but its official specifications instead cite a peak rate of 850 million triangles per secondone per cycle at its default 850MHz clock speed. We circled back with AMD to better understand the situation, and it’s a little more complex than was originally presented. What Cypress has is dual scan converters, but it doesn’t have the setup or primitive interpolation rates to support more than one triangle per cycle of throughput. As I understand it, the second scan converter is an optimization that allows the GPU to push through more pixels, in cases where the polygons are large enough. The GF100’s approach is quite different and really focused on increasing geometric complexity.
Comparative GF100 tessellation performance. Source: Nvidia.
Nvidia claims the higher setup rates enabled by the combination of the polymorph and raster engines allows the GF100 to achieve up to six times the performance of the Radeon HD 5870 in directed tests.
The firm also supplied us with frame-by-frame performance results for a selected portion of the Unigine DX11 demo that’s particularly geometry intensive. The GF100 purportedly outperforms the 5870 during this sequence thanks to its superior geometry throughput.
Clearly, Nvidia has gone to great lengths to give the GF100 a parallel geometry processing architecture, and that is the distinctive and defining feature of this chip. If it works as advertised, they will have solved a difficult problem in hardware for perhaps the first time. But make no mistake about it: giving the GF100 these capabilities is a forward-looking play, not an enhancement that will pay off in the short term. You will note that none of the examples above come from a contemporary, or even future, game; the closest we get is the Unigine DX11 technology demo from a third party. In order for the GF100’s geometry processing capabilities to give it a competitive advantage, games will not only have to make use of DX11 tessellation, but they will have to do so to an extreme degree, one unanticipated by existing DX11-class hardware from AMD. In other words, the usage model for GPUs will have to shift rather radically in the direction of additional geometric complexity.
Moreton expressed hope that geometry scaling techniques like dynamic level of detail algorithms could allow developers to use the GF100’s power without overburdening less capable hardware. Whether or not that will happen in the GF100’s lifetime remains to be seen, but Nvidia does appear to have addressed a problem that its competition will need to deal with in future architecture generations. That fact would be even more impressive if the GF100 weren’t so late to the party.
Nvidia has reshuffled the GF100’s texturing hardware, as well. In the GT200, three SMs shared a single texture unit capable of sampling and filtering eight texels per clock; each texture unit had an associated texture cache. In the GF100, each SM gets its own dedicated texture unit and texture cache, with no need for sharing, although the units themselves can only sample and filter four texels per cycle. That filtering rate assumes an INT8 texture format. Like the GT200 and Cypress, the GF100 filters FP16 textures at half the usual rate and FP32 textures at a quarter of it.
Add it all up, and the GF100 can sample and filter “only” 64 texels per cycle at peak, whereas the GT200 could do 80. (Rys had guessed 128(!) for the GF100. Thanks for playing!) Nvidia VP of GPU Architecture Emett Kilgariff explained to us that several considerations offset the GF100’s theoretically lower per-clock potential. For one thing, the GF100’s texture cache has been optimized so there are fewer sampling conflicts, allowing the texturing hardware to operate more efficiently. For another, Nvidia has done away with the split between the so-called core and shader clocks familiar from the GT200. Most of the chip now runs at half the speed of the shader clockincluding the texture units, polymorph engines, raster engines, schedulers, and caches, as I understand it. That should mean the GF100’s texturing hardware is clocked a little higher than the GT200’s.
Even so, Nvidia has said the GF100’s theoretical texturing capacity will be lower than the GT200’s, but Kilgariff presented some numbers to underscore the point that the GF100’s true, delivered performance should be as much as 40-70% higher.
GF100’s delivered texturing performance should eclipse GT200’s. Source: Nvidia.
By the way, that forecast of lower peak theoretical texturing performance for the GF100 gives us a big hint about likely clock speeds. We’ll revisit this topic shortly.
One mild surprise is that Nvidia hasn’t changed its texture filtering algorithm from the GT200, despite some expectations that the GF100 might bring improved quality in light of the new Radeons’ near perfect angle-invariant aniso. Alben described the output of the algorithm first implemented in G80 as “really beautiful” and said the team thus viewed filtering as “a solved problem.” Hard to argue with that, really.
Of course, the texture hardware now supports the HD texture compression formats introduced in DirectX 11.
An additional DX11 feature AMD touted with the introduction of Cypress was pull-model interpolation, in which yet another duty of the traditional setup engine was handed off to the shader core and made programmable. At the time, AMD said its setup hardware had limited the RV770’s performance in some directed tests of texture filtering, and Cypress was indeed quite a bit faster than two RV770s in such benchmarks. When I asked how they had implemented pull-model interpolation in the GF100, Moreton explained that interpolation had been handled in the shader array since the G80 and that the new Direct3D spec essentially matches the G80’s capabilities. His short answer for how they implemented it, then: “Natively.”
One bit of fanciness Kilgariff pointed out in the GF100’s texture samplers is a robust implementation of DX11’s Gather4 feature. The samplers can pull scalar data from four texel locations simultaneously, and those locations are programmable across a fairly broad area. By varying the sample locations, developers can implement what is essentially a hardware-accelerated jittered sampling routine for softening shadow edges and removing jaggies. Kilgariff said that, by using this technique, they’d measured a 2X increase over non-vectorized sampling on the GF100 and roughly 3.3X over the Radeon HD 5870.
ROPs and antialiasing
Nvidia has traditionally closely associated its ROP hardwarewhich converts shaded fragments into pixels and writes them to memorywith its L2 cache and memory controllers. With the move to GDDR5 memory, the GF100 promises to have as much as 50% higher memory bandwidth than the GT200, but it now has only six 64-bit memory controllers onboard, down from eight in the prior-gen chip. To keep the right balance of ROP hardware, Nvidia has reworked its ROP partitions: each one now houses eight ROP units, for a total of 48 ROP units across the chip. At peak, then, the GF100 can output 48 pixels per clock in a 32-bit integer formata straightforward increase of 50% over the GT200 or Cypress. GF100’s ROPs require two cycles to process pixels in FP16 data formats and four for FP32.
Not only are the GF100’s ROPs more numerous, but they’ve also been modified to handle 8X multisampled antialiasing without taking a big performance hit, mainly due to improved color compression speed. GeForce GPUs have been at a disadvantage in this 8X multisampling performance since the introduction of the Radeon HD 4800 series, but the GF100 should rectify the situation, as indicated by the Nvidia-supplied numbers above. (I should caution, however, that HAWX supports DirectX 10.1, which also accelerates antialiasing performance on newer GPUs. We’ll want to test things ourselves before being fully confident on this point.)
One saving grace for the GT200’s antialiasing performance has been Nvidia’s coverage sampled AA modes, which store larger numbers of coverage samples than color samples and offer nicely improved edge quality with little performance cost. Now that true 8X multisampling is more comfortable, Nvidia has added a coverage sampled AA mode based on it. The new 32X CSAA mode stores eight full coverage-plus-color samples and an additional 24 coverage-only samples.
CSAA 32X: blue positions are full samples, and gray positions are coverage only. Source: Nvidia.
Alpha-to-coverage on foliage: 8 coverage samples versus 32. Source: Nvidia.
Not only will 32X CSAA provide higher fidelity antialiasing on traditional object edges, but Kilgariff pointed out that many games use a technique called alpha-to-coverage to render dense grass or foliage with soft edges, in which alpha test results contribute to a coverage mask. This method produces better results than a simple alpha test, but it relies on coverage samples to work its magic. Sometimes four or eight samples will be insufficient to prevent aliasing. In such cases, 32X CSAA can produce markedly superior results, with a total of 33 levels of transparency. Also, Nvidia’s transparency multisampling modea driver feature that promotes simple alpha-test transparency to alpha-to-coverageshould benefit from the additional coverage samples in 32X CSAA.
What does caching do for graphics?
We’ve already spent ample time on this architecture’s computing capabilities, so I won’t revisit that ground again here. One question that we’ve had since hearing about the GF100’s relatively robust cache architecture is what benefits caching might have for graphicsif any.
Most GPUs have a number of special-purpose pools of local storage. The GF100 is similar in that it has an instruction cache and a dedicated 12KB texture cache in each SM. However, each SM also has 64KB of L1 data storage that’s a little bit different: it can be split either 48/16KB or 16/48KB between a local data store (essentially a software-managed cache) and a true L1 cache. For graphics, the GF100 uses the 48KB shared memory/16KB L1 cache configuration, so most of the local storage will be directly managed by Nvidia’s graphics drivers, as it was in the GT200. The small L1 cache in each SM does have a benefit for graphics, though. According to Alben, if an especially long shader fills all of the available register space, registers can spill into this cache. That should avoid some worst-case scenarios that could greatly hamper performance.
More impressive is the GF100’s 768KB L2 cache, which is coherent across the chip and services all requests to read and write memory. This cache’s benefits for computing applications with irregular data access patterns are clear, but how does it help graphics? In several ways, Nvidia claims. Because this cache can store any sort of data, it has multiple uses: it has replaced the 256KB, read-only L2 texture cache and the write-only ROP cache in the GT200 with a single, unified read/write path that naturally maintains proper program order. Since it’s larger, the L2 provides more texture coverage than the GT200’s L2 texture cache, a straightforward benefit. Because it can store any sort of data, and because it may be the only local data store large enough to handle it, the L2 cache will hold the large amounts of geometry data generated during tessellation, too.
So there we have some answers. If it works well, caching should help enable the GF100’s unprecedented levels of geometry throughput and contribute to the architecture’s overall efficiency.
One more shot at likely speeds and feeds
Speaking of efficiency, that will indeed be the big question about the Fermi architecture and especially about the GF100. How efficient is the architecture in its first implementation?
Almost to scale? A GF100 die shot. Source: Nvidia.
The chip isn’t in the wild yet, so no one has measured its exact die size. Nvidia, as matter of policy, doesn’t disclose die sizes for its GPUs (they are, I believe, the last straggler on this point in the PC market). But we know the transistor count is about three billion, which is, well, hefty. How so large a chip will fare on TSMC’s thus far troubled 40-nm fabrication process remains to be seen, but the signs are mixed at best.
Although we don’t yet have final product specs, Nvidia’s Drew Henry set expectations for the GF100’s power consumption by admitting the chip will draw more power under load than the GT200. That fact by itself isn’t necessarily a bad thingIntel’s excellent Lynnfield processors consume more power at peak than their Core 2 Quad predecessors, but their total power consumption picture is quite good. Still, any chip this late and this large is going to raise questions, especially with a very capable, much smaller competitor already in the market.
With the new information we have about the GF100’s graphics bits and pieces, we can revise our projections for its theoretical peak capabilities. Sad to say, our earlier projections were too bullish on several fronts, so most of our revisions are in a downward direction.
We don’t have final clock speeds yet, but we do have a few hints. As I pointed out when we are talking about texturing, Nvidia’s suggestion that the GF100’s theoretical texture filtering capacity will be lower than the GT200’s gives us an upper bound on clock speeds. The crossover point where GF100 would match the GeForce GTX 280 in texturing capacity is a 1505MHz core clock, with the texturing hardware running at half that frequency. We can probably assume the GF100’s clocks will be a little lower than that.
We have another nice hint that running the texturing hardware at half the speed of the shaders rather than on a separate core clock will impart a 12-14% frequency boost. In this case, I’m going to be optimistic, follow a hunch, and assume the basis of comparison is the GT200b chip in the GeForce GTX 285. A clock speed boost in that range would get us somewhere near 725MHz for the half-speed clock and 1450MHz for the shaders. The GF100’s various graphics units running at those speeds would yield the following peak theoretical rates.
|Process node||55 nm @ TSMC||40 nm @ TSMC||40 nm @ TSMC|
|Core clock||648 MHz||725 MHz||850 MHz|
|Hot clock||1476 MHz||1450 MHz||—|
|Memory clock||2600 MHz||4200 MHz||4800 MHz|
|SP FMA rate||0.708 Tflops||1.49 Tflops||2.72 Tflops|
|DP FMA rate||88.5 Gflops||186 Gflops*||544 Gflops|
|Memory bus width||512 bit||384 bit||256 bit|
|Memory bandwidth||166.4 GB/s||201.6 GB/s||153.6 GB/s|
|ROP rate||21.4 Gpixels/s||34.8 Gpixels/s||27.2 Gpixels/s|
|INT8 Bilinear texel rate
(Half rate for FP16)
|51.8 Gtexels/s||46.4 Gtexels/s||68.0 Gtexels/s|
I should pause to explain the asterisk next to the unexpectedly low estimate for the GF100’s double-precision performance. By all rights, in this architecture, double-precision math should happen at half the speed of single-precision, clean and simple. However, Nvidia has made the decision to limit DP performance in the GeForce versions of the GF100 to 64 FMA ops per clockone fourth of what the chip can do. This is presumably a product positioning decision intended to encourage serious compute customers to purchase a Tesla version of the GPU instead. Double-precision support doesn’t appear to be of any use for real-time graphics, and I doubt many serious GPU-computing customers will want the peak DP rates without the ECC memory that the Tesla cards will provide. But a few poor hackers in Eastern Europe are going to be seriously bummed, and this does mean the Radeon HD 5870 will be substantially faster than any GeForce card at double-precision math, at least in terms of peak rates.
Otherwise, on paper, the GF100 projects to be superior to the Radeon HD 5870 only in terms of ROP rate and memory bandwidth. (Then again, it’s now suddenly notable that we’re not estimating triangle throughput. The GF100 will have a clear edge there.) That fact isn’t necessarily a calamity. The GeForce GTX 280, for example, had just over half the peak shader arithmetic rate of the Radeon HD 4870 in theory, yet the GTX 280’s delivered performance was generally superior. Much hinges on how efficiently the GF100 can perform its duties. What we can say with certainty is that the GF100 will have to achieve a new high-water mark in architectural efficiency in order to outperform the 5870 by a decent marginsomething it really needs to do, given that it’s a much larger piece of silicon.
Obviously, the GF100 is a major architectural transition for Nvidia, which helps explain its rather difficult birth. The advances it promises in both GPU computing and geometry processing capabilities are pretty radical and could be well worth the pain Nvidia is now enduring, when all is said and done. The company has tackled problems in this generation of technology that its competition will have to address eventually.
In attempting to handicap the GF100’s prospects, though, I’m struggling to find a successful analog to such a late and relatively large chip. GPUs like the NV30 and R600 come to mind, along with CPUs like Prescott and Barcelona. All were major architectural revamps, and all of them conspicuously ran hot and underperformed once they reached the market. The only positive examples I can summon are perhaps the R520the Radeon X1800 XT wasn’t so bad once it arrived, though it wasn’t a paragon of efficiencyand AMD’s K8 processors, which were long delayed but eventually rewrote the rulebook for x86 CPUs. I suppose we’ll find out soon enough where in this spectrum the GF100 will reside.