Polymorph? Yeah, I saw The Crying Game
The biggest surprise of the day is undoubtedly the reshuffling that's happened in the GF100's geometry handling resources. When explaining the decision to undertake this reorganization, Alben pointed out that geometry performance hasn't been a major focus in GPU progress over the years. Between the GeForce FX 5800 Ultra (NV30) and the GeForce GTX 280 (GT200), he estimated pixel shading performance has mushroomed by a factor of 150. During that same span, he said, geometry performance has only tripled.
That's true in part because the hardware that handles a key part of the graphics pipeline, the setup engine, has simply not been parallelized. Instead, any progress has been supplied by increases in clock rates and in per-clock performance. The GeForce 256, for instance, could process a triangle in eight clock cycles. The GeForce FX could do so in two cycles, and the G80 in (optimally) a single cycle.
Alben and the team saw that growing gap between geometry and shader performance as a problem, and believed that the advent of DirectX 11, with its introduction of hardware-based tessellation and two new programmable pipeline stages for geometry processing, made the moment opportune for a change. Henry Moreton, an Nvidia Distinguished Engineer and geometry processing expert who authored the original rasterization microcode at SGI, characterized the earlier attempts at geometry processing in Direct3D as "train wrecks." He told us, however, that he believes in DirectX 11, "they got it right." Nvidia's response was to build what it believes is the world's first parallel architecture for geometry processing.
Each SM in the GF100 has a so-called polymorph engine. This engine facilitates a host of pre-rasterization stages of the Direct3D pipeline, including vertex and hull shaders, tessellation, and domain and geometry shaders. All four of those shader types run in the shader array, of course. Beyond that, Alben told us the block diagram above is in fact architecturally accurateall five of the functions map to dedicated units on the chip.
DirectX 11's tessellation support is what enables the GF100's geometry-intensive focus. The basic concept has been around quite a while: to create complex geometry by combining a low-resolution polygon mesh with a mathematical description of a more complex surface.
Situating tessellation further along in the graphics pipeline has tremendous advantages over simply using more complex geometry, not least of which is a savings of bus bandwidth between the host system and the GPU. Because tessellation happens after the vertex shader stage, animations can be processed (much more efficiently) for the base polygon mesh rather than the more complex final result. The final model will then inherit all of the appropriate movement.
Tessellation isn't just for smoothing out objects, either. Once a more complex mesh has been created, it can be altered via the use of a displacement map, which imparts new depth information to the object. Displacement maps can be used to generate complex terrain or to add complexity to an in-game object or character model. Unlike currently popular techniques like bump or normal maps, displacement maps really do alter geometry, so object silhouettes are correctly modified, not just the object interiors. Thus, tessellation has the potential to improve the look of games substantially beyond the current standard.
In DX11, tessellation involves two programmable stages, hull shaders and domain shaders, sandwiched around a fixed-function geometry expansion step. Hull shaders run first on the base polygon mesh, and they do the level-of-detail calculations for the subdivision of existing polygons. The tessellators take this input and create new vertices. Domain shaders then evaluate the surfaces created by the tessellation step and can apply displacement maps. So yes, the GF100 has 16 separate, fixed-function units dedicated to tessellation, but their duties are limited to geometry expansion. The rest of the work happens in the shader array.
The distribution of the polymorph engines' various duties to 16 separate units suggests broad parallelization, and so it is. Nvidia claims, for instance, that vertex fetches now happen in parallel, with up to 32 attributes being fetched per cycle across the GPU, four times the capacity of the GT200.
Managing all of the related calculations in parallel for some of the pipeline stages is no trivial task. Moreton cited an example related to tessellation, when a single SM has been given a patch that will generate thousands of triangles and thus potentially spill out of local storage. In such cases, the GF100 will evaluate patches and decompose them into smaller patches for distribution to multiple SMs across the chip. The related data are kept on die and passed to other SMs via their L1 caches. The results must still be output in the appropriate order, which requires careful scheduling and coordination across SMs. Thus, the GF100 employs a sort of coherency protocol to track geometry data at the thread level; a network in the chip distributes this information.
Once the polymorph engines have finished their work, the resulting data are forwarded the GF100's four raster engines. Optimally, each one of those engines can process a single triangle per clock cycle. The GF100 can thus claim a peak theoretical throughput rate of four polygons per cycle, although Alben called that "the impossible-to-achieve rate," since other factors will limit throughput in practice. Nvidia tells us that in directed tests, GF100 has averaged as many as 3.2 triangles per clock, which is still quite formidable.
Sharp-eyed readers may recall that AMD claimed it had dual rasterizers upon the launch of the Cypress GPU in the Radeon HD 5870. Based on that, we expected Cypress to be able to exceed the one polygon per cycle limit, but its official specifications instead cite a peak rate of 850 million triangles per secondone per cycle at its default 850MHz clock speed. We circled back with AMD to better understand the situation, and it's a little more complex than was originally presented. What Cypress has is dual scan converters, but it doesn't have the setup or primitive interpolation rates to support more than one triangle per cycle of throughput. As I understand it, the second scan converter is an optimization that allows the GPU to push through more pixels, in cases where the polygons are large enough. The GF100's approach is quite different and really focused on increasing geometric complexity.
Nvidia claims the higher setup rates enabled by the combination of the polymorph and raster engines allows the GF100 to achieve up to six times the performance of the Radeon HD 5870 in directed tests.
The firm also supplied us with frame-by-frame performance results for a selected portion of the Unigine DX11 demo that's particularly geometry intensive. The GF100 purportedly outperforms the 5870 during this sequence thanks to its superior geometry throughput.
Clearly, Nvidia has gone to great lengths to give the GF100 a parallel geometry processing architecture, and that is the distinctive and defining feature of this chip. If it works as advertised, they will have solved a difficult problem in hardware for perhaps the first time. But make no mistake about it: giving the GF100 these capabilities is a forward-looking play, not an enhancement that will pay off in the short term. You will note that none of the examples above come from a contemporary, or even future, game; the closest we get is the Unigine DX11 technology demo from a third party. In order for the GF100's geometry processing capabilities to give it a competitive advantage, games will not only have to make use of DX11 tessellation, but they will have to do so to an extreme degree, one unanticipated by existing DX11-class hardware from AMD. In other words, the usage model for GPUs will have to shift rather radically in the direction of additional geometric complexity.
Moreton expressed hope that geometry scaling techniques like dynamic level of detail algorithms could allow developers to use the GF100's power without overburdening less capable hardware. Whether or not that will happen in the GF100's lifetime remains to be seen, but Nvidia does appear to have addressed a problem that its competition will need to deal with in future architecture generations. That fact would be even more impressive if the GF100 weren't so late to the party.
|Some 840 EVOs still vulnerable to read speed slowdowns||58|
|Details leak out on AMD's first Zen-based desktop CPUs||78|
|Nvidia: the GeForce GTX 970 works exactly as intended||99|
|Report: 4GB of RAM coming to GTX 960 in March||105|
|Early deal of the week: A 27" G-Sync monitor for $480||39|
|Gearbox's Homeworld remake due February 25||46|
|Nvidia admits, explains GeForce GTX 970 memory allocation issue||242|
|Here's my guest appearance on tonight's Alt+Tab Show||12|