Seems like we've been waiting for these new GeForces for a long time now. Nvidia gave us a first glimpse at its latest GPU architecture about half a year ago, right around the time that AMD was introducing its Radeon HD 5870. In the intervening six months, AMD has fleshed out its Radeon HD 5000 series with a full suite of DirectX 11-class GPUs and graphics cards. Meanwhile, Nvidia's GF100 chip is later than a stoner to study hall.
Fortunately, our wait is coming to an end. The GeForce GTX 470 and 480 are expected to be available soon, and we've wrangled several of them for testing. We can say with confidence that the GF100 is nothing if not fascinating, regardless of whether it succeeds or fails.
Fermi is GF100 is GTX 480
We've already covered the GF100 graphics chip and its architecture rather extensively here at TR, so we won't cover the same ground again in any great detail here. There is much to know about this three-billion-transistor behemoth, though, so we'll try to bring you up to speed in brief.
Our first look at the GF100 was focused solely on the GPU architecture, dubbed Fermi, and how that architecture has been adapted to serve the needs of the nascent market for GPU-based computing devices. Nvidia intends this chip to serve multiple markets, from consumer gaming cards to high-performance computing clusters, and the firm has committed an awful lot of time, treasure, and transistors toward making the GF100 well suited for GPU computing. Thus, the GF100 has a number of compute-centric capabilities that no other GPU can match. The highlights include improved scheduling with the ability to execute multiple, concurrent kernels; a real, fully coherent L2 cache; robust support for double-precision floating-point math; ECC-protected memories throughout the hierarchy; and a large, unified address space with support for C++-style pointers. Some of these provisionsbetter scheduling and caching, for instancemay have side benefits for consumers, whose GeForce cards have the potential to be especially good at GPU-based video transcoding or in-game physics simulations. Most of them, however, will be practically useless in a desktop PC, particularly since they have no utility in real-time graphics.
After we considered the compute-focused parts of the Fermi architecture, Rys reminded us all that the GF100 is still very much a graphics chip by offering his informed speculation about the specifics of its graphics hardware. Nvidia eventually confirmed many of his hunches when it revealed the details of the GF100's graphics architecture to us just after CES. As expected, the move to a DX11 feature set means the GF100 adopts nearly every major graphics feature its competitor has, but we were thrown for a loop by how extensively Nvidia's architects chose to overhaul the GF100's geometry processing capabilities. Not only does Fermi support DirectX 11's hardware tessellationby means of which the GPU can amplify the polygon detail in a scene dramaticallybut Nvidia believes it is the world's first parallel architecture for geometry processing. With quad rasterizers and a host of geometry processing engines distributed across the chip, the GF100 has the potential nearly to quadruple the number of polygons possible in real-time graphics compared to even its fastest contemporaries (GT200 and AMD's Cypress). In this way, the GF100 is just as audacious an attempt at advancing the state of the art in graphics as it is in computing.
The trouble is that ambitious architectures and major technological advances aren't easy to achieve. New capabilities add time to the design cycle and complexity to the design itself. Nvidia may well have had both eyes on the potential competition from Intel and its vaunted Larrabee project when conceiving the GF100, with too little focus on the more immediate threat from AMD. Now that the first-generation Larrabee has failed to materialize as a consumer product, the GF100 must face its sole rival in the form of the lean, efficient, and much smaller Cypress chip in AMD's new Radeons. With a 50% wider path to memory and roughly half again as many transistors as Cypress, the GF100 ought to have no trouble capturing the overall graphics performance title. Yet the GF100 project has been dogged by delays and the inevitable rumors about the problems that have caused them, among them the time-honored classics of chip yield, heat, and power issues.
In this context, we've made several attempts at handicapping the key throughput rates of GF100-based products, and we've constantly had to revise our expectations downward with each trickle of new information. Now that the flagship GeForce GTX 480 is set to ship soon, we can make one more downward revision that brings us to the final numbers.
Nvidia has elected to disable one of the GF100's 16 shader multiprocessor groups even in the top-of-the-line GTX 480. That fact suggests some yield issues with this very large chip, and indeed, the company says the concession was needed in order to ensure sufficient initial supplies of the GTX 480. This change reduces the number of ALUs or "CUDA cores" by 32 in the final product, along with one texture unit that would have been good for sampling and filtering four texels per clock. With this modification and the settling of the base GPU clock at 700MHz, the shader ALUs at twice that, and the memory clock at 924MHz, the GTX 480's key rates become apparent.
|GeForce GTX 285||GeForce GTX 480||Radeon HD 5870|
|Process node||55 nm @ TSMC||40 nm @ TSMC||40 nm @ TSMC|
|Core clock||648 MHz||700 MHz||850 MHz|
|"Hot" (shader) clock||1476 MHz||1401 MHz||--|
|Memory clock||1300 MHz||924 MHz||1200 MHz|
|Memory transfer rate||2600 MT/s||3696 MT/s||4800 MT/s|
|Memory bus width||512 bits||384 bits||256 bits|
|Memory bandwidth||166.4 GB/s||177.4 GB/s||153.6 GB/s|
|Peak single-precision arithmetic rate||0.708 Tflops||1.35 Tflops||2.72 Tflops|
|Peak double-precision arithmetic rate||88.5 Gflops||168 Gflops||544 Gflops|
|ROP rate||21.4 Gpixels/s||33.6 Gpixels/s||27.2 Gpixels/s|
|INT8 bilinear texel rate
(Half rate for FP16)
|51.8 Gtexels/s||42.0 Gtexels/s||68.0 Gtexels/s|
The GTX 480 is a straightforward near-doubling of peak shader arithmetic and ROP throughput rates versus the GeForce GTX 285, but memory bandwidth is only marginally higher. In theory, the GTX 480 is, amazingly, slower at texturing than the GTX 285, but Nvidia expects the GF100 to deliver higher real-world texturing performance thanks to some texture cache optimizations that should reduce conflicts during sampling.
(Those of you familiar with the Fermi architecture may wonder why double-precision math performance is only doubled versus the GTX 285. In theory, GF100 does DP math at half the single-precision rate. However, Nvidia has elected to reserve all of that double-precision power for its professional-level Tesla products. GeForce cards will get only a quarter of the DP math rate.)
More troubling are the comparisons to the Radeon HD 5870. Yes, the GTX 480 is badly beaten in the FLOPS numbers by the 5870. That was also true of the comparison between the GTX 280 and Radeon HD 4870 in the prior generation, yet it was never a real problem, because Nvidia's scheduling methods are more efficient than AMD's. The sheer magnitude of the gap in FLOPS here is unsettling, but the areas of potentially greater concern include memory bandwidth, ROP rate, and texturing. Theoretically, the GTX 480 is only slightly quicker in the former two categories, since relatively low GDDR5 clock rates look to have hampered its memory bandwidth. And it's substantially slower than the 5870 at texturing. As we've said before, the GF100 will have to make up in efficiency what it lacks in brute-force capacity, assuming the competition leaves room for that.
Clearly, the GF100 missed its targets on a number of fronts. The fact that it's still tops in memory bandwidth and ROP rate illustrates how Nvidia's strategy of building a very large chip mitigates risk, in a sense. Even when dialed back, the GF100 is in running for the performance title. The question is whether capturing that title will be worth itto Nvidia in terms of manufacturing costs and delays, and to the consumer in terms of power draw, heat, and noise. Last time around, the GT200 wasn't really a paragon of architectural efficiency, but Nvidia was able to reach some fairly satisfactory compromises on clock speed, power, and performance.