Going back to the sub-block discussion, it should be clear how Nvidia might scale Fermi down to smaller variants and create derivatives. Nvidia could simply (and we use that term with all due respect to the actual difficulty involved) replace the DP-capable sub block with another of the simpler blocks. They could retain everything else about the SM, including the same scheduler, near pools, register file and even the operand gather logic.
That lets them create non-DP variants, losing some of the fearsome integer rate in the process as well (some of the integer hardware is shared with the DP silicon, necessitating that), for derivatives that don't require it, because they're addressing different markets.
Double-precision floating point is almost exclusively a non-graphics feature of GPUs, at least at this point in time (although, of course, extended-precision computation takes place all over the chip in non-programmable forms), and so it still makes sense to remove it from derivative, smaller, cheaper parts.
This modularity might also let Nvidia attempt a part with two DP sub blocks, with fairly minimal changes to the SM front end, if they so wish. Doing so will cost them area and power, but it's something they could take on. Overtaking the per-FPU, per-clock DP rate of Intel's microprocessors has to be appealing on some level.
We put a flag in the ground for the sampler hardware and ROP rates earlier, and it's worth expanding on our thinking there. GT200 has simply phenomenal texturing ability, especially filtered, to the point where going higher per-clock per-SM will simply unbalance the chip.
|GT200 has simply phenomenal texturing ability, especially filtered, to the point where going higher per-clock per-SM will simply unbalance the chip.|
A surfeit of available bilerps is never a bad thing, and high fetch rates into the shader core would keep many a developer smiling at his or her desk. However, keeping things at the GT200 level per SM is prudent in terms of area and still helps spend three billion transistors.
That's why we don't expect anything much to change in terms of raw sampler performance. What we do expect to change is image quality. All the hints point to Nvidia moving the game on a bit in terms of texel filtering quality, and they've been quietly making it clear that RV870 is beaten there. No bad thing, if it materializes.
The mentioned ROP rates will give GF100if we're correct about the count, of course, and Nvidia won't saydouble the G80's formidable rates. Remember, that chip would have a good stab at sustaining 192 Z-only pixels per clock of output. 16 depth and 16 colour samples per clock per ROP partition are still not to be sneezed at, so twice that in GF100 will keep pixel output performance firmly in the high end and, again, nicely help account for the legion of gates crammed into the rough 500 mm² area.
Ultimately, we're somewhat sad to say, we're still in the dark about counts and rates for the graphics-focused hardware, until Nvidia opens up and reveals the final specs of the chip. We're confident enough in the numbers to publish them, though, so we'll stand by the assertions and reasoning.
Remember, too, that the graphics hardware is also backed by what you'd call the compute-focused hardware in GF100, inside the shader core and in the first two levels of the memory hierarchy. That unified L2 makes certain bits of the graphics pipeline go faster for free (thread divergence, writeback from the shader core, etc).
We haven't talked about the tessellator yet, either, but that's because we don't think there really is one in hardware. At least in terms of something you could ring on a high-res die shot and go, "yeah, that all the tesselator logic." DX11 mandates a programmable tessellator with a number of features, but there's nothing in the spec that cries out for large amounts of fixed-function logic you can wall off and call a tessellator block.
|It's still exciting from a gamer's perspective to see GF100's compute core laid out on the table, if you're willing to reason about it before you know everything about the pixelated bits.|
You want support at the front end of the chip at triangle setup time (mostly from the memory controller), but then that's setup and you'd build that anyway. You want a lot of FPU (that box is ticked), and then you want a high performance memory subsystem to move the new geometry around the chip. That box gets ticked, too, with Fermi at all levels in the hierarchy, including out to DRAM.
So no tessellator as you'd traditionally think about it, but one in 'software' instead. We're unconvinced that setup rate will peak higher than one triangle per clock, despite many a heated argument in the background between ourselves about whether that'll be the case. AMD claims 850 Mtris/s on the HD 5870 is more than enough for modern graphics applications, and we believe them. So an increase in the rate there makes no sense to at least half of those of us shouting the loudest when we've talked about it. There's no chance of the rate decreasing, and we're more confident that GF100 can hit peak, compared to GT200. (Pushing 1 tri/clock on G80 and friends is quite hard.)
Rough performance estimates
Given everything we talked about above, we can start to draw some rough comparisons from GF100 to GT200 and RV870and maybe estimate performance. We've got a clock estimate we're pretty confident in, and we've puts flags in the ground for the graphics-specific units, in terms of counts, so here goes nothing.
We'll consider Radeon HD 5870 for the RV870 implementation and the GeForce GTX 285 for the GT200 product. Obviously, GT200's SP rate is for MADD, since it doesn't have support for SP FMA.
|Process node||40 nm @ TSMC||55 nm @ TSMC||40 nm @ TSMC|
|Core clock||650 MHz||648 MHz||850 MHz|
|Hot clock||1700 MHz||1476 MHz||--|
|Memory clock||4200 MHz||2600 MHz||4800 MHz|
|SP FMA rate||1.74 Tflops||0.708 Tflops||2.72 Tflops|
|DP FMA rate||870 Gflops||88.5 Gflops||544 Gflops|
|Memory bus width||384 bit||512 bit||256 bit|
|Memory bandwidth||201.6 GB/s||166.4 GB/s||153.6 GB/s|
|ROP rate||31.2 Gpixels/s||21.4 Gpixels/s||27.2 Gpixels/s|
|INT8 Bilinear texel rate (half rate for FP16)||83.2 Gtexels/s||51.8 Gtexels/s||68.0 Gtexels/s|
The GF100's architecture means the SKU we've described (the GeForce GTX 380, possibly) comfortably outruns the GeForce GTX 285 in every way, to the point that (and we really generalize here, sorry) it should usually be at least twice as fast. Of course, you can engineer situations, usually in the middle of a frame, where the GF100 won't outpace the GT200 all that much, but in the main, it should be a solid improvement. The GF100 will outpace the Radeon HD 5870 as the top single-chip graphics product of all time, assuming AMD doesn't release anything else in the interim, between now and January. Look for the margins there to be a bit more slender, and we refer you to our Radeon HD 5870 review for the figures that'll let you imagine performance versus AMD's product.
We mention again that our figures are preliminary ones based on educated guesswork and are subject to change once Nvidia talks about it properly next month. We're also aware of the recent Tesla announcements at SC'09, which give hints at rates based on flop counts, and you can use those numbers to work back to some clocks that don't match up to what we present here. Let's just say that we'd urge more focus on our clocks, at the very least for GeForce products.
While Fermi is being presented as a compute monster currently, it's still a GPU at heart, and at every level, there's consideration for how fast and how well it will draw pixels. The two things mostly go hand-in-hand with each other, so it's still exciting from a gamer's perspective to see GF100's compute core laid out on the table, if you're willing to reason about it before you know everything about the pixelated bits.
Nvidia is doing a lot with those three billion transistors in Fermi's first implementation, and the only real head-scratcher is why there's not a 512-bit memory interface, given the area. GDDR5 buys them 50% over what RV870 can suck on at the same clocks, but modern graphics products can be bandwidth starved mid-frame a lot of the time with modern games, even with 100 GB/s or more.
So we wait for hardware, more details and a chance to make it cry with lovingly hand-crafted code. You could squint and grumble at things like the in-flight thread count not being enough to cover the same amount of memory latency as GT200 at the same frequencies. But it's a deeply impressive architecture on paper, and RV870's execution resources will have to bust a gut to keep up.