It seems probable that September 2009 will be more than just a footnote in the annals of computing, especially when one considers graphics processors. AMD made the ninth month of the ninth year in the twenty-first century the one it announced, released, and made available at retail its next-generation DX11 graphics processor: Cypress. Nvidia managed to sneak Fermi in to September 2009 as well, talking about the chip publicly on the 30th.
We refer you to our initial poke at things from GTC to get you started, if you have no idea what Fermi is at this point.
If you’ve been following Fermi since it was announced, you’ll know Nvidia didn’t really talk about the specific graphics transistors in Fermi implementations. We’re going to take a stab at that, though, using information gleaned from the whitepaper, bits teased from Nvidia engineers, and educated guesswork. Remember, however, that graphics transistor chatter does ultimately remain a guess until the real details are unveiled.
“Why did Nvidia only talk about the compute side of Fermi?”, you might ask. You can’t have failed to notice the company’s push into non-graphics application of GPUs in recent years. The G80 processor launch, along with CUDA, has meant that people interested in using the GPU for non-graphics computation have had a viable platform for doing so. The processors have been very capable, and CUDA offers a more direct avenue for programming them than hijacking a high-level graphics shading language.
|This industry is now mostly up and walking, after being born little more than a few years ago. We’ve seen GPU computing shed tears, start teething, and take its first baby steps.|
Since that first serious attempt at providing infrastructure for GPU compute, we’ve seen CUDA evolve heavily and the competition and infrastructure along with it: AMD’s Stream programming initiative has grown to include the GPU, OpenCL now allows developers to harness GPU power across multiple platforms, and Microsoft now has a DirectCompute portion of DirectX that leverages the devices in a more general non-graphics way. Oh, and we mustn’t forget fleeting hints at the future from the likes of Rapidmind, now a part of Intel.
GPU computing is becoming a big business, and Nvidia is working, like any company with an obligation to its employees and shareholders, to make big inroads into a new industry with serious potential for growth. This industry is now mostly up and walking, after being born little more than a few years ago. We’ve seen GPU computing shed tears, start teething, and take its first baby steps.
Against that background, Nvidia chose not to talk about the graphics transistors in Fermi at its GPU Technology Conference. Sure, some of its reservations were competitive. After all, why give AMD all it needs to estimate product-level performance months in advance? Some of it was simply because they’ve only very recently been able to run code on real hardware, after delays in production and manufacturing. Regardless, it was real hardware at GTC, you can be very sure of that.
The crux, though, is that Fermi will be the first GPU architecture that Nvidia initially pushes harder into the compute space than consumer or professional graphics. Large supercomputer contracts and other big installations are being won on the back of Fermi’s general compute strengths, as we speak. The graphics side of things is, at this point in time anyway, less important. Make no mistake, though: Fermi is still a GPU, and the G still stands resolutely for graphics.
Graphics architecture discussion has gained some newmostly confusing and disparate, if we’re honestterminology in the last year or so. The drive to describe massively parallel devices executing thousands of threads at a time has forced the new words, acronyms and terms to the forefront. To add to things, each vendor has a propensity to use different terms for pretty much the same things, for whatever reason.
While we can’t quite unify the terminology, we can explain what we’re going to use in this article, to cover some of the more confusing or non-obvious bits and pieces you might come across in the following pages.
Let’s start with cluster. Nvidia used to call it a TPC, AMD is keen on calling it a SIMD, but we use “cluster” to denote the granular compute processing block on a GPU, the thing vendors use to scale their architectures up and down at a basic level. A cluster is generally a collection of what the vendors like to call cores, but we’re more inclined to call the cluster the core (at least most of the time; it depends on the architecture). For example, we’d say AMD’s Cypress is 20-cluster part, and Nvidia’s GT200 is a 10-cluster part.
Next, we’ve got the warp. AMD calls it a wavefront. Either way, these terms describe a logical collection of threads that are executing at any given time on the basic building blocks of a cluster. Because of the way a modern GPU renders pixels and needs to texture, threads don’t run at the single pixel/vertex/object level on a graphics processor, with each thread independent. Rather, objects are grouped logically and passed through the pipeline together. So a warp is a collection of threads, each running for a single object. Because of various requirements for efficient hardware rendering, and the underlying architecture of the GPU, those objects are grouped together.
So for recent Nvidia parts, a warp is 32 threads, and for recent AMD hardware, a warp is 64 threads. Branching on a GPU happens at the warp level, too.
We also talk about the “hot clock” when it comes to modern Nvidia hardware. The hot clock is the fastest clock on the chip, and it’s the one at which the compute core runs.
“Kernel” is just a nice name for the software programs that wrap execution on the GPU. Some GPUs can only run a single kernel at a time, although that is changing.
Finally, when we talk about the near (memory) pools in Fermi, we mean the register file and the L1 and L2 cache memories. Sometimes just L1, though, depending on context. To visualize what we mean, think of the memory hierarchy like a chain, from registers to L1 to L2 to the memory chips on the board, with the near pools being those nearest to the compute hardware physically.
There should be some attempt to unify the terminology at some point, since talking about threads and blocks and grids and streams and warps and wavefronts and fibers, with nuanced and inconsistent meaning to boot, is counter-productive. Hopefully this intro serves you well into the rest of the analysis.
Before we dive into the details, an overview of the Fermi architecture as a whole is prudent, and we’ll try and limit most of the comparisons to other architectures and chips to this part of our analysis.
Starting with the basic building block of Fermi, the cluster, Nvidia’s prior D3D10-class products all had multiple shader multiprocessors (SM) in each cluster, with two or three SMs each, depending on the evolution of the architecture. G80 and derivatives were two-SM parts, with each SM an 8-wide vector plus special function and interpolator block, with shared sampler resource with the other SM in the cluster.
G80, the base implementation, powered products like GeForce 8800 GTX and GTS, with eight clusters, and some product-family variants disabled a cluster (and ROP partition). GT200, responsible for Nvidia’s high-end products since launch roughly 17 months ago, expanded clusters to include a third SM, with each SM further enhanced with a single double-precision (DP) float unit. That DP support let developers access this capability early, a teaser if you will, before Fermi.
Fermi now has single-SM clusters, although each SM is effectively a pair of 16-way vector sub blocks. Sub-block configuration is the key to Fermi implementation configuration. GF100, the high-end part that Nvidia outlines in the whitepaper, uses two different sub blocks in each of its sixteen SMs.
A functional block diagram of GF100, the first chip based on the Fermi architecture
Each sub block has a special function unit (SFU) that provides access to hardware specials and interpolation for the vector, taking eight clocks to service a thread group or warp. More on that later. Nvidia points out that there’s a dedicated load/store unit for the cluster, too, although you could claim that for every interesting generation of hardware they’ve created. The logic there has some unique problems to solve due to the new per-cluster arrangement and computational abilities, but it’s arguably not worth presenting as part of the block logic.
Each SM now has a 64 KiB partitioned shared memory and L1 cache store. The cache can be partitioned two ways at the thread type level (although with no programmer control as far as we’re aware, at least not yet), with either 16/48 or 48/16 KiB dedicated to shared memory and L1. Each sub block shares access to the store with the other, due to executing the same warp. The reason for not allowing other splits is twofold: the desire to keep a familiar shared memory space for code designed for other multiprocessors, and the desire to let L1 run well in parallel; and they’re wire limited in terms of allowing those other configurations, area complexity becoming a real nemesis in terms of ports and what have you.
|The cache design is a significant change from any Nvidia architecture to date and a key component of its compute-focused ability.|
L1 is backed by a unified L2 cache shared across each Fermi chip’s SMs. The chip uses L2 to service all memory controller I/O requests, and all L2 writes from any cluster are visible in the next clock to any other cluster on the chip. The cache design is a significant change from any Nvidia architecture to date and a key component of its compute-focused ability. Graphics is generally a highly spatially local task for the memory subsystem to manage, with access and stride patterns well known in advance (spatial locality in terms of the address space, although that’s a function of how it processes geometry and pixels). Thus, GPU caches have traditionally been small, since the spatial locality means you don’t need all data in the cache to service a complete memory request (far from it, in reality). Yet non-graphics compute can trivially introduce non-spatially local memory access and random access patterns, which the large, unified L2 is designed to accelerate.
Also, all memories on the chip, from registers up to DRAM, can be protected by ECC.
Fermi overview (continued)
Scheduling wise, there’s a global scheduler and some logic at the front end of each Fermi chip that gets things into shape for each SM’s thread scheduler. Front-end wise, there’s some verification and state-tracking logic, some caches, and broadcast logic to each SM (mostly for decoded instructions). Since each SM in a Fermi implementation can run a different thread type, the front end must support an instruction stream per SM.
There’s a single buffered queue for decoded instructions, despite the SM running two instructions per clock, due to how the scheduler issues. Nvidia won’t disclose queue depth, but the queue and decoder are good enough to sustain chip peak rates, of course.
The new SM scheduler can dual-issue instructions for two running warps in a clock, with each warp running for two hot clocks, coordinating the operand fetch hardware and effectively completely orchestrating computation. Nvidia says there’s two schedulers, but we don’t believe them. The retire latency for the warp is half that of older D3D10-class designs, requiring twice the number of warps to hide the same memory access latency. (DRAM device latencies, of course, won’t be equal on Fermi hardware for the most part, because it now supports GDDR5).
A mix of instructions can be run across the SM for the pair of warps, and because warps of threads are independent in terms of data and execution order, and because of the sub-block arrangement, the instruction mix is flexible. A 32-bit IMUL could be executing on one sub block for one half warp, for example, and the other sub block could be running a single-precision FMA for the other half-warp of threads.
The scheduler runs a scoreboard for all possible threads in flight, like all of Nvidia’s D3D10-class hardware, that keeps track of data dependencies and the running and coming instruction mix, so the right warps are ready at the right time. If a memory request has to be serviced by memory, the chip will park the thread until it can be serviced by L2, to avoid stalling the execution resource. The chip will also, like prior hardware, actively scale back the in-flight thread count based on scoreboard statistics such as temporary register count, instructions to be run, and predicate and branch stats.
|With a straight face, any AMD employee could look you in the eye and call Cypress a 1600 (count ’em) shader-unit part, by virtue of its independent architecture.|
Prior to Fermi, compute kernels occupied the entire chip. The hardware ran a single kernel at a time, serially, with the help of the CUDA runtime. Now, compute kernels can occupy the chip at the SM level, like graphics thread types, with Fermi supporting a kernel per SM outwardly.
In general, Fermi executes just like G80. It’s a scalar architecture in that each vector lane is dedicated to computation on a single object, exploiting data parallelism and minimizing data dependency issues that can reduce efficiency in other GPU architectures. There are multiple clock domains as before, the vector SIMDs run at twice the base scheduler rate as before, and the base chip clock is separate from that.
Branching in Fermi happens at the warp level, and therefore with 32-object granularity. The hardware now supports predicating almost all instructions, although it’s unclear how the programmer has any direct control of that outside of CUDA.
Comparisons to Cypress have some of the numbers coming out in AMD’s favor. With a straight face, any AMD employee could look you in the eye and call Cypress a 1600 (count ’em) shader-unit part, by virtue of its independent architecture. Clusters of 5-way vector processors work together in groups of 16, processing an object each per clock (at 850MHz in Radeon HD 5870 form), with a faintly amazing 20 clusters churning away in total.
The Cypress-based Radeon HD 5870
Versus RV770, Cypress’s texturing resources have doubled, ROPs have doubled, raster has potentially doubled, and various near pools in the memory hierarchy have doubled in size and effective bandwidth. Going back to the shader hardware, four of the five ALUs in the 5-way vector are capable of full IEEE754-2008 FP32 FMAs, and the T-unit has other unique characteristics. It all adds up to serious rates of everything, from shading to texture sampling to pixel output to memory bandwidth. All of that in 334 mm² at 40 nm by TSMC, using 2.15 billion transistors. The density is absolutely outrageous. Oh, and keep those figures in mind for later.
A Cypress chip up close
|RV870 really is almost a full doubling of RV770 in terms of the core execution hardware, with only the external memory bus staying put at 256 bits|
Cypr….nah, I can’t do it any longer….RV870 really is almost a full doubling of RV770 in terms of the core execution hardware, with only the external memory bus staying put at 256 bits. That can make it seem imbalanced at times, but when not memory bound, it’s a processing monster, making games go faster than ever before, with a world-class output engine, good physicals, and a nice price. Nvidia will barely sell another GT200 with that on the scene, and it’s only the compute side of AMD’s proposition that let things down. At the hardware level, there’s not much that you could point at and say, “that’s for GPU computing.” Maybe that’ll go some way toward explaining why Nvidia is pushing so hard in the same space, as they use Fermi to try and take control of things. More on that later, after a look at GF100-level specifics.
The GF100 compute core
GF100 is the codename for the biggest, double-precision-supporting variant of the Fermi architecture. It’s a D3D11-class part, comprised of 16 clusters, each containing a pair of vector SIMD processors; a discrete memory pool and register banks; a dual-issue, dual-warp scheduler; sampler capability; and access to the chip’s ROPs and DRAM devices via the on-chip memory controller.
Our GF100 block diagram once again
One sub block is capable of double-precision computation. It’s a sixteen-wide DP vector unit, capable of a single FMA per clock for each of sixteen threads (half of what Nvidia calls a warp). Due to operand fetch limitations, when the DP sub block is executing, the front end to the SM can’t run the second sub block. In addition to the DP FMA (fused multiply-add), the FPU can run DP MUL and ADD in one clock. There’s a very capable integer ALU, too, capable of a single 32-bit MUL or ADD in one clock. Remember the CUDA documentation for G80 and friends that said 24-bit IMUL would go slower in future generations? Yeah.
The other sub block is a sixteen-wide, single-precision vector running computation for the other half of a thread warp. It can run a single-precision FMA per clock, or MUL or ADD. The new FMA ability of the sub blocks is important. Fusing the two ops into a single compute stage increases numerical precision over the old MADD hardware in prior D3D10-class Nvidia hardware. In graphics mode, that poses problems, since to run at the same numerical precision as GT200, Fermi chips like GF100 will be half their peak rate for MADD, because they run the old MUL and ADD in two clocks rather than one. Automatically promoting those to FMA is what the graphics driver will do, although the programmer can opt out of that if they find computational divergence that causes problems, compared to the same code on other hardware.
Computational accuracy is defined by Fermi’s support for IEEE754-2008, including exception handing, and fast performance for all float specials include NaN, +ve and -ve infinity, denormals and division by zero.
Each sub block has a special function unit (SFU), too. The SFU interpolates for the vector, as well as providing access to hardware special instructions such as transcendental ops (SIN, COS, LOG, etc). The DP sub-block SFU doesn’t run instructions in double precision.
The sub block and the SFU can run a number of other instructions and special computations, too, such as shifts, branch instructions, comparison ops, bit ops and cross-ALU counters. The complete mix of instructions and throughputs isn’t known, although Nvidia claims the scheduler is only really limited by operand gather and dispatch. If all data dependencies are satisfied and there are enough ports out of the register pool to service the request, the SM will generally run any mix of instructions you can think of. There’s enough operand fetch with 256 ports to run peak rate SGEMM, which will please HPC types. The maximum thread count per GF100 SM is 1536, up 50% compared to GT200.
The only limitation that appears worth talking about at this point, prior to measurement, is running the double-precision sub block. Given that operands for DP are twice as wide, it appears the operand gather hardware will consume all available register file ports, and so no other instructions can run on the other sub block.
In terms of the memory hierarchy, we’ve mentioned that all Fermi SMs contain the 64 KiB partitioned L1 and shared memory pool, backed by ECC if needed. (In fact, we’d guess that all L1 interaction is permanently protected.) Threads can access both the shared memory and L1 partitions of the near pool at the same time. Register overspill is to L1 in all Fermi implementations, and the register file is 128 KiB per SM (32 K FP32 values).
The L2 cache on GF100 is 768 KiB, making a static per-SM allocation of 48 KiB, but remember it’s completely unified. Preferred DRAM memory is GDDR5, but the memory controller supports DDR3, as well, and Nvidia will make use of the latter in the bigger 6 GiB Tesla configurations.
Fermi, and therefore GF100, virtualizes the address space of the device and the host, utilizing a hardware TLB for address conversion. Every memory in the hierarchy, from shared memory and caches up, is mapped into the virtual space and can be accessed by the shader core, including samplers. The shader core and samplers both consume the same virtual addresses, and the hardware and driver together are responsible for managing the memory maps. All addresses are 64-bit at the hardware level, and the physical address space that GF100 supports is 40-bit.
Fermi designs like GF100 also sport much improved atomic operation performance compared to currently shipping hardware. Atomic ops in a warp are coalesced and backed to the L2 on address contention, rather than the memory controller resolving them by replaying the transactions in DRAM at latencies of hundreds of clock cycles. The whitepaper’s claim of extra atomic units facilitating the new performance isn’t correct, and it’s down to L2 to service those memory ops (since that’s the further part of the hierarchy to get writes appearing globally to the chip’s SMs) rather than DRAM.
Concurrent compute kernel support for GF100 is a claimed 16 kernels at a time, one per SM, although we believe that the final count will be capped. Earlier architectures supporting CUDA could only run a single kernel at a time, executing them serially in submission order, but GF100 has no such limitation. Kernel streams are still queued up serially by the driver for execution, like before; however, when a cluster becomes free to run a stream from another kernel, it will schedule and run it freely, the chip effectively filling up in waterfall fashion as execution resources free up and new streams are ready to go. The limit therefore comes in the number of in-flight streams that the software side will support, and we think that’s likely capped at eight.
Speaking of DRAM, GF100 supports GDDR5 via six 64-bit channels, and the memory clock will likely be in the 4200 MHz range for the highest-end SKUs. The new memory type brings with it unique memory controller considerations, and at the basic level, I/O happens at the device at the same 64-bit granularity as previous-generation hardware.
In terms of texturing, GF100 appears to support the same per-cluster texturing ability as GT200, with eight pixels per clock of address setup and final texel address calculation and up to eight burnable bilerps per clock for filtering, although Nvidia won’t talk about it just yet. The texturing rate therefore appears to go up linearly with cluster count, at a peak of 1.6x over a similarly clocked GT200. The texture hardware supports all of D3D11’s requirements, of course, including FP32 surface filtering.
Despite the memory bus shrinking to 384 bits, GF100 appears to up the ROP count to 48 (each one able to write a quad of pixels to memory per clock), with full-rate blending for up to FP16 pixels, dropping to half rate for FP32. For comparison, that’s the same blend rate as GT200 and twice that per ROP per clock compared to G80. D3D11 support (and thus D3D10.1) also means a subtle change to the ROP hardware.
D3D10.1 brought about a requirement for application control over the subpixel coverage mask required for generating multisamples and various other tweaks required for fast multisample readback into the shader core. Control over the subpixel mask was the biggest barrier to older Nvidia D3D10 hardware supporting D3D10.1, and it’s only recently or so that Nvidia has announced chips with the capability (nearly three years since Microsoft ratified the specification).
|Put simply, it’s not the biggest consumer ASIC ever in terms of area, but it’s certainly the biggest in terms of transistor count, beating the RV870 by nearly a whole RV770. Just think about that for a second.|
Clock rates are currently forecast to be in the “high-end G92 range,” so we’ll pin that at around 650 MHz for the base clock domain and 1700 MHz for the hot clock (so 850 MHz for the bulk of the SM hardware, including the 64 KiB near pool and the register file).
At the manufacturing level, GF100 is a three-billion-transistor part manufactured by TSMC on its 40G process node (40-nm average feature size, 300-mm wafer). Nvidia is coy on die size for the time being, with best guesses putting it a touch under 500 mm². Put simply, it’s not the biggest consumer ASIC ever in terms of area (that goes to the original 65-nm GT200 at 576 mm²), but it’s certainly the biggest in terms of transistor count, beating the RV870 by nearly a whole RV770. Just think about that for a second.
Nvidia and TSMC have clearly been a little shy about pushing GF100 to the reticle limit of the process this time around. Nvidia is balancing the area cost with everything else it needs to consider, mostly financial constraints. The final area is a complex interplay of factors including the process, wafer start costs, margins, expected volume and market size, clock rates, voltage, and many other factors.
Estimating performance is folly at this point, but our suite of binaries to figure out the details is shaping up nicely, and should be more than ready by the time the first GF100-based product ships.
Going back to the sub-block discussion, it should be clear how Nvidia might scale Fermi down to smaller variants and create derivatives. Nvidia could simply (and we use that term with all due respect to the actual difficulty involved) replace the DP-capable sub block with another of the simpler blocks. They could retain everything else about the SM, including the same scheduler, near pools, register file and even the operand gather logic.
That lets them create non-DP variants, losing some of the fearsome integer rate in the process as well (some of the integer hardware is shared with the DP silicon, necessitating that), for derivatives that don’t require it, because they’re addressing different markets.
Double-precision floating point is almost exclusively a non-graphics feature of GPUs, at least at this point in time (although, of course, extended-precision computation takes place all over the chip in non-programmable forms), and so it still makes sense to remove it from derivative, smaller, cheaper parts.
This modularity might also let Nvidia attempt a part with two DP sub blocks, with fairly minimal changes to the SM front end, if they so wish. Doing so will cost them area and power, but it’s something they could take on. Overtaking the per-FPU, per-clock DP rate of Intel’s microprocessors has to be appealing on some level.
We put a flag in the ground for the sampler hardware and ROP rates earlier, and it’s worth expanding on our thinking there. GT200 has simply phenomenal texturing ability, especially filtered, to the point where going higher per-clock per-SM will simply unbalance the chip.
|GT200 has simply phenomenal texturing ability, especially filtered, to the point where going higher per-clock per-SM will simply unbalance the chip.|
A surfeit of available bilerps is never a bad thing, and high fetch rates into the shader core would keep many a developer smiling at his or her desk. However, keeping things at the GT200 level per SM is prudent in terms of area and still helps spend three billion transistors.
That’s why we don’t expect anything much to change in terms of raw sampler performance. What we do expect to change is image quality. All the hints point to Nvidia moving the game on a bit in terms of texel filtering quality, and they’ve been quietly making it clear that RV870 is beaten there. No bad thing, if it materializes.
The mentioned ROP rates will give GF100if we’re correct about the count, of course, and Nvidia won’t saydouble the G80’s formidable rates. Remember, that chip would have a good stab at sustaining 192 Z-only pixels per clock of output. 16 depth and 16 colour samples per clock per ROP partition are still not to be sneezed at, so twice that in GF100 will keep pixel output performance firmly in the high end and, again, nicely help account for the legion of gates crammed into the rough 500 mm² area.
Ultimately, we’re somewhat sad to say, we’re still in the dark about counts and rates for the graphics-focused hardware, until Nvidia opens up and reveals the final specs of the chip. We’re confident enough in the numbers to publish them, though, so we’ll stand by the assertions and reasoning.
Remember, too, that the graphics hardware is also backed by what you’d call the compute-focused hardware in GF100, inside the shader core and in the first two levels of the memory hierarchy. That unified L2 makes certain bits of the graphics pipeline go faster for free (thread divergence, writeback from the shader core, etc).
We haven’t talked about the tessellator yet, either, but that’s because we don’t think there really is one in hardware. At least in terms of something you could ring on a high-res die shot and go, “yeah, that all the tesselator logic.” DX11 mandates a programmable tessellator with a number of features, but there’s nothing in the spec that cries out for large amounts of fixed-function logic you can wall off and call a tessellator block.
|It’s still exciting from a gamer’s perspective to see GF100’s compute core laid out on the table, if you’re willing to reason about it before you know everything about the pixelated bits.|
You want support at the front end of the chip at triangle setup time (mostly from the memory controller), but then that’s setup and you’d build that anyway. You want a lot of FPU (that box is ticked), and then you want a high performance memory subsystem to move the new geometry around the chip. That box gets ticked, too, with Fermi at all levels in the hierarchy, including out to DRAM.
So no tessellator as you’d traditionally think about it, but one in ‘software’ instead. We’re unconvinced that setup rate will peak higher than one triangle per clock, despite many a heated argument in the background between ourselves about whether that’ll be the case. AMD claims 850 Mtris/s on the HD 5870 is more than enough for modern graphics applications, and we believe them. So an increase in the rate there makes no sense to at least half of those of us shouting the loudest when we’ve talked about it. There’s no chance of the rate decreasing, and we’re more confident that GF100 can hit peak, compared to GT200. (Pushing 1 tri/clock on G80 and friends is quite hard.)
Rough performance estimates
Given everything we talked about above, we can start to draw some rough comparisons from GF100 to GT200 and RV870and maybe estimate performance. We’ve got a clock estimate we’re pretty confident in, and we’ve puts flags in the ground for the graphics-specific units, in terms of counts, so here goes nothing.
We’ll consider Radeon HD 5870 for the RV870 implementation and the GeForce GTX 285 for the GT200 product. Obviously, GT200’s SP rate is for MADD, since it doesn’t have support for SP FMA.
|Process node||40 nm @ TSMC||55 nm @ TSMC||40 nm @ TSMC|
|Core clock||650 MHz||648 MHz||850 MHz|
|Hot clock||1700 MHz||1476 MHz||—|
|Memory clock||4200 MHz||2600 MHz||4800 MHz|
|SP FMA rate||1.74 Tflops||0.708 Tflops||2.72 Tflops|
|DP FMA rate||870 Gflops||88.5 Gflops||544 Gflops|
|Memory bus width||384 bit||512 bit||256 bit|
|Memory bandwidth||201.6 GB/s||166.4 GB/s||153.6 GB/s|
|ROP rate||31.2 Gpixels/s||21.4 Gpixels/s||27.2 Gpixels/s|
|INT8 Bilinear texel rate (half rate for FP16)||83.2 Gtexels/s||51.8 Gtexels/s||68.0 Gtexels/s|
The GF100’s architecture means the SKU we’ve described (the GeForce GTX 380, possibly) comfortably outruns the GeForce GTX 285 in every way, to the point that (and we really generalize here, sorry) it should usually be at least twice as fast. Of course, you can engineer situations, usually in the middle of a frame, where the GF100 won’t outpace the GT200 all that much, but in the main, it should be a solid improvement. The GF100 will outpace the Radeon HD 5870 as the top single-chip graphics product of all time, assuming AMD doesn’t release anything else in the interim, between now and January. Look for the margins there to be a bit more slender, and we refer you to our Radeon HD 5870 review for the figures that’ll let you imagine performance versus AMD’s product.
We mention again that our figures are preliminary ones based on educated guesswork and are subject to change once Nvidia talks about it properly next month. We’re also aware of the recent Tesla announcements at SC’09, which give hints at rates based on flop counts, and you can use those numbers to work back to some clocks that don’t match up to what we present here. Let’s just say that we’d urge more focus on our clocks, at the very least for GeForce products.
While Fermi is being presented as a compute monster currently, it’s still a GPU at heart, and at every level, there’s consideration for how fast and how well it will draw pixels. The two things mostly go hand-in-hand with each other, so it’s still exciting from a gamer’s perspective to see GF100’s compute core laid out on the table, if you’re willing to reason about it before you know everything about the pixelated bits.
Nvidia is doing a lot with those three billion transistors in Fermi’s first implementation, and the only real head-scratcher is why there’s not a 512-bit memory interface, given the area. GDDR5 buys them 50% over what RV870 can suck on at the same clocks, but modern graphics products can be bandwidth starved mid-frame a lot of the time with modern games, even with 100 GB/s or more.
So we wait for hardware, more details and a chance to make it cry with lovingly hand-crafted code. You could squint and grumble at things like the in-flight thread count not being enough to cover the same amount of memory latency as GT200 at the same frequencies. But it’s a deeply impressive architecture on paper, and RV870’s execution resources will have to bust a gut to keep up.