Inside Fermi’s graphics architecture

It seems probable that September 2009 will be more than just a footnote in the annals of computing, especially when one considers graphics processors. AMD made the ninth month of the ninth year in the twenty-first century the one it announced, released, and made available at retail its next-generation DX11 graphics processor: Cypress. Nvidia managed to sneak Fermi in to September 2009 as well, talking about the chip publicly on the 30th.

We refer you to our initial poke at things from GTC to get you started, if you have no idea what Fermi is at this point.

If you’ve been following Fermi since it was announced, you’ll know Nvidia didn’t really talk about the specific graphics transistors in Fermi implementations. We’re going to take a stab at that, though, using information gleaned from the whitepaper, bits teased from Nvidia engineers, and educated guesswork. Remember, however, that graphics transistor chatter does ultimately remain a guess until the real details are unveiled.

“Why did Nvidia only talk about the compute side of Fermi?”, you might ask. You can’t have failed to notice the company’s push into non-graphics application of GPUs in recent years. The G80 processor launch, along with CUDA, has meant that people interested in using the GPU for non-graphics computation have had a viable platform for doing so. The processors have been very capable, and CUDA offers a more direct avenue for programming them than hijacking a high-level graphics shading language.

This industry is now mostly up and walking, after being born little more than a few years ago. We’ve seen GPU computing shed tears, start teething, and take its first baby steps.

Since that first serious attempt at providing infrastructure for GPU compute, we’ve seen CUDA evolve heavily and the competition and infrastructure along with it: AMD’s Stream programming initiative has grown to include the GPU, OpenCL now allows developers to harness GPU power across multiple platforms, and Microsoft now has a DirectCompute portion of DirectX that leverages the devices in a more general non-graphics way. Oh, and we mustn’t forget fleeting hints at the future from the likes of Rapidmind, now a part of Intel.

GPU computing is becoming a big business, and Nvidia is working, like any company with an obligation to its employees and shareholders, to make big inroads into a new industry with serious potential for growth. This industry is now mostly up and walking, after being born little more than a few years ago. We’ve seen GPU computing shed tears, start teething, and take its first baby steps.

Against that background, Nvidia chose not to talk about the graphics transistors in Fermi at its GPU Technology Conference. Sure, some of its reservations were competitive. After all, why give AMD all it needs to estimate product-level performance months in advance? Some of it was simply because they’ve only very recently been able to run code on real hardware, after delays in production and manufacturing. Regardless, it was real hardware at GTC, you can be very sure of that.

The crux, though, is that Fermi will be the first GPU architecture that Nvidia initially pushes harder into the compute space than consumer or professional graphics. Large supercomputer contracts and other big installations are being won on the back of Fermi’s general compute strengths, as we speak. The graphics side of things is, at this point in time anyway, less important. Make no mistake, though: Fermi is still a GPU, and the G still stands resolutely for graphics.

Terminology introduction

Graphics architecture discussion has gained some new—mostly confusing and disparate, if we’re honest—terminology in the last year or so. The drive to describe massively parallel devices executing thousands of threads at a time has forced the new words, acronyms and terms to the forefront. To add to things, each vendor has a propensity to use different terms for pretty much the same things, for whatever reason.

While we can’t quite unify the terminology, we can explain what we’re going to use in this article, to cover some of the more confusing or non-obvious bits and pieces you might come across in the following pages.
Let’s start with cluster. Nvidia used to call it a TPC, AMD is keen on calling it a SIMD, but we use “cluster” to denote the granular compute processing block on a GPU, the thing vendors use to scale their architectures up and down at a basic level. A cluster is generally a collection of what the vendors like to call cores, but we’re more inclined to call the cluster the core (at least most of the time; it depends on the architecture). For example, we’d say AMD’s Cypress is 20-cluster part, and Nvidia’s GT200 is a 10-cluster part.

Next, we’ve got the warp. AMD calls it a wavefront. Either way, these terms describe a logical collection of threads that are executing at any given time on the basic building blocks of a cluster. Because of the way a modern GPU renders pixels and needs to texture, threads don’t run at the single pixel/vertex/object level on a graphics processor, with each thread independent. Rather, objects are grouped logically and passed through the pipeline together. So a warp is a collection of threads, each running for a single object. Because of various requirements for efficient hardware rendering, and the underlying architecture of the GPU, those objects are grouped together.

So for recent Nvidia parts, a warp is 32 threads, and for recent AMD hardware, a warp is 64 threads. Branching on a GPU happens at the warp level, too.

We also talk about the “hot clock” when it comes to modern Nvidia hardware. The hot clock is the fastest clock on the chip, and it’s the one at which the compute core runs.

“Kernel” is just a nice name for the software programs that wrap execution on the GPU. Some GPUs can only run a single kernel at a time, although that is changing.

Finally, when we talk about the near (memory) pools in Fermi, we mean the register file and the L1 and L2 cache memories. Sometimes just L1, though, depending on context. To visualize what we mean, think of the memory hierarchy like a chain, from registers to L1 to L2 to the memory chips on the board, with the near pools being those nearest to the compute hardware physically.

There should be some attempt to unify the terminology at some point, since talking about threads and blocks and grids and streams and warps and wavefronts and fibers, with nuanced and inconsistent meaning to boot, is counter-productive. Hopefully this intro serves you well into the rest of the analysis.

Fermi overview

Before we dive into the details, an overview of the Fermi architecture as a whole is prudent, and we’ll try and limit most of the comparisons to other architectures and chips to this part of our analysis.

Starting with the basic building block of Fermi, the cluster, Nvidia’s prior D3D10-class products all had multiple shader multiprocessors (SM) in each cluster, with two or three SMs each, depending on the evolution of the architecture. G80 and derivatives were two-SM parts, with each SM an 8-wide vector plus special function and interpolator block, with shared sampler resource with the other SM in the cluster.

G80, the base implementation, powered products like GeForce 8800 GTX and GTS, with eight clusters, and some product-family variants disabled a cluster (and ROP partition). GT200, responsible for Nvidia’s high-end products since launch roughly 17 months ago, expanded clusters to include a third SM, with each SM further enhanced with a single double-precision (DP) float unit. That DP support let developers access this capability early, a teaser if you will, before Fermi.

Fermi now has single-SM clusters, although each SM is effectively a pair of 16-way vector sub blocks. Sub-block configuration is the key to Fermi implementation configuration. GF100, the high-end part that Nvidia outlines in the whitepaper, uses two different sub blocks in each of its sixteen SMs.

A functional block diagram of GF100, the first chip based on the Fermi architecture

Each sub block has a special function unit (SFU) that provides access to hardware specials and interpolation for the vector, taking eight clocks to service a thread group or warp. More on that later. Nvidia points out that there’s a dedicated load/store unit for the cluster, too, although you could claim that for every interesting generation of hardware they’ve created. The logic there has some unique problems to solve due to the new per-cluster arrangement and computational abilities, but it’s arguably not worth presenting as part of the block logic.

Each SM now has a 64 KiB partitioned shared memory and L1 cache store. The cache can be partitioned two ways at the thread type level (although with no programmer control as far as we’re aware, at least not yet), with either 16/48 or 48/16 KiB dedicated to shared memory and L1. Each sub block shares access to the store with the other, due to executing the same warp. The reason for not allowing other splits is twofold: the desire to keep a familiar shared memory space for code designed for other multiprocessors, and the desire to let L1 run well in parallel; and they’re wire limited in terms of allowing those other configurations, area complexity becoming a real nemesis in terms of ports and what have you.

The cache design is a significant change from any Nvidia architecture to date and a key component of its compute-focused ability.

L1 is backed by a unified L2 cache shared across each Fermi chip’s SMs. The chip uses L2 to service all memory controller I/O requests, and all L2 writes from any cluster are visible in the next clock to any other cluster on the chip. The cache design is a significant change from any Nvidia architecture to date and a key component of its compute-focused ability. Graphics is generally a highly spatially local task for the memory subsystem to manage, with access and stride patterns well known in advance (spatial locality in terms of the address space, although that’s a function of how it processes geometry and pixels). Thus, GPU caches have traditionally been small, since the spatial locality means you don’t need all data in the cache to service a complete memory request (far from it, in reality). Yet non-graphics compute can trivially introduce non-spatially local memory access and random access patterns, which the large, unified L2 is designed to accelerate.

Also, all memories on the chip, from registers up to DRAM, can be protected by ECC.

Fermi overview (continued)

Scheduling wise, there’s a global scheduler and some logic at the front end of each Fermi chip that gets things into shape for each SM’s thread scheduler. Front-end wise, there’s some verification and state-tracking logic, some caches, and broadcast logic to each SM (mostly for decoded instructions). Since each SM in a Fermi implementation can run a different thread type, the front end must support an instruction stream per SM.

There’s a single buffered queue for decoded instructions, despite the SM running two instructions per clock, due to how the scheduler issues. Nvidia won’t disclose queue depth, but the queue and decoder are good enough to sustain chip peak rates, of course.

The new SM scheduler can dual-issue instructions for two running warps in a clock, with each warp running for two hot clocks, coordinating the operand fetch hardware and effectively completely orchestrating computation. Nvidia says there’s two schedulers, but we don’t believe them. The retire latency for the warp is half that of older D3D10-class designs, requiring twice the number of warps to hide the same memory access latency. (DRAM device latencies, of course, won’t be equal on Fermi hardware for the most part, because it now supports GDDR5).

A mix of instructions can be run across the SM for the pair of warps, and because warps of threads are independent in terms of data and execution order, and because of the sub-block arrangement, the instruction mix is flexible. A 32-bit IMUL could be executing on one sub block for one half warp, for example, and the other sub block could be running a single-precision FMA for the other half-warp of threads.

The scheduler runs a scoreboard for all possible threads in flight, like all of Nvidia’s D3D10-class hardware, that keeps track of data dependencies and the running and coming instruction mix, so the right warps are ready at the right time. If a memory request has to be serviced by memory, the chip will park the thread until it can be serviced by L2, to avoid stalling the execution resource. The chip will also, like prior hardware, actively scale back the in-flight thread count based on scoreboard statistics such as temporary register count, instructions to be run, and predicate and branch stats.

With a straight face, any AMD employee could look you in the eye and call Cypress a 1600 (count ’em) shader-unit part, by virtue of its independent architecture.

Prior to Fermi, compute kernels occupied the entire chip. The hardware ran a single kernel at a time, serially, with the help of the CUDA runtime. Now, compute kernels can occupy the chip at the SM level, like graphics thread types, with Fermi supporting a kernel per SM outwardly.

In general, Fermi executes just like G80. It’s a scalar architecture in that each vector lane is dedicated to computation on a single object, exploiting data parallelism and minimizing data dependency issues that can reduce efficiency in other GPU architectures. There are multiple clock domains as before, the vector SIMDs run at twice the base scheduler rate as before, and the base chip clock is separate from that.

Branching in Fermi happens at the warp level, and therefore with 32-object granularity. The hardware now supports predicating almost all instructions, although it’s unclear how the programmer has any direct control of that outside of CUDA.

Comparisons to Cypress have some of the numbers coming out in AMD’s favor. With a straight face, any AMD employee could look you in the eye and call Cypress a 1600 (count ’em) shader-unit part, by virtue of its independent architecture. Clusters of 5-way vector processors work together in groups of 16, processing an object each per clock (at 850MHz in Radeon HD 5870 form), with a faintly amazing 20 clusters churning away in total.

The Cypress-based Radeon HD 5870

Versus RV770, Cypress’s texturing resources have doubled, ROPs have doubled, raster has potentially doubled, and various near pools in the memory hierarchy have doubled in size and effective bandwidth. Going back to the shader hardware, four of the five ALUs in the 5-way vector are capable of full IEEE754-2008 FP32 FMAs, and the T-unit has other unique characteristics. It all adds up to serious rates of everything, from shading to texture sampling to pixel output to memory bandwidth. All of that in 334 mm² at 40 nm by TSMC, using 2.15 billion transistors. The density is absolutely outrageous. Oh, and keep those figures in mind for later.

A Cypress chip up close

RV870 really is almost a full doubling of RV770 in terms of the core execution hardware, with only the external memory bus staying put at 256 bits

Cypr….nah, I can’t do it any longer….RV870 really is almost a full doubling of RV770 in terms of the core execution hardware, with only the external memory bus staying put at 256 bits. That can make it seem imbalanced at times, but when not memory bound, it’s a processing monster, making games go faster than ever before, with a world-class output engine, good physicals, and a nice price. Nvidia will barely sell another GT200 with that on the scene, and it’s only the compute side of AMD’s proposition that let things down. At the hardware level, there’s not much that you could point at and say, “that’s for GPU computing.” Maybe that’ll go some way toward explaining why Nvidia is pushing so hard in the same space, as they use Fermi to try and take control of things. More on that later, after a look at GF100-level specifics.

The GF100 compute core

GF100 is the codename for the biggest, double-precision-supporting variant of the Fermi architecture. It’s a D3D11-class part, comprised of 16 clusters, each containing a pair of vector SIMD processors; a discrete memory pool and register banks; a dual-issue, dual-warp scheduler; sampler capability; and access to the chip’s ROPs and DRAM devices via the on-chip memory controller.

Our GF100 block diagram once again

One sub block is capable of double-precision computation. It’s a sixteen-wide DP vector unit, capable of a single FMA per clock for each of sixteen threads (half of what Nvidia calls a warp). Due to operand fetch limitations, when the DP sub block is executing, the front end to the SM can’t run the second sub block. In addition to the DP FMA (fused multiply-add), the FPU can run DP MUL and ADD in one clock. There’s a very capable integer ALU, too, capable of a single 32-bit MUL or ADD in one clock. Remember the CUDA documentation for G80 and friends that said 24-bit IMUL would go slower in future generations? Yeah.

The other sub block is a sixteen-wide, single-precision vector running computation for the other half of a thread warp. It can run a single-precision FMA per clock, or MUL or ADD. The new FMA ability of the sub blocks is important. Fusing the two ops into a single compute stage increases numerical precision over the old MADD hardware in prior D3D10-class Nvidia hardware. In graphics mode, that poses problems, since to run at the same numerical precision as GT200, Fermi chips like GF100 will be half their peak rate for MADD, because they run the old MUL and ADD in two clocks rather than one. Automatically promoting those to FMA is what the graphics driver will do, although the programmer can opt out of that if they find computational divergence that causes problems, compared to the same code on other hardware.

Computational accuracy is defined by Fermi’s support for IEEE754-2008, including exception handing, and fast performance for all float specials include NaN, +ve and -ve infinity, denormals and division by zero.

Each sub block has a special function unit (SFU), too. The SFU interpolates for the vector, as well as providing access to hardware special instructions such as transcendental ops (SIN, COS, LOG, etc). The DP sub-block SFU doesn’t run instructions in double precision.

The sub block and the SFU can run a number of other instructions and special computations, too, such as shifts, branch instructions, comparison ops, bit ops and cross-ALU counters. The complete mix of instructions and throughputs isn’t known, although Nvidia claims the scheduler is only really limited by operand gather and dispatch. If all data dependencies are satisfied and there are enough ports out of the register pool to service the request, the SM will generally run any mix of instructions you can think of. There’s enough operand fetch with 256 ports to run peak rate SGEMM, which will please HPC types. The maximum thread count per GF100 SM is 1536, up 50% compared to GT200.

The only limitation that appears worth talking about at this point, prior to measurement, is running the double-precision sub block. Given that operands for DP are twice as wide, it appears the operand gather hardware will consume all available register file ports, and so no other instructions can run on the other sub block.

In terms of the memory hierarchy, we’ve mentioned that all Fermi SMs contain the 64 KiB partitioned L1 and shared memory pool, backed by ECC if needed. (In fact, we’d guess that all L1 interaction is permanently protected.) Threads can access both the shared memory and L1 partitions of the near pool at the same time. Register overspill is to L1 in all Fermi implementations, and the register file is 128 KiB per SM (32 K FP32 values).

The L2 cache on GF100 is 768 KiB, making a static per-SM allocation of 48 KiB, but remember it’s completely unified. Preferred DRAM memory is GDDR5, but the memory controller supports DDR3, as well, and Nvidia will make use of the latter in the bigger 6 GiB Tesla configurations.

Fermi, and therefore GF100, virtualizes the address space of the device and the host, utilizing a hardware TLB for address conversion. Every memory in the hierarchy, from shared memory and caches up, is mapped into the virtual space and can be accessed by the shader core, including samplers. The shader core and samplers both consume the same virtual addresses, and the hardware and driver together are responsible for managing the memory maps. All addresses are 64-bit at the hardware level, and the physical address space that GF100 supports is 40-bit.

Fermi designs like GF100 also sport much improved atomic operation performance compared to currently shipping hardware. Atomic ops in a warp are coalesced and backed to the L2 on address contention, rather than the memory controller resolving them by replaying the transactions in DRAM at latencies of hundreds of clock cycles. The whitepaper’s claim of extra atomic units facilitating the new performance isn’t correct, and it’s down to L2 to service those memory ops (since that’s the further part of the hierarchy to get writes appearing globally to the chip’s SMs) rather than DRAM.

Concurrent compute kernel support for GF100 is a claimed 16 kernels at a time, one per SM, although we believe that the final count will be capped. Earlier architectures supporting CUDA could only run a single kernel at a time, executing them serially in submission order, but GF100 has no such limitation. Kernel streams are still queued up serially by the driver for execution, like before; however, when a cluster becomes free to run a stream from another kernel, it will schedule and run it freely, the chip effectively filling up in waterfall fashion as execution resources free up and new streams are ready to go. The limit therefore comes in the number of in-flight streams that the software side will support, and we think that’s likely capped at eight.

Speaking of DRAM, GF100 supports GDDR5 via six 64-bit channels, and the memory clock will likely be in the 4200 MHz range for the highest-end SKUs. The new memory type brings with it unique memory controller considerations, and at the basic level, I/O happens at the device at the same 64-bit granularity as previous-generation hardware.

In terms of texturing, GF100 appears to support the same per-cluster texturing ability as GT200, with eight pixels per clock of address setup and final texel address calculation and up to eight burnable bilerps per clock for filtering, although Nvidia won’t talk about it just yet. The texturing rate therefore appears to go up linearly with cluster count, at a peak of 1.6x over a similarly clocked GT200. The texture hardware supports all of D3D11’s requirements, of course, including FP32 surface filtering.

Despite the memory bus shrinking to 384 bits, GF100 appears to up the ROP count to 48 (each one able to write a quad of pixels to memory per clock), with full-rate blending for up to FP16 pixels, dropping to half rate for FP32. For comparison, that’s the same blend rate as GT200 and twice that per ROP per clock compared to G80. D3D11 support (and thus D3D10.1) also means a subtle change to the ROP hardware.

D3D10.1 brought about a requirement for application control over the subpixel coverage mask required for generating multisamples and various other tweaks required for fast multisample readback into the shader core. Control over the subpixel mask was the biggest barrier to older Nvidia D3D10 hardware supporting D3D10.1, and it’s only recently or so that Nvidia has announced chips with the capability (nearly three years since Microsoft ratified the specification).

Put simply, it’s not the biggest consumer ASIC ever in terms of area, but it’s certainly the biggest in terms of transistor count, beating the RV870 by nearly a whole RV770. Just think about that for a second.

Clock rates are currently forecast to be in the “high-end G92 range,” so we’ll pin that at around 650 MHz for the base clock domain and 1700 MHz for the hot clock (so 850 MHz for the bulk of the SM hardware, including the 64 KiB near pool and the register file).

At the manufacturing level, GF100 is a three-billion-transistor part manufactured by TSMC on its 40G process node (40-nm average feature size, 300-mm wafer). Nvidia is coy on die size for the time being, with best guesses putting it a touch under 500 mm². Put simply, it’s not the biggest consumer ASIC ever in terms of area (that goes to the original 65-nm GT200 at 576 mm²), but it’s certainly the biggest in terms of transistor count, beating the RV870 by nearly a whole RV770. Just think about that for a second.

Nvidia and TSMC have clearly been a little shy about pushing GF100 to the reticle limit of the process this time around. Nvidia is balancing the area cost with everything else it needs to consider, mostly financial constraints. The final area is a complex interplay of factors including the process, wafer start costs, margins, expected volume and market size, clock rates, voltage, and many other factors.

Estimating performance is folly at this point, but our suite of binaries to figure out the details is shaping up nicely, and should be more than ready by the time the first GF100-based product ships.

Architecture derivatives

Going back to the sub-block discussion, it should be clear how Nvidia might scale Fermi down to smaller variants and create derivatives. Nvidia could simply (and we use that term with all due respect to the actual difficulty involved) replace the DP-capable sub block with another of the simpler blocks. They could retain everything else about the SM, including the same scheduler, near pools, register file and even the operand gather logic.

That lets them create non-DP variants, losing some of the fearsome integer rate in the process as well (some of the integer hardware is shared with the DP silicon, necessitating that), for derivatives that don’t require it, because they’re addressing different markets.

Double-precision floating point is almost exclusively a non-graphics feature of GPUs, at least at this point in time (although, of course, extended-precision computation takes place all over the chip in non-programmable forms), and so it still makes sense to remove it from derivative, smaller, cheaper parts.

This modularity might also let Nvidia attempt a part with two DP sub blocks, with fairly minimal changes to the SM front end, if they so wish. Doing so will cost them area and power, but it’s something they could take on. Overtaking the per-FPU, per-clock DP rate of Intel’s microprocessors has to be appealing on some level.

Graphics-focused hardware

We put a flag in the ground for the sampler hardware and ROP rates earlier, and it’s worth expanding on our thinking there. GT200 has simply phenomenal texturing ability, especially filtered, to the point where going higher per-clock per-SM will simply unbalance the chip.

GT200 has simply phenomenal texturing ability, especially filtered, to the point where going higher per-clock per-SM will simply unbalance the chip.

A surfeit of available bilerps is never a bad thing, and high fetch rates into the shader core would keep many a developer smiling at his or her desk. However, keeping things at the GT200 level per SM is prudent in terms of area and still helps spend three billion transistors.

That’s why we don’t expect anything much to change in terms of raw sampler performance. What we do expect to change is image quality. All the hints point to Nvidia moving the game on a bit in terms of texel filtering quality, and they’ve been quietly making it clear that RV870 is beaten there. No bad thing, if it materializes.

The mentioned ROP rates will give GF100—if we’re correct about the count, of course, and Nvidia won’t say—double the G80’s formidable rates. Remember, that chip would have a good stab at sustaining 192 Z-only pixels per clock of output. 16 depth and 16 colour samples per clock per ROP partition are still not to be sneezed at, so twice that in GF100 will keep pixel output performance firmly in the high end and, again, nicely help account for the legion of gates crammed into the rough 500 mm² area.

Ultimately, we’re somewhat sad to say, we’re still in the dark about counts and rates for the graphics-focused hardware, until Nvidia opens up and reveals the final specs of the chip. We’re confident enough in the numbers to publish them, though, so we’ll stand by the assertions and reasoning.

Remember, too, that the graphics hardware is also backed by what you’d call the compute-focused hardware in GF100, inside the shader core and in the first two levels of the memory hierarchy. That unified L2 makes certain bits of the graphics pipeline go faster for free (thread divergence, writeback from the shader core, etc).

We haven’t talked about the tessellator yet, either, but that’s because we don’t think there really is one in hardware. At least in terms of something you could ring on a high-res die shot and go, “yeah, that all the tesselator logic.” DX11 mandates a programmable tessellator with a number of features, but there’s nothing in the spec that cries out for large amounts of fixed-function logic you can wall off and call a tessellator block.

It’s still exciting from a gamer’s perspective to see GF100’s compute core laid out on the table, if you’re willing to reason about it before you know everything about the pixelated bits.

You want support at the front end of the chip at triangle setup time (mostly from the memory controller), but then that’s setup and you’d build that anyway. You want a lot of FPU (that box is ticked), and then you want a high performance memory subsystem to move the new geometry around the chip. That box gets ticked, too, with Fermi at all levels in the hierarchy, including out to DRAM.

So no tessellator as you’d traditionally think about it, but one in ‘software’ instead. We’re unconvinced that setup rate will peak higher than one triangle per clock, despite many a heated argument in the background between ourselves about whether that’ll be the case. AMD claims 850 Mtris/s on the HD 5870 is more than enough for modern graphics applications, and we believe them. So an increase in the rate there makes no sense to at least half of those of us shouting the loudest when we’ve talked about it. There’s no chance of the rate decreasing, and we’re more confident that GF100 can hit peak, compared to GT200. (Pushing 1 tri/clock on G80 and friends is quite hard.)

Rough performance estimates

Given everything we talked about above, we can start to draw some rough comparisons from GF100 to GT200 and RV870—and maybe estimate performance. We’ve got a clock estimate we’re pretty confident in, and we’ve puts flags in the ground for the graphics-specific units, in terms of counts, so here goes nothing.

We’ll consider Radeon HD 5870 for the RV870 implementation and the GeForce GTX 285 for the GT200 product. Obviously, GT200’s SP rate is for MADD, since it doesn’t have support for SP FMA.

GF100 GT200 RV870
Transistor Count 3.0B 1.4B 2.15B
Process node 40 nm @ TSMC 55 nm @ TSMC 40 nm @ TSMC
Core clock 650 MHz 648 MHz 850 MHz
Hot clock 1700 MHz 1476 MHz
Memory clock 4200 MHz 2600 MHz 4800 MHz
ALUs 512 240 1600
SP FMA rate 1.74 Tflops 0.708 Tflops 2.72 Tflops
DP FMA rate 870 Gflops 88.5 Gflops 544 Gflops
ROPs 48 32 32
Memory bus width 384 bit 512 bit 256 bit
Memory bandwidth 201.6 GB/s 166.4 GB/s 153.6 GB/s
ROP rate 31.2 Gpixels/s 21.4 Gpixels/s 27.2 Gpixels/s
INT8 Bilinear texel rate (half rate for FP16) 83.2 Gtexels/s 51.8 Gtexels/s 68.0 Gtexels/s

The GF100’s architecture means the SKU we’ve described (the GeForce GTX 380, possibly) comfortably outruns the GeForce GTX 285 in every way, to the point that (and we really generalize here, sorry) it should usually be at least twice as fast. Of course, you can engineer situations, usually in the middle of a frame, where the GF100 won’t outpace the GT200 all that much, but in the main, it should be a solid improvement. The GF100 will outpace the Radeon HD 5870 as the top single-chip graphics product of all time, assuming AMD doesn’t release anything else in the interim, between now and January. Look for the margins there to be a bit more slender, and we refer you to our Radeon HD 5870 review for the figures that’ll let you imagine performance versus AMD’s product.

We mention again that our figures are preliminary ones based on educated guesswork and are subject to change once Nvidia talks about it properly next month. We’re also aware of the recent Tesla announcements at SC’09, which give hints at rates based on flop counts, and you can use those numbers to work back to some clocks that don’t match up to what we present here. Let’s just say that we’d urge more focus on our clocks, at the very least for GeForce products.

Conclusions

While Fermi is being presented as a compute monster currently, it’s still a GPU at heart, and at every level, there’s consideration for how fast and how well it will draw pixels. The two things mostly go hand-in-hand with each other, so it’s still exciting from a gamer’s perspective to see GF100’s compute core laid out on the table, if you’re willing to reason about it before you know everything about the pixelated bits.

Nvidia is doing a lot with those three billion transistors in Fermi’s first implementation, and the only real head-scratcher is why there’s not a 512-bit memory interface, given the area. GDDR5 buys them 50% over what RV870 can suck on at the same clocks, but modern graphics products can be bandwidth starved mid-frame a lot of the time with modern games, even with 100 GB/s or more.

So we wait for hardware, more details and a chance to make it cry with lovingly hand-crafted code. You could squint and grumble at things like the in-flight thread count not being enough to cover the same amount of memory latency as GT200 at the same frequencies. But it’s a deeply impressive architecture on paper, and RV870’s execution resources will have to bust a gut to keep up.

Comments closed
    • moritzgedig
    • 10 years ago

    Thank you for starting with an explaination of the terms before hand. That has always been the problem with the documents I read.
    there are two ways to look at it: from the software and from the hardware side, the overall-picture and terms can be very different.
    also there maybe missunderstandings because of multiplexing of some hardware, from the software or the scheduler side there might be 2 computeentitys but in hardware there is only one.
    this topic is often very confusing and documentation doesn’t live up to it.
    I think you could have invested even more into this by grouping terms to their perspective. Warp(software), cluster(hardware) a.s.o.
    Further hierarchys would have helped. f.e.:
    Hardware: GPU-cluster-SM-SubBlock-SFU
    Software: Device-Warp-Kernel-Thread
    (above might be incorrect, for I have not read the article yet)

    • Dissonance
    • 10 years ago

    test

      • PainIs4ThaWeak
      • 10 years ago

      “Dissonance” ?? As in THE Dissonance from Q3 E+ ??

      -KrystaL

    • anthony256
    • 10 years ago

    It’s all fine to talk your hardware up, but how many times before have we heard this?

    So it’s twice as fast inside, I want to see that reflection in games.

    I don’t want NVIDIA going around to dev’s making them gimp ATI hardware. It’s a low tactic. How about instead, you make your hardware more efficient and don’t cheat to make it look like your gear kicks ass.

    • tygrus
    • 10 years ago

    I’ve read similar articles in the past but this one was harder to follow and understand. I guess many are waiting for more details from nVidia and final product to test. I few more illustrations and examples would help.

    • Dagwood
    • 10 years ago

    Third read…

    “With a straight face, any AMD employee could look you in the eye and call Cypress a 1600 (count ’em) shader-unit part, by virtue of its independent architecture.”

    This is confusing to me. I was under the impresion that AMD’s structure involved grouping it’s smaller and slower shader processors in groups of five, and each group of five had to work on one thread at a time. So in order to get a more realistic picture you had to divide the sp count by five.

    Contratstingly… Nividia’s sp’s were in groups of two and dividing them by two would let you compare parts. Thus 8800 = 128 shaders = 64 sub groups and AMD’s 320 shaders = 64 sub groups. That is some very rough math on my part, but it matches the last gen parts.

    Now it sounds like Fermi will have 512 shaders that will act like 512 shaders, and AMD’s 1600 shaders will still work in groups of 5. So Fermi should have an edge in maths untill AMD releases a 3200 shader part.

      • Wintermane
      • 10 years ago

      No what happens is amd and nvidia count shaders.. but amd uses 5 single vector shaders to do each thingy while nvidia uses 1 4 vector wide shader thingy to do its stuff.

      So to calc the performance you spin around 12 times beat a fanboy with a tuba and then fling him out a window. If he lands head first you win!

      Yes that doesnt actauly find anything useful but its more fun.

      Personaly I just go with whatever is on sale for slightly less then what I have now and doesnt look like its a total pile of flaming llama puke as far as the very few games I actualy give a flying wombat about playing.

      • Ryszard
      • 10 years ago

      The ALUs in the 5-wide vector work on the same object, yes, but they can still execute independent instructions there.

    • flip-mode
    • 10 years ago

    q[

    • WaltC
    • 10 years ago

    /[<...But it's a deeply impressive architecture on paper, and RV870's execution resources will have to bust a gut to keep up.<]/ This is assuming of course that Fermi at some point is able to move from paper to silicon. This all has the flavor of the Bit Boys' speculation from a few years ago...;) Now we've got Fermi and Larrabee "on paper" and AMD's 5000 series in silicon and shipping, so it looks to me like it's Fermi and Larrabee who need to move into the "bust a gut" category...;)

    • Game_boy
    • 10 years ago

    -[

      • Damage
      • 10 years ago

      Would be good to read the article, including the part where he addresses the announced Tesla rates, before aiming to correct him. 😉

        • Game_boy
        • 10 years ago

        I tried. Sorry.

          • Damage
          • 10 years ago

          Ah, well, it’s not always an easy read, granted. Here’s the relevant bit:

          “We’re also aware of the recent Tesla announcements at SC’09, which give hints at rates based on flop counts, and you can use those numbers to work back to some clocks that don’t match up to what we present here. Let’s just say that we’d urge more focus on our clocks, at the very least for GeForce products. “

            • Game_boy
            • 10 years ago

            Thanks. I suppose we’ll find out for definite when hardware is announced.

            • Damage
            • 10 years ago

            Perhaps, but Rys is trying to tell you something. 🙂

            • Game_boy
            • 10 years ago

            If final silicon (A3, according to a few places) hasn’t taped out, not even Nvidia can know clockspeeds yet.

            I know that’s not been confirmed or denied by Nvidia though.

            • Ryszard
            • 10 years ago

            But you concede Nvidia can know the clocks of the spin they do have up and running right now, and also have the data from other 40nm parts they have in production at the same foundry? We don’t know final clocks, sure, but we can make educated guesses based on that data.

    • anotherengineer
    • 10 years ago

    Well nice on paper anyway. Time will only tell. Seems they are trying to target, servers, workstations, and other high end things with big price tags, I wonder what they will charge the avg consumer for such a device.

    I also wonder if there will be a GT, GTX, G!@, G@#, and a G$^ models lol

    Even with the scarcity of the 5k radeon series, and the super price/performance of the 4k series, this thing will have a tough time if the price is too high, but again, time will tell. Nice to see new arcitechture though.

    Is ATI’s 6k series going to be a new arcitechture as well??

    • gtoulouzas
    • 10 years ago

    To a layman such as myself, this seems all very… intel Larrabee. Both companies are hyping way overdue gpus on anything BUT rasterized graphics performance. I do understand new innovative uses may come out of this “GPU computing” trend (as they may, for that matter of fact, from raytracing graphics).

    For the time being, however, all we are left are nvidia’s proclamations on a technology that revolutionizes… e-penis competitions (pardon me, “folding”), and intel shouting to the rooftops about raytracing performance that is simply Not Being Utilized (that is, in the mainstream).

    I know everybody hates car analogies, but -again, from this layman’s perspective- this seems to me like responding to the competitors’ innovative hybrids with concept designs. Not convincing.

    • lycium
    • 10 years ago

    i’m very interested to know if the concurrent gpgpu / display problem has been solved; i.e. if we can still use our computers while computing!

    thx for the great article rys, it’s a great summary of the looooong thread over at b3d 🙂

      • rUmX
      • 10 years ago

      I totally agree. I have a GTX295 and even while folding on just one of the GPU’s, the whole Windows user interface is slowed to a crawl! Basically makes my computer (core i7) useless! So now I only fold on the GPU when I sleep.

        • Krogoth
        • 10 years ago

        You are doing something wrong.

        GPU Folding on my HD 4850s don’t incur a significant hit on my gaming performance.

          • lycium
          • 10 years ago

          *facepalm*

            • Krogoth
            • 10 years ago

            Let me rephrase that.

            GPU Folding clients never cause my system grind to a halt or significantly affected my gaming performance.

      • MadManOriginal
      • 10 years ago

      I feel an Xzibit meme coming on..

    • Risme
    • 10 years ago

    Thanks again to TR for another nice article, it’s a little too technical for me to understand completely with just one read, but I got the most important parts out of it anyhow.

    The article strengthened my thoughts about the design, meaning that GT100 should be fast, but pretty damn big and thus expensive to manufacture compared to RV870. But the questions still remain; how much faster will it be compared to RV870? How much more expensive will it be in relation to performance compared to RV870? How about yields? Let’s assume the die of a GT100 is 480mm²-500mm², which would make it between 43,7 and 49,7 percents larger than RV870. So with that in mind it doesn’t look very promising to Nvidia since AMD is already suffering badly and unable to ship enough chips because of TSMC’s problems with yields. So if TSMC doesn’t vastly improve their yields by the time GT100 launches then the amount of chips Nvidia can ship will be very small.

    In one hand I hope Nvidia get’s Fermi based products out asap so that there would be competition and prices would probably go down as a result of that. On the other hand I wish AMD would win this round, but just enough so that they would make some money with 5000 family products which they could then use to potentially compete better with Intel. After all I’m more worried about the competition in the CPU market than in the GPU market and AMD needs all the money it can get to improve their competitiveness in the CPU market.

      • MadManOriginal
      • 10 years ago

      As far as availability you also need to factor in demand for the chips as GPGPUs (Tesla) rather than as graphics cards. NV and partners would surely rather sell them in $3-4k cards. So availability as graphics cards will be even worse than it might have been otherwise since it looks like the Tesla version is going to have some significant demand more so than past GPUs.

        • Risme
        • 10 years ago

        That’s true, there are more variables that come to my mind that have an effect on availability and launch date, so buckle up and enjoy the if-rich ride ahead.

        First, will Nvidia be able to launch with A2 or do they need A3? If they need A3 and thus need to push GT100 launch further back will they also push back the launch date of Tesla derivatives of Fermi? If the answer is no, and they will have GT100 and Tesla launch dates closer to one another than planned, then they will also have more overlapping in terms of demand for chips. Now, if the launch date will indeed slip to Q2, due to the need for A3 revision, then what will the yield rates be when Nvidia starts ordering chips from TSMC in preparation for the launch? Even if yield rates are high they will still get much fewer working chips per wafer than AMD. There might already be answers to some of these questions around the web but I haven’t followed the developments around Fermi closely enough to know about them.

        I’m also interested about AMD’s position in Q2 in terms of their next gen graphics, but that’s another story.

      • rUmX
      • 10 years ago

      I agree with most, if not everything you’ve said. However I’d prefer more competitiveness in the GPU space, since, every build I have done, I’ve always spent more on the GPU than on the CPU. Always.

      • zima
      • 10 years ago

      Would be a bit sad if TSMC figures out its problems just before Fermi launch…

    • ClickClick5
    • 10 years ago

    …It will all boil down to drivers. It is all about the drivers in the end.

      • BoBzeBuilder
      • 10 years ago

      Drivers didn’t save NV30. Cheating helped tho.

        • yogibbear
        • 10 years ago

        Hey i had a NV30 and though it might’ve destroyed the fridge next to my computer it ran Half-life 2 nice enough

    • mph_Ragnarok
    • 10 years ago

    I caught the DIVISION BY ZERO joke !

    • codedivine
    • 10 years ago

    1. Typo : In the table, should be gflop not mflop.
    2. Given the Tesla product announcement, arent the clock rates estimated overly optimistic?

    • BoBzeBuilder
    • 10 years ago

    I don’t like Nvidia dancing in my face with their paper designs. Sell working silicons first.

      • Buzzard44
      • 10 years ago

      +1

      On paper, I’m sure both teams have designs that will put current-gen GPUs to shame. However, they mean nothing until they are actually implemented.

      Seriously, I’ve seen designs for space elevators and flux capacitors. Until there’s a card in my hands, how is this any different?

      Of course, I can’t blame NVidia for their current spam info and lure future customers tactics. It’s just smart economics.

      Excellent article, by the way. I really like learning about computers and graphics cards more in depth, and it’s rare to find articles that can elaborate on lower-level hardware without requiring a M.S. in electrical engineering.

    • PRIME1
    • 10 years ago

    So it can run Crysis and Fold?

    Truly the grail has been found.

      • Jigar
      • 10 years ago

      Is your name Optimus Troll ?

      • Krogoth
      • 10 years ago

      You could already do that for while.

      • SomeOtherGeek
      • 10 years ago

      It looks like it can fold, but nothing can play Crysis.

        • yogibbear
        • 10 years ago

        But can it fold my washing?

          • UberGerbil
          • 10 years ago

          That’s old tech. Net tech: clothes that don’t need folding.

    • indeego
    • 10 years ago

    Somewhat embarrassing when you show that side of the coin to our International visitors. Can you show the other side insteadg{

      • FuturePastNow
      • 10 years ago

      What’s wrong with that side of a quarter?

      • PRIME1
      • 10 years ago

      Which other side, there are like 55 designs. (Each state + Territories).

      • ethompson6
      • 10 years ago

      Obvious troll is obvious.

      • FireGryphon
      • 10 years ago

      The only thing I can think of is that indeego is getting highly politically leftist by inferring that “In G-d We Trust” reveals our backwards society as unenlightened and worthless to the world.

        • zima
        • 10 years ago

        Why that would be highly leftist? There are a handful of very right wing folks that root for natural mythologies instead of for the one imported from middle East; I’d guess they would very much prefer “in gods we trust” ;p

          • Dagwood
          • 10 years ago

          “There are a handful of very right wing folks that root for natural mythologies”

          Handful = 3 ?

            • zima
            • 10 years ago

            Uhm, just the typical occasional effect of EN being my 2nd or 3rd (not sure TBH…) language. You know what I’ve meant ;p

            • Dagwood
            • 10 years ago

            Ok your a polythiest who is polylingual. So my estimate of 3 people on earth might not be so far off.

            What is your group of three people lobbying congress for? Parenthesis around an “S” on the word God?

      • Shining Arcanine
      • 10 years ago

      There is nothing wrong with either side of the quarter.

      • eitje
      • 10 years ago

      statute of limitations ran out a few weeks ago on that complaint, since it’s a pic from the 5xxx article! 😉

      • ludi
      • 10 years ago

      Maybe instead he should compare it to the backside of a dollar bill, and then we can give the Masons their equal-opportunity time. Would that do the trick for ya?

    • SomeOtherGeek
    • 10 years ago

    Hey Rys, very nice article! Really, very in-depth of just one topic that I’m ever used to and do keep them coming. I liked it and will be re-reading just better understand how a GPU works.

    I just hope for the sake of the customers that this all pans out cuz we need the competition! I’m getting a little tired of nVidia’s paper trail. It is more and more like paper-weight.

    Anyway, good writeup and I demand more! MORE! 😉

      • wira020
      • 10 years ago

      Funny you mentioned re-reading… I thought i was the only 1 not getting it at first read :P… i need around 10 more round or so…

      Btw, i am sure they wont be selling Fermi cheaply, given the die size… i’m looking forward how will Nvidia stay competitive when matching performance and price to Ati… and i’m sure it’d be easy for AMD/Ati to scale down pricing, it’s what they do best after all.. as of feature, i’d say Eyeinfinity match up Physix evenly from my pov, personal preference mainly…

    • OneArmedScissor
    • 10 years ago

    Well that’s a heck of a roundabout way to say that it won’t work out to twice as fast as the GTX 200s. :p

    But seriously, it was very refreshing to see an article that explores something new in depth. I hope there are more articles like this in the works.

    • SecretMaster
    • 10 years ago

    Welcome Rys! A great article for people wanting to learn up on GPU architecture a bit.

      • Krogoth
      • 10 years ago

      Rys is an old timer.

      He has been here since TR’s birth. 😉

        • SecretMaster
        • 10 years ago

        Oh my. I’ve been here ~2005 and I don’t think I’ve ever seen an article by him. My apologies then.

          • Krogoth
          • 10 years ago

          He runs and mostly hangs out at the TR Freenode IRC channel.

          • Ryszard
          • 10 years ago

          Don’t think I’ve ever graced the hallowed pages of TR before with content, at least not that I can remember, although I think I’ve had a mention from time to time. Been around for a long time though behind the scenes and on TR IRC.

            • wira020
            • 10 years ago

            Good read, wish you could have included more picture… i’m one of those people that have short attention span, reading longggggg articles like this tends to sway my mind off topic… :P… nice article nevertheless…

            I wish you could have guess/estimate the price also… i know the gpu would cost more than rv870 given the die size but does other things (like architecture or card differences or memory bandwidht) matter with pricing?..

            • grantmeaname
            • 10 years ago

            welcome to the part of TR I read!!!

            • Applecrusher
            • 10 years ago

            I remember reading your old articles on hexus way back..

    • wsk
    • 10 years ago

    CUDA warps have 32, not 16 threads. 16 is the number of banks in shared memory.

      • Ryszard
      • 10 years ago

      Yeah, you’re right, sorry about that. Should hopefully be clear that 16 is the half warp later in the text.

    • ssidbroadcast
    • 10 years ago

    Nice touch with the bold snippets interspliced into the article. Not sure whose idea that was, but you guys should all start doing things that way from now on.

      • SomeOtherGeek
      • 10 years ago

      My thoughts exactly!

      • danny e.
      • 10 years ago

      I also enjoyed that even though I didnt read the article yet.. and may not as I’m annoyed by all the Nvidia paper news lately.. 🙂

    • MadManOriginal
    • 10 years ago

    NV is really spamming the hell out of the prerelease info to keep their GPU in the news huh. Meanwhle AMD (or TSMC) can’t make enough 40nm DX11 GPUs to keep up with demand. That must make NV rather irritated.

      • wira020
      • 10 years ago

      The saddest one so far is the pic of the gpu running heaven benchmark… which could have been fake… talking bout fake, i think the fake card jen showed was epic…

    • Krogoth
    • 10 years ago

    I am getting NV30 vibes from this.

      • swaaye
      • 10 years ago

      It feels to me like the priorities are heavily shifted towards GPGPU, with games slipping into the back seat. I wonder when that will start to seriously hurt them with respect to game performance. The loss of efficiency for games has shown up already with GT200 and RV770.

      And yeah the monthly paper PR does feel like 2002, when they were way late with NV30 and needed a steady stream of FUD.

        • dragmor
        • 10 years ago

        Blame the developers, there is nothing that pushes the current graphics cards. AMD has moved to multi monitor gaming just so it looks like there is a reason to have these top end cards.

          • wira020
          • 10 years ago

          Yerp, they feed on human needs for improvements and innovation, and also human’s epic search for pleasure… lols.. in short they wanted to make more money and we wanted more fun…

          • Mystic-G
          • 10 years ago

          I definitely agree with this, Nvidia sees not a whole lot more room to improve on current games since many end up being just console ports. I can see why there is worry, Nvidia just wants to make use of their cards outside gaming and video playback aswell. I say wait for pricing and reviews before you judge what they’ve done.

          I do think when BF3 comes out that both sides will be racing for the gold on that one since that series is pretty much the hallmark of PC-exclusive gaming and will most likely be pushing the boundaries.

            • Shining Arcanine
            • 10 years ago

            I recall a certain Bad Company game that was made only for consoles. If anything, World of Warcraft is the hallmark of PC Gaming. It is the only game of which I know that is exclusively for the PC.

            • eitje
            • 10 years ago

            (and Mac)

            • Meadows
            • 10 years ago

            You’ve never heard of Crysis.

            • derFunkenstein
            • 10 years ago

            Oh, the tech demo with the controllable camera?

            • Meadows
            • 10 years ago

            I actually liked the game. Shame on you.
            Stick a hedgehog where the sun don’t shine.

            Kids these days.

            • derFunkenstein
            • 10 years ago

            If they only made games I liked, we’d all be playing turn-based Japanese strategy games, baseball, and Katamari. So I won’t begrudge that someone enjoyed Crysis. I got the demo from nzone when the game first shipped and hated it, myself. Ran nicely at 1280×800 on my machine at the time, but meh.

            • swaaye
            • 10 years ago

            Frankly I think “console ports” is too commonly believed to be the proper scapegoat. There have always been multiplatform games. Ultima was on NES! 🙂

            The modern day problem is dev costs. The risk factor is way higher than ever before and the sales numbers on PCs suck compared to consoles outside of the crack cocaine MMOs. Making a game that won’t run on the consoles and won’t run on Intel GMAxxx makes absolutely zero business sense in most cases. And of course making AAA games for niche genres isn’t very smart either, hence the lack of sims for ex.

            • zima
            • 10 years ago

            Again, the mythical “console port” is to blame…when will you stop that?

            Perhaps this will resonate…how can you call a “port” something which has essentially identical engine with almost the same art assets on both types of platforms? “Porting” suggests there is some effort involved…that the platforms are in any significant way different.

            (the way I might look at it…”PC games” are to blame, because what happened to last generation of consoles is “PC-tisation”; but that would be equally moronic; besides, I have still enough great “console games” to play *[

      • PRIME1
      • 10 years ago

      That’s just Jigar’s mustache tickling you.

Pin It on Pinterest

Share This