Single page Print

Inside Fermi's graphics architecture

Insider info and careful speculation take us deep into Nvidia's next GPU

It seems probable that September 2009 will be more than just a footnote in the annals of computing, especially when one considers graphics processors. AMD made the ninth month of the ninth year in the twenty-first century the one it announced, released, and made available at retail its next-generation DX11 graphics processor: Cypress. Nvidia managed to sneak Fermi in to September 2009 as well, talking about the chip publicly on the 30th.

We refer you to our initial poke at things from GTC to get you started, if you have no idea what Fermi is at this point.

If you've been following Fermi since it was announced, you'll know Nvidia didn't really talk about the specific graphics transistors in Fermi implementations. We're going to take a stab at that, though, using information gleaned from the whitepaper, bits teased from Nvidia engineers, and educated guesswork. Remember, however, that graphics transistor chatter does ultimately remain a guess until the real details are unveiled.

"Why did Nvidia only talk about the compute side of Fermi?", you might ask. You can't have failed to notice the company's push into non-graphics application of GPUs in recent years. The G80 processor launch, along with CUDA, has meant that people interested in using the GPU for non-graphics computation have had a viable platform for doing so. The processors have been very capable, and CUDA offers a more direct avenue for programming them than hijacking a high-level graphics shading language.

This industry is now mostly up and walking, after being born little more than a few years ago. We've seen GPU computing shed tears, start teething, and take its first baby steps.

Since that first serious attempt at providing infrastructure for GPU compute, we've seen CUDA evolve heavily and the competition and infrastructure along with it: AMD's Stream programming initiative has grown to include the GPU, OpenCL now allows developers to harness GPU power across multiple platforms, and Microsoft now has a DirectCompute portion of DirectX that leverages the devices in a more general non-graphics way. Oh, and we mustn't forget fleeting hints at the future from the likes of Rapidmind, now a part of Intel.

GPU computing is becoming a big business, and Nvidia is working, like any company with an obligation to its employees and shareholders, to make big inroads into a new industry with serious potential for growth. This industry is now mostly up and walking, after being born little more than a few years ago. We've seen GPU computing shed tears, start teething, and take its first baby steps.

Against that background, Nvidia chose not to talk about the graphics transistors in Fermi at its GPU Technology Conference. Sure, some of its reservations were competitive. After all, why give AMD all it needs to estimate product-level performance months in advance? Some of it was simply because they've only very recently been able to run code on real hardware, after delays in production and manufacturing. Regardless, it was real hardware at GTC, you can be very sure of that.

The crux, though, is that Fermi will be the first GPU architecture that Nvidia initially pushes harder into the compute space than consumer or professional graphics. Large supercomputer contracts and other big installations are being won on the back of Fermi's general compute strengths, as we speak. The graphics side of things is, at this point in time anyway, less important. Make no mistake, though: Fermi is still a GPU, and the G still stands resolutely for graphics.

Terminology introduction
Graphics architecture discussion has gained some new—mostly confusing and disparate, if we're honest—terminology in the last year or so. The drive to describe massively parallel devices executing thousands of threads at a time has forced the new words, acronyms and terms to the forefront. To add to things, each vendor has a propensity to use different terms for pretty much the same things, for whatever reason.

While we can't quite unify the terminology, we can explain what we're going to use in this article, to cover some of the more confusing or non-obvious bits and pieces you might come across in the following pages. Let's start with cluster. Nvidia used to call it a TPC, AMD is keen on calling it a SIMD, but we use "cluster" to denote the granular compute processing block on a GPU, the thing vendors use to scale their architectures up and down at a basic level. A cluster is generally a collection of what the vendors like to call cores, but we're more inclined to call the cluster the core (at least most of the time; it depends on the architecture). For example, we'd say AMD's Cypress is 20-cluster part, and Nvidia's GT200 is a 10-cluster part.

Next, we've got the warp. AMD calls it a wavefront. Either way, these terms describe a logical collection of threads that are executing at any given time on the basic building blocks of a cluster. Because of the way a modern GPU renders pixels and needs to texture, threads don't run at the single pixel/vertex/object level on a graphics processor, with each thread independent. Rather, objects are grouped logically and passed through the pipeline together. So a warp is a collection of threads, each running for a single object. Because of various requirements for efficient hardware rendering, and the underlying architecture of the GPU, those objects are grouped together.

So for recent Nvidia parts, a warp is 32 threads, and for recent AMD hardware, a warp is 64 threads. Branching on a GPU happens at the warp level, too.

We also talk about the "hot clock" when it comes to modern Nvidia hardware. The hot clock is the fastest clock on the chip, and it's the one at which the compute core runs.

"Kernel" is just a nice name for the software programs that wrap execution on the GPU. Some GPUs can only run a single kernel at a time, although that is changing.

Finally, when we talk about the near (memory) pools in Fermi, we mean the register file and the L1 and L2 cache memories. Sometimes just L1, though, depending on context. To visualize what we mean, think of the memory hierarchy like a chain, from registers to L1 to L2 to the memory chips on the board, with the near pools being those nearest to the compute hardware physically.

There should be some attempt to unify the terminology at some point, since talking about threads and blocks and grids and streams and warps and wavefronts and fibers, with nuanced and inconsistent meaning to boot, is counter-productive. Hopefully this intro serves you well into the rest of the analysis.

Fermi overview
Before we dive into the details, an overview of the Fermi architecture as a whole is prudent, and we'll try and limit most of the comparisons to other architectures and chips to this part of our analysis.

Starting with the basic building block of Fermi, the cluster, Nvidia's prior D3D10-class products all had multiple shader multiprocessors (SM) in each cluster, with two or three SMs each, depending on the evolution of the architecture. G80 and derivatives were two-SM parts, with each SM an 8-wide vector plus special function and interpolator block, with shared sampler resource with the other SM in the cluster.

G80, the base implementation, powered products like GeForce 8800 GTX and GTS, with eight clusters, and some product-family variants disabled a cluster (and ROP partition). GT200, responsible for Nvidia's high-end products since launch roughly 17 months ago, expanded clusters to include a third SM, with each SM further enhanced with a single double-precision (DP) float unit. That DP support let developers access this capability early, a teaser if you will, before Fermi.

Fermi now has single-SM clusters, although each SM is effectively a pair of 16-way vector sub blocks. Sub-block configuration is the key to Fermi implementation configuration. GF100, the high-end part that Nvidia outlines in the whitepaper, uses two different sub blocks in each of its sixteen SMs.

A functional block diagram of GF100, the first chip based on the Fermi architecture

Each sub block has a special function unit (SFU) that provides access to hardware specials and interpolation for the vector, taking eight clocks to service a thread group or warp. More on that later. Nvidia points out that there's a dedicated load/store unit for the cluster, too, although you could claim that for every interesting generation of hardware they've created. The logic there has some unique problems to solve due to the new per-cluster arrangement and computational abilities, but it's arguably not worth presenting as part of the block logic.

Each SM now has a 64 KiB partitioned shared memory and L1 cache store. The cache can be partitioned two ways at the thread type level (although with no programmer control as far as we're aware, at least not yet), with either 16/48 or 48/16 KiB dedicated to shared memory and L1. Each sub block shares access to the store with the other, due to executing the same warp. The reason for not allowing other splits is twofold: the desire to keep a familiar shared memory space for code designed for other multiprocessors, and the desire to let L1 run well in parallel; and they're wire limited in terms of allowing those other configurations, area complexity becoming a real nemesis in terms of ports and what have you.

The cache design is a significant change from any Nvidia architecture to date and a key component of its compute-focused ability.

L1 is backed by a unified L2 cache shared across each Fermi chip's SMs. The chip uses L2 to service all memory controller I/O requests, and all L2 writes from any cluster are visible in the next clock to any other cluster on the chip. The cache design is a significant change from any Nvidia architecture to date and a key component of its compute-focused ability. Graphics is generally a highly spatially local task for the memory subsystem to manage, with access and stride patterns well known in advance (spatial locality in terms of the address space, although that's a function of how it processes geometry and pixels). Thus, GPU caches have traditionally been small, since the spatial locality means you don't need all data in the cache to service a complete memory request (far from it, in reality). Yet non-graphics compute can trivially introduce non-spatially local memory access and random access patterns, which the large, unified L2 is designed to accelerate.

Also, all memories on the chip, from registers up to DRAM, can be protected by ECC.