Single page Print

Nvidia's 'Fermi' GPU architecture revealed

GPU computing grabs center stage

Graphics processors, as you may know, have been at the center of an ongoing conversation about the future of computing. GPUs have shown tremendous promise not just for producing high-impact visuals, but also for tackling data-parallel problems of various types, including some of the more difficult challenges computing now faces. Hence, GPUs and CPUs have been on apparent collision course of sorts for some time now, and that realization has spurred a realignment in the processor business. AMD bought ATI. Intel signaled its intention to enter the graphics business in earnest with its Larrabee project. Nvidia, for its part, has devoted a tremendous amount of time and effort to cultivating the nascent market for GPU computing, running a full-court press everywhere from education to government, the enterprise, and consumer applications.

Heck, the firm has spent so much time talking up its GPU-compute environment, dubbed CUDA, and the applications written for it, including the PhysX API for games, that we've joked about Nvidia losing its relish for graphics. That's surely not the case, but the company is dead serious about growing its GPU-computing business.

Nowhere is that commitment more apparent that when it gets etched into a silicon wafer in the form of a new chip. Nvidia has been working on its next-generation GPU architecture for years now, and the firm has chosen to reveal the first information about that architecture today, at the opening of its GPU Technology Conference in San Francisco. That architecture, code-named Fermi, is no doubt intended to excel at graphics, but this first wave of details focuses on its GPU-compute capabilities. Fermi has a number of computing features never before seen in a GPU, features that should enable new applications for GPU computing and, Nvidia hopes, open up new markets for its GeForce and Tesla products.

An aerial view
We'll begin our tour of Fermi by honoring a time-honored tradition of looking at a logical block diagram of the GPU architecture. Images like the one below may not mean much divorced from context, but they can tell you an awful lot if you know how to interpret them. Here's how Nvidia represents the Fermi architecture when focused on GPU computing, with the graphics-specific bits largely omitted.

A functional overview of the Fermi architecture. Source: Nvidia.

Let's see if we can decode things. The tall, rectangular structures flanked by blue are SMs, or streaming multiprocessors, in Nvidia's terminology. Fermi has 16 of them.

The small, green squares inside of each SM are what Nvidia calls "CUDA cores." These are the most fundamental execution resources on the chip. Calling these "cores" is apparently the fashion these days, but attaching that name probably overstates their abilities. Nonetheless, those execution resources do help determine the chip's total power; the GT200 had 240 of them, and Fermi has 512, just more than twice as many.

Six of the darker blue blocks on the sides of the diagram are memory interfaces, per their labels. Those are 64-bit interfaces, which means Fermi has a total path to memory that is 384 bits wide. That's down from 512 bits on the GT200, but Fermi more than makes up for it by delivering nearly twice the bandwidth per pin via support for GDDR5 memory.

Needless speculation and conjecture for $400, Alex
Those are the basic outlines of the architecture, and if you're like me, you're immediately wondering how Fermi might compare to its most direct competitor in both the graphics and GPU-compute markets, the chip code-named Cypress that powers the Radeon HD 5870. We don't yet have enough specifics about Fermi to make that determination, even on paper. We lack key information on its graphics resources, for one thing, and we don't know what clock speeds Nvidia will settle on, either. But we might as well indulge in a little bit of speculation, just for fun. Below is a table showing the peak theoretical computational power and memory bandwidth of the fastest graphics cards based on recent GPUs from AMD and Nvidia. I've chosen to focus on graphics cards rather than dedicated GPU compute products because AMD hasn't yet announced a FireStream card based on Cypress, but the compute products shouldn't differ too much in these categories from the high-end graphics cards.

Peak single-precision
arithmetic (GFLOPS)
Single-issue Dual-issue
GeForce GTX 280 622 933 78 141.7
Radeon HD 4870 1200 - 240 115.2
Radeon HD 5870 2720 - 544 153.6

Those numbers set the stage. We're guessing from here, but let's say 1500MHz is a reasonable frequency target for Fermi's stream processing core. That's right in the neighborhood of the current GeForce GTX 285. If we assume Fermi reaches that speed, its peak throughput for single-precision math would be 1536 GFLOPS, or about half of the peak for the Radeon HD 5870. That's quite a gap, but it's not much different than the gulf between the GeForce GTX 280's single-issue (and most realistic) peak and the Radeon HD 4870's—yet the GTX 280 was faster overall in graphics applications and performed quite competitively in directed shader tests, as well.

Double-precision floating-point math is more crucial for GPU computing, and here Fermi has the advantage: its peak DP throughput should be close to 768 GFLOPS, if our clock speed estimates are anything like accurate. That's 50% higher than the Radeon HD 5870, and it's almost a ten-fold leap from the GT200, as represented by the GeForce GTX 280.

That's not all. Assuming Nvidia employs the same 4.8 Gbps data rate for GDDR5 memory that AMD has for Cypress, Fermi's peak memory bandwidth should be 230 GB/s, again roughly 50% higher than the Radeon HD 5870, which has a total memory bus width of 256 bits.

All of this speculation, of course, is a total flight of fancy, and I've probably given some folks at Nvidia minor heart palpitations by opening with such madness. A bump up or down here or there in clock speed could have major consequences in a chip that involves this much parallelism. Not only that, but peak theoretical gigaFLOPS numbers are increasingly less useful as a predictor of performance for a variety of reasons, including scheduling complexities and differences in chip capabilities. Indeed, as we'll soon see, the Fermi architecture is aimed at computing more precisely and efficiently, not just delivering raw FLOPS.

So you'll want to stow your tray tables and put your seat backs in an upright and locked position as this flight of fancy comes in to land. We would also like to know, of course, how large a chip Fermi might turn out to be, because that will also tell us something about how expensive it might be to produce. Nvidia doesn't like to talk about die sizes, but it says straightforwardly that Fermi is comprised of an estimated 3 billion transistors. By contrast, AMD estimates Cypress at about 2.15 billion transistors, with a die area of 334 mm². We've long suspected that the methods of counting transistors at AMD and Nvidia aren't the same, but set that aside for a moment, along with your basic faculties for logic and reason and any other reservations you may have. If Fermi is made using the same 40-nm fab process as Cypress, and assuming the transistor density is more or less similar—and maybe we'll throw in an estimate from the Congressional Budget Office, just to make it sound official—then a Fermi chip should be close to 467 mm².

That's considerably larger than Cypress—nearly 50%—but is in keeping with its advantages in DP compute performance and memory bandwidth. That also seems like a sensible estimate in light of Fermi's two additional memory interfaces, which will help dictate the size of the chip. Somewhat surprisingly, that also means Fermi may turn out to be a little bit smaller than the 55-nm GT200b, since the best estimates place the GT200b at just under 500 mm². Nvidia would appear to have continued down the path of building relatively large high-end chips compared to the competition's slimmed-down approach, but Fermi seems unlikely to push the envelope on size quite like the original 65-nm GT200 did.

Then again, I could be totally wrong on this. We should have more precise answers to these questions soon enough. For now, let's move on to what we do know about Nvidia's new architecture.