Nvidia’s ‘Fermi’ GPU architecture revealed
Graphics processors, as you may know, have been at the center of an ongoing conversation about the future of computing. GPUs have shown tremendous promise not just for producing high-impact visuals, but also for tackling data-parallel problems of various types, including some of the more difficult challenges computing now faces. Hence, GPUs and CPUs have been on apparent collision course of sorts for some time now, and that realization has spurred a realignment in the processor business. AMD bought ATI. Intel signaled its intention to enter the graphics business in earnest with its Larrabee project. Nvidia, for its part, has devoted a tremendous amount of time and effort to cultivating the nascent market for GPU computing, running a full-court press everywhere from education to government, the enterprise, and consumer applications.
Heck, the firm has spent so much time talking up its GPU-compute environment, dubbed CUDA, and the applications written for it, including the PhysX API for games, that we’ve joked about Nvidia losing its relish for graphics. That’s surely not the case, but the company is dead serious about growing its GPU-computing business.
Nowhere is that commitment more apparent that when it gets etched into a silicon wafer in the form of a new chip. Nvidia has been working on its next-generation GPU architecture for years now, and the firm has chosen to reveal the first information about that architecture today, at the opening of its GPU Technology Conference in San Francisco. That architecture, code-named Fermi, is no doubt intended to excel at graphics, but this first wave of details focuses on its GPU-compute capabilities. Fermi has a number of computing features never before seen in a GPU, features that should enable new applications for GPU computing and, Nvidia hopes, open up new markets for its GeForce and Tesla products.
An aerial view
We’ll begin our tour of Fermi by honoring a time-honored tradition of looking at a logical block diagram of the GPU architecture. Images like the one below may not mean much divorced from context, but they can tell you an awful lot if you know how to interpret them. Here’s how Nvidia represents the Fermi architecture when focused on GPU computing, with the graphics-specific bits largely omitted.
Let’s see if we can decode things. The tall, rectangular structures flanked by blue are SMs, or streaming multiprocessors, in Nvidia’s terminology. Fermi has 16 of them.
The small, green squares inside of each SM are what Nvidia calls “CUDA cores.” These are the most fundamental execution resources on the chip. Calling these “cores” is apparently the fashion these days, but attaching that name probably overstates their abilities. Nonetheless, those execution resources do help determine the chip’s total power; the GT200 had 240 of them, and Fermi has 512, just more than twice as many.
Six of the darker blue blocks on the sides of the diagram are memory interfaces, per their labels. Those are 64-bit interfaces, which means Fermi has a total path to memory that is 384 bits wide. That’s down from 512 bits on the GT200, but Fermi more than makes up for it by delivering nearly twice the bandwidth per pin via support for GDDR5 memory.
Needless speculation and conjecture for $400, Alex
Those are the basic outlines of the architecture, and if you’re like me, you’re immediately wondering how Fermi might compare to its most direct competitor in both the graphics and GPU-compute markets, the chip code-named Cypress that powers the Radeon HD 5870. We don’t yet have enough specifics about Fermi to make that determination, even on paper. We lack key information on its graphics resources, for one thing, and we don’t know what clock speeds Nvidia will settle on, either. But we might as well indulge in a little bit of speculation, just for fun. Below is a table showing the peak theoretical computational power and memory bandwidth of the fastest graphics cards based on recent GPUs from AMD and Nvidia. I’ve chosen to focus on graphics cards rather than dedicated GPU compute products because AMD hasn’t yet announced a FireStream card based on Cypress, but the compute products shouldn’t differ too much in these categories from the high-end graphics cards.
|GeForce GTX 280||622||933||78||141.7|
|Radeon HD 4870||1200||–||240||115.2|
|Radeon HD 5870||2720||–||544||153.6|
Those numbers set the stage. We’re guessing from here, but let’s say 1500MHz is a reasonable frequency target for Fermi’s stream processing core. That’s right in the neighborhood of the current GeForce GTX 285. If we assume Fermi reaches that speed, its peak throughput for single-precision math would be 1536 GFLOPS, or about half of the peak for the Radeon HD 5870. That’s quite a gap, but it’s not much different than the gulf between the GeForce GTX 280’s single-issue (and most realistic) peak and the Radeon HD 4870’syet the GTX 280 was faster overall in graphics applications and performed quite competitively in directed shader tests, as well.
Double-precision floating-point math is more crucial for GPU computing, and here Fermi has the advantage: its peak DP throughput should be close to 768 GFLOPS, if our clock speed estimates are anything like accurate. That’s 50% higher than the Radeon HD 5870, and it’s almost a ten-fold leap from the GT200, as represented by the GeForce GTX 280.
That’s not all. Assuming Nvidia employs the same 4.8 Gbps data rate for GDDR5 memory that AMD has for Cypress, Fermi’s peak memory bandwidth should be 230 GB/s, again roughly 50% higher than the Radeon HD 5870, which has a total memory bus width of 256 bits.
All of this speculation, of course, is a total flight of fancy, and I’ve probably given some folks at Nvidia minor heart palpitations by opening with such madness. A bump up or down here or there in clock speed could have major consequences in a chip that involves this much parallelism. Not only that, but peak theoretical gigaFLOPS numbers are increasingly less useful as a predictor of performance for a variety of reasons, including scheduling complexities and differences in chip capabilities. Indeed, as we’ll soon see, the Fermi architecture is aimed at computing more precisely and efficiently, not just delivering raw FLOPS.
So you’ll want to stow your tray tables and put your seat backs in an upright and locked position as this flight of fancy comes in to land. We would also like to know, of course, how large a chip Fermi might turn out to be, because that will also tell us something about how expensive it might be to produce. Nvidia doesn’t like to talk about die sizes, but it says straightforwardly that Fermi is comprised of an estimated 3 billion transistors. By contrast, AMD estimates Cypress at about 2.15 billion transistors, with a die area of 334 mm². We’ve long suspected that the methods of counting transistors at AMD and Nvidia aren’t the same, but set that aside for a moment, along with your basic faculties for logic and reason and any other reservations you may have. If Fermi is made using the same 40-nm fab process as Cypress, and assuming the transistor density is more or less similarand maybe we’ll throw in an estimate from the Congressional Budget Office, just to make it sound officialthen a Fermi chip should be close to 467 mm².
That’s considerably larger than Cypressnearly 50%but is in keeping with its advantages in DP compute performance and memory bandwidth. That also seems like a sensible estimate in light of Fermi’s two additional memory interfaces, which will help dictate the size of the chip. Somewhat surprisingly, that also means Fermi may turn out to be a little bit smaller than the 55-nm GT200b, since the best estimates place the GT200b at just under 500 mm². Nvidia would appear to have continued down the path of building relatively large high-end chips compared to the competition’s slimmed-down approach, but Fermi seems unlikely to push the envelope on size quite like the original 65-nm GT200 did.
Then again, I could be totally wrong on this. We should have more precise answers to these questions soon enough. For now, let’s move on to what we do know about Nvidia’s new architecture.
Better scheduling, faster switching
Like most major PC processors these days, Fermi hasn’t been entirely re-architected fresh from a clean sheet of paper; it is an incremental enhancement of prior Nvidia GPU architectures that traces its roots two major generations back, to the G80. Yet in the context of this continuity, Fermi brings radical change on a number of fronts, thanks to revisions to nearly every functional unit in the chip.
Many of the changes, especially the ones Nvidia is talking about at present, are directed toward improving the GPU’s suitability and performance for non-graphics applications. Indeed, Nvidia has invested tremendous amounts in building a software infrastructure for CUDA and in engaging with its customers, and it claims quite a few of the tweaks in this architecture were inspired by that experience. There’s much to cover here, and I’ve tried to organize it in a logical manner, but that means some key parts of the architecture won’t be addressed immediately.
We’ll start with an important, mysterious, and sometimes overlooked portion of a modern GPU: the primary scheduler, which Nvidia has too-cleverly named the “GigaThread” scheduler in this chip. Threads are bunched into groups, called “warps” in Nvidia’s lexicon, and are managed hierarchically in Fermi. This main scheduler hands off blocks of threads to the streaming multiprocessors, which then handle finer-grained scheduling for themselves. Fermi has two key improvements in its scheduling capabilities.
One is the ability to run multiple, independent “kernels” or small programs on different thread groups simultaneously. Although graphics tends to involve very large batches of things like pixels, other applications may not happen on such a grand scale. Indeed, Nvidia admits that some kernels may operate on data grids smaller than a GPU like Fermi, as illustrated in the diagram above. Some of the jobs are smaller than the GPU’s width, so a portion of the chip sits idle as the rest processes each kernel. Fermi avoids this inefficiency by executing up to 16 different kernels concurrently, including multiple kernels on the same SM. The limitation here is that the different kernels must come from the same CUDA contextso the GPU could process, say, multiple PhysX solvers at once, if needed, but it could not intermix PhysX with OpenCL.
To tackle that latter sort of problem, Fermi has much faster context switching, as well. Nvidia claims context switching is ten times the speed it was on GT200, as low as 10 to 20 microseconds. Among other things, intermingling GPU computing with graphics ought to be much faster as a result.
(Incidentally, AMD tells us its Cypress chip can also run multiple kernels concurrently on its different SIMDs. In fact, different kernels can be interleaved on one SIMD.)
Inside the new, wider SM
In many ways, the SM is the heart of Fermi. The SMs are capable of fetching instructions, so they are arguably the real “processing cores” on the GPU. Fermi has 16 of them, and they have quite a bit more internal parallelism than the processing cores on a CPU.
That concept we mentioned of thread groups or warps is fundamental to the GPU’s operation. Warps are groups of threads handled in parallel by the GPU’s execution units. Nvidia has retained the same 32-thread width for warps in Fermi, but the SM now has two warp schedulers and instruction dispatch units.
The SM then has four main execution units. Two of them are 16-wide groups of scalar “CUDA cores,” in Nvidia’s parlance, and they’re helpfully labeled “Core” in the diagram on the right, mainly because I wasn’t given sufficient time with a paint program to blot out the labels. There’s also a 16-element-wide load/store unit and a four-wide group of special function units. The SFUs handle special types of math like transcendentals, and the number here is doubled from GT200, which had two per SM.
Fermi’s SM has a full crossbar between the two scheduler/dispatch blocks and these four execution units. Each scheduler/dispatch block can send a warp to any one of the four execution units in a given clock cycle, which makes Fermi a true dual-issue design, unlike GT200’s pseudo-dual-issue. The only exception here is when double-precision math is involved, as we’ll see.
The local data share in Fermi’s SM is larger, as well, up from 16KB in GT200 to 64KB here. This data share is also considerably smarter, for reasons we’ll explain shortly.
First, though, let’s take a quick detour into the so-called “CUDA core.” Each of these scalar execution resources has separate floating-point and integer data paths. The integer unit stands alone, no longer merged with the MAD unit as it was on prior designs. And each floating-point unit is now capable of producing IEEE 754-2008-compliant double-precision FP results in two clock cycles, or half the performance of single-precision math. That’s a huge step up from the GT200’s lone DP unit per SMhence our estimate of a ten-fold increase in DP performance. Again, incorporating double-precision capability on this scale is quite a commitment from Nvidia, since such precision is generally superfluous for real-time graphics and really only useful for other forms of GPU computing.
I’d love to tell you the depth of these pipelines, but Nvidia refuses to disclose it. We could speculate, but we’ve probably done enough of that for one day already.
Fermi maintains Nvidia’s underlying computational paradigm, which the firm has labeled SIMT, for single instruction, multiple thread. Each thread in a warp executes in sequential fashion on a “CUDA core,” while 15 others do the same in parallel. For graphics, as I understand it, each pixel is treated as a thread, and pixel color components are processed serially: red, green, blue, and alpha. Since warps are 32 threads wide, warp operations will take a minimum of two clock cycles on Fermi.
Thanks to the dual scheduler/issue blocks, Fermi can occupy both 16-wide groups of CUDA cores with separate warps via dual issue. What’s more, each SM can track a total of 48 warps simultaneously and schedule them pretty freely in intermixed fashion, switching between warps at will from one cycle to the next. Obviously, this should be a very effective means of keeping the execution units busy, even if some of the warps must wait on memory accesses, because many other warps are available to run. To give you a sense of the scale involved, consider that 32 threads times 48 warps across 16 SMs adds up to 24,576 concurrent threads in flight at once on a single chip.
Enhanced precision and programmability
Fermi incorporates a number of provisions for higher mathematical precision, including support for a fused multiply-add (FMA) operation with both single- and double-precision math. FMA improves precision by avoiding rounding between the multiply and add operations, while storing a much higher precision intermediate result. Fermi is like AMD’s Cypress chip in this regard, and both claim compliance with the IEEE 754-2008 standard. Also like Cypress is Fermi’s ability to support denorms at full speed, with gradual underflow for accurate representation of numbers approaching zero.
Fermi’s native instruction set has been extended in a number of other ways, as well, with hardware support for both OpenCL and DirectCompute. These changes have prompted an update to PTX, the ISA Nvidia has created for CUDA compute apps. PTX is a low-level ISA, but it’s not quite machine level; there’s still a level of driver translation beneath that. CUDA applications can be compiled to PTX, though, and it’s sufficiently close to the metal to require an update in this case.
Nvidia hasn’t stopped at taking care of OpenCL and DirectCompute, either. Among the changes in PTX 2.0 is a 40-bit, 1TB unified address space. This single address space encompasses the per-thread, per-SM (or per block), and global memory spaces built into the CUDA programming model, with a single set of load and store instructions. These instructions support 64-bit addressing, offering headroom for the future. These changes, Nvidia contends, should allow C++ pointers to be handled correctly, and PTX 2.0 adds a number of other odds and ends to make C++ support feasible.
The memory hierarchy
As we’ve noted, each SM has 64KB of local SRAM associated with it. Interestingly, Fermi partitions this local storage between the traditional local data store and L1 cache, either as 16KB of shared memory and 48KB of cache or vice-versa, in a 48KB/16KB share/cache split. This mode can be set across the chip, and the chip must be idled to switch. The portion of local storage configured as cache functions as a real L1 cache, coherent per SM but not globally, befitting the CUDA programming model.
Backing up the L1 caches in Fermi is a 768KB L2 cache. This cache is fully coherent across the chip and connected to all of the SMs. All memory accesses go through this cache, and the chip will go to DRAM in the event of a cache miss. Thus, this cache serves as a high-performance global data share. Both the L1 and L2 caches support multiple write policies, including write-back and write-through.
The L2 cache could prove particularly helpful when threads from multiple SMs happen to be accessing the same data, in which case the cache can serve to amplify the tremendous bandwidth available in a streaming compute architecture like this one. Nvidia cites several examples of algorithms that should benefit from caching due to their irregular and unpredictable memory access patterns, and they span the range from consumer applications to high-performance computing. Among them: ray tracing, physics kernels, and sparse matrix multiply. Atomic operations should also be faster on FermiNvidia estimates between five and 20 times better than GT200in part thanks to the presence of the L2 cache. (Fermi has more hardware atomic units, as well.)
Additionally, the entire memory hierarchy, from the register file to the L1 and L2 caches to the six 64-bit memory controllers, is ECC protected. Robust ECC support is an obvious nod to the needs of large computing clusters like those used in the HPC market, and it’s another example of Nvidia dedicating transistors to compute-specific features. In fact, the chip’s architects allow that ECC support probably doesn’t make sense for the smaller GPUs that will no doubt be derived from Fermi and targeted at the consumer graphics market.
Fermi supports single-error correct, double-error detect ECC for both GDDR5 and DDR3 memory types. We don’t yet know what sort of error-correction scheme Nvidia has used, though. The firm refused to reveal whether the memory interfaces were 72 bits wide to support parity, noting only that the memory interfaces are “functionally 64 bits.” Fermi has true protection for soft errors in memory, though, so this is a more than just the CRC-based error correction built into the GDDR5 transfer protocol.
We’ve already noted that Fermi’s virtual and physical address spaces are 40 bits, but the true physical limits for memory size with this chip will be dictated by the number of memory devices that can be attached. The practical limit will be 6GB with 2Gb memories and 12GB with 4Gb devices.
Of course, GPUs must also communicate with the rest of the system. Fermi acknowledges that fact with a revamped interface to the host system that packs dedicated, independent engines for data transfer to and from the GPU. These allow for concurrent GPU-host and host-GPU data transfers, fully overlapped with CPU and GPU processing time.
Nvidia’s build-out of tools for CUDA software development continues, as well. This week at the GPU Technology Conference, Nvidia will unveil its Nexus development platform, with a Microsoft Visual Studio plug-in for CUDA pictured below. Fermi has full exception handling, which should make debugging with tools like these easier.
Nvidia’s investment in software tools for GPU computing clearly outclasses AMD’s, and it’s not really even close. Although this fact has prompted some talk of standards battles, I get the impression Nvidia’s primary interest is making sure every available avenue for programming its GPUs is well supported, whether it be PhysX and C for CUDA or OpenCL and DirectCompute.
That’s all part of a very intentional strategy of cultivating new markets in GPU computing, and the company expects imminent success on this front. In fact, the firm showed us its own estimates that place the total addressable market for GPU computing at just north of $1.1 billion, across traditional HPC markets, education, and defense. That is, I believe, for next year2010. Those projections may be controversial in their optimism, but they reveal much about Nvidia’s motivations behind the Fermi architecture.
There are many things we still don’t know about Nvidia’s next GPU, including crucial information about its graphics features and likely performance. When we visited Nvidia earlier this month to talk about the GPU-compute aspects of the architecture, the first chips were going through bring-up. Depending on how that process goes, we could see shipping products some time later this year or not until well into next year, as I understand it.
We now have a sense that when Fermi arrives, it should at least match AMD’s Cypress in its support for the OpenCL and DirectCompute APIs, along with IEEE 754-2008-compliant mathematical precision. For many corners of the GPU computing world, though, Fermi may be well worth the wait, thanks to its likely superiority in terms of double-precision compute performance, memory bandwidth, caching, and ECC supportalong with a combination of hardware hooks and software tools that should give Fermi unprecedented programmability for a GPU.
Let me suggest reading David Kanter’s piece on Fermi if you’d like more detail on the architecture.