Enhanced precision and programmability
Fermi incorporates a number of provisions for higher mathematical precision, including support for a fused multiply-add (FMA) operation with both single- and double-precision math. FMA improves precision by avoiding rounding between the multiply and add operations, while storing a much higher precision intermediate result. Fermi is like AMD's Cypress chip in this regard, and both claim compliance with the IEEE 754-2008 standard. Also like Cypress is Fermi's ability to support denorms at full speed, with gradual underflow for accurate representation of numbers approaching zero.
Fermi's native instruction set has been extended in a number of other ways, as well, with hardware support for both OpenCL and DirectCompute. These changes have prompted an update to PTX, the ISA Nvidia has created for CUDA compute apps. PTX is a low-level ISA, but it's not quite machine level; there's still a level of driver translation beneath that. CUDA applications can be compiled to PTX, though, and it's sufficiently close to the metal to require an update in this case.
Nvidia hasn't stopped at taking care of OpenCL and DirectCompute, either. Among the changes in PTX 2.0 is a 40-bit, 1TB unified address space. This single address space encompasses the per-thread, per-SM (or per block), and global memory spaces built into the CUDA programming model, with a single set of load and store instructions. These instructions support 64-bit addressing, offering headroom for the future. These changes, Nvidia contends, should allow C++ pointers to be handled correctly, and PTX 2.0 adds a number of other odds and ends to make C++ support feasible.
The memory hierarchy
As we've noted, each SM has 64KB of local SRAM associated with it. Interestingly, Fermi partitions this local storage between the traditional local data store and L1 cache, either as 16KB of shared memory and 48KB of cache or vice-versa, in a 48KB/16KB share/cache split. This mode can be set across the chip, and the chip must be idled to switch. The portion of local storage configured as cache functions as a real L1 cache, coherent per SM but not globally, befitting the CUDA programming model.
Backing up the L1 caches in Fermi is a 768KB L2 cache. This cache is fully coherent across the chip and connected to all of the SMs. All memory accesses go through this cache, and the chip will go to DRAM in the event of a cache miss. Thus, this cache serves as a high-performance global data share. Both the L1 and L2 caches support multiple write policies, including write-back and write-through.
The L2 cache could prove particularly helpful when threads from multiple SMs happen to be accessing the same data, in which case the cache can serve to amplify the tremendous bandwidth available in a streaming compute architecture like this one. Nvidia cites several examples of algorithms that should benefit from caching due to their irregular and unpredictable memory access patterns, and they span the range from consumer applications to high-performance computing. Among them: ray tracing, physics kernels, and sparse matrix multiply. Atomic operations should also be faster on FermiNvidia estimates between five and 20 times better than GT200in part thanks to the presence of the L2 cache. (Fermi has more hardware atomic units, as well.)
Additionally, the entire memory hierarchy, from the register file to the L1 and L2 caches to the six 64-bit memory controllers, is ECC protected. Robust ECC support is an obvious nod to the needs of large computing clusters like those used in the HPC market, and it's another example of Nvidia dedicating transistors to compute-specific features. In fact, the chip's architects allow that ECC support probably doesn't make sense for the smaller GPUs that will no doubt be derived from Fermi and targeted at the consumer graphics market.
Fermi supports single-error correct, double-error detect ECC for both GDDR5 and DDR3 memory types. We don't yet know what sort of error-correction scheme Nvidia has used, though. The firm refused to reveal whether the memory interfaces were 72 bits wide to support parity, noting only that the memory interfaces are "functionally 64 bits." Fermi has true protection for soft errors in memory, though, so this is a more than just the CRC-based error correction built into the GDDR5 transfer protocol.
We've already noted that Fermi's virtual and physical address spaces are 40 bits, but the true physical limits for memory size with this chip will be dictated by the number of memory devices that can be attached. The practical limit will be 6GB with 2Gb memories and 12GB with 4Gb devices.
Of course, GPUs must also communicate with the rest of the system. Fermi acknowledges that fact with a revamped interface to the host system that packs dedicated, independent engines for data transfer to and from the GPU. These allow for concurrent GPU-host and host-GPU data transfers, fully overlapped with CPU and GPU processing time.
What's next?
Nvidia's build-out of tools for CUDA software development continues, as well. This week at the GPU Technology Conference, Nvidia will unveil its Nexus development platform, with a Microsoft Visual Studio plug-in for CUDA pictured below. Fermi has full exception handling, which should make debugging with tools like these easier.

Nvidia's investment in software tools for GPU computing clearly outclasses AMD's, and it's not really even close. Although this fact has prompted some talk of standards battles, I get the impression Nvidia's primary interest is making sure every available avenue for programming its GPUs is well supported, whether it be PhysX and C for CUDA or OpenCL and DirectCompute.
That's all part of a very intentional strategy of cultivating new markets in GPU computing, and the company expects imminent success on this front. In fact, the firm showed us its own estimates that place the total addressable market for GPU computing at just north of $1.1 billion, across traditional HPC markets, education, and defense. That is, I believe, for next year2010. Those projections may be controversial in their optimism, but they reveal much about Nvidia's motivations behind the Fermi architecture.
There are many things we still don't know about Nvidia's next GPU, including crucial information about its graphics features and likely performance. When we visited Nvidia earlier this month to talk about the GPU-compute aspects of the architecture, the first chips were going through bring-up. Depending on how that process goes, we could see shipping products some time later this year or not until well into next year, as I understand it.
We now have a sense that when Fermi arrives, it should at least match AMD's Cypress in its support for the OpenCL and DirectCompute APIs, along with IEEE 754-2008-compliant mathematical precision. For many corners of the GPU computing world, though, Fermi may be well worth the wait, thanks to its likely superiority in terms of double-precision compute performance, memory bandwidth, caching, and ECC supportalong with a combination of hardware hooks and software tools that should give Fermi unprecedented programmability for a GPU.
Let me suggest reading David Kanter's piece on Fermi if you'd like more detail on the architecture.

-
149 comments —
Last by blubje at 2:45 AM on 11/15/09 - Email the author(s): Scott Wasson
- Sign up to receive notices when we publish new articles
- Or go back to TR's front page
-
SLI vs. CrossFireX: The DX11 generation
We've tested 23 different single- and multi-GPU configs to answer the question: are two mid-range cards better than a single expensive one? Read more...
111 comments —
Last by ThorAxe at 7:18 AM on 08/19/10 -
GPU value in the DirectX 11 age
Join us for a new round of scatter plots, this time featuring the latest and greatest DirectX 11 graphics cards from AMD and Nvidia. Read more...
117 comments —
Last by rhema83 at 11:03 AM on 08/29/10 -
Nvidia's GeForce GTX 460 graphics processor
The Fermi architecture has shed much of its fat in the GF104, a new chip focused intently on graphics. The result? Perhaps the best $199 video card available. Read more...
179 comments —
Last by jwolberg at 1:03 PM on 07/27/10 -
Eyefinity to the sixth degree
The newest Radeon HD 5870 unifies six monitors as one giant display wall. We've played loads of games on such a rig, purely for the sake of science. Read more...
82 comments —
Last by SiliconSlick at 8:13 AM on 05/23/10 -
Nvidia's GeForce GTX 480 and 470 graphics processors
Call it Fermi, GF100, or GeForce GTX 480. Heck, you could even call it late for dinner. But the new GeForces are here at last, with a full suite of DX11 goodness on tap. How well do they match up against the latest Radeons? Keep reading... Read more...
373 comments —
Last by BoBzeBuilder at 7:10 PM on 04/18/10 -
AMD outlines its Gamers Manifesto
AMD has released a Gamers Manifesto that details how the company works with developers to benefit all gamers, not just those with Radeon graphics cards. Join us as we take a closer look at the manifesto's guiding principles and how they apply to DirectX 11, Eyefinity, and GPU-accelerated physics. Read more...
101 comments —
Last by SiliconSlick at 7:08 AM on 05/23/10 -
AMD's Radeon HD 5830 graphics card
The newest Radeon is here to step into the massive vacuum between about $169 and $399. But is the 5830 up to its role? And how does it compare to mid-range graphics cards of old? We've taken a quick look. Read more...
167 comments —
Last by mutarasector at 10:23 PM on 03/15/10 -
Nvidia brings Optimus switchable graphics to notebooks
Switchable graphics is the best way to squeeze decent gaming performance and great battery life from the same notebook, but it's never been able to deliver graphics power on demand seamlessly. Nvidia's Optimus switchable tech promises to do better, and we've tested a system to find out whether it... Read more...
54 comments —
Last by Bensam123 at 5:16 PM on 02/15/10
