Enhanced precision and programmability
Fermi incorporates a number of provisions for higher mathematical precision, including support for a fused multiply-add (FMA) operation with both single- and double-precision math. FMA improves precision by avoiding rounding between the multiply and add operations, while storing a much higher precision intermediate result. Fermi is like AMD's Cypress chip in this regard, and both claim compliance with the IEEE 754-2008 standard. Also like Cypress is Fermi's ability to support denorms at full speed, with gradual underflow for accurate representation of numbers approaching zero.
Fermi's native instruction set has been extended in a number of other ways, as well, with hardware support for both OpenCL and DirectCompute. These changes have prompted an update to PTX, the ISA Nvidia has created for CUDA compute apps. PTX is a low-level ISA, but it's not quite machine level; there's still a level of driver translation beneath that. CUDA applications can be compiled to PTX, though, and it's sufficiently close to the metal to require an update in this case.
Nvidia hasn't stopped at taking care of OpenCL and DirectCompute, either. Among the changes in PTX 2.0 is a 40-bit, 1TB unified address space. This single address space encompasses the per-thread, per-SM (or per block), and global memory spaces built into the CUDA programming model, with a single set of load and store instructions. These instructions support 64-bit addressing, offering headroom for the future. These changes, Nvidia contends, should allow C++ pointers to be handled correctly, and PTX 2.0 adds a number of other odds and ends to make C++ support feasible.
The memory hierarchy
As we've noted, each SM has 64KB of local SRAM associated with it. Interestingly, Fermi partitions this local storage between the traditional local data store and L1 cache, either as 16KB of shared memory and 48KB of cache or vice-versa, in a 48KB/16KB share/cache split. This mode can be set across the chip, and the chip must be idled to switch. The portion of local storage configured as cache functions as a real L1 cache, coherent per SM but not globally, befitting the CUDA programming model.
Backing up the L1 caches in Fermi is a 768KB L2 cache. This cache is fully coherent across the chip and connected to all of the SMs. All memory accesses go through this cache, and the chip will go to DRAM in the event of a cache miss. Thus, this cache serves as a high-performance global data share. Both the L1 and L2 caches support multiple write policies, including write-back and write-through.
The L2 cache could prove particularly helpful when threads from multiple SMs happen to be accessing the same data, in which case the cache can serve to amplify the tremendous bandwidth available in a streaming compute architecture like this one. Nvidia cites several examples of algorithms that should benefit from caching due to their irregular and unpredictable memory access patterns, and they span the range from consumer applications to high-performance computing. Among them: ray tracing, physics kernels, and sparse matrix multiply. Atomic operations should also be faster on FermiNvidia estimates between five and 20 times better than GT200in part thanks to the presence of the L2 cache. (Fermi has more hardware atomic units, as well.)
Additionally, the entire memory hierarchy, from the register file to the L1 and L2 caches to the six 64-bit memory controllers, is ECC protected. Robust ECC support is an obvious nod to the needs of large computing clusters like those used in the HPC market, and it's another example of Nvidia dedicating transistors to compute-specific features. In fact, the chip's architects allow that ECC support probably doesn't make sense for the smaller GPUs that will no doubt be derived from Fermi and targeted at the consumer graphics market.
Fermi supports single-error correct, double-error detect ECC for both GDDR5 and DDR3 memory types. We don't yet know what sort of error-correction scheme Nvidia has used, though. The firm refused to reveal whether the memory interfaces were 72 bits wide to support parity, noting only that the memory interfaces are "functionally 64 bits." Fermi has true protection for soft errors in memory, though, so this is a more than just the CRC-based error correction built into the GDDR5 transfer protocol.
We've already noted that Fermi's virtual and physical address spaces are 40 bits, but the true physical limits for memory size with this chip will be dictated by the number of memory devices that can be attached. The practical limit will be 6GB with 2Gb memories and 12GB with 4Gb devices.
Of course, GPUs must also communicate with the rest of the system. Fermi acknowledges that fact with a revamped interface to the host system that packs dedicated, independent engines for data transfer to and from the GPU. These allow for concurrent GPU-host and host-GPU data transfers, fully overlapped with CPU and GPU processing time.
Nvidia's build-out of tools for CUDA software development continues, as well. This week at the GPU Technology Conference, Nvidia will unveil its Nexus development platform, with a Microsoft Visual Studio plug-in for CUDA pictured below. Fermi has full exception handling, which should make debugging with tools like these easier.
Nvidia's investment in software tools for GPU computing clearly outclasses AMD's, and it's not really even close. Although this fact has prompted some talk of standards battles, I get the impression Nvidia's primary interest is making sure every available avenue for programming its GPUs is well supported, whether it be PhysX and C for CUDA or OpenCL and DirectCompute.
That's all part of a very intentional strategy of cultivating new markets in GPU computing, and the company expects imminent success on this front. In fact, the firm showed us its own estimates that place the total addressable market for GPU computing at just north of $1.1 billion, across traditional HPC markets, education, and defense. That is, I believe, for next year2010. Those projections may be controversial in their optimism, but they reveal much about Nvidia's motivations behind the Fermi architecture.
There are many things we still don't know about Nvidia's next GPU, including crucial information about its graphics features and likely performance. When we visited Nvidia earlier this month to talk about the GPU-compute aspects of the architecture, the first chips were going through bring-up. Depending on how that process goes, we could see shipping products some time later this year or not until well into next year, as I understand it.
We now have a sense that when Fermi arrives, it should at least match AMD's Cypress in its support for the OpenCL and DirectCompute APIs, along with IEEE 754-2008-compliant mathematical precision. For many corners of the GPU computing world, though, Fermi may be well worth the wait, thanks to its likely superiority in terms of double-precision compute performance, memory bandwidth, caching, and ECC supportalong with a combination of hardware hooks and software tools that should give Fermi unprecedented programmability for a GPU.
Let me suggest reading David Kanter's piece on Fermi if you'd like more detail on the architecture.