The GF100 compute core
GF100 is the codename for the biggest, double-precision-supporting variant of the Fermi architecture. It's a D3D11-class part, comprised of 16 clusters, each containing a pair of vector SIMD processors; a discrete memory pool and register banks; a dual-issue, dual-warp scheduler; sampler capability; and access to the chip's ROPs and DRAM devices via the on-chip memory controller.
One sub block is capable of double-precision computation. It's a sixteen-wide DP vector unit, capable of a single FMA per clock for each of sixteen threads (half of what Nvidia calls a warp). Due to operand fetch limitations, when the DP sub block is executing, the front end to the SM can't run the second sub block. In addition to the DP FMA (fused multiply-add), the FPU can run DP MUL and ADD in one clock. There's a very capable integer ALU, too, capable of a single 32-bit MUL or ADD in one clock. Remember the CUDA documentation for G80 and friends that said 24-bit IMUL would go slower in future generations? Yeah.
The other sub block is a sixteen-wide, single-precision vector running computation for the other half of a thread warp. It can run a single-precision FMA per clock, or MUL or ADD. The new FMA ability of the sub blocks is important. Fusing the two ops into a single compute stage increases numerical precision over the old MADD hardware in prior D3D10-class Nvidia hardware. In graphics mode, that poses problems, since to run at the same numerical precision as GT200, Fermi chips like GF100 will be half their peak rate for MADD, because they run the old MUL and ADD in two clocks rather than one. Automatically promoting those to FMA is what the graphics driver will do, although the programmer can opt out of that if they find computational divergence that causes problems, compared to the same code on other hardware.
Computational accuracy is defined by Fermi's support for IEEE754-2008, including exception handing, and fast performance for all float specials include NaN, +ve and -ve infinity, denormals and division by zero.
Each sub block has a special function unit (SFU), too. The SFU interpolates for the vector, as well as providing access to hardware special instructions such as transcendental ops (SIN, COS, LOG, etc). The DP sub-block SFU doesn't run instructions in double precision.
The sub block and the SFU can run a number of other instructions and special computations, too, such as shifts, branch instructions, comparison ops, bit ops and cross-ALU counters. The complete mix of instructions and throughputs isn't known, although Nvidia claims the scheduler is only really limited by operand gather and dispatch. If all data dependencies are satisfied and there are enough ports out of the register pool to service the request, the SM will generally run any mix of instructions you can think of. There's enough operand fetch with 256 ports to run peak rate SGEMM, which will please HPC types. The maximum thread count per GF100 SM is 1536, up 50% compared to GT200.
The only limitation that appears worth talking about at this point, prior to measurement, is running the double-precision sub block. Given that operands for DP are twice as wide, it appears the operand gather hardware will consume all available register file ports, and so no other instructions can run on the other sub block.
In terms of the memory hierarchy, we've mentioned that all Fermi SMs contain the 64 KiB partitioned L1 and shared memory pool, backed by ECC if needed. (In fact, we'd guess that all L1 interaction is permanently protected.) Threads can access both the shared memory and L1 partitions of the near pool at the same time. Register overspill is to L1 in all Fermi implementations, and the register file is 128 KiB per SM (32 K FP32 values).
The L2 cache on GF100 is 768 KiB, making a static per-SM allocation of 48 KiB, but remember it's completely unified. Preferred DRAM memory is GDDR5, but the memory controller supports DDR3, as well, and Nvidia will make use of the latter in the bigger 6 GiB Tesla configurations.
Fermi, and therefore GF100, virtualizes the address space of the device and the host, utilizing a hardware TLB for address conversion. Every memory in the hierarchy, from shared memory and caches up, is mapped into the virtual space and can be accessed by the shader core, including samplers. The shader core and samplers both consume the same virtual addresses, and the hardware and driver together are responsible for managing the memory maps. All addresses are 64-bit at the hardware level, and the physical address space that GF100 supports is 40-bit.
Fermi designs like GF100 also sport much improved atomic operation performance compared to currently shipping hardware. Atomic ops in a warp are coalesced and backed to the L2 on address contention, rather than the memory controller resolving them by replaying the transactions in DRAM at latencies of hundreds of clock cycles. The whitepaper's claim of extra atomic units facilitating the new performance isn't correct, and it's down to L2 to service those memory ops (since that's the further part of the hierarchy to get writes appearing globally to the chip's SMs) rather than DRAM.
Concurrent compute kernel support for GF100 is a claimed 16 kernels at a time, one per SM, although we believe that the final count will be capped. Earlier architectures supporting CUDA could only run a single kernel at a time, executing them serially in submission order, but GF100 has no such limitation. Kernel streams are still queued up serially by the driver for execution, like before; however, when a cluster becomes free to run a stream from another kernel, it will schedule and run it freely, the chip effectively filling up in waterfall fashion as execution resources free up and new streams are ready to go. The limit therefore comes in the number of in-flight streams that the software side will support, and we think that's likely capped at eight.
Speaking of DRAM, GF100 supports GDDR5 via six 64-bit channels, and the memory clock will likely be in the 4200 MHz range for the highest-end SKUs. The new memory type brings with it unique memory controller considerations, and at the basic level, I/O happens at the device at the same 64-bit granularity as previous-generation hardware.
In terms of texturing, GF100 appears to support the same per-cluster texturing ability as GT200, with eight pixels per clock of address setup and final texel address calculation and up to eight burnable bilerps per clock for filtering, although Nvidia won't talk about it just yet. The texturing rate therefore appears to go up linearly with cluster count, at a peak of 1.6x over a similarly clocked GT200. The texture hardware supports all of D3D11's requirements, of course, including FP32 surface filtering.
Despite the memory bus shrinking to 384 bits, GF100 appears to up the ROP count to 48 (each one able to write a quad of pixels to memory per clock), with full-rate blending for up to FP16 pixels, dropping to half rate for FP32. For comparison, that's the same blend rate as GT200 and twice that per ROP per clock compared to G80. D3D11 support (and thus D3D10.1) also means a subtle change to the ROP hardware.
D3D10.1 brought about a requirement for application control over the subpixel coverage mask required for generating multisamples and various other tweaks required for fast multisample readback into the shader core. Control over the subpixel mask was the biggest barrier to older Nvidia D3D10 hardware supporting D3D10.1, and it's only recently or so that Nvidia has announced chips with the capability (nearly three years since Microsoft ratified the specification).
|Put simply, it's not the biggest consumer ASIC ever in terms of area, but it's certainly the biggest in terms of transistor count, beating the RV870 by nearly a whole RV770. Just think about that for a second.|
Clock rates are currently forecast to be in the "high-end G92 range," so we'll pin that at around 650 MHz for the base clock domain and 1700 MHz for the hot clock (so 850 MHz for the bulk of the SM hardware, including the 64 KiB near pool and the register file).
At the manufacturing level, GF100 is a three-billion-transistor part manufactured by TSMC on its 40G process node (40-nm average feature size, 300-mm wafer). Nvidia is coy on die size for the time being, with best guesses putting it a touch under 500 mm². Put simply, it's not the biggest consumer ASIC ever in terms of area (that goes to the original 65-nm GT200 at 576 mm²), but it's certainly the biggest in terms of transistor count, beating the RV870 by nearly a whole RV770. Just think about that for a second.
Nvidia and TSMC have clearly been a little shy about pushing GF100 to the reticle limit of the process this time around. Nvidia is balancing the area cost with everything else it needs to consider, mostly financial constraints. The final area is a complex interplay of factors including the process, wafer start costs, margins, expected volume and market size, clock rates, voltage, and many other factors.
Estimating performance is folly at this point, but our suite of binaries to figure out the details is shaping up nicely, and should be more than ready by the time the first GF100-based product ships.