Better scheduling, faster switching
Like most major PC processors these days, Fermi hasn't been entirely re-architected fresh from a clean sheet of paper; it is an incremental enhancement of prior Nvidia GPU architectures that traces its roots two major generations back, to the G80. Yet in the context of this continuity, Fermi brings radical change on a number of fronts, thanks to revisions to nearly every functional unit in the chip.
Many of the changes, especially the ones Nvidia is talking about at present, are directed toward improving the GPU's suitability and performance for non-graphics applications. Indeed, Nvidia has invested tremendous amounts in building a software infrastructure for CUDA and in engaging with its customers, and it claims quite a few of the tweaks in this architecture were inspired by that experience. There's much to cover here, and I've tried to organize it in a logical manner, but that means some key parts of the architecture won't be addressed immediately.
We'll start with an important, mysterious, and sometimes overlooked portion of a modern GPU: the primary scheduler, which Nvidia has too-cleverly named the "GigaThread" scheduler in this chip. Threads are bunched into groups, called "warps" in Nvidia's lexicon, and are managed hierarchically in Fermi. This main scheduler hands off blocks of threads to the streaming multiprocessors, which then handle finer-grained scheduling for themselves. Fermi has two key improvements in its scheduling capabilities.
One is the ability to run multiple, independent "kernels" or small programs on different thread groups simultaneously. Although graphics tends to involve very large batches of things like pixels, other applications may not happen on such a grand scale. Indeed, Nvidia admits that some kernels may operate on data grids smaller than a GPU like Fermi, as illustrated in the diagram above. Some of the jobs are smaller than the GPU's width, so a portion of the chip sits idle as the rest processes each kernel. Fermi avoids this inefficiency by executing up to 16 different kernels concurrently, including multiple kernels on the same SM. The limitation here is that the different kernels must come from the same CUDA contextso the GPU could process, say, multiple PhysX solvers at once, if needed, but it could not intermix PhysX with OpenCL.
To tackle that latter sort of problem, Fermi has much faster context switching, as well. Nvidia claims context switching is ten times the speed it was on GT200, as low as 10 to 20 microseconds. Among other things, intermingling GPU computing with graphics ought to be much faster as a result.
(Incidentally, AMD tells us its Cypress chip can also run multiple kernels concurrently on its different SIMDs. In fact, different kernels can be interleaved on one SIMD.)
Inside the new, wider SM
In many ways, the SM is the heart of Fermi. The SMs are capable of fetching instructions, so they are arguably the real "processing cores" on the GPU. Fermi has 16 of them, and they have quite a bit more internal parallelism than the processing cores on a CPU.
That concept we mentioned of thread groups or warps is fundamental to the GPU's operation. Warps are groups of threads handled in parallel by the GPU's execution units. Nvidia has retained the same 32-thread width for warps in Fermi, but the SM now has two warp schedulers and instruction dispatch units.
The SM then has four main execution units. Two of them are 16-wide groups of scalar "CUDA cores," in Nvidia's parlance, and they're helpfully labeled "Core" in the diagram on the right, mainly because I wasn't given sufficient time with a paint program to blot out the labels. There's also a 16-element-wide load/store unit and a four-wide group of special function units. The SFUs handle special types of math like transcendentals, and the number here is doubled from GT200, which had two per SM.
Fermi's SM has a full crossbar between the two scheduler/dispatch blocks and these four execution units. Each scheduler/dispatch block can send a warp to any one of the four execution units in a given clock cycle, which makes Fermi a true dual-issue design, unlike GT200's pseudo-dual-issue. The only exception here is when double-precision math is involved, as we'll see.
The local data share in Fermi's SM is larger, as well, up from 16KB in GT200 to 64KB here. This data share is also considerably smarter, for reasons we'll explain shortly.
First, though, let's take a quick detour into the so-called "CUDA core." Each of these scalar execution resources has separate floating-point and integer data paths. The integer unit stands alone, no longer merged with the MAD unit as it was on prior designs. And each floating-point unit is now capable of producing IEEE 754-2008-compliant double-precision FP results in two clock cycles, or half the performance of single-precision math. That's a huge step up from the GT200's lone DP unit per SMhence our estimate of a ten-fold increase in DP performance. Again, incorporating double-precision capability on this scale is quite a commitment from Nvidia, since such precision is generally superfluous for real-time graphics and really only useful for other forms of GPU computing.
I'd love to tell you the depth of these pipelines, but Nvidia refuses to disclose it. We could speculate, but we've probably done enough of that for one day already.
Fermi maintains Nvidia's underlying computational paradigm, which the firm has labeled SIMT, for single instruction, multiple thread. Each thread in a warp executes in sequential fashion on a "CUDA core," while 15 others do the same in parallel. For graphics, as I understand it, each pixel is treated as a thread, and pixel color components are processed serially: red, green, blue, and alpha. Since warps are 32 threads wide, warp operations will take a minimum of two clock cycles on Fermi.
Thanks to the dual scheduler/issue blocks, Fermi can occupy both 16-wide groups of CUDA cores with separate warps via dual issue. What's more, each SM can track a total of 48 warps simultaneously and schedule them pretty freely in intermixed fashion, switching between warps at will from one cycle to the next. Obviously, this should be a very effective means of keeping the execution units busy, even if some of the warps must wait on memory accesses, because many other warps are available to run. To give you a sense of the scale involved, consider that 32 threads times 48 warps across 16 SMs adds up to 24,576 concurrent threads in flight at once on a single chip.