Single page Print

Fermi overview (continued)
Scheduling wise, there's a global scheduler and some logic at the front end of each Fermi chip that gets things into shape for each SM's thread scheduler. Front-end wise, there's some verification and state-tracking logic, some caches, and broadcast logic to each SM (mostly for decoded instructions). Since each SM in a Fermi implementation can run a different thread type, the front end must support an instruction stream per SM.

There's a single buffered queue for decoded instructions, despite the SM running two instructions per clock, due to how the scheduler issues. Nvidia won't disclose queue depth, but the queue and decoder are good enough to sustain chip peak rates, of course.

The new SM scheduler can dual-issue instructions for two running warps in a clock, with each warp running for two hot clocks, coordinating the operand fetch hardware and effectively completely orchestrating computation. Nvidia says there's two schedulers, but we don't believe them. The retire latency for the warp is half that of older D3D10-class designs, requiring twice the number of warps to hide the same memory access latency. (DRAM device latencies, of course, won't be equal on Fermi hardware for the most part, because it now supports GDDR5).

A mix of instructions can be run across the SM for the pair of warps, and because warps of threads are independent in terms of data and execution order, and because of the sub-block arrangement, the instruction mix is flexible. A 32-bit IMUL could be executing on one sub block for one half warp, for example, and the other sub block could be running a single-precision FMA for the other half-warp of threads.

The scheduler runs a scoreboard for all possible threads in flight, like all of Nvidia's D3D10-class hardware, that keeps track of data dependencies and the running and coming instruction mix, so the right warps are ready at the right time. If a memory request has to be serviced by memory, the chip will park the thread until it can be serviced by L2, to avoid stalling the execution resource. The chip will also, like prior hardware, actively scale back the in-flight thread count based on scoreboard statistics such as temporary register count, instructions to be run, and predicate and branch stats.

With a straight face, any AMD employee could look you in the eye and call Cypress a 1600 (count 'em) shader-unit part, by virtue of its independent architecture.

Prior to Fermi, compute kernels occupied the entire chip. The hardware ran a single kernel at a time, serially, with the help of the CUDA runtime. Now, compute kernels can occupy the chip at the SM level, like graphics thread types, with Fermi supporting a kernel per SM outwardly.

In general, Fermi executes just like G80. It's a scalar architecture in that each vector lane is dedicated to computation on a single object, exploiting data parallelism and minimizing data dependency issues that can reduce efficiency in other GPU architectures. There are multiple clock domains as before, the vector SIMDs run at twice the base scheduler rate as before, and the base chip clock is separate from that.

Branching in Fermi happens at the warp level, and therefore with 32-object granularity. The hardware now supports predicating almost all instructions, although it's unclear how the programmer has any direct control of that outside of CUDA.

Comparisons to Cypress have some of the numbers coming out in AMD's favor. With a straight face, any AMD employee could look you in the eye and call Cypress a 1600 (count 'em) shader-unit part, by virtue of its independent architecture. Clusters of 5-way vector processors work together in groups of 16, processing an object each per clock (at 850MHz in Radeon HD 5870 form), with a faintly amazing 20 clusters churning away in total.


The Cypress-based Radeon HD 5870

Versus RV770, Cypress's texturing resources have doubled, ROPs have doubled, raster has potentially doubled, and various near pools in the memory hierarchy have doubled in size and effective bandwidth. Going back to the shader hardware, four of the five ALUs in the 5-way vector are capable of full IEEE754-2008 FP32 FMAs, and the T-unit has other unique characteristics. It all adds up to serious rates of everything, from shading to texture sampling to pixel output to memory bandwidth. All of that in 334 mm² at 40 nm by TSMC, using 2.15 billion transistors. The density is absolutely outrageous. Oh, and keep those figures in mind for later.


A Cypress chip up close

RV870 really is almost a full doubling of RV770 in terms of the core execution hardware, with only the external memory bus staying put at 256 bits

Cypr....nah, I can't do it any longer....RV870 really is almost a full doubling of RV770 in terms of the core execution hardware, with only the external memory bus staying put at 256 bits. That can make it seem imbalanced at times, but when not memory bound, it's a processing monster, making games go faster than ever before, with a world-class output engine, good physicals, and a nice price. Nvidia will barely sell another GT200 with that on the scene, and it's only the compute side of AMD's proposition that let things down. At the hardware level, there's not much that you could point at and say, "that's for GPU computing." Maybe that'll go some way toward explaining why Nvidia is pushing so hard in the same space, as they use Fermi to try and take control of things. More on that later, after a look at GF100-level specifics.