Partially thanks to its push into GPU computing, Nvidia has been much more open about some details of the GT200's architecture than it has been with prior GPU designs. As a result, we can take a look inside of a thread processing cluster and see a little more clearly how it works. The diagram at the right shows one TPC. Each TPC has three shader multiprocessors (SMs), eight texture addressing/filtering units, and an L1 cache. For whatever reason, Nvidia won't divulge the size of this L1 cache.
Inside of each SM is one instruction unit (IU), eight stream processors (SPs), and a 16K pool of local, shared memory. This local memory can facilitate inter-thread communication in GPU compute applications, but it's not used that way in graphics, where such communication isn't necessary.
For a while now, Nvidia has struggled with exactly how to characterize its GPUs' computing model. At last, the firm seems to have settled on a name: SIMT, for "single instruction, multiple thread." As with G80, GT200 execution is scalar rather than vector, with each SP processing a single pixel component at a time. The key to performance is keeping all of those execution units fed as much of the time as possible, and threading is the means by which the GT200 accomplishes this goal. All threads in the GT200 are managed in hardware by the IUs, with zero cost for switching between them.
The IU manages things in groups of 32 parallel threads Nvidia calls "warps." The IU can track up to 32 warps, so each SM can handle up to 1024 threads in flight. Across the GT200's 30 SMs, that adds up to as many as 30,720 concurrent hardware threads in flight at any given time. (G80 was similar, but peaked at 768 threads per SM for a maximum of 12,288 threads in flight.) The warp is a fundamental unit in the GPU. The chip's branching granularity is one warp, which equates to 32 pixels or 16 vertices (or, I suppose, 32 compute threads). Since one pixel equals one thread, and since the SPs are scalar, the compiler schedules pixel elements for execution sequentially: red, then green, then blue, and then alpha. Meanwhile, inside of that same SM, seven other pixels are getting the exact same treatment in parallel.
Should the threads in a warp hit a situation where a high-latency operation like a texture read/memory access is required, the IU can simply switch to processing another of the many warps it tracks while waiting for the results to come back. In this way, the GPU hides latency and keeps its SPs occupied.
That is, as I understand it, SIMT in a nutshell, and it's essentially the model established by the G80. Of course, the GT200 is improved in ways big and small to deliver more processing power more efficiently than the G80.
One of those improvements is relatively high-profile because it affects the GT200's theoretical peak FLOPS numbers. As you may know, each SP can contribute up two FLOPS per clock by executing a multiply-add (MAD) instruction. On top of that, each SP has an associated special-function unit that handles things like transcendentals and interpolation. That SFU can also, when not being used otherwise, execute a floating-point multiply instruction, contributing another FLOP per clock to the SP's output. By issuing a MAD and a MUL together, the SPs can deliver three total FLOPS per clock, and this potential is the basis for Nvidia's claim of 518 GFLOPS peak for the GeForce 8800 GTX, as well as of the estimate of 933 GFLOPS for the GeForce GTX 280.
Trouble is, that additional MUL wasn't always accessible on the G80, leading some folks to muse about the mysterious case of the missing MUL. Nvidia won't quite admit that dual-issue on the G80 was broken, but it says scheduling on the GT200 has been massaged so that it "can now perform near full-speed dual-issue" of a MAD+MUL pair. Tamasi claims the performance impact of dual-issue is measurable, with 3DMark Vantage's Perlin noise test gaining 16% and the GPU cloth test gaining about 7% when dual-issue is active. That's a long way from 33%, but it's better than nothing, I suppose.
Another enhancement in GT200 is the doubling of the size of the register file for each SM. The aim here is, by adding a more on-chip storage, to allow more complex shaders to run without overflowing into memory. Nvidia cites improvements of 35% in 3DMark Vantage's parallax occlusion mapping test, 6% in GPU cloth, 5% in Perlin noise, and 15% overall with Vantage's Extreme presets due to the larger register file.
Another standout in the laundry list of tweaks to GT200 is a much larger buffer for stream output from geometry shaders. Some developers have attempted to use geometry shaders for tessellation, but the large amount of data they produced caused problems for G80 and its progeny. The GT200's stream out buffer is six times the size of G80's, which should help. Nvidia's own numbers show the Radeon HD 3870 working faster with geometry shaders than the G80; those same measurements put the GT200 above the Radeon HD 3870 X2.
The diagram above sets the stage for the final two modifications to the GT200's processing capabilities. Nvidia likes to show this simplified diagram in order to explain how the GPU works in CUDA compute mode, when most of its graphics-specific logic won't be used. As you can see, the Chiclets don't change much, although the ROP hardware is essentially ignored, and what's left is a great, big parallel compute machine.
One thing such a machine needs for scientific computing and the like is the ability to handle higher precision floating-point datatypes. Such precision isn't typically necessary in graphics, especially real-time graphics, so it wasn't a capability of the first DirectX 10-class GPUs. The GT200, however, adds the ability to process IEEE 754R-compliant, 64-bit, double-precision floating-point math. Nvidia has added one double-precision unit in each SM, so GT200 has 30 total. That gives it a peak double-precision computational rate of 78 GFLOPS, well below the GPU's single-precision peak but still not too shabby.
Another facility added to the GT200 for the CUDA crowd is represented by the extra-wide, light-blue Chiclets in the diagram above: the ability to perform atomic read-modify-write operations into memory, useful for certain types of GPU-compute algorithms.