TurboCache and turbo lag
As for TurboCache, it isn't really about forced induction or ceramic impellers. TurboCache is about using a combination of fast local RAM and system memory for rendering. In this case, NVIDIA says the local frame buffer acts as a high-speed, software-managed cache, while main memory is allocated dynamically as needed for graphics use. GeForce 6200 cards with TurboCache will come with either one or two pieces of memory onboard. A single DRAM, typically a 16MB chip, will offer a 32-bit path to memory, while two chips will make for a 64-bit path. The memory chips will run at 350MHz, or a 700MHz data rate with DDR, so that a 32-bit config will yield 2.8GB/s of local frame buffer bandwidth, and a 64-bit config will have 5.6GB/s to memory. Although system memory can be used freely for rendering tasks, scan-out for video will always happen from the local frame buffer.
The NV44 can do texturing directly from main memory, as most graphics cards have done since the introduction of AGP texturing way back when. The novel thing with NV44 is the ability to write directly to a texture in system memory. Programmable pixel shading typically creates lots of renderable surfaces in a scene, and NVIDIA has concentrated its efforts on making NV44 able to render directly to system memory "at 100% efficiency," as they put it. The yellow bits in the block diagram on the previous page are the ones modified to make direct rendering to system memory a possibility. As you can see, NVIDIA has added a memory management unit that allows two-way access to system memory from both the pixel shader pipes and the ROP pipelines.
NVIDIA's drivers for the 6200 will allocate as much as 128MB of system memory for graphics on an as-needed basis in a 512MB system. The limits are lower with less RAM, and it's possible that future drivers will allocate up to 256MB of RAM in systems with plenty of memory.
Taken together, the local video RAM plus system RAM should produce copious amounts of available memory bandwidth. For instance, in a system with dual channels of DDR400 memory, a 6200 with a 32-bit path to memory would have a total of 9.2GB/s of theoretical peak memory bandwidth (2.8GB/s local plus 6.4GB/s system). A 64-bit version could have as much as 12GB/s available to it. By contrast, a Radeon X300 card with a 128-bit memory subsystem would have 6.4GB/s of peak theoretical bandwidth.
Unfortunately, going out to system memory introduces latency, or delays between the time that data is requested and data begins to arrive. If one thinks of the system as a network, going from the GPU to local RAM is a single "hop," while going from the GPU to system RAM in a Pentium 4 system is two hops: from the GPU over PCI Express to the north bridge, and from the north bridge to system RAM. Worse yet, the trip from the GPU to system RAM is three hops on an Athlon 64 system: from the GPU over PCI Express to the chipset, from the chipset to the CPU over HyperTransport, and from the CPU's memory controller to system RAM. Each additional chip-to-chip hop introduces longer delays.
We can illustrate the difference between a single-hop memory access and a two-hop access by looking at the results from one of our recent processor reviews. In this case, the Pentium 4 is doing a two-hop access to memory (from CPU to north bridge over the system bus, and from the north bridge into RAM) while the Athlon 64's integrated memory controller allows a one-hop access to RAM. The Pentium 4's memory access latencies are nearly twice those of the Athlon 64.
This latency penalty is one reason why main system memory hasn't been a good place for 3D graphics solutions to store data, and it's also why the GeForce 6200 will come with one or two faster local DRAM chips onboard.
NVIDIA claims the 6200 is designed to mask latency, and that the graphics pipeline itself is designed to do so. They say graphics involves lots of independent memory accesses and parallel work, so with adequate buffering to do all the work "in flight" on the GPU, the 6200 should perform reasonably well. The parts of the NV44's pipelines that handle calculations are unchanged from other NV4x chips, save for the noted changes to the ROPs, but the MMU is new. Presumably, the data paths between the MMU and other parts of the chip include a fair amount of buffering.
Interestingly enough, NVIDIA says real-world bandwidth between the system RAM and the GPU will be limited by the platform. They claim the Intel 900-series chipsets can achieve about 3GB/s of throughput from memory to the GPU, and only 1GB/s of throughput in the opposite directionwell below the 4GB/s bidirectional bandwidth promised by PCI Express x16. Although AMD64 systems will have higher memory access latencies by nature, NVIDIA says "faster K8 chipsets" will achieve more bandwidth to the GPU than Intel's 900-series platform.