Texturing and memory bandwidth
AMD has endowed the R600 with four texture units that operate independently of the chip's shader core. The R600's texture units and total texture addressing and filtering capacity look similar to the Radeon X1950 XTX's, but with some notable improvements. Those improvements include the ability to filter FP16-format texturespopular for high dynamic range lightingat full speed (16 pixels per clock) and FP32 textures at half speed. The R600 can do trilinear and anisotropic filtering for all formats. The Radeon X1950 XTX couldn't handle these texture formats in its filtering hardware and had to resorts to its pixel shaders instead, so AMD estimates R600 is roughly seven times the speed of its predecessor in this respect.
Like the Radeon X1950 XTX, each R600 texture unit can grab an additional four unfiltered textures per clock from memory using its fetch4 ability, which is the reason you see the four teensy additional texture address processors and texture samplers in the diagram above. This additional capacity to grab data from memory can be useful for certain tasks like shadowing or stream computing applications.
The texture units can access several of the GPU's many caches, as appropriate, including the L1 texture cache, the vertex cache (32KB), and the L2 texture cache (256KB).
The memory controller the R600 is evolved from the one in the R580. Demers said this one is a fully distributed ring bus, not a hybrid like the R580's. Demers asserts the ring bus is simpler to design and easier to adapt to new products than the more commonly used crossbar arrangement. The R600's ring is comprised of four sets of wires running around the chip in read/write pairs, for a total of about 2000 wires and 1024 bits of communication capacity. The ring bus has about 84 read clients and 70 write clients inside the chip, and PCI Express is just one of the many ring stops, as are the eight 64-bit channels to local memory.
In case I caught you snoozing, I said eight 64-bit channelsthat works out to a 512-bit-wide path to memory, well above the 384 bits of the G80. What does all of this mean to the Radeon HD 2900 XT?
|Radeon X1950 XTX||650||16||10.4||16||10.4||-||-||2000||256||64.0|
|GeForce 8800 GTS||500||20||10.0||24||12.0||24||12.0||1600||320||64.0|
|GeForce 8800 GTX||575||24||13.8||32||18.4||32||18.4||1800||384||86.4|
|GeForce 8800 Ultra||612||24||14.7||32||19.6||32||19.6||2160||384||103.7|
|Radeon HD 2900 XT||742||16||11.9||16||11.9||16||11.9||1650||512||105.6|
With 512MB GDDR3 of memory running at 1.65GHz, the Radeon HD 2900 XT has a torrential 105.6 GB/s of peak memory bandwidth, higher that of the GeForce 8800 Ultra and well above the GTX or GTS. Yet its peak multitextured fill rate is only about 12 Gtexels/s, close to that of the GeForce 8800 GTS and well behind the GTX. AMD seems to have been pleased with the basic fill rate and filtering capabilities of the Radeon X1950 XTX and chose only to extend them in R600 to include HDR texture formats. Texturing is indeed becoming less important as programmable shaders gain traction, but many of those shaders store or access data in textures, which is a concern. The Radeon HD 2900 XT trails Nvidia's fastest graphics cards by miles here, despite having a wider path to memory.
Here are a few quick texture fill rate and filtering tests to see how these theoretical peak numbers play out.