Single page Print

Texturing and memory bandwidth
AMD has endowed the R600 with four texture units that operate independently of the chip's shader core. The R600's texture units and total texture addressing and filtering capacity look similar to the Radeon X1950 XTX's, but with some notable improvements. Those improvements include the ability to filter FP16-format textures—popular for high dynamic range lighting—at full speed (16 pixels per clock) and FP32 textures at half speed. The R600 can do trilinear and anisotropic filtering for all formats. The Radeon X1950 XTX couldn't handle these texture formats in its filtering hardware and had to resorts to its pixel shaders instead, so AMD estimates R600 is roughly seven times the speed of its predecessor in this respect.


Logical diagram of the R600's texture units. Source: AMD.

Like the Radeon X1950 XTX, each R600 texture unit can grab an additional four unfiltered textures per clock from memory using its fetch4 ability, which is the reason you see the four teensy additional texture address processors and texture samplers in the diagram above. This additional capacity to grab data from memory can be useful for certain tasks like shadowing or stream computing applications.

The texture units can access several of the GPU's many caches, as appropriate, including the L1 texture cache, the vertex cache (32KB), and the L2 texture cache (256KB).


The R600's memory controller layout. Source: AMD.

The memory controller the R600 is evolved from the one in the R580. Demers said this one is a fully distributed ring bus, not a hybrid like the R580's. Demers asserts the ring bus is simpler to design and easier to adapt to new products than the more commonly used crossbar arrangement. The R600's ring is comprised of four sets of wires running around the chip in read/write pairs, for a total of about 2000 wires and 1024 bits of communication capacity. The ring bus has about 84 read clients and 70 write clients inside the chip, and PCI Express is just one of the many ring stops, as are the eight 64-bit channels to local memory.

In case I caught you snoozing, I said eight 64-bit channels—that works out to a 512-bit-wide path to memory, well above the 384 bits of the G80. What does all of this mean to the Radeon HD 2900 XT?

  Core
clock
(MHz)
Pixels/
clock
Peak
pixel
fill rate
(Gpixels/s)
Bilinear
filtered
textures/
clock
Peak
texel
fill rate
(Gtexels/s)
Bilinear
filtered
FP16
textures/
clock
Peak
FP16
filtering
rate
(Gtexels/s)
Effective
memory
clock
(MHz)
Memory
bus width
(bits)
Peak
memory
bandwidth
(GB/s)
Radeon X1950 XTX6501610.41610.4--200025664.0
GeForce 8800 GTS5002010.02412.02412.0160032064.0
GeForce 8800 GTX5752413.83218.43218.4180038486.4
GeForce 8800 Ultra6122414.73219.63219.62160384103.7
Radeon HD 2900 XT7421611.91611.91611.91650512105.6

With 512MB GDDR3 of memory running at 1.65GHz, the Radeon HD 2900 XT has a torrential 105.6 GB/s of peak memory bandwidth, higher that of the GeForce 8800 Ultra and well above the GTX or GTS. Yet its peak multitextured fill rate is only about 12 Gtexels/s, close to that of the GeForce 8800 GTS and well behind the GTX. AMD seems to have been pleased with the basic fill rate and filtering capabilities of the Radeon X1950 XTX and chose only to extend them in R600 to include HDR texture formats. Texturing is indeed becoming less important as programmable shaders gain traction, but many of those shaders store or access data in textures, which is a concern. The Radeon HD 2900 XT trails Nvidia's fastest graphics cards by miles here, despite having a wider path to memory.

Here are a few quick texture fill rate and filtering tests to see how these theoretical peak numbers play out.

The Radeon HD 2900 XT gets closer than any of the other cards to its theoretical maximum pixel fill rate, probably because it has sufficient memory bandwidth to make that happen. When we switch to multitexturing, the chips reach very near their theoretical limits, which puts the Radeon HD 2900 XT just behind the GeForce 8800 GTS. These are not FP16 textures, so the Radeon X1950 XTX performs reasonably well, too.