Single page Print

Texturing, memory hierarchy, and render back-ends

A single RV770 texture unit. Source: AMD.

Like the shaders, the texture units in the RV770 have been extensively streamlined. Hartog claimed an incredible 70% increase in performance per square millimeter for these units. Not only that, but as I've mentioned, the texture units are now aligned with shader SIMDs, so future RV770-based designs could scale the amount of processing power up or down while maintaining the same ratio of shader power to texture filtering capacity. Interestingly enough, the RV770 retains the same shader-to-texture capacity mix as the RV670 and the R600 before it. Nvidia has moved further in this direction recently with the release of the GT200, but the Radeons still have a substantially higher ratio of gigaflops to gigatexels.

With 10 texture units onboard, the RV770 can sample and bilinearly filter up to 40 texels per clock. That's up from 16 texels per clock on RV670, a considerable increase. One of the ways AMD managed to squeeze down the size of its texture units was taking a page from Nvidia's playbook and making the filtering of FP16 texture formats work at half the usual rate. As a result, the RV770's peak FP16 filtering rate is only slightly up from RV670. Still, Hartog described the numbers game here as less important than the reality of measured throughput.

To ensure that throughput is what it should be, the design team overhauled the RV770's caches extensively, replacing the R600's "distributed unified cache" with a true L1/L2 cache hierarchy.

A block diagram of the RV770's cache hierarchy. Source: AMD.

Each L1 texture cache is associated with a SIMD/texture unit block and stores unique data for it, and each L2 cache is aligned with a memory controller. Much of this may sound familiar to you, if you've read about certain competitors to RV770. No doubt AMD has learned from its opponents.

Furthermore, Hartog said RV770 uses a new cache allocation routine that delays the allocation of space in the L1 cache until the request for that data is fulfilled. This mechanism should allow RV770 to use its texture caches more efficiently. Vertices are stored in their own separate cache. Meanwhile, the chip's internal bandwidth is twice that of the previous generation—a provision necessary, Hartog said, to keep pace with the amount of data coming in from GDDR5 memory. He claimed transfer rates of up to 480GB/s for an L1 texture fetch and up to 384GB/s for data transfers between the L1 and L2 caches.

An overview of the RV770's memory interface. Source: AMD.

The RV770's reworked memory subsystem doesn't stop at the caches, either. AMD's vaunted ring bus is dead and gone, and it's not even been replaced by a crossbar. Instead, RV770 opts for a simpler approach. The GPU's four memory controllers are distributed around the edges of the chip, next to their primary bandwidth consumers, including the render back-ends and the L2 caches. Data is partitioned via tiling to maintain good locality of reference for each controller/cache pair, and a hub passes lower bandwidth data to and from the I/O units for PCI Express, display controllers, the UVD2 video engine, and the CrossFireX interconnect. AMD claims this approach brings efficiency gains, with the RV770 capable of reaching 95% of its theoretical peak bandwidth, up 10% from the RV670.

These gains alone wouldn't allow the RV770 to realize its full potential, however, with only a 256-bit aggregate path to memory. For extra help in this department, AMD worked with DRAM vendors to develop a new memory type, GDDR5. GDDR5 keeps the single-ended signaling used in current DRAM types and uses a range of techniques to achieve higher bandwidth. Among them: a new clocking architecture, an error-detection protocol for the wires, and individual training of DRAM devices upon startup. AMD's Joe Macri, who heads the JEDEC DRAM and GDDR5 committees, points out that this last feature should allow for additional overclocking headroom with better cooling, since DRAM training will respond to improvements in environmental conditions.

GDDR5's command clock runs at a quarter of the data rate, which is presumably why the Radeon HD 4870's memory clock shows up as 900MHz when the actual data rate is 3600 MT/s. Do the math, and you'll find that the 4870's peak memory bandwidth works out to 115.2 GB/s, which is even more than the Radeon HD 2900 XT managed with a 512-bit interface or what the GeForce GTX 260 can reach with a 448-bit interface to GDDR3. And that's with 3.6Gbps devices. AMD says it's already seeing 5Gbps GDDR5 memory now and expects to see 6Gbps before the end of the year.

An RV770 render back-end unit.
Source: AMD.

The final element in the RV770's wide-ranging re-plumbing of the R600 architecture comes in the form of heavily revised render back-ends. (For the confused, Nvidia calls these exact same units ROPs, but we'll use AMD's term in discussing its chips.) One of the RV770 design team's major goals was to improve antialiasing performance, and render back-ends are key to doing so. Looking at the diagram on the left, the RV770's render back-end doesn't look much different from any other, and the chip only has four of them, so what's the story?

Well, for one, the individual render back-end units are quite a bit more powerful. Below is a table supplied by Hartog that shows the total render back-end capacity of the RV770 versus RV670, both of which have the same number of units on chip.

RV670 versus RV770 total render back-end throughput. Source: AMD.

According to this table, the RV770's render back-ends are twice as fast as the RV670's in many situations: for any form of multisampled AA and for 64-bit color modes even without AA. Not only that, but the RV770 can perform up to 64 Z or stencil operations per clock cycle. Hartog identified the RV670's Z rate as the primary limiting factor in the RV670's antialiasing performance.

That's not the whole story, however. Ever since the R600 first appeared, we heard rumors that its render back-ends were essentially broken in that they would not perform the resolve step for multisampled AA—instead, the R600 and family handled this task in the shader core. Shader-based resolve did allow AMD to do some nice things with custom AA filters, but the R600-family's relatively weak AA performance was always a head-scratcher. Why do it that way, if it's so slow?

I suspect, as a result of the shader-based resolve, that the numbers you see for RV670 in the table above are, shall we say, optimistic. They may be correct as theoretical peaks, but I suspect the RV670 doesn't often reach them.

Fortunately, AMD has confirmed to us that the RV770 no longer uses its shader core for standard MSAA resolve. If there was a problem with the R6xx chips' render back-ends—and AMD still denies it—that issue has been fixed. The RV770 will still use shader-based resolve for AMD's custom-filter AA modes, but for regular box filters, the work is handled in custom hardware in the render back-ends—as it was on pre-R600 Radeons and on all modern GeForce GPUs.