Z-buffer optimizations
Realizing that semi-conductor complexity is continually outpacing memory bandwidth, ATI has concentrated on Z-buffer optimizations to reduce this disparity. Z-buffer optimizations include Z-buffer compression, fast Z-buffer clear and Hierarchical Z. We shall cover each of these in turn.

Z-buffer compression
Z-buffer compression is the basic building block of ATI's depth buffer optimizations. In this scheme, the Z-buffer is subdivided into square tiles, each containing 16 depth values. Each tile of depth values is compressed to between one half to one quarter of original size. Naturally, decompression takes place on a tile-by-tile basis. Therefore, an entire tile of depth values must be decompressed at once, even if only to access one or two depth values. This waste has been reduced in the R200's implementation by using a tile size of finer granularity (16 depth values).

The '3DMark 2001' fill rate test demonstrates the effectiveness of Z-compression under ideal conditions, because only a handful of large polygons that occupy the entire screen are rendered per frame. Since the polygons are all translucent, fill rate measurement is not inflated by Z-occlusion culling (GeForce3) or Hierarchical Z (R200). Furthermore, since Z-buffer contents are inconsequential, there is no need to clear the Z-buffer each frame. Therefore, the 3DMark 2001 fill rate test isolates the effects of Z-compression.

The graph (Figure 1) shows an 18% difference in fill rate with and without compression of a 24-bit precision Z-buffer. There was no appreciable benefit demonstrated when compression is enabled for 16-bit. Nevertheless, enabling Z-buffer compression is a prerequisite for 'Fast Z-clear' and 'Hierarchical Z'.


Figure 1: The effect of Z-compression on R200 fill rate

Table 2: R200 '3DMark 2001' fill rate (32-bit color/24-bit Z)
Resolution 640x480 800x600 1024x768 1280x1024 1600x1200
R200 baseline 641.4 660.1 664.9 664.3 663.8
R200 Z-optimizations 770.4 783.5 784.4 781.4 781.9
'Fast Z-buffer clear'
After each frame of a 3D scene is drawn, the Z-buffer must be erased before it can start receiving data for the next frame. When the application is not saturating the graphics pipeline, the latency of Z-clear is not perceptible. When the graphics subsystem is fully loaded, putting in the effort to reduce Z-clear latency produces dramatic slowdowns.

The R200 clears the Z-buffer at a fraction of the time without having to write anything to the Z-buffer. As mentioned earlier, the Z-buffer is subdivided into 16-pixel blocks. In addition, the status of each block (i.e. whether the block is compressed, uncompressed or 'Z-cleared') is stored in a lookup table. This table provides a means of clearing the Z-buffer one block at a time rather than pixel by pixel. All that is needed is to update the status of each block in the lookup table as having been 'Z-cleared'. This method avoids the time-consuming task of physically filling the Z-buffer with zeroes.Hierarchical Z-buffer
The usefulness of a hierarchical Z-buffer (HZ) is in the early exclusion of non-visible pixels from the graphics pipeline, before the application of textures; this saves both bandwidth and fill rate. The diagram (Figure 2) shows the HZ to reside in the silicon of the graphics processor, but it may in fact be stored in graphics memory. The HZ is a low-resolution representation of the Z-buffer—low-resolution because each depth value in the HZ represents a block of pixels. An improvement of the R200 over its predescessor is that each pixel block has been reduced from 64 pixels to 16 pixels. This increases the extent of pixel rejection. However, to maintain an acceptable rate of pixel processing, the number of Z-check units would have to be increased.

Unlike the depth value in a traditional Z-buffer that may have 16-bit or higher precision, the depth value precision of HZ is, by necessity, limited. The lack of precision might have been the cause of tile-shaped voids and flickering tiles ('Z-fighting') in R1, first documented in my article, Radeon In-Depth, published in September, 2000. Thankfully, these issues have been ironed out in the R200.


Figure 2: Overview of the graphics pipeline in relation to Z-buffer optimization