This article contains few benchmarks. While my testing has been rigorous, I prefer to showcase selected results that are meaningful and relevant to the discussion. Results that do not yield additional useful information were intentionally omitted. If one is looking for a recital of scores or frames per second, it is not there.
This is not ‘copy and paste’ marketing material or an ‘A to Z’ feature list. There are plenty of these available at various hardware sites. What you will find, instead, is an explanation of salient features of the R200 graphics pipeline (Z-buffer optimizations, vertex and pixel processing, anti-aliasing) and performance characteristics, as well as implementation of image quality features.
With regard to images, it is not my intention to produce a photo album of demos or games. Images of relevant scenes were captured at a low resolution of 320×240 pixels to save space. They have not been downsampled and filtered. If it is necessary to display in detail a portion of the scene, that portion is captured at 1600×1200 resolution, again without additional processing. In one or two instances, I have magnified the textures taking care not to filter the textures inadvertently.
With that out of the way, I hope you will find the article thought-provoking and refreshing. Remember, you read this at The Tech Report first.
|Table 1: System specifications|
|Windows 98, DirectX 8.1
Athlon Thunderbird 1000MHz, cooled with a PowerCooler PCH137
ChainTech 7 SID (SiS 735 chipset) motherboard
256MB PC2100 DDR SDRAM
|ATI RADEON 8500 (R200) – 250MHz, 1000 megapixels per second
64 megabytes of double data rate SDRAM – 250MHz, 8 gigabytes per second
Driver version: 4.13.01.7206
|NVIDIA GEFORCE3 – 200MHz, 800 megapixels per second
64 megabytes of double data rate SDRAM – 230MHz, 7.36 gigabytes per second
Driver version: 4.13.01.2183
Realizing that semi-conductor complexity is continually outpacing memory bandwidth, ATI has concentrated on Z-buffer optimizations to reduce this disparity. Z-buffer optimizations include Z-buffer compression, fast Z-buffer clear and Hierarchical Z. We shall cover each of these in turn.
Z-buffer compression is the basic building block of ATI’s depth buffer optimizations. In this scheme, the Z-buffer is subdivided into square tiles, each containing 16 depth values. Each tile of depth values is compressed to between one half to one quarter of original size. Naturally, decompression takes place on a tile-by-tile basis. Therefore, an entire tile of depth values must be decompressed at once, even if only to access one or two depth values. This waste has been reduced in the R200’s implementation by using a tile size of finer granularity (16 depth values).
The ‘3DMark 2001’ fill rate test demonstrates the effectiveness of Z-compression under ideal conditions, because only a handful of large polygons that occupy the entire screen are rendered per frame. Since the polygons are all translucent, fill rate measurement is not inflated by Z-occlusion culling (GeForce3) or Hierarchical Z (R200). Furthermore, since Z-buffer contents are inconsequential, there is no need to clear the Z-buffer each frame. Therefore, the 3DMark 2001 fill rate test isolates the effects of Z-compression.
The graph (Figure 1) shows an 18% difference in fill rate with and without compression of a 24-bit precision Z-buffer. There was no appreciable benefit demonstrated when compression is enabled for 16-bit. Nevertheless, enabling Z-buffer compression is a prerequisite for ‘Fast Z-clear’ and ‘Hierarchical Z’.
|Table 2: R200 ‘3DMark 2001’ fill rate (32-bit color/24-bit Z)|
‘Fast Z-buffer clear’
After each frame of a 3D scene is drawn, the Z-buffer must be erased before it can start receiving data for the next frame. When the application is not saturating the graphics pipeline, the latency of Z-clear is not perceptible. When the graphics subsystem is fully loaded, putting in the effort to reduce Z-clear latency produces dramatic slowdowns.
The R200 clears the Z-buffer at a fraction of the time without having to write anything to the Z-buffer. As mentioned earlier, the Z-buffer is subdivided into 16-pixel blocks. In addition, the status of each block (i.e. whether the block is compressed, uncompressed or ‘Z-cleared’) is stored in a lookup table. This table provides a means of clearing the Z-buffer one block at a time rather than pixel by pixel. All that is needed is to update the status of each block in the lookup table as having been ‘Z-cleared’. This method avoids the time-consuming task of physically filling the Z-buffer with zeroes.Hierarchical Z-buffer
The usefulness of a hierarchical Z-buffer (HZ) is in the early exclusion of non-visible pixels from the graphics pipeline, before the application of textures; this saves both bandwidth and fill rate. The diagram (Figure 2) shows the HZ to reside in the silicon of the graphics processor, but it may in fact be stored in graphics memory. The HZ is a low-resolution representation of the Z-bufferlow-resolution because each depth value in the HZ represents a block of pixels. An improvement of the R200 over its predescessor is that each pixel block has been reduced from 64 pixels to 16 pixels. This increases the extent of pixel rejection. However, to maintain an acceptable rate of pixel processing, the number of Z-check units would have to be increased.
Unlike the depth value in a traditional Z-buffer that may have 16-bit or higher precision, the depth value precision of HZ is, by necessity, limited. The lack of precision might have been the cause of tile-shaped voids and flickering tiles (‘Z-fighting’) in R1, first documented in my article, Radeon In-Depth, published in September, 2000. Thankfully, these issues have been ironed out in the R200.
While the exact mechanism of R200’s HZ remains sketchy, what is known is that it takes place during rasterization, when triangles are being scan-converted into pixels (Figure 2). Each triangle is conceptually divided into blocks of pixels. Preferably, at least two depth values (the nearest and furthest depth values within each block) are assigned to each block of pixels, accompanied by a flag that indicates complete or partial coverage of the triangle (Figure 3). Effectively, one or two pixels per block are being tested for visibility against the contents of the HZ. This contrasts with a traditional Z-buffer, where the depth of each and every pixel within a triangle is evaluated for visibility.
There are generally two outcomes of the ‘all or none’ type:
- The tile is totally obscured. The entire block of pixels may be rejected if determined to be occluded by a previous tile with 100% coverage. This creates the greatest bandwidth savings: texture, framebuffer and Z-buffer.
- The entire tile is in the foreground with respect to corresponding contents within the Hierarchical Z-buffer. Where an entire block of pixels with 100% coverage has been determined to completely occlude the previous tile, a full Z-buffer comparison is dispensed with and a portion of Z-buffer bandwidth is saved. This is the second best scenario.
When either of the above two criteria are not met, the tile requires a pixel-by-pixel Z-buffer comparison.Performance impact of Z-buffer optimizations
To assess the impact of Z-buffer optimizations, we take a look at the results obtained from the ‘Evolva Rolling Demo’ (Figure 4). The baseline graph is devoid of Z-buffer optimizations. Then we enable Z-compression, ‘Fast Z-clear’ and HZ strictly in the order, measuring the results with each setting. First, we note that Z-compression provides a maximum of 13.8% improvement over the baseline, understandably lower than ideal (18%) because of the increased polygon count. Next, we note a hefty 24% increase with ‘Fast Z-clear’. Lastly, HZ provides a small but noticeable boost.
|Table 3: Evolva 32-bit framerates|
There are two aspects of the vertex engine: fixed function and programmable vertex processing.
In all likelihood, fixed-function vertex processing is but a small subset within the broader context of a programmable vertex engine. The R200’s fixed-function vertex processing achieved 25 million polygons per second in ‘3DMark 2001’, a four-fold increase over R1 (8.1 million polygons per second). Even if one were to account for the difference in clock speeds between R200 and GeForce3, R200 still achieves a figure well in excess of the 15 million polygons per second that GeForce3 processes.
With a programmable vertex engine, it is possible to insert a small program, called a vertex shader, that executes on chip and processes vertex data as it as it passes through the geometry processing pipeline. The basic specifications are as defined by the ‘DirectX 8’ vertex shader model. Given the impressive results of the fixed function aspect of the R200’s vertex engine, it would not be unreasonable to expect a similar level of performance from the programmable aspect of the vertex engine.Vertex shader performance
To compare vertex shader performance, we benchmarked a series of vertex shaders of varying instruction lengths. Since the vertex shaders operated on the same 3D model, we were able to plot a framerate graph as a function of instruction length (Figure 5). It would have been tempting to conclude that R200 processed vertices at least twice as fast as GeForce3 vertex shaders.
As in life, things are never in black and white but shades of grey. In general, we noted that with one or two exceptions, vertex shaders of greater complexity executed faster on R200 while simple vertex shaders executed faster on the GeForce3. Because the vertex shaders and 3D models tested originated from NVIDIA, it is probable that the 3D models used were in fact optimized for the GeForce3 vertex cache with respect to the locality of vertex references. Keeping this consideration in mind, it could be argued that when cache effectiveness is rate-limiting, GeForce3 is faster. As instructions increase in complexity, the source of the bottleneck instead shifts to instruction execution, a strongpoint of R200. If the vertices of the 3D models were re-ordered to accomodate the vertex cache of R200, I am fairly confident that R200 would emerge a clear victor.
|Table 4: Vertex shader framerates|
|GeForce3 @ 200MHz||R200 @ 200MHz||R200 @ 250MHz|
A pixel shader is a set of microcode that, when downloaded to the graphics processor, execute on-chip and operates on pixels and texels. R200 ushers in pixel shader ‘version 1.4’. It makes possible the mixing of a greater number of textures (six to be exact) than pixel shader ‘version 1.1’ of the GeForce3. More important than a numerical advantage, ‘version 1.4’ unifies the instruction set and introduces the concept of ‘phase’.
The pixel shader executes in two phases. The first phase involves texture sampling and texture address operations. As an improvement over GeForce3 pixel shader, the texture sampling instructions have been expanded to encompass the entire spectrum of color operations. In addition to simplifying the programming model, unifying the texture and color operations makes it possible to reuse the same circuitry for both operations.
The second phase permits dependent texture reads as well as color operations. The ability to sample a texture value, modify that value in the address shader, and use the modified value as an address to sample a different texture allows pixel shaders to perform what are known as dependent texture reads. GeForce3, by contrast, only supports dependent texture read as a special case in cubic environment bump mapping.
Figure 6 is a flow diagram of a pixel shader in action. The diagram illustrates the concepts of phase and dependent sampling.
Besides enabling the expression of additional material properties hitherto not possible with GeForce3’s pixel shader, the greater range and precision of the R200 pixel shader improves existing effects, as well. Here we have an example of Pixel Shader Version 1.0 diffuse bump mapping. We can see that the greater range and precision of the R200 pixel color unit produces greater light saturation as well as well-defined bumps (Figure 7).
R200’s implementation of anti-aliasing has been aptly named ‘SmoothVision’. In order to uncover the type of anti-aliasing that R200 uses, it is necessary to understand the concepts behind super-sampling and multi-sampling. In super-sampling, each frame is first rendered at a higher resolution. The image thus generated is then downscaled to the desired resolution with a suitable filter, usually bilinear. In this way, the textures and edges of objects that make up the image are anti-aliased. Multi-sampling, on the other hand, concentrates mainly on anti-aliasing object edges. Depending on the particular implementation, multi-sampling may or may not filter textures. In the GeForce3, textures are not anti-aliased.
This difference between the two modes of anti-aliasing provides a way to recognize the type of anti-aliasing. But how does one recognize anti-aliased textures? The difference between filtered textures (e.g. bilinear, trilinear filters) and anti-aliased filtered textures is subtle indeed and by no means foolproof. The solution, as it turns out, is to disable texture filtering and use point sampling instead. With GeForce3 style multi-sampling, we note that edges are anti-aliased while textures remain point sampled (Figure 8). R200 ‘SmoothVision’ anti-aliases object edges and their textures (Figure 8). Does this conclusively prove that R200 supersamples? Unfortunately not, which is why the placement of MIP maps will be examined next.
MIP map placement
The placement of MIP maps differs between super- and multi-sample anti-aliasing. Let’s define the process of MIP mapping. MIP mapping provides a sequence of texture maps wherein the first texture map is an uncompressed texture map. Subsequent texture maps are downscaled by a fixed ratio until the texture map has been compressed to a single texel. This process may be visualized as a pyramid of MIP maps (Figure 9). When texturing an object as a perspective view, the graphics processor accesses one of the sequence of MIP maps to retrieve the appropriate texel, taking into account screen resolution and distance. As the closer portions of the perspective object are being rendered, the rendering circuitry accesses the less compressed MIP maps.Super-sampling in OpenGL
Recall that super-sampling renders at high resolution in the initial phase. There is, therefore, an emphasis towards MIP maps of higher detail than in GeForce3’s multi-sampling. Figure 10 is a composite of two screen captures from OpenGL-based ‘Serious Sam’, with color-coded MIP map levels. The screen capture on the right (R200) has a higher level of detail compared to the screen capture on the left (GeForce3). Clearly, ‘SmoothVision’ is a form of super-sampling.
Super-sampling in Direct3D
With ‘SmoothVision’ enabled in Direct3D, the determination of MIP map levels is either flawed or at best inexact (Figure 11). The image on the left belongs to R200 and shows appropriate MIP map levels without anti-aliasing. The middle image belongs to the Radeon and shows appropriately positioned MIP map levels for a super-sampled image. The right super-sampled image belongs to R200 and shows bizarre MIP map positioning. The R200’s unusual blurring of textures with anti-aliasing turned on has been described at a handful of sites. I believe this bizarre MIP map positioning is the cause.
Performance and quality of anti-aliasing
The quality of anti-aliasing is subjective; for the same degree of edge anti-aliasing, I tend to favour GeForce3 multi-sampling for its uniformity. With regards to anti-aliasing performance, multi-sampling consumes lower texture bandwidth than super-sampling, by a fraction of the degree of anti-aliasing.
The graph in Figure 12 is intriguing in two aspects: First, R200 supersampling is more efficient than GeForce3 multi-sampling, except at maximum resolutions. It would appear that R200 anti-aliases without resorting to a second pass, possibly bypassing explicit downscaling and filtering. Second, R200 is rate limited by AGP bus access at its maximum resolutions. That is why Z-buffer optimisations are ineffective at the maximum resolutions.
|Table 5: R200 anti-aliasing performance|
|R200 @ 200MHz, Hierarchical Z||146.4||131.7||93.6||47.7||0|
|R200 @ 200MHz, fast Z-clear||146||129.1||89.6||47.6||0|
|R200 @ 200MHz, baseline||144.6||118.1||75.9||47.7||0|
|R200 @ 200MHz, baseline||103.5||67.3||42.3||0||0|
|R200 @ 200MHz, fast Z-clear||120||81.1||42.4||0||0|
|R200 @ 200MHz, Hierarchical Z||123.7||84.8||42.3||0||0|
Trilinear filtering may be considered a form of anti-aliasing involving MIP map levels. One or two websites have noticed that the R200 takes an alternative approach to trilinear filtering. It is not ‘MIP map dithering,’ as has been speculated. Rather, it appears to me that R200 selectively applies trilinear filtering scan-line by scan-line based on the level of detail. The differences between trilinear filtering (Radeon) and selective filtering (R200) are not readily discernible unless the MIP maps are color-coded (Figure 13).
Anisotropic filtering reduces the too-fuzzy or too-sharp filtering that occurs with isotropic mipmapping when a pixel maps to a rectangular region in the texture space. The anisotropic filter employed by R200, in terms of texture clarity of distant objects, is conclusively superior to GeForce3 (Figure 14). A slight disadvantage of R200 is that it does not filter between mipmap levels (i.e. trilinear filter), though this is not noticeable with a high level of detail.
Hopefully, you have found the contents of this article unique, the explanations precise and the screen captures convincing. You may also have noticed that ‘higher order surfaces’ have not been mentioned; I felt it would not do justice to this visually significant feature within the time and space alloted. I wish to address this topic in a future work.
Quirks aside (mostly MIP map related), R200 proves its mettle where it matters most. Megahertz for megahertz, its vertex engine is more powerful than GeForce3, and the pixel shaders are visibly better. I, for one, would like to see developers take full advantage of ‘Pixel Shader Version 1.4’ with a fallback plan for ‘Version 1.1’.
However, we cannot be oblivious to the fact that Z-buffer optimizations are absent in the majority of OpenGL applications. One can only hope that ATI can and will extend this valuable feature to all OpenGL applications.