Command processing, setup, and dispatch
I continue to be amazed by the growing amount of disclosure we get from AMD and Nvidia as they introduce ever more complex GPUs, and R600 is no exception on that front. At its R600 media event, AMD chip architect Eric Demers gave the assembled press a whirlwind tour of the GPU, most of which whizzed wholly unimpeded over our heads. I'll try to distill down the bits I caught with as much accuracy as I can.
Our tour of the R600 began, appropriately, with the GPU's command processor. Demers said previous Radeons have also had logic to process the command stream from the graphics driver, but on the R600, this is actually a processor; it has memory, can handle math, and downloads microcode every time it boots up. The reason this command processor is so robust is so it can offload work from the graphics driver. In keeping with a DirectX 10 theme, it's intended to reduce state management overhead. DirectX 9 tends to group work in lots of small batches, creating substantial overhead just to manage all of the objects in a scene. That work typically falls to the graphics driver, burdening the CPU. Demers described the R600 command processor as "somewhat self-aware," snooping to determine and manage state itself. The result? A claimed reduction in CPU overhead of up to 30% in DirectX 9 applications, with even less overhead in DX10.
Next in line beyond the command processor is the setup engine, which prepares data for processing. It has three functions for DX10's three shader program types: vertex assembly (for vertex shaders), geometry assembly (for geometry shaders), and scan conversion and interpolation (for pixel shaders). Each function can submit threads to the dispatch processor.
One item of note near the vertex assembler is a dedicated hardware engine for tessellation. This unit is a bit of secret sauce for AMD, since the G80 doesn't have anything quite like it. The tessellator allows for the use of very high polygon surfaces with a minimal memory footprint by using a form of compression. This hardware takes two inputsa low-poly model and a mathematical description of a curved surfaceand outputs a very detailed, high-poly model. AMD's Natalya Tatarchuk showed a jaw-dropping demo of the tessellator in action, during which I kept thinking to myself, "Man, I wish she'd switch to wireframe mode so I could see what's going on." Until I realized the thing was in wireframe mode, and the almost-solid object I was seeing was comprised of millions of polygons nearly the size of a pixel.
This tessellator may live in a bit of an odd place for this generation of hardware. It's not a part of the DirectX 10 spec, but AMD will expose it via vertex shader calls for developers who wish to use it. We've seen such features go largely unused in the past, but AMD thinks we might see games ported from the Xbox 360 using this hardware since the Xbox 360 GPU has a similar tessellator unit. Also, tessellation capabilities are a part of Microsoft's direction for future incarnations of DirectX, and AMD says it's committed to this feature for the long term (unlike the ill-fated Truform feature that it built into the original Radeon hardware, only to abandon it in the subsequent generation). We'll have to see whether game developers use it.
The setup engine passes data to the R600's threaded dispatch processor. This part of the GPU, as Demers put it "is where the magic is." Its job is to keep all of the shader cores occupied, which it does by managing a large number of threads of three different types (vertex, geometry, and pixel shaders) and switching between them. The R600's dispatch processor keeps track of "hundreds of threads" in flight at any given time, dynamically deciding which ones should execute and which ones should go to sleep depending on the work being queued, the availability of data requested from memory, and the like. By keeping a large number of threads in waiting, it can switch from one to another as needed in order to keep the shader processors busy.
The thread dispatch process involves multiple levels of arbitration between the three thread types in waiting and the work already being done. Each of the R600's four SIMD arrays of shader processors has two arbiter units associated with it, as the diagram shows, and each one of those has a sequencer attached. The arbiter decides which thread to process next based on a range of variables, and the sequencer then determines the best ordering of instructions for execution of that thread. The SIMD arrays are pipelined, and the two arbiters per SIMD allow for execution of two different threads in interleaved fashion. Notice, also, that vertex and texture fetches have their own arbiters, so they can run independently of shader ops.
As you may be gathering, this dispatch processor involves lots of complexity and a good deal of mystery about its exact operation, as well. Robust thread handling is the reason why GPUs are very effective parallel computing devices, because they can keep themselves very well occupied. If a thread has to stop and wait for the retrieval of data from memory, which can take hundreds of GPU cycles, other threads are ready and waiting to execute in the interim. This logic almost has to occupy substantial amounts chip area, since the dispatch processor must keep track of all of the threads in flight and make "smart" decisions about what to do next.
These stream processor blocks are arranged in arrays of 16 on the chip, for a SIMD (single instruction multiple data) arrangement, and are controlled via VLIW (very long instruction word) commands. At a basic level, that means as many as six instructions, five math and one for the branch unit, are grouped into a single instruction word. This one instruction word then controls all 16 execution blocks, which operate in parallel on similar data, be it pixels, vertices, or what have you.
The four SIMD arrays on the chip operate independently, so branch granularity is determined by the width of the SIMD and the depth of the pipeline. For pixel shaders, the effective "width" of the SIMD should typically be 16 pixels, since each stream processor block can process a single four-component pixel (with a fifth slot available for special functions or other tasks). The stream processor units are pipelined with eight cycles of latency, but as we've noted, they always execute two threads at once. That makes the effective instruction latency per thread four cycles, which brings us to 64 pixels of branch granularity for R600. Some other members of the R600 family have smaller SIMD arrays and thus finer branch granularity.
Let's stop and run some numbers so we can address the stream processor count claimed by AMD. Each SIMD on the R600 has 16 of these five-ALU-wide superscalar execution blocks. That's a total of 80 ALUs per SIMD, and the R600 has four of those. Four times 80 is 320, and that's where you get the "320 stream processors" number. Only it's not quite that simple.
The superscalar VLIW design of the R600's stream processor units presents some classic challenges. AMD's compilera real-time compiler built into its graphics driverswill have to work overtime to keep all five of those ALUs busy with work every cycle, if at all possible. That will be a challenge, especially because the chip cannot co-issue instructions when one is dependent on the results of the other. When executing shaders with few components and lots of dependencies, the R600 may operate at much less than its peak capacity. (Cue sounds of crashing metal and human screams alongside images of other VLIW designs like GeForce FX, Itanium, and Crusoe.)
The R600 has many things going for it, however, not least of which is the fact that the machine maps pretty well to graphics workloads, as one might expect. Vertex shader data often has five components and pixel shader data four, although graphics usage models are becoming more diverse as programmable shading takes off. The fact that the shader ALUs all have the same basic set of capabilities should help reduce scheduling complexity, as well.
Still, Nvidia has already begun crowing about how much more efficient and easier to utilize its scalar stream processors in the G80 are. For its part, AMD is talking about potential for big performance gains as its compiler matures. I expect this to be an ongoing rhetorical battle in this generation of GPU technology.
So how does R600's shader power compare to G80? Both AMD and Nvidia like to throw around peak FLOPS numbers when talking about their chips. Mercifully, they both seem to have agreed to count programmable operations from the shader core, bracketing out fixed-function units for graphics-only operations. Nvidia has cited a peak FLOPS capacity for the GeForce 8800 GTX of 518.4 GFLOPS. The G80 can co-issue one MAD and one MUL instruction per clock to each of its 128 scalar SPs. That's three operations (multiply-add and multiply) per cycle at 1.35GHz, or 518.4 GFLOPS. However, the guys at B3D have shown that that extra MUL is not always available, which makes counting it questionable. If you simply count the MAD, you get a peak of 345.6 GFLOPS for G80.
By comparison, the R600's 320 stream processors running at 742MHz give it a peak capacity of 475 GFLOPS. Mike Houston, the GPGPU guru from Stanford, told us he had achieved an observed compute throughput of 470 GFLOPS on R600 with "just a giant MAD kernel." So R600 seems capable of hitting something very near its peak throughput in the right situation. What happens in graphics and games, of course, may vary quite a bit from that.