A revised graphics architecture
The biggest change in Tahiti and the rest of the Southern Islands lineup is undoubtedly the shader core, the computational heart of the GPU, where AMD has implemented a fairly major reorganization of the way threads are scheduled and instructions are executed. AMD first revealed partial details of this "Graphics core next" at its Fusion Developer Summit last summer, so some information about Tahiti's shader architecture has been out there for a while. Now that the first products are arriving, we've been able to fill in most of the rest of the details.
As we've noted, Tahiti doesn't look like too much of a departure from its Cayman predecessor at a macro level, as in the overall architecture diagram on page one. However, the true difference is in the CU, or compute unit, that is the new fundamental building block of AMD's graphics machine. These blocks were called SIMD units in prior architectures, but this generation introduces a very different, more scalar scheme for scheduling threads, so the "SIMD" name has been scrapped. That's probably for the best, because terms like SIMD get thrown around constantly in GPU discussions in ways that often confuse rather than enlighten.
In AMD's prior architectures, the SIMDs are arrays of 16 execution units, and each of those units is relatively complex, with either four (in Cayman) or five (in Cypress and derivatives) arithmetic logic units, or ALUs, grouped together. These execution units are superscalar—each of the ALUs can accept a different instruction and operate on different data in one clock cycle. Superscalar execution can improve throughput, but it relies on the compiler to manage a problem it creates: none of the instructions being dispatched in a cycle can rely on the output of one of the other instructions in the same group. If the compiler finds dependencies of this type, it may have to leave one or more of the ALUs idle in order to preserve the proper program order and obtain the correct results.
The superscalar nature of AMD's execution units has been both a blessing and a curse over time. On the plus side, it has allowed AMD to cram a massive amount of ALUs and FLOPS into a relatively small die area, since it's economical in terms of things like chip area dedicated to control logic. The downside is, as we've noted, that those execution units cannot always reach full utilization, because the compiler must schedule around dependencies.
Folks who know at AMD, including Graphics CTO Eric Demers, have consistently argued that these superscalar execution units have not been a problem for graphics simply because the machine maps well to graphics applications. For instance, DirectX-compliant GPUs typically process pixels in four-by-four blocks known as quads. Each pixel is treated as a thread, and 16-thread groups known as "wavefronts" or (in Nvidia's lexicon) "warps" are processed together. In an architecture like Cypress, a wavefront could be dispatched to a SIMD array, and each of the 16 execution units would handle a single thread or pixel. As I understand it, then, the four components of a pixel can be handled in parallel across the superscalar ALUs: red, green, blue, and alpha—and, in the case of Cypress, a special function like a transcendental in that fifth slot, too. In just one clock cycle, a SIMD array can process an operation for every element of an entire wavefront, with very full utilization of the available ALU resources.
The problems come when moving beyond the realm of traditional graphics workloads, either with GPU computing or simply when attempting to process data that has only a single component, like a depth buffer. Then, the need to avoid dependencies can limit the utilization of those superscalar ALUs, making them much less efficient. This dynamic is one reason Radeon GPUs have had very high theoretical FLOPS peaks but have sometimes had much lower delivered performance.
In a sense, Tahiti's compute unit is the same basic "width" as the SIMDs in Cayman and Cypress, capable of processing the equivalent of one wavefront per clock cycle. Beneath the covers, though, many things have changed. The most basic execution units are actually wider than before, 16-wide vector units (also called SIMD-16 in the diagram above), of which there are four. Each CU also has a single scalar unit to assist, along with its own scheduler. The trick here is that those vec16 execution units are scheduled very much like the 16-wide execution units in Nvidia's GPUs since the G80—in scalar fashion, with each ALU in the unit representing its own "lane." With graphics workloads, for instance, pixel components would be scheduled sequentially in each lane, with red on one clock cycle, blue on the next, and so on. In the adjacent ALUs on the same vec16 execution unit, the other pixels in that wavefront would be processed at the same time, in the same one-component-per-clock fashion. At the end of four clocks, each vec16 unit will have processed 16 pixels or one wavefront. Since the CU has four of those execution units, it is capable of processing four wavefronts in four clock cycles—as we noted, the equivalent of one wavefront per cycle. Like Cayman, Tahiti can process double-precision floating-point datatypes for compute applications at one quarter the usual rate, which is, ahem, 947 GFLOPS in this case, just shy of a teraflop.
For graphics, the throughput of the new CU may be similar to that of Cypress or Cayman. However, the scalar, lane-based thread scheduling scheme simplifies many things. The compiler no longer has to detect and avoid dependencies, since each thread is executed in an entirely sequential fashion. Register port conflicts are reduced, and GPU performance in non-traditional workloads should be more stable and predictable, reaching closer to those peak FLOPS throughput numbers more consistently. If this list of advantages sounds familiar to you, well, it is the same set of things Nvidia has been saying about its scheduling methods for quite some time. Now that AMD has switched to a similar scheme, the same advantages apply to Tahiti.
That's not to say the Tahiti architecture isn't distinctive and, in some ways, superior to Nvidia's Fermi. One unique feature of the Tahiti CU is its single scalar execution unit. Nvidia's shader multiprocessors have a special function unit in each SM, and one may be tempted to draw parallels. However, AMD's David Nalasco tells us Tahiti handles special functions like transcendentals in the vec16 units, at a very nice rate of four ops per clock cycle. The scalar unit is a separate, fully programmable ALU. In case you're wondering, it's integer-only, which is why it doesn't contribute to Tahiti's theoretical peak FLOPS count. Still, Nalasco says this unit can do useful things for graphics, like calculating a dot product and forwarding the results for use across multiple threads. This unit also assists with flow control and handles address generation for pointers, as part of Tahiti's support of C++-style data structures for general-purpose computing.
Another place where Tahiti stands out is its rich complement of local storage. The chip has tons of SRAM throughout, in the form of registers (260KB per CU), hardware caches, software-managed caches or "data shares," and buffers. Each of these structures has its own point of access, which adds up to formidable amounts of total bandwidth across the chip. Also, Tahiti adds a hardware-managed, multi-level read/write cache hierarchy for the first time. There's a 16KB L1 instruction cache and a 32KB scalar data cache shared across four CUs and backed by the L2 caches. Each CU also has its own L1 texture/data cache, which is fully read/write. Meanwhile, the CU retains the 64KB local data share from prior AMD architectures.
Nvidia has maintained a similar split between hardware- and software-managed caches in its Fermi architecture by allowing the partitioning of local storage into 16KB/48KB of texture cache and shared memory, or vice-versa. Nalasco points out, however, that the separate structures in Tahiti can be accessed independently, with full bandwidth to each.
Tahiti has six L2 cache partitions of 128KB, each associated with one of its dual-channel memory controllers, for a total of 768KB of L2 cache, all read/write. That's the same amount of L2 cache in Nvidia's Fermi, although obviously Tahiti's last-level caches service substantially more ALUs. The addition of robust caching should be a big help for non-graphics applications, and AMD clearly has its eye on that ball. In fact, for the first time, an AMD GPU has gained full ECC protection—not just of external DRAMs like in Cayman, but also of internal storage. All of Tahiti's SRAMs are single-error correct, double-error detect protected, which means future FirePro products based on this architecture should be vying in earnest for deployment in supercomputing clusters and the like against Nvidia's Tesla products. Nvidia has a big lead in the software and tools departments with CUDA, but going forward, AMD has the assistance of both Microsoft, via its C++ AMP initiative, and the OpenCL development ecosystem.