Any pain associated with AMD's ongoing deficit in CPU performance is dulled somewhat by Kaveri's incorporation of the state-of-the-art GCN graphics architecture. 47% of Kaveri's die space is devoted to graphics, signaling AMD's commitment not just to graphics, but also to GPU acceleration of general-purpose computing workloads.
The move to the Graphics Core Next architecture is a major upgrade over Trinity on both of these fronts, just as it was when the Radeon HD 7000 series supplanted the HD 6000 series. (I've outlined the structure of the GCN compute units here.) This is the same generation of graphics technology that AMD built into the chips that power Microsoft's Xbone and Sony's PS4.
More precisely, Kaveri's compute units are of the same vintage as those in the Hawaii GPU that powers the Radeon R9 290X. This latest revision of GCN includes provisions especially helpful for APUs. The addition of flat system addressing facilitates the sharing of memory between CPU and graphics compute units. Meanwhile, buffering changes should improve the performance of geometry shaders and tessellation in the bandwidth-constrained environs of a CPU socket.
Naturally, Kaveri's GPU is built on a much smaller scale than the big Hawaii chip. It has only eight compute units, versus 44 on the Radeon R9 290X. Still, those eight CUs endow Kaveri with a total of 512 shader processors and 32 texels per clock of bilinear filtering capacity. The front end can rasterize a single primitive per clock cycle, and two render back-ends give it 16 pixels per clock of ROP throughput. This is a major upgrade from the 384 SPs, 24 tpc of filtering, and 8 ppc of ROP throughput in Trinity—and we haven't even accounted for the more efficient scheduling and superior GPU computing chops of the GCN architecture.
In keeping with Kaveri's mobile focus, the impact of this wider graphics engine will most likely be felt in lower power bands, where the dual memory channels available inside of a CPU socket are less of a constraint, relatively speaking. We've already shown that the previous-gen Richland's GPU is somewhat bandwidth-constrained in higher power envelopes. If bandwidth becomes the primary performance limiter, then Kaveri's wider graphics engine could become starved for work.
The future is fusion?
What may be Kaveri's most innovative new technology doesn't yet benefit current applications. However, it should enable developers to create programs that can use the CPU and GPU cores on a chip together in novel ways. AMD talks about these features under the umbrella of its wide-ranging HSA effort. HSA stands for Heterogeneous Systems Architecture, and it refers to an overarching system architecture for mixed-mode computing (involving CPU cores, GPUs, and possibly DSPs) with its own programming model. AMD's HSA enablement effort involves building the tools and partnerships to make HSA a viable development platform, both for x86-compatible chips and for SoCs that marry other sorts of CPU cores and graphics engines. The goal is to make it possible to write software that almost effortlessly intermingles the use of CPUs, graphics processors, and other computing engines as needed.
AMD outlined the basic HSA architecture several years ago, and it has been slowly adding features to its chips to make this vision a reality. The first APU, Llano, had a 128-bit Fusion Compute Link that allowed the GPU to access CPU-owned memory in certain cases. This link was an add-on created specifically for mixed-mode computing, since the integrated Radeon had a 512-bit bus of its own. Trinity expanded the FCL to 256 bits wide and changed its path, routing it through an IOMMU and into a unified north bridge between the CPU and graphics cores. Kaveri retains the 512-bit Radeon bus and the 256-bit FCL, and it adds a third 256-bit link from the GPU to the north bridge.
This new link is notable because it provides coherent access to memory. That is, the GPU can read and modify memory locations over this link without worrying about whether the same data is being held or modified in the CPU caches. Much like in a multi-socket server, Kaveri's hardware ensures that its CPU and GPU cores are properly synchronized and working on correct, up-to-date data. Programmers and compilers need not worry about the hazards created by the GPU reaching into main memory and making a change. Coherent communication is one of the keys to unlocking the GPU's full participation in heterogeneous computing, and Kaveri is the first chip from AMD to offer this capability.
Kaveri's coherent FCL pairs up with a couple of other HSA-enabling features to open some new possibilities for programming an APU. Thanks to a feature called hUMA, or hetergenous uniform memory architecture, the CPU and GPU can share up to 32GB of memory and access it via a common addressing scheme. hQ, or heterogeneous queuing, allows the GPU to create and dispatch work for itself—or for the CPU. Kaveri's graphics unit includes eight dedicated asynchronous compute engines (ACE), independent of the graphics command processor, for scheduling parallel computing work. And Kaveri supports the atomic operations needed for synchronization between the CPU and GPU cores.
At the Kaveri press event, AMD HSA honcho Phil Rogers offered several examples of how an HSA-compliant APU could intermix CPU and GPU operations for higher performance using simple, less repetitive code. Kaveri is the first chip capable of running that code natively, making it the first real development platform for HSA. If AMD somehow is able to persuade the rest of the industry to standardize on its vision for heterogeneous programming, that could be an even bigger coup than the adoption of the x86-64 ISA back in the Athlon 64 days.
With that said, the implementation of graphics coherency in Kaveri is just a first step, as the presence of three separate buses coming from the GPU indicates. AMD Client Divison CTO Joe Macri forthrightly admitted that the three buses could be merged in a future design. One can imagine how a single link could be more power-efficient. For engineering purposes, he told us, replicating the FCL and making it coherent was the easier path for this project. Also, the coherent FCL presently bypasses the GPU's L2 cache, unlike the non-coherent link. On the CPU side, the L1 cache's TLB is available on both busses, but the L2 TLB—located in the IOMMU—can only be accessed by one client at a time. In the event of an L2 miss, the IOMMU will walk the page tables, remaining locked the whole time.
Obviously, these limitations aren't ideal. Macri explained that the goal in this case was keeping things simple and maintaining architectural correctness. The team didn't want a bug in HSA-related features to delay the product, especially since HSA is about enabling future applications, not current ones. In keeping with AMD's recent modus operandi of incremental CPU-GPU fusion, we'd expect these restrictions to be removed from future APUs.