Single page Print

Piledriver: somewhat heavier equipment
Trinity's use of the Bulldozer CPU architecture gives it a host of features that Llano lacked, including AES encryption acceleration and AVX instructions for wider floating-point vector processing. Bulldozer's basic layout also makes Trinity a very different beast than Llano. This architecture's fundamental building block is a compute "module" that can process two threads simultaneously. Although AMD claims the module has two distinct integer cores, those cores share some key resources, including the instruction fetch and decode units, an L2 cache, and a floating-point math unit (FPU). The shared structures have been upgraded substantially from prior AMD CPUs, to better service two integer cores at once. Trinity has two of these compute modules, giving it four threads, four integer "cores," and two FPUs. Each of those modules has 2MB of L2 cache. By contrast, Llano has four distinct cores, each with its own FPU and 1MB of L2 cache, with no sharing. (One similarity between Llano and Trinity is the omission of an L3 cache. AMD deemed the L3 a power efficiency liability in Llano, and it appears to have held to that conviction with Trinity.)

To date, Bulldozer's performance hasn't fulfilled the expectations created by its extended feature set. The desktop FX-8150 processor is barely quicker than the older Phenom II X6 in most cases, for instance, and its per-clock performance is actually lower than the prior-gen processor's. Some of that is by design; Bulldozer is intended to run at higher clock frequencies, and it gives up some per-clock performance in order to do so. Still, the revised "Piledriver" CPU cores in Trinity have been tweaked for higher instruction throughput in each clock cycle.

Although some folks probably expected a quick-fix for the Bulldozer architecture that would yield some sizeable performance gains, that doesn't appear be what's happened. Instead, Piledriver incorporates a fairly broad range of improvements, none of which contributes much more than 1% to overall per-clock instruction throughput. (I believe the cumulative total is somewhere around a 6% IPC improvement, generally, but my notes are fuzzy on that one.)


Changes from Bulldozer to Piledriver. Source: AMD.

One of the most notable changes in Piledriver is support for a couple of new instructions. The addition of a three-component fused multiply-add instruction, FMA3, brings AMD in line with Intel's plans for its upcoming Haswell chip. That should clear up any confusion about this workhorse of the AVX extensions. (Support for Bulldozer's FMA4 instruction remains.) Furthermore, Piledriver allows quick conversions between 16- and 32-bit floating-point data formats via the F16C instruction, which debuted in the Intel camp on Ivy Bridge.

Among the other tweaks to improve instruction throughput, the highest-impact change is probably the doubling in size of the L1 data cache's translation lookaside buffer. The TLB is a sort of cache index, and a larger TLB makes the cache faster and more efficient. Beyond that, nearly every part of the chip has been massaged, save for the execution units. The branch predictor is more accurate, thanks to an innovation borrowed from the Bobcat core. The integer and FP schedulers are more aggressive about retiring instructions, making them effectively larger without a structure size increase. And the hardware prefetcher can better predictively populate the L2 cache, in part because it has been tuned for client-style workloads (whereas Bulldozer is tuned for servers.)

As sweeping as the changes may look on paper, they are apparently rather modest in their cumulative effect. However, performance boosts can come from other sources, and Piledriver has been optimized to achieve higher clock frequencies at lower power levels. AMD tells us Piledriver responds much better than Llano's cores to changes in voltage, allowing wider latitude for clock frequencies and finer-grained control over those speeds. For a mobile-focused CPU (err, APU) like Trinity, such things tend to be especially helpful.

A new IGP based on, uh, proven technology
Trinity's integrated graphics are a generation beyond Llano's and are, in terms of basic capabilities, pretty well up to date. They're also based on an older generation of discrete graphics chips, "Northern Islands," most familiar from the Radeon HD 6900 series of video cards. AMD's current GCN architecture didn't make the cut.


Logical block diagram of Trinity's IGP. Source: AMD.

There's your requisite block diagram of the graphics portion of the chip. If you have really good glasses, you could count all of the units yourself. Trinity's IGP has six SIMD engines and sports a total of 384 shader ALUs. Each SIMD engine has a texture unit capable of filtering four texels per clock, so the IGP totals 24 texels per cycle. The two render back-ends can blend eight pixels per clock.

None of those are numbers particularly breathtaking. Llano's IGP has 5 SIMD engines, 400 ALUs, 20 texels per clock of filtering throughput, and dual render back-ends. Still, Trinity's IGP should make better use of its resources. Trinity's IGP trades up to a VLIW4 shader execution unit that is more area efficient. Llano's VLIW5 design has a fifth "fat" ALU for certain types of functions, and the other four ALUs have a subset of its abilities. The Northern Islands shader core eliminates that fifth ALU and grants full and equal functionality to the other four units. This new arrangement seems to work well aboard the Radeon HD 6900 series. Northern Islands also brings some improvements in tessellation performance, thanks to improved buffering intended to manage the difficult data flow issue created by geometry expansion.

Importantly for AMD's plans, the Northern Islands graphics core is better suited for non-graphics computing, too. The VLIW4 shaders should map well to a broader range of data sets, and this core adds the ability to execute multiple, independent kernels (or programs, essentially) at once, each with its own command queue and address domain.

None of those enhancements is likely to provide as much uplift versus Llano as one other change: higher IGP clock speeds. The fastest mobile Llano IGP runs at 444MHz, but Trinity's IGP operates at frequencies as high as 686MHz. When combined with the architectural enhancements and the slight bump from five SIMDs to six, the higher clock speed should make Trinity's IGP a considerable upgrade from Llano's. Texture filtering capacity is nearly doubled, and other key rates are up by 40-50%, with the notable exception of memory bandwidth, which depends on the DIMM speed.

Although Trinity's IGP isn't based on the latest architecture, its associated media processing block is AMD's most recent vintage. The UVD3 video decode engine adds support for the MVC extension to H.264 for stereoscopic 3D, for the MPEG-4/DivX format, and for decoding dual HD streams simultaneously. The brand-new VCE block throws hardware-accelerated H.264 encoding into the mix, too—something that's important not just for performance and power efficiency reasons, but also for enabling new features like wireless displays.

Speaking of displays, Trinity can drive as many as four at once over HDMI, DVI, and DisplayPort. AMD has blazed the trail for DisplayPort adoption among consumer systems, and this chip supports DisplayPort 1.2 operation at up to 5.4 Gbps, including the daisy-chaining of multiple monitors on a single link. The APU can bundle sound into its digital display connections, as well—as many as four 7.1-channel audio streams, with broad support for digital encoding standards, including DTS Master Audio and Dolby TrueHD.