AMD has been surrounded by a fair amount of gloom for the past couple of years, but the firm's low-cost and low-power Brazos platform has been a consistent bright spot in spite of everything. The E-series APUs based on Brazos have saturated the low end of the laptop market, helping to send traditional, functionally hobbled netbooks to their doom. AMD's new leadership has repeatedly spoken about the virtues of Brazos as a business. They like it because it's high-volume—you tend to move a lot of chips when you sell 'em cheap—and because Intel hasn't competed vigorously against Brazos, apparently for fear of eating into its low-end Core i3 business.
The follow-up to Brazos is a single chip, known by the twin code names Kabini and Temash, that packs four CPU cores, a miniature Radeon graphics processor, and everything else you need for a functional PC onto a tiny slice of silicon. The true competition from Intel will be the Bay Trail part based on the Silvermont architecture, but it's not slated to arrive until later this year. In the meantime, AMD will have something truly distinctive to offer: a quad-core SoC that's fully PC compatible, very affordable, and fits into various sorts of sleek, slim systems. Kabini will aim for laptops—think ultra-thin systems with long battery life for under 500 bucks—and low-cost desktops. Meanwhile, Temash will target tablets between 10.1" and 11.6" in size that are roughly 10mm thick, maybe a bit less. Imagine a tablet that sits between the Microsoft Surface and Surface Pro in price, size, and performance and you'll have the basic idea.
A true PC system on a chip
Some time in the past couple of years, pretty much everybody in the PC industry started calling their CPUs "SoCs" or systems-on-a-chip. It's trendy, sounds like what Apple does, and is therefore entirely irresistible within a 100-mile radius of the San Francisco Bay. Although the definition of a "real" SoC is a little wobbly, Kabini/Temash may have the best claim yet to being the first true PC SoC.
Naturally, then, this chip packs in a ton of components. The headliners are undoubtedly the four "Jaguar" CPU cores, based on an evolution of the Bobcat microarchitecture used in Brazos, and the integrated graphics processor, which is derived from the same Graphics Core Next (GCN) architecture as the Radeon HD 7000-series discrete GPUs. The GPU includes a UVD media processing block capable of H.264 decoding and encoding, of course, and the chip's north bridge acts as a traffic cop, routing requests to the SoC's single-channel DDR3 memory controller.
All of those elements might be familiar from past AMD APUs, but this SoC also incorporates all of the I/O functionality that has traditionally been built into a separate south bridge chip or "Fusion controller hub," as AMD calls it. Branching out from Kabini are four PCI Express x1 links, two SATA 6Gbps disk interfaces, an SD card controller, two USB 3.0 ports, eight USB 2.0 connections, a gaggle of display interfaces including HDMI and DisplayPort 1.2, and a dedicated four-lane connection for an optional discrete GPU. Oh, and legacy I/O like keyboard ports and such are in there, too. Makes you wonder if there aren't secret connections for a Turbo button and an EISA card.
Integrating all of these things together on one chip saves power, reduces the physical footprint of a system, and cuts costs, too. That's why we keep seeing more and more integration over time. Kabini simply takes that concept to a logical endpoint by bringing aboard pretty much an entire small-scale PC.
The key words above, by the way, are "small-scale." AMD tells us these chips are being manufactured by two different foundry partners, TSMC and GlobalFoundries, at 28-nm process geometries. We haven't managed to wrangle the chip's exact transistor count or die size yet, but I've held one in my hand, and it's tiny. There has been some talk about how this SoC is closely related to the chips going into the PlayStation 4 and Xbox One—and it is, quite closely—but Kabini is scaled way down. The PS4, for instance, has eight Jaguar cores and 1152 GCN shader ALUs, while the Xbox One reportedly has eight cores and 768 shader ALUs. This chip has four cores and 128 shader ALUs. The memory bandwidth disparity is similarly huge between Kabini and the consoles, more than an order of magnitude. Although they share quite a bit of DNA, Kabini and Temash are aimed at much lower cost and power targets than the chips AMD has built for Sony and Microsoft.
The Jaguar core
Although its bigger CPUs haven't been as competitive as hoped lately, AMD has had a nice run with the Bobcat core used in the Brazos platform. Bobcat came out of the gate using out-or-order execution and only one thread per core, and as a result, it was about 20% faster than the Atom in our tests, especially in cases where applications weren't readily multithreaded. Now, Intel has committed to a similar template for the upcoming, all-new Silvermont Atom architecture, with an emphasis on improving per-thread performance. Meanwhile, AMD has revised its low-power microarchitecture in a multitude of ways both big and small, and the result is the evolutionary step known as Jaguar.
Jaguar brings a few principal improvements over the prior generation in terms of power efficiency and performance, which are essentially two sides of the same coin these days. A host of tweaks throughout the core has produced a 22% gain in instruction throughput per clock, although that gain is more like 15% if you don't factor in the impact of the larger L2 cache. Either way, the generational advancements are substantial. Also, Jaguar has been retooled for better frequency-voltage response, in part via the addition of a couple of pipeline stages, so the chip should consume less power at a given clock speed. Finally, the core has been tweaked for better power efficiency in other ways, too, including some unit redesigns and an expansion of the ability to gate off the clock signal from portions of the chip that are currently idle.
Even greater performance increases are possible by harnessing extensions to the x86 instruction set, and Jaguar adds support for a whole range of those, including the SIMD alphabet soup that is SSE 4.1, SSE 4.2, and AVX. Also supported are AES-NI encryption acceleration and F16C format conversions. Other new features suggest Jaguar may find its way into server systems soon, including the expansion of physical addressing to 40 bits and the better support for OS virtualization.
Above is a functional block diagram of the Jaguar architecture. Although there are tweaks throughout the core that contribute to the IPC gains, the most sweeping changes are reserved for the floating-point unit, which is a total redesign. The new FPU is 128 bits wide, twice the width of Bobcat, and is responsible for executing many of those extended SIMD instructions like SSE and AVX. With single-precision datatypes, the execution hardware can perform four multiplies and four adds per cycle. For double-precision math, the rate is one multiply and two adds per clock.
AMD says support for 256-bit wide AVX extensions is achieved by "double-pumping" the 128-bit execution units. In this case, "double pumping" means data are fed through the units in two passes, but the units do not run at twice the base clock frequency, as the Pentium 4's integer ALUs did.
Kabini's four revised Jaguar cores are fed by a 2MB L2 cache shared via a common interface that connects to each core individually. Sharing a cache in this way has several benefits. In light workloads where one or more cores are inactive, the busy cores will effectively have more L2 cache capacity available to them, improving per-thread performance. Meanwhile, because the L2 cache replicates the contents of the cores' L1 caches, the L2 can act as a probe filter for coherency traffic, facilitating more efficient multitasking.
AMD has put some work into the L2 interface, which makes sense since it's the cores' only path to the rest of the system. The L2 interface runs at the full speed of the CPU cores and has built-in smarts, including the ability to store L2 tags, so it knows which portion of the cache to light up when the time comes to access one of its four 512KB banks. When those L2 cache banks aren't needed, they're clock gated to save power. AMD further conserves power by clocking the L2 arrays at half the frequency of the CPU cores.