Single page Print

AMD's A8-7600 'Kaveri' processor reviewed


Better graphics, bigger contrasts
— 7:00 AM on January 14, 2014

For several generations, since Llano, AMD has been slowly but methodically marching toward its vision of accelerated computing, where traditional CPU cores and graphics share space on a chip and work together to process data. This vision was called "fusion" back when the process began, although you won't hear that term coming from AMD these days. Regardless, AMD's latest processor, or APU (short for "accelerated processing unit"), is a major milestone on the path toward fused computing—and AMD is taking the wraps off of it today.

Compared to AMD's current APUs, the chip code-named Kaveri is packed with sweeping changes, including enhanced "Steamroller" CPU cores, updated Radeon graphics, and a first-of-its-kind ability for the onboard CPU and GPU cores to share memory and work together to tackle a problem. Those are just the big-ticket items. Virtually every unit in Kaveri has been enhanced in some fashion.

Same space, another billion transistors


An incredibly vague picture of the Kaveri die. Source: AMD.

The changes in Kaveri start with the transition to a new chip fabrication process that packs more transistors into the same space.

The prior-gen Trinity/Richland APUs were built at GlobalFoundries using a familiar sort of manufacturing process for AMD CPUs, with feature sizes as small as 32-nm and a silicon-on-insulator (SOI) substrate. This 32-nm SOI process is tuned expressly for CPUs and helps enable the clock frequencies above 4GHz that are common in AMD's desktop processors.

For Kaveri, AMD and GloFo have developed a 28-nm SHP (short for "super-high performance," presumably) process that trades SOI for traditional bulk silicon. The 28-nm SHP process is tuned differently, to allow for higher transistor densities and somewhat lower peak switching speeds. AMD describes the process as a "happy medium" tuning point, one more accommodating to the GPU portion of Kaveri's die.

Code name Key
products
CPU
cores/
modules
CPU
threads
Last-level
cache size
Process node
(Nanometers)
Estimated
transistors
(Millions)
Die
area
(mm²)
Lynnfield Core i5, i7 4 8 8 MB 45 774 296
Sandy Bridge Core i5, i7 4 8 8 MB 32 995 216
Ivy Bridge Core i5, i7 4 8 8 MB 22 1200 160
Haswell (Quad GT2) Core i5, i7 4 8 8 MB 22 1400 177
Llano A8, A6, A4 4 4 1 MB x 4 32 1450 228
Trinity/Richland A10, A8, A6 2 4 2 MB x 2 32 1303 246
Kaveri A10, A8 2 4 2 MB x 2 28 2410 245

Thanks to this new manufacturing process, Kaveri crams about 1.1 billion more transistors—most of them dedicated to graphics—into approximately the same die area as Trinity. However, Kaveri has lower CPU operating speeds, especially in the higher power envelopes typical of most desktop processors.

If you've been following these things, this story may sound familiar to you. Intel has taken a similar path with its 22-nm fab process, tuning for better low-power operation at the expense of additional peak performance. Given that chips like Kaveri and Intel's Haswell are geared primarily for laptops, this sort of tuning makes sense.

That said, AMD and Intel aren't exactly aligned in their approaches to highly integrated CPUs. In the last couple of generations, Intel has pushed into ever-lower power envelopes with its Core processors. Haswell Y-series parts can squeeze into power envelopes as low as 6W, and that's with an on-package "PCH," or south bridge I/O chip. AMD evidently didn't see that move coming when it defined the requirements of its new APU. Kaveri operates in a broad range of power targets between 15W and 95W, but it's most likely not optimal at either end of that range. AMD hasn't yet announced the mobile versions of Kaveri—today's introduction applies only to the desktop variants—but the 15W version of Kaveri will presumably have an external south bridge with its own power budget. AMD will have to cover lower power ranges with its Kabini and Temash SoCs, which are decent but cheaper, lower-performance chips.

Steamroller CPU cores
Kaveri has a pair of CPU modules, each with two "tightly coupled" integer cores and a single, shared floating-point unit. In keeping with its recent heavy-machinery theme, AMD calls this next revision of its CPU microarchitecture Steamroller. Kaveri's Steamroller modules have been tweaked in significant ways to improve performance and power efficiency compared to the previous generations, known as Piledriver and Bulldozer. AMD CTO Mark Papermaster revealed many of the changes on tap for Steamroller over a year ago, but Kaveri is the first silicon to include this generation of AMD's x86 processor tech.


An alarmingly simplified block diagram of a Steamroller module. Source: AMD.

The CPU modules in the Bulldozer family have never quite lived up to expectations for various reasons. The obvious point of emphasis in Steamroller is keeping the execution engine better fed through tweaks to the microarchitecture's front end. Most notably, instruction decode is no longer a shared resource. The module has separate, dedicated decoders for each of its two integer cores. Also, the instruction cache is now 50% larger, at 96KB, and is three-way set associative. AMD claims i-cache misses have been reduced by 30% as a result. Furthermore, the branch target buffer has grown in size from 5K to 10K entries, giving the branch predictor more insight into program activity. The benefit is a claimed 20% reduction in branch mispredictions. Tricky x86 instructions that require the use of microcode should run faster in Steamroller, as well, since microcode ROM can be accessed simultaneously by both of the module's threads.

There are some big numbers attached to those individual front-end improvements. Combined with a larger scheduler window that adds 5-10% more efficiency, the Steamroller execution engine is apparently being kept much busier. On a per-thread basis, AMD says instruction dispatches that use the max width of the machine have risen by 25%. The Steamroller module can retire work at a higher rate, too, thanks to improvements to its back end (including enhancements to the load and store queues).

Of course, improvements in individual areas don't always translate directly into overall performance gains, since architectural constraints tend to move around depending on the workload. AMD claims Steamroller delivers an overall average gain in retired instructions per clock of about 10% over Piledriver, although that number can rise as high as 20% in certain scenarios. The good news is that Kaveri's IPC increases should serve to offset the reduction in clock frequency caused by the switch to 28-nm SHP manufacturing, thus keeping CPU performance steady from Trinity and Richland. The bad news is that AMD may be largely treading water in terms of overall CPU performance, while Intel continues to extend its lead.