Single page Print

AMD's A8-3500M Fusion APU

Llano flows forth

Computer chips become more complex over time. We know this in our bones by now, in various ways, whether it's watching ever more functionality get crammed into smart phones or the constant drumbeat being sounded for, well, the constant drumbeat of Moore's Law. In recent years, we've watched the CPU rise from a single core to two, four, and even more. Cache sizes, clock speeds, and performance have grown over time, as well.

Even so, the sheer scope of AMD's new processor—code-named "Llano" and creatively dubbed an "accelerated processing unit" (APU) rather than a CPU—may cause you to do a double-take. This one chip incorporates a whole host of elements, many of which used to reside in other parts of a PC: up to four traditional CPU cores, a north bridge, a DDR3 memory controller, a bundle of PCI Express connectivity, a moderately robust Radeon GPU with an associated UVD block for video acceleration, and a pair of display interfaces. That's a mighty long list of capabilities consolidated into one piece of silicon, almost a system on a chip rather than a CPU surrounded by many helpers.

By integrating so many pieces together, Llano follows a trajectory for CPUs established long ago, when they first incorporated floating-point units. L2 caches were next to be assimilated, followed by memory controllers in AMD's K8. The integration trend has really picked up steam in recent years, though, and the most fully realized example has been Llano's primary competitor, Intel's Sandy Bridge processor. Even though it follows Sandy Bridge by roughly half a year, Llano still feels like a notable milestone on the integration path, in part because AMD has covered a lot of ground in this single step—and in part because Llano has absorbed a familiar and relatively formidable Radeon GPU.

Integration is the hot trend because it offers two main types of benefits. First, bringing ever more components on the CPU die can reduce the size, cost, and power consumption of a computer system. Laptops have grown dramatically smaller and more capable in recent years, with longer battery life, thanks to creeping integration. Second, situating key computing resources together on the same die has the potential to improve performance substantially, especially if those components can take advantage of a shared pool of memory.

By christening Llano a "Fusion APU" and talking about the possibility of tools like OpenCL allowing the execution resources of the CPU and GPU to work together, AMD's marketing machine has chosen to emphasize the second class of benefits. Make no mistake, though: Llano is about that first class of benefits, through and through.

Fusion's first steps
Intel has been shipping CPUs based on its own 32-nm manufacturing process for well over a year, but Llano is the first chip from AMD and its manufacturing partner, GlobalFoundries, to ship in volume at 32 nanometers. GloFo's 32-nm process is distinct from Intel's in several ways, including the use of silicon-on-insulator layering and a "gate-first" approach to the construction of high-k metal gates. Together, these techniques have helped create the benefits one would hope to see from a process shrink. According to Dr. Dirk Wristers, GloFo's VP of Technology and Integration, this 32-nm process offers a 100% increase in transistor density, along with a 40% increase in switching speed and a 40% reduction in energy required per switch, versus its 45-nm predecessor.

The upshot of these changes for Llano is room for more toys—a vastly increased transistor budget—and the potential for achieving higher performance in a relatively small power envelope.

Code name Key
Cores Threads Last-level
cache size
Process node
Penryn Core 2 Duo 2 2 6 MB 45 410 107
Bloomfield Core i7 4 8 8 MB 45 731 263
Lynnfield Core i5, i7 4 8 8 MB 45 774 296
Westmere Core i3, i5 2 4 4 MB 32 383 81
Gulftown Core i7-980X 6 12 12 MB 32 1168 248
Sandy Bridge Core i5, i7 4 8 8 MB 32 995 216
Sandy Bridge Core i3, i5 2 4 4 MB 32 624 149
Sandy Bridge Pentium 2 4 3 MB 32 - 131
Deneb Phenom II 4 4 6 MB 45 758 258
Propus/Rana Athlon II X4/X3 4 4 512 KB x 4 45 300 169
Regor Athlon II X2 2 2 1 MB x 2 45 234 118
Thuban Phenom II X6 6 6 6 MB 45 904 346
Llano A8, A6, A4 4 4 1MB x 4 32 1450 228
Llano A4 2 2 1MB x 2 32 758 -

The unnecessarily well-populated table above shows how Llano compares to a broad range of today's desktop processors. As you can see, AMD actually has plans for two very different versions of Llano silicon, one with quad cores and another with two cores and just over half the transistors. The quad-core version is first out of the chute, and initially, AMD will offer dual-core models of its A-series APUs made from the larger chip with a couple of cores disabled. Eventually, the native dual-core variant will take over, because it should be much more economical to manufacture. (Since it's not here yet, AMD hasn't seen fit to divulge the dual-core Llano's die size.)

Somewhat surprisingly, Llano's transistor count eclipses all of its contemporaries, including the six-core Gulftown chip with 12MB of L3 cache. However, the larger concern is die area, because that determines the cost to make the thing. As you can see, the quad-core Llano at 228 mm² is slightly larger than the 216 mm² quad-core Sandy Bridge. The difference doesn't seem so notable—until we consider that the bigger Llano will mostly do battle against the mid-size, 149 mm² Sandy Bridge. Of course, higher costs for AMD don't necessarily mean higher prices for consumers—just lower profits for AMD.

An annotated look at the "Llano" die. Source: AMD.

Llano itself may be new, but the individual components that make it up are largely familiar. The CPU cores are based on the now-venerable "Stars" microarchitecture used across the current Athlon, Phenom, and Opteron lineups. In Llano, each of those cores has a full megabyte of L2 cache associated with it, double the amount used in Propus (Athlon II) and Deneb/Thuban (Phenom II). That addition may, in part, help offset the loss of the 6MB L3 cache used in the Phenom II. Mike Goddard, Chief Engineer of AMD's client solutions, said the L3 cache was nixed for two reasons. First, the L3's performance advantages were limited by the latency it added to memory accesses. Second, and probably most notably, the L3 cache presented a power consumption problem, because it had to stay awake when any one of the CPU cores was awake. The power-performance tradeoff apparently wasn't worth it.

Block diagram of the AMD "Stars" CPU core. Source: AMD.

Goddard claimed Llano's implementation of the "Stars" core achieves over 6% higher instruction throughput per clock than prior versions due to a number of small refinements. The biggest contributor there may be the larger L2 cache. The algorithm that speculatively pre-fetches data into that cache has been beefed up, too. Llano's cores have larger reorder and load/store buffers, and the execution resources have been enhanced with the addition of a hardware divider unit. Those are the headliner tweaks, though Goddard hinted a number of more minor changes were included during the port to 32 nanometers, as well. The 6% figure doesn't sound like much, but it is more than we expected out of probably the last hurrah for this microarchitecture, before Bulldozer takes over later this year.