Computer chips become more complex over time. We know this in our bones by now, in various ways, whether it’s watching ever more functionality get crammed into smart phones or the constant drumbeat being sounded for, well, the constant drumbeat of Moore’s Law. In recent years, we’ve watched the CPU rise from a single core to two, four, and even more. Cache sizes, clock speeds, and performance have grown over time, as well.
Even so, the sheer scope of AMD’s new processor—code-named “Llano” and creatively dubbed an “accelerated processing unit” (APU) rather than a CPU—may cause you to do a double-take. This one chip incorporates a whole host of elements, many of which used to reside in other parts of a PC: up to four traditional CPU cores, a north bridge, a DDR3 memory controller, a bundle of PCI Express connectivity, a moderately robust Radeon GPU with an associated UVD block for video acceleration, and a pair of display interfaces. That’s a mighty long list of capabilities consolidated into one piece of silicon, almost a system on a chip rather than a CPU surrounded by many helpers.
By integrating so many pieces together, Llano follows a trajectory for CPUs established long ago, when they first incorporated floating-point units. L2 caches were next to be assimilated, followed by memory controllers in AMD’s K8. The integration trend has really picked up steam in recent years, though, and the most fully realized example has been Llano’s primary competitor, Intel’s Sandy Bridge processor. Even though it follows Sandy Bridge by roughly half a year, Llano still feels like a notable milestone on the integration path, in part because AMD has covered a lot of ground in this single step—and in part because Llano has absorbed a familiar and relatively formidable Radeon GPU.
Integration is the hot trend because it offers two main types of benefits. First, bringing ever more components on the CPU die can reduce the size, cost, and power consumption of a computer system. Laptops have grown dramatically smaller and more capable in recent years, with longer battery life, thanks to creeping integration. Second, situating key computing resources together on the same die has the potential to improve performance substantially, especially if those components can take advantage of a shared pool of memory.
By christening Llano a “Fusion APU” and talking about the possibility of tools like OpenCL allowing the execution resources of the CPU and GPU to work together, AMD’s marketing machine has chosen to emphasize the second class of benefits. Make no mistake, though: Llano is about that first class of benefits, through and through.
Fusion’s first steps
Intel has been shipping CPUs based on its own 32-nm manufacturing process for well over a year, but Llano is the first chip from AMD and its manufacturing partner, GlobalFoundries, to ship in volume at 32 nanometers. GloFo’s 32-nm process is distinct from Intel’s in several ways, including the use of silicon-on-insulator layering and a “gate-first” approach to the construction of high-k metal gates. Together, these techniques have helped create the benefits one would hope to see from a process shrink. According to Dr. Dirk Wristers, GloFo’s VP of Technology and Integration, this 32-nm process offers a 100% increase in transistor density, along with a 40% increase in switching speed and a 40% reduction in energy required per switch, versus its 45-nm predecessor.
The upshot of these changes for Llano is room for more toys—a vastly increased transistor budget—and the potential for achieving higher performance in a relatively small power envelope.
|Penryn||Core 2 Duo||2||2||6 MB||45||410||107|
|Bloomfield||Core i7||4||8||8 MB||45||731||263|
|Lynnfield||Core i5, i7||4||8||8 MB||45||774||296|
|Westmere||Core i3, i5||2||4||4 MB||32||383||81|
|Gulftown||Core i7-980X||6||12||12 MB||32||1168||248|
|Sandy Bridge||Core i5, i7||4||8||8 MB||32||995||216|
|Sandy Bridge||Core i3, i5||2||4||4 MB||32||624||149|
|Sandy Bridge||Pentium||2||4||3 MB||32||–||131|
|Deneb||Phenom II||4||4||6 MB||45||758||258|
|Propus/Rana||Athlon II X4/X3||4||4||512 KB x 4||45||300||169|
|Regor||Athlon II X2||2||2||1 MB x 2||45||234||118|
|Thuban||Phenom II X6||6||6||6 MB||45||904||346|
|Llano||A8, A6, A4||4||4||1MB x 4||32||1450||228|
|Llano||A4||2||2||1MB x 2||32||758||–|
The unnecessarily well-populated table above shows how Llano compares to a broad range of today’s desktop processors. As you can see, AMD actually has plans for two very different versions of Llano silicon, one with quad cores and another with two cores and just over half the transistors. The quad-core version is first out of the chute, and initially, AMD will offer dual-core models of its A-series APUs made from the larger chip with a couple of cores disabled. Eventually, the native dual-core variant will take over, because it should be much more economical to manufacture. (Since it’s not here yet, AMD hasn’t seen fit to divulge the dual-core Llano’s die size.)
Somewhat surprisingly, Llano’s transistor count eclipses all of its contemporaries, including the six-core Gulftown chip with 12MB of L3 cache. However, the larger concern is die area, because that determines the cost to make the thing. As you can see, the quad-core Llano at 228 mm² is slightly larger than the 216 mm² quad-core Sandy Bridge. The difference doesn’t seem so notable—until we consider that the bigger Llano will mostly do battle against the mid-size, 149 mm² Sandy Bridge. Of course, higher costs for AMD don’t necessarily mean higher prices for consumers—just lower profits for AMD.
Llano itself may be new, but the individual components that make it up are largely familiar. The CPU cores are based on the now-venerable “Stars” microarchitecture used across the current Athlon, Phenom, and Opteron lineups. In Llano, each of those cores has a full megabyte of L2 cache associated with it, double the amount used in Propus (Athlon II) and Deneb/Thuban (Phenom II). That addition may, in part, help offset the loss of the 6MB L3 cache used in the Phenom II. Mike Goddard, Chief Engineer of AMD’s client solutions, said the L3 cache was nixed for two reasons. First, the L3’s performance advantages were limited by the latency it added to memory accesses. Second, and probably most notably, the L3 cache presented a power consumption problem, because it had to stay awake when any one of the CPU cores was awake. The power-performance tradeoff apparently wasn’t worth it.
Goddard claimed Llano’s implementation of the “Stars” core achieves over 6% higher instruction throughput per clock than prior versions due to a number of small refinements. The biggest contributor there may be the larger L2 cache. The algorithm that speculatively pre-fetches data into that cache has been beefed up, too. Llano’s cores have larger reorder and load/store buffers, and the execution resources have been enhanced with the addition of a hardware divider unit. Those are the headliner tweaks, though Goddard hinted a number of more minor changes were included during the port to 32 nanometers, as well. The 6% figure doesn’t sound like much, but it is more than we expected out of probably the last hurrah for this microarchitecture, before Bulldozer takes over later this year.
Sumo wrestling among the Redwoods
Llano’s integrated graphics processor is code-named “Sumo,” which is mildly disturbing because it offers us a glimpse of our code-named-spangled future, in which every portion of a chip has a proper name we can’t remember. Fortunately, Sumo is easy to describe with reference to another code name, Redwood, which is entirely familiar as a discrete graphics processor from the Radeon HD 5000 series—namely, the Radeon HD 5670. Sumo shares Redwood’s graphics architecture—with five SIMD engines, a total of 400 shader ALUs, 16 texels per clock of texture filtering capacity, and eight pixels per clock of ROP throughput—and feature set—including robust DirectX 11 support with hardware tessellation, up to 8X multisampled antialiasing in hardware, and additional AA possibilities in software. (The Sandy Bridge IGP, by contrast, supports only DX10 and 4X multisampled AA.)
Sumo’s one upgrade over Redwood is an updated video processing block, dubbed UVD3, that’s also used in Radeon HD 6000-series discrete GPUs. UVD3 adds support for Blu-ray 3D playback, MPEG4 decode acceleration, and fuller acceleration of MPEG2 video streams to the previous generation’s acceleration of the VC-1, H.264, and MPEG2 formats. AMD points out the MPEG4 support, in particular, is noteworthy because Intel’s Clear Video block doesn’t have it.
When AMD starts talking about how Llano comes with “discrete level” graphics—a phrase we’ve heard often in reference to this product—one must remember that discrete graphics cards come in many forms.
Although the Llano IGP has the same array of graphics resources as a Radeon HD 5670, it has to operate under a considerably different set of constraints. The discrete desktop Radeon HD 5670 runs at a very healthy 775MHz, while the fastest mobile variants of Llano’s IGP tick along at 444MHz. (The desktop versions run as fast as 600MHz.) That means the best mobile Llano IGP has theoretical peaks of 3.6 Gpixels/s of fill rate, 8.9 Gtexels/s of texture filtering, and 355 GFLOPS of shader compute power. That’s a little more than half the corresponding rates for a discrete Radeon HD 5670. The more notable constraint, though, is memory bandwidth. Thanks to its GDDR5 memory, a discrete Radeon HD 5670 has 64GB/s of bandwidth all to itself. The Sumo IGP, meanwhile, has to share two channels of DDR3 memory with Llano’s four CPU cores. With dual 1333MHz memory modules, Llano’s shared memory subsystem has less than a third of the 5670’s dedicated bandwidth.
Those limitations don’t make Llano’s IGP a poor one. On the contrary, this is surely the best integrated graphics solution we’ve ever seen. Still, when AMD starts talking about how Llano comes with “discrete level” graphics—a phrase we’ve heard often in reference to this product—one must remember that discrete graphics cards come in many forms, down to the $49 Radeon HD 6450, which is pretty anemic. The beefier Radeon HD 5670 easily outpaces the Llano IGP, but it will set you back $77 online. In terms of both graphics power and dollars, the stakes involved here are relatively low.
AMD appears to be acutely aware of how critical memory bandwidth will be to the graphics performance of Llano-derived APUs. The dual-channel DDR3 memory controller will support 1333MHz memory, both in its stock and low-power (1.35V) incarnations, across the entire A-series APU mobile product line. A few variants will support 1600MHz memory, and the desktop versions will push their DIMMs as high as 1866MHz. Capacity will top out at two DIMMs and 32GB in the mobile chips, while the socketed desktop versions will support four DIMMs and up to 64GB. Then again, those are some really big honkin’ DIMMs, as we say in the industry, so the practical limits may be lower for the time being.
Glue for adhesion, not Fusion
The final major components in the Llano die are the four PCI Express controller blocks. Each of them can feed eight lanes of second-generation PCIe connectivity, but one of those blocks of eight is dedicated to driving a pair of digital display outputs. The remaining 24 lanes can flex into various configurations. A common one would use 16 lanes to talk to a discrete GPU, four lanes to talk to the FCH or south bridge chip, and leave four lanes for general-purpose use.
Much of the rest of Llano is glue, finding a way to make all of these disparate components talk to one another and function together properly. This chip doesn’t have any major architectural modifications geared toward efficient integration; unlike Sandy Bridge, there’s no internal communications ring, no shared last-level cache, and no IGP participation in the Turbo mechanism. Instead, Llano’s internal links look much like the external links used before. In place of the Radeon’s dual memory controllers is a connection to Llano’s north bridge. In fact, Goddard said there are actually two links from the IGP into the north bridge, which makes sense historically given that the Redwood GPU has two 64-bit memory interfaces. A separate connection, dubbed the “Fusion compute link,” serves the same purpose as a PCIe interconnect between a CPU and a discrete GPU, allowing the IGP to access system memory coherently—that is, without spoiling the complex dance involving multiple CPU cache levels holding multiple copies of data, potentially in different states. Goddard stated that this communication channel will be important in the future for GPU computing applications, but he admitted the engineering team didn’t plumb Llano’s Fusion Compute Link to be especially high bandwidth. Instead, he expects AMD to invest more in this link going forward—that is, in future APUs.
When asked about the thorny problem of how Llano arbitrates between CPU and IGP requests for memory access, Goddard chose his words carefully. To paraphrase, he noted that fewer CPU-based algorithms require high bandwidth, while GPUs tend to be more tolerant of high latency. Some applications also have isochronous requirements (that is, they need a guaranteed stream of data at a certain rate). The result is a “very complex algorithm.” Goddard admitted the team wasn’t able to do everything it wanted to do on this front. “We think you’ll struggle to find a problem, but there are things we’d like to do differently next time.”
If you’re getting the sense that Llano’s brand of fusion is more like a couple moving into adjacent apartments in the same complex rather than moving in together, you’re on the right track. The plan is to move in together, eventually, but that’s down the road.
With that said, AMD Graphic CTO Eric Demers did note a couple of compute-focused provisions in the IGP that point to a more fully fused future. The first provision, called “pin-in-place,” allows the GPU to reserve a portion of system memory that it can access without traversing any operating system storage buffers—a performance enhancement. Discrete GPUs can use this function, as well; the data transfers then happen over a PCI Express link. The second, known as “zero copy,” works in conjunction with pinned memory and lets kernels running on the GPU modify the system’s virtual memory directly, rather than copying the data to graphics memory for modification. For systems where the CPU and IGP share the same physical RAM, the use of zero-copy pinned memory can potentially offer some nice performance benefits. Demers said this capability could be used both for 3D graphics, via an OpenGL extension, and for GPU computing via OpenCL. Then again, both pin-in-place and zero-copy have also been available in Nvidia’s CUDA toolkit since version 2.2, so developers can employ them on ION-based netbooks, too.
Of power gating and ceramic impellers
Although the block diagrams for Llano are a mosaic of known quantities, AMD told us the major focus of its work on this chip was power savings. AMD has long trailed Intel on this front, and buying an AMD-based laptop has generally meant getting a bit of a discount on the overall system at the expense of shorter run times on battery. With Llano, the firm believes it has reached parity in this crucial arena with its much larger competitor.
One key to making it happen is the addition of a new type of logic: a power gate, which shuts off all power to a portion of the chip when tripped, eliminating not just active power but leakage power, as well. Intel has gated power for the individual cores of its processors since Nehalem, but to date, AMD has lacked that capability. No more. All four of Llano’s cores share the same voltage supply, but each core has a power switch associated with it. Whenever one of the cores becomes idle and enters the C6 power state, all power to that core is shut off. Even on what may feel like a busy system to the end user, there could be billions of cycles of unused time on multiple cores, a huge target for power savings.
Additionally, Llano is capable of entering a package-level C6 sleep state when all four cores are idle. In this state, voltage is lowered across the entire CPU rail, saving even more power.
Llano has a second power plane for its entire “uncore”: the IGP, UVD block, the graphics memory controller, and the north bridge. The uncore can operate at varying voltages and multiple, varying frequencies. According to Goddard, the uncore voltage is dynamically determined by a number of different inputs, including the north bridge’s power state, the GPU power state, PCI Express speeds, and the UVD workload. Several uncore elements have power gating, as well. The GPU and its memory controller are separately gated and will be powered down dynamically at idle, while the UVD block can simply be turned off by software when it’s not in use.
Besides saving power when idle, Llano is tuned to make the most of its available power envelope when active, thanks to AMD’s dynamic power scaling tech, known as Turbo Core. As you may know, Turbo Core is an answer to Intel’s Turbo Boost technology. The two are designed around the same basic principle, opportunistically grabbing more clock speed when the thermal headroom is available, yet they operate rather differently.
Intel’s Turbo Boost relies on a network of thermal sensors on the chip to help determine how much it can range up in clock frequency, while AMD’s Turbo Core uses only activity sensors on the die. Given this limited input, AMD must add additional intelligence offline, so it characterizes the power draw of its chips based on activity—Goddard called this a “big pre- and post-silicon exercise.” The firm then sets a Turbo Core policy for each model of CPU based on that research. By its nature, this estimate must be relatively conservative, because it must cover the whole range of chips selected to represent that model.
Turbo Core adds only one more P-state to the CPU’s repertoire, a single higher clock speed step; it then dithers between the two top clock frequencies as the activity-based power estimate will allow. In our Llano test chip, an A8-3500M processor, the difference between the two is rather large. The base clock speed is 1.5GHz, and the Turbo speed is 2.4GHz. There are no intermediate states like, say, 1.8GHz for four lightly-loaded cores. Typically, the Turbo Core policy only lets the chip range above its base clock speed when a portion of the total cores are actively at work. In the Phenom II X6, for instance, Turbo allows up to three active cores to range to higher frequencies. We’re unsure what the policy is for our Llano APU, since we lack a utility that will properly report its clock speeds.
The two are designed around the same basic principle, opportunistically grabbing more clock speed when the thermal headroom is available, yet they operate rather differently.
Our sense is that processors equipped with Turbo Core spend substantially less time resident at higher clock frequencies than those with Turbo Boost. Still, Goddard touted several of Turbo Core’s attributes as advantages over Intel’s approach. Among them is the fact that activity is measured digitally and therefore more precisely. Also, Turbo Core behavior is consistent and deterministic across all copies of a certain model of CPU, and performance doesn’t vary with the quality of the thermal solution in use. All of those things sound great on paper, but our sense is that AMD will abandon those principles just as soon as it can produce a chip with a thermal sensor network comparable to Intel’s.
Naturally, AMD’s activity-based power estimates for Llano include both the CPU and GPU cores on the chip. As we’ve already noted, the GPU doesn’t participate in Turbo Core’s clock frequency scaling. GPU activity may, however, eat up thermal headroom that would otherwise be available to the CPU cores; the GPU gets priority in such a case.
Another possibility is that programs causing particularly high power consumption could be run on one or both of Llano’s two major processor types, pushing the chip to exceeed its total thermal envelope. In that case, a legacy CPU thermal throttling mechanism will kick in on the CPU side of the fence, reducing the CPU cores to a lower P-state and limiting the chip’s overall power draw and heat production. The IGP will continue to chug along as ever, true to its Redwood roots. AMD’s graphics division did introduce a power-based throttling feature called PowerTune in its Cayman GPU, but that mechanism hasn’t trickled down to its smaller GPUs yet, nor to the Sumo IGP.
The Llano platform has its own code name, as well: “Sabine.” Thanks to Llano’s broad integration of components, the Sabine platform is a two-chip solution that should have a smaller footprint than AMD’s prior efforts in this segment. AMD calls the single support chip in its APU platforms a “Fusion controller hub” or FCH, although the FCH is essentially the same thing as a traditional south bridge.
AMD offers two FCH options for Sabine. The A60M chip, already widely used in the low-cost “Brazos” platform, may see duty in relatively inexpensive Llano systems. We’d expect the A70M to be more prevalent thanks to its support for up to four USB 3.0 ports, a nice-to-have and much-needed feature distinguished by the fact that it’s not natively provided by Intel’s Sandy Bridge platforms. For external hard drives and other devices capable of high-speed transfers, USB 3.0’s roughly 10X theoretical improvement over USB 2.0 can be a godsend.
AMD has built several features into the Sabine platform to make it more attractive. One is a dynamic screen brightness capability known as adaptive backlight modulation (ABM). ABM analyzes the image to be displayed and, when possible, reduces the backlight strength while raising the brightness of the pixels being displayed. The goal is to deliver a similar-looking on-screen image while using less power to drive the LCD, extending battery life.
Another feature of note is a dynamic switchable graphics facility. Although it has a relatively powerful IGP, as these things go, Llano can be paired with a discrete GPU for higher-performance graphics. As with prior platforms, the system can switch between the integrated and discrete GPUs in order to save power or to deliver better frame rates, either via user direction or by making an automatic swap to the IGP when on battery power. What’s new here is another alternative, a dynamic switching capability that will choose the most optimal GPU based on application profiles. For instance, a session of web surfing in the GPU-accelerated IE9 might use the lower-power Sumo IGP, which is adequate to the task, but firing up a game would cause the system to hand off rendering duties to the discrete GPU.
In our experience, the dynamic switching feature works pretty seamlessly, transitioning between the responsible GPUs with little delay or drama. However, changing from one type of switching mechanism to another—from manual to dynamic or vice-versa—involved some garbled screens and big, hairy delays on our review system. Still, we expect most folks will choose dynamic switching and never look back, especially because the system prompts the user to pick the appropriate GPU for unrecognized applications.
Dynamic switching operates in conjunction with another intriguing feature, the innocuously named Dual Graphics, a mobile version of AMD’s CrossFire multi-GPU teaming technology. AMD says the Llano IGP can cooperate with discrete Radeons in both the 5000- and 6000-series lineups. We’ve seen various attempts at teaming IGPs with discrete GPUs in the past, and they’ve been pretty uneven in terms of long-term support and compatibility. This incarnation comes with a big caveat right out of the starting gate, because it only works with games using DirectX 10 or 11. A great many games still use DX9, so Dual Graphics’ applicability is narrower than we’d like. Still, if the difference in throughput between the IGP and the discrete GPU isn’t too large, GPU teaming potentially makes some sense.
On evenly matched discrete GPUs, the preferred and most common method of divvying up the workload is to assign even-numbered frames to one GPU and odd-numbered ones to the other, a method known as alternate-frame rendering (AFR). On a pair of equally fast GPUs, AFR can achieve nearly twice the frame rate of a single chip. In the case of an IGP + GPU pairing that’s somewhat asymmetrical, performance isn’t likely to scale as well. AMD quotes a figure of “up to 75% additive performance” with Dual Graphics, and that’s a best-case number. To overcome more extreme asymmetry between IGP and GPU, Dual Graphics can use a 2:1 split in frame assignments between the discrete and integrated GPUs. You’re looking at more modest frame rate gains in such a configuration, but as Demers pointed out in a bit of a competitive dig, that’s better than Nvidia’s discrete GPUs, which can’t gain any additional performance by splitting the workload with IGPs from Intel or AMD. Whether Dual Graphics is useful for things other than scoring marketing points is something we’ll have to explore for ourselves.
The A-Series APUs and a whole bundle ‘o Radeons
AMD is spinning Llano into a host different models for the laptop market, most of them quad-core parts. Here’s a look at the lineup.
All of these models fit into one of two power bands, either 35W or 45W, both aimed at mainstream laptops. AMD tells us a 25W Llano derivative is a possibility, too, but it hasn’t chosen to introduce one yet. We expect that introduction is much more likely to happen once the native dual-core Llano silicon is shipping.
As if the amount of technology stuffed into Llano weren’t complex enough, AMD’s product segmentation efforts have yielded three different tiers of A-series APUs—A8, A6, and A4—along with a corresponding trio Radeon brand names for the IGP configurations. Although AMD hasn’t revealed exact prices for these mobile GPUs (and isn’t likely to do so, since its customers are big PC builders, not consumers), we do have a sense of the basic product positioning, right in the meat of the laptop market. A4 APUs should go into laptops starting at around $500. A6-based systems should start at $600, and A8 systems at $700. (AMD’s Brazos-based E- and C-series APUs will continue to compete with Intel’s Pentium, Celeron, and Atom processors in systems below $400.) John Taylor, AMD’s Director of Product Marketing, reckons the A4 series will compete with low-end mobile Core i3 processors, while the A6 will straddle the Core i3 and i5 lineups, and the A8 will face higher-end Core i5s and lower-end Core i7s.
We’re sure you’ve fully absorbed all of those APU and IGP model numbers and their related specifications, so we’ll move on to the next step of your education in Llano branding. Not only do the three IGP configurations get their own Radeon model numbers, but adding a second GPU for Dual Graphics gives you two more models to track: the discrete GPU’s, and a new model number that AMD marketing has generated to reflect the combined power of the Llano IGP and the discrete GPU together.
Long-time readers might think I am making this up, but alas, it is not a joke. Here’s a matrix from AMD that explains the whole scheme.
I think that explains it, at least. Say, for example, you have a laptop with an A6 processor and a Radeon HD 6520G integrated GPU, and that laptop also has a discrete Radeon HD 6630M GPU on board. Their combined wonder-twin powers would add up to a “Radeon HD 6680G2” label on the box.
If, like me, you’re going to forget these model numbers in about 15 seconds, it may be useful to remember that an “M” at the end of the model signifies a discrete GPU alone, a “G” indicates an IGP alone, and “G2” denotes a Dual Graphics configuration. If that has you feeling more confident, this will knock you back down a peg. Only Llano systems equipped with dual-channel memory configurations are eligible for Dual Graphics operation and branding. The drop in IGP performance with a single DIMM apparently throws things far enough out of balance for AMD to scuttle the whole deal.
The test systems
Our attempts to place Llano in context involve a couple of very similarly configured laptops—one based on an A8 APU and another based on a Sandy Bridge Core i5—and a host of data from other sources. Our Llano test system is a pre-production Compal whitebook supplied by AMD.
This system is equipped with an A8-3500M APU. That’s one of the more desirable Llano-derived APUs, since it has a 35W TDP, quad cores at 1.5GHz with a 2.4GHz Turbo peak, dual channels of DDR3 at 1333MHz, and the fastest version of the IGP, the Radeon HD 6620G. This laptop is also equipped with a discrete GPU, a Radeon HD 6630M, and is capable of running in a Dual Graphics config. AMD would call this GPU tag team the Radeon HD 6690G2.
For comparison to the Llano review unit, we ordered up the closest analog we could find in stock at Newegg, the HP ProBook 6460b pictured above. Like the Llano system, the ProBook has a 14″ 1366×768 display with a matte coating, a Hitachi 7K500 mobile hard drive, 4GB of RAM, an optical drive, and Windows 7. The processor in the ProBook is a Core i5-2410M, which we believe to be the closest competitor to the A8-3500M APU. Like the A8, the Core i5-2410M has a 35W TDP rating, but the i5-2410M has only two cores to Llano’s four. Thing is, these are much more potent cores, with a base clock of 2.3GHz, a Turbo peak of 2.9GHz, and quad threads thanks to the magic of Hyper-Threading. The i5-2410M also has the full-fledged HD Graphics 3000 edition of the Sandy Bridge IGP.
Although we couldn’t find a system with the exact same battery rating as the Llano test unit, we did get awfully close. The Compal whitebook has a 58 Wh battery, while the HP ProBook’s battery is rated for 55 Wh.
One major place where the ProBook differs from the Llano whitebook is its lack of a discrete GPU. We wanted to focus primarily on Llano and Sandy Bridge, so we didn’t bother with discrete graphics on the Core i5 system. We then disabled the discrete GPU in the BIOS on the Llano system for most of our tests. Both discrete and dual graphics will make an appearance, though, as you’ll see.
Oh, and I suppose this is as good a place as any to talk about the follies that went on behind the scenes as we prepared this article for publication. One of our major show-stoppers was the fact that we didn’t realize until after practically all of our testing was ostensibly complete that the HP ProBook had shipped with a single 4GB DIMM. That configuration robs its Core i5 processor of the bandwidth supplied by a second memory channel, reducing performance at times. We were forced to re-test everything, but since we already had results for the single-channel config, we’ve included those throughout the review, as well. That is, after all, apparently a valid, shipping configuration in pre-built systems.
We also ran into some rather grievous problems with the Compal whitebook’s power consumption. We think the primary problem was simply having used the wrong combination of BIOS settings in an attempt to disable the discrete GPU for battery life tests. Finding the correct setting, using careful observation on a watt meter, nearly doubled the Llano system’s run times in our battery life tests. We believe the scores we’ve finally reported are valid and reflect what you could likely expect from a similarly configured production system.
Our testing methods
With the exception of battery life, all tests were run at least three times, and we reported the median of those runs.
The test systems were configured like so:
|System||AMD A8-3500M test system||HP ProBook 6460b|
|Processor||AMD A8-3500M APU 1.5GHz||Intel Core i5-2410M 2.3GHz|
|I/O hub||AMD A70M FCH||Intel HM65|
|Memory type||DDR3 SDRAM||DDR3 SDRAM at 667MHz|
|Memory timings||N/A||9-9-9-24 1T|
|Audio||IDT codec||IDT codec with 6.10.6328.0 drivers|
|Graphics||AMD Radeon HD 6620G + AMD Radeon HD 6630M
with Catalyst 8.862 RC1 drivers
|Intel HD Graphics 3000 with 22.214.171.1241 drivers|
|Hard drive||Hitachi Travelstar 7K500 250GB 7,200 RPM||Hitachi Travelstar 7K500 320GB 7,200 RPM|
|Operating system||Windows 7 Home Premium x64||Windows 7 Professional x64|
|OS Updates||Service Pack 1
DirectX Runtime (June 2010)
Service Pack 1
DirectX Runtime (June 2010)
As I said, our comparative results for this article came from multiple sources. For our comparisons to desktop systems, you can see our test configurations on this page of our Core i7-990X review. For the configurations of the other mobile systems, see our review of the Asus K53E laptop.
Many of our performance tests are scripted and repeatable, but for some of the games, including Battlefield: Bad Company 2, we used the Fraps utility to record frame rates while playing a 60-second sequence from the game. Although capturing frame rates while playing isn’t precisely repeatable, we tried to make each run as similar as possible to all of the others. We raised our sample size, testing each Fraps sequence five times per video card, in order to counteract any variability. We’ve included second-by-second frame rate results from Fraps for those games, and in that case, you’re seeing the results from a single, representative pass through the test sequence.
The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Since so much of Llano’s focus involves improving battery life, we might as well get those results on the table. For our first two run-time scenarios, we’ve compared our A8-3500M and Core i5-2410M test laptops against a range of other systems from past reviews. Obviously, our two main systems are most comparable to one another, with similar battery sizes and other specs, as we’ve noted. The first test is our own home-cooked web browsing test, TR Browserbench 1.0, which consists of a static version of the TR home page that cycles through different text content, Flash ads, and images, refreshing every 45 seconds. The next one is our video test, which involves continuous, looped playback of an episode of CSI: New York encoded with H.264 at 480p resolution (taken straight from an HTPC).
We aim to keep display brightness consistent across all of our test systems, where possible. In this case, our common touchstone was an Acer 1810TZ laptop at 50% brightness. Many of the other test systems had glossy display coatings and were at 40-50% brightness, as well. To match that illumination level with our primary A8 and Core i5 test systems with matte display coatings, we had to dial the brightness up to 70% on each. Oh, and we conditioned the batteries on all systems by fully discharging them and then recharging prior to testing.
The HP ProBook results marked “1C” are the single-channel memory configuration. As you can see, using a dual-channel config (similar to the one on the Llano system) reduces run times somewhat. In fact, the direct competition between the dual-channel config of the HP ProBook and the Llano test system looks mighty close to parity. The Core i5-2410M system manages 30 minutes more run time while web surfing, but the A8 lasts longer during the video test, perhaps thanks to its UVD block efficiently offloading H.264 decoding and playback from the CPU cores.
AMD also made some strong claims about Llano’s battery life while playing games, so we decided to test that, as well. We pulled up Battlefield: Bad Company 2 and left it running, full-screen, to see how long each laptop would last. The Llano test system’s discrete GPU was disabled in the BIOS, so we were relying entirely on both processors’ IGPs. Here’s what we found.
AMD wins this one by a mile. Now, perhaps one reason Llano has an advantage here is because its IGP isn’t capable of pushing up to higher clock frequencies when there’s thermal headroom available, while the Core i5’s can. That said, one solution is delivering clearly superior performance to the other in this scenario, and it’s not the one with shorter battery life, as we’ll soon see.
Versus desktop processors
Our first round of performance tests will compare our two mobile systems against a range of desktop processors in many of the components of our CPU test suite. We think these comparisons can be a nice backdrop for our A8-3500M-versus-Core i5-2410M contest, but remember the mobile processors have to work within much smaller power envelopes. We’ve thrown in results for a couple of the higher-end Sandy Bridge mobile CPUs where possible, to provide some additional context.
Stream memory bandwidth
This synthetic test measures the throughput achieved by the CPU and its memory subsystem. Although it’s not a real-world application, it helps us better understand the capacity of the CPUs and platforms being tested. As you can see, although the A8-3500M has dual channels of DDR3 memory at 1333MHz, it’s considerably slower in this bandwidth test than the dual-channel Core i5-2410M config.
Lest you think we have a problem with the Llano system, look closer at the rest of the results. The Phenom II X4 840 is a quad-core processor with an architecture very similar to Llano’s. The X4 840 achieves a little bit more throughput—but then its core clock speed is 3.2GHz, over double that of the A8-3500M’s base frequency of 1.5GHz. This test uses all four CPU cores, so Turbo Core can’t really help, either. Even with its enhanced pre-fetching algorithm, Llano can’t overcome the fact that it’s based on a relatively older core at a rather low clock speed.
Ouch. In our first real benchmark, the A8-3500M finishes well behind ye olde Core 2 Duo E6400 (that’s a 65-nm Conroe, folks) and uncomfortably close to the Pentium 4-derived Extreme Edition 840. Then again, those are desktop processors subject to fewer thermal constraints. At least two of the A8’s quad cores are essentially no help here—check the other results, and it’s clear this test isn’t widely multithreaded—and for whatever reason, Turbo Core doesn’t seem to be doing much, either.
7-Zip file compression and decompression
This is a nicely multithreaded application, and the A8-3500M delivers a more respectable showing. Still, Llano’s two cores at low frequencies can’t match the Core i5-2410M’s two higher-speed, higher-IPC cores.
TrueCrypt disk encryption
Yeowch! Close contest. This one would have been a clean kill for Intel had it not disabled the hardware AES acceleration in the Core i5-2410M for the sake of product segmentation. Because it did, both of our contenders are in the same boat: unable to achieve throughput to match the speed of a SATA 3Gbps disk interface.
The Panorama Factory photo stitching
Here’s yet another result where a nicely threaded test runs faster on dual Sandy Bridge cores than on quad Llano cores. This is a result you can feel, too. When stitching together a panorama, you’ll be drumming your fingers for 19 seconds longer with the A8-3500M.
x264 HD video encoding
Windows Live Movie Maker 14 video encoding
Depending on the program you’re using and the stage of the encoding process in question, the A8-3500M has the potential to be somewhat competitive with the Core i5-2410M, but the A8 is clearly slower overall.
Valve VRAD map compilation
In both of our rendering tests, Cinebench and VRAD, the A8-3500M with four cores at 1.5GHz is slower than the desktop Phenom II X2 565, which has two similar cores at 3.4GHz. That makes intuitive sense, I suppose, but it tempts one to wonder whether AMD’s decision to give Llano four slower cores instead of two faster ones was really the right choice. The thing is, slower clock speeds generally mean lower voltages and thus much lower power consumption, so AMD’s choice was probably an easy one to make, especially for the mobile market.
In case you were looking for evidence that Turbo Core actually does something, look no further than the Cinebench results. CPU performance tends to scale nicely when going from a single thread to multiples in an easily parallelizable task like rendering. For instance, the Phenom II X4 840 is almost four times as quick in the multithreaded test as it is in the single-threaded one. The A8-3500M, though, is quite a bit faster in the single-threaded test than one might have guessed by looking at its multithreaded results. Looks like Turbo Core is offering a bit of a frequency boost when only one core is busy.
Source engine particle simulation
This test is intriguing for a couple of reasons. One, because it’s a fully multithreaded particle simulation ripped from a game engine where Llano is dramatically slower than the Sandy Bridge competition. Llano promises big things for mobile gaming thanks to its Radeon IGP, but it is possible those low-frequency CPU cores will hamper its gaming performance somewhat. Two, particle simulations in games are nicely parallel by nature, and consequently, I believe many games now use the GPU to handle such work. If so, Llano’s more robust IGP may be just what the doctor ordered. So point, counterpoint.
Versus mobile processors
Now we’ll consider the A8-3500M against a range of recent mobile CPUs, including AMD’s own Brazos-based APUs.
This is a different version of SunSpider running on a different browser than we saw a couple of pages back, so don’t be surprised by the much larger numbers.
When removed from the company of desktop processors and placed exclusively alongside other mobile CPUs—many of which, in this case, are relatively lightweight, low-power affairs—the A8-3500M’s CPU performance doesn’t look nearly as dire. In fact, the A8 keeps up pretty well with the Arrandale-based Core i3-370M and Core i5-450M, both of them dual-core, 32-nm CPUs. Only the Sandy Bridge-based processors, with their much more efficient CPU microarchitecture, really distance themselves from the Llano-derived APU. Meanwhile, the A8-3500M clearly outclasses the lower half of the processors tested, including the Turion II Neo and Pentium CULV chips.
GPU texture filtering quality
We’ve looked at Llano’s CPU performance in a couple of different contexts. Before we consider graphics performance, we need to understand another set of issues, though. You see, CPUs are all required to do the same work and must produce the correct answer every time. GPUs don’t work that way. Instead, they are constantly looking to fool the human eye, to produce enough frames to maintain a steady illusion of motion and keep things looking good—good, but not necessarily perfect, because there aren’t clearly defined standards for exactly how a rendered frame ought to look. Yes, Microsoft has ratcheted up some of the rules for DirectX graphics over time, and the top two GPU makers, AMD and Nvidia, have reached a sort of equilibrium on image quality in certain respects. However, not everybody plays by the same rules. As we’ve found out, Intel happens to play by its own rules, which are considerably more lax.
Yep, I’m gonna bust out the atomic flowers now. Have a look.
The images you see above are the output from the Direct3D AF Tester, a little tool we use every time a new GPU architecture debuts in order to understand how its texture filtering hardware works. We’ve set this tool to use the highest possible filtering quality it can request, 16X anisotropic filtering with trilinear blending, and captured the result for posterity.
If you’re familiar with the output of this test, you’re probably shooting your Coke out of your left nostril all over your screen while pointing at the Intel result. If not, allow me to explain.
In the images above, you’re peering down a 3D-rendered cylinder or tube, and the inside surface of that tube has been covered with a simple texture map. The colored bands are what are known as mip maps, or increasingly lower resolution copies of the base texture mapped to the walls of the cylinder. The further you move from the camera, the lower the resolution of the mip level used. In the pictures above, the different colors show different mip levels. (Of course, mip maps don’t normally come in different colors. They look very much like one another and like the base texture. This test app colors them in order to make them easily visible.) Mip maps are a helpful tool in texture filtering because sampling from a single copy of the original, high-res texture can be work-intensive and, in a constrained grid of pixels, can produce excessive high-frequency noise, which is visually disruptive. In other words, a little bit of blurring and blending in the right places can be beneficial to the final result.
Alongside mip mapping, we’re layering on a couple of additional techniques to improve image quality. We’re using trilinear filtering to blend between mip levels, so that we don’t see abrupt transitions or banding. That’s why the different colors transition gradually from one to another. We’re also using anisotropic filtering, grabbing more samples for textures that exist at certain angles on the Z or depth axis—typically on surfaces stretching away from the camera, like floors, walls, and ceilings—in order to preserve sharpness that simple mip mapping would destroy. All of these things we take for granted in modern GPUs, which have custom hardware onboard to perform these functions.
Trouble is, doing all of the sampling and blending required to combine these techniques, especially with anisotropic filtering, is really hard work. In the bad old days, GPU makers used a particular shortcut in order to avoid some of that work, skimping on the amount of sampling for surfaces that aren’t parallel to one of the screen’s edges by reducing the level of mip map detail used. This optimization would generally leave floors and walls looking crisp, since they’re parallel to a screen edge, but other surfaces would lose detail.
Our tunnel test with colored mip maps is a designed to show which surfaces are getting more or less detail. The Radeon IGP’s results are very close to ideal—nearly a perfect circle, indicating all surfaces receive the same filtering treatment regardless of their angle of inclination. That circle is also relatively small and tight, indicating that the Radeon isn’t transitioning to smaller textures (and thus reducing texture detail) until it becomes beneficial to do so for the sake of filtering quality.
The Intel IGP’s results, on the other hand, are pretty horrific. The colors flare out to the four corners of the screen in an indication that mip map detail is reduced dramatically on surfaces at a 45° angle from the screen edges. Floors and walls might look OK, just so long as you don’t lean and cause the camera to tilt. Do that, and texture sharpness will drop dramatically.
|Radeon X1950 XTX||Radeon HD 2900 XT||GeForce 8800 GTX|
We’d have to reach pretty far back in time in order to find a pattern this poor on a GPU from Nvidia or AMD, prior to the DirectX 10 generation at least. Even the Radeon HD X1950XT’s filtering hardware was crafty enough to avoid skimping at 45° angles of inclination, instead aiming for in-between angles like 22.5° and 67.5°.
The Intel IGP’s pattern also shows some funny, irregular lines in it, indicating a fair amount of error in its level-of-detail calculations at certain angles. In motion, that’s likely to produce some additional texture crawling or visual noise.
Above is one quick example of how the Intel IGP’s reduced level of detail affects in-game image quality. This is a shot from Portal 2, and I’ve magnified the center of the screen to twice its original size in each dimension. In the top shot, from the Radeon, the gray walls of the tunnel are covered with a detailed, stone-and-concrete texture. Below, in the Intel IGP image, the walls are a blurry, muddy mess of gray. Granted, my example isn’t the greatest. The impact of Intel’s reduced quality would be more obvious on a texture with higher color contrast or with a tight, repeating pattern. The effect is also more noticeable in motion, when you’re looking up this shaft, you spin a bit, and the detail disappears. Still, the difference here is fairly dramatic, if you squint long enough to see it.
The point of this little exercise is simply this: as we evaluate GPU performance, we should keep in mind that the Intel IGP is doing less work—less sampling and blending—and thus producing a lower-quality result than the A8’s integrated Radeon. The bar charts full of benchmark results won’t show you that, but it’s real, and it’s an important component of overall GPU quality. Also, I should note that texture filtering is just one aspect of graphics image quality. Although we haven’t had time to explore it fully, it appears the Intel IGP lacks internal mathematical precision in other respects. Using it, we regularly noticed obvious “screen door” dithering artifacts that weren’t visible on the Radeon IGP.
Synthetic GPU performance
We’ll kick off our IGP performance tests with a quick look at some synthetic benchmarks. The test above is intended to measure texture filtering performance. For the reasons we discussed on the last page, the results tell an interesting story. At lower filtering quality levels, the A8’s IGP has a big lead over the Intel IGP. As the filtering quality level rises, the performance of the two solutions converges—no doubt because the Radeon IGP is doing increasingly more sampling and blending work than the Intel HD 3000.
Notably, the discrete Radeon is measurably faster than either of the integrated solutions, thanks to higher GPU throughput rates and higher memory bandwidth.
These next two tests are meant to gauge GPU shader arithmetic. We used the defaults for ShaderToyMark, since it’s all about stressing shader performance. We configured Unigine Heaven to use DirectX 10—common ground for these GPUs, since the Intel IGP can’t handle DX11—and set the shader quality level to “high” while the texture and filtering settings were at “low.” The screen resolution was set to 1366×768, native for both of the test systems.
In both of these shader-oriented tests, the A8’s integrated Radeon HD 6620G is about a third faster than the Intel HD 3000. The discrete Radeon HD 6630 is again faster than either, and Unigine gives us our first glimpse of Dual Graphics in action. In this case, going dual offers a noteworthy increase in frame rates.
Civ V offers a range of opportunities for performance testing, and we’ll avail ourselves of all of them. First up is a texture compression test that uses DirectCompute shaders. This is one of those Fusiony-type application that plays to Llano’s strengths. It also wouldn’t run on the Intel HD 3000 IGP.
The next test populates the screen with a large number of units and animates them all in parallel. Civ V will run this benchmark without updating the screen, to test raw CPU performance, and with a full slate of graphics updates, to measure the performance of the total solution.
If you remove the graphics portion of the workload from the equation, the Core i5-2410M is clearly the faster CPU. Once you involve the IGPs, the Core i5’s performance drops to a quarter of its prior level, and the A8 is faster overall.
Civ V offers two further in-game benchmarks, and we tested them at the settings you can see below. The first of those tests comes from the scenes in the game where you see your leader character, richly rendered with lots of shader effects. These scenes aren’t really part of the core gameplay, but they do make for a decent test of pixel shader performance.
Finally, there’s the in-game benchmark proper, which involves a late-game scenario where the screen is richly populated.
Again, if we don’t ask the GPUs to render the scene, then the Core i5 runs the game’s core simulation quicker than the A8-3500M. The 3500M does manage to produce updates at a clip of nearly 60 FPS, though, so it’s certainly adequate to the task. When running the entire game with the CPU and GPU working together, the A8-3500M simply trounces the Core i5-2410M.
You may have noticed that the Dual Graphics solution hasn’t fared well in Civ V. All we can tell you is that those results are correct, and AMD tells us there are some teething problems with Dual Graphics in the BIOS of our pre-release Compal review laptop. We’ll have to test a production system and see whether its Dual Graphics implementation is more solid.
Since Portal 2 isn’t too hard on the GPU, we tested it with its max quality settings at different levels of edge antialiasing. As you can see, the Llano IGP is faster with 8X AA than the Sandy Bridge IGP is without any antialiasing.
Battlefield: Bad Company 2
Here’s one of those instances where the performance gap between the AMD and Intel IGPs means playable frame rates on one but not the other. We’ve seen similar gaps in Civ V and Portal 2, as well. It is possible to dial back the quality settings in at least two of those games (though not Civ V) and squeeze better frame rates out of the Intel IGP, but the Llano IGP removes any doubts.
The A8-3500M’s proves to be about twice as fast as the Core i5-2410M in DiRT 3‘s DirectX 9 mode. You can get a little more eye candy out of the Llano IGP by using DX11 mode, but then we couldn’t compare directly to the Intel HD 3000, obviously. We had hoped to allow Direct Graphics to strut its stuff for us in DX11 mode, but unfortunately, we ran into major screen corruption problems. Again, AMD pointed to a BIOS-level issue with our test system and Dual Graphics, so we weren’t able to work around it.
Borderlands is, uh, borderline on both of these IGPs. As you can see, we didn’t have much room left to reduce image quality in order to gain higher frame rates. The answer might be dropping to a lower, non-native display resolution. We expect at least the Llano IGP to run the game well with that concession.
USB transfer rates
We have one last round of tests for those who aren’t feeling completely inundated by now: a quick look at USB transfer rates via HDTach. The A70M FCH chip’s native support for USB 3 is a potential platform level advantage for AMD. To see whether it delivers, we docked an Intel X25-E 64GB SSD (yes, that’s an enterprise-class SLC drive—very fast) into a Thermaltake BlacX 5GB.
I think it’s safe to say USB 3.0 could be a major selling point for Llano-based laptops. Many folks with laptops this powerful will want to make use of external storage, and the A75M’s USB 3.0 ports achieve real-world transfer rates four to five times as fast as USB 2.0. CPU utilization is reasonable, too, considering how much more data is snaking through that pipe. Of course, it’s possible for Sandy Bridge laptops to add competent USB 3.0 support via a peripheral chip, but it will come at the expense of a little added power draw, motherboard real-estate, and cost—and it probably won’t be as widely offered. Our HP ProBook test system, for instance, lacks USB 3.0.
We’ve had a sense of what Llano might be for many months now, and the conventional wisdom has been fairly well established. Everyone expected a CPU that couldn’t keep pace with Sandy Bridge and an integrated graphics processor that would surely outdo Intel’s HD 3000. The key issues, then, would be about the tradeoffs involved and about which you value more, CPU power or GPU power.
To me, the question with CPUs these days often seems to be, “Are they fast enough?” For many parts of our daily computer use, CPUs really do seem to be sufficiently quick that performance no longer feels scarce. Quite a few laptop buyers have been willing to compromise on CPU power for the sake of portability, with Atom-based netbooks being the extreme (and rather popular) example of that compromise.
Meanwhile, the question about IGPs is similar, but inverted. “Are they fast enough?” That is, is any integrated graphics solution fast enough to matter, or should we simply recommend a discrete GPU to the would-be mobile gamer? What is the value of a superior IGP if it’s still too slow to make any sort of difference in regular use?
Both of these questions vexed me before I got my hands on a Llano-based system, because I lacked the context to answer them. Now that we’ve considered our test results, I have a better sense of the landscape. You are, of course, free make what what you will of our data, but this is my answer.
The A8-3500M’s four CPU cores are routinely and consistently slower than the Core i5-2410M’s two cores, sometimes by margins that are borderline embarrassing. Intel has opened up a monstrous lead on this front, and the few architectural enhancements in Llano’s cores aren’t sufficient to narrow the gap appreciably. However, you’ll need four threads to take best advantage of the Core i5-2410M, and with four threads in play, the A8-3500M doesn’t look too bad. The A8’s performance is comparable to Intel’s older Arrandale-based Core i5-450M, a generation back architecturally from Sandy Bridge. That puts the A8 solidly in the same class as other big-boy mobile CPUs, a clear cut above budget ultraportable chips like Intel’s Consumer Ultra Low Voltage models and such.
Meanwhile, AMD has Intel utterly outclassed in integrated graphics. You’ve seen our discussion of texture filtering quality and our performance results. Where the Llano IGP delivers playable frame rates in some of the latest games, Intel’s HD 3000 treads on the edge of uselessness. Add in a host of other considerations, including the vastly superior hardware feature set of the Radeon IGP and its ability to partake of the sweet, sweet stream of Catalyst driver updates, and this is a difference between the products that truly matters. Now, that’s probably more true in the consumer market than the corporate one, but any buyer who thinks he might want to play games or make use of 3D graphics in any capacity should strongly consider a choosing Llano-based laptop.
That choice is made considerably easier since AMD appears to have achieved rough parity with Intel in terms of battery life—maybe the biggest surprise of the day, and it’s a pleasant one. If the production laptops turn out right, we may no longer need to advise our friends and readers that buying an AMD-based laptop means receiving an iffy discount in exchange for accepting shorter run times. We’ve never liked that tradeoff.
The deciding factor, of course, is the quality of the production laptops. If the big PC makers can translate what we’ve seen of the A-Series APUs into systems that are as sleek, cool-running, and endurance-endowed as their Sandy Bridge counterparts, AMD should have a hit on its hands.