One of the big stories in PC processors over the past few years has been AMD’s struggles to match the performance of Intel’s high-end desktop CPUs. The much-anticipated “Bulldozer” microarchitecture landed with a thud, unable to mount a serious challenge to the dominance of Intel’s Core i5 and i7 offerings. Meanwhile, Intel continues to crank out major improvements to these products at a pretty regular clip, as it did with the introduction of the 22-nm Ivy Bridge chips last month.
However, there is another, even bigger story unfolding in PC processors at the same time, and AMD plays a more intriguing role in it. As you may know, CPUs have swallowed up a whole host of other system components in the past few generations from the memory controller to I/O and graphics. The reasons for this trend are several. Integration can sometimes deliver higher performance—bringing the memory controller onboard provided a nice boost, for instance—but it can also cut costs, reduce the physical size of the platform, and improve power efficiency. With the rise of mobile computing, Intel and AMD have pushed ever closer to the ideal of a single-chip PC solution.
In that context, last year’s introduction of the A-series processors, based on the chip code-named Llano, was a big stride forward for AMD. Llano achieved several important milestones at once. For one thing, it essentially matched Intel’s competing products for battery run time; parity on this front had long eluded AMD. For another, Llano was the first AMD processor to incorporate the Radeon graphics technology the firm had acquired years before by purchasing ATI. As you might expect, Llano’s graphics capabilities gave it instant credibility and a clear leg up on Intel’s anemic integrated graphics processor (IGP). We liked the mobile Llano variant well enough to consider it a viable alternative to Intel’s dual-core Sandy Bridge processors—perhaps even a superior choice for most folks, given the gap in graphics capabilities.
Llano had its limitations, though. The supply of 32-nm chips from AMD manufacturing partner GlobalFoundries was spotty for quite a while. The chip didn’t translate well to desktop-class power envelopes, in our view. And it looked very much like a first-generation effort in a lot of ways. AMD talked endlessly about CPU-GPU “fusion” and dubbed the A-series products “APUs,” for “accelerated processing units,” but Llano stopped well short of making the IGP into a true co-processor.
Today, AMD is ready to take the next step with the introduction of the second-generation APU known as Trinity. Nearly everything about it is new, from the CPU cores to the IGP and the various bits of glue that hold everything together. The integration in Trinity is more mature, with more benefits and fewer visible seams between the processor’s various components. Thus, although Trinity is manufactured using the same 32-nm SOI fabrication process as Llano, AMD claims Trinity doubles its predecessor’s power-performance ratio. That claim takes several forms; most prominently, there is a 17W version of Trinity that purportedly performs like a 35W Llano variant. If true, AMD ought to have a very nice offering to slide into the ultra-thin laptops that are all the rage these days.
The annotated image above points out Trinity’s main components. The CPU portion of the chip includes four integer cores and two FPUs based on the “Bulldozer” microarchitecture. In fact, Trinity is the first chip to incorporate AMD’s “Piledriver” architectural updates. More on those shortly. Also updated from Llano is Trinity’s IGP, which is derived from the “Northern Islands” generation of Radeons. The memory controller remains a dual-channel affair, capable of supporting DIMMs up to 1866 MT/s, though 1600 MT/s is the top speed for mobile parts. Trinity’s media processing block still decodes a host of video formats but has learned a new trick: hardware-accelerated H.264 encoding. And for communication with the outside world, the chip has 24 lanes of PCI Express Gen2 connectivity. Gone is the HyperTransport link used in AMD processors for ages; this chip talks to its Fusion Controller Hub I/O support chip via dedicated PCIe lanes, instead.
|Sandy Bridge||Core i3, i5||2||4||4 MB||32||624||149|
|Sandy Bridge||Core i5, i7||4||8||8 MB||32||995||216|
|Ivy Bridge||Core i5, i7||4||8||8 MB||22||1400||160|
|Llano||A4||2||2||1 MB x 2||32||758||–|
|Llano||A8, A6, A4||4||4||1 MB x 4||32||1450||228|
|Trinity||A10, A8, A6||4||4||2 MB x 2||32||1303||246|
Trinity isn’t an especially large chip, as these things go, but it is a little larger than Llano—despite a lower transistor count estimate—and it’s quite a bit larger than the quad-core versions of Sandy and Ivy Bridge. Then again, the 22-nm Ivy Bridge quad is positively tiny.
Piledriver: somewhat heavier equipment
Trinity’s use of the Bulldozer CPU architecture gives it a host of features that Llano lacked, including AES encryption acceleration and AVX instructions for wider floating-point vector processing. Bulldozer’s basic layout also makes Trinity a very different beast than Llano. This architecture’s fundamental building block is a compute “module” that can process two threads simultaneously. Although AMD claims the module has two distinct integer cores, those cores share some key resources, including the instruction fetch and decode units, an L2 cache, and a floating-point math unit (FPU). The shared structures have been upgraded substantially from prior AMD CPUs, to better service two integer cores at once. Trinity has two of these compute modules, giving it four threads, four integer “cores,” and two FPUs. Each of those modules has 2MB of L2 cache. By contrast, Llano has four distinct cores, each with its own FPU and 1MB of L2 cache, with no sharing. (One similarity between Llano and Trinity is the omission of an L3 cache. AMD deemed the L3 a power efficiency liability in Llano, and it appears to have held to that conviction with Trinity.)
To date, Bulldozer’s performance hasn’t fulfilled the expectations created by its extended feature set. The desktop FX-8150 processor is barely quicker than the older Phenom II X6 in most cases, for instance, and its per-clock performance is actually lower than the prior-gen processor’s. Some of that is by design; Bulldozer is intended to run at higher clock frequencies, and it gives up some per-clock performance in order to do so. Still, the revised “Piledriver” CPU cores in Trinity have been tweaked for higher instruction throughput in each clock cycle.
Although some folks probably expected a quick-fix for the Bulldozer architecture that would yield some sizeable performance gains, that doesn’t appear be what’s happened. Instead, Piledriver incorporates a fairly broad range of improvements, none of which contributes much more than 1% to overall per-clock instruction throughput. (I believe the cumulative total is somewhere around a 6% IPC improvement, generally, but my notes are fuzzy on that one.)
One of the most notable changes in Piledriver is support for a couple of new instructions. The addition of a three-component fused multiply-add instruction, FMA3, brings AMD in line with Intel’s plans for its upcoming Haswell chip. That should clear up any confusion about this workhorse of the AVX extensions. (Support for Bulldozer’s FMA4 instruction remains.) Furthermore, Piledriver allows quick conversions between 16- and 32-bit floating-point data formats via the F16C instruction, which debuted in the Intel camp on Ivy Bridge.
Among the other tweaks to improve instruction throughput, the highest-impact change is probably the doubling in size of the L1 data cache’s translation lookaside buffer. The TLB is a sort of cache index, and a larger TLB makes the cache faster and more efficient. Beyond that, nearly every part of the chip has been massaged, save for the execution units. The branch predictor is more accurate, thanks to an innovation borrowed from the Bobcat core. The integer and FP schedulers are more aggressive about retiring instructions, making them effectively larger without a structure size increase. And the hardware prefetcher can better predictively populate the L2 cache, in part because it has been tuned for client-style workloads (whereas Bulldozer is tuned for servers.)
As sweeping as the changes may look on paper, they are apparently rather modest in their cumulative effect. However, performance boosts can come from other sources, and Piledriver has been optimized to achieve higher clock frequencies at lower power levels. AMD tells us Piledriver responds much better than Llano’s cores to changes in voltage, allowing wider latitude for clock frequencies and finer-grained control over those speeds. For a mobile-focused CPU (err, APU) like Trinity, such things tend to be especially helpful.
A new IGP based on, uh, proven technology
Trinity’s integrated graphics are a generation beyond Llano’s and are, in terms of basic capabilities, pretty well up to date. They’re also based on an older generation of discrete graphics chips, “Northern Islands,” most familiar from the Radeon HD 6900 series of video cards. AMD’s current GCN architecture didn’t make the cut.
There’s your requisite block diagram of the graphics portion of the chip. If you have really good glasses, you could count all of the units yourself. Trinity’s IGP has six SIMD engines and sports a total of 384 shader ALUs. Each SIMD engine has a texture unit capable of filtering four texels per clock, so the IGP totals 24 texels per cycle. The two render back-ends can blend eight pixels per clock.
None of those are numbers particularly breathtaking. Llano’s IGP has 5 SIMD engines, 400 ALUs, 20 texels per clock of filtering throughput, and dual render back-ends. Still, Trinity’s IGP should make better use of its resources. Trinity’s IGP trades up to a VLIW4 shader execution unit that is more area efficient. Llano’s VLIW5 design has a fifth “fat” ALU for certain types of functions, and the other four ALUs have a subset of its abilities. The Northern Islands shader core eliminates that fifth ALU and grants full and equal functionality to the other four units. This new arrangement seems to work well aboard the Radeon HD 6900 series. Northern Islands also brings some improvements in tessellation performance, thanks to improved buffering intended to manage the difficult data flow issue created by geometry expansion.
Importantly for AMD’s plans, the Northern Islands graphics core is better suited for non-graphics computing, too. The VLIW4 shaders should map well to a broader range of data sets, and this core adds the ability to execute multiple, independent kernels (or programs, essentially) at once, each with its own command queue and address domain.
None of those enhancements is likely to provide as much uplift versus Llano as one other change: higher IGP clock speeds. The fastest mobile Llano IGP runs at 444MHz, but Trinity’s IGP operates at frequencies as high as 686MHz. When combined with the architectural enhancements and the slight bump from five SIMDs to six, the higher clock speed should make Trinity’s IGP a considerable upgrade from Llano’s. Texture filtering capacity is nearly doubled, and other key rates are up by 40-50%, with the notable exception of memory bandwidth, which depends on the DIMM speed.
Although Trinity’s IGP isn’t based on the latest architecture, its associated media processing block is AMD’s most recent vintage. The UVD3 video decode engine adds support for the MVC extension to H.264 for stereoscopic 3D, for the MPEG-4/DivX format, and for decoding dual HD streams simultaneously. The brand-new VCE block throws hardware-accelerated H.264 encoding into the mix, too—something that’s important not just for performance and power efficiency reasons, but also for enabling new features like wireless displays.
Speaking of displays, Trinity can drive as many as four at once over HDMI, DVI, and DisplayPort. AMD has blazed the trail for DisplayPort adoption among consumer systems, and this chip supports DisplayPort 1.2 operation at up to 5.4 Gbps, including the daisy-chaining of multiple monitors on a single link. The APU can bundle sound into its digital display connections, as well—as many as four 7.1-channel audio streams, with broad support for digital encoding standards, including DTS Master Audio and Dolby TrueHD.
Better integration for power savings
Llano’s battery life is pretty good, but AMD claims Trinity is even better, with run times in some configurations extending as far as 10 hours. One very impressive number in this regard is the chip’s idle power draw of 1.08W. Battery run times will, of course, depend on more than just the CPU’s power consumption, but Trinity looks to be doing its part to conserve power.
Power-efficient performance should get a boost thanks to more capable power management and dynamic clock speed scaling. Llano could only trade power in one direction, with the CPU scaling up or down via Turbo Core depending on the needs of the IGP. Trinity’s IGP can join the game, now, too, allowing the whole chip to adjust its performance in response to the current workload.
The example above shows how the A10-4600M’s IGP and CPU clock frequencies change in response to different workloads. A heavy CPU load with light graphics use results in a moderate IGP clock and higher CPU frequencies. The CPU speed then varies based on the number of threads active; with a single thread, the A10 CPU can reach 3.2GHz. On the other hand, for a GPU-intensive application with modest CPU needs, the IGP clock jumps up and the CPU speed scales down.
Since they’re based on Piledriver, the CPU modules in Trinity have a much more capable implementation of Turbo Core than Llano. Llano has only one P-state above its stock clock speed. The A8-3500M’s base frequency is 1.5GHz, and when Turbo kicks in, the clock jumps to 2.4GHz. Trinity has finer-grained control, with four P-states for Turbo Core. Trinity is also able to respond much more quickly to changes in activity and die temperatures, thanks to an onboard power-management microcontroller and an architecture that’s designed to operate well at different frequencies across a range of voltages.
One way Trinity manages to achieve such low power draw at idle is more extensive power gating. In addition to power gates for the IGP and the two CPU modules, this chip adds gates for the north bridge, the PCIe interface, and the display PHY. When those portions of the chip aren’t in use, they can be shut off entirely, eliminating even the leakage power that would otherwise be going to them.
The conservation effort extends to the rest of the platform, too. Trinity’s memory controller can adjust DRAM frequencies on the fly in order to conserve energy, and it supports the low-power DDR3 standard for driving DIMMs at 1.25V. The VRMs can make faster transitions, improving efficiency. Also, the number-one activity for nearly all computer users is now more economical: staring at a static screen. Trinity can refresh a static display from a single memory module, allowing the other DIMM to scale back or to power down. The chip has more buffering for display memory, too, which should save power that would otherwise be spent on memory I/O.
Accelerating accelerated computing
AMD has talked a good game about CPU-GPU convergence and accelerated computing for a while now, but it is also laying the foundation for true GPU-IGP cooperation. One key bit of plumbing on that front is something called the Fusion Compute Link. The FCL replaces the PCIe communication channel between the CPU and GPU in a merged chip like Trinity. Llano’s first-generation FCL had only modest bandwidth, but AMD promised to invest more in this connection over time. Trinity’s FCL is 128 bits in each direction. This connection allows the IGP to access the CPU’s memory space coherently, and it gives the CPU a window into the IGP’s dedicated frame buffer. Given the right programming model, which AMD is pioneering with its software work on the Heterogeneous System Architecture, the FCL could become important in future converged applications, where the IGP and CPU might team up to manipulate data in the same memory.
The FCL augments the IGP’s primary path to system memory, which is two pairs of 256-bit links (one in each direction) between the graphics memory controller and the north bridge.
Trinity is ready to support merged applications with discrete GPUs, too. Its IOMMU will allow the shaders in PCIe graphics cards to operate directly on main memory, and it’s capable of supporting GPU virtualization.
The new A-series APUs
Naturally, AMD has a range of Trinity-based APUs on offer. The fastest model is the A10-4600M, which we’ve already seen in our dynamic power scheme example. The A10-4600M is also the chip we have for review today. As you can see, it has all of Trinity’s cores and cache enabled, running at aggressive clock speeds. With its 35W TDP, the 4600M will serve nicely to illustrate AMD’s progress since the Llano-based A8-3500M we reviewed last year—and have brought back here for an encore. AMD expects the A10 series to make its way into laptops costing $700 and more, where it will compete with the lower end of the mobile Core i7 line and the high end of the Core i5 lineup.
We have a couple of laptops based on Intel chips in the same basic class as the A10-4600M for comparison, too. The Core i7-2670QM is a quad-core Sandy Bridge with a 2.2GHz base clock and a 3.1GHz Turbo peak that is selling in laptops costing $659 and up at Newegg. The Core i7-3720QM is an Ivy Bridge-based quad-core in the same price range, although it’s too new to have a robust selection of systems available. Those few that are available currently cost quite a bit more than $700. A bigger wrench in the works is the fact that both Intel chips have 45W TDP ratings, so they have more room from which to extract performance than the A10-4600M. The most direct competition from Intel at present may be the Sandy Bridge-based Core i7-2640QM, which has a 35W TDP and costs about $30 less than the i7-3720QM, but it is a near run thing. AMD has positioned the A10 very close to those Intel quad-cores, obviously quite intentionally.
The rest of the lineup plays out much as one might expect. The A8-4500M will occupy systems costing $550 or more, facing off against the lesser Core i5s and greater Core i3s. The A6-4400M, with only one compute module (and thus two cores) enabled, will do battle with the Core i3 in laptops above the $450 mark. As far as we know, the A6 parts are actually quad-core Trinity chips with two cores disabled. AMD wasn’t willing to disclose any plans to produce a natively dual-core version of Trinity.
The most interesting Trinity parts, in our view, are the 25W and 17W models. The 17W version is the one destined for those ultra-thin MacBook Air clones, and we’re very much intrigued by its potential. It may prove to be a nice alternative to the dual-core version of Ivy Bridge, once the dual Ivy chip arrives later this summer.
The Trinity whitebook
Our Trinity APU sample is enclosed in the following 14″ whitebook:
The system looks and behaves almost like a retail product, but it isn’t one. It’s etched with AMD’s corporate logo instead of a vendor’s badge, and it lacks the fit and finish of a commercial system. One tell-tale sign is the optical drive, which sticks out a little past the lower edge of the system’s body—not enough to snag on something, but enough to make it clear this is a prototype.
Unlike the Llano whitebook we reviewed last year, this one lacks a discrete GPU. The APU’s two memory channels are fed with two 2GB DDR3 SO-DIMMs clocked at 1600MHz. AMD threw in a 128GB Samsung 830 solid-state drive, as well. We ruthlessly replaced it with a 500GB WD Scorpio Black hard drive to keep our benchmark comparisons fair.
Speaking of benchmark comparisons, we had some trouble gathering adequate contestants for this match-up. The 13″ Llano whitebook made a return appearance, as you’d expect, but the dual-core Sandy Bridge notebook against which we compared it last year wasn’t available for an encore. We do, however, have a couple of quad-core Intel notebooks on hand: one based on Sandy Bridge, and another based on Ivy Bridge.
Those two notebooks are detailed below. They’re both larger than the Trinity and Llano whitebooks, with 15″ displays and thicker, heavier frames. Both are outfitted with GeForce GT 630 discrete graphics, which we didn’t use in our tests, and both have twice as much RAM as the AMD whitebooks—eight gigs—but we don’t expect memory capacity to be a constraint in any of our tests. The most notable difference is that the Intel notebooks have 45W processors. Keep that in mind as you see the results on the following pages; the Intel parts ought to have a built-in handicap since they have a 10W larger power envelope.
Our testing methods
We ran every test at least three times and reported the median of the scores produced.
The test systems were configured like so:
|System||AMD A8-3500M test system||AMD A10-4600M test system||Asus N56VM||Asus N53S|
|Processor||AMD A8-3500M APU 1.5GHz||AMD A10-4600M 2.3GHz||Intel Core i7-3720QM 2.3GHz||Intel Core i7-2670QM 2.2GHz|
|North bridge||AMD A70M FCH||AMD A70M FCH||Intel HM76 Express||Intel HM65 Express|
|Memory size||4GB (2 DIMMs)||4GB (2 DIMMs)||8GB (2 DIMMs)||8GB (2 DIMMs)|
|Memory type||DDR3 SDRAM at 1333MHz||DDR3 SDRAM at 1600MHz||DDR3 SDRAM at 1600MHz||DDR3 SDRAM at 1333MHz|
|Audio||IDT codec||IDT codec with 22.214.171.12477 drivers||Realtek codec with 126.96.36.19937 drivers||Realtek codec with 188.8.131.5263 drivers|
|Graphics||AMD Radeon HD 6620G + AMD Radeon HD 6630M
with Catalyst 12.4 drivers
|AMD Radeon HD 7660G with Catalyst 8.945 RC2 drivers||Intel HD Graphics 4000 with 184.108.40.20696 drivers
GeForce GT 630M with 296.54 drivers
|Intel HD Graphics 3000 with 220.127.116.112 drivers
GeForce GT 630M with 296.54 drivers
|Hard drive||Hitachi Travelstar 7K500 250GB 7,200 RPM||WD Scorpio Black 500GB 7,200 RPM||Seagate Momentus 750GB 7,200-RPM||Seagate Momentus 750GB 7,200-RPM|
|Operating system||Windows 7 Ultimate x64||Windows 7 Ultimate x64||Windows 7 Professional x64||Windows 7 Home Premium x64|
Thanks to Asus for volunteering a quad-core Sandy Bridge laptop, and thanks to AMD and Intel for providing the other systems.
We used the following versions of our test applications:
- Stream 5.8 64-bit
- SiSoft Sandra 2012.SP3
- 7-Zip 9.20 64-bit
- TrueCrypt 7.1a
- Chromium 20.0.1096.0
- SunSpider 0.9.1
- The Panorama Factory 5.3 x64 Edition
- LuxMark 2.0
- AMD APP SDK 2.6
- Intel SDK for OpenCL Applications 2012
- x264 HD benchmark 4.0
- Battlefield 3
- The Elder Scrolls V: Skyrim
- Batman: Arkham City
- FRAPS 3.5.0
The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Memory subsystem performance
Per our tradition, we’re going to start off by comparing the memory subsystems of our CPUs in a few synthetic tests.
Please note that the A10-4600M and Core i7-3720QM have higher-clocked memory than the other two offerings. Because of the discrepancy, the results below won’t paint a clear, unadulterated picture of memory controller efficiency. But they will show us something else. You see, the A10-4600M and Core i7-3720QM both support faster RAM than their predecessors. (Both can accommodate DDR3-1600 memory, while the A8-3500M and i7-2670QM are limited to DDR3-1333.) So we’re going to be able to see what dividends the faster memory support pays from one generation to the next.
In this basic measure of memory bandwidth, the A10-4600M edges out the A8-3500M by about 13%. Our Ivy Bridge CPU enjoys a similar gain over its forebear. The A10 can’t come close to matching the Intel chips, though.
Next up: SiSoft Sandra’s more elaborate memory and cache bandwidth test. This test is multithreaded, so it captures the bandwidth of all caches on all cores concurrently. The different test block sizes step us down from the L1 and L2 caches into L3 and main memory.
The A10-4600M’s two L1 and L2 caches manage to match the A8’s four L1/L2 caches nearly step by step in terms of bandwidth. Neither can keep pace with the Bridge sisters’ cache hierarchies, however.
Sandra also includes a new latency testing tool. SiSoft has a nice write-up on it, for those who are interested. We used the “in-page random” access pattern to reduce the impact of prefetchers on our measurements. We’ve also taken to reporting the results in terms of CPU cycles, which is how this tool returns them. The problem with translating these results into nanoseconds, as we’ve done in the past with latency measurements, is that we don’t always know the clock speed of the CPU, which can vary depending on Turbo responses.
Because it shares 2MB of L2 cache across each dual-core module, the A10 manages a lower latency than the A8 at the 2MB block size.
However, the A10 falls behind the A8 at every other block size, including those small enough to fit into the L1 and L2 caches. The culprit may simply be slower caches on Piledriver. In our desktop tests, Bulldozer fared even worse against the Phenom II X6 1100T, which is based on the same architecture as Llano.
TrueCrypt disk encryption
TrueCrypt supports acceleration via Intel’s AES-NI instructions, so the encoding of the AES algorithm, in particular, should be very fast on the CPUs that support those instructions. We’ve also included results for another algorithm, Twofish, that isn’t accelerated via dedicated instructions.
7-Zip file compression and decompression
Neither AMD APU can catch up to Intel’s quad-core offerings, of course, but that’s no great surprise. We didn’t expect AMD to catch Intel in raw CPU performance, especially not with a 10W power envelope handicap. Still, Trinity will have to do well on other fronts to distinguish itself.
The Panorama Factory photo stitching
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s widely multithreaded. We asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs.
In the past, we’ve added up the time taken by all of the different elements of the panorama creation wizard and reported that number, along with detailed results for each operation. However, doing so is incredibly data-input-intensive, and the process tends to be dominated by a single, long operation: the stitch. Thus, we’ve simply decided to report the stitch time, which saves us a lot of work and still gets at the heart of the matter.
x264 HD benchmark
This benchmark tests one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark.
We see the same story unfold in our image editing and video encoding tests: the A10-4600M edges out the A8-3500M by a decent margin, but it’s no match for quad-core Sandy and Ivy CPUs.
Trinity, Llano, Sandy Bridge, and Ivy Bridge all dedicate a substantial chunk of their die area to graphics. And, with the exception of Llano, they all have special-purpose video transcoding logic, as well. We sought to unleash all of those extra transistors in a few general-purpose applications, to see if the competitive picture would change at all.
LuxMark OpenCL rendering
We’ve deployed LuxMark in several recent reviews to test GPU performance. Since it uses OpenCL, we can also use it to test CPU performance—and even to compare performance across different processor types. And since OpenCL code is by nature parallelized and relies on a real-time compiler, it should adapt well to new instructions. For instance, Intel and AMD offer integrated client drivers (ICDs) for OpenCL on x86 processors, and they both claim to support AVX. The AMD APP ICD even supports Bulldozer’s distinctive instructions, FMA4 and XOP.
First, a word about those missing bars in the graph. Sandy Bridge’s HD 3000 integrated graphics lack OpenCL support, so we couldn’t run LuxMark on the Core i7-2670QM’s IGP. Also, the AMD processors don’t support Intel’s ICD driver, so we were only able to run LuxMark on their integrated Radeon HD graphics and on their CPU cores using the AMD APP ICD. Ivy Bridge is the only processor that supports both AMD and Intel ICDs and has the ability to execute OpenCL code using its integrated graphics.
As we saw in our Ivy Bridge review last month, AMD’s APP ICD yields better results than Intel’s ICD when the IGPs are kept out of the running. The best results are obtained by combining the CPU with the APP ICD and the integrated graphics with their own OpenCL drivers. Regardless of the configuration, though, Trinity falls well behind both Ivy Bridge and Sandy Bridge. At the same time, it’s still nicely ahead of Llano.
AMD supplied us with a special build of The GIMP 2.8, which features a wealth of OpenCL-accelerated filters. Future GIMP builds will feature an entirely OpenCL-accelerated image processing pipeline, but the build we used did not. Here’s what AMD had to say on the subject:
The upcoming major release of GIMP is expected to move its main processing pipeline to use the GEGL library. . . . Knowing that OpenCL and GEGL are the future, the current OpenCL work is designed to impact GEGL, not the current GIMP pipeline. One consequence of aligning with GEGL is that the speed of adoption will rely on GEGL integration with GIMP. In current GIMP builds, there are special menus to use GEGL operation. There is other overhead as well. While we’re seeing nice speedups with OpenCL now, even better performance is expected once GIMP moves completely to the GEGL pipeline.
We tested by loading up an image from our camera, a 32-bit, 4272×2848 bitmap, running through 15 GEGL filters, and averaging the results. AMD says OpenCL kernel code is built when a filter is run for the first time, and this results in a “slight performance hit.” To compensate for that hit, we ran each filter four times and only recorded results from the last three runs.
Sadly, The GIMP’s GEGL operations weren’t available on our Intel systems. The menu simply didn’t show up, not even when we had the AMD APP ICD installed. Our results for the AMD systems tells us what we already know: Trinity is quicker than Llano. The difference is more pronounced here than in LuxMark, though.
The latest version of WinZip features a parallel processing pipeline with OpenCL support. The pipeline allows multiple files to be opened, read, compressed, and encrypted simultaneously, all with hardware acceleration. Right now, though, WinZip’s OpenCL capabilities seem to be off-limits to Intel processors—again, regardless of what ICD is installed. The OpenCL switch in the WinZip settings would only appear on our AMD systems.
We tested WinZip by compressing, then decompessing, a 1.17GB directory containing about 150 small text and image files, a couple dozen medium-sized PDF files, and 14 large Photoshop PSD files. We timed each operation with a stopwatch.
OpenCL acceleration doesn’t do much for decompression, but it clearly pays off during file compression. Interestingly, Trinity sees greater overall benefits from hardware acceleration than Llano. The Intel CPUs are faster even without help from their IGPs, though.
This user-friendly video transcoder supports AMD’s VCE and Intel’s QuickSync hardware transcoding blocks. Those are effectively black boxes without much programmability, so their output isn’t necessarily comparable—and neither is their performance, strictly speaking. From a practical standpoint, though, it’s helpful to see which solution will transcode videos the quickest. So that’s what we’re going to do.
For our test, we fed MediaEspresso a 1080p version of the Iron Man 2 trailer, and we asked it convert the clip to a format suitable for the iPhone 4. We tested with full hardware acceleration as well as in software mode. Where the setting was available, we selected encoding speed over quality. The A8-3500M was only run in software mode, since it lacks hardware H.264 encoding.
Both VCE and QuickSync appear to halve transcoding times… except the latter looks to be considerably faster. We didn’t see much of a difference in output image quality between the two, but the output files had drastically different sizes. QuickSync spat out a 69MB video, while VCE got the trailer down to 38MB. (Our source file was 189MB.) Using QuickSync in high-quality mode extended the Core i7-3760QM’s encoding time to about 10 seconds, but the resulting file was even larger—around 100MB. The output of the software encoder, for reference, weighed in at 171MB.
IGP texture filtering quality
Among discrete GPUs, anisotropic filtering comparisons have become somewhat superfluous. Today’s solutions apply the same level of filtering quality with the same mipmap transitions at all polygon angles, which yields generally consistent results across different GPU makes and generations.
In the integrated world, though, things aren’t quite as rosy. We witnessed that first-hand when comparing Llano to Sandy Bridge last year. While Llano’s IGP had a nice, consistent filtering pattern, Sandy’s HD 3000 integrated graphics exhibited huge variations in filtering quality at different angles.
Happily, though, things have improved quite a bit with Ivy Bridge. Take a look:
The patterns above are the output of our Direct3D AF Tester. In case you’re not familiar with it, here’s our explanation from last year’s Llano review:
In the images above, you’re peering down a 3D-rendered cylinder or tube, and the inside surface of that tube has been covered with a simple texture map. The colored bands are what are known as mip maps, or increasingly lower resolution copies of the base texture mapped to the walls of the cylinder. The further you move from the camera, the lower the resolution of the mip level used. In the pictures above, the different colors show different mip levels. (Of course, mip maps don’t normally come in different colors. They look very much like one another and like the base texture. This test app colors them in order to make them easily visible.) Mip maps are a helpful tool in texture filtering because sampling from a single copy of the original, high-res texture can be work-intensive and, in a constrained grid of pixels, can produce excessive high-frequency noise, which is visually disruptive. In other words, a little bit of blurring and blending in the right places can be beneficial to the final result.
Alongside mip mapping, we’re layering on a couple of additional techniques to improve image quality. We’re using trilinear filtering to blend between mip levels, so that we don’t see abrupt transitions or banding. That’s why the different colors transition gradually from one to another. We’re also using anisotropic filtering, grabbing more samples for textures that exist at certain angles on the Z or depth axis—typically on surfaces stretching away from the camera, like floors, walls, and ceilings—in order to preserve sharpness that simple mip mapping would destroy. All of these things we take for granted in modern GPUs, which have custom hardware onboard to perform these functions.
In a nutshell, we want the color patterns to map consistently to the geometry (so, in this case, we want them to be perfectly circular), and we want the transitions between each color to be smooth. Trinity’s Radeon HD 7760G integrated graphics has no trouble with either task. Ivy Bridge’s HD 4000 IGP also manages mostly circular patterns with smooth transitions, but if you look closely, you’ll see jagged lines where the red fades into the background checkerboard pattern. As for Sandy Bridge, well, the image speaks for itself.
In a real-world example, the differences are plainly visible. Trinity and Ivy Bridge both give us nice, sharp textures at off-axis angles of inclination, while Sandy Bridge fails in a very noticeable way. Those textures only look sharp on Sandy’s IGP if we rotate the viewport to align the wall with the edge of the screen.
The Elder Scrolls V: Skyrim
Our Skyrim test involved running around the town of Whiterun, starting from the city gates, all the way up to Dragonsreach, and then back down again.
We tested at 1366×768 using the “medium” detail preset.
Now, we should preface the results below with a little primer on our testing methodology. Along with measuring average frames per second, we delve inside the second to look at frame rendering times. Studying the time taken to render each frame gives us a better sense of playability, because it highlights issues like stuttering that can occur—and be felt by the player—within the span of one second. Charting frame times shows these issues clear as day, while charting average frames per second obscures them.
For example, imagine one hypothetical second of gameplay. Almost all frames in that second are rendered in 16.7 ms, but the game briefly hangs, taking a disproportionate 100 ms to produce one frame and then catching up by cranking out the next frame in 5 ms—not an uncommon scenario. You’re going to feel the game hitch, but the FPS counter will only report a dip from 60 to 56 FPS, which would suggest a negligible, imperceptible change. Looking inside the second helps us detect such skips, as well as other issues that conventional frame rate data measured in FPS tends to obscure.
We’re going to start by charting frame times over the totality of a representative run for each system—though we conducted five runs per system to sure our results are solid. These plots should give us an at-a-glance impression of overall playability, warts and all. (Note that, since we’re looking at frame latencies, plots sitting lower on the Y axis indicate quicker solutions.)
From this vantage point, it’s obvious the A10-4600M and Radeon HD 7760G IGP combo pulls off the lowest, most consistent frame times of the bunch. Ivy Bridge and its HD 4000 IGP suffer from a greater number of latency spikes, and they seems to exhibit more variance in general, as well. Sandy Bridge is the worst of the bunch by far, with embarrassingly high frame latencies and a huge spike over 250 ms at the end of the run.
We can slice and dice our raw frame-time data in other ways to show different facets of the performance picture. Let’s start with something we’re all familiar with: average frames per second. Though this metric doesn’t account for irregularities in frame latencies, it does give us some sense of typical performance.
Next, we can demarcate the threshold below which 99% of frames are rendered. The lower the threshold, the more fluid the game. This metric offers a sense of overall frame latency, but it filters out fringe cases.
Of course, the 99th percentile result only shows a single point along the latency curve. We can show you that whole curve, as well. With integrated graphics or single-GPU configs, the right hand-side of the graph—and especially the last 10% or so—is where you’ll want to look. That section tends to be where the best and worst solutions diverge.
These latency curves are nice and neat, with no one solution crossing over to be slower than the other one in the last 5% or so. Sometimes things aren’t like that, as we’ll likely see shortly.
Finally, we can rank solutions based on how long they spent working on frames that took longer than 50 ms to render. The results should ideally be “0” across the board, because the illusion of motion becomes hard to maintain once frame latencies rise above 50-ms or so. (50 ms frame times are equivalent to a 20 FPS average.) Simply put, this metric is a measure of “badness.” It tells us about the scope of delays in frame delivery during the test scenario.
No question about it: Trinity’s integrated graphics are fast. They’re substantially quicker than even Llano’s, and the contest with Intel’s solutions is really no contest at all. From a seat-of-the-pants perspective, only the A10-4600M and Radeon HD 7760G are really playable at these settings. Llano is borderline, and the Intel offerings are just too choppy.
Batman: Arkham City
We grappled and glided our way around Gotham, occasionally touching down to mingle with the inhabitants.
Arkham City was tested at 1366×768 using medium detail and medium FXAA, and with v-sync disabled.
Uneven frame times seem to be a fact of life with this game, and our integrated graphics solutions appear to exacerbate the problem. By the looks of it, though, Ivy Bridge has slightly lower and slightly more consistent frame times than Trinity. Llano achieved shorter latencies than Sandy bridge overall, but it also suffered huge latency spikes a couple of times throughout the run. Those seemed to occur consistently no matter how many times we ran through the same sequence.
Well, isn’t that interesting? As poorly as Ivy and the HD 4000 did in Skyrim, they’re actually faster and smoother than the A10-4600M and Radeon HD 7760G across the board here—and by a fair margin, too. Perhaps Intel’s driver team has done some optimization work for Unreal Engine 3-based titles. Either that, or some of the integrated Radeon’s performance has been left untapped. Considering Trinity barely edges out Llano here, the latter seems more likely.
We tested Battlefield 3 by playing through the start of the Kaffarov mission, right after the player lands. Our 90-second runs involved walking through the woods and getting into a firefight with a group of hostiles, who fired and lobbed grenades at us.
BF3 wasn’t really playable at anything but the lowest detail preset using these IGPs—so that’s what we used.
Trinity is back in the saddle, yielding lower, more consistent frame times than Ivy Bridge. Sandy Bridge, meanwhile, isn’t even in the running. Not only does it perform poorly, as evidenced by the plot above, but its HD 3000 IGP also has image quality problems. It fails to render shadows across the ground texture properly.
Yep. Trinity definitely leads the pack here. The average FPS figures might fool you into thinking Ivy Bridge is almost as fast, but a glance at our frame latency curve will show otherwise. The Ivy system’s frame times rise sharply for the last 10% or so of frames.
Battery run times
We tested battery run times twice: once running TR Browserbench 1.0, a web browsing simulator of our own design, and again looping a 720p Game of Thrones episode in Windows Media Player. (In case you’re curious, TR Browserbench is a static version of TR’s old home page rigged to refresh every 45 seconds. It cycles through various permutations of text content, images, and Flash ads, with some cache-busting code to keep things realistic.)
Before testing, we conditioned batteries by fully discharging and then recharging each system twice in a row. We also used our colorimeter to equalize display luminosity at around 100 cd/m². That meant brightness levels of 40% for the Trinity system, 70% for the Llano machine, 25% for the Asus N56VM, and 45% for the N53S. The Intel systems had larger panels than the AMD ones, though, so that might have impacted power consumption.
We should note one other caveat: our four machines didn’t all have the same battery capacities. The batteries in the two Intel notebooks both had 56 Wh ratings, but the Llano laptop had a 58 Wh battery, and the Trinity system’s battery was rated for 54 Wh.
It’s no surprise to see the Trinity whitebook pulling off longer run times than the two Intel notebooks, since those have bigger displays and more power-hungry CPUs. The leap over Llano is encouraging, though; coupled with our performance data, it suggests AMD has managed to deliver both higher performance and greater power efficiency without a die shrink.
Now, that said, Trinity’s power-efficiency lead over Llano might not be as huge as our web-browsing results suggest. Last year, after much fiddling with BIOS and control panel settings, we managed to squeeze 5.4 hours of web surfing out of the same Llano whitebook. We weren’t able to reproduce that result this time, but it’s worth keeping in mind.
Even if Trinity only gets you an extra hour of run time over its slower predecessor, though, that’s still a nice improvement.
In pitching Trinity to the press, AMD repeatedly emphasized the subjective user experience and downplayed the importance of benchmarks. Take a look at our CPU performance results—even keeping in mind the 10W handicap the A10 had to deal with—and you’ll understand why they might not want to see that comparison emphasized too much. Still, it is a fair point to note that one can’t always perceive differences in CPU performance these days. During our time with the Trinity laptop, we found its snappiness for everyday web browsing and such to be virtually indistinguishable from our two Intel quad-core laptops. Of course, running heavier-duty applications is where our CPU tests and perception collide; there’s little arguing with a photo stitching result where the A10 takes 12 seconds longer than Sandy Bridge to complete the same task. Whether one will regularly notice the difference between the two will depend on how one uses the system.
AMD is doing some good work in helping to push heavy-duty desktop applications like the GIMP and WinZip toward GPU acceleration via OpenCL. Many others, including the x264 video encoder, are purportedly slated to get OpenCL support soon. Further adoption of OpenCL and GPU acceleration could transform some of the stickiest parts of the desktop usage model by making key applications more GPU-dependent than CPU-dependent. That’s huge. Presumably, AMD and its APUs would benefit from this change. However, the early returns from WinZip and LuxMark have shown four of Intel’s x86 CPU cores to be even faster than Trinity’s CPU-and-IGP tag team. AMD still has a lot of work to before it can credibly claim to be fulfilling its vision of a better user experience via converged computing.
For now, the choice between AMD’s Trinity and the Intel competition is very much about priorities. If you value desktop application performance above all else, then Trinity probably isn’t for you. If you care about graphics and gaming, well, then Trinity may hold some interest. We don’t think that’s a minor point in the grand scheme of things. Laptops are rapidly becoming the most popular consumer PCs, and a great many consumers will want to play games on them at least some of the time. We’ve noted that one can’t always tell CPUs apart from the seat-of-the-pants experience. The results of our latency-focused gaming tests will tell you IGP performance deltas are much easier to perceive, at least in graphically intensive titles like the ones we tested. All of these IGPs are relatively wimpy graphics solutions, so you really want the best one possible. That’s one of the reasons we liked Llano, and Trinity gives us no reason to change our tune. Yes, Ivy Bridge’s IGP is much improved, but Trinity’s is enough better to erase any questions of supremacy on that front.
What we want, now, is to get our hands on an ultra-thin laptop with a 17W Trinity inside. If that setup proves to be reasonably competent for both all-around use and occasional gaming, then AMD may have set a new high-water mark of sorts in ultraportable computing. That would really be something.