For several generations, since Llano, AMD has been slowly but methodically marching toward its vision of accelerated computing, where traditional CPU cores and graphics share space on a chip and work together to process data. This vision was called “fusion” back when the process began, although you won’t hear that term coming from AMD these days. Regardless, AMD’s latest processor, or APU (short for “accelerated processing unit”), is a major milestone on the path toward fused computing—and AMD is taking the wraps off of it today.
Compared to AMD’s current APUs, the chip code-named Kaveri is packed with sweeping changes, including enhanced “Steamroller” CPU cores, updated Radeon graphics, and a first-of-its-kind ability for the onboard CPU and GPU cores to share memory and work together to tackle a problem. Those are just the big-ticket items. Virtually every unit in Kaveri has been enhanced in some fashion.
Same space, another billion transistors
The changes in Kaveri start with the transition to a new chip fabrication process that packs more transistors into the same space.
The prior-gen Trinity/Richland APUs were built at GlobalFoundries using a familiar sort of manufacturing process for AMD CPUs, with feature sizes as small as 32-nm and a silicon-on-insulator (SOI) substrate. This 32-nm SOI process is tuned expressly for CPUs and helps enable the clock frequencies above 4GHz that are common in AMD’s desktop processors.
For Kaveri, AMD and GloFo have developed a 28-nm SHP (short for “super-high performance,” presumably) process that trades SOI for traditional bulk silicon. The 28-nm SHP process is tuned differently, to allow for higher transistor densities and somewhat lower peak switching speeds. AMD describes the process as a “happy medium” tuning point, one more accommodating to the GPU portion of Kaveri’s die.
|Lynnfield||Core i5, i7||4||8||8 MB||45||774||296|
|Sandy Bridge||Core i5, i7||4||8||8 MB||32||995||216|
|Ivy Bridge||Core i5, i7||4||8||8 MB||22||1200||160|
|Haswell (Quad GT2)||Core i5, i7||4||8||8 MB||22||1400||177|
|Llano||A8, A6, A4||4||4||1 MB x 4||32||1450||228|
|Trinity/Richland||A10, A8, A6||2||4||2 MB x 2||32||1303||246|
|Kaveri||A10, A8||2||4||2 MB x 2||28||2410||245|
Thanks to this new manufacturing process, Kaveri crams about 1.1 billion more transistors—most of them dedicated to graphics—into approximately the same die area as Trinity. However, Kaveri has lower CPU operating speeds, especially in the higher power envelopes typical of most desktop processors.
If you’ve been following these things, this story may sound familiar to you. Intel has taken a similar path with its 22-nm fab process, tuning for better low-power operation at the expense of additional peak performance. Given that chips like Kaveri and Intel’s Haswell are geared primarily for laptops, this sort of tuning makes sense.
That said, AMD and Intel aren’t exactly aligned in their approaches to highly integrated CPUs. In the last couple of generations, Intel has pushed into ever-lower power envelopes with its Core processors. Haswell Y-series parts can squeeze into power envelopes as low as 6W, and that’s with an on-package “PCH,” or south bridge I/O chip. AMD evidently didn’t see that move coming when it defined the requirements of its new APU. Kaveri operates in a broad range of power targets between 15W and 95W, but it’s most likely not optimal at either end of that range. AMD hasn’t yet announced the mobile versions of Kaveri—today’s introduction applies only to the desktop variants—but the 15W version of Kaveri will presumably have an external south bridge with its own power budget. AMD will have to cover lower power ranges with its Kabini and Temash SoCs, which are decent but cheaper, lower-performance chips.
Steamroller CPU cores
Kaveri has a pair of CPU modules, each with two “tightly coupled” integer cores and a single, shared floating-point unit. In keeping with its recent heavy-machinery theme, AMD calls this next revision of its CPU microarchitecture Steamroller. Kaveri’s Steamroller modules have been tweaked in significant ways to improve performance and power efficiency compared to the previous generations, known as Piledriver and Bulldozer. AMD CTO Mark Papermaster revealed many of the changes on tap for Steamroller over a year ago, but Kaveri is the first silicon to include this generation of AMD’s x86 processor tech.
The CPU modules in the Bulldozer family have never quite lived up to expectations for various reasons. The obvious point of emphasis in Steamroller is keeping the execution engine better fed through tweaks to the microarchitecture’s front end. Most notably, instruction decode is no longer a shared resource. The module has separate, dedicated decoders for each of its two integer cores. Also, the instruction cache is now 50% larger, at 96KB, and is three-way set associative. AMD claims i-cache misses have been reduced by 30% as a result. Furthermore, the branch target buffer has grown in size from 5K to 10K entries, giving the branch predictor more insight into program activity. The benefit is a claimed 20% reduction in branch mispredictions. Tricky x86 instructions that require the use of microcode should run faster in Steamroller, as well, since microcode ROM can be accessed simultaneously by both of the module’s threads.
There are some big numbers attached to those individual front-end improvements. Combined with a larger scheduler window that adds 5-10% more efficiency, the Steamroller execution engine is apparently being kept much busier. On a per-thread basis, AMD says instruction dispatches that use the max width of the machine have risen by 25%. The Steamroller module can retire work at a higher rate, too, thanks to improvements to its back end (including enhancements to the load and store queues).
Of course, improvements in individual areas don’t always translate directly into overall performance gains, since architectural constraints tend to move around depending on the workload. AMD claims Steamroller delivers an overall average gain in retired instructions per clock of about 10% over Piledriver, although that number can rise as high as 20% in certain scenarios. The good news is that Kaveri’s IPC increases should serve to offset the reduction in clock frequency caused by the switch to 28-nm SHP manufacturing, thus keeping CPU performance steady from Trinity and Richland. The bad news is that AMD may be largely treading water in terms of overall CPU performance, while Intel continues to extend its lead.
Any pain associated with AMD’s ongoing deficit in CPU performance is dulled somewhat by Kaveri’s incorporation of the state-of-the-art GCN graphics architecture. 47% of Kaveri’s die space is devoted to graphics, signaling AMD’s commitment not just to graphics, but also to GPU acceleration of general-purpose computing workloads.
The move to the Graphics Core Next architecture is a major upgrade over Trinity on both of these fronts, just as it was when the Radeon HD 7000 series supplanted the HD 6000 series. (I’ve outlined the structure of the GCN compute units here.) This is the same generation of graphics technology that AMD built into the chips that power Microsoft’s Xbone and Sony’s PS4.
More precisely, Kaveri’s compute units are of the same vintage as those in the Hawaii GPU that powers the Radeon R9 290X. This latest revision of GCN includes provisions especially helpful for APUs. The addition of flat system addressing facilitates the sharing of memory between CPU and graphics compute units. Meanwhile, buffering changes should improve the performance of geometry shaders and tessellation in the bandwidth-constrained environs of a CPU socket.
Naturally, Kaveri’s GPU is built on a much smaller scale than the big Hawaii chip. It has only eight compute units, versus 44 on the Radeon R9 290X. Still, those eight CUs endow Kaveri with a total of 512 shader processors and 32 texels per clock of bilinear filtering capacity. The front end can rasterize a single primitive per clock cycle, and two render back-ends give it 16 pixels per clock of ROP throughput. This is a major upgrade from the 384 SPs, 24 tpc of filtering, and 8 ppc of ROP throughput in Trinity—and we haven’t even accounted for the more efficient scheduling and superior GPU computing chops of the GCN architecture.
In keeping with Kaveri’s mobile focus, the impact of this wider graphics engine will most likely be felt in lower power bands, where the dual memory channels available inside of a CPU socket are less of a constraint, relatively speaking. We’ve already shown that the previous-gen Richland’s GPU is somewhat bandwidth-constrained in higher power envelopes. If bandwidth becomes the primary performance limiter, then Kaveri’s wider graphics engine could become starved for work.
The future is fusion?
What may be Kaveri’s most innovative new technology doesn’t yet benefit current applications. However, it should enable developers to create programs that can use the CPU and GPU cores on a chip together in novel ways. AMD talks about these features under the umbrella of its wide-ranging HSA effort. HSA stands for Heterogeneous Systems Architecture, and it refers to an overarching system architecture for mixed-mode computing (involving CPU cores, GPUs, and possibly DSPs) with its own programming model. AMD’s HSA enablement effort involves building the tools and partnerships to make HSA a viable development platform, both for x86-compatible chips and for SoCs that marry other sorts of CPU cores and graphics engines. The goal is to make it possible to write software that almost effortlessly intermingles the use of CPUs, graphics processors, and other computing engines as needed.
AMD outlined the basic HSA architecture several years ago, and it has been slowly adding features to its chips to make this vision a reality. The first APU, Llano, had a 128-bit Fusion Compute Link that allowed the GPU to access CPU-owned memory in certain cases. This link was an add-on created specifically for mixed-mode computing, since the integrated Radeon had a 512-bit bus of its own. Trinity expanded the FCL to 256 bits wide and changed its path, routing it through an IOMMU and into a unified north bridge between the CPU and graphics cores. Kaveri retains the 512-bit Radeon bus and the 256-bit FCL, and it adds a third 256-bit link from the GPU to the north bridge.
This new link is notable because it provides coherent access to memory. That is, the GPU can read and modify memory locations over this link without worrying about whether the same data is being held or modified in the CPU caches. Much like in a multi-socket server, Kaveri’s hardware ensures that its CPU and GPU cores are properly synchronized and working on correct, up-to-date data. Programmers and compilers need not worry about the hazards created by the GPU reaching into main memory and making a change. Coherent communication is one of the keys to unlocking the GPU’s full participation in heterogeneous computing, and Kaveri is the first chip from AMD to offer this capability.
Kaveri’s coherent FCL pairs up with a couple of other HSA-enabling features to open some new possibilities for programming an APU. Thanks to a feature called hUMA, or hetergenous uniform memory architecture, the CPU and GPU can share up to 32GB of memory and access it via a common addressing scheme. hQ, or heterogeneous queuing, allows the GPU to create and dispatch work for itself—or for the CPU. Kaveri’s graphics unit includes eight dedicated asynchronous compute engines (ACE), independent of the graphics command processor, for scheduling parallel computing work. And Kaveri supports the atomic operations needed for synchronization between the CPU and GPU cores.
At the Kaveri press event, AMD HSA honcho Phil Rogers offered several examples of how an HSA-compliant APU could intermix CPU and GPU operations for higher performance using simple, less repetitive code. Kaveri is the first chip capable of running that code natively, making it the first real development platform for HSA. If AMD somehow is able to persuade the rest of the industry to standardize on its vision for heterogeneous programming, that could be an even bigger coup than the adoption of the x86-64 ISA back in the Athlon 64 days.
With that said, the implementation of graphics coherency in Kaveri is just a first step, as the presence of three separate buses coming from the GPU indicates. AMD Client Divison CTO Joe Macri forthrightly admitted that the three buses could be merged in a future design. One can imagine how a single link could be more power-efficient. For engineering purposes, he told us, replicating the FCL and making it coherent was the easier path for this project. Also, the coherent FCL presently bypasses the GPU’s L2 cache, unlike the non-coherent link. On the CPU side, the L1 cache’s TLB is available on both busses, but the L2 TLB—located in the IOMMU—can only be accessed by one client at a time. In the event of an L2 miss, the IOMMU will walk the page tables, remaining locked the whole time.
Obviously, these limitations aren’t ideal. Macri explained that the goal in this case was keeping things simple and maintaining architectural correctness. The team didn’t want a bug in HSA-related features to delay the product, especially since HSA is about enabling future applications, not current ones. In keeping with AMD’s recent modus operandi of incremental CPU-GPU fusion, we’d expect these restrictions to be removed from future APUs.
Dedicated accelerators and more
Kaveri is about more than Steamroller and GCN. The dedicated media accelerators on the chip have all been updated, too.
The big addition here is the TrueAudio DSP block that AMD built into the latest Radeons—and apparently into the next-gen game console SoCs, as well. TrueAudio is meant to accelerate effects like 3D positional audio in games, removing that burden from the CPU. Like the HSA features, Kaveri’s TrueAudio block is a bit forward-looking, since we don’t yet have any software that can take advantage of it. However, a number of middleware vendors look to be gearing up to support TrueAudio, so we can probably expect to see games use it before too long.
Kaveri’s video accelerators are both updated versions of the ones featured in Trinity and Richland. The UVD 4 video decoder block hasn’t changed much, but AMD says it has improved error resiliency, so videos will continue playing even when the decoder encounters errors in their source files. The VCE 2 encoder block adds support for the YUV444 color format, specifically in order to provide better text quality when using 60GHz wireless displays. H.265/HEVC isn’t supported in VCE 2. AMD is instead talking about using GPU acceleration via OpenCL to assist with the playback of 4K video content encoded in this fashion.
Oh, and one big-ticket checkbox item has been marked: at last, AMD’s latest APU supports PCI Express 3.0 connectivity for off-chip I/O. This addition could pay dividends in several cases, especially when the APU is paired with a couple of discrete graphics cards in a multi-GPU team.
Like AMD’s past APUs, Kaveri has sophisticated power management capabilities, with dynamic voltage and frequency scaling (DVFS) as well as boost. I suspect AMD isn’t talking as much about this particular area because it’s saving something for the introduction of the mobile Kaveri parts. Compared to prior generations, the firm says, Kaveri has better monitoring of temperatures and activity counters across the chip, allowing it to pursue higher clock frequencies with boost—and thus push the limits of its prescribed thermal envelope—without reducing chip reliability.
AMD did share some preliminary battery life numbers for the mobile version of Kaveri. The numbers above come from a 35W APU installed in a system with a 58Whr battery, and as you can see, the run times look pretty decent—although that is a pretty beefy battery.
One other bit of good news for mobile versions of Kaveri: AMD says the chip draws only about 25 mW of power in an S3 suspend state. That should mean that it’s possible to let a laptop sleep for hours or even days without substantially draining the battery. We’ll have to see how that works out at a platform and system level, but the APU power number sounds very nice.
A new socket: FM2+
Kaveri comes to the desktop with a new type of socket in tow: Socket FM2+. This new plug type has two more pins than the older FM2 standard, and as a result, Kaveri-based APUs won’t drop into pre-FM2+ motherboards.
Happily, Socket FM2+ mobos will accept older Trinity and Richland-based APUs, so there is a measure of backward compatibility in play here. I think most owners of Socket FM2 systems would probably prefer things the other way around, though, so they could drop a new CPU into an older system as an upgrade.
A trio of desktop Kaveris
|A10-7850K||2/4||3.7 GHz||4.0 GHz||4 MB||8||720 MHz||95 W||$173|
|A10-7700K||2/4||3.4 GHz||3.8 GHz||4 MB||6||720 MHz||95 W||$152|
|A8-7600||2/4||3.3 GHz||3.8 GHz||4 MB||6||720 MHz||65 W||$119|
|A8-7600||2/4||3.1 GHz||3.3 GHz||4 MB||6||720 MHz||45 W||$119|
Yes, I said there is a trio of desktop Kaveri APUs. Look closely above, and you’ll see that the A8-7600 occupies two lines in the table. That’s because this particular model comes with a configurable TDP. The user can pick one of two operating points for it, a 45W peak or a 65W peak, and the chip will run at different clock speeds based on that setting. I’ve already mentioned that most of Kaveri’s improvements will be more acutely felt in lower power envelopes, so perhaps you won’t be surprised to learn that AMD has elected to supply the A8-7600 to us for review. I can’t really complain. We’ve long said AMD’s 65W APUs are its most attractive offerings.
I do wish the A8-7600 were actually becoming available for purchase today, but AMD quotes a vague “Q1 ’14” release time frame for it. The two A10 parts are the ones hitting stores today.
Naturally, we’ve tested the A8-7600 at both 45W and 65W TDP levels. As a fairly direct competitor to the A8-7600, we’ve have Intel’s Core i3-4330. This dual-core, quad-threaded Haswell runs at 3.5GHz, actually a higher clock than the A8’s. That fact doesn’t bode well for the CPU performance match-up, since Intel’s recent cores tend to be substantially faster clock-for-clock than AMD’s. (Then again, Kaveri has twice as many integer cores.) The i3-4330 lists for $138 and has a TDP rating of 54W, smack-dab between the A8-7600’s two configurable levels. The Core i3 features Intel’s HD Graphics 4600 IGP. Haswell’s beefier GT3 and GT3e graphics configs aren’t available in socketed desktop parts.
As expected, the A10-series Kaveris can’t quite reach the same clock frequencies as Richland parts fabbed on a 32-nm SOI process. The 7850K tops out at 3.7GHz base and 4.0GHz boost speeds, several hundred megahertz below the 4.1/4.4GHz operation of the A10-6800K. Graphics clock speeds are down a bit, too, from 844MHz in the 6800K to 720MHz in the 7850K. Kaveri’s wider graphics should still be a clean win, though, provided that there’s enough memory bandwidth available in the socket.
To that end, AMD has expanded support for DDR3-2133 memory speeds across the entire Kaveri desktop lineup. In the Richland lineup, the A10-6800K is the only part with official support for DDR3-2133. The others top out at DDR3-1866.
Test notes and methods
AMD sent us a complete system with the A8-7600 inside. The machine is based on Xigmatek’s Nebula enclosure, which looks like an overgrown Mini-ITX cube. At 13″ x 10″ x 10″, the case is a little big for a mini build. There’s room inside for full-sized PSUs, larger coolers, and double-wide graphics cards, though.
The case has some nice elements, including chunky aluminum side panels affixed with a nifty, tool-free mechanism. Popping off the walls exposes the guts on two sides.
Inside lies a Gigabyte F2A88XN-WIFI motherboard, an Antec High Current Pro 750W PSU, a Samsung 840 Pro 256GB SSD, and 16GB of AMD’s own Gamer Series DDR3-2133 memory. And the A8-7600, of course. Despite the fact that there’s plenty of headroom inside the case, AMD strapped one of Noctua’s low-profile NH-L9a coolers onto the chip.
With Scott in Las Vegas for CES last week, all of our testing was conducted at TR’s northern outpost. We don’t have access to the test rigs and CPUs in Scott’s lab, so we had to make do with a more limited selection of competitors for the A8-7600.
AMD has positioned the A8-7600 opposite the Core i3-4330. We tested the Core i3 on a Mini-ITX motherboard based on Intel’s Z87 platform. We also tested a couple of 45W APUs based on the last-gen Richland silicon. Both are quad-core models; the A8-6500T is clocked at 2.1/3.1GHz, while the A10-6700T runs at 2.5/3.5GHz. The A10’s higher clock speeds make it the more appropriate foil for the A8-7600, which is clocked at 3.3/3.8GHz in 65W mode and 3.1/3.3GHz in 45W mode.
We tested the A8-7600 in its 45W and 65W modes, both with 2133 MT/s memory. The Core i3 doesn’t officially support memory transfer rates over 1600 MT/s, but our Z87 motherboard does, and it had no problem running a pair of DIMMs with the same frequency as the Kaveri rig. Since all of our testing was conducted using the onboard GPUs, we targeted 2133 MT/s for all the configs.
Richland has an 1866 MT/s default memory speed, and we weren’t able to push the A10-6700T and A8-6500T any higher, perhaps because the T-series parts lack unlocked multipliers. Even when we set a 2133 MT/s transfer rate in the firmware, the system booted at 1866 MT/s or slower. The A10-6700T was happy at 1866 MT/s, but the A8-6500T stubbornly stuck to 1600 MT/s no matter what we tried. The 6500T is supposed to support the higher speed, so a motherboard firmware quirk may be responsible for the issues we encountered.
To fill out the lineup, we added a Core i7-4770K. This is Intel’s fastest Haswell chip, so it’s not a direct competitor for the A8-7600 or any of the AMD APUs we’ve tested. The i7-4770K is meant to provide a familiar frame of reference for the rest of the results.
The timeline for this review was very tight, limiting our ability to test additional configurations. We didn’t even get final drivers from AMD until Friday, so we had to work through the weekend just to get these parts tested. More on Kaveri is coming, though. Scott managed to get his hands on the full-fat, 95W A10-7850K during CES. Look for that chip to make its way through our usual CPU test suite soon.
We ran every test at least three times and reported the median of the scores produced. The test systems were configured like so:
|Intel Core i7-4770K|
|Platform hub||AMD A88X||Intel
|Memory size||16 GB
Gamer Series DDR3 SDRAM
Vengeance Pro DDR3 SDRAM
|Memory speed||2133 MT/s||1866
|Memory timings||10-11-11-30 2T||9-10-11-27
Catalyst 13.30 RC2
Intel RST 126.96.36.1996
ALC889 with 2.73 drivers
ALC1150 with 2.73 drivers
|Radeon HD 8650D||Radeon
Catalyst 13.30 RC2
840 Pro 256GB
840 Pro 256GB
High Current Pro 750W
We used the following versions of our test applications:
- AIDA64 4.00
- Stream 5.8 64-bit
- SiSoft Sandra 2014.02.20.10
- 7-Zip 9.20 64-bit
- TrueCrypt 7.1a
- Google Chrome 27.0.1453.94 m
- SunSpider 1.02
- Kraken 1.1
- The Panorama Factory 5.3 x64 Edition
- Cinebench R15 64-bit Edition
- LuxMark 2.0
- x264 encoder r2334
- Handbrake 0.9.9.1 64-bit
- Qtbench 0.2.2
- Photoshop CC
- WinZip 18
- Musemage 188.8.131.5263
- Battlefield 4
- Batman: Arkham Origins
- Tomb Raider
- FRAPS 3.5.99
Some further notes on our testing methods:
- The test systems’ Windows desktops were set at 1920×1200 in 32-bit color. Vertical refresh sync (vsync) was disabled in the graphics driver control panel.
- We used a Watts Up Pro digital power meter to capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (The monitor was plugged into a separate outlet.) We measured how each of our test systems used power across a set time period, during which time we encoded a video with x264. All power testing was done with the Antec High Current Pro 750W PSU.
- After consulting with our readers, we’ve decided to enable Windows’ “Balanced” power profile for the bulk of our desktop processor tests, which means power-saving features like SpeedStep and Cool’n’Quiet are operating. (In the past, we only enabled these features for power consumption testing.) Our spot checks demonstrated to us that, typically, there’s no performance penalty for enabling these features on today’s CPUs. If there is a real-world penalty to enabling these features, well, we think that’s worthy of inclusion in our measurements, since the vast majority of desktop processors these days will spend their lives with these features enabled.
The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Memory subsystem performance
Before diving into our gaming and application tests, we’ll take a moment to look a handful of lower-level metrics, starting with memory subsystem performance. Keep in mind that only the A8-7600 and the Intel CPUs are running their memory at 2133 MT/s. The A10-6700T config has a slower 1866 MT/s memory speed, and the A8-6500T is limited to 1600 MT/s.
The A8-7600 is a fair bit faster than its Richland-based siblings in our Stream memory bandwidth test. That’s to be expected given the Kaveri chip’s higher clock speeds, especially versus the A8-6500T. The A10-6700 is a closer match for the A8-7600, but it can’t keep up, either.
While Kaveri looks fast versus Richland, it lags well behind the Haswell competition. The Core i3-4330 wrings much higher bandwidth from the same memory setup as the A8-7600.
Dialing back the A8-7600’s thermal envelope has only a minimal impact on memory bandwidth, at least in this test. Let’s see what Sandra has to say.
This multithreaded test measures the bandwidth of all caches on all cores concurrently. The different block sizes step us down from the L1 and L2 caches into L3 and main memory. Notice how the A8-7600’s performance starts to fall off after 64KB, when the test spills out of the L1 cache, and after 4MB, when it exceeds the capacity of the L2 cache and pushes into system memory. Neither Kaveri nor Richland has an integrated L3 cache, so the test hits main memory when it runs out of L2.
The A8-7600 has higher cache bandwidth than the Richland chips we tested. The Core i3-4330 delivers substantially higher throughput than the A8-7600 at smaller block sizes, though. Those two chips are closely matched from 128KB through 512KB, but the Core i3 slows down as larger block sizes push into its L3 cache. The A8-7600’s larger L2 cache has an edge until the caches are exhausted and the test becomes bound by the system memory interface.
The Core i7-4770K runs away with this test thanks to a combination of higher clock speeds, greater L1 and L2 cache capacity (via additional cores), and a larger L3 cache. Remember that it’s not a direct competitor to the A8-7600 or any of the other contenders.
Next, we’ll look at Sandra’s cache and memory latency test. We used the “in-page random” access pattern to reduce the impact of prefetchers on our measurements. You can read more about this test right here.
Again, the results expose the cache configurations of each chip. This test is single-threaded, so the presence of additional CPU cores doesn’t affect the results. The Core i7-4770K has lower access latencies than the i3-4330 only because of the difference in L3 cache size.
All of the AMD chips perform comparably until the 4MB block size. Starting at that point, the A8-7600 configs exhibit higher latencies than the A10-6700T and A8-6500T. The looser timings required by the A8-7600’s 2133 MT/s memory could explain the difference.
Some quick synthetic math tests
AIDA64 has a collection of synthetic CPU benchmarks, some of which take advantage of the new instructions supported by the latest AMD and Intel CPUs. If you’re curious, this page has details on each test. The CPU PhotoWorxx and Hash tests both employ AVX2 and XOP instructions. So do the FPU Julia and Mandel tests, which also support FMA4 code.
The A8-7600 can’t catch its Core i3 competition in the PhotoWorxx test, and it’s way behind in the two FPU tests. The chip outpaces the Intel duallie in the CPU Hash test, though. That test uses the SHA1 algorithm and runs much faster on Kaveri than it does on Richland. Of course, the A8-7600 also has higher CPU and memory clocks than the A10-6700T and A8-6500T. The tight race between those Richland chips suggests memory bandwidth isn’t a major constraint in the CPU Hash test.
Given the different CPU and memory frequencies of our APU configs, it’s difficult to get a sense of Kaveri’s IPC improvements over Richland. We may have to revisit that topic with more targeted testing in the future. Given the timeline for this review, we elected to spend more time testing actual games and applications. Speaking of which, let’s see how Kaveri’s GCN-derived Radeon handles cutting-edge DirectX 11 titles.
Unfortunately, the A8-7600 isn’t part of AMD’s Battlefield 4 bundling promo. The chip runs the game rather well, though, as we learned while blasting through a portion of the single-player campaign’s Shanghai mission. As usual, we tested the game by measuring each frame of animation produced. The uninitiated can start here for an intro to our methods.
We stuck to a 1920×1080 display resolution for all our game testing. Surprisingly, the A8-7600 handled that resolution with medium detail settings.
At these settings, Battlefield 4 runs much better on the A8-7600 than on any of the other configs. All our metrics agree; the Kaveri setups have higher FPS averages, lower 99th percentile frame times, and fewer frames beyond each of our “badness” thresholds. Frame production isn’t silky smooth, as the frame time plot indicates, but it’s a massive improvement over the other solutions.
The data match my subjective impressions. BF4 may not be especially pretty with medium details, but it’s definitely playable on the A8-7600’s integrated graphics, and there isn’t much of a penalty associated with shifting the chip into 45W mode. That said, 28 FPS is a little on the sluggish side for multiplayer gaming. You may want to lower the in-game detail when playing online, where 64-player servers can generate a lot more on-screen mayhem than the average campaign mission.
We had hoped to test BF4 with multiple memory speeds, but the A8-7600 wouldn’t boot with the DIMMs set to 2400 MT/s. We did manage to get Kaveri running with 1866 MT/s memory, though. That setup dropped the 65W config’s FPS average by two frames per second and increased its 99th percentile frame time by 2.5 milliseconds—relatively small changes. Those small deltas suggest that the A10-6700’s deficit is due to more than just its slower memory interface. The A8-7600’s closest Richland-based competition has much higher frame latencies.
The Tomb Raider reboot is next. In this game, we ran through the jungle and pilfered a dead man’s bow and arrow. Most of the detail settings were left at the “normal” defaults.
Once again, all our metrics agree that the A8-7600 offers the best performance of the bunch. My seat-of-the-pants impressions concur. Playing on the A8-7600 feels substantially smoother regardless of the TDP configuration. Tomb Raider is still playable on the other setups, but the experience is definitely compromised.
Although the Core i3-4330 and Core i7-4770K are soundly trounced by the A8-7600, the Intel chips are surprisingly competitive with the A10-6700T. Haswell’s onboard GPU can’t take all the credit, though. Integrated graphics performance is highly dependent on memory bandwidth, and the Intel configs are running faster memory than the Richland setups.
Batman: Arkham Origins
Game benchmarking sounds fun until you realize that it involves repeating the same 60-second sequence over and over again. That usually gets old pretty fast. However, after several days of non-stop testing, I’m still not sick of brawling through the “Panorama” challenge map we used to test Batman: Arkham Origins.
The contest is tighter this time, but the end result is the same. The A8-7600 delivers much more fluid frame delivery than the competition.
Yes, there are still spikes in its frame time plot. And no, Arkham Origins doesn’t look exceptional with so much of its eye candy turned off. But the game’s timing-focused combat makes it easy to feel the performance differences between the A8-7600 and its peers. The Richland-based A10-6700 is noticeably slower than the Kaveri configs.
As we’ve seen throughout our gaming tests, the 45W Kaveri config offers nearly all of the gaming performance of the 65W setup. Our metrics show a consistent delta between the two settings, but I couldn’t discern much of a difference while actually playing each game.
The A8-7600 is stuck between Haswell and Richland here. The Core i3-4330 has a considerable lead over the fastest Kaveri config, which in turn has a smaller advantage over the A10-6700T.
TrueCrypt disk encryption
TrueCrypt supports acceleration via Intel’s AES-NI instructions, which also work with Richland and Kaveri. We’ve included results for another algorithm, Twofish, that isn’t accelerated via dedicated instructions.
In the AES test, the Core i3-4330 slips between the two Kaveri configs. It’s not fast enough to keep up in the Twofish test, though. The A10-6700T isn’t fast enough to keep up with the A8-7600 in either test.
7-Zip file compression and decompression
The first of two compression tests, 7-Zip doesn’t employ any specialized hardware acceleration.
The A8-7600 fares reasonably well here. Its 65W incarnation is only a smidgen behind the Core i3-4330 in the compression test, and it has a comfortable lead over the Intel chip in the decompression test. Capping the chip’s TDP at 45W lowers performance somewhat, but the A8-7600 still performs better than the A10-6700T in the same thermal envelope. With slower CPU and memory frequencies, the A8-6500T continues to bring up the rear.
WinZip file compression and decompression
Unlike Z-Zip, WinZip has built-in OpenCL acceleration. It doesn’t include a benchmark, so we used a stopwatch to time how long it took to compress and decompress 1.5GB of application, MP3, RAW, JPEG, Excel, and text files.
Although the A8-7600 compresses our file set in about the same amount of time as the Core i3-4330, the Intel chip is much faster in the decompression test. Interestingly, the Richland-based A10-6700 is way behind in the compression test but barely off the pace in the decompression test.
Compiling code in GCC
Our resident developer, Bruno Ferreira, helped put together this code compiling test. Qtbench tests the time required to compile the QT SDK using the GCC compiler. Here’s Bruno’s note about how he built it:
QT SDK 2010.05 – Windows, compiled via the included MinGW port of GCC 4.4.0.
Even though apparently at the time the Linux version had properly working and supported multithreaded compilation, the Windows version had to be somewhat hacked to achieve the same functionality, due to some batch file snafus.
After a working multithreaded compile was obtained (with the number of simultaneous jobs configurable), it was time to get the compile time down from 45m+ to a manageable level. This required severe hacking of the makefiles in order to strip the build down to a more streamlined version that preferably would still compile before hell froze over.
Then some more fiddling was required in order for the test to be flexible about the paths where it was located. Which led to yet more Makefile mangling (the poor thing).
The number of jobs dispatched by the Qtbench script is configurable, and the compiler does some multithreading of its own, so we did some calibration testing to determine the optimal number of jobs for each CPU.
Score another one for the Core i3-4330. The A8-7600 takes more than a minute longer to finish our compiling test, and that’s in 65W mode. Lowering the TDP extends the chip’s compiling time by about a minute and a half, putting it even farther behind. At least the A8-7600 has a healthy advantage over the A10-6700T. The gap between the Kaveri chip and its closest Richland competition is large enough to suggest that IPC improvements are partially responsible.
x264 HD video encoding
Our x264 test uses a build of the encoder that supports both AVX2 and FMA instructions. To test, we encoded a one-minute, 1080p .m2ts video using the following options:
–profile high –preset medium –crf 18 –video-filter resize:1280,720 –force-cfr
The source video was obtained from a repository of stock videos on this website. We used the Samsung Earth from Above clip.
The A8-7600 doesn’t quite catch the Core i3-4330 here. The 65W config comes close, but it’s not fast enough.
Even with its TDP dialed back to 45W, the A8-7600 has a considerable edge over the A10-6700T. Kaveri trumps Richland once more, with the handicapped A8-6500T stuck in last place as usual.
Handbrake HD video encoding
Our Handbrake test transcodes a two-and-a-half-minute 1080p H.264 source video into a smaller format defined by the program’s “iPhone & iPod Touch” preset. The latest official version of the encoder is supposed to support OpenCL, but we couldn’t find a way to enable the feature on any of our test systems. Installing each platform’s OpenCL SDK didn’t help, either.
We had to fall back to an older, OpenCL-specific Handbrake build to let the integrated GPUs assist with the encoding process. That build didn’t get along with the Intel processors, though. The OpenCL option was present when we opened the app, but it disappeared after our source file was selected.
It’s a shame the OpenCL build didn’t work on the Intel CPUs, because the A8-7600 and Core i3-4330 are neck-and-neck in the standard test. I’m curious to see if the two chips would still be evenly matched with GPU acceleration thrown into the mix.
Don’t compare the standard and OpenCL-accelerated encoding times to each other. Those sets of results come from Handbrake builds released months apart, so other factors may contribute to the differences—or the relative lack thereof.
Do, however, note that the A8-7600 leads the A10-6700T by another wide margin.
The Panorama Factory photo stitching
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s widely multithreaded. We asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs.
Another test, another example of the A8-7600 failing to catch the Core i3-4330 but managing to stay comfortably ahead of the A10-6700T.
Photoshop CC’s smart sharpen filter uses OpenCL for noise reduction. We used a stopwatch to time how long it took each system to sharpen an 18-megapixel RAW image file.
The A8-7600 65W takes nearly 50% longer than the Core i3-4330 to sharpen our test image. Lowering the thermal envelope to 45W adds another two seconds to the filter’s execution time, which puts the Kaveri-based chip behind the A10-6700T. The A10 chip has a 45W TDP, too, but it has a higher peak Turbo speed than the 45W Kaveri setup. That 200MHz advantage is enough for the A10-6700T to steal a small victory over its successor.
Like Photoshop, Musemage is an image editing application with OpenCL acceleration. We used the built-in benchmark, which applies a series of filters to an image before producing an overall score.
AMD runs the table here. The A8-7600 nearly doubles the performance of the Core i3-4330, and it’s way ahead of the i7-4770K. All of the APUs, including even the A8-6500T saddled with 1600 MT/s memory, manage to beat the Intel CPUs in this test.
This time around, the A10-6700T’s Turbo advantage over the A8-7600 45W isn’t enough to tip the scales in Richland’s favor. Kaveri’s IPC enhancements and higher memory speed keep the A8-7600 ahead of its 45W predecessor.
Because LuxMark uses OpenCL, we can use it to test both GPU and CPU performance. OpenCL code is by nature parallelized and relies on a real-time compiler, so it should adapt well to new instructions. For instance, Intel and AMD offer integrated client drivers for OpenCL on x86 processors, and they both support AVX. The AMD APP driver even supports Bulldozer’s and Piledriver’s distinctive instructions, FMA4 and XOP. We used the Intel ICD on the Intel processors and the AMD ICD on the AMD chips.
Interesting. Intel has a clear advantage in CPU performance, while AMD has the edge in GPU horsepower. When both components are working together to render the scene, the scales tip in Intel’s favor. The Core i3-4330’s faster CPU cores are just too much for the A8-7600’s integrated Radeon to overcome.
To AMD’s credit, the A8-7600 scores better than the A10-6700T. The difference in CPU performance is relatively small, but the gaps in the GPU and combined tests are huge. Some of Kaveri’s advantage there probably comes courtesy of its faster memory interface. The deltas between the A10-6700T and A8-6500T suggest that the GPU and combined tests are particularly sensitive to memory bandwidth.
The Cinebench benchmark is based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores (or threads, in CPUs with multiple hardware threads per core) are available.
The A8-7600 goes zero for two in Cinebench. Its single-threaded performance is substantially slower than that of the Core i3-4330, and the multithreaded test doesn’t provide much relief.
The multithreaded test gives Kaveri a chance to beat up on Richland a little, though. In that test, the A8-7600 has a big lead over the A10-6700. The difference between the two chips is much smaller in the single-threaded test.
Power consumption and efficiency
Our workload for this test is encoding a video with x264, based on a command ripped straight from the x264 benchmark earlier in the review. The first graph below shows system power consumption over the duration of the test.
The A8-7600T completes the encoding workload much quicker than the Richland-based APUs. It also has much higher peak power consumption during the encoding process, but there’s little difference in idle power draw between the AMD offerings. Meanwhile, the Core i3-4330 has lower idle and peak power consumption than anything in the AMD camp. And it finishes encoding the video file faster than the competition, too.
Note that the 45W and 65W Kaveri configs have identical idle power consumption. The lower TDP limit cuts the system’s peak power consumption by almost exactly 20W, which is what we’d expect.
We can quantify efficiency by looking at the amount of power used, in kilojoules, during the entirety of our test period, when the chips are busy and at idle.
Perhaps our best measure of CPU power efficiency is task energy: the amount of energy used while encoding our video. This measure rewards CPUs for finishing the job sooner, but it doesn’t account for power draw at idle.
With a slower encoding time and higher power consumption, the A8-7600 is less energy efficient than the Core i3-4330. Depending on the configuration, it requires 60-70% more energy to complete the same encoding task.
The A8-7600 and A10-6700T are pretty closely matched on the efficiency front. The Richland chip’s task energy is comparable to that of the two Kaveri configs. And, since the A8-7600 finishes the encode and returns to idle faster, it consumes less energy over the full test period.
AMD’s Kaveri APU raises the bar for integrated graphics performance, which is sort of what we expected. What did you think would happen when AMD built a processor infused with GCN-class Radeon hardware?
To be honest, I didn’t expect something that plays Battlefield 4 as well as the A8-7600. Despite sporting a cut-down version of Kaveri’s integrated GPU, the A8-7600 still pumps out playable frame rates at 1080p resolution with medium details. And it’s powerful enough to handle other big-name DirectX 11 titles, too. Some in-game eye candy has to be disabled to get playable frame rates, of course, but that’s true for all integrated graphics implementations. The fact is the A8-7600 delivers a better overall experience with fewer compromises than direct rivals based on AMD’s older Richland chips and Intel’s latest Haswell parts.
Kaveri’s potent onboard Radeon also has benefits beyond gaming. General-purpose computing applications can leverage the graphics hardware to tackle less trivial tasks, and HSA could make things really interesting down the road. Right now, though, the A8-7600 doesn’t have a clear advantage in so-called “accelerated” applications. The results of our OpenCL-accelerated tests were mixed, and they highlight the fact that the GPU is only one part of the processor. Kaveri’s Steamroller cores have to hold up their end of the bargain, too.
To AMD’s credit, Steamroller appears to have higher per-clock performance than the Piledriver cores familiar from Richland. Kaveri’s advantage seems to be especially prominent in multithreaded tests, and it’s nice to see the company making progress on the CPU performance front. There’s more work to be done, though. The Core i3-4330 beat the A8-7600 in the bulk of our non-gaming tests, including most of the multithreaded ones. Intel continues to have advantages in single-threaded performance and power efficiency, as well.
We’ve seen this dynamic with previous APUs, and it’s always made for a tough sell on the desktop. Gamers who actually care about graphics performance are better off with discrete video cards that deliver better visuals and smoother frame delivery, while those who don’t care about gaming are better served by Intel chips with higher per-thread performance and lower power consumption (which typically leads to lower noise levels.) APUs occupy this awkward middle ground for so-called casual gamers who want something better than an Intel IGP but not as good as a halfway-decent graphics card. As Jerry Seinfeld would say, “who are these people?” Seriously, I’ve never met one.
Now, Kaveri may be a questionable proposition for traditional desktops, but it has some appeal everywhere a discrete graphics card isn’t an option. Small-form-factor and all-in-one rigs seem particularly ripe for an APU like the A8-7600, which could bring a dose of graphics grunt to machines that typically offer poor gaming experiences.