AMD has been surrounded by a fair amount of gloom for the past couple of years, but the firm’s low-cost and low-power Brazos platform has been a consistent bright spot in spite of everything. The E-series APUs based on Brazos have saturated the low end of the laptop market, helping to send traditional, functionally hobbled netbooks to their doom. AMD’s new leadership has repeatedly spoken about the virtues of Brazos as a business. They like it because it’s high-volume—you tend to move a lot of chips when you sell ’em cheap—and because Intel hasn’t competed vigorously against Brazos, apparently for fear of eating into its low-end Core i3 business.
The follow-up to Brazos is a single chip, known by the twin code names Kabini and Temash, that packs four CPU cores, a miniature Radeon graphics processor, and everything else you need for a functional PC onto a tiny slice of silicon. The true competition from Intel will be the Bay Trail part based on the Silvermont architecture, but it’s not slated to arrive until later this year. In the meantime, AMD will have something truly distinctive to offer: a quad-core SoC that’s fully PC compatible, very affordable, and fits into various sorts of sleek, slim systems. Kabini will aim for laptops—think ultra-thin systems with long battery life for under 500 bucks—and low-cost desktops. Meanwhile, Temash will target tablets between 10.1″ and 11.6″ in size that are roughly 10mm thick, maybe a bit less. Imagine a tablet that sits between the Microsoft Surface and Surface Pro in price, size, and performance and you’ll have the basic idea.
A true PC system on a chip
Some time in the past couple of years, pretty much everybody in the PC industry started calling their CPUs “SoCs” or systems-on-a-chip. It’s trendy, sounds like what Apple does, and is therefore entirely irresistible within a 100-mile radius of the San Francisco Bay. Although the definition of a “real” SoC is a little wobbly, Kabini/Temash may have the best claim yet to being the first true PC SoC.
A perspective-heavy block diagram of Kabini/Temash. Source: AMD.
Naturally, then, this chip packs in a ton of components. The headliners are undoubtedly the four “Jaguar” CPU cores, based on an evolution of the Bobcat microarchitecture used in Brazos, and the integrated graphics processor, which is derived from the same Graphics Core Next (GCN) architecture as the Radeon HD 7000-series discrete GPUs. The GPU includes a UVD media processing block capable of H.264 decoding and encoding, of course, and the chip’s north bridge acts as a traffic cop, routing requests to the SoC’s single-channel DDR3 memory controller.
All of those elements might be familiar from past AMD APUs, but this SoC also incorporates all of the I/O functionality that has traditionally been built into a separate south bridge chip or “Fusion controller hub,” as AMD calls it. Branching out from Kabini are four PCI Express x1 links, two SATA 6Gbps disk interfaces, an SD card controller, two USB 3.0 ports, eight USB 2.0 connections, a gaggle of display interfaces including HDMI and DisplayPort 1.2, and a dedicated four-lane connection for an optional discrete GPU. Oh, and legacy I/O like keyboard ports and such are in there, too. Makes you wonder if there aren’t secret connections for a Turbo button and an EISA card.
Integrating all of these things together on one chip saves power, reduces the physical footprint of a system, and cuts costs, too. That’s why we keep seeing more and more integration over time. Kabini simply takes that concept to a logical endpoint by bringing aboard pretty much an entire small-scale PC.
The key words above, by the way, are “small-scale.” AMD tells us these chips are being manufactured by two different foundry partners, TSMC and GlobalFoundries, at 28-nm process geometries. We haven’t managed to wrangle the chip’s exact transistor count or die size yet, but I’ve held one in my hand, and it’s tiny. There has been some talk about how this SoC is closely related to the chips going into the PlayStation 4 and Xbox One—and it is, quite closely—but Kabini is scaled way down. The PS4, for instance, has eight Jaguar cores and 1152 GCN shader ALUs, while the Xbox One reportedly has eight cores and 768 shader ALUs. This chip has four cores and 128 shader ALUs. The memory bandwidth disparity is similarly huge between Kabini and the consoles, more than an order of magnitude. Although they share quite a bit of DNA, Kabini and Temash are aimed at much lower cost and power targets than the chips AMD has built for Sony and Microsoft.
The Jaguar core
Although its bigger CPUs haven’t been as competitive as hoped lately, AMD has had a nice run with the Bobcat core used in the Brazos platform. Bobcat came out of the gate using out-or-order execution and only one thread per core, and as a result, it was about 20% faster than the Atom in our tests, especially in cases where applications weren’t readily multithreaded. Now, Intel has committed to a similar template for the upcoming, all-new Silvermont Atom architecture, with an emphasis on improving per-thread performance. Meanwhile, AMD has revised its low-power microarchitecture in a multitude of ways both big and small, and the result is the evolutionary step known as Jaguar.
Jaguar brings a few principal improvements over the prior generation in terms of power efficiency and performance, which are essentially two sides of the same coin these days. A host of tweaks throughout the core has produced a 22% gain in instruction throughput per clock, although that gain is more like 15% if you don’t factor in the impact of the larger L2 cache. Either way, the generational advancements are substantial. Also, Jaguar has been retooled for better frequency-voltage response, in part via the addition of a couple of pipeline stages, so the chip should consume less power at a given clock speed. Finally, the core has been tweaked for better power efficiency in other ways, too, including some unit redesigns and an expansion of the ability to gate off the clock signal from portions of the chip that are currently idle.
Even greater performance increases are possible by harnessing extensions to the x86 instruction set, and Jaguar adds support for a whole range of those, including the SIMD alphabet soup that is SSE 4.1, SSE 4.2, and AVX. Also supported are AES-NI encryption acceleration and F16C format conversions. Other new features suggest Jaguar may find its way into server systems soon, including the expansion of physical addressing to 40 bits and the better support for OS virtualization.
Functional block diagram of the Jaguar core. Source: AMD.
Above is a functional block diagram of the Jaguar architecture. Although there are tweaks throughout the core that contribute to the IPC gains, the most sweeping changes are reserved for the floating-point unit, which is a total redesign. The new FPU is 128 bits wide, twice the width of Bobcat, and is responsible for executing many of those extended SIMD instructions like SSE and AVX. With single-precision datatypes, the execution hardware can perform four multiplies and four adds per cycle. For double-precision math, the rate is one multiply and two adds per clock.
AMD says support for 256-bit wide AVX extensions is achieved by “double-pumping” the 128-bit execution units. In this case, “double pumping” means data are fed through the units in two passes, but the units do not run at twice the base clock frequency, as the Pentium 4’s integer ALUs did.
The physical floorplan of a single Jaguar core. Source: AMD.
Kabini’s four revised Jaguar cores are fed by a 2MB L2 cache shared via a common interface that connects to each core individually. Sharing a cache in this way has several benefits. In light workloads where one or more cores are inactive, the busy cores will effectively have more L2 cache capacity available to them, improving per-thread performance. Meanwhile, because the L2 cache replicates the contents of the cores’ L1 caches, the L2 can act as a probe filter for coherency traffic, facilitating more efficient multitasking.
AMD has put some work into the L2 interface, which makes sense since it’s the cores’ only path to the rest of the system. The L2 interface runs at the full speed of the CPU cores and has built-in smarts, including the ability to store L2 tags, so it knows which portion of the cache to light up when the time comes to access one of its four 512KB banks. When those L2 cache banks aren’t needed, they’re clock gated to save power. AMD further conserves power by clocking the L2 arrays at half the frequency of the CPU cores.
GCN graphics scaled down
Kabini is the first APU to incorporate a graphics block based on AMD’s GCN architecture, and this addition grants the SoC a rich suite of graphics- and compute-focused features, including support for the DirectX 11.1 graphics API and the OpenCL media and compute API. Also, crucially, the presence of GCN hardware makes Kabini compatible with the latest AMD Catalyst graphics drivers, which should translate into solid compatibility with the latest applications and relatively frequent driver updates.
Kabini’s integrated graphics processor – logical diagram. Source: AMD.
As we’ve already noted, Kabini’s graphics have been scaled down pretty massively in order to fit into the power envelopes in question. The chip has only two GCN compute units, or CUs, with a total of 128 shader ALUs and eight texels per clock of filtering capacity. A single render back-end offers four pixels per clock of blending throughput. For this class of product, these choices are sensible, but we’ve already established how much grander the scale is in this chip’s console siblings. Compare, also, to the Radeon HD 7790; that $149 graphics card has 14 compute units, for a total of 896 shader ALUs, and 16 pixels per clock of ROP throughput. So, you know, don’t expect the world from Kabini’s graphics, even though they’re likely to be the best in their class.
Heck, I’m a little surprised AMD was able to squeeze its full-fat desktop GPU architecture into a chip of this class in any form. AMD has made a few adjustments to adapt GCN to this sort of deployment. This is, in fact, a newer version of GCN than you’ll find in most current Radeons; it includes some instructions to facilitate memory sharing between the CPU and IGP. Also, in Kabini, the number of banks in the local data share in each CU has been reduced from 32 to 16. Beyond that, as far as we know, the only other power-saving measures are at the physical design level, where transistor selection was optimized for low-power operation.
Power management in an SoC like this one is paramount. As AMD’s Sam Naffizger told us, if all portions of Kabini were turned on at once in a typical system, it would drain the battery in less than an hour and probably melt part of the system case. These chips can only fit into their prescribed power envelopes by constantly adjusting themselves.
To that end, Kabini has a fairly sophisticated power management setup, similar in basic capability to what’s built into AMD’s larger Trinity and Richland mobile APUs. At its heart is an onboard 32-bit microcontroller with its own memory that takes inputs from a range of sources across the chip, including power monitors in the CPU cores, the GPU, the display interface, and the FCH. The power controller estimates total power use based on activity and can require an individual unit to ramp down its voltage and clock frequency in order to prevent the chip from exceeding its power budget and overheating.
Power sharing opportunities across the chip. Source: AMD.
Kabini includes power gates for each of its four cores and for its IGP, so power to these sections of the chip can be turned off entirely when one of those entities is idle. The combination of dynamic voltage and frequency scaling, power gating, and intelligent monitoring of power opens up opportunities to share power headroom between different portions of the chip via a mechanism AMD calls Turbo Core. The concept is straightforward: if the GPU is idle and the CPU cores have work to do, the GPU can be shut down and its power budget shifted to the CPU cores. With more headroom, then, the CPU cores can increase voltage and frequency beyond their usual limits.
The user experience is often dominated by the performance of a single core. Source: AMD.
Similarly, a single active CPU core could borrow headroom from its inactive neighbors to range up to higher clocks temporarily. This provision can increase single-thread performance and improve the user’s sense of system responsiveness.
All good in theory, right? The strange thing here is that, at launch, only a single model of Temash supports Turbo Core: the 8W A6-1450 quad-core intended for tablets. None of the Kabini-derived parts do. They all can save power via Kabini’s DVFS scheme, but they can’t shift power around to extract more performance headroom.
AMD does have another trick up its sleeve, also named after forced induction, that may offset any loss of performance from the lack of Turbo Core: it’s called Turbo Dock. This feature is intended for dockable tablets like the Asus Transformer series. When the tablet is detached from its keyboard dock, the SoC inside will operate under the burden of a relatively low power limit, to preserve battery life and reduce heat. Once the tablet is docked, the TDP limit is raised, potentially to twice the limit of slate mode. AMD expects to achieve up to ~30% higher performance via this trick. We’ll have to see it in action in an actual product, of course, but the theory sounds good.
The products: A- and E-series APUs
Kabini and Temash are code names. Out in the market, AMD will use a different nomenclature to refer to these chips. The Kabini lineup will officially be known as the 2013 AMD Mainstream APU Platform, and it will include both A-series and E-series models, as outlined in the table below:
AMD says the A6-5200 will compete against Intel’s low-end Core i3 processors, while the A4-5000 will go up against slower Pentium models, and the E-series offerings will stack up against even slower Celerons. The company claims Kabini “completely outclasses” those Pentium processor, and it expects the chip to “dominate” the low-end notebook market.
While those are bold words, AMD touted similar positioning with its original Brazos platform. That was two years ago, of course. Today’s Pentiums and Celerons are faster, and Brazos is in no position to match them in a fair fight. Kabini’s extra performance should be instrumental in helping AMD recapture lost ground there.
The Temash lineup is known as the 2013 AMD Elite Mobility Platform, and it only includes three A-series models:
In tablets, all three of the A-series chips above will sit between the Atom and the Core i3. In other words, Windows 8 tablets based on Temash should be faster and a little more power-hungry than Atom tablets, but they shouldn’t be quite as big, bulky, or expensive as Core-powered slates like the Microsoft Surface Pro. That device isn’t so much a tablet as the open-faced sandwich version of an ultrabook.
The A4-1200 has a tighter power envelope than both of AMD’s existing tablet chips, the Z-60 and Z-01. Those offerings also have dual 1GHz cores, but they’re rated for TDPs of 4.5W and 5.9W, respectively. They also have fewer shader ALUs (80) and less L2 cache (only 1MB) than the new A-series parts. Considering Jaguar’s IPC improvements, it’s fair to expect the A4-1200 to deliver better CPU performance, better graphics performance, and better battery life than the Z-60 and Z-01.
Temash will also appear in what AMD calls “small screen touch notebooks.” In those systems, the chips will again slot in between the Atom and Core series, but they’ll have direct competition from low-end Pentium and Celeron processors.
The Kabini whitebook
Our sample Kabini system was a 13-inch notebook PC powered by the fastest 15W variant of Kabini, the A4-5000. This notebook isn’t a production system. Rather, it’s a “whitebook” bereft of corporate branding and assembled solely for testing purposes.
The system’s 13″ display lacks touch-screen capabilities, but it has a 1080p resolution, a matte coating, and what looks to be an IPS panel. Inside the chassis, there’s a single 4GB DDR3 DIMM, a 1TB Hitachi hard drive with a 5,400-RPM spindle speed, and a 45Wh battery. Connectivity includes USB 3.0, DisplayPort, VGA, and Ethernet.
At 3.83 lbs and 0.87″ thick, this thing is a little heavier and thicker than your average ultrabook. It’s still very thin and light, though, and AMD tells us that similar configurations could cost just $499 out in the wild. That would be a good $100-200 more affordable than the cheapest ultrabooks.
We asked AMD whether this whitebook was representative of a typical Kabini configuration. We were told that it’s “in the middle of what you might see.” The company expects the most inexpensive Kabini notebooks to be priced at just $399. Thanks to the processor’s tight power envelope, PC makers will have a wide range of display sizes to choose from—and there will no doubt be some touch screens in the mix, as well.
Our testing methods
We compared the performance of the A4-5000 whitebook to that of four systems:
- A premium ultrabook, the Asus Zenbook Prime UX31A, which has a 17W Core i5 processor and is priced at $1,100 right now. Retail notebooks based on the A4-5000 shouldn’t cost anywhere near that much, but the Zenbook Prime gives us a high-water mark for performance in the ultrathin category.
- A low-end ultrathin laptop, the Asus VivoBook X202E. This system has a 17W Core i3 CPU backed by single-channel memory, and it costs $399. In terms of both pricing and performance, this should be one of the most direct competitors to upcoming laptops based on the A4-5000.
- An Atom-powered Windows 8 tablet, the Asus VivoTab Smart ME400C, which is priced just south of $430. This is one of the lowest-power Windows 8 systems on the market today. Its Atom Z2760 processor manages to squeeze dual 1.8GHz, Hyper-Threaded cores into a Lilliputian 1.7W power envelope. The ME400C is obviously not in the same league as the A4 whitebook, but it provides us with a performance baseline for an ultra-low-power x86 config.
- A Mini-ITX desktop build based on AMD’s E-350 mobile APU. The E-350 is the A4-5000’s predecessor. It has two Bobcat cores, integrated Radeon HD 6000-series graphics, and an 18W thermal envelope. We were hoping to procure a notebook based on the E-350 (or the slightly quicker E2-1800) to run battery life comparisons, but we weren’t able to get one in time for our review. This desktop build is the next best thing; it will let us see how much Kabini has moved the ball forward.
You’ll find the full specs of those machines in the table below.
One more thing to note: the Atom Z2760 processor doesn’t run 64-bit software. That’s not a deficiency of the silicon; rather, it’s a product segmentation move by Intel. Either way, we had to test our tablet using 32-bit versions of our benchmark apps. In instances where those apps were available in both 32-bit and 64-bit versions, we tested 32-bit builds on the Atom, 64-bit builds on the Core processors, and, in order to provide a frame of reference, both 32-bit and 64-bit builds on the A4-5000. In such instances where multiple versions of the same benchmark were run, you’ll see 32-bit runs labeled clearly in the graphs.
We ran every test at least three times and reported the median of the scores produced. The test systems were configured like so:
|System||Asus ME400C||Asus UX31A||Asus X202E||AMD A4-5000 whitebook||Gigabyte E350N-USB3 test system|
|Processor||Intel Atom Z2760 1.8GHz||Intel Core i5-3317U 1.7GHz||Intel Core i3-3217U 1.8GHz||AMD A4-5000 1.5GHz||AMD E-350 1.6GHz|
|Platform hub||Integrated||Intel HM76 Express||Intel HM76 Express||Integrated||AMD Hudson M1|
|Memory size||2GB||4GB (2 SO-DIMMs)||4GB (1 SO-DIMM)||4GB (1 SO-DIMM)||4GB (2 DIMMs)|
|Memory type||LPDDR2 SDRAM at 800MHz||DDR3 SDRAM at 1600MHz||DDR3 SDRAM at 1333MHz||DDR3 SDRAM at 1600MHz||DDR3 SDRAM at 1066MHz|
|Audio||Intel SST codec with 6.2.9200.25166 drivers||Realtek codec with 22.214.171.12410 drivers||Via codec with 126.96.36.1990 drivers||Conexant HD audio with 188.8.131.52 drivers||Realtek ALC892 with 6.2.9200.16497 drivers|
|Graphics||Intel Graphics Media Accelerator
with 184.108.40.2069 drivers
|Intel HD Graphics 4000
with 220.127.116.1171 drivers
|Intel HD Graphics 4000
with 18.104.22.16871 drivers
|Radeon HD 8330
with 13.101-130507a-156998E drivers
|Radeon HD 6310
with Catalyst 13.5 beta drivers
|Hard drive||SEM64G 64GB SSD||Adata XM11 128GB SSD||HGST Z5K500 500GB 5,400-RPM||Toshiba MQ01ABD100H 1TB 5,400-RPM||Crucial m4 256GB SSD|
|Operating system||Windows 8 x86||Windows 8 Enterprise x64||Windows 8 x64||Windows 8 Pro x64||Windows 8 Pro x64|
Thanks to AMD and Asus for providing our test systems.
We used the following versions of our test applications:
- AIDA64 pre-3.0 build 2446
- HandBrake svn5436 (OpenCL build)
- Stream 5.8 64-bit
- SiSoft Sandra 2013.SP3
- 7-Zip 9.22
- TrueCrypt 7.1a
- Chrome 26.0.1410.67
- Chromium 20.0.1096.0
- SunSpider 1.0
- The Panorama Factory 5.3.2807
- LuxMark 2.0
- Musemage 1.9.5235
- WinZip 17.5
- x264 r2310
- AMD APP SDK 2.8
- Battlefield 3
- The Elder Scrolls V: Skyrim
- FRAPS 3.5.9
The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Memory subsystem performance
As always, we’ll begin our performance comparison with some synthetic tests. The first of those is a simple measure of memory bandwidth.
The A4-5000 and E-350 are both limited to single-channel DDR3 memory, although the A4 supports higher speeds—1600MHz instead of 1066MHz. The Core i3 and Core i5 both support dual-channel DDR3-1600, but in the systems we tested, only the Core i5 machine had both of its memory channels populated.
The A4-5000 enjoys a sizable boost in memory bandwidth over the E-350 thanks to the higher memory frequency. However, the Core i3 seems to be able to extract more bandwidth out of a single DDR3-1600 SO-DIMM than the A4 does. A look at cache performance should give us some insight as to why.
SiSoft Sandra’s more elaborate memory and cache bandwidth test is multithreaded, so it captures the bandwidth of all caches on all cores concurrently. The different test block sizes step us down from the L1 and L2 caches into main memory.
The A4’s caches keeps up with the Core i3’s until the 256KB block size, after which they fall behind. If you read our architectural exposé earlier, you’ll know Kabini has four 32KB blocks of L1 cache (one per core) supplemented by a 2MB pool of L2 cache, which is shared among the cores. Since this test is multithreaded and uses up all 256KB of Kabini’s L1 caches before moving on to the L2, the results suggest AMD’s newcomer has faster L1 but slower L2 cache than the Core i3. (Remember, the Core i3 is a dual-core processor, so we’re comparing two Ivy Bridge cores to a quad-core Kabini module.) The L2 performance picture isn’t entirely surprising, since Kabini’s L2 cache runs at half the CPU clock speed.
In any case, the A4’s L1 and L2 cache performance is considerably higher than that of the E-350. The E-350 even falls behind the Atom, which has less than a tenth the power envelope, in this test. Of course, both the E-350 and the Atom have only two cores, while the A4-5000 has four.
Sandra also includes a new latency testing tool. SiSoft has a nice write-up on it, for those who are interested. We used the “in-page random” access pattern to reduce the impact of prefetchers on our measurements. We’ve also taken to reporting the results in terms of CPU cycles, which is how this tool returns them. The problem with translating these results into nanoseconds, as we’ve done in the past with latency measurements, is that we don’t always know the clock speed of the CPU, which can vary depending on Turbo responses.
This test isn’t multithreaded, which explains why the A4’s latency goes up at the 32KB block size and then again above 2MB. The E-350 exhibits a similar pattern but latency rises right after the 512KB block size. That’s because it has half as much L2 cache (only 1MB), and that cache is split between the two cores. Each core therefore has only 512KB of L2 at its disposal.
Synthetic CPU performance
The latest version of AIDA64 includes some synthetic CPU tests that can give us a sense of Kabini’s branch prediction effectiveness and its performance with the AVX instruction set. FinalWire explains the mechanics of those tests in detail on its website, but we’ll give you the Cliff’s Notes version below.
We’ll start with the CPU Queen test, which FinalWire describes as a “simple integer benchmark” that assesses each processor’s branch prediction effectiveness.
This test is nicely multithreaded, which explains the rough doubling of performance from the E-350 to the A4-5000. The A4 is running at a 100MHz lower clock rate, and the Bobcat core has one or two fewer pipeline stages than Jaguar. Still, the A4’s performance is more than twice the E-350’s, which suggests Jaguar’s branch prediction is indeed more accurate than Bobcat’s.
Next up is PhotoWorxx, which simulates photo processing workloads. FinalWire says this test focuses on integer and memory performance. AVX instructions are used here.
No doubt thanks to its AVX support, the A4 comes fairly close to the Core i3 in this test.
The CPU Hash, FPU Julia, and FPU Mandel tests are all written in assembly, and they all utilize AVX instructions. CPU Hash measures encryption performance using the SHA1 hashing algorithm. The FPU Julia and FPU Mandel benchmarks measure single- and double-precision floating-point performance, respectively, using fractal computations.
The A4-5000 outruns the Core i3 in the hash test. It lags a little behind its rival in both floating-point tests, but it’s hugely quicker than the E-350 across the board. The increase from Brazos to Kabini here is much larger than one would expect from just doubling the core count. Kabini’s AVX support and redesigned FPU probably deserve the bulk of the credit for the size of the gains.
TrueCrypt disk encryption
TrueCrypt supports acceleration via Intel’s AES-NI instructions, so the encoding of the AES algorithm, in particular, should be very fast on the CPUs that support those instructions. We’ve also included results for another algorithm, Twofish, that isn’t accelerated via dedicated instructions.
Along with the Core i5, the A4-5000 is the only processor in the mix to support AES acceleration. That acceleration pays substantial dividends in TrueCrypt.
The Twofish algorithm isn’t accelerated, but the A4-5000 still gives the Core i3 a minor whupping there. It’s also more than twice as fast as the E-350—and the Atom.
7-Zip file compression and decompression
The Core i3 has a clear advantage when it comes to data compression in 7-Zip, but two chips handle decompression at about the same rate.
The Core i5 does perform better than the Core i3 here despite its slightly lower base clock speed, but remember, it also has Turbo Boost and considerably more memory bandwidth at its disposal.
The Panorama Factory photo stitching
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s widely multithreaded. We asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs.
In the past, we’ve added up the time taken by all of the different elements of the panorama creation wizard and reported that number, along with detailed results for each operation. However, doing so is incredibly data-input-intensive, and the process tends to be dominated by a single, long operation: the stitch. Thus, we’ve simply decided to report the stitch time, which saves us a lot of work and still gets at the heart of the matter.
The A4 is again slower than the Core i3, but it’s much faster than the E-350.
Well, that is, unless you’re running the 32-bit version of the app, which seems to be much slower than the 64-bit version. The Atom Z2760’s lack of 64-bit support puts it at a substantial disadvantage.
x264 HD video encoding
We’ve devised a new x264 test, which involves one of the latest builds of the encoder with AVX2 support. To test, we encoded a one-minute, 1080p .m2ts video using the following options:
–profile high –preset medium –crf 18 –video-filter resize:1280,720 –force-cfr
The source video was obtained from a repository of stock videos on this website. We used the Samsung Earth from Above clip.
The A4 is almost three times quicker than the E-350 and the Atom Z7260. However, it’s a tad slower than the Core i3. These results largely echo those from AIDA64’s AVX-enabled synthetic benchmarks.
The benchmarks we’ve run so far have made use of the CPU cores only. The ones on this page and the next tap into the SoC’s integrated graphics processor for general-purpose computing tasks.
LuxMark OpenCL rendering
LuxMark uses OpenCL to render a 3D scene using an OpenCL-accelerated ray tracing algorithm. Since OpenCL code is by nature parallelized and relies on a real-time compiler, it should adapt well to new instructions. For instance, Intel and AMD offer integrated client drivers for OpenCL on x86 processors, and they both claim to support AVX.
We’ll start with CPU-only results. These results come from the AMD APP driver for OpenCL, since it tends to perform well on both Intel and AMD CPUs.
For some reason, the A4 falls behind the E-350 in when we run LuxMark on the IGP only. Kabini has much faster integrated graphics on paper, so we may be looking at a driver optimization issue or some other software hiccup.
Pair the CPU and IGP together, and the A4-5000 trounces the Core i3. AMD’s decision to dedicate plenty of die area to graphics doesn’t just bode well for games; it pays dividends in compute tasks like this one, as well.
The Atom is absent from these results, because even with the APP runtime installed, it wouldn’t run LuxMark properly. The application started, but it complained of a lack of OpenCL-capable devices and wasn’t able to proceed with rendering.
This photo editing application features OpenCL acceleration. It also includes a built-in benchmark, which applies a set of filters to a photo and spits out a score at the end. That’s what we used.
Musemage’s OpenCL acceleration is kind to the A4-5000. The AMD chip races ahead of even the Core i5 from our premium ultrabook.
For the last couple of versions, WinZip has featured a parallel processing pipeline with OpenCL support. The pipeline allows multiple files to be opened, read, compressed, and encrypted simultaneously, all with hardware acceleration.
We tested WinZip by compressing a 1.17GB directory containing about 150 small text and image files, a couple dozen medium-sized PDF files, and 14 large Photoshop PSD files. We tested first with OpenCL disabled in the options, and then with OpenCL enabled, to get a sense of the performance benefits GPU acceleration would yield. Each operation was timed with a stopwatch.
Without OpenCL, the A4-5000 compresses our test archive substantially slower than the Core i3 does. Once we enable OpenCL, the tables are turned—but only because IGP acceleration actually slows down the Core i3. For a fair contest, we should compare the A4’s accelerated compression time (89 seconds) to the Core i3’s regular time (77 seconds). The A4 still ends up at a disadvantage, but the gap between the two chips shrinks from 25% to about 13%.
By the way, you’ll notice that the Atom is included in both sets of results. That’s because WinZip doesn’t hide the OpenCL setting on our Atom-powered tablet. However, it doesn’t look like the setting actually changes anything. Compression times are the same regardless of whether the checkbox is ticked or not.
There are now public builds of HandBrake available with an OpenCL-accelerated version of the x264 encoder—and an option to enable or disable OpenCL when encoding.
We tested by encoding a 1080p version of the Looper trailer into 720p format. We used the encoding options outlined in the screenshot below, with the constant frame rate setting enabled.
Here, too, we tested both with and without OpenCL enabled. The OpenCL setting wasn’t exposed on the Intel systems, however, so we presented the results in the same graph.
The gain from OpenCL acceleration on both the A4-5000 and the E-350 is extremely minor. At least, it seems that way until we look at total encoding times for our test video, which was 3649 frames in length.
OpenCL cuts encoding times by five seconds on the A4 and 24 seconds on the E-350. Which is, you know, better than nothing. It’s still not enough to give the AMD chips an edge over the competition from Intel, however.
The Elder Scrolls V: Skyrim
Our Skyrim test involved running around the town of Whiterun, starting from the city gates, all the way up to Dragonsreach, and then back down again.
Testing was done at 1280×720 using the game’s “Low” quality preset. Our Atom system had to sit this one out, since it couldn’t run the game properly at any settings.
Let’s preface the results below with a little primer on our testing methodology. Along with measuring average frames per second, we delve inside the second to look at frame rendering times. Studying the time taken to render each frame gives us a better sense of playability, because it highlights issues like stuttering that can occur—and be felt by the player—within the span of one second. Charting frame times shows these issues clear as day, while charting average frames per second obscures them.
To get a sense of how frame times correspond to FPS rates, check the table on the right.
We’re going to start by charting frame times over one representative test run for each system. (That run is usually the middle one out of the five we ran for each card.) These plots should give us an at-a-glance impression of overall playability, warts and all. You can click the buttons below the graph to compare the different solutions.
Right away, it’s clear that the A4 delivers a huge graphics performance improvement over the E-350. Also, the A4 achieves very consistent frame times overall, even if the plot line isn’t particularly low. (40 ms works out to about 25 FPS, for the record.) The E-350 and the Core i3 are both all over the place, and their frame times are clearly higher on average.
Only one chip beats the A4, and that’s the Core i5. Of course, that chip is equipped with dual-channel memory, whereas the A4 is limited to only a single channel—and real-time graphics is a very bandwidth-intensive task.
Now, we can slice and dice our raw frame-time data in several ways to show different facets of the performance picture. Let’s start with something we’re all familiar with: average frames per second. Average FPS is widely used, but it has some serious limitations. Another way to summarize performance is to consider the threshold below which 99% of frames are rendered, which offers a sense of overall frame latency, excluding fringe cases. (The lower the threshold, the more fluid the game.)
The average FPS and 99th percentile results confirm our initial observations: the A4 is second only to the Core i5, and it’s well ahead of both the E-350 and the Core i3. However, the A4’s 46.3 ms 99th-percentile frame time is a little on the high side if you’re hoping for fluid animation.
By the way, those 99th-percentile figures only capture a single point along the latency curve, but we can show you that whole curve, as well. With single-GPU configs like these, the right hand-side of the graph—and especially the last 5% or so—is where you’ll want to look. That section tends to be where the best and worst solutions diverge.
The A4 and the Core i5 both manage to keep frame latencies consistent throughout about 98-99% of the frames. That’s the kind of consistency we’d expect from a good discrete desktop GPU.
Finally, we can rank the cards based on how long they spent working on frames that took longer than a certain number of milliseconds to render. Simply put, this metric is a measure of “badness.” It tells us about the scope of delays in frame delivery during the test scenario. You can click the buttons below the graph to switch between different millisecond thresholds.
The Core i5 spends comparatively so little time working on frames over 50 ms that its bar is too thin to show up in the graph above. The A4 trails reasonably closely, although it’s at a disadvantage in the 33.3-ms rankings because of its relatively high average frame times. Still, 4368 milliseconds (or 4.4 seconds) over that threshold isn’t bad out of a 90-second run.
From a seat-of-the-pants perspective, Skryim feels fluid enough to be playable on the A4—at least in this test run, which involved multiple characters and detailed geometry but no combat. The high average frame times mean the game isn’t as silky-smooth as it would be on a decent desktop gaming rig. Still, frame times are low and consistent enough to make the game playable.
We tested Battlefield 3 by playing through the start of the Kaffarov mission, right after the player lands. Our 90-second runs involved walking through the woods and getting into a firefight with a group of hostiles, who fired and lobbed grenades at us.
As in Skyrim, we tested at 1280×720 using the “Low” quality preset. Again, our Atom system sat out this round of testing.
Let’s not mince words: none of these systems are fast enough to run Battlefield 3 acceptably, even at these very low detail settings. The A4 achieves the lowest, most consistent frame latencies of the bunch, which is commendable, but neither it nor the Core i5 deliver what we’d call a playable experience. Clearly, folks hoping to game on their Kabini-powered laptops will have to pick less demanding titles.
Empirical benchmarks tell us a lot, but they don’t communicate how quick and responsive a system feels when running day-to-day tasks. After completing our suite of empirical tests, we spent some time using the A4-5000 whitebook in order to get a feel for it.
Since slow storage plays a huge part in perceived responsiveness, we swapped out the A4-5000 whitebook’s 5,400-RPM hard drive and replaced it with a 256GB Crucial m4 solid-state drive. The idea was to see how snappy a Kabini system could be under ideal conditions.
Our verdict: the A4-5000 whitebook is noticeably slower than our premium ultrabook, the Zenbook Prime UX31, but the difference is small—much smaller than you’d imagine.
Oh, sure, we noticed some slightly longer pauses when skipping from page to page across the web. And there were other minor slowdowns, such as when scrolling down content-rich pages or opening applications. However, we never had the impression that Kabini was struggling to keep up with input, which is often the case with slower, Atom-powered systems—and with machines based on Kabini’s predecessor. More importantly, the Kabini system never felt slow to the point of frustration; it just wasn’t quite as snappy as a $1,100 ultrabook.
Out there in the real world, A4-5000-powered laptops probably won’t come fitted with 256GB SSDs, and they certainly won’t compete head-on with premium ultrabooks. Instead, they’ll be saddled with mechanical storage and made to fight it out with similar machines powered by Intel’s Core i3 and Pentium CPUs. In those matchups, the responsiveness difference may be imperceptible. We certainly didn’t get the impression that the A4-5000 whitebook was noticeably slower than our Core i3-powered VivoBook X202E, despite what the benchmark numbers on the previous pages indicate.
What about gaming? Well, after the promising showing in Skyrim and the, er, somewhat less promising results in Battlefield 3, we thought we’d try some more casual titles to see if the A4-5000 handled those any better. We didn’t really have time to benchmark these games, but we did load up Fraps and keep an eye on reported frame rates while playing.
Our first candidate was Counter-Strike: Global Offensive, a snazzed-up sequel to Counter-Strike: Source. At 1280×720 using the lowest possible detail settings, frame rates hovered between 20 and 50 FPS, and the game ranged from smooth and playable to choppy and not-really-playable. Heavy combat saw frame rates drop into the teens, which had a direct impact on our kill-to-death ratio.
You can play CS:GO on the A4-5000, but the integrated graphics will drag you down in serious multiplayer skirmishes.
Next up was Dyad, an abstract indie game that’s a favorite of our own Geoff Gasior. Dyad has oodles of kaleidoscopic eye candy, but it ran better on our Kabini whitebook than CS:GO. Frame rates hovered in the 30-50 FPS range at 720p, which was playable, albeit somewhat less buttery-smooth than on a desktop gaming PC. Dyad is definitely a game you can enjoy on the A4-5000.
We rounded out our subjective gaming tests with Ilomilo, an Xbox Live port that’s now available through the Windows 8 app store. This game runs from the Modern UI environment, and unfortunately, Fraps isn’t able to monitor frame rates inside it. Playing the game, however, it was clear that the A4-5000 had no problems maintaining a smooth, fluid experience. A touch screen would have made things even better… too bad our whitebook doesn’t have one.
Based on our results so far, I’d say the A4-5000 is more than qualified to handle casual games, and it treads close to the playability threshold in more serious titles. In some of those, it’s fast enough at the lowest detail settings; in others, like Battlefield 3, performance isn’t sufficient to make the game playable.
Considering this chip is expected to appear in sub-$500 notebooks, I’d say that’s a pretty good overall showing.
Battery run times
We tested battery life twice: once running TR Browserbench 1.0, a web browsing simulator of our own design, and again looping a 720p Game of Thrones episode in Windows Media Player. (In case you’re curious, TR Browserbench is a static version of TR’s old home page rigged to refresh every 45 seconds. It cycles through various permutations of text content, images, and Flash ads, with some cache-busting code to keep things realistic.)
Before testing, we conditioned the batteries by fully discharging and then recharging each system twice in a row. We also used our colorimeter to equalize the display luminosity at around 100 cd/m².
The A4-5000 whitebook achieves much longer run times than the Core i3-based VivoBook. It even edges out the larger, Core i5-driven Zenbook in our video playback test. The Zenbook stays awake an hour longer in the web-browsing run, though.
These are tricky comparisons to make, though, because the systems don’t all have the same battery capacities and displays. We can’t compensate for the display differences, but we can normalize the data based on the capacity of each battery. The following results show normalized run times in minutes per watt-hour. (For the record, the UX31A has a 13″ 1080p screen, while the X202E and ME400C spread the same 1366×768 resolution across 11.6″ and 10.1″ panels, respectively.)
These normalized numbers show the A4-5000 actually comes very close to the Zenbook in the web-browsing run—and it’s substantially better in the video playback test.
If all of these systems had a 50Wh battery, the Kabini whitebook would have stayed up 6.9 hours in the web test, compared to 7.4 for the Zenbook. That’s a rather small difference. Both runs are also within spitting distance of the “all day battery life” nirvana.
AMD has achieved two things with Kabini.
One, the company has delivered a substantial across-the-board performance increase over Brazos, its previous low-power mobile platform, without increasing the power envelope. In fact, Kabini is more power-efficient in spite of the higher performance. AMD has cut the power envelope from 18W to 15W, and that now includes the integrated Fusion controller hub. Also, as we noted earlier, Kabini features new power management mojo to further improve energy efficiency.
In addition to all that, AMD has come very close to matching the CPU performance of Intel’s ultrabook-bound Core i3 processors—at least in a single-channel memory configuration, which seems to be commonplace in low-end notebooks. That near-competitiveness on the CPU side is accompanied by better graphics performance and, as we saw, superior battery life. Kabini is a single-chip solution, too, while the Core i3 requires a separate chipset. Kabini’s smaller die size and higher power efficiency could make for lighter, more compact systems.
If AMD’s official product positioning charts are any indication, the A4-5000 may very well wind up priced lower than the Core i3-3217U. Should that be the case, then PC makers may be able to build better systems for the money using the AMD silicon. A4-powered offerings might have better displays, say, than their Intel counterparts. We were surprised by the inclusion of a 13″ 1080p panel on the A4-5000 whitebook, but AMD says we can expect similar goodness from $499 retail offerings. Finding a similar display on a $500 Intel machine is difficult, if not impossible, right now.
Although Kabini and Temash look poised to spawn some pretty attractive products in the coming weeks, we can’t help but look at this SoC and the Jaguar CPU architecture and think about the remaining, untapped potential. Any conversation about that subject has to start with the lack of Turbo Core in all but one of the current products based on this chip. The 15W quad-core Kabini A4-5000’s CPU cores top out at 1.5GHz. Surely a single core could reach 2GHz with Turbo Core enabled.
That’s just the beginning of the tuning opportunities, we think. Currently, Kabini and Temash use digital activity counters to measure utilization and estimate power, just as AMD’s Trinity chip did at its launch. Since then, AMD has taken the same Trinity silicon and released a new lineup of products under the Richland code name that include higher CPU and GPU clock speeds courtesy of smarter power management routines. Those algorithms improve on Trinity’s dynamic behavior via more extensive profiling in AMD’s labs and by taking advantage of the temperature sensors embedded in Trinity/Richland silicon. According to AMD power guru Sam Naffziger, Kabini and Temash have thermal sensors built into them, as well, but they’re currently only used as a last-resort safety mechanism. One could easily imagine AMD doing a refresh of this product lineup using the same Kabini/Temash silicon with more refined power management firmware. Naffiziger admitted to us that there’s even more opportunity to extract headroom by monitoring temperatures outside of the chip, at the platform level, as well.
What’s more, Kabini and Temash don’t support Windows 8’s connected standby mode. AMD elected to focus its efforts on speeding up two things: quickly resuming operation when coming out of sleep mode and ensuring quick reconnects to Wi-Fi networks. Those are sensible choices, but AMD could add support for connected standby mode in a future refresh. We expect they’d end up getting lower platform operating power in the process.
Really, that’s just the beginning. AMD says a Jaguar core takes up about 3.1 mm² of die area at 28 nm, virtually the same size as an ARM Cortex-A15. This is arguably AMD’s first “true” SoC, and its lowest power envelope is 3.8W, which limits it to relatively thick tablets. Beyond the power management and platform-level work we’ve mentioned, there are opportunities for further integration and power savings in future chips. For instance, this SoC has redundant paths to memory for the Jaguar CPU cores and the Radeon IGP. Those could be unified in a future chip, saving power without compromising performance. With the Jaguar microarchitectre in its arsenal, AMD appears to have the core technology needed to take a PC-like experience into even smaller devices in the future; the firm just needs to do the engineering work to make it happen.