Thanks in part to the smartphone market’s rapid move toward 64-bit-capable processors, ARM’s licensable CPU cores have seen an upsurge in high-visibility deployments this year. From the Exynos version of Samsung’s Galaxy Note 4 to newer offerings like the Qualcomm-based LG G4, ARM’s Cortex-A57 has become the new standard for processing power in Android phones. That’s a bit of a change from the last couple of years, when custom cores like Qualcomm’s Krait dominated the same landscape.
The folks at ARM know that they have to keep progressing in order to remain competitive with the customers who license their instruction set for custom core development. Thus, they’ve already announced the next-generation Cortex-A72 CPU core and, last week at a press event in London, they revealed the first details of this new core’s internal architecture.
Into the A72
The Cortex-A72 is the latest iteration of ARM’s largest CPU core, although it’s probably a mid-size core in the grand scheme of things. It’s quite a bit smaller than Intel’s Broadwell or the latest Apple CPU core in the A8 SoC, for instance.
The A72 is a heavily revised version of the Cortex-A57 core that it supplants, which in turn owes a debt to the Cortex-A15. Mike Filippo, an ARM Fellow and Lead Architect, told us that the A72 team started with the A57 and then “optimized every block” in the CPU in order to squeeze out higher performance and improved energy efficiency. Since the A72 will be used in chips intended for both mobile- and server-class applications, ARM gave it a rich feature set meant to cover the necessary bases in both markets.
Like many of ARM’s cores, the fundamental structure of the Cortex-A72 is a cluster comprised of up to four discrete CPU cores sharing a common L2 cache. For the A72, that cache can be as small as 512KB or as large as 4MB. The cluster talks to the rest of the SoC via a 128-bit AMBA interface, and one can expect to see chips incorporating fairly large numbers of quad-core A72 clusters for certain markets.
Simplified block diagram of the Cortex-A72. Source: ARM.
From high altitude, the A72 doesn’t look that different than the A57 that came before it. The core has an in-order front end with an out-of-order back end and memory. It can fetch three instructions per clock cycle and issue up to eight micro-ops to the execution units. The updates to A72 widen some data paths while improving efficiency at each stop along the way.
The first stop is the branch prediction unit, one of those blocks that nearly every architectural update seems to touch. Filippo said the team “effectively rebuilt” the unit in the A72. As usual, the primary goal in the rebuild was to increase the accuracy of this unit’s predictions. Doing so can improve performance and power efficiency by reducing the time and power spent computing branches that programs ultimately don’t take. Filippo claims a new algorithm “significantly” improves the A72’s prediction accuracy, and it’s coupled with a host of targeted tweaks. Those tweaks pay off in more than just accuracy. Filippo told us the new branch prediction unit itself operates in more energy-efficient fashion than the one in the Cortex-A57.
Some of the other key changes to the A72 have to do with the way instructions and data flow through the machine. Like a lot of modern CPUs, the A72 translates instructions from the external ARMv8 instruction set, exposed to software and compilers, into its own internal operations, known as micro-ops. The reality, in fact, is even more complex. The A72 can fetch three ARMv8 instructions, decode them into three macro-ops—an intermediate format used internally—and then dispatch up to five micro-ops into the issue queues in each clock cycle. These queues, which operate independently, can then issue up to eight micro-ops into the execution units in a single tick of the clock.
Logical block diagram of Cortex-A72. Source: ARM.
This mix of per-clock throughput through fetch, decode, dispatch, and the issue queues might seem unbalanced at 3-3-5-8, but remember, we’re dealing with different sorts of instruction units at different stages. ARMv8 instructions tend to break down into larger numbers of micro-ops. On average, Filippo said, each ARMv8 instruction translates into 1.08 micro-ops. Also, not all of the “cracking” of complex instructions into ops happens in the decode stage. Some of it happens in the dispatch units, when those intermediate macro-ops are translated into micro-ops—hence the dispatch unit’s ability to take in three macro-ops and output five micro-ops. That’s an upgrade from the dispatch unit in the A57, which can only output three micro-ops per cycle for a 3-3-3-8 flow.
That said, the progression from ARMv8 instructions to micro-ops isn’t all about simplification, either. Filippo explained that the A72’s micro-ops aren’t “super-simple steps.” Instead, “we do some fairly complex things with the back-end micro-ops.” In some cases, the decoder can fuse multiple ARMv8 instructions into a single macro-op; this is another new capability added in the A72. Thus, the role of the CPU’s front-end units is proper formatting as much as anything. Since micro-ops can sometimes take multiple cycles to execute, the CPU doesn’t require eight decode or dispatch slots per cycle in order to keep the issue queues full. The A57, remember, gets by pretty well with a 3-3-3-8 config.
The decode/rename and dispatch/retire units have been the subject of block-level optimizations, as well. Many of these changes have to do with the operation of local storage arrays—buffers in the decode unit and registers in the dispatch/retire unit. Filippo said a power-oriented reorganization of the register files produced a “significant reduction” in the number of ports used and the amount of chip area consumed. He believes careful sharing of the remaining ports should allow the A72 to be “performance neutral” on this front, with “no meaningful performance drop-off” from stalls caused by port contention.
The core’s basic complement of execution units, shown in the diagram above, looks to be pretty much the same as the A57’s. The queues can issue one instruction to each of the two single-cycle integer ALUs, one to the branch unit, one to the multi-cycle ALU, two to the floating-point/SIMD units, and two to the load/store units. That’s a total of eight instructions issued, as we’ve already noted.
Some of those execution units are substantially improved in the A72. The integer units have added a radix-16 divider with twice the bandwidth of the A57’s divider. They’ve also added a number of zero-cycle forwarding paths, so data can travel to the next stop in the pipeline immediately, without a one-cycle bubble.
The FP/SIMD units have the most extensive changes, with markedly lower latencies for key instructions. Floating-point multiplication now happens in three cycles, a 40% reduction versus the A57. FP adds also take three cycles—versus four on A57. As a result, the latency for combining the two operations in a fused multiply-add is six cycles, a 33% drop versus the prior generation. Floating-point division is now served by a radix-16 divider with double the bandwidth of the old unit, too.
In keeping with its M.O., the A72 team has tuned all of these units for power-efficient operation, and it has tweaked the load-balancing algorithm from the issue queue to ensure fuller utilization of these quicker execution resources.
The A72’s memory subsystem has seen extensive tuning, as well. The caches should be kept warm with relevant data thanks to a hardware pre-fetcher situated in the L1 cache complex that can retrieve data into both the L1 and L2 caches. The L2 cache has been tuned for higher bandwidth, as well. Beyond that, the A72 should typically be paired with ARM’s CCI-500
north bridge interconnect, which can increase available memory bandwidth by as much as 30%. (This area was a pain point in the Exynos 5433, so the change is welcome.) Again, power efficiency was a specific target of optimization, both in the load/store unit and in the L2 cache. In addition to the usual tuning of logic and local memories, the team worked to reduce the L2’s power draw at idle.
Quantifying the goodness
The final product of all of this tuning and optimization is a core that’s substantially improved compared to the Cortex-A57. For one thing, the Cortex-A72 is simply smaller than the A57, with a 10% reduction in chip area on the same manufacturing process.
Meanwhile, the A72 offers higher per-clock throughput across a range of workloads—from 16 to 50%, as illustrated below.
Cortex-A57 vs. Cortex-A72 per-clock performance. Source: ARM
The largest gains come in memory-sensitive workloads, but improvements of 16% in integer math and 26% in floating-point are still considerable.
Clock speeds should be up generally, too. Filippo told us the team’s target was to reach the same frequencies as the A57, but in the end, the A72 wound up better able to tolerate high frequencies. As a result, we may see a few hundred megahertz of additional clock speed out of A72-based SoCs, with peaks in the neighborhood of 2.5GHz.
More important for real-world performance is the A72’s potential to sustain its peak clock speed over time. That ability comes courtesy of some major improvements in power efficiency, as illustrated below.
This comparison is a little tricky because it’s primarily against the Cortex-A15 and involves differences in process technology, as well. Still, the green bars show the impact of the core changes alone at a common 28-nm process. ARM expects the A72 to consume 50% less power than the A15 and, I dunno, ~19% less than the A57 while achieving the same performance. (These numbers involve lower clock speeds for the newer cores, since they have higher per-clock throughput.)
These days, improvements in CPU power efficiency generally translate almost directly into performance, since CPUs tend to be heavily power-constrained. Many A57-based SoCs tend to dial back CPU core clocks during longer periods of sustained activity in order to keep temperatures in check. By contrast, Filippo expects the A72 to be able to operate at its peak frequency for sustained periods.
Combine the A72’s higher power efficiency with the expected gains in per-clock performance and clock speeds, and you’re looking at a pretty substantial generational leap. Filippo credibly calls the A72 a “next-gen design.” Factor in the expected benefits of the transition to 14/16-nm-class chip fabrication processes, and by my rough math, the next wave of ARM-based devices could achieve roughly double the sustained performance of 20-nm SoCs based on Cortex-A57. The gains would be more modest in short, bursty workloads where the A57 is able to operate at its peak clocks.
As with the A57, an A72 cluster is likely to be paired with a cluster of Cortex-A53 cores as part of ARM’s big.LITTLE asymmetric multiprocessing scheme. Such a pairing should allow the A72 cores to remain power-gated off during light work, further improving power efficiency.
One intriguing question is whether the A72’s cumulative advances will be sufficient to win it a big presence in premium smartphones like the A57 has now. The improvements we’ve cited could be enough to put the Cortex-A72 at parity with or slightly ahead of Apple’s current custom CPU core in the A8, but Apple will likely have something more potent to offer with the next iPhone refresh. The A72 will also have to contend with upcoming custom cores from Qualcomm, Samsung, and possibly others. One could reasonably expect those firms to use their own cores in their next-gen SoCs unless those cores are somehow obviously not competitive.
Of course, regardless of what happens there, some portion of the mobile SoC market will surely adopt the A72, especially low-cost and quick-turnaround artists like MediaTek. The A72 will undoubtedly set a new performance standard for that portion of the market.
The pitch for Cortex-A72 in data centers
ARM continues to push for its A-series processors to make inroads into the data center, and the Cortex-A72 is a big part of its plans on that front. ARM isn’t shy about making fairly direct comparisons between the A72 and the Haswell and Broadwell cores Intel uses in its Xeon products, even though ARM’s case for its intrusion into the data center clearly involves a form of asymmetric warfare.
The workloads and applications where ARM thinks SoCs based on its technology are likely to steal business from the Xeon tend to involve relatively simple, throughput-based tasks. In those cases, ARM’s relatively small, low-power cores have the potential to compete well, in part simply because smaller CPU cores may be better suited to the job—particularly when power consumption comes into play, as it so often does.
Above is an example of a possible ARM-based SoC architecture meant for data-center applications. This chip has four quad-core A72 clusters paired with four quad-core A53 clusters, for a total of 16 “big” cores and 16 “little” cores. With up to 32MB of L3 cache, four channels of DDR4 memory, and tons of I/O bandwidth on tap, an SoC like this one could work well when serving certain types of workloads—perhaps anything from driving a network appliance to running a more traditional server application like a web-caching layer.
ARM pitches the A72 as the rough performance equivalent of a single thread on an Intel Broadwell core. Since each Broadwell core can track and execute two threads via Hyper-Threading, the basic idea is that two A72 cores roughly match one full Broadwell core. That comparison won’t fly when you’re talking about single-threaded performance, where the Intel CPU is likely to have a substantial advantage, but we can assume it might make some sense in server-class applications with an abundance of threads. Now consider the power and density picture.
Die size comparison of ARM Cortex-A72 and Intel Broadwell cores. Source: ARM.
ARM points out that a single Cortex-A72 core built on TSMC’s 16FF+ process occupies roughly 1.15 mm2 worth of die space, while a single Broadwell core with 256K of L2 cache is rougly eight mm2 on Intel’s 14-nm process. The more apt comparison may be what ARM can fit into the same eight square millimeters as the Broadwell: four A72 cores and 2MB of L2 cache. For dense computing environments addressing applications that require lots of raw throughput and relatively simple code execution, the A72 could form the basis of a compelling solution.
Here’s a comparison of a 20-thread SPECint_rate2006 workload running on a Haswell-EP-based Xeon with 10 cores and 20 threads versus a couple of “example” (I believe emulated, not actual hardware) 20-core ARM-based SoCs, one using Cortex-A57 and the other A72. ARM claims to be able to match the Xeon’s performance while consuming under a third of the power—less than 30W versus 105W for the Xeons.
ARM is even willing to take on the Broadwell core and, by proxy, the Xeon D processor by making a comparison to a Core M-based Dell Venue Pro II, which is evidently the only Broadwell they were able to wrangle so far. These examples are obviously cherry-picked by ARM to put the A72 in a good light, and the Core M system in question is clearly thermally constrained. I’m happy to pass on the results above by way of illustration, but I’m not sure they’re a good indication of what one should expect from a comparison of true server-class SoCs.
The thing is, we could keep doing this sort of thing all day, picking out workloads whose specific needs might be best served by an array of small, low-power CPU cores. That’s true even if the majority of server-class applications don’t fall into that category and would be better served by a bigger, beefier Xeon. I suppose that’s ARM’s underlying point: that even with the Xeon D looking formidable, there’s still plenty of room in the data center for tailored solutions based on ARM processors to capture some business.