Single page Print

Inside ARM's Cortex-A72 microarchitecture

The next-gen CPU core for mobile devices and servers
— 3:02 PM on May 1, 2015

Thanks in part to the smartphone market's rapid move toward 64-bit-capable processors, ARM's licensable CPU cores have seen an upsurge in high-visibility deployments this year. From the Exynos version of Samsung's Galaxy Note 4 to newer offerings like the Qualcomm-based LG G4, ARM's Cortex-A57 has become the new standard for processing power in Android phones. That's a bit of a change from the last couple of years, when custom cores like Qualcomm's Krait dominated the same landscape.

The folks at ARM know that they have to keep progressing in order to remain competitive with the customers who license their instruction set for custom core development. Thus, they've already announced the next-generation Cortex-A72 CPU core and, last week at a press event in London, they revealed the first details of this new core's internal architecture.

Into the A72
The Cortex-A72 is the latest iteration of ARM's largest CPU core, although it's probably a mid-size core in the grand scheme of things. It's quite a bit smaller than Intel's Broadwell or the latest Apple CPU core in the A8 SoC, for instance.

The A72 is a heavily revised version of the Cortex-A57 core that it supplants, which in turn owes a debt to the Cortex-A15. Mike Filippo, an ARM Fellow and Lead Architect, told us that the A72 team started with the A57 and then "optimized every block" in the CPU in order to squeeze out higher performance and improved energy efficiency. Since the A72 will be used in chips intended for both mobile- and server-class applications, ARM gave it a rich feature set meant to cover the necessary bases in both markets.

Like many of ARM's cores, the fundamental structure of the Cortex-A72 is a cluster comprised of up to four discrete CPU cores sharing a common L2 cache. For the A72, that cache can be as small as 512KB or as large as 4MB. The cluster talks to the rest of the SoC via a 128-bit AMBA interface, and one can expect to see chips incorporating fairly large numbers of quad-core A72 clusters for certain markets.

Simplified block diagram of the Cortex-A72. Source: ARM.

From high altitude, the A72 doesn't look that different than the A57 that came before it. The core has an in-order front end with an out-of-order back end and memory. It can fetch three instructions per clock cycle and issue up to eight micro-ops to the execution units. The updates to A72 widen some data paths while improving efficiency at each stop along the way.

The first stop is the branch prediction unit, one of those blocks that nearly every architectural update seems to touch. Filippo said the team "effectively rebuilt" the unit in the A72. As usual, the primary goal in the rebuild was to increase the accuracy of this unit's predictions. Doing so can improve performance and power efficiency by reducing the time and power spent computing branches that programs ultimately don't take. Filippo claims a new algorithm "significantly" improves the A72's prediction accuracy, and it's coupled with a host of targeted tweaks. Those tweaks pay off in more than just accuracy. Filippo told us the new branch prediction unit itself operates in more energy-efficient fashion than the one in the Cortex-A57.

Some of the other key changes to the A72 have to do with the way instructions and data flow through the machine. Like a lot of modern CPUs, the A72 translates instructions from the external ARMv8 instruction set, exposed to software and compilers, into its own internal operations, known as micro-ops. The reality, in fact, is even more complex. The A72 can fetch three ARMv8 instructions, decode them into three macro-ops—an intermediate format used internally—and then dispatch up to five micro-ops into the issue queues in each clock cycle. These queues, which operate independently, can then issue up to eight micro-ops into the execution units in a single tick of the clock.

Logical block diagram of Cortex-A72. Source: ARM.

This mix of per-clock throughput through fetch, decode, dispatch, and the issue queues might seem unbalanced at 3-3-5-8, but remember, we're dealing with different sorts of instruction units at different stages. ARMv8 instructions tend to break down into larger numbers of micro-ops. On average, Filippo said, each ARMv8 instruction translates into 1.08 micro-ops. Also, not all of the "cracking" of complex instructions into ops happens in the decode stage. Some of it happens in the dispatch units, when those intermediate macro-ops are translated into micro-ops—hence the dispatch unit's ability to take in three macro-ops and output five micro-ops. That's an upgrade from the dispatch unit in the A57, which can only output three micro-ops per cycle for a 3-3-3-8 flow.

That said, the progression from ARMv8 instructions to micro-ops isn't all about simplification, either. Filippo explained that the A72's micro-ops aren't "super-simple steps." Instead, "we do some fairly complex things with the back-end micro-ops." In some cases, the decoder can fuse multiple ARMv8 instructions into a single macro-op; this is another new capability added in the A72. Thus, the role of the CPU's front-end units is proper formatting as much as anything. Since micro-ops can sometimes take multiple cycles to execute, the CPU doesn't require eight decode or dispatch slots per cycle in order to keep the issue queues full. The A57, remember, gets by pretty well with a 3-3-3-8 config.

The decode/rename and dispatch/retire units have been the subject of block-level optimizations, as well. Many of these changes have to do with the operation of local storage arrays—buffers in the decode unit and registers in the dispatch/retire unit. Filippo said a power-oriented reorganization of the register files produced a "significant reduction" in the number of ports used and the amount of chip area consumed. He believes careful sharing of the remaining ports should allow the A72 to be "performance neutral" on this front, with "no meaningful performance drop-off" from stalls caused by port contention.

The core's basic complement of execution units, shown in the diagram above, looks to be pretty much the same as the A57's. The queues can issue one instruction to each of the two single-cycle integer ALUs, one to the branch unit, one to the multi-cycle ALU, two to the floating-point/SIMD units, and two to the load/store units. That's a total of eight instructions issued, as we've already noted.

Some of those execution units are substantially improved in the A72. The integer units have added a radix-16 divider with twice the bandwidth of the A57's divider. They've also added a number of zero-cycle forwarding paths, so data can travel to the next stop in the pipeline immediately, without a one-cycle bubble.

The FP/SIMD units have the most extensive changes, with markedly lower latencies for key instructions. Floating-point multiplication now happens in three cycles, a 40% reduction versus the A57. FP adds also take three cycles—versus four on A57. As a result, the latency for combining the two operations in a fused multiply-add is six cycles, a 33% drop versus the prior generation. Floating-point division is now served by a radix-16 divider with double the bandwidth of the old unit, too.

In keeping with its M.O., the A72 team has tuned all of these units for power-efficient operation, and it has tweaked the load-balancing algorithm from the issue queue to ensure fuller utilization of these quicker execution resources.

The A72's memory subsystem has seen extensive tuning, as well. The caches should be kept warm with relevant data thanks to a hardware pre-fetcher situated in the L1 cache complex that can retrieve data into both the L1 and L2 caches. The L2 cache has been tuned for higher bandwidth, as well. Beyond that, the A72 should typically be paired with ARM's CCI-500 north bridge interconnect, which can increase available memory bandwidth by as much as 30%. (This area was a pain point in the Exynos 5433, so the change is welcome.) Again, power efficiency was a specific target of optimization, both in the load/store unit and in the L2 cache. In addition to the usual tuning of logic and local memories, the team worked to reduce the L2's power draw at idle.