Quantifying the goodness
The final product of all of this tuning and optimization is a core that's substantially improved compared to the Cortex-A57. For one thing, the Cortex-A72 is simply smaller than the A57, with a 10% reduction in chip area on the same manufacturing process.
Meanwhile, the A72 offers higher per-clock throughput across a range of workloads—from 16 to 50%, as illustrated below.
The largest gains come in memory-sensitive workloads, but improvements of 16% in integer math and 26% in floating-point are still considerable.
Clock speeds should be up generally, too. Filippo told us the team's target was to reach the same frequencies as the A57, but in the end, the A72 wound up better able to tolerate high frequencies. As a result, we may see a few hundred megahertz of additional clock speed out of A72-based SoCs, with peaks in the neighborhood of 2.5GHz.
More important for real-world performance is the A72's potential to sustain its peak clock speed over time. That ability comes courtesy of some major improvements in power efficiency, as illustrated below.
This comparison is a little tricky because it's primarily against the Cortex-A15 and involves differences in process technology, as well. Still, the green bars show the impact of the core changes alone at a common 28-nm process. ARM expects the A72 to consume 50% less power than the A15 and, I dunno, ~19% less than the A57 while achieving the same performance. (These numbers involve lower clock speeds for the newer cores, since they have higher per-clock throughput.)
These days, improvements in CPU power efficiency generally translate almost directly into performance, since CPUs tend to be heavily power-constrained. Many A57-based SoCs tend to dial back CPU core clocks during longer periods of sustained activity in order to keep temperatures in check. By contrast, Filippo expects the A72 to be able to operate at its peak frequency for sustained periods.
Combine the A72's higher power efficiency with the expected gains in per-clock performance and clock speeds, and you're looking at a pretty substantial generational leap. Filippo credibly calls the A72 a "next-gen design." Factor in the expected benefits of the transition to 14/16-nm-class chip fabrication processes, and by my rough math, the next wave of ARM-based devices could achieve roughly double the sustained performance of 20-nm SoCs based on Cortex-A57. The gains would be more modest in short, bursty workloads where the A57 is able to operate at its peak clocks.
As with the A57, an A72 cluster is likely to be paired with a cluster of Cortex-A53 cores as part of ARM's big.LITTLE asymmetric multiprocessing scheme. Such a pairing should allow the A72 cores to remain power-gated off during light work, further improving power efficiency.
One intriguing question is whether the A72's cumulative advances will be sufficient to win it a big presence in premium smartphones like the A57 has now. The improvements we've cited could be enough to put the Cortex-A72 at parity with or slightly ahead of Apple's current custom CPU core in the A8, but Apple will likely have something more potent to offer with the next iPhone refresh. The A72 will also have to contend with upcoming custom cores from Qualcomm, Samsung, and possibly others. One could reasonably expect those firms to use their own cores in their next-gen SoCs unless those cores are somehow obviously not competitive.
Of course, regardless of what happens there, some portion of the mobile SoC market will surely adopt the A72, especially low-cost and quick-turnaround artists like MediaTek. The A72 will undoubtedly set a new performance standard for that portion of the market.
The pitch for Cortex-A72 in data centers
ARM continues to push for its A-series processors to make inroads into the data center, and the Cortex-A72 is a big part of its plans on that front. ARM isn't shy about making fairly direct comparisons between the A72 and the Haswell and Broadwell cores Intel uses in its Xeon products, even though ARM's case for its intrusion into the data center clearly involves a form of asymmetric warfare.
The workloads and applications where ARM thinks SoCs based on its technology are likely to steal business from the Xeon tend to involve relatively simple, throughput-based tasks. In those cases, ARM's relatively small, low-power cores have the potential to compete well, in part simply because smaller CPU cores may be better suited to the job—particularly when power consumption comes into play, as it so often does.
Above is an example of a possible ARM-based SoC architecture meant for data-center applications. This chip has four quad-core A72 clusters paired with four quad-core A53 clusters, for a total of 16 "big" cores and 16 "little" cores. With up to 32MB of L3 cache, four channels of DDR4 memory, and tons of I/O bandwidth on tap, an SoC like this one could work well when serving certain types of workloads—perhaps anything from driving a network appliance to running a more traditional server application like a web-caching layer.
ARM pitches the A72 as the rough performance equivalent of a single thread on an Intel Broadwell core. Since each Broadwell core can track and execute two threads via Hyper-Threading, the basic idea is that two A72 cores roughly match one full Broadwell core. That comparison won't fly when you're talking about single-threaded performance, where the Intel CPU is likely to have a substantial advantage, but we can assume it might make some sense in server-class applications with an abundance of threads. Now consider the power and density picture.
ARM points out that a single Cortex-A72 core built on TSMC's 16FF+ process occupies roughly 1.15 mm2 worth of die space, while a single Broadwell core with 256K of L2 cache is rougly eight mm2 on Intel's 14-nm process. The more apt comparison may be what ARM can fit into the same eight square millimeters as the Broadwell: four A72 cores and 2MB of L2 cache. For dense computing environments addressing applications that require lots of raw throughput and relatively simple code execution, the A72 could form the basis of a compelling solution.
Here's a comparison of a 20-thread SPECint_rate2006 workload running on a Haswell-EP-based Xeon with 10 cores and 20 threads versus a couple of "example" (I believe emulated, not actual hardware) 20-core ARM-based SoCs, one using Cortex-A57 and the other A72. ARM claims to be able to match the Xeon's performance while consuming under a third of the power—less than 30W versus 105W for the Xeons.
ARM is even willing to take on the Broadwell core and, by proxy, the Xeon D processor by making a comparison to a Core M-based Dell Venue Pro II, which is evidently the only Broadwell they were able to wrangle so far. These examples are obviously cherry-picked by ARM to put the A72 in a good light, and the Core M system in question is clearly thermally constrained. I'm happy to pass on the results above by way of illustration, but I'm not sure they're a good indication of what one should expect from a comparison of true server-class SoCs.
The thing is, we could keep doing this sort of thing all day, picking out workloads whose specific needs might be best served by an array of small, low-power CPU cores. That's true even if the majority of server-class applications don't fall into that category and would be better served by a bigger, beefier Xeon. I suppose that's ARM's underlying point: that even with the Xeon D looking formidable, there's still plenty of room in the data center for tailored solutions based on ARM processors to capture some business.
I can't help but wonder if the best targets for such ARM-based SoCs aren't places where ARM and its partners already have a pretty big presence in the data center, though. Devices like network switches, routers, and storage controllers already make use of ARM's IP in great measure. At this point, Intel is making a push to win some of that business away from ARM, but doing so may prove difficult if Intel won't license its CPU cores for use in purpose-built chips. Meanwhile, ARM seems to have aspirations to capture some of the more traditional server market from Intel, and that prospect seems awfully challenging, too. Perhaps ARM and its partners can make some inroads by making a fairly narrow case along the lines outlined above for the adoption of A72-based solutions at places like Facebook and Google, if their needs align with ARM's specific strengths.