Inside ARM’s Cortex-A72 microarchitecture

Thanks in part to the smartphone market’s rapid move toward 64-bit-capable processors, ARM’s licensable CPU cores have seen an upsurge in high-visibility deployments this year. From the Exynos version of Samsung’s Galaxy Note 4 to newer offerings like the Qualcomm-based LG G4, ARM’s Cortex-A57 has become the new standard for processing power in Android phones. That’s a bit of a change from the last couple of years, when custom cores like Qualcomm’s Krait dominated the same landscape.

The folks at ARM know that they have to keep progressing in order to remain competitive with the customers who license their instruction set for custom core development. Thus, they’ve already announced the next-generation Cortex-A72 CPU core and, last week at a press event in London, they revealed the first details of this new core’s internal architecture.

Into the A72

The Cortex-A72 is the latest iteration of ARM’s largest CPU core, although it’s probably a mid-size core in the grand scheme of things. It’s quite a bit smaller than Intel’s Broadwell or the latest Apple CPU core in the A8 SoC, for instance.

The A72 is a heavily revised version of the Cortex-A57 core that it supplants, which in turn owes a debt to the Cortex-A15. Mike Filippo, an ARM Fellow and Lead Architect, told us that the A72 team started with the A57 and then “optimized every block” in the CPU in order to squeeze out higher performance and improved energy efficiency. Since the A72 will be used in chips intended for both mobile- and server-class applications, ARM gave it a rich feature set meant to cover the necessary bases in both markets.

Like many of ARM’s cores, the fundamental structure of the Cortex-A72 is a cluster comprised of up to four discrete CPU cores sharing a common L2 cache. For the A72, that cache can be as small as 512KB or as large as 4MB. The cluster talks to the rest of the SoC via a 128-bit AMBA interface, and one can expect to see chips incorporating fairly large numbers of quad-core A72 clusters for certain markets.

Simplified block diagram of the Cortex-A72. Source: ARM.

From high altitude, the A72 doesn’t look that different than the A57 that came before it. The core has an in-order front end with an out-of-order back end and memory. It can fetch three instructions per clock cycle and issue up to eight micro-ops to the execution units. The updates to A72 widen some data paths while improving efficiency at each stop along the way.

The first stop is the branch prediction unit, one of those blocks that nearly every architectural update seems to touch. Filippo said the team “effectively rebuilt” the unit in the A72. As usual, the primary goal in the rebuild was to increase the accuracy of this unit’s predictions. Doing so can improve performance and power efficiency by reducing the time and power spent computing branches that programs ultimately don’t take. Filippo claims a new algorithm “significantly” improves the A72’s prediction accuracy, and it’s coupled with a host of targeted tweaks. Those tweaks pay off in more than just accuracy. Filippo told us the new branch prediction unit itself operates in more energy-efficient fashion than the one in the Cortex-A57.

Some of the other key changes to the A72 have to do with the way instructions and data flow through the machine. Like a lot of modern CPUs, the A72 translates instructions from the external ARMv8 instruction set, exposed to software and compilers, into its own internal operations, known as micro-ops. The reality, in fact, is even more complex. The A72 can fetch three ARMv8 instructions, decode them into three macro-ops—an intermediate format used internally—and then dispatch up to five micro-ops into the issue queues in each clock cycle. These queues, which operate independently, can then issue up to eight micro-ops into the execution units in a single tick of the clock.

Logical block diagram of Cortex-A72. Source: ARM.

This mix of per-clock throughput through fetch, decode, dispatch, and the issue queues might seem unbalanced at 3-3-5-8, but remember, we’re dealing with different sorts of instruction units at different stages. ARMv8 instructions tend to break down into larger numbers of micro-ops. On average, Filippo said, each ARMv8 instruction translates into 1.08 micro-ops. Also, not all of the “cracking” of complex instructions into ops happens in the decode stage. Some of it happens in the dispatch units, when those intermediate macro-ops are translated into micro-ops—hence the dispatch unit’s ability to take in three macro-ops and output five micro-ops. That’s an upgrade from the dispatch unit in the A57, which can only output three micro-ops per cycle for a 3-3-3-8 flow.

That said, the progression from ARMv8 instructions to micro-ops isn’t all about simplification, either. Filippo explained that the A72’s micro-ops aren’t “super-simple steps.” Instead, “we do some fairly complex things with the back-end micro-ops.” In some cases, the decoder can fuse multiple ARMv8 instructions into a single macro-op; this is another new capability added in the A72. Thus, the role of the CPU’s front-end units is proper formatting as much as anything. Since micro-ops can sometimes take multiple cycles to execute, the CPU doesn’t require eight decode or dispatch slots per cycle in order to keep the issue queues full. The A57, remember, gets by pretty well with a 3-3-3-8 config.

The decode/rename and dispatch/retire units have been the subject of block-level optimizations, as well. Many of these changes have to do with the operation of local storage arrays—buffers in the decode unit and registers in the dispatch/retire unit. Filippo said a power-oriented reorganization of the register files produced a “significant reduction” in the number of ports used and the amount of chip area consumed. He believes careful sharing of the remaining ports should allow the A72 to be “performance neutral” on this front, with “no meaningful performance drop-off” from stalls caused by port contention.

The core’s basic complement of execution units, shown in the diagram above, looks to be pretty much the same as the A57’s. The queues can issue one instruction to each of the two single-cycle integer ALUs, one to the branch unit, one to the multi-cycle ALU, two to the floating-point/SIMD units, and two to the load/store units. That’s a total of eight instructions issued, as we’ve already noted.

Some of those execution units are substantially improved in the A72. The integer units have added a radix-16 divider with twice the bandwidth of the A57’s divider. They’ve also added a number of zero-cycle forwarding paths, so data can travel to the next stop in the pipeline immediately, without a one-cycle bubble.

The FP/SIMD units have the most extensive changes, with markedly lower latencies for key instructions. Floating-point multiplication now happens in three cycles, a 40% reduction versus the A57. FP adds also take three cycles—versus four on A57. As a result, the latency for combining the two operations in a fused multiply-add is six cycles, a 33% drop versus the prior generation. Floating-point division is now served by a radix-16 divider with double the bandwidth of the old unit, too.

In keeping with its M.O., the A72 team has tuned all of these units for power-efficient operation, and it has tweaked the load-balancing algorithm from the issue queue to ensure fuller utilization of these quicker execution resources.

The A72’s memory subsystem has seen extensive tuning, as well. The caches should be kept warm with relevant data thanks to a hardware pre-fetcher situated in the L1 cache complex that can retrieve data into both the L1 and L2 caches. The L2 cache has been tuned for higher bandwidth, as well. Beyond that, the A72 should typically be paired with ARM’s CCI-500 north bridge interconnect, which can increase available memory bandwidth by as much as 30%. (This area was a pain point in the Exynos 5433, so the change is welcome.) Again, power efficiency was a specific target of optimization, both in the load/store unit and in the L2 cache. In addition to the usual tuning of logic and local memories, the team worked to reduce the L2’s power draw at idle.

 

Quantifying the goodness

The final product of all of this tuning and optimization is a core that’s substantially improved compared to the Cortex-A57. For one thing, the Cortex-A72 is simply smaller than the A57, with a 10% reduction in chip area on the same manufacturing process.

Meanwhile, the A72 offers higher per-clock throughput across a range of workloads—from 16 to 50%, as illustrated below.

Cortex-A57 vs. Cortex-A72 per-clock performance. Source: ARM

The largest gains come in memory-sensitive workloads, but improvements of 16% in integer math and 26% in floating-point are still considerable.

Clock speeds should be up generally, too. Filippo told us the team’s target was to reach the same frequencies as the A57, but in the end, the A72 wound up better able to tolerate high frequencies. As a result, we may see a few hundred megahertz of additional clock speed out of A72-based SoCs, with peaks in the neighborhood of 2.5GHz.

More important for real-world performance is the A72’s potential to sustain its peak clock speed over time. That ability comes courtesy of some major improvements in power efficiency, as illustrated below.

Source: ARM.

This comparison is a little tricky because it’s primarily against the Cortex-A15 and involves differences in process technology, as well. Still, the green bars show the impact of the core changes alone at a common 28-nm process. ARM expects the A72 to consume 50% less power than the A15 and, I dunno, ~19% less than the A57 while achieving the same performance. (These numbers involve lower clock speeds for the newer cores, since they have higher per-clock throughput.)

These days, improvements in CPU power efficiency generally translate almost directly into performance, since CPUs tend to be heavily power-constrained. Many A57-based SoCs tend to dial back CPU core clocks during longer periods of sustained activity in order to keep temperatures in check. By contrast, Filippo expects the A72 to be able to operate at its peak frequency for sustained periods.

Combine the A72’s higher power efficiency with the expected gains in per-clock performance and clock speeds, and you’re looking at a pretty substantial generational leap. Filippo credibly calls the A72 a “next-gen design.” Factor in the expected benefits of the transition to 14/16-nm-class chip fabrication processes, and by my rough math, the next wave of ARM-based devices could achieve roughly double the sustained performance of 20-nm SoCs based on Cortex-A57. The gains would be more modest in short, bursty workloads where the A57 is able to operate at its peak clocks.

As with the A57, an A72 cluster is likely to be paired with a cluster of Cortex-A53 cores as part of ARM’s big.LITTLE asymmetric multiprocessing scheme. Such a pairing should allow the A72 cores to remain power-gated off during light work, further improving power efficiency.

One intriguing question is whether the A72’s cumulative advances will be sufficient to win it a big presence in premium smartphones like the A57 has now. The improvements we’ve cited could be enough to put the Cortex-A72 at parity with or slightly ahead of Apple’s current custom CPU core in the A8, but Apple will likely have something more potent to offer with the next iPhone refresh. The A72 will also have to contend with upcoming custom cores from Qualcomm, Samsung, and possibly others. One could reasonably expect those firms to use their own cores in their next-gen SoCs unless those cores are somehow obviously not competitive.

Of course, regardless of what happens there, some portion of the mobile SoC market will surely adopt the A72, especially low-cost and quick-turnaround artists like MediaTek. The A72 will undoubtedly set a new performance standard for that portion of the market.

The pitch for Cortex-A72 in data centers

ARM continues to push for its A-series processors to make inroads into the data center, and the Cortex-A72 is a big part of its plans on that front. ARM isn’t shy about making fairly direct comparisons between the A72 and the Haswell and Broadwell cores Intel uses in its Xeon products, even though ARM’s case for its intrusion into the data center clearly involves a form of asymmetric warfare.

The workloads and applications where ARM thinks SoCs based on its technology are likely to steal business from the Xeon tend to involve relatively simple, throughput-based tasks. In those cases, ARM’s relatively small, low-power cores have the potential to compete well, in part simply because smaller CPU cores may be better suited to the job—particularly when power consumption comes into play, as it so often does.

Source: ARM.

Above is an example of a possible ARM-based SoC architecture meant for data-center applications. This chip has four quad-core A72 clusters paired with four quad-core A53 clusters, for a total of 16 “big” cores and 16 “little” cores. With up to 32MB of L3 cache, four channels of DDR4 memory, and tons of I/O bandwidth on tap, an SoC like this one could work well when serving certain types of workloads—perhaps anything from driving a network appliance to running a more traditional server application like a web-caching layer.

ARM pitches the A72 as the rough performance equivalent of a single thread on an Intel Broadwell core. Since each Broadwell core can track and execute two threads via Hyper-Threading, the basic idea is that two A72 cores roughly match one full Broadwell core. That comparison won’t fly when you’re talking about single-threaded performance, where the Intel CPU is likely to have a substantial advantage, but we can assume it might make some sense in server-class applications with an abundance of threads. Now consider the power and density picture.

Die size comparison of ARM Cortex-A72 and Intel Broadwell cores. Source: ARM.

ARM points out that a single Cortex-A72 core built on TSMC’s 16FF+ process occupies roughly 1.15 mm2 worth of die space, while a single Broadwell core with 256K of L2 cache is rougly eight mm2 on Intel’s 14-nm process. The more apt comparison may be what ARM can fit into the same eight square millimeters as the Broadwell: four A72 cores and 2MB of L2 cache. For dense computing environments addressing applications that require lots of raw throughput and relatively simple code execution, the A72 could form the basis of a compelling solution.

Source: ARM.

Here’s a comparison of a 20-thread SPECint_rate2006 workload running on a Haswell-EP-based Xeon with 10 cores and 20 threads versus a couple of “example” (I believe emulated, not actual hardware) 20-core ARM-based SoCs, one using Cortex-A57 and the other A72. ARM claims to be able to match the Xeon’s performance while consuming under a third of the power—less than 30W versus 105W for the Xeons.

Source: ARM.

ARM is even willing to take on the Broadwell core and, by proxy, the Xeon D processor by making a comparison to a Core M-based Dell Venue Pro II, which is evidently the only Broadwell they were able to wrangle so far. These examples are obviously cherry-picked by ARM to put the A72 in a good light, and the Core M system in question is clearly thermally constrained. I’m happy to pass on the results above by way of illustration, but I’m not sure they’re a good indication of what one should expect from a comparison of true server-class SoCs.

The thing is, we could keep doing this sort of thing all day, picking out workloads whose specific needs might be best served by an array of small, low-power CPU cores. That’s true even if the majority of server-class applications don’t fall into that category and would be better served by a bigger, beefier Xeon. I suppose that’s ARM’s underlying point: that even with the Xeon D looking formidable, there’s still plenty of room in the data center for tailored solutions based on ARM processors to capture some business.

I can’t help but wonder if the best targets for such ARM-based SoCs aren’t places where ARM and its partners already have a pretty big presence in the data center, though. Devices like network switches, routers, and storage controllers already make use of ARM’s IP in great measure. At this point, Intel is making a push to win some of that business away from ARM, but doing so may prove difficult if Intel won’t license its CPU cores for use in purpose-built chips. Meanwhile, ARM seems to have aspirations to capture some of the more traditional server market from Intel, and that prospect seems awfully challenging, too. Perhaps ARM and its partners can make some inroads by making a fairly narrow case along the lines outlined above for the adoption of A72-based solutions at places like Facebook and Google, if their needs align with ARM’s specific strengths.

Comments closed
    • BobbinThreadbare
    • 4 years ago

    I feel like the comparison chart with the Xeon is worst case scenario for Intel. The key is that it’s a situation where actual CPU speed doesn’t matter. So you could use the lowest wattage Intel chip which should be a lot less than 105W used.

    • Klimax
    • 4 years ago

    All good and pretty, until close encounter with Broadwell and Skywell…
    (good neutral benchmarks)

    • tootercomputer
    • 4 years ago

    Quite a few tech sites wrote about this new ARM chip. So what is the upshot of this processor? It simply appears to be a somewhat faster somewhat more efficient chip than its predecessor. Why all the attention?

      • flip-mode
      • 4 years ago

      “Somewhat faster” seems pretty unfair. If Intel’s next (what is it… Skylake?) chip was 16%-50% faster than Broadwell people would be crapping hyperbolic paraboloids.

        • tootercomputer
        • 4 years ago

        Well, yes, on the one hand, I concur, but let’s see some independent data across a variety of metrics. The data presented here and elsewhere all comes from ARM. What did they do, fly in a bunch of tech site folks and do a presentation? I guess it was ARM’s TechDay 2015, in London. I’m just struck by how widely it has been reported: by Anandtech, PC Perspective, Arstechnica, and others.

          • Damage
          • 4 years ago

          Yes, ARM did a presentation of the A72 architecture, and no, devices based on it aren’t out yet–and may not be until late this year or sometime next. That’s how it goes in the licensed IP business. ARM’s core is ready, but it still has to be built into a real product, which takes time.

          FWIW, the Cortex-A57 architectural info in my Galaxy Note 4 Exynos review from early this year came from notes from the first ARM Tech Day event in spring 2013. We will eventually have hands on with a Cortex-A72 to test, but you’ve gotta give it time.

          As others have said, a new ARM core is a pretty big deal these days. ARM’s cores are shipping in a ton of devices and systems of various types, including all of the premium Android smartphones of the past ~6 mos. That’s why we’ve chosen to cover it.

        • Klimax
        • 4 years ago

        It’s easy to have large x% when you at best compete otherwise with Atoms…

      • smilingcrow
      • 4 years ago

      It’s a bigger deal than a new Intel chip these days.

      • the
      • 4 years ago

      Because faster and more efficient is a very hard thing to do at the same time.

      Intel doesn’t allow a 1% increase power consumption into the design of their Core series unless it can provide a 2% increase in performance to ensure that efficiency continually moves in the direction. This is also why Intel’s recent releases haven’t been that big of leaps in terms of performance – the real performance changers likely consumed far too much power and were left on the cutting room floor.

    • UberGerbil
    • 4 years ago

    I don’t see ECC anywhere but the PCPerspective article mentioned it so I assume it’s in there as an option. That’s going to be important if they’re targeting the high-bandwidth, low-compute market for servers. Even the lowest of the Xeons is way over-powered for a storage server, but that’s what you have to spec to get ECC from Intel (and I don’t think AMD is getting many design wins for public/private clouds or even personal NAS appliances).

      • Damage
      • 4 years ago

      Yeah, ECC is an option everywhere. Was available on the A57, too, so I didn’t call it out.

      • smilingcrow
      • 4 years ago

      Plenty of Intel i3 CPUs have ECC support as someone posted here recently. Limited choice of mobos with ECC that support those chips though.

        • UberGerbil
        • 4 years ago

        An i3 is overkill too, though, for something like a NAS.

          • smilingcrow
          • 4 years ago

          Of course but they are cheaper.

            • UberGerbil
            • 4 years ago

            Cheaper than the A72?

            My point is that there’s an opening in the market that ARM can exploit. Intel likes to turn the knobs to segment the market, but they leave themselves vulnerable if they don’t come up with a particular combination of knobs (low power with ECC) that part of the market wants.

            • smilingcrow
            • 4 years ago

            You are comparing a socketed desktop CPU with a mobile SoC so you are at the mercy of device manufacturers to build a suitable device for you that supports ECC and the other features you require. Whereas with a socketed CPU you are in control of the whole build.
            No reason why Intel couldn’t release an Atom with ECC support if they want to address that market sector.
            More competition is usually good.

            • smilingcrow
            • 4 years ago

            Plenty of Haswell Celeron and Pentiums with ECC Support around $45-55.
            [url<]http://ark.intel.com/search/advanced?s=t&FilterCurrentProducts=true&SocketsSupported=FCLGA1150&MarketSegment=DT&CoreCountMin=2&CoreCountMax=2&ThreadCountMin=2&ThreadCountMax=2&ECCMemory=true[/url<] The issue is motherboard support not the CPUs.

            • srg86
            • 4 years ago

            Baytrail Atom E38xx series also has ECC support (on one channel).

    • jjj
    • 4 years ago

    “The improvements we’ve cited could be enough to put the Cortex-A72 at parity with or slightly ahead of Apple’s current custom CPU core in the A8”

    You seem to be doing the math very wrong. Isn’t A57 at 2.1GHz on 14nm very very close already?
    You add the ,lets say, 10% perf gains from the core and the 19% clock gains and it would be far ahead while being able to include 4 cores not just 2 in a phone and being much smaller.
    So Snapdragon 620 with A72 at 1.8GHz would beat the iphone but not the ipad. At 2.5GHz A72 should obliterate the iphone and easily beat the ipad.

    I’m also looking forward to AMD’s core since it’s likely by far the biggest and fastest ARM core yet. Too bad they don’t seem to have plans for a quad core with ample TDP consumer SoC with it for next year.

      • Damage
      • 4 years ago

      Depends on the test. I’m mostly just being conservative, but there are cases where the iPhone 6 CPU still soundly beats an A57. I’m not terribly confident in current mobile benchmarks, either. Seems like some of what we know is kinda shaky. Hence the conservative handicapping.

        • jjj
        • 4 years ago

        Yeah the benchmarks do kinda suck but the focus should be on the CPU benchmarks not the very popular Java ones or who knows what else.
        I use Geekbench as a guide for now because there are actual results available for a MT8173 that apparently is clocked at 2GHz (the SoC is supposed to hit 2.4GHz) . I do look at integer and FP scores not at total since the memory score is very high and can distort the comparison with some SoCs.
        A72 looks very exciting if they can deliver at the targeted 750mW.

        PS: Mediatek yesterday during their results call confirmed that Helio x20 exists (2xA72 @ 2.5GHz plus 8xA53 in 2 quad clusters) and disclosed that it will be shipping to customers in Q4.

          • blastdoor
          • 4 years ago

          [quote<]Yeah the benchmarks do kinda suck but the focus should be on the CPU benchmarks not the very popular Java ones or who knows what else.[/quote<] Why should that be the focus? Seems to me the focus should be on the benchmark that does the best job of addressing what people are interested in learning. Different people have different interests. As noted in the article, some data center people might be interested in total throughput rather than single thread performance. For those people, multicore Geekbench might be more relevant than a javascript web benchmark that is run on the built-in browser on a phone. And it would seem that the A72 *clearly* serves these peoples far better than Apple's cores (which is lucky, since those people cannot buy Apple's cores). For people who will be making a lot of use of the stock browser on whatever smartphone they buy, though, multicore geek bench is irrelevant and the javascript benchmarks matter a lot. Apple's cores best serve these people (which, by amazing coincidence, is the only group of people who get access to Apple's cores).

        • chuckula
        • 4 years ago

        [quote<]. I'm not terribly confident in current mobile benchmarks, either.[/quote<] This x1000. Current benchmarks in the mobile world make the PC benchmark wars of the late 90s look like detailed scientific experiments with 6 sigma confidence levels.

      • mczak
      • 4 years ago

      Also, I doubt the 19% clock gains. The comparison shows a 2.2Ghz A57 at 20nm and a 2.5Ghz A72 at 16nm (finfet). Don’t forget that even according to arm the target was the same clocks, so the “accidental” higher clock room is probably really small (I would guess no more than ~10%). For reference, samsung clocks their 14nm finfet (granted samsung tech, not tsmc, not sure exactly how they’d compare) at “only” 2.1 Ghz. Doesn’t mean the A72 at 16nm TSMC couldn’t reach 2.5Ghz but potentially might not make sense to clock it at that in a smartphone.
      It probably could reach Apple A8 single thread performance at clocks around ~2Ghz (as the Exynos 7, arguably the best A57 right now, still fails at that). Multi-thread, it should obviously easily beat that, at least for short burst loads – for sustained loads I haven’t really seen any comparisons with the apple a8 (even the exynos 7 has to drop back to ~1Ghz for this). But the competitor of course should be apple a9 in any case (which, as is usual for apple, noone outside apple really seems to have any idea what it’s going to look like).

        • jjj
        • 4 years ago

        You got it wrong.
        Obviously we are talking about targets , remains to be seen what they can reach.

        The initial press release said “Sustained operation within the constrained mobile power envelope at frequencies of 2.5 GHz in a 16nm FinFET process and scalable to higher frequencies for deployment in larger form factor devices ”
        They are a bit fuzzy on the process, if it’s 16ff or 16ff+ but they start by saying ff+ a couple of times .before forgetting the + so lets assume it’s on 16ff+.
        If you look at some at the slides and articles available you’ll notice that they always see A72 at 2.5GHz at 750mW/ per core. 750mW does allow plenty of room for a quad to stay at max load .
        Not long ago Anandtech looked at the Note 4 with Exynos with A57 on 20nm at 1.9GHz and found that a single core was being pushed north of 1.5W.

        So the targets are 2.5GHz at 750mW on 16ff+.
        ARM itself expects the core to be able to scale higher when is has more TDP room, as the press release states, so it’s not expected to hit a wall right at 2.5GHz.
        Given the competitive landscape, SoC makers will try to push advertised clocks (single core ) as high as they can and they do have ample room from a TDP perspective. How it does on each process (each version of 14/16) , when it hits the power wall and stops scaling, we just can’t guess. If a SoC maker can get to 3.2GHz at 1.2W per core or 3.5 at 1.5W,they will do it, just got no way of knowing how far they manage to push it safely.
        Ofc these are targeted numbers and they might or might not hit them. But even if they can only hit 2.5GHz at 1W, there is TDP room for higher clocks in mobile.

        Then we have the announced SoCs
        Snapdragon 620 and 618 at 1.8Ghz, unknown process but the clocks would suggest 28nm.
        MT8173 that is a tablet SoC with 2xA72 at 2.4Ghz and 2xA53 on 28nm.
        Mediatek Helio x20 that is 2xA72 at 2.5Ghz plus 4+4 A53s on 20nm.
        Those SoCs suggest that we are on track to see higher clocks than 2.5GHz on 14/16nm.

        As for Apple, ppl need to remember the die size, on 20nm their dual core plus cache is some 12.2mm2 and they also got a big chunk of SRAM as L3 (some 4.5mm2, maybe slightly bellow that).
        A quad A72 cluster including cache should be some 14.3 mm2 on 20nm ,with a single core at some 1.85mm2. A53 quad cluster including cache on Samsung 20nm was 4.58mm2.
        So no matter how you look at it Apple is at almost twice the size.
        On 14nm Samsung , A72 should be bellow 1.6mm2 so some 4 times smaller than Broadwell as far as i can tell.

          • mczak
          • 4 years ago

          There is no way they can stay at 2.5Ghz with 4 cores for sustained loads. Just isn’t going to happen (otherwise that would imply something over 2 times higher efficiency than Cortex-A57, and not even arm is claiming that). One core sustained, yes, if they can reach the clock with reasonable effort in the first place. I think you should always read that press material a bit pessimistically – Cortex A57 was supposed to reach up to 3 Ghz as well (albeit probably not in phones).
          FWIW Snapdragon 618/620 are confirmed at 28nm, so yes, somewhat lower clocks are expected there (though this should likely affect sustained clocks more than peak clocks).

            • jjj
            • 4 years ago

            I can only operate with existing info, you go for baseless numbers and expect that ARM will miss the stated targets by over 100%.
            I will point out 1 thing that you likely aren’t aware of and i should have mentioned.
            The 750mW target appears to be for the core only without cache ,interconnect, memory bus and obviously all the non CPU related blocks. Those don’t scale in a linear fashion with the numbers of cores but they do add power consumption and distort the measured numbers for existing cores.

            • mczak
            • 4 years ago

            Oh I was quite aware of that. Which is a big reason why it won’t work (4x750mW would probably be too much for sustained loads anyway in phone form factor, if they hit that).
            You might call my numbers baseless, I call them realistic based on similar marketing fluff (not just from arm, this would be standard practice). Have you seen the Exynos 7420 throttling? It’s not bad compared to other phones, but the fact remains sustained loads drop back to an average of ~ 1.4 Ghz or so (and these are good numbers, way better than what the Snapdragon 810 achieves). [url<]http://arstechnica.com/gadgets/2015/04/23/in-depth-with-the-snapdragon-810s-heat-problems/.[/url<] So to assume a A72 could now suddenly easily sustain 2.5Ghz given the same thermal constraints just isn't realistic. btw you can also look at the numbers from arm for the "simulated" 20 core chip. Ok that's 2.7 Ghz instead of 2.5 Ghz but it still does not include i/o (but at least "realistic" memory power consumption). That is 30W, divide by 5 and you've got 6W. Even if you think it does a bit better that's still too much to be sustainable in a phone. (That said, not that any of that would actually really matter in a phone. I bet almost noone would actually even ever notice if there's 2 or 4 high-end cores there...)

    • Sam125
    • 4 years ago

    Great article that ties in well with Rhyszard’s article from a week and a half ago, Scott!

    If the benchmarks seem to hint at anything in particular, it would seem that ARM can see their SOCs in servers that typically have very light workloads but many concurrent users. Things like web servers, databases, web caches, virtual private servers, etc and I believe that’s exactly how AMD is marketing and going to market their ARM based servers like Seattle. ARM in servers might be interesting to follow if for no other reason than the novelty of it, IMO!

    • tuxroller
    • 4 years ago

    One day I wish that someone would explain exactly how improvements are made to the BPU R with EVERY SINGLE UPDATE regardless of node change.
    It’s not as though this isn’t a very well researched area or a place where new cs work is being done (afaict), so how do that do it? What are they missing on the previous design that they CONSISTENTLY find with the next design?

      • Ryszard
      • 4 years ago

      It’s a little bit different with CPUs I imagine for BP, but here’s how it works in general with GPUs (which I help design):

      Microarchitectural improvements like that are almost always as a result of direct feedback on running a combination of existing and upcoming kinds of workloads. So an enormous body of code is run on a wide sweep of potential configurations of the processor in a full system setup (emulation or otherwise). Lessons learned from running that code guides changes to the microarchitecture and how the processor works with the rest of the system architecture, to make it run faster or more efficiently at the same performance.

      However, you can’t know what you don’t know, and you can’t run everything, so you’ll always miss some kinds of workload that you didn’t account for. Then sometimes it’s just new ways of thinking, new ways of design, changes in process technology, etc, that allow you to try new things and find changes to make outside of tuning the architecture just based on workload analysis.

      Then there’s just plain marketing, where certain results are picked to really highlight the improvements in a raw percentage sense, and so those are the numbers you hear about, which always sound great.

      In reality, the performance increase across all potential workloads is always lower than the headline marketing figures, as a weighted average of their improvements. That’s the kind of data that falls out in a review containing a rigorous analysis.

        • tuxroller
        • 4 years ago

        Thank you so much for the response!
        Your second paragraph was the totality of my understanding of how the improvements can happen. What is surprising is that after, say, 6 generations of design, there would any significant knowledge gained tracing loads unless brand new paradigms started appearing (specifically something like cuda/opencl/amp).
        The new ways of thinking/design is the one that I’m particularly interested in, and one that, as I said, makes the least sense to me since, again, this is a really well trod area and you can always experiment with an fpga without relying on a whole new design to test out ideas.
        The marketing angle was one I hadn’t considered at all.

    • derFunkenstein
    • 4 years ago

    I don’t like that energy comparison chart, because it LOOKS like they’re saying a 2.5GHz A72 will use 75% less energy than the 1.6GHz A15. In reality I think it’s comparing energy usage at equal performance. Who’s going to make a 16nm finfet design that runs at 1.1GHz? (answer: nobody). It’s not a useful comparison. If 1.6GHz is typical for an A15, why not show energy use and relative performance compared to what ARM expects to be “typical” A72 design, which appears to be 2.5GHz? If it’s just double performance for the same power usage, fine, I guess.

    The A57-based designs from Qualcomm SEEM to run hot – such as the HTC One M9. That’s bad news, and not being terribly forward with power consumption now is also bad news.

      • tuxroller
      • 4 years ago

      It’s energy used with a CONSTANT workload (say, loading the same webpage).
      In short they are pushing the same race to idle conditions that you typically see from the other companies but they are also claiming it can sustain higher clocks with a higher IPC for longer than previous designs. How much higher is unknown (to me).

        • derFunkenstein
        • 4 years ago

        My bad, I guess. I took it to mean a constant load, like the core is constantly busy.

          • jjj
          • 4 years ago

          Their goal seems to be 750mW per core at 2.5Ghz on 16ff (not sure if ff or ff+ and that does matter a bit). So in theory we should see even higher advertised peak clocks if they hit that 750mW target. Remains to be seen if they do get close to that and how high clocks can be pushed single core since a phone allows for way higher TDP.

          Edit: take for example the A53, it was targeted at 1.2GHz but SoC makers pushed it up to 1.7GHz at first and now to 2-2.2GHz on the same process(28nm). So it’s safe to assume that some will try implementations at significantly higher clocks and TDP. If they can safely push A72 to 3.2 GHz at 1.2W per core instead of doing 2.5GHz at 750mW, they will go there for sure. Got no way of guessing how far they can push it so we’ll have to just wait and see.

    • chuckula
    • 4 years ago

    The fact that they went out of their way to base most of those comparisons on the A15 is telling… the A15 was by far ARM’s weakest high-end core in recent memory.

    In other news, Intel is going to gauge Broadwell’s CPU performance against the P4 EE (Emergency Edition) parts.

      • tuxroller
      • 4 years ago

      I don’t know. The a15r4 was pretty damned good at a per clock level. I’m not sure the current revisions of the a57 have caught up to it.
      The real weakness in all this, though, is big.LITTLE implementations. Until upstream comes up with a consensus for the energy aware scheduler it’s never going to act like arm promised.
      However, the new CCI will certainly make things better, as well a more efficient big core.
      BUT, until that scheduler manifests, arm’s strategy will remain wrong.

        • the
        • 4 years ago

        An inclusive L3 cache or better prefetching to the big core could also address the lack of performance on a context switch between modes. Cache coherency between the big and LITTLE cache domains takes power when the rest of the big cluster isn’t active so it isn’t used. Worst case is that when switching between big.LITTLE is that the caches miss to main memory, hurting performance and using lots of power to read from DRAM. ARM’s discussion of server class SoC’s with L3 cache address these concerns but mobile SoC markers thus far haven’t been including L3 cache due to the extra die space necessary and thus cost. So ironically the mobile market where big.LITTLE is targeted, it is also the most hampered.

        I’ve been of the mindset that for good big.LITTLE performance, ARM would need to offer an AMD Bulldozer-like module with both a styles of cores. The shared logic between them would mainly be caches and perhaps a few infrequently used components like hardware square root/division. The main reason for merging them is that independently when a context switch is performed, the caches on the big core are generally empty of relevant data. The result is lower performance after the transition. This mode can also get away with using an older scheduler by dictating the transition in hardware as long as both the big and LITTLE cores are not simultaneously seen by the OS.

        I’m also wondering if a smarter instruction scheduler dispatch inside a large core could achieve similar results to big.LITTLE. Effectively restrict the wide of the design while in a low power state. As activity increases, additional instructions can be simultaneously dispatched and executed. It’d save power due to the reduction in execution unit usage but the only power savings on the front end would be changing the number of active instructions decoders. Considering that the ARM decoders are small and simple compared to the monstrosity that is x86, that wouldn’t save that much power either. How well this approach would work is dependent on how well optimized the code is: the better the optimization, the less power savings this approach would actually have.

      • nico1982
      • 4 years ago

      [quote<]The fact that they went out of their way to base most of those comparisons on the A15 is telling...[/quote<] Unless we are looking at totally different slides, the A15 shows up only once.

      • willmore
      • 4 years ago

      Are you confusing the performance of the A15 with the poor implementation Samsung first made with it? The first chromebook that they released with one did not have good power specs–but performed just fine.

Pin It on Pinterest

Share This