Single page Print

ARM lays the foundation for a data center invasion


Partners line up to help make it happen
— 11:57 AM on May 9, 2014

The announcement this week that AMD is working on an all-new, high-performance CPU architecture compatible with the ARM instruction set is huge news in its own right, but it's also an important step in a progression that's been unfolding in recent years. Bolstered by success in the mobile computing market, ARM and its partners have been gearing up to challenge to the dominance of Intel in other parts of the computing world, including the data center.

ARM's licensing model means any capable chip company can use ARM's technologies, from its CPU instruction set to interconnect standards to specific blocks of logic, in order to build a product. ARM offers a broad suite of IP (or intellectual property) for its customers to employ as they wish, and some of what ARM offers is truly impressive, high-bandwidth technology aimed squarely at server-class applications.

Most ARM licensees aren't likely to challenge Intel by attempting to take on its potent Xeon processors head to head, like AMD may well do with its K12 core. They can, however, potentially gain a foothold in the lucrative enterprise market by tailoring ARM-compatible SoCs for specific classes of workloads. An old axiom of computing says that custom-designed hardware will inevitably be more efficient and cost-effective at a given job than a brute-force, general-purpose solution. Not every job demands the fastest processing core. If ARM's partners can build SoCs that efficiently handle workloads that are, for instance, more I/O-bound than compute bound, then they can win business away from Intel without matching Xeon stride for stride in every respect.

In doing so, ARM and its customers could very well lower the cost of computing at a rate faster than Intel's famed Moore's Law, and we could see an expansion of the number of viable players in the business of building server-class silicon.

ARM is expending quite a bit of effort to make such a future possible, and it invited us to a press and analyst confab in Austin, Texas, last week in order to highlight some of that work.

There's really too much happening in the ARM ecosystem for us to offer anything like a comprehensive look at how the various companies involved are targeting the server space, but we should note that there is a widely distributed but concerted push for 64-bit ARM-based servers happening behind the scenes right now. The players include everyone from ARM itself to chipmakers like AMD and Applied Micro, from OEMs like HP to software vendors like Canonical and Red Hat—and to customers like Facebook and other cloud providers. The industry seems to want an alternative to Intel and its x86 ISA in the server space, and an awful lot of key players are putting in the effort to make that happen.

Rather than touring the whole scene, we'll take a look at a couple of examples of the sort of technology ARM is designing and its partners are implementing. The first one demonstrates that ARM is more than just a CPU company, and the second illustrates the current state of ARM-powered solutions for the data center.

AMBA 5 CHI and the uncore
Much of the innovation in microprocessors over the past decade has come not in the CPU microarchitectures themselves, but outside of the cores, in the plumbing that feeds these compute engines. We've spent an awful lot of keystrokes talking about the "uncore" complexes and chip-to-chip interconnects that surround Xeons and Opterons, and we probably still haven't entirely given them their due.

ARM offers two 64-bit CPU cores that can play a role in the server space, the smaller Cortex-A53 and the still-smallish-but-larger Cortex-A57. (The A57 is derived from the Cortex-A15 that's made its way into high-end smartphones.) To support these cores, ARM has defined an interconnect architecture called AMBA 5 CHI, and it has created a family of "uncore" products that implement this architecture. Mike Filippo, Lead Architect for ARM's Enterprise Systems Solutions, walked us through this spec and implementation in detail last week.

The AMBA 5 CHI spec describes a high-bandwidth interconnect for transporting data across a chip. AMBA 5 CHI is coherent, which means multiple connected clients (like CPU cores and I/O devices) can access a shared pool of memory safely. The interconnect hardware manages any hazards created by different clients trying to modify the same data simultaneously. In this respect and many others, AMBA 5 CHI is similar to standards like AMD's HyperTransport and Intel's QPI.

AMBA 5 CHI is a layered architecture. It defines proper behavior at multiple layers, from the top-level protocol to routing to the link layer to low-level physical signaling. Oddly enough, Filippo says the spec is agnostic about topology; it can be deployed as a point-to-point link, a crossbar, a ring, a mesh, or what have you. The spec includes provisions for multiple virtual channels—essentially wire sharing—and the protocol layer allows for different flow-control policies.

At present, AMBA 5 CHI is only being used as an internal interconnect between different on-chip devices, but Filipo tells us the spec was defined with an eye toward chip-to-chip communication, as well. That raises the prospect that AMBA 5 CHI, or something very much like it, could be used to enable coherent multiprocessing across multiple silicon dies at some point in the not-too-distant future. In fact, Filippo says ARM is "working on it."

That said, what ARM has already done with AMBA 5 CHI looks to be plenty impressive in its own right. The firm has created a lineup of logic offerings, dubbed the CCN-500 family for "cache coherent network," that can act as the glue for a high-bandwidth ARM-based SoC. Right now, the CCN-500 family has two members: the CCN-504, which supports up to 16 CPU cores, and the CCN-508, which supports as many as 32 cores. Filippo tells us there are smaller- and larger-scale versions in the works. All of them implement AMBA 5 CHI.

The block diagram above offers a simplified view of the CCN-508 uncore. One can see how it links together the CPU cores, memory controllers, and other I/O logic needed to make an SoC work. What's striking about the 508, especially since it's coming from ARM, is its sheer scale. The CCN-508 can support up to eight quad-core clusters of Cortex-A57 CPUs, for a grand total 32 cores. (It can also scale down to as few as two clusters and eight cores, if needed.) The uncore can connect to four ARM DMC-520 memory controllers capable of supporting both DDR3- and DDR4-type memories. The L3 cache can be as large as 32MB, and since that L3 cache has distributed ownership, there's a snoop filter to prevent excess traffic from coherency enforcement.

All of the above will sound fairly familiar to those who know today's Xeon and Opteron architectures. The CCN-508, though, has been built to provide copious bandwidth and coherent caching even in the context of relatively fewer, smaller, and less expensive CPU cores. In fact, dig a level deeper than the diagram above, and one will find that this uncore is based on a distributed design that makes its I/O interfaces into first-class citizens.

The CCN is organized as a series of crosspoints, each of which has two device ports and two interconnect ports. These crosspoints can have various sorts of clients, including L3 cache partitions, CPU cores, and I/O interfaces. "Just plop down crosspoints," Filippo says, "and the system builds itself." Breaking the design down into relatively intelligent crosspoints simplifies development, ARM claims, and allows performance to ramp up smoothly as designs grow in scale.

The L3 cache partitions can range in size from 128KB to 4MB, and some of them are paired up in crosspoints with memory controllers and I/O bridges rather than CPU clusters. That distribution underscores how the L3 doesn't just serve the cores, but also acts as a very high-bandwidth I/O cache. The L3 has an "adaptive" policy regarding inclusion; it doesn't always replicate the contents of the CPU cores' L2 caches. In fact, Filippo claims that calling this cache "L3" is iffy, since it's not just for compute.

This cache is no doubt needed to take advantage of all of the bandwidth on tap. Filippo estimates the CCN-508's peak bandwidth at 360GB/s, and he says the interconnects can sustain 230GB/s pretty much constantly. Each of the eight I/O accelerator ports is capable of 40GB/s of throughput, so there's 320GB/s of peak I/O bandwidth possible across the uncore. Although that number outstrips the 230GB/s of interconnect bandwidth, caching can help. Each crosspoint has a bypass port into the L3 cache, so there's more bandwidth available at each local stop than what's out on the ring. Filippo says delivered bandwidth is "significantly higher" than one would expect from an analysis of the ring without this mechanism.

Should bandwidth still become a constraint, each stop on the ring supports the quality-of-service provisions built into AMBA 5 CHI, "from ingress to egress and throughout the interconnect," in Filippo's words. QoS policies can thus provide guarantees of bandwidth, latency, and packet prioritization to specific applications or types of traffic. Since not all I/O devices honor QoS requests, the CCN has regulators to enforce policies internally when needed.

ARM's uncore also includes the sort of power-management provisions one would expect from a product of this class. The L3 cache partitions can step down through multiple lower-power states, depending on demand. They can disable half of their capacity if it's not needed, disable all tag and data SRAM and simply act as a conduit to DRAM, or enter an active retention state.

In short, the CCN-508 looks to be everything one would need from an uncore in order to build a useful chip for the data center. That chip might drive a series of blades or modules in a "microserver" config, or it might be a more specialized storage or network processing ASIC.

Filippo tells us ARM has a number of design wins for the CCN-508—either in the high single digits or the low double digits. ARM expects to see 32-core, enterprise-class systems based on this uncore in 2015, although predicting exactly what ARM's partners will do with its IP is kind of like cat-herding—an inexact science, at best. That's just the beginning for the CCN family, too. We should see larger and smaller versions of this uncore become available for licensing later this year.

Interestingly enough, my sense is that the most visible names in the ARM SoC business aren't likely to adopt the CCN-500 series at all. For instance, AMD has its own internal SoC-style fabric for interconnecting IP blocks, and I believe Nvidia uses its own in the Tegra, too. Companies of that sort seek to differentiate their products by building their own glue logic. The thing is, you don't have to be a big player in order to build something interesting when a high-bandwidth uncore like the CCN-508 is available for licensing. That's kind of the point, really.