The announcement this week that AMD is working on an all-new, high-performance CPU architecture compatible with the ARM instruction set is huge news in its own right, but it’s also an important step in a progression that’s been unfolding in recent years. Bolstered by success in the mobile computing market, ARM and its partners have been gearing up to challenge to the dominance of Intel in other parts of the computing world, including the data center.
ARM’s licensing model means any capable chip company can use ARM’s technologies, from its CPU instruction set to interconnect standards to specific blocks of logic, in order to build a product. ARM offers a broad suite of IP (or intellectual property) for its customers to employ as they wish, and some of what ARM offers is truly impressive, high-bandwidth technology aimed squarely at server-class applications.
Most ARM licensees aren’t likely to challenge Intel by attempting to take on its potent Xeon processors head to head, like AMD may well do with its K12 core. They can, however, potentially gain a foothold in the lucrative enterprise market by tailoring ARM-compatible SoCs for specific classes of workloads. An old axiom of computing says that custom-designed hardware will inevitably be more efficient and cost-effective at a given job than a brute-force, general-purpose solution. Not every job demands the fastest processing core. If ARM’s partners can build SoCs that efficiently handle workloads that are, for instance, more I/O-bound than compute bound, then they can win business away from Intel without matching Xeon stride for stride in every respect.
In doing so, ARM and its customers could very well lower the cost of computing at a rate faster than Intel’s famed Moore’s Law, and we could see an expansion of the number of viable players in the business of building server-class silicon.
ARM is expending quite a bit of effort to make such a future possible, and it invited us to a press and analyst confab in Austin, Texas, last week in order to highlight some of that work.
There’s really too much happening in the ARM ecosystem for us to offer anything like a comprehensive look at how the various companies involved are targeting the server space, but we should note that there is a widely distributed but concerted push for 64-bit ARM-based servers happening behind the scenes right now. The players include everyone from ARM itself to chipmakers like AMD and Applied Micro, from OEMs like HP to software vendors like Canonical and Red Hat—and to customers like Facebook and other cloud providers. The industry seems to want an alternative to Intel and its x86 ISA in the server space, and an awful lot of key players are putting in the effort to make that happen.
Rather than touring the whole scene, we’ll take a look at a couple of examples of the sort of technology ARM is designing and its partners are implementing. The first one demonstrates that ARM is more than just a CPU company, and the second illustrates the current state of ARM-powered solutions for the data center.
AMBA 5 CHI and the uncore
Much of the innovation in microprocessors over the past decade has come not in the CPU microarchitectures themselves, but outside of the cores, in the plumbing that feeds these compute engines. We’ve spent an awful lot of keystrokes talking about the “uncore” complexes and chip-to-chip interconnects that surround Xeons and Opterons, and we probably still haven’t entirely given them their due.
ARM offers two 64-bit CPU cores that can play a role in the server space, the smaller Cortex-A53 and the still-smallish-but-larger Cortex-A57. (The A57 is derived from the Cortex-A15 that’s made its way into high-end smartphones.) To support these cores, ARM has defined an interconnect architecture called AMBA 5 CHI, and it has created a family of “uncore” products that implement this architecture. Mike Filippo, Lead Architect for ARM’s Enterprise Systems Solutions, walked us through this spec and implementation in detail last week.
The AMBA 5 CHI spec describes a high-bandwidth interconnect for transporting data across a chip. AMBA 5 CHI is coherent, which means multiple connected clients (like CPU cores and I/O devices) can access a shared pool of memory safely. The interconnect hardware manages any hazards created by different clients trying to modify the same data simultaneously. In this respect and many others, AMBA 5 CHI is similar to standards like AMD’s HyperTransport and Intel’s QPI.
AMBA 5 CHI is a layered architecture. It defines proper behavior at multiple layers, from the top-level protocol to routing to the link layer to low-level physical signaling. Oddly enough, Filippo says the spec is agnostic about topology; it can be deployed as a point-to-point link, a crossbar, a ring, a mesh, or what have you. The spec includes provisions for multiple virtual channels—essentially wire sharing—and the protocol layer allows for different flow-control policies.
At present, AMBA 5 CHI is only being used as an internal interconnect between different on-chip devices, but Filipo tells us the spec was defined with an eye toward chip-to-chip communication, as well. That raises the prospect that AMBA 5 CHI, or something very much like it, could be used to enable coherent multiprocessing across multiple silicon dies at some point in the not-too-distant future. In fact, Filippo says ARM is “working on it.”
That said, what ARM has already done with AMBA 5 CHI looks to be plenty impressive in its own right. The firm has created a lineup of logic offerings, dubbed the CCN-500 family for “cache coherent network,” that can act as the glue for a high-bandwidth ARM-based SoC. Right now, the CCN-500 family has two members: the CCN-504, which supports up to 16 CPU cores, and the CCN-508, which supports as many as 32 cores. Filippo tells us there are smaller- and larger-scale versions in the works. All of them implement AMBA 5 CHI.
The block diagram above offers a simplified view of the CCN-508 uncore. One can see how it links together the CPU cores, memory controllers, and other I/O logic needed to make an SoC work. What’s striking about the 508, especially since it’s coming from ARM, is its sheer scale. The CCN-508 can support up to eight quad-core clusters of Cortex-A57 CPUs, for a grand total 32 cores. (It can also scale down to as few as two clusters and eight cores, if needed.) The uncore can connect to four ARM DMC-520 memory controllers capable of supporting both DDR3- and DDR4-type memories. The L3 cache can be as large as 32MB, and since that L3 cache has distributed ownership, there’s a snoop filter to prevent excess traffic from coherency enforcement.
All of the above will sound fairly familiar to those who know today’s Xeon and Opteron architectures. The CCN-508, though, has been built to provide copious bandwidth and coherent caching even in the context of relatively fewer, smaller, and less expensive CPU cores. In fact, dig a level deeper than the diagram above, and one will find that this uncore is based on a distributed design that makes its I/O interfaces into first-class citizens.
The CCN is organized as a series of crosspoints, each of which has two device ports and two interconnect ports. These crosspoints can have various sorts of clients, including L3 cache partitions, CPU cores, and I/O interfaces. “Just plop down crosspoints,” Filippo says, “and the system builds itself.” Breaking the design down into relatively intelligent crosspoints simplifies development, ARM claims, and allows performance to ramp up smoothly as designs grow in scale.
The L3 cache partitions can range in size from 128KB to 4MB, and some of them are paired up in crosspoints with memory controllers and I/O bridges rather than CPU clusters. That distribution underscores how the L3 doesn’t just serve the cores, but also acts as a very high-bandwidth I/O cache. The L3 has an “adaptive” policy regarding inclusion; it doesn’t always replicate the contents of the CPU cores’ L2 caches. In fact, Filippo claims that calling this cache “L3” is iffy, since it’s not just for compute.
This cache is no doubt needed to take advantage of all of the bandwidth on tap. Filippo estimates the CCN-508’s peak bandwidth at 360GB/s, and he says the interconnects can sustain 230GB/s pretty much constantly. Each of the eight I/O accelerator ports is capable of 40GB/s of throughput, so there’s 320GB/s of peak I/O bandwidth possible across the uncore. Although that number outstrips the 230GB/s of interconnect bandwidth, caching can help. Each crosspoint has a bypass port into the L3 cache, so there’s more bandwidth available at each local stop than what’s out on the ring. Filippo says delivered bandwidth is “significantly higher” than one would expect from an analysis of the ring without this mechanism.
Should bandwidth still become a constraint, each stop on the ring supports the quality-of-service provisions built into AMBA 5 CHI, “from ingress to egress and throughout the interconnect,” in Filippo’s words. QoS policies can thus provide guarantees of bandwidth, latency, and packet prioritization to specific applications or types of traffic. Since not all I/O devices honor QoS requests, the CCN has regulators to enforce policies internally when needed.
ARM’s uncore also includes the sort of power-management provisions one would expect from a product of this class. The L3 cache partitions can step down through multiple lower-power states, depending on demand. They can disable half of their capacity if it’s not needed, disable all tag and data SRAM and simply act as a conduit to DRAM, or enter an active retention state.
In short, the CCN-508 looks to be everything one would need from an uncore in order to build a useful chip for the data center. That chip might drive a series of blades or modules in a “microserver” config, or it might be a more specialized storage or network processing ASIC.
Filippo tells us ARM has a number of design wins for the CCN-508—either in the high single digits or the low double digits. ARM expects to see 32-core, enterprise-class systems based on this uncore in 2015, although predicting exactly what ARM’s partners will do with its IP is kind of like cat-herding—an inexact science, at best. That’s just the beginning for the CCN family, too. We should see larger and smaller versions of this uncore become available for licensing later this year.
Interestingly enough, my sense is that the most visible names in the ARM SoC business aren’t likely to adopt the CCN-500 series at all. For instance, AMD has its own internal SoC-style fabric for interconnecting IP blocks, and I believe Nvidia uses its own in the Tegra, too. Companies of that sort seek to differentiate their products by building their own glue logic. The thing is, you don’t have to be a big player in order to build something interesting when a high-bandwidth uncore like the CCN-508 is available for licensing. That’s kind of the point, really.
One of the first: Applied Micro’s X-Gene
The first server-class SoC compatible with the 64-bit ARMv8 ISA is the X-Gene from Applied Micro, and it’s one example of the sort of thing we can expect from ARM partners going into the server space in the coming years. The X-Gene is intended for cloud-style deployments, where lots of small server instances will service workloads that have modest computational requirements or are more I/O-constrained.
The diagram above makes the X-Gene look relatively simple, but don’t be fooled—there’s lots of parallelism represented. Applied Micro says it has tailored this SoC for specific workloads, and in doing so, the company has created an awful lot of its own IP. The CPU cores, for example, are the product of an ARM ISA license. Applied Micro built its own custom CPU core, compatible with ARMv8. To address the enterprise market, the firm built in ECC support and a number of RAS and reliability features. The first X-Gene chip features eight of these cores clocked at 2.4GHz, a relatively high frequency in the ARM world. The interconnect fabric in the X-Gene is Applied Micro’s own design, as well, not anything licensed from ARM. That fabric links the X-Gene’s CPU cores and I/O blocks to a total of four memory controllers, twice as many as in Intel’s Avoton.
What’s interesting is the rationale behind this design choice: Applied Micro says applications increasingly reside completely in memory, so the X-Gene needed to have access to “lots and lots of cheap memory.” The primary driver here wasn’t bandwidth, but sheer RAM capacity. The result is a fairly low-power SoC that can support a ton of memory—up to 512GB, according to Applied Micro, in an eight-ranks-per-channel configuration. I doubt most X-Gene microserver modules will have half a terabyte of RAM onboard, but this possibility is still worthy of note. Intel has limited Avoton to a maximum 64GB of physical memory, perhaps in part to protect its high-margin Xeon business. The X-Gene permits configurations that might be a better fit for cloud workloads, which is exactly how ARM partners could take business away from Intel.
In a similar vein, Applied Micro has built a form of TCP acceleration into the quad 10-GigE network controllers onboard the X-Gene. This hardware can purportedly reduce the latency for TCP communication from 20-30 milliseconds to roughly five microseconds. Applied Micro says cloud providers like Facebook provision their servers on the basis of request latency, and it believes the X-Gene’s TCP acceleration could allow the chip to deliver a substantially higher number of requests per second.
These things sound good in theory, but we don’t yet know how they’ll work in practice. Applied Micro didn’t have any performance numbers of consequence to share with us yet, just a vague claim of being able to support twice as many instances per unit of power as an Intel CPU. (We don’t know which one.)
Also, in a reminder that the X-Gene comes at things from a very different angle, this first chip is built on an antiquated 40-nm fabrication process. Production of the first X-Gene chips began in March, and Applied Micro is currently shipping pre-production silicon to system builders. Happily, a 28-nm X-Gene follow-up is in the works, and the first samples are scheduled for this quarter. The second X-Gene shouldn’t be dramatically different from the first one, but tweaks to the CPU core are expected to bring a 15% gain in the number of instructions retired per clock cycle.
If none of this sounds good enough to persuade customers to leave the existing x86 hardware and software infrastructure behind, well, just know that Applied Micro has been working with some very influential partners. HP hasn’t officially announced any products, but it has repeatedly demonstrated an X-Gene based cartridge for its modular Moonshot servers.
Also, last week, some folks from Canonical showed up at ARM’s event to demo a 64-bit ARMv8 version of the Ubuntu Linux distro running on X-Gene. Canonical intends for Ubuntu 14.4 for ARMv8 to be a first-class server operating system, complete with a five-year support lifetime.
The demonstration platform consisted of a stack of 14 X-Gene servers with very little active cooling running in a centrally controlled OpenStack Icehouse environment. Christian Reis from Canonical kicked off instances of several different server applications, including MediaWiki and Hadoop, with all of the necessary components natively compiled for ARMv8. Although the process of deploying a cloud application environment didn’t make for the most breathtaking real-time theater, the apps did seem to work as advertised once they were up and running.
Reis reported good progress in getting Ubuntu ported to ARMv8. He said that “99% of the main universe” is already up and going. Some of the important remaining “gaps” he identified have to do with proprietary components, like Oracle’s Java virtual machine. There’s also apparently work yet to be done in order to ensure ARM-based systems support low-level firmware standards for broad interoperability. UEFI support is now ready to go, but ACPI is still a work in progress, for instance.
We are at the beginning of something, obviously, and there’s much to be done before ARM-based SoCs can truly challenge Intel for the highest-profile roles in the data center. But the foundation is being laid, brick by brick, by software and hardware engineers from a range of companies whose names are familiar and not so familiar. This week’s revelation that AMD is joining the fray opens up new possibilities for ARM-based servers to challenge Xeons toe to toe, assuming the K12 core turns out reasonably well. It’s hard to say exactly what happens next, but it’s possible the data center will look very different five years from now, thanks to a swarm of invaders, big and small, that share almost nothing in common but an ARM license.