There's been a lot of talk in recent months about ARM-based server CPUs in one form or another. While it's been a reasonable assumption that said CPUs would offer impressive performance-per-watt ratios, the big question mark in everyone's mind is probably "how fast are they in absolute terms?" Answers for that question have now started to form thanks to a handful of recently-released benchmarks from academia and industry.
Cloudflare is a name that may be recognized by a handful of gerbils. The company is primarily known for its CDN (Content Delivery Network) services, and I'm willing to bet that you may have noticed at least once that you're downloading something from a domain ending in cloudflare.com. Vlad Krasnov, one of the company's engineers, recently shared a number of benchmarks comparing an engineering-sample server fitted with a 46-core Qualcomm Centriq SoC clocked at 2.5 GHz versus a dual-socket Broadwell Xeon E5-2630 v4 system at 2.2 GHz (and a 3.1 GHz turbo), and a dual-socket Xeon Silver 4116 system at 2.1 GHz (with a 3 GHz turbo clock).
Maximum processor TDP for the Intel systems is 170W, while the Centriq system is content with 120W. It's worth pointing out that the Intel Skylake Server processors in the test weren't the highest-end Platinum models, but as Krasnov remarks, those machines have TDPs as high as 200W, and that Cloudflare is primarily concerned with performance-per-watt, hence the Xeon Silver chips.
Cloudflare's software stack relies on cryptography and compression functionality in multiple languages to run, as well as good ol' web serving. The company relies on the Lua and Go languages for many of its needs. Krasnov goes on to point out that some of the software used isn't yet fully optimized (if at all) for the ARM architecture, but the results are impressive enough as-is.
In public key cryptography (OpenSSL) Falkor scores a good win, although it falters in symmetric-key cryptography tests, likely thanks to its narrower SIMD units compared to the Intel competition. When it comes to gzip and brotli compression, although Falkor's single-core performance trails the the Intel systems, it comes into its own and proves itself superior in multi-core scenarios, taking into consideration that Cloudflare apparently doesn't use the highest compression levels in brotli.
Falkor doesn't do too well in Cloudflare's Go cryptography, regular expressions, compression, and string-handling tests. Krasnov believes this is due simply to the fact that the language and libraries in question aren't really optimized for ARM at all yet. In the Lua tests, however, Falkor proves itself "competitive," offering roughly similar performance to the Intel machines. That performance comes with impressive power efficiency. In the final Nginx webserver test, the single-socket Centriq system ends the party with a bang, serving an average of 214 requests per consumed watt, versus 99 for Skylake and 77 for Broadwell.
Cloudflare isn't the only outfit offering a sneak peek at benchmarks for ARM servers. The GW4 Alliance, an academic consortium of four British universities, is readying up Isambard, an ARM-based XC50-series supercomputer manufactured by Cray, packing 10,000 ARM CPU cores from Cavium ThunderX2 processors, with two 32-core processsors at over 2 GHz per node. Simon Mcintosh-Smith from the University of Bristol compared a single-socket "early access" Cavium ThunderX2 system with 32 cores at 2.5 GHz to a 22-core, 2.1 GHz Xeon Gold 6152 system and an 18-core Xeon E5-2695 v4 server at 2.1 GHz. Neither Cavium nor the GW4 provide TDPs for the ThunderX2 CPU, so it's not possible to compare the systems on that metric.
The ThunderX2-based system scores wins nearly across the board, though it's worth pointing out that the Intel systems in question were running at a clock speed disadvantage. What's especially impressive is that even with the use of Intel's optimizing compiler where possible on the Xeon systems, the ThunderX2 system was able to eke out an advantage regardless. Mcintosh-Smith believes that applications relying on raw memory bandwidth see better performance from Cavium's higher number of memory channels (eight versus Skylake's six and Broadwell's four), than ones that rely on raw computing ability, where the ThunderX2 CPU might be at a disadvantage. These results were shown at the SC17 conference, and you can check out more information in the slides here.
It's worth pointing out that all of these benchmarks cover specific use cases, albeit relatively common ones in the case of Cloudflare's wide range of benchmarks. Nevertheless, they seem to paint a fairly consistent picture: ARM CPUs should definitely be taken seriously in server rooms and data centers these days.