After taking a little over a year to think on it, Intel appears to have decided that glue can be pretty Epyc after all. The company teased plans for a new Xeon platform called Cascade Lake Advanced Performance, or Cascade Lake-AP, this morning ahead of the Supercomputing 2018 conference. This next-gen platform doubles the cores per socket from an Intel system by joining a number of Cascade Lake Xeon dies together on a single package with the blue team's Ultra Path Interconnect, or UPI. Intel will allow Cascade Lake-AP servers to employ up to two-socket (2S) topologies, for as many as 96 cores per server.
Intel chose to share two competitive performance numbers alongside the disclosure of Cascade Lake-AP. One of these is that a top-end Cascade Lake-AP system can put up 3.4x the Linpack throughput of a dual-socket AMD Epyc 7601 platform. This benchmark hits AMD where it hurts. The AVX-512 instruction set gives Intel CPUs a major leg up on the competition in high-performance computing applications where floating-point throughput is paramount. Intel used its own compilers to create binaries for this comparison, and that decision could create favorable Linpack performance results versus AMD CPUs, as well.
AMD has touted superior floating-point throughput from its Epyc platforms in the past for two-socket systems, but those comparisons were made against Broadwell CPUs with two AVX2 execution units per core rather than the twin AVX-512 engines of Skylake Server and the derivative Cascade Lake cores. AMD also chose to use the GCC compiler for those comparisons rather than Intel's compiler suite. Intel has clearly had enough of that kind of claim from AMD, and it seems keen to reassert its chips' superiority for floating-point performance with this benchmark info.
Other decisions about configuring the systems under test will likely raise louder objections. Intel didn't note whether Hyper-Threading would be available from Cascade Lake-AP chips, and indeed, its comparative numbers against that dual-socket Epyc 7601 system were obtained with SMT off on the AMD platform. 64 active cores is nothing to sniff at, to be sure, but when a platform is capable of throwing 128 threads at a problem and one artificially slices that number in half, eyebrows are going to go up.
Update 11/5/2018 at 18:11: According to an Intel spokesperson who contacted me this evening, "it's common industry practice for Intel to disable simultaneous multithreading on processors when running STREAM and LINPACK to achieve the highest processor performance, which is why we disabled it on all processors we benchmarked." Our independent research on this point corroborates Intel's statement, as Linpack fully occupies the floating-point units of the CPU and would likely experience performance regressions from resource contention with SMT on. Point taken.
Intel also asserted that on the Stream Triad benchmark, a Cascade Lake-AP system will be able to offer 1.3x the memory bandwidth of that same 2S Epyc 7601 system with eight channels of DDR4-2666 RAM. That figure comes courtesy of 12 channels of DDR4 memory per socket, a simple doubling-up of the six memory channels available per socket from a typical Xeon Scalable processor today. Dual-socket Cascade Lake-AP systems will be able to offer an incredible 24 channels of DDR4 memory per server. Intel didn't disclose the memory speed it used to arrive at this figure, however.
Intel also teased some deep-learning performance numbers against its own products. Compared to a 2S system with Xeon Platinum 8180 CPUs, Intel projects that a 2S Cascade Lake-AP server will offer as many as 17 times the deep-learning image inference throughput per second as today's systems. That figure could be related to Cascade Lake's support for the Vector Neural Network Instruction (VNNI) subset of the AVX-512 instruction set. VNNI allows Cascade Lake processors to perform INT8 and INT16 operations that are important to AI inferencing operations.
Beyond this high-level teaser, Intel didn't specify nitty-gritty details like the inter-socket interconnect topology or the number of PCIe lanes available per socket from each Cascade Lake-AP CPU. We expect to learn more upon the official release of the Cascade Lake family of processors later this year.