The supercomputing-related announcements are coming fast and thick today, and we have one truly novel entry in the bunch, as Intel formally unveils its first Xeon Phi offerings. As you may recall, Xeon Phi is the brand name given to products based on Knight’s Corner, the chip that evolved from the prior Knight’s Ferry project, which itself was derived from Larrabee, Intel’s aborted attempt at producing a graphics processor.
The lineage may be confusing, but Intel says the Xeon Phi is the culmination of eight years of effort across multiple product groups, and the chip’s purpose is now clear: to tackle the HPC and supercomputing markets, going up against the likes of Nvidia’s Tesla K20 series. In the spirit and format of today’s other announcements, here’s a look at the products in the Xeon Phi lineup:
|Xeon Phi SE10P/SE10X||1100||61||1.07||8GB||512-bit||352 GB/s||30.5 MB||300W|
|Xeon Phi 5110P||1053||60||1.01||8GB||512-bit||320 GB/s||30 MB||225W|
|Xeon Phi 3100 series||TBA||TBA||>1||6GB||TBA||240 GB/s||28.5 MB||300W|
These product specs require a bit of clarification. For instance, the two "SE10" models above are special-edition cards that Intel supplied to OEM partners who needed early access to the Xeon Phi. They have higher power consumption than the final product, the 5110P, but share the same feature set. Meanwhile, the Xeon Phi 3100 series isn’t being officially launched today; some of its basic specs will remain obscured until its introduction in the first half of next year, although we know that both actively and passively cooled variants will be offered.
However, our amazing, Sherlock Holmes-like powers of deduction may allow us to fill in some blanks. Since the 3100-series cards have 512KB of cache per core and possess a total cache size of 28.5 MB, we’re going to go way out on a limb and deduce that these products will feature 57 active cores. Since they should achieve over one teraflops of double-precision floating-point performance with fewer cores than the 5110P, we’d expect higher clocks for the 3100 series. Those higher frequencies would explain why the cards’ power envelopes are higher, at 300W.
That leaves today’s main announced product, the Xeon 5110P. The "P" at the end of the model number denotes passive cooling. Like the Tesla K20X, this card will be aimed squarely at servers. In fact, I believe both the K20X and 5110P will drop into the same Cray system, at the customer’s discretion. If we may compare briefly with the Tesla, the 5110P features slightly less peak throughput, 1.01 teraflops to to the K20X’s 1.31, while its peak TDP is 10W higher than the Nvidia card’s.
Having said that, Intel would clearly prefer not to make direct comparisons with chips that it deems "accelerators," since Xeon Phi is, by contrast, very much a CPU. The language may be a bit precious, but Intel does have a point. Although both chips are mounted on PCIe cards that snap into systems driven by Xeons or Opterons, the Xeon Phi is a somewhat different sort of beast for several key reasons.
For one, the Xeon Phi runs its own Linux-based operating system and acts as an independent node in the cluster. Each card can get its own IP address, can run multiple jobs, and can communicate with other nodes across the network. Additionally, Xeon Phi cores are full-featured x86 processors, though modified for data-parallel processing. The Phi doesn’t need to rely on a external CPU to execute program control code—one of its cores can serve that role, locally—and it can be programmed like any other x86 processor, with the same familiar tools, although optimal throughput will obviously require parallelization. Finally, the Xeon Phi’s architecture diverges from today’s GPUs substantially when it comes to the cache hierarchy. For example, I believe Nvidia’s GK110 has 1.5MB of L3 cache; the Phi 5110P has 30MB of L3 cache, with full hardware-maintained coherency. For some types of workloads, Intel’s approach should yield very different results than today’s streaming-focused GPUs.
Comparisons to GPUs are nevertheless inevitable, and one of the first Xeon Phi clusters has landed on today’s Top500 list of fastest supercomputers, in seventh place with 2.66 petaflops of Linpack throughput. The cluster’s power consumption isn’t listed, so we can’t compare that aspect of the system directly to the (presumably much larger) Opteron-and-Tesla-based Titan at Oak Ridge National Labs, which took the top spot with 17.59 petaflops in Linpack.
In talking about the Xeon Phi’s performance, Intel makes a salient point about the claims of 30X or better speedups that one often hears coming out of projects that have made the transition to data-parallel computing. As it set out to port applications to Xeon Phi along with various partners, Intel did indeed see major performance improvements from converting legacy code to nicely vectorized code compiled with the latest tools. However, many of those speedups applied nearly as dramatically to regular Xeon E5 processors as they did to Xeon Phi. Simply giving a "before" number from old, unoptimized code running on a CPU and an "after" number from freshly optimized and vectorized code running on the Phi might yield a big, juicy multiple of improvement. However, when the same optimized code runs on both processors, the Xeon Phi is 2.2 to 2.9X faster than dual Xeon E5-2670s in applications like SGEMM, DGEMM, Linpack, and Stream.
Interestingly, some applications did see larger speedups. BlackScholes SP saw a gain of 10.75X on the Phi versus regular Xeons. The difference, however, was due to specialized hardware for transcendentals built into Knight’s Corner, hardware that betrays the chip’s graphics-focused roots.
At any rate, the sorts of improvements depicted in the slides above are nonetheless worth pursuing, and Intel contends further parallelization is essential to reach its goal of exascale computing, given the power constraints involved. The firm also insists the HPC and supercomputing markets alone are worth addressing with this new product lineup, given their growth potential.
The Xeon Phi 5110P is shipping to OEMs now, with availability to end customers planned for January 28, 2013.