Intel joins the data-parallel computing fraternity with Xeon Phi

The supercomputing-related announcements are coming fast and thick today, and we have one truly novel entry in the bunch, as Intel formally unveils its first Xeon Phi offerings. As you may recall, Xeon Phi is the brand name given to products based on Knight’s Corner, the chip that evolved from the prior Knight’s Ferry project, which itself was derived from Larrabee, Intel’s aborted attempt at producing a graphics processor.

The lineage may be confusing, but Intel says the Xeon Phi is the culmination of eight years of effort across multiple product groups, and the chip’s purpose is now clear: to tackle the HPC and supercomputing markets, going up against the likes of Nvidia’s Tesla K20 series. In the spirit and format of today’s other announcements, here’s a look at the products in the Xeon Phi lineup:

  Core

clock

(MHz)

Max

cores

Peak

DP tflops

Memory

capacity

(GDDR5)

Memory

interface

width

Memory

bandwidth

Total

cache

TDP
Xeon Phi SE10P/SE10X 1100 61 1.07 8GB 512-bit 352 GB/s 30.5 MB 300W
Xeon Phi 5110P 1053 60 1.01 8GB 512-bit 320 GB/s 30 MB 225W
Xeon Phi 3100 series TBA TBA >1 6GB TBA 240 GB/s 28.5 MB 300W

These product specs require a bit of clarification.  For instance, the two "SE10" models above are special-edition cards that Intel supplied to OEM partners who needed early access to the Xeon Phi. They have higher power consumption than the final product, the 5110P, but share the same feature set. Meanwhile, the Xeon Phi 3100 series isn’t being officially launched today; some of its basic specs will remain obscured until its introduction in the first half of next year, although we know that both actively and passively cooled variants will be offered.

However, our amazing, Sherlock Holmes-like powers of deduction may allow us to fill in some blanks. Since the 3100-series cards have 512KB of cache per core and possess a total cache size of 28.5 MB, we’re going to go way out on a limb and deduce that these products will feature 57 active cores. Since they should achieve over one teraflops of double-precision floating-point performance with fewer cores than the 5110P, we’d expect higher clocks for the 3100 series. Those higher frequencies would explain why the cards’ power envelopes are higher, at 300W.

That leaves today’s main announced product, the Xeon 5110P. The "P" at the end of the model number denotes passive cooling. Like the Tesla K20X, this card will be aimed squarely at servers. In fact, I believe both the K20X and 5110P will drop into the same Cray system, at the customer’s discretion. If we may compare briefly with the Tesla, the 5110P features slightly less peak throughput, 1.01 teraflops to to the K20X’s 1.31, while its peak TDP is 10W higher than the Nvidia card’s.

Having said that, Intel would clearly prefer not to make direct comparisons with chips that it deems "accelerators," since Xeon Phi is, by contrast, very much a CPU. The language may be a bit precious, but Intel does have a point. Although both chips are mounted on PCIe cards that snap into systems driven by Xeons or Opterons, the Xeon Phi is a somewhat different sort of beast for several key reasons.

For one, the Xeon Phi runs its own Linux-based operating system and acts as an independent node in the cluster. Each card can get its own IP address, can run multiple jobs, and can communicate with other nodes across the network. Additionally, Xeon Phi cores are full-featured x86 processors, though modified for data-parallel processing. The Phi doesn’t need to rely on a external CPU to execute program control code—one of its cores can serve that role, locally—and it can be programmed like any other x86 processor, with the same familiar tools, although optimal throughput will obviously require parallelization. Finally, the Xeon Phi’s architecture diverges from today’s GPUs substantially when it comes to the cache hierarchy. For example, I believe Nvidia’s GK110 has 1.5MB of L3 cache; the Phi 5110P has 30MB of L3 cache, with full hardware-maintained coherency. For some types of workloads, Intel’s approach should yield very different results than today’s streaming-focused GPUs.

Comparisons to GPUs are nevertheless inevitable, and one of the first Xeon Phi clusters has landed on today’s Top500 list of fastest supercomputers, in seventh place with 2.66 petaflops of Linpack throughput. The cluster’s power consumption isn’t listed, so we can’t compare that aspect of the system directly to the (presumably much larger) Opteron-and-Tesla-based Titan at Oak Ridge National Labs, which took the top spot with 17.59 petaflops in Linpack.

In talking about the Xeon Phi’s performance, Intel makes a salient point about the claims of 30X or better speedups that one often hears coming out of projects that have made the transition to data-parallel computing. As it set out to port applications to Xeon Phi along with various partners, Intel did indeed see major performance improvements from converting legacy code to nicely vectorized code compiled with the latest tools. However, many of those speedups applied nearly as dramatically to regular Xeon E5 processors as they did to Xeon Phi. Simply giving a "before" number from old, unoptimized code running on a CPU and an "after" number from freshly optimized and vectorized code running on the Phi might yield a big, juicy multiple of improvement. However, when the same optimized code runs on both processors, the Xeon Phi is 2.2 to 2.9X faster than dual Xeon E5-2670s in applications like SGEMM, DGEMM, Linpack, and Stream.

Interestingly, some applications did see larger speedups. BlackScholes SP saw a gain of 10.75X on the Phi versus regular Xeons. The difference, however, was due to specialized hardware for transcendentals built into Knight’s Corner, hardware that betrays the chip’s graphics-focused roots.

At any rate, the sorts of improvements depicted in the slides above are nonetheless worth pursuing, and Intel contends further parallelization is essential to reach its goal of exascale computing, given the power constraints involved. The firm also insists the HPC and supercomputing markets alone are worth addressing with this new product lineup, given their growth potential.

The Xeon Phi 5110P is shipping to OEMs now, with availability to end customers planned for January 28, 2013.

Comments closed
    • Silus
    • 7 years ago

    Typo here:

    “while its peak TDP is 10W higher than the Nvidia card’s.”

    Shouldn’t it be 10W lower than the NVIDIA K20X (which is what this was being compared to) ?
    According to your other piece of news, K20X has a TDP of 235W, making the Xeon Phi, at least on TDP, slightly better than NVIDIA’s K20X.

    • Wirko
    • 7 years ago

    How big are these chips anyway? They must be VERY big as Intel obviously builds 64 cores on every chip, and 7 of them serve as “overprovisioning”.

    The yield should be very high though, maybe over 90%.

      • chuckula
      • 7 years ago

      My estimate is that they are 5 – 6 Billion transistors, which is certainly huge but actually smaller than a K20 at ~7Billion transistors. Not all of the transistors are active due to some of the cores in each chip being fused off for yield.

      Here’s the breakdown with estimates:
      1. 32 MB of cache = 32 * 2 ^ 20 bytes/MB * 9 bits/byte * 6 transistors/bit = 1.8 Billion

      (9 bits/byte due to ECC, but if there is no ECC then fewer transistors are needed, only 30 MB are active)

      2. Cores: Estimate of 50 million transistors per core. Remember that this does not include any cache and the Phi cores are intentionally simple. This could actually be an overestimate:
      64 * 50 million transistors/core = 3.2 Billion

      3. Uncore: Call it an even 500 million transistors for the memory controllers, PCIe interface, etc. etc.

      Total: ~5.5 Billion transistors. Subtract some from that figure since 4 cores and 2 MB of cache are deactivated in the commercially shipping version.

      • fellix
      • 7 years ago

      The die-shot clearly shows 62 cores for the full configuration. Intel didn’t stated the die-size, so no hard estimations for now.

    • kukreknecmi
    • 7 years ago

    The mentioned SuperComputer at 7th place is Texas Advanced Computing Center’s Stampede SuperComputer with Dell PowerEdge C8220 cluster. Again the mentioned 2.6 PFLOPS in the article is maintained by CPU cluster only. The Xeon Phi accelerators are going to provide more than 7 PFLOPS of computing capacity to the cluster. And it is safe to assume that, with addition of Xeon Phi’s to the cluster, it will extend its capacity 8-10 PFLOPS scale which will leverage it to 3rd or 4th place on top500.

    The off topic and real question is what happened to Nagasaki Uni’s Deigma Cluster. Is nowhere to be found on top500 and supposed to be rated more than 200 TFLOPS. Which is a home made cluster with i5’s and Radeon GPUs

      • chuckula
      • 7 years ago

      To get on the Top-500 list you need to run tests in an approved manner and submit the results. If the Deigma Cluster guys didn’t do that, then they won’t show up on the list.

      The Top-500 should really be named the “Published” Top-500 since there are lots of powerful machines out there where the owners don’t want to brag about the performance for one reason or another.

    • ronch
    • 7 years ago

    Tell the waiter to start serving food. Everyone’s here.

    • Game_boy
    • 7 years ago

    For the undisclosed core count of the 3100 series, the 28.5MB implies 57 cores.

      • chuckula
      • 7 years ago

      Code name: Ketchup

      • derFunkenstein
      • 7 years ago

      He, uh…he covered that in the article.

      [quote<]However, our amazing, Sherlock Holmes-like powers of deduction may allow us to fill in some blanks. Since the 3100-series cards have 512KB of cache per core and possess a total cache size of 28.5 MB, we're going to go way out on a limb and deduce that these products will feature 57 active cores. Since they should achieve over one teraflops of double-precision floating-point performance with fewer cores than the 5110P, we'd expect higher clocks for the 3100 series. Those higher frequencies would explain why the cards' power envelopes are higher, at 300W.[/quote<]

    • Krogoth
    • 7 years ago

    Naturally Intel will try to prevent AMD and Nvidia from taking over the HPC space. Phi is their answer to Tesla and Firepro lines.

      • Meadows
      • 7 years ago

      Come now Krogoth, you were supposed to say “direct answer”. And you didn’t even use the phrase “trade blows” this time! I don’t even know you anymore.

    • brucethemoose
    • 7 years ago

    Hmm, that giant Xeon Phi cluster listed in the Top 500 is actually within walking distance… I should go check that out.

    • Prospero424
    • 7 years ago

    It’ll certainly be interesting to see what sort of real-world processing workloads this sort of architecture will excel at vs. the GPU-based parallel solutions. There are bound to be at least a handful.

    I can certainly see the appeal of packaging these as completely independent nodes, but will they perform well enough in the areas experiencing the most demand in the HPC realm these days to be competitive? I’m genuinely curious.

    With AMD on the way out in the desktop processor market, the stakes are even higher if Intel has the potential (even if it’s at least five years out) to make significant inroads in the HPC market. They may not be a direct threat to Nvidia at the moment in that space, but that could change in a matter of a few years depending on how this situation turns out.

    Either way, it seems clear that Intel at least WANTS it all. It just remains to be seen if they can make it work in the mobile and HPC markets.

    • codedivine
    • 7 years ago

    Pretty good figures. As a datapoint, my Radeon 7970 does about 700 GFlops on DGEMM.

      • OU812
      • 7 years ago

      Intel’s own web page on the Phi 5110P only shows 829 for the SE10P (which is faster than the 5110P) so I wonder where they got the 883 number from?

      [quote<]4. 2 socket Intel® Xeon® processor E5-2670 server vs. a single Intel® Xeon Phi™ coprocessor SE10P (Intel Measured DGEMM perf/watt score 309 GF/s @ 335W vs. 829 GF/s @ 195W) [url<]http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html[/url<] [/quote<] And for reference the Nvidia K20X scores 1220.

    • albundy
    • 7 years ago

    “Additionally, Xeon Phi cores are full-featured x86 processors”

    LMFAO!

      • bcronce
      • 7 years ago

      It makes it very easy to port any application. GPU and CPUs have completely different memory patterns and usage cases.

      Typically, what one is good at, the other does horribly. Xeon Phi is supposed to be a “good” trade off, entirely focused on throughput, but being x86.

    • OU812
    • 7 years ago

    [quote<] If we may compare briefly with the Tesla, the 5110P features [b<]slightly less peak throughput, 1.01 teraflops to to the K20X's 1.31[/b<], while its peak TDP is 10W higher than the Nvidia card's.[/quote<] Since when is 23% less throughput "[b<]slightly less[/b<]"? As an example it would take 24,239 Xeon 5110P's to be equal to the 18,688 K20X's in the Titan Those 24,239 Xeon 5110P's would burn 5.45 megawatts vs 4.39 megawatts for the 18,688 Nvidia K20X's. So again, that 10 watt savings for the 5110P actually turns out to be a cost of an extra 1.06 megawatts when viewed in equal performance. And don't forget the recurring cost of that 1.06 megawatts and the additional cooling. ---- Edit: The numbers look even worse for the 5110P when you consider that it is on 22nm vs 28nm for the K20X.

      • chuckula
      • 7 years ago

      Ooh look.. the shill has metastasized to a new article!

        • Meadows
        • 7 years ago

        Don’t let it perform mitosis.

          • willmore
          • 7 years ago

          You got it wet, didn’t you?

            • MadManOriginal
            • 7 years ago

            He probably fed it after midnight.

            • willmore
            • 7 years ago

            That only makes them mean.

          • chuckula
          • 7 years ago

          Don’t cross the streams?

        • OU812
        • 7 years ago

        Hey Chuckie (any relation to Char-LIE) the facts are presented now present your case of forever be known as the fanboi who couldn’t.

          • chuckula
          • 7 years ago

          LMAO… if you actually were a regular on this site and not just a visiting troll you’d know that I’m often accused of being an Nvidia shill because I’m a big Linux user and I’ve had very good success with their drivers. I just don’t like over the top fanboys of any stripe.

          P.S. –> That applies to your bad-mouthing of the GCN AMD parts too, not just the Intel parts. Unfortunately AMD has a lot of problems as a company, but GCN is a pretty good architecture for compute, and I’d like to see it have more success in the HPC realm too.

            • OU812
            • 7 years ago

            Your PS is pure BS (as in complete lies (thus you must really be related to char-lie)).

            I post facts on products backed up by published articles you post moronic drivel as a reply and have the gall to call others moronic.

            Since you state you publish regularly here are you known as the village troll, shrill or idiot? I just want to be sure of your proper title when addressing you.

            • Meadows
            • 7 years ago

            People with the same first names are related to each other. This is golden.
            Sadly, he’ll be here all week, folks.

            • NeelyCam
            • 7 years ago

            Are you 5150’s evil twin?

          • Meadows
          • 7 years ago

          Char-lie? Doesn’t roll off the tongue. Is it the place where untruthful thoughts go to burn?

            • Prospero424
            • 7 years ago

            Maybe ALL of you could actually discuss the topic at hand instead of spending all of your time taking useless, petty pot shots at each other.

            Just an idea…

            • MadManOriginal
            • 7 years ago

            Yeah, but a boring one 🙁

      • moose17145
      • 7 years ago

      As the article points out these are very different from K20X. I would expect these to dominate over the K20X in some applications while the K20X dominates in others. It just depends on what you need. Also, as others have pointed out as well… these are all theoretical numbers. Could very well be these are more efficient in their use of resources than K20X is, which would help close the gap some (quite a bit in same cases). And, these also have the benefit of being full x86 capable processors.

      • derFunkenstein
      • 7 years ago

      5150 is still the better album.

        • 5150
        • 7 years ago

        I’m waiting for someone to make a username of the album after OU812.

          • MadManOriginal
          • 7 years ago

          Since both of you have spelled out the full album name, I don’t see why someone couldn’t make an account called For Unlawful Carnal Knowledge.

        • NeelyCam
        • 7 years ago

        They both kinda suck

    • dpaus
    • 7 years ago

    It’s almost 4:30pm EST, and Imagination Technologies hasn’t announced their product yet…. (I’d have used ARM Holdings PLC for that joke but I didn’t want to give Neely a heart attack)

      • Meadows
      • 7 years ago

      I’m waiting for an S3 Chromatose server part with up to 128 pipelines, GDDR2, and a PCIe slot adaptor.

      Edit: and LED lights.

        • willmore
        • 7 years ago

        Yes! More LEDs! I miss the good old CM with it’s forest of LEDs.

        Real computers have lights on them.

    • Meadows
    • 7 years ago

    60 cores can fit under 225 W.

    Not 61 though, hohoho, no! One core and +4% frequency adds another 75 W. I’m waiting for Otellini to don a turtleneck and show me how it’s magical.

      • chuckula
      • 7 years ago

      Those SE10P/SE10X model chips were specially designed for Stampede since it was one of Intel’s earliest customers. From the numbers, it is obvious that the Stampede chips were made and validated on an earlier stepping, and Intel made a later stepping that dialed down the TDP for the commercially available 5110P models.

      Overall, while the K20 certainly has higher peak performance these Phi chips look to have a pretty good balance of double-precision number crunching power combined with a *big* memory bandwidth advantage while fitting into the same power envelope as the lower-power K20. Will the Phi be the best for every workload? No, and Linpack benchmarks won’t get the best results on the Phi. However, for more complex workloads that need lots of parallelism but also don’t have perfectly uniform memory access patterns, the Phi should do extremely well.

        • Meadows
        • 7 years ago

        Let’s turn around the logic. If it’s actually a different stepping, then we’re seeing 25% lower power consumption going from “old” to “new”, and I guess that’s pretty useful.

        Is the 5110P going to be the only shipping part from the list?

        • BryanC
        • 7 years ago

        The K20X actually has equal or slightly better real world memory bandwidth than the Xeon Phi SE10P.

        Take a look at this page:
        [url<]http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html[/url<] And search for "STREAM Triad" in the fine print, which is the benchmark that serves as an upper bound for real-life memory bandwidth. You'll see that the SE10P sustains 175 GB/s, versus its theoretical peak of 352 GB/s. 50% of peak memory bandwidth. Whereas Nvidia GPUs generally sustain around 75% of peak = 250 *.75 = 188 GB/s. The real world numbers are about equal.

      • MadManOriginal
      • 7 years ago

      Learn to read, moran.

      /Meadows impression.

        • Meadows
        • 7 years ago

        Spell moron correctly, moron. :}

          • derFunkenstein
          • 7 years ago

          Get a brain! Morans!

          [url<]http://knowyourmeme.com/memes/get-a-brain-morans[/url<]

            • MadManOriginal
            • 7 years ago

            I have no doubts that Meadows knows the meme, but thanks for posting that intriguing history!

            • willmore
            • 7 years ago

            Let’s terminate this morain.

Pin It on Pinterest

Share This