Knight’s Corner is now Xeon Phi

The Intel project code-named Larrabee was intended to produce a discrete graphics processor to compete with chips from Nvidia and AMD. When that project was canned, Intel decided to keep the work going by re-targeting its efforts toward the HPC and supercomputing markets, building a many-core processor whose mission would be data-parallel computing rather than graphics. The renewed effort eventually gave birth to a revamped, in-development chip known as Knight’s Corner. Today, in a major sign of progress toward a completed project, Intel has officially attached a brand name to Knight’s Corner and future many-integrated-core (MIC) products in the same vein: Xeon Phi.

 

Intel says it chose "Xeon Phi" because it wanted its data-parallel compute products to be part of the Xeon family. "Phi" was picked because it "evokes many concepts in science & nature including the ‘golden ratio’ in mathematics."

Beyond the announcement of the brand, Intel reiterated some key details about its plans for Knight’s Corner. The chip will be in production "in 2012" on Intel’s 22-nm process with tri-gate transistors. Knights Corner will ship with more than 50 cores and "8GB+ GDDR5 memory" on a PCI Express card. Intel claims the chip will achieve greater than a teraflop of double-precision compute throughput in Linpack. Notably, that’s the same basic performance claim Nvidia made recently for its upcoming GK110 processor. As a proof of concept, Intel has built a Knight’s Corner cluster capable of 118 tflops of throughput.

In addition to its own efforts, Intel has enlisted the help of a number of key industry players in bringing Xeon Phi to market. Among them is Cray, who intends to use Knight’s Corner-based processors in its own supercomputing products.

Perhaps the most distinctive thing Xeon Phi is expected to bring to the table is its x86 compatibility. Intel claims developers will be able to use standard tools and a familiar programming model to harness the power of its MIC processors. This ease of use could allow Intel to gain a foothold among HPC and supercomputing customers who are increasingly turning to alternatives like Nvidia’s CUDA tool set and Tesla processors in order to realize the potential benefits of data-parallel computing. However, we are somewhat dubious about the ease with which applications can be modified to use the full potential of Intel’s MIC chips. Data-parallel computing often seems to require entirely new algorithms, not just modified data structures, in order to achieve optimal throughput. Only time will tell whether Intel’s x86 compatibility will prove to be a noteworthy advantage in this space.

Comments closed
    • ronch
    • 7 years ago

    Just a thought.. what’s happening nowadays is somewhat analogous to what happened in the 70’s, when everyone was doing things their own way and there wasn’t a clear-cut industry standard. There was Apple, Commodore, IBM, etc. etc. A thousand different kinds of computers were on the streets. Perhaps it was Microsoft/Intel/IBM who tied everything together with x86/DOS/Windows in the 80’s and 90’s, but nowadays it seems like everyone wants to have the industry to their way again. x86, ARM, AMD’s HSA, Nvidia CUDA, Intel Xeon Phi, Linux, WinRE, etc. Suddenly the industry is in turmoil again. On one hand, it proves that no one company can claim the title of owning the entire computer industry. On the other hand, it’s concerning to see how much resources in the industry are being spent (and wasted) on platforms that will probably eventually lose out to the others and fade away.

    In the end, I expect the better part of the industry to perhaps consolidate again and agree on a standard. We’ll see.

    • Sahrin
    • 7 years ago

    Xeon Phi? Why not just call it “We still can’t get the graphics drivers to work.”

      • pogsnet1
      • 7 years ago

      It is no longer graphics now, after they found out it is unplayable. =P

    • ronch
    • 7 years ago

    Xeon Phi? PHI? Yeah, Phi may stand for a lot of big things in life but… Xeon Phi???

    Those marketers AMD fired must’ve found their way inside Intel…

    Oh wait… what if the next Xeon Phi will be called Xeon Phinom?

    • Metonymy
    • 7 years ago

    At the university where I work, we do a lot of hydrological modeling and forecasting, on Linux, using the Portland Group FORTRAN compiler. We have reached the stage where newer x86 chips aren’t providing much of a performance improvement and for HPC apps like ours the obvious solution has seemed to be to rewrite code for CUDA.

    CUDA has a huge leg up on the newer stuff because so many professional tools already target it. But we don’t have the resources to rewrite code for it, and though the compiler will do much on its own (using ‘suggestions’), we’re limited. (And of course, GK104, with it’s almost-zip support for double precision reduced motivation for that, though GK110 seems to address that.)

    To the extent this makes it easier to optimize/convert traditional code, without the significant rewrites that most GPU-based solutions require, it would be a huge win for us and for others doing similar things.

    • ub3r
    • 7 years ago

    How good are these cards at bitcoin mining??

      • xeridea
      • 7 years ago

      Probably poor. Bitcoin mining thoughput is nearly identically related to SPs*Mhz. AMD GPUs also have better than Nvidia due to SHA-256, which makes heavy use of the 32-bit integer right rotate operation, supported on AMD cards, emulated on Nvidia. I doubt that is supported, and core count is extremely low.

    • ronch
    • 7 years ago

    Waste not want not.

    • ronch
    • 7 years ago

    Waste not want not.

    Edit – double post. Dunno how it happened. Clicked only once.

    • xeridea
    • 7 years ago

    I don’t really see how this is much of a performance boost over just an i7. Sure there are 50 cores, but they run at like 500Mhz-1Ghz (don’t remember last number that was stated). So compare this to say 3.5Ghz 6 core, hyperthreaded i7. 3.5Ghz*6*1.2 = ~25. Knights ferry would be ~25-50 depending on clock. Plus the i7 is out of order, with tons of performance enhancements and additional instruction sets. It also has a lot lower TDP.

    GPUs get insane performance with similar clocks, but have upwards of 2000 SPs, not 50 x86 cores. They are also getting more friendly for programming, with shared memory space, and full C compatibility, though they do require a lot different programming model and more attention to detail for optimization.

    The only thing I can see this being useful for is putting 4+ in a system, allowing more bang per system, but at a much higher power envelope. I could just be totally wrong though.

      • SPOOFE
      • 7 years ago

      [quote<] So compare this to say 3.5Ghz 6 core, hyperthreaded i7.[/quote<] No, because that's ridiculous. The product categories are completely different.

      • DavidC1
      • 7 years ago

      That’s just where you got bunch of things wrong. Knights Corner can probably run at 1-1.6GHz depending on the version. Also its supposed to be “over 50 cores”. Estimates range between 54-62 as not all cores will be activated for redundancy purposes, but there are supposed to be total of 64.

      Where it REALLY gains an advantage is massively parallel FP applications.
      -512-bit SIMD with FMA
      -GDDR5 with very wide links for massive memory bandwidth

      The Xeon E5(Sandy Bridge core) with 8 cores running at 2.6GHz can get 155GFlops sustained, while the Xeon Phi can do 1TFlops sustained. That’s more than 6x the throughput. Xeon E5 with DDR3-1600 has 50GB/s bandwidth. Assuming the Xeon Phi uses the same setup as Radeon 7970, it’ll do 5x at more than 250GB/s memory bandwidth.

      And you CANNOT compare SPs with cores. Each x86 cores in the Xeon Phi is 512-bit wide, meaning it can do 16 32-bit operations per core. While the theoretical SP flops for GPUs are greater, the Phi isn’t meant for gaming, and it has competitive DP throughput. The GK110 coming same time as the Xeon Phi is supposed to be in the same DP throughput range as well.

        • Deanjo
        • 7 years ago

        [quote<]The GK110 coming same time as the Xeon Phi is supposed to be in the same DP throughput range as well.[/quote<] The conservative estimates that I have seen put the DP performance around 1.33.

    • bcronce
    • 7 years ago

    The real question is when we might see a small 16core version integrated into your quad-core.

    Talk about heterogeneous, a GPU, CPU, and many-core CPU all packed into the same die.

    • allreadydead
    • 7 years ago

    Oh Intel, After HD4000 performance charts and seeing real world performance numbers, I was so close to shelve this piece of brilliant humour forever..
    But, no. You HAD to do it. Again. You had it coming Intel;
    [url<]http://i.imgur.com/G1zo4.jpg[/url<]

    • rrr
    • 7 years ago

    Interesting to see. But I can’t help but wonder: ensuring x86 compatibility certainly took a lot of die space which could otherwise be used for more cores. Can we partially drop some legacy parts of x86 and would it result in some die space savings, potentially for more cores?

      • d0g_p00p
      • 7 years ago

      Apparently the cores are Pentium Pro based so x86 is the base tech this is founded on. I would like to know the clock speed and the interconnect. I imagine a PPro at 22nn must be really really tiny. Based on that I am surprised that Intel only manged to get 50 cores on there. I think the beta hardware was 80 cores but I could be wrong.

      I cannot wait to read a review on one of these and see how well it scales plus the performance vs nVidia and AMD cards.

        • xeridea
        • 7 years ago

        I think its 64 cores in silicon, with 50-60ish enabled. I also don’t see why they can’t get more cores in their, PPro is ancient, though it is a modified design.

        • samurai1999
        • 7 years ago

        Knights Bridge (aka Larrabee) was 32 PPros

    • kcarlile
    • 7 years ago

    I’m fairly certain the initial run will all be sold out to some big supercomputing center. Be really cool to have some of these in our cluster, though…

    • Forge
    • 7 years ago

    I wonder, x264 parallelizes nicely, I may have to get one of these and tinker.

      • chuckula
      • 7 years ago

      Something tells me these cards ain’t gonna be cheap… (and by “cheap” I mean the high-end $500 – $600 consumer cards from AMD/Nvidia).

    • dpaus
    • 7 years ago

    An Intel co-processor? Why isn’t it called the i80387?

    Seriously, though, being x86-friendly will likely push this towards a completely different market than either AMD or Nvidia GPGPU offerings. I can see immediate uses for this in performing Java-driven visualization from massive datasets.

    • Goty
    • 7 years ago

    *Calls it Larrabee and then Knight’s Corner*

    The public: “Meh”

    *Calls it a Xeon*

    The public: “OMG, best thing since sliced bread!”

      • Ringofett
      • 7 years ago

      Maybe it’s just me, but internal code names are usually much better then the final marketing name. “Xeon Phi” sounds like a sorority. Knights Corner sounds like something bad ass.

    • thill9
    • 7 years ago

    But wait! How can it play Crysis if it doesn’t have any display outputs?

      • dpaus
      • 7 years ago

      Buried deep in the spec sheet: ‘Wireless Neural Interface ver. 0.7’

      Driver bugs possibly scrambing the output media? Hey, it’s from Intel… 🙂

    • jdaven
    • 7 years ago

    Now this is an excellent use of Intel’s resources. I hope this means that Intel has stopped beating the completely and utterly dead horse that is Itanium or trying to make a discrete video card product the market doesn’t need from the weakest part of the company: the Intel video driver development team.

    Kudos to you Intel. I hope your MIC project becomes another good ‘big iron’ processor for the HPC crowd. We need all the competition we can get in this extremely important field.

      • Airmantharp
      • 7 years ago

      I hope they don’t stop working on Itanium- useful product or not, VLIW makes sense; the idea that the burden of organizing code should be on the compiler in order to maximize computing efficiency is applicable to all areas of computing. It forces compilers and compiler engineers to get smarter, and that’s a good thing for everyone.

    • brucethemoose
    • 7 years ago

    Didn’t the 7970 already hit 1 DP TFlop?

      • HighTech4US2
      • 7 years ago

      Is that a sustained rate?

      Is ECC enabled?

      • Goty
      • 7 years ago

      Yes.

      • bcronce
      • 7 years ago

      GPUs are MUCH MUCH pickier about the kind of parallel algorithms that they can use. They are still very specialized for certain work load types. GPUs have a hard time reaching peak performance for work loads outside of their specialization, but they are getting better.

      A good example of this is branching. Withing a GPU, the cores are broken up into groups. Each group typically consists of hundreds of cores. If any one core in that group has a branch in the logic (if-else), then any core that isn’t taking the same branch must STOP and wait. Once the cores doing the one branch finish, then the other cores can do their work, while the original cores sit idle. Once they merge back into the same code path, they can all run at the same time again.

      If you have any parallel branchy code, the x86 cores are going to smear the floor with the GPU.

      This is one example, but probably a more common one.

      GPUs also perform best with certain memory access patterns, which so happens to be the opposite of what CPUs are good at. for certain programs, CPUs will be inherently faster because of access patterns.

      One cannot make a blanket statement to say one is better than the other, it depends. Reaching the 1TF milestone is quite the achievement.

        • Firestarter
        • 7 years ago

        Good points. I have no idea what the current state of the architecture is, but if it’s an out-of-order design with a good branch predictor and a relatively low branch miss penalty, I guess it will be a lot easier to program for than your average GPU.

          • bcronce
          • 7 years ago

          Last I read, a long time back, they were in-order CPUs to keep per-core transistor size down. They do employ a 512bit SIMD instruction set and fused mult-add. Strong per-cycle SIMD with lots of cores gives a strong throughput.

          I heard it has some issue with memory bandwidth, so code must be careful to keep as much data in cache. I assume GPUs have a similar issue, but people programming x86 may be more inclined to not worry about memory bandwidth.

      • arbitmax
      • 7 years ago

      Technically the answer is NO.
      The maximum peak DP performance of Tahiti was 947 GFlops. But yeah, it is close enough to call it so.

        • xeridea
        • 7 years ago

        Especially considering that chip has a lot of headroom, easily hitting 1.0-1.1GHz without voltage mod. They were conservatively binned, and there are GHz+ variants coming out.

          • HighTech4US2
          • 7 years ago

          Gaming GPU != Professional GPU

            • Airmantharp
            • 7 years ago

            +1 for accuracy, but they are extremely closely related, the differences largely being software/firmware related.

            • Deanjo
            • 7 years ago

            When it comes to GPGPU purposes it isn’t only software firmware that differentiates. Items such as ECC also start factoring in.

            • Airmantharp
            • 7 years ago

            I missed ECC, you’re right, but that’s just different RAM (or an extra chip? not sure how ECC works on GPUs), and not a change in the GPU silicon itself.

            My point is that the same piece of silicon gets used for products in both categories, with external configurations making up the difference. I’m not discounting those configuration changes or their usefulness, just that it’s the same GPU silicon at heart :).

            • Deanjo
            • 7 years ago

            No, you are right, with regards to the GPU it is essentially the same (perhaps binned for electrical/thermal characteristics on the individual chip level). It is just that some features have usually been disabled for the consumer GPU. ECC still requires the memory controller to support it. I’m not sure if nvidia actively disables ECC support on the consumer chips or if the board venders just simply choose not to use ECC modules since they have little relevance to consumer boards.

            I imagine that they bin the gpus based appropriately as well best–>tesla / better—>quadro / good—>consumer cards.

            When you compare cards like Quadro vs the consumer cards on some 3d software you will see minor differences with the rendering quality. The consumer version could have misaligned or missing pixels in models where the quadro cards offer a consistent rendering.

            • Unknown-Error
            • 7 years ago

            Deanjo, does your company have plans to use “Xeon Phi”? IIRC you already use CUDA based accelerators. I assume programming in x86 should make life easier than CUDA or OpenCL?

            • Deanjo
            • 7 years ago

            Nope, no plans to support it at all. That however would change with us looking at the possibility of switching over to openCL (which is relatively painless to do) and then that is how we would develop for knights corner, via openCL. To go back to an x86 only route would be quite frankly painful and dumb and completely go against our goal of running on anything. Going openCL allows us to utilize a wide variety of hardware, not just x86.

        • HighTech4US2
        • 7 years ago

        maximum peak != sustained

        Kepler GK110 will provide over 1 TFlop of double precision throughput with greater than 80% DGEMM efficiency.

        Has AMD published the FirePro’s double precision throughput and DGEMM efficiency?

    • chuckula
    • 7 years ago

    [quote<]Intel claims developers will be able to use standard tools and a familiar programming model to harness the power of its MIC processors.[/quote<] FTW x 10: I don't expect Knight's Corner to blow away competing solutions from AMD or Nvidia in peak throughput, but only Intel is providing a truly open architecture that allows for the use of a very wide range of tools instead of being locked in to either CUDA or AMD's flavor of OpenCL (and *believe me* there's a huge difference between the abstract concept of "Open"CL vs. what you have to do with the specific SDKs to get performance out of AMD cards). Interesting: Apparently having the ability to use an accelerated co-processor using a simple compiler and standard programming models is now a bad thing. Of course, apparently nobody has the courage to make a post saying as to *why* it is a bad thing, but somehow I'm not surprised.

      • MadManOriginal
      • 7 years ago

      Intel is evil. x86 is the devil’s ISA. At least, that’s what I’ve read on the internets.

        • BobbinThreadbare
        • 7 years ago

        You forgot to use the term kludge

      • Deanjo
      • 7 years ago

      And then there was clover,

      [url<]http://people.freedesktop.org/~steckdenis/clover/index.html[/url<] Of course code written for knight's corner is only good for knights corner. Sounds like a vendor specific implementation to me.

        • SPOOFE
        • 7 years ago

        [quote<]Of course code written for knight's corner is only good for knights corner. Sounds like a vendor specific implementation to me.[/quote<] Then you have selective hearing, because "x86 compatibility" covers a much broader swath of the development base than either of the two competitors mentioned above.

          • chuckula
          • 7 years ago

          The SIMD instructions used in Knight’s corner are fully publicly documented and are actually just a 512 bit extension of the AVX operations that are already used in newer Intel and Bulldozer/Trinity CPUs. I’m not 100% sure, but there’s a good chance that AMD’s x86 license covers these instructions and there’s nothing that would prevent AMD from coming up with a competitor that was instruction compatible (although AMD might not want to bother).

          The difference between openly documented instructions that can be emitted by a compiler (GCC will be updated and Intel already has compilers out for developers) vs. having to go through a huge video card driver to get to the hardware is a big step up.

          • Deanjo
          • 7 years ago

          OK it may run on x86 cpu’s as well, just don’t expect it to be optimized to do so. openCL on the other hand can run on CPU’s / DSP’s / LPGA’s / GPU’s. Yes, device specific optimizations would be needed as well but you have a much larger selection of devices that it is available to. It’s the same old CTM vs API argument.

Pin It on Pinterest

Share This