Knights Corner silicon hits teraflop mark

A year and a half has passed since Intel redirected its Larrabee effort to produce a competitive desktop GPU. The Many Integrated Core (MIC) architecture was repurposed for high-performance computing—think room-filling supercomputers rather than the overclocked gaming box sitting under your desk. Dubbed Knights Corner, this new effort finally showed its face this week at the SC11 supercomputing conference.

Intel didn’t reveal many new details about the chip at the show. According to ComputerWorld, though, Intel Technical Computing General Manager Rajeeb Hazra showed off actual silicon. The basics haven’t changed: Knights Corner features “more than 50” cores (Intel won’t say exactly how many), and it’s manufactured using cutting-edge 22-nm process technology, complete with 3D transistors. Intel isn’t talking clock speeds or power consumption, but it says that a single Knights Corner chip can push one teraflop.

While it’s unclear when Knights Corner will become available, the Texas Advanced Computing Center has reportedly ordered some for a facility that’s scheduled to be constructed starting next year. Knights Corner will reportedly be sold as a PCI Express add-in card, making it easy to plug into existing servers. That PCIe interface puts Knights Corner in direct competition with GPU-based compute platforms from AMD and Nvidia. Intel contends that its architecture, which has native support for the x86 instruction set, will make it easier for developers to port over applications.

Comments closed
    • pogsnet1
    • 8 years ago

    The future of CPU will be in PCIe………. so it can be compatible in all platform.

    • bcronce
    • 8 years ago

    The benefit of these x86 cores is they can perform near their peak performance in many many more cases than a GPU can.

    GPUs have a high peak performance ~600gflop, but only for code with almost no branching. The many core CPU design allows each core to not care about branching in the other cores(edit: not care “as much” about branching)

    Another thing to be careful of with the 1TF being claimed is this means it’s doing about 16DP/cycle. They do have 512bit wide SIMD, which means 8 DP can be loaded, which is only 8fp/cycle. I’m assuming they are using fused math, like mul+add, which could give 16dp/cycle. For other math, it may be limited to only 500GFLOP.

    Also, I don’t see how a core could sustain peak performance if you had to also load and store data. even if you could do 16FP in a cycle, you still need to load that data into the registers and copy it out when done. Unless this chip has basic Out-of-order and has extra registers, so it can effectively hide the load and stores.

    Just tossing around some thoughts. Either way, I’m excited for the future. No, this will not replace GPUs in all math areas, but it will make them work for it.

      • tejas84
      • 8 years ago

      I agree that the macro ops fusion on FPU of Knights Corner is a more ideal solution for FP operand workloads. However GF100 does allow some basic branching in some circumstances per warp. However AMD’s current VLIW implementation does not branch in any meaningful way per wavefront.

      However given that branchy code is not conducive to peak operation FLOP performance I think your point is somewhat moot.

      If the compiler is good enough ( and in Nvidia’s case their compiler is excellent at exploiting Thread Level Parallelism) then Nvidia have a worthy opponent in Fermi and soon to be Kepler versus Intel’s excellent Knights Corner/ Ferry/ Bridge chips.

      I just wish Intel would brand some for Desktop GPU gaming!

        • bcronce
        • 8 years ago

        “However given that branchy code is not conducive to peak operation FLOP performance I think your point is somewhat moot. ”

        I didn’t actually think about the cost of a high flop/cycle SIMD design with branchy code.

        While GPUs can take a fairly large hit from branchy code because of stalled stream units, but the x86 core gets another problem of its own. Even one clock cycle lost from an IF statement is still a peak loss of 16 floats, and a branch-missprediction could be hundreds.

        However, I can’t remember if these cores even do speculative computations because that adds to transistor count, so branch-missprediction may not exist, but logic flow may then cause fixed overhead.

        Good point.

      • Voldenuit
      • 8 years ago

      I haven’t read the whitepaper on Knights Corner, but do we know this for a fact? I would imagine that these x86 cores would be a lot simpler than traditional CPU cores, and that may mean simpler branching operations (stalls, loop unrolling, prediction, etc) as well as instruction issuing. If Knights Corner is tailored towards low-branch, highly parallel workloads, it may not be much more robust with loopy code than a modern GPU.

      @ OneArmedScissor: x86 is still found in many embedded applications, including microwaves, car ECUs, elevators, etc. IIRC, Canon designs and produces a custom x86 processor under license for their high end DSLRs. Though ARM and Power/PowerPC are more prevalent. Heck, there are still people using Motorola 68xx based CPUs for their hardware.

        • bcronce
        • 8 years ago

        I slightly changed my statement with regards to branching after thinking tejas84’s post over.

        There doesn’t seem to be a lot of info about it, and as far as I know based on “ideas” that Intel has been talking about, it probably is a very simple and possibly in-order CPU. I’ll have to google a bit this weekend.

        • MadManOriginal
        • 8 years ago

        I believe it acts like a cluster of CPUs and they are simpler than a ‘real’ CPU – not sure on the details of the latter though. Because it acts like a cluster, parallelization doesn’t matter nearly as much as for GPUs, as long as you can load up the cores you’re good. In many ways it’s a better solution for HPC than GPUs because of that – it just works like a cluster of CPUs and that’s much less challenging for people to work with than GPUs even if only because of familiarity. Add in the flexibility to run parallelizable code or many non-parallelizable work loads and it sure seems like a winner.

        And people laughed at Intel when Knight’s Ferry ‘failed’ – teehee.

    • Meadows
    • 8 years ago

    At this rate, by the time it comes out it’ll be behind AGAIN.

    • AlvinTheNerd
    • 8 years ago

    GFLOPS single or double precision?

    x86 should have very similar performance for both and if it is 1 teraflop double precision, than I have to hand it to Intel for making a chip that is best for this market hands down.

    A Geforce 590 and Radeon 6990 both have ~2Tflop single and ~600Tflop double. Gaming performance will be based on single and thus I am very sure that Knight’s Corner would not have competed in the GPU space.

    However, scientific computing needs double performance and if KC brings 1Tflop double, than I really hope they offer the card to more than just special buyers as there is a lot of code that I can compile on x86 much easier than rewritting it to run on a GPU.

    If its 1Tflop single is significantly less, the chip becomes much more difficult. GPU has the performance advantage, but only by a factor of 2 instead of the factor of 10-15 they have over CPU’s and do require rewriting code. How much does it cost to run twice as many KC’s vs the time to port code?

      • Goty
      • 8 years ago

      The time required to port the code is usually insignificant compared to the amount of time the code will be in use and therefore the amount of time that you will save in the end due to the better performance (speaking strictly of the HPC space, of course).

        • smilingcrow
        • 8 years ago

        “The time required to port the code is usually insignificant… “

        It varies a lot but it is rarely that insignificant. For applications that are particularly demanding and are run frequently then it makes sense to spend the extra time optimising them for the fastest platform. But there are plenty of applications that don’t fit that camp where x86 compatibility will offer a huge advantage.
        I can see that it will be particularly useful in ad hoc situations where you just need to run a quick test. If the ad hoc tests lead somewhere fruitful you might even end up optimising the code for CUDA eventually so I can see that both methods can live together.

          • Goty
          • 8 years ago

          You’re absolutely right, but as I mentioned in my post, I’m referring only to the HPC space, which is where these sorts of co-processors might be in competition with one another. For instance, let’s say I want to port the code I’m using now (which has been running a single simulation since last February) over to something like CUDA, it might take me two to three months to port the code, but I might shave a month or more off the total simulation time for ten or fifteen projects, so it’s certainly worth the initial investment.

            • Althernai
            • 8 years ago

            It depends on the code. If it is just your own code that you can rewrite in 2-3 months, then yes, you may benefit from switching architectures. However, not all HPC code is like that. What I work with is closer to HTC, but I bet there are a lot of HPC applications that suffer from one or more of the following:

            1) The code was written and modified by multiple people over long periods of time.
            2) Some external libraries must be integrated.
            3) It is not trivial to verify that the code works [i<]exactly[/i<] as expected after making changes. If so, then changing the code is not that simple because the people who wrote much of it are not there to do it anymore. Furthermore, even once it has been changed, it has to be validated. This typically means you have to figure out why your changes broke something seemingly unrelated. Finally, some of the time it is considered validated and after the results are in, it is discovered that something subtle was not entirely correct. What Intel is aiming for is to keep people currently using x86 from leaving it. If the performance difference between switching and not switching is a factor of 10, then it might make sense to switch despite the high costs of transition, but if the difference is 20%, then switching is almost certainly not worth it. Finally, it's worth mentioning that there is not a dominant alternative to x86. The GPU makers don't collaborate (AMD has Stream, Nvidia has CUDA) and various other alternatives (e.g. SPARC) are not compatible with either. There's a non-trivial probability that you switch to something and half a year later, the company that makes it decides that from now on, it's only going to make smart phones and tablets or whatever. One of the benefits of x86 is that it's relatively safe to say that it will be around for a while with new and improved variations.

            • MadManOriginal
            • 8 years ago

            But then KC has much higher DP FP power than GPUs. Unless it is absolutely insanely priced – it only needs to be priced around what HPC GPU cards are, and NV at least doesn’t let their consumer cards equal Quadros in DP – it’s hard to justify spending 2-3 months recoding for a slower solution, even if GPUs are faster than x86 CPUs.

      • smilingcrow
      • 8 years ago

      It’s 1TF for DP from what I’ve read as the older version managed 1TF SP and this is much improved.

      • chuckula
      • 8 years ago

      It’s double precision.

    • OneArmedScissor
    • 8 years ago

    Their unrelenting holy war against anything non-x86 is admirable.

    Coming soon – x86 microwaves, brought to you by Intel! Deep-fry a turkey in mere seconds, with the new 50+ core Pentium 4 powered Core i9 2349823984239898X microwave-on-a-chip!

      • Meadows
      • 8 years ago

      Reminds me of Buy n’ Large.

      • smilingcrow
      • 8 years ago

      Way to miss the point; it’s all about software development time.

        • OneArmedScissor
        • 8 years ago

        Serious business.

    • Game_boy
    • 8 years ago

    Speculation: 50 means 64 on die minus whatever the yield requires them to cut (like Fermi; because it’s a huge chip on a new process).

      • entropy13
      • 8 years ago

      I’d rather think of it as 48 plus 2 more that they were able to add. LOL

        • NeelyCam
        • 8 years ago

        Not lolling.

      • NeelyCam
      • 8 years ago

      You took that from Thomas Ryan.

        • Game_boy
        • 8 years ago

        No. He said it was 64, yes, but I’m speculating Intel why won’t say that (due to yield).

          • NeelyCam
          • 8 years ago

          Yield was an obvious implication in Ryan’s comment; that’s why he didn’t even bother to mention it.

Pin It on Pinterest

Share This