Intel announces next-gen Knights Mill Xeon Phi accelerator

You may all be wondering what's up with the Internet's coverage of IDF, since there wasn't a lot of news coming from that front yesterday. And yet, all of a sudden, something big comes along. The item in question is called Knights Mill, a codename for the next generation of Intel's HPC-oriented Xeon Phi processors.

Our own Jeff Kampman has his boots on the ground as we speak, and he took a snapshot of Intel's announcement during a session:

Put simply, Knights Mill is an upgraded version of the current-generation Knighs Landing CPUs. For those to whom either "Xeon Phi" or the whole "Knights" naming doesn't really ring a bell: these CPUs differ from the more "normal" models by offering many dozens of small cores instead of a few "big" ones. They're targeted at HPC and machine learning tasks. One perusal through Intel's ARK is all you need: the current-generation models pack somewhere from 64 to 72 cores and many more threads.

Intel says the Knights Mill models should offer better efficiency when compared to their predecessors, as well as "enhanced variable precision," which we take to mean improved floating-point support. Like current chips, Knights Mill CPUs will have a slice of RAM directly stacked on top of CPU (current Xeon Phis pack 16GB.) The company hopes the new CPUs to be gracing enormous clusters worldwide come 2017.

Since we're on the topic of Knight's Mill, it's interesting to look back at how the perceived-as-failed Larrabee went from being advertised as a graphics card to actually power the "Knights" series of Xeon Phi CPUs. In a personal blog entry, Larrabee architect Tom Forsyth lays out a bit of history on the chip and goes on to detail what it was meant to be, what it wasn't, and the public perception of the whole thing. If the fact that "Larrabee = Xeon Phi" comes as an overlooked and surprising fact (including to a good portion of the TR staff), by all means, go read Mr. Forsyth's account.

Comments closed
    • tipoo
    • 3 years ago

    From the Larrabee link
    “That ordering is important – in terms of engineering and focus, Larrabee was never primarily a graphics card. If Intel had wanted a kick-ass graphics card, they already had a very good graphics team begging to be allowed to build a nice big fat hot discrete GPU – and the Gen architecture is such that they’d build a great one, too. But Intel management didn’t want one, and still doesn’t. But if we were going to build Larrabee anyway, they wanted us to cover that market as well.”

    Come oooooonnnn, let the GPU guys freeeee! Their gen9 graphics are pretty decent, scale that sucker up with enough bandwidth and put it on a board, we need some new blood in that race.

      • chuckula
      • 3 years ago

      You probably wouldn’t like the results very much.

      It’s AMD fanboyism to claim that Intel can’t do integrated graphics when they clearly can.

      It’s Intel fanboysim to claim that Intel’s integrated graphics are just a quick scaleup away from being a credible player in the discrete graphics market.

        • tipoo
        • 3 years ago

        Scaling it up is an understatement of the complexity, but I think they have a fairly competent core architecture.

    • DavidC1
    • 3 years ago

    They do not need a 10nm process for this.

    My guesses:
    -KNL was delayed a de-featured a LOT. It went from 14-16GFlops/watt projected performance with a released back in 2015 to 10-12GFlops/watt performance in 2016.
    -KNL has an AVX clock that’s lower just like Haswell Xeon E5 chips. Only the top Xeon Phi 7290 ever reaches 3TFlops mark. The lowest 7210 has a peak of only 2.25TFlops
    -Culprit? All signs point to 14nm being the suck.
    -KNM is likely a refined, higher clock KNL with FP16 thrown in. They probably can also add deep learning specific enhancements too.

    Original KNL: 3TFlops, 2015 release, 160-215W TDP(including KNL-F)
    Actual KNL: 3TFlops but lower AVX clock, 2016 release, 215-260W TDP(including KNL-F)

    KNL: 1.3GHz AVX(1.5GHz non AVX base/1.7 Turbo), 72 cores, 32DP Flops = 3TFlops
    Possible KNM: 1.7GHz AVX(1.7GHz base/1.9 Turbo), 76 cores, 32DP Flops = 4.1TFlops

    It likely uses 14+ process as well.

    • the
    • 3 years ago

    Larrabee did fail as a GPU and it failed hard. Repurposing it as a HPC was not unwise though. Early disclosures on the architecture were focused on its graphics capabilities so I’m hesitate to follow Tom Forsyth’s claim that targeting the HPC sector was a higher priority than graphics.

      • UberGerbil
      • 3 years ago

      Well, Larrabee was fundamentally an experiment. Most experiments end in failure; but the failures often teach you more than the successes. (Successes just confirm things work the way you thought; failures tell you there’s still things you have learn).

      And I’m not convinced that it (or at least that approach, which is kind of where the functional units in GPUs are headed anyway) would [i<]never[/i<] work as a GPU -- given enough R&D, aka talent, money, and time. The problem was that it became a very high-profile experiment very early in its life, in part because of the names working on it -- Intel's x86 architects are pretty anonymous, and guys like Michael Abrash definitely aren't -- and in part because Intel's marketing department got set loose on it (or just found out about t, ran with it, and weren't reined in). That set expectations unreasonably high, unreasonably early. But even if it had remained a black project buried inside Intel, I'm not sure the company as a whole would've had the patience to keep it going for as long as its gestation required. The fact that they re-purposed it and shipped it as an HPC part in pretty short order suggests the appetite Intel has for blue-sky R&D is fairly limited. They're ROI-driven, and this was the surest way to get an R on their I.

        • tipoo
        • 3 years ago

        [quote<]given enough R&D, aka talent, money, and time[/quote<] That, and their fab advantage. But that still doesn't make an x86 decode and ucode unit per GPU core a great idea, Intel would have had to pour way more resources into shoving that forward than normal for a GPU. I think it was partly a post Itanium x86 or bust mentality that led to the idea that an x86 GPU should happen. Kind of want them to take their gen 9 GPU architecture, multiply it by a few times, and put it on a board as a third GPU competitor though.

          • UberGerbil
          • 3 years ago

          [quote<] But that still doesn't make an x86 decode and ucode unit per GPU core a great idea, Intel would have had to pour way more resources into shoving that forward than normal for a GPU. [/quote<]Oh, no doubt about it. But I suspect if you took the Polaris design back to nVidia's designers in the DX7 timeframe they would've had a hard time believing their stuff was going to get that complicated either. Nevertheless, nVidia has the convenience of growing complexity as they need it; Intel was starting with a complicated design and trying to streamline it. But they [i<]could[/i<] streamline it -- and, had they continued with Larrabee, they undoubtedly would have. Unlike modern x86 or even Phi, Larrabee wouldn't have had the requirement to be able to run any random bit of 40-year old x86 code. Even when AMD streamlined things in x64, they could only do so in Long mode; they couldn't actually get rid of any of the old cruft because somebody might want to run some ancient real-mode 8088 code on it. But an "x86 GPU" could be incredibly constrained wrt the software it was expected to run -- nothing beyond Intel's own "driver" code essentially; they might not even publicly document the disjoint subset of x86 that the chip actually supported. By the time they got to a competitive GPU, whether it even warranted the term "x86" would be questionable: the code would still use a bunch of the conventions, and some of the basic opcodes would be the same, but the instruction set would be dramatically smaller and more aligned with the underlying uops. Essentially the only pieces left would be what they need to feed and schedule the parallel FPUs, and all the superscalar performance tweaks for that which Intel had already perfected (minus the ones that were irrelevant because they only existed to speed the now-missing cruft). By the end of that process, the relationship between conventional x86 CPUs and the final GPU would be a bit like the relationship between the scaffolding around a building under construction and the final building itself.

            • the
            • 3 years ago

            The thing that brought down Larrabee wasn’t that it was x86 (if anything, that actually [i<]helped[/i<]), the greater issue was that it was relying entirely on software for parts of the rendering pipeline that are traditionally hardware based. At that time in Intel's history, they couldn't write a drive to save the company. Sounds a bit hyperbolic, but they certainly couldn't get the Larrabee drivers to work across a wide gamut of games for it to work as a GPU. Simply put, fixed function hardware is faster than a software driven solution. Intel did learn this as the Knight's Corner part did have some fixed function ROPs, TMUs, and oddly a video decoder unit. A smaller, more RISC like core would have an advantage over x86 in that it would consumer less die space and thus more units could be put into the same area. I can imagine a future Larrabee-like project with ARM cores for shader units but would have features like a hardware task scheduler.

            • tipoo
            • 3 years ago

            What do you mean by x86 helped? It certainly added flexibility, but it’s a lot of die area to spend per GPU ALU to repeat every x86 decode and ucode within itself. It even hampers smartphone CPUs, let alone the massive parallelism in GPUs.

            Though yeah, agreed that Larrabee could have been something if they conceded some hardware units like ROPs.

            • the
            • 3 years ago

            Compilers.

            Intel learned the hard way with Itanium that you can’t wave a magic compiler wand to get great performing code. While it does take die space, using x86 enables usage of extensive compilers and debugging tools. It also allows for prototyping graphics routines before the final Larrabee hardware was ready (granted the vector code would need to be emulated). There were also a few existing software routines that could be easily updated for Larrabee.

            • tipoo
            • 3 years ago

            I think we’re on the same page. x86 landed a lot of software flexibility and leverage with existing tools, but scaling up to dozens and eventually hundreds of GPU cores also makes it consume a lot of cumulative die space.

            Now that GPUs themselves have gotten much more complex, I wonder if it would be more competitive in a few more years. Sadly we’ll never find out, as per what I quoted out of the article about Intel management just not wanting a dedicated GPU.

            • the
            • 3 years ago

            Yeah, I think the more idea scenario would have been to implement a more Bulldozer-like approach and share the rather large x86 decoders. To feed the actual cores, the L1 instruction cache would need to directly store micro-ops to reduce the pressure on the shared decoders.

            This also would have been the best opportunity to integrate another feature missing from Larrabee: hardware scheduling. nVidia and AMD have dedicated hardware for thread scheduling. This is great for performance but would break some x86 conventions on how interrupts and how the embedded OS on Larrabee works.

            An even more radical approach to this would be the ability to load shader programs directly as micro-ops to by pass the decoders entirely. Given the JIT nature of many shaders, this would be feasible for GPU work.

      • ImSpartacus
      • 3 years ago

      Yeah, he’s just saving face. I don’t believe that larrabee was intended to be an hpc part. Intel isn’t stupid and I’m sure they made sure they had the hpc angle in their back pocket, but it wasn’t the intention.

    • ImSpartacus
    • 3 years ago

    I loved reading Forsyth’s article, but is it that unknown that larrabee literally turned into xeon phi?

    I’m a bit of an Anandtech fanboy, so I tend to remember its old articles and sure enough, the debut xeon phi article mentions that it’s literally just a “direct continuation” of larrabee.

    [url<]http://www.anandtech.com/show/6017/intel-announces-xeon-phi-family-of-coprocessors-mic-goes-retail[/url<]

      • DancinJack
      • 3 years ago

      Beyond the minuscule minorities (relatively) that frequent these sites, and even further – the ones that care about HPC or are just interested, I’d say most people don’t know that. Ofc you and I (among others here) know that, but once Intel wasn’t releasing a “discrete graphics card” anymore I imagine most people just didn’t care anymore.

    • chuckula
    • 3 years ago

    From yesterday’s (mostly boring) keynote the short-term interesting thing is Knight Mill on a [s<]10nm process[/s<] [u<]OK not 10nm if DKanter says so, there were rumors from yesterday that were apparently wrong[/u<]. The long term interesting thing that will actually have major impacts on practically every computer being sold in the future was the announcement that they have been shipping silicon-photonics network interfaces. Right now it's nice for fast network connections in data centers. In a few years the same technology will be baked right into your chips and motherboards to provide ultra highspeed connections that don't have to worry about electrical cross-talk.

      • Jeff Kampman
      • 3 years ago

      David Kanter says it’s not actually a 10-nm chip.

        • the
        • 3 years ago

        If it isn’t 10 nm, then Intel has to be doing some interposer/EMIB work to increase the effective die size. Knight’s Landing on 14 nm is estimated to be around 680 mm^2, Intel’s second largest chip they’ve ever manufactured commercially. Maximum area on Intel’s previous nodes was between 750 mm^2 and 800 mm^2 so there is very little room to grow.

        • DancinJack
        • 3 years ago

        Maybe some more info? Why? What else did he say?

      • Srsly_Bro
      • 3 years ago

      Relying on rumors without corroborating evidence and not even a disclosure! For Shame, Chuck!

    • Growler
    • 3 years ago

    Every time I see something like “Optimized for Deep Learning”, I think of Dr. Forrester and TV’s Frank. [url=https://i.ytimg.com/vi/wjCjgjk9n0o/hqdefault.jpg<]Deep Hurting![/url<]

    • tipoo
    • 3 years ago

    Nvidia did not like the comparison they made on stage, lol. Pot, meet…Other pot? Actually I guess everything is pots in silicon. But Intel using a 4 year old Nvidia solution was definitely a bit scummy.

    [url<]http://arstechnica.com/gadgets/2016/08/nvidia-intel-xeon-phi-deep-learning-gpu/[/url<]

      • chuckula
      • 3 years ago

      Interesting point: When AMD was trash talking the GTX 1080 in a public way at the Polaris launch for products that address a pretty large market Nvidia never said a peep.

      When Intel produces a slide about performance in a application (“deep learning”) that’s somewhat niche even within the already pretty niche sphere of HPC, Nvidia has a heart attack.

        • tipoo
        • 3 years ago

        AMD at least wasn’t comparing to a 4 year old product when they well knew a better one was out, they were more just “1080 is expensive lol”.
        Intels was a harder lie.

          • chuckula
          • 3 years ago

          Intel didn’t lie and the comparison to a “4 year old product” is fair because the four year old product is still Nvidia’s flagship in HPC since Maxwell was effectively skipped over for HPC status.

          As for Pascal, the P100 products are still so rarified that hard performance numbers are still difficult to come by. Nvidia has its own PR department, they can make their own marketing stuff up.

          Intel has clearly hit a nerve here. KNL is a legitimate product that might not win in a peak LinPack benchmark for a marketing slilde but certainly has some extremely strong advantages in real world workloads.

            • Zizy
            • 3 years ago

            The problem wasn’t used hardware, it was use of ancient software.

            And with this misleading bs from Intel side and NV’s prompt reply, as well as P100 announced but unavailable and mill announced to come just 1 year after landing, it seems both companies are scared shitless.

            • NTMBK
            • 3 years ago

            Rubbish. The Tesla M40 is a shipping product, with a GM200 GPU inside it, aimed at HPC. They did not skip over Maxwell for HPC. Maxwell was poor at [i<]double precision[/i<], but that is irrelevant for deep learning- there's a reason why they're adding FP16 in GP100, after all!

            • the
            • 3 years ago

            The GK210 is preferred over the newer GM200 for double precision workloads. The GM200 is superior for single and half precision though. Both are fair comparison points depending on the metric being used.

            • NTMBK
            • 3 years ago

            Indeed. And the workload compared was a deep learning workload, where you should definitely be using GM200.

Pin It on Pinterest

Share This