Intel Xeon Scalable 6138P is its first shipping CPU with an FPGA on board

Intel announced that its first Xeon Scalable processor with an integrated FPGA is being made available to select customers. The Xeon Scalable 6138P includes an Arria 10 GX 1150 FPGA on package that's connected to the CPU die using Intel's Ultra Path Interconnect (UPI). According to Intel, UPI offers these chips coherent and direct access to data in the processor or FPGA caches and in main memory without the overhead of direct memory access or data replication.

The Arria 10 GX 1150 is the beefiest FPGA in its family, with 1,150,000 programmable logic elements, 427,200 adaptive logic modules for efficiently constructing certain LUTs, 96 17.4-Gbps transceivers, and 768 pins of GPIO, among other features. Intel didn't specify the Xeon chip that accompanies the Arria 10 GX 1150, but the standard Gold 6138 is a 20-core, 40-thread chip with a 2-GHz base clock and a 3.7-GHz Turbo frequency in a 125-W TDP.

As an example of how the Xeon 6138P might be used by customers, Intel created a virtual switching reference platform that uses the FPGA portion of the chip for infrastructure dataplane switching while the CPU runs its usual applications or virtual machines. Offloading the software-defined networking load to the FPGA apparently offers better performance than asking the CPU alone to handle both workloads, as we might expect.

Intel also notes that the 6138P is compatible with the Open Virtual Switch software-defined multilayer switch for VM environments, and it claims that an OVS implementation running on this chip offers a 3.2X throughput improvement and half the latency of a non-FPGA-accelerated workload for twice the virtual machines that a CPU alone can run.

Intel says Fujitsu is its lead partner for the Xeon 6138P, and that the firm plans to pair its own reliability, availability, and serviceability (RAS) special sauce with Xeon 6138P processors for software-defined network infrastructure equipment. Fujitsu will be demonstrating its implementation of the 6138P at its Fujitsu Forum this week in Tokyo.

Comments closed
    • UberGerbil
    • 2 years ago

    The “pilot program” for this was [url=<]in the news two years ago[/url<]. That's a long beta.

      • davidbowser
      • 2 years ago

      Agreed. I think this has limited use cases, like the vSwitch example (which is awesome, but limited).

      Maybe because I was just having a conversation about FPGAs at the BioIT conference a couple days ago, but there ARE several people in the High Performance Computing space that are excited about this. They tend to be the types that will go to some extremes to optimize performance (like buying GPU supercomputers) so writing code to take advantage of the FPGA is not a big deal. So a baked-in capability to offload certain functions will be welcome.

      The flip-side is many other people wondering if this can REALLY offer better performance than they can get from existing methods (GPUs, add-on cards, etc.) and what the hardware refresh cycle will look like. The upgrade example is being able to swap faster GPUs (or whatever) in a cluster, while keeping the rest of chassis intact. CPU upgrades tend to be more problematic because they often require new motherboard/chipsets and thus usually end up with an all new chassis.

    • blastdoor
    • 2 years ago

    Niche, but nice.

    • Neutronbeam
    • 2 years ago

    “a 20-core, 40-thread chip”…the mind boggles.

      • tipoo
      • 2 years ago

      Would 8-way SMT add to your boggling?

      [url<][/url<] 12 SMT-8 cores for 96 threads on the fly in a single chip

      • Srsly_Bro
      • 2 years ago

      AMD has 32/64. Are you a fan boy or unaware of hardware released almost a year ago?

        • exilon
        • 2 years ago

        He said chip, not 4 chips stuffed into a socket.

          • DavidC1
          • 2 years ago

          EPYC does count a single, 32 core chip. EPYC does not count as a single, 32 core die. Kabylake-G is a single chip too. Just saying.

            • exilon
            • 2 years ago

            The term chip specifically refers to an IC on a monolithic piece of semiconductor material.

            EPYC is not a single chip. It’s a MCM, which literally stands for multi-chip module.

            Neither is Kaby-G, which is also a MCM. There’s the CPU chip, the stacked HBM chips, and then the Vega/Polaris/??? chip

    • Physicist59
    • 2 years ago

    It is good to see this sort of technology get better and move to higher end products. I have been using the Xilinx Zynq version of this (ARM processor with FPGA) for quite awhile, and really like it. However, eliminating DMA and giving full access to system memory would sure help throughput.

      • roncat
      • 2 years ago

      What is the use case for this thing? High speed trading? Mining?

        • dragontamer5788
        • 2 years ago

        Software defined routers seems to be use-case #1. Think routing multiple 10 Gigabit-E packets on a customized basis (security features, VLAN features, filtering on top).

        But since its an FPGA, anything where you need customized hardware acceleration would be an ideal use case. Mining is the obvious use case, especially with communities like “Monero” who change their algorithm each time an ASIC is made (its far easier to develop an FPGA than it is to develop an ASIC. So you can redevelop / redeploy the FPGA each time the algorithm changes).

        Deep Blue was famous for having ASICs figure out chess positions. The FPGA could be programmed to do that.

        Basically: think a GPGPU, except smaller, more expensive, but a HELL of a lot more flexible. Software-defined routing is an ideal demonstration of this flexibility, but really anything that requires a flexible accelerator would fit.


        Neural Networks seem to be well handled by GPGPUs / Volta cores. But a NN built out of this FPGA + CPU would have faster communication between CPU / FPGA than anything NVidia has put out. So there’s a chance that this FPGA could find some deep-learning use cases.

        The relative low-cost of GPUs however makes it a harder sell. And the “training” vs “inference” steps mean that its mostly a bandwidth problem as opposed to a latency problem. NNs / Deep Learning will likely remain in the domain of GPUs, but I wouldn’t be surprised if some new innovation allowed FPGAs to become competitive.


        Microsoft Bing was relatively famous for deploying FPGAs as part of their search engine recently. Databases could be searched with specialized circuits.

        Image processing is another good use case. Although its currently in the realm of GPU-compute, one can imagine more complicated algorithms (such as Youtube copyright infringement scanner) could be accelerated on an FPGA more quickly. A more complicated algorithm like that would require CPU / Accelerator communication and collaboration.

        While your standard CPU -> PCIe x16 -> GPU latencies are ~microseconds, your CPU -> UPI -> FPGA latencies would be measured in hundreds of nano-seconds. So close collaboration of accelerator + CPU with this architecture would lead to new opportunities.

      • dragontamer5788
      • 2 years ago

      [quote<]However, eliminating DMA and giving full access to system memory would sure help throughput.[/quote<] I dunno about "throughput". PCIe is surprisingly quick these days. PCIe 3.0 x16 is something on the order of ~15.8GB/s. In contrast, 2666 MT/s DDR4 is 21GB/s (theoretical), x2 since most computers are dual channel. And my understanding of PCIe is that it looks like the devices have access to memory. The main issue addressed with this architecture is latency. So UPI (or at least... AMD's infinity fabric. I assume UPI is similar) is on the order of ~200ns latencies, while PCIe is IIRC ~1000+ns or ~microseconds. EDIT: UPI speed is noted in this [url=<]Intel article[/url<]. So UPI is something like 5x better latency than PCIe and maybe 2x to 3x better bandwidth. Certainly an upgrade over PCIe at least, but latency seems to be the bigger benefactor to this setup. ------------- The crazy part is that UPI is how Intel's chips remain cache-coherent between sockets. So any FPGA hooked into the UPI network can theoretically join in the L3 cache-coherency fun. That means your C code can have a spinlock and the FPGA will be notified of the state of the spinlock as quickly as any processor would have been.

        • DavidC1
        • 2 years ago

        There’s an article that says a PCIe 3.0 x8 slot realistically achieves only 1.8GB/s for FPGA scenarios. QPI can achieve 7-8GB/s. UPI is an advancement over QPI. Not only the link speeds are faster, but the packet transfer is more efficient. Potentially a gain of over 50% over QPI is possible.

        They also said QPI achieves massive advantage in latency at small file sizes, so like 4K. Two orders of magnitude, or greater than 100x faster in latency for QPI.

        This makes sense looking at how Optane, as fast as it is happens to be severely limited even with NVMe. And the greatest difference is on small 4K I/O size.

    • strangerguy
    • 2 years ago

    But can it run Crysis while hashing at 40MH/s?

      • DeadOfKnight
      • 2 years ago

      With FPGAs being used for hardware emulation of retro gaming consoles, I am kind of curious what kind of potential something like this could have for gaming.

    • chuckula
    • 2 years ago

    It’s good to see these things moving from demonstration products to real products. The UPI interconnect is interesting because it is the same interconnect used between sockets in multi-socket Xeon systems.

Pin It on Pinterest

Share This