Intel's Embedded Multi-Die Interconnect Bridge (EMIB) technology is one of the most interesting developments in package-level design this year. These tiny slices of silicon allow Intel to connect heterogeneous dice together on the same substrate without large interposers. Today, Intel revealed how it'll use EMIBs as part of the new Stratix 10 MX family of field-programmable gate arrays (FPGAs) to break through the challenges of feeding those accelerators with enough memory bandwidth. As it's doing with the HBM2 RAM on board the as-yet-unnamed marriage of an Intel CPU and Radeon graphics, Intel will use EMIBs to hook up Stratix 10 MX FPGAs to as many as four "tiles" of HBM2 RAM for aggregate bandwidth of up to 512 GB/s. Besides the HBM2 memory stacks, Intel is also using EMIBs to join four transceivers to the FPGA fabric for signals like PCIe.
In its white paper on the new FPGAs, Intel describes the challenges of scaling up present system-level architectures using FPGAs and DDR4 RAM. Three channels of DDR4-3200 RAM might provide one of today's FPGAs with 80 GB/s of bandwidth, according to Intel, but scaling that figure up poses design and layout challenges that are seemingly impossible to overcome with present-day system architectures. As future FPGA processing demands increase, the company says that it's simply not feasible to put enough DDR I/O pins on a package to satisfy the accompanying increase in bandwidth that those applications will require.
Even if it were possible to put enough I/O pins on an FPGA package, Intel claims that the extra memory would require hundreds of lengthy traces per DIMM with power-hungry I/O buffers driving them, causing the power demands of that bandwidth to exceed realistic design constraints in the performance-per-watt-sensitive data center market. Finally, the company notes that situating 10 DDR4 DIMMs on a PCB to hit a theoretical 256 GB/s throughput (as would be required in its vision of some future demands on FPGAs) simply takes up a lot of space, harming data-center compute density.
Those concerns echo many of the constraints that caused AMD to begin developing HBM RAM for its graphics processors to begin with. HBM was developed to combat the large PCB area occupied by a growing number of GDDR5 memory chips and the accompanying number of wires needed to communicate with them and power them. It doesn't hurt that HBM offers lots and lots of raw bandwidth, to boot. Recall that most current implementations of HBM (and HBM2) require the fabrication of an interposer to join those memory chips with the GPU die itself, though, a large piece of extra silicon that introduces packaging complexity and limits the overall size of a chip that can be mated with those stacks of RAM.
Intel positions its EMIBs as the ideal way to surmount the implementation challenges posed by joining FPGAs with DDR4 to increase bandwidth. Unlike the silicon interposer that joins HBM RAM with Fiji and Vega GPUs, EMIBs let Intel join HBM2 memory dies with Stratix 10 MX FPGAs without running into the reticle limits that can constrain the size of chips that are packaged atop silicon interposers. Intel also claims that using EMIBs lets it enjoy package yields similar to those of substrates without EMIBs. Because the bridge is a small piece of silicon that connects to dies using micro-bumps instead of the through-silicon vias (TSVs) characteristic of interposers, the company doesn't have to worry about the potential fabrication challenges and yield reductions of TSVs, either. Instead, it can package chips destined for EMIB integration using standard flip-chip techniques.
Ultimately, all that fancy tech lets Stratix 10 MX FPGAs enjoy what Intel calls "an order of magnitude" increase in memory bandwidth to the FPGA fabric itself, all in a much more compact package than an FPGA joined to DDR RAM. The many-channel architecture of HBM2 also allows these FPGAs to have more concurrent memory accesses in flight at once: as many as 64, compared to four to six channels of conventional DDR in today's FPGA implementations. While I will admit that quantifying FPGA performance is not my forte, the Stratix 10 MX FPGA fabric itself is built using Intel's HyperFlex FPGA architecture and can run at clock speeds of up to 1 GHz. The company says these improvements allow Stratix 10 MX chips to deliver higher performance and more flexibility in the applications they can accelerate.
Intel says the increased bandwidth of Stratix 10 MX accelerators makes those chips ideal for a wide range of high-performance computing, high-resolution video processing, wireline networking, data analytics, and Internet of Things applications in the data center. Interested readers should check out Intel's white paper and product family documents for more information about this new generation of FPGAs.