As fast as modern GPUs are, there will always demand for more horsepower. Beyond gamers looking to drive multiple 4K displays with one graphics card, there are researchers and businesses with an ever-increasing thirst for greater compute acceleration. Judging from a recent research publication, Nvidia thinks it's rapidly approaching the limits of its current GPU architectural model, so it's looking for a way forward. The idea is still in the simulation stage, but the paper proposes a Multi-Chip Module GPU (MCM-GPU) that would comprise several GPU modules integrated as a single package.
The proposal was put together by researchers and engineers from Arizona State University, Nvidia, the University of Texas at Austin, and the Barcelona Supercomputing Center. The idea starts with the recognition that Nvidia is soon going to struggle to squeeze more performance out of its current layouts with today's fabrication technology. Typically, the company has been able to improve GPU performance between generations by ratcheting up the streaming multiprocessor (SM) count. Unfortunately, it's getting increasingly difficult to cram more transistors into single dies. Nvidia's V100 GPU, for example, required TSMC to produce the chips at the reticle limit of its 12-nm process. Furthermore, there are costs and problems associated with making ever-larger dies, as yield numbers decrease due to manufacturing faults.
It's possible that Nvidia could take the approach of putting multiple GPUs on the same PCB, as it did with the Tesla K10 and K80. However, the researchers found a number of problems with this approach that the company has yet to solve. For example, they note that it's not easy to distribute work across multiple GPUs, so it requires a lot of effort from programmers to use the hardware efficiently.
Instead, these researchers want to take advantage of developments in package technologies that might allow Nvidia to place mutiple GPU modules (GPMs) onto one package. These GPMs would be smaller than current GPUs, and therefore easier and cheaper to manufacture. While the researchers acknowldedge that questions remain about the performance of packages like this one, they claim that recent developments in substrate technology could allow the company to implement a fast, robust interconnect architecture to let these modules communicate. Theoretically, on-package bandwidth could reach multiple terabytes per second.
In Nvidia's in-house GPU simulator, the research team put together an MCM-GPU with a whopping 256 SMs, compared to Pascal's "measly" 56 SMs. The team then pitted that against a hypothetical (and unbuildable) 256-SM GPU built with the company's current architecture. The results showed that the MCM-GPU was 45.5% faster than the monolithic chip. Further comparison with multiple GPUs on the same board (rather than integrated into one package) still gave the MCM-GPU a 26.8% performance advantage.
These numbers all come from simulations and rely on upcoming technologies and untested optimizations, of course, so it's probably a little early to start putting pennies in the piggy bank and saving up to buy a card with an MCM-GPU. That being said, rumor does have it that AMD is pursuing a similar idea with its Navi GPU, so it's possible that the MCM-CPU concept could become more prominent in the future. In the meantime, this paper serves as an intriguing opportunity to peek behind the curtain and hear Nvidia's engineers talk about the company's current design challenges and possible routes to new levels of GPU computational prowess.
|Gigabyte's X399 Designare-EX adds Thunderbolt to Threadripper||10|
|No, you can't enable Threadripper's extra two dice||33|
|International Talk Like a Pirate Day Shortbread||27|
|Philips 328P6AU and 328P6VU monitors make the best of USB-C||7|
|Tuesday deals: graphics cards, a mobo, storage, and a big TV||13|
|EVGA Epower V breaks the shackles of stock GPU power delivery||22|
|Reminder: iOS 11 will arrive tomorrow||35|
|In the lab: MSI's Aegis 3 gaming desktop||13|
|Rumor: Eight-core desktop Intel CPUs and Z390 chipset riding in||28|