For years now, AMD has taken on the responsibility of defining new types of memory to be used in graphics cards, standards that have eventually come to be used by the entire industry. Typically, being first out of the gate with a new graphics-oriented memory technology has given AMD a competitive advantage in that first generation of products. For instance, the introduction of GDDR5 allowed the Radeon HD 4870 to capture the performance crown back in the day.
Trouble is, GDDR5 is still the standard memory type for graphics processors to this day, seven years after the 4870's introduction. Graphics memory has gotten faster over time, of course, but it hasn't fundamentally changed for a long while.
GDDR5's reign is about to end, however, thanks to a new type of memory known by the ridiculously generic name of high-bandwidth memory (HBM). Although we've been waiting quite a while for a change, HBM looks a be a fairly monumental shift as these things go, thanks to a combination of new fabrication methods and smart engineering. The first deployment of HBM is likely to be alongside Fiji, AMD's upcoming high-end GPU, expected to be called the Radeon R9 390X.
Fiji is a fitting companion for HBM since a team of engineers at AMD has once again helped lead the charge in its development. In fact, that team has been led by one of the very same engineers responsible for past GDDR standards, Joe Macri. I recently had the chance to speak with Macri, and he explained in some detail the motivations and choices that led to the development of HBM. Along the way, he revealed quite a bit of information about what we can likely expect from Fiji's memory subsystem. I think it's safe to say the new Radeon will have the highest memory bandwidth of any single-GPU graphics card on the market—and not by a little bit. But I'm getting ahead of myself. Let's start at the beginning.
The impetus behind HBM
Macri said the HBM development effort started inside of AMD seven years ago, so not long after GDDR5 was fully baked. He and his team were concerned about the growing proportion of the total PC power budget consumed by memory, and they suspected that memory power consumption would eventually become a limiting factor in overall performance.
Beyond that, GDDR5 has some other drawbacks that were cause for concern. As anybody who has looked closely at a high-end graphics card will know, the best way to grow the bandwidth available to a GPU is to add more memory channels on the chip and more corresponding DRAMs on the card. Those extra DRAM chips chew up board real-estate and power, so there are obvious limits to how far this sort of solution will scale up. AMD's biggest GPU today, the Hawaii chip driving the Radeon R9 290 and 290X, has a 512-bit-wide interface, and it's at the outer limits of what we've seen from either of the major desktop GPU players. Going wider could be difficult within the size and power constraints of today's PC expansion cards.
One possible solution to this dilemma is the one that chipmakers have pursued relentlessly over the past couple of decades in order to cut costs, drive down power use, shrink system footprints, and boost performance: integration. CPUs in particular have integrated everything from the floating-point units to the memory controller to south bridge I/O logic. In nearly every case, this katamari-like absorption of various system components has led to tangible benefits. Could the integration of memory into the CPU or GPU have the same benefits?
Possibly, but it's not quite that easy.
Macri explained that the processes used to make DRAM and logic chips like GPUs are different enough to make integration of large memory and logic arrays on the same chip prohibitively expensive. With that option off of the table, the team had to come up with another way to achieve the benefits of integration. The solution they chose pulls memory in very close to the GPU while keeping it on a separate silicon die. In fact, it involves a bunch of different silicon dies stacked on top of one another in a "3D" configuration.
And it's incredibly cool tech.
Something different: HBM's basic layout
Any HBM solution has three essential components: a main chip (either a GPU, CPU, or SoC), one or more DRAM stacks, and an underlying silicon wafer known as an interposer. The interposer is a simple silicon die, usually manufactured using an older and larger chip fabrication process, that sits beneath both the main chip and the DRAM stacks.
Macri explained that the interposer is completely passive; it has no active transistors because it serves only as an electrical interconnect path between the primary logic chip and the DRAM stacks.
The interposer is what makes HBM's closer integration between DRAM and the GPU possible. A traditional organic chip package sits below the interposer, as it does with most any GPU, but that package only has to transfer data for PCI Express, display outputs, and some low-frequency interfaces. All high-speed communication between the GPU and memory happens across the interposer instead. Because the interposer is a silicon chip, it's much denser, with many more connections and traces in a given area than an off-chip package.
Although the interposer is essential, the truly intriguing innovation in the HBM setup is the stacked memory. Each HBM memory stack consists of five chips: four storage dies above a single logic die that controls them. These five chips are connected to one another via vertical connections known as through-silicon vias (TSVs). These pathways are created by punching a hole through the silicon layers of the storage chips. Macri said those storage chips are incredibly thin, on the order of 100 microns, and that one of them "flaps like paper" when held in the hand. The metal bits situated between the layers in the stack are known as "microbumps" or μbumps, and they help form the vertical columns that provide a relatively short pathway from the logic die to any of the layers of storage cells.
Each of those storage dies contains a new type of memory conceived to take advantage of HBM's distinctive physical layout. The memory runs at relatively low voltages (1.3V versus 1.5V for GDDR5), lower clock speeds (500MHz versus 1750MHz), and at relatively slow transfer rates (1 Gbps vs. 7 Gbps for GDDR5), but it makes up for those attributes by having an exceptionally wide interface. In this first implementation, each DRAM die in the stack talks to the outside world by way of two 128-bit-wide channels. Each stack, then, has an aggregate interface width of 1024 bits (versus 32 bits for a GDDR5 chip). At 1 Gbps, that works out to 128 GB/s of bandwidth for each memory stack.
Making this sort of innovation happen was a broadly collaborative effort. AMD did much of the initial the heavy lifting, designing the interconnects, interposer, and the new DRAM type. Hynix partnered with AMD to produce the DRAM, and UMC manufactured the first interposers. JEDEC, the standards body charged with blessing new memory types, gave HBM the industry's blessing, which means this memory type should be widely supported by various interested firms. HBM made its way onto Nvidia's GPU roadmap some time ago, although it's essentially a generation behind AMD's first implementation.