For years now, AMD has taken on the responsibility of defining new types of memory to be used in graphics cards, standards that have eventually come to be used by the entire industry. Typically, being first out of the gate with a new graphics-oriented memory technology has given AMD a competitive advantage in that first generation of products. For instance, the introduction of GDDR5 allowed the Radeon HD 4870 to capture the performance crown back in the day.
Trouble is, GDDR5 is still the standard memory type for graphics processors to this day, seven years after the 4870’s introduction. Graphics memory has gotten faster over time, of course, but it hasn’t fundamentally changed for a long while.
GDDR5’s reign is about to end, however, thanks to a new type of memory known by the ridiculously generic name of high-bandwidth memory (HBM). Although we’ve been waiting quite a while for a change, HBM looks a be a fairly monumental shift as these things go, thanks to a combination of new fabrication methods and smart engineering. The first deployment of HBM is likely to be alongside Fiji, AMD’s upcoming high-end GPU, expected to be called the Radeon R9 390X.
Fiji is a fitting companion for HBM since a team of engineers at AMD has once again helped lead the charge in its development. In fact, that team has been led by one of the very same engineers responsible for past GDDR standards, Joe Macri. I recently had the chance to speak with Macri, and he explained in some detail the motivations and choices that led to the development of HBM. Along the way, he revealed quite a bit of information about what we can likely expect from Fiji’s memory subsystem. I think it’s safe to say the new Radeon will have the highest memory bandwidth of any single-GPU graphics card on the market—and not by a little bit. But I’m getting ahead of myself. Let’s start at the beginning.
The impetus behind HBM
Macri said the HBM development effort started inside of AMD seven years ago, so not long after GDDR5 was fully baked. He and his team were concerned about the growing proportion of the total PC power budget consumed by memory, and they suspected that memory power consumption would eventually become a limiting factor in overall performance.
Beyond that, GDDR5 has some other drawbacks that were cause for concern. As anybody who has looked closely at a high-end graphics card will know, the best way to grow the bandwidth available to a GPU is to add more memory channels on the chip and more corresponding DRAMs on the card. Those extra DRAM chips chew up board real-estate and power, so there are obvious limits to how far this sort of solution will scale up. AMD’s biggest GPU today, the Hawaii chip driving the Radeon R9 290 and 290X, has a 512-bit-wide interface, and it’s at the outer limits of what we’ve seen from either of the major desktop GPU players. Going wider could be difficult within the size and power constraints of today’s PC expansion cards.
One possible solution to this dilemma is the one that chipmakers have pursued relentlessly over the past couple of decades in order to cut costs, drive down power use, shrink system footprints, and boost performance: integration. CPUs in particular have integrated everything from the floating-point units to the memory controller to south bridge I/O logic. In nearly every case, this katamari-like absorption of various system components has led to tangible benefits. Could the integration of memory into the CPU or GPU have the same benefits?
Possibly, but it’s not quite that easy.
Macri explained that the processes used to make DRAM and logic chips like GPUs are different enough to make integration of large memory and logic arrays on the same chip prohibitively expensive. With that option off of the table, the team had to come up with another way to achieve the benefits of integration. The solution they chose pulls memory in very close to the GPU while keeping it on a separate silicon die. In fact, it involves a bunch of different silicon dies stacked on top of one another in a “3D” configuration.
And it’s incredibly cool tech.
Something different: HBM’s basic layout
Any HBM solution has three essential components: a main chip (either a GPU, CPU, or SoC), one or more DRAM stacks, and an underlying silicon wafer known as an interposer. The interposer is a simple silicon die, usually manufactured using an older and larger chip fabrication process, that sits beneath both the main chip and the DRAM stacks.
Macri explained that the interposer is completely passive; it has no active transistors because it serves only as an electrical interconnect path between the primary logic chip and the DRAM stacks.
The interposer is what makes HBM’s closer integration between DRAM and the GPU possible. A traditional organic chip package sits below the interposer, as it does with most any GPU, but that package only has to transfer data for PCI Express, display outputs, and some low-frequency interfaces. All high-speed communication between the GPU and memory happens across the interposer instead. Because the interposer is a silicon chip, it’s much denser, with many more connections and traces in a given area than an off-chip package.
Although the interposer is essential, the truly intriguing innovation in the HBM setup is the stacked memory. Each HBM memory stack consists of five chips: four storage dies above a single logic die that controls them. These five chips are connected to one another via vertical connections known as through-silicon vias (TSVs). These pathways are created by punching a hole through the silicon layers of the storage chips. Macri said those storage chips are incredibly thin, on the order of 100 microns, and that one of them “flaps like paper” when held in the hand. The metal bits situated between the layers in the stack are known as “microbumps” or μbumps, and they help form the vertical columns that provide a relatively short pathway from the logic die to any of the layers of storage cells.
Each of those storage dies contains a new type of memory conceived to take advantage of HBM’s distinctive physical layout. The memory runs at relatively low voltages (1.3V versus 1.5V for GDDR5), lower clock speeds (500MHz versus 1750MHz), and at relatively slow transfer rates (1 Gbps vs. 7 Gbps for GDDR5), but it makes up for those attributes by having an exceptionally wide interface. In this first implementation, each DRAM die in the stack talks to the outside world by way of two 128-bit-wide channels. Each stack, then, has an aggregate interface width of 1024 bits (versus 32 bits for a GDDR5 chip). At 1 Gbps, that works out to 128 GB/s of bandwidth for each memory stack.
Making this sort of innovation happen was a broadly collaborative effort. AMD did much of the initial the heavy lifting, designing the interconnects, interposer, and the new DRAM type. Hynix partnered with AMD to produce the DRAM, and UMC manufactured the first interposers. JEDEC, the standards body charged with blessing new memory types, gave HBM the industry’s blessing, which means this memory type should be widely supported by various interested firms. HBM made its way onto Nvidia’s GPU roadmap some time ago, although it’s essentially a generation behind AMD’s first implementation.
The benefits of HBM
Macri says this first-generation HBM solution has a number of advantages over GDDR5. Higher peak transfer rates is chief among them, but that’s followed closely by some related wins. He estimates that GDDR5 can transfer about 10.66 GB/s per watt, while HBM transfers over 35 GB/s per watt.
HBM also packs tremendously more bits into the same space. A gigabyte of HBM is just one stack 35 mm² in size. By contrast, four GDDR5 dies can occupy 672 mm² of real estate. As a result, HBM ought to allow for much smaller total solutions, whether it be more compact video cards or, eventually, smaller footprints for entire systems.
Past GDDR memory types have largely been confined to graphics-related devices because of their high bandwidth and corresponding higher access latencies, but Macri expects HBM to make its way into more general applications. That’s no great surprise given that AMD hinted strongly at future server-class APUs that use HBM in its recent Analyst Day roadmap reveal.
Although HBM was built primarily to deliver more bandwidth per watt, Macri cites a host of reasons why its access latencies are effectively lower. First, he cites “very small horizontal movement” of DRAM, since the data paths traverse the stack vertically. With more channels and banks, HBM has “much better pseudo-random access behavior,” as well. Also, HBM’s clocking subsystem is simpler and thus incurs fewer delays. All told, he says, those “small positives” can add up to large reductions in effective access latencies. Macri points to server and HPC workloads as especially nice potential fits for HBM’s strengths. Eventually, he expects HBM to move into virtually every corner of the computing market except for the ultra-mobile space (cell phones and such), where a “sister device” will likely fill the same role.
Another advantage of HBM is that it requires substantially less die space on the host GPU than GDDR5. The physical interfaces, or PHYs, on the chip are simpler, saving space. The external connections to the interposer are arranged at a much finer pitch than they would be for a conventional organic substrate, which means a more densely packed die. Macri hinted that even the data flow inside the GPU itself could be optimized to take advantage of data coming in “in a very concentrated hump.”
Of course, all of these things sound very much like the sorts of positive effects one might expect from closer integration of a critical component. In that respect, then, HBM looks poised to deliver on much of the promise of the initial concept.
A likely map to Fiji’s HBM-infused memory subsystem
We’ve already discussed the potential savings in die space that HBM might grant to AMD’s next big Radeon GPU. Surprisingly enough, I think we can map out that GPU’s entire memory subsystem based on my discussion with Macri and the information included in the JEDEC documentation for HBM.
The basic stack layout I outlined above is almost surely what Fiji uses: a four-die stack with two 128-bit channels per die. At a clock speed of 500MHz with a DDR-style arrangement that transfers data on both the rising and falling edges of the clock, that memory should have a 1 Gbps data rate. Thus, each 1024-bit link from a stack into the GPU should be capable of transferring data at a rate of 128 GB/s.
As in the examples provided by AMD, Fiji will have four stacks of DRAM attached. That will give it a grand total of 512 GB/s of memory bandwidth, which is quite a bit more than both the Radeon R9 290X (320 GB/s) and the GeForce Titan X (336 GB/s). Based on that difference alone, I’d wager that the new Radeon will outperform today’s fastest GPU by a considerable margin. Memory bandwidth is one of a handful of key constraints that defines the performance of a GPU these days, and having that sort of an edge in bandwidth should translate into world-beating performance, provided AMD doesn’t have any show-stopping problems elsewhere.
One thing we don’t really know yet from the information Macri presented is how much power savings HBM really delivers in the context of a big GPU like Fiji, where power budgets can touch 300W. I’m not sure what portion of that budget is consumed by memory and memory-related logic.
Macri did say that GDDR5 consumes roughly one watt per 10 GB/s of bandwidth. That would work out to about 32W on a Radeon R9 290X. If HBM delivers on AMD’s claims of more than 35 GB/s per watt, then Fiji’s 512 GB/s subsystem ought to consume under 15W at peak. A rough savings of 15-17W in memory power is a fine thing, I suppose, but it’s still only about five percent of a high-end graphics cards’s total power budget. Then again, the power-efficiency numbers Macri provided only include the power used by the DRAMs themselves. The power savings on the GPU from the simpler PHYs and such may be considerable.
This first-gen HBM stack will impose at least one limitation of note: its total capacity will only be 4GB. At first blush, that sounds like a limited capacity for a high-end video card. After all, the Titan X packs a ridiculous 12GB, and the prior-gen R9 290X has the same 4GB amount. Now that GPU makers are selling high-end cards on the strength of their performance at 4K resolutions, one might expect more capacity from a brand-new flagship graphics card.
When I asked Macri about this issue, he expressed confidence in AMD’s ability to work around this capacity constraint. In fact, he said that current GPUs aren’t terribly efficient with their memory capacity simply because GDDR5’s architecture required ever-larger memory capacities in order to extract more bandwidth. As a result, AMD “never bothered to put a single engineer on using frame buffer memory better,” because memory capacities kept growing. Essentially, that capacity was free, while engineers were not. Macri classified the utilization of memory capacity in current Radeon operation as “exceedingly poor” and said the “amount of data that gets touched sitting in there is embarrassing.”
Strong words, indeed.
With HBM, he said, “we threw a couple of engineers at that problem,” which will be addressed solely via the operating system and Radeon driver software. “We’re not asking anybody to change their games.”
The conversation around this issue should be interesting to watch. Much of what Macri said about poor use of the data in GPU memory echoes what Nvidia said in the wake of the revelations about the GeForce GTX 970’s funky 3.5GB/0.5GB memory split. If Nvidia makes an issue of memory capacity at the time of the new Radeons’ launch, it will be treading into dangerous waters. Of course, the final evaluation will be up to reviewers and end-users. We’ll surely push these cards to see where they start to struggle.
A few other tricky issues
Any tech as radically new and different as HBM is likely to come with some potential downsides. The issues for HBM don’t look to be especially difficult, but they could complicate life for Fiji in particular as the first HBM-enabled device.
One problem with HBM is especially an issue for large GPUs. High-end graphics chips have, in the past, pushed the boundaries of possible chip sizes right up to the edges of the reticle used in photolithography. Since HBM requires an interposer chip that’s larger than the GPU alone, it could impose a size limitation on graphics processors. When asked about this issue, Macri noted that the fabrication of larger-than-reticle interposers might be possible using multiple exposures, but he acknowledged that doing so could become cost-prohibitive.
Fiji will more than likely sit on a single-exposure-sized interposer, and it will probably pack a rich complement of GPU logic given the die size savings HBM offers. Still, with HBM, the size limits are not what they once were.
Another possible issue with HBM’s tiny physical footprint is increased power density. Packing more storage and logic into a smaller area can make cooling that solution difficult because the cooling solution must transfer more heat through the available surface area. AMD arguably had a form of this problem with the Radeon R9 290X, whose first retail coolers couldn’t always keep up, leading to reduced performance.
Fortunately, Macri told us the power density situation was “another beautiful thing” about HBM. He explained that the DRAMs actually work as a heatsink for the GPU, effectively increasing the surface area for the heatsink to mate to the chips. That works out because, despite what you see in the “cartoon diagrams” (Macri’s words), the Z height of the HBM stack and the GPU is almost exactly the same. As a result, the same heatsink and thermal interface material can be used for both the GPU and the memory.
(Notice that Macri did not say Fiji doesn’t have a power density issue. He was talking only about the HBM solution. The fact remains that leaked images of Fiji cards appear to have liquid cooling, and one reason to go that route is to deal with a power density challenge.)
The final issue HBM may face out of the gate is one of the oldest ones in semiconductors. Until HBM solutions are manufactured and sold in really large volumes, their costs will likely be relatively high. AMD and its partners will want to achieve broad market adoption of HBM-based products in order to kick-start the virtuous cycle of large production runs and dropping costs. I’m not sure Fiji alone will sell in anything like the volumes needed to get that process started, and as far as we know, AMD’s GPU roadmap doesn’t have a top-to-bottom refresh with HBM on it any time soon. Odds are that HBM will succeed enough to drive down costs eventually, but one wonders how long it will take before it reaches prices comparable to GDDR5.
When I asked Macri this question, he avoided getting into the specifics of AMD’s roadmap, but he expressed confidence that HBM will follow a trajectory similar to past GDDR memory types.
I don’t think his confidence is entirely misplaced, even if HBM may take a little longer to reach broad adoption. The tech AMD and its partners have built is formidable for a host of reasons we’ve just examined, and what’s even more exciting is the way HBM promises to scale up over time. According to Macri, HBM2 is already on the way. It will “wiggle twice as fast” as the first-gen HBM, giving it twice the bandwidth per stack. The memory makers will also move HBM2 to the latest DRAM fabrication process, giving it four times the capacity. The stack itself will grow to eight layers, and Macri said someday it may grow as large as 16. Meanwhile, JEDEC is already talking about what follows HBM2.
Whatever happens that far down the road, we appear to be on the cusp of a minor revolution in memory tech. We should know more about what it means for graphics in about a month, when AMD will likely unveil Fiji to the world.