2016 will be remembered for a lot of things. For graphics cards, last year marked the long-awaited transition to next-generation process technologies. It was also the year that the graphics card arguably came into its own as a distinct platform for compute applications. Not an Nvidia presentation went by last year wherein Jen-Hsun Huang didn’t tout the power of the graphics processor for self-driving cars, image recognition, machine translation, and more. The company’s various Pascal GPUs set new bars for gaming performance, too, but it’s clear that gaming is just one job that future graphics cards will do.
A block diagram of the Vega architecture.
AMD is, of course, just as aware of the potential of the graphics chip for high-performance computing. Even before ATI’s merger with AMD and the debut of graphics cards with unified stream processor architectures, the company explored ways to tap the potential of its hardware to perform more general computing tasks. In the more than ten years since, graphics chips have been pressed into compute duty more and more.
An unnamed Vega chip
AMD’s next-generation graphics architecture, Vega, is built for fluency with all the new tasks that graphics cards are being asked to do these days. We already got a taste of Vega’s versatility with the Radeon Instinct MI25 compute accelerator, and we can now explain some of the changes in Vega that make it a better all-around player for graphics and compute work alike.
Memory, memory everywhere
In his presentation at the AMD Tech Summit in Sonoma last month, Radeon Technologies Group chief Raja Koduri lamented the fact that data sets for pro graphics applications are growing to petabytes in size, and high-performance computing data sets to exabytes of information. Despite those increases, graphics memory pools are still limited to just dozens of gigabytes of RAM. To help crunch these increasingly enormous data sets, Vega’s memory controller—now called the High Bandwidth Cache Controller—is designed to help the GPU access data sets outside of the traditional pool of RAM that resides on the graphics card.
The “high-bandwidth cache” is what AMD will soon be calling the pool of memory that we would have called RAM or VRAM on older graphics cards, and on at least some Vega GPUs, the HBC will consist of a chunk of HBM2 memory. HBM2 has twice the bandwidth per stack (256 GB/s) that HBM1 does, and the capacity per stack of HBM2 is up to eight times greater than HBM1. AMD says HBM stacks will continue to get bigger, offer higher performance, and scale in a power-efficient fashion, too, so it’ll remain an appealing memory technology for future products.
HBM2 is only one potential step in a hierarchy of new caches where data to feed a Vega GPU could reside, however. The high-bandwidth cache controller has the ability to address a pool of memory up to 512TB in size, and that pool could potentially encompass other memory locations like NAND flash (as seen on the Radeon Pro SSG), system memory, and even network-attached storage. To demonstrate the HBCC in action, AMD demonstrated a Vega GPU displaying a photorealistic representation of a luxurious bedroom produced from hundreds of gigabytes of data using its ProRender backend.
Geometry processing gets more flexible
Today’s Radeon GPUs retain fixed-function geometry-processing hardware in their front ends, but the company has observed that more and more developers have been doing geometry processing in compute shaders. Koduri notes that some of today’s games can have extremely geometrically complex scenes. He cited parts of the Golem City section of Deus Ex: Mankind Divided (that we incidentally use in our graphics-card testing) to prove his point.
The middle portion of that benchmark run has over 220 million polygons, according to AMD, but only about 22 million might need to be shaded for what the gamer actually sees in the final frame. Figuring out which of those polys need to be shaded is a hugely complicated task, and achieving better performance for that problem is another focus of the Vega architecture.
To accomodate developers’ increasing appetite for migrating geometry work to compute shaders, AMD is introducing a more programmable geometry pipeline stage in Vega that will run a new type of shader it calls a primitive shader. According to AMD corporate fellow Mike Mantor, primitive shaders will have “the same access that a compute shader would have to coordinate how you bring work into the shader.” Mantor also says that primitive shaders will give developers access to all the data they need to effectively process geometry, as well.
AMD thinks this sort of access will ultimately allow primitives to be discarded at a very high rate. Interestingly, Mantor expects that programmable pipeline stages like this one will ultimately replace fixed-function hardware on the graphics card. For now, the primitive shader is the next step in that direction.
To effectively manage the work generated by this new geometry-pipeline stage, Vega’s front end will contain a new “intelligent workgroup distributor” that can consider the various draw calls and instances that a graphics workload generates, group that work, and distribute it to the right programmable stage of the pipeline for better throughput. AMD says this load-balancing design addresses workload-distribution shortcomings in prior GCN versions that were highlighted by console developers pushing its hardware at a low level.
Higher clocks and better throughput with the NCU
To achieve higher performance in certain workloads, Vega will be the first AMD GPU with support for packed math operations. Certain workloads, like deep learning tasks, don’t need the full 32 bits that GPUs offer for single-precision data types. Prior AMD GPUs, including Fiji and Polaris, have included native support for 16-bit data types in order to benefit from more efficient memory and register file usage, but the GCN ALUs in those chips couldn’t produce the potential doubling of throughput that some Nvidia chips, like the GP100 GPU on the Tesla P100 accelerator, enjoy.
All that changes with Vega and its next-generation compute unit design, called the NCU. (What the N really stands for remains a mystery). The NCU will be able to perform packed math, allowing it to achieve up to 512 eight-bit ops per clock, 256 16-bit ops per clock, or 128 32-bit ops per clock. These numbers rely on the fact that a GCN ALU can perform up to two operations per cycle in the form of a fused multiply-add, of course.
AMD also emphasizes that the single-threaded performance of the compute unit remains a critical part of its engineering efforts, and part of that work has been working to optimize Vega’s circuitry on its target process to push clock speeds up while maintaining or lowering voltages. AMD isn’t talking about the clock speeds it expects Vega to hit yet, but corporate fellow Mantor says that one source of IPC improvement is that Vega’s enlarged instruction buffer lets operations run “continuous at rate,” especially with three-operand instructions.
More efficient shading with the draw-stream binning rasterizer
AMD is significantly overhauling Vega’s pixel-shading approach, as well. The next-generation pixel engine on Vega incorporates what AMD calls a “draw-stream binning rasterizer,” or DSBR from here on out. The company describes this rasterizer as an essentially tile-based approach to rendering that lets the GPU more efficiently shade pixels, especially those with extremely complex depth buffers. The fundamental idea of this rasterizer is to perform a fetch for overlapping primitives only once, and to shade those primitives only once. This approach is claimed to both improve performance and save power, and the company says it’s especially well-suited to performing deferred rendering.
The DSBR can schedule work in what AMD describes as a “cache-aware” fashion, so it’ll try to do as much work as possible for a given “bundle” of objects in a scene that relate to the data in a cache before the chip proceeds to flush the cache and fetch more data. The company says that a given pixel in a scene with many overlapping objects might be visited many times during the shading process, and that cache-aware approach makes doing that work more efficient. The DSBR also lets the GPU discover pixels in complex overlapping geometry that don’t need to be shaded, and it can do that discovery no matter what order that overlapping geometry arrives in. By avoiding shading pixels that won’t be visible in the final scene, Vega’s pixel engine further improves efficiency.
To help the DSBR do its thing, AMD is fundamentally altering the availability of Vega’s L2 cache to the pixel engine in its shader clusters. In past AMD architectures, memory accesses for textures and pixels were non-coherent operations, requiring lots of data movement for operations like rendering to a texture and then writing that texture out to pixels later in the rendering pipeline. AMD also says this incoherency raised major synchronization and driver-programming challenges.
To cure this headache, Vega’s render back-ends now enjoy access to the chip’s L2 cache in the same way that earlier stages in the pipeline do. This change allows more data to remain in the chip’s L2 cache instead of being flushed out and brought back from main memory when it’s needed again, and it’s another improvement that can help deferred-rendering techniques.
The draw-stream binning rasterizer won’t always be the rasterization approach that a Vega GPU will use. Instead, it’s meant to complement the existing approaches possible on today’s Radeons. AMD says that the DSBR is “highly dynamic and state-based,” and that the feature is just another path through the hardware that can be used to improve rendering performance. By using data in a cache-aware fashion and only moving data when it has to, though, AMD thinks that this rasterizer will help performance in situations where the graphics memory (or high-bandwidth cache) becomes a bottleneck, and it’ll also save power even when the path to memory isn’t saturated.
By minimizing data movement in these ways, AMD says the DSBR is its next thrust at reducing memory bandwidth requirements. It’s the latest in a series of solutions to the problem of memory-bandwidth efficiency that AMD has been working on across many generations of its products. In the past, the company has implemented better delta color compression algorithms, fast Z clear, and hierarchical-Z occlusion detection to reduce pressure on memory bandwidth.
So how’s it play?
High-level architectural discussions are one thing, but everybody wants to know what Vega silicon will look like in shipping products. AMD constantly demurred about actual implementation details of its Vega chips last month, but we’ve speculated that the Vega-powered Radeon Instinct MI25 would pack 4096 stream processors running at around 1500 MHz thanks to its 25 TFLOPS of FP16 power.
AMD did have an early piece of Vega silicon running in its demo room for the press to play with. In the Argent D’Nur level of Doom‘s arcade mode, that chip was producing anywhere between 60 and 70 FPS at 4K on Ultra settings. Surprisingly, the demo attendant let me turn on Doom‘s revealing “nightmare” performance metrics, and I saw a maximum frame time of about 24.8 ms after large explosions. I also noted that the chip had 8GB of memory on board, though of course we couldn’t say what type of memory was being used.
Though that performance might not sound so impressive, it’s worth noting that all of the demo system’s vents (including the graphics card’s exhaust) were taped up, and it’s quite likely the chip was sweating to death in its own waste heat. By my rough estimate, that puts the early Vega card inside somewhere between the performance of a GTX 1070 and a GTX 1080 in Doom. If that’s an indication of where consumer Vega hardware will end up, we could have a competitive year to look forward to from AMD. Until we learn more about Vega, though, we’re left with only this tantalizing taste ahead of the card’s first-half-of-2017 release.