Higher clocks and better throughput with the NCU
To achieve higher performance in certain workloads, Vega will be the first AMD GPU with support for packed math operations. Certain workloads, like deep learning tasks, don't need the full 32 bits that GPUs offer for single-precision data types. Prior AMD GPUs, including Fiji and Polaris, have included native support for 16-bit data types in order to benefit from more efficient memory and register file usage, but the GCN ALUs in those chips couldn't produce the potential doubling of throughput that some Nvidia chips, like the GP100 GPU on the Tesla P100 accelerator, enjoy.
All that changes with Vega and its next-generation compute unit design, called the NCU. (What the N really stands for remains a mystery). The NCU will be able to perform packed math, allowing it to achieve up to 512 eight-bit ops per clock, 256 16-bit ops per clock, or 128 32-bit ops per clock. These numbers rely on the fact that a GCN ALU can perform up to two operations per cycle in the form of a fused multiply-add, of course.
AMD also emphasizes that the single-threaded performance of the compute unit remains a critical part of its engineering efforts, and part of that work has been working to optimize Vega's circuitry on its target process to push clock speeds up while maintaining or lowering voltages. AMD isn't talking about the clock speeds it expects Vega to hit yet, but corporate fellow Mantor says that one source of IPC improvement is that Vega's enlarged instruction buffer lets operations run "continuous at rate," especially with three-operand instructions.
More efficient shading with the draw-stream binning rasterizer
AMD is significantly overhauling Vega's pixel-shading approach, as well. The next-generation pixel engine on Vega incorporates what AMD calls a "draw-stream binning rasterizer," or DSBR from here on out. The company describes this rasterizer as an essentially tile-based approach to rendering that lets the GPU more efficiently shade pixels, especially those with extremely complex depth buffers. The fundamental idea of this rasterizer is to perform a fetch for overlapping primitives only once, and to shade those primitives only once. This approach is claimed to both improve performance and save power, and the company says it's especially well-suited to performing deferred rendering.
The DSBR can schedule work in what AMD describes as a "cache-aware" fashion, so it'll try to do as much work as possible for a given "bundle" of objects in a scene that relate to the data in a cache before the chip proceeds to flush the cache and fetch more data. The company says that a given pixel in a scene with many overlapping objects might be visited many times during the shading process, and that cache-aware approach makes doing that work more efficient. The DSBR also lets the GPU discover pixels in complex overlapping geometry that don't need to be shaded, and it can do that discovery no matter what order that overlapping geometry arrives in. By avoiding shading pixels that won't be visible in the final scene, Vega's pixel engine further improves efficiency.
To help the DSBR do its thing, AMD is fundamentally altering the availability of Vega's L2 cache to the pixel engine in its shader clusters. In past AMD architectures, memory accesses for textures and pixels were non-coherent operations, requiring lots of data movement for operations like rendering to a texture and then writing that texture out to pixels later in the rendering pipeline. AMD also says this incoherency raised major synchronization and driver-programming challenges.
To cure this headache, Vega's render back-ends now enjoy access to the chip's L2 cache in the same way that earlier stages in the pipeline do. This change allows more data to remain in the chip's L2 cache instead of being flushed out and brought back from main memory when it's needed again, and it's another improvement that can help deferred-rendering techniques.
The draw-stream binning rasterizer won't always be the rasterization approach that a Vega GPU will use. Instead, it's meant to complement the existing approaches possible on today's Radeons. AMD says that the DSBR is "highly dynamic and state-based," and that the feature is just another path through the hardware that can be used to improve rendering performance. By using data in a cache-aware fashion and only moving data when it has to, though, AMD thinks that this rasterizer will help performance in situations where the graphics memory (or high-bandwidth cache) becomes a bottleneck, and it'll also save power even when the path to memory isn't saturated.
By minimizing data movement in these ways, AMD says the DSBR is its next thrust at reducing memory bandwidth requirements. It's the latest in a series of solutions to the problem of memory-bandwidth efficiency that AMD has been working on across many generations of its products. In the past, the company has implemented better delta color compression algorithms, fast Z clear, and hierarchical-Z occlusion detection to reduce pressure on memory bandwidth.