Single page Print

Radeon, refined
Under the hood, Polaris is a more efficient version of the GCN architecture we've known since its inception on the Radeon HD 7970. This fourth generation of GCN incorporates a number of small improvements.


Source: AMD

The geometry engines in Polaris now feature a stage called the Primitive Discard Accelerator that can remove zero-area (or "degenerate") triangles, as well as polys that aren't being sampled, early in the graphics pipeline. AMD says this feature improves performance in workloads that combine lots of triangles (like highly tesselated scenes) and multi-sampled anti-aliasing. Heavy tesselation has traditionally been a weakness for GCN cards, so it'll be interesting to see what effect this feature has on Polaris' performance. Each Polaris geometry engine also gets an index cache for storing what AMD describes as "small instanced geometry." This cache reduces the need to move data around the chip, reducing internal bandwidth requirements and improving throughput.

Polaris 10's front end is getting a new pair of programmable units that AMD calls "Hardware Schedulers," or an HWS for short, alongside its four asynchronous compute engines. These blocks perform a variety of scheduling tasks for asynchronous compute workloads. They can set up real-time and prioritized task queues for audio and VR processing, manage concurrent tasks and process scheduling, and perform load-balancing between compute and graphics workloads. Since the HWSes can perform this work on the chip, they reduce CPU driver overhead. Because they're programmable, AMD says it can update the capabilities of each HWS with new microcode, too.


Compute Unit Reservation in action. Source: AMD

One example use of the HWS duo involves audio processing for VR. In order to be sure that a given audio task will complete within a certain time frame, a developer can use a new feature called Compute Unit Reservation to request a specific number of on-chip resources that will be dedicated to a specific task queue. The HWSes ensure that the proper resources are allocated for the job—"spatial management," in AMD parlance. These blocks can also perform what AMD calls "temporal management." An example of such a task is managing the Quick Response Queue that the company specifically made for handling VR-related workloads like asynchronous time warp for the Oculus Rift.


Source: AMD

The stream processors in each GCN compute unit are getting some new tricks in Polaris, too. If many wavefronts (AMD's name for groups of threads) of the same workload are set to be processed, a new feature called instruction prefetch lets executing wavefronts fetch instructions for subsequent ones. The company says this approach makes its instruction caching more efficient. Polaris CUs also get a larger per-wave instruction buffer, a feature that's claimed to increase single-threaded performance within the CU. Polaris can group client L2 cache requests, too, so it can fetch data from that cache more efficiently. 

In addition to a larger L2 cache that allows more data to remain on the chip, Polaris has improved delta color compression (or DCC) capabilities that allow it to compress color data at 2:1, 4:1, or 8:1 ratios. These compression methods should allow the chip to enjoy greater effective memory bandwidth and higher efficiency. Polaris' memory controller supports 8 GT/s GDDR5 DRAMs for up to 256 GB/s of memory bandwidth. Instead of moving to a new memory technology, AMD says it's getting more life out of GDDR5 mostly thanks to Polaris' improved DCC capabilities. Between DCC, the expanded L2 cache, and improved cache access methods, AMD claims it can reduce the power required by Polaris' memory interface by up to 58%, too.

Getting smart about power usage
Along with the move to the 14-nm FinFET process itself, AMD is deploying several new monitoring technologies on Polaris that are meant to help each chip perform at its best. The company is making this move in response to a basic problem of power delivery to the chip itself: variations in input voltage as large as 10% to 15% require an increase in the average voltage sent to the chip to compensate. AMD says this safety margin wastes a lot of power, so it's responding with a new technology called adaptive voltage and frequency scaling, or AVFS.


Source: AMD

When the company designed Polaris, it borrowed a few pages from its CPU design team. Each Polaris GPU now has embedded frequency sensors on its die that work in concert with its temperature and power sensors. If a chip can run at lower voltages to achieve a given frequency on the DVFS curve, for example, this tech will let it do so and allow it to save power at the same time. The chip can also quickly adjust its frequency in response to voltage droops instead of running within a safety margin at all times, extracting 5%-10% more performance on average.


Source: AMD

That on-chip monitoring technology also allows the GPU to analyze the input voltage it's receiving from its host system at boot time and compensate for any differences between that input and the power characteristics of the test equipment on which the chip was initially binned. The chip can then use this information to adjust its voltage regulators to deliver the same operating environment it saw on the test bench, improving efficiency.

Finally, AVFS lets Polaris chips compensate for the effects of transistor aging and the aging of other components in the system. Polaris' AVFS modules have aging-sensitive circuitry inside that let the chip compensate for any degradation in performance as it gets older. That same boot-time monitoring technology can determine when other components in a system (presumably, the power supply) are no longer as young and spry as they once were, too. By self-calibrating and adapting to this aging, AMD says the chip will offer "more robust operation" over time while delivering better performance out of the box.


Source: AMD

Getting down to the real nitty-gritty, AMD is improving the design of the multi-bit flip-flop circuits it uses in Polaris. The company says there are about 21 million of these circuits on Polaris 10, and they account for 15% of the chip's TDP. By moving to a quad multi-bit flip-flop, AMD says it reduced Polaris' TDP by 4%-5%.

All that's well and good, but we know you really want to see the RX 480 card itself. We won't make you wait any longer.