Single page Print

Pascal architectural improvements

Getting asynchronous
Anybody attuned to the enthusiast hardware scene over the past few months has doubtless heard a ton about graphics cards' asynchronous compute capabilities, namely Radeons' prowess and GeForces' apparent shortcomings on that point. However much stock you place in this argument, Pascal appears to offer improved asynchronous compute capability versus Maxwell chips.

First, we should talk a little bit about the characteristics of an asynchronous compute workload. Nvidia suggests that an asynchronous task might overlap with another task running on the GPU at the same time, or it might need to interrupt a task that's running in order to complete within a given time window.


Source: Nvidia

One example of such a compute task is asynchronous timewarp, a VR rendering method that uses head-position data to slightly reproject a frame before sending it out to the VR headset. Nvidia notes that timewarp often needs to interrupt—or preempt—a task in progress to execute on time. On the other hand, less time-critical workloads, like physics or audio calculations, might run concurrently (but asynchronously) with rendering tasks. Nvidia says Pascal chips support two major forms of asynchronous compute execution: dynamic load-balancing for overlapping workloads, and pixel-level preemption for time-sensitive ones.

It's here that we actually learn a thing or two about what Maxwell could do in this regard—perhaps even in more depth than we ever did while those chips were the hottest thing on the market. Nvidia says Maxwell provided overlapping workloads with a static partitioning of resources: one partition for graphics tasks, and another for compute. The company says this approach was effective when the partitioning scheme matched the resources needed by both graphics and compute workloads. Maxwell's static partitioning has a downside, though: mess up that initial resource allocation, and a graphics task can complete before a compute task, causing part of the GPU to go idle while it waits for the compute task to complete and for new work to be dispatched.

It might seem obvious to say so, but like any modern chip, GPUs want all of their pipelines filled as much of the time as possible in order to extract maximum performance. Idle resources are bad news. Nvidia admits as much in its documentation, noting that a long-running task in one resource partition might cause performance for the concurrent tasks to fall below whatever the potential benefits of running them together might have offered. Either way, if you were wondering what exactly was going on with Maxwell and async compute way back when, it appears this is your answer.


Source: Nvidia

Pascal looks like it's much better provisioned to handle asynchronous workloads. For overlapping tasks, the chip can now perform what Nvidia calls dynamic load balancing. Unlike the rather coarse-sounding partitioning method outlined above, Pascal chips can dispatch work to idle parts of the GPU on the fly, potentially keeping more of the chip at work and improving performance.

Nvidia doesn't go into the same depth about Maxwell's pre-emption capability as it does for the architecture's methods for handling overlapping workloads, but given friend-of-TR David Kanter's now-infamous comment about preemption on Maxwell being "potentially catastrophic," perhaps we can guess why. Pascal's preemption abilities seem to be much better, though. Let's talk about them.


Source: Nvidia

For one, Nvidia claims Pascal is the first GPU architecture to implement preemption at the pixel level. The company says each of the chip's graphics units can keep track of its intermediate state on a work unit. That fine-grained awareness lets those resources quickly save state, service the preemption request, and pick up work where they left off once the high-priority task is complete. Once the GPU is finished with the work that it can't save and unload, Nvidia says that task-switching with preemption can finish in under 100 microseconds. Compute tasks also benefit from the finer-grained preemption capabilities of Pascal cards. If a CUDA workload needs to preempt another running compute task, that interruption can occur at the instruction level.

Simultaneous multi-projection, single-pass stereo, and VR
One of the biggest architectural changes in Pascal is a new component in the Polymorph Engine geometry processor that arrived in Fermi GPUs. That processor now benefits from a feature called the Simultaneous Multi-Projection Engine, or SMPE. This hardware can take geometry information from the upstream graphics pipeline and create up to 16 separate pre-configured projections of a scene across up to two different camera positions. This hardware efficiently performs a task that would have previously required generating geometry for as many separate projections as a developer wanted to create—a prohibitively performance-intensive task.


Source: Nvidia

All that jargon essentially means that in situations where a single projection might have caused weird-looking perspective errors, like one might see with a three-monitor surround setup, Pascal can now account for the angle of those displays (with help from the application programmer) and create the illusion of a continuous space across all three monitors with no perspective problems.

Surround gaming is just one application for this technology, though—it also has major implications for VR performance. You'll remember that the SMPE can create projections based on up to two different camera positions. Humans have two eyes, and if we put on a VR headset, we end up looking at two different screens with slightly different views of a scene. Before Pascal hit the market, Nvidia says graphics cards had to render for each eye's viewpoint separately, resulting in twice as much work.


An example of how the same scene needs to look for different eyes in VR. Source: Nvidia

With Pascal, however, SMPE enables a new capability called Single-Pass Stereo rendering for VR headsets. As Nvidia puts it, Single-Pass Stereo lets an application submit its vertex work just once. The graphics card will then produce two positions for each vertex and match up each one with the correct eye. This resource essentially cuts the work necessary to render for a VR headset in half, presuming a developer takes advantage of it.


An example VR scene, before and after traditional post-processing for a VR headset display. Source: Nvidia

SMPE and its effects on VR don't end there, however. The technology also allows developers to take advantage of a feature called Lens Matched Shading, or LMS for short. Prior to Pascal, graphics cards had to render the first pass of an image for a VR viewport assuming a flat projection. Because VR headsets rely on distorting lenses to create a natural-looking result, however, a pre-distorted image then has to be produced from the flat initial rendering to create a final scene that looks correct through the headset. This step throws away data. Nvidia says that a traditional graphics card might start with a 2.1MP image to begin with for a VR scene, but after post-processing, that image might be only 1.1MP. That's a huge amount of extra work for pixels that are just going to be discarded.


An example of Lens Matched Shading in action. Source: Nvidia

LMS, on the other hand, takes advantage of the SMPE to render a scene more efficiently. It first slices the viewport into quadrants and then uses each of those to generate an associated projection that's close to that of the part of the lens that will eventually be used to view the image. With this multi-projection rendering, the preliminary image in Nvidia's example is just 1.4MP before it goes through the final post-processing step—a major increase in efficiency.