Direct3D 12 on existing hardware
Direct3D 12's lower abstraction level takes the form of a new programming model, and that programming model will be supported on a broad swath of current hardware. AMD has pledged support for all of its current offerings based on the Graphics Core Next architecture, while Nvidia did the same for all of its DirectX 11-class chips (spanning the Fermi, Kepler, and Maxwell architectures). Intel, meanwhile, pledged support for the integrated graphics in its existing Haswell processors (a.k.a. 4th-generation Core).
Beyond the PC, Direct3D 12's new programming model will also be exploitable on the Xbox One console and on Windows Phone handsets. Microsoft hasn't yet said which versions of Windows on the desktop will support Direct3D 12, but it dropped some hints. During the Q&A following the reveal keynote, Microsoft's Gosalia ruled out Windows XP support, but he declined to give a categorical answer about Windows 7.
Sandy's blog post identified four key changes that D3D12 makes to the Direct3D programming model: pipeline state objects, command lists, bundles, and descriptor heaps and tables. These are all about lowering the abstraction level and giving developers better control over the hardware. Those of you well-acquainted with Mantle may find that some of those constructs have a familiar ring to them. That familiarity may be partly due to AMD's role (whether direct or indirect) in Direct3D 12's development, but I suspect it's explainable to a large degree by the fact that both D3D12 and Mantle are low-level graphics APIs closely tailored to the behavior of modern GPUs.
For instance, Mantle's monolithic pipelines roll the graphics pipeline into a single object. Direct3D 12 groups the graphics pipeline into "pipeline state objects," or PSOs. Those PSOs work like such, according to Sandy:
Direct3D 12 . . . [unifies] much of the pipeline state into immutable pipeline state objects (PSOs), which are finalized on creation. This allows hardware and drivers to immediately convert the PSO into whatever hardware native instructions and state are required to execute GPU work. Which PSO is in use can still be changed dynamically, but to do so the hardware only needs to copy the minimal amount of pre-computed state directly to the hardware registers, rather than computing the hardware state on the fly. This means significantly reduced draw call overhead, and many more draw calls per frame.
Gosalia says PSOs "wrap very efficiently to actual GPU hardware." That's in contrast to Direct3D 11's higher-level representation of the graphics pipeline, which induces higher overhead. "For example," Sandy explains, "many GPUs combine pixel shader and output merger state into a single hardware representation, but because the Direct3D 11 API allows these to be set separately, the driver cannot resolve things until it knows the state is finalized, which isn't until draw time." D3D11's approach increases overhead and limits the number of draw calls that can be issued per frame.
D3D12 also replaces D3D11's context-based execution model with something called command lists, which sound pretty comparable to Mantle's command buffers. Here's Sandy's explanation again:
Direct3D 12 introduces a new model for work submission based on command lists that contain the entirety of information needed to execute a particular workload on the GPU. Each new command list contains information such as which PSO to use, what texture and buffer resources are needed, and the arguments to all draw calls. Because each command list is self-contained and inherits no state, the driver can pre-compute all necessary GPU commands up-front and in a free-threaded manner. The only serial process necessary is the final submission of command lists to the GPU via the command queue, which is a highly efficient process.
D3D12 takes things a step further with a construct called bundles, which lets developers re-use commands in order to further reduce driver overhead:
In addition to command lists, Direct3D 12 also introduces a second level of work pre-computation, bundles. Unlike command lists which are completely self-contained and typically constructed, submitted once, and discarded, bundles provide a form of state inheritance which permits reuse. For example, if a game wants to draw two character models with different textures, one approach is to record a command list with two sets of identical draw calls. But another approach is to "record" one bundle that draws a single character model, then "play back" the bundle twice on the command list using different resources. In the latter case, the driver only has to compute the appropriate instructions once, and creating the command list essentially amounts to two low-cost function calls.
Thanks to all of this shader and pipeline state caching, Gosalia says there should be "no more compiles in the middle of gameplay." Draw-time shader compilation can cause hitches (or frame latency spikes) during gameplay—and developers bemoaned it at AMD's APU13 event last year. Dan Baker of Oxide Games says that, in D3D12, we "shouldn't have frame hitches caused by driver at all."
Both Mantle and D3D12 introduce new ways to bind resources to the graphics pipeline, as well. D3D12's model involves descriptor heaps, which don't sound all that dissimilar to Mantle's descriptor sets. Sandy explains:
Instead of requiring standalone resource views and explicit mapping to slots, Direct3D 12 provides a descriptor heap into which games create their various resource views. This provides a mechanism for the GPU to directly write the hardware-native resource description (descriptor) to memory up-front. To declare which resources are to be used by the pipeline for a particular draw call, games specify one or more descriptor tables which represent sub-ranges of the full descriptor heap. As the descriptor heap has already been populated with the appropriate hardware-specific descriptor data, changing descriptor tables is an extremely low-cost operation.
In addition to the improved performance offered by descriptor heaps and tables, Direct3D 12 also allows resources to be dynamically indexed in shaders, providing unprecedented flexibility and unlocking new rendering techniques. As an example, modern deferred rendering engines typically encode a material or object identifier of some kind to the intermediate g-buffer. In Direct3D 11, these engines must be careful to avoid using too many materials, as including too many in one g-buffer can significantly slow down the final render pass. With dynamically indexable resources, a scene with a thousand materials can be finalized just as quickly as one with only ten.
According to Sandy, descriptor heaps "match modern hardware and significantly improve performance." The D3D11 approach is "highly abstracted and convenient," he says, but it requires games to issue additional draw calls when resources need to be changed, which leads to higher overhead.
According to Yuri Shtil, Senior Infrastructure Architect at Nvidia, the introduction of descriptor heaps transfers the responsibility of managing resources in memory from the driver to the application. In other words, it's up to the developer to manage memory. This arrangement is again reminiscent of Mantle. AMD hailed Mantle's manual memory allocation as a major improvement and as a means to make more efficient use of GPU memory.
Now, of course, lower-level abstraction of that sort can be a double-edged sword. Because developers have a greater level of control over what happens on the hardware, the driver and API are able to do less work—but this also leads to more opportunities for things to go wrong. Here's an example from Nvidia's Tamasi:
Think about memory management, for example. The way DirectX 11 works is, if you want to allocate a texture, before you can use it, the driver basically pre-validates that that memory is resident on the GPU. So, there's work going on in the driver and on the CPU to validate that that memory is resident. In a world where the developer controls memory allocation, they will already know whether they've allocated or de-allocated that memory. There's no check that has to happen. Now, if the developer screws up and tries to render from a texture that isn't resident, it's gonna break, right? But because they have control of that, there's no validation step that will need to take place in the driver, and so you save that CPU work.
Developers who would rather not deal with such risks won't have to. According to Max McMullen, Microsoft's Development Lead for Windows Graphics, D3D12 will give developers the option to use the more abstracted programming model from D3D11. "Every single algorithm that you can build on 11 right now, you can build on 12," he said.
But getting one's hands dirty with the lower-level programming model should pay some very real dividends. One of the demos shown at GDC was a custom, D3D12 version of Futuremark's 3DMark running on a quad-core Intel processor. The D3D12 demo used 50% less CPU time than the D3D11 version, and instead of dumping most of the workload on one CPU core, it spread the load fairly evenly across all four cores. The screenshots above show the differences in CPU utilization at the top left.
Oxide's Baker mentioned other potential upsides to D3D12, including a "vast reduction in driver complexity" and "generally more responsive games . . . even at a lower frame rate." D3D12 may not just extract additional performance and rendering complexity out of today's hardware. It may also make games feel better in subtle but important ways. Also, if what Baker said about driver robustness checks out, PC gamers may waste less time waiting on game-specific driver fixes and optimizations from GPU manufacturers.
Direct3D 12 on future hardware
Developers will be able to exploit D3D12's new programming model on a wide range of existing graphics processors. In addition to that programming model, however, D3D12 will introduce some new rendering features that will require new GPUs. Microsoft teased a couple of those rendering features at GDC:
I'm not entirely clear on what the new blend modes are supposed to do, but as I understand it, conservative rasterization will help with object culling (that is, hiding geometry that shouldn't be seen, such as objects behind a wall) as well as hit detection.
Nvidia's Tamasi told us D3D12 includes a "whole bunch more" new rendering features beyond those Microsoft has already discussed. I expect we'll hear more about them when Microsoft delivers the preview release of the new API to developers, which is scheduled to happen later this year.
Which next-gen GPUs will support those new features? We don't know yet. Since the first D3D12 titles are expected in the 2015 holiday season, I would be surprised if Nvidia and AMD didn't have new hardware with complete D3D12 support ready by then. Then again, neither AMD nor Nvidia have announced anything of the sort yet. We'll have to wait and see what those companies have to say when Microsoft reveals D3D12's full array of new rendering features.
|Friday night topic: quadcopters!||9|
|The TR Podcast video 173: Torquing the Titan||1|
|Report: AMD R&D spending falls to near-10-year low||26|
|Deal of the week: Ultra-wide IPS for $750, 16GB DDR4-2666 for $190, plus more||40|
|Broadwell Xeon D lands on Mini-ITX boards||31|
|Half-Life 2: Update mod adds modern polish to old classic||54|
|The TR Podcast is live, so come ask us stuff!||1|
|AMD shows off DirectX 12 performance with new 3DMark benchmark||75|
|Intel and Micron sampling 3D NAND based on floating gates||27|