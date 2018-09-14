Taking the first steps into a ray-traced future

It's Turing Day at TR. We've been hearing about the innovations inside Nvidia's Turing GPUs for weeks, and now we can tell you a bit more about what's inside them. Turing implements a host of new technologies that promise to reshape the PC gaming experience for many years to come. While much of the discussion around Turing has concerned the company's hardware acceleration of real-time ray-tracing, the tensor cores on board Turing GPUs could have even more wide-ranging effects on the way we game—to say nothing of the truckload of other changes under Turing's hood that promise better performance and greater flexibility for gaming than ever before.



A die shot of the TU102 GPU. Source: Nvidia

On top of the architectural details that we can discuss this morning, Nvidia sent over both GeForce RTX 2080 and RTX 2080 Ti cards for us to play with. As of this writing, those cards are on a FedEx truck and headed for the TR labs. Nvidia has hopped on the "unboxing embargo" bandwagon, meaning we can show you the scope of delivery of those cards later today. Performance numbers will have to wait, though. First, Nvidia is pulling back the curtain on the Turing architecture and the first implementations thereof. Let's discuss some of the magic inside.

Despite Nvidia's description of ray-tracing as the holy grail of computer graphics during its introduction of the Turing architecture, these graphics cards do not replace rasterization—the process of mapping 3D geometry onto a 2D plane and the way real-time graphics have been produced for decades—with ray-tracing, or the process of casting rays through a 2D plane into a 3D scene to directly model the behavior of light. Real-time ray tracing for every pixel of a scene remains prohibitively expensive, computationally speaking.

Instead, the company wants to continue using rasterization for the things it's good at and add certain ray-traced effects where those techniques would produce better visual fidelity—a technique it refers to as hybrid rendering. Nvidia says rasterization is a much faster way of determining object visibility than ray-tracing, for example, so ray-tracing only needs to enter the picture for techniques where fidelity or realism is important yet difficult to achieve via rasterization, like reflections, refractions, shadows, and ambient occlusion. Nvidia notes that the traditional rasterization pipeline and the new ray-tracing pipeline can operate "simultaneously and cooperatively" in its Turing architecture.

The software groundwork for this technique was laid earlier this year when Microsoft revealed the DirectX Raytracing API, or DXR, for DirectX 12. DXR provides access to some of the basic building blocks for ray-tracing alongside existing graphics-programming techniques, including a method of representing the 3D scene that can be traversed by the graphics card, a way to dispatch ray-tracing work to the graphics card, a series of shaders for handling the interactions of rays with the 3D scene, and a new pipeline state object for tracking what's going on across raytracing workloads.

Microsoft notes that DXR code can run on any DirectX 12-compatible graphics card in software as a fallback, since it behaves as a compute-like workload. That fallback method won't be a practical way of achieving real-time ray-traced performance, though. To make DXR code practical for use in real-time rendering, Nvidia is implementing an entire platform it calls RTX that will let DXR code run on its hardware. In turn, GeForce RTX cards are the first hardware designed to serve as the foundation for real-time ray-traced effects with DXR and RTX.

RT cores take a load off

Real-time ray tracing has remained elusive because, to paraphrase a famous quote, ray-tracing is fast but computers are slow. Trying to figure out what triangle a ray will intersect with in a scene is extremely computationally expensive, and it can be difficult to organize scene data in a way that lets a processor exploit locality of reference while ray tracing. If a ray could interact with practically any triangle in a scene, those cases make it difficult to keep temporally-local or spatially-local data in cache. A ray might ultimately behave in a way that's friendly to the cache, or it might not. This is not a problem that the Turing architecture necessarily seeks to solve, or can solve—it's just a fact of life of ray-tracing.

One way developers can help to make ray-tracing more efficient, however, is through the use of an acceleration structure—an organization of geometry data that helps bundle stuff that's spatially local and reduces the amount of work necessary when testing the objects a ray interacts with in a scene.

The typical way data is organized to accelerate ray-tracing is through a tree structure called a bounding volume hierarchy, or BVH. The top level of a BVH might contain one or more bounding shapes (usually, boxes) that themselves might contain further groups of subdivisions of the scene. Ultimately, the last level of the BVH tree contains triangle data. When a ray is cast, software that uses a BVH doesn't go straight to work trying to find out which, if any, triangles that ray hits. Instead, it limits the scope of work by first testing whether the ray intersects the bounding shapes at high levels of the BVH tree and traverses each level of it, ultimately arriving at only those triangles that the ray would actually interact with before the GPU performs any further shading work.

The RT core makes the real-time ray-tracing portion of hybrid rendering possible by accelerating the process of BVH traversal and ray-triangle intersection testing, freeing up the shader multiprocessor to do other work. Without the RT core, determining which bounding volumes and triangles are intersected by a given ray would require immense amounts of traditional floating-point shader power that's prohibitive for real-time rendering applications. For reference, Nvidia says the GTX 1080 Ti can cast 1.1 gigarays per second with its 11.3-TFLOP shader array, while the RTX 2080 Ti can cast 10 gigarays per second or more thanks to its RT cores.

One of the challenges of bounding volume hierarchies is that the bounding volumes themselves can change as objects in the scene move, requiring the refitting of those shapes and possibly the insertion or removal of nodes from within the BVH tree. Nvidia handles initial construction and refitting of the BVH in the driver, while the actual casting of rays and the resultant shading work are handled by the developer through the DXR API.

Even with the acceleration of ray-tracing operations that the RT core provides, Nvidia cautions that applications will not be able to suddenly begin casting hundreds of rays per pixel in real time. Instead, the second pillar of real-time ray tracing and hybrid rendering in Turing cards comes from denoising filters. In traditional ray tracing, the number of rays cast per pixel might need to be large in order to achieve a quality result. That's a tradeoff that's not necessarily amenable for attempting to integrate ray-traced effects into the real-time rendering pipeline. Fewer rays cast per pixel can result in coarse-looking noise, and noisy reflections or shadows would prove exceedingly unpleasant to the eye in photorealistic environments.

With GeForce RTX cards, the hope is that developers can cast relatively few rays per pixel before using denoising algorithms to clean up the resulting image. Denoising allows ray-traced effects with small numbers of rays cast to arrive at a result whose quality is similar to that of a scene with many more rays cast. Nvidia isn't specific about the denoisers it's using in its RTX platform, although the company says that it's using both AI and non-AI denoising algorithms depending on what produces the best result for a given application. In any case, the ray-traced portion of the hybrid rendering pipeline wouldn't be possible without the reduction in rays cast that denoising permits.

Tensor cores bring artificial intelligence to gaming PCs

To make AI models like denoising filters practical for use on gaming PCs, Turing cards include the tensor cores that Nvidia first unveiled as part of its Volta architecture. These cores provide accelerated processing for tensor operations, a type of matrix multiplication that's incredibly useful and versatile for performing AI inferencing. As a refresher, Inferencing is the use of trained deep-learning models to perform a computational task.

Denoising won't be the biggest, or even the only, application for deep learning models on Turing cards. Many game developers are hopping on the bandwagon for Nvidia's Deep Learning Super Sampling, or DLSS, technology. Nvidia describes DLSS as a replacement for temporal anti-aliasing, a technique that combines multiple frames by determining motion vectors and using that data to sample portions of the previous frame. Nvidia notes that despite the common use of temporal AA, it remains a difficult technique for developers to effectively employ. For my part, I've never enjoyed the apparent blur that TAA seems to add to the edges of objects in motion.

To attack some of the limitations of TAA, Nvidia took its extensive experience using deep learning to recognize and process images and applied it to games. DLSS depends on a trained neural network that's exposed to a large number of "ground truths," perfect or near-perfect representations of what in-game scenes should look like via 64x supersampling. Once the model is sufficiently trained on those images, Turing cards can use it to render scenes "at a lower input sample count," according to Nvidia, and then infer what the final scene should look like at its target resolution. Nvidia says DLSS offers similar image quality to TAA with half the shading work.