Single page Print

Inside the second with Nvidia's frame capture tools

Display-level reckoning for GPUs
— 4:40 PM on March 27, 2013

We've come a long way since our initial Inside the second article. That's where we first advocated for testing real-time graphics and gaming performance by considering the time required to render each frame of animation, instead of looking at traditional FPS averages. Since then, we've applied new testing methods focused on frame latencies to a host of graphics card reviews and to CPUs, as well, with enlightening results.

The fundamental reality we've discovered is that a higher FPS average doesn't necessarily correspond to smoother animation and gameplay. In fact, at times, FPS averages don't seem to mean very much at all. The problem boils down to a weakness of averaging frame rates over the span of a whole second, as nearly all FPS-based tools tend to do. Allow me to dust off an old illustration, since it still serves our purposes well:

The fundamental problem is that, in terms of both computer time and human visual perception, one second is a very long time. Averaging results over a single second can obscure some big and important performance differences between systems.

To illustrate, let's look at an example. It's contrived, but it's based on some real experiences we've had in game testing over the years. The charts below show the times required, in milliseconds, to produce a series of frames over a span of one second on two different video cards.

GPU 1 is obviously the faster solution in most respects. Generally, its frame times are in the teens, and that would usually add up to an average of about 60 FPS. GPU 2 is slower, with frame times consistently around 30 milliseconds.

However, GPU 1 has a problem running this game. Let's say it's a texture upload problem caused by poor memory management in the video drivers, although it could be just about anything, including a hardware issue. The result of the problem is that GPU 1 gets stuck when attempting to render one of the frames—really stuck, to the tune of a nearly half-second delay. If you were playing a game on this card and ran into this issue, it would be a huge show-stopper. If it happened often, the game would be essentially unplayable.

The end result is that GPU 2 does a much better job of providing a consistent illusion of motion during the period of time in question. Yet look at how these two cards fare when we report these results in FPS:

Whoops. In traditional FPS terms, the performance of these two solutions during our span of time is nearly identical. The numbers tell us there's virtually no difference between them. Averaging our results over the span of a second has caused us to absorb and obscure a pretty major flaw in GPU 1's performance.

Since we published that first article, we've seen a number of real-world instances were FPS averages have glossed over noteworthy performance problems. Most prominent among those was the discovery of frame latency issues in last Christmas' crop of new games with the Radeon HD 7950. When we demonstrated the nature of that problem with slow-motion video, which showed a sequence that had stuttering animation despite an average of 69 FPS, lots of folks seemed to grasp intuitively the story we'd been telling with numbers alone. As a result, AMD has incorporated latency-sensitive methods into its driver development process, and quite a few other websites have begun deploying frame-latency-based testing methods in their own reviews. We're happy to see it.

There's still much work to be done, though. We discovered a couple of problems in our initial investigation into these matters, and we haven't been able to explore those issues in full. For instance, we encountered concrete evidence of a weakness of multi-GPU setups known as micro-stuttering. We believe it's a real problem, but our ability to quantify its impact has been affected by another problem: the software tool that we've been using to capture frame times, Fraps, collects its samples at a relatively early stage in the frame rendering process. Both of the major GPU makers, AMD and Nvidia, have told us that the results from Fraps don't tell the whole story—especially when it comes to multi-GPU solutions.

Happily, though, in a bit of enlightened self-interest, the folks at Nvidia have decided to enable reviewers—and eventually, perhaps, consumers—to look deeper into the question of frame rendering times and frame delivery. They have developed a new set of tools, dubbed "FCAT" for "Frame Capture and Analysis Tools," that let us measure exactly how and when each rendered frame is being delivered to the display. The result is incredible new insight into what's happening at the very end of the rendering-and-display pipeline, along with several surprising revelations about the true nature of the problems with some multi-GPU setups.

How stuff works
Before we move on, we should take a moment to establish how video game animations are produced. At the core of the process is a looping structure: most game engines do virtually all of their work in a big loop, iterating over and over to create the illusion of motion. During each cycle through the loop, the game evaluates inputs from various sources, advances its physical simulation of the world, initiates any sounds that need to be played, and creates a visual representation of that moment in time. The visual portion of the work is then handed off to a 3D graphics programming interface, such as OpenGL or DirectX, where it's processed and eventually displayed onscreen.

The path each "frame" of animation takes to the display involves several stages of fairly serious computation, along with some timing complications. I've created a horribly oversimplified diagram of the process below.

As you can see, the game engine hands off the frame to DirectX, which does a lot of processing work and then sends commands to the graphics driver. The graphics driver must then translate these commands into GPU machine language, which it does with the aid of a real-time compiler. The GPU subsequently does its rendering work, eventually producing a final image of the scene, which it outputs into a frame buffer. This buffer is generally part of a queue of two to three frames, as in our illustration.

What happens next depends on the settings in your graphics card control panel and in-game menus. You see, although the rendering process produces frames at a certain rate—one that can vary from frame to frame—the display operates according to its own timing. In fact, today's LCD panels still operate on assumptions dictated by Ye Olde CRT monitors, as if an electron gun were still scanning phosphors behind the screen and needed to touch each one of them at a regular interval in order to keep it lit. Pixels are updated from left to right across the screen in lines, and those lines are refreshed from the top to the bottom of the display. Most LCDs completely refresh themselves according to this pattern at the common CRT rate of 60 times per second, or 60 Hz.

If vsync, or vertical refresh synchronization, is enabled in your graphics settings, then the system will coordinate with the display to make sure updates happen in between refresh cycles. That is, the system won't flip to a new frame buffer, with new information in it, while the display is being updated. Without vsync, the display will be updated whenever a new frame of animation becomes ready, even if it's in the middle of painting the screen. Updates in the middle of the refresh cycle can produce an artifact known as tearing, where a seam is visible between successive animation frames shown onscreen at once.

An example of tearing from Borderlands 2

I sometimes like to play games with vysnc enabled, in order to avoid tearing artifacts like the one shown above. However, vsync introduces several problems. It caps frame rates at 60 Hz, which can interfere with performance testing (especially FPS-average-driven tests). Also, vsync introduces additional delays before a frame of animation makes it to the display. If a frame isn't ready for display at the start of the current refresh cycle, its contents won't be shown until the next refresh cycle begins. In other words, vysnc causes frame update rates to be quantized, which can hamper display updates at the very worst time, when GPU frame rates are especially slow. (Nvidia's Adaptive Vsync feature attempts to work around this problem by disabling refresh sync when frame rates drop.)

We have conducted the bulk of our performance testing so far, including this article, with vsync disabled. I think there's room for some intriguing explorations of GPU performance with vsync enabled. I'm not entirely sure what we might learn from that, but it's a different task for another day.

At any rate, you're probably getting the impression that lots happens between the game engine handing off a frame to DirectX and the content of that frame eventually hitting the screen. That takes us back to the limitations of one of our tools, Fraps, which we use to capture frame times. Fraps grabs its samples from the spot in the diagram where the game presents a completed frame to DirectX by calling "present," as denoted by the orange line. As you can see, that point lies fairly early in the rendering pipeline.

Since the frame production process is basically a loop, sampling at any point along the way ought to tell us how things are going. However, there are several potential complications to consider. One is the use of buffering later in the pipeline, which could help smooth out small rendering delays from one frame to the next. Another is the complicated case of multi-GPU rendering, where two GPUs alternate, one producing odd frames and the other churning out even frames. This very common load-balancing method can potentially cause delays when frames produced on the secondary GPU are transferred to the GPU connected to the display. Thornier still, Nvidia claims to have created a "frame metering" tech to smooth out frame delivery to the display on SLI configs—and that further complicates the timing. Finally, the issues we've noted with display refresh sync can play a part in how and when frames make it to the screen.

So.. yeah, Fraps is busted, right? Not exactly. You see, it's situated very close to the game engine in this whole process, and the internal simulation timing of the game engine determines the content of the frames being produced. Game animation is like a flipbook, and the contents of each page must advance uniformly in order to create the fluid illusion of motion. To the extent that Fraps' timing matches the internal timing of the game engine, its samples may be our truest indication of animation smoothness. We don't yet have a clear map of how today's major game engines track and advance their internal timing, and that is a crucial question. Fortunately, we do now have one other piece of the puzzle: some new tools that let us explore these issues at the ultimate end of the rendering pipeline: the display output. Let's have a look at them.