Inside the second with Nvidia’s frame capture tools
We’ve come a long way since our initial Inside the second article. That’s where we first advocated for testing real-time graphics and gaming performance by considering the time required to render each frame of animation, instead of looking at traditional FPS averages. Since then, we’ve applied new testing methods focused on frame latencies to a host of graphics card reviews and to CPUs, as well, with enlightening results.
The fundamental reality we’ve discovered is that a higher FPS average doesn’t necessarily correspond to smoother animation and gameplay. In fact, at times, FPS averages don’t seem to mean very much at all. The problem boils down to a weakness of averaging frame rates over the span of a whole second, as nearly all FPS-based tools tend to do. Allow me to dust off an old illustration, since it still serves our purposes well:
The fundamental problem is that, in terms of both computer time and human visual perception, one second is a very long time. Averaging results over a single second can obscure some big and important performance differences between systems.
To illustrate, let’s look at an example. It’s contrived, but it’s based on some real experiences we’ve had in game testing over the years. The charts below show the times required, in milliseconds, to produce a series of frames over a span of one second on two different video cards.
GPU 1 is obviously the faster solution in most respects. Generally, its frame times are in the teens, and that would usually add up to an average of about 60 FPS. GPU 2 is slower, with frame times consistently around 30 milliseconds.
However, GPU 1 has a problem running this game. Let’s say it’s a texture upload problem caused by poor memory management in the video drivers, although it could be just about anything, including a hardware issue. The result of the problem is that GPU 1 gets stuck when attempting to render one of the frames—really stuck, to the tune of a nearly half-second delay. If you were playing a game on this card and ran into this issue, it would be a huge show-stopper. If it happened often, the game would be essentially unplayable.
The end result is that GPU 2 does a much better job of providing a consistent illusion of motion during the period of time in question. Yet look at how these two cards fare when we report these results in FPS:
Whoops. In traditional FPS terms, the performance of these two solutions during our span of time is nearly identical. The numbers tell us there’s virtually no difference between them. Averaging our results over the span of a second has caused us to absorb and obscure a pretty major flaw in GPU 1’s performance.
Since we published that first article, we’ve seen a number of real-world instances were FPS averages have glossed over noteworthy performance problems. Most prominent among those was the discovery of frame latency issues in last Christmas’ crop of new games with the Radeon HD 7950. When we demonstrated the nature of that problem with slow-motion video, which showed a sequence that had stuttering animation despite an average of 69 FPS, lots of folks seemed to grasp intuitively the story we’d been telling with numbers alone. As a result, AMD has incorporated latency-sensitive methods into its driver development process, and quite a few other websites have begun deploying frame-latency-based testing methods in their own reviews. We’re happy to see it.
There’s still much work to be done, though. We discovered a couple of problems in our initial investigation into these matters, and we haven’t been able to explore those issues in full. For instance, we encountered concrete evidence of a weakness of multi-GPU setups known as micro-stuttering. We believe it’s a real problem, but our ability to quantify its impact has been affected by another problem: the software tool that we’ve been using to capture frame times, Fraps, collects its samples at a relatively early stage in the frame rendering process. Both of the major GPU makers, AMD and Nvidia, have told us that the results from Fraps don’t tell the whole story—especially when it comes to multi-GPU solutions.
Happily, though, in a bit of enlightened self-interest, the folks at Nvidia have decided to enable reviewers—and eventually, perhaps, consumers—to look deeper into the question of frame rendering times and frame delivery. They have developed a new set of tools, dubbed “FCAT” for “Frame Capture and Analysis Tools,” that let us measure exactly how and when each rendered frame is being delivered to the display. The result is incredible new insight into what’s happening at the very end of the rendering-and-display pipeline, along with several surprising revelations about the true nature of the problems with some multi-GPU setups.
How stuff works
Before we move on, we should take a moment to establish how video game animations are produced. At the core of the process is a looping structure: most game engines do virtually all of their work in a big loop, iterating over and over to create the illusion of motion. During each cycle through the loop, the game evaluates inputs from various sources, advances its physical simulation of the world, initiates any sounds that need to be played, and creates a visual representation of that moment in time. The visual portion of the work is then handed off to a 3D graphics programming interface, such as OpenGL or DirectX, where it’s processed and eventually displayed onscreen.
The path each “frame” of animation takes to the display involves several stages of fairly serious computation, along with some timing complications. I’ve created a horribly oversimplified diagram of the process below.
As you can see, the game engine hands off the frame to DirectX, which does a lot of processing work and then sends commands to the graphics driver. The graphics driver must then translate these commands into GPU machine language, which it does with the aid of a real-time compiler. The GPU subsequently does its rendering work, eventually producing a final image of the scene, which it outputs into a frame buffer. This buffer is generally part of a queue of two to three frames, as in our illustration.
What happens next depends on the settings in your graphics card control panel and in-game menus. You see, although the rendering process produces frames at a certain rate—one that can vary from frame to frame—the display operates according to its own timing. In fact, today’s LCD panels still operate on assumptions dictated by Ye Olde CRT monitors, as if an electron gun were still scanning phosphors behind the screen and needed to touch each one of them at a regular interval in order to keep it lit. Pixels are updated from left to right across the screen in lines, and those lines are refreshed from the top to the bottom of the display. Most LCDs completely refresh themselves according to this pattern at the common CRT rate of 60 times per second, or 60 Hz.
If vsync, or vertical refresh synchronization, is enabled in your graphics settings, then the system will coordinate with the display to make sure updates happen in between refresh cycles. That is, the system won’t flip to a new frame buffer, with new information in it, while the display is being updated. Without vsync, the display will be updated whenever a new frame of animation becomes ready, even if it’s in the middle of painting the screen. Updates in the middle of the refresh cycle can produce an artifact known as tearing, where a seam is visible between successive animation frames shown onscreen at once.
I sometimes like to play games with vysnc enabled, in order to avoid tearing artifacts like the one shown above. However, vsync introduces several problems. It caps frame rates at 60 Hz, which can interfere with performance testing (especially FPS-average-driven tests). Also, vsync introduces additional delays before a frame of animation makes it to the display. If a frame isn’t ready for display at the start of the current refresh cycle, its contents won’t be shown until the next refresh cycle begins. In other words, vysnc causes frame update rates to be quantized, which can hamper display updates at the very worst time, when GPU frame rates are especially slow. (Nvidia’s Adaptive Vsync feature attempts to work around this problem by disabling refresh sync when frame rates drop.)
We have conducted the bulk of our performance testing so far, including this article, with vsync disabled. I think there’s room for some intriguing explorations of GPU performance with vsync enabled. I’m not entirely sure what we might learn from that, but it’s a different task for another day.
At any rate, you’re probably getting the impression that lots happens between the game engine handing off a frame to DirectX and the content of that frame eventually hitting the screen. That takes us back to the limitations of one of our tools, Fraps, which we use to capture frame times. Fraps grabs its samples from the spot in the diagram where the game presents a completed frame to DirectX by calling “present,” as denoted by the orange line. As you can see, that point lies fairly early in the rendering pipeline.
Since the frame production process is basically a loop, sampling at any point along the way ought to tell us how things are going. However, there are several potential complications to consider. One is the use of buffering later in the pipeline, which could help smooth out small rendering delays from one frame to the next. Another is the complicated case of multi-GPU rendering, where two GPUs alternate, one producing odd frames and the other churning out even frames. This very common load-balancing method can potentially cause delays when frames produced on the secondary GPU are transferred to the GPU connected to the display. Thornier still, Nvidia claims to have created a “frame metering” tech to smooth out frame delivery to the display on SLI configs—and that further complicates the timing. Finally, the issues we’ve noted with display refresh sync can play a part in how and when frames make it to the screen.
So.. yeah, Fraps is busted, right? Not exactly. You see, it’s situated very close to the game engine in this whole process, and the internal simulation timing of the game engine determines the content of the frames being produced. Game animation is like a flipbook, and the contents of each page must advance uniformly in order to create the fluid illusion of motion. To the extent that Fraps’ timing matches the internal timing of the game engine, its samples may be our truest indication of animation smoothness. We don’t yet have a clear map of how today’s major game engines track and advance their internal timing, and that is a crucial question. Fortunately, we do now have one other piece of the puzzle: some new tools that let us explore these issues at the ultimate end of the rendering pipeline: the display output. Let’s have a look at them.
The FCAT tools
You may recall that we first talked to Nvidia’s Tom Petersen about frame latencies and multi-GPU micro-stuttering right when we first started looking at these things. To our surprise, Petersen had obviously been working on these matters before we spoke, because he very quickly produced a fairly robust presentation related to micro-stuttering and Fraps captures. That was about a year and a half ago. Turns out Peteresen and his team have been working on FCAT tools for about two years. We’ve had a few hints along the way that something along these lines was in the works, and that some tools might be presented to the press when the time was right. A couple of weeks ago, Petersen and another Nvidia rep visited Damage Labs to help us get up and running with a frame capture setup and the FCAT suite of tools.
This setup requires a few bits of very specific hardware and a fairly capable host PC.
Pictured above is a Datapath VisionDVI-DL video capture card, which is capable of capturing uncompressed digital video over a dual-link DVI link at very high resolutions and refresh rates. For our purposes, it’s able to collect each and every frame of a video sequence at resolutions up to 2560×1440 at a refresh rate of 60 Hz—enough to stress a high-end GPU config running the latest games. (2560×1600 doesn’t seem to work, for what it’s worth.) During such a capture, the card is streaming data at a rate of 422 MB/s, which is… considerable.
I can’t say the setup process for this card is easy. The thing didn’t want to work at all with our Intel X79 motherboard (although I’d rather not work with an Intel X79 motherboard myself, I must admit). We eventually got it going with an MSI Z77 board, but we had to disable a number of extra system devices, like USB 3.0 and auxiliary storage controllers, in order to get it working consistently.
The video output from the gaming system being tested connects to this Gefen dual-link DVI splitter, which feeds outputs to both the monitor and the DVI capture card. Nvidia told us its cards could avoid using a splitter by working in clone mode. However, clone mode is not always possible on Radeons in conjunction with CrossFire, so Nvidia chose to include a splitter in its FCAT config for reviewers.
We were advised that we’d need a storage subsystem capable of fast and truly sustained transfer rates, so we turned to the folks at Corsair, who kindly supplied four Neutron SSDs for our capture rig. At Nvidia’s suggestion, we attached them to an Intel storage controller and put them into a RAID 0 config. If you like round numbers, this array is almost a terabyte of storage capable of writing at almost one gigabyte per second.
Which ain’t bad. In fact, it’s shockingly good and consistent; we virtually never saw dropped frames once our capture setup was configured properly. I suspect this RAID could easily go faster if Intel storage controllers had more than two 6Gbps SATA ports available.
Once you have the ability to capture each and every frame of animation streaming out of a video card at will, you’re already well down the road to some interesting sorts of analysis. You can play back the sequence exactly as it looked the first time around, slow it down, speed it up, pause, and step through frame by frame. You can even correlate individual frames of animation to spikes recorded in Fraps and things like that. But what if you want to measure the timing of each and every GPU frame coming to the display?
For that purpose, Nvidia has developed an overlay program that inserts a colored bar along the left-hand side of each frame rendered by the GPU. These colors are inserted in a specific sequence of 16 distinct hues and serve as a sort of watermark, so each individual frame can be identified in sequence.
This gets complicated because, remember, with vsync disabled, the “frames” produced by the GPU don’t correspond directly to the video “frames” displayed onscreen. In the example above, six GPU frames are spread across four display frames. The GPU is producing frames slightly faster than 60 FPS, or 60 Hz, during this span of time. The GPU frame marked “green” spans two video frames, occupying the bottom half of one and a small slice of the top of the next one, before the GPU switches to a new buffer with the aqua frame. And so on.
We used VirtualDub for the captures. If you simply capture this sort of output with the overlay enabled, you can page through individual video frames to get a clear sense of how frame delivery is happening. Very, very cool stuff. The next bit, though, is kind of magic.
The FCAT extractor tool scans through any input video with the overlay present and produces a CSV file with information about how many scan lines of each color are present in each frame of video. This file contains the raw data needed for all sorts of post-processing, including figuring out which GPU frames span multiple video frames and the like.
Interestingly enough, when we first tried the extractor tool with videos captured from a Radeon HD 7970, it didn’t work quite properly. We asked Petersen about the problem, and he eventually found that the extractor was having trouble because the overlay colors being displayed by the Radeon weren’t entirely correct. The extractor routine had to be adjusted for looser tolerances in order to account for the variance. That variance is mathematically very minor and not easily perceptible, but it is real. To the right is a pixel-doubled and heavily contrast-enhanced section of the (formerly) pink overlay output from the Radeon in Far Cry 3. You can probably see that there is some noise in it. The same pink overlay section from the GeForce GTX 680 is all the same exact color value. Not sure what that’s worth or what the cause might be, but it’s kind of intriguing.
After the overlay info has been extracted, the next step is to process it in various ways. Petersen has created a series of Perl scripts that handle that job. They can spit out all sorts of output, including a simple set of successive frame times that we can use just like Fraps data. The FCAT scripts include lots of options for processing and filtering the data, and since they’re written in Perl, they can be modified easily. One thing they’ll do is use Gnuplot to graph results. In fact, by default, the FCAT scripts produce two graphs that will look fairly familiar to TR readers: a frame time plot and a percentile curve.
Pardon the extreme compression, but the default plot size is ginormous, and I’ve not yet sorted out how to modify it. One nice thing the FCAT frame time plot does is correlate each frame time distribution to the scene time, something you won’t see in our current Excel plots.
The percentile curves will look inverted if you’re used to ours, because FCAT converts them into FPS terms. I know that option will be popular with some folks who still find the concept of FPS more intuitive to understand.
We haven’t yet converted to using FCAT’s visualization tools in place of our usual Excel sheets, but there is potential for automation here that extends well beyond what we currently have in place. If these tools are to be widely used in the industry—or, heck, even consistently used in several places—then automation of this sort will no doubt be needed. Processing this type of data isn’t trivial; it’s a long way from throwing together a few FPS averages.
Speaking of which, I should say that my summary of FCAT capture and analysis boils down a much more complex process. Configuring everything to work properly is a tedious affair that involves synchronizing EDIDs for the display and capture card behind the splitter, doing just the right magic to ensure good video captures without dropped or inserted frames, and a whole host of other things.
With that said, it’s still extremely cool that Nvidia is enabling this sort of analysis of its products. The firm says its FCAT tools will be freely distributable and modifiable, and at least the Perl script portions will necessarily be open-source (since Perl is an interpreted language). Nvidia says it hopes portions of the FCAT suite, such as the colored overlay, will be incorporated into third-party applications. We’d like to see Fraps incorporate the overlay, since using it alongside the FCAT overlay is sometimes problematic.
Now, let’s see what we can learn by making use of these tools.
We were only able to get the FCAT overlay working reliably alongside Fraps in four of the nine games from our latest graphics test suite. Using both tools was important to us, simply because we wanted to correlate Fraps and FCAT data in order to see how they compare. We were able to do so with some games, but not others. We burned quite a bit of time converting our GPU test rigs from Windows 8 to Windows 7 in order to improve compatibility between Fraps and the overlay, but doing so didn’t yield any real improvement.
Furthermore, the data you’ll see on the following pages typically comes from just a single test run for each graphics solution. Our usual practice has been to use five test runs per game per solution, but time constraints and additional workflow complications made that sort of sampling impractical for this article. Heck, the FCAT data sets for a single run from each config across five games was over 630GB, if you include the raw video. Such problems can be managed with moar hardware, but we haven’t built an external FCAT storage array quite yet. The single test runs we’ve included should suffice for the sort of analysis we want to do today. Just keep in mind that this is not one of usual GPU reviews based on substantially more testing.
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Our test systems were configured like so:
|Memory size||16GB (4 DIMMs)|
DDR3 SDRAM at 1600MHz
|Chipset drivers||INF update
Rapid Storage Technology Enterprise 126.96.36.1999
with Realtek 188.8.131.5262 drivers
Deneva 2 240GB SATA
Service Pack 1
|1006||1059||1502||2 x 2048|
Radeon HD 7970 GHz
13.3 beta 2
|Dual Radeon HD
13.3 beta 2
|1000||1050||1500||2 x 3072|
Thanks to Intel, Corsair, and Gigabyte for helping to outfit our test rigs with some of the finest hardware available. AMD, Nvidia, and the makers of the various products supplied the graphics cards for testing, as well.
Unless otherwise specified, image quality settings for the graphics cards were left at the control panel defaults. Vertical refresh sync (vsync) was disabled for all tests.
In addition to the games, we used the following test applications:
The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Fraps vs. FCAT: Skyrim
We’ll start with our Skyrim test because it’s very repeatable and has proven to be almost impossible to complete without a few spikes in frame times. The numbers you’ll see below come from both Fraps and FCAT captures from the exact same test run. I simply ran Fraps and the FCAT overlay together and recorded video on the capture system while benchmarking with Fraps. After the fact, I was able to watch for the start and end points of the Fraps test in the video and correlate them more or less exactly with the frames we analyzed to produce the FCAT results.
The outcome should give us a sense of what’s happening at two points in the rendering process: when the game engine hands off a frame to DirectX (Fraps) and when the frame hits the display (FCAT).
The plots above show how closely correlated the Fraps and FCAT frame time distributions appear to be. Click through the buttons above to see the results for each config tested.
I expect some folks will be ready to give me the beating I so richly deserve for presenting the data in this way, by which I mean “in a really small image” and “without sufficient color contrast.” I apologize. I was limited by both time and ability. And by Microsoft Excel, which should not escape blame. I fully endorse the use of “Ctrl + Mouse-wheel up” to zoom in on the frame time plots for better visibility.
What you should be seeing is that three of the four configs have very close correlations between the Fraps and FCAT numbers, and that plots for the FCAT results are much tighter, with less frame-to-frame variance than the Fraps numbers have. That suggests there’s some natural variance in the dispatch of frames coming from the game engine (closer to where Fraps measures) that gets smoothed out by buffering later in the pipeline.
Now, that doesn’t mean one set of results is “correct” and the other
“incorrect.” As far as we know, both are correct for what they measure, at different points in the pipeline. One thing we’ll want to investigate further is those spots where the Fraps plot shows latency spikes that the FCAT plot does not. Keep that in mind for later.
On another front, FCAT looks to be giving us some important additional insight about the Radeon HD 7970 CrossFire setup: its Fraps results look like the other solutions’ Fraps plots, but its FCAT output is much “fuzzier,” with larger frame-to-frame swings. Curious, no? That’s probably not a good outcome, but it does map well to our expectations that Fraps results may not capture the extent of the timing differences introduced by multi-GPU load-balancing.
Let’s see how these data look in our latency-focused performance metrics.
Using a traditional FPS average, the SLI and CrossFire setups would appear to perform nearly twice as well as the single-GPU solutions. However, when we switch to the latency-oriented 99th percentile frame time, the Radeon HD 7970 CrossFire config proves not to be so hot. The 99th percentile frame time is just the threshold below which 99% of all frames were rendered; we can look at the fuller latency curve for a better sense of what went wrong.
The FCAT latency curve for the 7970 CrossFire config has that classic profile shown by multi-GPU micro-stuttering. About 50% of the frame times are inordinately low, and the other half are inordinately high. As we approach the 99th percentile on the FCAT latency curve, the 7970 CrossFire config’s frame times climb to within a few milliseconds of the single 7970’s. Uh oh.
Our measure of “badness” often acts as an anchor for us, preventing us from getting too bogged down in the weeds of other analysis. This metric adds up any time spent working on frames that take longer than a given threshold. Our primary threshold here, 50 milliseconds, equates to about 20 FPS. We figure any animation that dips below the 20 FPS mark is in danger of looking choppy. Also, 50 ms maps to three vertical refresh intervals on a 60 Hz display. The other two, 16.7 ms and 33.3 ms, map to 60 FPS with a single refresh interval and 30 FPS with two refresh intervals, respectively.
The truth is that all these solutions are incredibly quick in Skyrim, with only a few milliseconds spent above our main threshold by any of them. That jibes with our sense that all these cards ran this test pretty smoothly, with only an occasional hiccup in each case. You may also notice that the Fraps data tends to show more time spent beyond each threshold than FCAT does. That should be no surprise given the larger spikes visible in the Fraps plots. Let’s see what we can make of that fact.
Anatomy of a stutter
We have two directions we can go at this point: pursuing the multi-GPU micro-stuttering issue, and looking at cases where Fraps latency spikes aren’t reflected as strongly in the frame delivery measured by FCAT. Let’s start with the latter, and then we’ll circle back to the micro-stuttering problem.
We’re not trying to pick on the Radeon HD 7970 by using it as an example here. As you saw on the last page, each of the configs tested had one or more of these latency spikes during the duration of the test run. The 7970 plot and video just give us a nice test case for what happens when Fraps and FCAT measurements don’t match.
We’ve zoomed in on the second of the two spikes in the 7970’s Fraps plot. As you can see, there’s a very small spike to 22 milliseconds in the FCAT plot, but a much larger spike, to nearly 50 ms, in the Fraps plot at the same spot. The question is: if frame delivery is still relatively even, as the FCAT results indicate, does it matter whether there was a spike in the Fraps data? As you might imagine, I was like a kid in a candy store when I got to pull up the video of the test run and see how it looked as I paged through the animation.
What I saw was… a quick but easily perceptible disruption in the animation, something much more on the order of the 50-ms delay indicated in the Fraps data. I’ve attempted to share this eureka moment with you by snipping out a brief video clip of the vaunted stutter in action. Since YouTube is gonna convert the video to 30 FPS no matter what, I took the liberty of slowing the source video to 15 FPS, in the hopes of keeping some semblance of each source frame intact. So hit play and buckle up. You should see a skip at about halfway through this six-second extravaganza.
Yeah, so YouTube pretty much adds its own stutter to the mix. Perhaps not the best tool for this job. Still, if you can get the video to play smoothly, the momentary stutter is real and perceptible. Although it’s still pretty minor in the grand scheme, it confirms for me what Andrew Lauritzen has argued about the value of Fraps data. He was discussing a different Skyrim video of ours at the time, but the principle remains the same:
Note that what you are seeing are likely not changes in frame delivery to the display, but precisely the affect of the game adjusting how far it steps the simulation in time each frame. . . . A spike anywhere in the pipeline will cause the game to adjust the simulation time, which is pretty much guaranteed to produce jittery output. This is true even if frame delivery to the display (i.e. rendering pipeline output) remains buffered and consistent. i.e. it is never okay to see spikey output in frame latency graphs.
Even with buffering smoothing out frame delivery as measured by FCAT, the spike in the Fraps plot indicates a disruption in timing that has an impact on the content of the frames being displayed and thus on the smoothness of the animation.
We’ve been hearing an argument out of AMD about the value of Fraps data that I should address in this context. Folks I’ve talked to there have insisted to me that they’ve seen cases where a spike in Fraps frame times doesn’t translate into an interruption in animation, seemingly casting aspersions on the value of Fraps data. After talking to AMD’s David Nalasco last night, I think I understand this position better.
I believe Nalasco would modify Andrew Lauritzen’s statement above to: “it is sometimes okay to see spiky output in frame latency graphs,” simply because not every little hiccup or spike translates into a flaw in the animation that one can perceive. There are a couple of possible reasons why that could be the case. One has to do with the tricky question of what constitutes a stutter and what the threshold for human perception of a problem might be. Small interruptions may not matter if no one will notice them, especially with the display refresh cycle complicating how frames are presented. A related technical question is how large an interruption has to be in Fraps—by back-pressure in the rendering queue preventing the submission of a new frame—before a perceptible stutter is created. This issue is complicated by the varying ways in which game engines keep and advance the timing for their simulations. Some engines may be more tolerant of small bubbles in the pipeline if they advance time in regular intervals from frame to frame, even when those frames are being submitted closely together to refill the queue after a hiccup.
I won’t argue with any of that. However, Nalasco also concedes that a sufficiently large frame time spike in Fraps will indeed translate into an interruption in the game’s animation. I think he just wants folks to avoid obsessing over small spikes in frame time charts and to keep in mind that perception is the final arbiter of animation smoothness.
Which, you know, is why we create silly little videos like this one, examining a single case of a 50-millisecond frame and the visual interruption it appears to create.
However—and this is a huge caveat—we have some trepidation about declaring even this one particular example a definitive triumph for Fraps-based measurements. You see, like most folks who test gaming performance, we’ve removed the built-in frame rate cap in Skyrim. We already know that doing so causes some funky timing quirks for things like the game’s AI, but it may also modify the game’s fundamental timekeeping method for all of its physical simulation work. (The variable we’ve modified in order to “uncap” Skyrim is called “iPresentInterval”, and we’ve changed it from “1” to “0.” You may recall that Fraps measures when the game calls Present(). Hmm.) If our uncapping effort has changed the way time is kept in the game, it may have created the possibility of frame-to-frame timing issues that one would usually not see with the game engine’s default timing method. This thought occurred to me on an airplane, on the way out to GDC, so I haven’t been able to dig deeper into this issue yet. I definitely think it merits further investigation, and the frame-by-frame playback and analysis possible with the FCAT tool set should be a big help when the time comes.
Let’s look at how Fraps and FCAT data compare in a few more games, while keeping an eye on the multi-GPU systems for evidence of micro-stuttering problems. Then, we’ll address micro-stuttering in little more depth.
Borderlands 2 is noteworthy in this context not just because it’s a great game, but also because it’s based on the incredibly popular Unreal engine, like a whole ton of other titles. Interestingly enough, our results for this game show incredibly close correspondence between Fraps timing and FCAT frame delivery. Yes, that’s what you’re seeing in the plots above—not just a single distribution, but two that almost entirely overlap. If you look closely, you can see that even the spikes tend to overlap. The peaks are a little higher in Fraps in several cases, but usually not by much. We do see a little “fuzziness” at a few spots in the Radeon HD 7970 CrossFire plot from FCAT, which likely indicates some micro-stuttering, but it’s relatively minimal.
Every one of our metrics confirms that Fraps and FCAT are virtually in unison here. That’s a good thing, because it should mean that the content of frames being displayed will match the timing of their appearance onscreen quite closely. It also gives us quite a bit of confidence that we’re measuring the “true” performance of these graphics solutions in Borderlands 2 between these two tools.
Guild Wars 2
This one is a little different. According to Fraps, the frame-time distributions for all tested configs are pretty spiky. By FCAT’s account, the single-GPU solutions are nice and tight, with relatively little variance overall; however, both of the multi-GPU offerings are at least as spiky as in the Fraps data. The Radeon HD 7970 CrossFire config’s Fraps and FCAT plots match up very closely, while the GTX 680 SLI’s FCAT results show even more variance than the Fraps output.
Frankly, we need to spend a little more time looking into this one. Doing so should help us understand the impact of the divergence between the Fraps and FCAT results for the single-GPU cards. This game is especially tricky because it’s an MMO with a client-server relationship—and, remember, we have only a single test run for each card in this case. Subjectively, the animation didn’t seem entirely smooth to us on any of the cards, but I’d like to spend more time with the captured video before drawing any conclusions. For now, let’s move on and take a closer look at some of the multi-GPU issues we’ve encountered.
Multi-GPU issues: Micro-stuttering, runt frames, and more
So what exactly was going on with those strangely “cloudy” frame time plots for the 7970 CrossFire, especially in Skyrim? Let’s look more closely at a small snippet of time from each test run, starting with the Fraps results.
The plots for the two multi-GPU solutions above both show obvious evidence of multi-GPU micro-stuttering, with frame times oscillating in that familiar sawtooth pattern. This timing issue is caused by the preferred method of balancing the load between two GPUs, alternate frame rendering (AFR), in which one chip renders the even-numbered frames and the other renders odd-numbered frames, in interleaved fashion. When the two GPUs aren’t exactly in sync, frame delivery becomes uneven. I have a hard time getting excited about this problem in this particular instance simply because the frame times involved are very short, so the difference between them is small. Still, a multi-GPU config with this sort of jitter is really no quicker than those longer frame times in the pattern. In some cases, with too much jitter, multi-GPU solutions may not be much faster than a single GPU of the same type.
So that’s that, but now look what happens with the two multi-GPU configs in the FCAT results, where we’re measuring frame delivery times, not frame dispatch times.
The jitter pattern is eliminated on the GeForce GTX 680 SLI setup—evidence that Nvidia’s frame metering technology for SLI is doing its job. This technology tracks frame delivery times and inserts very small delays as needed in order to ensure even spacing of the frames that are displayed. The FCAT tools give us the ability to confirm that frame metering works as advertised. Remember how I said the FCAT release was a bit of enlightened self-interest? Yeah, here’s where the self-interest comes into the picture. Nvidia gets to show off its frame metering tech.
Meanwhile, our FCAT results suggest the jitter on the Radeon HD 7970 CrossFire setup is much more severe than Fraps detected. The shorter frame times in the pattern are literally a fraction of a millisecond, while the longer frame times are effectively twice what an FPS average of this section might suggest. How does a 0.3 millisecond frame look onscreen? Something like this:
In that example, there are portions of five GPU frames onscreen at once, as the overlay indicates. The little aqua and silver snippets are tiny portions of what are, presumably, fully rendered frames from the GPU, but the timing is so off-kilter than only a few scan lines of them are shown onscreen. Here’s a close-up of one of these “runt” frames (Nvidia’s term for them) causing tearing in BF3.
The runt frame appears to have real content; it just isn’t onscreen long enough to add any substantial new information to the picture.
Above is an illustration of another snag we encountered with the 7970 CrossFire config in multiple games. This is a three-video frame sequence from BF3 with the FCAT overlay enabled. The expected color sequence for the overlay here is red, teal, navy, green, and aqua. What you see displayed, though, is red immediately followed by navy and then aqua. The teal and green frames aren’t even runts here—they’re simply not displayed at all. They’re just dropped.
Here’s another funky anomaly we encountered intermittently with the CrossFire setup. The FCAT analysis script was reporting two-pixel-tall “frames” that were out of the expected sequence, which was a bit of a puzzle. If you page through the video sequence shown above, everything looks correct at first glance, with the proper sequence of fuchsia, yellow, orange, white, and lime. However, if you zoom in on the top-left corner of the last video frame of the sequence, you’ll see this:
That’s a two-pixel yellow overlay bar and, to its right, apparently the other content of an out-of-sequence frame. When this happens, the out-of-place imagery always shows up at the top of the screen like this. Based on lots of zooming and squinting, I believe the content of the two scanlines here matches the timing of the yellow-marked GPU frame from the first video frame in the sequence. Somehow, it’s “leaking” into the top of this video frame. Not a huge problem, frankly, but it’s an apparent bug in CrossFire frame delivery.
So what do we make of the problems of runt and dropped frames? They’re troublesome for performance testing, because they get counted by benchmarking tools, helping to raise FPS averages and all the rest, but they have no tangible visual benefit to the end user.
Nvidia’s FCAT scripts offer the option of filtering out runt and dropped frames, so that they aren’t counted in the final performance results. That seems sensible to me, so long as it’s done the right way. The results you’ve seen from us on the preceding pages were not filtered in this fashion, but we can apply the filters to show you how they affect things. By default, the script’s definition of a “runt frame” is one that occupies 20 scan lines or less, or one that comprises less than 25% of the length of the prior frame. I think the 20-scan-line limit may be a reasonable rule of thumb, but I’m dubious about the 25% cutoff. What if the prior frame represented a big spike in frame rendering times?
Fortunately, the filtering rules in the FCAT scripts are easily tweakable, so we can define our own thresholds for these things. I expect you’ll see lots of results today and in the coming weeks that accept FCAT’s default filtering rules, though, so let’s take a look at how they affect some test data. Here are the Fraps and FCAT results for the Radeon HD 7970 CrossFire setup in Skyrim, followed by the filtered version from FCAT.
Filtered in this way, the CrossFire config loses lots of frames from its output. You can imagine what that does to its FPS average:
Interestingly, even the 99th percentile frame time is affected slightly by the removal of so many super-short-time frames, whose presence shifts the cutoff point for 99% of frames rendered.
So, yeah, accounting for these frame delivery problems with filtering really alters the relative performance picture. By contrast, the SLI setup is barely touched by the filters in this case. We did see a few runt frames from the SLI rig in both Skyrim and Guild Wars 2, but they never amounted to much.
Battlefield 3 multi-GPU performance
Let’s take a look at another game where multi-GPU micro-stuttering comes into play. This time, we’ve left out the Fraps numbers so we can concentrate on both raw and filtered FCAT results.
Obviously, the Radeon HD 7970 CrossFire setup cranks out more frames than anything else in this test, but just as clearly, it has some pronounced jitter going on. Here’s an extreme close-up of some of the worst of it:
FCAT’s filtering removes those runt frames from the equation, and we’re left with substantially altered performance results for the CrossFire rig. As you can see, none of the other solutions’ results are affected at all. Their filtered and raw frame time plots and latency curves entirely overlap, and the rest of their scores are identical.
With the FCAT filtering applied, the Radeon HD 7970 CrossFire setup’s average FPS drops precipitously. Even without filtering, the longer frame times in its jitter pattern affect its latency curve negatively. Add in filtering, and the CrossFire rig’s 99th percentile frame time drops back to match a single Radeon’s almost exactly. Filtered or not, there really is no measurable benefit to having a second graphics card in the mix.
With that said, we have to point out that in this game at these settings, the 7970 CrossFire rig performs just fine, speaking both subjectively and going by the resulting numbers. In our measure of “badness,” time spent working on frames beyond 50 milliseconds, none of the cards register even a single blip. They spend no substantial time working on frames that take longer than 33 ms, either.
So now what?
This first take on Nvidia’s FCAT tools and the things they can measure is just a beginning, so I don’t have many conclusions for you just yet. But this is an awfully good start to a new era of GPU benchmarking. We can now see into the early stages of the rendering pipeline with Fraps and then determine exactly what’s happening at the other end of the pipe when visuals are delivered to the display with FCAT. We can correlate the two and see how much leeway there is between them. And we also have videos that allow us to review the resulting animation with frame-by-frame precision, to show us the exact impact of any spikes or anomalies in the numbers we’re seeing. These are the best tools yet for understanding real-time graphics performance, and they offer the potential for lots of new insights.
The fact that Nvidia has decided to release analytical tools of this caliber to the general public is remarkable. Yes, the first results of those tools have detected some issues with its competition’s products, but who knows what other problems we might uncover with them down the road? Nvidia is taking a risk here, and the fact it’s willing to do so is incredibly cool.
Going forward, there’s still tons of work to be done. For starters, we need to spend quite a bit more time understanding the problems of multi-GPU micro-stuttering, runt frames, and the like. The presence of these things in our benchmark results may not be all that noteworthy if overall performance is high enough. The stakes are pretty low when the GPUs are constantly slinging out new frames in 20 milliseconds or less. I’ve not been able to perceive a problem with micro-stuttering in cases like that, and I suspect those who claim to are seeing extreme cases or perhaps other issues entirely. Our next order of business will be putting multi-GPU teams under more stress to see how micro-stuttering affects truly low-frame-rate situations where animation smoothness is threatened. We have a start on this task, but we need to collect lots more data before we are ready to draw any conclusions. Stay tuned for more on that front. I’m curious to see what other folks who have these tools in their hands have discovered, too.
The FCAT analysis has shown us that Nvidia’s frame metering tech for SLI does seem to work as advertised. Frame metering isn’t necessarily a perfect solution, because it does insert some tiny delays into the rendering-and-display pipeline. Those delays may create timing discontinuities between the game simulation time—and thus frame content—and the display time. They also add a minuscule bit to the lag between user input and visual response. But then there’s apparently a fair amount of low-stakes timing slop in PC graphics, as the gap between our Fraps and FCAT results (in everything but the Unreal-engine-based Borderlands 2) has demonstrated. The best thing we can say for frame metering is that it makes the Fraps and FCAT times for SLI solutions appear to correlate about like they do for single-GPU solutions. That’s a really high-concept way of saying that it appears to work pretty well.
We do want to be careful to note that frame delivery as measured by FCAT is just one part of a larger picture. Truly fluid animation requires the regular delivery of frames whose contents are advancing at the same rate. What happens at the beginning of the pipeline needs to match what happens at the end. Relying on FCAT numbers alone will not tell that whole story; we’d just be measuring the effectiveness of frame metering techniques. We’ve come too far in the past couple of years in how we measure gaming performance to commit that error now.
Ideally, we’d like to see two things happen next. First, although FCAT’s captures are nice to have and Nvidia’s scripts provide a measure of automation, using these tools is a lot of work and generates huge amounts of data. It would be very helpful to have an API from the major GPU makers that exposes the true timing of the frame-buffer flips that happen at the display. I don’t think we have anything like that now, or at least nothing that yields results as accurate as those produced by FCAT. With such an API, we could collect end-of-pipeline data much easier and use frame captures sparingly, for sanity checks and deeper analysis of images. Second, in a perfect world, game developers would expose an API that reveals the internal simulation timing of the game engine for each frame of animation. That would allow us to do away with grabbing the Present() time via Fraps and end any debate about the accuracy of those numbers. We’d then have the data we need to correlate with precision the beginning and ending of the pipeline and to analyze smoothness—or, well, for someone who’s smarter than us about the tricky math of a rate-match problem and the perceptual thresholds for smooth animation to do so.
Follow me on Twitter for shorter ramblings.