Single page Print

Some refinements to our methods
A few months ago, we reconsidered the way we test video-game performance and proposed some new methods in the article Inside the second: A new look at game benchmarking. The basic argument of that article was that the traditional approach of measuring speed in frames per second has some pretty major blind spots. For instance, one second is an eternity in terms of human perception. A bunch of fast frames surrounded by a handful of painfully slow ones can average out to an "acceptable" rate in FPS, even when the fluidity of the game has been interrupted. (We opened a whole other can of worms when we applied these insights to multi-GPU systems, but that is a story for another day.)

We weren't quite sure what folks would think of our proposed new methods, but the response so far has been overwhelmingly positive. Most folks embraced the idea of a new approach, and many of you wrote in to offer your suggestions on how we might improve our methods going forward. Since then, several things have happened.

For one, while I was preoccupied with reviewing new CPUs, Cyril took the ball and ran with it, testing both Battlefield 3 and Skyrim using our proposed new methods. Folks seemed to like those articles, and in both cases, Cyril was able to pinpoint performance issues that a simple FPS measurement would have missed.

Meanwhile, behind the scenes in conversations with TR editors and others, I've slowly sifted through your suggestions to figure out which of them might prove worthwhile to us. We've rejected some interesting ideas simply because we think they'd be too complicated for mass consumption, and we've passed on some others because they didn't necessarily apply to the sort of performance we're after. The goal of a real-time graphics system is to produce frames regularly at relatively low latencies, within a window established by the limits of display technology and human perception. Measuring properties like "variance" without reference to the realities involved doesn't appeal to us.

We have come up with one refinement to our methods that we think is helpful, though. In past articles, in order to highlight cases where a particular config ran into performance problems, we reported the number of frame times that were longer than a given time period for each card, usually 50 ms. We kind of pulled that number out of a hat, but 50 ms corresponds to about 20 FPS at a steady rate. We think that's slow enough that the illusion of motion is being threatened. A collection of too many frame latencies beyond about 50 ms wouldn't produce a good experience in most games. By counting the number of frame times above 50 ms for each config, we were able to offer a sense of which ones had potentially problematic performance problems (picked a peck of pickled peppers).

This approach, though, has two problems. First, in certain cases where the prevailing frame times rose above 50 ms for most GPUs, the faster GPU would of course produce more frames above 50 ms. We didn't want to penalize the faster solution, so we had to be very careful about how we set our threshold in each test scenario.

The second problem is related: a simple count of the frame times longer than a certain threshold fails to consider the time element involved. For instance, take the two example performances below. They're fabricated but possible.

The first card, the ReForce, produces several frames in 51 milliseconds during its test run. That's not great, but three frames at 51 ms probably wouldn't interrupt the flow of a game too badly. The second card, the Gyro, has only one long-latency frame, but it's a doozy: 200 ms, a fifth of a second and an undeniable interruption in gameplay. Here's how our long-latency frame count would look for these two cards:

Whoops. The Gyro comes out looking better in that chart, even though it's obviously doing a poorer job of delivering fluid motion. The solution we've devised? Rather than counting the number of frames above 50 ms, we can add up all of the time spent working on frames beyond our 50-ms threshold. For our example above, the outcome would look like so:

Those three 51-ms frames only contribute 3 ms to the total time spent waiting beyond our threshold, while that one 200-ms frame contributes much more. I think this result captures the relative severity of the interruptions in gameplay fluidity quite nicely. This technique also does away with any concerns about the faster card being penalized for producing more frames.

I should note that, although we cooked up this new method months ago in a frenetic conversation at Starbucks during IDF, a TR reader named Olaf later wrote in and pointed out this exact problem with the time element of the frame rate count, in response to one of Cyril's articles. Olaf, you nailed it. We're going with the technique of adding up time spent beyond 50 ms from here on out.

The time element of individual frames also scuttled one of our favorite suggestions for augmenting the presentation of our results: histograms showing the distribution of frame times. At first, that seemed like a nice idea. However, when we actually created one, it looked like so:

The problem? These are real data from our tests, and as you'll see later, what separates the performance of some GPUs from others here is an abundance of long frame times in some cases. Some of the GPUs devote quite a bit of time to processing long-latency frames, so those frames are very important to consider. Yet the severity of those long-latency frames is entirely obscured in this histogram. As a simple count, they're overwhelmed in the chart by the many thousands of low-latency frames produced by all of the GPUs. It's a purty picture, but I don't think it adds much to our analysis.

That's a shame, because I was ready to bust out the fancy stuff:

Bam! Pow!

But ultimately pointless. We'll keep looking for ways to better analyze and present our results in the future. I think what we've developed so far is pretty strong, though, with our one new refinement.

For what it's worth, I've also rejiggered things behind the scenes a little bit to make sure that, where possible, we're sampling all five runs from each game separately and then picking the median result. That way, the amount of time spent beyond 50 ms is the time spent during a single, representative run—and not the result of the occasional system performance blip due to a background task. The one place where that doesn't work is the 99th percentile frame times, where we've found we get more coherent results by analyzing the data from all five runs as a group.