As the second turns: the web digests our game testing methods
A funny thing happened over the holidays. We went into the break right after our Radeon vs. GeForce rematch and follow-up articles had caused a bit of a stir. Also, our high-speed video had helped to illustrate the problems we’d identified with smooth animation, particularly on the Radeon HD 7950. All of this activity brought new attention to the frame latency-focused game benchmark methods we proposed in my "Inside the second" article over a year ago and have been refining since.
As we were busy engaging in the holiday rituals of overeating and profound regret, a number of folks across the web were spending their spare time thinking about latency-focused game testing, believe it or not. We’re happy to see folks seriously considering this issue, and as you might expect, we’re learning from their contributions. I’d like to highlight several of them here.
Perhaps the most notable of these contributions comes from Andrew Lauritzen, a Tech Lead at Intel. According to his home page, Andrew works "with game developers and researchers to improve the algorithms, APIs and hardware used for real-time rendering." He also occasionally chides me on Twitter. Andrew wrote up a post at Beyond3D titled "On TechReport’s frame latency measurement and why gamers should care." The main thrust of his argument is to support our latency-focused testing methods and to explain the need for them in his own words. I think he makes that case well.
Uniquely, though, he also addresses one of the trickier aspects of latency-focused benchmarking: how the graphics pipeline works and how the tool that we’ve been using to measure latencies, Fraps, fits into it.
As we noted here, Fraps simply writes a timestamp at a certain point in the frame production pipeline, multiple stages before that frame is output to the display. Many things, both good and bad, can happen between the hand-off of the frame from the game engine and the final display of the image on the monitor. For this reason, we’ve been skittish about using Fraps-based frame-time measurements with multi-GPU solutions, especially those that claim to include frame metering, as we explained in our GTX 690 review. We’ve proceeded to use Fraps in our single-GPU testing because, although its measurements may not be a perfect reflection of what happens at the display output, we think they are a better, more precise indication of in-game animation smoothness than averaging FPS over time.
Andrew addresses this question in some depth. I won’t reproduce his explanation here, which is worth reading in its entirety and covers the issues of pipelining, buffering, and CPU/driver-GPU interactions. Interestingly, Andrew believes that in the case of latency spikes, buffered solutions may produce smooth frame delivery to the display. However, even if that’s the case, the timing of the underlying animation is disrupted, which is just as bad:
This sort of "jump ahead, then slow down" jitter is extremely visible to our eyes, and demonstrated well by Scott’s follow-up video using a high speed camera. Note that what you are seeing are likely not changes in frame delivery to the display, but precisely the affect of the game adjusting how far it steps the simulation in time each frame. . . . A spike anywhere in the pipeline will cause the game to adjust the simulation time, which is pretty much guaranteed to produce jittery output. This is true even if frame delivery to the display (i.e. rendering pipeline output) remains buffered and consistent. i.e. it is never okay to see spikey output in frame latency graphs.
Disruptions in the timing of the game simulation, he argues, are precisely what we want to avoid in order to ensure smooth gameplay—and Fraps writes its timestamps at a critical point in the process:
Games measure the throughput of the pipeline via timing the back-pressure on the submission queue. The number they use to update their simulations is effectively what FRAPS measures as well.
In other words, if Fraps captures a latency spike, the game’s simulation engine likely sees the same thing, with the result being disrupted timing and less-than-smooth animation.
There’s more to Andrew’s argument, but his insights about the way game engines interact with the DirectX API, right at the point where Fraps captures its data points, are very welcome. I hope they’ll help persuade folks who might have been unsure about latency-focused testing methods to give them a try. Andrew concludes that "If what we ultimately care about is smooth gameplay, gamers should be demanding frame latency measurements instead of throughput from all benchmarking sites."
With impeccable timing, then, Mark at AlienBabelTech has just published an article that asks the question: "Is Fraps a good tool?" He attempts to answer the question by comparing the frame times recorded by Fraps to those recorded by the tools embedded in several game engines. You can see Mark’s plots of the results for yourself, but the essence of his findings is that the game engine and Fraps output are "so similar as to convey approximately the same information." He also finds that the results capture a sense of the fluidity of the animation. The frame time plot "fits very well in with the experience of watching this benchmark – a small chug at the beginning, then it settles down until the scene changes and lighting comes into play – smoothness alternates with slight jitter until we reach the last scene that settles down nicely."
With the usefulness of Fraps and frame-time measurements established, Mark says his next step will be to test a GeForce GTX 680 and a Radeon HD 7970 against each other, complete with high-speed video comparisons. We look forward to his follow-up article.
Speaking of follow-up, I know many of you are wondering how AMD plans to address the frame latency issues we’ve identified in several newer games. We have been working with AMD, most recently running another quick set of tests right before Christmas with the latest Catalyst 12.11 beta and CAP update, just to ensure the problems we saw weren’t already resolved in a newer driver build. We haven’t heard much back yet, but we noticed in the B3D thread that AMD’s David Baumann says the causes of latency spikes are many—and he offers word of an impending fix for Borderlands 2:
There is no one single thing for, its all over the place – the app, the driver, allocations of memory, CPU thread priorities, etc., etc. I believe some of the latency with BL2 was, in fact, simply due to the size of one of the buffers; a tweak to is has improved it significantly (a CAP is in the works).
This news bolsters our sense that the 7950’s performance issues were due to software optimization shortfalls. We saw spiky frame time plots with BL2 both in our desktop testing and in Cyril’s look at the Radeon HD 8790M, so we’re pleased to see that a fix could be here soon via a simple CAP update.
Meanwhile, if you’d like to try your hand at latency-focused game testing, you may want to know about an open-source tool inspired by our work and created by Lindsay Bigelow. FRAFS Benchmark Viewer parses and graphs the frame time data output by Fraps. I have to admit, I haven’t tried it myself yet since our own internal tools are comfortingly familiar, but this program may be helpful to those whose Excel-fu is a little weak.
Finally, we have a bit of a debate to share with you. James Prior from Rage3D was making some noises on Twitter about a "problem" with our latency-focused testing methods, and he eventually found the time to write me an email with his thoughts. I replied, he replied, and we had a a nice discussion. James has kindly agreed to the publication of our exchange, so I thought I’d share it with you. It’s a bit lengthy and incredibly nerdy, so do what you will with it.
Here’s is James’s initial email:
Alrighty, had some time to play with it and get some thoughts together. First of all, not knocking what you’re doing – I think it’s a good thing. When I said ‘theres a big flaw’ here’s what I’m thinking.
When I look at inside the second, the data presentation doesn’t lend itself to supporting some of the conclusions. This is not because you’re wrong but because I’m not sure of the connection between the two. Having played around with looking at 99% time, I think that it’s not a meaningful metric in gauging smoothness of itself, it shows uneven render time but not the impact of that on game experience, which was the whole point. It’s another way of doing ‘X number is better than Y number’.
I agree with you that a smoothness metric is needed. I concur with your thoughts about FPS rates not being the be-all end-all, and 60fps vsync isn’t the holy grail. The problem is the perception of smoothness, and quantifying that. If you have a 25% framerate variation at 45fps you’re going to notice it more than a 25% framerate variation at 90fps. 99% time shows when you have a long time away from the average frame rate but not that the workload changes, so is naturally very dependent on the benchmark data, time period and settings.
What I would (and am, but it took me 2 weeks to write this email, I’m so time limited) aim for is to find a way to identify a standard deviation and look for ways to show that. So when you get a line of 20-22ms frames interrupted by a 2x longer frame time and possibly a few half as long frame times (the 22, 22, 58, 12, 12, 22, 22ms pattern) you can identify it, and perhaps count the number of times it happens inside the dataset.
Next up would be ‘why’ and that can start with game settings – changing MSAA, AO, resolution, looking for game engine bottlenecks and then looking at drivers and CPU config. People have reported stuttering frame rates from different mice, having HT enabled, having the NV AO code running on the AMD card (or vice versa).
In summary – I think the presentation of the data doesn’t show the problem at the extent it’s an issue for gamers. I think it’s too simplistic to say ‘more 99% time on card a, it’s no good’. But that’s an editorial decision for you, not me.
The videos of skyrim were interesting but of no value to me, it’s a great way to show people how to idenitify the problem but unless you frame sync the camera to the display and can find a way to reduce the losses of encoding to show it, it’s not scientific. Great idea though, help people understand what you’re describing.
Thanks for being willing to listen, and have a Merry Christmas 🙂
My response follows:
Hey, thanks for finally taking time to write. Glad to see you’ve considered these things somewhat.
I have several thoughts in response to what you’ve written, but the first and most important one is simply to note that you’ve agreed with the basic premise that FPS averages are problematic. Once we reach that point and are talking instead about data presentation and such, we have agreed fundamentally and are simply squabbling over details. And I’m happy to give a lot of ground on details in order to find the best means of analyzing and presenting the data to the reader in a useful format.
With that said, it seems to me you’ve concentrated on a single part of our data presentation, the 99th percentile frame time, and are arguing that the 99th percentile frame time doesn’t adequately communicate the "smoothness" of in-game animation.
I’d say, if you look at our work over the last year in total, you’d find that we’re not really asking the 99th percentile frame time to serve that role exclusively or even primarily.
Before we get to why, though, let’s establish another fundamental. That fundamental reality is that animation involves flipping through a series of frames in sequence (with timing that’s complicated somewhat by its presentation on a display with a fixed refresh rate.) The single biggest threat to smooth animation in that context is delays or high-latency frames. When you wait too long for the next flip, the illusion of motion is threatened.
I’m much more concerned with high-latency frames than I am with variance from a mean, especially if that variance is on the low side of the mean. Although a series of, say, 33 ms frames might be the essence of "smoothness," I don’t consider variations that dip down to 8 ms from within that stream to be especially problematic. As long as the next update comes quickly, the illusion of motion will persist and be relatively unharmed. (There are complicated timing issues here involving the position of underlying geometry at render time and display refresh intervals that pull in different directions, but as long as the chunks of time involved are small enough, I don’t think they get much chance to matter.) Variations *above* the mean, especially big ones, are the real problem.
At its root, then, real-time graphics performance is a latency-sensitive problem. Our attempts to quantify in-game smoothness take that belief as fundamental.
Given that, we’ve borrowed the 99th percentile latency metric from the server world, where things like database transaction latencies are measured in such terms. As we’ve constantly noted, the 99th percentile is just one point on a curve. As long as we’ve collected enough data, though, it can serve as a reliable point of comparison between systems that are serving latency-sensitive data. It’s a single sample point from a large data set that offers a quick summary of relative performance.
With that in mind, we’ve proposed the 99th percentile frame time as a potential replacement for the (mostly pointless) traditional FPS average. The 99th percentile frame time has also functioned for us as a companion to the FPS average, a sort of canary in the coal mine. When the two metrics agree, generally that means that frame rates are both good *and* consistent. When they disagree, there’s usually a problem with consistent frame delivery.
So the 99th percentile does some summary work for us that we find useful.
But it is a summary, and it rules out the last 1% of slow frames, so I agree that it’s not terribly helpful as a presentation of animation smoothness. That’s why our data presentation includes:
1) a raw plot of frame times from a single benchmark run,
2) the full latency curve from 50-100% of frames rendered,
3) the "time spend beyond 50 ms" metric, and
4) sometimes zoomed-in chunks of the raw frame time plots.
*Those* tools, not the 99th percentile summary, attempt to convey more useful info about smoothness.
My favorite among them as a pure metric of smoothness is "time spent beyond 50 ms."
50 milliseconds is our threshold because at a steady state it equates to 20 FPS, which is pretty slow animation, where the illusion of motion is starting to be compromised. (The slowest widespread visual systems we have, in traditional cinema, run at 24 FPS.) Also, if you wait more than 50 ms for the next frame on a 60Hz display with vsync, you’re waiting through *four* display refresh cycles. Bottom line: frame times over 50 ms are problematic. (We could argue over the exact threshold, but it has to be somewhere in this neighborhood, I think.)
At first, to quantify interruptions in smooth animation, we tried just counting the number of frames that take over 50 ms to render. The trouble with that is that a 51 ms frame counts the same as a 108 ms frame, and faster solutions can sometimes end up producing *more* frames over 50 ms than slower ones.
To avoid those problems, we later decided to account for how far the frame times are over our threshold. So what we do is add up all of the time spent rendering beyond our threshold. For instance, a 51 ms frame adds 1 ms to the count, while an 80 ms frame adds 30 ms to our count. The more total time spent beyond the threshold, the more the smoothness of the animation has been compromised.
It’s not perfect, but I think that’s a pretty darned good way to account for interruptions in smoothness. Of course, the results from these "outlier" high-latency frames can vary from run to run, so we take the "time beyond X" for each of the five of the test runs we do for each card and report the median result.
In short, I don’t disagree entirely with your notion that the 99th percentile frame time doesn’t tell you everything you might need to know. That’s why our data presentation is much more robust than just a single number, and why we’ve devised a different metric that attempts to convey "smoothness"–or the lack of it.
I’d be happy to hear your thoughts on alternative means of analyzing and presenting frame time data. Once we agree that FPS averages hide important info about slowdowns, we’re all in the same boat, trying to figure out what comes next. Presenting latency-sensitive metrics is a tough thing to do well for a broad audience that is accustomed to much simpler metrics, and we’re open to trying new things that might better convey a sense of the realities involved.
And here is James’s reply:
First up, yes I absolutely agree that FPS averages aren’t the complete picture. Your cogent and comprehensive response details the thinking behind your methodology very nicely. You are correct, I did choose to highlight 99% time as my first point, and your clarification regarding the additional data you review and present is well taken.
I agree with you about the 50ms/20fps ‘line in the sand’, for watching animated pictures. My personal threshold for smoothness in movies is about 17-18, my wife’s is 23.8. For gaming however, I find around 35fps / 29ms per frame is where I get pissed off and call it an unplayable slideshow unless it is an RTS – I was prepared to hate C&C locked at 30fps but found it quite pleasant. This was based on not only animation smoothness but smoothness of response to input. Human perception is a funny thing, it changes with familiarity and temperament.
So on that basis I concur that dipping from 22ms to 50ms is perceptible in ‘palm of the hand’ and 99% plus 50ms statistics address identify that nicely. Where I disagree with you is the moving from 22ms to say 11ms isn’t noticeable, especially if it is an experientially significant amount of time for the latency consumer – the player. Running along at 22ms and switching to 11ms probably won’t be perceived badly, but the regression back to 22ms might be, especially if it happens frequently. I experienced this first hand when I benched Crossfire 7870’s in Eyefinity, with VECC added SSAA. The fraps average was high, in the 60’s, the min was around 38. The problem was the feel, it looked smooth, but the response was input was terrible. The perceived average FPS was closer to the minimum and wasn’t smooth and so despite being capable of stutter free animation, the playability was ruined due to frame rate variation from 38fps to ~90fps. The problem ended up being memory bandwidth, as increasing clocks improved the feel and reduced the variation; this was reinforced by moving from SSAA to AAA and standard MSAA; the less intensive modes were silky smooth, AAA being in the same general performance range.
This can be observed on the raw frame rate graph, a saw tooth pattern will be seen if the plot resolution is right, but when examining a plot covering perhaps minutes of data showing tens of frame render times per second then you need a systemic approach for consistency and time cost of the analyst.
The obvious answer is to restrict your input data, find a benchmark session that doesn’t do that but then you end up with a question of usefulness to your latency processor again – the player. Does the section of testing represent the game fairly? Is the provided data enough for someone to know that the card will cope with the worst case scenarios of the game, is there enough data for each category of consumer – casual player, IQ/feature enthusiast, game enthusiast, performance enthusiast, competitive gamer, system builder, family advisor, mom upgrading little Jonny’s gateway – to understand the experience?
Servers talk to servers, games talk to people. We can base analysis methodology on what comes from the server world, and then move on to finding a way to consistently quantify the experience so that the different experience levels show through.
I’ll confess I still owe him a response to this message. We seem to have ended up agreeing on the most important matters at hand, though, and the issues he raises in his reply are a bit of a departure from our initial exchange. It seems to me James is thinking in the right terms, and I look forward to seeing how he implements some of these ideas in his own game testing in the future.
You can follow me on Twitter for more nerdy exchanges.