We have been following the story of a chip-level problem in AMD’s quad-core Opteron and Phenom processors all week. This bugCPU makers prefer to call them erratacan cause system hangs in specific, rare circumstances. This sort of obscure problem is not really uncommon in microprocessors, but CPU makers are often able to fix them on the fly with little impact to the end user. This particular erratum is especially unfortunate because the fix for it involves sacrificing a substantial amount of performance.
This week’s developments have included the revelation that this bug affects all “Barcelona” quad-core Opterons, leading to a “stop ship” order on quad-core Opterons to most customers. The erratum also affects all speed grades of Phenom processors, which are still shipping to PC makers and resellers. AMD admitted the presence of the erratum prior to the Phenom’s public introduction, but the firm’s initial statements gave the impression that the erratum affected only virtualization, which is a server-class application and an uncommon use for a desktop CPU. In truth, the erratum can cause instability with desktop-style usage patterns, as well, and systems with Phenom 9500 and 9600 processors will have to be patched and suffer the accompanying performance penalty.
One thing we haven’t known is exactly how that performance penalty would lookuntil today. We can now offer you some preliminary benchmarks that demonstrate the impact of the BIOS-based workaround for the problem.
I don’t wish to re-hash too much of what we’ve already covered this week, but we should recap briefly the nature of the erratum. The problem involves the chip’s translation lookaside buffer (TLB) and L3 cache. AMD has provided a technical description of the problem as part of its documentation for a unsupported patch for the Linux kernel that alleviates the problem with only a minor performance hit. The specific circumstances that can lead to the data corruption and system hang are most likely to occur during periods of high utilization of all four CPU cores. Technically, AMD refers to the problem as errata number 298, but the problem has become more widely known as the TLB erratum.
Those are the basic outlines of the problem, but we should address something else before we move on. Some folks seem to be confused about the likelihood that the erratum could affect a system’s stability. We don’t really have any good way of quantifying that at present, but we can offer a few nuggets of wisdom. AMD says the problem is very rare, and we believe them. CPU makers do a tremendous amount of qualification testing before releasing a product, precisely because they want to avoid show-stopper problems. The Barcelona Opterons were rock solid when we conducted the testing for our review of those chips. We did run into some stability problems with our early Phenom test systems, but we’d trace those issues back to a pre-production Asus motherboard. The production version of the Asus M3A32-MVP Deluxe that we’ve since tested was much more stable, even without the erratum workaround applied. We used an MSI motherboard in testing for this article, and not once did the system lock or crash during hours of testing without the TLB patch applied.
The TLB erratum is a big deal largely because of the standards CPU makers have established for themselves, in which utter stability is a guarantee, not an optional feature. Even if the likelihood of a crash is extremely rare, of course anything less than 100% stability is unacceptable. AMD has helped establish those expectations, and of course, the industry is correct to expect CPU makers to live up to them. We would be intrigued to see some handicapping of the exact odds of a TLB erratum-induced system hang occurring during the life of an average PC, but such considerations probably aren’t going to fly for most customers. Even if the erratum occurs very infrequently, no one wants the possibility of a system crash at the worst possible moment hanging over his head, and no one wants a “broken” chip. That’s surely why AMD has directed motherboard makers to enable the workaround by default in their BIOSes, with no option to disable it, even though it slows performance.
AMD has taken the unusual step of pledging to release a version of its Overdrive tweaking utility that will allow users to disable the workaround, however, which says something about the severity of its performance impact.
As for that impact, it’s tough to estimate entirely. We’ve heard estimates of 10% and “10 to 20%.” We do know that the BIOS-based workaround for the TLB erratum disables some problematic logic in the CPU, but does not disable the L3 cache entirely.
With that said, we can move on to the test conditions. Several key things made this test possible. One was the fact that MSI was able to supply us with a BIOS for its K9A2 Platinum motherboard that includes the TLB erratum workaround. Thus, we tested with an earlier revision of the MSI board’s BIOS (version VP.0B7) and with the newer, patched BIOS (version 1.21). Per AMD’s guidance on this issue, MSI apparently did not include a menu option to disable the workaround. In fact, the BIOS doesn’t look to offer any cosmetic indicator that the workaround is in place.
Another key to making this test possible was the help of the fine folks at NCIX, who supplied us with a production Phenom 9600 processor for use in testing. Thanks to them for their assistance.
I’ll spare you the giant table here, but we generally tested using the configurations described in the testing methods section of our original Phenom review. The notable exceptions, of course, are the production Phenom 9600 and the MSI motherboard.
That original Phenom review overstated the Phenom 9600’s performance for two reasons. The most obvious, of course, is the fact that we didn’t have the TLB erratum patch applied. The other is that the north bridge clock on our Phenom engineering sample was running at 2.0GHz. AMD told us that was the correct north bridge speed, but our experience with production Phenom 9600 chips has proven otherwise; the correct clock is 1.8GHz. The north bridge clock is critical to performance in this CPU architecture because the L3 cache runs at the speed of this clock.
As a result, we have included scores for several Phenom CPUs in the graphs on the following pages. The ones marked “Phenom 9600 – TLB patch” and “Phenom 9600 – No TLB patch” come from the production Phenom 9600 and the MSI motherboard. The scores marked “Phenom ES” come from our Phenom engineering sample and the pre-production Asus motherboard. The “Phenom ES 2.3GHz” is what we mistakenly represented as a Phenom 9600 in our original review. With all of these results present, you should be able to see the impact of both the lower north bridge clock and of the TLB erratum patch.
The warm-up: Memory subsystem performance
Since the TLB erratum workaround is likely to affect the performance of the memory subsystem most directly, we’ll start by looking at some synthetic memory benchmark results. Note that performance results for these tests don’t translate directly into real-world application performance. These more focused benchmarks are instead designed to stress the memory subsystem, including CPU caches and main memory.
This first test lets us whip out a fancy-looking graph. Bear with us as we show off briefly.
I hesitated to include these results because, well, the graph is really hard to read. The problem? All three of the Phenom CPU configs running at 2.3GHz overlap almost entirely. I do think that’s enlightening, though, because it shows us that L2 and L3 cache bandwidth don’t appear to be affected by the TLB workaround. We can isolate the larger 1GB block size in this test, though, and see a more tangible difference.
Here, the TLB patch’s impact begins to be apparent. It’s even more apparent in Sandra’s Stream-like test of main memory bandwidth. Have a look.
Ow. Memory bandwidth drops from roughly 5.4 GB/s without the TLB patch to 3.65 GB/s with it.
Of course, bandwidth is only part of the story. Let’s look at memory access latencies, as well.
The TLB patch exacts an access latency penalty of 40 nanoseconds at our sample block size. We can look at the latency picture more fully with a terrifying-but-useful 3D graph.
In these graphs, yellow represents L1 cache, light orange is L2 cache, and dark orange is main memory. What you’re seeing here is memory access latencies at various block and step sizes, in a way that exposes latency for the various stages in the memory hierarchy.
These latency graphs demonstrate the same basic principle that the bandwidth graphs did. The TLB erratum workaround exacts its performance penalty by slowing main memory performance. The actual impact on memory performance is much greater than the 10% number we’ve seen floating around, but the patch’s affect on application performance will depend on whether the app’s working data set can fit into the CPU’s on-chip caches. The more often an application must reach into main memory, the greater the impact of the erratum workaround will be.
The real test: Application performance
Now for the real test: actual applications one might run on a Phenom processor. We’ve used a number of elements of our usual CPU testing suite. If you’re not familiar with them, we introduced them more properly in our original Phenom review. We’ll dispense with the pleasantries here and give you the results.
As we predicted, the TLB patch’s influence on performance varies from one application to the next. Everything depends on how that application uses memory. The largest impact, by far, is in the Firefox test, which really takes a hit when the patch is present. Web browsing isn’t exactly an uncommon activity, either. In other tests, like the Sandra multimedia benchmark that uses SIMD instructions to generate a picture of a Mandelbrot fractal, the patch barely slows performance at all.
So the BIOS-based workaround for the TLB erratum can have quite an effect on performance. How close were the estimates we’ve heard of a 10% performance drop? Let’s summarize our results and consider the percentage differences.
|No TLB patch||TLB patch||Difference|
|Sandra cache and memory bandwidth||6527||5932||9.6%|
|Sandra memory bandwidth – FPU||5403||3650||38.7%|
|Sandra memory bandwidth – ALU||5401||3648||38.7%|
|CPU-Z memory access latency||59||99||50.6%|
|WorldBench – Microsoft Office 2003 SP-1||369||399||7.8%|
|WorldBench – Adobe Photoshop CS2||521||595||13.3%|
|WorldBench – Firefox||298||536||57.1%|
|WorldBench – Microsoft Windows Media Encoder 9.0||248||272||9.2%|
|WorldBench – WinZip 10||305||321||5.1%|
|picCOLOR overall score||9.74||7.21||29.9%|
|Valve Source engine particle simulation benchmark||62||55||12.0%|
|Valve VRAD map build time||182||191||4.8%|
|SiSoft Sandra Multimedia Integer x16||130697||130648||0.04%|
|SiSoft Sandra Multimedia Floating Point x8||169434||169373||0.04%|
|Total average difference||19.8%|
|Average difference without memory subsystem tests||13.9%|
Across every test we ran, the difference between the Phenom 9600 with and without the TLB patch averages out to 19.8%. However, if we rule out the synthetic memory tests and consider only the application tests, that difference drops to 13.9%.
The most troubling results here are the applications where we see large performance drops with the TLB erratum workaround active, including the Firefox web browser and the picCOLOR image analysis tool. If one happens to spend a lot of time running an application whose memory access patterns don’t mix well with the TLB patch, the result could prove frustrating. The BIOS-based workaround for the TLB erratum may achieve its intended resultsystem stabilitybut it comes at a pretty steep price in terms of performance.
For the average retail PC consumer, this price might not be unacceptable. After seeing the Firefox test results, I spent some time browsing the web with our Phenom-based test system, and it didn’t feel noticeably sluggish to me compared to most modern PCs. Then again, I doubt whether the average sort of consumer is likely to purchase a system with a quad-core processor. One wonders where that leaves AMD and the PC makers currently shipping Phenom-based PCs. I’m not sure a recall is in order, but a discount certainly might be. And folks need to know what they’re getting into when purchasing a Phenom 9500 or 9600-based computer this holiday season. Caveat emptor, indeed.
In fact, a credible source indicated to us that at least some of the few high-volume customers who are still accepting Barcelona Opterons with the erratum are receiving “substantial” discounts for taking the chips. One would hope consumers would get the same consideration. The trouble is, I doubt AMD would have shipped Phenom processors in this state were it not feeling intense financial pressure.
AMD’s other major concern here should be for its reputation. The company really pulled a no-no by representing Phenom performance to the press (and thus to consumers) without fully explaining the TLB erratum and its performance ramifications at the time of the product’s introduction.
As we’ve reported elsewhere, AMD does plan to fix the TLB erratum with a new revision of its quad-core chip due some time in mid-to-late Q1 of 2008. Once the new revision is available, the Phenom 9500 and 9600 will be replaced by the 9550 and 9650, with the -50 suffix denoting the updated silicon and higher performance. Most users will want to wait until those new Phenom models are available before paying full price for a Phenom processor or a system based on one.