We have been following the story of a chip-level problem in AMD's quad-core Opteron and Phenom processors all week. This bugCPU makers prefer to call them erratacan cause system hangs in specific, rare circumstances. This sort of obscure problem is not really uncommon in microprocessors, but CPU makers are often able to fix them on the fly with little impact to the end user. This particular erratum is especially unfortunate because the fix for it involves sacrificing a substantial amount of performance.
This week's developments have included the revelation that this bug affects all "Barcelona" quad-core Opterons, leading to a "stop ship" order on quad-core Opterons to most customers. The erratum also affects all speed grades of Phenom processors, which are still shipping to PC makers and resellers. AMD admitted the presence of the erratum prior to the Phenom's public introduction, but the firm's initial statements gave the impression that the erratum affected only virtualization, which is a server-class application and an uncommon use for a desktop CPU. In truth, the erratum can cause instability with desktop-style usage patterns, as well, and systems with Phenom 9500 and 9600 processors will have to be patched and suffer the accompanying performance penalty.
One thing we haven't known is exactly how that performance penalty would lookuntil today. We can now offer you some preliminary benchmarks that demonstrate the impact of the BIOS-based workaround for the problem.
I don't wish to re-hash too much of what we've already covered this week, but we should recap briefly the nature of the erratum. The problem involves the chip's translation lookaside buffer (TLB) and L3 cache. AMD has provided a technical description of the problem as part of its documentation for a unsupported patch for the Linux kernel that alleviates the problem with only a minor performance hit. The specific circumstances that can lead to the data corruption and system hang are most likely to occur during periods of high utilization of all four CPU cores. Technically, AMD refers to the problem as errata number 298, but the problem has become more widely known as the TLB erratum.
Those are the basic outlines of the problem, but we should address something else before we move on. Some folks seem to be confused about the likelihood that the erratum could affect a system's stability. We don't really have any good way of quantifying that at present, but we can offer a few nuggets of wisdom. AMD says the problem is very rare, and we believe them. CPU makers do a tremendous amount of qualification testing before releasing a product, precisely because they want to avoid show-stopper problems. The Barcelona Opterons were rock solid when we conducted the testing for our review of those chips. We did run into some stability problems with our early Phenom test systems, but we'd trace those issues back to a pre-production Asus motherboard. The production version of the Asus M3A32-MVP Deluxe that we've since tested was much more stable, even without the erratum workaround applied. We used an MSI motherboard in testing for this article, and not once did the system lock or crash during hours of testing without the TLB patch applied.
The TLB erratum is a big deal largely because of the standards CPU makers have established for themselves, in which utter stability is a guarantee, not an optional feature. Even if the likelihood of a crash is extremely rare, of course anything less than 100% stability is unacceptable. AMD has helped establish those expectations, and of course, the industry is correct to expect CPU makers to live up to them. We would be intrigued to see some handicapping of the exact odds of a TLB erratum-induced system hang occurring during the life of an average PC, but such considerations probably aren't going to fly for most customers. Even if the erratum occurs very infrequently, no one wants the possibility of a system crash at the worst possible moment hanging over his head, and no one wants a "broken" chip. That's surely why AMD has directed motherboard makers to enable the workaround by default in their BIOSes, with no option to disable it, even though it slows performance.
AMD has taken the unusual step of pledging to release a version of its Overdrive tweaking utility that will allow users to disable the workaround, however, which says something about the severity of its performance impact.
As for that impact, it's tough to estimate entirely. We've heard estimates of 10% and "10 to 20%." We do know that the BIOS-based workaround for the TLB erratum disables some problematic logic in the CPU, but does not disable the L3 cache entirely.
With that said, we can move on to the test conditions. Several key things made this test possible. One was the fact that MSI was able to supply us with a BIOS for its K9A2 Platinum motherboard that includes the TLB erratum workaround. Thus, we tested with an earlier revision of the MSI board's BIOS (version VP.0B7) and with the newer, patched BIOS (version 1.21). Per AMD's guidance on this issue, MSI apparently did not include a menu option to disable the workaround. In fact, the BIOS doesn't look to offer any cosmetic indicator that the workaround is in place.
Another key to making this test possible was the help of the fine folks at NCIX, who supplied us with a production Phenom 9600 processor for use in testing. Thanks to them for their assistance.
I'll spare you the giant table here, but we generally tested using the configurations described in the testing methods section of our original Phenom review. The notable exceptions, of course, are the production Phenom 9600 and the MSI motherboard.
That original Phenom review overstated the Phenom 9600's performance for two reasons. The most obvious, of course, is the fact that we didn't have the TLB erratum patch applied. The other is that the north bridge clock on our Phenom engineering sample was running at 2.0GHz. AMD told us that was the correct north bridge speed, but our experience with production Phenom 9600 chips has proven otherwise; the correct clock is 1.8GHz. The north bridge clock is critical to performance in this CPU architecture because the L3 cache runs at the speed of this clock.
As a result, we have included scores for several Phenom CPUs in the graphs on the following pages. The ones marked "Phenom 9600 - TLB patch" and "Phenom 9600 - No TLB patch" come from the production Phenom 9600 and the MSI motherboard. The scores marked "Phenom ES" come from our Phenom engineering sample and the pre-production Asus motherboard. The "Phenom ES 2.3GHz" is what we mistakenly represented as a Phenom 9600 in our original review. With all of these results present, you should be able to see the impact of both the lower north bridge clock and of the TLB erratum patch.