AMD’s quad-core "Barcelona" Opterons have been notably difficult to find since their introduction two months ago, and The Tech Report has learned that a chip-level problem has impacted the supply of these chips to both server OEMs and distribution channel customers.
Chipmakers refer to chip-level problems as errata. Errata are fairly common in microprocessors, though they vary in nature and severity. This particular erratum first became widely known when AMD attributed the delay of the 2.4GHz version of its Phenom desktop processor to the problem. Not much is known about the specifics of the erratum, but it is related to the translation lookaside buffer (TLB) in the processor’s L3 cache. The erratum can cause a system hang with certain software workloads. The issue occurs very rarely, and thus was not caught by AMD’s usual qualification testing.
An industry source at a tier-two reseller told The Tech Report that the TLB erratum has led to a "stop ship" order on all Barcelona Opterons. When asked for comment, spokesman Phil Hughes said AMD is shipping Barcelona Opterons now, but only for "specific customer deals." Industry sources have suggested to TR that those deals are high-volume situations involving supercomputing clusters. Such customers may run workloads less likely to be affected by any workarounds for the erratum that reduce L3 cache performance, and those customers could potentially consume hundreds of thousands of CPUs. Our sources indicate, and the current availability picture would seem to confirm, that quad-core Opterons are not shipping to OEMs or the channel more generally.
News of this problem is notable because it confirms that the TLB erratum affects Barcelona server processors as well as Phenom desktop CPUs, and that the problem impacts AMD’s quad-core processors at lower clock speeds. The Opteron 2300 lineup spans clock speeds from 1.7GHz to 2.0GHz. Those CPUs’ north bridge clocks, which determine the frequency of the L3 cache, range from 1.4GHz to 1.8GHz.
The erratum is present in all AMD quad-core processors up to the current B2 revision. AMD has said a revision B3 is in the works and expected in Q1. One source told TR that large quantities of B3 chips might not be available until the end of Q1.
The potential for instability with the TLB erratum can be corrected via BIOS-based workaround, but multiple sources have suggested the BIOS fix involves a substantial performance hit. AMD has publicly estimated the performance penalty for the BIOS fix could be around 10%, and one source pegged the penalty at 10-20%. AMD has acknowledged that the TLB erratum particularly affects virtualization, and industry sources say the performance hit from the fix may be most severe with virtualization, as well. Server administrators responsible for virtualized environments will probably want to wait for the B3-rev CPUs before upgrading.
TR has attempted to confirm the impact of the BIOS-based fix, but the BIOS for the SuperMicro H8DMU+ motherboard used in our review of the Barcelona Opterons has not been updated since mid-September and doesn’t appear to include the TLB erratum workaround.
Linux users may have another option in the form of a patch for that operating system’s kernel. Sources estimate this patch’s performance hit at less than one percent, but it comes with several caveats. At present, the patch purportedly only applies to the 64-bit version Red Hat Enterprise Linux, Upgrade 4. Customers must sign a non-disclosure agreement in order to obtain the patch, and will be responsible for supporting it themselves. The patch doesn’t currently appear to be available via Red Hat’s regular support channels.
At present, Microsoft doesn’t offer a Windows hotfix to address the problem, and our sources were doubtful about the prospects for such a patch. CPU makers have oftentimes addressed errata via updates to the processor’s microcode, but such a fix for this problem also appears to be unlikely.
Update: It turns out the BIOS-based workaround for the erratum already involves a microcode update, AMD informs us. More soon.
Update II: We now have an update with a clarification on the nature of the BIOS workaround and information about how this issue affects Phenom processors right here.