Single page Print

AMD's 'Shanghai' 45nm Opterons

...versus Intel's latest 45nm Xeons

AMD's quad-core Opterons have certainly had a rough life to this point. The original "Barcelona" Opterons were hamstrung by delays, unable to meet clock frequency and performance expectations, and plagued by a show-stopper bug that forced AMD largely to stop shipments of the chips for months while waiting for a new revision, as we first reported. Once the revised Opterons made it into the market, they faced formidable competition from Intel's 45nm "Harpertown" Xeons, whose best-in-class performance and much-improved power efficiency have stolen quite of a bit of the Opteron's luster.

AMD is looking to reverse its fortunes with the introduction of a brand-new version of the quad-core Opteron, code-named Shanghai, which has been manufactured using a new, smaller 45-nanometer fabrication process that should bring gains in power efficiency and clock speeds. Shanghai also has the considerable benefit of being the second generation of a new processor design, and AMD has taken the opportunity to tweak this design in innumerable ways, large and small, in order to improve its performance and, one would hope, allow it more fully to meet its potential. The result is an Opteron processor with higher clock speeds, improved performance per clock, and lower power consumption—a better proposition in almost every way than Barcelona.

Will it be enough to make the Opteron truly competitive with Intel's latest Xeons? We've been testing systems for the past couple of weeks in Damage Labs in order to find out.

The Opteron gets Shanghaied
In spite of the troubles "Barcelona" Opterons have faced, AMD got quite a bit right in designing them—or so it would seem when peering down at the basic layout from high altitude. Barcelona was the first native quad-core x86-compatible processor, with four cores sharing a single piece of silicon. Each of those cores had its own 512KB L2 cache, and the four cores then shared a larger, on-chip 2MB L3 cache. Barcelona's cores could also, of course, share data via this cache, making inter-core communication quick and relatively straightforward. In order to manage power consumption, Barcelona could modify the clock speed of each core independently in response to demand. In addition, the chip had dual power planes, one for the CPU cores and a second for the chip's other elements—specifically, its L3 cache, integrated memory controller, and HyperTransport links. Voltage to either plane could be reduced independently, again in response to activity. All of these provisions seemed to make Barcelona an ideal candidate for servers and workstations based on AMD's Socket F infrastructure, which in itself was a strength, thanks to a topology based on high-speed, point-to-point interconnects and CPUs with integrated memory controllers.

Few will argue these basic concepts aren't sound, especially now that Intel has adopted a very similar architecture for its Nehalem processors, which are already available on the desktop in the form of the staggeringly fast Core i7 and will be headed to servers in the first half of next year.

Shanghai retains Barcelona's strengths and looks to better capitalize on them. To that end, AMD has outfitted Shanghai with a larger, 6MB L3 cache and a host of tweaks aimed at bringing higher performance per clock and increased power efficiency.

Like the city for which it's named, Shanghai is about growth: it's comprised of an estimated 758 million transistors, up from 463 million in Barcelona. Despite this growth, though, the smaller fabrication process means Shanghai has a smaller die area, at 258 mm², than Barcelona's 283 mm².

AMD's 45-nm fabrication process combines strained silicon and silicon-on-insulator techniques to achieve higher switching speeds at lower power levels, as did the past two generations of its fabrication technology. This time around, though, the firm has incorporated immersion lithography in order to reach smaller geometries. The use of a liquid medium between the lens and the wafer, as shown in the diagram on the right, offers improved focus and resolution versus the usual air gap in this space. AMD claims immersion lithography will be essential for the 32nm process node, even for Intel, and proudly notes that it has made the transition first.

Most of Shanghai's additional transistors (versus Barcelona) come from its expanded L3 cache, whose performance benefits for many server-class workloads should be fairly obvious. A number of logic changes, many of them cache-related, consume fewer transistors but promise additional benefits. For example, along with the larger cache comes an enhanced data pre-fetch mechanism. This logic attempts to recognize data access patterns and speculatively loads likely-to-be-needed data into cache ahead of time. As caches grow, pre-fetch algorithms often become more aggressive. Shanghai can also probe the L1 and L2 caches in its cores for coherency information twice as often as Barcelona, which gives it double the probe bandwidth. This provision should be particularly helpful when a core has lowered its clock speed to conserve power while idle.

In order to make sure its larger caches don't cause data integrity problems, AMD has built in a new feature it calls L3 Cache Index Disable. This feature allows the CPU to turn off parts of the L3 cache if too many machine-check errors occur. This capability will apparently require OS-level support, and that's not here quite yet. AMD expects "select operating systems" to bring support for this feature next year.

By contrast, the somewhat confusingly named Smart Fetch should have immediate benefits. Despite the name, Smart Fetch is primarily a power-saving feature intended to work around the fact that AMD's caches are exclusive in nature—that is, the lower-level caches don't replicate the entire contents of the higher-level caches. Exclusive caches have the simple benefit of extending the total effective size of the cache hierarchy—AMD justifiably bills Shanghai as having 8MB of cache—but they can present conflicts with dynamic power saving schemes. In Barcelona, for instance, a completely idle core would have to continue operating, though at a lower frequency, in order to keep its caches active and their contents available. Shanghai, by contrast, will dump the contents of that core's L1 and L2 caches into the L3 cache and put the core entirely to sleep, essentially reducing its clock speed to zero. AMD claims this provision can reduce idle power draw by up to 21%. One core in the system must remain active at all times, but in a four-socket system, only a single core in one socket must keep ticking. Smart Fetch isn't quite as impressive as the core-level power switching Intel built into Nehalem because it doesn't eliminate leakage power, but it's still a nice improvement over Barcelona.

One tweak in Shanghai that affects not just the cache but the entire memory hierarchy has to do with the chip's support for nested page tables, a feature that accelerates memory address translation with system virtualization software. Shanghai maintains the same basic feature set as Barcelona here, but AMD claims a reduction in "world switch time" of up to 25% for Shanghai. That means the system should be able to transition from guest mode to hypervisor mode and then back to guest mode much more quickly. Since we've only had a couple of weeks following the release of the Core i7 to test Shanghai, we weren't able to test this improvement ourselves, unfortunately. (Proper, publishable virtualization benchmarking is a non-trivial undertaking.) AMD says it tested the time required to make these two transitions (guest-to-hypervisor and hypervisor-to-guest) itself and measured a latency of 1360 cycles on Barcelona versus 900 cycles on Shanghai. Hypervisors that support the AMD-V feature set could thus see a marked improvement in performance in cases where virtual server performance is hampered by world-switch latency. Indeed, VMware has published some Shanghai performance numbers with VMware ESX 3.5 that show dramatic performance advantages over software-based shadow page tables.

Our 2P Opteron test system with 16GB of DDR2-800 memory

A couple of other changes ought to bring more general performance gains. Shanghai's memory controller bumps up officially supported memory frequencies from 667MHz to 800MHz, for one. Also, HyperTransport 3 support is finally imminent. The first Shanghai processors don't support it, mainly because AMD didn't want to hold up these products' introduction while waiting for full validation of HT3 solutions. Instead, the firm plans to introduce HT3-ready Opterons next spring. When those arrive, they'll double the available bandwidth for CPU-to-CPU communication in Opteron systems. With HyperTransport clock speeds up to 2.2GHz, HT3 will allow for up to 17.6 GB/s of bandwidth (the bidirectional total) per link. Only with the introduction of the Fiorano platform later in 2009 will the CPU-to-chipset interconnect transition to HT3.