AMD’s ‘Shanghai’ 45nm Opterons

AMD’s quad-core Opterons have certainly had a rough life to this point. The original “Barcelona” Opterons were hamstrung by delays, unable to meet clock frequency and performance expectations, and plagued by a show-stopper bug that forced AMD largely to stop shipments of the chips for months while waiting for a new revision, as we first reported. Once the revised Opterons made it into the market, they faced formidable competition from Intel’s 45nm “Harpertown” Xeons, whose best-in-class performance and much-improved power efficiency have stolen quite of a bit of the Opteron’s luster.

AMD is looking to reverse its fortunes with the introduction of a brand-new version of the quad-core Opteron, code-named Shanghai, which has been manufactured using a new, smaller 45-nanometer fabrication process that should bring gains in power efficiency and clock speeds. Shanghai also has the considerable benefit of being the second generation of a new processor design, and AMD has taken the opportunity to tweak this design in innumerable ways, large and small, in order to improve its performance and, one would hope, allow it more fully to meet its potential. The result is an Opteron processor with higher clock speeds, improved performance per clock, and lower power consumption—a better proposition in almost every way than Barcelona.

Will it be enough to make the Opteron truly competitive with Intel’s latest Xeons? We’ve been testing systems for the past couple of weeks in Damage Labs in order to find out.

The Opteron gets Shanghaied

In spite of the troubles “Barcelona” Opterons have faced, AMD got quite a bit right in designing them—or so it would seem when peering down at the basic layout from high altitude. Barcelona was the first native quad-core x86-compatible processor, with four cores sharing a single piece of silicon. Each of those cores had its own 512KB L2 cache, and the four cores then shared a larger, on-chip 2MB L3 cache. Barcelona’s cores could also, of course, share data via this cache, making inter-core communication quick and relatively straightforward. In order to manage power consumption, Barcelona could modify the clock speed of each core independently in response to demand. In addition, the chip had dual power planes, one for the CPU cores and a second for the chip’s other elements—specifically, its L3 cache, integrated memory controller, and HyperTransport links. Voltage to either plane could be reduced independently, again in response to activity. All of these provisions seemed to make Barcelona an ideal candidate for servers and workstations based on AMD’s Socket F infrastructure, which in itself was a strength, thanks to a topology based on high-speed, point-to-point interconnects and CPUs with integrated memory controllers.

Few will argue these basic concepts aren’t sound, especially now that Intel has adopted a very similar architecture for its Nehalem processors, which are already available on the desktop in the form of the staggeringly fast Core i7 and will be headed to servers in the first half of next year.

Shanghai retains Barcelona’s strengths and looks to better capitalize on them. To that end, AMD has outfitted Shanghai with a larger, 6MB L3 cache and a host of tweaks aimed at bringing higher performance per clock and increased power efficiency.

Like the city for which it’s named, Shanghai is about growth: it’s comprised of an estimated 758 million transistors, up from 463 million in Barcelona. Despite this growth, though, the smaller fabrication process means Shanghai has a smaller die area, at 258 mm², than Barcelona’s 283 mm².

AMD’s 45-nm fabrication process combines strained silicon and silicon-on-insulator techniques to achieve higher switching speeds at lower power levels, as did the past two generations of its fabrication technology. This time around, though, the firm has incorporated immersion lithography in order to reach smaller geometries. The use of a liquid medium between the lens and the wafer, as shown in the diagram on the right, offers improved focus and resolution versus the usual air gap in this space. AMD claims immersion lithography will be essential for the 32nm process node, even for Intel, and proudly notes that it has made the transition first.

Most of Shanghai’s additional transistors (versus Barcelona) come from its expanded L3 cache, whose performance benefits for many server-class workloads should be fairly obvious. A number of logic changes, many of them cache-related, consume fewer transistors but promise additional benefits. For example, along with the larger cache comes an enhanced data pre-fetch mechanism. This logic attempts to recognize data access patterns and speculatively loads likely-to-be-needed data into cache ahead of time. As caches grow, pre-fetch algorithms often become more aggressive. Shanghai can also probe the L1 and L2 caches in its cores for coherency information twice as often as Barcelona, which gives it double the probe bandwidth. This provision should be particularly helpful when a core has lowered its clock speed to conserve power while idle.

In order to make sure its larger caches don’t cause data integrity problems, AMD has built in a new feature it calls L3 Cache Index Disable. This feature allows the CPU to turn off parts of the L3 cache if too many machine-check errors occur. This capability will apparently require OS-level support, and that’s not here quite yet. AMD expects “select operating systems” to bring support for this feature next year.

By contrast, the somewhat confusingly named Smart Fetch should have immediate benefits. Despite the name, Smart Fetch is primarily a power-saving feature intended to work around the fact that AMD’s caches are exclusive in nature—that is, the lower-level caches don’t replicate the entire contents of the higher-level caches. Exclusive caches have the simple benefit of extending the total effective size of the cache hierarchy—AMD justifiably bills Shanghai as having 8MB of cache—but they can present conflicts with dynamic power saving schemes. In Barcelona, for instance, a completely idle core would have to continue operating, though at a lower frequency, in order to keep its caches active and their contents available. Shanghai, by contrast, will dump the contents of that core’s L1 and L2 caches into the L3 cache and put the core entirely to sleep, essentially reducing its clock speed to zero. AMD claims this provision can reduce idle power draw by up to 21%. One core in the system must remain active at all times, but in a four-socket system, only a single core in one socket must keep ticking. Smart Fetch isn’t quite as impressive as the core-level power switching Intel built into Nehalem because it doesn’t eliminate leakage power, but it’s still a nice improvement over Barcelona.

One tweak in Shanghai that affects not just the cache but the entire memory hierarchy has to do with the chip’s support for nested page tables, a feature that accelerates memory address translation with system virtualization software. Shanghai maintains the same basic feature set as Barcelona here, but AMD claims a reduction in “world switch time” of up to 25% for Shanghai. That means the system should be able to transition from guest mode to hypervisor mode and then back to guest mode much more quickly. Since we’ve only had a couple of weeks following the release of the Core i7 to test Shanghai, we weren’t able to test this improvement ourselves, unfortunately. (Proper, publishable virtualization benchmarking is a non-trivial undertaking.) AMD says it tested the time required to make these two transitions (guest-to-hypervisor and hypervisor-to-guest) itself and measured a latency of 1360 cycles on Barcelona versus 900 cycles on Shanghai. Hypervisors that support the AMD-V feature set could thus see a marked improvement in performance in cases where virtual server performance is hampered by world-switch latency. Indeed, VMware has published some Shanghai performance numbers with VMware ESX 3.5 that show dramatic performance advantages over software-based shadow page tables.

Our 2P Opteron test system with 16GB of DDR2-800 memory

A couple of other changes ought to bring more general performance gains. Shanghai’s memory controller bumps up officially supported memory frequencies from 667MHz to 800MHz, for one. Also, HyperTransport 3 support is finally imminent. The first Shanghai processors don’t support it, mainly because AMD didn’t want to hold up these products’ introduction while waiting for full validation of HT3 solutions. Instead, the firm plans
to introduce HT3-ready Opterons next spring. When those arrive, they’ll double the available bandwidth for CPU-to-CPU communication in Opteron systems. With HyperTransport clock speeds up to 2.2GHz, HT3 will allow for up to 17.6 GB/s of bandwidth (the bidirectional total) per link. Only with the introduction of the Fiorano platform later in 2009 will the CPU-to-chipset interconnect transition to HT3.

Pricing and availability

Even with all of these chip-level changes, the biggest news of the day may be the advent of Opterons with higher clock speeds and lower prices. The refreshed Shanghai lineup now looks like so:

Model Clock speed North bridge

speed

ACP Price
Opteron 2384 2.7GHz 2.2GHz 75W $989
Opteron 2382 2.6GHz 2.2GHz 75W $873
Opteron 2380 2.5GHz 2.0GHz 75W $698
Opteron 2378 2.4GHz 2.0GHz 75W $523
Opteron 2376 2.3GHz 2.0GHz 75W $377
Opteron 8384 2.7GHz 2.2GHz 75W $2,149
Opteron 8382 2.6GHz 2.2GHz 75W $1,865
Opteron 8380 2.5GHz 2.0GHz 75W $1,514
Opteron 8378 2.4GHz 2.0GHz 75W $1,165

All of the new Opterons, ranging from 2.3 to 2.7GHz, fit into the same 75W thermal envelope, according to AMD’s “ACP” rating method (which it insists is the best analog to Intel’s TDP numbers, though Intel would disagree.) Clock speeds overall are up, and notably, north bridge clocks participate in that advance. I say that’s notable because the north bridge clock governs the L3 cache, as well, which has a pretty direct impact on overall Opteron performance.

AMD expects all of the products above to be available now. Conspicuous by their absence are low-power HE and higher-speed SE derivatives of Shanghai. AMD intends for these HE and SE parts to fit into their traditional 55W and 105W thermal envelopes, respectively, when they arrive in the first quarter of next year. With the additional power headroom, the SE parts could quite possibly reach 3GHz, although only time will tell.

The Opteron’s next steps

The improvements in Shanghai sound pretty good, but many folks are still asking exactly what AMD will do in order to counter Intel’s Nehalem, which promises a similar system architecture and—by all current indications, at least—higher performance per core and per socket. Interestingly enough, AMD does have some credible answers to such questions, and it has disclosed quite a bit of its future Opteron roadmap in response. Here’s a quick overview of the basic plan:

AMD’s Opteron roadmap into 2011. Source: AMD.

Not noted above is the planned release of HyperTransport 3-enabled Opterons next spring. After that, the next big change will be the introduction of the Fiorano platform in mid-2009. Fiorano will be the first Opteron chipset based on the core-logic technology AMD acquired when it purchased ATI. That chipset will be comprised of the SR5690 I/O hub and the SP5100 south bridge. Fiorano will retain compatibility with Socket F-type CPUs, but will add several noteworthy enhancements, including full HyperTransport 3 and (at last) PCI Express 2.0, complete with support for device hot-plugging. As one would expect, Fiorano will support AMD’s IOMMU technology for fast and secure hardware-assisted virtualization of I/O devices.

A simple block diagram of the Fiorano platform. Source: AMD.

Fiorano will be scalable from 2P to 4P and 8P systems. As you can see in the diagram above, 4P Opteron systems will not be fully connected—there will still be two “hops” from one corner of a 4P system to the opposing corner. Also notable by its absence is support for DDR3 memory. Although the desktop Phenom II is expected to make the move to DDR3 in early 2009, the Opteron won’t follow until it makes a socket transition in 2010.

Before that happens, some time in late 2009, the Opteron lineup will get a boost with the release of a six-core processor code-named Istanbul. This 45-nm chip should look very much like Shanghai, but with two additional cores onboard—same 6MB L3 cache, same DDR2 memory controller, still HyperTransport 3. For certain applications, a six-core Opteron could conceivably be a nice alternative to Intel’s quad-core, eight-thread Nehalem-based Xeons, although by the time Istanbul arrives, Intel may be reaching new milestones in its own roadmap.

Istanbul looks like Shanghai plus two cores. Source: AMD.

Then comes the transition to the new G34 socket—the funky elongated, rectangular socket you may have seen in some reports—in 2010. This socket will bring a major infrastructure refresh for the Opteron. DDR3 support will come in with a bang; each socket is expected to support four channels of DDR3 memory. Also, the maximum number of HyperTransport 3 links per chip will rise from three to four, potentially enabling fully connected 4P systems.

Interestingly enough, all of the changes here will apparently be the result of modifications to the physical socket and to Opteron processors. Although AMD has given the new platform a code name, Maranello, it uses the same two core-logic chips as Fiorano.

The new processors will come in two distinct flavors: Sao Paulo, with six cores and 6MB of L3 cache, and the oh-so-cleverly named Magny-Cours, with a whopping 12 cores and 12MB of L3 cache. We don’t yet know whether or how these cores will be enhanced compared to Shanghai Opterons. Both chips will be manufactured with 45nm process tech, and the basic cache hierarchy on the Opteron will remain the same, with an exclusive L3. AMD will add additional smarts to these chips, though, in the form of a probe filter (or snoop filter) that will reduce cache coherency management traffic. Also, much like Nehalem, these processors will feature on-chip power management and thermal control capabilities, including the ability to raise and lower clock speeds based on thermal control points.

Beyond that, things become foggy. We know that AMD’s spun-off manufacturing arm, temporarily dubbed “the foundry company,” has plans to introduce two advanced 32-nm fabrication technologies in the first half of 2010, a high-performance process using SOI and a low-power process using high-k metal gates. Meanwhile, AMD is working on a next-generation CPU microarchitecture code-named “Bulldozer,” about which we know very little. Early information on Bulldozer suggested it would initially tape out on a 45nm process, but more recent rumblings from AMD suggest Bulldozer has been pushed back—the desktop variant to 2011—and may be a 32nm part.

Sizing up the Xeons and Opterons in our test

Intel, of course, hasn’t been sitting still since we last looked at its server/workstation-class processors. The firm is now shipping a new E stepping of its 45nm Xeons that reduces power draw and allows for slightly higher clock frequencies. All of the Xeons we tested for this review are based on E-stepping silicon. We had intended to review these Xeons in a separate article but weren’t able to complete it before this one, so we have a range of new-to-us products to test, based on multiple different Intel server- and workstation-class platforms.

The most direct competition for the Shanghai Opterons we’ve tested is the Xeon E5450, a 3GHz quad-core part with a 1333MHz front-side bus. We’ve tested the E5450 on Intel’s highest-volume server platform, known as Bensley. This platform, based on the Intel 5000P chipset, is getting a little long in the tooth and lacks a few features, like support for a 1600MHz FSB, 800MHz FB-DIMMs, and a full-coverage snoop filter. However, it is still the predominant Xeon server platform, and is thus the best basis of comparison versus the Opteron systems we’re testing. The Xeon E5450 is priced at $915 in volume, quite close to the $989 price tag of the Shanghai Opteron 2384. The two chips also share similar thermal envelopes; the Xeon E5450 is rated at an 80W TDP and the Opteron 2384 has a 75W ACP. (Assuming you buy AMD’s arguments about its ACP ratings, at least, the two should be similar. We will test power consumption ourselves, regardless.)

We have also, of course, included AMD’s best 65nm Opteron within this same thermal envelope, the 2356, to see how it compares to Shanghai.

Intel’s 45nm Xeons extend into higher-performance and lower-power territory in some interesting ways, as well. The low-voltage Xeon L5430, for instance, has specs very similar to the E5450—quad cores, 2.66GHz core clock, 1333MHz bus, 12MB total L2 cache—but comes with a TDP rating of just 50W. For our testing, we’ve mated it with a very intriguing low-power server platform from Intel, known as San Clemente.

This is our first look at San Clemente, which Intel hasn’t pushed especially hard in the mainstream server or workstation spaces. Instead, Intel has aimed it primarily at dense blade servers and embedded systems like routers, SANs, and NAS boxes. That’s kind of a shame, since the Intel 5100 MCH at the heart of San Clemente makes a key power-saving move, shunning Fully Buffered DIMMs for registered DDR2 memory modules just like Opterons use. FB-DIMMs allow for higher total system memory capacities, but they exact notable penalties in terms of both memory access latencies and power consumption. San Clemente’s power consumption could be quite a bit lower than Bensley’s, as could its memory access latencies. Like Bensley, San Clemente has dual, independent front-side bus connections to each socket in a 2P system, as well.

The tradeoffs are several. The 5100 MCH is limited to a maximum of six DIMMs per system and 48GB of total memory, versus 16 FB-DIMMs and 64GB total memory for Bensley. Also, the 5100 MCH’s two channels of DDR2-667 memory yield a peak of 10.6 GB/s of bandwidth, compared to Bensley’s 21 GB/s max.

The guts of our San Clemente test rig, fully populated with six DIMMs

Our San Clemente test system underscored its memory capacity limitations by proving to be incompatible with our 2GB DDR2 DIMMs, for whatever reason. We were limited to testing with only 6GB of total memory by populating each of its DIMM slots with 1GB modules.

Since AMD’s 45nm Opteron HE products aren’t out yet, the closest competition we have to the Xeon L5430/San Clemente combo is the Opteron 2347 HE, a 65nm part with a 55W ACP (68W TDP), a 1.9GHz core clock, and a 1.6GHz north bridge/L3 cache. That’s a rough comparison for AMD, but things should change once the 45nm Opteron HE parts arrive next quarter.

At the other end of the spectrum entirely is the Xeon X5492, an ultra-high-end processor (nearly $1,500 list) that tests the outer limits of Intel’s 45nm process tech with a 3.4GHz core clock, a 1600MHz FSB, and a 150W TDP rating. We’ve tested a pair of these babies on the Stoakley platform. Stoakley is essentially an updated version of the Bensley platform with higher bandwidth, but it’s been targeted largely at workstations and HPC systems.

There really is no Opteron analog to the Xeon X5492. The closest comparison might be to the 65nm Opteron 2360 SE, which has a 105W ACP (and 119W TDP), but Shanghai has higher clock frequencies and a larger cache in a much smaller power budget, so the 2360 SE is essentially obsolete. Again, we may have to wait for the introduction of 45nm Opteron SE models before we have a truly comparable product from AMD—and even then, AMD may choose not to produce an Opteron with a 150W thermal envelope.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processors
Dual Xeon E5450 3.0GHz

Dual Xeon X5492 3.4GHz
Dual
Xeon L5430 2.66GHz
Dual
Opteron 2347 HE 1.9GHz

Dual
Opteron 2356 2.3GHz


Dual Opteron 2384 2.7GHz
System
bus
1333MHz (333MHz quad-pumped) 1600MHz (400MHz quad-pumped) 1333MHz (333MHz quad-pumped) 1GHz
HyperTransport
1GHz
HyperTransport
Motherboard SuperMicro
X7DB8+
SuperMicro
X7DWA
Asus
RS160-E5
SuperMicro
H8DMU+
SuperMicro
H8DMU+
BIOS
revision
6/23/2008 8/04/2008 8/08/2008 3/25/08 10/15/08
North
bridge
Intel
5000P MCH
Intel
5400 MCH
Intel
5100 MCH
Nvidia
nForce Pro 3600
Nvidia
nForce Pro 3600
South
bridge
Intel
6321 ESB ICH
Intel
6321 ESB ICH
Intel
ICH9R
Nvidia
nForce Pro 3600
Nvidia
nForce Pro 3600
Chipset
drivers
INF
Update 9.0.0.1008
INF
Update 9.0.0.1008
INF
Update 9.0.0.1008
Memory
size
16GB
(8 DIMMs)
16GB
(8 DIMMs)
6GB (6 DIMMs) 16GB
(8 DIMMs)
16GB
(8 DIMMs)
Memory
type
2048MB
DDR2-800 FB-DIMMs 
2048MB
DDR2-800 FB-DIMMs
1024MB
registered ECC

DDR2-667 DIMMs

2048MB
registered ECC

DDR2-800 DIMMs

2048MB
registered ECC

DDR2-800 DIMMs

Memory
speed (Effective)

667MHz
800MHz
667MHz

667MHz

800MHz
CAS
latency (CL)
5 5 5 5 6
RAS
to CAS delay (tRCD)
5 5 5 5 5
RAS
precharge (tRP)
5 5 5 5 5
Storage
controller
Intel
6321 ESB ICH
with

Matrix Storage Manager 8.6

Intel
6321 ESB ICH
with

Matrix Storage Manager 8.6

Intel ICH9R with

Matrix Storage Manager 8.6

Nvidia
nForce Pro 3600
LSI
Logic Embedded MegaRAID

with 8.9.518.2007 drivers

Power
supply
Ablecom
PWS-702A-1R
700W
Ablecom
PWS-702A-1R
700W
FSP
Group FSP460-701UG 460W
Ablecom
PWS-702A-1R
700W
Ablecom
PWS-702A-1R
700W
Graphics Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
XGI Volari Z9s with 1.09.10_ASUS drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Hard
drive
WD
Caviar WD1600YD 160GB
OS Windows
Server 2008 Enterprise x64 Edition with Service Pack 1

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

This test shows us a nice visual picture of the memory bandwidth available at the different levels of the memory hierarchy. One can see the impact of the Opteron 2384’s larger L3 cache at the 16MB test block size, where it’s much faster than the older quad-core Opterons. Still, the Xeons’ caches typically achieve quite a bit higher throughput than the Opterons’.

Our graph is tough to read at the largest test block sizes where main memory comes into play. Here’s a closer look at the 256MB block size, which should be a good indicator of main memory bandwidth.

These results are consistent with what we’ve seen in the past from most of these platforms. I believe these results only show the bandwidth available to a single CPU core, so they’re substantially less than the peak available in the entire system. The Opterons appear to benefit greatly from their integrated memory controllers here, and the Shanghai Opteron 2384 takes advantage of its faster 800MHz memory, as well.

The Opteron 2384’s revamped cache and TLB hierarchy, along with faster memory, delivers major reductions in memory access latency. With the 65nm Barcelona Opterons, we’ve found that the L3 cache tends to contribute quite a bit of latency to the overall picture. Yet with three times the L3 cache of the Opteron 2356, the 2384 is still faster to main memory. Let’s have a closer look at the cache picture and see why that is.

Before we do, though, we should also point out that the Xeon L5430 on the San Clemente platform has much lower access latencies than the E5450 on the Bensley platform, although they share the same bus frequency and topology. Assuming there aren’t any other major contributing factors, FB-DIMMs would appear to add about 14ns of delay versus DDR2 modules at the same 667MHz clock speed. The Stoakley platform essentially makes up that deficit by using higher bus and memory frequencies.

Note that, below, I’ve color-coded the block sizes that roughly correspond to the different caches on each of the processors. L1 data cache is yellow, L2 is light orange, L3’s darker orange, and main memory is brown.

These graphs offer a good visual representation of the data, but perhaps some numbers would illuminate things further. Because the Opteron’s L3 cache is clocked independently from the CPU cores, it doesn’t make sense to quantify that cache’s latency in terms of CPU clock cycles. In this case, the Opteron 2356’s L3 cache runs at 2GHz, while the 2384’s runs at 2.2GHz—a 10% increase. Despite the fact that the 2384’s L3 cache is three times the size, though, its latencies are considerably lower. At the 2048KB block size and step size of 256, the 2356’s latency is 23ns, while the 2384’s is only 16ns—a reduction of nearly a third.

SPECjbb 2005

SPECjbb 2005 simulates the role a server would play executing the “business logic” in the middle of a three-tier system with clients at the front-end and a database server at the back-end. The logic executed by the test is written in Java and runs in a JVM. This benchmark tests scaling with one to many threads, although its main score is largely a measure of peak throughput.

As you may know, system vendors spend tremendous effort attempting to achieve peak scores in benchmarks like this one, which they then publish via SPEC. We did not intend to challenge the best published scores with our results, but we did hope to achieve reasonably optimal tuning for our test systems. To that end, we used a fast JVM—the 64-bit version of Oracle’s JRockIt JRE R27.6—and picked up some tweaks for tuning from recently published results. We used two JVM instances with the following command line options:

start /AFFINITY [0F, F0] java -Xms3700m -Xmx3700m -XXaggressive -XXlazyunlocking -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads=4 -Xns3200m -XXcallprofiling -XXtlasize:min=4k,preferred=512k -XXthroughputcompaction

Notice that we used the Windows “start” command to affinitize our threads on a per-socket basis. We also tried affinitizing on a per-chip basis for the Xeon systems, but didn’t see any performance benefit from doing so. The one exception to the command line options above was our Xeon L5430/San Clemente system. Since it had only 6GB of memory, we had to back the heap size down to 2200MB for it.

The Opteron 2384’s performance here is undeniably impressive—a massive leap from the performance of the Opteron 2356, and substantially higher than its most direct competitor, the Xeon E5450. In fact, at just 2.7GHz, the Shanghai Opteron performs nearly as well as the exotic of the group, the 3.4GHz Xeon X5492, remarkably enough. Shanghai’s larger L3 cache and other tweaks, combined with the Opteron’s native quad-core design and strong system architecture, yield big returns in this server-class workload.

Not only that, but Oracle has hinted to us that even higher performance is possible when using a version of JRockIt optimized for Shanghai. That version of JRockIt hasn’t yet been released, but we understand it’s on the way.

Before we move on, let’s take a quick look at power consumption during this test. SPECjbb 2005 is the basis for SPEC’s own power benchmark, which we had initially hoped to use in this review, but time constraints made that impractical. Nevertheless, we did capture power consumption for each system during a test run using our Extech 380803 power meter. All of the systems used the same model of Ablecom 700W power supply unit, with the exception of the Xeon L5430 server, which used an FPS Group 460W unit. Power management features (such as SpeedStep and Cool’n’Quiet) were enabled via Windows Server’s “Balanced” power policy.

Although it has the same 75W ACP rating as the Opteron 2356, the Opteron 2384 draws substantially less power at every step of the way. The Xeon E5450 system is practically a power hog by comparison, with much higher peak power draw. The bright spot for Intel here is the Xeon L5430/San Clemente system with DDR2 memory, whose power consumption is admirably low—almost 60W less than the Opteron 2384 system.

Have a look at what happens when we consider performance per watt.

Our Opteron 2384 system combines higher performance with lower power draw than the Xeon E5450 system, so its “bops per watt” lead is predictably large. Shanghai certainly looks good in this light.

Meanwhile, the Xeon L5430 aims to steal the limelight. It’s a bit of a wild card, with less total memory, only six DIMMs to eight for the other systems, and a much lower wattage PSU (which may be more efficient at these load levels). Still, one can’t deny the efficiency of its 50W quad-core Xeons—and one can’t help but wonder whether Intel made the right call in choosing FB-DIMMs for its mainstream server platform just as performance per watt was becoming perhaps the key metric for server evaluations.

Cinebench rendering

We can take a closer look at power consumption and energy-efficient performance by using a test whose time to completion varies with performance. In this case, we’re using Cinebench, a 3D rendering benchmark based on Maxon’s Cinema 4D rendering engine.

This is a very different sort of application, in which the Shanghai Opterons’ larger cache and faster memory don’t bring the sort of performance gains we saw in SPECjbb. Here, the Xeon E5450 is faster. In fact, the Xeons are faster clock for clock—the Xeon L5430 at 2.66GHz outperforms the Opteron 2384 at 2.7GHz.

As we did with SPECjbb, we measured power draw at the wall socket for each of our test systems across a set time period, during which we ran Cinebench’s multithreaded rendering test.

Some of the outcomes are obvious immediately, like the fact that the Xeon E5450 and X5492 systems have much higher overall power draw. Still, we can quantify these things with more precision. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

The Opteron 2384 system’s idle power draw is just over 10W less than that of the system based on its 65nm predecessor, the 2356, in spite of the fact that its L3 cache is larger and runs at a higher clock speed. Shanghai’s ability to flush its L1 and L2 caches into the L3 and shut down its cores does appear to pay dividends. Even so, those incremental gains seem small in light of the considerably higher idle power draw of the FB-DIMM-equipped Xeon systems.

Meanwhile, the low-voltage Xeons and San Clemente continue to impress.

Next, we can look at peak power draw by taking an average from the ten-second span from 15 to 25 seconds into our test period, during which the processors were rendering.

Peak power draw with Cinebench isn’t quite as high as it is with SPECjbb, but the trends remain the same. The Shanghai Opterons draw less power, at a higher clock speed, than their 65nm counterparts.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

We can quantify efficiency even better by considering specifically the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

Even though the Opteron 2384 isn’t as fast as the Xeon E5450 in this application, the Opteron system still requires less energy to render the scene than the Xeon E5450-based one. The Opteron 2356 isn’t nearly as efficient as either, but the Shanghai Opterons more than restore AMD’s competitiveness in power-efficient performance.

With that said, the biggest winner here, obviously, is the Xeon L5430 system, which is simply in a class by itself.

XML handling

We are working, bit by bit, to add additional pieces to our server/workstation CPU benchmark suite over time. As part of that effort, our web developer and sysadmin, Stephen Roylance, has put together an XML handling benchmark for us. He based some elements of this test on parts of the open-source XML Benchmark project, but Steve ported everything to Microsoft’s C# language and .NET runtime environment. Here’s how he describes the program:

The program runs four different XML related tests for a configurable number of cycles, across a configurable number of threads.

The four operations are:

  • Read a test XML file and parse it.
  • Generate a randomized XML tree and write it into memory as a document.
  • Transform an XML tree with XSLT, writing the resulting document into memory.
  • Attach a cryptographic signature to a parsed XML tree, write it back to memory as an XML document in a string, parse it and verify the signature.

In contrast to SPECjbb, this test is written in C# and runs under Microsoft’s .NET common language runtime. It should be a reasonable simulation of real-world CPU workloads on servers running ASP.NET web applications.

The results you see below show the total time required to execute 100 iterations of an interleaved mix of the four thread types across eight concurrent threads. We tested using the benchmark’s “medium” file size option, so the data files involved were around 256KB in size.

Curiously enough, the Opteron 2384 is no faster than the 2356 here. This data point, plus a couple of others, points to a possible cause. The Opterons struggle versus the Xeons overall, which suggests that the NUMA memory subsystem of the Opteron systems may be causing problems. In other words, we may have threads ping-ponging between cores on different sockets, forcing them to access memory associated with the non-local CPU socket, draining performance. Also, notice how much quicker the Xeon L5430 is in this test than the E5450. That fact heightens my suspicion that this test is particularly sensitive to memory access latencies. In SPECjbb, where the Shanghai Opterons were much more effective, we explicitly affinitized threads with CPU sockets.

Of course, performance bottlenecks like this one are a day-to-day reality of living with the Opteron’s NUMA architecture, and many off-the-shelf applications aren’t NUMA-aware. This issue will probably get quite a bit more attention soon, since Intel will be making the move to a very similar NUMA arrangement with 2P Nehalem systems.

Our next steps in the development of this benchmark will have to include affinitizing threads with sockets, if at all possible. Also, we’d like to report execution times for individual thread types, and I believe Steve plans to write up a blog post about this benchmark and release the source code. If any of our readers have suggestions for improvement, he’ll be taking them at that time.

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of proteins. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.
In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.

MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.

I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

The Opteron 2384 finishes ahead of the Xeon E5450, with a best time two seconds faster than the Xeon. What’s more interesting is how it gets there: the Xeon is faster at every thread count from one to six, but the Opteron scales better when taking that last step to an optimal thread count, likely thanks to its native quad-core layout and integrated memory controller.

Another intriguing development is the fact that the Xeon L5430/San Clemente system is nearly as fast as the Xeon E5450 on the Bensley platform—faster at low thread counts, in fact—in spite of the L5430’s clock speed deficit.

And, well, there’s a storm brewing on the horizon. A single-socket desktop version of Nehalem, the Core i7-965 Extreme, completed this same test in only 60 seconds, 10 seconds ahead of even our dual-socket Xeon X5492 system. Granted, that’s with a different OS with possible kernel tuning advantages and exotic 1600MHz RAM, but one can’t help but wonder how a dual-socket Nehalem system might perform.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.
The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

The Opteron 2384 can’t quite catch the Xeons here, but consider the match-up against the Xeon L5430. The L5430 reaches a much higher frequency with a single thread, but its advantage gradually erodes as the number of threads climbs. At eight threads, the L5430 is only ahead by a fraction.

For comparison’s sake, by the way, the single-socket Core i7-965 Extreme broke the 5Hz barrier on this test—again, well ahead of our Xeon X5492 system.

[email protected]

Next, we have a slick little [email protected] benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, [email protected] is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The [email protected] project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, [email protected] should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.

The Xeons are plainly faster here, and the scores for both the AMD and Intel processors appear to scale rather linearly with clock speed improvements.

3D modeling and rendering

POV-Ray rendering

We’re using the latest beta version of POV-Ray 3.7 that includes native multithreading and 64-bit support. Some of the beta 64-bit executables have been quite a bit slower than the 3.6 release, but this should give us a decent look at comparative performance, regardless.

Shanghai’s performance gains here aren’t quite sufficient to allow the Opteron 2384 to catch the Xeon E5450, but they are remarkably solid improvements, especially in the benchmark scene. The question is: why? POV-Ray hasn’t been particularly sensitive to cache sizes or memory bandwidth in recent years. During my recent visit to AMD’s Austin, Texas campus, one of AMD’s engineers told me that Shanghai’s branch prediction algorithm had been tweaked to improve its accuracy in certain cases, and one of the applications that should benefit from that tweak is POV-Ray. Looks like it helped.

Valve VRAD map compilation

This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to pre-compute lighting that goes into its games.

This is our final lighting/rendering-type test, and the results are what we’ve come to expect, more or less.

x264 HD video encoding

This benchmark tests performance with one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark. These scores come from the newer, faster version 0.59.819 of the x264 executable.

For more workstation-oriented applications like this one, the Xeons have a consistent edge over the Opterons, and Shanghai doesn’t really change that.

Sandra Mandelbrot

We’ve included this final test largely just to satisfy our own curiosity about how the different CPU architectures handle from SSE extensions and the like. SiSoft Sandra’s “multimedia” benchmark is intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.

The benchmark contains many versions (ALU, MMX, (Wireless) MMX, SSE, SSE2, SSSE3) that use integers to simulate floating point numbers, as well as many versions that use floating point numbers (FPU, SSE, SSE2, SSSE3). This illustrates the difference between ALU and FPU power.

The SIMD versions compute 2/4/8 Mandelbrot point iterations at once – rather than one at a time – thus taking advantage of the SIMD instructions. Even so, 2/4/8x improvement cannot be expected (due to other overheads), generally a 2.5-3x improvement has been achieved. The ALU & FPU of 6/7 generation of processors are very advanced (e.g. 2+ execution units) thus bridging the gap as well.

We’re using the 64-bit version of the Sandra executable, as well.

Shanghai is nearly as fast, clock for clock, as the Xeon in both the integer x8 and FP double tests. The Opteron 2384 runs neck and neck with the 2.66GHz Xeon L5430.

Conclusions

The Shanghai Opterons’ higher clock speeds, larger and quicker L3 cache, and improved memory subsystem are just what the doctor ordered for AMD’s quad-core CPU architecture. These changes, along with lower power consumption both at idle and while loaded, go a long way toward alleviating the weaknesses of the 65nm Barcelona Opterons. The Opteron 2384’s ability to outperform the Xeon E5450 in SPECjbb is dramatic proof of Shanghai’s potency. Similar server-class workloads are likely to benefit with Shanghai, as well, so long as they are properly NUMA-aware. Both in SPECjbb and in the more difficult case (for the Opteron) of the Cinema 4D renderer, we found our Opteron 2384-based system to be quantifiably superior in terms of power-efficient performance than Xeon systems that employ FB-DIMMs.

The new Opterons are clearly more competitive now, but they were still somewhat slower overall in the HPC- and workstation-oriented applications we tested, with the lone exception of MyriMatch. In many cases, Shanghai at 2.7GHz was slightly behind the Xeon L5430 at 2.66GHz. The Opteron does best when it’s able to take advantage of its superior system architecture and native quad-core design, and it suffers most by comparison in applications that are more purely compute-bound, where the Xeons generally have both the IPC and clock frequency edge.

We should say a word here about Intel’s San Clemente platform, which we paired with its low-voltage Xeons. It’s a shame this platform isn’t more of a mainstream affair, and it’s a shame the memory controller is limited to only six DIMMs. Even with that limitation, San Clemente may be Intel’s best 2P server platform. In concert with the Xeon L5430, it’s even more power efficient than this first wave of Shanghai Opterons, and in several cases, the lower latency of DDR2 memory seemed to translate into a performance advantage over the Bensley platform in our tests. For servers that don’t require large amounts of RAM, there’s no better choice.

AMD argues that it has a window of opportunity at present, while its Shanghai Opterons are facing off in mainstream servers versus current Xeons. I would tentatively agree. For the right sort of application, an Opteron 2384-based system offers competitive performance and lower power draw than a Xeon E5450 system based on the Bensley platform. The Xeon lineup has other options with consistently higher performance or lower power consumption, but the Shanghai Opterons match up well against Intel’s mainstream server offerings. (Workstations and HPC, of course, are another story.) If AMD can deliver on its plans for HyperTransport 3-enabled Opterons early next year, along with low-power HE and high-performance SE models, it may have a little time to regain lost ground in the server space before 2P versions of Nehalem arrive and the window slams shut.

Comments closed

Pin It on Pinterest

Share This

Share this post with your friends!