AMD’s ‘Shanghai’ 45nm Opterons

AMD’s quad-core Opterons have certainly had a rough life to this point. The original “Barcelona” Opterons were hamstrung by delays, unable to meet clock frequency and performance expectations, and plagued by a show-stopper bug that forced AMD largely to stop shipments of the chips for months while waiting for a new revision, as we first reported. Once the revised Opterons made it into the market, they faced formidable competition from Intel’s 45nm “Harpertown” Xeons, whose best-in-class performance and much-improved power efficiency have stolen quite of a bit of the Opteron’s luster.

AMD is looking to reverse its fortunes with the introduction of a brand-new version of the quad-core Opteron, code-named Shanghai, which has been manufactured using a new, smaller 45-nanometer fabrication process that should bring gains in power efficiency and clock speeds. Shanghai also has the considerable benefit of being the second generation of a new processor design, and AMD has taken the opportunity to tweak this design in innumerable ways, large and small, in order to improve its performance and, one would hope, allow it more fully to meet its potential. The result is an Opteron processor with higher clock speeds, improved performance per clock, and lower power consumption—a better proposition in almost every way than Barcelona.

Will it be enough to make the Opteron truly competitive with Intel’s latest Xeons? We’ve been testing systems for the past couple of weeks in Damage Labs in order to find out.

The Opteron gets Shanghaied

In spite of the troubles “Barcelona” Opterons have faced, AMD got quite a bit right in designing them—or so it would seem when peering down at the basic layout from high altitude. Barcelona was the first native quad-core x86-compatible processor, with four cores sharing a single piece of silicon. Each of those cores had its own 512KB L2 cache, and the four cores then shared a larger, on-chip 2MB L3 cache. Barcelona’s cores could also, of course, share data via this cache, making inter-core communication quick and relatively straightforward. In order to manage power consumption, Barcelona could modify the clock speed of each core independently in response to demand. In addition, the chip had dual power planes, one for the CPU cores and a second for the chip’s other elements—specifically, its L3 cache, integrated memory controller, and HyperTransport links. Voltage to either plane could be reduced independently, again in response to activity. All of these provisions seemed to make Barcelona an ideal candidate for servers and workstations based on AMD’s Socket F infrastructure, which in itself was a strength, thanks to a topology based on high-speed, point-to-point interconnects and CPUs with integrated memory controllers.

Few will argue these basic concepts aren’t sound, especially now that Intel has adopted a very similar architecture for its Nehalem processors, which are already available on the desktop in the form of the staggeringly fast Core i7 and will be headed to servers in the first half of next year.

Shanghai retains Barcelona’s strengths and looks to better capitalize on them. To that end, AMD has outfitted Shanghai with a larger, 6MB L3 cache and a host of tweaks aimed at bringing higher performance per clock and increased power efficiency.

Like the city for which it’s named, Shanghai is about growth: it’s comprised of an estimated 758 million transistors, up from 463 million in Barcelona. Despite this growth, though, the smaller fabrication process means Shanghai has a smaller die area, at 258 mm², than Barcelona’s 283 mm².

AMD’s 45-nm fabrication process combines strained silicon and silicon-on-insulator techniques to achieve higher switching speeds at lower power levels, as did the past two generations of its fabrication technology. This time around, though, the firm has incorporated immersion lithography in order to reach smaller geometries. The use of a liquid medium between the lens and the wafer, as shown in the diagram on the right, offers improved focus and resolution versus the usual air gap in this space. AMD claims immersion lithography will be essential for the 32nm process node, even for Intel, and proudly notes that it has made the transition first.

Most of Shanghai’s additional transistors (versus Barcelona) come from its expanded L3 cache, whose performance benefits for many server-class workloads should be fairly obvious. A number of logic changes, many of them cache-related, consume fewer transistors but promise additional benefits. For example, along with the larger cache comes an enhanced data pre-fetch mechanism. This logic attempts to recognize data access patterns and speculatively loads likely-to-be-needed data into cache ahead of time. As caches grow, pre-fetch algorithms often become more aggressive. Shanghai can also probe the L1 and L2 caches in its cores for coherency information twice as often as Barcelona, which gives it double the probe bandwidth. This provision should be particularly helpful when a core has lowered its clock speed to conserve power while idle.

In order to make sure its larger caches don’t cause data integrity problems, AMD has built in a new feature it calls L3 Cache Index Disable. This feature allows the CPU to turn off parts of the L3 cache if too many machine-check errors occur. This capability will apparently require OS-level support, and that’s not here quite yet. AMD expects “select operating systems” to bring support for this feature next year.

By contrast, the somewhat confusingly named Smart Fetch should have immediate benefits. Despite the name, Smart Fetch is primarily a power-saving feature intended to work around the fact that AMD’s caches are exclusive in nature—that is, the lower-level caches don’t replicate the entire contents of the higher-level caches. Exclusive caches have the simple benefit of extending the total effective size of the cache hierarchy—AMD justifiably bills Shanghai as having 8MB of cache—but they can present conflicts with dynamic power saving schemes. In Barcelona, for instance, a completely idle core would have to continue operating, though at a lower frequency, in order to keep its caches active and their contents available. Shanghai, by contrast, will dump the contents of that core’s L1 and L2 caches into the L3 cache and put the core entirely to sleep, essentially reducing its clock speed to zero. AMD claims this provision can reduce idle power draw by up to 21%. One core in the system must remain active at all times, but in a four-socket system, only a single core in one socket must keep ticking. Smart Fetch isn’t quite as impressive as the core-level power switching Intel built into Nehalem because it doesn’t eliminate leakage power, but it’s still a nice improvement over Barcelona.

One tweak in Shanghai that affects not just the cache but the entire memory hierarchy has to do with the chip’s support for nested page tables, a feature that accelerates memory address translation with system virtualization software. Shanghai maintains the same basic feature set as Barcelona here, but AMD claims a reduction in “world switch time” of up to 25% for Shanghai. That means the system should be able to transition from guest mode to hypervisor mode and then back to guest mode much more quickly. Since we’ve only had a couple of weeks following the release of the Core i7 to test Shanghai, we weren’t able to test this improvement ourselves, unfortunately. (Proper, publishable virtualization benchmarking is a non-trivial undertaking.) AMD says it tested the time required to make these two transitions (guest-to-hypervisor and hypervisor-to-guest) itself and measured a latency of 1360 cycles on Barcelona versus 900 cycles on Shanghai. Hypervisors that support the AMD-V feature set could thus see a marked improvement in performance in cases where virtual server performance is hampered by world-switch latency. Indeed, VMware has published some Shanghai performance numbers with VMware ESX 3.5 that show dramatic performance advantages over software-based shadow page tables.

Our 2P Opteron test system with 16GB of DDR2-800 memory

A couple of other changes ought to bring more general performance gains. Shanghai’s memory controller bumps up officially supported memory frequencies from 667MHz to 800MHz, for one. Also, HyperTransport 3 support is finally imminent. The first Shanghai processors don’t support it, mainly because AMD didn’t want to hold up these products’ introduction while waiting for full validation of HT3 solutions. Instead, the firm plans
to introduce HT3-ready Opterons next spring. When those arrive, they’ll double the available bandwidth for CPU-to-CPU communication in Opteron systems. With HyperTransport clock speeds up to 2.2GHz, HT3 will allow for up to 17.6 GB/s of bandwidth (the bidirectional total) per link. Only with the introduction of the Fiorano platform later in 2009 will the CPU-to-chipset interconnect transition to HT3.

Pricing and availability

Even with all of these chip-level changes, the biggest news of the day may be the advent of Opterons with higher clock speeds and lower prices. The refreshed Shanghai lineup now looks like so:

Model Clock speed North bridge

speed

ACP Price
Opteron 2384 2.7GHz 2.2GHz 75W $989
Opteron 2382 2.6GHz 2.2GHz 75W $873
Opteron 2380 2.5GHz 2.0GHz 75W $698
Opteron 2378 2.4GHz 2.0GHz 75W $523
Opteron 2376 2.3GHz 2.0GHz 75W $377
Opteron 8384 2.7GHz 2.2GHz 75W $2,149
Opteron 8382 2.6GHz 2.2GHz 75W $1,865
Opteron 8380 2.5GHz 2.0GHz 75W $1,514
Opteron 8378 2.4GHz 2.0GHz 75W $1,165

All of the new Opterons, ranging from 2.3 to 2.7GHz, fit into the same 75W thermal envelope, according to AMD’s “ACP” rating method (which it insists is the best analog to Intel’s TDP numbers, though Intel would disagree.) Clock speeds overall are up, and notably, north bridge clocks participate in that advance. I say that’s notable because the north bridge clock governs the L3 cache, as well, which has a pretty direct impact on overall Opteron performance.

AMD expects all of the products above to be available now. Conspicuous by their absence are low-power HE and higher-speed SE derivatives of Shanghai. AMD intends for these HE and SE parts to fit into their traditional 55W and 105W thermal envelopes, respectively, when they arrive in the first quarter of next year. With the additional power headroom, the SE parts could quite possibly reach 3GHz, although only time will tell.

The Opteron’s next steps

The improvements in Shanghai sound pretty good, but many folks are still asking exactly what AMD will do in order to counter Intel’s Nehalem, which promises a similar system architecture and—by all current indications, at least—higher performance per core and per socket. Interestingly enough, AMD does have some credible answers to such questions, and it has disclosed quite a bit of its future Opteron roadmap in response. Here’s a quick overview of the basic plan:

AMD’s Opteron roadmap into 2011. Source: AMD.

Not noted above is the planned release of HyperTransport 3-enabled Opterons next spring. After that, the next big change will be the introduction of the Fiorano platform in mid-2009. Fiorano will be the first Opteron chipset based on the core-logic technology AMD acquired when it purchased ATI. That chipset will be comprised of the SR5690 I/O hub and the SP5100 south bridge. Fiorano will retain compatibility with Socket F-type CPUs, but will add several noteworthy enhancements, including full HyperTransport 3 and (at last) PCI Express 2.0, complete with support for device hot-plugging. As one would expect, Fiorano will support AMD’s IOMMU technology for fast and secure hardware-assisted virtualization of I/O devices.

A simple block diagram of the Fiorano platform. Source: AMD.

Fiorano will be scalable from 2P to 4P and 8P systems. As you can see in the diagram above, 4P Opteron systems will not be fully connected—there will still be two “hops” from one corner of a 4P system to the opposing corner. Also notable by its absence is support for DDR3 memory. Although the desktop Phenom II is expected to make the move to DDR3 in early 2009, the Opteron won’t follow until it makes a socket transition in 2010.

Before that happens, some time in late 2009, the Opteron lineup will get a boost with the release of a six-core processor code-named Istanbul. This 45-nm chip should look very much like Shanghai, but with two additional cores onboard—same 6MB L3 cache, same DDR2 memory controller, still HyperTransport 3. For certain applications, a six-core Opteron could conceivably be a nice alternative to Intel’s quad-core, eight-thread Nehalem-based Xeons, although by the time Istanbul arrives, Intel may be reaching new milestones in its own roadmap.

Istanbul looks like Shanghai plus two cores. Source: AMD.

Then comes the transition to the new G34 socket—the funky elongated, rectangular socket you may have seen in some reports—in 2010. This socket will bring a major infrastructure refresh for the Opteron. DDR3 support will come in with a bang; each socket is expected to support four channels of DDR3 memory. Also, the maximum number of HyperTransport 3 links per chip will rise from three to four, potentially enabling fully connected 4P systems.

Interestingly enough, all of the changes here will apparently be the result of modifications to the physical socket and to Opteron processors. Although AMD has given the new platform a code name, Maranello, it uses the same two core-logic chips as Fiorano.

The new processors will come in two distinct flavors: Sao Paulo, with six cores and 6MB of L3 cache, and the oh-so-cleverly named Magny-Cours, with a whopping 12 cores and 12MB of L3 cache. We don’t yet know whether or how these cores will be enhanced compared to Shanghai Opterons. Both chips will be manufactured with 45nm process tech, and the basic cache hierarchy on the Opteron will remain the same, with an exclusive L3. AMD will add additional smarts to these chips, though, in the form of a probe filter (or snoop filter) that will reduce cache coherency management traffic. Also, much like Nehalem, these processors will feature on-chip power management and thermal control capabilities, including the ability to raise and lower clock speeds based on thermal control points.

Beyond that, things become foggy. We know that AMD’s spun-off manufacturing arm, temporarily dubbed “the foundry company,” has plans to introduce two advanced 32-nm fabrication technologies in the first half of 2010, a high-performance process using SOI and a low-power process using high-k metal gates. Meanwhile, AMD is working on a next-generation CPU microarchitecture code-named “Bulldozer,” about which we know very little. Early information on Bulldozer suggested it would initially tape out on a 45nm process, but more recent rumblings from AMD suggest Bulldozer has been pushed back—the desktop variant to 2011—and may be a 32nm part.

Sizing up the Xeons and Opterons in our test

Intel, of course, hasn’t been sitting still since we last looked at its server/workstation-class processors. The firm is now shipping a new E stepping of its 45nm Xeons that reduces power draw and allows for slightly higher clock frequencies. All of the Xeons we tested for this review are based on E-stepping silicon. We had intended to review these Xeons in a separate article but weren’t able to complete it before this one, so we have a range of new-to-us products to test, based on multiple different Intel server- and workstation-class platforms.

The most direct competition for the Shanghai Opterons we’ve tested is the Xeon E5450, a 3GHz quad-core part with a 1333MHz front-side bus. We’ve tested the E5450 on Intel’s highest-volume server platform, known as Bensley. This platform, based on the Intel 5000P chipset, is getting a little long in the tooth and lacks a few features, like support for a 1600MHz FSB, 800MHz FB-DIMMs, and a full-coverage snoop filter. However, it is still the predominant Xeon server platform, and is thus the best basis of comparison versus the Opteron systems we’re testing. The Xeon E5450 is priced at $915 in volume, quite close to the $989 price tag of the Shanghai Opteron 2384. The two chips also share similar thermal envelopes; the Xeon E5450 is rated at an 80W TDP and the Opteron 2384 has a 75W ACP. (Assuming you buy AMD’s arguments about its ACP ratings, at least, the two should be similar. We will test power consumption ourselves, regardless.)

We have also, of course, included AMD’s best 65nm Opteron within this same thermal envelope, the 2356, to see how it compares to Shanghai.

Intel’s 45nm Xeons extend into higher-performance and lower-power territory in some interesting ways, as well. The low-voltage Xeon L5430, for instance, has specs very similar to the E5450—quad cores, 2.66GHz core clock, 1333MHz bus, 12MB total L2 cache—but comes with a TDP rating of just 50W. For our testing, we’ve mated it with a very intriguing low-power server platform from Intel, known as San Clemente.

This is our first look at San Clemente, which Intel hasn’t pushed especially hard in the mainstream server or workstation spaces. Instead, Intel has aimed it primarily at dense blade servers and embedded systems like routers, SANs, and NAS boxes. That’s kind of a shame, since the Intel 5100 MCH at the heart of San Clemente makes a key power-saving move, shunning Fully Buffered DIMMs for registered DDR2 memory modules just like Opterons use. FB-DIMMs allow for higher total system memory capacities, but they exact notable penalties in terms of both memory access latencies and power consumption. San Clemente’s power consumption could be quite a bit lower than Bensley’s, as could its memory access latencies. Like Bensley, San Clemente has dual, independent front-side bus connections to each socket in a 2P system, as well.

The tradeoffs are several. The 5100 MCH is limited to a maximum of six DIMMs per system and 48GB of total memory, versus 16 FB-DIMMs and 64GB total memory for Bensley. Also, the 5100 MCH’s two channels of DDR2-667 memory yield a peak of 10.6 GB/s of bandwidth, compared to Bensley’s 21 GB/s max.

The guts of our San Clemente test rig, fully populated with six DIMMs

Our San Clemente test system underscored its memory capacity limitations by proving to be incompatible with our 2GB DDR2 DIMMs, for whatever reason. We were limited to testing with only 6GB of total memory by populating each of its DIMM slots with 1GB modules.

Since AMD’s 45nm Opteron HE products aren’t out yet, the closest competition we have to the Xeon L5430/San Clemente combo is the Opteron 2347 HE, a 65nm part with a 55W ACP (68W TDP), a 1.9GHz core clock, and a 1.6GHz north bridge/L3 cache. That’s a rough comparison for AMD, but things should change once the 45nm Opteron HE parts arrive next quarter.

At the other end of the spectrum entirely is the Xeon X5492, an ultra-high-end processor (nearly $1,500 list) that tests the outer limits of Intel’s 45nm process tech with a 3.4GHz core clock, a 1600MHz FSB, and a 150W TDP rating. We’ve tested a pair of these babies on the Stoakley platform. Stoakley is essentially an updated version of the Bensley platform with higher bandwidth, but it’s been targeted largely at workstations and HPC systems.

There really is no Opteron analog to the Xeon X5492. The closest comparison might be to the 65nm Opteron 2360 SE, which has a 105W ACP (and 119W TDP), but Shanghai has higher clock frequencies and a larger cache in a much smaller power budget, so the 2360 SE is essentially obsolete. Again, we may have to wait for the introduction of 45nm Opteron SE models before we have a truly comparable product from AMD—and even then, AMD may choose not to produce an Opteron with a 150W thermal envelope.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processors
Dual Xeon E5450 3.0GHz

Dual Xeon X5492 3.4GHz
Dual
Xeon L5430 2.66GHz
Dual
Opteron 2347 HE 1.9GHz

Dual
Opteron 2356 2.3GHz


Dual Opteron 2384 2.7GHz
System
bus
1333MHz (333MHz quad-pumped) 1600MHz (400MHz quad-pumped) 1333MHz (333MHz quad-pumped) 1GHz
HyperTransport
1GHz
HyperTransport
Motherboard SuperMicro
X7DB8+
SuperMicro
X7DWA
Asus
RS160-E5
SuperMicro
H8DMU+
SuperMicro
H8DMU+
BIOS
revision
6/23/2008 8/04/2008 8/08/2008 3/25/08 10/15/08
North
bridge
Intel
5000P MCH
Intel
5400 MCH
Intel
5100 MCH
Nvidia
nForce Pro 3600
Nvidia
nForce Pro 3600
South
bridge
Intel
6321 ESB ICH
Intel
6321 ESB ICH
Intel
ICH9R
Nvidia
nForce Pro 3600
Nvidia
nForce Pro 3600
Chipset
drivers
INF
Update 9.0.0.1008
INF
Update 9.0.0.1008
INF
Update 9.0.0.1008
Memory
size
16GB
(8 DIMMs)
16GB
(8 DIMMs)
6GB (6 DIMMs) 16GB
(8 DIMMs)
16GB
(8 DIMMs)
Memory
type
2048MB
DDR2-800 FB-DIMMs 
2048MB
DDR2-800 FB-DIMMs
1024MB
registered ECC

DDR2-667 DIMMs

2048MB
registered ECC

DDR2-800 DIMMs

2048MB
registered ECC

DDR2-800 DIMMs

Memory
speed (Effective)

667MHz
800MHz
667MHz

667MHz

800MHz
CAS
latency (CL)
5 5 5 5 6
RAS
to CAS delay (tRCD)
5 5 5 5 5
RAS
precharge (tRP)
5 5 5 5 5
Storage
controller
Intel
6321 ESB ICH
with

Matrix Storage Manager 8.6

Intel
6321 ESB ICH
with

Matrix Storage Manager 8.6

Intel ICH9R with

Matrix Storage Manager 8.6

Nvidia
nForce Pro 3600
LSI
Logic Embedded MegaRAID

with 8.9.518.2007 drivers

Power
supply
Ablecom
PWS-702A-1R
700W
Ablecom
PWS-702A-1R
700W
FSP
Group FSP460-701UG 460W
Ablecom
PWS-702A-1R
700W
Ablecom
PWS-702A-1R
700W
Graphics Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
XGI Volari Z9s with 1.09.10_ASUS drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Integrated
ATI ES1000 with 8.240.50.3000 drivers
Hard
drive
WD
Caviar WD1600YD 160GB
OS Windows
Server 2008 Enterprise x64 Edition with Service Pack 1

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

This test shows us a nice visual picture of the memory bandwidth available at the different levels of the memory hierarchy. One can see the impact of the Opteron 2384’s larger L3 cache at the 16MB test block size, where it’s much faster than the older quad-core Opterons. Still, the Xeons’ caches typically achieve quite a bit higher throughput than the Opterons’.

Our graph is tough to read at the largest test block sizes where main memory comes into play. Here’s a closer look at the 256MB block size, which should be a good indicator of main memory bandwidth.

These results are consistent with what we’ve seen in the past from most of these platforms. I believe these results only show the bandwidth available to a single CPU core, so they’re substantially less than the peak available in the entire system. The Opterons appear to benefit greatly from their integrated memory controllers here, and the Shanghai Opteron 2384 takes advantage of its faster 800MHz memory, as well.

The Opteron 2384’s revamped cache and TLB hierarchy, along with faster memory, delivers major reductions in memory access latency. With the 65nm Barcelona Opterons, we’ve found that the L3 cache tends to contribute quite a bit of latency to the overall picture. Yet with three times the L3 cache of the Opteron 2356, the 2384 is still faster to main memory. Let’s have a closer look at the cache picture and see why that is.

Before we do, though, we should also point out that the Xeon L5430 on the San Clemente platform has much lower access latencies than the E5450 on the Bensley platform, although they share the same bus frequency and topology. Assuming there aren’t any other major contributing factors, FB-DIMMs would appear to add about 14ns of delay versus DDR2 modules at the same 667MHz clock speed. The Stoakley platform essentially makes up that deficit by using higher bus and memory frequencies.

Note that, below, I’ve color-coded the block sizes that roughly correspond to the different caches on each of the processors. L1 data cache is yellow, L2 is light orange, L3’s darker orange, and main memory is brown.

These graphs offer a good visual representation of the data, but perhaps some numbers would illuminate things further. Because the Opteron’s L3 cache is clocked independently from the CPU cores, it doesn’t make sense to quantify that cache’s latency in terms of CPU clock cycles. In this case, the Opteron 2356’s L3 cache runs at 2GHz, while the 2384’s runs at 2.2GHz—a 10% increase. Despite the fact that the 2384’s L3 cache is three times the size, though, its latencies are considerably lower. At the 2048KB block size and step size of 256, the 2356’s latency is 23ns, while the 2384’s is only 16ns—a reduction of nearly a third.

SPECjbb 2005

SPECjbb 2005 simulates the role a server would play executing the “business logic” in the middle of a three-tier system with clients at the front-end and a database server at the back-end. The logic executed by the test is written in Java and runs in a JVM. This benchmark tests scaling with one to many threads, although its main score is largely a measure of peak throughput.

As you may know, system vendors spend tremendous effort attempting to achieve peak scores in benchmarks like this one, which they then publish via SPEC. We did not intend to challenge the best published scores with our results, but we did hope to achieve reasonably optimal tuning for our test systems. To that end, we used a fast JVM—the 64-bit version of Oracle’s JRockIt JRE R27.6—and picked up some tweaks for tuning from recently published results. We used two JVM instances with the following command line options:

start /AFFINITY [0F, F0] java -Xms3700m -Xmx3700m -XXaggressive -XXlazyunlocking -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads=4 -Xns3200m -XXcallprofiling -XXtlasize:min=4k,preferred=512k -XXthroughputcompaction

Notice that we used the Windows “start” command to affinitize our threads on a per-socket basis. We also tried affinitizing on a per-chip basis for the Xeon systems, but didn’t see any performance benefit from doing so. The one exception to the command line options above was our Xeon L5430/San Clemente system. Since it had only 6GB of memory, we had to back the heap size down to 2200MB for it.

The Opteron 2384’s performance here is undeniably impressive—a massive leap from the performance of the Opteron 2356, and substantially higher than its most direct competitor, the Xeon E5450. In fact, at just 2.7GHz, the Shanghai Opteron performs nearly as well as the exotic of the group, the 3.4GHz Xeon X5492, remarkably enough. Shanghai’s larger L3 cache and other tweaks, combined with the Opteron’s native quad-core design and strong system architecture, yield big returns in this server-class workload.

Not only that, but Oracle has hinted to us that even higher performance is possible when using a version of JRockIt optimized for Shanghai. That version of JRockIt hasn’t yet been released, but we understand it’s on the way.

Before we move on, let’s take a quick look at power consumption during this test. SPECjbb 2005 is the basis for SPEC’s own power benchmark, which we had initially hoped to use in this review, but time constraints made that impractical. Nevertheless, we did capture power consumption for each system during a test run using our Extech 380803 power meter. All of the systems used the same model of Ablecom 700W power supply unit, with the exception of the Xeon L5430 server, which used an FPS Group 460W unit. Power management features (such as SpeedStep and Cool’n’Quiet) were enabled via Windows Server’s “Balanced” power policy.

Although it has the same 75W ACP rating as the Opteron 2356, the Opteron 2384 draws substantially less power at every step of the way. The Xeon E5450 system is practically a power hog by comparison, with much higher peak power draw. The bright spot for Intel here is the Xeon L5430/San Clemente system with DDR2 memory, whose power consumption is admirably low—almost 60W less than the Opteron 2384 system.

Have a look at what happens when we consider performance per watt.

Our Opteron 2384 system combines higher performance with lower power draw than the Xeon E5450 system, so its “bops per watt” lead is predictably large. Shanghai certainly looks good in this light.

Meanwhile, the Xeon L5430 aims to steal the limelight. It’s a bit of a wild card, with less total memory, only six DIMMs to eight for the other systems, and a much lower wattage PSU (which may be more efficient at these load levels). Still, one can’t deny the efficiency of its 50W quad-core Xeons—and one can’t help but wonder whether Intel made the right call in choosing FB-DIMMs for its mainstream server platform just as performance per watt was becoming perhaps the key metric for server evaluations.

Cinebench rendering

We can take a closer look at power consumption and energy-efficient performance by using a test whose time to completion varies with performance. In this case, we’re using Cinebench, a 3D rendering benchmark based on Maxon’s Cinema 4D rendering engine.

This is a very different sort of application, in which the Shanghai Opterons’ larger cache and faster memory don’t bring the sort of performance gains we saw in SPECjbb. Here, the Xeon E5450 is faster. In fact, the Xeons are faster clock for clock—the Xeon L5430 at 2.66GHz outperforms the Opteron 2384 at 2.7GHz.

As we did with SPECjbb, we measured power draw at the wall socket for each of our test systems across a set time period, during which we ran Cinebench’s multithreaded rendering test.

Some of the outcomes are obvious immediately, like the fact that the Xeon E5450 and X5492 systems have much higher overall power draw. Still, we can quantify these things with more precision. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

The Opteron 2384 system’s idle power draw is just over 10W less than that of the system based on its 65nm predecessor, the 2356, in spite of the fact that its L3 cache is larger and runs at a higher clock speed. Shanghai’s ability to flush its L1 and L2 caches into the L3 and shut down its cores does appear to pay dividends. Even so, those incremental gains seem small in light of the considerably higher idle power draw of the FB-DIMM-equipped Xeon systems.

Meanwhile, the low-voltage Xeons and San Clemente continue to impress.

Next, we can look at peak power draw by taking an average from the ten-second span from 15 to 25 seconds into our test period, during which the processors were rendering.

Peak power draw with Cinebench isn’t quite as high as it is with SPECjbb, but the trends remain the same. The Shanghai Opterons draw less power, at a higher clock speed, than their 65nm counterparts.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

We can quantify efficiency even better by considering specifically the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

Even though the Opteron 2384 isn’t as fast as the Xeon E5450 in this application, the Opteron system still requires less energy to render the scene than the Xeon E5450-based one. The Opteron 2356 isn’t nearly as efficient as either, but the Shanghai Opterons more than restore AMD’s competitiveness in power-efficient performance.

With that said, the biggest winner here, obviously, is the Xeon L5430 system, which is simply in a class by itself.

XML handling

We are working, bit by bit, to add additional pieces to our server/workstation CPU benchmark suite over time. As part of that effort, our web developer and sysadmin, Stephen Roylance, has put together an XML handling benchmark for us. He based some elements of this test on parts of the open-source XML Benchmark project, but Steve ported everything to Microsoft’s C# language and .NET runtime environment. Here’s how he describes the program:

The program runs four different XML related tests for a configurable number of cycles, across a configurable number of threads.

The four operations are:

  • Read a test XML file and parse it.
  • Generate a randomized XML tree and write it into memory as a document.
  • Transform an XML tree with XSLT, writing the resulting document into memory.
  • Attach a cryptographic signature to a parsed XML tree, write it back to memory as an XML document in a string, parse it and verify the signature.

In contrast to SPECjbb, this test is written in C# and runs under Microsoft’s .NET common language runtime. It should be a reasonable simulation of real-world CPU workloads on servers running ASP.NET web applications.

The results you see below show the total time required to execute 100 iterations of an interleaved mix of the four thread types across eight concurrent threads. We tested using the benchmark’s “medium” file size option, so the data files involved were around 256KB in size.

Curiously enough, the Opteron 2384 is no faster than the 2356 here. This data point, plus a couple of others, points to a possible cause. The Opterons struggle versus the Xeons overall, which suggests that the NUMA memory subsystem of the Opteron systems may be causing problems. In other words, we may have threads ping-ponging between cores on different sockets, forcing them to access memory associated with the non-local CPU socket, draining performance. Also, notice how much quicker the Xeon L5430 is in this test than the E5450. That fact heightens my suspicion that this test is particularly sensitive to memory access latencies. In SPECjbb, where the Shanghai Opterons were much more effective, we explicitly affinitized threads with CPU sockets.

Of course, performance bottlenecks like this one are a day-to-day reality of living with the Opteron’s NUMA architecture, and many off-the-shelf applications aren’t NUMA-aware. This issue will probably get quite a bit more attention soon, since Intel will be making the move to a very similar NUMA arrangement with 2P Nehalem systems.

Our next steps in the development of this benchmark will have to include affinitizing threads with sockets, if at all possible. Also, we’d like to report execution times for individual thread types, and I believe Steve plans to write up a blog post about this benchmark and release the source code. If any of our readers have suggestions for improvement, he’ll be taking them at that time.

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of proteins. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.
In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.

MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.

I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

The Opteron 2384 finishes ahead of the Xeon E5450, with a best time two seconds faster than the Xeon. What’s more interesting is how it gets there: the Xeon is faster at every thread count from one to six, but the Opteron scales better when taking that last step to an optimal thread count, likely thanks to its native quad-core layout and integrated memory controller.

Another intriguing development is the fact that the Xeon L5430/San Clemente system is nearly as fast as the Xeon E5450 on the Bensley platform—faster at low thread counts, in fact—in spite of the L5430’s clock speed deficit.

And, well, there’s a storm brewing on the horizon. A single-socket desktop version of Nehalem, the Core i7-965 Extreme, completed this same test in only 60 seconds, 10 seconds ahead of even our dual-socket Xeon X5492 system. Granted, that’s with a different OS with possible kernel tuning advantages and exotic 1600MHz RAM, but one can’t help but wonder how a dual-socket Nehalem system might perform.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.
The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

The Opteron 2384 can’t quite catch the Xeons here, but consider the match-up against the Xeon L5430. The L5430 reaches a much higher frequency with a single thread, but its advantage gradually erodes as the number of threads climbs. At eight threads, the L5430 is only ahead by a fraction.

For comparison’s sake, by the way, the single-socket Core i7-965 Extreme broke the 5Hz barrier on this test—again, well ahead of our Xeon X5492 system.

Folding@Home

Next, we have a slick little Folding@Home benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, Folding@Home is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The Folding@Home project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, Folding@Home should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.

The Xeons are plainly faster here, and the scores for both the AMD and Intel processors appear to scale rather linearly with clock speed improvements.

3D modeling and rendering

POV-Ray rendering

We’re using the latest beta version of POV-Ray 3.7 that includes native multithreading and 64-bit support. Some of the beta 64-bit executables have been quite a bit slower than the 3.6 release, but this should give us a decent look at comparative performance, regardless.

Shanghai’s performance gains here aren’t quite sufficient to allow the Opteron 2384 to catch the Xeon E5450, but they are remarkably solid improvements, especially in the benchmark scene. The question is: why? POV-Ray hasn’t been particularly sensitive to cache sizes or memory bandwidth in recent years. During my recent visit to AMD’s Austin, Texas campus, one of AMD’s engineers told me that Shanghai’s branch prediction algorithm had been tweaked to improve its accuracy in certain cases, and one of the applications that should benefit from that tweak is POV-Ray. Looks like it helped.

Valve VRAD map compilation

This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to pre-compute lighting that goes into its games.

This is our final lighting/rendering-type test, and the results are what we’ve come to expect, more or less.

x264 HD video encoding

This benchmark tests performance with one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark. These scores come from the newer, faster version 0.59.819 of the x264 executable.

For more workstation-oriented applications like this one, the Xeons have a consistent edge over the Opterons, and Shanghai doesn’t really change that.

Sandra Mandelbrot

We’ve included this final test largely just to satisfy our own curiosity about how the different CPU architectures handle from SSE extensions and the like. SiSoft Sandra’s “multimedia” benchmark is intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.

The benchmark contains many versions (ALU, MMX, (Wireless) MMX, SSE, SSE2, SSSE3) that use integers to simulate floating point numbers, as well as many versions that use floating point numbers (FPU, SSE, SSE2, SSSE3). This illustrates the difference between ALU and FPU power.

The SIMD versions compute 2/4/8 Mandelbrot point iterations at once – rather than one at a time – thus taking advantage of the SIMD instructions. Even so, 2/4/8x improvement cannot be expected (due to other overheads), generally a 2.5-3x improvement has been achieved. The ALU & FPU of 6/7 generation of processors are very advanced (e.g. 2+ execution units) thus bridging the gap as well.

We’re using the 64-bit version of the Sandra executable, as well.

Shanghai is nearly as fast, clock for clock, as the Xeon in both the integer x8 and FP double tests. The Opteron 2384 runs neck and neck with the 2.66GHz Xeon L5430.

Conclusions

The Shanghai Opterons’ higher clock speeds, larger and quicker L3 cache, and improved memory subsystem are just what the doctor ordered for AMD’s quad-core CPU architecture. These changes, along with lower power consumption both at idle and while loaded, go a long way toward alleviating the weaknesses of the 65nm Barcelona Opterons. The Opteron 2384’s ability to outperform the Xeon E5450 in SPECjbb is dramatic proof of Shanghai’s potency. Similar server-class workloads are likely to benefit with Shanghai, as well, so long as they are properly NUMA-aware. Both in SPECjbb and in the more difficult case (for the Opteron) of the Cinema 4D renderer, we found our Opteron 2384-based system to be quantifiably superior in terms of power-efficient performance than Xeon systems that employ FB-DIMMs.

The new Opterons are clearly more competitive now, but they were still somewhat slower overall in the HPC- and workstation-oriented applications we tested, with the lone exception of MyriMatch. In many cases, Shanghai at 2.7GHz was slightly behind the Xeon L5430 at 2.66GHz. The Opteron does best when it’s able to take advantage of its superior system architecture and native quad-core design, and it suffers most by comparison in applications that are more purely compute-bound, where the Xeons generally have both the IPC and clock frequency edge.

We should say a word here about Intel’s San Clemente platform, which we paired with its low-voltage Xeons. It’s a shame this platform isn’t more of a mainstream affair, and it’s a shame the memory controller is limited to only six DIMMs. Even with that limitation, San Clemente may be Intel’s best 2P server platform. In concert with the Xeon L5430, it’s even more power efficient than this first wave of Shanghai Opterons, and in several cases, the lower latency of DDR2 memory seemed to translate into a performance advantage over the Bensley platform in our tests. For servers that don’t require large amounts of RAM, there’s no better choice.

AMD argues that it has a window of opportunity at present, while its Shanghai Opterons are facing off in mainstream servers versus current Xeons. I would tentatively agree. For the right sort of application, an Opteron 2384-based system offers competitive performance and lower power draw than a Xeon E5450 system based on the Bensley platform. The Xeon lineup has other options with consistently higher performance or lower power consumption, but the Shanghai Opterons match up well against Intel’s mainstream server offerings. (Workstations and HPC, of course, are another story.) If AMD can deliver on its plans for HyperTransport 3-enabled Opterons early next year, along with low-power HE and high-performance SE models, it may have a little time to regain lost ground in the server space before 2P versions of Nehalem arrive and the window slams shut.

Comments closed
    • ssidbroadcast
    • 11 years ago

    I’d just like to say two things:

    1) that I enjoyed the review and I think Scott does great work.

    2) Regarding Page8 and the APS.NET performance via the “home brewed” XML benchmarker, I had a concern. Now, I don’t want to sound like how *certain* TR members have sounded in the past (and present), but isn’t there some sort of issue with the way that a program is compiled that favors Genuine(tm) Intel Processors during runtime? Could the Shanghai be getting “the run around” just because some specific SSE set isn’t detected? Or is that whole conspiracy pure humbug?

      • srg86
      • 11 years ago

      That’s normally only with Intel’s compilers afaik.

      • sdack
      • 11 years ago

      The conspiracy is still up and I believe these are like never-ending stories as there is no simple way to end them.

      What has made it a minor issue is that even if there is anything suspicious inside a piece of code compiled by Intel’s compiler the result often still runs faster than what Microsoft’s compiler or the GNU compiler produces. This makes it a bit tricky to argue against Intel because it is at best only immoral.

      Even if Intel continues to give an AMD processor a disadvantage does it not stop AMD from releasing their own compiler. So why accuse Intel for trying to look better than the competitor while AMD lets them do all the compiler development?

      Unfair and unfair does not make it fair again, it makes it twice as unfair for the users, but I believe most have settled with the idea that the way it is now is as good as it can get.

    • glynor
    • 11 years ago

    Good review (and testing), Scott. Thanks for all the hard work.

    • swaaye
    • 11 years ago

    Well, make me a Phenom with one of these cores and make it cheaper than a Q6600 and I’m in. Gotta replace that Athlon X2 in my 780G mobo with something someday.

    Pretty sad that it looks like Q6600 is still generally going to beat this core per clock. It’s really close though. Only took ~2 years!!

      • Krogoth
      • 11 years ago

      It seems to be likely given that the Q6600s are already retiring before the Phenom-variants of Shanghai enter the market.

    • Hattig
    • 11 years ago

    Nice review. A real shame that virtualisation wasn’t tested, because that is what is important in many server installations now.

    In addition it is a shame that >2 sockets wasn’t tested. This is where AMD excel, and this market is growing. With >2S for Nehalem so long away, and requiring an entire platform evaluation within many companies (or they could stick with the tried and tested Socket F) this would be a good test for AMD.

    Also when you have a benchmark available in C, C++ and Java, why rewrite it in C#, and then complain about problems with the C# implementation as it stands? Why not present the C and Java benches as well?

    I will agree with the other poster here that said that a lot of the number crunching applications have moved or will move to GPUs, and thus these benchmarks will be less and less relevant in the future compared to other processing tests. Still, right now they’re relevant and AMD didn’t do many improvements here sadly – Shanghai being a non-core enhancement mostly.

    As far as I can see it, Shanghai in a 2P situation has improved AMDs position in the market greatly compared to Barcelona, especially where it matters. Some of the benchmarks show excellent scaling for Shanghai, but sadly stop at 8 threads on 2S, 4S benches would be very interesting. Indeed considering Intel’s lack of movement in 4S and above I guess that Shanghai is just cementing being the only choice here. Shame that no-one wants to benchmark it… (maybe AMD should send some systems out, tsk, their marketing people suck)

      • TravelMug
      • 11 years ago

      Virtualization is hard to benchmark. On one hand, which solution would you pick (Xen, KVM, VMware ESX, VMware soft hypervisors). Then what settings and what tests would you perform? Also if you want to see VMware Vmark results you just have a look at their website. If they didn’t change it in the mean time, the licese for Vmark also says you cant’ publish the results without their approval. But then again, why would you even duplicate the work already done? Theres’ plenty of submissions on their site to draw the required conclusions.

      About the 2 socket and 4 socket systems:

      As correctly pointed out in the conclusion by Scott, AMD now has a small window of opportunity to flog the Opteron line. That window will close in 2009Q1 with the arrival of Xeon 5500 series. Take a look at the submitted SPEC results so far. The throughput tests (where AMD shines/shined) say a lot. The best results for a 2 socket with the fastes available Opteron 2384 system currently are(peak/base):

      SPECint_rate_2006 = 136/113
      SPECfp_rate_2006 = 118/105

      Compare this to the 1 socket i7-965 submission by Asus (=most likely not ideally optimized and as a bonus done under Vista):

      SPECint_rate_2006 = 125/117
      SPECfp_rate_2006 = 86/83

      So basically a one socket Nehalem based system will be able to compete with the best 2 socket Opteron system in INT and have a deficit of 20-25% in FP. Now if you scale the results to a 2 socket Nehalem system based on how the rate tests scale from the other submissions you’ll get something like this:

      SPECint_rate_2006 = 250/234
      SPECfp_rate_2006 = 172/166

      Now compare these to the best 4 socket 8384 submission:

      SPECint_rate_2006 = 249/202
      SPECfp_rate_2006 = 170/156

      What you get is a parity between the 2 socket Nehalem system and the 4 socket 8384 system performance wise. Except that the 4 socket system is way more expensive.

      So the sad fact (for AMD) is that in about 3 month time they lose the 2 socket system performance crown for good (in any benchmark be it performance or power efficiency) and their 4 socket systems will be under heavy attack from the best 2 socket Nehalem systems out there.

        • dragmor
        • 11 years ago

        The Spec Rate tests are basically a test of *[

          • TravelMug
          • 11 years ago

          Those are throughput tests, not pure memory bandwidth tests. Without going deep into the details, just look at the 2 socket Opteron results with various CPU clock but the same NB and HT clock. The result increases with the clockspeed. Of course they (Intel) were arguing that if they were behind in performance due to the FSB bottleneck. The cores could process more but the data simply does not get to them. Now it does.

          The obviousness of the i7 superiority due to removed bottleneck is not the issue here. The main issue (again mainly from AMD’s business point of view) is that a cheaper dual socket system will outperform their quad socket system thus eliminating any advantage they have left in the server market. No sane person will shell out the big bucks due to the premium pricing for a quad socket Opteron system if he/she can get the same performance from a much cheaper dual socket system.

          It would be really nice to see some real world server workload benchmarks comparing a 2 x 2384 and the i7-965 system. Even if the i7 would only have 12GB RAM. I doubt that it would be a limitation for those benchmarks, most could be setup to account for the memory limitation.

      • sroylance
      • 11 years ago

      The existing XML benchmarks were all written to compare different XML library implementations against each other, and were not useful for benchmarking CPUs. Since I had to rewrite it anyway, I also wanted to fill another gap. asp.net is widely used, and while we have a java web services benchmark in our server suite, we didn’t have anything that runs under the .net CLR.

    • Forge
    • 11 years ago

    I’d really like to pester for some info on the settings, material used, etc, etc for the x264 benchmark. I’m routinely taking 1080i and 1080p source (OTA HDTV and my Blurays respectively) down to 720p, and your first pass results seem low and your second pass results seem high compared to my usual workloads. I’m guessing it’s all boiled down to you have roughly 2X the number of cores (for the second pass results) along with either slower storage or just different arguments making the first pass slow (I’ve noticed that I’m rarely using all my cores during pass #1, much less using them all fully).

    Of course, I’m on a much newer build of the core, so this may all be meaningless.

      • Damage
      • 11 years ago

      You can download the x264 HD benchmark yourself by following the link in the review. I believe that should provide answers to any questions you may have. There’s also a public database of results to which you can refer.

    • charged3800z24
    • 11 years ago

    WOW.. AMD hate day… well the 8000 series are much better.. 4socket and 8 socket.. AMD scales like a Be-atch…they are still close clock for clock , if you will, to Intel at the 2 socket so it is much better then they were…. But , it would be nice to see more server crucial test..VM for one..

      • Forge
      • 11 years ago

      It’s hard to make proper VM benchmarks. I’ve got some ideas on that, though. I’ll make sure to share them with Damage next time I see him on Skype.

    • piesquared
    • 11 years ago
    • Prototyped
    • 11 years ago

    Sweet. Thanks for the review. It appears Shanghai’s a substantial improvement over Barcelona, at least meeting the Xeons on clock-for-clock performance and even exceeding it in some cases (like SPECjbb2005 and MyriMatch). That and the lower system power consumption might make this the best current platform for Java application serving.

    It’s too bad you guys didn’t have the time to do virtualization testing. I look forward to a followup article on virtualization performance when you do get it worked out.

    Given a /[

    • sdack
    • 11 years ago

    A very nice, informative read, and Thank You for it, up to the point where the benchmarking starts.

    With GPUs easily taking over the number crunching task on the PC shows how little these CPUs are actually prepared for it. As nice as the benchmarks provided by your friends are is the work they do not more real-worldly than compiling code for instance. Multi-CPU x86 servers still do a lot of database work and act as file servers, too, and this goes somewhat missing.
    What I am missing, too, is true SMP folding. Amber and Tinker are single core folding clients and an SMP Linux client with the A2 core produces more than 2200 PPD. On a single Barcelona CPU one can get as much as 4200 PPD by running three of these clients in parallel, because a single client cannot utilize a quad core to more than 75%, at best. Therefore would I rather like to see what two SMP clients per CPU produce instead of these Tinker and Amber work loads.

    The Conclusions are rather fair.

    One thing that made me chuckle was the excerpt of the SiSoft FAQ about the Mandelbrot renderer, /[<"...It is a real-life benchmark rather than a synthetic benchmark, ..."<]/ - and I thought: /[

      • UberGerbil
      • 11 years ago

      I don’t think a file server test would amount to much of a CPU benchmark: there’s just not enough going on.

      It’s true that branchy code will remain the domain of CPUs as GPUs take over more of the parallel FP. A database benchmark would be valid and useful when testing server processors, provided you can rig it so it’s not entirely gated by disk performance. Some webserver front ends that assemble pages from fragments (drawn from a DB) using complicated logic would be interesting as well, though I don’t know how much even the worst of those stress the latest CPUs.

    • eitje
    • 11 years ago

    q[

    • bogbox
    • 11 years ago

    I’m a bit curious about the real world consumer desktop user performance
    of the Phenom II .
    I wish that TR reviewed 3d max 2009 insted of Cinebanch, Photoshop etc.
    Real world applications not “my friends” programs, or abstract bench that nobody uses (mass use).

    I’m not interested in server to much , not at all. But unfortunately for me , the server with its Cloud will eat us up,so the future is server base CPU (e.g i7).

    One more wish real gaming performance at 1600 rez. Is too much?

    • fishyuk
    • 11 years ago

    AMDs opportunity is in the commercial application server space. The SPEC benchmark illustrates this but if you look at recent benchmarks then Shanghai really shines in the majority of commercial applications, and most importantly in virtualisation. A 16 core Shanghai outperforms a 24 core Dunnington in VMmark with the same number of tiles and just as well for Intel that power consumption isn’t shown.

    Factor in that Nehalem doesn’t hit 4P for nearly a year, and that customers like standards then all is certainly not lost for AMD in the segment they are concentrating on.

    I’d also add that when comparing like systems then Shanghai is more power efficient than the L5430. A good way is to use HP’s online BladeSystem sizer. All power and fan components are identical and the blades really are just CPU and memory. Anything over 2 DIMMs and Opteron is much more efficient, especially at loads most likely to be seen in real life (40 to 60%)

    • TravelMug
    • 11 years ago

    Well, I did go back to the TR review of i7 and checked the benchmark results myself. It does not look pretty fro AMD. The 1 socket 965 system outperformes the 2 socket 2384 system in every benchmark. Even when considering the outrageous price of $1000 for the i7-965 plus let’s say $300 for a motherboard plus some juicy DDR3, it’s still much lower then the price of 2x2384s and the corresponding dual socket board. And the performance is higher. Pitty about the 12GB RAM limit in those systems due to the current lack of non-ECC non-Registered 4GB DDR3 modules.

      • Azmount
      • 11 years ago

      And who would wanna run non-ECC memory in their server?
      Also who would want their server to overclock itself just because it feels like it.

      On a side note Core i7 has TLB bug that manifest itself under virtualization as such its a no go for a server until IntEl fixes it with a revision.

        • TravelMug
        • 11 years ago

        I know it’s hard for you to follow a discussion flow or add 2 and 2 together, but that post was about performance.

    • Meadows
    • 11 years ago

    If you compare it to AMD’s previous thing, then this processor is quite brilliant. Adding that to the fact that they can be overclocked to at least 3.2 GHz (usually well beyond that) on air might make it a more tempting enthusiast’s platform than what Barcelona was – after all, the difference between, say, a 3.4 GHz intel and a 3.4 GHz AMD can *[

    • Anonymous Coward
    • 11 years ago

    At least the margins are so high in this market that AMD has lots of room to sell on price. Looks like they’ll keep Intel honest in most server markets. I’m very interested to see how things go in the desktop market.

    • mshook
    • 11 years ago

    Small nitpick: Shanghai isn’t named after the city but after its F1 track (it’s the theme used by AMD).

    • Azmount
    • 11 years ago

    This review is a joke.
    Authors praise Xeon L5430 for its better power consumption even thou Opteron X4 and other Xeon racks uses 16GB of memory and Xeon L5430 uses *[<6GB<]*

    • Fighterpilot
    • 11 years ago

    Interesting review but the comments and conclusions were awfully kind to AMD when a cold look at basically every bench result showed the Xeons well ahead in performance.(It would have been called /[

      • tsoulier
      • 11 years ago

      I agree totally

      • shank15217
      • 11 years ago

      That platform doesn’t exist yet.

      • moose17145
      • 11 years ago

      The AMD systems were slower on the whole, yes, but were usually kinder in ther terms of performance per watt. You can have the fastest CPU on the planet that will do 100x anything that is out there currently… but if it eats up 10,000x the power to do it, then almost no one is going to use it. The energy these things comsume proportional to the amount of work they do is becoming ever more important. If you have a whole room full of these things you have to keep them cool. If a cpu puts out 20 watts of heat, then you are really using more like 40+ watts to keep it cool, because you have to use heavy AC units to remove the heat fom the room, plus there are innefficiencies in the AC units as well to take into account. The kind comments were not in regards to their overall performance where fastest = king like it does in the small home PC market, but rather in regards to their decent efficiencies for what they were doing.

      Something to keep in mind when i was working on my old colleges IT department, one of the lead guys who was in charge of our data center was talking about how he found how our server room was evolving kind of funny. He said that when he started and they built that data center they had very little room on the floor to move around. Today there is more free floor space in there than ever, and you can pretty freely move around. Despite that fact he said they have more equipment in there than they ever have, and are consuming over 4x the amount of electricity compared to when their first built it, as computers have gotten smaller and smaller, allowing them to fit more into one area, but likewise because of that their power consumption have been going through the roof.

    • TravelMug
    • 11 years ago

    I agree with the suggestion that it would have been nice to see the results from let’s say an i7-920 there (to roughly match the 2384 in clock speed). Even if that is not a server system. I’m curious how it compares to these dual-socket systems. Based on the SPEC results it should be pretty close.

      • tsoulier
      • 11 years ago

      I agree totally EDIT sorry wrong post

    • ludi
    • 11 years ago

    Good review. At least AMD will be able to keep pace in the small server space, one of the few markets where it has consistently maintained a strong presence.

    • BoBzeBuilder
    • 11 years ago

    So much for “waiting for AMD’s 45nm CPUs”, they’d get crushed by Nehalem.

      • shank15217
      • 11 years ago

      Stars 3D is a showcase for SMT and high bandwidth architectures. If you look at the Core i7 review Stars 3D bench this becomes obvious. That link you posted basically shows the same thing.

    • tfp
    • 11 years ago

    I know this is to show dual sockets or better but it would have been nice to see the Core 7i chips in there considering they do so well in workstation situations even if they are single socket.

    • Vasilyfav
    • 11 years ago

    Very curious how this will translate into Phenom II productivity.

      • shank15217
      • 11 years ago

      I think phenom IIs will have near clock-for-clock parity with gen 2 core 2s

        • srg86
        • 11 years ago

        I can’t see that at all, not with the workstation type benchmarks in this review, possibly with games.

    • Skrying
    • 11 years ago

    It seems in certain environments the Shanghai processors could do amazingly well, in others just average. Not really sure what to gather from all of this, it just seems AMD really needs these parts to respond well with clock speed ramp up.

      • mentaldrano
      • 11 years ago

      Everyone talks about AMD procs performing well with a clock speed bump, but AMD has consistently failed to actually increase speeds!

      While Shanghai does seem to be what AMD needs to stay in the game in the server market, don’t count on higher speed models to save the day. The SE version of Barcelona was what, 100 MHz faster than non-SE parts?

      Anyway, I’m more interested in the HE part, if the performance doesn’t suck.

    • ub3r
    • 11 years ago

    booooooooo

Pin It on Pinterest

Share This