AMD’s quad-core Opteron 2300 processors
Somewhere around mid-morning this past Friday, a rather large package made its way into the depths of Damage Labs.
Inside was a server containing something very special: a pair of AMD’s new quad-core Opteron processors. The chip code-named “Barcelona” has been something of an enigma during its development, both because of questions about exactly when it would arrive and how it would perform when it did. After a long, hot weekend of non-stop testing, we have some answers to those questions. AMD is formally introducing its Barcelona-based Opteron 2300-series processors today, so the time is now. As for the performance, well, keep reading to see exactly how the new Opterons compare to Intel’s quad-core Xeons.
Introducing the Opteron 2300 series
As I said, we received AMD’s new Opterons just this past Friday. I’ve concentrated my efforts since then on testing the heck out of them, so you’re going to be spared my attempts to summarize this new CPU architecture in any kind of depth. If you’re unfamiliar with AMD’s K10 architecture and want an in-depth look at how it works, let me suggest reading David Kanter’s excellent overview of Barcelona. I will give you some basics, though.
Barcelona is a single-chip, native quad-core design. Each of those cores have been substantially revised to improve performance per clock cycle through a variety of tweaks, some big and some small. The cores now have a wider, 32-byte instruction fetch, and the floating-point units can execute 128-bit SSE operations in a single clock cycle (including the Supplemental SSE3 instructions Intel included in its Core-based Xeons). Accordingly, the Barcelona core has more bandwidth throughout in order to accommodate higher throughputinternally between units on the chip, between the L1 and L2 caches, and between the L2 cache and the north bridge/memory controller.
AMD has also added an L3 cache to the chip. That results in a cache hierarchy that includes 64KB of dedicated L1 cache and 512KB of dedicated L2 cache per core, bolstered by a 2MB L3 cache that’s shared dynamically between all four cores. The total effective cache size is still much smaller than Intel’s Xeons, but AMD claims its mix of dedicated and shared caches can avoid contention problems that Intel’s large, shared L2 might have.
Behind this L3 cache sits an improved memory controller, still integrated into the CPU as with previous Opterons. AMD claims this memory controller is better able to take advantage of the higher bandwidth offered by DDR2 memory thanks to a number of enhancements, including buffers that are between 2X and 4X the size of those in previous Opterons and an improved prefetch mechanism. Perhaps most notably, the new controller can access each 64-bit memory channel independently, reading from one while writing to another, instead of just treating dual memory channels as a single 128-bit device.
Throughout Barcelona, from this memory controller to the CPU cores, AMD has made revisions with power-efficiency in mind. That starts with clock gating, whereby portions of the chip not presently in use are temporarily deactivated. AMD says it has improved its clock gating on both coarse- and fine-grained scales, combining the ability to turn off, say, the entire floating-point unit when running integer-heavy code with the ability to put smaller logic blocks on the chip to sleep when they’re not needed. Even the memory controller will turn off its write logic during reads and vice-versa.
Clock gating is a commonly used technique these days, but some of Barcelona’s tricks are more novel. Unlike other x86 multicore processors, each of Barcelona’s CPU cores is clocked independently, so that each one can raise and lower its clock speed (via PowerNow) dynamically in response to demand. (In Intel’s current Xeons, one core at high utilization means the other core on that chip must run at a higher clock speed, as well.) Barcelona’s CPU voltage is still dependent on power state of the core with highest utilization, but AMD has separated the power plane for the chip’s CPU core from the power plane for its memory controller. As a result, the memory controller and CPU cores can each draw only the power they need.
All told, these modifications led to a chip comprised of approximately 463 million transistors. As manufactured on AMD’s 65nm SOI process, Barcelona measures 285mm².
The obvious goals for Barcelona included several key things: doubling the number of CPU cores per socket, raising the number of instructions each core can execute per clock, keeping power use relatively low by taking advantage of opportunities for dynamic scaling, and in doing so, achieving vastly improved performance per watt. AMD also sought to extend its excellent HyperTransport-based system architecture, although many of those improvements will have to wait for platform and chipset updates. The most urgent overarching goal, though, was undoubtedly restoring AMD’s competitive position compared to Intel’s Xeons based on the formidable Core microarchitecture.
The nuts and bolts of the quad-core Opterons
AMD continues its tradition with these new Opterons of making them drop-in replacements for the existing infrastructure. In this case, that infrastructure involves Socket F-class servers and workstations. With only a BIOS update, these systems can move from dual-core to quad, without need for a change in motherboards, cooling solutions, or power suppliesnot a bad proposition at all. That upgrade proposition does come with a caveat, though: older motherboards that don’t support Barcelona’s split power planes will suffer a performance hit with certain Opteron 2300 models. For example, the Opteron 2350’s default memory controller clock is 1.8GHz. Without separate voltage domain, though, the 2350’s memory controller drops to 1.6GHz. That matters quite a bit more than you might think, in part because the L3 cache uses the same clock.
AMD is introducing another innovation of sorts with Barcelona in the form of a new power rating, dubbed ACP for “average CPU power.” Differences in describing a processor’s maximum power and thermal envelope, known as Thermal Design Power, have long been a source of contention between Intel and AMD. For ages, AMD has argued that its TDP ratings are an absolute maximum while Intel’s are something less than that, andhey, not fair! At the same time, AMD hasn’t had the same class of dynamic thermal throttling that Intel’s chips have, so it’s had to make do with more conservative estimates. The problem, according to AMD, is that its numbers were being compared directly to Intel’s, which could be misleadingparticularly since its processors incorporate a north bridge, as well.
At long last, AMD is looking to sidestep this issue by creating a new power rating for its CPUs. Despite the name, ACP is not so much about “average” power use but about power use during high-utilization workloads. AMD has a methodology for defining a processor’s ACP that involves real-world testing with such workloads, and the company will apparently be using ACP as the primary way to talk about its CPUs’ power use going forward, though it will still disclose max power, as well. To give you a sense of things, standard Opterons with a 95W max power rating will have a 75W ACP. This move may be controversial, but personally, I think it’s probably justifiable given the power draw profiles we’ve seen from Opterons. I’m not especially excited about it one way or another since we spend hours measuring CPU power use around here. We’ll show you numbers shortly, and you can decide what to think about them.
Now that you know what ACP means, here’s a look at the initial Opteron 2300 lineup, complete with ACP and TDP numbers for each part.
These chips fit into the same basic power envelopes as current Opterons, obviously, and AMD continues to offer HE models with higher power efficiency for a slight price premium. These first chips run at rather low clock frequencies, with even lower memory controller/L3 cache speeds. Fortunately, AMD does plan to ramp up clock speeds. To demonstrate that, they shipped us a pair of 2.5GHz Barcelona engineering samples at the eleventh hour, which they later christened as the Opteron 2360 SE. These higher-frequency products won’t be available until some time in the fourth quarter of this year, but we can give you a preview of their performance today. Naturally, we’ve run them through our full gamut of tests, along with a pair of 2GHz Opteron 2350s. We also have a pair of Opteron 2347 HEs, but we’ve had to defer that review to another day due to time constraints.
You’ll want to watch several key matchups in the results, including:
- Opteron 2350 vs. Xeon E5345 This is the matchup AMD has identified as the most direct comparison on a price and performance basis. Looks to me like E5345s still cost more than the Opteron 2350’s list price, but this 2.33GHz Xeon will be a good foil for the 2GHz Barcelona.
- Opteron 2350 vs. Xeon L5335 Here’s your clock-for-clock comparison between Intel’s quad-core Xeons and AMD’s quad-core Opterons. The L5335 is a brand-new, low-power version of the Xeon whose most direct competitor might be the Opteron 2347 HE, but CPU microarchitecture geeks will appreciate comparing performance per clock between these two.
- Opteron 2360 SE vs. Opteron 2218 HE Watch this one in single-threaded (and up to four-threaded) tests to get a rough sense of Barcelona’s per-clock performance versus older Opterons. The 2218 HE runs at 2.6GHz and the 2360 SE runs at 2.5GHz, so it’s not an exact match, but it’s close.
- Opteron 2360 SE vs. Xeon X5365 The best of breed from AMD and Intel face off.
You can see our system configs and test applications below. We’ve tried to produce as direct a comparison as possible. We even tested the Xeons and Opterons 2300s in the same sort of chassis with the same cooling solution. The Opteron 2200s were in a slightly different enclosure, but I did some testing and believe the power consumption from its cooling fans is similar. All of the systems used the same model of power suppy unit.
I think we have a good mix of tests, but they’re more heavily geared toward HPC and digital content creation than traditional server workloads. We have added SPECjbb2005 this time out, and we continue working on developing some other server-oriented tests to add to our suite. Unfortunately, our preparation time for this review was limited.
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.
Our test systems were configured like so:
Xeon L5335 2.0GHz
2218 HE 2.6GHz
Opteron 2350 2.0GHz
Dual Opteron 2360 SE 2.5GHz
Tiger K8SSA (S3992)
nForce Pro 3600
6321 ESB ICH
nForce Pro 3600
1024MB DDR2-667 FB-DIMMs at 667MHz
1024MB ECC reg. DDR2-667 DIMMs at 667MHz
1024MB ECC reg. DDR2-667 DIMMs at 667MHz
to CAS delay (tRCD)
6321 ESB ICH with
Intel Matrix Storage Manager 7.6
nForce Pro 3600 with
Caviar WD1600YD 160GB
ATI ES1000 with 22.214.171.12453 drivers
Server 2003 R2 Enterprise x64 Edition with Service Pack 2
We used the following versions of our test applications:
- SiSoft Sandra XI.SP4a 64-bit
- CPU-Z 1.40
- SPECjbb2005 with Sun Java 6 Update 2 Windows x64 edition
- Valve VRAD map build benchmark
- Cinebench R10 64-bit Edition
- POV-Ray for Windows 3.7 beta 22 64-bit
- CASE Lab Euler3d CFD benchmark multithreaded edition
- MyriMatch proteomics benchmark
- notfred’s Folding benchmark CD 8/8/07 revision
- picCOLOR 4.0 build 598 64-bit
- The Panorama Factory 4.5 x64 Edition
- Windows Media Encoder 9 x64 Edition
The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Memory subsystem performance
With all of the talk about Barcelona’s increased throughput, I figured we should put that to the test. Here’s a quick synthetic benchmark of cache and memory bandwidth.
Barcelona delivers as advertised on this front, doubling the L1 and more than doubling the L2 cache bandwidth of the older Opteron 2200s, despite having lower clock speeds. Let’s take a closer look at the tail end of these results, where we’re primarily accessing main memory. I believe these results show memory bandwidth available to a single CPU core, not total system bandwidth, but they’re still enlightening.
The improvements to Barcelona’s memory controller appear to pay off nicely here. I’m a little dubious about the relatively low results for the Xeons, though. I expect we could see higher results with a different test.
Anyhow, that’s bandwidth, but its close cousin is memory access latency. Opterons have traditionally had very low latencies thanks to their integrated memory controllers. How does Barcelona look here?
Well, that’s not so good. Let’s look a little closer at the results with the aid of some fancy 3D graphs, and I think we can pinpoint a reason for the Opteron 2300s’ higher memory access latencies. In the graphs below, by the way, yellow represents L1 cache, light orange is L2 cache, red is L3 cache, and dark orange is main memory. Just because we can.
Ok, stop right there and have a look. The Opteron 2350’s L3 cache has a latency of about 23ns, and the 2360 SE’s L3 latency is about 19ns. Since latency in the memory hierarchy is a cumulative thing, that’s very likely the cause of our higher memory access latencies. I would give you the L3 cache latency in CPU clock cycles, but that’s kind of beside the point. Barcelona’s L3 cache runs at the speed of the north bridgeso 1.8GHz in the 2350 and 2.0GHz in the 2360 SE. The L3 cache may have some additional latency for other reasons: because cache access between the four cores is doled out in a round-robin fashion and because of the FIFO buffers that sit in front of this cache in order to deal with cores running at what may be vastly different clock speeds.
Adding the L3 cache in this way was undoubtedly a tradeoff for AMD, but it certainly carries a hefty latency penalty. This penalty may become less pronounced when Barcelona reaches higher clock speeds. AMD says the memory controller’s speed can increase as clock frequencies do.
SPECjbb2005 simulates the role a server would play executing the “business logic” in the middle of a three-tier system with clients at the front-end and a database server at the back-end. The logic executed by the test is written in Java and runs in a JVM. This benchmark tests scaling with one to many threads, although its main score is largely a measure of peak throughput.
SPECjbb2005 can be configured to run in many different ways, with different performance outcomes, depending on the tuning of the JVM, thread allocations, and all sorts of other things. I had no intention of producing a record score myself; I just wanted to test relative performance on equal footing. We’ll leave peak scores to the guys who spend their days optimizing for a single benchmark.
I used the Sun JVM for Windows x64, and I found that using two instances of the JVM produced the best scores on the Opteron-based systems. Scores with one or two instances were about the same on the Xeons, so I settled on two instances for my testing, with the following Java options:
-Xms2048m -Xmx4096m +XX:AggressiveOpts
Those settings produced the following results:
In our first real performance test, Barcelona comes out looking very good. The Opteron 2350 outperforms the Xeon E5345, and the 2.5GHz Opteron 2360 SE beats out the 3GHz Xeon X5365a promising start indeed.
Valve VRAD map compilation
This next test processes a map from Half-Life 2 using Valve Software’s VRAD lighting tool. Valve uses VRAD to precompute lighting that goes into games like Half-Life 2. This isn’t a real-time process, and it doesn’t reflect the performance one would experience while playing a game. Instead, it shows how multiple CPU cores can speed up game development.
I’ve included a quick Task Manager snapshot from the test below, and I’ll continue that on the following pages. That’s there simply to show how well the application makes use of eight CPU cores, when present. As you’ll see, some apps max out at four threads.
This is a disappointing way to follow up that SPECjbb performance. Barcelona can’t match the Xeons clock for clock here, which leaves the 2GHz 2350 trailing the rest of the quad-core processors and the 2.5GHz 2360 SE behind the 2.33GHz Xeon E5345. Obviously, even this performance is a huge improvement over the Opteron 2200 series, though, and at least puts AMD back in the game.
Graphics is a classic example of a computing problem that’s easily parallelizable, so it’s no surprise that we can exploit a multi-core processor with a 3D rendering app. Cinebench is the first of those we’ll try, a benchmark based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores are available.
I had high hopes for Barcelona’s purported improvements in floating-point math, but we’re just not seeing it here. Have a look at the single-threaded performance of the Opteron 2218 HE (at 2.6GHz) versus the Opteron 2360 SE (at 2.5GHz): performance per clock is nearly identical between K8 and Barcelona. The one saving grace for the new Opterons is strong multi-threaded scaling. The Xeon E5345 is faster than the 2360 SE with one thread but slower with eight. Put another way, the E5345 offer a 6.2X speedup with multithreading, while the Barcelona’s is nearly 7X.
We caved in and moved to the beta version of POV-Ray 3.7 that includes native multithreading. The latest beta 64-bit executable is still quite a bit slower than the 3.6 release, but it should give us a decent look at comparative performance, regardless.
Again, we’re seeing strong performance scaling with Barcelona, but not dominance in floating-point math. The Xeon L5335 at 2GHz is just a few ticks behind the 2GHz Opteron 2350.
I decided to go ahead and report these results for the sake of completeness, but I don’t believe they’re telling us much about the new Opterons’ competence. This beta version of POV-Ray seems to have a problem with single-threaded tasks bouncing around from one CPU core to the next, and this causes especially acute problems on NUMA systems. Since the vast majority of the computation time for these scene involves such single-threaded work, things turn out badly for the Opteron 2300s.
Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He recently offered to provide us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:
In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.
In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.
MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.
The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.
This application looks to be limited by memory bandwidth or some similiar factor; scaling beyond four threads doesn’t work out well on any of these systems. That said, the new Opterons perform respectably here, with the 2350 matching the Xeon E5346 and the 2360 SE edging out the Xeon X5365.
STARS Euler3d computational fluid dynamics
Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here. (I believe the score you see there at almost 3Hz comes from our eight-core Clovertown test system.)
In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:
The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.
The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.
So the higher the score, the faster the computer. I understand the STARS Euler3D routines are both very floating-point intensive and oftentimes limited by memory bandwidth. Charles has updated the benchmark for us to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.
The Barcelona Opterons can’t even match the Xeons clock for clock here. In fact, even the 2360 SE trails the slowest 2GHz Xeon. Again, we’re seeing just under 2X the performance of the dual-core Opterons at similar clock speeds (2218 HE vs. 2360 SE), but even that’s not enough to catch Intel.
Next, we have a slick little [email protected] benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, [email protected] is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.
The [email protected] project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, [email protected] should be a great example of real-world scientific computing.
notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.
On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.
This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.
To catch the architectural improvements, follow the matchup between Opteron 2218 HE and Opteron 2360 SE at similar clock speeds once again. Per-clock performance is similar between K8 and Barcelona with the Tinker and Amber work units, where AMD already excelled, but Barcelona is much stronger with the two Gromacs WU typeswhich tend to dominate the WU assignment these days, as I understand it. Those improvements aren’t enough to catch the Xeons, though. The Intel processors remain faster clock-for-clock and also run at higher frequencies.
The Barcelona-based Opterons put in a respectable showing overall, however, on the strength of eight cores that handle the Amber and Tinker WU types relatively well.
The Panorama Factory
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs. The program’s timer function captures the amount of time needed to perform each stage of the panorama creation process. I’ve also added up the total operation time to give us an overall measure of performance.
Here’s another case where the new Opterons can’t match the Xeons clock for clock and they have lower frequencies. Things do seem to tighten up with the faster 2360 SE, its higher frequency core, and its faster memory controller clock. This design wants to run faster than where AMD’s starting with it.
picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Eight of the 12 functions in the test are multithreaded, and in this latest revision, five of those eight functions use four threads.
Scores in picCOLOR, by the way, are indexed against a single-processor Pentium III 1 GHz system, so that a score of 4.14 works out to 4.14 times the performance of the reference machine.
This app uses a maximum of four threads, and again, the Barcelona Opterons perform similarly on a per-clock basis to the older dual-core Opterons. I’ll leave you to analyze the finer points of the individual sub-tests, if that’s your thing.
Windows Media Encoder x64 Edition
Windows Media Encoder is one of the few popular video encoding tools that uses four threads to take advantage of quad-core systems, and it comes in a 64-bit version. Unfortunately, it doesn’t appear to use more than four threads, even on an eight-core system. For this test, I asked Windows Media Encoder to transcode a 153MB 1080-line widescreen video into a 720-line WMV using its built-in DVD/Hardware profile. Because the default “High definition quality audio” codec threw some errors in Windows Vista, I instead used the “Multichannel audio” codec. Both audio codecs have a variable bitrate peak of 192Kbps.
Here, the Barcelona and Xeon quad-core processors at 2GHz go head to head, and the Xeon comes away victorious. Fortunately, the 2.5GHz Opteron 2360 SE looks like it may be relatively stronger.
SiSoft Sandra Mandelbrot
Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The one of interest to us is the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:
This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.
The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.
We’re using the 64-bit version of Sandra. The “Integer x16” version of this test uses integer numbers to simulate floating-point math. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations in parallel.
I’ve been looking forward to seeing the results of this little test, because it has the potential to demonstrate what Barcelona’s single-cycle 128-bit SSE enhancements can do when given a simple, parallelizable task, just as it did when the Core microarchitecture arrived. The story that it tells is intriguing. We see huge improvements between the Opteron 2218 HE and the 2360 SEnearly 4X in the integer test, with only twice as many cores on the 2360 SE. The magnitude of the gain in the floating-point test is lower, but still well past the doubled score one might expect from twice the cores in ideal conditions. Overall, though, the Xeons’ per-clock throughput remains much higher than Barcelona’s.
POV-Ray power consumption and efficiency
Now that we’ve had a look at performance in various applications, let’s bring power efficiency into the picture. Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire systemthe CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor into a separate outlet, though.) We measured how each of our test systems used power across a set time period, during which time we asked POV-Ray to render our “chess2.pov” scene at 1024×768 resolution with antialiasing set to 0.3.
Before testing, we enabled the CPU power management features for Opterons and XeonsPowerNow! and Demand Based Switching, respectivelyvia Windows Server’s “Server Balanced Processor Power and Performance” power scheme.
Incidentally, the Xeon’s I’ve used here are brand-new G-step models that promise lower power use at idle than older ones. I used a beta BIOS for our SuperMicro X7DB8+ motherboard that supports the enhanced idle power management capabilities of G-step chips. Unfortunately, I’m unsure whether we’re seeing the full impact of those enhancements. Intel informs me that only newer revisions of its 5000-series chipset support G-step processors fully in this regard. Although this is a relatively new motherboard, I’m not certain it has the correct chipset revision.
Anyhow, here are the results:
We can slice up the data in various ways in order to better understand them, though. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.
The new Opterons draw a little more power at idle than the old ones, as might be expected with so many more transistors on the chips. Still, the Barcelonas draw much less power at idle than the Xeons. Part of the Xeons’ problem is a platform issue: FB-DIMMs draw quite a bit more power per module than DDR2 DIMMs.
Next, we can look at peak power draw by taking an average from the ten-second span from 30 to 40 seconds into our test period, during which the processors were rendering.
True to its billing, the Opteron 2350 draws no more power under load than the Opteron 2220, its dual-core predecessor. Of course, AMD had to compromise on clock frequency in order to do it, but this still is an impressive resultespecially since the 2350 draws less power under load than the low-power Xeon L5335, whose TDP rating is 50W. Then again, this is total system power draw, and we’ve already established that the Xeons have a handicap thereone they’re tied to, nonetheless.
Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.
By using more cores to finish the rendering work sooner, the Opteron 2350 is able to use less power through the course of the test period than the Opteron 2220, despite having similar peak power consumption and higher idle power consumption. Sometimes, power efficiency is partially about finishing first. However, the Xeons’ strong performance alone can’t redeem them here.
We can quantify efficiency even better by considering the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve chosen to identify the end of the render as the point where power use begins to drop from its steady peak. We’ve sometimes seen disk paging going on after that, but we don’t want to include that more variable activity in our render period.
We’ve computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.
This may be our best measure of power-efficient performance under load, and in this measure, the Barcelona Opterons again excel. The Xeons are close here due to their short render times, but the new Opterons place first and second.
Power use at partial utilization with SPECjbb 2005
Before we close out our look at power efficiency, I’d like to consider another example. I’ve measured power use in SPECjbb2005 in order to show how it scales with incremental increases in load. I’ve only used a single instance of the JVM so that we can see a nice, gradual step up in loadtwo instances would take us to peak utilization much quicker.
We’ve graphed the quad-core Opterons and Xeons together. Since the dual-core Opterons take much longer to finish, they get their own graph.
Well, that’s interesting to see. I’m not sure exactly what to make of it just yet. I’d like to correlate power and performance here, but as I’ve mentioned, our testing time has been limited. Perhaps next time.
The new Barcelona-based quad-core Opterons bring major performance gains over their dual-core predecessors while fitting comfortably into the same power and thermal envelopes. Doubling the number of CPU cores will take you a long way in the server/workstation space, where the usage models tend to involve explicitly parallel workloads. The new Opterons also bring improved clock-for-clock performance in some cases, most notably with SSE-intensive applications like the [email protected] Gromacs core. However, Barcelona’s gains in performance per clock aren’t quite what we expected, especially in floating-point-intensive applications like 3D rendering, where it looks for all the world like a quad-core K8. As a result, Barcelona is sometimes faster, sometimes slower, and oftentimes the equal of Intel’s Core microarchitecture, MHz for MHz. Given the current clock speed situation, that’s a tough reality.
That said, new processor microarchitectures often scale quite well with clock speed, and our sneak peek at the 2.5GHz Opteron 2360 SE suggests Barcelona might be that way. Still, one can’t help but wonder whether AMD did the right thing with its L3 cache. That cache’s roughly 20ns access latency erases the Opteron’s lifelong advantage in memory access latencies, yet it nets an effective total cache size just over half that of the current Xeon’s. Since the L3 cache is clocked at the same speed as the memory controller, raising that memory controller’s clock speed should be a priority for AMD. This particular issue may be more of a concern in desktops and workstations than in servers, however, given the usage models involved.
At its modest price and 2GHz clock speed, the Opteron 2350 is still a compelling product for the server space, especially as a drop-in upgrade for existing Opterons. AMD’s HyperTransport-based system architecture remains superiora similar design is the way of the future for Intel nowand this architecture is one of the reasons why the Opteron 2350 scales relatively well in some applications, such as SPECjbb2005. Also, for the past couple of years, both Intel and AMD have been talking up a storm about how power-efficient performance is the new key to processors, especially in the server space. By that standard, AMD now has the lead. By any measureand we have several, including idle power, peak power, and a couple of energy use metricsthe quad-core Opterons trump Intel’s quad-core Xeons. Even the early 2.5GHz chip we tested proved to have relatively low power draw, which bodes well. We’ll have to take the Opteron 2347 HE out for a test drive soon, to see how it fares, as well.
Nonetheless, AMD now faces some harsh realities. For one, it is not going to capture the overall performance lead from Intel soon, not even in “Q4,” which is when higher-clocked parts like the Opteron 2360 SE are expected to arrive. Given what we’ve seen, AMD will probably have to achieve something close to clock speed parity with Intel in order to compete for the performance crown. On top of that, Intel is preparing new 45nm “Harpertown” Xeons for launch some time soon, complete with a 6MB L2 cache, 1.6GHz front-side bus, clock speeds over 3GHz, and expected improvements in per-clock performance and power efficiency. These new Xeons could make life difficult for Barcelona. And although AMD should remain competitive in the server market on the strength of Opteron’s natural system architecture and power efficiency advantages, this CPU architecture may not translate well to the desktop, where it has to compete with a Core 2 processor freed from the power and memory latency penalties of FB-DIMMs. But that, I suppose, is a question for another day.