Intel’s Stoakley platform and 45nm Xeons

When AMD’s “Barcelona” Opterons made their debut last Monday, we couldn’t tell you about a sleek, black box nestled in among the other test systems in Damage Labs. Housed inside of it: an example of Intel’s brand-new “Stoakley” dual-processor platform, complete with a pair of Xeons based on 45nm process technology. These Xeons are the first members of the Penryn family of 45nm CPUs to reach our test labs, and they offer a tantalizing look at how Intel will counter AMD’s new CPU design with a substantially revised version of its own potent Core microarchitecture.

These new CPUs and the platform that supports them promise marked improvements in performance, thanks to a bevy of tweaks and updates. In fact, although the new Xeons are more a minor refresh than a major overhaul, the gains they’ve attained are formidable. Today, we can show you how these processors perform.

The contest between next-generation CPU architectures has begun in earnest. Read on to see how Intel’s 45nm Xeons match up with AMD’s quad-core Opterons.

Goin’ to Harpertown

Following hardware developments these days requires navigating a virtual minefield of overlapping codenames, and Intel proudly leads the world in codename generation. The new Xeons have several names attached. “Penryn” is the codename for the family of processors based on Intel’s 45nm fab process, and this same silicon will serve a number of markets in various configurations. For the server and workstation markets, the bread-and-butter Penryn derivative will be “Harpertown,” a dual-chip, quad-core product that supersedes the current quad-core “Clovertown” Xeons. Intel also has plans for a single-chip, dual-core variant known as “Wolfdale.”

All Penryn derivatives will be manufactured via Intel’s 45nm high-k chip fabrication process, which the company has hailed as a breakthrough and a fundamental restructuring of the transistor. Despite the fanfare, the change brings gains that were once considered fairly conventional for process shrinks. Intel says the 45nm high-k process has twice the transistor density, a 20% increase in switching speed, and a 30% reduction in switching power versus its 65nm process. Improvements of that order are nothing to scoff at these days, nor is Intel’s manufacturing might. The firm already has two fabs making the 45nm conversion in the second half of 2007, Fab D1D in Oregon and Fab 32 in Arizona. Fab 28 in Israel will follow in the first half of next year, along with Fab 11X in New Mexico in the second half of ’08. 45nm processors should make up the majority of its output by then.

Harpertown Xeons and their Penryn-based cousins are not just die-shrunk versions of current chips, but they do retain the same basic layout. The quad-core parts are comprised of two dual-core chips situated together in a single LGA771-style package. This two-chip arrangement isn’t as neatly integrated as AMD’s “native quad-core” Opterons—the two chips can communicate with one another only by means of the relatively slow front-side bus—but it has the advantage of making chips easier to manufacture. The approximately 463 million transistors of AMD’s Barcelona are packed into an area that’s 283 mm² via AMD’s 65nm SOI fab process. That’s a relatively large area over which AMD must avoid defects. By contrast, current 65nm Xeons are based on two chips, each roughly 341 million transistors and measuring just 143 mm². Each chip in a Harpertown Xeon crams 410 million transistors into an even smaller 107 mm² area. One can argue that AMD’s approach to quad-core processors is more elegant, but it’s hard to argue with the Penryn family’s tiny die area.



A wafer of Harpertown 45nm Xeons

The small die belies big changes, though. The most obvious of those is a larger (6MB) and smarter (24-way set associative) L2 cache shared between the two cores on each chip. That adds up to 12MB of L2 cache per socket, for those who prefer to count that way. Harpertowns Xeons can better feed that cache thanks front-side bus speeds of up to 1.6GHz.

Penryn’s CPUs themselves may need the extra bandwidth, thanks to a handful of tweaks. One of the most prominent: a new, faster divider capable of handling both integer and floating-point numbers. This new radix-16-based design processes four bits per cycle, versus two bits in prior designs, and includes an optimized square root function. An early-out algorithm in the divider can lead to lower instruction latencies in some cases, as well. Penryn also extends the Core microarchitecture’s 128-bit single cycle SSE capabilities to shuffle operations, doubling execution throughput there. This is not a new instruction but an optimization for existing instructions, so no software changes are required to take advantage of this capability. The faster shuffle should be useful in formatting and setting up data for use in other SSE-based vector operations.

Speaking of SSE and new instructions, SSE4 is finally here in Penryn. These aren’t just the Supplemental SSE3 instructions supported in the first rev of the Core microarchitecture, but 47 all-new instructions aimed at video acceleration, basic graphics operations (including dot products), and the integration and control of coprocessors over PCIe. These instructions will, of course, require updated software support.

Harpertown Xeons pack some additional Penryn goodness, such as store forwarding and virtualization improvements, but they do not have the nifty “dynamic acceleration tech” intended for desktop Penryn derivatives. Those chips will have the ability to raise their clock speeds beyond their stock ratings, while staying within their appointed thermal envelopes, when one core is idle and the other is busy with a heavily single-threaded workload. Such trickery may be too fancy for the button-down world of servers and workstations, at least in its first-generation form.

Interestingly, Intel is toying with another, more permanent possibility for some future Xeon products: disabling one core on each of the two chips in a package in order to yield a dual-core solution that has 6MB of dedicated L2 cache per core. This move could allow a distinctive mix of single-threaded performance (as dictated by both cache sizes and clock speeds) within a given power envelope.

Speaking of which, the power envelopes for the new Xeons will remain essentially the same as the old ones. That means TDPs of 40, 65, and 80W for dual-core parts and 50, 80, and 120W for quad-cores. TDP ratings at a given clock speed should be down, I believe, although we don’t have all of the details yet. We do know that Intel plans to sell a 3.16GHz version of Harpertown that will fit into the top 120W envelope, and we know that our sample Harpertowns, to be sold as the Xeon E5472, run at 3GHz and fit into an 80W thermal envelope. Additional details on the lineup and pricing will have to wait for the Harpertown Xeons’ official launch date, which isn’t yet here. That will come on November 12.

Stoakley steps up

The product that is officially arriving today is Intel’s new dual-socket platform, code-named Stoakley. This platform is comprised of something old—Intel’s current ESB2 I/O chip (or south bridge)—and something new—a new memory controller hub or north bridge chip code-named Seaburg. Seaburg supplants a pair of existing products, the server-oriented Blackford MCH and the workstation-class Greencreek MCH. Manufactured on a newer process node than its predecessors, Seaburg’s clock speed is up from 333 to 400MHz within a similar power envelope.



We’ve removed the air duct to expose the CPU coolers and DIMMs in our Stoakley test rig

Of course, the Stoakley platform’s main mission in life is to support the new 45nm Xeons. Like the Bensley platform before it, Stoakley has two front-side buses, one dedicated to each socket in the system. However, while Bensley’s front-side buses topped out at 1.33GHz, Stoakley’s FSBs can run at 1.6GHz. Memory bandwidth is up, too, since Seaburg supports FB-DIMM speeds of 800MHz for its four memory channels (though 667MHz remains an option.) Stoakley’s memory controller gains more capacity for memory request reordering than Bensley, as well. All told, Intel cites a 25% higher sustainable memory throughput for the new platform.

In addition to the extra throughput, Stoakley can house twice as much memory as Bensley—up to 128GB—and will support FB-DIMM fail-over for high-reliability systems. Seaburg also doubles the number of PCIe lanes and upgrades those links to second-generation PCI Express.

One of the bigger challenges in designing the Seaburg north bridge was no doubt creating the snoop filter. This logic stores coherency information for all last-level caches on both of the chipset’s front-side buses, and it reduces FSB utilization by filtering out unnecessary coherency updates rather than passing them along from one FSB to the other. A system with dual Harpertown Xeons will have four-last level caches of 6MB each, and each cache will be 24-way associative. Accordingly, Seaburg’s snoop filter has four affinity groups, provides 24MB of coverage, and is 96-way associative. Seaburg also uses a more optimal algorithm to improve victim selection.

In the previous generation, only the workstation-oriented Greencreek MCH had a snoop filter; the server-targeted Blackford MCH did not, because it could hamper performance in some cases. The improvements to Stoakley’s snoop filter have mitigated that performance penalty, and so Intel will offer only one product in this generation. Technically, Stoakley is billed primarily as a workstation platform, but expect it to find its way into servers, as well. With its increased throughput, Stoakley could prove particularly popular for HPC systems.

Test notes

You can see our test system configurations and the like in the section below. Most of it is self-explanatory, but I should mention at least this. You’ll notice that the Stoakley/Xeon 45nm system came with 16GB of RAM, while the rest of the systems had 8GB of RAM. I elected to retain the eight-DIMM, 16GB configuration for the majority of our tests, especially the power tests, since the rest of the test rigs had eight DIMMs each. The presence of additional RAM in the Stoakley box shouldn’t affect the outcome of the vast majority of our tests, since they all fit comfortably into 8GB. The one potential exception is SPECjbb2005, which can use quite a bit of memory, so I tested the Stoakley/Xeon E5472 system with 8GB of RAM in SPECjbb2005.

On another note, we were unfortunately unable to include results from our Folding@Home benchmark in this review, because the bootable Linux CD’s networking stack proved somehow incompatible with our Stoakley review system. We’ll have to test that later.

Also, you’ll see that we have an Opteron 2347 HE among the results, a new addition since our initial review of the quad-core Opterons. We’re curious to see how this CPU matches up against the Xeon L5335 in performance and power use.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processors Dual
Xeon L5335 2.0GHz

Dual Xeon
E5345
2.33GHz

Dual Xeon
X5365
3.0GHz


Dual Xeon E5472 3.0GHz
Dual
Opteron
2218 HE
2.6GHz

Dual Opteron
2220
2.8GHz

Dual Opteron 2347 1.9GHz
Dual
Opteron 2350 2.0GHz

Dual Opteron 2360 SE 2.5GHz

System
bus
1333MHz
(333MHz quad-pumped)
1600MHz
(400MHz quad-pumped)
1GHz
HyperTransport
1GHz
HyperTransport
Motherboard SuperMicro
X7DB8+
SuperMicro
X7DWA
Tyan
Tiger K8SSA (S3992)
SuperMicro
H8DMU+
BIOS
revision
8/13/2007 8/28/2007 5/29/2007 8/15/2007
North
bridge
Intel
5000P MCH
Intel
Seaburg MCH
ServerWorks
BCM 5780
Nvidia
nForce Pro 3600
South
bridge
Intel
6321 ESB ICH
Intel
6321 ESB ICH
ServerWorks
BCM 5785
Nvidia
nForce Pro 3600
Chipset
drivers
INF
Update 8.3.0.1013
INF
Update 8.5.0.1005
SMBus
driver 4.57
Memory
size
8GB
(8 DIMMs)
16GB
(8 DIMMs)
8GB
(8 DIMMs)
8GB
(8 DIMMs)
Memory
type

1024MB DDR2-667 FB-DIMMs at 667MHz
2048MB
DDR2-800 FB-DIMMs at 800MHz

1024MB ECC reg. DDR2-667 DIMMs at 667MHz

1024MB ECC reg. DDR2-667 DIMMs at 667MHz
CAS
latency (CL)
5 5 5 5
RAS
to CAS delay (tRCD)
5 5 5 5
RAS
precharge (tRP)
5 5 5 5
Storage
controller
Intel
6321 ESB ICH
with

Intel Matrix Storage Manager 7.6

Intel
6321 ESB ICH
with

Intel Matrix Storage Manager 7.6

Broadcom
RAIDCore with

1.1.7057.1 drivers

Nvidia
nForce Pro 3600 with

6.87 drivers

Hard
drive
WD
Caviar WD1600YD 160GB
Graphics Integrated
ATI ES1000 with 6.14.10.6553 drivers
OS Windows
Server 2003 R2 Enterprise x64 Edition with Service Pack 2
Power
supply
Ablecom
PWS-702A-1R
700W

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

We start with some synthetic tests of the cache and memory subsystem, and the first one shows us that the 45nm Xeon E5472 pretty much matches its the Xeon X5365 in L1 and L2 cache bandwidth. The only big difference is at the 16MB block size, where the E5472’s larger 6MB L2 cache helps out some. Both of these chips run at 3GHz, so they’re a clock-for-clock match. We’ll want to watch these two to see how much, if any, the Harpertown Xeon E5472s improve per-clock performance.

Let’s take a closer look at the tail end of these results, where we’re primarily accessing main memory. I believe these results show memory bandwidth available to a single CPU core, not total system bandwidth, but they’re still enlightening.

The Stoakley platform’s faster bus and higher memory frequencies add up to a nice boost in bandwidth over the older Xeons on the Bensley platform. Again, I don’t think we’re seeing absolute peak bandwidth, especially from the Xeons, but we can see a relative boost in throughput.

Memory access latencies are essentially unchanged from the older Xeons to the newer. Let’s look at this issue in a little more detail. In the graphs below, yellow represents L1 cache, light orange is L2 cache, red is L3 cache, and dark orange is main memory.

As one might expect, the Xeon E5742’s memory access latencies are lower at larger block sizes, like 16MB and 32MB, than the X5365’s. The faster bus and memory clocks likely deserve credit for that. More impressively, we measured the E5472’s 6MB L2 cache at 15 cycles of latency, just one cycle more than the 4MB L2 cache on the Xeon X5365 at the same clock frequency—quite the contrast to the high latencies we found in the quad-core Opterons’ new L3 cache.

SPECjbb2005

SPECjbb2005 simulates the role a server would play executing the “business logic” in the middle of a three-tier system with clients at the front-end and a database server at the back-end. The logic executed by the test is written in Java and runs in a JVM. This benchmark tests scaling with one to many threads, although its main score is largely a measure of peak throughput.

SPECjbb2005 can be configured to run in many different ways, with different performance outcomes, depending on the tuning of the JVM, thread allocations, and all sorts of other things. I had no intention of producing a record score myself; I just wanted to test relative performance on equal footing. Much higher performance is available using alternative JVMs and the like, and we may explore those options in the future. For now, we’ll leave peak scores to the guys who spend their days optimizing for a single benchmark.

I used the Sun JVM for Windows x64, and I found that using two instances of the JVM produced the best scores on the Opteron-based systems. Scores with one or two instances were about the same on the Xeons, so I settled on two instances for my testing, with the following Java options:

-Xms2048m -Xmx4096m +XX:AggressiveOpts

Those settings produced the following results:

The Xeon E5742 delivers a clock-for-clock performance increase of roughly 10% over the Xeon X5365 in this test, enough to vault it ahead of another not-yet-released product, the 2.5GHz Opteron 2360 SE, and into the top spot.

Valve VRAD map compilation

This next test processes a map from Half-Life 2 using Valve Software’s VRAD lighting tool. Valve uses VRAD to precompute lighting that goes into games like Half-Life 2. This isn’t a real-time process, and it doesn’t reflect the performance one would experience while playing a game. Instead, it shows how multiple CPU cores can speed up game development.

I’ve included a quick Task Manager snapshot from the test below, and I’ll continue that on the following pages. That’s there simply to show how well the application makes use of eight CPU cores, when present. As you’ll see, some apps max out at four threads.

The new Xeon E5472s shave five seconds off of the X5365s’ time, impressively enough. This isn’t quite the ~10% gain we saw above, but it’s not bad, either. Notably, even the Opteron 2360 SEs are nearly half a minute slower than the E5472s.

Cinebench

Graphics is a classic example of a computing problem that’s easily parallelizable, so it’s no surprise that we can exploit a multi-core processor with a 3D rendering app. Cinebench is the first of those we’ll try, a benchmark based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores are available.

The theme of clock-for-clock performance gains continues in Cinebench, where the 45nm Xeons’ faster divider and SSE shuffle capabilities may be coming into play. The E5472s are only slightly faster than the X5365s with only a single thread in use, but the new Xeons scale better up to eight threads than the older models. Again, Intel is putting more distance between its top chip and AMD’s future Opteron 2360 SE.

POV-Ray rendering

We caved in and moved to the beta version of POV-Ray 3.7 that includes native multithreading. The latest beta 64-bit executable is still quite a bit slower than the 3.6 release, but it should give us a decent look at comparative performance, regardless.

The per-clock performance gains come to a halt in POV-Ray, where the E5472s essentially match the X5365s. That still puts them in a tie for first place, though.

By the way, this beta version of POV-Ray seems to have a problem with single-threaded tasks bouncing around from one CPU core to the next, and this causes especially acute problems on NUMA systems. Since the vast majority of the computation time for the benchmark scene involves such single-threaded work, things turn out badly for the Opteron 2300s.

MyriMatch

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He recently offered to provide us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.

In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.

MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.

One of the most striking things about these results is that fact that performance on the eight-core systems seems to top out at about four to six threads and drop off from there. I asked Myrimatch’s authors about this dynamic a few months ago, and here’s how they explained it:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution. Of course, machines with insufficient memory to store both spectra and sequence database at once suffer a tremendous performance penalty, but the benchmark employs a small database with a small spectral set to avoid this problem.

As they note, memory bandwidth may become a bottleneck with this application. And right on cue, the new Xeons on the Stoakley platform produce a substantial performance gain over the Xeon X5365s. The performance boost is enough for Intel to recapture the overall lead from the Opteron 2350 SEs.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here. (I believe the score you see there at almost 3Hz comes from our eight-core Clovertown test system.)

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.

The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with increasing numbers of threads.

The Xeon E5472s chalk up another victory, and they set a new record for Euler3D throughput in the process. The performance gains over the X5365s are present from one to eight threads, but they’re most pronounced at six and eight threads, where bus and memory bandwidth limitations are most likely to become a factor. In fact, the E5472s are faster at six threads than the X5365s are at eight.

The Panorama Factory
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs. The program’s timer function captures the amount of time needed to perform each stage of the panorama creation process. I’ve also added up the total operation time to give us an overall measure of performance.

The Xeon E5472s continue to post solid performance gains in this image processing application, finishing the panorama generation process nearly two seconds quicker than the Xeon X5365s. Looking at the results from the individual operations in this process, we can see small gains from the E5472s at nearly every stage. Proportionally, some of the biggest gains come in the stitch and render operations.

picCOLOR

picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Eight of the 12 functions in the test are multithreaded, and in this latest revision, five of those eight functions use four threads.

Scores in picCOLOR, by the way, are indexed against a single-processor Pentium III 1 GHz system, so that a score of 4.14 works out to 4.14 times the performance of the reference machine.

The new Xeons post strong per-clock performance gains in some of picCOLOR’s functions, especially in the Fourier (FFT/PWR) one, where the E5472s post a score of 17.71 versus the X5365’s 11.62. I asked Dr. Müller about this function, and he said: “The FFT/PWR function calculates the Fourier transform of the image, then
displays the power spectrum, and then reconstructs the original image
by inverse Fourier transform.” That makes this function a good candidate for taking advantage of Penryn’s tweaks. In fact, the inner kernel of the FFT algorithm uses a bit shuffle function, and the power part of the function includes “a few MULs, one ADD, and one SQRT.” So we should be seeing both Penryn’s fast SSE shuffle and its optimized square root logic in action.

Windows Media Encoder x64 Edition

Windows Media Encoder is one of the few popular video encoding tools that uses four threads to take advantage of quad-core systems, and it comes in a 64-bit version. Unfortunately, it doesn’t appear to use more than four threads, even on an eight-core system. For this test, I asked Windows Media Encoder to transcode a 153MB 1080-line widescreen video into a 720-line WMV using its built-in DVD/Hardware profile. Because the default “High definition quality audio” codec threw some errors in Windows Vista, I instead used the “Multichannel audio” codec. Both audio codecs have a variable bitrate peak of 192Kbps.

The E5472s are at it again, finishing the encoding task 20 seconds before their like-clocked predecessors.

SiSoft Sandra Mandelbrot

Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The one of interest to us is the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.

We’re using the 64-bit version of Sandra. The “Integer x16” version of this test uses integer numbers to simulate floating-point math. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations in parallel.

The E5472s’ performance gains here aren’t quite what we’ve seen elsewhere, but it hardly matters. Nothing can touch the 3GHz quad-core Xeons.

POV-Ray power consumption and efficiency

Now that we’ve had a look at performance in various applications, let’s bring power efficiency into the picture. Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor into a separate outlet, though.) We measured how each of our test systems used power across a set time period, during which time we asked POV-Ray to render our “chess2.pov” scene at 1024×768 resolution with antialiasing set to 0.3.

Before testing, we enabled the CPU power management features for Opterons and Xeons—PowerNow! and Demand Based Switching, respectively—via Windows Server’s “Server Balanced Processor Power and Performance” power scheme.

Incidentally, the 5300-series Xeons I’ve used here are newer G-step models that promise lower power use at idle than older ones. I used a beta BIOS for our SuperMicro X7DB8+ motherboard that supports the enhanced idle power management capabilities of G-step chips. Unfortunately, I’m unsure whether we’re seeing the full impact of those enhancements. Intel informs me that only newer revisions of its 5000-series chipset support G-step processors fully in this regard. Although this is a relatively new motherboard, I’m not certain it has the correct chipset revision.

Of course, our Stoakley platform should support the further reductions in idle power offered by the Xeon E5472s.

Anyhow, here are the results:

Without any extra help, you can easily see that the new Xeons bring big reductions in power use over the X5365s. We can slice up the data in various ways in order to better understand them, though. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

The Stoakley platform draws about the same at idle as Bensley does when coupled with low-power Xeons. The E5472s on Stoakley draw 20W less at idle than their 3GHz counterparts on the Bensley platform, but that’s still quite a bit more power draw at idle than any of the Opterons.

Next, we can look at peak power draw by taking an average from the ten-second span from 30 to 40 seconds into our test period, during which the processors were rendering.

The Stoakley/Harpertown pairing brings a drastic drop in power draw versus the Xeon X5365s on Bensley. In fact, the Stoakley/Harpertown combo at 3GHz draws less power than Bensley/Clovertown pairing at 2.33GHz. Notably, the Xeon E5472 system also consumes less power than the Opteron 2360 SE-based one.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

When you slice things this way, the Opterons tend to excel, led by the low-power Opteron 2347 HE. However, the Stoakley/Harpertown system isn’t far behind, and it edges out the low-power Xeon L5335.

We can quantify efficiency even better by considering the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve chosen to identify the end of the render as the point where power use begins to drop from its steady peak. We’ve sometimes seen disk paging going on after that, but we don’t want to include that more variable activity in our render period.

We’ve computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

In what may be our best measure of energy-efficient performance, the Xeon E5472/Stoakley system distances itself from the pack. Even AMD’s impressive new quad-core Opterons, our previous champs, are well behind it.

Power use at partial utilization with SPECjbb 2005

Before we close out our look at power efficiency, I’d like to consider another example. I’ve measured power use in SPECjbb2005 in order to show how it scales with incremental increases in load. I’ve only used a single instance of the JVM so that we can see a nice, gradual step up in load—two instances would take us to peak utilization much quicker.

We’ve graphed the quad-core Opterons and Xeons together. Since the dual-core Opterons take much longer to finish, they get their own graph.

The E5472s look great here, as well, starting at idle power levels similar to the Xeon L5335 and peaking out right alongside the 2.33GHz Xeon E5345.

Conclusions

The combination of Intel’s 45nm Harpertown Xeons and their supporting Stoakley platform brings incremental but compelling gains in performance over current Xeons on the Bensley platform. Clock for clock, the new Xeons delivered performance gains in the majority of our tests. Those gains were especially notable in SPECjbb2005, where we saw about a 10% increase, and in memory bandwidth-limited applications like MyriMatch and Euler3D’s CFD solver, where the advances were even greater.

This higher clock-per-clock performance comes alongside a considerable drop in peak power use at 3GHz—from 403W for the Xeon X5365 system to 311W for the Xeon E5472 system—and a smaller but welcome drop in power draw at idle. The faster performance and lower power consumption together make the Stoakley/Harpertown combo an excellent “performance per watt” proposition, as our measure of energy required to render a scene demonstrated. In fact, no other solution was close in this respect. The new Xeons’ weakness on the efficiency front remains power draw at idle, a problem largely attributable to Intel’s continued use of FB-DIMM memory. For this reason, AMD’s quad-core Opterons remain competitive in terms of overall power efficiency.

Those new Opterons will certainly have their hands full with Intel’s 45nm Xeons, though. The Xeon E5472 extends Intel’s performance lead over the fastest quad-core Opteron we’ve seen yet, the 2.5GHz model 2360 SE. Of course, neither chip is available to the public as a product just yet, though both are promised for the fourth quarter of this year. Right now, if both companies make good on their plans, it looks like Intel will continue to lead in the server and workstation markets. The same may be true in other markets served by these same basic CPU designs, but only time will tell for sure.

Comments closed
    • someotherguy5
    • 12 years ago

    I am interested in hearing more details about PCI 2.0 on the server (not workstation). PCI 2.0 is exciting for me because it means the bus is no longer the “limiting factor” for using multiple NICs running at 10gbps. (think commodity router).

    Intel’s Seaberg (5400?) chipset apparently will replace the “Blackford”, Intel 5000P (The memory controller hub for a Xeon Dual-socket server).

    • halbhh
    • 12 years ago

    Quick power cost calculation for servers:

    We know that servers must have the performance to meet peak demand, and so they typically idle a lot at non-peak times, and reports put the idle time at 75% to 85% per day. Let’s use 80% idle time and that other 20% at full load as a rough approximation for the typical server. From the article, the 2360SE vs the new Xeon shows roughly 45 watts advantage for the Opty at idle and 20 watts disadvantage at load. Also for the SPECjjb2005 the Xeon is about 5% faster, so put the Opty2360SE load time up from 20% to 21% to match (even a number of 25% wouldn’t change this calculation much — greater precision would imply more accuracy than we have here (significant digits, etc)).

    The 2360SE 2.5Ghz vs the new Stoakley/e5472 3Ghz (enough of a performance advantage to be somewhat noticible, but not completely out of range for a comparison.)

    In 1 year at 20 cents/KwH, you get the 2360SE electrical cost advantage for 1 year at about $50-$60 in typical server operation, or perhaps $200 over a 3.5yr server lifetime, or as much as $300 for a long lived server.

    This is more than nothing, but not decisive for all purchasers. Initial cost will often be more important in such a comparison. Of course, in a crowded server room, power use at idle and load can be very important also due to the limits of the building.

    As enthusiasts we like to get excited about what are actually modest differences. These chips are not far apart as servers actually.

    For specific applications HPC on the other hand, then it’s a question of what the application is when choosing.

      • smilingcrow
      • 12 years ago

      FB-DIMMS are killing Intel’s idle power consumption figures but if you strip out that penalty the difference between Barcelona and Harpertown is tiny; it’s a 13W spread with Harpertown having the edge at least at higher clock speeds.

      Of course this is theoretical but when you consider that FB-DIMMs must be due for retirement next year it’s still a concern for AMD. If they lose the scalability and power efficiency advantage (i.e Nehalem) they are only left with value for money. That’s not a place they want to compete at in the server sector as it’s killing them on the desktop seemingly. It concerns me anyway.

      • Mr Bill
      • 12 years ago

      You also have a cooling load for that extra dissipated heat. That typically adds another 25% of the energy difference into your total power bill.

      c.f. §[<http://en.wikipedia.org/wiki/Seasonal_energy_efficiency_ratio<]§

        • halbhh
        • 12 years ago

        37, well, with that simple multiplyer, $200 becomes $250, $300 about $380 (keeping only 2 digits). I suppose if you really want to get into it, you’d compare the initial cost time-value, tax write off, etc. 🙂

    • Mr Bill
    • 12 years ago

    Nice review. I suggest that when on the comments page, there still be a link to the article at the bottom the the summary. As it is you have to back out to the front page to go back to the article.

      • UberGerbil
      • 12 years ago

      Yeah, I thought so too — but you can click on the picture. Not exactly obvious, though, and I don’t see why the title can’t be a link to the article.

        • Mr Bill
        • 12 years ago

        Oh! Thanks for pointing that out. 😉

        • Usacomp2k3
        • 12 years ago

        You’ve always been able to click on the picture.

          • UberGerbil
          • 12 years ago

          Yeah, but you /[

            • flip-mode
            • 12 years ago

            Agreed. Sorta like the rest of the internet.

            • UberGerbil
            • 12 years ago

            What does the “HT” in HTML stand for, again?

            • leor
            • 12 years ago

            Hyper Tushy

            • UberGerbil
            • 12 years ago

            I would think that applies only to your lingerie-enhanced hardware reviews, Leor.

    • derFunkenstein
    • 12 years ago

    i think the new Xeons are still being held back by the FB-DIMMs and the way they’re buffered, with huge latency compared to desktop RAM…Penryns will be even faster on the desktop

      • tfp
      • 12 years ago

      It is interesting that the latency is still comparible with the new AMD chips even though they are using FB-DIMMs.

      • Krogoth
      • 12 years ago

      FB-DIMM’s biggest drawback is actual power consumption for the DIMMs and controller.

      The latency isn’t as severe as paper would indicate. Merom and Penyrn based chips are both very apathetic to excessive memory bandwidth and tight latencies.

        • smilingcrow
        • 12 years ago

        The Inquirer (yeah, I know) had an article yesterday which showed FB-DIMM versus Registered RAM in terms of power consumption on the AMD and Intel 2P quad-core platforms. The data showed that FB-DIMMs require 8W per stick and Registered DDR2 = 1W per stick.

        If you factor that into the data sets in the Barcelona and Harpertown reviews what you see is that Opteron has quite an ordinary performance per watt and it only looks good because Intel made a BAD CHOICE in power terms and not because AMD did anything positive. With Intel likely to move away from FB-DIMMs for Nehalem (is this confirmed?) their luck is about to run out in this area.
        On the same note, is the San Clemente 2P chipset that supposedly supports Registered RAM still due Q4?

        Even with FB-DIMMs Harpertown (3.0) is easily beating Barcelona (2.5) at load; admittedly this is with POV-Ray which is not exactly a typical server application.
        Here’s the data in terms of energy needed to perform the task:

        2350 – 20.7
        2360SE – 20.3
        X5365 (FB) – 21.6
        X5365 (REG) – 18.6
        E5345 (FB) – 21.5
        E5345 (REG) – 17.8
        E5472 (FB) – 15.7
        E5472 (REG) – 12.9

        There are two figures for the Intel chips the real one with FB-DIMMs and the calculated data that uses estimated power data for Registered RAM as being 1W per stick rather than 8 which saves 56W per system; they all had 8 sticks.
        I just hope this is an atypical result otherwise it points to AMD being crushed next year unless they can release a very good 45nm part.

        Of course HT gives AMD a very nice scalable architecture but even that advantage seems likely to disappear with Nehalem.

    • apopilot
    • 12 years ago

    What about gaming performance?

      • ucisilentbob
      • 12 years ago

      How often do you game on your Server? Both the Barcelona and this review are both server oriented hence the server benchmarks. Seeing as there is a small but still relatively consistent increase in overall performance of the Penryn based Xeons over the previous Conroe Based Xeons, it’s safe to say that the Penryn based Core 2 Duo/Quads would benefit in the same range.

        • Kurlon
        • 12 years ago

        /[

          • UberGerbil
          • 12 years ago

          Really? WIth registered ECC and FB-DIMMs and onboard SCSI and no overclocking options?

    • leor
    • 12 years ago

    over a year later and intel finally releases stoakley . . .

    where were you when i was setting up my workstation??

      • Anomymous Gerbil
      • 12 years ago

      Lucky, because after these there will not be any further improvements in PC tech forb[

        • leor
        • 12 years ago

        this is the first workstation chipset that’s worth a damn for socket 771 since its release, wise ass, and it was supposed to be out over 6 months ago.

        the 5000 series was basically a server chipset, which made me have to go with a 1207 based system to get the PCI-e lanes I need.

          • Smurfer2
          • 12 years ago

          Leor, right now you are a sarcasm magnet…. Congrats…. I think…

            • leor
            • 12 years ago

            not my fault people don’t know the history of the 771 socket and took my words incorrectly. intel had awful workstation support for the life of this platform, and I think it was irresponsible for them to let my markt segment languish for so long. This platform was supposed to be out in Q1 of this year.

            if you wanted a dual socket workstation AMD has been the only game in town, unless 20 lanes of PCI Express is enough for you, I’m personally using 40.

            • Anomymous Gerbil
            • 12 years ago

            Haha, poor Leor… he’s so misunderstood!

      • indeego
      • 12 years ago

      Here’s another hint: Don’t upgrade between March 14 of 2011 through May 3, 2012, because on May 4, 2012 at 3:15:09 p.m. Intel will release something badassg{

        • ssidbroadcast
        • 12 years ago

        Wait, but isn’t that only 3 days, 18 hours, 26 minutes, and 4 seconds from the Zombie Uprising of Doom?

    • Prototyped
    • 12 years ago

    By the way, as a coup de grace for AMD, IBM will start shipping AMD K10-based Opteron systems a week i[http://episteme.arstechnica.com/eve/forums/a/tpc/f/77909774/m/151009737831?r=330006247831#330006247831<]§

    • lex-ington
    • 12 years ago

    I wouldn’t say Intel was a sleeping beast . . more like a Fat, Lazy beast that now has to exercise to keep what they once had.

    It amazes me how much stuff Intel can keep pushing out without anyone complaining . . . like the amount of chipsets and processors they’re cooking up . . . but if AMD wants to switch something and make it backwards compatible, they’re a dead-end company.

    • Peldor
    • 12 years ago

    I was hoping to see some 1333 FSB Penryns for a more direct comparison with the Clovertowns. There are only a couple of high-end 1600 FSB Harpertowns on the charts at this point.

    • 5150
    • 12 years ago

    Didn’t he get sign with Denver?

    Meh.

      • droopy1592
      • 12 years ago

      ROFL

    • nstuff
    • 12 years ago

    Comparison of the 2.0 and 2.5Ghz Barcelona chips show a pretty nice scale in performance. In at least a few cases, assuming the same increase when jumping to 3.0Ghz shows Barcelona may keep up or surpass the new Xeons when they finally ramp up the clock speeds. Time will tell though.

      • just brew it!
      • 12 years ago

      *[

    • lyc
    • 12 years ago

    awesome review, thanks scott 🙂

    well, that about seals it for amd (fingers crossed that intel+nvidia don’t become the only player in their markets); i’m pretty disappointed, but it’s hardly an accident we’re seeing here.

      • ucisilentbob
      • 12 years ago

      As bleak as it looks right now with the low clock speeds of the Barcelona chips, AMD is far from dead. The Architecture looks mighty competetive to the Core 2 Architecture clock for clock. AMD may be very late to the party but the party is far from over.

        • packfan_dave
        • 12 years ago

        I hate to be so pedantic on this, but you’d think only a few years removed from Northwood’s beat-down of the Athlon XP, people would remember that clock for clock perfromance is of absolutely no importance whatsoever.

        The variables are performance (in whatever apps you’re actually running), cost, and power consumption (and this isn’t much of a factor as long as it stays reasonable, and you’re not building a portable device or ultra dense rack server). It looks like AMD doesn’t have a part that competes with the top-clocked Clovertowns in most apps, let alone Harpertowns, at this time (only the 1.9 and 2 GHz Barcelonas are ‘released’) or in the foreseeable future (the 2.5 GHz Barcelona made available for benchmarketing is probably the best AMD will ship this year). So AMD’s left with competing on cost with Intel’s down-market chips, and talking about HPC and other bandwidth-sensitive apps. Which they were doing with K8 anyway.

          • ucisilentbob
          • 12 years ago

          What I was saying is that if they can ramp up their clock speeds, the game is far from over. At least the barcelona architecture is still efficient enough that I personally don’t see AMD in the weeds for too long.

            • tfp
            • 12 years ago

            If ifs and buts were candy and nuts every day would be Christmas.

    • king_kilr
    • 12 years ago

    Urgh, I want to see desktop benchies!

      • rxc6
      • 12 years ago

      Wait for a desktop processor then 😉

    • BoBzeBuilder
    • 12 years ago

    AMD woke a sleeping beast with their K8.

      • nexxcat
      • 12 years ago

      Indeed they did. They didn’t realise Intel can run two microarchitecture projects in parallel. What’s interesting is Intel seems to be holding back some, releasing just enough to stay ahead of AMD.

      AMD needs to get their act together soon.

      • Dposcorp
      • 12 years ago

      Yeah, although that was 3-4 years ago. lol
      K10 is da bomb, yo!

        • derFunkenstein
        • 12 years ago

        you read the same article I did? 😆

Pin It on Pinterest

Share This