As you may know, Intel has enjoyed a resurgence in its server and workstation processor business over the past several years, due in no small part to regular and effective refinements to its core CPU technology. The introduction of the “Nehalem” quad-core Xeons last year was the biggest step forward the firm has taken in many years, with a whole new system architecture nicely complementing a revamped processor microarchitecture. The results were major gains in scalability, performance, and power efficiency compared to the prior generation of Xeonsalong with renewed strength for Intel’s competitive standing versus its main rival, AMD.
By contrast, this year’s revision of the Xeon is comparatively simple, even modest. The new 32-nm Xeons, code-named Westmere-EP, raise the on-chip core count by twoto a total of six cores per chipwhile fitting into the same socket and cooling infrastructure as the Nehalem Xeons before them. The Westmere Xeons’ clock frequencies are largely similar, as is the per-clock performance of each core.
A change like that is easy to grasp, but it’s also easy to underestimate. In the thread-rich realm of server-class applications, with a robust system architecture like this one, adding two more cores can boost performance by nearly 50%. From another angle, that boost could translate into a similarly large increase in energy efficiency, because half again as much work is being accomplished for each watt-hour the system consumes. If talk like that doesn’t float your boat, you’re probably not a system administrator responsible for a room full of servers. I’d wager most folks in such roles would happily accept a 50% gain in power-efficient performance each year, if they could get it.
The question is whether the Westmere-EP Xeons really deliver on their advertised promise. We’ve had a number of systems cranking away in Damage Labs for the last little while in order to find out, and, without giving away the game entirely, the news is even better than you might think. Since our last look at workstation/server-class processors, the state of the art in such systems has changed on multiple fronts, from the growing prevalence of platforms tailored for power efficiency to the proliferation of solid-state disks. Our revised suite of test systems provides a nice overview of the landscape. Read on to see how it all fits together.
Westmere-EP: both less and more
Intel’s 32-nm chip fabrication process is what makes Westmere possible. This relatively new fabrication technology allows substantially more gatesand thus transistors, logic, and ultimately coresto fit into a given amount of chip area than the 45-nm processes used formerly by Intel and still today by AMD. In this generation of process tech, Intel has carried over its high k + metal gate transistors, first used at 45 nm, and moved to immersion lithographyin which a liquid medium is used to better focus lightfor the first time. By now, Intel is well into ramping its 32-nm production, with the dual-core Clarkdale and six-core Gulftown processors making up a large proportion of its consumer mobile and desktop CPU lineups. In fact, our review of these Xeon processors is rather late; the Westmere-based Xeon 5600 series has been shipping to customers for a number of months, as well.
|Harpertown||Xeon 5400||2 x 2||2 x 2||2 x 6 MB||45||2 x 410||2 x 107|
|Nehalem-EP||Xeon 5500||4||8||8 MB||45||731||263|
|Westmere-EP||Xeon 5600||6||12||12 MB||32||1170||248|
|Shanghai||Opteron 2300||4||4||6 MB||45||758||258|
|Istanbul||Opteron 2400||6||6||6 MB||45||904||346|
|Lisbon||Opteron 4100||6||6||6 MB||45||904||346|
|Magny-Cours||Opteron 6100||2 x 6||2 x 6||2 x 6 MB||45||2 x 904||2 x 346|
The remarkable thing about the Westmere-EP Xeons, as illustrated in the table above, is that they incorporate two more cores and 50% more cacheL3 size is up from 8MB in Nehalem to 12MB hereyet they are actually smaller chips than their predecessors.
A close-up of a Westmere-EP wafer. Source: Intel.
AMD hasn’t made a process transition lately, and GlobalFoundries currently lags behind Intel by roughly a year, if not more. Thus, Westmere’s competition is a much larger chip, at 346 mm², with the same core count. In fact, the most direct competition for the Westmere Xeons is arguably the Opteron 6000 series, which is based on two of those larger chips packaged together in each socket. The contrasts here are stark enough to incite me to use italics twice in two paragraphs, so we’re not talking small potatoes. Smaller chips, of course, are generally more desirable for a number of reasons, including lower manufacturing costs and typically lower power draw with tamer thermals.
By and large, Westmere-EP is essentially a Nehalem Xeon that’s been ported over to the new 32-nm process, but it has received a host of notable tweaks along the way, not least of which is the aforementioned addition of 50% more cores and cache. Thanks to Intel’s version of simultaneous multithreading, known as Hyper-Threading, a six-core Xeon can track and execute 12 hardware threads. Two Westmere Xeons in a 2P system present an imposing total of 24 threads to the OS.
The other modifications in Westmere-EP are minor but numerous. Some of them boost performance in various ways. A suite of seven new instructions, collectively dubbed AES-NI, can accelerate cryptography. The chip’s integrated memory controller now supports two DIMMs per channel at 1333MHz, raising the limit from 1066MHz in Nehalem. Also, the number of memory buffers has risen from 64 to 88, offering the potential for higher peak bandwidth at a given memory frequency. And, as is almost customary these days, certain latencies have been reduced in the CPU’s virtualization hardware, potentially enhancing performance for consolidated servers.
Another set of changes in this new silicon focuses on advancing power efficiency. The Nehalem Xeons introduced a gate capable of shutting off power to idle cores; Westmere adds a power gate for the “uncore” portion of the chip capable of reducing the voltage to the memory controller, L3 cache, and QuickPath interconnect when both sockets in a 2P system are idle. Another potential heavy hitter for server installations will be the memory controller’s ability to support low-voltage DDR3 memory, which has become available in recent months. The chip’s APIC timer now continues running when the CPU goes into a deep sleep state, too.
A pair of Xeon X5670 processors
From this Westmere-EP silicon, Intel has spun an entire range of new Xeons dubbed the 5600 series. We detailed the various models here when those products were first introduced. The 5600 lineup and its pricing appear to have remained largely static since then.
What’s on the bench
The Xeons we have for review represent the best of the 5600 series on one axis or another, and we’ve tested them in different types of systems as appropriate. The most extreme of the bunch is the Xeon X5680, which has a base clock speed of 3.33GHz and can raise its frequency as high as 3.6GHz via Turbo Boost when the thread count and thermal headroom permit. The X5680’s max power and thermal rating, or thermal design power (TDP), is 130W, which puts it on the high end of the power spectrum. As Intel’s fastest 2P processor, this model commands a hefty price premium, too. A single X5680 will set you back $1663.
Our test platform for this beast is a relatively large, floor-standing workstation enclosure with a SuperMicro X8DA3 motherboard and a 700W power supply. That combination is comfortably up to the task of cooling and powering a system with a pair of 130W processors.
We should note that, although 5600-series Xeons are billed as drop-in replacements for the 5500-series Xeons before them, at Intel’s recommendation, we upgraded the motherboard in this test system rather than using the older version of the X8DA3 used in our Xeon 5500 review. That older X8D3A was pre-production hardware from the early days of Nehalem, so the change was needed for optimal operation. However, Intel tells us many Xeon 5500-based systems should allow for seamless drop-in upgrades to Westmere Xeons. As is usually the case in these scenarios, you’ll want to check with your motherboard or system vendor for compatibility information.
Asus’ RS700-E6 1U server
The X5680’s 130W TDP will probably rule it out of most server installations. Xeons in the 95W power band are more common, and the X5670 is Intel’s fastest offering at that TDP. The X5670 runs only slightly slower than the X5680, with a 2.93GHz base clock and a 3.33GHz Turbo peak. Stepping down to the X5670 will give you a nice break on max power ratings, but at $1440, it’s not much less expensive.
We’ve tested the X5670 in an Asus 1U server system, pictured above. We also dropped a pair of Xeon X5570 processors into this systemthe prior-gen Nehalem offering at the same frequency and TDPto see how the two generations of Xeons compare.
The low-power Willbrook server
To many folks, the Xeon L5640 may be the sexiest of these new CPUs. Its six cores run at 2.26GHz and can spool up to 2.8GHz via Turbo Boost, yet this Xeon’s TDP rating is a calm and collected 60W. Naturally, that fact makes the L5640 a fantastic candidate for a power-efficient server. You will pay a premium for this sort of power efficiency, though: the L5640 lists at $996 per chip.
Our test system matches a pair of L5640s with a custom motherboard from Intel officially known as the S5500WBand unofficially code-named Willowbrook. Although this Willowbrook board is based on the same Tylersburgexcuse me, I mean “Intel 5500 series”chipset as our other Xeon systems, Intel has specifically optimized this board for reduced power consumption. Those optimizations include a carefully tuned voltage regulator design and more widely spaced components intended to permit airflow and reduce the energy required by cooling fans. The firm claims a 32W savings at idle and a 42W savings under load versus its own S5520UR motherboard.
To that potent mix of power-efficient components, we’ve added six DIMMs of low-power Samsung DDR3 memory. These DIMMs operate at only 1.35V, and Samsung happily touts them as a greener alternative to traditional DDR3 modules.
As you may be gathering by now, this entire platform ought to be quite nicely tailored for low-power operation. To give us a sense of how the enhancements in the Westmere Xeons alone contribute to this system’s efficiency and performance, we’ve tested a couple of quad-core Xeon L5520 processors in this same system. The L5520 has the same 2.26GHz base clock at 60W TDP as the L5640, but its 2.53GHz Turbo max is lower, and its memory speed tops out at 1066MHz.
A competitive imbroglio
As our regular readers will attest, we usually try to test products against their closest competition whenever possible. For the Xeon 5600 series, that competition would most definitely be the latest Opterons from AMD. In order to keep pace with Intel’s formidable performance gains in recent years, AMD has elected to double up on the number of chips it delivers in a single package. The resulting processors, code-named Magny-Cours, were formally announced in late March as the Opteron 6100 series. With 12 cores and four channels of DDR3 memory per socket, these new Opterons promise substantial gains over the six-core Istanbul chips introduced a year ago, even though the basic building block is essentially the same hunk of silicon.
Doubling up on chips per socket can be a savvy strategy in the server market, one that Intel itself validated with its Harpertown Xeons back in 2007. Seeking to upgrade performance by raising clock speeds is a tricky endeavor, because it requires increases in chip voltage that can raise power draw exponentially. By keeping clock speeds low, and thus voltages in check, AMD has made room for multiple chips per socket while staying within its traditional power bands. For widely threaded workloads, this approach could pay solid performance dividends.
Several of the Opteron 6100 models look like good matches for the CPUs we’re testing. The Opteron 6176 SE with 12 cores, a 2.3GHz core clock, and a 105W ACP rating looks like a plausible rival to the Xeons X5680 and X5670 we have on hand. The 6176 SE’s $1386 price tag makes it a close competitor, too. Meanwhile, the Opteron 6164 HE at $744 might well be the closest competition for the Xeon L5640. With 12 cores at 1.7GHz and a 65W ACP, the 6164 HE could make things interesting, at least.
More recently, AMD has announced the Opteron 4100 series, code named Lisbon during its development. These CPU use only a single chip but add DDR3 support like Magny-Cours. The 4100-series Opterons are aimed primarily at compact, high-density server installations, and a bit of mystery surrounds the potential customers for these products. AMD expects the 4100 series to appeal to big web companies that buy large numbers of servers through custom design groups at major OEMs, but it has said not to expect sales figures from that business to become public. Whether such talk foreshadows stealthy success or silent-but-abysmal failure, we do not know.
We do know the 4100 series isn’t positioned directly against the Westmere Xeons, by and large. The fastest Lisbon chip is the Opteron 4184, with six cores at 2.8GHz and a 75W ACP, and it lists for $316. At that price, the 4184 competes against the quad-core Xeon X5500 processors that remain in Intel’s product portfolio.
Unfortunately, we don’t have any of these newer Opterons to test. We have worked with both AMD and Intel for years to make these reviews possible, and both companies have supplied us (and other publications) with samples of their latest products. Initially, this product cycle was no different. We even made a two-day visit to AMD’s server group in Texas to talk about the new Opterons a couple of months back, yet we’re still awaiting word on review samples. That is, frankly, one reason this review is a little late to publication; we had sincerely hoped to include a head-to-head comparison of the latest CPUs.
Of course, some of the blame for the absence of newer Opterons here lies with us. We should have seen the writing on the wall and pursued other avenues for getting hold of a system sooner, either by working with a server maker or just buying the stuff ourselves. We’re still hoping to put the new Opterons through their paces, but we couldn’t delay publication of this review any longer. For the time being, we’ve tested the latest Xeons against the products they replaceand against the older generation of Opterons. That’s not our favored outcome, but we should be able to get a good sense of the Westmere Xeons’ relative performance, regardless.
Fortunately, we do have an Opteron platform you may not have seen tested in the wild just yet. Tyan was kind enough to supply us with its S8212 motherboard, which is based on AMD’s SR5690 chipset, better known as the Fiorano platform. Fiorano is AMD’s first attempt to produce its own server platform in quite a few years, and it adds a few critical features to the Opteron’s quiver. Among them: support for the HyperTransport 3 and PCI Express 2.0 interconnects, both with higher throughput than the older versions of those standards. Although our Fiorano system doesn’t make use of DDR3 memory, it is otherwise comprised of basically the same components as any newer Opteron 4100 system, with the same SR5690 chipset and six-core, 45-nm processors. In this case, we’ve used Opteron 2435 CPUs clocked at 2.6GHz with a 75W power envelope. The analogous model in the 4100 series would be the Opteron 4180, which shares the same clock frequency and max power draw rating.
For what it’s worth, we built our Fiorano test rig using the same type of floor-standing enclosure and power supply as our Xeon X5680 box, so comparisons between those two should be reasonably direct, if something of a mismatch.
We have a power-optimized representative from the Opteron fold, as well, in the form of a 1U server with an efficient 650W PSU and a pair of Opteron 2425 HE processors. The Opteron 2425 HE is a six-core, 2.1GHz part with a 55W ACP. This system is based on an older SuperMicro H8DMU+ motherboard with an Nvidia chipset. Although it lacks a few new features, I believe this board is more power-efficient overall than most existing Fiorano-based mobos, which is why we chose to test the Opteron HEs on it.
All of our test systems benefited greatly in terms of power consumption and performance from the addition of solid-state drives for fast, local storage.
The folks at OCZ helped equip our test systems with enterprise-class Vertex EX SSDs. The single-level-cell flash memory in these drives can endure more write-erase cycles than the multi-level-cell flash used in consumer drives, so it’s better suited for server applications. SLC memory writes data substantially faster than MLC flash, as well. The only catch is that SLC flash is quite a bit pricier, as are the drives based on it. For the right application, though, a drive like the Vertex EX can be very much worth it. Heck, we even noticed the effects of these drives during our test sessions. Boot times were ridiculously low for all of the systems, and program start-up times were practically instantaneous.
We’ve also beefed up our lab equipment by stepping up to a Yokogawa WT210 power meter. The Extech unit we’ve used in the past would occasionally return an obviously erroneous value, and for that reason, the Extech hasn’t been sanctioned for use with SPECpower_ssj when the results are to be published via SPEC. The WT210 is a much more accurate meter that meets with SPEC’s approval and integrates seamlessly with the SPECpower_ssj power measurement components.
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. We typically run each test three times and report the median result. In the case of the SPEC benchmarks, though, we’ve reported the results from the single best run achieved.
Our test systems were configured like so:
2425 HE 2.1GHz
Xeon X5670 2.93GHz
nForce Pro 3600
nForce Pro 3600
Rapid Storage Technology 9.6
Rapid Storage Technology 9.6
Rapid Storage Technology 9.6
ATI ES1000 with
with 184.108.40.206 drivers
Matrox G200e with 220.127.116.11 drivers
with 18.104.22.168 drivers
GeForce 8400 GS with ForceWare 257.15
Electronics DPS650SB 650W
Electronics DPS770BB 770W
Vertex EX 64GB SSD with firmware rev. 1.5
Server 2008 R2 Enterprise x64
We used the following versions of our test applications:
- SPECpower_ssj2008 1.10 with Oracle JRockIt JRE P28.0.0-29 Windows 64-bit
- SPECjbb2005 1.07 with Oracle JRockIt JRE P28.0.0-29 Windows 64-bit
- SiSoft Sandra 2010.SP1d
- Stream 5.8 64-bit
- CPU-Z 1.54
- Cinebench R11.5 64-bit Edition
- POV-Ray for Windows 3.7 beta 37a 64-bit
- CASE Lab Euler3d CFD benchmark multithreaded edition
- MyriMatch proteomics benchmark
- 7-Zip 4.65 64-bit
- x264 HD benchmark 3.0
The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Memory subsystem performance
This synthetic test gives us a quick, visual map of the cache hierarchies of these processors. As you can see, the six L1 and L2 caches of the Westmere Xeons deliver considerable cumulative throughputover a terabyte per second at peak in our 2P Xeon X5680 system.
The Xeon X5670 is over 50% faster than the X5570 at the 256KB and 512KB data points, most likely due to the increased coverage offered by two more L1 and L2 caches. Because those caches are inclusive (that is, the L2 caches replicate the contents of the L1s), the total effective size of the X5670’s L1 and L2 caches is 192KB, while the X5570’s add up to 128KB.
Unfortunately, this program’s sample points are too coarsely distributed to show us the impact of Westmere-EP’s larger L3 cache.
The impact of the added buffering in Westmere-EP’s memory controller isn’t hard to spot. The Xeon X5670 transfers a couple of gigabytes per second more data than the X5570, given the same core and memory frequencies.
The L5640’s lower performance with the scale and copy patterns initially puzzled us, but we expect this processor’s throughput is limited by its lower (2.13GHz) memory controller clock. The X5570, X5670, and X5680 have a faster 2.66GHz memory controller. The L5520 shares the same 2.13GHz memory controller frequency, and its performance with those patterns is nearly identical to the L5640’s, even though its DIMMs operate at 1066MHz.
Incidentally, we’ve modified our Stream testing method from last time out. We’ve found that we get the best throughput on the Xeons by assigning one thread to each physical core in the system. That’s why our results are slightly better than you may have seen before.
Memory access latencies haven’t changed much from Nehalem to Westmere, despite the growth of the L3 cache from 8MB to 12MB. In fact, we had to move our sample point for this graph to 32MB because the larger cache was masking any latency it adds at 16MB.
We can get a closer look at access latencies throughout the memory hierarchy with the 3D graphs below. I’ve colored the block sizes that correspond to different cache levels, with yellow being L1 data cache and brown representing main memory.
The effect is impossible to see in the charts above, but our utility reports that L3 cache latency has grown from 32 cycles on the X5570 to 39 cycles on the X5670. (The L1 and L2 caches on the two chips have identical latencies.) Thinking in terms of cycles is tricky, though, because the results are reported in core cycles and the L3 cache is clocked independently. In this case, the comparison works because the two CPU models share the same core and uncore frequencies.
SPECjbb 2005 simulates the role a server would play executing the “business logic” in the middle of a three-tier system with clients at the front-end and a database server at the back-end. The logic executed by the test is written in Java and runs in a JVM. This benchmark tests scaling with one to many threads, although its main score is largely a measure of peak throughput.
As you may know, system vendors spend tremendous effort attempting to achieve peak scores in benchmarks like this one, which they then publish via SPEC. We have used a relatively fast JVM, the 64-bit version of Oracle’s JRockIt JRE, and we’ve tuned each system reasonably well. Still, it was not our intention to match the best published scores, a feat we probably couldn’t accomplish without access to the IBM JVM, which looks to be the fastest option at present. Similarly, although we’ve worked to be compliant with the SPEC run rules for this benchmark, we have not done the necessary work to prepare these results for publication via SPEC, nor do we intend to do so. Thus, these scores should be considered experimental, research-mode results only.
As always, please, no wagering.
We used the following command line options:
Xeons 12 core/24 thread/24GB/6 instances:
start /AFFINITY [F00000, 0F0000, 00F000, 000F00, 0000F0, 00000F] java -Xms3900m -Xmx3900m -Xns3260m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:4 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k
Xeons 12 core/24 thread/12GB/6 instances:
start /AFFINITY [F00000, 0F0000, 00F000, 000F00, 0000F0, 00000F] java -Xms2800m -Xmx2800m -Xns2500m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:4 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k
Xeons 8 core/16 thread/24GB/2 instances:
start /AFFINITY [FF00, 00FF] java -Xms3900m -Xmx3900m -Xns3260m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:8 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k
Xeons 8 core/16 thread/12GB/2 instances:
start /AFFINITY [FF00, 00FF] java -Xms2800m -Xmx2800m -Xns2500m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:8 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k
Opterons 12 core/16GB/2 instances:
start /AFFINITY [FC0, 03F] java -Xms3900m -Xmx3900m -Xns3260m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:6 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k
In keeping with the SPECjbb run rules, we tested at up to twice the optimal number of warehouses per system, with the optimal count being the total number of hardware threads.
In all cases, Windows Server’s “lock pages in memory” setting was enabled for the benchmark user. In the Xeon systems’ BIOSes, we disabled the “hardware prefetch” and “adjacent cache line prefetch” options.
The X5670 isn’t quite 50% faster than the Xeon X5570 here, but this is a healthy performance gain within the same infrastructure and power envelope, regardless. The low-power Xeon L5640 posts a similar gain over the L5520, which is sufficient to put the L5640 ahead of the X5570and that feels like noteworthy progress. The picture will no doubt come into sharper focus when we add the question of power efficiency to the mix.
Like SPECjbb2005, this benchmark is based on multithreaded Java workloads and uses similar tuning parameters, but its workloads are somewhat different. SPECpower is also distinctive in that it measures power use at different load levels, stepping up from active idle to 100% utilization in 10% increments. The benchmark then reports power-performance ratios at each load level.
SPEC’s run rules for this benchmark require the collection of ambient temperature, humidity, and altitude data, as well as power and performance, in order to prevent the gaming of the test. Per SPEC’s recommendations, we used a separate system to act as the data collector. Attached to it were a Digi WatchPort/H temperature and humidity sensor and our Yokogawa WT210 power meter. Although our new power meter might well pass muster with SPEC, what we said about our SPECjbb results being “research mode only” applies here, too.
We used the same basic performance tuning and system setup parameters here that we did with SPECjbb2005.
SPECpower_ssj results are a little more complicated to interpret than your average benchmark. We’ve plotted the output in several ways into order to help us understand it.
Although the plot above looks like some sort of odd coral formation, this may well be the most intuitive way of presenting these data. Each of the load levels in the benchmark is represented by a point on the plot, and the two axes are straightforward enough. The higher the point is on the plot, the higher the performance. The further to the right it is, the more power was consumed at that load level.
Immediately, we can divine that the Xeon X5680 has the highest overall performance and the highest power consumption. The Xeon X5670 represents a substantial reduction in power draw versus the X5680 with only a minor drop in operations per second. Meanwhile, the Xeon X5570 draws nearly as much power as the X5670 at the upper load levels but doesn’t deliver nearly as much throughput. The Opteron 2435’s power draw is also quite similar, but its performance is lower still.
The Willowbrook system with the low-power Xeons is in a class of its own. Inside that system, the Xeon L5640 achieves roughly 200K more ops per second than the L5520 with only marginally higher power draw. Indeed, the L5640 appears to be the undisputed champ here, peaking higher than the Xeon X5570 while consuming over 100 fewer watts.
We can confirm the L5640’s standing with a look at the performance-to-power ratios and the summarized overall standings.
Yep, the Willowbrook/L5640 combination takes the top spot. Furthermore, the power efficiency progress from Nehalem to Westmere is illustrated vividly in both the 95W and 60W Xeons. The 95W Xeon X5670 even turns out to be more efficient than the 60W Xeon L5520 at all but the highest load levels, giving it a modest lead in the overall score.
We can take another look at power consumption and energy-efficient performance by using a test whose time to completion varies with performance. In this case, we’re using Cinebench, a 3D rendering benchmark based on Maxon’s Cinema 4D rendering engine.
The six-core Xeons dominate the performance results, more or less as expected. We’ll pause to note the architectural efficiency of the current Xeons. Even at a lower clock frequency, the six-core, 2.26GHz Xeon L5640 outperforms the six-core, 2.6GHz Opteron 2435.
Still, single-threaded performance essentially hasn’t advanced from the past generation to this one, as Amdahl’s Law stubbornly refuses to give way to Moore’s. The one exception is the Xeon L5460, whose unusally high Turbo frequency leeway of 533MHz allows it basically to match the Xeon X5670 in the single-threaded test.
As the multithreaded version of this test ran, we measured power draw at the wall socket for each of our test systems across a set time period.
A quick look at the data tells us much of what we need to know. Still, we can quantify these things with more precision. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.
The 5600-series Xeons bring a slight but measurable increase in power draw at idle, but they’re clearly within the same range as their predecessors. The most remarkable numbers here come from the Willowbrook system. In case this hasn’t sunk in yet, with low-power Xeons aboard, it’s idling at around 65W.
Next, we can look at peak power draw by taking an average from the ten-second span from 10 to 20 seconds into our test period, during which the processors were rendering.
Peak power draw is also up somewhat in the 5600-series Xeons, but not enough to create any real concern.
One way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.
The Willowbrook system’s minimal power draw at idle at makes this one a rout.
We can quantify efficiency even better by considering specifically the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.
Once again, the Westmere Xeons are measurably more efficient than the prior generation of Xeonsand Opterons, for what it’s worth. Even the X5680 looks pretty good here, aided by the fact that it finishes renderingand thus ends our measurementin such short order.
Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of proteins. I’ll stop right here and let him explain what MyriMatch does:
In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.
In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.
MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.
The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used.
I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:
Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.
Here’s how the processors performed.
In this ostensibly memory-bound test, the X5670 shaves 11 secondsor nearly 30%off of the shortest execution time posted by the processor it succeeds.
STARS Euler3d computational fluid dynamics
Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.
In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:
The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.
The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.
So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.
As you’ll note, we’re seeing some pretty broad variance in the results of this test at lower thread counts, which suggests it may be stumbling over these systems’ non-uniform memory architectures. In an attempt to circumvent that problem, I decided to try running two instances of this benchmark concurrently, with each one affinitized to a socket, and adding the results into an aggregate compute rate. Doing so offers a nice performance boost.
The Xeon X5670 betters the X5570’s simulation rate by about 25%, again in a workload where memory bandwidth has traditionally been a constraint.
POV-Ray ray tracing
We’ve been using POV-Ray as a benchmark for a very, very long time, and we’re not going to stop now. The latest version is multithreaded and makes good use of all of the cores and hardware threads available to it.
As we saw in Cinebench, highly parallel, compute-intensive graphics workloads lend themselves well to increased core counts, so the new Xeons fulfill much of their potential.
x264 HD video encoding
This benchmark tests performance with one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark.
I have to say, I sure hope they get a few Westmere Xeons inside the YouTube data center. If you’ve ever uploaded an HD video there, you’ll know what I mean.
7-Zip file compression
This final entry in our test suite more or less confirms what we already know about the Westmere Xeons. They’re quite adept at file compression, as well as many other things.
We had hoped to expand our test suite to include a nifty new virtualization and web service benchmark, but it turns out that generating SPECjbb and SPECpower_ssj performance results for seven different CPU models across five different test systems takes quite a bit of time. We also tried to squeeze in some tests of hardware-accelerated encryption using the new version of TrueCrypt, but the software doesn’t yet recognize Xeon 5600-series processors. With luck, we can revisit both of these potential new benchmarks before too long.
Regardless, our modest set of tests has given us a practically unanimous verdict on the Westmere Xeons. For multithreaded workloads that aren’t gated primarily by I/O throughput, the addition of two cores at the same clock frequencies and power envelopesalong with a proportional increase in cacheis a clear performance win. Even workloads that rely on streaming quite a bit of data from memory, as our two HPC-oriented benchmarks do, may benefit from the move to Westmere thanks to its larger cache and aggressive data pre-fetching algorithms. That speaks to the scalability of this still relatively new system architecture, among other things.
Because Intel has largely held the line on power consumption, these new Xeons bring a worthy increase in power-efficient performance, as well. Inside the same system, whether it be our mid-range Asus 1U box or our low-power Willowbrook server, the move from Nehalem to Westmere produces dramatic progress in measured SPECpower_ssj power-performance ratios. We see gains on the same order in our home-brewed measurement of the energy required to render a scene in Cinebench, too.
Of course, some workloads simply won’t benefit from these new Xeons, because they’re generally no faster in single-threaded tasks than the Nehalems that preceded themand because not all server-class workloads these days are meaningfully compute-bound. Those are two very different sides of the same coin, but this is where progress to date has taken us.
In our view, the combination of the Xeon L5640 processors and the Willowbrook server may be the finest 2P server platform Intelor anyone elsehas produced to date. This setup’s all-around performance is superior to the flagship 95W Nehalem, the Xeon X5570, which it shadows in single-threaded tests thanks to the L5640’s generous Turbo Boost peak and conquers in multi-threaded workloads due to 50% more cores and cache. Yet the L5640’s TDP is 35W below the X5570’s, and its real-world power draw aboard the Willowbrook motherboard is marvelously minuscule.
The question now is whether AMD’s response has any teeth. Perhaps we’ll find out soon.