Intel’s Xeon X7560 and Dell’s R810 server

Sometimes, we bite off more than we can chew. That was certainly the case with this review. We’ve long pushed, prodded, argued, and advocated for the folks at Intel and AMD to work with us on reviews of their server CPUs. That’s generally gone well in the past few years, happily, but we have also been wary of expanding our mission beyond our means. That has meant, for instance, declining opportunities to review 4P systems. Large, expensive servers are interesting, but testing them properly requires time, the right hardware, and a fairly select set of very expensive applications, many of which require massive, proprietary data sets. Reaching up into that segment of the market is no trivial undertaking.

Our instincts were confounded, however, when Intel and Dell dangled a new class of server in front of us, a sort of intermediate step between the traditional, low-cost 2P box and a much beefier, vastly more expensive 4P system. Testing it properly would be a bit of a challenge, sure, and it was sort of an expansion of our mission. But wow, that was some really cool hardware, and AMD seemed to have something similar in the works. Besides, we had some interesting ideas about testing. The challenge would be intriguing, if nothing else.

Thus we found ourselves taking delivery of a Dell R810 server, a sleek, 2U box packed with dual octal-core Nehalem-EX processors, twin 1100W power supplies, quad SAS 6 Gbps hard drives, and a heart-stopping 128GB of RAM.

That was about a year ago, and the months that followed gave us an unprecedented bounty of new GPU and CPU architectures and products based on them—in other words, lots of things to review. We had more to review than we could handle, and in this R810 server, we had perhaps more computer than we could handle properly, too. Shamefully, the R810 went on the backburner time and again as other obligations intervened.

Fortunately, we’ve finally managed to complete our testing, and we’re right in time for Intel’s announcement of—ack!—a drop-in replacement for the Nehalem-EX processor known as Westmere-EX. Rather than completely despairing, we’ve decided to move ahead with our initial look at the R810 and Nehalem-EX. If there’s sufficient interest, after that, we’ll see about upgrading to the new processors and taking them for a spin, as well. Much of the ground we’ll cover today is foundational for servers based on either CPU, since they share the same system architecture.

Nehalem-EX: The Ocho

The details of Nehalem-EX silicon may be familiar by now to many interested parties, but we’ll recap briefly because they are complex and impressive enough to warrant further attention. As the name implies, the Nehalem-EX processor is based on the same basic CPU microarchitecture and 45-nm manufacturing process as its smaller siblings that share the Nehalem name. The difference with the EX variant has to do with scale, both in terms of the processor silicon—the thing encompasses 2.3 billion transistors—and the system architecture that supports it.

The Nehalem-EX die. Source: Intel.

Logical diagram of a Nehalem-EX processor. Source: Intel.

Crammed into the EX are fully eight CPU cores and 24MB of L3 cache—enough elements that the processor’s architects decided the simpler internal communications arrangement in quad-core Nehalems wouldn’t suffice. Instead, they gave the EX an internal ring bus, a high-speed, bidirectional communication corridor with stops for each key component of the chip. This ring is a precursor, incidentally, for the one Intel architects built into the newer Sandy Bridge architecture to accommodate multiple cores alongside an integrated GPU.

Like all Nehalem chips, the EX has an integrated memory controller. In fact, the EX really has a pair of memory controllers, although the arrangements are rather different than in lower-end 2P Xeons. The EX series is designed to scale to four or more sockets with very large memory capacities, and the sheer number of traces running out of each socket may impede that mission. Intel’s system architects have worked around that problem by using external Scalable Memory Buffer (SMB) chips to talk to the memory modules.

Between the EX socket and each SMB is a narrow, high-speed link known as a serial memory interconnect, or SMI. The SMI and SMB allow for higher memory capacities, at the expense of higher access latencies. In fact, this whole arrangement is based closely on the FB-DIMM technology used in older Xeons, which was somewhat infamous for the performance-versus-capacity tradeoff it required. One difference here is that the SMB chips are built into the system and mounted on the motherboard, so EX systems can use regular DDR3 RDIMMs. Another difference, obviously, is the elimination of the front-side bus and its potential to act as a bottleneck at high load levels. Intel claims the EX has a lower, flatter memory access latency profile than the prior-generation Xeon X7400 series.

The Nehalem-EX has two SMI channels per memory controller, and each channel talks to an SMB chip. In turn, each SMB communicates with two channels of DDR3 SDRAM clocked at a peak of 1066MHz. Each memory channel can support a pair of registered DIMMs.

Logical diagram of a Nehalem EX system. Source: Intel.

Multiply all of those things out across four sockets, and the numbers get to be formidable. A single Nehalem-EX socket can support up to 16 DIMMs. Just four channels of DDR3-1066 memory per socket could, in theory, yield up to 34 GB/s of memory bandwidth, although some complicating factors like SMI overhead have led Intel to claim a peak memory bandwidth per socket of 25 GB/s. (Real-world throughput will vary depending on the mix of reads and writes used.) Still, that’s potentially 100 GB/s of memory bandwidth in a 4P configuration.

Like its Nehalem-EX brethren, the EX uses Intel’s point-to-point QuickPath Interconnect for communication between the sockets. Each CPU has four QPI link controllers onboard, making possible fully-connected 4P configurations like the one depicted in the diagram above. Glueless 8P configurations are also possible, as are higher socket counts with the aid of third-party node controller chips.

The I/O hub shown above is a chip code-named Boxboro, and it’s basically a giant PCI Express switch, with 36 lanes of second-generation PCIe connectivity. These lanes can be configured in various ways: four PCIe x8 links plus an x4, nine x4 connections, or dual x16s alongside two x2 links, for instance. If that’s not enough I/O bandwidth, a 2P config may have dual IOH chips, while a 4P may have as many as three. An eight-way, quad-IOH layout could have up to 144 lanes of PCIe Gen2 bandwidth—again, staggering scale. Since the Boxboro IOH is largely just for PCI Express, it connects to Intel’s tried-and-true ICH10 chip, which provides the rest of the system’s conventional I/O needs, including some first-generation PCIe lanes.

Not only does the EX platform exist on a much larger scale than other Xeons, but it also includes some RAS (reliability, availability, and serviceability) features traditionally found only in mainframes, high-end RISC systems, and Intel’s other offering in this segment, Itanium. These capabilities extend well beyond the traditional error recovery mechanism built into ECC DRAM. The EX’s recoverable machine check architecture (MCA) allows for on-the-fly recoveries from events that would be catastrophic in another class of hardware.

For example, in the event of a DIMM failure, the system could take the failed module out of use while the firmware and OS would work together to recover or restart any affected processes, without bringing the system down. Eventually, a tech could perform a hot-swap replacement of the failed and isolated module—all while the system keeps running. (That last bit sounds rather terrifying to me. I’d much rather shut down the affected system and do the DIMM swap during a maintenance window, but perhaps I’m just too timid.)

By creating a new class of 2P server based on Nehalem-EX, Intel and its partners are bringing these RAS features to a new price point, along with higher memory capacities.

Speaking of prices, don’t get your hopes up for an especially cheap date. The fastest Nehalem-EX processor is the Xeon X7560, which is the one we’ve tested inside the Dell R810. The X7560 has eight cores, 16 threads (via Hyper-Threading/SMT), and a default clock speed of 2.26GHz. If there’s headroom left within its 130W thermal envelope, Intel’s Turbo Boost feature will allow the X7560’s clock frequency to range up to 2.66GHz. A single Xeon X7560 will currently set you back $3,692. In the context of the total system price, that’s practically a steal.

Dell’s R810: EX in the flesh

The Dell R810 server we have here for review is a very nice example of this new class of affordable-ish Nehalem-EX-based systems. Dell employs a whole host of Transformers-inspired space-saving tricks in order to fit four CPU sockets, 32 DIMM slots, and a gaggle of expansion slots into a sleek box only two rack units high. I’ll confess upfront that I don’t have a tremendous amount of experience with servers engineered to the hilt like this one; in my sysadmin days, my preference was for white-box 1U systems, cheap and easily replaceable. The R810 is neither of those things, but it is very slickly produced, with better integration than a diversity-training workshop at Harvard.

Up front, the R810 sports six hot-swappable 2.5″ drive bays and a DVD-ROM drive for OS installations. Our review unit came equipped with five Seagate Savvio 15K.2 SATA 6Gbps hard drives, each of them 146GB.

Above the DVD drive is a pair of USB ports, a VGA output for console use, and a small LCD screen that displays hardware-level status and error messages.

Slide back the lid on the R810 to expose its guts, including a large, black, plastic cooling shroud stretching between the front drive bays and the CPU heatsinks. To the left of the 2.5″ drive bays is an interesting detail. Let’s zoom in.

Yep, that’s a pair of SD card slots. One may install a hypervisor and simply boot from an SD card, with no need for additional local storage. Networked storage can do the real heavy lifting from there. That second SD card provides a measure of redundancy, so the loss of a tiny SD card won’t bring the whole server to its knees.

The niftiest trick in the R810’s quiver is undoubtedly the sliding storage shelf, which moves forward in order to expose the DIMMs beneath. This arrangement makes an in-rack DIMM replacement a relatively simple matter and makes this 2U enclosure feel decidedly less cramped.

And what a lot of DIMMs there are. 32 in all, arranged in eight banks of four. If you look closely, you may count eight black heatsinks distributed around the DIMM slots. Beneath those are the memory buffer chips, eight in all, that provide the glue for the EX’s memory subsystem. You’re looking at 128GB of DRAM in the pictures above, but that’s just the tip of the iceberg. With 16GB DIMMs, the R810 can support up to half a terabyte of RAM in those slots.

Yes, you read that right.

With all four CPU sockets occupied, the R810 only uses a single memory controller (and two SMI links) per processor, while the second one sits idle. Dell and Intel have yet another very nifty trick up their sleeves, though.

The EX’s FCLGA1567 socket

The underside of Dell’s FlexMemBridge insert

Our test system is a 2P configuration, with the middle two sockets occupied by Xeon X7560 processors. The outer two sockets have a simple insert installed in them—Dell calls it a FlexMemBridge—that provides electrical connections. With the FlexMemBridge installed, the CPU in the adjacent, occupied socket can access memory associated with the unoccupied socket. Thus, a 2P configuration can take advantage of all 32 of the R810’s DIMM slots. Quite the trick, no?

The back third or so of the R810’s enclosure is dedicated to power supplies and expansion slots. The slot array includes a trio of riser-based, full-height PCIe slots, each with a physical x16 layout and eight connected lanes. There are two half-height x4 slots directly on the board and another full-height x4 slot on a riser. Finally, there’s a single internal x4 slot dedicated to storage; ours is populated with a Dell SAS RAID card with 512MB of cache RAM.

Also poking out the back of the R810 is a quartet of Gigabit Ethernet ports, courtesy of dual Broadcom controller chips.

Yes, those dual power supplies are redundant and hot-swappable, and they’re each rated at 1100W.

Again, all of these things add up to a tremendous amount of capability in a relatively compact 2U space. I hope this tour of the R810’s guts has given you a sense of the rather impressive packaging involved. That impression would be furthered if you could work with the system. Key components are designed to slide out, swing up, or otherwise move out of the way to grant access to other components. Each such mechanism is secured by a tab, easily released by the press of a finger or thumb, and ready to snap back into place when you’re finished.

The R810 also offers shockingly good acoustics for what it is. You wouldn’t want to leave it running beside your desk for terribly long, but it doesn’t emit the shrieking turbine noise that most 1U boxes do, either. Decent fan speed control ensures that the R810 isn’t louder than necessary when loads are low, as well.

This is, after all, a rather expensive system—and it’s built like it.

Test notes

All of our test systems benefited greatly in terms of power consumption and performance from the addition of solid-state drives for fast, local storage.

The folks at OCZ helped equip our test systems with enterprise-class Vertex EX SSDs. The single-level-cell flash memory in these drives can endure more write-erase cycles than the multi-level-cell flash used in consumer drives, so it’s better suited for server applications. SLC memory writes data substantially faster than MLC flash, as well. The only catch is that SLC flash is quite a bit pricier, as are the drives based on it. For the right application, though, a drive like the Vertex EX can be very much worth it. Heck, we even noticed the effects of these drives during our test sessions. Boot times were ridiculously low for all of the systems, and program start-up times were practically instantaneous.

We’ve also beefed up our lab equipment by stepping up to a Yokogawa WT210 power meter. The Extech unit we used in the past would occasionally return an obviously erroneous value, and for that reason, the Extech hasn’t been sanctioned for use with SPECpower_ssj when the results are to be published via SPEC. The WT210 is a much more accurate meter that meets with SPEC’s approval and integrates seamlessly with the SPECpower_ssj power measurement components.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. We typically run each test three times and report the median result. In the case of the SPEC benchmarks, though, we’ve reported the results from the single best run achieved.

Our test systems were configured like so:

Processor Opteron
2425 HE  2.1GHz
Opteron
2435 2.6GHz
Xeon
L5520 2.26GHz
Xeon
X5570 2.93GHz

Xeon X5670 2.93GHz

Xeon
X5680 3.33GHz
Xeon
X7560 2.27GHz
Xeon
L5640 2.26GHz
Motherboard SuperMicro
H8DMU+
Tyan
S8212
Intel
S5500WB
Asus
Z8PS-D12-1U
SuperMicro
X8DA3
Dell
05W7DG
North bridge Nvidia
nForce Pro 3600
AMD
SR5690
Intel
5500
Intel
5520
Intel
5520
Intel
7500
South bridge Nvidia
nForce Pro 3600
SP5100 ICH10R ICH10R ICH10R ICH10R
Memory size 16GB
(4 DIMMs)
16GB
(8 DIMMs)
12GB
(6 DIMMs)
24GB
(6 DIMMs)
24GB
(6 DIMMs)
128GB
(32 DIMMs)
Memory type Kingston
PC2-6400

registered ECC

DDR2 SDRAM

Avant

Technology

PC2-6400

registered ECC

DDR2 SDRAM

Samsung
PC3L-10600R

registered ECC

DDR3 SDRAM

Samsung
PC3-10600R

registered ECC

DDR3 SDRAM

Samsung
PC3-10700

registered ECC

DDR3 SDRAM

PC3-8500

registered ECC

DDR3 SDRAM

Memory speed 800
MT/s
800
MT/s
1066
MT/s
1333
MT/s
1333
MT/s
1066
MT/s
1333
MT/s
Memory timings 6-6-6-18
1T
6-5-5-18
1T
7-7-7-20
1T
9-9-9-24
1T
9-9-9-24
1T
9-9-9-24
1T
Chipset

drivers

9.28 INF update
9.1.1.1025

Rapid Storage Technology 9.6

INF update
9.1.1.1025

Rapid Storage Technology 9.6

INF update
9.1.1.1025

Rapid Storage Technology 9.6

INF update
9.1.1.1025

Rapid Storage Technology 9.6

Graphics Integrated

ATI ES1000

Integrated
ASPEED
Integrated

Matrox G200e

Integrated
ASPEED
Nvidia
GeForce 8400 GS
Matrox
G200e
Power

supply

Cold
Watt CWA2-650-

10-SM01-1

650W

Ablecom
PWS-702A-1R 700W
Delta
Electronics DPS650SB 650W
Delta
Electronics DPS770BB 770W
Ablecom
PWS-702A-1R 700W
Dell
L1100A-S0 1100W
Hard drive OCZ
Vertex EX 64GB SSD with firmware rev. 1.5
OS Windows
Server 2008 R2 Enterprise x64

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

As you can see, we’ve pitted the R810 and Xeon X7560 against a range of lower-end 2P systems from our prior server reviews. Since this is a new class of product in many ways, and since we don’t dabble in 4P systems, these comparisons will have to suffice, even though the R810 doesn’t compete directly with these less expensive, less scalable systems. That’s not to say the R810 won’t have its hands full with the Westmere-EP based 5600-series Xeons. With six cores per socket and higher frequencies, those Xeons are quite potent masters of the traditional 2P space.

Also conspicuous by its absence is AMD’s Opteron 6100 series, whose promise of relatively inexpensive 4P configurations arguably provided the impetus for the creation of 2P EX systems. These Opterons are next on our slate, so please bear with us.

This test measures cache bandwidth in parallel, so all of the caches on the available cores should be tested. Even so, the L2 caches associated with the Xeon X7560 system’s 16 cores at 2.26GHz can’t quite keep pace with the L2 caches of the 12 cores at 3.33GHz in Xeon X5680 box, as is evident at the intermediate block sizes. The most dramatic gap between the systems comes at the larger block sizes, though, and here the X7560 breaks from the pack. The Nehalem-EX processors’ large 24MB L3 caches give them a decided bandwidth advantage at the 4MB, 16MB, and even 64MB block sizes. For the right mix of applications, or single applications with very large working data sets, those enormous caches could prove very helpful.

This is a disappointing, practically scandalous result, and I’m not entirely sure what to make of it. I should start by saying I believe it is legitimate and correct, that our testing methods were sound and appropriate. Stream allows one to tune its operation reasonably well, and we tailored it to fit with the threading and socket config of our Dell R810 server. We’ve found that Stream works best with Hyper-Threaded CPUs if one assigns a single thread to each physical CPU core. Doing so allowed us to achieve nice results on the Nehalem-EP and Westmere-EP Xeons, as is evident. We used a similar, expanded thread assignment for the Xeon X7560, and it produced the best results of any config we tried, with the appropriate thread utilization showing in the Task Manager.

The likely culprit here is the EX’s use of memory buffer chips connected via a serialized link. The very similar FB-DIMM technology also underachieved in measured bandwidth in the Xeon 5400 series. We’d know more if we’d been able to measure memory access latencies, as well, but our usual tool for doing that wasn’t built to cope with 24MB L3 caches. We did, however, ping Dell and Intel, and these results weren’t outside the range of their expectations for this system.

SPECjbb2005

SPECjbb 2005 simulates the role a server would play executing the “business logic” in the middle of a three-tier system with clients at the front-end and a database server at the back-end. The logic executed by the test is written in Java and runs in a JVM. This benchmark tests scaling with one to many threads, although its main score is largely a measure of peak throughput.

As you may know, system vendors spend tremendous effort attempting to achieve peak scores in benchmarks like this one, which they then publish via SPEC. We have used a relatively fast JVM, the 64-bit version of Oracle’s JRockIt JRE, and we’ve tuned each system reasonably well. Still, it was not our intention to match the best published scores, a feat we probably couldn’t accomplish without access to the IBM JVM, which looks to be the fastest option at present. Similarly, although we’ve worked to be compliant with the SPEC run rules for this benchmark, we have not done the necessary work to prepare these results for publication via SPEC, nor do we intend to do so. Thus, these scores should be considered experimental, research-mode results only.

We’ve documented the command-line options used for most of the test systems in our Xeon 5600 review. For the Dell R810, we used the following command line options:

Xeons 16 core/32 thread/128GB/8 instances:

start /AFFINITY [F0000000, 0F000000, 00F00000, 000F0000, 0000F000, 00000F00, 000000F0, 0000000F] JAVAOPTIONS=-Xms3900m -Xmx3900m -Xns3260m -XXaggressive -Xlargepages:exitOnFailure=true -Xgc:genpar -XXgcthreads:8 -XXcallprofiling -XXtlasize:min=4k,preferred=1024k

In keeping with the SPECjbb run rules, we tested at up to twice the optimal number of warehouses per system, with the optimal count being the total number of hardware threads.

In all cases, Windows Server’s “lock pages in memory” setting was enabled for the benchmark user. In the Xeon systems’ BIOSes, we disabled the “hardware prefetch” and “adjacent cache line prefetch” options.

Our 2P Xeon X7560 system pretty much outclasses the lower-priced options in SPECjbb, with substantially higher throughput in a single box than anything else we’ve tested. The X7560’s performance peaks at eight instances with four warehouses each, or 32 threads, as one might expect. Unlike the Westmere Xeons, though, the X7560 server’s performance doesn’t drop substantially after moving past that point.

SPECpower_ssj2008

Like SPECjbb2005, this benchmark is based on multithreaded Java workloads and uses similar tuning parameters, but its workloads are somewhat different. SPECpower is also distinctive in that it measures power use at different load levels, stepping up from active idle to 100% utilization in 10% increments. The benchmark then reports power-performance ratios at each load level.

SPEC’s run rules for this benchmark require the collection of ambient temperature, humidity, and altitude data, as well as power and performance, in order to prevent the gaming of the test. Per SPEC’s recommendations, we used a separate system to act as the data collector. Attached to it were a Digi WatchPort/H temperature and humidity sensor and our Yokogawa WT210 power meter. Although our new power meter might well pass muster with SPEC, what we said about our SPECjbb results being “research mode only” applies here, too.

We used the same basic performance tuning and system setup parameters here that we did with SPECjbb2005, although we had to use a smaller heap in some cases.

SPECpower_ssj results are a little more complicated to interpret than your average benchmark. We’ve plotted the output in several ways into order to help us understand it.

This fact may be obvious from our results, but let’s grant up front that our Dell R810 system with dual X7560s isn’t exactly the sort of system one would tend to find in the SPEC submissions for this benchmark. As our tour of the R810 demonstrated, this system is intended to cram a considerable amount of computing power into a single box. With 32 DIMMs, dual octal-core processors, and an 1100W power supply, the R810 isn’t exactly going to sip power at idle. Meanwhile, our two Xeon L-series configurations are absolute killers, relatively lightweight systems expressly designed for power efficiency. Aside from those, the rest of the systems we’ve tested are largely mainstream 2P server setups, still less capable than the R810 by a fair amount. In fact, the R810 draws as much power at idle as some of the mainstream system do at peak. The R810 is quite a bit faster, though.

Our Xeon 7560 box doesn’t look so far from the rest of the pack when we consider the power-performance ratio at various load levels. In fact, it nearly matches the older Opterons that we tested, despite drawing nearly twice the power. Stepping up to a bigger box like the R810 will cost you some power efficiency compared to a single, cheaper 2P system. However, if the larger memory config and better scalability enables the consolidation of just two systems like our Opteron 2435 test rig, you could end up coming out ahead overall.

An experiment

We knew going into this project that testing Nehalem-EX with our usual suite of benchmarks wouldn’t suffice, but we had an idea for a test that might do a better job of pushing the limits of a system like the R810.

We’d long been looking for a test that involved virtualization performance in some way. Trouble is, most of the formal virtualization benchmarks we’ve seen have very steep requirements for a valid run, including large numbers of network clients, making them impractical for our use. Fortunately, we discovered that our friend Paul Venezia at InfoWorld had been working on a promising test of his own. After a quick conversation in which we offered a couple of suggestions, Paul went home and produced a very slick working benchmark setup, which he generously shared with us.

The basics are straightforward. The test setup, packaged as an OVF template, includes images for a number of virtual machines. Each VM hosts a portion of a fairly robust LAMP web hosting setup: up to four web servers, one or two database servers, and a load balancer. In order to keep things simple and bypass any network bottlenecks, the client also runs on a local VM. The core of the test is based on ApacheBench, and the performance outcomes are delivered as ApacheBench results. Because the VMs are packaged into a single template, the benchmark can be deployed easily and repeatably on any system running a VMWare hypervisor. (In our case, we used VMWare ESX 4.1.)

We found out during testing that our nefarious plan to use enterprise-class SLC solid-state disks to provide our local storage for this test wasn’t going to fly. Our OCZ Vertex drives seemed ideal for a high-IOps scenario like this one, but the drives are only 60GB in size, so even a dual-drive RAID 0 was too small to house all eight of the VMs that comprise the test. In order to make this work, we needed similar performance but substantially more capacity.

Fortunately, the folks at Corsair offered us a solution in form of a couple of Force-series F240 SSDs. Although these drives use slower MLC-style NAND Flash, their SandForce SF-1200 controllers have proven capable of delivering exceptionally high IOps rates, making them a good fit here. Just a single drive in each of our test systems offered enough capacity and performance for our purposes.

With that issue settled, we proceeded to tune the benchmark for our two test systems, the R810 and our dual Westmere-EP box. The benchmark can be configured in various ways, and finding the right mix isn’t easy. Eventually, we decided on using the full complement of four web servers and two database servers, with a ratio of static to dynamic web requests of 994. Both systems appeared to deliver their highest peak throughput at around 200 concurrent requests, so that became our standard. We found that longer tests tended to produce higher average response rates, so we decided to use 4 million requests in each test run. Tuning the knobs and dials in this way produced appropriately high CPU utilization across all of the VMs, with the exception that the client and load-balancer machines usually weren’t fully taxed.

Here’s how the results came out.

The dual Westmere-EP system serviced more requests per second, on average, than the Xeon X7560 system. That’s an unfortunate result for the Nehalem-EX, no doubt. However, the more detailed results reveal a different aspect of the story.

The Xeon X5670 services two-thirds of the requests substantially quicker, on average, than the X7560. We’d attribute that outcome to a number of factors, including the X5670’s higher clock frequencies, higher measured memory bandwidth, and what we suspect are substantially lower memory access latencies since there’s no memory buffer chip in the mix. However, for the final 20% of the requests, the X5670’s response times are much higher than the X7560’s. The EX box’s higher core count and larger L3 caches grant it an advantage, as well.

Depending on the sort of performance characteristics you value, the EX’s showing may be the more impressive one. Avoiding those longer response times may be more desirable than simply serving more requests per second.

Then again, we expressly tuned the benchmark config to produce the best request rate averages. It’s possible a different set of parameters might yield more optimal response times overall on the Xeon X5670. We may have to experiment further in the future. Still, we’re pleased to see that this new addition to our test suite offers us a different sort of insight into the performance of these two systems, at least giving us a hint of the Nehalem-EX’s scalability advantage when running multiple VMs.

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of proteins. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.
In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.

MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used.

I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.
The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

As you’ll note, we’re seeing some pretty broad variance in the results of this test at lower thread counts, which suggests it may be stumbling over these systems’ non-uniform memory architectures. In an attempt to circumvent that problem, I decided to try running two instances of this benchmark concurrently, with each one affinitized to a socket, and adding the results into an aggregate compute rate. Doing so offers a nice performance boost.

Both of these scientific computing applications seem to put a premium on memory bandwidth. Given our earlier Stream results, we’re a little bit surprised that the Xeon X7560 essentially matches the fastest Westmere-EP Xeon, the X5680. Yes, the Westmere-EP Xeons cost less, but the Nehalem-EX could be a formidable home for any scientific computing or HPC application that requires large amounts of memory in a single system.

Conclusions

I think the biggest lesson we’ve all learned here is that a year is too long a lead time for a computer hardware review. Although that was, perhaps, entirely self-evident before now. Also, I expect I’ll soon be getting an earful from readers who want to see comparisons to the Opteron 6100 series and perhaps Westmere-EX. To that I say: all in due time, folks. Give me 18 months or so.

Kidding, seriously. Just kidding.

Beyond that, we’ve spent a fair amount of time here comparing this Nehalem-EX-based Dell R810 to cheaper, more mainstream 2P servers. We should probably provide some additional context, so we can be clear about the value proposition involved. Today, an R810 configured like our test system, with dual Xeon X7560s, lists for about $23K, according to Dell’s website. Dropping down to a Dell R710 with dual Xeon X5670s will cost about a third less. (I suspect both prices would be lower if one were buying in any sort of volume via a Dell sales rep.) If you make that tradeoff, you’ve saved thousands, but you’ve given up quite a bit, too. The R710 has about half the memory and expansion slot capacity of the R810, and it has no real CPU upgrade options. You can drop a couple more Xeon X7560s into the R810 to make it a 4P box, or you could spring for two to four Westmere-EX 10-core processors to really raise the stakes. How does 40 cores and 80 threads sound?

For a great many applications, an organization’s needs would be better served by smaller, cheaper, more power-efficient individual servers based on mainstream 2P processors, whether packaged in blades or high-density 1U enclosures. However, for the right application, such as a large, centralized database server, paying for the R810’s additional expansion headroom and scalability may make a lot of sense.

Performance-wise, the dual Xeon X7560 configuration we tested isn’t uniformly faster than a dual X5670 system, which is a bit of a letdown. Still, it is sometimes substantially faster—as in SPECjbb2005—and it’s rarely much slower than a Westmere-EP Xeon box. The EX’s even tenor in our virtualized LAMP web service test suggests it may be especially suitable for applications that have strict response time requirements, as well. Although we haven’t tested it ourselves, our sense is that the prior-gen Xeon 7400 series was a larger, more painful step up in pricing with a questionable performance proposition. By contrast, the EX platform looks to be a much safer and more sensible choice, especially with the availability of the 2P option.

We’re quite taken with the R810’s slick packaging and design, too. Next to the generic, white-box servers that populate our labs, it stands out as a example of more intelligent system engineering. Fortunately, our next stop, time permitting, should be another Dell server with another unusual and intriguing CPU and memory configuration.

Comments closed
    • Arclight
    • 8 years ago

    It looked so badass i thought it was illegal….

    • yuhong
    • 9 years ago

    “we’re a little bit surprised that the Xeon X7650 essentially matches the fastest Westmere-EP Xeon, the X5680. ”
    Looks like there is a typo BTW, which reminds me that it is unfortunate that they called Westmere-EX Xeon “E7” instead of Xeon 6600/7600.

      • Damage
      • 9 years ago

      Fixed, thanks.

      The typo, I mean, not Intel’s crazy naming scheme. 🙂

    • codedivine
    • 9 years ago

    Great article. Interesting to see some server stuff reviewed on TR. One suggestion: Can you put some CPU info (such as core-count, process tech, core family etc) in the testing methods table? Server naming scheme of both Intel and AMD is hard to remember so having that info in the table for quick reference while reading the article will help.

    • vvas
    • 9 years ago

    Thanks for finally pushing this article into the light, Scott. Right now I guess it stands a bit on its own, but if similar articles on servers based on Westmere-EX and Magny-Cours are coming, they’re going to make for a very interesting comparison. Keep up the good work!

    • judoMan
    • 9 years ago

    I love this!

    Perhaps like many here, I’ve been reading Damage since he and Dr. Evil were at Ars. In the ensuing years, I’ve worked more and more with backend systems than desktops. This article was right up my alley. Thank you! More please!!

    • flip-mode
    • 9 years ago

    I applaud this benchmarking effort. Honestly, though, I think it would be awesome if you guys looked at some entry level servers and also NAS systems – definitely, definitely NAS systems. And not just RAID 5 NAS systems. RAID 1. Upon reflection, it’s a wonder to me that you guys haven’t done a single NAS review, to my knowledge. No Qnap or Synology or Buffalo or Netgear units… dunno, but it seems like that would be pretty cool. Some two-disk units up to maybe some 4 or 8 disk units.

    And what about entry level servers? Some 1U, single socket units? Testing their RAID features and virtualization capabilities and storage capabilities?

    What about build-your-own server systems? You know, Newegg offers all that hardware, rack unit cases and such, but I’ve never seen any of it tested.

    Seems like all this must be stuff that you guys have talked about internally before and have decided not to go after.

      • ssidbroadcast
      • 9 years ago

      Man flip, you just can’t be pleased! They finally review a server product for you and now you’re asking for more.

        • flip-mode
        • 9 years ago

        Don’t interpret it as complaining because that is not what it is. I’m perfectly happy with TR as is. It’s just thinking out loud, asking whatever questions come to mind and such.

      • potatochobit
      • 9 years ago

      I am pretty sure I saw a prebuilt home server review on here maybe two years ago
      I think it was an HP or something from newegg

      • codedivine
      • 9 years ago

      Yeah NAS stuff is something even I would like to see reviewed even though I don’t work anywhere near an “enterprise”. Just looking for a small home setup.

    • dextrous
    • 9 years ago

    I’m extremely interested in these types of articles! I hope you can somehow review more enterprise-centric hardware as good reviews are hard to find. I’d love to see not only server reviews, but SAN reviews, HBA reviews, or even NIC reviews (not Killer NICs, but enterprise ones). I can dream, right?

    • ssidbroadcast
    • 9 years ago

    [quote<]with better integration than a diversity-training workshop at Harvard.[/quote<] lol. So left field!!

    • potatochobit
    • 9 years ago

    someone had a hissy fit last month because Dell uses proprietary HDD connectors

    • dpaus
    • 9 years ago

    [quote<]"Yes, those dual power supplies are redundant and hot-swappable..."[/quote<] Yes, but when one of them fails, what happens? We've experimented with redundant, hot-swappable power supplies, but have been frustrated by the fact that when one of the fails, all it does - at best - is turn on a lousy, cheap buzzer. Very helpful if you want to keep a person sitting in the server room 24/7, otherwise useless. And let's not get me started on the ones that don't even have a buzzer (prithee, [i<][b<]what[/i<][/b<] is the point of [i<][b<]that[/i<][/b<]??!?) Utilities that send an e-mail are all very nice, but useless when the server itself has failed (as in, D-E-D failed). Ideally, we'd love to find a server that combines e-mail notifications with a good old-fashioned watchdog timer card with discrete voltage outputs (which could be used to trigger an autonomous alarm system), but we haven't found one yet (or, more precisely, we haven't found the right mix of parts to build out own yet - suggestions from other gerbils welcome)

      • Steel
      • 9 years ago

      Which is why for this server you’d have Dell’s OpenManage software installed somewhere in your environment and it would alert you about failed power supplies and fans, failing memory and whatever else you want monitored in the server.

      The management and monitoring software is part of the reason most companies go with Dell, HP and the other big names instead of building DIY white box servers.

        • shank15217
        • 9 years ago

        Unless you make servers from desktop boxes DIY white box servers usually come with a BMC which allow you to do same things dell does.

          • dpaus
          • 9 years ago

          ..a what?

            • OneArmedScissor
            • 9 years ago

            A Bud/Miller/Coors, of course.

            • shank15217
            • 9 years ago

            baseboard management controller

            • dpaus
            • 9 years ago

            Thanks; where does one get one of these [s<]unicorns[/s<] BMCs for a DIY server?

            • indeego
            • 9 years ago

            You buy a MB with it included. (SinglePointOfFailure)
            You buy a BMC PCIe/PCI/PCIX card. (SPOF)
            You buy Software-based. (Multiple points of failure)
            You buy external monitoring/agent software. (Free to $$$$) (This is what I think most people do, it carries its own risks though including more false positives due to connectivity issues.)

            My advice is unless you are extremely cash-limited, you will be better off getting a OEM-server that is not brand-spanking new just released, tested in the field (check the support forums for gotchyas) and not first of its generation. Let others test the bugs for you. There isn’t a server out there without bugs on release, mind you. They are extremely complex and getting more so.

            The advantage of HP/DELL/etc is techs (external and internal) know how they work, they are familiar with them, they know the quirks, the support knows the quirks, and you reduce your TIME in troubleshooting the quirks that can exist with unknown hardware.

            • dpaus
            • 9 years ago

            thanks!

      • Thrashdog
      • 9 years ago

      In my experience redundant power supplies are more useful for enabling redundant power delivery. More than once I’ve seen a UPS (one of them was brand freaking new!) decide that life wasn’t worth living, and avoided service interruption because all the servers plugged into that UPS had a second PSU that was connected to a different battery backup.

        • Anomymous Gerbil
        • 9 years ago

        Exactly.

        And dpaus, even without the proper dual power/UPS setup described above, are you *really* asking why a server (that will presumably be used in a critical environment) would include additoinal hardware to remove single points of failure?

      • d0g_p00p
      • 9 years ago

      Any sysadmin worth his salt will have more than a email notification for monitoring servers. Nagios alarms, email alerts, global monitoring solution and a Telalert like system will let you know when you have a hardware failure or a change in operating frequencies hardware wise. I manage over 120+ production servers that not only my employers rely on but the big 4 cell phone companies as well. You better believe I know when a PS failed, a disk is about to die or a RAM module is faulty.

      The solutions are out there it’s just up to you to implement them right.

      edit|: #16 and Thrashdog, that is the correct way to setup redundant power supplies. Having power redundancy is useless when the power source is the same.

    • shank15217
    • 9 years ago

    The R815 is a much more interesting beast, its drop in replacement is Bulldozer, if you guys get a chance take a look at one of those or even the C6145, 96 cores in 2U courtesy of AMD and soon 128 integer cores in 2U.

    • 5150
    • 9 years ago

    Awesome review, thanks for doing one like this and I look forward to more in the future!

Pin It on Pinterest

Share This