Sometimes, we bite off more than we can chew. That was certainly the case with this review. We've long pushed, prodded, argued, and advocated for the folks at Intel and AMD to work with us on reviews of their server CPUs. That's generally gone well in the past few years, happily, but we have also been wary of expanding our mission beyond our means. That has meant, for instance, declining opportunities to review 4P systems. Large, expensive servers are interesting, but testing them properly requires time, the right hardware, and a fairly select set of very expensive applications, many of which require massive, proprietary data sets. Reaching up into that segment of the market is no trivial undertaking.
Our instincts were confounded, however, when Intel and Dell dangled a new class of server in front of us, a sort of intermediate step between the traditional, low-cost 2P box and a much beefier, vastly more expensive 4P system. Testing it properly would be a bit of a challenge, sure, and it was sort of an expansion of our mission. But wow, that was some really cool hardware, and AMD seemed to have something similar in the works. Besides, we had some interesting ideas about testing. The challenge would be intriguing, if nothing else.
Thus we found ourselves taking delivery of a Dell R810 server, a sleek, 2U box packed with dual octal-core Nehalem-EX processors, twin 1100W power supplies, quad SAS 6 Gbps hard drives, and a heart-stopping 128GB of RAM.
That was about a year ago, and the months that followed gave us an unprecedented bounty of new GPU and CPU architectures and products based on them—in other words, lots of things to review. We had more to review than we could handle, and in this R810 server, we had perhaps more computer than we could handle properly, too. Shamefully, the R810 went on the backburner time and again as other obligations intervened.
Fortunately, we've finally managed to complete our testing, and we're right in time for Intel's announcement of—ack!—a drop-in replacement for the Nehalem-EX processor known as Westmere-EX. Rather than completely despairing, we've decided to move ahead with our initial look at the R810 and Nehalem-EX. If there's sufficient interest, after that, we'll see about upgrading to the new processors and taking them for a spin, as well. Much of the ground we'll cover today is foundational for servers based on either CPU, since they share the same system architecture.
Nehalem-EX: The Ocho
The details of Nehalem-EX silicon may be familiar by now to many interested parties, but we'll recap briefly because they are complex and impressive enough to warrant further attention. As the name implies, the Nehalem-EX processor is based on the same basic CPU microarchitecture and 45-nm manufacturing process as its smaller siblings that share the Nehalem name. The difference with the EX variant has to do with scale, both in terms of the processor silicon—the thing encompasses 2.3 billion transistors—and the system architecture that supports it.
Crammed into the EX are fully eight CPU cores and 24MB of L3 cache—enough elements that the processor's architects decided the simpler internal communications arrangement in quad-core Nehalems wouldn't suffice. Instead, they gave the EX an internal ring bus, a high-speed, bidirectional communication corridor with stops for each key component of the chip. This ring is a precursor, incidentally, for the one Intel architects built into the newer Sandy Bridge architecture to accommodate multiple cores alongside an integrated GPU.
Like all Nehalem chips, the EX has an integrated memory controller. In fact, the EX really has a pair of memory controllers, although the arrangements are rather different than in lower-end 2P Xeons. The EX series is designed to scale to four or more sockets with very large memory capacities, and the sheer number of traces running out of each socket may impede that mission. Intel's system architects have worked around that problem by using external Scalable Memory Buffer (SMB) chips to talk to the memory modules.
Between the EX socket and each SMB is a narrow, high-speed link known as a serial memory interconnect, or SMI. The SMI and SMB allow for higher memory capacities, at the expense of higher access latencies. In fact, this whole arrangement is based closely on the FB-DIMM technology used in older Xeons, which was somewhat infamous for the performance-versus-capacity tradeoff it required. One difference here is that the SMB chips are built into the system and mounted on the motherboard, so EX systems can use regular DDR3 RDIMMs. Another difference, obviously, is the elimination of the front-side bus and its potential to act as a bottleneck at high load levels. Intel claims the EX has a lower, flatter memory access latency profile than the prior-generation Xeon X7400 series.
The Nehalem-EX has two SMI channels per memory controller, and each channel talks to an SMB chip. In turn, each SMB communicates with two channels of DDR3 SDRAM clocked at a peak of 1066MHz. Each memory channel can support a pair of registered DIMMs.
Multiply all of those things out across four sockets, and the numbers get to be formidable. A single Nehalem-EX socket can support up to 16 DIMMs. Just four channels of DDR3-1066 memory per socket could, in theory, yield up to 34 GB/s of memory bandwidth, although some complicating factors like SMI overhead have led Intel to claim a peak memory bandwidth per socket of 25 GB/s. (Real-world throughput will vary depending on the mix of reads and writes used.) Still, that's potentially 100 GB/s of memory bandwidth in a 4P configuration.
Like its Nehalem-EX brethren, the EX uses Intel's point-to-point QuickPath Interconnect for communication between the sockets. Each CPU has four QPI link controllers onboard, making possible fully-connected 4P configurations like the one depicted in the diagram above. Glueless 8P configurations are also possible, as are higher socket counts with the aid of third-party node controller chips.
The I/O hub shown above is a chip code-named Boxboro, and it's basically a giant PCI Express switch, with 36 lanes of second-generation PCIe connectivity. These lanes can be configured in various ways: four PCIe x8 links plus an x4, nine x4 connections, or dual x16s alongside two x2 links, for instance. If that's not enough I/O bandwidth, a 2P config may have dual IOH chips, while a 4P may have as many as three. An eight-way, quad-IOH layout could have up to 144 lanes of PCIe Gen2 bandwidth—again, staggering scale. Since the Boxboro IOH is largely just for PCI Express, it connects to Intel's tried-and-true ICH10 chip, which provides the rest of the system's conventional I/O needs, including some first-generation PCIe lanes.
Not only does the EX platform exist on a much larger scale than other Xeons, but it also includes some RAS (reliability, availability, and serviceability) features traditionally found only in mainframes, high-end RISC systems, and Intel's other offering in this segment, Itanium. These capabilities extend well beyond the traditional error recovery mechanism built into ECC DRAM. The EX's recoverable machine check architecture (MCA) allows for on-the-fly recoveries from events that would be catastrophic in another class of hardware.
For example, in the event of a DIMM failure, the system could take the failed module out of use while the firmware and OS would work together to recover or restart any affected processes, without bringing the system down. Eventually, a tech could perform a hot-swap replacement of the failed and isolated module—all while the system keeps running. (That last bit sounds rather terrifying to me. I'd much rather shut down the affected system and do the DIMM swap during a maintenance window, but perhaps I'm just too timid.)
By creating a new class of 2P server based on Nehalem-EX, Intel and its partners are bringing these RAS features to a new price point, along with higher memory capacities.
Speaking of prices, don't get your hopes up for an especially cheap date. The fastest Nehalem-EX processor is the Xeon X7560, which is the one we've tested inside the Dell R810. The X7560 has eight cores, 16 threads (via Hyper-Threading/SMT), and a default clock speed of 2.26GHz. If there's headroom left within its 130W thermal envelope, Intel's Turbo Boost feature will allow the X7560's clock frequency to range up to 2.66GHz. A single Xeon X7560 will currently set you back $3,692. In the context of the total system price, that's practically a steal.