Single page Print

Memory subsystem performance
As you can see, we've pitted the R810 and Xeon X7560 against a range of lower-end 2P systems from our prior server reviews. Since this is a new class of product in many ways, and since we don't dabble in 4P systems, these comparisons will have to suffice, even though the R810 doesn't compete directly with these less expensive, less scalable systems. That's not to say the R810 won't have its hands full with the Westmere-EP based 5600-series Xeons. With six cores per socket and higher frequencies, those Xeons are quite potent masters of the traditional 2P space.

Also conspicuous by its absence is AMD's Opteron 6100 series, whose promise of relatively inexpensive 4P configurations arguably provided the impetus for the creation of 2P EX systems. These Opterons are next on our slate, so please bear with us.

This test measures cache bandwidth in parallel, so all of the caches on the available cores should be tested. Even so, the L2 caches associated with the Xeon X7560 system's 16 cores at 2.26GHz can't quite keep pace with the L2 caches of the 12 cores at 3.33GHz in Xeon X5680 box, as is evident at the intermediate block sizes. The most dramatic gap between the systems comes at the larger block sizes, though, and here the X7560 breaks from the pack. The Nehalem-EX processors' large 24MB L3 caches give them a decided bandwidth advantage at the 4MB, 16MB, and even 64MB block sizes. For the right mix of applications, or single applications with very large working data sets, those enormous caches could prove very helpful.

This is a disappointing, practically scandalous result, and I'm not entirely sure what to make of it. I should start by saying I believe it is legitimate and correct, that our testing methods were sound and appropriate. Stream allows one to tune its operation reasonably well, and we tailored it to fit with the threading and socket config of our Dell R810 server. We've found that Stream works best with Hyper-Threaded CPUs if one assigns a single thread to each physical CPU core. Doing so allowed us to achieve nice results on the Nehalem-EP and Westmere-EP Xeons, as is evident. We used a similar, expanded thread assignment for the Xeon X7560, and it produced the best results of any config we tried, with the appropriate thread utilization showing in the Task Manager.

The likely culprit here is the EX's use of memory buffer chips connected via a serialized link. The very similar FB-DIMM technology also underachieved in measured bandwidth in the Xeon 5400 series. We'd know more if we'd been able to measure memory access latencies, as well, but our usual tool for doing that wasn't built to cope with 24MB L3 caches. We did, however, ping Dell and Intel, and these results weren't outside the range of their expectations for this system.