Memory subsystem performance

This test shows us a nice visual picture of the memory bandwidth available at the different levels of the memory hierarchy. One can see the impact of the Opteron 2384's larger L3 cache at the 16MB test block size, where it's much faster than the older quad-core Opterons. Still, the Xeons' caches typically achieve quite a bit higher throughput than the Opterons'.
Our graph is tough to read at the largest test block sizes where main memory comes into play. Here's a closer look at the 256MB block size, which should be a good indicator of main memory bandwidth.

These results are consistent with what we've seen in the past from most of these platforms. I believe these results only show the bandwidth available to a single CPU core, so they're substantially less than the peak available in the entire system. The Opterons appear to benefit greatly from their integrated memory controllers here, and the Shanghai Opteron 2384 takes advantage of its faster 800MHz memory, as well.

The Opteron 2384's revamped cache and TLB hierarchy, along with faster memory, delivers major reductions in memory access latency. With the 65nm Barcelona Opterons, we've found that the L3 cache tends to contribute quite a bit of latency to the overall picture. Yet with three times the L3 cache of the Opteron 2356, the 2384 is still faster to main memory. Let's have a closer look at the cache picture and see why that is.
Before we do, though, we should also point out that the Xeon L5430 on the San Clemente platform has much lower access latencies than the E5450 on the Bensley platform, although they share the same bus frequency and topology. Assuming there aren't any other major contributing factors, FB-DIMMs would appear to add about 14ns of delay versus DDR2 modules at the same 667MHz clock speed. The Stoakley platform essentially makes up that deficit by using higher bus and memory frequencies.
Note that, below, I've color-coded the block sizes that roughly correspond to the different caches on each of the processors. L1 data cache is yellow, L2 is light orange, L3's darker orange, and main memory is brown.






These graphs offer a good visual representation of the data, but perhaps some numbers would illuminate things further. Because the Opteron's L3 cache is clocked independently from the CPU cores, it doesn't make sense to quantify that cache's latency in terms of CPU clock cycles. In this case, the Opteron 2356's L3 cache runs at 2GHz, while the 2384's runs at 2.2GHza 10% increase. Despite the fact that the 2384's L3 cache is three times the size, though, its latencies are considerably lower. At the 2048KB block size and step size of 256, the 2356's latency is 23ns, while the 2384's is only 16nsa reduction of nearly a third.
| Friday night topic: The trouble with Best Buy | 143 |