Memory subsystem performance
Since we have a new chip architecture and a new memory type on the bench, let's take a look at some directed memory tests before moving on to real-world applications.
The fancy plot above mainly looks at cache bandwidth. This test is multithreaded, so the numbers you see show the combined bandwidth from all of the L1 and L2 caches on each system. Since our Haswell-EPs have 20 L1 caches of 32KB each, we're still in the L1 cache at the 512KB block size above. The next few points, up to 4MB, are hitting the L2 caches, and beyond that, up to 16MB, we're into the L3.
Haswell-EP's promised doubling of L1 and L2 cache bandwidth per core is on display in the plot above. The E5-2687W v3's higher core count also plays a part in these results, but however you slice it, this is a massive increase in cache bandwidth.
Now, let's look at what happens when we get into main memory.
We found that our usual version of the Stream bandwidth test fails to scale to 20 cores and 40 threads properly, so we've substituted AIDA's memory tests, instead. Obviously, they have no such issue. The E5 v3's higher-speed DDR4 memory clearly outperforms the two prior generations of Xeons with DDR3 memory, with delivered bandwidth of up to 123 GB/s in the memory read test.
SiSoft has a nice write-up of this latency testing tool, for those who are interested. We used the "in-page random" access pattern to reduce the impact of prefetchers on our measurements. This test isn't multithreaded, so it's a little easier to track which cache is being measured. If the block size is 32KB, you're in the L1 cache. If it's 64KB, you're into the L2, and so on.
Haswell-EP delivers nearly twice the L1 and L2 cache bandwidth without any increase in access latencies for those caches. There is a slight increase in L3 cache access times, but the Xeon E5 v2 has more LLC cache partitions to access than its eight-core siblings do.
At 2133 MT/s, Haswell-EP's DDR4 memory doesn't provide quite as quick a turnaround as DDR3 does. I'd expect that to change as DDR4 operating speeds ramp up. Notice the nice result above for the Haswell-E-based 5960X with DDR4-2800.
Some quick synthetic math tests
The folks at FinalWire have built some interesting micro-benchmarks into their AIDA64 system analysis software. They've tweaked several of these tests to make use of new instructions on the latest processors, including Haswell-EP. Of the results shown below, PhotoWorxx uses AVX2 (and falls back to AVX on Ivy Bridge, et al.), CPU Hash uses AVX (and XOP on Bulldozer/Piledriver), and FPU Julia and Mandel use AVX2 with FMA.
Here's a nice look at the true potential throughput of the Haswell-EP hardware, provided a nicely vectorizable workload and the AVX2 instruction set extensions. Many of the applications we're testing on the following pages don't take full advantage of AVX2 yet, but once they do... yikes.