Memory subsystem performance
Since we have a new chip architecture and a new memory type on the bench, let's take a look at some directed memory tests before moving on to real-world applications.
The fancy plot above mainly looks at cache bandwidth. This test is multithreaded, so the numbers you see show the combined bandwidth from all of the L1 and L2 caches on each CPU. Since Haswell-E has eight 32KB L1 caches, we're still in the L1 cache at the 256KB block size above. The next three points, up to 2MB, are hitting the L2 caches, and beyond that, up to 16MB, we're into the L3.
Intel's architects essentially doubled the bandwidth in Haswell's L1 and L2 caches compared to Ivy Bridge in order to make them fast enough to support AVX2's higher throughput. We don't see quite a doubling of performance in our measurements when comparing the Core i7-4960X to the 5960X, but there are a lot of moving parts here. The 5960X has more cores but a lower frequency, for instance. Regardless, the 5960X's caches can sustain vastly more throughput than any other CPU we've tested.
Stream offers a look at main memory bandwidth, and the results are scandalous. I tried a number of different thread counts and affinity configs, but I just couldn't extract any more throughput from the 5960X with this version of Stream. In fact, I even tried raising the DDR4 speed from 2133 to 2800 MT/s, and throughput didn't improve. Frustrated, I decided to try a different bandwidth test from AIDA64.
That's more like it. The Haswell-E/DDR4 combo can achieve higher throughput in the right situation. The memory read results don't tell the whole story, though.
Looks like DDR4 writes are slower than reads, at least with AIDA64's access pattern. In the end, the memory copy test shows a reasonably good overall result for the 5960X and DDR4. At 2133 MT/s, it's not much faster than an i7-4960X with DDR3-1866, but at 2800 MT/s, the new CPU and memory type move ahead.
Next up, let's look at access latencies.
SiSoft has a nice write-up of this latency testing tool, for those who are interested. We used the "in-page random" access pattern to reduce the impact of prefetchers on our measurements. This test isn't multithreaded, so it's a little easier to track which cache is being measured. If the block size is 32KB, you're in the L1 cache. If it's 64KB, you're into the L2, and so on.
The bottom line: Haswell-E achieves roughly double the cache bandwidth without any real increase in the number of clock cycles of access latency. That's excellent.
Accessing main memory is a bit of a different story. The Haswell-E-and-DDR4 combo has higher memory access latencies at 2133 MT/s than most of the DDR3-based setups. Fortunately, that slowdown pretty much evaporates once we crank the DDR4 up to 2800 MT/s.
Honestly, I'm not sure the added memory latency matters much, anyhow, given the fact that the 5960X has a massive 20MB L3 cache. We conducted most of our testing at the CPU's officially supported memory spec of DDR4-2133. We may have to do some additional testing at 2800 MT/s to see what it nets us in real applications.
Some quick synthetic math tests
The folks at FinalWire have built some interesting micro-benchmarks into their AIDA64 system analysis software. They've tweaked several of these tests to make use of new instructions on the latest processors, including Haswell-E. Of the results shown below, PhotoWorxx uses AVX2 (and falls back to AVX on Ivy Bridge, et al.), CPU Hash uses AVX (and XOP on Bulldozer/Piledriver), and FPU Julia and Mandel use AVX2 with FMA.
Good grief. The Core i7-5690X is off to one heck of a start. Many of the big generational performance gains you're seeing above come from the use of AVX2 and the FMA (or fused multiply-add) instruction. Haswell has it, and Ivy Bridge doesn't. Notice how the quad-core 4790K nearly matches or even beats the six-core 4960X? That's Haswell magic at work. Now, give the Haswell chip eight cores, and you have the i7-5960X.