Single page Print

Memory subsystem performance
Since we have a new chip architecture and a new memory type on the bench, let's take a look at some directed memory tests before moving on to real-world applications.

The fancy plot above mainly looks at cache bandwidth. This test is multithreaded, so the numbers you see show the combined bandwidth from all of the L1 and L2 caches on each CPU.

The most notable result above involves the comparison between the, uh, sky-blue Skylake 6700K line and the yellow Haswell 4790K one. At the 256KB to 1MB block sizes, where we're accessing the four 256KB L2 caches in these chips, Skylake achieves substantially higher transfer rates at the same basic clock frequency as Haswell.

Oh, and get used to the Core i7-5960X taking the top spot in almost every benchmark. That CPU has eight cores, 16 threads, quad channels of DDR4, and a 20MB L3 cache. It also costs a grand. The 5960X is not in the same class as the rest of these chips. It's just here for reference.

Now let's zoom in on a portion of the graph above.

My main motivation for including this strange plot is to get you to consider the 64MB test block size. There, the purple line for the 5775C indicates higher bandwidth than any other CPU tested; the 5775C's bandwidth at this block size is roughly double the 6700K's. This data point is one spot where we can see the impact of the 5775C's 128MB L4 cache. Ooh, ahh.

Interesting. You can see the impact of the 6700K's higher-bandwidth DDR4 memory easily in Stream. That wasn't the case with Haswell-E compared to Ivy-E. My suspicion is that Skylake may be more aggressive about speculatively pre-fetching data into its caches than prior architectures. That would explain its ability to take immediate advantage of DDR4's additional bandwidth.

Looks to me like the Broadwell 5775C's large L4 cache is a boon in AIDA's memory copy test. The 5775C doesn't look like anything special in isolated read or write tests, but when asked to do both, having that big cache on hand appears to help.

Next up, let's look at access latencies.

SiSoft has a nice write-up of this latency testing tool, for those who are interested. We used the "in-page random" access pattern to reduce the impact of pre-fetchers on our measurements. This test isn't multithreaded, so it's a little easier to track which cache is being measured. If the block size is 32KB, you're in the L1 cache. If it's 64KB, you're into the L2, and so on.

Despite the higher transfer rates of Skylake's L2 cache, its access latencies have barely risen. L2 accesses on the Haswell 4790K take 11 cycles using this tool, and they take 12 cycles on the Skylake 6700K.

The move to DDR4 at 2133 MT/s carries only a slight penalty for the 6700K—it's four nanoseconds slower than the 4790K with DDR3. That's not bad at all, and I suspect that penalty could evaporate pretty quickly as DDR4 clock speeds ramp up.

Some quick synthetic math tests
The folks at FinalWire have built some interesting micro-benchmarks into their AIDA64 system analysis software. They've tweaked several of these tests to make use of new instructions on the latest processors. Of the results shown below, PhotoWorxx uses AVX2 (and falls back to AVX on Ivy Bridge, et al.), CPU Hash uses AVX (and XOP on Bulldozer/Piledriver), and FPU Julia and Mandel use AVX2 with FMA.

These quick tests give us a nice starting sense of how the Skylake-based 6700K may compare to its 4790K predecessor. In Photoworxx, the 6700K manages a pretty dramatic gain over the 4790K. The 5775C even gets in on the action, with the Broadwell chip taking the third spot ahead. However, the 6700K's advantage over the 4790K grows slimmer as we move to different workloads. Skylake's improvements in per-clock performance can be substantial in the right circumstances, but not every application will benefit equally. Some may not benefit much at all.