Single page Print

Memory subsystem performance
Before moving into the rest of our application-focused CPU tests, we can briefly look at some synthetic benchmarks that measure targeted aspects of system performance.

All three recent generations of Intel's processors do a fine job of extracing bandwidth from a pair of DDR3 DIMMs, and there's not much difference between them. I believe looser memory timings on those SO-DIMMs are responsible for the 4950HQ's relatively weak showing.

This test is multithreaded, so it captures the bandwidth of all caches on all cores concurrently. The different block sizes step us down from the L1 and L2 caches into L3 and main memory. The only hitch is that the capacity of all four caches on a quad-core CPU must be taken into account when reading this plot. For instance, at the 128KB block size above, we're just starting to spill out of Haswell's four 32KB L1 caches.

The most remarkable thing about these results is that Haswell really does appear to have twice the L1 cache bandwidth of Ivy and Sandy Bridge, in a measurable way—roughly one terabyte per second of internal bandwidth. We don't see improvement to nearly the same extent in the L2 cache, for whatever reason, though.

Can we also measure the impact of the 4950HQ's gargantuan eDRAM cache? Let's zoom in a bit and have a look.

Yep. The impact of the 4950HQ's L4 cache is evident at the 16MB and 64MB test block sizes. Bandwidth is almost exactly doubled at the 64MB block size compared to the 4770K, which has to rely on main memory alone.

SiSoft has a nice write-up of this latency testing tool, for those who are interested. We used the "in-page random" access pattern to reduce the impact of prefetchers on our measurements. This test isn't multithreaded, so it's a little easier to track which cache is being measured. If the block size is 32KB, you're in the L1 cache. If it's 64KB, you're into the L2, and so on.

As advertised, access latencies for Haswell's L1 and L2 caches are no higher than Sandy's and Ivy's. The L3 cache appears to be slightly slower, though. Notice also how the 4950HQ's L4 cache mitigates access latencies at the 16MB through 64MB block sizes.

Some quick synthetic math tests
The folks at FinalWire have built some interesting micro-benchmarks into their AIDA64 system analysis software. They've updated several of these tests to make use of new instructions on the latest processors, including Haswell. Of the results shown below, PhotoWorxx uses AVX2 (and falls back to AVX on Ivy Bridge, et al.), CPU Hash uses AVX (and XOP on Bulldozer/Piledriver), and FPU Julia and Mandel use AVX2 with FMA.

The big, eye-popping gains here come courtesy of AVX2 with FMA in the Julia and Mandel tests, where the 4770K is about 50% faster than Ivy Bridge and over twice as fast as the FX-8350. This isn't quite the doubling of peak FLOPS throughput that Haswell can produce in theory, but it's pretty good for delivered performance. Also, I think we can safely say the 4950HQ's L4 cache doesn't provide much benefit here.

One more test here, just because it's interesting. Tamas Miklos of FinalWire pointed this fact out to us, and we've confirmed it with our own testing: Haswell is a little slower than Ivy Bridge when running SinJulia, which uses extended-precision x87 code. Not a big deal, especially in light of the gains Haswell provides with new instructions, but it's something to note.