I think it was alluded to earlier, but all the PI benchmarks are probably very
to latency as much as FPU efficiencies. I think you drew the wrong conclusion from the K7 to K8 jump; the IMC vs NB probably contributed to most of that speedup. Otherwise the execution latencies at the instruction level are not too dissimilar between the two. I usually use Agner's x86 instruction latency tables if I'm handcoding some assembly. I don't know what is the x87 instruction mix for SuperPi though. http://www.agner.org/optimize/instruction_tables.pdf
If K8's IMC is what provided most of the boost in SuperPi, then Core 2's excellent Pi performance becomes even more interesting. It does extremely well with an old fashioned FSB. Moving the memory controller on-chip didn't give Intel much of a boost in this benchmark. Haswell is 4 generations ahead of Wolfdale.
In the talk I think Pat Gelsinger made at around the time Core 2 launched called "Into the Core" (should still be able to find it on the net), someone asked him about when Intel was going to bring the memory controller onboard, and he wasn't sure when. His answer as to why at the time he didn't give a yes was something like "because we have really great caches".
So my guess is that cache performance (and large L2 caches, the 45nm Core 2 had 6MB) masked not having an integrated memory controller.
Intel Core i7 4790K, Z97, 16GB RAM, 128GB m4 SSD, 480GB M500 SSD, 500GB WD Vel, Intel HD4600, Corsair HX650, Fedora x64.
Thinkpad T460p, Intel Core i5 6440HQ, 8GB RAM, 512GB SSD, Intel HD 530 IGP, Fedora x64, Win 10 x64.