Beyond the core
With chips of this scale, the CPU cores are only a small part of the overall picture. The glue that binds everything together is also incredibly complex—and is crucial for performance to scale up with core count. Have a look at this diagram of the 18-core Haswell-EP part in order to get a sense of things.
Like I said: complex. Intel has used a ring interconnect through multiple generations of Xeons now, but the bigger versions of Haswell-EP actually double the ring count to two fully-buffered rings per chip. Intel's architects say this arrangement provides substantially more bandwidth, and they expect it to remain useful in the future when core counts rise above the current peak of 18.
The rings operate bidirectionally, and individual transactions always flow in the direction of the shortest path from point A to point B. The two rings are linked via a pair of buffered switches. These switches add a couple of cycles of latency to any transaction that must traverse one of them.
One thing that you'll notice is that the ring, even in the big chip, is somewhat lopsided. There are eight cores on one ring and ten on the next. Each ring has its own memory controller, but only the left-side ring has access to PCIe connectivity and the QuickPath Interconnect to the other socket.
The 12-core chip seems even weirder, with half of one ring simply clipped off along with the six cores that used to reside there.
Such asymmetry just doesn't seem natural at first glance. Could it present a problem where one thread executes more quickly than another by virtue of its assigned core's location?
I think that would matter more if it weren't for the fact that the chip is operating at billions of cycles per second, and anything happening via one of those off-chip interfaces is likely to be enormously slower. When I raised the issue of asymmetry with Intel's architects, they pointed out that the latency for software-level thread switching is much, much higher than what happens in hardware. They further noted that Intel has had some degree of asymmetry in its CPUs since the advent of multi-core processors.
Also, notice that each core has 2.5MB of last-level cache associated with it. This cache is distributed across all cores, and its contents are shared, so that any core could potentially access data in any other cache partition. Thus, it's unlikely that any single core would be the most advantageous one to use by virtue of its location on the die.
For those folks who prefer to have precise control over how threads execute, the Haswell-EP Xeons with more than 10 cores offer a strange and intriguing alternative known as cluster-on-die mode. The idea here is that each ring on the chip operates almost like its own NUMA node, as each CPU socket does in this class of system. Each ring becomes its own affinity domain. The cores on each ring only "see" the last-level cache associated with cores on that ring, and they'll prefer to write data to memory via the local controller.
This mode will be selectable via system firmware, I believe, and is intended for use with applications that have already been tuned for NUMA operation. Intel says it's possible to achieve single-digit-percentage performance gains with cluster-on-die mode. I expect the vast majority of folks to ignore this mode and take the "it just works" option instead.
The small die with "only" eight cores has just one ring, with all four memory channels connected to a single home agent. This chip is no doubt the basis for Haswell-E products like the Core i7-5960X.
With this amount of integration, Xeons are increasingly becoming almost entire systems on a chip. Thus, a new generation means little upgrades here and there across that system. Haswell-EP raises the bandwidth on the QPI socket-to-socket interconnect to 9.6GT/s, up from 8GT/s before. The PCIe 3.0 controllers have been enhanced with more buffers and credits, so they can achieve higher effective transfer rates and better tolerate latency.
The biggest change on this front, though, is the move to DDR4 memory. Each Haswell-EP socket has four memory channels, and those channels can talk to DDR4 modules at speeds of up to 2133 MT/s. That's slightly faster than the 1866 MT/s peak of DDR3 with Ivy Bridge-EP, but the real benefits of DDR4 go beyond that. This memory type operates at lower voltage (1.2V standard), has smaller pages that require less activation power, and employs a collection of other measures to improve power efficiency. The cumulative savings, Intel estimates, are about two watts per DIMM at the wall socket.
DDR4 also operates at higher frequencies with more DIMMs present—up to 1600 MT/s on Haswell-EP with three DIMMs per channel. Going forward, DDR4 should enable even higher transfer rates and bit densities. Memory makers already have 3200 MT/s parts in the works, and Samsung is exploiting DDR4's native support for die stacking to create high-performance 64GB DIMMs.
Naturally, with the integration of the voltage regulators and the change in memory types, Haswell-EP also brings with it a new socket type. Dubbed Socket R3, this new socket isn't backward-compatible with prior Xeons at all, although it does have the same dimensions and attach points for coolers.
Accompanying Haswell-EP to market is an updated chipset—really just a single chip—with a richer complement of I/O ports. The chipset's code name is Wellsburg, but today, it officially gets the more pedestrian name of C612. I suspect it's the same chip known as the X99 in Haswell-E desktop systems. Wellsburg is much better endowed with high-speed connectivity than its predecessor; it sprouts 10 SATA 6Gbps ports and 14 USB ports, six of them USB 3.0-capable. The chipset's nine PCIe lanes are still stuck at Gen2 transfer rates, but lane grouping into x2 and x4 configs is now supported.
Intel is spinning the three Haswell-EP chips into a grand total of 29 different Xeon models. The new Xeons will be part of the E5 v3 family, whereas Ivy Bridge-EP chips are labeled E5 v2, and older Sandy Bridge-EP parts lack a trailing version number. There's a wide array of new products, and here is a confusing—but potentially helpful—slide that Intel is using to map out the lineup.
Prices range from $2,702 for the E5-2697 v3 to $213 for the E5-2603 v3. Well, that's not the entire range. Tellingly, Intel isn't divulging list prices for the top models, including the 18-core E5-2699 v3. I'm pretty sure that doesn't mean it's on discount.
Our attention today is focused primarily on workstation-class Xeons, specifically the 10-core Xeon E5-2687W v3, which we've tested against its two direct predecessors based on the Sandy Bridge-EP and Ivy Bridge-EP microarchitectures. Their specs look like so:
|Xeon E5-2687W||8/16||3.1||3.8||20||8.0 GT/s||4||DDR3-1600||150||$1,890|
|Xeon E5-2687W v2||8/16||3.4||4.0||25||8.0 GT/s||4||DDR3-1866||150||$2,112|
|Xeon E5-2687W v3||10/20||2.7/3.1||3.5||25||9.6 GT/s||4||DDR4-2133||160||$2,141|
Note that there are two base frequencies listed for the E5-2687W v3. The base speed is 2.7GHz with AVX workloads and 3.1GHz without. The peak Turbo speed is 3.5GHz for both types of workloads, though.
At any rate, these Xeons are all gut-bustingly formidable processors, and they're intended to drop into dual-socket systems where the core counts and memory channels will double. That's a recipe for some almost ridiculously potent end-user systems. In fact, we have an example of just such a box on hand.