One of the funny things about Intel's workstation- and server-class Xeon processors is that we kind of think we know what's coming before each new generation arrives. For instance, the new generation of chips known as Haswell-EP is making its debut today, yet the Haswell microarchitecture has been shipping in client systems for over a year. The desktop derivative of this very silicon, Haswell-E, was introduced late last month, too.
What amazes me about the new Xeons, though, is how much more there is to them than one might have expected. Intel's architects and designers have crammed formidable new technologies into these chips in order to allow them to scale up to large core counts and multiple sockets. The result may be the most impressive set of CPUs Intel has produced to date, with numbers for core count and throughput that pretty much boggle the mind. Read on to see what makes Haswell-EP different—and better.
The Haswell-EP family
The first thing one needs to know about Haswell-EP is that it's not just a single chip, but a trio of chips. Intel has moved in recent years toward right-sizing its Xeon silicon for different products, and Haswell-EP takes that trend into new territory. Here are the three members of the family.
|Haswell-EP||8||16||20 MB||22 nm||2601||354|
|Haswell-EP||12||24||30 MB||22 nm||3839||484|
|Haswell-EP||18||36||45 MB||22 nm||5569||662|
All three chips are fabbed on Intel's 22-nm process tech with tri-gate transistors, and they all share the same basic technological DNA. Intel has simply scaled them differently, with quite a bit of separation in terms of size and transistor count between the three options. The biggest of the bunch has a staggering 18 cores, 36 threads, and 45MB of L3 cache. To give you some perspective of this CPU's size, at 662 mm², it's substantially larger than even the biggest GPUs in the world. Nvidia's GK110 is 555 mm², and AMD's Hawaii GPU is 438 mm².
The prior generation of Xeons, code-named Ivy Bridge-EP, topped out at 12 cores, so Haswell-EP offers a 50% increase on that front. Haswell-EP is a "tock" in Intel's so-called "tick-tock" development model, which means it brings a new CPU architecture to a familiar chip fabrication process. There's quite a bit more to this new family than just a revised CPU microarchitecture, though. The entire platform has been reworked, as the diagram below summarizes.
The changes really do begin with the transition to Haswell-class CPU cores. These are indeed the same basic cores used across Intel's product portfolio, and by now, their virtues are well known. Through a combination of larger on-chip structures, more execution units, and smarter logic, the Haswell core increases its instruction throughput per clock by about 10% compared to Ivy Bridge before it. That number can go much higher with the use of the new AVX2 instruction set extensions, which have the potential to double vector throughput for both integer and floating-point data types.
For servers in particular, the Haswell core has the potential to boost performance even further via the TSX instruction set extensions, which enable hardware lock elision and restricted transactional memory. The TSX instructions allow the hardware to shoulder much of the burden of making sure concurrent threads don't cause problems for one another. Unfortunately, Intel discovered an erratum in its TSX implementation just prior to the release of Haswell-EP. As a result, the first systems based on this silicon have shipped with TSX disabled via microcode. Users may have the option to enable TSX in a system's BIOS for development purposes, but doing so risks system instability. I'd expect Intel to produce a new stepping of Haswell-EP with the TSX erratum corrected, but we don't yet have a clear timetable for such a move. The firm has hinted that TSX should be production-ready once the larger, multi-socket Haswell-EX parts arrive.
The new generation of Xeons has much to recommend it even without TSX. One of the most notable innovations in Haswell-era chips is the incorporation of voltage regulation circuitry directly onto the CPU die. The integrated VR, which Intel calls FIVR for "fully integrated voltage regulator," allows for more efficient operation along several lines. Voltage transitions with FIVR can be much quicker than with an external VR, and FIVR has many more supply lines, allowing for fine-grained control of power delivery across the chip. The integrated VRs can also reduce the physical footprint of the CPU and its support circuitry.
The advent of FIVR grants Haswell-EP increased dynamic operating range versus its predecessors. For instance, each individual core on the processor can maintain its own power state, or P-state, with its own clock speed and supply voltage. In Ivy-E and earlier parts, all of the cores share a common frequency and voltage. This per-core P-state feature operates in the margins between idle (power is gated off individually to idle cores) and peak core utilization. Dropping a partially used core to an intermediate P-state via this mechanism can free up some thermal headroom for another, busier core to move to a higher frequency via Turbo—so the payoff ought to be more efficiency and performance.
We've seen this sort of independent core clocking run into problems in the past, notably in AMD's Barcelona-based processors, but Intel's architects are confident that Haswell-EP's P-state transitions happen quickly enough and have few enough penalties to make this feature worthwhile. At present, per-core P-states are only being used in server- and workstation-class CPUs, not in client-focused products where immediate responsiveness is a top priority.
FIVR also offers a separate supply rail to the "uncore" complex that handles internal and external communication. As a result, the uncore is now clocked independently of the cores. It can run at higher frequencies when bandwidth is at a premium, even if the CPU cores are lightly utilized, and the situation can be reversed when I/O demands decrease and the CPU cores are fully engaged.
The Turbo Boost algorithm that controls the chip's clocking behavior has grown a little more sophisticated, as well. One addition is what Intel calls "Energy Efficient Turbo." The power control routine now monitors the activity of each core for throughput and stalls. If it decides that raising the clock speed of a core wouldn't be energy efficient—presumably because the core's present activity is gated by external factors or is somehow inefficient—the Turbo mechanism will choose not to raise the speed.
The final tweak to Haswell-EP's dynamic operating strategy came as a surprise to me. As you can see illustrated on the right, Haswell-EP processors will operate at lower frequencies when processing AVX instructions. The fundamental reality here is that those 256-bit-wide AVX vector units are big, beefy hardware. They chew up a lot of power, and so they require some concessions. As with regular Turbo operation, the chip will seek as high a clock speed within its defined limits during AVX processing—those limits are just lower. Intel says the CPU will return to its regular, non-AVX operating mode one millisecond after the completion of the last AVX instruction in a stream.
Intel has defined the base and Turbo peak AVX frequencies for each of the new Xeons, and it says it will publish those speeds for all to see. As of now, though, I have yet to see AVX clock speeds listed in any of Intel's pre-launch press information. I expect we'll hear more on this front soon.
The move to Haswell cores has also brought with it some benefits for virtualization performance. The amount of time needed to enter and to exit a virtual machine has shrunk, as it has fairly consistently over time with successive CPU generations. The result should be a general increase in VM performance. Haswell-EP also allows the shadowing of VM control structures, which should improve the efficiency of VM management and the like.
Perhaps the niftiest bit of new tech for virtualization can apply to other uses, as well. Haswell-EP has hooks built in for the monitoring of cache allocation by thread. In a VM context, this capability should allow hypervisors to expose information that would let sysadmins identify "noisy neighbor" VMs that thrash the cache and may cause problems for other VMs on the same system. Once identified, these troublesome VMs could be moved or isolated in order to prevent cache contention problems from affecting other virtual machines.