Another year has passed, and once again, it's time for Intel to unveil a new generation of processors. Somehow, it seems like we've been waiting longer than usual for this latest refresh. Dunno why that is. Perhaps it's because the "tocks" in Intel's vaunted "tick-tock" development cadence tend to be the more exciting technologies, new CPU architectures that promise major change for the better. Last year's "tick," under the code name Ivy Bridge, brought some nice reductions in power consumption thanks to the transition to an advanced 22-nm fabrication process. This year's "tock," code-named Haswell, packs even more change: enhanced CPU cores, faster graphics, and some sweeping modifications to the PC platform itself. The goal, in part, is to shoehorn Intel's fastest CPU technology into power envelopes suitable for ultra-thin laptops and tablets. Undoubtedly, Haswell is the cornerstone of Intel's bid for relevance in a new, mobile-centric market. The benefits spill over onto the desktop, though, in various and surprising ways.
New architectures, both micro and macro
Superficially, the Haswell chip represented in the die shot above looks an awful lot like the Sandy Bridge and Ivy Bridge generations that came before it. All three of 'em have four cores, with the same size caches at each level of the hierarchy, down to the 8MB L3. They each have graphics, a memory controller, and PCI Express integrated, too, with the CPU cores and those other elements linked together by a high-speed ring interconnect. Even at the die level, Haswell doesn't look that much different from Ivy.
|Lynnfield||Core i5, i7||4||8||8 MB||45||774||296|
|Gulftown||Core i7-9xx||6||12||12 MB||32||1168||248|
|Sandy Bridge||Core i5, i7||4||8||8 MB||32||995||216|
|Sandy Bridge-E||Core-i7-39xx||8||16||20 MB||32||2270||435|
|Ivy Bridge||Core i5, i7||4||8||8 MB||22||1200||160|
|Haswell (Quad GT2)||Core i5, i7||4||8||8 MB||22||1400||177|
|Deneb||Phenom II||4||4||6 MB||45||758||258|
|Thuban||Phenom II X6||6||6||6 MB||45||904||346|
|Llano||A8, A6, A4||4||4||1 MB x 4||32||1450||228|
|Trinity||A10, A8, A6||2||4||2 MB x 2||32||1303||246|
Both chips are built using Intel's 22-nm fab process with "tri-gate" transistors, and both are relatively small in size. Although it's grown a little compared to Ivy, the quad-core desktop Haswell is quite a bit smaller than anything else in our table, which dates back several generations.
At this level, Haswell doesn't look too different from what has come before. Drill in a little deeper, though, and many of the components have been updated. Let's start by looking at those CPU cores. Haswell isn't a complete microarchitectural rebuild like Sandy Bridge. Instead, the Haswell core builds on the foundation established by Sandy and Ivy Bridge. The main pipeline depth is unchanged, but Intel's architects have implemented a series of evolutionary modifications intended to increase per-clock throughput. Some of those measures are aimed at raising instruction throughput in existing code, while others require the use of new instructions.
The list of tweaks aimed at improving performance with legacy code begins with some items that will look familiar to those who track CPU development. They're the sorts of things CPU architects do with growing transistor budgets in order to extract more parallelism out of a single thread. Fetch bandwidth is higher in Haswell's front end. Branch prediction accuracy is up. The window for out-of-order execution is larger, with a corresponding size increase in related structures. Switching latencies for hardware virtualization support have been reduced. We've seen these sorts of changes with regularity over time, although it's worth remembering that Intel's CPU cores already have some of the richest feature sets and highest rates of per-clock throughput anywhere.
In a more obvious microarchitectural modification, Haswell's execution core is even wider than its predecessor's. The number of ports in the reservation station has risen from six to eight, with the added ports feeding a couple of new execution units: another integer ALU/branch unit and a store address unit. When possible, Haswell should be able to rearrange even more components of an incoming instruction stream and feed them through this wider machine in parallel. This feat is one of the most difficult tasks in modern CPU design, and Intel continues to make earnest efforts to push the boundaries forward.
The new core also raises the performance ceiling by adding several new instructions for faster processing of vector math. Sandy Bridge doubled the floating-point throughput of the prior Nehalem generation by trading SSE's 128-bit-wide vectors for AVX's 256-bit vectors. With AVX2, Haswell adds support for 256-bit integer vectors and introduces a fused multiply-add (FMA) instruction. Media-centric workloads frequently pair multiply and add operations together, and fusing the two into a single instruction can mean twice the FLOPS executed in a clock cycle. AMD's Bulldozer and Piledriver cores were the first x86 processors to incorporate FMA support, but a Bulldozer/Piledriver module can only process a single 256-bit FMA per cycle. A Sandy Bridge core can produce a 256-bit add and a 256-bit multiply in each cycle, for a comparable peak FLOPS rate. Each Haswell core can execute two 256-bit FMAs per cycle, double Bulldozer's and Sandy's peaks—and four times Nehalem's. The catch, of course, is that software will have to be recompiled to take advantage of the AVX2 and FMA instructions.
Haswell needed more bandwidth in order to service twice the FLOPS per clock, so Intel has revised its cache hierarchy. The L1 cache now has fewer restrictions related to banking. The L2 cache, which in Sandy Bridge was accessible on every other clock cycle, can now be read on each cycle. The result, Intel claims, is that Haswell's L1 and L2 caches both offer roughly double the bandwidth at the same access latencies as Sandy Bridge.
One of the more intriguing new technologies built into Haswell is another set of ISA extensions known as TSX, an enabler for hardware transactional memory. TSX has the potential to ease the development of highly multithreaded applications. Unfortunately, Intel seems to be playing product segmentation games with this feature, disabling it selectively in key models of the Core i5 and i7, including the K-series processors targeted at enthusiasts. Intel has established a tradition of putting too many knobs and dials into its chips and fine-tuning its product offerings aggressively enough to be positively confusing and off-putting to the consumer. The decision to play this game with TSX support may be the worst (or is best?) example yet of overdoing it.