Single page Print

A quick primer on Sandy Bridge

Intel's next architecture reveals its secrets

In a bit of a strange move, Intel disclosed next to nothing about its upcoming Sandy Bridge processor during the opening IDF keynote last week, which you'll know if you vigilantly refreshed the page as my live blog of the speech descended into tragic irrelevance and hairdo critiques. We've not usually been this close to the release of an Intel processor—Sandy Bridge-based CPUs are expected to arrive right as we ring in 2011—without a sense of its basic internals for quite some time. Fortunately, Intel did finally disclose many of the architectural details of Sandy Bridge later at IDF, during the technical sessions led by Sandy Bridge architects. We had the good fortune to attend some of them, but I've been traveling and unable to gather my thoughts on what we learned until now.

The first things to know about Sandy Bridge are that it's a chip built using Intel's high-speed 32-nm chip fabrication process, with initial variants expected to have four traditional CPU cores, an integrated graphics processor, cache, and a memory controller located together on the same piece of silicon. Intel essentially skipped building a quad-core processor at 32-nm, opting to accelerate the schedule for Sandy Bridge instead. We've long known most of the above, that Sandy Bridge would include integrated graphics and would require a new CPU socket and motherboards, and we've known that it would support Intel's AVX instructions for faster vector processing of media workloads and the like. The mystery has been pretty much everything else beyond those preliminaries. 

Sandy Bridge in the flesh. Source: Intel.

A substantially new new microarchitecture
That mystery, it turns out, is pretty juicy, because Sandy Bridge is part of the unprecedented wave of brand-new x86 microprocessor architectures hitting the market. Just weeks after AMD disclosed the outlines of its Bulldozer and Bobcat cores, Intel has offered us an answer in the form of its own substantially new microarchitecture.

Now, making a claim like I just did is fraught with peril, since new chip designs almost inevitably build on older ones, especially when you're talking about Intel CPUs. That's the thing about Sandy Bridge, though: one of its architects proclaimed at IDF that it was essentially a from-the-ground-up rebuild of the out-of-order and floating-point execution engines. Such changes were necessary to accommodate the doubled vector width of the AVX instruction set, and it means something fairly momentous. As my friend David Kanter observed, this is, at long last, the breaking point where one can finally say virtually nothing remains of the P6 (Pentium Pro) roots that have undergirded everything from the Conroe/Merom Core 2 to the Nehalem/Westmere Core i-series processors.

Not only has the execution engine changed, but nearly everything around it has been replaced with new logic, as well, from the front-end and branch predictor to the memory execution unit. Outside of Sandy Bridge's CPU cores, the "glue" logic on the chip is all new, too. The inter-core connections, memory controller, and power management microcontroller have been tailored to accommodate the presence of a graphics processor. Even the integrated graphics engine bears little resemblance to what has come before. If you're looking for a golden age of CPU design, we're living in it, folks.

The most monumental change in Sandy Bridge has to be the incorporation of graphics onto the CPU die, and Intel has almost assuredly gone further toward deep integration than AMD did in its Ontario "fusion" chips. Still, that step feels almost like an afterthought, as part of a logical progression like the integration of the memory controller and PCIe logic in the past few generations. The IGP here is more of an application-specific accelerator, not a true co-processor for data-parallel computation. Such lofty goals will have to wait for later generations. For now, the biggest opportunities for head-turning progress come from the sweeping changes to Sandy Bridge's CPU microarchitecture, where smart new logic may potentially deliver formidable increases in per-clock performance.

The CPU front-end looks fairly similar to Merom or Nehalem from a high-altitude, block-diagram sort of view. The instruction cache is 32KB in size, and the decoder that turns CISC-style x86 instructions into RISC-like internal "micro-ops" can still process four instructions per cycle in most cases. Intel's architects point to two key changes here.

The first is that rebuilt branch predictor. In most processors, the branch prediction unit uses a clever algorithm to "guess" what path a program will take prior to execution and then feeds the out-of-order engine with instructions to be processed speculatively. If it guesses right, the result is higher delivered performance, but if it guesses wrong, the results must be discarded and the proper program path must be executed instead, leading to a considerable performance hit. Modern CPUs have very accurate branch predictors, causing some folks to wonder whether pushing further on this front makes sense. Sandy Bridge's architects suggested thinking about the problem not as a question of how much better one can do when one is already at 96% efficiency. Instead, one should think in terms of reducing in mispredictions, where a change from, say, 7% to 4% represents an improvement of over 40%. With that in mind, they attacked the branch prediction problem anew in Sandy Bridge to achieve even lower rates of error. Unfortunately, we didn't get any hard numbers on the accuracy of the new branch predictor, but it should be superior to Nehalem's.

This and the other improvements discussed above should lead to general performance increases, even in familiar tasks where we haven't necessarily seen much improvement in per-clock performance in recent years.

The other innovation of note in Sandy Bridge's front end is the addition of a cache for decoded micro-ops. Old-school CPU geeks may recognize this mechanism from a similar one, called the execution trace cache, used in the Pentium 4. Again, this provision is a nod to the fact that modern x86 processors don't execute CISC-style x86 instructions natively, preferring instead to translate them into their own internal instruction sets. The idea behind this new cache is to store instructions in the form of the processor's internal micro-ops, after they've been processed by the decoders, rather than storing them as x86 instructions. Doing so can reduce pressure on the decoders and, I believe, improve the chip's power efficiency in the process. Unlike the Pentium 4, Sandy Bridge retains robust decode logic that it can call on when needed, so the presence of a micro-op cache should be a straightforward win, with few to no performance trade-offs.

To find the feature with the largest impact on Sandy Bridge performance, though, one has to look beyond the front end to the memory execution units. In Nehalem, those units have three ports, but only one can do loads, so the chip is capable of a single load per cycle. In Sandy Bridge, the load/store units are symmetric, so the chip can execute two 128-bit loads per cycle. Store and cache bandwidth is higher, as well. Removing these constraints and doubling the number of loads per cycle allows Sandy Bridge to feed its formidable execution engine more fully, resulting in more work completed. This and the other improvements discussed above should lead to general performance increases, even in familiar tasks where we haven't necessarily seen much improvement in per-clock performance in recent years.

Of course, programs that make use of the AVX instruction set may see even larger gains, thanks to Sandy Bridge's ability to process more data in parallel via wider, 256-bit vectors. AVX should benefit some familiar workload types, including graphics and media processing, where the data to be processed can be grouped together in large blocks. We've known the outlines of Sandy Bridge's abilities here for a while, including the potential to execute a 256-bit floating-point add and a 256-bit floating-point multiply concurrently in the same clock cycle. At IDF, we got a better sense of how complete an AVX implementation Sandy Bridge really has, right down to a physical register file to store those 256-bit vectors. This chip should be in a class of its own on this front, at least until AMD's Bulldozer arrives later in 2011. Even then, Bulldozer will have half the peak AVX throughput of Sandy Bridge and may only catch up when programs make use of AMD's fused multiply-add (FMA) instruction—which only Bulldozer will support.

The pathways connecting Sandy Bridge's cores together have expanded to enable this increased throughput thanks to a new ring-style interconnect that links the CPU cores, graphics, last-level cache, and memory controller. Intel first used such a ring topology to connect the eight cores of the ultra-high-end Nehalem-EX processor. That concept has been borrowed and refined in Sandy Bridge. The chip's architects saw the need for a high-bandwidth interconnect to allow CPU cores and the IGP to share the cache and memory controller, and they liked the ring concept because of its potential to scale up and down along with the modular elements of the architecture. Because each core has some L3 cache and a ring stop associated with it, cache bandwidth grows with the core count. At 3GHz, each stop can transfer up to 96 GB/s, so a dual-core Sandy Bridge implementation peaks at 192 GB/s of last-level cache bandwidth, while the quad-core variant peaks at a torrential 384 GB/s.

Intel's Opher Kahn said his team had made significant changes to the ring interconnect compared to the one used in Nehalem-EX, and he expects it will scale up and be viable for use in client-focused processors for multiple generations. The same ring will likely be used in server-focused derivatives of Sandy Bridge with more cores and very modest graphics capabilities, if any.