In a bit of a strange move, Intel disclosed next to nothing about its upcoming Sandy Bridge processor during the opening IDF keynote last week, which you’ll know if you vigilantly refreshed the page as my live blog of the speech descended into tragic irrelevance and hairdo critiques. We’ve not usually been this close to the release of an Intel processorSandy Bridge-based CPUs are expected to arrive right as we ring in 2011without a sense of its basic internals for quite some time. Fortunately, Intel did finally disclose many of the architectural details of Sandy Bridge later at IDF, during the technical sessions led by Sandy Bridge architects. We had the good fortune to attend some of them, but I’ve been traveling and unable to gather my thoughts on what we learned until now.
The first things to know about Sandy Bridge are that it’s a chip built using Intel’s high-speed 32-nm chip fabrication process, with initial variants expected to have four traditional CPU cores, an integrated graphics processor, cache, and a memory controller located together on the same piece of silicon. Intel essentially skipped building a quad-core processor at 32-nm, opting to accelerate the schedule for Sandy Bridge instead. We’ve long known most of the above, that Sandy Bridge would include integrated graphics and would require a new CPU socket and motherboards, and we’ve known that it would support Intel’s AVX instructions for faster vector processing of media workloads and the like. The mystery has been pretty much everything else beyond those preliminaries.
A substantially new new microarchitecture
That mystery, it turns out, is pretty juicy, because Sandy Bridge is part of the unprecedented wave of brand-new x86 microprocessor architectures hitting the market. Just weeks after AMD disclosed the outlines of its Bulldozer and Bobcat cores, Intel has offered us an answer in the form of its own substantially new microarchitecture.
Now, making a claim like I just did is fraught with peril, since new chip designs almost inevitably build on older ones, especially when you’re talking about Intel CPUs. That’s the thing about Sandy Bridge, though: one of its architects proclaimed at IDF that it was essentially a from-the-ground-up rebuild of the out-of-order and floating-point execution engines. Such changes were necessary to accommodate the doubled vector width of the AVX instruction set, and it means something fairly momentous. As my friend David Kanter observed, this is, at long last, the breaking point where one can finally say virtually nothing remains of the P6 (Pentium Pro) roots that have undergirded everything from the Conroe/Merom Core 2 to the Nehalem/Westmere Core i-series processors.
Not only has the execution engine changed, but nearly everything around it has been replaced with new logic, as well, from the front-end and branch predictor to the memory execution unit. Outside of Sandy Bridge’s CPU cores, the “glue” logic on the chip is all new, too. The inter-core connections, memory controller, and power management microcontroller have been tailored to accommodate the presence of a graphics processor. Even the integrated graphics engine bears little resemblance to what has come before. If you’re looking for a golden age of CPU design, we’re living in it, folks.
The most monumental change in Sandy Bridge has to be the incorporation of graphics onto the CPU die, and Intel has almost assuredly gone further toward deep integration than AMD did in its Ontario “fusion” chips. Still, that step feels almost like an afterthought, as part of a logical progression like the integration of the memory controller and PCIe logic in the past few generations. The IGP here is more of an application-specific accelerator, not a true co-processor for data-parallel computation. Such lofty goals will have to wait for later generations. For now, the biggest opportunities for head-turning progress come from the sweeping changes to Sandy Bridge’s CPU microarchitecture, where smart new logic may potentially deliver formidable increases in per-clock performance.
The CPU front-end looks fairly similar to Merom or Nehalem from a high-altitude, block-diagram sort of view. The instruction cache is 32KB in size, and the decoder that turns CISC-style x86 instructions into RISC-like internal “micro-ops” can still process four instructions per cycle in most cases. Intel’s architects point to two key changes here.
The first is that rebuilt branch predictor. In most processors, the branch prediction unit uses a clever algorithm to “guess” what path a program will take prior to execution and then feeds the out-of-order engine with instructions to be processed speculatively. If it guesses right, the result is higher delivered performance, but if it guesses wrong, the results must be discarded and the proper program path must be executed instead, leading to a considerable performance hit. Modern CPUs have very accurate branch predictors, causing some folks to wonder whether pushing further on this front makes sense. Sandy Bridge’s architects suggested thinking about the problem not as a question of how much better one can do when one is already at 96% efficiency. Instead, one should think in terms of reducing in mispredictions, where a change from, say, 7% to 4% represents an improvement of over 40%. With that in mind, they attacked the branch prediction problem anew in Sandy Bridge to achieve even lower rates of error. Unfortunately, we didn’t get any hard numbers on the accuracy of the new branch predictor, but it should be superior to Nehalem’s.
|This and the other improvements discussed above should lead to general performance increases, even in familiar tasks where we haven’t necessarily seen much improvement in per-clock performance in recent years.|
The other innovation of note in Sandy Bridge’s front end is the addition of a cache for decoded micro-ops. Old-school CPU geeks may recognize this mechanism from a similar one, called the execution trace cache, used in the Pentium 4. Again, this provision is a nod to the fact that modern x86 processors don’t execute CISC-style x86 instructions natively, preferring instead to translate them into their own internal instruction sets. The idea behind this new cache is to store instructions in the form of the processor’s internal micro-ops, after they’ve been processed by the decoders, rather than storing them as x86 instructions. Doing so can reduce pressure on the decoders and, I believe, improve the chip’s power efficiency in the process. Unlike the Pentium 4, Sandy Bridge retains robust decode logic that it can call on when needed, so the presence of a micro-op cache should be a straightforward win, with few to no performance trade-offs.
To find the feature with the largest impact on Sandy Bridge performance, though, one has to look beyond the front end to the memory execution units. In Nehalem, those units have three ports, but only one can do loads, so the chip is capable of a single load per cycle. In Sandy Bridge, the load/store units are symmetric, so the chip can execute two 128-bit loads per cycle. Store and cache bandwidth is higher, as well. Removing these constraints and doubling the number of loads per cycle allows Sandy Bridge to feed its formidable execution engine more fully, resulting in more work completed. This and the other improvements discussed above should lead to general performance increases, even in familiar tasks where we haven’t necessarily seen much improvement in per-clock performance in recent years.
Of course, programs that make use of the AVX instruction set may see even larger gains, thanks to Sandy Bridge’s ability to process more data in parallel via wider, 256-bit vectors. AVX should benefit some familiar workload types, including graphics and media processing, where the data to be processed can be grouped together in large blocks. We’ve known the outlines of Sandy Bridge’s abilities here for a while, including the potential to execute a 256-bit floating-point add and a 256-bit floating-point multiply concurrently in the same clock cycle. At IDF, we got a better sense of how complete an AVX implementation Sandy Bridge really has, right down to a physical register file to store those 256-bit vectors. This chip should be in a class of its own on this front, at least until AMD’s Bulldozer arrives later in 2011. Even then, Bulldozer will have half the peak AVX throughput of Sandy Bridge and may only catch up when programs make use of AMD’s fused multiply-add (FMA) instructionwhich only Bulldozer will support.
The pathways connecting Sandy Bridge’s cores together have expanded to enable this increased throughput thanks to a new ring-style interconnect that links the CPU cores, graphics, last-level cache, and memory controller. Intel first used such a ring topology to connect the eight cores of the ultra-high-end Nehalem-EX processor. That concept has been borrowed and refined in Sandy Bridge. The chip’s architects saw the need for a high-bandwidth interconnect to allow CPU cores and the IGP to share the cache and memory controller, and they liked the ring concept because of its potential to scale up and down along with the modular elements of the architecture. Because each core has some L3 cache and a ring stop associated with it, cache bandwidth grows with the core count. At 3GHz, each stop can transfer up to 96 GB/s, so a dual-core Sandy Bridge implementation peaks at 192 GB/s of last-level cache bandwidth, while the quad-core variant peaks at a torrential 384 GB/s.
Intel’s Opher Kahn said his team had made significant changes to the ring interconnect compared to the one used in Nehalem-EX, and he expects it will scale up and be viable for use in client-focused processors for multiple generations. The same ring will likely be used in server-focused derivatives of Sandy Bridge with more cores and very modest graphics capabilities, if any.
Re-thought integrated graphics and other improvements
The fact that the graphics processor is just another stop on the ring demonstrates how completely Sandy Bridge integrates its GPU. The graphics device shares not just main memory bandwidth but also the last-level cache with the CPU coresand in some cases, it shares memory directly with those cores. Some memory is still dedicated solely to graphics, but the graphics driver can designate graphics streams to be cached and treated as coherent.
Inside the graphics engine, the big news isn’t higher unit counts but more robust individual execution units. Recent Intel graphics solutions have claimed compatibility with the feature-rich DirectX 10 API, but they have used their programmable shaders to process nearly every sort of math required in the graphics pipeline. Dedicated, custom hardware can generally be faster and more efficient at a given task, though, which is why most GPUs still contain considerable amounts of graphics-focused custom hardware blocksand why those Intel IGPs have generally underachieved.
For this IGP, Intel revised its approach, using dedicated graphics hardware throughout, wherever it made sense to do so. A new transcendental math capability, for instance, promises 4-20X higher performance than the older generation. Before, DirectX instructions would break down into two to four internal instructions in the IGP, but in Sandy Bridge, the relationship is generally one-to-one. A larger register file should facilitate the execution of more complex shaders, as well. Cumulatively, Intel estimates, the changes should add up to double the throughput per shader unit compared to the last generation. The first Sandy Bridge derivative will have 12 of those revised execution units, although I understand that number may scale up and down in other variants.
Like the prior gen, this IGP will be DirectX 10-compliant but won’t support DX11’s more advanced feature set with geometry tessellation and higher-precision datatypes.
Sandy Bridge’s large last-level cache will be available to the graphics engine, and that fact purportedly will improve performance while saving power by limiting memory I/O transactions. We heard quite a bit of talk about the advantages of the cache for Sandy Bridge’s IGP, but we’re curious to see just how useful it proves to be. GPUs have generally stuck with relatively small caches since graphics memory access patterns tend to involve streaming through large amounts of data, making extensive caching impractical. Sandy Bridge’s IGP may be able to use the cache well in some cases, but it could trip up when high degrees of antialiasing or anisotropic filtering cause the working data set to grow too large. We’ll have to see about that.
We also remain rather skeptical about the prospects for Intel to match the standards of quality and compatibility set by the graphics driver development teams at Nvidia and AMD any time soon.
|The concept is that the CPU will recognize when an intensive workload begins and ramp up the clock speed so the user gets “a lot more performance” for a relatively long periodwe heard the time frame of 20 seconds thrown around.|
One bit of dedicated hardware that’s gotten quite a bit of attention on Sandy Bridge belongs to the IGP, and that’s the video unit. This unit includes custom logic to accelerate the processing of H.264 video codecs, much like past Intel IGPs and competing graphics solutions, with the notable addition of an encoding capability as well as decoding. Using the encoding and decoding capabilities together opens the possibility of very high speed (and potentially very power-efficient) video transcoding, and Intel briefly demoed just that during the opening keynote. We heard whispers of speeds up to 10X or 20X that of a software-only solution.
Sandy Bridge’s transcoding capabilities raise all sorts of funny questions. On one hand, using custom logic for video encoding as well as decoding makes perfect sense given current usage models, and it seems like a convenient way for Intel to poke a finger into the eye of competitors like AMD and Nvidia, whose GPGPU technologies have, to date, just one high-profile consumer application: video transcoding. On the other hand, this is Intel, bastion of CPUs and tailored instruction sets, embracing application-specific acceleration logic. I’m also a little taken aback by all of the excitement surrounding this feature, given that my mobile phone has the same sort of hardware.
Because the video codec acceleration is part of Sandy Bridge’s IGP, it will be inaccessible to users of discrete video cards, including anyone using the performance enthusiast-oriented P-series chipsets. Several folks from Intel told us the firm is looking into possible options for making the transcoding hardware available to users of discrete graphics cards, but if that happens it all, it will likely happen some time after the initial Sandy Bridge products reach consumers.
One more piece of the Sandy Bridge picture worth noting is the expansion of thermal-sensor-based dynamic clock frequency scalingbetter known as Turbo Boostalong a several lines. Although the Westmere dual-core processors had a measure of dynamic speed adjustment for the graphics component, the integration of graphics onto the same die has allowed much faster, finer-grained participation in the Turbo Boost scheme. Intel’s architects talked of “moving power around” between the graphics and CPU cores as needed, depending on the constraints of the workloads. If, say, a 3D game doesn’t require a full measure of CPU time but needs all the graphics performance it can get, the chip should respond by raising the graphics core’s voltage and clock speed while keeping the CPU’s power draw lower.
Furthermore, Intel claims Sandy Bridge should have substantially more headroom for peak Turbo Boost frequencies, although it remains coy about the exact numbers there. One indication of how expansive that headroom may be is a new twist on Turbo Boost aimed at improving system responsiveness during periods of high demand. The concept is that the CPU will recognize when an intensive workload begins and ramp up the clock speed so the user gets “a lot more performance” for a relatively long periodwe heard the time frame of 20 seconds thrown around. With this feature, the workload doesn’t have to use just one or two threads to qualify for the speed boost; the processor will actually operate above its maximum thermal rating, or TDP, for the duration of the period, so long as its on-die thermal sensors don’t indicate a problem.
We worry that this feature may make computer performance even less deterministic than the first generation of Turbo Boost, and it will almost surely place a higher premium on good cooling. Still, the end result should be more responsive systems for users, and it’s hard to argue with that outcome.