Re-thought integrated graphics and other improvements
The fact that the graphics processor is just another stop on the ring demonstrates how completely Sandy Bridge integrates its GPU. The graphics device shares not just main memory bandwidth but also the last-level cache with the CPU coresand in some cases, it shares memory directly with those cores. Some memory is still dedicated solely to graphics, but the graphics driver can designate graphics streams to be cached and treated as coherent.
Inside the graphics engine, the big news isn't higher unit counts but more robust individual execution units. Recent Intel graphics solutions have claimed compatibility with the feature-rich DirectX 10 API, but they have used their programmable shaders to process nearly every sort of math required in the graphics pipeline. Dedicated, custom hardware can generally be faster and more efficient at a given task, though, which is why most GPUs still contain considerable amounts of graphics-focused custom hardware blocksand why those Intel IGPs have generally underachieved.
For this IGP, Intel revised its approach, using dedicated graphics hardware throughout, wherever it made sense to do so. A new transcendental math capability, for instance, promises 4-20X higher performance than the older generation. Before, DirectX instructions would break down into two to four internal instructions in the IGP, but in Sandy Bridge, the relationship is generally one-to-one. A larger register file should facilitate the execution of more complex shaders, as well. Cumulatively, Intel estimates, the changes should add up to double the throughput per shader unit compared to the last generation. The first Sandy Bridge derivative will have 12 of those revised execution units, although I understand that number may scale up and down in other variants.
Like the prior gen, this IGP will be DirectX 10-compliant but won't support DX11's more advanced feature set with geometry tessellation and higher-precision datatypes.
Sandy Bridge's large last-level cache will be available to the graphics engine, and that fact purportedly will improve performance while saving power by limiting memory I/O transactions. We heard quite a bit of talk about the advantages of the cache for Sandy Bridge's IGP, but we're curious to see just how useful it proves to be. GPUs have generally stuck with relatively small caches since graphics memory access patterns tend to involve streaming through large amounts of data, making extensive caching impractical. Sandy Bridge's IGP may be able to use the cache well in some cases, but it could trip up when high degrees of antialiasing or anisotropic filtering cause the working data set to grow too large. We'll have to see about that.
We also remain rather skeptical about the prospects for Intel to match the standards of quality and compatibility set by the graphics driver development teams at Nvidia and AMD any time soon.
|The concept is that the CPU will recognize when an intensive workload begins and ramp up the clock speed so the user gets "a lot more performance" for a relatively long periodwe heard the time frame of 20 seconds thrown around.|
One bit of dedicated hardware that's gotten quite a bit of attention on Sandy Bridge belongs to the IGP, and that's the video unit. This unit includes custom logic to accelerate the processing of H.264 video codecs, much like past Intel IGPs and competing graphics solutions, with the notable addition of an encoding capability as well as decoding. Using the encoding and decoding capabilities together opens the possibility of very high speed (and potentially very power-efficient) video transcoding, and Intel briefly demoed just that during the opening keynote. We heard whispers of speeds up to 10X or 20X that of a software-only solution.
Sandy Bridge's transcoding capabilities raise all sorts of funny questions. On one hand, using custom logic for video encoding as well as decoding makes perfect sense given current usage models, and it seems like a convenient way for Intel to poke a finger into the eye of competitors like AMD and Nvidia, whose GPGPU technologies have, to date, just one high-profile consumer application: video transcoding. On the other hand, this is Intel, bastion of CPUs and tailored instruction sets, embracing application-specific acceleration logic. I'm also a little taken aback by all of the excitement surrounding this feature, given that my mobile phone has the same sort of hardware.
Because the video codec acceleration is part of Sandy Bridge's IGP, it will be inaccessible to users of discrete video cards, including anyone using the performance enthusiast-oriented P-series chipsets. Several folks from Intel told us the firm is looking into possible options for making the transcoding hardware available to users of discrete graphics cards, but if that happens it all, it will likely happen some time after the initial Sandy Bridge products reach consumers.
One more piece of the Sandy Bridge picture worth noting is the expansion of thermal-sensor-based dynamic clock frequency scalingbetter known as Turbo Boostalong a several lines. Although the Westmere dual-core processors had a measure of dynamic speed adjustment for the graphics component, the integration of graphics onto the same die has allowed much faster, finer-grained participation in the Turbo Boost scheme. Intel's architects talked of "moving power around" between the graphics and CPU cores as needed, depending on the constraints of the workloads. If, say, a 3D game doesn't require a full measure of CPU time but needs all the graphics performance it can get, the chip should respond by raising the graphics core's voltage and clock speed while keeping the CPU's power draw lower.
Furthermore, Intel claims Sandy Bridge should have substantially more headroom for peak Turbo Boost frequencies, although it remains coy about the exact numbers there. One indication of how expansive that headroom may be is a new twist on Turbo Boost aimed at improving system responsiveness during periods of high demand. The concept is that the CPU will recognize when an intensive workload begins and ramp up the clock speed so the user gets "a lot more performance" for a relatively long periodwe heard the time frame of 20 seconds thrown around. With this feature, the workload doesn't have to use just one or two threads to qualify for the speed boost; the processor will actually operate above its maximum thermal rating, or TDP, for the duration of the period, so long as its on-die thermal sensors don't indicate a problem.
We worry that this feature may make computer performance even less deterministic than the first generation of Turbo Boost, and it will almost surely place a higher premium on good cooling. Still, the end result should be more responsive systems for users, and it's hard to argue with that outcome.