When AMD's "Barcelona" Opterons made their debut last Monday, we couldn't tell you about a sleek, black box nestled in among the other test systems in Damage Labs. Housed inside of it: an example of Intel's brand-new "Stoakley" dual-processor platform, complete with a pair of Xeons based on 45nm process technology. These Xeons are the first members of the Penryn family of 45nm CPUs to reach our test labs, and they offer a tantalizing look at how Intel will counter AMD's new CPU design with a substantially revised version of its own potent Core microarchitecture.
These new CPUs and the platform that supports them promise marked improvements in performance, thanks to a bevy of tweaks and updates. In fact, although the new Xeons are more a minor refresh than a major overhaul, the gains they've attained are formidable. Today, we can show you how these processors perform.
The contest between next-generation CPU architectures has begun in earnest. Read on to see how Intel's 45nm Xeons match up with AMD's quad-core Opterons.
Goin' to Harpertown
Following hardware developments these days requires navigating a virtual minefield of overlapping codenames, and Intel proudly leads the world in codename generation. The new Xeons have several names attached. "Penryn" is the codename for the family of processors based on Intel's 45nm fab process, and this same silicon will serve a number of markets in various configurations. For the server and workstation markets, the bread-and-butter Penryn derivative will be "Harpertown," a dual-chip, quad-core product that supersedes the current quad-core "Clovertown" Xeons. Intel also has plans for a single-chip, dual-core variant known as "Wolfdale."
All Penryn derivatives will be manufactured via Intel's 45nm high-k chip fabrication process, which the company has hailed as a breakthrough and a fundamental restructuring of the transistor. Despite the fanfare, the change brings gains that were once considered fairly conventional for process shrinks. Intel says the 45nm high-k process has twice the transistor density, a 20% increase in switching speed, and a 30% reduction in switching power versus its 65nm process. Improvements of that order are nothing to scoff at these days, nor is Intel's manufacturing might. The firm already has two fabs making the 45nm conversion in the second half of 2007, Fab D1D in Oregon and Fab 32 in Arizona. Fab 28 in Israel will follow in the first half of next year, along with Fab 11X in New Mexico in the second half of '08. 45nm processors should make up the majority of its output by then.
Harpertown Xeons and their Penryn-based cousins are not just die-shrunk versions of current chips, but they do retain the same basic layout. The quad-core parts are comprised of two dual-core chips situated together in a single LGA771-style package. This two-chip arrangement isn't as neatly integrated as AMD's "native quad-core" Opteronsthe two chips can communicate with one another only by means of the relatively slow front-side busbut it has the advantage of making chips easier to manufacture. The approximately 463 million transistors of AMD's Barcelona are packed into an area that's 283 mm² via AMD's 65nm SOI fab process. That's a relatively large area over which AMD must avoid defects. By contrast, current 65nm Xeons are based on two chips, each roughly 341 million transistors and measuring just 143 mm². Each chip in a Harpertown Xeon crams 410 million transistors into an even smaller 107 mm² area. One can argue that AMD's approach to quad-core processors is more elegant, but it's hard to argue with the Penryn family's tiny die area.
The small die belies big changes, though. The most obvious of those is a larger (6MB) and smarter (24-way set associative) L2 cache shared between the two cores on each chip. That adds up to 12MB of L2 cache per socket, for those who prefer to count that way. Harpertowns Xeons can better feed that cache thanks front-side bus speeds of up to 1.6GHz.
Penryn's CPUs themselves may need the extra bandwidth, thanks to a handful of tweaks. One of the most prominent: a new, faster divider capable of handling both integer and floating-point numbers. This new radix-16-based design processes four bits per cycle, versus two bits in prior designs, and includes an optimized square root function. An early-out algorithm in the divider can lead to lower instruction latencies in some cases, as well. Penryn also extends the Core microarchitecture's 128-bit single cycle SSE capabilities to shuffle operations, doubling execution throughput there. This is not a new instruction but an optimization for existing instructions, so no software changes are required to take advantage of this capability. The faster shuffle should be useful in formatting and setting up data for use in other SSE-based vector operations.
Speaking of SSE and new instructions, SSE4 is finally here in Penryn. These aren't just the Supplemental SSE3 instructions supported in the first rev of the Core microarchitecture, but 47 all-new instructions aimed at video acceleration, basic graphics operations (including dot products), and the integration and control of coprocessors over PCIe. These instructions will, of course, require updated software support.
Harpertown Xeons pack some additional Penryn goodness, such as store forwarding and virtualization improvements, but they do not have the nifty "dynamic acceleration tech" intended for desktop Penryn derivatives. Those chips will have the ability to raise their clock speeds beyond their stock ratings, while staying within their appointed thermal envelopes, when one core is idle and the other is busy with a heavily single-threaded workload. Such trickery may be too fancy for the button-down world of servers and workstations, at least in its first-generation form.
Interestingly, Intel is toying with another, more permanent possibility for some future Xeon products: disabling one core on each of the two chips in a package in order to yield a dual-core solution that has 6MB of dedicated L2 cache per core. This move could allow a distinctive mix of single-threaded performance (as dictated by both cache sizes and clock speeds) within a given power envelope.
Speaking of which, the power envelopes for the new Xeons will remain essentially the same as the old ones. That means TDPs of 40, 65, and 80W for dual-core parts and 50, 80, and 120W for quad-cores. TDP ratings at a given clock speed should be down, I believe, although we don't have all of the details yet. We do know that Intel plans to sell a 3.16GHz version of Harpertown that will fit into the top 120W envelope, and we know that our sample Harpertowns, to be sold as the Xeon E5472, run at 3GHz and fit into an 80W thermal envelope. Additional details on the lineup and pricing will have to wait for the Harpertown Xeons' official launch date, which isn't yet here. That will come on November 12.
Stoakley steps up
The product that is officially arriving today is Intel's new dual-socket platform, code-named Stoakley. This platform is comprised of something oldIntel's current ESB2 I/O chip (or south bridge)and something newa new memory controller hub or north bridge chip code-named Seaburg. Seaburg supplants a pair of existing products, the server-oriented Blackford MCH and the workstation-class Greencreek MCH. Manufactured on a newer process node than its predecessors, Seaburg's clock speed is up from 333 to 400MHz within a similar power envelope.
Of course, the Stoakley platform's main mission in life is to support the new 45nm Xeons. Like the Bensley platform before it, Stoakley has two front-side buses, one dedicated to each socket in the system. However, while Bensley's front-side buses topped out at 1.33GHz, Stoakley's FSBs can run at 1.6GHz. Memory bandwidth is up, too, since Seaburg supports FB-DIMM speeds of 800MHz for its four memory channels (though 667MHz remains an option.) Stoakley's memory controller gains more capacity for memory request reordering than Bensley, as well. All told, Intel cites a 25% higher sustainable memory throughput for the new platform.
In addition to the extra throughput, Stoakley can house twice as much memory as Bensleyup to 128GBand will support FB-DIMM fail-over for high-reliability systems. Seaburg also doubles the number of PCIe lanes and upgrades those links to second-generation PCI Express.
One of the bigger challenges in designing the Seaburg north bridge was no doubt creating the snoop filter. This logic stores coherency information for all last-level caches on both of the chipset's front-side buses, and it reduces FSB utilization by filtering out unnecessary coherency updates rather than passing them along from one FSB to the other. A system with dual Harpertown Xeons will have four-last level caches of 6MB each, and each cache will be 24-way associative. Accordingly, Seaburg's snoop filter has four affinity groups, provides 24MB of coverage, and is 96-way associative. Seaburg also uses a more optimal algorithm to improve victim selection.
In the previous generation, only the workstation-oriented Greencreek MCH had a snoop filter; the server-targeted Blackford MCH did not, because it could hamper performance in some cases. The improvements to Stoakley's snoop filter have mitigated that performance penalty, and so Intel will offer only one product in this generation. Technically, Stoakley is billed primarily as a workstation platform, but expect it to find its way into servers, as well. With its increased throughput, Stoakley could prove particularly popular for HPC systems.