Single page Print

The Core microarchitecture
The heritage of the Core microarchitecture can be traced back through the Core Duo and Pentium M, through the Pentium II and III, all the way to the original Pentium Pro. That original design has undergone some serious evolutionary changes, plus a few radical mutations, along the way, and the Core microarchitecture may be the most sweeping set of changes yet. Even compared to its direct forebear, the Core Duo, the Core design can be considered substantially new.

Core's genesis was a project known internally at Intel as Merom, whose mission was to build a replacement for the Pentium M and Core Duo mobile processors. The Israel-based design team responsible for Intel's mobile CPUs followed a distinctive design philosophy focused intently on energy efficiency, which helped make the Pentium M a resounding success as part of the Centrino platform. When power and heat became problems for Netburst-based desktop and server processors, Intel turned to Merom as the source of a new, common microarchitecture for its mobile, desktop, and server CPUs.

Because of its orientation toward power efficiency, the Core architecture is a very different design from Netburst. From the very first Pentium 4, Netburst was a "speed demon" type of architecture, a chip designed not for clock-for-clock performance, but to be comfortable running at high clock frequencies. To this end, the original Netburst processors had a relatively long 20-stage main pipeline. For a time, this design achieved good results at the 130nm process node, but all of that changed when Intel introduced a vastly reworked Netburst at 90nm. With its pipeline stretched to 31 stages and its transistor count up significantly, the Pentium 4 "Prescott" still had trouble delivering high clock speeds without getting too hot, and performance suffered as a result.

The Core architecture, meanwhile, is the opposite of a speed demon; it's a "brainiac" instead. Core has a relatively short 14-stage pipeline, but it's very "wide," with ample execution resources aimed at handling lots of instructions at once. Core is unique among x86-compatible processors in its ability to fetch, decode, issue and retire up to four instructions in a single clock cycle. Core can even execute 128-bit SSE instructions in a single clock cycle, rather than the two cycles required by previous architectures. In order to keep all of its out-of-order execution resources occupied, Core has deeper buffers and more slots for instructions in flight.

A block diagram of one Core execution, err, core. Source: Intel.

Like other contemporary PC processors, Core translates x86 instructions into a different set of instructions that its internal, RISC-like core can execute. Intel calls these internal instructions micro-ops. Core inherits the Pentium M and Core Duo's ability to fuse certain micro-op pairs and send them down the pipeline for execution together, a provision that can make the CPU's execution resources seem even wider that they are. To this ability, Core adds the capability to fuse some pairs of x86 "macro-ops," such as compare and jump, that tend to occur together commonly. Not only can these provisions enhance performance, but they can also reduce the amount of energy expended in order to execute an instruction sequence.

Another innovation in Core is a feature Intel has somewhat cryptically named memory disambiguation. Most modern CPUs speculatively execute instructions out of order and then reorder them later to create the illusion of sequential execution. Memory disambiguation extends out-of-order principles to the memory system, allowing for loads to be moved ahead of stores in certain situations. That may sound like risky business, but that's where the disambiguation comes in. The memory system uses an algorithm to predict which loads are to move ahead of stores, removing the ambiguity.

See? Ahh.

This optimization can pay big performance dividends.

A picture of the Core 2 die. Source: Intel.

In contrast to the various "dual-core" implementations of Netburst, the Core microarchitecture is a natively dual-core design. The chip's two execution cores each have their own separate, 32K L1 instruction and data caches, but they share a common L2 cache that can be either 2MB or 4MB in size. (The execution trace cache from Netburst is not carried over here.) The chip can allocate space in this L2 cache dynamically on an as-needed basis, dedicating more space to one core than the other in periods of asymmetrical activity. The common cache also eliminates the need for coherency protocol traffic on the system's front-side bus, and one core can pass data to another simply by transferring ownership of that data in the cache. This arrangement is easily superior to the Pentium D's approach, where the two cores can communicate and share data only via the front-side bus.

As Intel's brand-new common microarchitecture, Core is of course equipped with all of the latest features. String 'em together, and you get something like this: MMX, SSE, SSE2, SSE3, SSE4, EM64T, EIST, C1E, XD, and VT, to name a subset of the complete list. The most notable addition here is probably EM64T¬óIntel's name for x86-64 compatibility¬óbecause the Core Duo didn't have it. In order to make its way into desktops and servers, Core needed to be a 64-bit capable processor, and so it is.

The scope and depth of the changes to the Core microarchitecture simply from its direct "Yonah" Core Duo ancestor are too much to cover in a review like this one, but hopefully you have a sense of things. For further reading on the details of the Core architecture, let me recommend David Kanter's excellent overview of the design.

AMD answers with Energy Efficient Athlons
Anticipating better power efficiency from Intel's new desktop processors, AMD has begun offering Energy Efficient versions of many of its CPUs for the new Socket AM2 infrastructure. Much like the Turion 64 mobile processor and the HE versions of the Opteron server chips, these Energy Efficient Athlon 64s have been manufactured using a tweaked fabrication process intended to produce chips capable of operating at lower voltages. Making these more efficient chips isn't easy, so AMD charges a price premium for the Energy Efficient models that averages about 40 bucks over the non-EE versions.

The Athlon 64 X2 3800+ Energy Efficent Small Form Factor (left) and Athlon 64 X2 4600+ Energy Efficient (right)

Just as we wrapped up our testing of the Core 2 Duo, a pair of these new Energy Efficient processors arrived from AMD. On the right above is the EE version of the Athlon 64 X2 4600+. AMD rates its max thermal power at 65 W, down from 89W in the stock version. Currently, the X2 4600+ EE commands a $43 price premium over the regular X2 4600+.

The processor on the left above may have the longest product name of any desktop CPU ever: "Athlon 64 X2 3800+ Energy Efficient Small Form Factor." This long-winded name, though, signals a very frugal personality; AMD rates this processor's max thermal power at only 35W. Making the leap from the stock version to the EE SFF model will set you back roughly 60 bucks, or you can stop halfway and get the X2 3800+ EE with a 65W TDP for 20 bucks more than the basic 89W version.

By the way, you may be tempted to compare the TDP numbers for the Core 2 Duo with these processors, but there is some risk in doing so. AMD generates its TDP ratings using a simple maximum value, while Intel uses a more complex method that produces numbers that may be less than the processor's actual peak power use. As a result, direct comparisons between AMD and Intel TDP numbers may not reflect the realities involved.

For all intents and purposes beyond power consumption and the related heat production, the EE versions of the Athlon 64 X2 ought to be identical to the originals. They run at the same clock speeds, have the same feature sets, and should deliver equivalent performance. Because that's so, and due to limited testing time, we've restricted our testing of these Energy Efficient chips to power consumption.