The product of that team’s efforts is a new CPU microarchitecture known as Core, of which the Core 2 Duo and Core 2 Extreme are among the first implementations intended for desktop PCs. We’ve been knee-deep in hype about the Core architecture for months now, with a stream of juicy technical details, semi-official benchmark previews, and clandestine reviews of pre-release products feeding the anticipation. Clearly, when a player as big as Intel stumbles as badly as it has, PC enthusiasts and most others in the industry are keen to see it get back up and start delivering exciting products once again.
Fortunately, the wait for Core 2 processors is almost over. Intel has decided to take the wraps off final reviews of its new CPUs today, in anticipation of the chips’ release to the public in a couple of weeks. Fish have gotta swim, politicians have gotta dissemble, and TR has gotta test hardware, so of course we’ve had the Core 2 processors on the test bench here in Damage Labs for a thorough workout against AMD’s finestincluding the new Energy Efficient versions of the Athlon 64 X2. After many hours of testing, we’re pleased to report that the Core 2 chips live up to the hype. Intel has recovered its stride, returned to its winning ways, gotten its groove back, and put the izzle back in its shizzle. Read on for our full review.
Conroe up close
We first previewed the chip code-named Conroe back in March, and now we finally have our hands on one within the confines of our own labs. In spite of all the hype, the Core 2 Duo processor itself is a rather unassuming bloke that looks no different than Pentium CPUs that preceeded it. Like them, it resides in an LGA775-style socket and runs on a 1066MHz front-side bus.
The Core 2 Duo E6700 processor Also like its most immediate predecessors, the Core 2 Duo is manufactured on Intel’s 65nm fab process. Unlike them, however, the Core 2 Duo is not comprised of two chips crammed together on one package; it’s a native dual-core design with a total of roughly 291 million transistors arranged in an area that’s 143 mm2. By contrast, each of the Pentium Extreme Edition 965’s two chips have an estimated 188 million transistors in an 81-mm2 die. If you add the two chips together, the Pentium Extreme Edition 965 has more total transistors and a larger total die area than the Core 2 Duo.
Intel plans to offer five flavors of Core 2 processors initially, with prices and features like so:
|Clock speed||Bus speed||L2 cache||TDP||Price|
|Core 2 Extreme X6800||2.93GHz||1066MHz||4MB||75 W||$999|
|Core 2 Duo E6700||2.67GHz||1066MHz||4MB||65 W||$530|
|Core 2 Duo E6600||2.4GHz||1066MHz||4MB||65 W||$316|
|Core 2 Duo E6400||2.13GHz||1066MHz||2MB||65 W||$224|
|Core 2 Duo E6300||1.86GHz||1066MHz||2MB||65 W||$183|
The prices on the mid-range models are quite reasonable once you consider performance, as we’ll do shortly. What you’ll really want to notice about the Core 2 chips, though, is the column labeled TDP. This parameterthermal design powerspecifies the amount of cooling the chip requires, and the numbers are down dramatically from the Pentium Extreme Edition 965’s rating of 130W. Clock speeds are down, as well, since the Core microarchitecture focuses on achieving high performance per clock rather than stratospheric clock frequencies. The fastest Core 2 processor is the X6800 Extreme, which is separated from the regular Core 2 Duos only by its 2.93GHz clock speed and a 10W higher TDPoh, and by almost half a grand.
Intel says complete PC systems based on the Core 2 Extreme X6800 and individually boxed products will both begin selling on July 27th, while Core 2 Duo processors with 4MB of L2 cache should show up on August 7th. Intel will be transitioning its CPU production gradually away from Pentiums to Core 2 Duos, and that transition might not happen as quickly as the market would like. I wouldn’t be surprised to see strong demand and short supply of these processors for the next couple of months, until Intel is able to ramp up production volumes. The less expensive versions of the Core 2 Duo with 2MB of L2 cache are initial casualties of this controlled ramp. They aren’t expected to be available until the fourth quarter of this year.
On a brighter note, the supporting infrastructure for Core 2 chips is already fairly well established. The processors should be compatible with a number of chipsets, including the enthusiast-class 975X and the upcoming 965-series mainstream chipsets from Intel. NVIDIA’s nForce4 SLI X16 Intel Edition should work, too, as well as the yet-to-be-released nForce 500 series for Intel. In fact, the Core 2 can act as a drop-in replacement for a Pentium D or Pentium Extreme Edition, provided that the motherboard is capable of supplying the lower voltages that Core 2 processors require. Only the most recent motherboards seem to have Core 2 support, so you’ll want to check carefully with the motherboard maker before assuming a board is compatible. Our Core 2 Duo and Extreme review samples, for example, came from Intel with an updated version of the D975XBX motherboard, since older revisions couldn’t supply the proper voltage.
Speaking of which, the upgrade path for those who buy motherboards for Core 2 processors in the next few months isn’t entirely clear. The server/workstation version of the Core microarchitecture, the Woodcrest Xeon, already rides on a faster 1333MHz front-side bus. The Core 2 Duo may move to this faster bus frequency at some point, but Intel hasn’t revealed a schedule for this move. Intel has revealed plans to deliver “Kentsfield,” a quad-core processor with two Conroe chips in a single package, in early 2007, but we don’t yet know whether current motherboards will be able to support it. Investing in a Core 2-capable motherboard right now might be a recipe for longevity, but it might also be a dead end as far as CPU upgrades are concerned.
What’s with the name?
Before we go on, we should probably take a moment to talk about the Core 2 Duo product name. It’s dreadful, of course, but for deeper reasons than you might think. You see, microprocessors tend to be known by several names throughout their lives, and usually those names aren’t really related. For example, the chip code-named Willamette, based on a microarchitecture called Netburst, became the first product known as Pentium 4. The multiple names may be a little difficult to keep straight, but they’re distinctive and follow a coherent logic.
This chip, however, is different. The microarchitecture is called Core, the chip is code-named Conroe, and the product is called Core 2 Duo. By that logic, the chip code-named Willamette would have been based on the Willette microarchitecture, and the first product might have been the Willette 4 Quadro, which everyone knows is actually a disposable razor.
The Core 2 Duo’s name does make sense from a certain perspective, though, because Intel has been shipping the original Core Duo as a dual-core mobile processor since the beginning of the year. There’s also a single-core version of that processor known as the Core Solo, which explains the whole Duo suffix. And the mobile version of the Core 2 Duo, based on the chip code-named Merom, will be the follow-up to the Core Duo.
So why name the microarchitecture Core? You’ve got me. The Core microarchitecture is a descendant of the one found in the current Core Duo, but it’s been pretty extensively reworked and certainly deserves a new name. The fact that its name matches up with the previous-gen product’s name is confounding. We’ll simply have to, as one Intel employee admonished at the Spring ’06 IDF, “Deal with it.”
The Core microarchitecture
The heritage of the Core microarchitecture can be traced back through the Core Duo and Pentium M, through the Pentium II and III, all the way to the original Pentium Pro. That original design has undergone some serious evolutionary changes, plus a few radical mutations, along the way, and the Core microarchitecture may be the most sweeping set of changes yet. Even compared to its direct forebear, the Core Duo, the Core design can be considered substantially new.
Core’s genesis was a project known internally at Intel as Merom, whose mission was to build a replacement for the Pentium M and Core Duo mobile processors. The Israel-based design team responsible for Intel’s mobile CPUs followed a distinctive design philosophy focused intently on energy efficiency, which helped make the Pentium M a resounding success as part of the Centrino platform. When power and heat became problems for Netburst-based desktop and server processors, Intel turned to Merom as the source of a new, common microarchitecture for its mobile, desktop, and server CPUs.
Because of its orientation toward power efficiency, the Core architecture is a very different design from Netburst. From the very first Pentium 4, Netburst was a “speed demon” type of architecture, a chip designed not for clock-for-clock performance, but to be comfortable running at high clock frequencies. To this end, the original Netburst processors had a relatively long 20-stage main pipeline. For a time, this design achieved good results at the 130nm process node, but all of that changed when Intel introduced a vastly reworked Netburst at 90nm. With its pipeline stretched to 31 stages and its transistor count up significantly, the Pentium 4 “Prescott” still had trouble delivering high clock speeds without getting too hot, and performance suffered as a result.
The Core architecture, meanwhile, is the opposite of a speed demon; it’s a “brainiac” instead. Core has a relatively short 14-stage pipeline, but it’s very “wide,” with ample execution resources aimed at handling lots of instructions at once. Core is unique among x86-compatible processors in its ability to fetch, decode, issue and retire up to four instructions in a single clock cycle. Core can even execute 128-bit SSE instructions in a single clock cycle, rather than the two cycles required by previous architectures. In order to keep all of its out-of-order execution resources occupied, Core has deeper buffers and more slots for instructions in flight.
Like other contemporary PC processors, Core translates x86 instructions into a different set of instructions that its internal, RISC-like core can execute. Intel calls these internal instructions micro-ops. Core inherits the Pentium M and Core Duo’s ability to fuse certain micro-op pairs and send them down the pipeline for execution together, a provision that can make the CPU’s execution resources seem even wider that they are. To this ability, Core adds the capability to fuse some pairs of x86 “macro-ops,” such as compare and jump, that tend to occur together commonly. Not only can these provisions enhance performance, but they can also reduce the amount of energy expended in order to execute an instruction sequence.
Another innovation in Core is a feature Intel has somewhat cryptically named memory disambiguation. Most modern CPUs speculatively execute instructions out of order and then reorder them later to create the illusion of sequential execution. Memory disambiguation extends out-of-order principles to the memory system, allowing for loads to be moved ahead of stores in certain situations. That may sound like risky business, but that’s where the disambiguation comes in. The memory system uses an algorithm to predict which loads are to move ahead of stores, removing the ambiguity.
This optimization can pay big performance dividends.
In contrast to the various “dual-core” implementations of Netburst, the Core microarchitecture is a natively dual-core design. The chip’s two execution cores each have their own separate, 32K L1 instruction and data caches, but they share a common L2 cache that can be either 2MB or 4MB in size. (The execution trace cache from Netburst is not carried over here.) The chip can allocate space in this L2 cache dynamically on an as-needed basis, dedicating more space to one core than the other in periods of asymmetrical activity. The common cache also eliminates the need for coherency protocol traffic on the system’s front-side bus, and one core can pass data to another simply by transferring ownership of that data in the cache. This arrangement is easily superior to the Pentium D’s approach, where the two cores can communicate and share data only via the front-side bus.
As Intel’s brand-new common microarchitecture, Core is of course equipped with all of the latest features. String ’em together, and you get something like this: MMX, SSE, SSE2, SSE3, SSE4, EM64T, EIST, C1E, XD, and VT, to name a subset of the complete list. The most notable addition here is probably EM64TIntel’s name for x86-64 compatibilitybecause the Core Duo didn’t have it. In order to make its way into desktops and servers, Core needed to be a 64-bit capable processor, and so it is.
The scope and depth of the changes to the Core microarchitecture simply from its direct “Yonah” Core Duo ancestor are too much to cover in a review like this one, but hopefully you have a sense of things. For further reading on the details of the Core architecture, let me recommend David Kanter’s excellent overview of the design.
AMD answers with Energy Efficient Athlons
Anticipating better power efficiency from Intel’s new desktop processors, AMD has begun offering Energy Efficient versions of many of its CPUs for the new Socket AM2 infrastructure. Much like the Turion 64 mobile processor and the HE versions of the Opteron server chips, these Energy Efficient Athlon 64s have been manufactured using a tweaked fabrication process intended to produce chips capable of operating at lower voltages. Making these more efficient chips isn’t easy, so AMD charges a price premium for the Energy Efficient models that averages about 40 bucks over the non-EE versions.
Just as we wrapped up our testing of the Core 2 Duo, a pair of these new Energy Efficient processors arrived from AMD. On the right above is the EE version of the Athlon 64 X2 4600+. AMD rates its max thermal power at 65 W, down from 89W in the stock version. Currently, the X2 4600+ EE commands a $43 price premium over the regular X2 4600+.
The processor on the left above may have the longest product name of any desktop CPU ever: “Athlon 64 X2 3800+ Energy Efficient Small Form Factor.” This long-winded name, though, signals a very frugal personality; AMD rates this processor’s max thermal power at only 35W. Making the leap from the stock version to the EE SFF model will set you back roughly 60 bucks, or you can stop halfway and get the X2 3800+ EE with a 65W TDP for 20 bucks more than the basic 89W version.
By the way, you may be tempted to compare the TDP numbers for the Core 2 Duo with these processors, but there is some risk in doing so. AMD generates its TDP ratings using a simple maximum value, while Intel uses a more complex method that produces numbers that may be less than the processor’s actual peak power use. As a result, direct comparisons between AMD and Intel TDP numbers may not reflect the realities involved.
For all intents and purposes beyond power consumption and the related heat production, the EE versions of the Athlon 64 X2 ought to be identical to the originals. They run at the same clock speeds, have the same feature sets, and should deliver equivalent performance. Because that’s so, and due to limited testing time, we’ve restricted our testing of these Energy Efficient chips to power consumption.
Our testing methods
Please note that the two Pentium D 900-series processors in our test are actually a Pentium Extreme Edition 965 chip that’s been set to the appropriate core and bus speeds and had Hyper-Threading disabled in order to simulate the actual products. Similarly, our Socket AM2 versions of the Athlon 64 X2 4800+, 4600+, and 4200+ are actually the Athlon 64 FX-62 and X2 5000+ clocked down to the appropriate speeds, and the Core 2 Duo E6600 is actually an underclocked Core 2 Extreme X6800. The performance of our “simulated” processor models should be identical to the actual products.
Also, I’ve placed asterisks next to the memory clock speeds of the Socket AM2 test systems in the table below. Due to limitations in AMD’s memory clocking scheme, a couple of these systems couldn’t set their memory clocks to exactly 800MHz.
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.
Our test systems were configured like so:
|Processor|| Pentium D 950 3.4GHz
Pentium D 960 3.6GHz
|Pentium Extreme Edition 965 3.73GHz||Core 2 Duo E6600 2.4GHz
Core 2 Duo E6700 2.66GHz
Core 2 Extreme 2.93GHz
|Athlon 64 X2 4200+ 2.2GHz
Athlon 64 X2 4800+ 2.4GHz
Athlon 64 X2 4600+ 2.4GHz
Athlon 64 X2 5000+ 2.6GHz
Athlon 64 FX-62 2.8GHz
Athlon 64 X2 3800+ Energy Efficient
Athlon 64 X2 4600+ Energy Efficient
|System bus||800MHz (200MHz quad-pumped)||1066MHz (266MHz quad-pumped)||1066MHz (266MHz quad-pumped)||1GHz HyperTransport|
|Motherboard||Intel D975XBX||Intel D975XBX||Intel D975XBX||Asus M2N32-SLI Deluxe|
|North bridge||975X MCH||975X MCH||975X MCH||nForce 590 SLI SPP|
|South bridge||ICH7R||ICH7R||ICH7R||nForce 590 SLI MCP|
|Chipset drivers||INF Update 220.127.116.117
Intel Matrix Storage Manager 18.104.22.1685
|INF Update 22.214.171.1247
Intel Matrix Storage Manager 126.96.36.1995
|INF Update 188.8.131.527
Intel Matrix Storage Manager 184.108.40.2065
|SMBus driver 4.52
IDE/SATA driver 6.67
|Memory size||2GB (2 DIMMs)||2GB (2 DIMMs)||2GB (2 DIMMs)||2GB (2 DIMMs)|
|Memory type||Crucial Ballistix PC2-8000
DDR2 SDRAM at 800MHz
|Crucial Ballistix PC2-8000
DDR2 SDRAM at 800MHz
|Corsair TWIN2X2048-8500C5 DDR2 SDRAM at 800MHz||Corsair TWIN2X2048-8500C5 DDR2 SDRAM at 800MHz*|
|CAS latency (CL)||4||4||4||4|
|RAS to CAS delay (tRCD)||4||4||4||4|
|RAS precharge (tRP)||4||4||4||4|
|Cycle time (tRAS)||15||15||15||12|
with SigmaTel 5.10.4991.0 drivers
with SigmaTel 5.10.4991.0 drivers
with SigmaTel 5.10.4991.0 drivers
|Integrated nForce 590 MCP/AD1988B with SoundMAX 220.127.116.1190 drivers|
|Hard drive||Maxtor DiamondMax 10 250GB SATA 150|
|Graphics|| GeForce 7900 GTX 512MB PCI-E with ForceWare 84.25 drivers
GeForce 7900 GTX 512MB PCI-E with ForceWare 84.21 drivers (WorldBench only)
|OS||Windows XP Professional x64 Edition
Windows XP Professional with Service Pack 2 (WorldBench only)
Also, all of our test systems were powered by OCZ GameXStream 700W power supply units. Thanks to OCZ for providing these units for our use in testing.
The test systems’ Windows desktops were set at 1280×1024 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled.
We used the following versions of our test applications:
- SiSoft Sandra 2007.5.10.98 64-bit
- CPU-Z 1.33
- Compiled binary of C Linpack port from Ace’s Hardware
- POV-Ray for Windows 3.6.1 64-bit
- SMPOV 4.6
- Cinebench 9.5 64-bit Edition
- 3ds max 8.0
- LAME MT 3.97a 64-bit
- Windows Media Encoder 9 x64 Edition
- Sphinx 3.3
- picCOLOR 4.0 build 561 64-bit
- The Elder Scrolls IV: Oblivion 1.1
- Battlefield 2 1.22
- Quake 4 1.2 with trq4demo5
- FEAR 1.03
- Unreal Tournament 2004 v3369 and 3369 64-bit Edition with trdemo1
- 3DMark06 1.0.2
- WorldBench 5.0
The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
We’ll begin our tests with a customary look at memory subsystem performance. These results won’t track with performance in most real-world applications, but they can teach us a thing or two about these processors and how they compare.
Although our Intel motherboard has dual channels of 667MHz DDR memory, The Core 2 Duo’s path to main memory is limited by its 1066MHz front-side bus. With their on-chip memory controllers, the Athlon 64 processors can take better advantage of the peak bandwidth offered by two channels of DDR2 memory. That said, the Core 2 Duo doesn’t achieve the same throughput as the Extreme Edition 965, which also rides on a 1066MHz bus. The gap between these two Intel CPU architectures may stem from the algorithms they each use to govern pre-fetching of data from main memory into the L2 cache. The Netburst processor may be more aggressive here in a way that benefits it in this synthetic test.
Next up is our ancient version of Linpack. This classic benchmark is traditionally used to measure floating-point math performance, but we use this unoptimized version simply to get a look at the “shape” of the memory subsystem. Unfortunately, this rendition of Linpack has a fixed maximum matrix size of 2MB, so we can’t really see how the Core 2’s entire L2 cache or main memory performs. I would have cut these results out of the review entirely, were they not so dramatic.
The Core 2 processors look to have one heck of a fast cache subsystem, at least in the first 2MB. Neither the Pentiums nor the Athlons come close.
Memory bandwidth is important, but memory access latencies are arguably more important, though the two are interrelated. This result is intriguing, because the Core 2 processors manage to achieve much lower access latencies than the Netburst-based Pentiums, despite using the same memory timings on the same type of motherboard. These numbers, however, are just one sample point in a range of possibilities. Let’s look at representatives of the three different microarchitectures in more detail.
The graphs below show results from multiple step and block sizes. I’ve color-coded the graphs to make them easier to read. For each processor, the yellow areas represent block sizes that fit into the L1 data cache, the light orange areas represent L2 cache, and the dark orange areas represent main memory.
The Athlon 64’s built-in memory controller gives it a pronounced and consistent advantage in getting out to main memory quickly, but the Core 2 really does shave 15 to 20 nanoseconds off of main memory access times versus the Pentium Extreme Edition. I hate to speculate too much about the reasons, but they may include the Core 2’s lower latency caches (which we see illustrated here), potentially less aggressive pre-fetching (and thus a less saturated bus), and possibly even its ability to move loads ahead of stores via memory disambiguation.
Oh, and CPU geeks may be interested to note that our latency test app reports the Core 2’s L1 cache latency is three cycles, for what it’s worth.
We tested Quake 4 by running our own custom timedemo with and without its multiprocessor optimizations enabled. These can be switched on in the game console by setting the “r_usesmp” variable to “1”.
Above the following benchmark graph, and throughout most of the tests in this review, we’ve included Task Manager plots showing CPU utilization. These plots were captured on the Pentium Extreme Edition 965, and they should offer some indication of how much impact multithreading has on the operation of each application. Single-threaded apps may sometimes show up as spread across multiple processors in Task Manager, but the total amount of space below all four lines shouldn’t equal more than the total area of one square if the test is truly single-threaded. Anything significantly more than that is probably an indication of some multithreaded component in the execution of the test. Because WorldBench’s tests are entirely scripted, however, we weren’t able to capture Task Manager plots for them, as you’ll notice later.
NVIDIA’s video drivers are now multithreaded, so we should see some amount of multithreading action happening in any application that uses the GPU for 3D graphics, even if the game is only single-threaded.
Just like that, we see a new order being established. The three Core 2 Duo chips capture the top three spots, with even the E6600at 2.4GHz and $316outperforming the Athlon 64 FX-62. The Core 2’s advantage over the Athlon 64 X2 is similar to the one the AMD chips have held over the Pentiums for so long. Obviously, there’s utterly no contest between the new Intel processors and their predecessors. Will this pattern hold in other games?
The Elder Scrolls IV: Oblivion
We tested Oblivion by manually playing through a specific point in the game five times for each CPU while recording frame rates using the FRAPS utility. Each gameplay sequence lasted 60 seconds. This method has the advantage of simulating real gameplay quite closely, but it comes at the expense of precise repeatability. We believe five sample sessions are sufficient to get reasonably consistent and trustworthy results. In addition to average frame rates, we’ve included the low frames rates, because those tend to reflect the user experience in performance-critical situations. In order to diminish the effect of outliers, we’ve reported the median of the five low frame rates we encountered.
We set Oblivion’s graphical quality settings to “Medium,” 800×600 resolution, with HDR lighting enabled. Our Oblivion test is a quick run around the Imperial City Arboretum.
The Core 2 processors show up strong again here. Only the E6600 falls behind the FX-62, and the Core 2 Extreme X6800 cranks out roughly twice the average and minimum frames rate of the Pentium Extreme Edition 965.
We used F.E.A.R.’s built-in “test settings” benchmark to get these results. The game’s “Computer” and “Graphics” performance options were both set to “High.”
The Core 2 processors come up big in F.E.A.R., as well, by posting solid gains over the Athlon 64s in both average and minimum frame rates. To give you some sense of how much more effective this microarchitecture is for gaming than Netburst, consider that the Core 2 E6600’s low is 52 frames per seconddouble that of the Pentium D 950 and an indicator of much smoother gameplay.
We used FRAPS to capture BF2 frame rates just as we did with Oblivion. Graphics quality options were set to BF2’s canned “High” quality profile. This game has a built-in cap at 100 frames per second, and we intentionally left that cap enabled so we could offer a faithful look at real-world performance.
BF2 has been considered something of a system hog in the past, but all of these CPUs are fast enough to run BF2 acceptably. The Core 2 Extreme keeps the game practically locked at its 100Hz peak refresh rate.
Unreal Tournament 2004
We used a more traditional recorded timedemo for testing UT2004, but we tried out two versions of the game, the original 32-bit flavor and the 64-bit version.
UT2004 has been a thorn in Intel’s side for ages, but no longer. Core 2 changes the equation entirely. As for the question of 32-bit code versus 64-bit, looks to me like none of the processors gain significantly more than the others by going to the 64-bit executable.
3DMark06 combines the results from its graphics and CPU tests in order to reach an overall score. Here’s how the processors did overall and in each of those tests.
3DMark’s four graphics tests are almost entirely GPU bound, even with our test systems’ GeForce 7900 GTX graphics cards. The Core 2 chips gain their advantage in the overall 3DMark score by doing very well in the two CPU tests.
WorldBench overall performance
WorldBench’s overall score is a pretty decent indication of general-use performance for desktop computers. This benchmark uses scripting to step through a series of tasks in common Windows applications and then produces an overall score for comparison. WorldBench also records individual results for its component application tests, allowing us to compare performance in each. We’ll look at the overall score, and then we’ll show individual application results alongside the results from some of our own application tests.
Well, Core 2 processors don’t just excel at 3D gaming. They’ve also taken the top three spots in WorldBench, and their margins of victory are impressive. The Core 2 Extreme opens up a huge lead on the Athlon 64 FX-62, setting a new WorldBench record (at least for us) in the process. Oh, and the Pentium Extreme Edition is a staggering 43 points behind the Core 2 Extreme X6800.
Audio editing and encoding
LAME MP3 encoding
LAME MT is, as you might have guessed, a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. (Of course, multithreading works even better on dual-core processors.) You can download a paper (in Word format) describing the programming effort.
Rather than run multiple parallel threads, LAME MT runs the MP3 encoder’s psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. The author notes, “In general, this approach is highly recommended, for it is exponentially harder to debug a parallel application than a linear one.”
We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here, as we have done in many of our previous CPU reviews.
There are no real surprises here. The Core 2 processors excel at audio encoding and editing, just as they seem to everywhere else.
Video editing and encoding
Windows Media Encoder x64 Edition Advanced Profile
We asked Windows Media Encoder to convert a gorgeous 1080-line WMV HD video clip into a 320×240 streaming format using the Windows Media Video 8 Advanced Profile codec.
Windows Media Encoder
VideoWave Movie Creator
Intel’s new processors have the edge in video encoding, as well.
picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Eight of the 12 functions in the test are multithreaded.
Scores in picCOLOR, by the way, are indexed against a single-processor Pentium III 1 GHz system, so that a score of 4.14 works out to 4.14 times the performance of the reference machine.
Add image processing to the list of categories that the Core 2 processors handle well. Notice that in picCOLOR, the Core 2 Extreme X6800 works out to be about 11 times the speed of a Pentium III 1GHz. How’s that for progress?
Multitasking and office applications
Mozilla and Windows Media Encoder
Two of these three tests, the MS Office and Mozilla plus Windows Media Encoder ones, attempt to simulate real-world user multitasking. The Core 2 processors handle them both very well, unsurprisingly.
Sphinx speech recognition
Ricky Houghton first brought us the Sphinx benchmark through his association with speech recognition efforts at Carnegie Mellon University. Sphinx is a high-quality speech recognition routine. We use two different versions, built with two different compilers, in an attempt to ensure we’re getting the best possible performance.
The Core 2 Extreme busts out a new record in Sphinx, needing only about 30% of its power to run this high-quality speech recognition routine in real time.
WinZip is another impressive victory for the Core 2 CPUs, while the field is very tight (and probably largely I/O bound) in the Nero test.
3D modeling and rendering
Cinebench measures performance in Maxon’s Cinema 4D modeling and rendering app. This is the 64-bit version of Cinebench, primed and ready for these 64-bit processors.
This one is another victory for the Core 2, but the contest is a little closer this time, with Athlon 64 processors taking second and fourth places.
Cinebench’s shading tests are single-threaded, and they allow us to compare the performance of shading with the Cinema 4D engine and software OpenGL with GPU-accelerated OpenGL. Our three Core 2 processors excel in all cases, whether they are doing the shading themselves or feeding a GPU.
POV-Ray just recently made the move to 64-bit binaries, and thanks to the nifty SMPOV distributed rendering utility, we’ve been able to make it multithreaded, as well. SMPOV spins off any number of instances of the POV-Ray renderer, and it will divvy up the scene in several different ways. For this scene, the best choice was to divide the screen horizontally between the different threads, which provides a fairly even workload.
We considered using the new beta of POV-Ray with native support for SMP, but it proved to be very, very slow. We’ll have to try it again once development has progressed further.
POV-Ray rendering has been difficult turf for Netburst-based CPUs to defend, but the Core 2 is much more competitive. Note that the Core 2 processors don’t seem to scale as well when going from one thread to two as the Athlon 64s. I’m not entirely confident that’s not the fault of a quirk in the latest version of SMPOV, so I wouldn’t read too much into it. Using an external program to call the renderer has its perils.
3dsmax 8 rendering
For our 3ds max test, we used the “architecture” scene from SPECapc for 3ds max 7. This scene is very complex and should be nice exercise for these CPUs. Using 3ds max’s default scanline renderer, we first rendered frames 0 to 10 of the scene at 500×300 resolution. The renderer’s “Use SSE” option was enabled.
Next, we rendered just the first frame of the scene in 3ds max’s mental ray renderer. Notice that we’ve changed our time scale from seconds to minutes for this one.
Check out those render times with mental ray. Yeowtch. The Core 2 Extreme finishes nine minutes before the FX-62.
SiSoft Sandra Mandelbrot
Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The one of interest to us is the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX and SSE/2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:
This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.
The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assigns [sic] each thread to a different CPU.
We’re using the 64-bit port of Sandra. The “Integer x16” version of this test uses integer numbers to simulate floating-point math. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations at once.
The Core microarchictecture’s rich execution resources are on display here. Netburst-based chips perform relatively well in these tests, probably because they are executing what I believe is a fairly simple, well-optimized program loop at a very high clock speed. The Core 2 processors, however, have the ability to handle four 128-bit floating-point operations per cycle, or eight 64-bit floating-point operations per cycle, which is considerably more work per clock than competing microarchitectures. The E6600 at 2.4GHz doesn’t quite double the performance of the X2 4600+ at 2.4GHz, but it’s close.
We took our power readings at the wall outlet using an Extech 380803 power meter. Only the PC was plugged into the watt meter; the system’s monitor and speakers, for instance, were not. The “idle” readings were taken at the Windows desktop, while the “load” readings were taken using SMPOV and the 64-bit version of the POV-Ray renderer to load up the CPUs. In all cases, we asked SMPOV to use the same number of threads as there were CPU front ends in Task Managerso four for the Extreme Edition 965, two for the Core 2 and Athlon 64 X2 processors. The test rigs were all equipped with OCZ GameXStream 700W power supply units.
The graph below for idle power use has results with and without “power management.” By “power management,” we mean the dynamic clock speed and voltage throttling technologies from Intel and AMD, known as SpeedStep and Cool’n’Quiet, respectively. The Intel processors also have an enhanced halt state known as C1E. A processor’s halt state is invoked by the OS whenever the system is able to sit idle for a moment. The C1E halt state in the Intel processors ramps down the CPU clock speed and voltage in order to save power, so even without SpeedStep, the CPU’s idle power use is reduced. Keep that in mind when considering the “No power management” results for the Intel processors at idle.
Interestingly, we found that the Core 2’s C1E state doesn’t lower CPU voltage. The CPU multiplier drops to 6.0, bringing the clock speed down to 1.6GHz, but voltage appears to remain unchanged. Turning on SpeedStep, however, drops the CPU’s core voltage, allowing for even lower idle power use.
Another tricky part about power consumption testing is getting good numbers for our “simulated” CPU speed grades. In order to make it work, you have to set the proper CPU core voltage, not just the right clock speeds. I made an attempt at simulating the Athlon 64 X2 models 4800+, 4600+, and 4200+ and the Pentium D 950/960 by setting the CPU voltages manually, but I’ve put an asterisk next to those CPUs in our results as a reminder that they’re simulated. I didn’t even bother including some simulated CPU models because of the difficulty involved and a few questionable results.
For the Athlon 64 X2 4800+, I set the voltage at 1.35V. The X2 4600+ and 4200+ were set to 1.3V. The “power management” idle scores were simply taken from chips with the same cache size (the FX-62 and 5000+, respectively), because all of these processors share the same 1 GHz/1.1V idle with Cool’n’Quiet.
The Pentium D 950 and 960 were trickier, since each Pentium D’s voltage needs are programmed at the factory. In this case, I stuck with the default of 1.312V for both speed grades. On an 800MHz bus, the Pentium D 950 and 950 both clocked down to 2.4 GHz at idle via the C1E halt mechanism. The Extreme Edition 965 clocked down to 3.2 GHz at idle.
You’ll notice that the results below include numbers for the Energy Efficient versions of the Athlon 64 X2 3800+ and 4600+. AMD sent these CPUs out to us along with a more power-efficient motherboard than our Asus M2N32-SLI Deluxe test platform, whose nForce 590 SLI chipset seems to be something of a power hog. The board AMD sent, however, is not an enthusiast-class mobo with dual graphics slots, so we elected not to include it in our tests. We wanted to test the EE chips opposite the Core 2 Duo on an enthusiast-class board, so we stuck with the M2N32-SLI Deluxe. It’s possible that enthusiast-class boards based on the Radeon Xpress 3200 or the nForce 570 SLI chipsets could lower power consumption for all of the Athlon 64 processors here without compromising performance.
Whoa. Performance is way up with the Core 2 processors, and power draw is way down. The Core 2’s mythical “performance per watt”which is actually a rather slippery thing to quantifyhas gotta be the best on the market. The Core 2 Duo E6700 outperforms the Athlon 64 FX-62 more often than not, yet the E6700-based system draws 74 fewer Watts under load.
AMD has made substantial progress on this front with its new Energy Efficient processors. Under load, the Athlon 64 X2 4600+ EE system pulls about 20W less at the wall socket than the stock X2 4600+ system, and the 35W-rated X2 3800+ system draws less power than anything else we tested. Still, even these new CPUs can’t match the performance of the Core 2 processors that are in the same neighborhood in terms of power draw.
Like many of the thousand-dollar, high-end processors of late, the Core 2 Extreme X6800 has an unlocked multiplier, so overclocking this beast is simply a matter of turning up that value in the BIOSeven on an Intel motherboard. With very little effort, I had the X6800 running at 3.46GHz on a 1066MHz bus. This was with air coolinga Zalman CNPS9500 LED. That’s very good air cooling, yes, but nothing terribly exotic.
I had to raise the voltage from the stock 1.2V to 1.375V in order to get it to be completely stable, but at those settings, the CPU ran a pair of Prime95 torture tests for a good while without throwing any errors. With the cooler fan running at its top speed, the CPU leveled out at about 70°C while running those tests.
I tried to coax the X6800 into running at the next multiplier up the ladder, for a top speed of 3.73GHz, but it wouldn’t quite go there. Even with 1.45V flowing into it, the X6800 would POST but wouldn’t boot into Windows. I suspect this CPU could go well past 3.46GHz with some bus overclocking, but I haven’t tried it yet. Even at “only” 3.46GHz, performance was astounding.
This architecture has more headroom than a Ford Excursion. Performance scales up very well with clock speed, too. In fact, these scores may give us some insight into why Intel chose not to move to a 1333MHz front-side bus for the Core 2: the 1066MHz bus just doesn’t look like a serious performance bottleneck.
After years of wandering in the wilderness, Intel has recaptured the desktop CPU performance title in dramatic fashion. Both the Core 2 Extreme X6800 and the Core 2 Duo E6700 easily outperform the Athlon 64 FX-62 across a range of applicationsand the E6600 is right in the hunt, as well. Not only that, but the Core 2 processors showed no real weaknesses in our performance tests. (I would say that Core looks like a more balanced architecture than Netburst, but at this stage of the game, Netburst just seems slow almost across the board.) No matter what you’re hoping to do with your PC, a Core 2 processor should be a very solid choice.
The PC industry can also breathe a collective sigh of relief about power and thermal issues now that Core 2 has arrived. Intel finally has a firm handle on those problems. These processors consume less powerand thus produce less heatthan desktop Pentiums have for quite a while. The E6700 system’s total power draw when fully loaded was 156 W, only 14W more than the Pentium Extreme Edition system drew while sitting idle. What’s more, even the high-end Core 2 processors’ power use was in line with that of the Energy Efficient versions of the Athlon 64 X2. That leaves room for many good things to happen, from less expensive cooling systems to quieter, smaller enclosures and even some righteous overclocking. Combine the low power draw with the performance we’ve seen, and the Core 2 is clearly the most energy-efficient desktop processor around.
As much as I appreciate the performance and efficiency of these new CPUs, though, I can’t endorse forking out a cool grand (minus one) for a Core 2 Extreme X6800. These top-end CPUs are always iffy values, even if they’re insane performers. Meanwhile, the prices on the first two Core 2 Duos are very reasonable for what you get. At $316, the Core 2 Duo E6600 looks like a tremendous deal, provided you can get your hands on one. The E6700 is pricier at $530, but it’ll beat the much more expensive FX-62 at almost every turn.
In fact, after seeing the Core 2 in action, many folks may be wondering how AMD is going to keep up. The Athlon 64 X2 4200+ currently lists for more than the Core 2 Duo E6600, and that’s just not gonna cut it. Fortunately, AMD has confirmed to us that a major price move is coming in July. We don’t have the specifics just yet, but they say they intend to maintain a competitive price-performance ratio. That may mean we’ll see the dramatic price cuts rumored to be coming, which would be a good start.
For its next trick, AMD needs to get its 65nm fab process going ASAP. I’ve heard prognostications that AMD won’t be able compete against Core 2 chips with its current AMD64 microarchitecture. That may be the case, but I’m not entirely convinced. The contest we’ve seen in the preceding pages pitted CPUs manufactured on AMD’s 90nm process against CPUs made on Intel’s 65nm process. The Netburst fiasco at 90nm has made us forgetful about the benefits of process shrinks, but they can be substantial. AMD could be in a much stronger position if it gets to 65nm quickly.
Regardless of what happens with its competition, though, the big story here is that Intel has replaced its troubled Netburst microarchitecture with a world-beater. The Core microarchitecture and the chips based on it are a huge improvement, and a fitting end to the era of the Pentium.