We’ve been hearing a lot of doom and gloom about the prospects for microprocessors in the past few years. Pundits have told us that Moore’s Law is destined to hit physical limitations that will bring the incredible every-two-years doubling of CPU power to a screeching haltand probably sooner rather than later, since we’ve already seen CPUs run into heat and clock speed barriers. They have a pointup to a point. But the constant hum-drum of negativity begins to sound dated as time marches further and further away from Intel’s fiasco with the 90nm Pentium 4.
In fact, such thoughts seem like ancient history today, as we get our first look at the desktop version of Intel’s quad-core processors manufactured on its 45nm fabrication process. This new process not only packs in twice as many transistors as 65nm, but also employs new materials to deliver reductions in electrical current leakage. These changes add up to the sort of generational improvement that transports old codgers like me back to the roaring 1990s, when the horizon for CPU progress seemed limitless.
Of course, these days, Intel has hedged its bets by multiplying the number of cores per processor and ramping up the cadence of design innovations to those cores. The result? The new Core 2 Extreme QX9650 quad-core processor promises big reductions in power consumption and heat production, along with performance increases of up to 20%at the same 3GHz clock speed as the chip that preceded it. Not that there’s anything wrong with that. In fact, this processor could make the prophets of doom and gloom look like downright fuddy-duddies, if you know what I mean. Keep reading to see whether the QX9650 puts a clown suit on the doubters.
The Penryn lands in the Yorkfield
Those of you sick, sick people who follow CPUs closely are probably already familiar with the bevy of code-names involved here, but I’ll recount the major points for the healthier among us. True to Moore’s Law, Intel’s code names double every 18 to 24 months, so there’s much to track. The most relevant names for our present discussion are Penryn and Yorkfield. Penryn is the name of the basic building block of Intel’s entire 45nm lineup; it is the dual-core 45nm processor design on which most of Intel’s mobile, desktop, and server products will be based. Yorkfield is the first desktop implementation of Penryn, and it’s a two-fer special, situating two dual-core chips together nice and cozy-like in a single LGA775-style package, just as Intel’s Kentsfield quad cores like the QX6850 did before it. The Core 2 Extreme QX9650 will be the first version of Yorkfield to hit the streets.
While we’re dropping names, we should probably enter a couple of others into the discussion. Yorkfield is arriving right on time for a generational battle with its somewhat tardy opponent, AMD’s Phenom processor. The Phenom is based on AMD’s K10 design, and unlike Yorkfield, it incorporates four cores natively onto a single chipor at least it will when it arrives later this month. We’ve already shown you a preview of this microarchitectural battle in the heavyweight division with our previews of AMD’s K10-based quad-core “Barcelona” Opterons and Intel’s 45nm “Harpertown” Xeons. Now we have a chance to reprise this contest on the desktop, starting with the QX9650.
As I’ve mentioned, the key to the QX9650’s advances is Intel’s new 45nm fab process, which represents a fundamental change in the structure of the transistors on a chip. Intel says it’s the biggest advancement in transistor technology since the late 1960’s, although this is clearly an evolutionary step. The transistor combines a high-capacitance gate oxide, made of halfnium, with a metal gate, and it delivers some eye-popping purported advantages in addition to the customary doubling of transistor density. Among them, Intel claims, is a 30% reduction in switching power, an improvement of over 20% in switching speed, and a more-than-10X reduction in gate oxide leakage. In layman’s terms, that means 45nm chips should be smaller, run faster, and consume less power than Intel’s 65nm partswhich were already quite good.
Each dual-core Penryn chip crams roughly 410 million transistors into a space of 107 mm². By contrast, the dual-core 65nm Conroe chips fit fewer transistors, 341 million, into a larger 143 mm² die area. Intel has to produce two good chips in order to make one Yorkfield processor, but the small die area involved should make things relatively easy, in terms of avoiding defects and keeping yields high. AMD, on the other hand, has chosen tighter integration and a higher degree of difficulty via a single-chip approach to quad-core processors; each of its upcoming Phenom chips packs 463M transistors into a 283 mm² die via AMD’s 65nm fab process.
Penryn isn’t quite so revolutionary on the CPU design front, since it’s based on the same basic microarchitecture as previous Core 2 chips. It ain’t exactly chopped liver, either, since the Core 2 chips are the fastest desktop processors around. What’s more, Intel’s chip architects have endowed Penryn with more than its fair share of new tricks and tweaks. The most visible of those tweaks is a larger (6MB) and smarter (24-way set associative) L2 cache on each chip, shared between the two cores. (That works out to 12MB of total L2 cache in a Yorkfield processor, for my fellow liberal arts degree holders.)
With the QX9650, Yorkfield begins life riding a 1333MHz front-side bus like older Core 2 CPUs, but that’s not likely to be the limit forever. Penryn-based Xeons will start out on a 1600MHz FSB, and Intel has already demoed a Core 2 Extreme QX9770 with a 1.6GHz bus.
Both the larger cache and faster bus are traditional vehicles for performance gains, but Penryn has some internal execution tweaks, as well. The chip features a new divider, capable of handling both integer and floating-point math. The divider’s radix-16-based design lets it process four bits per cycle (up from two bits in previous chips) and includes an optimized square root function. The divider has an early-out mechanism that can reduce instruction latencies in some cases, too.
Penryn also extends the Core microarchitecture’s 128-bit single cycle SSE capabilities to shuffle operations, potentially doubling execution throughput for certain tasks, including the formatting of data for other SSE-based vector operations.
Another common vehicle for performance advances is the addition of tailored instructions for specific uses. Penryn has some of those, too, in the form of SSE4. SSE4 is comprised of 47 instructions aimed at HD video acceleration, basic graphics operations (including dot products), and the integration and control of coprocessors over PCI Express links. Developers will have to update their applications and compilers in order to take advantage of these instructions, of course. Fortunately, we’ve been able to include an SSE4-enabled video compression codec in our test suite, as you’ll see.
As the first desktop-oriented derivative of Penryn, the Core 2 Extreme QX9650 is very much a premium product. Like Intel’s other Extreme Editions, the QX9650 has an unlocked upper multiplier and will probably sport a price tag around a grand. Since it drops into LGA775-style sockets, the QX9650 is compatible with many newer Intel-oriented motherboards, especially those based on Intel’s P35 and X38 chipsets, usually with the help of a BIOS update. You’ll want to check with the mobo maker to see whether a particular board supports the QX9650.
As for cooling, Intel officially lists the QX9650’s TDP at 130W, like past Core 2 Extreme processors. I think that’s crazy conservative, like the love-child of Ann Coulter and Pat Buchanan, for reasons that will become clear once you see how it looks on the power meter.
And, as I’ve said, the QX9650 runs at 3GHz on a 1333MHz bus, just like the 65nm Core 2 Extreme QX6850 did before it. The comparison between these two CPUs should give us a nice look at how Penryn/Yorkfield’s architectural tweaks boost clock-for-clock performance.
Before we move on to our results, I should mention that this an early preview of the QX9650. This product is officially slated to debut, and become available for purchase, on November 12. Intel plans to introduce several 45nm Xeons at the same time, but that will be it for a while. Additional Penryn-based desktop processors, both dual- and quad-core, aren’t expected until early next year.
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.
Our test systems were configured like so:
|Processor|| Core 2 Quad Q6600 2.4GHz
Core 2 Extreme QX6800 2.93GHz
|Core 2 Duo E6750 2.66GHz
Core 2 Extreme QX6850 3.00GHz
|Dual Xeon X5365 3.00GHz|| Athlon 64 X2 5600+ 2.8GHz
Athlon 64 X2 6000+ 3.0GHz
Athlon 64 X2 6400+ 3.2GHz
|Dual Athlon 64 FX-74 3.0GHz|
|Core 2 Extreme QX9650 3.00GHz|
|System bus||1066MHz (266MHz quad-pumped)||1333MHz (333MHz quad-pumped)||1333MHz (333MHz quad-pumped)||1GHz HyperTransport||1GHz HyperTransport|
|Motherboard||Gigabyte GA-P35T-DQ6||Gigabyte GA-P35T-DQ6||Intel S5000VXN||Asus M2N32-SLI Deluxe||Asus L1N64-SLI WS|
|North bridge||P35 Express MCH||P35 Express MCH||5000X MCH||nForce 590 SLI SPP||nForce 680a SLI|
|South bridge||ICH9R||ICH9R||6231 ESB ICH||nForce 590 SLI MCP||nForce 680a SLI|
|Chipset drivers||INF Update 188.8.131.523
Intel Matrix Storage Manager 7.5
|INF Update 184.108.40.2063
Intel Matrix Storage Manager 7.5
|INF Update 220.127.116.113
Intel Matrix Storage Manager 7.5
|ForceWare 15.01||ForceWare 15.01|
|Memory size||4GB (4 DIMMs)||4GB (4 DIMMs)||4GB (4 DIMMs)||4GB (4 DIMMs)||4GB (4 DIMMs)|
|Memory type||Corsair TWIN3X2048-1333C9DHX
DDR3 SDRAMat 1066MHz
DDR3 SDRAMat 1333MHz
|Samsung ECC DDR2-667
FB-DIMM at 667MHz
DDR2 SDRAMat ~800MHz
DDR2 SDRAMat ~ 800MHz
|CAS latency (CL)||8||8||5||4||4|
|RAS to CAS delay (tRCD)||8||9||5||4||4|
|RAS precharge (tRP)||8||9||5||4||4|
|Cycle time (tRAS)||20||24||15||18||18|
with Realtek 18.104.22.16849 drivers
with Realtek 22.214.171.12449 drivers
with Realtek 126.96.36.19949 drivers
|Integrated nForce 590 MCP/AD1988B
with Soundmax 188.8.131.5200 drivers
|Integrated nForce 680a SLI/AD1988B
with Soundmax 184.108.40.20600 drivers
|Hard drive||WD Caviar SE16 320GB SATA|
|Graphics||GeForce 8800 GTX 768MB PCIe with ForceWare 163.11 and 163.71 drivers|
|OS||Windows Vista Ultimate x64 Edition|
|OS updates||KB940105, KB929777 (nForce systems only), KB938194, KB938979|
Please note that testing was conducted in two stages. Non-gaming apps and Supreme Commander were tested with Vista patches KB940105 and KB929777 (nForce systems only) and ForceWare 163.11 drivers. The other games were tested with the additional Vista patches KB938194 and KB938979 and ForceWare 163.71 drivers.
Thanks to Corsair for providing us with memory for our testing. Their products and support are far and away superior to generic, no-name memory.
Our primary test systems were powered by OCZ GameXStream 700W power supply units. The dual-socket Xeon and Quad FX systems were powered by PC Power & Cooling Turbo-Cool 1KW-SR power supplies. Thanks to OCZ for providing these units for our use in testing.
Also, the folks at NCIXUS.com hooked us up with a nice deal on the WD Caviar SE16 drives used in our test rigs. NCIX now sells to U.S. customers, so check them out.
The test systems’ Windows desktops were set at 1280×1024 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled.
We used the following versions of our test applications:
- SiSoft Sandra XI.SP4a 64-bit
- CPU-Z 1.40
- WorldBench 6 beta 2
- Team Fortress 2
- Lost Planet: Extreme Condition with DirectX 10
- BioShock 1.0 with DirectX 10
- Supreme Commander 1.1.3260
- Valve VRAD map build benchmark
- Valve Source Engine particle simulation benchmark
- Cinebench R10 64-bit Edition
- POV-Ray for Windows 3.7 beta 21a 64-bit
- CASE Lab Euler3d CFD benchmark multithreaded edition
- MyriMatch proteomics benchmark
- notfred’s Folding benchmark CD 8/8/07 revision
- picCOLOR 4.0 build 598 64-bit
- The Panorama Factory 4.5 x64 Edition
- Windows Media Encoder 9 x64 Edition
- LAME MT 3.97a 64-bit
- VirtualDub 1.7.6 with DivX 6.7
The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Memory subsystem performance
We’ll start, as ever, with some quick synthetic tests of the memory subsystem, which will help give us the lay of the land before we dive into our real-world benchmarks.
The QX9650 easily surpasses the QX6850 here, probably because it can prefetch more data into its larger L2 cache and thus effectively transfer more data. There are clear striations here among the Intel processors based on bus speed, with the CPUs on the 1066MHz at the back of the pack. The top spots all go to Athlon 64 processors, whose integrated memory controllers are very tough to beat with a front-side bus-based system architecture.
This useful little test gives us a look at L2 cache bandwidth. You’ll notice that it’s multithreaded, so systems with more cores show up as having higher L2 cache bandwidth. Not just one processor or cache is being measured. As a result, the dual-socket quad-core Xeon X5365 (65nm) soars above everything else. We’ve included this system because it was marketed to enthusiasts as part of Intel’s “V8” media creation platform, an answer of sorts to AMD’s dual-socket Quad FX platform, represented here by the Athlon 64 FX-74. I’m happy to be able to include these systems as a curiosity, especially since the FX-74 is AMD’s only quad-core solution for the desktop, but they both have their quirky performance drawbacks as well as benefits that I won’t discuss in too much detail, lest they become a distraction. Besides, as I’ve mentioned before, the Xeons are total show-offs.
Back to the QX9650, its L2 cache bandwidth mirrors that of its 65nm predecessor until we reach the 16MB test block size, where its larger L2 cache grants it a slight advantage.
The QX9650’s memory access latencies also mirror those of the QX6850, despite the QX9650’s larger L2 cache. That’s impressive, though perhaps not quite as impressive as the roughly 15ns advantage the Athlon 64 X2’s integrated memory controller gives it.
We can look at this issue in a little more detail. In the graphs below, yellow represents L1 cache, light orange is L2 cache, and dark orange is main memory.
We measured the QX9650’s 6MB L2 cache latency at 15 cycles, just one cycle more than the smaller 4MB L2 cache in the QX6850. Larger caches tend to bring latency penalties with them, but the smarter L2 in Penryn has barely any penalty at all. That helps explain why the QX9650’s memory access latencies are effectively equivalent to the older chips.
But enough of this CPU geekery! Let’s play some games.
Team Fortress 2
We’ll kick off our gaming tests with some Team Fortress 2, Valve’s class-driven multiplayer shooter based on the Source game engine. In order to produce easily repeatable results, we’ve tested TF2 by recording a demo during gameplay and playing it back using the game’s timedemo function. In this demo, I’m playing as the Heavy Weapons Guy, with a medic in tow, dealing some serious pain to the blue team.
We tested at 1024×768 resolution with the game’s detail levels set to their highest settings. HDR lighting and motion blur were enabled. Antialiasing was disabled, and texture filtering was set to trilinear filtering only. We used this relatively low display resolution with low levels of filtering and AA in order to prevent the graphics card from becoming a primary performance bottleneck, so we could show you the performance differences between the CPUs.
Notice the little green plot with four lines above the benchmark results. That’s a snapshot of the CPU utilization indicator in Windows Task Manager, which helps illustrate how much the application takes advantage of up to four CPU cores, when they’re available. I’ve included these Task Manager graphics whenever possible throughout our results. In this case, Team Fortress 2 looks like it probably only takes full advantage of a single CPU core, although Nvidia’s graphics drivers use multithreading to offload some vertex processing chores.
The QX9650 produces some very nice clock-for-clock performance gains right off the bat. Yow. All of these CPUs are pushing acceptable frame rates for TF2, but the QX9650 is in a class by itself in terms of raw performance. If you want future-proofing, this puppy has it.
Lost Planet: Extreme Condition
Lost Planet puts the latest hardware to good use via DirectX 10 and multiple threadsas many as eight, in the case of our dual quad-core Xeon test rig. Lost Planet‘s developers have built a benchmarking tool into the game, and it tests two different levels: a snow-covered outdoor area with small numbers of large villains to fight, and another level set inside of a cave with large numbers of small, flying creatures filling the air. We’ll look at performance in each.
We tested this game at 1152×864 resolution, largely with its default quality settings. The exceptions: texture filtering was set to trilinear, edge antialiasing was disabled, and “Concurrent operations” was set to match the number of CPU cores available.
As I’ve stated beforeand watch me do it againLost Planet‘s Cave level is exciting because it puts a cubic assload of flying doodads on the screen and uses multiple threads to control them all. That gives us a nice look at how quad-core processors can speed up a game. Oddly, the QX9650 stumbles just a little bit in the Snow level, for whatever reason, but in the Cave level with all of those doodads, it’s well ahead of the packand roughly 10% faster than its 3GHz 65nm counterpart.
We tested BioShock by manually playing through a specific point in the game five times while recording frame rates using the FRAPS utility. The sequence? Me trying to fight a Big Daddy, or more properly, me trying not to die for 60 seconds at a pop.
This method has the advantage of simulating real gameplay quite closely, but it comes at the expense of precise repeatability. We believe five sample sessions are sufficient to get reasonably consistent results. In addition to average frame rates, we’ve included the low frame rates, because those tend to reflect the user experience in performance-critical situations. In order to diminish the effect of outliers, we’ve reported the median of the five low frame rates we encountered.
For this test, we largely used BioShock‘s default image quality settings for DirectX 10 graphics cards, but again, we tested at a relatively low resolution of 1024×768 in order to prevent the GPU from becoming the main limiter of performance.
The QX9650 take the top spot again, though not by much. Any of the Core 2 processors here can run BioShock more or less optimally, obviously. And while playing, I didn’t notice any real slowdowns or problems, even on the Athlon 64 X2 5600+.
We tested performance using Supreme Commander‘s built-in benchmark, which plays back a test game and reports detailed performance results afterward. We launched the benchmark by running the game with the “/map perftest” option. We tested at 1024×768 resolution with the game’s fidelity presets set to “High.”
Supreme Commander’s built-in benchmark breaks down its results into several major categories: running the game’s simulation, rendering the game’s graphics, and a composite score that’s simply comprised of the other two. The performance test also reports good ol’ frame rates, so we’ve included those, as well.
We’ve had a heck of a time trying to tease out big performance differences between CPUs in this game. They don’t come easily and obviously aren’t very large. However, the QX9650 again sits atop the field, this time in each of Supreme Commander‘s several performance measurements.
Valve Source engine particle simulation
Next up are a couple of tests we picked up during a visit to Valve Software, the developers of the Half-Life games. They’ve been working to incorporate support for multi-core processors into their Source game engine, and they’ve cooked up a couple of benchmarks to demonstrate the benefits of multithreading.
The first of those tests runs a particle simulation inside of the Source engine. Most games today use particle systems to create effects like smoke, steam, and fire, but the realism and interactivity of those effects are limited by the available computing horsepower. Valve’s particle system distributes the load across multiple CPU cores.
The QX9650 posts a gain of about 15% over the QX6850 in this test, even surpassing the dual Xeons.
Valve VRAD map compilation
This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to precompute lighting that goes into games like Half-Life 2. This isn’t a real-time process, and it doesn’t reflect the performance one would experience while playing a game. Instead, it shows how multiple CPU cores can speed up game development.
Even when the QX9650 can’t deliver major progress over the QX6850, it wins. Nothing AMD has to offer even comes close.
WorldBench’s overall score is a pretty decent indication of general-use performance for desktop computers. This benchmark uses scripting to step through a series of tasks in common Windows applications and then produces an overall score for comparison. WorldBench also records individual results for its component application tests, allowing us to compare performance in each. We’ll look at the overall score, and then we’ll show individual application results alongside the results from some of our own application tests. Because WorldBench’s tests are entirely scripted, we weren’t able to capture Task Manager plots for them, as you’ll notice.
Productivity and general use software
MS Office productivity
This WorldBench component test has a multitasking element, since several Office apps are in use at once. In this case, the QX9650 finishes a tick behind the QX6850, for whatever reason. The Athlon 64 X2 6400+ puts in a relatively strong showing here, as well.
Firefox web browsing
If you want proof positive that an Intel processor will make your Internet faster, here it is. I wouldn’t exactly recommend trading the Pentium 4 and cable modem for a QX9650 and dial-up, though.
Multitasking – Firefox and Windows Media Encoder
Here’s another WorldBench component test with a multitasking bent. This one uses a multithreaded application, Windows Media Encoder, alongside the Firefox web browser. Once more, the QX9650 achieves an impressive per-clock performance improvement. In fact, I’m going to make “impressive per-clock performance improvement” into a keyboard macro.
WinZip file compression
Nero CD authoring
The QX9650 is a tad bit faster that its predecessor in WinZip, but not in the Nero test, where performance seems to be dictated by (1) disk controller performance and (2) sheer, blind luck.
The QX9650 show continues in Photosop, where the Yorkfield processor will let you cut out pictures of your friends and place them almost-convincingly into incriminating circumstances better than any other CPU.
The Panorama Factory photo stitching
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs. The program’s timer function captures the amount of time needed to perform each stage of the panorama creation process. I’ve also added up the total operation time to give us an overall measure of performance.
Intel’s new baby excels at stitching together multiple pictures to create a panorama, though it only finishes a couple of seconds ahead of the QX6850. If you’re into CPU geekery, the per-clock performance gains shown in the indvidiual operations below may interest you. Looks to me like the majority of the difference comes in the “stitch” operation, which is the heart of the panorama generation process.
picCOLOR image analysis
picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Eight of the 12 functions in the test are multithreaded, and in this latest revision, five of those eight functions use four threads.
Scores in picCOLOR, by the way, are indexed against a single-processor Pentium III 1 GHz system, so that a score of 4.14 works out to 4.14 times the performance of the reference machine.
When the first Core 2 Duo processor debuted in July of 2006, it scored 10.94 in this same test. Today, we’re at 16.46 times the performance of a Pentium III 1GHz, a little over a year laterand only some of picCOLOR’s functions are multithreaded. Not too shabby.
Video encoding and editing
VirtualDub and DivX encoding with SSE4
Here’s a brand-new addition to our test suite that should allow us to get a first look at the benefits of SSE4’s instructions for video acceleration. In this test, we used VirtualDub as a front-end for the DivX codec, asking it to compress a 66MB MPEG2 source file into the higher compression DivX format. We used version 6.7 of the DivX codec, which has an experimental full-search function for motion estimation that uses SSE4 when available and falls back to SSE2 when needed. We tested with most of the DivX codec’s defaults, including its Home Theater base profile, but we enabled enhanced multithreading and, of course, the experimental full search option.
Well, this isn’t even fair at alland that’s sort of the point. A couple of SSE4’s new instructions are specifically targeted to accelerate H.264-style motion estimation, and they seem to do it well. The QX6850 takes nearly 10 seconds longer to process this short video clip, and the Athlon 64 FX-74 takes twice as long as the QX9650.
Windows Media Encoder x64 Edition video encoding
Windows Media Encoder is one of the few popular video encoding tools that uses four threads to take advantage of quad-core systems, and it comes in a 64-bit version. Unfortunately, it doesn’t appear to use more than four threads, even on an eight-core system. For this test, I asked Windows Media Encoder to transcode a 153MB 1080-line widescreen video into a 720-line WMV using its built-in DVD/Hardware profile. Because the default “High definition quality audio” codec threw some errors in Windows Vista, I instead used the “Multichannel audio” codec. Both audio codecs have a variable bitrate peak of 192Kbps.
Windows Media Encoder video encoding
Roxio VideoWave Movie Creator
The remainder of our video tests don’t take advantage of SSE4, but the QX9650 still leads in each of them. Some of these are notable performance differences, too, when one CPU finishes processing a short video clip a full 20 or 30 seconds ahead of another one.
LAME MT audio encoding
LAME MT is a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. Of course, multithreading works even better on multi-core processors. You can download a paper (in Word format) describing the programming effort.
Rather than run multiple parallel threads, LAME MT runs the MP3 encoder’s psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. That means this test won’t really use more than two CPU cores.
We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here.
Yep. Uh huh. Yep. Moving on…
Graphics is a classic example of a computing problem that’s easily parallelizable, so it’s no surprise that we can exploit a multi-core processor with a 3D rendering app. Cinebench is the first of those we’ll try, a benchmark based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores are available.
Rendering is generally computationally bound, not limited by memory bandwidth or the like. In this case, then, the QX9650 is achieving its clock-for-clock performance boost thanks to its fast radix-16 divider or its single-cycle 128-bit SSE shuffle ability.
We caved in and moved to the beta version of POV-Ray 3.7 that includes native multithreading. The latest beta 64-bit executable is still quite a bit slower than the 3.6 release, but it should give us a decent look at comparative performance, regardless.
3ds max modeling and rendering
The computational performance enhancements in the QX9650 bring benefits in all three of our rendering test apps. In the case of the POV-Ray chess2 scene, the QX9650 shaves 17 seconds off of the QX6850’s render time, vaulting it ahead of the Athlon 64 FX-74.
Next, we have a slick little [email protected] benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, [email protected] is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.
The [email protected] project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, [email protected] should be a great example of real-world scientific computing.
notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.
On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.
This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.
Wow. At the very same clock speed, the QX9650 can haul in quite a few more points per day than its QX6850 precursor, and it easily leads all contenders in the single-threaded processing of three of the four WU types.
Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He recently offered to provide us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:
In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.
In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.
MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.
The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.
I should mention that performance scaling in Myrimatch tends to be limited by several factors, including memory bandwidth, as David explains:
Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.
Here’s how the processors performed.
Since memory bandwidth is the primary limiter among the very fastest processors here, the QX9650 doesn’t separate itself much from the QX6850, which shares the same bus speed and memory subsystem.
STARS Euler3d computational fluid dynamics
Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here. (I believe the score you see there at almost 3Hz comes from our eight-core Clovertown test system.)
In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:
The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.
The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.
So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.
The Yorkfield processor hits a 15% higher processing frequency than its Kentsfield counterpart, another impressive jump in performance at the same clock speed.
SiSoft Sandra Mandelbrot
Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The one of interest to us is the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:
This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.
The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.
We’re using the 64-bit version of Sandra. The “Integer x16” version of this test uses integer numbers to simulate floating-point math. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations in parallel.
I keep this test around because it seems to show off the Core 2 chips’ single-cycle SSE2 execution capabilities rather well. However, Penryn’s single-cycle 128-bit SSE shuffle doesn’t help much here.
Power consumption and efficiency
Now that we’ve had a look at performance in various applications, let’s bring power efficiency into the picture. Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire systemthe CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor into a separate outlet, though.) We measured how each of our test systems used power across a set time period, during which time we ran Cinebench’s multithreaded rendering test.
All of the systems had their power management features (such as SpeedStep and Cool’n’Quiet) enabled during these tests via Windows Vista’s “Balanced” power options profile.
Anyhow, here are the results:
If you’re like me, you looked at that raw data on the QX9650 and immediately did a double-take. It’s for real, though.
Let’s slice up the data in various ways in order to better understand them. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.
Surprisingly, our QX9650 system draws substantially less power34W, to be exactat idle than the otherwise-identical QX6850 system did. That drops the QX9650 power consumption even below that of the dual-core Core 2 Duo E6750.
Next, we can look at peak power draw by taking an average from the ten-second span from 30 to 40 seconds into our test period, during which the processors were rendering.
The 45nm chip’s reduction in power use under load is even more impressive. The QX9650 system pulls 74W less under load than the QX6850-based oneless than an Athlon 64 X2 5600+, astoundingly enough.
Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.
Obviously, with such low idle and peak power consumption, and its quick render time, the QX9650 doesn’t draw much power during the duration of our test period.
We can quantify efficiency even better by considering the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.
The Core 2 Extreme QX9650 combines some of the lowest power consumption of the group with the quickest render times. That means it’s able to render the scene with under half the energy used by the Core 2 Duo E6750. Compared to the 65nm Core 2 Extreme QX6850 running at the same clock speed, the QX9650 brings a 33% reduction in system-level energy use during this task.
I started overclocking the QX9650 by setting its multiplier to 12, which would yield a 4GHz clock speed on a 1333MHz front-side bus (whose base clock is 333MHz). I initially raised the CPU core voltage from the default of 1.25V to 1.2625, just to help things along. The system came up and immediately began to POST, but then locked in mid-POST.
I tried several times, and the problem persisted.
After recovering to the BIOS defaults, I started cranking up the voltage in an attempt to achieve 4GHz. A little extra juice allowed the system to begin booting Windows, but it crashed before completing the boot process. Things got no better as I stepped up to 1.3V and then 1.325V. I could have gone for more voltage, but I figured backing down on the clock speed a little bit would probably be the best path to stability. After several attempts at 3.85GHz and 3.795GHz with a slightly overclocked bus, I finally settled on a stable config: 3.66GHz at 1.2875V on a stock 1333MHz bus. I then took this screenshot:
It’s like a postcard. From a vacation. In megahertz-land.
This setup proved stable while running four instances of Prime95 for quite a while, so I called it good. The QX9650 also ran through a couple of benchmarks flawlessly at this speed.
I’d say that’s an acceptable start for Intel’s 45nm process, although the actual clock speed is only 166MHz faster than what we reached with our 65nm QX6850. If this chip is any indication, Intel easily has clock room to release some Penryn-based parts at 3.2 or 3.4GHz, at least, and it’s still very early in the game for 45nm.
Sometimes we have to craft finely nuanced analyses of our CPU test results in order to summarize the various merits and weaknesses of different processors as fairly as possible. Not so today. Intel was already well ahead in the performance game with its 65nm quad-core processors, and the Core 2 Extreme QX9650 simply extends that lead by anywhere from a few percentage points to nearly 20%. What’s more, it does so on the strength of a handful of key revisions to the chips, including a larger L2 cache and a fast divider, that benefit a startlingly broad range of applications, from games to office apps and scientific computing. In the video encoding application we tested that supports SSE4, we saw even larger performance gains. The Core microarchitecture has always had strong clock-for-clock performance, but Intel’s design team has found ample room for improvementand delivered it.
Yet the QX9650’s advances in per-clock performance may not even be its best quality. Our power consumption testing confirmed Intel wasn’t just blowing smoke when it claimed big reductions in switching power and leakage current for its 45nm fabrication process. Our QX9650 test system drew 34W less power at idle and 74W less under load than a comparably equipped Core 2 Extreme QX6850-based one. Taken together with the increases in clock-for-clock performance, the QX9650 brought a 33% reduction in the overall system power needed to render a scene. That’s a huge step forward in power-efficient performance.
All of this comes without any increase in clock speedyet. Intel seems to be holding higher speeds in reserve, since we were easily able to reach 3.66GHz with our QX9650, without having to resort to crazy-insane core voltages. We can probably expect to see both higher core clocks and higher bus speeds from this generation of products as it matures.
Let the prophets of doom-and-gloom stick that in their pipes and smoke it. They may be right about transistor scaling limits eventuallyduhbut many of them spoke too much, too soon.
The crazy thing is that the QX9650 may not even be the fastest desktop microprocessor to arrive this year, if AMD somehow manages to hit the right clock speeds with its Phenom. Let’s not kid ourselves. Based on everything we’ve seen from the 45nm Xeons and Barcelona Opterons, Intel appears positioned to hang on to the performance crown for the foreseeable future. But one never knows until the chips arrive, as the Phenom is set to do very soon. Stay tuned.