Intel’s Core 2 Extreme QX6850 processor

Intel’s Core 2 Extreme QX6850 processor said its first hello to the world some months ago, and unusually, I totally whiffed on getting my review out at that time. In my defense, I had many things on my mind, including preparing for the launch of brand-new processor microarchitectures from AMD and Intel in the form of the Barcelona Opterons and Harpertown Xeons. Meanwhile, I was rebuilding our desktop-class CPU test rigs with new software and hardware, the latest and greatest stuff. I was also busy with utterly ruining my website, and let me tell you, personal career suicide isn’t as easy as Britney Spears makes it look. Girl has a gift.

At any rate, I’ve finally finished my first round of tests with our all-new test setup, and we can now show you how Intel’s fastest quad-core desktop processor, the Core 2 Extreme QX6580, stacks up against a range of competitors—everything from the new Athlon 64 X2 6400+ to dual-socket monsters like AMD’s Quad FX and Intel’s V8 platform, just because we can. And, of course, we have new applications and the latest games, like BioShock and Team Fortress 2, in the mix. Keep reading for a cornucopia of quad-core goodness—but read quickly, before Intel replaces this CPU with a 45nm Penryn-based chip.



The QX6850 in situ. Look it up.

The dirt on the QX6850

Here are the vitals on the Core 2 Extreme QX6850. This processor is yet another spin on Intel’s Kentsfield quad-core product, which incorporates two Core 2 Duo chips onto a single package for a quartet of bit-flipping goodness. Like other Kentsfield-based products, it has a total of 8MB of L2 cache, or 4MB per chip. The QX6850 distinguishes itself from its direct predecessor, the QX6800, with the addition of a 3GHz core clock frequency and a 1333MHz front-side bus. Intel has moved the bulk of its Core 2 lineup to this higher bus speed, whose benefits we first tested in our review of the Core 2 Duo E6750. At that time, we concluded that a 1333MHz front-side bus wasn’t much help to a dual-core processor, but it might be more of a boon to a quad-core part, especially because the two chips on Kentsfield processors communicate between themselves via this bus.

At this point, the reader should feel drama and tension rise.

So we’ll have to see whether the faster bus offers more benefit for the QX6850.

Intel’s Core 2 processors have also learned a new trick in recent months: lower power consumption and heat production, thanks to the chips’ new rev-G stepping. The QX6850 houses a couple of rev-G chips, which is one reason why it can accommodate higher clock and bus speeds while fitting into the same thermal envelope as the QX6800.

Sounds good, right? Yes, but this is technology, and things move quickly. Intel is set to replace its 65nm Core 2 processors with “Penryn” based products fabbed with its new 45nm process in, like, days. That means the QX6850 is the last of its breed, destined to live the final 15 years of its career playing to half-filled theater audiences in Vegas.

I have no idea what that means. Work with me here.

I should mention that the Core 2 Extreme QX6850 currently sells for somewhere between $1100 and $1300 at online retailers, which is enough money to buy you several range-top microwave ovens and an iPhone. The range-top ovens and iPhone couldn’t easily be replaced by a $280 Core 2 Quad Q6600 processor, either. Then again, neither the phone, the microwaves, nor the Q6600 have an unlocked upper multiplier for easy overclocking.

Heck, the iPhone is locked down tighter than Fort Knox.

Intel’s Core 2 Extreme processors, though, make overclocking a snap. These high-end quad-core “halo products” also have no true competition right now, unless you count AMD’s Quad FX platform. We’ve tested it here, for what it’s worth, but I’ll save you some suspense: the QX6800 was already faster than Quad FX.

We’ve also tested AMD’s new fastest dual-core processor, the Athlon 64 X2 6400+ Black Edition, distinguished by its black box, limited quantities, and modest debut. We couldn’t let this one slip by under the radar entirely, even if it’s not AMD’s proudest achievement.

The new testing stuff

That faint aroma of leather and chemicals you smell isn’t your granny’s handbag; it’s the new-car smell emanating from Damage Labs. As I said, we’ve revamped much of our CPU testing apparatus in Damage Labs—and words like “apparatus” and “labs” sound much more sophisticated than “long table with skeleton PCs sitting on top.” Atop the table now is this nifty new motherboard:



Gigabyte’s GA-P35T-DQ6

The Gigabyte GA-P35T-DQ6 won an Editor’s Choice award in our five-way roundup of Intel P35 chipset-based mobos, so it seemed like a logical choice for our Intel CPU test platform. We opted to go with the DDR3 version of this board because we wanted to maximize the like-new aromatic potential of our test rigs, and because we wanted to give the latest CPUs the best chance to shine. We then plopped four Corsair DIMMs capable of running at 1333MHz into each of the DQ6’s four DIMM slots, for a total of 4GB of memory.

In fact, we’ve bumped up all of our test systems to 4GB of memory—a logical step given that we’re using a 64-bit OS and a fair amount of 64-bit software these days. We also outfitted our test systems with GeForce 8800 GTX graphics cards, and we replaced their aging Maxtor 250GB hard drives with our current pick of the desktop drive litter, the WD Caviar SE16. Oh, and we patched Windows Vista until it worked correctly. The end result should be reasonably high-end PCs ready to take full advantage of the latest DirectX 10 games.

That is the hope, anyway. Let’s see what we found.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processor
Core 2 Quad Q6600 2.4GHz

Core 2 Extreme QX6800
2.93GHz
Core
2 Duo E6750
2.66GHz

Core 2 Extreme QX6850
3.00GHz
Dual
Xeon
X5365
3.00GHz

Athlon
64 X2 5600+
2.8GHz
Athlon
64 X2 6000+
3.0GHz
Athlon
64 X2 6400+
3.2GHz
Dual Athlon 64
FX-74
3.0GHz
System
bus
1066MHz
(266MHz quad-pumped)
1333MHz
(333MHz quad-pumped)
1333MHz
(333MHz quad-pumped)
1GHz
HyperTransport
1GHz
HyperTransport
Motherboard Gigabyte
GA-P35T-DQ6
Gigabyte
GA-P35T-DQ6
Intel
S5000VXN
Asus
M2N32-SLI Deluxe
Asus
L1N64-SLI WS
BIOS
revision
F1 F1 S5000.86B.06.00.0076.

0409200070751

1201 0505
North
bridge
P35
Express MCH
P35
Express MCH
5000X
MCH
nForce
590 SLI SPP
nForce
680a SLI
South
bridge
ICH9R ICH9R 6231
ESB ICH
nForce
590 SLI MCP
nForce
680a SLI
Chipset
drivers
INF
Update 8.3.0.1013

Intel Matrix Storage Manager 7.5

INF
Update 8.3.0.1013

Intel Matrix Storage Manager 7.5

INF
Update 8.3.0.1013

Intel Matrix Storage Manager 7.5

ForceWare
15.01
ForceWare
15.01
Memory
size
4GB
(4 DIMMs)
4GB
(4 DIMMs)
4GB
(4 DIMMs)
4GB
(4 DIMMs)
4GB
(4 DIMMs)
Memory
type
Corsair
TWIN3X2048-1333C9DHX

DDR3 SDRAM at 1066MHz

Corsair
TWIN3X2048-1333C9DHX

DDR3 SDRAM at 1333MHz

Samsung ECC DDR2-667
FB-DIMM at 667MHz
Corsair
TWIN2X2048-8500

DDR2 SDRAM at ~800MHz

Corsair
TWIN2X2048-8500C5D

DDR2 SDRAM at ~ 800MHz

CAS
latency (CL)
8 8 5 4 4
RAS
to CAS delay (tRCD)
8 9 5 4 4
RAS
precharge (tRP)
8 9 5 4 4
Cycle
time (tRAS)
20 24 15 18 18
Audio Integrated
ICH9R/ALC889A

with Realtek 6.0.1.5449 drivers

Integrated
ICH9R/ALC889A

with Realtek 6.0.1.5449 drivers

Integrated
ICH9R/ALC260

with Realtek 6.0.1.5449 drivers

Integrated
nForce 590 MCP/AD1988B

with Soundmax 6.10.2.6100 drivers

Integrated
nForce 680a SLI/AD1988B

with Soundmax 6.10.2.6100 drivers

Hard
drive
WD
Caviar SE16 320GB SATA
Graphics
GeForce 8800 GTX 768MB PCIe
with ForceWare 163.11 and 163.71 drivers
OS Windows
Vista Ultimate x64 Edition
OS
updates
KB940105,
KB929777 (nForce systems only), KB938194, KB938979

Please note that testing was conducted in two stages. Non-gaming apps and Supreme Commander were tested with Vista patches KB940105 and KB929777 (nForce systems only) and ForceWare 163.11 drivers. The other games were tested with the additional Vista patches KB938194 and KB938979 and ForceWare 163.71 drivers.

Thanks to Corsair for providing us with memory for our testing. Their products and support are far and away superior to generic, no-name memory.

Our primary test systems were powered by OCZ GameXStream 700W power supply units. The dual-socket Xeon and Quad FX systems were powered by PC Power & Cooling Turbo-Cool 1KW-SR power supplies. Thanks to OCZ for providing these units for our use in testing.

Also, the folks at NCIXUS.com hooked us up with a nice deal on the WD Caviar SE16 drives used in our test rigs. NCIX now sells to U.S. customers, so check them out.

The test systems’ Windows desktops were set at 1280×1024 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled.

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

We’ve been in a rut since, like, 1999, and we’re not about to stop opening our CPU reviews with memory bandwidth results now. Ten years is so close I can taste it!

As expected, the QX6850 achieves higher bandwidth thanks to its 1333MHz front-side bus, yet it comes in slightly behind the E6750. Why? Glad you asked. Probably because the QX6850 has to share that bus between three devices: the chipset and two separate CPU chips. The additional loading on the bus limits the QX6850’s bandwidth, giving the E6750 a minor but measurable edge.

Of course, the AMD chips, with no front-side bus to speak of, post big numbers here.

Here’s a look at cache and memory bandwidth, and as you can see, the QX6850 is right in line with expectations once again. This test appears to measure cumulative cache bandwidth, which is why the quad-core systems produce much higher numbers than then dual-core ones based on the same microarchitecture. The Intel “V8” dual-Xeon rig is pretty much just showing off here. None of the other CPUs like the Xeons.

Since it’s hard to see it on the line graph, here’s a quick look at the bandwidth results for the 1GB test block size.

Even with a 1333MHz bus and smart on-chip logic for speculatively moving loads ahead of stores in certain situations, the QX6850 can’t quite match the memory access latencies achieved by the Athlon 64’s integrated memory controller. Intel keeps getting closer on this front, though. And, of course, these are mere synthetic tests that provide some interesting info but don’t predict real-world performance.

Team Fortress 2

We’ll kick off our gaming tests with some Team Fortress 2, Valve’s class-driven multiplayer shooter based on the Source game engine. In order to produce easily repeatable results, we’ve tested TF2 by recording a demo during gameplay and playing it back using the game’s timedemo function. In this demo, I’m playing as the Heavy Weapons Guy, with a medic in tow, dealing some serious pain to the blue team.

We tested at 1024×768 resolution with the game’s detail levels set to their highest settings. HDR lighting and motion blur were enabled. Antialiasing was disabled, and texture filtering was set to trilinear filtering only. We used this relatively low display resolution without lots of filtering and AA in order to prevent the graphics card from becoming a primary performance bottleneck, so we could show you the performance differences between the CPUs.

Notice the little green plot with four lines above the benchmark results. That’s a snapshot of the CPU utilization indicator in Windows Task Manager, which helps illustrate how much the application takes advantage of up to four CPU cores, when they’re available. I’ve included these Task Manager graphics whenever possible throughout our results. In this case, Team Fortress looks like it probably only takes full advantage of a single CPU core, although Nvidia’s graphics drivers use multithreading to offload some vertex processing chores.

The Core 2 QX6850 comes out of the gate fast, taking the lead in our first game test. Obviously, with the slowest score reaching down to over 70 frames per second, any of these processors will run TF2 more than adequately. The QX6850 just runs it, err, adqeuatest.

Lost Planet: Extreme Condition
Lost Planet puts the latest hardware to good use via DirectX 10 and multiple threads—as many as eight, in the case of our dual quad-core Xeon test rig. Lost Planet‘s developers have built a benchmarking tool into the game, and it tests two different levels: a snow-covered outdoor area with small numbers of large villains to fight, and another level set inside of a cave with large numbers of small, flying creatures filling the air. We’ll look at performance in each.

We tested this game at 1152×864 resolution, largely with its default quality settings. The exceptions: texture filtering was set to trilinear, edge antialiasing was disabled, and “Concurrent operations” was set to match the number of CPU cores available.

The most exciting result here, by far, is the “Cave” level. As you can see from both the Task Manager output and the benchmark results, we have an actual game that benefits from the presence of more than two processor cores. Lost Planet puts a cubic assload of flying doodads onscreen at once in the Cave scene, and they’re tracked and animated by multiple threads. Given that very friendly environment, the QX6850 excels, topping the quad-core systems and blowing away the dual-core contenders.

BioShock

We tested BioShock by manually playing through a specific point in the game five times while recording frame rates using the FRAPS utility. The sequence? Me trying to fight a Big Daddy, or more properly, me trying not to die for 60 seconds at a pop.

This method has the advantage of simulating real gameplay quite closely, but it comes at the expense of precise repeatability. We believe five sample sessions are sufficient to get reasonably consistent results. In addition to average frame rates, we’ve included the low frame rates, because those tend to reflect the user experience in performance-critical situations. In order to diminish the effect of outliers, we’ve reported the median of the five low frame rates we encountered.

For this test, we largely used BioShock‘s default image quality settings for DirectX 10 graphics cards, but again, we tested at a relatively low resolution of 1024×768 in order to prevent the GPU from becoming the main limiter of performance.

Things don’t always come out neatly as expected when you’re testing gameplay manually, but you get the gist of it here: all of the Core 2 processors run BioShock very well, though it’s hard to tell the difference between them. The Core 2s put some distance between themselves and the Athlon 64s here, although even the lowly X2 5600+ doesn’t drop below 37 frames per second.

In a world dominated by games that must live in the cross-platform wilds of crappy in-order CPUs like the one found in the Xbox 360, most games will run just fine on any modern PC processor. Sure, their AI has the IQ of an eggplant, but that can’t be helped.

Supreme Commander

We tested performance using Supreme Commander‘s built-in benchmark, which plays back a test game and reports detailed performance results afterward. We launched the benchmark by running the game with the “/map perftest” option. We tested at 1024×768 resolution with the game’s fidelity presets set to “High.”.

Supreme Commander’s built-in benchmark breaks down its results into several major categories: running the game’s simulation, rendering the game’s graphics, and a composite score that’s simply comprised of the other two. The performance test also reports good ol’ frame rates, so we’ve included those, as well.

Well, the QX6850 comes out on top in various ways, including in terms of both average and median low frame rates. Again, though, the performance deltas between the processors aren’t large enough to matter much.

Valve Source engine particle simulation

Next up are a couple of tests we picked up during a visit to Valve Software, the developers of the Half-Life games. They’ve been working to incorporate support for multi-core processors into their Source game engine, and they’ve cooked up a couple of benchmarks to demonstrate the benefits of multithreading.

The first of those tests runs a particle simulation inside of the Source engine. Most games today use particle systems to create effects like smoke, steam, and fire, but the realism and interactivity of those effects are limited by the available computing horsepower. Valve’s particle system distributes the load across multiple CPU cores.

Only the eight-core Xeon system—which is totally cheating and everyone knows it—can beat out the QX6850 here.

Valve VRAD map compilation

This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to precompute lighting that goes into games like Half-Life 2. This isn’t a real-time process, and it doesn’t reflect the performance one would experience while playing a game. Instead, it shows how multiple CPU cores can speed up game development.

Cut and paste special: Only the eight-core Xeon system—which is totally cheating and everyone knows it—can beat out the QX6850 here.

WorldBench

WorldBench’s overall score is a pretty decent indication of general-use performance for desktop computers. This benchmark uses scripting to step through a series of tasks in common Windows applications and then produces an overall score for comparison. WorldBench also records individual results for its component application tests, allowing us to compare performance in each. We’ll look at the overall score, and then we’ll show individual application results alongside the results from some of our own application tests. Because WorldBench’s tests are entirely scripted, we weren’t able to capture Task Manager plots for them, as you’ll notice.

The QX6850 continues to lead all reasonable competition—and the dual Xeons, this time—in WorldBench’s overall score. The QX6850’s large lead over the QX6800 is something of a fluke, though, for reasons I’ll explain below.

Productivity and general use software

MS Office productivity

WorldBench’s office test involves a multitasking component, since several Office apps are opened and in use simultaneously. Given that, the Athlon 64 X2 6400+’s quick time here is impressive, though it’s overshadowed by the QX6850.

Firefox web browsing

Multitasking – Firefox and Windows Media Encoder

Here’s another WorldBench component test with a multitasking bent. This one uses a multithreaded application, Windows Media Encoder, alongside the Firefox web browser. As a result, the quad-core (and better) solutions grab the top spots, with the QX6850 in the lead.

WinZip file compression

Ouch. I suspect the Core 2 chips stomp the Athlon 64 X2s in this one because of their larger caches, but that’s just a guess.

Nero CD authoring

Here’s why I said the QX6800’s relatively low score is a bit of a fluke. The Nero test is largely dependent on the disk controller, and with the WorldBench 6 beta in Windows Vista, the results from this test tend to vary quite a bit from one run to the next. Even with multiple runs, the QX6800 came out with an inordinately low score. I wouldn’t hold that against it.

Image processing

Photoshop

The Apple guys will want to look at this one carefully, since everyone knows that all CPU performance ultimately boils down to a few select Photoshop filters.

The Panorama Factory photo stitching
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs. The program’s timer function captures the amount of time needed to perform each stage of the panorama creation process. I’ve also added up the total operation time to give us an overall measure of performance.

It’s getting late here, and that’s a lotta colors and numbers and stuff. Let’s go for a second cut-‘n’-paste special: Only the eight-core Xeon system—which is totally cheating and everyone knows it—can beat out the QX6850 here.

picCOLOR image analysis

picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Eight of the 12 functions in the test are multithreaded, and in this latest revision, five of those eight functions use four threads.

Scores in picCOLOR, by the way, are indexed against a single-processor Pentium III 1 GHz system, so that a score of 4.14 works out to 4.14 times the performance of the reference machine.

If you’re looking to run at over fifteen times the speed of a Pentium III 1GHz, the QX6850 is your ticket.

Video encoding and editing

Windows Media Encoder x64 Edition video encoding

Windows Media Encoder is one of the few popular video encoding tools that uses four threads to take advantage of quad-core systems, and it comes in a 64-bit version. Unfortunately, it doesn’t appear to use more than four threads, even on an eight-core system. For this test, I asked Windows Media Encoder to transcode a 153MB 1080-line widescreen video into a 720-line WMV using its built-in DVD/Hardware profile. Because the default “High definition quality audio” codec threw some errors in Windows Vista, I instead used the “Multichannel audio” codec. Both audio codecs have a variable bitrate peak of 192Kbps.

Windows Media Encoder video encoding

Roxio VideoWave Movie Creator

The QX6850 sweeps our video tests, continuing its dominance. Notably, its margins of victory over the QX6800 can probably be explained almost entirely by the QX6850’s 66MHz clock speed advantage. The faster bus doesn’t appear to help much at all here.

LAME MT audio encoding

LAME MT is a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. Of course, multithreading works even better on multi-core processors. You can download a paper (in Word format) describing the programming effort.

Rather than run multiple parallel threads, LAME MT runs the MP3 encoder’s psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. That means this test won’t really use more than two CPU cores.

We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here, as we have done in many of our previous CPU reviews.

Ok, so the Athlon 64 X2 6400+ gets a little uppity in one test, but the QX6850 smacks it back down in the next one.

Cinebench rendering

Graphics is a classic example of a computing problem that’s easily parallelizable, so it’s no surprise that we can exploit a multi-core processor with a 3D rendering app. Cinebench is the first of those we’ll try, a benchmark based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores are available.

The Intel processors look relatively stronger in this new R10 release of Cinebench than they did in version 9.5. The QX6850 again leads all quad-core solutions, with only those show-off Xeons scoring higher.

POV-Ray rendering

We caved in and moved to the beta version of POV-Ray 3.7 that includes native multithreading. The latest beta 64-bit executable is still quite a bit slower than the 3.6 release, but it should give us a decent look at comparative performance, regardless.

3ds max modeling and rendering

Our remaining rendering tests play out much as expected, although the Quad FX system steals one from the QX6850 in the POV-Ray chess2 scene.

[email protected]

Next, we have a slick little [email protected] benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, [email protected] is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The [email protected] project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, [email protected] should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.

Shockingly, the QX6850 is the fastest quad-core system once again! I should note that its lead is even more pronounced with the Gromacs work unit types, which I hear are more commonly used in Folding projects these days.

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He recently offered to provide us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.

In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.

MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.

I should mention that performance scaling in Myrimatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

The QX6850 finished six seconds ahead of the QX6800—not exactly a big gain given the QX6850’s higher bus speed. On a happier note, we are seeing much better scaling and quicker completion times here than we saw in most recent look at server processors—likely due to locking and mutex handling improvements in the Windows Vista kernel.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here. (I believe the score you see there at almost 3Hz comes from our eight-core Clovertown test system.)

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.

The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

Here, we see a bit more of a meaningful performance boost out of the QX6850’s 1333MHz bus.

SiSoft Sandra Mandelbrot

Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The one of interest to us is the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.

We’re using the 64-bit version of Sandra. The “Integer x16” version of this test uses integer numbers to simulate floating-point math. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations in parallel.

I’m keeping this test around to see how AMD’s new quad-core processors, with their single-cycle 128-bit SSE capabilities, handle it. (Shh.. don’t tell, but they won’t catch the Core 2.)

Power consumption and efficiency

Now that we’ve had a look at performance in various applications, let’s bring power efficiency into the picture. Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor into a separate outlet, though.) We measured how each of our test systems used power across a set time period, during which time we ran Cinebench’s multithreaded rendering test.

All of the systems had their power management features (such as SpeedStep and Cool’n’Quiet) enabled during these tests via Windows Vista’s “Balanced” power options profile.

Anyhow, here are the results:

Looking at the graph of the raw data, you can see that the QX6850 doesn’t consume any more power at peak than the QX6800 did.

We can slice up the data in various ways in order to better understand them, though. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

With two chips and lots of transistors onboard, the QX6850’s idle power wasn’t going to be the lowest of the lot, but it isn’t bad. Idle power is higher than the QX6800, perhaps due to the 1333MHz front-side bus.

Next, we can look at peak power draw by taking an average from the ten-second span from 30 to 40 seconds into our test period, during which the processors were rendering.

The QX6850’s extra little bit of performance on the QX6800 requires no additional power at peak, and the QX6850 draws less power than a couple of the Athlon 64 X2s, as well. That’s gonna leave a mark in Austin.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

Because the QX6850 finishes rendering quickly and drops back to idle, it doesnt’ consume much power during our test time span. Those uppity Xeons finally get their comeuppance, too.

We can quantify efficiency even better by considering the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

By delivering more performance with no additional power consumption, the QX6850 brings new highs in efficiency in this measurement, which may be our best means of quantifying the vaunted “performance per watt” of a processor. Betcha didn’t expect to see a high-end part taking the lead in this regard. This is more evidence that Intel hasn’t pushed the clock speed envelope terribly hard with its 65nm chips; even the high-end models are in a nice place on the power/speed/voltage curves.

Overclocking

No matter what I tried, I couldn’t get our QX6850 stable at 3.66GHz. I started at the stock voltage and stepped up through the settings to 1.4125V, and I ran into math errors in Prime95 or blue screens of death at every step along the way—sometimes one followed by the other. The problem seemed to be the first of the four cores, which is where all of the Prime95 errors happened. That’s life with a quad-core Intel processor: you can’t overclock past the limits of the slowest core. I was able to get the QX6850 rock-solid stable at 3.5GHz on a 1400MHz bus, as my CPU-Z screenshot proves in legally binding fashion.

Here’s what overclocking this puppy to 3.5GHz will get you in terms of performance:

Tasty.

Conclusions

So, we’ve seen an awful lot of interesting performance data, but it all boils down to this: the Core 2 Extreme QX6850 doesn’t change much. The QX6800 was the fastest desktop processor before it, and the QX6850 brings slightly higher performance in the same power envelope. The faster 1333MHz bus doesn’t pay huge dividends, even for this quad-core part, but we did see reasonable gains in certain memory-bandwidth-limited tests.

Would I recommend buying this product? Nope, probably not. Long-time TR readers know about my aversion to paying the big premiums to get top-end products when reasonably priced alternatives are available. The Core 2 Quad Q6600 was secretly the star of our show today, with its quad-core performance and sub-$300 price tag. Grab one of those puppies and overclock it to 3GHz if you want a quad-core processor with a 1333MHz bus, fer crying out loud.

Still, I must admit that am warming to the idea that high-end processors may have their place. CPUs are relatively cheap compared to, well, a great many things, especially time. Our benchmarks have shown how a faster CPU can trim minutes and seconds off of common desktop computing tasks (and uncommon media creation and scientific computing ones, as well.) If time is money and your time is limited, the QX6850 may be worth every penny to you. One can’t entirely quarrel with this sort of speed, and the easy overclocking is nice to have.

The thing is, as I’ve mentioned, Intel is about to change things all over again. Given the performance we’ve seen out of the Xeon implementation of the new “Penryn” core, we expect big things from the upcoming Core 2 Extreme QX9650, which is scheduled for its official launch on November 12. Hang onto your dinero for a few short weeks longer if you want to retain bragging rights for your shiny new monster PC for any reasonable amount of time.

Comments closed

Pin It on Pinterest

Share This

Share this post with your friends!