Ah, progress. As we hurtle forward in time, developing minor medical conditions and growing hair in weird places, the chips inside our computers get ever betteror such is the usual way of things. Sometimes progress comes in big leaps, as it did with the debut of Intel’s Core i7 processors not so long ago, and sometimes it comes in smaller increments, as it has with AMD’s Phenom II processors in the several months since their first introduction. The latest waystation on the way to, er, CPU nirvana is being unveiled today in the form of the Phenom II X4 955.
The X4 955 is the culmination of a process in which we saw the Phenom II first hit the market and then, a month later, transition to Socket AM3 and gain compatibility with DDR3 memory. Those first Socket AM3 processors were mainstream offerings, with smaller caches, lower clock speeds, and fewer cores than the top-of-the-line products. Today, in the form of the X4 955, AMD brings to market a true flagship for its new lineup. This is a quad-core processor with a full 6MB of L3 cache and the highest clock frequency to date for a Phenom II: 3.2GHz.
Since this is a Socket AM3 processor, it’s compatible with both Socket AM3 motherboards that support DDR3 memory and Socket AM2+ motherboards that use DDR2 memory. And since this is a new flagship for AMD, the 955 is a “Black Edition” processor with all of the privileges that title bestowspretty much just “easy overclocking via an unlocked multiplier,” but hey, that’s not a bad perk.
The 955 isn’t the only new horse in AMD’s stable, either. There’s also its younger sibling, the Phenom II X4 945, which runs at 3GHz and doesn’t have the distinction of being a Black Edition product, either. Instead, the 955 gets all the glory, and the 945 keeps to itself and spends a lot of time in its room reading comic books. Here are the highlights of both new models, for comparison:
Model | Clock speed | North bridge/ L3 cache speed |
L3 cache size |
Cores | TDP | Price |
Phenom II X4 955 Black Edition |
3.2 GHz |
2.0 GHz |
6MB | 4 | 125W | $245 |
Phenom II X4 945 |
3.0 GHz |
2.0 GHz |
6MB | 4 | 125W | $225 |
As you can see, the 945 is only 20 bucks cheaper than the 955, and I’d say paying a little more for that unlocked multiplier in the Black Edition is worth it every time, unless the very idea of overclocking causes you to break out in sweats or evokes deep feelings of shame. Still, neither CPU is particularly expensive for a processor at the top of AMD’s lineup. The firm has made a commitment to remain competitive with Intel on price and performance, and the 955’s $245 price tag would appear to position it against the Core 2 Quad Q9550, a 2.83GHz chip with four cores and 12MB (or, more precisely, 2 x 6MB) of L2 cache. Intel’s current price list has the Q9550 at $266, so the Phenom II X4 955 undercuts it a little bit, in fact.
One place where Intel may have a bit of an advantage, though, is in the power consumption department. The Q9550 has a TDP rating of 95W, while AMD has rated the X4 955 at 125W. TDP is a peak number, so those ratings really only apply when the processor is fully occupied. Still, even now, AMD’s best processors may need a little more thermal headroom to match the equivalent Core 2 Quads.
AMD’s stock cooler for retail boxed versions of the Phenom II X4 955
Speaking of thermal envelopes (guys: use this transition line on your next datepure dynamite), here’s a look at the stock heatsink/fan combo AMD supplies with retail boxed versions of the X4 955. Not a total beast, at least, and fairly similar to any number of past stock AMD models, which are usually pretty decent coolers. We opted instead for an aftermarket cooler with a prop the size of a small aircraft, to enable quiet and copious overclocking.
We have, of course, a huge collection of CPU performance results, and we’ve run the X4 955 through the full gamut of our CPU test suite. We will add its distinctiveness to our own.
Test notes
In order to gauge the impact of memory type on performance and power use, we’ve tested the Phenom II X4 810 both with DDR2 memory on a Socket AM2+ board and with DDR3 memory on a Socket AM3 board. You’ll find the results in the following pages, labeled appropriately.
The Core 2 Quad Q8300 processor we used for testing came to us courtesy of the good folks at NCIX and NCIXUS. Thanks to them for making this comparison possible. We’ve underclocked our Q8300 to simulate a Q8200 for this review. I’m sure we’ll get around to testing the Q8300 at its stock speed, as well, eventually.
We’ve simulated several other speed grades via underclocking, too. Specifically, the Phenom II X4 920 is an underclocked 940, and the Core 2 Quad Q9550 is an underclocked Core 2 Extreme QX9650. We expect the performance of these “simulated” speed grades to be identical to the real things, but we generally omit these processors from our power consumption testing because we do anticipate power use would vary slightly from the actual products.
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.
Our test systems were configured like so:
Processor | Core 2 Quad Q6600 2.4 GHz |
Core 2 Duo E8400 3.00 GHz Core 2 Duo E8600 3.33 GHz Core 2 Quad Q8200 2.33 GHz Core 2 Quad Q9300 2.5 GHz Core 2 Quad Q9400 2.66 GHz Core 2 Quad Q9550 2.83 GHz |
Core 2 Extreme QX9770 3.2 GHz |
Dual Core 2 Extreme QX9775 3.2 GHz |
Core i7-940 2.66 GHz Core i7-940 2.93 GHz |
Core i7-965 Extreme 3.2 GHz |
Athlon 64 X2 6400+ 3.2 GHz |
Phenom X3 8750 2.4 GHz |
Phenom II X4 920 2.8 GHz Phenom II X4 940 3.0 GHz |
Phenom II X4 810 2.6 GHz |
Phenom X4 9950 Black 2.6 GHz |
Phenom II X3 720 2.8 GHz Phenom II X4 810 2.6 GHz |
Phenom II X4 955 3.2 GHz |
||||||||
System bus | 1066 MT/s (266 MHz) |
1333 MT/s (333 MHz) |
1600 MT/s (400 MHz) |
1600 MT/s (400 MHz) |
QPI 4.8 GT/s (2.4 GHz) |
QPI 6.4 GT/s (3.2 GHz) |
HT 2.0 GT/s (1.0 GHz) |
HT 3.6 GT/s (1.8 GHz) |
HT 3.6 GT/s (1.8 GHz) |
HT 4.0 GT/s (2.0 GHz) |
HT 4.0 GT/s (2.0 GHz) |
HT 4.0 GT/s (2.0 GHz) |
|||||||||
Motherboard | Asus P5E3 Premium |
Asus P5E3 Premium |
Asus P5E3 Premium |
Intel D5400XS |
Intel DX58SO |
Intel DX58SO |
Asus M3A79-T Deluxe |
Asus M3A79-T Deluxe |
MSI DKA790GX Platinum |
Asus M4A79T Deluxe |
BIOS revision | 0605 | 0605 | 0605 | XS54010J.86A.1149. 2008.0825.2339 |
SOX5810J.86A.2260. 2008.0918.1758 |
SOX5810J.86A.2260. 2008.0918.1758 |
0403 | 0403 | 11/25/08 | 0703 |
1.6 (1/21/09) |
0902 | |||||||||
North bridge | X48 Express MCH |
X48 Express MCH |
X48 Express MCH |
5400 MCH |
X58 IOH |
X58 IOH |
790FX | 790FX | 790GX | 790FX |
South bridge | ICH9R | ICH9R | ICH9R | 6321ESB ICH | ICH10R | ICH10R | SB750 | SB750 | SB750 | SB750 |
Chipset drivers | INF Update 9.0.0.1008 Matrix Storage Manager 8.5.0.1032 |
INF Update 9.0.0.1008 Matrix Storage Manager 8.5.0.1032 |
INF Update 9.0.0.1008 Matrix Storage Manager 8.5.0.1032 |
INF Update 9.0.0.1008 Matrix Storage Manager 8.5.0.1032 |
INF update 9.1.0.1007 Matrix Storage Manager 8.5.0.1032 |
INF update 9.1.0.1007 Matrix Storage Manager 8.5.0.1032 |
AHCI controller 3.1.1540.61 |
AHCI controller 3.1.1540.61 |
AHCI controller 3.1.1540.61 |
AHCI controller 3.1.1540.61 |
Memory size | 4GB (2 DIMMs) |
4GB (2 DIMMs) |
4GB (2 DIMMs) |
4GB (2 DIMMs) |
6GB (3 DIMMs) |
6GB (3 DIMMs) |
4GB (2 DIMMs) |
4GB (2 DIMMs) |
4GB (2 DIMMs) |
4GB (2 DIMMs) |
Memory type | Corsair TW3X4G1800C8DF DDR3 SDRAM |
Corsair TW3X4G1800C8DF DDR3 SDRAM |
Corsair TW3X4G1800C8DF DDR3 SDRAM |
Micron ECC DDR2-800 FB-DIMM |
Corsair TR3X6G1600C8D DDR3 SDRAM |
Corsair TR3X6G1600C8D DDR3 SDRAM |
Corsair TWIN4X4096-8500C5DF DDR2 SDRAM |
Corsair TWIN4X4096-8500C5DF DDR2 SDRAM |
Corsair TWIN4X4096-8500C5DF DDR2 SDRAM |
Corsair TW3X4G1600C9DHXNV DDR3 SDRAM |
Memory speed (Effective) |
1066 MHz |
1333 MHz |
1600 MHz |
800 MHz |
1066 MHz |
1600 MHz |
800 MHz |
1066 MHz |
1066 MHz |
1333 MHz |
CAS latency (CL) | 7 | 8 | 8 | 5 | 7 | 8 | 4 | 5 | 5 | 8 |
RAS to CAS delay (tRCD) | 7 | 8 | 8 | 5 | 7 | 8 | 4 | 5 | 5 | 8 |
RAS precharge (tRP) | 7 | 8 | 8 | 5 | 7 | 8 | 4 | 5 | 5 | 8 |
Cycle time (tRAS) | 20 | 20 | 24 | 18 | 20 | 24 | 12 | 15 | 15 | 20 |
Command rate |
2T | 2T | 2T | 2T | 2T | 1T | 2T | 2T | 2T | 2T |
Audio | Integrated ICH9R/AD1988B with SoundMAX 6.10.2.6480 drivers |
Integrated ICH9R/AD1988B with SoundMAX 6.10.2.6480 drivers |
Integrated ICH9R/AD1988B with SoundMAX 6.10.2.6480 drivers |
Integrated 6321ESB/STAC9274D5 with SigmaTel 6.10.5713.7 drivers |
Integrated ICH10R/ALC889 with Realtek 6.0.1.5704 drivers |
Integrated ICH10R/ALC889 with Realtek 6.0.1.5704 drivers |
Integrated SB750/AD2000B with SoundMAX 6.10.2.6480 drivers |
Integrated SB750/AD2000B with SoundMAX 6.10.2.6480 drivers |
Integrated SB750/ALC888 with Realtek 6.0.1.5704 drivers |
Integrated SB750/ALC1200 with Realtek 6.0.1.5704 drivers |
Hard drive | WD Caviar SE16 320GB SATA | |||||||||
Graphics | Radeon HD 4870 512MB PCIe with Catalyst 8.55.4-081009a-070794E-ATI drivers |
|||||||||
OS | Windows Vista Ultimate x64 Edition | |||||||||
OS updates | Service Pack 1, DirectX redist update August 2008 |
Thanks to Corsair for providing us with memory for our testing. Their products and support are far and away superior to generic, no-name memory.
Our single-socket test systems were powered by OCZ GameXStream 700W power supply units. The dual-socket system was powered by a PC Power & Cooling Turbo-Cool 1KW-SR power supply. Thanks to OCZ for providing these units for our use in testing.
Also, the folks at NCIXUS.com hooked us up with a nice deal on the WD Caviar SE16 drives used in our test rigs. NCIX now sells to U.S. customers, so check them out.
The test systems’ Windows desktops were set at 1600×1200 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled.
We used the following versions of our test applications:
- SiSoft Sandra 2009.1.15.42
- CPU-Z 1.48
- WorldBench 6 beta 2
- Half-Life 2: Episode Two
- Crysis Warhead
- Far Cry 2
- Unreal Tournament 3 1.3
- Valve VRAD map build benchmark
- Valve Source Engine particle simulation benchmark
- Cinebench R10 64-bit Edition
- POV-Ray for Windows 3.7 beta 29 64-bit
- CASE Lab Euler3d CFD benchmark multithreaded edition
- MyriMatch proteomics benchmark
- notfred’s Folding benchmark CD 9/28/08 revision
- picCOLOR 4.0 build 627 64-bit
- The Panorama Factory 5.2 x64 Edition
- Windows Media Encoder 9 x64 Edition
- x264 HD benchmark 2.0 with x264 version 0.59.819
- LAME MT 3.97a 64-bit
The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Memory subsystem performance
We’ll begin with a look at memory performance, so we can get a feel for some fundamentals of these CPUs.
The X4 955’s L1 and L2 caches are proportionately faster than those of prior Phenom IIs at each test block size.
Since it’s difficult to see the results once we get into main memory, let’s take a closer look at the 256MB block:
With DDR3 memory, the Phenom II X4 955 achieves the exact same throughput as its smaller-cache sibling, the X4 810. That’s nearly as much bandwidth as the Core i7-920, which has the benefit of an additional channel of DDR3 memory (albeit at a lower clock speed of 1066MHz).
The X4 955’s memory latency is admirably low, as well. No real surprises here, but you can see that, thanks to its integrated memory controller, fast cache hierarchy, and DDR3 memory, the Phenom II X4 955 scores among the best CPUs available in our synthetic memory subsystem benchmarks.
Crysis Warhead
We measured Warhead performance using the FRAPS frame-rate recording tool and playing over the same 60-second section of the game five times on each processor. This method has the advantage of simulating real gameplay quite closely, but it comes at the expense of precise repeatability. We believe five sample sessions are sufficient to get reasonably consistent results. In addition to average frame rates, we’ve included the low frame rates, because those tend to reflect the user experience in performance-critical situations. In order to diminish the effect of outliers, we’ve reported the median of the five low frame rates we encountered.
We tested at at relatively modest graphics settings, 1024×768 resolution with the game’s “Mainstream” quality settings, because we didn’t want our graphics card to be the performance-limiting factor. This is, after all, a CPU test.
The results of our first game test suggest AMD wasn’t far off on its product positioning: the Phenom II X4 955 just trails the Core 2 Quad Q9550 in average frame rate, but both chips produce a minimum frame rate of 28 FPS, which suggests they’ll deliver very similar experiences.
Incidentally, this is probably the most CPU-intensive game we’ll test, but like many of today’s games, it’s probably primarily limited by GPU performance. If you look at the Warhead performance results from our latest video card review, in which we tested the same area of the game in the same basic fashion, even a pretty nice video card like a Radeon HD 4870 1GB would limit the frame rates achieved by a relatively lowly CPU like ye olde Core 2 Quad Q6600. Of course, we are in that case testing at a relatively high resolution with nice image quality settings, and your mileage may vary depending on your display resolution and in-game settings.
Then again, like most games, Warhead really only makes use of one or two processor cores, as evidenced by the strong showings of the Core 2 Duo E8400 and E8600 chips here. So you can spend less and get more by going for a high-frequency dual-core processor, if gaming performance is your main goal.
Far Cry 2
After playing around with Far Cry 2, I decided to test it a little bit differently by recording frame rates during the jeep ride sequence at the very beginning of the game. I found that frame rates during this sequence were generally similar to those when running around elsewhere in the game, and after all, playing Far Cry 2 involves quite a bit of driving around. Since this sequence was repeatable, I just captured results from three 90-second sessions.
Again, I didn’t want the graphics card to be our primary performance constraint, so although I tested at fairly high visual quality levels, I used a relatively low 1024×768 display resolution and DirectX 9.
Impressively, the X4 955 outperforms the Core i7-940 in terms of average frame rate, although the i7-940 has a higher minimum, which arguably counts for more. Once again, the Q9550 and X4 955 are very closely matched, with the edge going to the Q9550 by a hair.
Unreal Tournament 3
As you saw on the preceding page, I did manage to find a couple of CPU-limited games to use in testing. I decided to try to concoct another interesting scenario by setting up a 24-player CTF game on UT3’s epic Facing Worlds map, in which I was the only human player. The rest? Bots controlled by the CPU. I racked up frags like mad while capturing five 60-second gameplay sessions for each processor.
Oh, and the screen resolution was set to 1280×1024 for testing, with UT3’s default quality options and “framerate smoothing” disabled.
Here’s an example where having more than two CPU cores can improve gaming performance, although judging by the actual frame rates involved, no one’s gonna feel it. The X4 955 splits with the Q9550, with a clearly higher frame rate average but a lower minimum.
Half Life 2: Episode Two
Our next test is a good, old custom-recorded in-game timedemo, precisely repeatable.
What was I saying about most games not being especially CPU-limited? Yeah, the frame rates here will induce nosebleeds, but the X4 955 does run the game faster than the Q9550, as well as both (reasonably priced) flavors of the Core i7.
Source engine particle simulation
Next up is a test we picked up during a visit to Valve Software, the developers of the Half-Life games. They had been working to incorporate support for multi-core processors into their Source game engine, and they cooked up some benchmarks to demonstrate the benefits of multithreading.
This test runs a particle simulation inside of the Source engine. Most games today use particle systems to create effects like smoke, steam, and fire, but the realism and interactivity of those effects are limited by the available computing horsepower. Valve’s particle system distributes the load across multiple CPU cores.
The X4 955 performs about as expected, but it’s not quite quick enough to match the Q9550.
WorldBench
WorldBench’s overall score is a pretty decent indication of general-use performance for desktop computers. This benchmark uses scripting to step through a series of tasks in common Windows applications and then produces an overall score for comparison. WorldBench also records individual results for its component application tests, allowing us to compare performance in each. We’ll look at the overall score, and then we’ll show individual application results alongside the results from some of our own application tests.
The Phenom IIs can’t quite keep up with the competing Core 2s in WorldBench, largely due to a couple of tests where the AMD CPUs tend to struggle. We’ll look at those individual test results as we go.
Productivity and general use software
MS Office productivity
Firefox web browsing
Multitasking – Firefox and Windows Media Encoder
The X4 955 is quite a bit quicker to finish these three tests than the Q9550, and in two of the three, the new Phenom II outperforms even the Core i7-940. Notably, both the Office and Firefox/Windows Media Encoder tests have a multitasking component in which multiple applications are running and in use concurrently. In both cases, the X4 955 places near the top of the pack, behind only Intel’s very fastest desktop processors.
WinZip file compression
Nero CD authoring
The Phenom II comes back to earth in both of these tests, and both of themespecially Nerotend to be limited somewhat by disk I/O throughput. We’ve documented quite well that the disk controllers in AMD’s chipsets are a platform-level weakness, largely due to an essentially broken implementation of SATA NCQ.
With that said, WinZip isn’t entirely disk I/O bound, since faster Phenoms consistently finish before slower ones.
Image processing
Photoshop
Here’s another WorldBench component where the AMD processors struggle, relatively speaking. The X4 955 can’t quite catch the slowest Core 2 Quad.
The Panorama Factory photo stitching
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s widely multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs. The program’s timer function captures the amount of time needed to perform each stage of the panorama creation process. I’ve also added up the total operation time to give us an overall measure of performance.
Only half a second separates the X4 955 and Q9550 in the total completion time for putting together our panorama.
Below is a look at the individual operations required to create a panorama, if you care to see that sort of detail.
picCOLOR image analysis
picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Many of the individual functions that make up the test are multithreaded.
Once more, the X4 955 shadows the Q9550 here in another close result.
Media encoding and editing
x264 HD benchmark
This benchmark tests performance with one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark. These scores come from the newer, faster version 0.59.819 of the x264 executable.
Well, the second pass is pretty much a wash, but the X4 955 finished encoding the entire clip sooner thanks to a higher encode rate in pass 1. Chalk up another one for AMD.
Windows Media Encoder x64 Edition video encoding
Windows Media Encoder is one of the few popular video encoding tools that uses four threads to take advantage of quad-core systems, and it comes in a 64-bit version. Unfortunately, it doesn’t appear to use more than four threads, even on an eight-core system. For this test, I asked Windows Media Encoder to transcode a 153MB 1080-line widescreen video into a 720-line WMV using its built-in DVD/Hardware profile. Because the default “High definition quality audio” codec threw some errors in Windows Vista, I instead used the “Multichannel audio” codec. Both audio codecs have a variable bitrate peak of 192Kbps.
The X4 955 extends its video encoding lead over the Q9550 in our Windows Media Encoder session. Notice, also, that the Core i7-920 trails the X4 955 here.
Windows Media Encoder video encoding
Roxio VideoWave Movie Creator
These last two tests are WorldBench components which we’ve included for completeness. I happen to think our other video encoding tests are better indicators of real-world performance.
LAME MT audio encoding
LAME MT is a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. Of course, multithreading works even better on multi-core processors. You can download a paper (in Word format) describing the programming effort.
Rather than run multiple parallel threads, LAME MT runs the MP3 encoder’s psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. That means this test won’t really use more than two CPU cores.
We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here.
This one is almost a tie, but not quite. Q9550 by a nose.
3D modeling and rendering
Cinebench rendering
Graphics is a classic example of a computing problem that’s easily parallelizable, so it’s no surprise that we can exploit a multi-core processor with a 3D rendering app. Cinebench is the first of those we’ll try, a benchmark based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores (or threads, in CPUs with multiple hardware threads per core) are available.
The X4 955 gives even the fastest quad-core Core 2, the QX9770, a bit of a scare here, but the Core i7-920’s Hyper-Threading takes it to another level. The Core i7-965 Extreme isn’t too far behind the dual-socket QX9775, even.
POV-Ray rendering
We’re using the latest beta version of POV-Ray 3.7 that includes native multithreading and 64-bit support. Some of the beta 64-bit executables have been quite a bit slower than the 3.6 release, but this should give us a decent look at comparative performance, regardless.
3ds max modeling and rendering
Valve VRAD map compilation
This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to pre-compute lighting that goes into games like Half-Life 2.
The new Phenom II nearly pulls off a clean sweep of our remaining rendering tests, besting the Q9550 in both POV-Ray scenes and both 3ds max tests before falling behind by four seconds in the Valve VRAD job.
[email protected]
Next, we have a slick little [email protected] benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, [email protected] is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.
The [email protected] project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, [email protected] should be a great example of real-world scientific computing.
notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.
On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.
This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.
The overall Folding crown goes to the new Phenom II on the strength of a dominating performance in Tinker type work units, combined with a photo finish in Amber WUs and an ever-so-slight advantage for the Q9550 in the two Gromacs WU types. Looking at the overall score, this is another case where the Core i7’s Hyper-Threading appears to push it into a different performance class than the Core 2 or Phenom II.
MyriMatch proteomics
Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:
In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.
In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.
MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.
The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.
I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:
Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.
Here’s how the processors performed.
The X4 955 is quite a bit slower than the Q9550 with only one thread active, but as the thread count rises, the Phenom II’s performance scales better, so that it has the better score with four concurrent threads.
STARS Euler3d computational fluid dynamics
Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.
In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:
The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.
The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.
So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.
The Q9550 is simply faster here, regardless of the thread count.
Power consumption and efficiency
Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire systemthe CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor into a separate outlet, though.) We measured how each of our test systems used power across a set time period, during which time we ran Cinebench’s multithreaded rendering test.
All of the systems had their power management features (such as SpeedStep and Cool’n’Quiet) enabled during these tests via Windows Vista’s “Balanced” power options profile.
Although we don’t usually include “simulated” CPU speed grades in our power results, I’ve made an exception for the Q9550 since it’s been a focus of our attention today. For the record, our simulated Q9550 ran at 1.24V, right in the middle of the range of possible voltages for this product.
Let’s slice up the data in various ways in order to better understand them. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.
The system based on the new Phenom II draws less power at idle than the one based on our simulated Q9550. The X4 955’s platform power draw is admirably low for this class of CPU.
Next, we can look at peak power draw by taking an average from the ten-second span from 15 to 25 seconds into our test period, during which the processors were rendering.
The step up from the X4 940 to the X4 955 hasn’t produced a massive increase in peak power draw, despite the fact that the X4 955’s TDP rating is 30 W higher. Similarly, our X4 955 system consumes a few watts less than the simulated Q9550 system, despite the TDP gap between the two.
Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.
We can quantify efficiency even better by considering specifically the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.
Our Phenom II X4 955 system draws less power while rendering the scene and finishes sooner, so it uses less energy overall to complete the task than the simulated Q9550. The X4 955 also bests all previous AMD processors and several additional Core 2 Quads. AMD says it continually refines its manufacturing processes over time, and it would seem that the company’s 45nm process is making some nice strides.
Overclocking
Cranking up the clock speed on a Black Edition processor is as simple as setting the multiplier and tweaking the voltage if needed. Or it’s as complicated as you wish to make it, I suppose, if you’re aiming for some sort of record. For our purposes, we just wanted to see what we could get with a few multiplier and voltage tweaks. As usual, I used the BIOS rather than a Windows-based utility for overclocking, and I kept a log of what happened on each attempt. I did make use of AMD’s Overdrive software, but only for stress testing the CPU once it had booted into Windows. Here’s how my overclocking attempt went down:
3.6GHz, stock 1.437V – Seems OK
3.8GHz, 1.437V – Seems OK (Quick test)
4.0GHz, 1.4375V – Crash during boot
4.0GHz, 1.45V – Crash during boot
4.0GHz, 1.475V – Crash during boot
4.0GHz, 1.5V – Freeze during boot
4.0GHz, 1.5V, tweaked system voltages – Reboot during boot
3.9GHz, 1.4375V – BSOD during boot
3.9GHz, 1.465V – Reboot during stability test
3.9GHz, 1.4875V – BSOD during stability test
3.9GHz, 1.5125V – BSOD during stability test
3.8GHz, 1.4375V – BSOD during stability test
3.8GHz, 1.45V – BSOD during stability test
3.8GHz, 1.475V, tweaked system voltages – BSOD during stability test
3.7GHz, 1.435V – Seems OK
Man, I was sooooo close to 3.8 or 3.9GHz, but it just wasn’t meant to be. As the notes say, I even tried upping the voltage slightly on other system components, like the north bridge and HyperTransport, to see if I could coax out stability at a higher clock speed, but it didn’t quite suffice.
Anyhow, 3.7GHz is still a respectable overclock, especially for a top-of-the-lineup processor, and the X4 955 is pretty darned quick at 3.7GHz.
Not bad, huh? Fastest of the pack here. Of course, it’s overclocking, so good luck.
The value proposition
We’ve taken a long and meandering route through several truckloads of performance data, and in order to help you make sense of it all, we have ripped a page from our recent CPU value article.
To create a synthetic “overall performance” score, we computed an unweighted average of the results for a subset of our tests consisting of the benchmarks used in the CPU value article. Our formula includes 22 different benchmarks, but since our aim is practicality, it excludes a few more esoteric ones like the scientific computing applications. As our baseline, the Athlon X2 6400+ gets a 100% score. Other scores are all relative to it.
Of course, what you see below is a crazy experiment and probably meaningless, but some folks may find it a worthwhile thought exercise, at least. These scatter plots show price versus performance in a fairly intuitive way. To oversimplify slightly, the best CPU values tend to be located closer to the top and left edges of the plot.
Well, look at that. The X4 955 averages out to an almost exact performance match for the Core 2 Quad Q9550, yet it’s priced a little bit lower and is thus a better value overall, if you go strictly by these numbers. In fact, the X4 955’s overall value proposition is among the best in the constellation of processors we tested at all different price points.
Incidentally, these plots reflect recent price changes on the Phenom II X4 940 and the Core 2 Quad Q9300. As you can see, several Phenom II chips look to be good values.
Now, here’s another crack at the same issue with total system cost taken into account. To get our pricing numbers for the X axis, we’ve added the cost of a motherboard, memory kit, graphics card, and hard drive to that of our processors. Wherever it made sense, we picked components from our latest system guide. Also, we got all our prices from Newegg. Here’s a complete breakdown:
Intel LGA775 platform | AMD Socket AM2+ platform | Intel Core i7 platform | |||
Gigabyte GA-EP45-UD3P | $135 | Gigabyte GA-MA790X-UD4P | $110 | Gigabyte GA-EX58-UD3R | $200 |
4GB Kingston DDR2-800 | $47 | 4GB Kingston DDR2-800 | $47 | 6GB Corsair DDR3-1600 | $98 |
Sapphire Radeon HD 4870 512MB | $165 | Sapphire Radeon HD 4870 512MB | $165 | Sapphire Radeon HD 4870 512MB | $165 |
Western Digital Caviar Black 640GB | $75 | Western Digital Caviar Black 640GB | $75 | Western Digital Caviar Black 640GB | $75 |
$422 | $397 | $538 |
Notice that we are making some assumptions here that may not be entirely valid. For instance, we’ve priced the X4 955 on a Socket AM2+ motherboard with DDR2 memory, though we tested it with DDR3 memory. As you may have noticed, memory type didn’t make much difference at all to the performance of the Phenom II X4 810, and we expect the story will be similar for the X4 955. In the same vein, we priced the Core 2 processors with DDR2 memory, though we tested them with DDR3. Our goal in selecting these components was to settle on a standard platform for each CPU type with a decent price-performance ratio, not to exactly replicate our sometimes-exotic test systems.
Turns out the total component cost for a Socket AM2+ system is a little bit lower than for an Intel LGA775 one. That doesn’t move the needle much, but the X4 955’s strong value proposition does grow a little stronger versus the Q9550.
Conclusions
That, I think, leads us to our conclusions, which ought to be fairly straightforward. The performance contest between the Phenom II X4 955 and the Core 2 Quad 9550 is crazy close, and even the X4 955’s one apparent weakness, a higher power/thermal rating, turned out to be a non-issue in our testing.
AMD knew what it had in this CPU: practically a mirror image of the Core 2 Quad Q9550. They’ve done two smart things, as a result. They’ve priced the chip right and have given it an unlocked multiplier to simplify overclocking. Add to those things the fact that Socket AM3 seems to have a better upgrade path than LGA775, and the Phenom II X4 955 looks to be the smarter choice for most consumers, should they be choosing between these two products.
Strictly on value, one might wish to step down to one of our favorite Phenom II processors, the X3 720. Gamers, especially, don’t need four cores. If you do value multithreaded performance, the Core i7-920 could be an interesting possibility. But somewhere in between, the Phenom II X4 955 could make a whole lot of sense.