AMD’s Phenom II processors

One of the great privileges of this job is having my own test lab full of the very latest PC hardware for testing and comparison. At least, it’s a privilege if you’re a huge geek who’s into such things. For years now, the prime slot on my CPU test bench, closest to the monitor and keyboard and associated with port one on my KVM switch, has been occupied by AMD-based systems. The positioning doesn’t mean much of anything, really, but I figure an underdog like AMD ought to be first at something, so why not?

Port one has fallen on hard times lately, though. AMD’s Phenom processors haven’t quite been good enough to keep pace with Intel’s latest CPUs, for a variety of reasons. They were late to market, tripped up early on by a show-stopping bug, and couldn’t reach the right clock speeds within the power and thermal limits common to PC processors. AMD has remained fairly competitive by keeping Phenom prices low, but a great many PC enthusiasts have been wooed by the Core 2 processors’ combination of strong performance, low power consumption, and considerable overclocking headroom. The picture only has grown more difficult for AMD with the arrival of the Core i7 and its occasionally heart-stopping speed.

AMD has a potential remedy for the port one blues, though, in the form of the Phenom II, a revised Phenom processor that has been moved to a new, smaller chip fabrication process and tweaked in a variety of ways to achieve higher clock speeds and to wring more performance from every tick of the clock. As a result, port one has been producing some very respectable benchmark scores of late. Could it be that, in the midst of Intel’s ongoing resurgence—nay, dominance—AMD somehow has its swagger back?

The Phenom II X4 940 Black Edition WRX STi LX40 Shazbot 2400XL FTW

Ok, so I added a few extra terms at the end of its name in the subhead above, but the new top Phenom really is named the “Phenom II X4 940 Black Edition.” I kid you not. We’ve come quite a ways from the days, just a year ago, when AMD introduced the Phenom 9600 under a new, simplified naming scheme and extolled the virtues of concise labeling. Since then, we’ve added back the “X4” designator for quad-core processors, even though the model number alone is sufficient to specify the core count, and we’ve now picked up a “II,” courtesy of the die shrink. Oh, and the “Black Edition” thing specifies a CPU with an unlocked upper multiplier. All of these extra letters denote additional goodness, but we’re on goodness overload here, folks.

The more intriguing thing about the Phenom II’s naming scheme may be the model numbers themselves. AMD is introducing a pair of Phenom II X4 products today, the 920 and the 940, and those model numbers sure do ring familiar, what with the Core i7-920 and -940 kicking around out there. I could swear I heard someone from AMD claiming that the whole thing was a big coincidence, but wow. This particular coincidence seems to have made it past the “Whoops” stage, through several months of vetting, and into the “that will sure make for an interesting comparison on the shelves at Best Buy” stage.

Not that we really give a flip about what AMD has decided to call it. What we care about most is the technology. Of course, like the Phenom before it, the Phenom II is a native quad-core design with a very nice system layout that includes an integrated, dual-channel memory controller and dedicated HyperTransport links to and from the rest of the system. AMD’s system architecture has long been like this, and Intel has only just recently delivered a similar infrastructure alongside the Core i7.

The big changes with the Phenom II come in the chip itself, and those changes start with the conversion to a 45nm fabrication process. Because the basic building blocks are smaller than the 65nm process used to build the original Phenom, the Phenom II can pack more transistors into a smaller space while drawing less power and, potentially, operating at higher clock speeds. Intel has been at the 45nm process node for quite a while. AMD may seem a little late to the game, but the underdog brings its own particular spin by employing silicon-on-insulator technology and a brand-new technique called immersion lithography, in which a layer of water is used to focus light.

This new fab process has allowed AMD to fit many more transistors into a Phenom II die—an estimated 758 million, versus 463 million for the Phenom—while reducing the die size from Phenom’s 283 mm² to just 258 mm². Interestingly enough, the Phenom II’s basic specs sound remarkably similar to those of the Core i7, which weighs in at roughly 731 million transistors and 263 mm².

If you’re wondering where the Phenom II’s additional transistors come from, look no further than its L3 cache, which has grown in size from 2MB to 6MB. This larger L3 cache is the centerpiece of AMD’s effort to improve the clock-for-clock performance of its quad-core processor architecture. This cache isn’t just larger, though. It’s also faster, with what AMD claims is a two-cycle improvement in access latencies versus the 65nm Phenom’s L3 cache. Since the L3 cache in this architecture runs at a lower clock frequency than the CPU cores themselves, the improvement in access times may be more substantial than this claim might first seem to suggest. The cache hierarchy is smarter in various ways, too, with more aggressive data prefetch algorithms, twice the bandwidth for L1/L2 coherency probes, and 48-way set associativity for the L3 cache. AMD has made quite a few changes to improve per-clock performance. If you’d like to read about them in more detail, I suggest checking out my review of the 45nm Opterons, which discusses this same silicon in more depth.

One specific I should mention here, though, is a nifty new power-saving feature. The cores on the 65nm Phenom were clocked independently of one another, so that any core could enter a lower-frequency, lower-power state when not in use, but they couldn’t shut down entirely because the contents of their L2 caches needed to be kept available for other cores to check and possibly access. The Phenom II introduces another possibility: the contents of a core’s L2 cache can be transferred into the L3 cache, and the core may then shut down entirely. AMD claims this feature can enable the Phenom II to achieve much lower idle power usage.

There’s some tension, though, between this feature and another change AMD has made in the Phenom II. The firm found that the varying power states (or P-states) on the Phenom could prove to be confusing to the Windows Scheduler, which wouldn’t necessarily choose wisely when deciding whether to schedule a thread on a core with a low P-state or a high one. As a result, enabling the Cool’n’Quiet dynamic power saving feature could lead to unintended performance degradation. To work around this problem, AMD has decided to link together the P-states of the Phenom II’s cores, via some BIOS-level changes. Obviously, this is not the ideal solution, and AMD says it is working with Microsoft to ensure such things work properly in the future.

We don’t yet understand entirely how these linked P-states affect the Phenom II’s ability to put a core into a deep idle state where its L2 cache is flushed into the L3. We’ve heard rumblings from AMD to suggest these two attributes can coexist peacefully, but we don’t yet have an entirely clear sense how they interact.

So here’s the plan

Right now, AMD’s plan for the Phenom II is simple. There will be two models. The Phenom II X4 940 will run at 3GHz and will be priced at $275. The 920 will run at 2.8GHz and list for $235. Both chips will have a 125W TDP (thermal design power) rating, and both will have a north bridge clock of 1.8GHz. (That means their L3 caches and memory controllers will operate at 1.8GHz, and their HyperTransport links will be capable of 3.6 GT/s.) Both the 920 and 940 will be compatible with existing Socket AM2+ motherboards and will support DDR2 memory at up to 1066MHz. As a Black Edition CPU, the 940 will have an unlocked upper multiplier to facilitate easy overclocking. Both processors should be available immediately—if not sooner, considering the spate of early listings at online retailers.

That pricing sets up the Phenom II X4 940 as a direct rival to Intel’s Core 2 Quad Q9400, a 2.66GHz processor whose street price is about $269 right now. That’s not far from the $284 list price of the Core i7-920, but AMD rightly argues that the additional cost of an X58 motherboard and DDR3 memory puts the Core i7-920 in a different price category. Meanwhile, the 920’s closest competition may be the Core 2 Quad Q9300, which sells for around $240-250, although one could make a case for the similar but slightly lower spec Q8300.

The Phenom II’s fairly near-term future, however, will become considerably less simple. Not too terribly long from now—probably weeks rather than months—AMD will introduce a new version of the Phenom II with a memory controller capable of working with DDR3 memory at up to 1333MHz, and it will introduce a new socket type, Socket AM3, to go along with it. In one of the neater tricks we’ve seen along these lines, Socket AM3-capable Phenom II processors will, happily, be backward compatible with current Socket AM2+ motherboards and DDR2 memory.

I’d expect the Socket AM3 versions of the Phenom II to completely supplant the products being introduced today, because they will offer similar functionality along with an upgrade path. I’d also expect these new chips to be where Phenom II really flourishes, with a fuller lineup of products extending to mainstream (~65W) TDP ratings, triple-core budget chips, and probably a higher-TDP flagship FX processor. In addition to all of that, I suspect this newer silicon rev will bring higher clock speeds for the L3 cache, memory controller, and HyperTransport links.

Meanwhile, the best these Socket AM2+ Phenom II processors can hope for is probably a James Dean-style run in which they burn brightly for a short period and then go out in spectacular fashion. Those folks who have already invested in Socket AM2+ motherboards and wish to upgrade may find these first Phenom IIs compelling, but most others will probably want to wait for the Socket AM3 version before taking the plunge.

Test notes

Here’s a look at our test system, in which we inadvertently used all of the components of AMD’s so-called Dragon platform, including a Phenom II, the 790GX chipset, and a Radeon HD 4870.

Handsome, innit?

Nothing against these fine individual components, but I have yet to be impressed by a desktop “platform” package. I prefer simply to choose the best motherboard, processor, and graphics card, regardless of who makes them. To date, “platforms” have only served to thwart that ambition (see: CrossFire and SLI chipset requirements and corporate catfights, for instance) without delivering any concrete advantages. AMD seems to be committed to this direction, though, so you can probably expect to hear more about it. I will admit that AMD’s “Overdrive” utility has some nice features—and a new version is slated for release with the Phenom II—but it’s mainly tied to the 7-series chipsets rather than the whole package, including graphics. And a software utility alone isn’t likely to sell us on the platform concept.

Anyhow, another thing you’ll want to note is that several of the CPU models we tested were actually simulated using underclocked versions of higher-grade processors. Specifically, the Phenom II X4 920 is an underclocked 940, and the Core 2 Quad Q9550 is an underclocked Core 2 Extreme QX9650. We expect the performance of these “simulated” speed grades to be identical to the real things, but we have omitted these processors from our power consumption testing because we do anticipate that power use would vary slightly from the actual products.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processor Core
2 Quad Q6600 2.4 GHz
Core
2 Duo E8600 3.33 GHz

Core 2 Quad Q9300 2.5 GHz

Core 2 Quad Q9400 2.66 GHz

Core 2 Quad Q9550 2.83 GHz

Core
2 Extreme QX9770 3.2 GHz
Dual
Core
2 Extreme QX9775 3.2 GHz
Core
i7-940 2.66 GHz

Core i7-940 2.93 GHz

Core
i7-965 Extreme 3.2 GHz
Athlon
64 X2 6400+ 3.2 GHz
Phenom
X3 8750 2.4 GHz

Phenom II X4 920

2.8 GHz

Phenom II X4 940

3.0 GHz


Phenom X4 9950

Black 2.6 GHz

System bus 1066
MT/s

(266 MHz)

1333
MT/s

(333 MHz)

1600
MT/s

(400 MHz)

1600
MT/s

(400 MHz)

QPI
4.8 GT/s

(2.4 GHz)

QPI
6.4 GT/s

(3.2 GHz)

HT
2.0 GT/s

(1.0 GHz)

HT
3.6 GT/s (1.8 GHz)
HT
3.6 GT/s (1.8 GHz)
HT
4.0 GT/s (2.0 GHz)
Motherboard Asus
P5E3 Premium
Asus
P5E3 Premium
Asus
P5E3 Premium
Intel
D5400XS
Intel
DX58SO
Intel
DX58SO
Asus
M3A79-T Deluxe
Asus
M3A79-T Deluxe
MSI
DKA790GX Platinum
BIOS revision 0605 0605 0605 XS54010J.86A.1149.

2008.0825.2339

SOX5810J.86A.2260.

2008.0918.1758

SOX5810J.86A.2260.

2008.0918.1758

0403 0403 11/25/08
North bridge X48
Express MCH
X48
Express MCH
X48
Express MCH
5400
MCH
X58
IOH
X58
IOH
790FX 790FX 790GX
South bridge ICH9R ICH9R ICH9R 6321ESB ICH ICH10R ICH10R SB750 SB750 SB750
Chipset drivers INF
Update 9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF
Update 9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF
Update 9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF Update
9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF
update 9.1.0.1007

Matrix Storage Manager 8.5.0.1032

INF
update 9.1.0.1007

Matrix Storage Manager 8.5.0.1032

AHCI
controller 3.1.1540.61
AHCI
controller 3.1.1540.61
AHCI
controller 3.1.1540.61
Memory size 4GB
(2 DIMMs)
4GB
(2 DIMMs)
4GB
(2 DIMMs)
4GB
(2 DIMMs)
6GB
(3 DIMMs)
6GB
(3 DIMMs)
4GB
(2 DIMMs)
4GB
(2 DIMMs)
4GB
(2 DIMMs)
Memory type Corsair
TW3X4G1800C8DF

DDR3 SDRAM

Corsair
TW3X4G1800C8DF

DDR3 SDRAM

Corsair
TW3X4G1800C8DF

DDR3 SDRAM

Micron
ECC DDR2-800

FB-DIMM

Corsair
TR3X6G1600C8D

DDR3 SDRAM

Corsair
TR3X6G1600C8D

DDR3 SDRAM

Corsair
TWIN4X4096-8500C5DF

DDR2 SDRAM 

Corsair
TWIN4X4096-8500C5DF

DDR2 SDRAM

Corsair
TWIN4X4096-8500C5DF

DDR2 SDRAM

Memory
speed (Effective)
1066
MHz
1333
MHz
1600
MHz
800
MHz
1066
MHz
1600
MHz
800
MHz
1066
MHz
1066
MHz
CAS latency (CL) 7 8 8 5 7 8 4 5 5
RAS to CAS delay (tRCD) 7 8 8 5 7 8 4 5 5
RAS precharge (tRP) 7 8 8 5 7 8 4 5 5
Cycle time (tRAS) 20 20 24 18 20 24 12 15 15
Command
rate
2T 2T 2T 2T 2T 1T 2T 2T 2T
Audio Integrated
ICH9R/AD1988B

with SoundMAX 6.10.2.6480 drivers

Integrated
ICH9R/AD1988B

with SoundMAX 6.10.2.6480 drivers

Integrated
ICH9R/AD1988B

with SoundMAX 6.10.2.6480 drivers

Integrated
6321ESB/STAC9274D5

with SigmaTel 6.10.5713.7 drivers

Integrated
ICH10R/ALC889

with Realtek 6.0.1.5704 drivers

Integrated
ICH10R/ALC889

with Realtek 6.0.1.5704 drivers

Integrated
SB750/AD2000B

with SoundMAX 6.10.2.6480 drivers

Integrated
SB750/AD2000B

with SoundMAX 6.10.2.6480 drivers

Integrated

SB750/ALC888

with Realtek 6.0.1.5704 drivers

Hard drive WD Caviar SE16 320GB SATA
Graphics Radeon
HD 4870 512MB PCIe with Catalyst 8.55.4-081009a-070794E-ATI
drivers
OS Windows Vista Ultimate x64 Edition
OS updates Service
Pack 1, DirectX redist update August 2008

Thanks to Corsair for providing us with memory for our testing. Their products and support are far and away superior to generic, no-name memory.

Our single-socket test systems were powered by OCZ GameXStream 700W power supply units. The dual-socket system was powered by a PC Power & Cooling Turbo-Cool 1KW-SR power supply. Thanks to OCZ for providing these units for our use in testing.

Also, the folks at NCIXUS.com hooked us up with a nice deal on the WD Caviar SE16 drives used in our test rigs. NCIX now sells to U.S. customers, so check them out.

The test systems’ Windows desktops were set at 1600×1200 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled.

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

We figure the graph below is just complicated enough to weed out the lightweights and keep our readership from getting all big and unmanageable. Fortunately, it’s not really all that difficult to read. You’re just seeing how much bandwidth the memory subsystem on each CPU can deliver at different block sizes, which tend to correspond with different caches. For instance, the 1MB block size ought to spill into the L3 cache on the Phenoms.

One noteworthy result here: at that 1MB block size, the Phenom II’s L3 cache bandwidth is higher than the Phenom X4 9950’s, even though the 9950’s L3 cache runs at 2GHz, or 200MHz faster than the Phenom II’s.

Since it’s difficult to see the results once we get into main memory, let’s take a closer look at the 256MB block size:

Although our Core 2 test systems have relatively fast DDR3 memory, their memory bandwidth appears to be limited by their front-side bus speeds. With integrated memory controllers, all of the Phenoms can transfer data from main memory faster, and the Phenom IIs make some nice gains over the original Phenom in this department. I suspect the Phenom II’s larger L3 cache and more aggressive data prefetch algorithm deserves some of the credit for this result. Of course, the Core i7 is even faster still thanks to its onboard triple-channel DDR3 memory controller.

We have noted before that the Phenom’s L3 cache appears to contribute some delay to the whole memory subsystem. Even though the Phenom II’s L3 cache is three times the size and runs 200MHz slower, the Phenom II is nearly as quick at getting out to main memory as the Phenom X4 9950. Not bad.

Below are 3D graphs of memory access latencies at various block and step size for the Phenom II and some of its closer rivals. We’ve color coded them just as a guide, although it doesn’t mean much. Yellow roughly corresponds to the chip’s L1 cache size, light orange to the L2 cache, red to the L3 cache, and dark orange to main memory.

The Phenom II’s L3 cache is indeed pretty quick, although the Core i7-920’s is larger and quicker still. The Core 2 Q9400 lacks an L3 cache, but has a larger L2 instead.

Crysis Warhead

We measured Warhead performance using the FRAPS frame-rate recording tool and playing over the same 60-second section of the game five times on each processor. This method has the advantage of simulating real gameplay quite closely, but it comes at the expense of precise repeatability. We believe five sample sessions are sufficient to get reasonably consistent results. In addition to average frame rates, we’ve included the low frame rates, because those tend to reflect the user experience in performance-critical situations. In order to diminish the effect of outliers, we’ve reported the median of the five low frame rates we encountered.

We tested at at relatively modest graphics settings, 1024×768 resolution with the game’s “Mainstream” quality settings, because we didn’t want our graphics card to be the performance-limiting factor. This is, after all, a CPU test.

In our first indication of the Phenom II’s real-world performance, the Phenom II X4 940 essentially matches the Core 2 Quad Q9400 in terms of average frame rate, but the Phenom II’s minimum frame rate is slightly higher than the Q9400’s. Not bad.

Those of you looking for clock-for-clock comparison of CPU architectures might want to pay attention to how the Phenom II X4 920, at 2.8GHz, matches up to the Core 2 Quad Q9550 at 2.83GHz. That’s not an exact match, but it’s very close, and the Q9550 has the full 6MB of L2 cache per chip that the, um, non-neutered 45nm Core 2 parts have. As you can see, Intel’s Core 2 architecture remains very potent on a per-clock basis.

Doing a clock-for-clock comparison with the Core i7 is complicated by that processor’s Turbo mode feature, which raises clock speeds by up to 266MHz if there’s thermal headroom available. Even the slowest Core i7 here, the 920, may be running at up to 3.2GHz, especially in our gaming tests, since most games don’t take advantage of more than one or two CPU cores.

Far Cry 2

After playing around with Far Cry 2, I decided to test it a little bit differently by recording frame rates during the jeep ride sequence at the very beginning of the game. I found that frame rates during this sequence were generally similar to those when running around elsewhere in the game, and after all, playing Far Cry 2 involves quite a bit of driving around. Since this sequence was repeatable, I just captured results from three 90-second sessions.

Again, I didn’t want the graphics card to be our primary performance constraint, so although I tested at fairly high visual quality levels, I used a relatively low 1024×768 display resolution and DirectX 9.

Here’s a nicer result for the Phenom IIs, as they reach into Core i7 territory. Can they keep this up?

Unreal Tournament 3

As you saw on the preceding page, I did manage to find a couple of CPU-limited games to use in testing. I decided to try to concoct another interesting scenario by setting up a 24-player CTF game on UT3’s epic Facing Worlds map, in which I was the only human player. The rest? Bots controlled by the CPU. I racked up frags like mad while capturing five 60-second gameplay sessions for each processor.

Oh, and the screen resolution was set to 1280×1024 for testing, with UT3’s default quality options and “framerate smoothing” disabled.

The Phenom II processors perform quite well in this game, as well, clearly outrunning their direct price competition. With a host of bots to control, UT3 seems to take advantage of more than two CPU cores. Then again, with these frame rates, any of these processors will run this game quite smoothly.

Half Life 2: Episode Two

Our next test is a good, old custom-recorded in-game timedemo, precisely repeatable.

AMD’s naming scheme would seem to make perfect sense based on these results. The Phenom II X4 940 just outruns the Core i7-940, while the Phenom II X4 920 also edges out the Core i7-920.

Source engine particle simulation

Next up is a test we picked up during a visit to Valve Software, the developers of the Half-Life games. They had been working to incorporate support for multi-core processors into their Source game engine, and they cooked up some benchmarks to demonstrate the benefits of multithreading.

This test runs a particle simulation inside of the Source engine. Most games today use particle systems to create effects like smoke, steam, and fire, but the realism and interactivity of those effects are limited by the available computing horsepower. Valve’s particle system distributes the load across multiple CPU cores.

The two Phenom IIs come back to earth a little here, just trailing the Q9400 and Q9300, respectively.

WorldBench

WorldBench’s overall score is a pretty decent indication of general-use performance for desktop computers. This benchmark uses scripting to step through a series of tasks in common Windows applications and then produces an overall score for comparison. WorldBench also records individual results for its component application tests, allowing us to compare performance in each. We’ll look at the overall score, and then we’ll show individual application results alongside the results from some of our own application tests.

Intel’s Core microarchitecture excels at the sort of integer math needed for many everyday productivity tasks, and these results serve to illustrate that point. Checking in on our clock-for-clock comparison, the Core 2 Quad Q9550 outscores the Phenom II X4 920 pretty handily. Still, the Phenom IIs only trail their closest Core 2 Quad competition by a few points in each case.

Productivity and general use software

MS Office productivity

Firefox web browsing

Multitasking – Firefox and Windows Media Encoder

WinZip file compression

Nero CD authoring

Two of the benchmarks above, the MS Office and Firefox/Windows Media Encoder tests, involve simulated multitasking, with multiple applications running at once. In both of them, the Phenom II processors finish ahead of their Core 2 Quad rivals. They’re also quite strong in Firefox alone. The Phenoms fall behind in Winzip, though.

Image processing

Photoshop

Ouch. This is a result we’ve seldom seen in all of our experience with AMD’s 45nm quad-core processors: the older 65nm Phenoms are actually faster. Not sure what to make of that one, other than to say that this performance was consistent for both Phenom II speed grades across multiple test runs. Odd.

The Panorama Factory photo stitching
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s widely multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs. The program’s timer function captures the amount of time needed to perform each stage of the panorama creation process. I’ve also added up the total operation time to give us an overall measure of performance.

The Phenom IIs recover with a respectable showing in our photo-stitching app. Below is a look at the individual operations required to create a panorama, if you care to see that sort of detail.

picCOLOR image analysis

picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Many of the individual functions that make up the test are multithreaded.

Looks to me like the Phenom II’s poor showing in Photoshop was some sort of anomaly. In our third image manipulation test, the Phenom IIs basically tie the Q9300 and Q9400.

Media encoding and editing

x264 HD benchmark

This benchmark tests performance with one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark. These scores come from the newer, faster version 0.59.819 of the x264 executable.

In pass one, the Phenom II appears to match the Core 2 clock for clock, as the Phenom II X4 920 just trails the Core 2 Quad Q9550. In pass two, though, we have a pair of photo finishes between like-priced competitors.

Windows Media Encoder x64 Edition video encoding

Windows Media Encoder is one of the few popular video encoding tools that uses four threads to take advantage of quad-core systems, and it comes in a 64-bit version. Unfortunately, it doesn’t appear to use more than four threads, even on an eight-core system. For this test, I asked Windows Media Encoder to transcode a 153MB 1080-line widescreen video into a 720-line WMV using its built-in DVD/Hardware profile. Because the default “High definition quality audio” codec threw some errors in Windows Vista, I instead used the “Multichannel audio” codec. Both audio codecs have a variable bitrate peak of 192Kbps.

Windows Media Encoder video encoding

Roxio VideoWave Movie Creator

LAME MT audio encoding

LAME MT is a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. Of course, multithreading works even better on multi-core processors. You can download a paper (in Word format) describing the programming effort.

Rather than run multiple parallel threads, LAME MT runs the MP3 encoder’s psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. That means this test won’t really use more than two CPU cores.

We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here.

The rest of our media encoding tests show us a seesaw battle between the Q9400 and 940, and another between the Q9300 and 920. Remarkable how similarly these CPUs perform, given their sheer complexity and very different architectures.

3D modeling and rendering

Cinebench rendering

Graphics is a classic example of a computing problem that’s easily parallelizable, so it’s no surprise that we can exploit a multi-core processor with a 3D rendering app. Cinebench is the first of those we’ll try, a benchmark based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores (or threads, in CPUs with multiple hardware threads per core) are available.

POV-Ray rendering

We’re using the latest beta version of POV-Ray 3.7 that includes native multithreading and 64-bit support. Some of the beta 64-bit executables have been quite a bit slower than the 3.6 release, but this should give us a decent look at comparative performance, regardless.

3ds max modeling and rendering

Valve VRAD map compilation

This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to pre-compute lighting that goes into games like Half-Life 2.

All told, our suite of 3D rendering tests fails to break the stalemate between the Phenom IIs and their Core 2 Quad adversaries. I’m not sure this contest could be any closer.

[email protected]

Next, we have a slick little [email protected] benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, [email protected] is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The [email protected] project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, [email protected] should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.

This is beginning to feel like a reality TV show where the producers tip the scales to make the contest seem even. I swear, folks, we just run the tests and the results come out as they will. The Q9400/940 and Q9300/920 contests really are this tight.

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.
In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.

MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.

I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

Here’s an interesting bit of additional drama amidst the startling parity on display. The Phenom II X4 940 is 21 seconds slower than the Q9400 with only one thread, but its performance scales better as the thread count ramps up. At four threads, the Phenom II is a few seconds faster.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.
The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

With the Phenom II’s superior memory bandwidth, I had expected a different result here. I’m curious to see how the transition to Socket AM3 and DDR3 memory affects this one.

Power consumption and efficiency

Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor into a separate outlet, though.) We measured how each of our test systems used power across a set time period, during which time we ran Cinebench’s multithreaded rendering test.

All of the systems had their power management features (such as SpeedStep and Cool’n’Quiet) enabled during these tests via Windows Vista’s “Balanced” power options profile.

Clearly, the Phenom II has much lower idle power use than its 65nm predecessors. On the theme of striking parity with the Core 2, though, let’s have a closer look at the Q9400/940 data, in direct comparison:

Man, is that ever close. The two systems idle at just about the same power level and complete the rendering task in nearly the same amount of time. (The Phenom II is a smidgen quicker.) True to its lower 95W TDP rating, though, the Core 2 Quad Q9400 draws less power under load than the Phenom II. This Q9400 is a brand-new chip and appears to be a new stepping of the Penryn silicon (CPU-Z reports stepping A, revision R0) with lower power draw at idle than our older Core 2 Quad Q9300 (stepping 7, revision M1).

Let’s slice up the data in various ways in order to better understand them. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

The Phenom II X4 940 system’s power use is ever so slightly lower than the Q9400-based system’s. A couple of differences worth noting in the system configs: the Q9400 system is based on the high-end X48 chipset with more PCIe lanes for graphics than the AMD 790GX chipset the Phenom II system uses, and the Q9400 system uses DDR3 memory, which operates at lower voltage than DDR2. Offsetting considerations, perhaps, at least in part.

Next, we can look at peak power draw by taking an average from the ten-second span from 15 to 25 seconds into our test period, during which the processors were rendering.

The Phenom II’s higher peak power draw is evident here, but still, AMD has shaved 30W off of the Phenom X4 9950’s peak while delivering higher performance at the same time.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

We can quantify efficiency even better by considering specifically the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

Because of its higher draw under load, the Phenom II’s power efficiency can’t quite match that of Intel’s Core 2 Quads, but it comes very close.

Overclocking

You may have already heard some of the hype about overclocking headroom in the Phenom II. Of course, such things are never guaranteed, but a willingness to reach higher clock speeds could bode well for AMD’s 45nm quad-core processors. And, given that the Phenom II X4 940 comes as a Black Edition with an unlocked multiplier, overclocking it could be incredibly easy. Also, AMD probably needs a little additional headroom in these chips to match the 45nm Core 2, which has long been an excellent overclocker.

Our Phenom II X4 940 didn’t quite reach the near-4GHz heights that we saw in AMD’s early press demo, but it did pretty well with only the assistance of our large-ish Cooler Master air cooler. Here’s a quick log of my overclocking attempts and their results. I was using a multithreaded version of Prime95 for stress testing, generally only testing for a few minutes at each speed grade during my initial attempts.

3.2GHz, stock voltage – Seems OK

3.4GHz, stock voltage – P95 thread error

3.4GHz, 1.3625V – Seems OK

3.6GHz, 1.3625V – Immediate P95 error

3.6GHz, 1.3875V – P95 errors

3.6GHz, 1.4V – P95 error, thread 4

3.6GHz, 1.4125V – P95 errors

3.6GHz, 1.425V – Seems good, temps at ~41C

3.8GHz, 1.425V – Crash during boot

3.7GHz, 1.425V – Crash during boot

3.7GHz, 1.45V – Crash during boot

3.7GHz, 1.475V – Crash during boot

3.7GHz, 1.4875V – Crash during boot

3.7GHz, 1.5V – Crash during boot

3.7GHz, 1.525V – Crash during boot

3.7GHz, 1.55V – Crash during boot

Back to 3.6GHz, 1.425V – Crash during boot

Stock speed and voltage – Boots fine

3.6GHz, 1.4375V – BSOD!

3.5GHz, 1.425V – Seems OK, temps at ~40C

Getting the system to boot into Windows at 3.7GHz proved to be impossible, even at 1.55V. As you can see, I eventually settled on 3.5GHz as a nice, stable overclock.

I was close to 3.6GHz, but things seemed to go downhill as I raised the voltage, which is never a good sign. Perhaps with a little more tweaking this CPU could make it over the hump at 3.6GHz or better, but I didn’t have time to nurse it along. For what it’s worth, fellow TR editor Geoff Gasior made his own attempts with a different Phenom II X4 940 chip, and he tells me his was stable at 3.5GHz and 1.4375V using a Scythe Ninja cooler.

The Phenom II’s performance was already strong in both of these games at stock speeds. When overclocked to 3.5GHz, the Phenom II is nipping at the heels of the Core i7-965 Extreme.

Conclusions

In the Phenom II, AMD has produced a chip that comes strikingly close to duplicating the performance of Intel’s mid-range Core 2 Quad processors, the Q9300 and Q9400. The Phenom II proved to be faster in several of our gaming tests, but it was slower in some components of WorldBench, including WinZip and Photoshop, which lowered its overall score a bit. On the whole, though, the key characteristic we saw through the bulk of our performance tests was the remarkable parity between the Phenom II X4 940 and the Core 2 Quad Q9400—and the same between their siblings one notch down the ladder. The idle power use of our Phenom II X4 940 system was a couple of watts lower than its Core 2 Quad-based adversary, although it did draw 24W more under load, which was the Phenom II’s one definitive disadvantage in this comparison. The Phenom II may even be able to rival the Core 2 Quad’s vaunted overclocking headroom—and it’s hard to argue with the ease of overclocking a Black Edition processor with a simple multiplier tweak.

Considering that the Core 2 Quad Q9400 was the featured processor in the “Sweeter spot” build in our latest system guide, the Phenom II puts AMD back in the running right in the middle of the enthusiast PC market, where price and performance converge in solid value. You still don’t need (and probably won’t benefit from) quad cores for gaming, but the Phenom II’s individual cores are more than fast enough to perform well in games, in addition to the multitasking and multithreaded performance they can deliver in other scenarios.

With the Socket AM3 versions of the Phenom II looming so close on the horizon, I suspect many folks may choose to wait on building or buying an all-new AMD-based system. Current owners of Socket AM2+ motherboards may not wish to delay any further, though, and if they’ve been waiting to upgrade from something like an Athlon X2, well, I wouldn’t blame them for making the leap now. Just remember that you’re sacrificing an easy upgrade path to newer motherboards and DDR3 memory.

Some tough realities remain for AMD, but I’d say they’re now tempered by a little more hope. Although the Phenom II is a marked improvement over the original 65nm Phenom, AMD still can’t match the fastest Core 2 Quads in clock-for-clock or outright performance. And obviously, the Core i7 is yet another step beyond the Core 2. With Socket AM3, though, AMD should have an infrastructure in place that’s very much like the one Intel plans to introduce with the upcoming Nehalem-derived mainstream desktop processors. AMD will then have plenty of knobs and dials to tweak on the Phenom II—memory speed, L3 cache frequency, and core clocks among them—to increase the performance or improve the power efficiency of its CPUs. I’m doubtful any version of the Phenom II will be able to take advantage of raw bandwidth like the Core i7 does in our scientific computing benchmarks, but higher speed grades could help. And, frankly, such applications aren’t presently all that important for desktop PCs. Given all of that, the match-up between future versions of the Phenom II and the upcoming Core i5 (or whatever it’s eventually called) might be a challenge for AMD, but it may not be Armageddon after all.

0 0 votes
Article Rating
3 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
dpaus
dpaus
13 years ago
Reply to  indeego

Hmmm, indeego, I guess you haven’t heard about the Palm Pre that everyone called the hit of this year’s CES. All-new OS, all-new GUI, all-new apps, gorgeous new hardware (4oz!) – nope, no innovation going on there…

swaaye
swaaye
13 years ago
Reply to  apaige

A couple things to consider: -Core 2 was slapping the competition around before any significant optimizations were done. -Athlon 64/Opteron & Athlon 64 X2 were beating P4 before any significant optimizations were done. I don’t really believe in judging a processor by its “potential”. These chips don’t matter long enough for it to, well, matter. And since Phenom is based heavily on 9 year old Athlon cores, I definitely have little faith in it going anywhere at this point. It’s an architecture that’s been around in some form for forever. And finally, Phenom 2 is a slightly upgraded version of… Read more »

moritzgedig
moritzgedig
13 years ago
Reply to  skagon

“/[

Pin It on Pinterest

Share This

Share this post with your friends!