AMD’s Phenom processors

If you’re reading this article, chances are you already know at thing or two about Phenom processors. After all, they’ve been in development for years, and AMD has been talking about them publicly for quite some time. In fact, we’ve even reviewed the exact same silicon in different outerwear, the quad-core Opterons, earlier this year. We’ve heard all about how Phenom will be the world’s first “native” quad-core desktop processor, how such integration has tangible benefits for performance and power consumption, and how folks will be absolutely stunned by the synergistic convergence of Phenom processors, the 790FX chipset, and Radeon HD 3800-series GPUs.

What we didn’t have, however, were answers about some key Phenom basics: How fast will it be, both in terms of clock speeds and performance per clock? When will it be available? And will it have been worth the, erm, considerable wait?

Today we have some answers, after a long weekend spent in hands-on testing with a Phenom processor. Read on for our extensive first look at AMD’s brand-new CPU.

The Phenom steps up

Yes, it’s called Phenom, which I’ve heard pronounced “fee-nom,” like a promising young pitching savant in some baseball club’s farm system, and “fen-om,” which rhymes with “venom” and sounds positively toxic to my ears. Either way, what the name mainly evokes for me is this: Phenom-ena. I suppose it’s fine as far as CPU names go, though.

The chip itself is the same basic “K10” design found in AMD’s quad-core Opterons. Although those CPU cores are derived from the ones found in current Athlon 64 X2 processors, AMD has made substantial revisions to them in order to improve per-clock performance and efficiency. The cores now have a wider, 32-byte instruction fetch, and the floating-point units can execute 128-bit SSE operations in a single clock cycle. Phenom can execute the Supplemental SSE3 instructions Intel included in its Core 2 processors, but not the newer SSE4 extensions in Intel’s just-introduced 45nm chips. The K10 core has more bandwidth throughout in order to accommodate higher throughput—internally between units on the chip, between the L1 and L2 caches, and between the L2 cache and the north bridge/memory controller.



The quad-core Phenom die. Source: AMD.

These improved cores are, of course, now grouped four to a chip, and AMD has added a third level to the cache hierarchy in order to assist with integration of the cores. As a result, each Phenom core has 64K of L1 data cache, 512K of dedicated L2 cache, and access to the 2MB L3 cache shared between all cores. An interesting quirk of the Phenom design is that the L3 cache runs at the clock speed of the memory controller/north bridge section of the chip, which is typically slower than the CPU core clocks. Since the L3 cache is an integral part of the memory hierarchy, north bridge clock speeds will be a key factor in overall Phenom performance.

The chip’s integrated memory controller can talk to dual channels of DDR2 memory at speeds up to 1066MHz. This memory controller has been improved in various ways, among them larger buffers and an improved mechanism for speculative data prefetch. The memory controller can also be configured to access its two 64-bit memory channels independently, instead of treating them as a single 128-bit entity.

As with any new chip design these days, the Phenom has been tuned for power efficiency as well as performance. Most prominently, in this case, the Phenom’s four cores are clocked independently and can dynamically raise or lower their clock speeds in response to demand. The Phenom’s core voltage is still determined by the power state of the core with highest utilization, but AMD has separated the power plane for the chip’s CPU core from the power plane for its memory controller. Only motherboards conforming to the new Socket AM2+ standard will be able to reap the benefits of Phenom’s split power planes, but Phenom ought to be compatible with—and able to act as a drop-in upgrade for—existing Socket AM2 motherboards. (Though, as always, you’ll want to check with your motherboard maker about compatibility, and your mobo may need a BIOS update. Your mileage may vary. One never knows. All rights reserved. Etcetera.)

Socket AM2+ also brings support for another Phenom feature: HyperTransport 3.0. This interconnect links the Phenom to the rest of the system for I/O and the like, although it’s not a traditional front-side bus, since the Phenom has its own memory controller. Revision 3.0 of HyperTransport doubles the effective clock speed and data rate of the interconnect, giving Phenoms twice the external bandwidth of the Athlon 64 X2 in the same 940-pin socket.

And yes, it is the same socket. Not only should Phenoms be able to fit into Socket AM2 mobos, but Athlon 64 processors should drop comfortably and functionally into Socket AM2+ motherboards. At present, the only Socket AM2+ chipset on the market is AMD’s 790FX, which we’ve reviewed today, as well. Notice how smoothly I worked in that plug there, Geoff. No one will notice.

AMD says it has plans to introduce yet another socket, dubbed AM3, to go along with its 45nm CPUs. That new dynamic duo will enable support for DDR3 memory types, when it arrives in 2008. Or, you know, whenever’s convenient. I wouldn’t put any money on 2008.

The Phenom’s great, big bundle of integrated goodness is manufactured as a single chip on AMD’s 65nm silicon-on-insulator fabrication process at its Fab 36 facility in Dresden, Germany. All told, the Phenom has roughly 463 million transistors, and the chip’s area is 285 mm². That’s fairly large as far as CPUs go. Intel’s brand-new 45nm Penryn chips come two to a package in its quad-core processors, but each chip fits 410 million transistors into a 107 mm² die. Since larger chips are exponentially more prone to manufacturing defects, the Phenom’s relatively large size may cause AMD some headaches over time. The big upside here is the K10’s tighter integration of four cores and faster communication between them, a benefit that may pay bigger dividends in the multi-socket server arena than in desktop processors like Phenom.

I’m hopped up on several gallons of straight espresso, this review is over 12 hours late, and my kids will be eating pop-tarts for dinner for the next month, so that’s all the time I have to discuss microarchitectural specifics in this context. However, you can learn more about the K10 design by reading my review of the quad-core “Barcelona” Opterons, or you can skip over to David Kanter’s incisive Barcelona architecture overview for more detail on the CPU-geek stuff.

AMD’s next top model

I was playing catch with my six-year-old the other day, and she mentioned something she’d heard about in preschool. “Daddy, why can’t AMD get Phenom clock speeds higher?” That’s when I knew AMD’s struggles were not exactly out of public view. We’ve expected the Phenom to launch at many clock speeds over the past six months or so, each one a little lower than the last. Quite recently, AMD told us it planned to introduce a 2.4GHz version at a launch, but alas, that didn’t come to pass. As I told my daughter, “Sweetie, the latest Phenom chip revision has a TLB problem that causes instability at higher clock speeds.” We don’t have precise details on the nature of the problem, but AMD told us that fixing it will require a new spin of the chip, which is why higher speed Phenom variants have been delayed.

As a result, AMD is introducing a pair of Phenom models today, with promises of more later. The Phenom 9500 is the first model; it will be clocked at 2.2GHz, and the Phenom 9600 will run at 2.3GHz. Both chips have a 95W TDP rating. AMD recently introduced its “ACP” power rating system, but the company has yet to assign ACP ratings to its Phenom models. Expect those numbers, when they come, to be lower than the TDP numbers.

We know from experience with the quad-core Opterons that a Phenom at 2.3GHz isn’t going to recapture the overall performance crown from Intel. As we wrote in that review, AMD needs to reach something close to clock speed parity in order to catch Intel, and the top Core 2 processors run at 3GHz. You can do the math on that one. In order to make the Phenom attractive, then, AMD has priced it to move. The 9500 will list for $251 and the 9600 for $281—right in the territory of our current favorite Intel CPU value, the Core 2 Quad Q6600. AMD even plans to sweeten the pot by offering a Phenom variant akin to the recent Athlon 64 X2 5000+ “Black Edition.” This 2.3GHz Phenom will have an unlocked upper multiplier for easy overclocking, and it should be priced similarly to the locked versions of the same. This strategy isn’t a substitute for achieving outright performance leadership, but it’s certainly a nice way to capture the attention—and perhaps the affections—of PC enthusiasts.

Down the road, AMD does plan to introduce higher-clocked variants of the Phenom, starting with the 9700 at 2.4GHz. This model won’t arrive until some time in the first quarter of next year, but when it does, it will follow the same value-oriented pricing scheme as current products, with a price “below $300.” Later in Q1 2008, the Phenom 9900 should debut at 2.6GHz, with a price tag “under $350.” These chips should be newer revisions of Phenom silicon with the TLB bug corrected. Of course, plans like these may sound good, but AMD will have to execute on them, and that’s the tricky part. Also, pricing on the 9700 and 9900 models may look like quite a bit less of a bargain once Intel ships the rest of its 45nm product lineup, which should happen early next year, as well.


Here’s a look at the Phenom engineering sample we received for testing. Unfortunately, our access to Phenom chips prior to the products’ introduction was limited by an obviously nervous AMD. We only received this CPU late last week, and it’s not entirely representative of consumer products, despite the fact that AMD expects to have processors selling in a few days, in time for Black Friday shopping. Our chip is clocked at 2.6GHz, like the proposed Phenom 9900 model expected next year. We’ve tested it at that speed, but we’ve also clocked it down to 2.3GHz a la the Phenom 9600 for another round of tests.

I should mention that both shipping Phenom models and our 2.6GHz engineering sample come with a 2GHz north bridge clock. We’ll have a look at why that clock is an important one in our benchmark results shortly.

Our Phenom sample came to us with another engineering sample, an early rev of the Asus M3A32-MVP Deluxe motherboard based on the 790FX chipset. The Phenom and 790FX make up two of the three elements of AMD’s so-called “Spider” platform. Unfortunately, the platformization pitch didn’t translate into a stable system. I don’t think we’ve ever experienced so much trouble with a major hardware product so close to its launch as we did with the Phenom/790FX combo. The system simply just wasn’t stable and had to be coaxed through our test suite with a combination of BIOS tweaking, trial-and-error retries, and prayer. We were able to complete our testing, but we’re very much hoping what we experienced isn’t representative of the products that will be shipping to consumers this week. If it is, AMD and its partners are in for a deluge of support requests from frustrated customers.

Intel’s November surprise

You may watch the 2007 New England Patriots and think that rubbing it in is bad sportsmanship. Intel watches them and thinks, “Hey, cool. They hung 56 on Buffalo. Respect!” That, I suppose, is the spirit in which Intel shipped out its little greeting committee for the Phenom, the chip pictured below.



The Core 2 Extreme QX9770

The Core 2 Extreme QX9770 is based on the same 45nm Yorkfield design as the QX9650, but it runs at 3.2GHz on a 1600MHz front-side bus. Like the 2.6GHz Phenom, this chip isn’t scheduled to arrive until next year, but we can give you an early preview now. Of course, unlike the Phenom 9900, this processor won’t cost under $350. Intel says to expect pricing above today’s Extreme processors, which are already topping $1100. Also, this part has a TDP rating of 136W, putting it beyond the traditional power envelopes we’ve come to expect from Intel.



Gigabyte’s X38-DQ6 handled the 1600MHz front-side bus easily

The QX9770 presents another problem, in that today’s motherboards and chipsets aren’t rated for 1600MHz FSB operation, at least officially. That FSB speed will get its official blessing with the introduction of the upcoming Intel X48 chipset. Fortunately, with the latest BIOS, we were able to achieve a stable configuration using a Gigabyte X38-DQ6 motherboard. The board we had on hand was the DDR2 version, so it wasn’t able to use DDR3 memory like our other Core 2 test platform, but that shouldn’t have a major impact on performance.

Clearly, the QX9770 won’t be direct competition for the Phenom 9600 or even the 9900. Instead, it seems to be an exclamation point on Intel’s performance leadership. For our purposes today, we’ll be focusing the bulk of our commentary and analysis on the Phenom 9600 and the Core 2 Quad Q6600, since they are much more direct competitors. The QX9770 results are there for you to ogle, though, if you wish.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processor Core 2 Quad Q6600 2.4GHz
Core 2 Extreme QX6800 2.93GHz
Core 2 Duo E6750 2.66GHz
Core 2 Extreme QX6850 3.00GHz
Core
2 Extreme QX9770 3.2GHz
Athlon 64 X2 5600+ 2.8GHz
Athlon 64 X2 6000+ 3.0GHz
Athlon 64 X2 6400+ 3.2GHz
Dual Athlon 64 FX-74 3.0GHz Phenom
9600 2.3GHz

Phenom 9900 2.6GHz

Core 2 Extreme QX9650 3.00GHz
System bus 1066MHz (266MHz quad-pumped) 1333MHz (333MHz quad-pumped) 1600MHz
(400MHz quad-pumped)
1GHz HyperTransport 1GHz HyperTransport 1GHz HyperTransport
Motherboard Gigabyte GA-P35T-DQ6 Gigabyte GA-P35T-DQ6 Gigabyte
GA-X38-DQ6
Asus M2N32-SLI Deluxe Asus L1N64-SLI WS Asus
M3A32-MVP Deluxe
BIOS revision F1 F1 F6b 1201 0505 0307
F4
North bridge P35 Express MCH P35 Express MCH X38
Express MCH
nForce 590 SLI SPP nForce 680a SLI 790FX
South bridge ICH9R ICH9R ICH9R nForce 590 SLI MCP nForce 680a SLI SB600
Chipset drivers INF Update 8.3.0.1013

Intel Matrix Storage Manager 7.5

INF Update 8.3.0.1013

Intel Matrix Storage Manager 7.5

INF Update 8.3.0.1013

Intel Matrix Storage Manager 7.5

ForceWare 15.01 ForceWare 15.01
Memory size 4GB (4 DIMMs) 4GB (4 DIMMs) 4GB (4 DIMMs) 4GB (4 DIMMs) 4GB (4 DIMMs) 4GB (4 DIMMs)
Memory type Corsair TWIN3X2048-1333C9DHX

DDR3 SDRAM at 1066MHz

Corsair TWIN3X2048-1333C9DHX

DDR3 SDRAM at 1333MHz

Corsair TWIN2X2048-8500C5D

DDR2 SDRAM at 800MHz

Corsair TWIN2X2048-8500

DDR2 SDRAM at ~800MHz

Corsair TWIN2X2048-8500C5D

DDR2 SDRAM at ~ 800MHz

Corsair TWIN2X2048-8500C5D

DDR2 SDRAM at 800MHz

CAS latency (CL) 8 8 4 4 4 4
RAS to CAS delay (tRCD) 8 9 4 4 4 4
RAS precharge (tRP) 8 9 4 4 4 4
Cycle time (tRAS) 20 24 18 18 18 18
Audio Integrated ICH9R/ALC889A

with Realtek 6.0.1.5449 drivers

Integrated ICH9R/ALC889A

with Realtek 6.0.1.5449 drivers

Integrated
ICH9R/ALC889A

with Realtek 6.0.1.5449 drivers

Integrated nForce 590 MCP/AD1988B

with Soundmax 6.10.2.6100 drivers

Integrated nForce 680a SLI/AD1988B

with Soundmax 6.10.2.6100 drivers

Integrated
SB600/AD1988B

with Soundmax 6.10.2.6180 drivers

Hard drive WD Caviar SE16 320GB SATA
Graphics GeForce 8800 GTX 768MB PCIe with ForceWare 163.11 and 163.71 drivers
OS Windows Vista Ultimate x64 Edition
OS updates KB940105, KB929777 (nForce systems only), KB938194, KB938979

Please note that testing was conducted in two stages. Non-gaming apps and Supreme Commander were tested with Vista patches KB940105 and KB929777 (nForce systems only) and ForceWare 163.11 drivers. The other games were tested with the additional Vista patches KB938194 and KB938979 and ForceWare 163.71 drivers.

Thanks to Corsair for providing us with memory for our testing. Their products and support are far and away superior to generic, no-name memory.

Our primary test systems were powered by OCZ GameXStream 700W power supply units. The Quad FX system was powered by a PC Power & Cooling Turbo-Cool 1KW-SR power supply. Thanks to OCZ for providing these units for our use in testing.

Also, the folks at NCIXUS.com hooked us up with a nice deal on the WD Caviar SE16 drives used in our test rigs. NCIX now sells to U.S. customers, so check them out.

The test systems’ Windows desktops were set at 1280×1024 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled.

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

We’ll start, as ever, with some quick synthetic tests of the memory subsystem, which will help give us the lay of the land before we dive into our real-world benchmarks.

The Phenom does indeed have quite a bit more L1 and L2 cache bandwidth than its predecessors, as you can see. This test is multithreaded, and it shows higher bandwidth scores when more cores and cache are available. Even so, the Phenom 9600 at 2.3GHz achieves higher throughput than the Athlon 64 FX-74, a two-socket “quad core” solution using dual Athlon 64 FX processors. Intel’s caches are faster still.

Here’s a closer look at how these systems perform when accessing main memory. The Phenom’s revised memory controller gets to show off a bit, with much higher throughput than anything else we tested.

AMD’s processors with integrated memory controllers have traditionally had the lowest memory access latencies around, but that’s no longer the case with Phenom. Intel managed to close the gap somewhat with its memory disambiguation logic in the Core microarchitecture, and AMD has widened that gap with Phenom. The fancy-pants graphs below will show us why.

In these graphs, yellow represents L1 cache, light orange is L2 cache, and dark orange is main memory. What you’re seeing here is memory access latencies at various block and step sizes, in a way that exposes latency for the various stages in the memory hierarchy.

Have a look at the red section representing the Phenom 9600’s L3 cache. This cache’s latencies are about 22 nanoseconds, and the additional task of checking the L3 cache adds latency to main memory accesses, as well. The Phenom includes sharing logic that buffers requests from all four cores—which may all be running at different clock speeds at any given time—coming into the L3 cache. This logic itself undoubtedly adds some delay. Also, as we’ve mentioned, the L3 cache doesn’t run at the full speed of the CPU cores—it runs at the north bridge speed. That means L3 cache performance doesn’t scale linearly with core clock speeds. Both the Phenom 9600 and 9900 models have 2GHz north bridges, for example, and both have the exact same 22ns L3 cache latency penalty.

This additional memory latency isn’t the end of the world by any means, but the fact the Phenom trades this many nanoseconds of latency for the addition of a relatively small 2MB cache is, well, unusual, to say the least. AMD will almost certainly have to raise north bridge speeds along with core clocks in order to keep performance scaling well.

Team Fortress 2

We’ll kick off our gaming tests with some Team Fortress 2, Valve’s class-driven multiplayer shooter based on the Source game engine. In order to produce easily repeatable results, we’ve tested TF2 by recording a demo during gameplay and playing it back using the game’s timedemo function. In this demo, I’m playing as the Heavy Weapons Guy, with a medic in tow, dealing some serious pain to the blue team.

We tested at 1024×768 resolution with the game’s detail levels set to their highest settings. HDR lighting and motion blur were enabled. Antialiasing was disabled, and texture filtering was set to trilinear filtering only. We used this relatively low display resolution with low levels of filtering and AA in order to prevent the graphics card from becoming a primary performance bottleneck, so we could show you the performance differences between the CPUs.

Notice the little green plot with four lines above the benchmark results. That’s a snapshot of the CPU utilization indicator in Windows Task Manager, which helps illustrate how much the application takes advantage of up to four CPU cores, when they’re available. I’ve included these Task Manager graphics whenever possible throughout our results. In this case, Team Fortress 2 looks like it probably only takes full advantage of a single CPU core, although Nvidia’s graphics drivers use multithreading to offload some vertex processing chores.

TF2 doesn’t gain anything from the addition of more than two cores, it seems, and so the Phenoms don’t provide much of a performance boost over the Athlon 64 X2. The Phenom does gain some per-clock performance, though not as much as initially expected from the K10 design.

Living in the now, the Phenom 9600 is slower than the Core 2 Quad Q6600 here, although it’s easily up to the task of running this game fluidly.

Lost Planet: Extreme Condition
Lost Planet puts the latest hardware to good use via DirectX 10 and multiple threads—as many as eight, in the case of our dual quad-core Xeon test rig. Lost Planet‘s developers have built a benchmarking tool into the game, and it tests two different levels: a snow-covered outdoor area with small numbers of large villains to fight, and another level set inside of a cave with large numbers of small, flying creatures filling the air. We’ll look at performance in each.

We tested this game at 1152×864 resolution, largely with its default quality settings. The exceptions: texture filtering was set to trilinear, edge antialiasing was disabled, and “Concurrent operations” was set to match the number of CPU cores available.

We’re pretty much looking at a GPU bottleneck or something similar in the “Snow” level, where all of the processors are bunched together at around 95 FPS. The Cave level is more intriguing, since it puts four cores to good use. Here, the Phenom 9600 just edges out the Core 2 Quad Q6600, and the Phenom 9900 puts in a respectable showing on a clock-for-clock basis versus the quad-core Intel CPUs.

BioShock

We tested BioShock by manually playing through a specific point in the game five times while recording frame rates using the FRAPS utility. The sequence? Me trying to fight a Big Daddy, or more properly, me trying not to die for 60 seconds at a pop.

This method has the advantage of simulating real gameplay quite closely, but it comes at the expense of precise repeatability. We believe five sample sessions are sufficient to get reasonably consistent results. In addition to average frame rates, we’ve included the low frame rates, because those tend to reflect the user experience in performance-critical situations. In order to diminish the effect of outliers, we’ve reported the median of the five low frame rates we encountered.

For this test, we largely used BioShock‘s default image quality settings for DirectX 10 graphics cards, but again, we tested at a relatively low resolution of 1024×768 in order to prevent the GPU from becoming the main limiter of performance.

Here’s a nice surprise. The Phenoms both run BioShock very well, with the 9900 taking the top spot overall. We’re not talking about especially meaningful differences in performance between the top performers, especially with the manual testing element involved here, but Phenom clearly puts AMD back in the hunt where the Athlon 64 had fallen behind.

Supreme Commander

We tested performance using Supreme Commander‘s built-in benchmark, which plays back a test game and reports detailed performance results afterward. We launched the benchmark by running the game with the “/map perftest” option. We tested at 1024×768 resolution with the game’s fidelity presets set to “High.”

Supreme Commander’s built-in benchmark breaks down its results into several major categories: running the game’s simulation, rendering the game’s graphics, and a composite score that’s simply comprised of the other two. The performance test also reports good ol’ frame rates, so we’ve included those, as well.

The differences are relatively small, but the Phenom 9600 does trail the Core 2 Quad Q6600 in each test.

Valve Source engine particle simulation

Next up are a couple of tests we picked up during a visit to Valve Software, the developers of the Half-Life games. They’ve been working to incorporate support for multi-core processors into their Source game engine, and they’ve cooked up a couple of benchmarks to demonstrate the benefits of multithreading.

The first of those tests runs a particle simulation inside of the Source engine. Most games today use particle systems to create effects like smoke, steam, and fire, but the realism and interactivity of those effects are limited by the available computing horsepower. Valve’s particle system distributes the load across multiple CPU cores.

There’s good and bad news in these results. The good news is that the Phenom 9900 at 2.6GHz outruns the two Athlon 64 FX-74 processors at 3.0GHz, a nice gain in performance per clock. The bad news is that the Phenom 9600 is well behind its would-be competition, the Core 2 Quad Q6600.

Valve VRAD map compilation

This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to precompute lighting that goes into games like Half-Life 2. This isn’t a real-time process, and it doesn’t reflect the performance one would experience while playing a game. Instead, it shows how multiple CPU cores can speed up game development.

The story is much the same here as it was in the last test. The Phenom brings some nice gains for AMD, but they’re not quite enough at current clock speeds to catch the Core 2 Quad Q6600.

WorldBench

WorldBench’s overall score is a pretty decent indication of general-use performance for desktop computers. This benchmark uses scripting to step through a series of tasks in common Windows applications and then produces an overall score for comparison. WorldBench also records individual results for its component application tests, allowing us to compare performance in each. We’ll look at the overall score, and then we’ll show individual application results alongside the results from some of our own application tests. Because WorldBench’s tests are entirely scripted, we weren’t able to capture Task Manager plots for them, as you’ll notice.

WorldBench, like the applications that make it up, is no great respecter of more than two CPU cores. The fact that the Core 2 Duo E6750 ties the Core 2 Quad Q6600 is evidence of that. Still, the Phenom processors turn in disappointing overall scores, with the 9600 trailing the Athlon 64 X2 6000+. This is another proof point for a dawning realization: the Phenom needs to run at higher clock frequencies in order to perform comparatively well in everyday desktop applications.

Productivity and general use software

MS Office productivity

WorldBench’s Office test has a multitasking element, since multiple Office apps are running at once. Even so, the Phenom 9600 and Core 2 Quad Q6600 finish near the bottom of the pack, above only the Core 2 Duo E6750.

Firefox web browsing

Multitasking – Firefox and Windows Media Encoder

Here’s another multitasking test, one in which having four cores helps quite a bit more. The Phenom 9600 again trails the Core 2 Quad Q6600.

WinZip file compression

Ouch. The Phenom helps a little bit here, but AMD’s still getting trounced.

Nero CD authoring

The Nero test depends largely on the disk controller’s performance, which explains the basic grouping of results among test platforms. The 790FX chipset handicaps the Phenom here, in part because its SATA controller doesn’t seem to work properly in Windows Vista while in AHCI mode. We had to test with AHCI disabled, which means SATA Native Command Queuing isn’t enabled. NCQ might have helped here. Unfortunately, AMD hasn’t released Vista drivers for the SB600 south bridge’s SATA controller with a fix and didn’t have any suggestions for us when we contacted them about the problem.

Image processing

Photoshop

WorldBench’s PhotoShop test goes poorly for the AMD processors. It’s possible the difference here is made by the Core 2 processors’ larger L2 caches—up to 12MB, in the case of the 45nm quad-core processors. Even with the addition of a 2MB L3 cache, the Phenom’s effective cache size is smaller than the Core 2 Duo E6750’s 4MB L2.

The Panorama Factory photo stitching
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs. The program’s timer function captures the amount of time needed to perform each stage of the panorama creation process. I’ve also added up the total operation time to give us an overall measure of performance.

The Phenom 9900 matches the Core 2 Quad Q6600 here, but the 9600 is a little slower. Once again, versus the FX-74, the Phenom does achieve a tangible per-clock performance gain.

picCOLOR image analysis

picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Eight of the 12 functions in the test are multithreaded, and in this latest revision, five of those eight functions use four threads.

Scores in picCOLOR, by the way, are indexed against a single-processor Pentium III 1 GHz system, so that a score of 4.14 works out to 4.14 times the performance of the reference machine.

Here’s a case where the K10 looks for all the world like a quad-core K8.

Video encoding and editing

VirtualDub and DivX encoding with SSE4

Here’s a brand-new addition to our test suite that should allow us to get a first look at the benefits of SSE4’s instructions for video acceleration. In this test, we used VirtualDub as a front-end for the DivX codec, asking it to compress a 66MB MPEG2 source file into the higher compression DivX format. We used version 6.7 of the DivX codec, which has an experimental full-search function for motion estimation that uses SSE4 when available and falls back to SSE2 when needed. We tested with most of the DivX codec’s defaults, including its Home Theater base profile, but we enabled enhanced multithreading and, of course, the experimental full search option.

This test is obviously a showcase for the tailored instructions of SSE4, which the Phenom lacks. Still, the Phenom 9600 manages to come out ahead of the Q6600, which also lacks SSE4 support.

Windows Media Encoder x64 Edition video encoding

Windows Media Encoder is one of the few popular video encoding tools that uses four threads to take advantage of quad-core systems, and it comes in a 64-bit version. Unfortunately, it doesn’t appear to use more than four threads, even on an eight-core system. For this test, I asked Windows Media Encoder to transcode a 153MB 1080-line widescreen video into a 720-line WMV using its built-in DVD/Hardware profile. Because the default “High definition quality audio” codec threw some errors in Windows Vista, I instead used the “Multichannel audio” codec. Both audio codecs have a variable bitrate peak of 192Kbps.

In our second video encoding test, the Q6600 turns the tables, finishing before the Phenom 9600.

Windows Media Encoder video encoding

Roxio VideoWave Movie Creator

The Phenoms fare relatively poorly in the two WorldBench video tests, which don’t appear to use more than two cores to any great advantage. Left to use only one to two cores at relatively low clock speeds, the Phenom suffers.

LAME MT audio encoding

LAME MT is a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. Of course, multithreading works even better on multi-core processors. You can download a paper (in Word format) describing the programming effort.

Rather than run multiple parallel threads, LAME MT runs the MP3 encoder’s psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. That means this test won’t really use more than two CPU cores.

We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here.

Here’s another application where only two threads are used, and the Phenoms again struggle, even against the Athlon 64 X2.

Cinebench rendering

Graphics is a classic example of a computing problem that’s easily parallelizable, so it’s no surprise that we can exploit a multi-core processor with a 3D rendering app. Cinebench is the first of those we’ll try, a benchmark based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores are available.

Have a look at the scores for the Athlon 64 FX-74 and the Phenom 9900. Although the FX-74 has a 400MHz clock frequency advantage, the Phenom 9900’s single-threaded score is almost as high. That’s a nice per-clock performance advancement. Then check out the multithreaded results, where the Phenom 9900 easily scales better than the FX-74 and ends up producing a higher score overall. Those improvements aren’t sufficient to allow the 9600 to catch the Q6600, though.

POV-Ray rendering

We caved in and moved to the beta version of POV-Ray 3.7 that includes native multithreading. The latest beta 64-bit executable is still quite a bit slower than the 3.6 release, but it should give us a decent look at comparative performance, regardless.

The pendulum swings the other way in POV-Ray’s chess2 scene, where the Phenom 9600 finishes 21 seconds ahead of the Q6600 and the Phenom 9900 bests Intel’s Core 2 Extreme QX6850. Intel’s new 45nm Penryn-based Intel processors are faster still, however.

The benchmark scene is largely a single-threaded affair, which helps explains the Phenom’s relatively slow performance.

3ds max modeling and rendering

The DirectX test is more of a modeling session than a rendering test, so we have a little of each here. Even the Phenom 9900 can’t catch the Core 2 Quad Q6600 in either test.

[email protected]

Next, we have a slick little [email protected] benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, [email protected] is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The [email protected] project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, [email protected] should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.

The Phenom is clearly faster clock for clock than older Athlon 64 X2s with the two Gromacs WU types. Overall, of course, the Phenom easily beats the X2 processors by virtue of having two more cores. The Phenom 9600 is very, very close to the Q6600 in overall points per day,

SiSoft Sandra Mandelbrot

Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The one of interest to us is the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.
The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.

We’re using the 64-bit version of Sandra. The “Integer x16” version of this test uses integer numbers to simulate floating-point math. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations in parallel.

I had hoped this test would be a showcase for the Phenom’s single-cycle SSE instruction execution, and I suppose it is. The 2.6GHz Phenom 9900’s throughput is more than three times that of the Athlon 64 X2 5600+ at 2.8GHz, which means it gains quite a bit more than what its additional cores can provide. Still, the Core 2 processors are considerably faster.

Power consumption and efficiency

Now that we’ve had a look at performance in various applications, let’s bring power efficiency into the picture. Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor into a separate outlet, though.) We measured how each of our test systems used power across a set time period, during which time we ran Cinebench’s multithreaded rendering test.

All of the systems had their power management features (such as SpeedStep and Cool’n’Quiet) enabled during these tests via Windows Vista’s “Balanced” power options profile—with a prominent exception. Our 790FX-based motherboard simply would not work with Cool’n’Quiet enabled. The system would hang shortly after the feature was enabled. As a result, we had to test the Phenoms without Cool’n’Quiet enabled. That means the Phenoms will draw more power at idle and during periods of partial load (as the rendering process starts and finishes) than they otherwise would. Power draw at peak utilization shouldn’t be affected. We will try to test again with Cool’n’Quiet enabled once we get a working motherboard.

Anyhow, here are the results:

Let’s slice up the data in various ways in order to better understand them. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

The Phenom systems draw more power at idle than anything but the basket-case Quad FX system. I expect we’d see better results if Cool’n’Quiet were working properly.

Next, we can look at peak power draw by taking an average from the ten-second span from 30 to 40 seconds into our test period, during which the processors were rendering.

Under load, the Phenom systems draw about as much power as those based on Intel’s 65nm quad-core processors in the 3GHz range. Unfortunately, that means the Phenom 9600 compares unfavorably to the Core 2 Quad Q6600 on power draw—and Intel isn’t making things any easier with its 45nm chips. The system based on the Core 2 Extreme Q9770 at 3.2GHz draws 40W less power under load than the Phenom 9600.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

We can quantify efficiency even better by considering the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

This final measurement shouldn’t be greatly affected by the absence of Cool’n’Quiet, since it comes from the time when the CPUs are largely fully occupied. Looked at this way, the Phenom represents true progress for AMD, since systems based on dual-core chips are relatively inefficient at rendering compared to quad-core ones. But AMD has more work to do in order to catch Intel’s 65nm chips, let alone its incredibly efficient 45nm ones.

Conclusions

The Phenom quite obviously isn’t a bad CPU design, given the way it performs on a per-clock basis and how its performance scales from one to four threads. In many cases, its enhanced execution cores crank out some solid gains in instructions per clock over AMD’s Athlon 64 X2. Also, the addition of two more cores can bring substantial performance increases in applications able to take advantage of them, as many of our tests have shown. But at the end of the day, CPU performance comes down to a couple of variables, performance per clock and clock speed, and the Phenom doesn’t have enough of either to allow it to catch up with Intel’s fastest 65nm Core 2 processors, let alone the even-more-potent 45nm ones. I sound like a broken record, but AMD is going to have to achieve something close to clock speed parity with Intel in order to compete for the overall performance lead. That’s how closely these two architectures appear to be matched at this point.

The Phenom 9900 does show some promise at 2.6GHz, but it may not be available until—who knows?—perhaps Februrary or March. Until then, the Phenom 9600 will do battle at 2.3GHz against Intel’s Core 2 Quad Q6600 and its presumptive 45nm successor. AMD has priced the 9600 to compete with the Q6600, and that makes it a potentially attractive product. However, the Phenom’s well-publicized clock frequency issues keep it from being a slam-dunk. The Phenom 9600 was generally a little slower overall in our tests than the Q6600. If it ran at 2.4GHz, then it might be comparatively stronger. AMD’s plan to release an unlocked version of the Phenom 9600 may help tip the scales in the 9600’s favor for some folks, but I suspect they won’t find much overclocking headroom in those chips. In fact, our 2.6GHz engineering sample wasn’t 100% stable, which is why you won’t find any overclocking results in this review.

Unlike in the server space, where Intel’s use of FB-DIMMs gives AMD a built-in advantage, the Phenom trails Intel’s 65nm processors in power efficiency and really doesn’t come close to Intel’s 45nm models. The picture would look better here, at least at idle and during periods of intermediate use, had our test system been stable with Cool’n’Quiet clock throttling enabled. This is, after all, one of the big advantages of the Phenom’s native quad-core design. But it has to work properly in order to be an advantage, and at the time of the product’s public release, we don’t yet have an example that does. As you may have gathered from the preceding pages full of test results, we’re not ones to take things on faith from hardware makers—for good reason.

One bright spot here is the upgrade proposition the Phenom offers to current owners of Socket AM2-based systems. Those folks now have an affordable path to a quad-core solution that’s nearly as fast as a Core 2 Quad Q6600, which is a fine thing and a no-brainer upgrade choice. That said, Socket AM2 owners will want to watch the fine print carefully. Obviously, you won’t get the power savings of split power planes or the increased data rate of HyperTransport 3.0 if you drop a Phenom into an older motherboard. Also, I haven’t yet had time to confirm this myself in a test rig, but I believe the Phenom’s north bridge clock will run a little slower on Socket AM2 board, leading to somewhat reduced performance.

The immediate path ahead for AMD is blindingly obvious. They’re going to have to supply enough Phenom 9500 and 9600 chips to meet demand, which could be a challenge, and they’re going to have to work on reaching higher clock frequencies as soon as they can. They also have some work to do, along with their partners, in bringing the Phenom’s Socket AM2+ infrastructure up to snuff. The chipset operation AMD purchased in the ATI acquisition was in many ways still a fledgling effort, and our initial experiences with 790FX motherboards haven’t inspired much confidence. AMD has convinced Asus, Gigabyte, and MSI to produce 790FX-based boards, but it doesn’t appear to have convinced them to dedicate top-shelf engineering resources to these efforts. We will, of course, be working to get our hands on newer versions of Phenom and 790FX hardware as these products become available consumers. We’re hopeful that our experiences with the final products will be better than what we’ve seen to date.

Comments closed

Pin It on Pinterest

Share This

Share this post with your friends!