AMD’s Quad FX platform

IN THE PC REALM, when you can’t win by traditional means, there may be another reliable avenue available to you: move upmarket. This form of one-upsmanship has been masking technological shortcomings in increasing measure in recent years. Intel arguably started this trend in the CPU market when, on the eve of AMD’s introduction of the Athlon 64, it uncorked the first Pentium Extreme Edition processor, basically a Xeon with scads of L3 cache pulled from the server market into service as a new flagship desktop part. At the prohibitive price of just one dollar short of a grand, the Extreme Edition wasn’t intended to sell at high volumes. Its job was simply to defend the performance crown to the best of its prodigious ability. That’s the beauty of the ultra-high-end product: a top product can rock the benchmarks yet only ship in a few hundreds or thousands of units.

With that background, perhaps you will understand why we were skeptical when AMD unveiled its plans for a new platform, code-named “4×4”, just as Intel prepared processors based on its excellent new Core microarchitecture for release. The initial concept was about as extreme as they come, with the “4×4” signifying the combination of four CPU cores (in two sockets) and four GPUs in the same system. From the sound of it, these boxes would only come from boutique PC vendors like Alienware and Voodoo, and they would cost more than a reasonably well-equipped Honda Civic. We were underwhelmed by some of these constraints, especially the initial exclusivity to PC makers, and said so at the time.

Fortunately, AMD was listening. The 4×4 concept has undergone some moderation since it was first announced, and those constraints have been eased somewhat. What’s left is a new enthusiast-oriented PC platform that officially sanctions what some of us have been doing since the days of the Celeron 300A: running multiple processors in an enthusiast-class system. (By “processor,” of course, I mean one of those things that you stick into a socket on a motherboard, not just another CPU core on a chip.) The first incarnations of “4×4”, now known as the Quad FX platform, will deliver quad CPU cores into desktop systems starting today. You may be asking yourself a number of questions upon reading this news. Questions like: Yeah, but can it keep pace with Intel’s mighty Core 2 Extreme QX6700 quad-core processor? Why would I want one? What can you really do with four cores? Will Britney and K-Fed patch things up, or is it really over? Fear not, my friend, for we have the answers to three of those four questions. Read on to find them.

Anatomy of a Quad FX
If the Quad FX scheme is borne of necessity, the cause of that necessity is undoubtedly the Core 2 Extreme QX6700 processor, which successfully shoehorns two Core 2 Duo chips into a single package for a “quad core” result—and a potent one, at that. Presumably, AMD isn’t countering with two Athlon 64 X2 chips in a single package for a number of reasons—not least of which is the fact that they’re still making chips on a 90nm fabrication process, and the die size of those chips probably wouldn’t allow it. Instead, the Quad FX platform essentially brings a workstation-class dual-socket Opteron solution onto the desktop.

 

The Athlon 64 FX-74, pictured above, is a case in point. It comes in LGA-style package, just like newer Opterons, and drops into a 1207-pin socket, just like newer Opterons. Unlike Opterons, though, these new FX processors don’t require pricey registered ECC memory, and they won’t reside in fuddy-duddy motherboards that spoil all the fun. Instead, they use regular ol’ unbuffered DDR2 DIMMs, and AMD is encouraging the development of Quad FX motherboards with tweakable BIOSes and—for shame!—robust overclocking options.


A block diagram of the Quad FX platform. Source: AMD.

Here’s a look at the logical layout of a typical Quad FX system. Hanging off of each CPU socket is a pair of DDR2 memory channels, with officially supported DIMM speeds up to 800MHz. That means you’re looking at up to 25.6 GB/s of memory bandwidth—far above the bandwidth available to the Core 2 Extreme QX6700, which is limited by its front-side bus. However, that AMD memory subsystem is by nature NUMA—an acronym signifying non-uniform memory access. This Opteron/K8 NUMA memory architecture is a mixed blessing. Memory bandwidth scales up linearly as more CPUs are added to the system, but memory access times rise when CPU 0 must grab data from memory controlled by CPU 1. In order to attain NUMA’s benefits without stumbling on its drawbacks, software—especially the operating system—must be NUMA-aware.

If all of this sounds like a tremendous amount of complexity for a desktop system, well, you’re right. It’s also a tremendous amount of power for a desktop box.

The Quad FX scheme is aided and abetted by Nvidia’s nForce 680a SLI core-logic chipset. Following through with the theme of doubling up for success and excess, the 680a SLI is essentially two copies of the nForce 570 SLI chip, mounted side by side together on a motherboard. The presence of both chips makes possible a total of four PCIe x16 slots (two with 16 PCIe lanes and two with eight), four Gigabit Ethernet ports, and a whopping 12 SATA ports, among other things. The two core logic chips are attached to one CPU socket via dual HyperTransport links so that the system can operate with a single processor and still provide access to all I/O capabilities.

I expect AMD, through its newly acquired ATI subsidiary, to bring its own Quad FX chipset to the market at some point in the future, but for now, Nvidia is the sole supplier of Quad FX core logic. Personally, I’d also like to see a Quad FX solution with “only” two PCIe x16 slots, six SATA ports, lower power consumption, and a more modest price, but that’s not in the cards just yet.

The key to making Quad FX anything more than a marketing stunt aimed at recent lottery winners, of course, is keeping systems price-competitive with those based on Intel’s quad-core parts. Since folks will have to purchase two CPUs in order to build a proper Quad FX box, that’s no small concern. Happily, AMD has done its part on that front, keeping its promise to deliver pairs of FX CPUs for “well under a thousand dollars.” The processors will be sold in pairs in the following configurations:

Model

Clock speed L2 cache
(per core)
TDP
(per CPU)
Price
(per pair)
Athlon 64 FX-70 2.6GHz 1MB 125 W $599
Athlon 64 FX-72 2.8GHz 1MB 125 W $799
Athlon 64 FX-74 3.0GHz 1MB 125 W $999

With CPU pairs priced as low as $599, Quad FX may not be cheap, but the processors are arguably affordable and maybe even a decent value, depending on how you define value.

Check the clock speed on the FX-74 once more, just to make sure you get it: a healthy 3GHz. Intel chose to back down to 2.66GHz for its top quad-core part, the QX6700, in order to meet the power and thermal requirements of a single CPU socket. With two sockets, two coolers, and more pins per socket, AMD had no such constraint, so they’ve actually raised clock speeds a notch beyond what’s currently available in a single-socket Athlon 64 processor.

Now, we know Core 2 Duo processors typically perform better clock for clock than Athlon 64 X2s, but in this quad-core solution, AMD has vastly more memory bandwidth, a very nice system architecture, and a pronounced clock speed advantage. This could get interesting, no?

Of course, if it’s low power consumption you want, Quad FX may not be your cup of tea. With a peak thermal dissipation requirement of 125W per processor, Quad FX exhibits another characteristic of a “4×4″—low gas mileage.

For those of you who are wondering how these FX processor prices will affect current Opteron prices, which are quite a bit higher, the answer seems to be: not much. AMD says FX pricing and Opteron pricing are two separate issues. FX chips won’t support registered ECC memory, and AMD says FX processors aren’t supposed to work on Opteron motherboards. Some folks may choose Quad FX workstations rather than Opteron ones, but AMD seems willing to accept that.

If Quad FX doesn’t sound quite sweet enough to tempt you yet, AMD has one more prospect to add to the mix. Today’s Quad FX systems will come out of the chute ready to accept AMD’s native quad-core processors when they arrive some time next year, raising the possibility that a Quad FX box could be upgraded to eight of AMD’s new-microarchitecture cores in the future. Holy moly. That one’s gotta set some fanboys’ hearts aflutter.

So when can you get some Quad FX action, you ask? AMD says Quad FX solutions should begin selling today, both from system builders and in the form of kits, with two CPUs and a motherboard included, from select online vendors like Newegg. (Yes, that means those of us who like to build our own systems should be able to pick up kits right away, thank goodness.) Initial quantities will be limited to these outlets, but AMD expects the CPU pairs to make it into full distribution in the first quarter of next year. The company also claims it’s committed to the idea of a dual-socket enthusiast platform for the long haul.

 

The mobo
Those Quad FX kits I was talking about are bound to come with an Asus L1N64-SLI WS motherboard, because, well, that’s the only Quad FX board available at this point. AMD chose Asus as its exclusive launch partner for this platform, so this one Asus motherboard is the lone Quad FX mobo option. That’s not necessarily a bad thing for many reasons. The L1N64-SLI WS is definitely a worthy board, with a full suite of features, overclocking options, and BIOS tweaks like any high-end, enthusiast-class board from Asus.

I have those three chipset coolers mounted on the board because I was using the board on an open test bench with no extra forced airflow. In a properly cooled case, they may not be necessary. Then again, I wouldn’t bet on it.

The thing has two CPU sockets, four DIMM slots, four PCIe x16 graphics slots, one PCI slot, one PCIe x1 slot, dual Gigabit Ethernet ports, and a disturbing and wrong 12 SATA ports.

Yes, that’s 12 SATA ports, all clumped together on the corner of the board.

Here’s a look at one of the CPU sockets, which has 1207 pins in it, arranged much like an Intel Core 2 Duo’s socket.

I am a big fan of Asus’ recent high-end mobos. I think they get nearly everything important right, and the L1N64-SLI WS follows that successful formula quite closely. I could nitpick, but for the most part, I’d have few qualms about making this mobo the heart of a Quad FX system for myself—save for two things.

First, like most dual-socket mobos, the L1N64-SLI WS doesn’t quite fit into a standard ATX form factor. Asus has heroically crammed an awful lot into a small space, but it’s not quite enough to meet the standard. The max dimensions for full-sized ATX board are 12″ by 9.6″. The L1N64-SLI WS is 12″ by 10.5″, nearly an inch deeper. On top of that, you have an IDE port facing off of the inside edge of the board. You will want to measure the space in your chosen enclosure carefully before trying to install this board in it. I expect the L1N64-SLI WS to fit into some of the better enclosures out there, but definitely not all of them.

Second, there’s the price. Asus says the L1N64-SLI WS will list for $349.99, and I wouldn’t be shocked to see it selling at a premium initially. AMD has gone a long way toward making the Quad FX platform somewhat affordable with its $599 pricing of FX-70 pairs, but the price tag on this puppy raises the cost of entry significantly—especially compared to some of the boards that support the Core 2 Quad and Core 2 Extreme QX6700. AMD couldn’t give us any timetable for the arrival of additional Quad FX motherboards, so the L1N64-SLI WS will probably be the only option for some time yet.

 

With a winch in front and a spare tire hanging off the back
Just in case we didn’t entirely feel the vibe of the Quad FX concept, AMD decided to send out an entire system for review, and it’s a “4×4” through and through—the Hummer H2 of enthusiast boxen, a veritable hymn to conspicuous consumption in PC form, complete with knobby tires and ample ground clearance. Don’t take it from me, though. Have a look at this beast.

This box’s vitals include two FX-74 processors, an Asus L1N64-SLI WS mobo, 4GB of memory in the form of Corsair Dominator DIMMs, a pair of WD 150GB Raptors in RAID 0, a 500GB drive for additional storage, a 1kW PSU, and a couple of GeForce 7900 GTX cards in SLI. The chassis is a Thermaltake enclosure with a new door panel that has dual ports above the CPU coolers and internal tunnels that extend down to meet the top of those coolers. (AMD says production versions of this enclosure should be available soon.)

Of course, our first task, after photography, was to disassemble this system and set up the CPU and processors in our standard configuration for testing. But I did let the system run long enough to note that it doesn’t actually sound “like an Oreck XL on Metabolife,” as I had feared. This isn’t the quietest box by any means, but its cooling design makes it sound fairly reasonable, believe it or not.

Incidentally, when I first tried to set up the core of the Quad FX system on the test bench using our standard OCZ GameXStream 700W power supply, the system wouldn’t POST properly. After trying a number of things without success, including cutting back to a Radeon X300 video card, I was able to get the system working by swapping in a BFG Tech 1000W PSU. Later, I tried subbing in an OCZ PowerStream 520W PSU, and the system would POST fine with it. I’m not sure whether its reluctance to POST with the GameXStream was just an odd incompatibility or a sign of something larger, but you will definitely need a good power supply unit to feed a Quad FX system, regardless. We’ll talk more about power use shortly.

Putting four cores to proper use
The process of putting together our review of Intel’s first quad-core processor made clear to us the difficulty of taking full advantage of four CPU cores. Many of the apps in our usual CPU test suite are multithreaded, but only a handful of them use more than two cores effectively. Even in applications like video encoding, where the problem would seem to be imminently parallelizable, many programs don’t spin off more than two threads because, historically, four-way systems have been extremely rare in nearly every province of computing except for high-end servers.

Of course, that means that on one level, stepping a quad-core system through a series of desktop-class apps and showing little or no performance gain compared to dual-core systems, as we did in our QX6700 review, is an entirely valid exercise. It is not, however, especially satisfying, because it doesn’t address the larger questions of a quad-core system’s potential, either in terms of performance with widely multithreaded apps or of scaling from two cores to four. We decided to attempt to address these questions with this article, so we have sought out applications that can use more than two threads and focused on them. As a result, the following set of tests is a little bit unusual; the applications are less common and a little more academic in nature. Indulge us, if you will, as we attempt to learn what performance gains quad-core systems can bring. Keep in mind, though, that going from two cores to four won’t necessarily bring these sorts of performance improvements across the board. A look at our QX6700 review should dispel that notion.

 

Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processor Core 2 Extreme X6800 2.93GHz
Core 2 Extreme QX6700 2.66GHz
2 x Athlon 64 FX-74 3.0GHz Athlon 64 FX-62 2.8GHz
System bus 1066MHz (266MHz quad-pumped) 1GHz HyperTransport 1GHz HyperTransport
Motherboard Asus P5W64 WS Pro Asus L1N64-SLI WS Asus M2N32-SLI Deluxe
BIOS revision 0304 0117 0706
North bridge 975X MCH nForce 680a SLI nForce 590 SLI SPP
South bridge ICH7R nForce 680a SLI nForce 590 SLI MCP
Chipset drivers INF Update 8.1.1.1010
Intel Matrix Storage Manager 6.2
ForceWare 9.35 ForceWare 9.35
Memory size 4GB (4 DIMMs) 4GB (4 DIMMs) 4GB (4 DIMMs)
Memory type Crucial Ballistix PC2-6400
DDR2 SDRAM
at 800MHz
Corsair Dominator CM2X1024-8500C5D
DDR2 SDRAM at 800MHz*
Corsair TWIN2X2048-8500C5
DDR2 SDRAM
at 800MHz
CAS latency (CL) 4 4 4
RAS to CAS delay (tRCD) 4 4 4
RAS precharge (tRP) 4 4 4
Cycle time (tRAS) 12 12 12
Audio Integrated ICH7R/AD1988B with
Soundmax 5.10.2.4650 drivers
Integrated  nForce 680a MCP/AD1988B with
Soundmax 5.10.2.4650 drivers
Integrated nForce 590 MCP/AD1988B with
Soundmax 5.10.2.4650 drivers
Hard drive Maxtor DiamondMax 10 250GB SATA 150
Graphics GeForce 7950 GX2 1GB PCI-E with ForceWare 93.71 drivers
OS Windows XP Professional x64 Edition
OS updates DirectX 9.0c update (October 2006)

Thanks to Corsair and Crucial for providing us with memory for our testing. Both of them provide products and support that are far and away superior to generic, no-name memory.

Also, all of our test systems were powered by BFG Tech 1000W power supply units. Thanks to BFG for providing these units for our use in testing.

The test systems’ Windows desktops were set at 1280×1024 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled.

We used the following versions of our test applications:

The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

 

Memory performance
We’ll begin by measuring the memory subsystem performance of these solutions—no minor thing, since there are such big differences between the system architectures. These synthetic tests won’t track closely with real-world application performance, but are enlightening anyhow.

Notice that I’ve included a graphic above the benchmark results. That’s a snapshot of the CPU utilization indicator in Windows Task Manager, which helps illustrate how much the application takes advantage of four CPU cores, when they’re available. I’ve included these Task Manager graphics whenever possible throughout our results.

Sandra’s synthetic memory bandwidth test is widely multithreaded, so it takes good advantage of all four of the Quad FX systems’ memory channels and thus both halves of the NUMA memory subsystem. The result is realized throughput of nearly 15 GB/s. I should note here that, due to limitations in the Athlon 64’s memory clocking scheme, the FX-74’s memory modues are actually running at 750MHz rather than 800MHz—not that it hampers performance too terribly much.

Speaking of handicaps, the Core 2 Extreme QX6700 comes up a little behind the X6800, probably due to the fact that the QX6700’s two chips each present a load on the system’s front side bus, bringing with them additional overhead. That may be why the QX6700 is consistently, if slightly, behind the X6800 in memory bandwidth tests like this one.

The Quad FX system matches the Intel systems in memory access latency, falling a little behind the single-socket Athlon 64 FX-62. It’s possible the FX-74 is hampered here somehow by NUMA overhead, but as you can see, CPU-Z’s latency test is definitely single-threaded, so I’m not sure what to think. Regardless, all of these systems are very quick at transferring data to and from memory, the Athlon 64s mainly because of their integrated memory controllers and the Core 2 processors because of their sophisticated cache prefetch algorithms and the ability to move loads ahead of stores (a.k.a. “memory disambiguation”).

 

Cinebench
Graphics is a classic example of a computing problem that’s easily parallelizable, so it’s no surprise that we can exploit a quad-core system with a 3D rendering app. Cinebench is the first of those we’ll try, a benchmark based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores are available.

With all four cores engaged, the Quad FX system muscles past the QX6700 to takes the top spot by a surprisingly wide margin. Why? Well, part of the dynamic here is very simple. At 3GHz, the FX-74 proves faster than the QX6700 with only one thread, and thus one core, in action. When we move to four threads, that gap is only magnified.

You’ll want to keep another thing in mind when considering scaling from two cores to four. We have included the top dual-core processors from AMD and Intel in the mix there, because they are the appropriate real-world competitors to these quad-core systems. However, Intel makes a step down in clock speed when moving from the Core 2 Extreme X6800 to the QX7600, while AMD takes a step up from the FX-62 to the FX-74.

POV-Ray rendering
After holding out for quite a while, we’ve finally caved in and moved to the beta version of POV-Ray 3.7 that includes native multithreading. The 64-bit executable is still quite a bit slower than the 3.6 release, but it should give us a decent look at comparative performance, regardless.

Once more, the Quad FX system prevails, proving consistently faster than the QX6700 from a single thread up to four threads. Both systems scale pretty well from a single thread to four, but the FX-74 proves superior on that front, achieving a nearly perfect 4X speed increase with four threads.

3dsmax 9 rendering
For our 3ds max test, we used the “architecture” scene from SPECapc for 3ds max 7. This scene is very complex and should be nice exercise for these CPUs. Using 3ds max’s default scanline renderer, we first rendered frames 0 through 10 of the animation at 500×300 resolution.

Intel’s quad-core CPU picks up a win here, easily finishing before the FX-74. One reason the quad-core systems don’t separate themselves more from the dual-core competition is captured in the Task Manager graph; between rendering the frames, 3ds max pauses and uses a single thread to set up the next frame. If we were rendering at a higher resolution, the quad-core systems would likely pull further away from the dual-cores.

We’ve seen this problem before, but we’d hoped it would be resolved in 3ds max 9. Despite the fact that all four cores appear to be in use, the quad-core systems take longer to render the frame than their dual-core counterparts—strange but true.

 

Valve Source engine particle simulation
Next up are a couple of tests we picked up during a visit to Valve Software, the developers of the Half-Life games. They’ve been working to incorporate support for multi-core processors into their Source game engine, and they’ve cooked up a couple of benchmarks to demonstrate the benefits of multithreading.

The first of those tests runs a particle simulation inside of the Source engine. Most games today use particle systems to create effects like smoke, steam, and fire, but the realism and interactivity of those effects is limited by the available computing horsepower. Valve’s particle system distributes the load across multiple CPU cores.

Both quad-core systems perform well, but the QX6700 is fastest. For what it’s worth, we have seen better performance from the Core 2 Extreme X6800 in this test in another config, but it was consistently slower here, for whatever reason.

Incidentally, we’ve also seen even more impressive particle simulations running on an Ageia PhysX card and on a GeForce 8800. Traditional CPU cores may not be the most effective vehicle for particle simulations in the next generation of games.

Valve VRAD map compilation
This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to precompute lighting that goes into its games. This isn’t a real-time process, and it doesn’t reflect the performance one would experience while playing a game. It does, however, show how multiple CPU cores can speed up game development.

Intel’s quad-core CPU turns out to be faster here, but both quad-core systems are again much quicker than their dual-core brethren.

 

3DMark06
3DMark06 combines the results from its graphics and CPU tests in order to reach an overall score. Here’s how the processors did overall and in each of those tests.

Wow, that is tight! The QX6700 just barely edges out the FX-74 in an extremely close matchup. Let’s see what made the difference.

3DMark’s graphics tests are almost entirely GPU-bound, even with our GeForce 7950 GX2 graphics card. The CPU tests, though, spin off multiple threads to handle tasks like game logic, physics, and AI, so the quad-core systems can hit full stride. Their strong performance in the CPU tests, combined with essentially equivalent performance in the graphics tests, allows the quad-core rigs to take the top spots in 3DMark’s overall score.

 

MyriMatch
Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He recently offered to provide us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest. David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching.

In this test, 1503 tandem mass spectra from a Thermo LCQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast).

The multithreaded stage of MyriMatch comes during generation of peptides from the database and comparison of those peptides with the experimental spectra. MyriMatch detects the number of CPUs/cores available on the system and spawns a worker thread for each. Worker threads then “take a number” out of a list of “worker numbers” and will iterate through the protein database in steps sized according to how big the “worker numbers” list is. The list is created so that each worker thread will finish its current number and then come back for another after it finishes. For example, on a machine with one dual-core processor, 2 threads will be spawned, and the “worker numbers” list might be any multiple of the number of worker threads, like: (1, 2, 3, 4, 5, 6, 7, 8). The first thread works on proteins 1, 9, 17, 25, etc. The second thread works on proteins 2, 10, 18, 26, etc. Whenever a thread finishes it will take the next number in the list, and iterate through the database again using the new number as the starting point. This technique is intended to minimize synchronization overhead between threads, minimize idle CPU time, and minimize the effect of some unfortunate ordering in the protein database causing one thread to search long proteins while another thread searches short proteins.

David and his colleagues will be publishing a paper on the MyriMatch algorithms, and I understand they hope to make MyriMatch available as open-source software, as well. The most important news for us is that MyriMatch is a real-world application, widely multithreaded, that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to four threads.

These results give us a new spin on the question of scaling. The Core 2 Extreme QX6700 is easily faster than the FX-74 with one and two threads, and it would appear to be on its way to outright victory. However, the QX6700’s performance doesn’t scale well when moving to three and four threads, while the FX-74’s does. The QX6700 might be running into a bus or memory bandwidth limitation. Whatever the case, the Quad FX system turns in the quickest overall processing time with four threads, albeit by a narrow margin. The moral of the story? If you’re matching peptides to spectra at home, but FX-74 will probably serve you best.

STARS Euler3d computational fluid dynamics
Our next benchmark is also a new one for us. Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us recently to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here. (I believe the score you see there at almost 3Hz comes from our eight-core Clovertown test system.)

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.

The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. I understand the STARS Euler3D routines are both very floating-point intensive and oftentimes limited by memory bandwidth. Here’s how our contenders handled it.

Well, the Core 2 processors pretty much embarrass the Athlon 64s here. Even the dual-core X6800 runs faster than the Quad FX.

 

[email protected]
Next, we have another relatively new addition to our benchmark suite: a slick little [email protected] benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, [email protected] is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The [email protected] project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, [email protected] should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.

The FX-74 system ends up getting the highest overall points-per-day score, but the result is actually split down the middle. For Tinker and Amber work units, the Athlon 64 CPUs are fastest, and for the Gromacs WUs, the Core 2 processors reign supreme. Either way, quad cores can offer big gains in distributed computing applications like Folding.

 

SiSoft Sandra Mandelbrot
Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The one of interest to us is the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX, SSE, and SSE2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.

We’re using the 64-bit version of Sandra. The “Integer x16” version of this test uses integer numbers to simulate floating-point math. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations in parallel.

The dual FX-74s are more than twice as fast as the Athlon 64 FX-62, but the Core microarchitecture’s ability to execute a 128-bit SSE instruction in a single clock cycle gives it an insurmountable advantage.

Windows Media Encoder x64 Edition
I had hoped to use QuickTime Pro to do some high-definition H.264 encoding, but QuickTime apparently maxes out at two threads. Windows Media Encoder works fine with four threads, though, and comes in a 64-bit version. For this test, I asked Windows Media Encoder to transcode a 153MB 1080-line widescreen video into a 720-line WMV using its built-in DVD/Hardware profile.

This is another close one, but the QX6700 take the top spot. Multi-core processors do offer speed gains in video encoding, but as is the case here, those gains don’t tend to be linear like they can be in 3D rendering.

 

picCOLOR
picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Eight of the 12 functions in the test are multithreaded, and in this latest revision, five of those eight functions use four threads.

Scores in picCOLOR, by the way, are indexed against a single-processor Pentium III 1 GHz system, so that a score of 4.14 works out to 4.14 times the performance of the reference machine.

It’s not hard to pick out the four-threaded functions from among the individual results. The rotation and DCT functions seem to gain the most on the quad-core systems. Overall, though, the QX6700 proves faster than the FX-74 system.

The Panorama Factory
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs. The program’s timer function captures the amount of time needed to perform each stage of the panorama creation process, so we can get a good look at each one. I’ve also added up the total operation time to give us an overall measure of performance.

Amazingly, virtually every stage of this program’s operation appears to use at least four threads. Notice that the QX6700 is faster in nearly every stage than the X6800, despite its slower clock speed.

At the end of the day, our two quad-core systems turn out to be evenly matched in this app, although the FX-74 technically gets credit for the win.

 

Power consumption and efficiency
We’re trying something new with power consumption this time. Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. As always, the meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, video card, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor and speakers into a separate outlet, though.) We measured how each of our test systems used power during a roughly one-minute period, during which time we executed Cinebench’s rendering test.

All of the systems had their power management features (such as SpeedStep and Cool’n’Quiet) enabled during these tests, with the exception of the Athlon 64 FX-62. Our Asus M2N32-SLI Deluxe motherboard wouldn’t work with Cool’n’Quiet for some reason. We tried the two most recent production BIOS revisions for the board in both Windows XP Pro x64 Edition and the 32-bit version, to no avail. The loss of Cool’n’Quiet could raise the FX-62 system’s power consumption at idle or during low-load periods, but shouldn’t affect peak power consumption.

Like I said, Quad FX is the Hummer H2 of PC platforms. The thing uses nearly as much power at idle as the Core 2 Extreme X6800 system does while rendering, and when both FX-74s are rendering, power use peaks at around 450W. I believe that’s the highest we’ve seen for any PC system. Yow.

Once we have this data captured over time, we can consider it in various ways. For instance, one simple way to gauge power efficiency could be to look at energy use over our one-minute time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of Watt-seconds, equivalent to joules.

This is a potentially useful way of measuring power efficiency, but it’s tied to a set period of time. Assuming you don’t plan to keep your system mostly busy, the higher idle power use of the quad-core systems makes them less power-efficient overall. However, I think I’d prefer to break power use down into two components. The first of those, of course, is idle power, which is almost always a part of the total picture. Here’s how the various systems compare at idle.

That’s simple enough. The next step is to consider the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ll want to isolate the render period for each system. We can then compute the amount of energy used by each system during the rendering process, expressed in Watt-seconds. This method should account for both power use and, to some degree, for performance, because shorter render times may lead to lower energy consumption. I believe that makes this method our best measure of power efficiency.

Considered in these terms, quad-core systems—with properly multithreaded applications—can be very power efficient. Even with its 450W peak power draw, the Quad FX system ends up in third place here, ahead of the Athlon 64 FX-62. The Core 2 Extreme QX6700, meanwhile, is all alone in first place, because it uses the least energy to render the scene. I suspect that lower speed grades of both of these quad-core solutions could offer even more power efficiency than these top-end processors do.

I should offer a quick thanks, by the way, to my fellow TR staffers Geoff Gasior and Cyril Kowaliski for helping me slice and dice this power consumption data in order to produce the graphs above. They were a great help in overcoming both time constraints and my liberal arts background.

 
Conclusions
Our tests have shown that quad-core systems can offer substantial performance gains in a broad range of applications, if the computing problem lends itself to parallel processing and if developers put the necessary effort into making their software multithreaded. Such widely multithreaded programs are not common today, especially among traditional consumer-oriented desktop applications. Even some creative tools intended for parallelizable tasks, like QuickTime Pro, use a maximum of two threads at present. Top developers like Valve are working on making their applications take advantage of four or more cores, though, and they will likely pave the way for the rest of the industry.

When those applications do arrive, we probably shouldn’t expect to see a general doubling of performance when moving from two cores to four, or even the same degree of performance leap we saw when going from one core to two. That’s not what we’ve seen from most of these widely multithreaded applications. The reasons for this scaling difficulty are many, but they are summarized in Amdahl’s Law. The degree of speedup we can expect will depend on the nature of the application, the skill of the programmers, and the other constraints of the hardware.

Between the two quad-core systems we tested, the Core 2 Extreme QX6700 is faster overall. The Quad FX system with a pair of Athlon 64 FX-74 processors puts up a surprisingly good fight, though, thanks to its relatively high clock speed and superior system architecture. At the very least, the overall performance title is no longer unified due to the strength of Quad FX’s showing. By adapting its dual-socket workstation platform for the desktop, AMD has shown that it can still offer very competitive performance, so long as you don’t mind the power consumption that comes with it.

I’m pleased that the original 4×4 concept has been moderated so that it’s no longer tied to pairs of extremely pricey CPUs, no longer exclusive to vendors of outrageously expensive PCs, and no longer mated with quad-GPU graphics. Those adaptations have transformed Quad FX from a gimmick into a potentially attractive platform and a welcome development for PC enthusiasts.

Unfortunately, the Quad FX concept hasn’t entirely escaped its roots in excess and exclusivity. Most notably, the Asus L1N64-SLI WS is too expensive, and it raises the overall cost of the platform. The mobo’s price tag, size, and power consumption are no doubt higher due to its use of dual core-logic chips, which is probably an artifact of 4×4’s original quad-GPU association—and is just silly. The fact that this Asus board is the only Quad FX option makes it more of a problem. If this were one choice among many, we could more easily accept it as a part of the picture and move on to more reasonable alternatives.

Quad FX also suffers from a lack of low-power or even mid-power CPU options, which is a shame. This same technology in Opteron form offers a very compelling power efficiency proposition compared to the competition from Intel. Quad FX could do the same, if AMD would let it. Bring on the pairs of Athlon 64 X2 5200+ Energy Efficient processors and single-chipset motherboards with dual PCIe x16 slots, please, AMD. Then, trust me, you will have our attention.

For now, though, Intel’s quad-core processors offer better performance, lower power draw with correspondingly lower fan noise, and a range of excellent motherboard choices, almost all of which will fit into a standard ATX enclosure. Perhaps what AMD needs most is to make the transition to 65nm chip fabrication technology, so that quad-core computing doesn’t require an additional socket.  

Comments closed

Pin It on Pinterest

Share This

Share this post with your friends!