AMD’s Athlon 64 processor

AT LONG LAST, AMD is ready to introduce its Athlon 64 processor to the world. This chip is the successor to the formidable K7 core, known to most as the AMD Athlon processor. The Athlon cemented AMD’s place as a respectable second source for x86 processors, a position shakily established by the K6 before it. This time around, AMD has chosen to bolster the K7 core with a number of key internal enhancements and a radical reworking of the PC’s internal plumbing. This new processor core, code-named Hammer, is intended to solidify AMD as a leader not just in desktops, but in servers and workstation, as well. AMD has even amended the x86 instruction set for the future with 64-bit extensions.

We’ve spent the past few weeks testing Hammer-based chips, including the Athlon 64 and its new big brother, the mighty Athlon 64 FX-51. Not only that, but Intel slipped us a Pentium 4 3.2GHz Extreme Edition processor at the last minute, and we’ve benchmarked it, as well—2MB L3 cache and all. We can honestly say we were blown away by the performance of these new chips. The future is now. Read up to see how good it can be.

Hammer comes to the desktop
The Hammer CPU core is an evolutionary design based AMD’s K7 microarchitecture. Nonetheless, Hammer is revolutionary, not so much because of what goes on inside the chip itself, but because of how it talks to the rest of the computer. We have dedicated most of our time and effort in preparing this review to empirical testing, so we’re not able to cover Hammer’s architectural innovations in as much detail as we’d like. Still, we’ll hit some of the major points that make AMD’s new processor distinctive. Among them:

  • An integrated memory controller — Conventional systems have long had a memory controller located on a “north bridge” chips that talks to the processor over a front-side bus. Hammer chips have the memory controller built in, so the processor talks to the memory controller directly at the full speed of the CPU—2.2GHz in the case of the fastest Athlon 64. (The memory controller itself also runs at the speed of the CPU.) As a result, Hammer processors can access memory with very low latencies, opening up one of the most persistent bottlenecks to overall system performance. Current Hammer chips have DDR memory controllers compatible with memory speeds up to 400MHz.

    Beyond the basic performance benefits, the movement of the memory controller on die has implications for the organization of the entire Hammer platform. Core logic chipsets no longer need to provide memory controllers, and the Hammer, strictly speaking, has no traditional front-side bus. Even more mind-bendingly, multiprocessor Hammer systems have individual banks of memory for each processor, so they should scale very well as processors are added.

  • HyperTransport communications — HyperTransport is the glue that makes AMD’s reorg of the traditional PC work. A packet-based chip-to-chip interconnect, HyperTransport links are pairs of 8-bit or 16-bit unidirectional links running at speeds up to 800MHz. Throw in a little DDR action, sending data twice per clock cycle, and you have an effective clock rate of 1.6GHz per link. As implemented in Hammer, HyperTransport links have a maximum throughput of 6.4GB/s (16 bits upstream plus 16 bits downstream at an effective 1.6GHz).

    Hammer systems use HyperTransport for several things. In all Hammer systems, one of the CPUs (or the only CPU) talks to the rest of the system over a HyperTransport link. Traditional chipset services like AGP, PCI, and south bridge I/O are delivered over this link much like VPN tunnels are delivered over TCP/IP connections in a computer network. Done right, HyperTransport should simplify motherboard design by replacing slower and wider connections that require more traces to achieve similar results. In multiprocessor implementations, HyperTransport links between processors allow for inter-chip communications, as well.

  • 1MB of on-chip L2 cache — Previous versions of the K7 core have had varying amounts of L2 cache, up to 512K on the most recent “Barton” Athlon XPs. The first wave of Hammer chips all come with 1MB of L2 cache onboard, upping the ante by a factor of two.

    Hammer’s L1 cache sizes are unchanged from K7 at 64K for instructions and 64K for data. AMD’s caches tend to be exclusive, and that’s the case with Hammer; these caches don’t replicate the contents of the L1 cache. With the L1 data and L2 caches combined, the Hammer chips’ total effective data cache size is 1088K.

    We’ve seen many times before the impact larger caches have on performance. Generally, more cache is better, but many tasks pull through too much data to derive any benefit from extra cache, so the benefits are uneven.

  • SSE2 instruction set support — Intel introduced the SSE2 instruction set for single-instruction, multiple-data (SIMD) calculations with the Pentium 4. SSE2 allows for SIMD operations on 128-bit IEEE double-precision floating-point datatypes, so it’s useful in tasks like 3D rendering, graphics drivers, gaming, and media encoding. Previous Athlon chips have supported SIMD instruction set extensions for both integer (MMX) and single-precision floating-point (SSE, 3DNow!) operations, but they have been missing SSE2, where SIMD on x86 arguably has the most impact. The Hammer core can take advantage of applications optimized for SSE2, making it more competitive with the Pentium 4.
  • AMD64 instruction set support — In a gutsy move, AMD has concocted its own set of extensions to the x86 instruction set architecture, or ISA. The new AMD64 ISA isn’t a radical departure, but it allows for 64-bit addressing, and it adds some additional registers to the register-poor x86 ISA.

    AMD’s move to 64 bits accomplishes several things. First, it eliminates the barrier of 4GB of addressable memory in 32-bit systems. 4GB may sound like a lot today, but as an upper limit, 4GB could become a nasty constraint, even on common desktop systems, in the next few years.

    Second, by adding 32-bit extensions to the x86 ISA, AMD has created an evolutionary alternative to Intel’s Itanium chips, which break almost entirely with the industry-standard x86 software infrastructure. Naturally, code will have to be recompiled for AMD64, but AMD64 is familiar enough that retooling compilers for it should be relatively painless.

    Finally, AMD64’s additional registers, which are present in Hammer, promise better performance on recompiled code. (Registers are essentially temporary local storage slots on a processor. More of them means less storing data in cache or memory.) Addressing memory in 64-bit chunks won’t, by itself, necessarily improve performance. The Hammer has eight new 64-bit integer registers and eight new 128-bit SSE/SSE2 registers to help.

  • A slightly longer pipeline — Hammer’s main branch prediction/recovery pipeline has been lengthened from 10 stages to 12. This change should allow the processor to run at higher clock rates at the expense of executing fewer instructions per clock. The Hammer’s pipeline is still much shorter than the 20-stage main pipeline in the “speed demon” Pentium 4.
  • 0.13-micron SOI fab process — All Athlon 64 processors are manufactured at AMD’s fab in Dresden, Germany, where AMD employs an advanced silicon-on-insulator (SOI) fabrication process to make these chips. Laying down the silicon on top of an insulator should allow the chip’s transistors to operate faster. IBM, who pioneered SOI technology, claims clock frequency gains from SOI as high as 35 percent in testing. However, transitions to new chip fabrication processes are fraught with potential snags. When AMD delayed the Athlon 64’s launch this past spring, the company cited difficulties producing chips in volume using SOI technology as the primary culprit and had to turn to IBM for assistance.

    The move to SOI is crucial because AMD’s enhancements to Hammer add up to a whole lot more transistors per chip than the K7. The last revision of the Athlon XP, code-named Barton, had 54.3 million transistors and a die size of 101 square millimeters. The Northwood Pentium 4 has 55 million transistors on a die that’s 145 square millimeters. By contrast, the Athlon 64 packs 105.9 million transistors onto a 192 square millimeter die.

Those are the high points of AMD’s remodeling job. As we’ve noted, these changes have wide-ranging implications for Hammer motherboards, chipsets, operating systems, and software. We will, of course, be testing the performance implications shortly.


ClawHammer, SledgeHammer—JackHammer? BallpeenHammer? TackHammer?
AMD is introducing Athlon 64 chips based on two Hammer variants today, code-named ClawHammer and SledgeHammer. The principal difference between Claw and Sledge is the width of the chip’s connection to memory. ClawHammer’s memory controller has a single, 64-bit path to RAM, while SledgeHammer’s is a dual-channel or 128-bit design.

AMD originally planned for all Athlon 64 chips to be based on ClawHammer, while Opteron workstation/server processors would be based on Sledge. Turns out, though, AMD has decided to intro a top-of-the-line desktop chip with a dual-channel memory controller called the Athlon 64 FX. (Somebody phone NVIDIA marketing!) This chip is essentially a remarked Opteron running at 2.2GHz. To be more specific, the 2.2GHz flavor is dubbed “Athlon 64 FX-51”, and it gets no other designation. FX-51. That’s it. AMD figures the folks who will be willing to cough up the $733 list price for this baby will know how it performs from having read publications like this one, so there will be no Pentium 4 equivalency games played here.

If you want to play those games, you can pick up a non-FX Athlon 64 like the Athlon 64 3200+. These ClawHammer-based products have a 64-bit path to memory and come in the 754-pin package originally intended for all Athlon 64s. AMD will initially be selling the Athlon 64 3200+, which runs at 2GHz, for $417—a veritable bargain compared to the FX model.

The Athlon 64 FX (left) comes in a ceramic package, while
the Athlon 64 3200+ wears an organic package

With 940 pins, Athlon 64 FX (left) would make a good brush.
The Athlon 64 3200+ (right) sports only 754 pins.

Because of AMD’s late decision to go with a SledgeHammer-based desktop chip, the Athlon 64 FX drops into Opteron motherboards with 940-pin sockets like the Asus SK8N. Of course, this fact messes up AMD’s careful market segmentation plans, especially since it now looks like the future of Athlon 64 is 128-bit memory interfaces. To remedy this problem, Athlon 64 FX processors will soon get their very own, physically incompatible 939-pin socket. To aid in the infrastructure transition, Athlon 64 FX chips will be available in both 940-pin and 939-pin packages for the duration of 2004.

Word has it AMD may introduce a separate, 941-pin package later this year just out of spite.

I kid. I kid.

These market segmentation games don’t bother me too much, so long as AMD delivers a quality product. Motherboard manufacturers, however, may be a different story. I expect many of them are scrambling right now to prepare 939-pin motherboards for use with upcoming Athlon 64 FX chips.

If you’re thinking the prices on these Athlon 64 chips sound steep, you’re thinking right. AMD is only introducing two models of the Athlon 64, and the cheap one costs north of four hundred bucks. There are several possible reasons, not mutually exclusive, for these high prices. AMD says it wants to end the practice of pricing its chips below Intel’s when AMD’s chips are technically superior. To that end, the company says it’s positioning the Athlon 64 against Intel’s upcoming Prescott chips. The Athlon XP can continue to battle it out with existing Pentium 4 chips, and the Athlon 64 FX will sit alone atop the desktop performance throne. Like so:

Source: AMD.

This looks like wishful thinking to me, but perhaps it will help AMD raise its average selling prices, even if the strategy doesn’t succeed entirely. I can’t help but think the most important reason for high prices on the Athlon 64 has to do with limited supply. Price can be a very effective rationing system, and the fact AMD isn’t introducing slower, lower-cost Athlon 64 chips suggests such rationing may be necessary at present.


Intel’s extreme measures
If AMD can pull a workstation/server-class chip into the desktop market in order to capture the performance lead, Intel should be able to do the same, right? Well, that’s exactly what Intel decided to do, and apparently the decision was made very recently. Late last week, the Pentium 4 3.2GHz Extreme Edition arrived at my doorstep, innocently proclaiming its willingness to be benchmarked against whatever AMD had to offer. This processor is basically a rebadged Xeon MP chip, but then again the Xeon is just a rebadged Pentium 4, so it all comes back around somehow. The net result is a Pentium 4 that runs at 3.2GHz with the usual 512K of L2 cache—plus a whopping 2MB of on-chip L3 cache. Unlike true Xeon chips, the Extreme Edition fits into plain ol’ Socket 478 Pentium 4 motherboards. We plugged it into our Abit IC7-G test mobo, and it worked fine without need of a BIOS update or anything else.

The Pentium 4 3.2GHz Extreme Edition proclaims its newness with handwritten markings

I mentioned earlier that the additional cache on the Athlon 64 should help performance in some applications, but not in all of them. The same is true for the Pentium 4 Extreme Edition, but with a larger cache, there’s a better chance an application’s working data set will fit inside it. Intel is aiming at gamers with this chip, as AMD is with the Athon 64 FX-51, and that fact alone should tell you something about how the added cache is likely to affect performance.

Intel says the Extreme Edition should ship around about November, which is just the right time to ship a new model space heater. At 178 million transistors, this Edition is most definitely Extreme. The die size is an impressive 237 square millimeters, or just about the size of Vermont. Still, the Extreme Edition seems like a sweet proposition. With Hyper-Threading and all of that cache, the user experience just puttering around on the desktop or playing with productivity apps should be creamy smooth indeed. The Extreme’s price has yet to be announced.

By the way, we have no qualms about Intel horning in on AMD’s product launch. AMD has done the same to Intel in the past, and besides, Intel’s decision to pull a killer server chip into the desktop market is very much welcome to us. In the end, consumers benefit. There is also the distinct possibility that the Extreme Edition is Intel’s way not just of competing, but of avoiding embarrassment. You’ll see what I mean by that when we get into the benchmark scores shortly.

The question I have is whether Intel will remain committed to future Extreme Edition processors. AMD has proclaimed its commitment to keeping the Athlon 64 FX on top. In fact, AMD has been releasing low-volume high-end parts since a year ago, with the T-bred 2800+ chip, which was never available via retail. These products aren’t a good value proposition, but they do give well-funded enthusiasts a chance to grab the latest technology before everybody else. I’m curious to see whether Intel will play this game long term.


Not-so-core logic
The Athlon 64 arrives with the support of two decent third-party chipsets—or at least what’s left of the chipset, now that the memory controller has moved onto the CPU. Available now, NVIDIA’s nForce3 150 chipset is a single-chip design with AGP and south bridge I/O functions rolled into one. When used with an Opteron or Athlon 64 FX, NVIDIA calls it the nForce3 Pro 150, but I’m pretty sure the Pro and non-Pro chips are one and the same. Interestingly, NVIDIA hasn’t included its much-ballyhooed Audio Processing Unit (APU) on the nForce3, so unaccelerated AC’97 audio is all you get. We’re left to wonder what will become of the APU.

To coincide with the Athlon 64 launch, NVIDIA is introducing its ForceWare software, which will roll up all of NVIDIA’s platform software under a single brand name. ForceWare will add RAID 0, 1, and 0+1 capabilities to nForce3 150’s three ATA/133 controllers, among other things.

Planned for later this year is nForce3 250, a revised chipset with support for Serial ATA (including RAID), eight USB 2.0 ports, and optional Gigabit Ethernet. nForce3 Go will add power management for laptop computers and NVIDIA integrated graphics.

VIA’s K8T800 is a dual-chip design with distinct north and south bridges for easy upgrades to either chip. The K8T800, which is now shipping, will go into everything from desktops to workstations to servers. This chipset offers Serial ATA RAID and eight USB ports courtesy of VIA’s new VT8237 south bridge. Also, the K8T800 has a full 16-bit, 800MHz implementation of HyperTransport that VIA has dubbed Hyper8. With 6.4GB/s of bandwidth between chipset and CPU, the K8T800 has a theoretical advantage over the nForce3 150’s 3.6GB/s.

A block diagram of a typical K8T800-based desktop system. Source: VIA.

Incidentally, when we set out to review the Athlon 64, we tested the Athlon 64 FX-51 with the nForce3 Pro 150-based Asus SK8N motherboard. This was the test system configuration shipped to us by AMD. However, after seeing the performance of the K8T800, especially in gaming and graphics, we decided to retest with the VIA chipset, as well. Both sets of results are included in our review. We had no intention of doing a platform comparison in our processor review, but I think you’ll understand why we chose to include the K8T800.

VIA’s plans for Athlon 64 chipsets also include the K8M800 with integrated graphics for low-cost desktops and the K8N800 for mobile applications.

About the 64-bit Windows pre-beta
You will notice that we tested the Athlon 64 FX-51 in some applications with a pre-beta version of Windows XP 64-bit Edition. The very fact that Microsoft allowed AMD to ship systems to reviewers with an unreleased 64-bit version of Windows says good things about Microsoft’s commitment to AMD64 support. The pre-beta version of Windows XP 64-bit Edition is still very much a work in progress, and not everything works perfectly yet. For instance, the NVIDIA video drivers didn’t seem to have proper OpenGL acceleration, and we didn’t have access to a 64-bit version of DirectX 9. As a result, we tested only text applications and a few Direct3D 8.x games.

Of course, these applications are all 32-bit programs, so they don’t take advantage of extra registers or memory address space. They are just 32-bit programs running in the Windows-on-Windows facility in the 64-bit edition of WinXP. They should provide an interesting preview of performance in 64-bit Windows, but I expect performance will improve as the OS nears its final form.


Cooling the beast
I’d like to take a second to show you AMD’s new cooler retention mechanism, because I ‘m fairly impressed with it. The pictures below will give you the general idea. This is the stock cooler from AMD. One came in our test system, and one came with each of the two retail Opteron 240s we recently purchased.

The cooler’s lever arm is clipped closed

Release the lever and the tension

Use a screwdriver to remove the clip, and…

That was easy!

A plate under the mobo holds the cooler retention bracket in place

I’ve broken more than my share of stock Intel Pentium 4 coolers, but I was able to get the hang of the new AMD retention mech after just a few uses. Also, the plate on the underside of the mobo seems to help dissipate heat, which is, erm, cool.


What to watch for in the benchmark results
There are plenty of storylines here, and I can’t mention them all. However, you will want to watch for several things. First, of course, there’s AMD’s two new processors, the Athlon 64 3200+ and Athlon 64 FX-51. How much of an improvement over the K7 are these new Hammer chips?

Also, we’ve included results for an Opteron 146 chip. This is a SledgeHammer running at 2.0GHz, so it provides a pretty direct comparison to the Athlon 64 3200+, which runs at the same speed but has only one memory channel. Then again, the Athlon 64 3200+ seems to be on a faster motherboard, so that comparison is a little wobbly.

We’ll be keenly interested to see how the Pentium 4 3.2GHz Extreme Edition fares. Will it really offer a worthy improvement over the Pentium 4 3.2GHz, and more importantly, can it beat the Athlon 64 FX-51 for the top spot?

Back down on earth, the battle between the Pentium 4 3.2GHz and the Athlon 64 3200+ may be more interesting still, because folks might actually buy these chips. Can AMD capture the performance lead at the top of the mainstream desktop market from Intel?

Finally, I spent some extra time in the test lab to make sure we had results for slower speeds of Pentium 4 and Athlon XP, so you can see how truly fast the new high-end chips are. Check out the performance delta from low end to high.

Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least twice, and the results were averaged.

Our test systems were configured like so:

Processor Athlon XP ‘Barton’ 3200+ 2.2GHz Athlon XP ‘Barton’ 2500+ 1.83GHz
Athlon XP ‘Barton’ 2800+ 2.183GHz
AMD Athlon 64 3200+ 2.0GHz AMD Athlon 64 FX-51 2.2GHz AMD Opteron 146 2.0GHz
AMD Athlon 64 FX-51 2.2GHz
Pentium 4 2.4 ‘C’ GHz
Pentium 4 2.8GHz
Pentium 4 3.2GHz
Pentium 4 3.2GHz Extreme Edition
Front-side bus 400MHz (200MHz DDR) 333MHz (166MHz DDR) HT 16-bit/800MHz downstream
HT 16-bit/800MHz upstream
HT 16-bit/800MHz downstream
HT 16-bit/800MHz upstream
HT 16-bit/600MHz downstream
HT 8-bit/600MHz upstream
800MHz (200MHz quad-pumped)
Motherboard Asus A7N8X Deluxe v2.0 Asus A7N8X Deluxe v2.0 MSI K8T Neo MSI 9130 Asus SK8N Abit IC7-G
North bridge nForce2 SPP nForce2 SPP K8T800 K8T800 nForce3 Pro 150 82875P MCH
South bridge nForce2 MCP-T nForce2 MCP-T VT8237 VT8237 82801ER ICH5R
Chipset drivers nForce Unified 2.45 nForce Unified 2.45 4-in-1 v.4.49
ATA 5.1.2600.10
4-in-1 v.4.49
AGP 4.42
AGP 3.34
ATA 3.44
INF Update 5.0.1015
ATA 5.0.1007.0
BIOS revision 1005 1005 1.0 1.0 1002 1.6
Memory size 1GB (2 DIMMs) 1GB (2 DIMMs) 768MB (3 DIMMs) 1GB (2 DIMMs) 1GB (2 DIMMs) 1GB (2 DIMMs)
Memory type Corsair TwinX XMS4000 DDR SDRAM at 400MHz Corsair TwinX XMS4000 DDR SDRAM at 333MHz Corsair XMS3200 DDR SDRAM at 400MHz Infineon PC3200 registered ECC DDR SDRAM at 400MHz Infineon PC3200 registered ECC DDR SDRAM at 400MHz Corsair TwinX XMS4000 DDR SDRAM at 400MHz
Hard drive Seagate Barracuda V 120GB ATA/100 Seagate Barracuda V 120GB ATA/100 Seagate Barracuda V 120GB SATA 150 Seagate Barracuda V 120GB SATA 150 Seagate Barracuda V 120GB ATA/100 Seagate Barracuda V 120GB SATA 150
Audio nForce2 MCP/ALC650 nForce2 MCP/ALC650 VT8237/ALC650 VT8237/ALC201A nForce3 Pro/ALC650 ICH5/ALC650
Graphics GeForce FX 5900 Ultra
OS Microsoft Windows XP Professional
OS updates Service Pack 1, DirectX 9.0b

Sorry about the 768MB of RAM in the Athlon 64 3200+ system. I couldn’t get it to boot with either pair of 512MB DDR400 DIMMs I had on hand, and its motherboard had only three DIMM slots, so 768MB was as close as we could come. I don’t belive this difference in memory size should affect any of the benchmarks we used.

All tests on the Pentium 4 systems were run with Hyper-Threading enabled.

Thanks to Corsair for providing us with memory for our testing. If you’re looking to tweak out your system to the max and maybe overclock it a little, Corsair’s RAM is definitely worth considering.

The test systems’ Windows desktops were set at 1152×864 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled for all tests.

We used the following versions of our test applications:

All the tests and methods we employed are publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.


Benchmark results

Memory performance
By tradition, we start our CPU tests with synthetic memory performance benchmarks, in part to keep them separate from real-world benchmarks, and in part because they’re generally where the action is when a truly new processor arrives. Such is the case today. The Athlon 64’s integrated memory controller promises great things, so let’s dive in and see if we can measure its impact.

The Athlon 64 FX-51 shows exactly why AMD decided to incorporate the memory controller into the processor. The FX-51 leads all contenders with over 5.5GB/s of memory bandwidth in Sandra. The Athlon 64 3200+, meanwhile, performs very well for a single-channel solution, just edging out the Athlon XP 3200+ in Sandra, and seriously outrunning it in cachemem’s bandwidth test.

The P4 Extreme Edition doesn’t show much benefit over the stock P4 in test of memory bandwidth, but that’s expected.

Linpack can show us L1 and L2 cache hierarchies at work, as well as real-world memory bandwidth. That orange line you see flying off the edge of the graph is the P4 Extreme Edition, whose L3 cache is larger than our data matrix sizes. We’re gonna have to adjust our tests if this keeps up.

Among the more sane results, you can see the extended size of the Athlon 64’s L2 cache compared to the Athlon XP, and you can see how the Athlon 64 FX-51 promises the most sustained bandwidth when matrix calculations spill into main memory.

The Extreme Edition’s massive cache forced us to choose a different data point for our latency sample. Even with a 4MB block, the Extreme’s big caches help mask memory access latency, as one might expect with the pre-fetching algorithms of the Pentium 4.

Nevertheless, the Athlon 64 chips destroy the competition here. The integrated memory controller appears to shave over 25 nanoseconds off memory access. But, my friends, that’s just one sample point. Let’s look at them all.

Memory performance (continued)
Not only are our 3D graphs indulgent, but they’re useful, too. I’ve arranged them manually in rough order from worst to best, for what it’s worth. I’ve also colored the data series according to how they correspond to different parts of the memory subsystem. Yellow is L1 cache, light orange is L2 cache, and orange is main memory. The red series on the Extreme Edition graph represents L3 cache. Of course, caches sometimes overlap, so the colors are just an interesting visual guide.

The Athlon 64 FX-51 and the Opteron 146 require registered DIMMs, but the Athlon 64 3200+ does not. Probably as a result of that difference, the Athlon 64 3200+ achieves the lowest memory access times.

The Pentium 4 Extreme Edition remains… extreme.


Unreal Tournament 2003

The Athlon 64 chips are worldbeaters in Unreal Tournament, well ahead of the Pentium 4 3.2GHz. The P4 Extreme Edition’s extra cache helps in UT2003, and its third-place finish (behind the two FX-51 setups) maintains some respectability for Intel.

Quake III Arena

I guess you can see now why the P4 Extreme Edition makes sense. A total of four different AMD Hammer configs finish ahead of the Pentium 4 3.2GHz, but the Extreme Edition leads them all, loading up large chunks of Quake III into its L3 cache and going to town. Nevertheless, the Athlon 64s show their mettle here, outrunning the P4 3.2GHz.

You can begin to see the difference between the K8T800 and the nForce3. With the Athlon 64 FX-51, the K8T800 pulls ahead of the nForce3 Pro by over 10 frames per second. The 2GHz Hammer chips are even more lopsided, with the single-channel Athlon 64 3200+ on the K8T800 trouncing its dual-channel counterpart, the Opteron 146, on the nForce3 Pro. The Athlon 64 3200+ somehow even manages to beat the Athlon 64 FX-51.

Wolfenstein: Enemy Territory

Wolfenstein: Enemy Territory is a tight race, but the Athlon 64 FX-51 manages to take the top spot when paired up with the K8T800.


Comanche 4

We couldn’t have asked for more drama, even if we were producers of a reality show on Fox. The Extreme Edition nips the Athlon 64 FX by a hair. That L3 cache is good for about 8 frames per second in Comanche 4, and that’s just enough to do the trick.

Serious Sam SE

The Hammer processors rule in Serious Sam. Look, also, at the performance delta from top to bottom here. That’s huge.


3DMark’s overall score is driven primarily by video card performance, but the Pentium 4 chips are a bit faster in this test.

The CPU tests are another story. The Athlon 64 chips sweep CPU test 1. CPU test 2 seems to be more memory bandwidth oriented, and the Extreme Edition performs well in it.


Sphinx speech recognition
Ricky Houghton first brought us the Sphinx benchmark through his association with speech recognition efforts at Carnegie Mellon University. Sphinx is a high-quality speech recognition routine that needs the latest computer hardware to run at speeds close to real-time processing. We use two different versions, built with two different compilers, in an attempt to ensure we’re getting the best possible performance.

There are two goals with Sphinx. The first is to run it faster than real time, so real-time speech recognition is possible. The second, more ambitious goal is to run it at about 0.8 times real time, where additional CPU overhead is available for other sorts of processing, enabling Sphinx-driven real-time applications.

All but two of our test systems finish below 0.8 times real time, which must mean Sphinx is ready to deploy. The graph’s sort order is a little deceptive, because it’s sorted by the Microsoft compiler results, and the Athlon chips perform better with the Intel compiler. (Go figure.) In terms of absolute performance, the Athlon 64 FX on the K8T800 is actually the third fastest config, right behind the Pentium 4 3.2GHz Extreme and the regular P4 3.2GHz.

LAME MP3 encoding
We used LAME 3.92 to encode a 101MB 16-bit, 44KHz audio file into a very high-quality MP3. The exact command-line options we used were:

lame –alt-preset extreme file.wav file.mp3

The Pentium 4 has always performed well at media encoding tasks, and that’s the case here. The Extreme Edition’s L3 cache doesn’t help at all, though, nor does the Athlon 64 FX’s integrated memory controller.

DivX video encoding
Xmpeg is partially self-tuning, and we noted that it chose the SSE2 Optimized iDCT on the Hammer processors.

You can see how poorly the K7 chips handle DivX encoding compared to the Pentium 4. The Hammer processors close the gap substantially, but they can’t quite catch the 3.2GHz P4s.


3ds max rendering
We begin our 3D rendering tests with Discreet’s 3ds max, one of the best known 3D animation tools around. 3ds max is both multithreaded and optimized for SSE2. We rendered a couple of different scenes at 1024×465 resolution, including the Island scene shown below. Our testing techniques were very similar to those described in this article by Greg Hess. In all cases, the “Enable SSE” box was checked in the application’s render dialog.

SSE2 and Hyper-Threading are a potent combo in 3D rendering, as the Pentium 4 3.2GHz CPUs prove. Still, the Athlon 64 FX holds the top spot in the Earth-Apollo scene.


Lightwave rendering
NewTek’s Lightwave is another popular 3D animation package that includes support for multiple processors and is highly optimized for SSE2. Lightwave can render very complex scenes with realism, as you can see from the sample scene, “A5 Concept,” below.

Also, I should note that in our recent workstation PC comparo we tried a number of different threads for the rendering engine in an attempt to exploit Hyper-Threading. At the time, we thought we were getting the best scores with multiple concurrent threads on the Pentium 4, but on further investigation, that turned out not to be the case. Lightwave uses SSE2 well enough that more threads don’t really help, or so it seems. All the results below are single-threaded.

Adding SSE2 support was a big win for the Athlon 64. The top Pentium 4 chips still post the lowest render times, but look at the massive render time differences between the 2.2GHz Athlon XP 3200+ and the 2.2GHz Athlon 64 FX-51.


POV-Ray rendering
POV-Ray is the granddaddy of PC ray-tracing renderers, and it’s not multithreaded in the least. Don’t ask me why—seems crazy to me. POV-Ray also relies more heavily on x87 FPU instructions to do its work, because it contains only minor SIMD optimizations.

We tested with old “chess2.pov” scene we’ve been using forever. In a recent article, we also tested with the “official” POV-Ray benchmark, but time constraints prevented us from including it here. I also believe our “chess2” scene is more representative of everyday POV-Ray performance than the “official” benchmark scene.

In this x87-intensive renderer, the AMD chips are the clear winners. I think the Pentium 4s might give them a run for their money if POV-Ray were multithreaded, though.


Cinebench 2003 rendering and shading
Cinebench is based on Maxon’s Cinema 4D modeling, rendering, and animation app. This revision of Cinebench measures performance in a number of ways, including 3D rendering, software shading, and OpenGL shading with and without hardware acceleration.

Cinema 4D’s renderer is multithreaded, so it takes advantage of Hyper-Threading. For the AMD-based systems, I’ve reported the single-processor results. For the P4 systems, I’ve reported the multi-threaded results, which in all cases were notably faster.

Cinema 4D likes the Hyper-Threading. The Athlon 64 systems can’t quite compete with that.

The Athlon 64 FX makes up for its loss in the first test by sweeping the rest of them.


SPECviewperf workstation graphics
SPECviewperf simulates the graphics loads generated by various professional design, modeling, and engineering applications.

Notice here the contrast between the Athlon 64 FX with the K8T800 and with the nForce3 Pro. With the K8T800, the Athlon 64 FX is arguably the fastest system overall in the viewperf suite. The nForce3 Pro, however, seems to limit performance quite a bit.

Also, here’s a case where the Pentium 4 Extreme Edition’s L3 cache doesn’t seem to help much. That’s almost surprising, because it tends to help much more often than not.


I’d like to thank Alex Goodrich for his help working through a few bugs the 2.0 beta version of ScienceMark. Thanks to his diligent work, I was able to complete testing with this impressive new benchmark, which is optimized for SSE, SSE2, 3DNow! and is multithreaded, as well.

In the interest of full disclosure, I should mention that Tim Wilkens, one of the originators of ScienceMark, now works at AMD. However, Tim has sought to keep ScienceMark independent by diversifying the development team and by publishing much of the source code for the benchmarks at the ScienceMark website. We are sufficiently satisfied with his efforts, and impressed with the enhancements to the 2.0 beta revision of the application, to continue using ScienceMark in our testing.

The molecular dynamics simulation models “the thermodynamic behaviour of materials using their forces, velocities, and positions”, according to the ScienceMark documentation. Sounds simple enough, right?

Primordia “calculates the Quantum Mechanical Hartree-Fock Orbitals for each electron in any element of the periodic table.” In our case, we used the default element, Argon.

The next test measures performance in AES encryption.

The Hammer core excels in the classical computing algorithms above. Matrix multiplication with BLAS may be a different story, however. Notice that ScienceMark’s BLAS tests are highly optimized using x87 assembly, SSE, SSE2, and 3DNow! as appropriate. As a result, these tests are probably a much better indicator of matrix multiplication performance than the version of Linpack we use primarily to measure memory bandwidth.

The Pentium 4 achieves the highest peak throughput in both single-precision (SGEMM) and double-precision (DGEMM) floating-point calculations with proper use of SSE and SSE2. However, the Athlon 64 processors are more amenable to various types of optimizations, and they perform best with the compiled C code, as well. Interestingly enough, in DGEMM, the Hammer chips appear to achieve near-peak performance with three different types of code, two scalar and one vector. They don’t seem to care how the data is organized, whereas the Pentium 4 responds much better with vectorization.


picCOLOR image analysis
We thank Dr. Reinert Muller with the FIBUS Institute for pointing us toward his picCOLOR benchmark. This image analysis and processing tool is partially multithreaded, and it shows us the results of a number of simple image manipulation calculations. The overall score is indexed to a Pentium III 1GHz system based on a VIA Apollo Pro 133. In other words, the reference system would score a 1.0 overall.

The Athlon 64 FX is over three times the speed of Dr. Muller’s reference system in picCOLOR, way ahead of the nearest Pentium 4. Let’s look at some selected results from the individual picCOLOR tests to see why that is.

The Athlons all excel at the GraphCopy and AddressMem functions, as well as Fixed Interpolation. Otherwise, the Pentium 4 processors are very competitive. The Athlon 64 3200+ trips up in the video tests—this is a familiar problem with the K8T800 chipset in this test. We’re unsure what the cause is.

AMD’s Athlon 64 processors are very impressive performers. They inherit all the strengths of the Athlon XP, but few of the weaknesses. For a long while, the give-and-take between the Pentium 4 and Athlon XP involved a kind of imbalance, with the Pentium 4 dominating in certain types of benchmarks while the Athlon XP dominated in others. No more. With very fast memory access and SSE2 support, the Athlon 64 chips match up well against the P4 in nearly every way. Our set of benchmarks is a little heavy on 3D rendering, where optimizations for SSE2 and Hyper-Threading bolster the Pentium 4, but overall, the Athlon 64 FX-51 stakes a strong claim to the title of fastest x86 processor. The FX-51 is so flat-out quick in 3D gaming, one wonders whether the Pentium 4 3.2GHz Extreme Edition doesn’t exist just to save face for Intel. Were it not for the Extreme Edition’s copious amounts of L3 cache, the Athlon 64 FX-51 would nearly have run the tables in our gaming tests.

The P4 Extreme Edition does hold its own against the Athlon 64 FX, and you have to like Intel’s willingness to mine its Xeon line for extra desktop performance. I am a little surprised by the breadth of the benchmarks in which the Extreme Edition’s massive amounts of on-chip cache improve performance over the stock Pentium 4, especially the games. When you can practically load Quake III into cache and execute it, though, good things are bound to happen. Let’s hope Intel follows through with sufficient volumes and somewhat reasonable pricing on the P4 Extreme Edition. It shouldn’t cost a penny more than the Athlon 64 FX-51, especially because the Extreme Edition seems to heat up our test labs noticeably more than any other CPU we’ve tested. That’s just a seat-of-my-pants evaluation, but I swear, the seat of my pants got pretty sweaty.

For those of us with more pedestrian spending limits, the Athlon 64 3200+ looks like a great value. Yes, it costs over 400 bucks, but the stock Pentium 4 3.2GHz is selling for more than $600 right now. The Athlon 64 3200+ maybe trails the P4 3.2GHz in overall performance by the thinnest of margins, but no way is the P4 worth another $150 to $200. And that’s without considering the 64-bit question.

In fact, we’ve barely scratched the surface of the 64-bit issue beyond confirming that the Windows 64-bit pre-beta seems to run 32-bit code reasonably well. AMD supplied some 64-bit test apps with the Athlon 64 FX-51 review system, but I’m afraid we spent too much time investigating new graphics chips to devote proper attention to the Athlon 64’s AMD64 extensions. We’ll have to look at that in a future article. Of course, the true test of 64-bit performance will come with a release OS and real 64-bit applications, assuming they become available. AMD seems to be making all the right moves to garner support for AMD64, but this is new territory. We’re all wondering how successful AMD’s 64-bit initiative will be, and only time will tell.

All in all, Hammer translates surprisingly well to the desktop. That didn’t seem like a foregone conclusion when the first Opterons arrived this past spring at lower clock frequencies, but the Hammer core scales exceedingly well with clock speed. So long as AMD can ramp up supply of Athlon 64 chips at a decent pace and keep raising clock speeds to counter Intel’s upcoming Prescott core, it looks like a winner. 

Comments closed

Pin It on Pinterest

Share This

Share this post with your friends!