Even so, the Athlon 64 FX and the P4 EE share a common heritage: they are plucked from the top end of each manufacturer’s professional workstation/server lineup of chips. Specifically, the Athlon 64 FX-51 is a 940-pin chip nearly identical to an Opteron, with the exception that it runs at 2.2GHz. Before today, the fastest Opteron ran at 2.0GHz. No more. The Opteron 148 chip arrives at 2.2GHz as the workstation equivalent of the Athlon 64 FX. The 248 and 848 models, meanwhile, raise the prospect of servers and workstations with multiple processorschips just like the Athlon 64 FXrunning in tandem.
Now, grab your drool rag and ride along as we put the Opteron 148 and 248 CPUs through their paces to see how they measure up as workstation processors.
Introductions and preliminaries
The Opteron 148 is intended for single-processor workstation PCs, while the 248 model can run in pairs. Otherwise, these things are the same basic product: Opterons running at 2.2GHz. They’re based on AMD’s eight-generation Hammer microarchitecture, with extensions to support 64-bit computing and all the rest. Since the Opteron x48 series is just a speed bump, I won’t belabor the point. You can read our introduction to the Hammer architecture here.
I should note a few things, however, before we dive into the test results. First, it really is the case that the Opteron 148 and the Athlon 64 FX-51 are the essentially same product with different names. They both run at 2.2GHz; both nestle into 940-pin sockets; both support dual channels of DDR400 memory. They are, I suppose, aimed at slightly different markets, but both products compete with the Pentium 4, which straddles the high-end desktop and low-end workstation segments. So I dunno. To make things simpler, I’ve included only one set of benchmarks for the Athlon 64 FX-51 and Opteron 148 chips. You can read the results together, because they perform the same.
Next, the Opteron 146, 148, 246, and 248 chips in our test results all come from systems with DDR400 memory, while the Opteron 140 and 240 setups use DDR333 memory. AMD has been a little fuzzy on DDR400 support in the Opteron line, but the company has now decided to endorse DDR400 officially for the x46 models and above. Of course, Opterons still require registered DIMMs, whatever the speed.
AMD’s test kit for the Opteron 148/248 included an MSI 9130 motherboard, a very decent motherboard that we’ve used in the past for some of our testing. The 9130 is based on VIA’s K8T800 chipset, which has proven faster than NVIDIA’s nForce3 Pro for most tasks. However, the MSI 9130 has one glaring weakness: it doesn’t have DIMM slots hanging off its second CPU socket, so the second processor’s built-in memory controller doesn’t have access to any local memory. As a result, CPU 1 will always have to go through CPU 0 in order to access memory, and half of our dual Opteron’s potential memory bandwidth can’t be realized. The MSI 9130 isn’t unique in this regard; at present, most dual Opteron workstation boards use this sort of memory configuration.
Perhaps it’s just as well, because the Windows XP Pro kernel isn’t aware of the Opteron’s non-uniform memory access (NUMA) architecture, and thus can’t take full advantage of an optimally configured Opteron rig. What’s more, the 64-bit version of Windows XP has been delayed, so we aren’t likely to see a truly optimal version of Windows for the Opteron for quite some time. The Opteron will have to live with some handicaps in our testing, but these are the same handicaps many real Opteron workstations are likely to face.
Speaking of handicaps, the only Intel Xeon chips we’ve included as foils to the Opteron 248 series run at 2.66GHz with 512K of L2 cache. Nowadays, top workstation Xeons come at 3.2GHz with 1MB of L3 cache. I wish we could have included those chips in our review, but Intel tends not to make its Xeon chips available to the media for comparisons like this one, despite our best efforts. And at about a grand a pop, we’re not too keen on buying the latest and greatest Xeons each time we conduct a review. So, uhm, sorry.
As consolation, we’ve included results for the ultra-expensive Pentium 4 3.2GHz Extreme Edition, which has 2MB of L3 cache, an 800MHz bus, and dual-channel DDR400 memory. This single CPU is even more exotic than the top Xeon workstation chip. So there. We have also included a few other top-end desktop processors, because those chips traditionally bleed into the single-processor workstation segment.
With all of that out of the way, let’s dive into the benchmark results, which generally speak for themselves.
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least twice, and the results were averaged.
Our test systems were configured like so:
|Processor||Athlon XP ‘Barton’ 3200+ 2.2GHz||Athlon 64 3200+ 2.0GHz|| Opteron 146 2.0GHz
Athlon 64 FX-51 2.2GHz
2 x Opteron 246 2.0GHz
2 x Opteron 248 2.2GHz
|Opteron 140 1.4GHz
2 x Opteron 240 1.4GHz
| Pentium 4 3.2GHz
Pentium 4 3.2GHz Extreme Edition
|2 x Xeon 2.66GHz|
|Front-side bus||400MHz (200MHz DDR)||HT 16-bit/800MHz downstream
HT 16-bit/800MHz upstream
|HT 16-bit/800MHz downstream
HT 16-bit/800MHz upstream
|HT 16-bit/800MHz downstream
HT 16-bit/800MHz upstream
|800MHz (200MHz quad-pumped)||533MHz (133MHz quad-pumped)|
|Motherboard||Asus A7N8X Deluxe v2.0||MSI K8T Neo||MSI 9130||MSI 9130||Abit IC7-G||Tyan Tiger i7505|
|North bridge||nForce2 SPP||K8T800||K8T800||K8T800||82875P MCH||E7505 MCH|
|South bridge||nForce2 MCP-T||VT8237||VT8237||VT8237||82801ER ICH5R||82801DB ICH4|
|Chipset drivers||nForce Unified 2.45||4-in-1 v.4.49
|INF Update 5.0.1015
|INF Update 5.0.2
|Memory size||1GB (2 DIMMs)||768MB (3 DIMMs)||1GB (2 DIMMs)||1GB (2 DIMMs)||1GB (2 DIMMs)||1GB (4 DIMMs)|
|Memory type||Corsair TwinX XMS4000 DDR SDRAM at 400MHz||Corsair XMS3200 DDR SDRAM at 400MHz||Infineon PC3200 registered ECC DDR SDRAM at 400MHz||Infineon PC2700 registered ECC DDR SDRAM at 333MHz||Corsair TwinX XMS4000 DDR SDRAM at 400MHz||Corsair TwinX XMS3200LL DDR SDRAM at 266MHz|
|Hard drive||Seagate Barracuda V 120GB ATA/100||Seagate Barracuda V 120GB SATA 150||Seagate Barracuda V 120GB SATA 150||Seagate Barracuda V 120GB SATA 150||Seagate Barracuda V 120GB SATA 150||Seagate Barracuda V 120GB ATA/100|
|Graphics||GeForce FX 5900 Ultra|
|OS||Microsoft Windows XP Professional|
|OS updates||Service Pack 1, DirectX 9.0b|
Sorry about the 768MB of RAM in the Athlon 64 3200+ system. I couldn’t get it to boot with either pair of 512MB DDR400 DIMMs I had on hand, and its motherboard had only three DIMM slots, so 768MB was as close as we could come. I don’t believe this difference in memory size should affect any of the benchmarks we used.
All tests on the Pentium 4 systems were run with Hyper-Threading enabled.
Thanks to Corsair for providing us with memory for our testing. If you’re looking to tweak out your system to the max and maybe overclock it a little, Corsair’s RAM is definitely worth considering.
The test systems’ Windows desktops were set at 1152×864 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled for all tests.
We used the following versions of our test applications:
- Cachemem 2.65MMX
- SiSoft Sandra MAX3! (2003.7.9.73)
- Compiled binary of C Linpack port from Ace’s Hardware
- Discreet 3ds max 5.1 SP1
- NewTek Lightwave 7.5
- Cinebench 2003
- POV-Ray for Windows v3.5
- PICCOLOR v4.0 build 451
- SPECviewperf 7.1
- ScienceMark 2.0 beta (06SEP03-A build)
- Sphinx 3.3
- LAME 3.93.1 (build from mitiok.cjb.net)
- Xmpeg 5.0.1 with DivX Video 5.05
- FutureMark 3DMark03 build 330
- Comanche 4 demo
- Quake III Arena v1.31
- Serious Sam SE v1.07
- Unreal Tournament 2003 demo v.2206
- Wolfenstein: Enemy Territory v2.55
All the tests and methods we employed are publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
The Opteron 148 and the Pentium 4 3.2GHz are vying for the top spot in our synthetic memory bandwidth tests, splitting the spoils. Notice how the dual Opteron systems are generally slower in memory benchmarks than the single-processor Opteron rigs. We have come to expect this behavior when only one of the two processors’ built-in memory controllers has DIMM slots attached. The slowdown isn’t huge, but the need for non-local memory access does have an impact. Then again, the Xeons are quite a bit slower, and newer Xeons still have the same bus speed and memory architecture.
Linpack shows us the L1, L2, and L3 caches at work, including the crazy-mad Extreme Edition’s jaggy orange line that goes flying off the charts. If you are planning to plunk down a grand for a P4 EE, you might want to frame a picture of this Linpack graph to remind yourself what you’re buying.
The integrated memory controller in the Opteron chips yields extremely low memory access latencies, as cachemem shows here. Once again, the need for non-local memory access on the dual-processor Opteron systems slows performance a bit.
By adding a third dimension, we can investigate the cachemem latency results in more detail. I’ve color-coded the various cache levels (L1 is yellow, L2 amber, L3 red, and main memory is orange) to make for easier reading. The graphs are presented in rough order of overall access latency, from highest to lowest.
Even with the MSI 9130’s sub-optimal memory configuration, the Opteron 248 comes out looking good in cachemem’s access latency test, especially compared to the Xeons.
Sphinx speech recognition
Ricky Houghton first brought us the Sphinx benchmark through his association with speech recognition efforts at Carnegie Mellon University. Sphinx is a high-quality speech recognition routine that needs the latest computer hardware to run at speeds close to real-time processing. We use two different versions, built with two different compilers, in an attempt to ensure we’re getting the best possible performance.
There are two goals with Sphinx. The first is to run it faster than real time, so real-time speech recognition is possible. The second, more ambitious goal is to run it at about 0.8 times real time, where additional CPU overhead is available for other sorts of processing, enabling Sphinx-driven real-time applications.
The Opterons push the top Pentium 4 chips hard in Sphinx, but the Xeons are near the bottom of the pack.
LAME MP3 encoding
We used LAME 3.92 to encode a 101MB 16-bit, 44KHz audio file into a very high-quality MP3. The exact command-line options we used were:
lame –alt-preset extreme file.wav file.mp3
Obviously, LAME isn’t multithreaded, so single-processor results are very similar to those from duallies. The Pentium 4 chips take lead, but interestingly, the K7 and K8 chips at 2.2GHz all group together here.
DivX video encoding
Xmpeg is partially self-tuning, and we noted that it chose the SSE2 Optimized iDCT on the Hammer processors.
We’ve finally found a system capable of outrunning a Pentium 4 at video encoding! Of course, it took two CPUs to do it, but still, the dual Opteron 248 system really cuts down on encoding times. Impressive.
3ds max rendering
We begin our 3D rendering tests with Discreet’s 3ds max, one of the best known 3D animation tools around. 3ds max is both multithreaded and optimized for SSE2. We rendered a couple of different scenes at 1024×465 resolution, including the Island scene shown below. Our testing techniques were very similar to those described in this article by Greg Hess. In all cases, the “Enable SSE” box was checked in the application’s render dialog.
The dual Opteron systems produce the lowest render times, and the Opteron 148 puts in a respectable showing, as well.
NewTek’s Lightwave is another popular 3D animation package that includes support for multiple processors and is highly optimized for SSE2. Lightwave can render very complex scenes with realism, as you can see from the sample scene, “A5 Concept,” below.
I’ve included results for various thread counts on the multiprocessor systems and on the Intel systems with Hyper-Threading. Lightwave isn’t self-tuning; users must pick the number of render threads manually. Unfortunately, my results are a little bit fragmentary for the dual Xeon and Opteron 240 systems. I’d already allowed the Opteron 240s to be carted off and subjected to Andy’s evil overclocking experiments by the time I realized I needed to do more testing.
The addition of SSE2 support makes AMD’s Opteron processors very competitive with the Pentium 4 in Lightwave. The dual Opteron 248 system renders our test scenes fastest, but the dual Xeon 2.66GHz system is only a few tenths of a second behind it.
POV-Ray is the granddaddy of PC ray-tracing renderers, and it’s not multithreaded in the least. Don’t ask me whyseems crazy to me. POV-Ray also relies more heavily on x87 FPU instructions to do its work, because it contains only minor SIMD optimizations.
The “official” POV-Ray benchmark continues to be a bit of a surprise to me. Our humble chess scene, which we’ve used for ages, clearly renders fastest on the AMD chips. The POV-Ray benchmark uses a wider range of POV-Ray rendering engine features, and it does especially well on the Intel processors. Most puzzling of all is the Opteron’s relatively poor performance in the benchmark compared to the Athlon XP. Not sure what to make of that.
Cinebench 2003 rendering and shading
Cinebench is based on Maxon’s Cinema 4D modeling, rendering, and animation app. This revision of Cinebench measures performance in a number of ways, including 3D rendering, software shading, and OpenGL shading with and without hardware acceleration.
Cinema 4D’s renderer is multithreaded, so it takes advantage of Hyper-Threading and multiple processors. I’ve reported the multi-threaded results, which in all cases were notably faster, for all multiprocessor and Hyper-Threaded systems.
The Cinebench renderer has always run well on Pentium 4 and Xeon chips, but the Opteron 248 is just able to edge out the dual Xeon rig. Of course, the Xeons are a few steps down in the Intel product line, closer to the Opteron 240. The Opteron 148 can’t keep pace with the Pentium 4 3.2GHz.
The rest of the results are more mixed, and these tests are quite apparently not multithreaded.
SPECviewperf workstation graphics
SPECviewperf simulates the graphics loads generated by various professional design, modeling, and engineering applications.
Be aware that this is the one benchmark most influenced by the fact we’re not testing with a workstation-class graphics driver like those bundled with Quadro or FireGL cards. Nonetheless, the Opteron chips dominate the viewperf suite.
I’d like to thank Alex Goodrich for his help working through a few bugs the 2.0 beta version of ScienceMark. Thanks to his diligent work, I was able to complete testing with this impressive new benchmark, which is optimized for SSE, SSE2, 3DNow! and is multithreaded, as well.
In the interest of full disclosure, I should mention that Tim Wilkens, one of the originators of ScienceMark, now works at AMD. However, Tim has sought to keep ScienceMark independent by diversifying the development team and by publishing much of the source code for the benchmarks at the ScienceMark website. We are sufficiently satisfied with his efforts, and impressed with the enhancements to the 2.0 beta revision of the application, to continue using ScienceMark in our testing.
The Opteron processors excel in ScienceMark’s simulation tests, especially when run in pairs. The tests below, however, are more like the Linpack test we saw earlier. These matrix multiplication tests use various flavors of code optimization in order to achieve peak performance. SGEMM is a single-precision floating-point math test, and DGEMM uses double-precision FP datatypes.
Even the 148 and 248 Opterons can’t beat the Pentium 4 when running its ideal codepath. However, the Opterons produce similarly strong performance with multiple types of optimizations.
picCOLOR image analysis
We thank Dr. Reinert Muller with the FIBUS Institute for pointing us toward his picCOLOR benchmark. This image analysis and processing tool is partially multithreaded, and it shows us the results of a number of simple image manipulation calculations. The overall score is indexed to a Pentium III 1GHz system based on a VIA Apollo Pro 133. In other words, the reference system would score a 1.0 overall.
This benchmark obviously runs best on Opteron chips. The test’s multithreaded functions perform very well on the dual-CPU systems, too.
Quake III Arena
Just for fun, we’ve run some gaming benchmarks on these workstation CPUs. Obviously, an Opteron-based system should make a decent general-purpose PC, as well, so what the heck. We begin with Quake III Arena, one of the few games able to take advantage of a second processor. The ‘r_smp 1’ notation means the game’s multiprocessing support is enabled.
Wow. The Opterons x48 are very fast in Quake III, especially with multiprocessing support turned up. The P4 Extreme Edition remains fastest in Quake III, though, by virtue of its ability to run seemingly the whole darned game from its L3 cache.
Since we’re way too serious to talk about all of these games individually, I’ll let the rest of the results pass by without too much commentary. We’ll sum up at the end of our gaming tests.
Unreal Tournament 2003
Wolfenstein: Enemy Territory
Serious Sam SE
If, once your day of 3D modeling or CAD/CAM engineering is done, you decide to fire up a game and relax, the Opteron 148 and 248 chips will give you one of the baddest gaming systems on the planet. Shhh. We won’t tell if you won’t.
The last time we looked at a pair of Opteron chips running in SMP, we were comparing the Opteron 240 to the Xeon 2.66GHz, and frankly, the Opteron 240 got its tail whipped. This time around, it’s a very different story. Thanks in part to its built-in memory controller, the Opteron’s performance seems to scale up exceptionally well with clock speed increases. Overall, a single Opteron 148 is every bit as fast as a Pentium 4 3.2GHz Extreme Edition chip, and running together, a pair of Opteron 248s makes for the fastest workstation system we’ve ever laid hands on.
Yes, the Xeons we tested against it were “only” running at 2.66GHz, and higher clock speeds would help close the gap. Still, Intel hasn’t yet raised the Xeon’s bus speed or memory clock speed to match the Pentium 4’s. With a 533MHz bus and dual-channel DDR266 memory, even the fastest Xeon at 3.2GHz with 1MB of L2 cache isn’t likely to match the Opteron in memory-intensive tasks like video encoding or speech recognition. Our low-level memory tests tell the story here. Intel needs to deliver its E7210 chipset (code-named Canterwood ES) in order to remain competitive with AMD’s Opterons.
For its part, AMD needs to continue pushing to fulfill its vision for this platform. We have been playing with a dual Opteron workstation that uses a real four-channel memory config, and we’ll be bringing you those results soon. Unfortunately, though, the 64-bit version of Windows remains rather elusive. One of the biggest advantages of the Opteron for the workstation world is its ability to address more than 4GB of RAM, but that ability remains confined to Linux for the time being. A real NUMA-aware kernel for Windows XP would be a nice interim measure. Already, there’s one available in Windows 2003 Server, but that OS isn’t really tuned for workstation use.
The price of admission for leading-edge performance in the workstation realm is never cheap, and AMD’s new Opterons are no exception. The Opteron 148 will list at $733, or exactly the same price as its desktop-bound cousin, the Athlon 64 FX-51. The dually chip, the Opteron 248, will sell for $913 a pop. You will, of course, need two. I expect most buyers will purchase complete systems from AMD partners like APPRO or BOXX Technologies. What they’ll get for their money is one of the fastest x86-based systems anywhere.