With dual-core CPUs firmly established, neither manufacturer has been talking much about single-core processors in the workstation market—but the high-end single-core processor still has a hand or two to play before it makes a final exit. AMDÂ’s recent stealth launch of the Opteron 254 at 2.8GHz ensures that the limelight stays focused on their dual-core products, while the Opteron 254 offers an intriguing option for buyers caught on the fence between lower-speed, dual-core and higher-clocked, single-core processors.
At $851 per chip, the Opteron 254 is positioned well above the Opteron 265 (1.8GHz dual core) at $690 and directly opposite the Opteron 270 (2GHz dual core) at $851. The Opteron 275 and Pentium Extreme Edition 840 top out at roughly at $1000 each, and the Xeon 3.6GHz is actually the cheapest of the processors we compare here, at only $690 per core. There’s not much new to talk about when comparing the price-to-performance ratio of a single Opteron 254 against a dual-core Opteron 270. In applications or test suites that are primarily or entirely single-threaded, the 254 would stomp the dual-core processor. In applications or suites that favor multi-threading, the dual-core processor would return the favor, with an added helping of creamy smoothness as an additional purchase incentive.
When we start talking about 2P vs. 4P, however, things get more interesting.
The Opteron 254
Why two is sometimes better than four
AMD and Intel may have declared that thread-level parallelism and dual-core CPUs are the future of computing, but the software transition required in order to make such claims ring true will be anything but quick and painless. Until Intel launched Hyper-Threading in 2002, there was little to no reason to create multi-threaded desktop software, since no desktop CPUs could take advantage of it.
The proliferation of Hyper-Threaded CPUs gave desktop-oriented software companies an incentive to implement multi-threading—but only to a point. There are a number of applications we could refer to as strictly dual-threaded, or, in some cases, what I call “minimally multi-threaded”. The latter term refers to programs that were designed with Hyper-Threading in mind, and deliver the same 10%-20% speed boost whether they are run on a dual-core or Hyper-Threaded CPU.
The advent of dual-core products gives the software industry new reason to optimize mainstream products for multi-threading, but the cost and effort required to do so can be significant. Generally speaking, the more parallel the code, the more difficult and time-consuming it is to create. Spin off enough threads, and you’ll inevitably begin to create other bottlenecks within the system, the sum total of which could actually degrade performance. As impressive as a dual-core, dual-CPU system is on paper, thereÂ’s no guarantee that an application billed as “multi-threaded” will actually take advantage of all four cores. Even if it does, quad-core support isn’t an inherent guarantee of scalability. Several of the applications we’ll examine today are capable of using all four cores in a quad-core system, but don’t demonstrate anything close to linear scaling when compared to a dual-core system. In such scenarios, the Opteron 254’s 27% clock advantage over the Opteron 275 may prove to be more beneficial than the Opteron 275’s additional cores.
WeÂ’ve added Lightwave 8.3 and Sysmark 2004 SE to encompass a wider variety of rendering platforms and overall system performance tests. Any suggestions or comments you have regarding these tests, specific scenes, or suggestions on multi-threaded testing in general are welcome.
Our testing methods
As always, weÂ’ve done our best to deliver clean benchmark results; all tests save for Sysmark 2004 SE and 3dsMax 7 were run at least three times, and the results were averaged. Sysmark and 3ds Max were each run three times, but both benchmarks pick the fastest run time of the three, rather than the average.
Our Pentium Extreme Edition 840 picCOLOR results weren’t run on my own test platform, due to a hardware failure. Scott ran those tests himself on his own 955XBK board, using a Maxtor DiamondMax 10 and Crucial Ballistix DDR2 PC6400 512MB x4. The memory was set for 533 MHz and the same latencies as shown in our table. The results should be nearly identical, and Pentium Extreme Edition 840’s performance in that benchmark did not deviate significantly from expected results. Nevertheless, be aware that we had to substitute a different configuration in that test.
|Pentium Extreme Edition 840 3.2GHz
| Single Xeon 3.6GHz (Nocona 1MB)
Dual Xeon 3.6GHz (Nocona 1MB)
| Single Opteron 275
Dual Opteron 275
Single Opteron 254
Dual Opteron 254
|800MHz (200MHz quad-pumped)
|800MHz (200MHz quad-pumped)
|Tyan Thunder K8WE S2895
|INF Update 220.127.116.113
|INF Update 18.104.22.1683
|SMBus driver 4.48
IDE driver 5.34
|2GB (4 DIMMs)
|2 GB (4 DIMMs)
|2 GB (4 DIMMs)
|Corsair XMS2 PC4300C3Pro 512MB DDR2 SDRAM at 533MHz
|Infineon 512MB Registered ECC DDR2 SDRAM at 400MHz
|Kingston 512MB Registered ECC DDR SDRAM at 400MHz
|CAS latency (CL)
|RAS to CAS delay (tRCD)
|RAS precharge (tRP)
|Cycle time (tRAS)
|Seagate 7200.7 ST3120026AS
with SigmaTel 5.10.4456.0 drivers
|Soundmax ADI 1980
with Realtek 22.214.171.12420 drivers
|GeForce 6800 Ultra 256MB PCI-E with ForceWare 78.01 drivers
|Windows XP Professional x64 Edition
All Xeon and Extreme Edition benchmarks were run with Hyper-Threading enabled
Power was provided by an Enhance ENS-0555 PSU (550W).
Tests were run in 1280×1024 (32-bit color) at an 85 Hz refresh rate. Vsync was disabled in all tests.
We used the following versions of our test applications:
- SiSoft Sandra 2005 SR2 10.60 64-bit
- ScienceMark 2.0 64-bit
- POV-Ray for Windows 3.7beta8 64-bit
- 3ds Max 7.1
- Cinebench 2003
- LAME MT 3.97a 64-bit
- Xmpeg 5.03 with
- DivX Video 5.2.1
- Windows Media Encoder 9
- Sphinx 3.3
- picCOLOR v4.0 build 561 64-bit
- Lightwave 8.3
The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
SandraÂ’s memory benchmark takes full advantage of the two memory controllers found in our dual Opteron 275 and dual Opteron 254 systems, with the 254Â’s higher frequency giving it a slight edge over the slower 275 rig.
The dual Xeon rig turns in abnormally low scores, considering it uses DDR2-400; the single Xeon 3.6Â’s memory performance is what weÂ’d have expected from the dual system as well. In this case, the dual Xeon 3.6Â’s low Sandra memory bandwidth appears to be a bug rather than an indication of a problem. Other memory bandwidth tests (including Sciencemark and Everest) returned significantly higher results than SandraÂ’s, and our other benchmark results gave no indication that the dual Xeon system was suffering from abnormally low memory performance.
Linpack is a concise test of L1, L2, and main memory performance packed into a single graph. The test is only single-threaded, so we’re only reporting results for the single-CPU configs.
We tested POV-Ray’s new 64-bit multi-threaded binary (beta 3.7v8). Since this software is still in beta, performance in future versions may change markedly. In fact, all of the systems tested here performed much slower when rendering in single-threaded mode using this beta than they did when we used a standard 64-bit version of POV-ray combined with SMPOV in previous reviews.
In single-threaded mode, the dual 254 and single 254 systems effectively tie, followed by both the dual and single 275-based rigs. The dual Xeons and the Extreme Edition 840 bring up the rear. With two threads, dual Opteron 254 retains the lead, but the single 254 CPU drops into next-to-last place, and only beats the single Xeon 3.6. All the CPUs we tested continued to scale well until we exceeded the number of threads they could spin to a separate core or Hyper-Threaded logical processor, and then they more-or-less held constant. The four-thread single Xeon 3.6 result notably (and consistently) departed from this trend, but again, this is beta software.
The first 3ds Max test we’ll examine today is a multipass render of the Dragon_Character_Rig scene. This particular animation highlights (and tests) motion-blur performance; we rendered it in 320×240 resolution, using both 3ds max’s default scanline renderer and mental ray. The mental ray plug-in provided with 3ds max, however, only supports up to two processors, which dramatically affects scaling and performance.
We also used a benchmark suite developed by StudioPC. A detailed article discussing the testÂ’s construction can be found here. The benchmark consists of three separate scenes, each chosen for its focus on an aspect of 3D rendering.
The “Ape” scene requires multiple render passes in order to create realistic motion blur. Shadow mapping and multiple light sources are the other two major features of this demo scene.
“Architecture” uses large amounts of glass and water, both of which create a number of reflections. This scene is primarily a test of ray-tracing performance.
The final test, “Radiosity”, extensively models this lighting technique. Roughly 80% of the sceneÂ’s total computation time is spent on radiosity calculations, 15% for area shadows, and 5% handles scan-line rendering and ray-tracing calculations.
Our dual Opteron 275 rig is the obvious winner across all three scenes, but there are some interesting scaling differences between the three. Ape doesnÂ’t scale as well when moving from the single Opteron 275 to the dual; total computation time only drops about 28% when the other two cores are added. The single 254 to dual 254 change cuts render time by 36%; the Xeons are close behind with their own 34% drop.
The other two scenes scale more readily, particularly radiosity. The dual Opteron 275 renders in 61% the time of the single 275, with the other CPUs we tested turning in similar speed increases. Overall, the dual Opteron 254 proved significantly faster than a single-core Opteron 275, and surpassed dual Xeon 3.6GHz by a small margin. The single Opteron 254 manages to best the single Xeon 3.6GHz, but it lags behind all of our other test systems significantly.
Our Lightwave 8.3 rendering benchmarks focus on raytracing and radiosity, as well. Lightwave isnÂ’t yet native in 64 bits, but Newtek has announced Lightwave 64, which will be native to Windows XP x64 Edition and include increased support for multi-core solutions. LW8.3 runs nearly flawlessly in WinXP x64 Edition, though we did have one problem: the Skull_Head_Newest scene (first included in Lightwave 7.5 as a radiosity demonstration) will fail on a DLL call if run after some of the other demos in LW8.3 that utilize radiosity.
Lightwave has always scaled well when run on SMP or HT-enabled configurations, and our benchmark results continue to bear that out. The dual Opteron 275 again sweeps the suite, significantly outperforming second-place dual Xeon 3.6GHz. The Opteron 254 loses to the dual Xeon by only a hairÂ’s breadth, and the Pentium Extreme Edition 840 is next, turning in by far the best performance out of our single-socket configurations.
Cinebench 2003 is a rendering benchmark developed by Maxon and freely available for download. The 32-bit binary has been available for years, but the 64-bit test is a relative newcomer. Since the two are directly comparable, and since the difference is significant, we decided to give you numbers from each.
Both sets of results scale as weÂ’d expect: the dual Opteron 254 walks away with the strongest single-CPU render time (the single 254, for some reason, was just slightly slower) and the second strongest multi-CPU render time. In multi-CPU rendering, the Opteron 275 beats everything else to a pulp, while the single-threaded Cinema4D, OpenGL software, and OpenGL hardware rendering tests are all won by the single Opteron 254.
WhatÂ’s eye-catching, however, is the rendering performance boost across all CPU configurations and both Intel and AMD systems when we move to 64 bits. The AMD systems scale slightly better, with scores increasing, on average, by about 30% versus the Xeon’s and Pentium Extreme Edition 840’s averages of about 21%. Scores in the OpenGL software and hardware rendering tests drop slightly; the OpenGL hardware drop is possibly due to less-than-perfect 64-bit graphics drivers. If the 64-bit binary versions of 3ds Max and Lightwave reflect similar jumps, rendering professionals may be very, very happy with Windows XP x64 Edition.
LAME MT is a dual-threaded version of LAME created by Raichshtain Gilad; his paper detailing the effort can be found here (in Word format). Instead of performing the actual encoding in parallel, LAME MT runs a psycho-acoustic analysis one frame ahead of the actual encode procedure. WeÂ’ve tested LAME MT as compiled natively for 64-bit operation, using compilers from both Intel and Microsoft, and with either variable bit rate or constant bit rate selected.
At two times 2.8GHz, the dual Opteron 254 continues to dominate the dual-threaded benchmarks, followed by the dual 3.6GHz Xeon. The choice of compiler largely determines how well the Extreme Edition 840 does. When we use the Intel compiler, regardless of how many threads we test, or whether we use CBR or VBR, the Extreme Edition stays right behind the Opteron 275, trailing it by only a second or two. Under the MS compiler, the Opteron 275’s performance remains identical, but the Pentium Extreme Edition degrades markedly, falling behind both of the Opteron 275-based systems.
We actually see this same pattern from the Xeon CPUs, but in their case, the dual Xeon 3.6 is still fast enough to win all but the MS VBR tests when compared against the single and dual Opteron 275s. The single Opteron 254 trots along steadily, unconcerned with thread allocation, and actually manages to outperform the HT-enabled single Xeon at 3.6GHz.
We used Xmpeg with DivX 5.2 to handle our conversion. DivX 6 installed properly on our 64-bit system, but Xmpeg was unable to begin encoding with the newer version. Since DivX 5.2 installed and ran perfectly, it isnÂ’t clear whether this is a problem with Xmpeg or a compatibility issue with DivX 6.0. DivX Helium (the SMP-optimized encoder based on DivX 6.0) refuses to run under WinXP x64, and thus was unavailable.
Since Xmpeg is only dual-threaded, the Opteron 254Â’s faster raw clock speed outruns the 275Â’s greater number of cores. The Xeon 3.6 lags significantly behind the other systems, while the Extreme Edition 840 and both of the Opteron 275 systems are clustered together.
Windows Media Encoder 9
We encoded the 1080-line WMV file Â“Step Into LiquidÂ” into a 320×240 streaming format. Audio was handled by the WMA 9 codec, while video encoding was done using the WMV 9 Advanced Profile codec.
WME 9 is capable of making some use of the Opteron 275’s additional resources. The Opteron 254 places second here, well ahead of the single 275 rig, but still a fair margin behind the dual-socket, dual-core Opteron 275s. The single-socket Opteron 254 outruns the single Xeon by nearly twenty seconds and also turns in the most significant scaling result: encode time falls by 34% when we add the second CPU.
Sysmark 2004 SE is an updated version of Sysmark 2004 built for Windows XP x64 Edition, though the programs and benchmark scenarios themselves are all strictly 32 bits. Our results show that at least some of the benchmark tests in Sysmark 2004 are multi-threaded, but itÂ’s the Opteron 254 system in the lead here (by about 4%) over the Opteron 275.
Compare the single and dual Opteron 275 scores to the single and dual Opteron 254s, and youÂ’ll see that a majority of the multi-threaded scenarios are tuned for a dual-CPU configuration. When we move from the Opteron 275 to the dual 275, (and thus, from two cores to four), we see only a 7% speed increase. Adding a second core to the Opteron 254 (for a total of two), on the other hand, boosts total performance by 27%. Similarly, the HT-enabled single Xeon gains a 20% performance boost when a second core is added.
WeÂ’re using a 64-bit beta of ScienceMark 2.0. Multi-threading is present, but isnÂ’t fully optimized, as of yet. Again, we see the Opteron 254 turn in the fastest computation time in Moldyn, though in this case, even the slowest Opteron (the single 254) is faster than the fastest Intel. Primordia also is also multi-threaded to a degree; the single and dual Opteron 275 system tie at 266 seconds, while Opteron 254 takes 215 seconds to run the same series of computations. The dual Xeon is actually slightly faster than Opteron 275, while the single Opteron 254 drops right between single Xeon 3.6 and the Pentium Extreme Edition 840.
While we arenÂ’t comparing against the 32-bit version, I wanted to note that Moldyn is one test that sped up tremendously when it jumped to 64 bits. In 64-bit mode, Moldyn is anywhere from three to five times faster than when run in 32-bit mode.
Sandra’s Mandelbrot test is designed to demonstrate the benefit of SIMD instruction sets, including MMX, SSE, and SSE2. The test is fully optimized for SMP (up to 64 processors) and operates in both integer and floating-point modes. In the Integer x16 test, integers are used to simulate floating-point math, while the Floating-Point x8 test uses SSE2 to process up to eight Mandelbrot iterations at the same time.
Our test results show exceptionally strong performance from the dual Xeon and Extreme Edition 840 CPUs; both outperform the dual Opteron 254, while the dual Opteron 275 still wins out overall, thanks to its four cores. Nonetheless, if you’re working with a carefully optimized program that uses a great deal of vector math, the dual Xeon or Extreme Edition may provide the best bang for your buck.
Sphinx speech recognition
ItÂ’s not entirely clear why this single-threaded Sphinx test runs slower on dual Opteron 275 than a single 275, or why it pulls the same trick again on the dual Opteron 254 versus the single 254. We ran the test repeatedly, particularly on the 254 system, and got the same results. In this case, itÂ’s the single Opteron 254 that takes the lead in Sphinx, followed by the Pentium Extreme Edition 840. The remaining Opteron configurations follow, with both Xeon 3.6 systems straggling behind.
picCOLOR is an image viewer and analyzer that can be used for a number of scientific tasks, including particle flow analysis. It was created (and continues to be updated) by Dr. Reinhert H.G. MÃ¼ller of the FIBUS Institute. picCOLOR is now 64-bit native, and is updated by Dr. MÃ¼ller to take full advantage of CPU technology advances, including MMX, SSE, and SSE2. All scores below are normalized against a 1GHz PIII system.
picCOLOR makes strong use of dual-threadinga single Opteron 275 outperforms the significantly faster single 254 by a full 20%. Some of the individual tests take advantage of the Opteron 275Â’s four cores. Overall, however, itÂ’s the dual Opteron 254s that carry the day. Interestingly, the Extreme Edition 840 outperforms the Xeon 3.6GHz chips by a fair amount.
All power measurements are taken at the wall socket, without the monitor included. Idle power consumption was measured while the CPU sat post-boot, doing nothing. Load consumption was measured while rendering in POV-Ray, with an equal number of threads selected for each CPU the system was configured with at the time.
Our Xeon-based systems draw far more power than the Extreme Edition 840 rig does at idle, possibly reflecting the Extreme EditionÂ’s later design and IntelÂ’s power improvements. Under load, the dual Xeon 3.6 system is only about 10% worse than the Opteron 254 or 275 configs, until you realize that the Opteron 275 is a four-core system, versus the other single-core, dual-CPU configurations.
ItÂ’s rather astounding that AMD can pack four 2.2GHz cores into the same power specification as two 2.8GHz cores when all the CPUs involved are working at full load. In rendering workloads, none of the other CPUs here look particularly good when measured against the Opteron 275’s performance-per-watt ratio, but the dual Xeons end up looking downright sickly.
At $851 per chip, the Opteron 254 is significantly cheaper than the Opteron 275and in some cases, a better value. That question turns entirely on the issue of workloads and software parallelization, but based on what weÂ’ve seen, we can draw some general conclusions.
Most of the desktop-oriented software, media encoders, and analytical scientific software we tested here is either single-threaded or dual-threaded, and shows minimal gains from adding another two cores. If you intend to focus on this type of work, you may be better served by the faster Opteron 254s.
When it comes to 3D rendering performance, however, the Opteron 275 is the 800-pound gorilla; no other CPU or configuration that weÂ’ve tested from AMD or Intel comes even close to competing with the Opteron 275’s results. The differences are more than academic. Consider, for a moment, the Opteron 275Â’s render time in Lightwave 83Â’s Radiosity_Box test (784 seconds) versus the Opteron 254 (1097 seconds) and the dual Xeon 3.6 (1017 seconds). ThatÂ’s the time required to compute a single frame of animation, and animations typically run at 24 frames per second. An animation scene that rendered at the same speed as Radiosity_Box would require 5.2 hours of render time on the Opteron 275, 6.8 hours on the dual Xeon 3.6GHz, and 7.3 hours on the Opteron 254. ThatÂ’s a huge gap, and it makes a dual-socket, dual-core Opteron 275 rig an easy choice for any render station.
Ultimately it boils down to workloads and available funds. The Opteron 254 fits nicely into AMDÂ’s product lineup, and a pair can be had for about $400 less than a pair of Opteron 275s. Depending on what you want to do, they might be quite a bargain.