AMD’s dual-core Opteron processors

MICROPROCESSORS ARE GETTING too hot, requiring too much power, and not delivering enough additional performance for it. That’s the basic problem. The engine that’s driven the microcomputer’s incredible rise in capability over the past 30 years, Moore’s Law, isn’t quite out of steam yet, but some of its offshoots are on the ropes. CPU designers have nearly exhausted their collective bag of tricks to get more performance out of additional transistors on a chip by increasing parallelism at the instruction level. Speculative execution and deep pipelining are by now very standard features, and CPU designs are getting increasingly complex and hard to manage. When Gordon Moore’s goose lays a golden egg and the number of transistors possible on a chip doubles, as it is supposed to do every 18 months, taking advantage of the windfall has proven increasingly difficult.

Cranking up clock speeds hasn’t helped much, either, because of transistor leakage problems. Chips are sucking up large amounts of power and expending much of it as heat, and the problem grows more acute as clock speeds ramp up. The most widely noted example of these problems, by far, has been at the company Moore co-founded. The power, heat, and speed problems of the “Prescott” core inside of recent Pentium 4 processors prompted Intel into an impressive and very public change of direction over the course of the past year. The company has sworn off the quest for 4GHz, shied away from clock speed as a measure of performance, and utterly rewritten its CPU roadmap.

AMD has not been entirely immune to these problems, but it has sidestepped their worst effects by keeping clock speeds down. The original Opteron processor debuted two years ago today at speeds up to 2GHz. Two years later, the same processors are available at 2.6GHz—only a 600MHz increase, not much in the grand scheme.

Fortunately, both AMD and Intel seem to have settled on an answer that should allow them to take advantage of ballooning transistor counts to gain additional performance: thread-level parallelism. By dialing back clock speeds and putting multiple CPU cores on a chip, the theory goes, processor performance can rise as transistor counts do. This sort of parallelism will, of course, be familiar to those who know a thing or two about Opteron processors, which have commonly been employed in pairs as part of server or workstation systems.

In fact, AMD says that the Opteron was designed from the outset with dual-core implementations in mind. The folks there are also quick to remind anyone who will listen that AMD was first to tape out an x86-compatible dual-core design and first to demonstrate such a beast in public. Today, they aim to be the first manufacturer to deliver dual-core x86 processors for workstations and servers, just days after Intel officially announced its first dual-core desktop processors.

We’ve had a pair of dual-core Opteron processors on the test bench for some time now, and we’re pleased to report some rather impressive results. AMD’s dual-core design is something more than just a pair of CPUs glued together on a single piece of silicon, and this design choice yields a performance dividend. Keep reading to see how the new Opteron 275 stacks up against its Opteron predecessors and against Intel’s latest “Nocona” Xeons. We also have a head-to-head battle of single-socket, dual-core workstation processors: the Opteron 175 versus the Pentium Extreme Edition 840.

The processors
On looks alone, one would be hard pressed to tell the difference between dual-core Opterons and their single-core counterparts.


A pair of Opteron 875 processors

They’re cosmetically identical, save for the slightly revamped model numbering scheme. The three-tiered processor series convention remains intact. 100 series processors are for single-socket systems, the 200 series for dual, and the 800 series is intended for 4-socket systems or better. However, instead of incrementing the tail end of the model number by two as clock speeds ramp up, as the Opterons 246, 248, and 250 did, the dual-core models will come in increments of five. The first dual-core Opterons will arrive at clock speeds of 1.8, 2.0, and 2.2GHz as models x65, x70, and x75, respectively.

Prices will vary according to whether the chips are part of the 100, 200, or 800 series and according to clock speeds, but the general plan for pricing is fairly straightforward: it’s almost as if AMD were introducing three new top-end speed grades at once. However, there is some overlap. For instance, the Opteron 252 is priced at $851, and the Opteron 265 will be priced the same. Consumers can choose whether they wish to purchase a dual-core processor at 1.8GHz or a single core at 2.6GHz for the same amount. The higher models will carry a premium, but AMD plans to bring the prices of dual-core Opterons down over time into the territory of the current single-core models.

The even better news for current owners of Opteron systems is that the dual-core Opterons will be pin-compatible with existing Socket 940 systems, capable of acting as drop-in replacements for current single-core models. The only requirement is that the motherboard must be able to support newer 90nm chips like the Opteron 252. If the board can do that, it should be able to handle the dual-core chips after a BIOS update, AMD claims. (Check with your motherboard maker to be sure.)

In order to pull off this impressive feat of backward compatibility, AMD had to make its dual-core parts fit into the same basic power and heat envelopes as its single-core processors. To do so, the company tweaked its fabrication process, using lower-leakage transistors that switch somewhat slower but waste less power, among other things. As a result, the Opteron 275 tops out at 2.2GHz, but it consumes no more power than the Opteron 252 at 2.6GHz.

This is one of the minor miracles of choosing thread-level parallelism over higher clock speeds. When we asked AMD CTO Fred Weber about how they managed to keep power and heat so low, he was coy about which specific optimizations AMD employed, but he offered some examples. When you’re not optimizing for the absolute best linear performance, he noted, many things are possible, including everything from changing the oxide thickness and transistor voltages to resizing buffers and more extensive clock gating.

To further manage heat and power, dual-core Opterons will support AMD’s PowerNow feature (also known as Cool’n’Quiet in the desktop world) that scales clock speeds and CPU voltages down at times of low CPU loads. This feature will function on a whole-chip basis; the CPU cores will not scale their clock speeds up and down independently.


A shot of the dual-core Opteron die. Source: AMD.

As for the chip itself, the dual-core Opteron will be manufactured on AMD’s 90nm process with silicon-on-insulator (SOI) technology. The chips will include all of the latest enhancements AMD has made to the K8 core, including SSE3 support and an improved memory controller with broader compatibility, improved memory loading, and more efficient memory mapping.

A dual-core Opteron chip packs in about 233 million transistors, and its die size is a very healthy 199 mm2. The Intel Prescott/Nocona on which the Xeon is based is 112 mm2 with roughly 125 million transistors. (The newer version with 2MB of L2 cache has 133 million transistors.) So the dual-core Opteron is large, but it’s also a very close match for Intel’s “Smithfield” dual core, which weighs in at roughly 230 million transistors and 206 mm2, although estimates and methods of counting transistors can vary.

 

A closer look at AMD’s dual-core architecture
Let’s start by looking at a very simplified diagram of a dual-core Opteron, which looks like so:


How two CPU cores are situated together on a chip. Source: AMD.

As you can see, AMD didn’t simply glue a pair of K8 cores together on a single piece of silicon. They’ve actually done some integration work at a very basic level, so that the two CPU cores can act together more effectively. Each of the K8 cores has its own, independent L2 cache onboard, but the two cores share a common system request queue. They also share a dual-channel DDR memory controller and a set of HyperTransport links to the outside world. Access to these I/O resources is adjudicated via a crossbar, or switch, so that each CPU can talk directly to memory or I/O as efficiently as possible. In some respects, the dual-core Opteron acts very much like a sort of SMP system on a chip, passing data back and forth between the two cores internally. To the rest of the system I/O infrastructure, though, the dual-core Opteron looks more or less like the single-core version.

The Opteron’s system architecture remains very different from that of its primary competitor, Intel’s Xeon. AMD says its so-called Direct Connect architecture was over-designed for single-core Opterons with an eye to the dual-core future. Each processor (whether dual core or single) has its own local dual-channel DDR memory controller, and the processors talk to one another and to I/O chips via point-to-point HyperTransport links running at 1GHz. This arrangement makes for a network-like system topology with gobs of bandwidth. The total possible bandwidth flowing through the 940 pins of an Opteron 875 is 30.4GB/s—technically, enough to choke a horse. With one less HyperTransport link, the Opteron 275 can theoretically hit 22.4GB/s.

By contrast, current Xeons have a shared front-side bus on which the north bridge chip (with memory controller) and both processors reside. At 800MHz, its total bandwidth is 6.4GB/s—a possible bottleneck in certain situations.

MESI-MESI-MOESI Banana-fana…
In order to understand the impact of AMD’s dual-core chip design and system architecture, we should briefly discuss cache coherency. This scary sounding term is actually one of the bigger challenges in a multiprocessor system. How do you handle the fact that one CPU may have a certain chunk of data in its cache and be modifying it while another CPU wants to read it from memory and operate on it, as well? Assuming you don’t run from the room screaming in fear at the complexity of it all, the answer is some sort of cache coherency protocol. Such a protocol would store information about the status of data in the cache and offer updates to other CPUs in the system when something changes.

Intel’s Xeons use a cache coherency protocol called MESI. MESI is an acronym that stands for the various states that data in the CPU’s cache can be flagged as: modified, exclusive, shared, or invalid. Let’s tackle them completely out of order, just to be difficult. If a CPU pulls a chunk of data into cache and has not modified it, the data will be flagged as Exclusive. Should another CPU pull that same chunk of data into its cache, the data would then be marked as Shared. Then let’s say that one of the processors were to modify that data; the data would be marked locally as Modified, and the same chunk on the other CPU would be flagged as Invalid.

Simple, no?

The processor with the Invalid data in its cache (CPU 0, let’s say) might then wish to modify that chunk of data, but it could not do so while the only valid copy of the data is in the cache of the other processor (CPU 1). Instead, CPU 0 would have to wait until CPU 1 wrote the modified data back to main memory before proceeding—and that takes time, bus bandwidth, and memory bandwidth. This is the great drawback of MESI.

AMD sought to address this problem by making use of a cache coherency protocol called MOESI, which adds a fifth possible state to its quiver: Owner. (MOESI is used by all Opterons and was even used by the Athlon MP and 760MP chipset back in the day.) A CPU that “owns” certain data has that data in its cache, has modified it, and yet makes it available to other CPUs. Data flagged as Owner in an Opteron cache can be delivered directly from the cache of CPU 0 into the cache of CPU 1 via a CPU-to-CPU HyperTransport link, without having to be written to main memory.

That alone is a nice enhancement over MESI, but the dual-core Opterons take things a step further. In the dual-core chip, cache coherency for the two local CPU cores is still managed via MOESI, but updates and data transfers happen through the system request interface (SRI) rather than via HyperTransport. This interface runs at the speed of the CPU, so transfers from the cache on core 0 into the cache on core 1 should happen very, very quickly. Externally, MOESI updates from a pair of cores in a socket are grouped in order to keep HyperTransport utilization low.

Again, this is quite the contrast with Intel’s dual-core implementation, which remains on Smithfield almost exactly like a pair of Xeons on two sockets. MESI updates are communicated over the front-side bus. There is no alternative internal on-chip data path.

Interestingly, the ability of the two cores to pass data quickly to one another seems to offer a compelling enough performance benefit that, from what I gather, AMD’s guidance to OS vendors has been to give priority to scheduling threads on adjacent cores first before spinning off a thread on a CPU core on another socket. That’s despite the fact that there’s additional memory bandwidth available on the second socket.

 

Why I am a bad person
This is the part of the review where I explain why I benchmarked what I did, the way I did, and in such obvious violation of the Sacred Creed of Geeks Everywhere.

First, I tried to test the CPUs in such way as to show their benefits and limitations. To do so, I used the brand-spanking-new Windows XP x64 Edition operating system and a number of 64-bit applications. WinXP x64 is NUMA aware—that is, it comprehends the need to put data into the memory attached to the CPU modifying that data. By nature, Opteron systems require a NUMA-aware OS in order to perform at their best. Also, AMD says that the WinXP x64 scheduler is especially well tuned for dual-core processors.

Obviously, 64-bit applications are the future of dual-core processors. All of the CPUs that I tested, including the moldy old Opterons, newer Xeons, and brand-new dual-core processors from AMD and Intel, are 64-bit capable. I also tried to make use of multithreaded applications where possible, although some programs aren’t threaded and some types of tasks simply don’t lend themselves to multithreading. The end result is that nine of our 15 test applications are multithreaded, and of those nine, five are 64-bit binaries.

I understand that multitasking has been cited as one of the key areas where dual-core processors will benefit the end user, and I certainly don’t disagree entirely. The full-fat, Atkins-approved creamy smoothness that comes with a multiprocessor system will be a boon in desktop systems and in low-end, single-processor workstations, and I have extolled its virtues at length in the past. However, most workstation-class systems are already multiprocessor boxes. Not only that, but we are living right now in what I’d call the Multitasking Moment, as we transition from one CPU core to two. Once dual-core processors become more common, multitasking smoothness will no longer be a big issue. The more relevant question as we move to two, four, eight, and more CPU cores per system will be about the benefits of thread-level parallelism to outright performance, and that is the question I’ve attempted to address in my testing.

I also have not made any attempt at server-class testing in this review. I would have loved to do it, but it would be a new enterprise for us around here, and we had our hands full in doing the testing we did. I would also like to apologize for the workstation purists for not ponying up the cash for high-end Quadro or FireGL graphics cards for our test systems. Truth be told, I would have loved to use them, but the dual Xeon rig soaked up our budget for this review. And that’s without us paying through the nose for registered DDR2-400 DIMMs for that rig. I am, as I said, a bad person.

I am also probably a bad person for focusing primarily on DCC and scientific computing instead of CAD/CAM applications that are generally not multithreaded. Even worse, I threw in a few game benchmarks at the end of the review. Please, whatever you do, don’t tell your boss about those.


I am also a bad person because I really like Tyan’s wicked S2895 motherboard that
sports dual 940-pin sockets and dual 16-lane PCI-E x16 slots.

Cool stuff to watch for in the results
There are a number of intriguing matchups in our benchmark results. Let me outline a few of them, so you know what to watch for.

  • Opteron 175 versus dual Opteron 248 — This is a matchup at 2.2GHz between a dual-core Opteron and a pair of single-core Opterons. Is it better to have two cores situated closely together, or is having double the memory bandwidth, as the Opteron 248s do, preferable?
  • Opteron 175 versus Opteron 152 — So would you rather have: a single-core Opteron at 2.6GHz or a dual-core model at 2.2GHz?
  • Dual Opteron 275 versus dual Xeon 3.4GHz — Yes, Intel has newer Xeons out with 2MB of L2 cache, but we couldn’t find them to purchase when we were buying hardware for this review. We suspect that cache size doesn’t make much difference in most of the applications we’re testing, which don’t tend to use the extra meg of cache very well. (See our P4 600 series review for more on the 2MB cache’s performance impact.) The question is, how does a dual-CPU Hyper-Threaded Xeon compare to a quad-core Opteron?
  • Opteron 175 versus Pentium Extreme Edition 840 — Dual-core processors from AMD and Intel go head to head. Perhaps we’re crossing market segments a little bit here, but then Intel targets its high-end Pentium chipsets and processors at the low-end workstation market. The Opteron 175 and Extreme Edition 840 are arguably direct competitors. Who else is Intel targeting with a pair of 3.2GHz Prescott cores on one chip?

    Evil people who wish to observe possible desktop processor performance matchups should note that the Opteron 175 is essentially identical to the Athlon 64 X2 4400+.

  • Dual Xeon 3.2GHz versus Pentium Extreme Edition 840 — Both are dual Prescott cores with 1M of L2 cache running at 3.2GHz on a shared 800MHz front-side bus. The Xeon is saddled with a slower memory subsystem, while the Pentium XE 840 is one chip rather than two. Hmm.
  • Pentium Extreme Edition 840 versus Pentium D 840 — If you disable Hyper-Threading on the Pentium XE 840, you get a Pentium D 840. That’s exactly what we did, because we were curious to see the performance impact.
  • Redemption for Prescott? — We’ve used a mix of single-threaded and multithreaded applications in our past CPU reviews, but we’re using more threading now than ever. Does Intel’s Hyper-Threaded processor core regain some of its luster when more of the benchmarks go multithreaded?
  • Dual Opteron 275 versus the world — What can a dual-socket, quad-core system do better than any of the two-core systems we’re testing here? Not everything, because some applications only spin off one or two threads. In some cases, though, well, you’ll see..

There are some other interesting questions to be asked about the results, but you can find them for yourself. I’m just offering my top suggestions.

 

Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least twice, and the results were averaged.

Our test systems were configured like so:

Processor Opteron 148
Opteron 152
Opteron 175
Dual Opteron 248
Dual Opteron 252
Dual Opteron 275
Xeon 3.2GHz (Nocona 1MB)
Xeon 3.4GHz (Nocona 1MB)
Dual Xeon 3.2GHz (Nocona 1MB)
Dual Xeon 3.4GHz (Nocona 1MB)
Pentium 4 660 3.6GHz
Pentium D 840 3.2GHz
Pentium Extreme Edition 840 3.2GHz
System bus 1GHz HyperTransport 800MHz (200MHz quad-pumped) 800MHz (200MHz quad-pumped)
Motherboard Tyan Thunder K8WE S2895 SuperMicro X6DAL-G Intel D955XBK
BIOS revision 2/21/2005 beta 080010 BK95510J.86A.1152
North bridge nForce Professional 2200
nForce Professional 2050
AMD 8131 PCI-X Tunnel
Intel E7525 955X MCH
South bridge 6300ESB ICH7R
Chipset drivers SMBus driver 4.45
IDE driver 4.75
OS integrated INF Update 7.0.0.1019
Memory size 2GB (4 DIMMs) 2GB (4 DIMMs) 1GB (2 DIMMs)
Memory type OCZ PC3200 512MB registered ECC DDR SDRAM at 400MHz Kingston PC3200 512MB registered ECC DDR DRAM at 333MHz Corsiar XMS2 5400UL DDR2 SDRAM at 533MHz
CAS latency (CL) 3 2.5 3
RAS to CAS delay (tRCD) 3 3 2
RAS precharge (tRP) 3 3 2
Cycle time (tRAS) 8 7 8
Hard drive Maxtor DiamondMax 10 250GB SATA 150
Audio Integrated nForce/AD1981B
with NVIDIA 4.60 drivers
Integrated 6300ESB/ALC650
with Realtek 5.10.0.5820 drivers
Integrated ICH7R/STAC9221D5
with SigmaTel 5.10.4456.0 drivers
Graphics GeForce 6800 Ultra 256MB PCI-E
with ForceWare 71.84 drivers
GeForce 6800 Ultra 256MB PCI-E
with ForceWare 71.84 drivers
GeForce 6800 Ultra 256MB PCI-E
with ForceWare 71.84 drivers
OS Windows XP Professional x64 Edition Windows XP Professional x64 Edition Windows XP Professional x64 Edition
OS updates

Note that we have less total memory on the Pentium setups. I don’t believe any of our benchmarks are constrained by available RAM in a 1GB system, but you’ll still want to keep the difference in mind.

All tests on the Pentium systems were run with Hyper-Threading enabled, except where otherwise noted.

Thanks to Corsair, OCZ, and Kingston for providing us with memory for our testing. This matchup required lots of high-quality RAM, so we had to spread the love around. All three brands are far and away superior to generic, no-name memory.

Also, all of our test systems were powered by OCZ PowerStream power supply units. The PowerStream was one of our Editor’s Choice winners in our latest PSU round-up.

The test systems’ Windows desktops were set at 1152×864 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled for all tests.

We used the following versions of our test applications:

The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

 

Memory performance
We begin with a few synthetic memory performance tests because the memory subsystem’s performance affects our understanding of what’s going on in the rest of the benchmarks.

Sandra’s memory performance tests are multithreaded and take full advantage of the dual-socket Opterons’ NUMA memory subsystem. Yow.

The Xeon systems are at a distinct disadvantage due to their dual-channel DDR333 memory, but that’s about as fast as it gets for Xeons currently. Some Xeon motherboards offer dual-channel DDR2 400, but that’s not likely much faster than DDR memory at 333MHz. The Pentium XE 840 achieves much higher throughput thanks to its dual channels of low-latency DDR533 memory.

Linpack lets us get a quick visual on the size and speed of the L1 data and L2 caches on these processors, and there’s nothing unexpected here.

The Opteron’s copious memory bandwidth is tied to its low memory latencies, which come courtesy of its built-in memory controller. Because the memory controller runs at the speed of the CPU, the faster the processor, the lower the memory access latency. The multi-socket Opterons pay a small latency penalty here, but nothing major. The Xeons, on the other hand, have a rough time of it.

 

POV-Ray rendering
POV-Ray just recently made the move to 64-bit binaries, and thanks to the nifty SMPOV distributed rendering utility, we’ve been able to make it multithreaded, as well. SMPOV spins off any number of instances of the POV-Ray renderer, and it will bisect the scene in several different ways. For this scene, the best choice was to divide the screen up horizontally between the different threads, which provides a fairly even workload.

Notice the Task Manger graph above. I’ve included those to give some indication of how much an application occupies each CPU. In this case, SMPOV and POV-Ray show a near-perfect 100%, nailed-to-the-wall utilization across all four CPU cores in our dual Opteron 275 system.

Rendering is one of those cases where multithreading can bring huge performance increases. The dual Opteron 275 crushes everything else here, as one might expect, rendering the entire scene in 87 seconds. Note that the dual-core Opteron 175 slightly outperforms the pair of Opteron 248s, too.

The other big thing to notice here is how much faster the Prescott/Nocona core looks when additional POV-Ray threads take advantage of Hyper-Threading. The Xeon 3.4GHz shaves 50 seconds off of its render time with a second thread.

 

3dsmax 7 rendering
We tested 3ds max performance by rendering 20 frames of a sample scene at 320×240 resolution. This particular scene makes use of a motion-blur effect that requires extensive multi-pass rendering. We tried two different renderers: 3ds max’s default scanline renderer and its built-in version of the mental ray renderer.

The default renderer performs very well on our quad-core Opteron 275 system, and once again, the Opteron 175 just edges out a pair of 248s.

Unfortunately, the mental ray renderer didn’t like our dual Opteron 275 system. It refused to use all four cores to the full because the license for 3ds max’s integrated version of mental ray will only use two processors, no more. AMD has been pushing software makers to do their licensing on a per-CPU rather than per-core basis, but some current workstation-class programs will probably present this sort of problem, at least until the licensing model for dual-core products is fully worked out and programs are updated.

With two threads, the dual Opteron 252s at 2.6GHz are fastest here. Once more, the Opteron 175 outdoes the dual 248s, as it does a pair of Xeons.

 

Cinema 4D rendering
Cinema 4D’s rendering engine does a very nice job of distributing the load across multiple processors, as the Task Manager graph shows.

With four threads active, the dual Opteron 275 system absolutely tears through this rendering task, busting the 1000 mark in a benchmark where our previous champ had scored 689. That champ, by the way, was a system with dual 3.2GHz Xeons based on the older “Prestonia” core. The newer Nocona Xeons at 3.4GHz are slower here, even at a higher clock speed.

The remainder of Cinebench’s tests are all single-threaded shading tests, and the Opterons all perform well here, though additional cores or processors are no help.

 

LAME audio encoding
LAME MT is, as you might have guessed, a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. You can even download a paper (in Word format) describing the programming effort.

Rather than run multiple parallel threads, LAME MT runs the MP3 encoder’s psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. The author notes, “In general, this approach is highly recommended, for it is exponentially harder to debug a parallel application than a linear one.”

We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here, as we have done in our previous CPU reviews.

Notice how the CPU affinity tends to ping-pong around as the encoder runs. That’s typical behavior in Windows for some applications.

As you can see, LAME MT’s two threads work well with Hyper-Threading or SMP, but they don’t take any extra advantage of the four cores on the dual Opteron 275.

 

Xmpeg/DivX video encoding
We used the Xmpeg/DivX combo to convert a DVD .VOB file of a movie trailer into DivX format. Like LAME MT, this application is only dual threaded.

Once more, the dual Opteron 252s wind up at the top of the heap, but the Opteron 175 isn’t far behind, nearly tied with that other dual-core CPU, the Pentium XE 840. The Xeons are quite likely hampered here by their lower memory bandwidth.

Windows Media Encoder video encoding
We asked Windows Media Encoder to convert a gorgeous 1080-line WMV HD video clip into a 640×480 streaming format using the Windows Media Video 8 Advanced Profile codec.

This video encoder makes better use of four cores, and the dual Opteron 275 system finishes before the rest. The Opteron 175, meanwhile, absolutely runs away from the Opteron 248s. Could this be the new K8 cores’ SSE3 support at work?

 

ScienceMark

We’re using the 64-bit beta version of ScienceMark for these tests, and several of its components are multithreaded. ScienceMark author Alexander Goodrich says this about the Molecular Dynamics simulation:

Molecular Dynamics is lightly multithreaded – one thread takes care of U/I aspects, and the other thread takes care of the computation. The computation itself is not multithreaded, though Tim and I were looking into ways of changing the algorithm to support multi-threading programming a couple years ago – it’s a lot of effort, unfortunately. When MD [is] running there [is] a total of 2 threads for the process.

Here are the results:

The Primordia test “calculates the Quantum Mechanical Hartree-Fock Orbitals for each electron in any element of the periodic table.” Alex says this about it:

Primordia is multithreaded. Two main tasks occur which allow this to happen. Essentially, we identified 2 parallel tasks that could be done. We could probably take this a step further and optimize it even more. There is an issue, however, with the Pentium Extreme Edition that we’ve identified. The second computation thread gets executed on the logical HT thread rather than the 2nd core, so performance isn’t as good as it could be. This will be fixed in the next revision. This doesn’t effect [sic] the regular Pentium D. A workaround could include disabling HT on Pentium EE. There are 3 threads for primordia – 2 threads for computation, 1 thread for U/I.

The next two tests are only single-threaded, and they don’t make as good use of any of the CPUs here as they could if they were better optimized. The ScienceMark team has plans to incorporate linear algebra libraries from Intel and AMD in order to boost performance.

 

SiSoft Sandra
Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The one of interest to us is the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX and SSE/2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:

This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.

The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.

We’re using the 64-bit port of Sandra. The “Integer x16” version of this test uses integer numbers to simulate floating-point math. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations at once.

The Dual Opteron 275s scale very nicely in this test, showing off the incredible peak performance possible with four CPU cores working together. Among the single-socket dual-core processors, though, the Pentium XE 840 crushes the Opteron 175.

Sphinx speech recognition
Ricky Houghton first brought us the Sphinx benchmark through his association with speech recognition efforts at Carnegie Mellon University. Sphinx is a high-quality speech recognition routine. We use two different versions, built with two different compilers, in an attempt to ensure we’re getting the best possible performance. However, the versions of Sphinx we’re using are only single-threaded.

The dual-core Opterons perform quite a bit better here than the Opteron 148 and 248 do, possibly because of the enhancements AMD has made to its memory controller. Overall, however, CPUs geared more for linear performance triumph in this single-threaded test, as the Pentium 4 660 at 3.8GHz takes the top spot, followed by the Opteron 152.

 

picCOLOR
picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA.

At our request, Dr. Müller, the program’s author, added larger image sizes to this latest build of picCOLOR. We were concerned that the thread creation overhead on the tests rather small default image size would overshadow the benefits of threading. Dr. Müller has also made picCOLOR multithreading more extensive. Eight of the 12 functions in the test are now multithreaded.

Scores in picCOLOR, by the way, are indexed against a single-processor Pentium III 1GHz system, so that a score of 4.14 works out to 4.14 times the performance of the reference machine.

With a max of two threads, the Opteron 252 comes out on top, followed by the dual-core Opterons. This appears to be another case where Hyper-Threading confusion causes the Pentium XE 840 to stumble. Perhaps the latest version of picCOLOR, which incorporates four threads for some functions, could take better advantage of the XE 840 and the dual Opteron 275s.

 

Gaming performance
Purists, look away! We’re running games on the quad Opteron workstation box. Please, just flip ahead a couple of pages before you burst into flames.

Doom 3
We tested performance by playing back a custom-recorded demo that should be fairly representative of most of the single-player gameplay in Doom 3.

Far Cry
Our Far Cry demo takes place on the Pier level, in one of those massive, open outdoor areas so common in this game. Vegetation is dense, and view distances can be very long.

Unreal Tournament 2004
Our UT2004 demo shows yours truly putting the smack down on some bots in an Onslaught game.

Should you wish to run video games on an Opteron 175 or *cough* something like it, it would serve that purpose quite well, based on these results.

 

3DMark05

3DMark05’s overall score is utterly bottlenecked by the graphics card, but its CPU score is not. In fact, the CPU tests include an element of multithreading, and test two especially like the dual Opteron 275’s four cores.

 

Power consumption
We measured the power consumption of our entire test systems, except for the monitor, at the wall outlet using a Watts Up PRO watt meter. The test rigs were all equipped with OCZ PowerStream 520W power supply units. The idle results were measured at the Windows desktop, and we used SMPOV and the 64-bit version of the POV-Ray renderer to load up the CPUs. In all cases, we asked SMPOV to use the same number of threads as there were CPU front ends in Task Manager—so four for the dual Opteron 252, four for the Pentium XE 840, two for the Opteron 175, and so on.

The graphs below have results for “power management” and “no power management.” That deserves some explanation. By “power management,” we mean SpeedStep or PowerNow. (In the case of the Pentium 4 600-series processors, the C1E halt state is always available, even in the “no power management” tests.) Sadly, the beta BIOS we used for our Tyan S2895 motherboard didn’t support AMD’s PowerNow, so we couldn’t report scores for the Opterons with power management enabled.

At idle and under load, AMD appears to have delivered on its promise: the dual-core Opteron 175 actually consumes less power than the Opteron 152. Since power consumption is pretty directly related to heat output, AMD appears to have hit both its power and heat targets for the dual-core parts. Even the quad-core Opteron 275 system consumes substantially less power under load than the dual Xeon rig or the Pentium XE 840.

Incidentally, simply by turning off Hyper-Threading, the Pentium XE’s power consumption under load drops from 313W to 292W.

Here’s something kind of interesting. Since we’re dealing with dual-socket systems, we can calculate the power consumption delta when going from single to dual CPUs. That lets us isolate CPU power consumption from overall system consumption—at least in theory, I think, and don’t sue me please.

Again, the Opteron 175 is well within the power envelope established by its predecessors.

 
Conclusions
AMD’s dual-core Opteron processors are extremely well executed on all fronts, based on what we’ve seen. AMD’s dual-core design has a technical elegance that Intel’s can’t match, and that design brings superior performance. One Opteron 175 performs slightly better than a pair of Opteron 248s running at the same clock speed, and it does so while consuming less power than a single-core Opteron 152. All in all, very impressive.

Going to a dual-core Opteron does, however, involve some tradeoffs. Fundamentally, one is giving up single-threaded performance in order to gain multithreaded performance. Whether or not this tradeoff makes sense will depend on the kind of applications one plans to run on the system. Many of our benchmarks were multithreaded, but only made use of two threads, leaving the dual Opteron 275 system looking a little pointless. The Opteron 252 system outperformed it in many of these dual-threaded apps, like media encoding. Our other tests, however, showed the Opteron 275s to be an absolute rendering powerhouse. Which processor is the better buy will depend greatly on its intended use.

The rough part of the story is that AMD isn’t asking customers to choose between an Opteron 252 and a 275, which could be a tough choice for many workstation users. They’ve priced the Opteron 265, which runs at only 1.8GHz, right on top of the Opteron 252 at 2.6GHz. That forces one to choose: are you really committed to the idea of dual-core processors or not? For systems that already have two CPU sockets, I’m not sure what I’d choose without knowing the specific types of applications involved. The move to dual-core CPUs effectively ups the ante on thread-level parallelism in workstations, and some classes of applications will benefit from that effect more dramatically, and immediately, than others.

I do think that the answer for single-socket workstations is probably rather straightforward: I’d pick the dual-core Opteron over the single for the same reasons that most workstations have traditionally had multiple processors. Not only will the dual-core CPU bring better multitasking responsiveness, but it will also work well in dual-threaded applications, which are fairly common. In the server space, the choice to opt for dual-core chips for web, database, and terminal severs will also likely be rather easy, given the highly threaded nature of such roles.

There are a number other subplots in our benchmark results, as we discussed earlier, and I won’t attempt to address them all here. You’ve seen the results for yourself. We will probably be sorting through some of the more profound questions about the benefits and limitations of thread-level parallelism for years to come. 

Comments closed

Pin It on Pinterest

Share This

Share this post with your friends!