We have on the bench the 3800+ model of the Venice Athlon 64, and we’ve compared it against everything from its direct predecessor, the Athlon 64 “Newcastle” 3800+, to the highfalutin’ new dual-core processors from Intel and AMD. We’ve also attempted to overclock the thing into oblivion. Hop into our gondola and come take a brief tour of Venice with us.
Rev E revs up
One of the first things you need to know about the Venice core is that AMD isn’t selling it as anything new or special. In fact, they’re not really advertising the changes at all, and we had to goad them into sending one of these puppies out for review. If you go buy a new Athlon 64 with 512K of L2 cache at an online vendor, you may well get the new Venice core, or you might get one of its two predecessors: the “Winchester” core, also built on AMD’s 90nm fab process, or the older “Newcastle” core built on a 130nm fab process. These cores sell under some of the same model names, including 3200+, 3500+, and 3800+, depending on clock speed. Fortunately, most of the better online vendors will tell you which version of the K8 core you’re buying, so you can pick the right one. You do want to pick the right one, by the way. Ask AMD, and they will give you a short list of enhancements made to the revision E core that looks like this:
- Support for SSE3 instructions The Venice core can execute 11 of the 13 instructions that make up SSE3. Pioneered by Intel in its Pentium 4 “Prescott” chips, these 11 instructions are targeted at speeding up certain types of computation. Five of them are aimed at complex math operations like fast Fourier transforms (often used in scientific computing), while another four allow for better data organization in software vertex shader routines for graphics. The handful of others should improve performance in video encoding and speed up conversion of float-point data types to integers. The two SSE3 instructions that the Venice core doesn’t support have to do with thread synchronization for Hyper-Threading and simply don’t apply here.
- Support for mismatched DIMM sizes per channel The memory controller built into the Athlon 64 has been tweaked to allow for DIMMs of two different sizes to coexist in a single memory channel. That means it would be possible, in a Socket 939 system with two memory channels, to plug in two pairs of DIMMs that are different sizes without gravely compromising performance.
- Better memory mapping The rev-E chips can make more efficient use of memory space, although AMD isn’t long on details about the changes.
- Improved memory loading AMD says it’s possible with a rev-E chip to populate all of the slots of the motherboard with dual-bank DIMMs without seeing any slowdown.
That’s the end of the official list of changes, but I suspect there’s more going on here than just that. AMD has also incorporated a number of small changes and fixes into the revision E cores that it doesn’t care to talk about. Some of those things are minor modifications deemed too obscure for public consumption, no doubt, but some may be kind of interesting. The Venice core’s clock-for-clock performance is up appreciably in certain scenarios, as our benchmarks will show. If you look at this news post from way back in March of 2003, AMD’s Kevin McGrath claimed that future Athlon 64 chips would have a number of new features, although he wasn’t terribly specific about what would happen when. It’s possible that the Venice core is the first to include the enhanced data prefetch function that McGrath described, among others, and some of our benchmark results would appear to support theory.
One thing that we do know about the Venice core is that it’s manufactured using AMD’s 90nn fab process. Moving from a larger process to a smaller one reduces the overall size of the chip substantially, potentially allowing it to run cooler, require less power, and perhaps operate at higher clock speeds. AMD’s 90nm process employs a couple of interesting techniques in order to improve chip properties even further.
The first of these techniques, silicon-on-insulator (SOI) technology, has been used for all Athlon 64 processors. By situating the silicon layer on the chip on top of a film of silicon oxide, an SOI fab process allows for lower transistor capacitance and faster switching speeds.
The second technique, most widely known as strained silicon, is newer. AMD has used it in limited ways in its 130nm process and is now using it more extensively at 90nm. AMD and IBM jointly developed this process, which they call Dual Stress Liner technology. It’s so named because the technique allows the firms to compress or stretch the lattice of silicon molecules on a chip, selectively, by placing the silicon on top of one of two layers of silicon nitride. Some types of transistors (PMOS transistors) benefit from the compressive effect, while others (NMOS transistors) benefit from stretch or strain. NMOS transistors, for instance, have lower resistance with strained silicon. AMD has claimed that Dual Stress Liner technology will allow for transistors that switch up to 24% faster at comparable power levels than transistors manufactured conventionally.
Fortunately, the benefits of the SOI and Dual Stress Liner techniques are additive, so they work together to reduce power consumption and increase the switching speed of transistors on the CPU. The end result is chips like our Venice-based Athlon 64 3800+ that should consume less power and, we hope, be willing to run at higher clock speeds, too.
We’ve tested the Venice 3800+ against a range of CPUs, including the latest dual-core processors. We’ve also used a lot of multithreaded applications, which may not seem fair to the single-core Athlon 64 3800+, but life is about to get very unfair indeed for single-core CPUs in the coming months and years as the industry works to use threading more extensively.
There are two comparisons to watch here. The first one is the Venice 3800+ against the older Newcastle-based 3800+. That’s a pretty straightforward affair of old versus new, and it will give us a good sense of the clock-for-clock performance benefits of the Venice core. The second is the Venice 3800+ against the Pentium 4 660, Intel’s single-core competition. Truth be told, the 3800+’s most direct competitor is probably the Pentium 4 650 at 3.4GHz, which lists for about $25 more than the 3800+. However, the 3800+ may well hold its own against the slightly faster and more expensive P4 660.
Also, we have included results for the Pentium D 840 in our testing, which we obtained by disabling Hyper-Threading on our Extreme Edition 840. Since the Pentium D 840 is just an Extreme Edition 840 sans HT, the numbers should be valid.
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least twice, and the results were averaged.
Our test systems were configured like so:
|Processor||Opteron 152 2.6GHz|| Pentium 4 660 3.6GHz
Pentium D 840 3.2GHz
Pentium Extreme Edition 840 3.2GHz
|Pentium 4 Extreme Edition 3.73GHz||Athlon 64 3800+ 2.4GHz (Venice)
Athlon 64 3800+ 2.4GHz (Newcastle)
Athlon 64 4000+ 2.4GHz
Athlon 64 FX-55 2.6GHz
Athlon 64 X2 4200+ 2.2GHz
Athlon 64 X2 4800+ 2.4GHz
|System bus||1GHz HyperTransport||800MHz (200MHz quad-pumped)||1066MHz (266MHz quad-pumped)||1GHz HyperTransport|
|Motherboard||Tyan Thunder K8WE S2895||Intel D955XBK||Intel D955XBK||Asus A8N-SLI Deluxe|
|BIOS revision||2/21/2005 beta||BK95510J.86A.1152||BK95510J.86A.1234||MCT2/dualcore|
|North bridge||nForce4 Professional 2200
nForce4 Professional 2050
AMD 8131 PCI-X Tunnel
|955X MCH||955X MCH||nForce4 SLI|
|Chipset drivers||SMBus driver 4.45
IDE driver 4.75
|INF Update 22.214.171.1249||INF Update 126.96.36.1999||SMBus driver 4.45
IDE driver 4.75
|Memory size||2GB (4 DIMMs)||1GB (2 DIMMs)||1GB (2 DIMMs)||1GB (2 DIMMs)|
|Memory type||OCZ PC3200 512MB registered ECC DDR SDRAM at 400MHz||Corsair XMS2 5400UL DDR2 SDRAM at 533MHz||Corsair XMS2 5400UL DDR2 SDRAM at 667MHz||Corsair XMS Pro 3200XL DDR SDRAM at 400MHz|
|CAS latency (CL)||3||3||4||2|
|RAS to CAS delay (tRCD)||3||2||2||2|
|RAS precharge (tRP)||3||2||2||2|
|Cycle time (tRAS)||8||8||8||5|
|Hard drive||Maxtor DiamondMax 10 250GB SATA 150|
with NVIDIA 4.60 drivers
with SigmaTel 5.10.4456.0 drivers
with SigmaTel 5.10.4456.0 drivers
with Realtek 188.8.131.5220 drivers
|Graphics||GeForce 6800 Ultra 256MB PCI-E with ForceWare 71.84 drivers|
|OS||Windows XP Professional x64 Edition|
Note that we have more total memory on the Opteron rig. I don’t believe any of our benchmarks are constrained by available RAM in a 1GB system, but you’ll still want to keep the difference in mind.
All tests on the Pentium systems were run with Hyper-Threading enabled, except where otherwise noted.
Thanks to Corsair and OCZ for providing us with memory for our testing. This matchup required lots of high-quality RAM, so we had to spread the love around. Both brands are far and away superior to generic, no-name memory.
The test systems’ Windows desktops were set at 1152×864 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled for all tests.
We used the following versions of our test applications:
- SiSoft Sandra 2005 SR1 10.50 64-bit
- ScienceMark 2.0 64-bit
- Compiled binary of C Linpack port from Ace’s Hardware
- POV-Ray for Windows 3.6 64-bit
- SMPOV 4.3
- 3ds max 7.0
- Cinebench 2003
- LAME MT 3.97a 64-bit
- Xmpeg 5.0.3 with DivX Video 5.21
- Windows Media Encoder 9
- Sphinx 3.3
- picCOLOR v4.0 build 545 64-bit
- DOOM 3 1.1 with trdelta1 demo
- Far Cry 1.3 with tr3-pier demo
- Unreal Tournament 2004 v3355 with trdemo1
- 3DMark05 v120
The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
We generally start out with some memory subsystem tests, so we can see how the processors match up on that front. These results sometimes help us to understand some of the later benchmark results from real applications.
The changes in the Venice core don’t help much in our synthetic memory bandwidth tests. In fact, its bandwidth numbers are slightly lower than the Newcastle-based 3800+. However, look at the Linpack graph for some interesting results. The Venice 3800+ is markedly faster than the Newcastle at matrix sizes between about 64K and 576K, where the L2 cache is the primary determinant of performance. This slight advantage extends well into larger matrix sizes where main memory access is the main bottleneck. Both 3800+ chips’ numbers trail off fairly early in Linpack due to their relatively small L2 cache size.
Up next are some gaming tests, which will essentially serve to illustrate the futility of running a dual-core processor in a single-threaded application. Notice that we’ve included above each result a little graph generated by the Windows Task Manager as the benchmark ran on our dual Opteron 275 system (with four total CPU cores.) This should give you some indication of the amount of threading in the application. In some cases with single-threaded apps like the games below, the task will oscillate back and forth between one CPU and the next, but total utilization generally won’t go above 50% for a dual-core or 25% for a quad-core (or quad-front-end, in the case of the XE 840 with Hyper-Threading) system.
We tested performance by playing back a custom-recorded demo that should be fairly representative of most of the single-player gameplay in Doom 3.
Our Far Cry demo takes place on the Pier level, in one of those massive, open outdoor areas so common in this game. Vegetation is dense, and view distances can be very long.
Unreal Tournament 2004
Our UT2004 demo shows yours truly putting the smack down on some bots in an Onslaught game.
The Venice 3800+ has a teeny but consistent edge over the Newcastle in all three of these games, closing the gap somewhat between the 3800+ and 4000+. The Pentium 4 660, meanwhile, just can’t keep up.
The overall 3DMark score is held up by limitations of the graphics card, but the CPU test exposes differences between the CPUs. Since it’s multithreaded, the dual-core and Hyper-Threaded CPUs do relatively well here. However, the Venice 3800+ manages to surpass the Pentium 4 660, and the Venice chip pulls a coup by outscoring the Athlon 64 FX-55, as well. The FX-55’s larger L2 cache and 200MHz clock speed advantage isn’t enough to overcome the advantages conferred by the rev-E core.
POV-Ray just recently made the move to 64-bit binaries, and thanks to the nifty SMPOV distributed rendering utility, we’ve been able to make it multithreaded, as well. SMPOV spins off any number of instances of the POV-Ray renderer, and it will bisect the scene in several different ways. For this scene, the best choice was to divide the screen up horizontally between the different threads, which provides a fairly even workload.
The Venice core’s tweaks pay no dividends in POV-Ray rendering, where multicore processors rule the roost. The Venice 3800+ turns in render times almost identical to the Newcastle 3800+, but that’s no bad thing. The Venice 3800+ still renders the scene faster than the Pentium 4 660.
We tested 3ds max performance by rendering 20 frames of a sample scene at 320×240 resolution. This particular scene makes use of a motion-blur effect that requires extensive multi-pass rendering. We tried two different renderers: 3ds max’s default scanline renderer and its built-in version of the mental ray renderer.
Again in 3dsmax, there’s no real performance advantage to the rev-E CPU. Regardless, both flavors of the 3800+ render the scene faster than the P4 660.
Cinema 4D’s rendering engine does a very nice job of distributing the load across multiple processors, as the Task Manager graph shows.
The Cinebench rendering test also doesn’t see any performance gains with the Venice core, but here, the Pentium 4 660 proves to be faster, so long as Hyper-Threading is used.
Venice outperforms Newcastle slightly but steadily in the Cinebench shading tests, and the Venice 3800+ takes two out of three tests from the P4 660, as well.
LAME MT is, as you might have guessed, a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. You can even download a paper (in Word format) describing the programming effort.
Rather than run multiple parallel threads, LAME MT runs the MP3 encoder’s psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. The author notes, “In general, this approach is highly recommended, for it is exponentially harder to debug a parallel application than a linear one.”
We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here, as we have done in our previous CPU reviews.
The rev-E tweaks offer no advantage when doing CBR encoding, but they help the Venice 3800+ shave about a second off of the Newcastle’s VBR encode times. That’s not quite enough to help the 3800+ catch the P4 660, though.
We used the Xmpeg/DivX combo to convert a DVD .VOB file of a movie trailer into DivX format. Like LAME MT, this application is only dual threaded.
Windows Media Encoder video encoding
We asked Windows Media Encoder to convert a gorgeous 1080-line WMV HD video clip into a 640×460 streaming format using the Windows Media Video 8 Advanced Profile codec.
The Venice core is no faster than Newcastle with Xmpeg, but it blows away the Newcastle with Windows Media Encoder’s Advanced Profile codec. I wish I could tell you whether this big speed boost was likely the result of SSE3 instructions being used or some other rev-E tweak, but I can’t. Whatever the case, it’s enough to make the Venice 3800+ transcode this video clip faster than any other Athlon 64, but it’s not enough to catch the Pentium 4 660.
We’re using the 64-bit beta version of ScienceMark for these tests, and several of its components are multithreaded. ScienceMark author Alexander Goodrich says this about the Molecular Dynamics simulation:
Molecular Dynamics is lightly multithreaded – one thread takes care of U/I aspects, and the other thread takes care of the computation. The computation itself is not multithreaded, though Tim and I were looking into ways of changing the algorithm to support multi-threading programming a couple years ago – it’s a lot of effort, unfortunately. When MD [is] running there [is] a total of 2 threads for the process.
Here are the results:
The Primordia test “calculates the Quantum Mechanical Hartree-Fock Orbitals for each electron in any element of the periodic table.” Alex says this about it:
Primordia is multithreaded. Two main tasks occur which allow this to happen. Essentially, we identified 2 parallel tasks that could be done. We could probably take this a step further and optimize it even more. There is an issue, however, with the Pentium Extreme Edition that we’ve identified. The second computation thread gets executed on the logical HT thread rather than the 2nd core, so performance isn’t as good as it could be. This will be fixed in the next revision. This doesn’t effect [sic] the regular Pentium D. A workaround could include disabling HT on Pentium EE. There are 3 threads for primordia – 2 threads for computation, 1 thread for U/I.
The next two tests are only single-threaded, and they don’t make as good use of any of the CPUs here as they could if they were better optimized. The ScienceMark team has plans to incorporate linear algebra libraries from Intel and AMD in order to boost performance.
Of the five ScienceMark tests we’re considering, only the Blas DGEMM score shows any appreciable difference between Newcastle and Venice.
Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The one of interest to us is the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX and SSE/2. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:
This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm. The benchmark is multi-threaded for up to 64 CPUs maximum on SMP systems. This works by interlacing, i.e. each thread computes the next column not being worked on by other threads. Sandra creates as many threads as there are CPUs in the system and assignes [sic] each thread to a different CPU.
We’re using the 64-bit port of Sandra. The “Integer x16” version of this test uses integer numbers to simulate floating-point math. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations at once.
The Venice 3800+ is a hair’s breadth faster than the Newcastle 3800+ here, but the Pentium 4 is well ahead of them both. Sphinx speech recognition
Ricky Houghton first brought us the Sphinx benchmark through his association with speech recognition efforts at Carnegie Mellon University. Sphinx is a high-quality speech recognition routine. We use two different versions, built with two different compilers, in an attempt to ensure we’re getting the best possible performance. However, the versions of Sphinx we’re using are only single-threaded.
Here’s another case where the Venice core’s enhancements will raise some eyebrows. The Venice 3800+ is the fastest single-core Athlon 64 here, and the dual-core Athlon 64 X2 4800+, which also runs at 2.4GHz, turns in a nearly identical time.
picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA.
At our request, Dr. Müller, the program’s author, added larger image sizes to this latest build of picCOLOR. We were concerned that the thread creation overhead on the tests rather small default image size would overshadow the benefits of threading. Dr. Müller has also made picCOLOR multithreading more extensive. Eight of the 12 functions in the test are now multithreaded.
Scores in picCOLOR, by the way, are indexed against a single-processor Pentium III 1GHz system, so that a score of 4.14 works out to 4.14 times the performance of the reference machine.
Venice only has a minor edge on Newcastle here, but it’s just enough to propel the 3800+ past the Pentium 4 660.
We measured the power consumption of our entire test systems, except for the monitor, at the wall outlet using a Watts Up PRO watt meter. The test rigs were all equipped with OCZ PowerStream 520W power supply units. The idle results were measured at the Windows desktop, and we used SMPOV and the 64-bit version of the POV-Ray renderer to load up the CPUs. In all cases, we asked SMPOV to use the same number of threads as there were CPU front ends in Task Managerso four for the dual Opteron 252, four for the Pentium XE 840, two for the Opteron 175, and so on.
The graphs below have results for “power management” and “no power management.” That deserves some explanation. By “power management,” we mean SpeedStep or PowerNow/Cool’n’Quiet. (In the case of the Pentium 4 600-series processors and the XE 840, the C1E halt state is always active, even in the “no power management” tests.) Sadly, the beta BIOS we used for our Tyan S2895 motherboard didn’t support AMD’s PowerNow, so we couldn’t report scores for the Opterons with power management enabled. Similarly, the beta BIOS for our Asus A8N-SLI Deluxe mobo wouldn’t support Cool’n’Quietwhich is PowerNow with a different nameon the Athlon 64 X2 processors. AMD says all of its dual-core chips will support power management once the proper BIOS support becomes available.
The Venice 3800+ practically sips power compared to the rest of the group. The Venice-based test rig consumes 38W less power under load than the same system with a Newcastle 3800+ installed.
So does this thing overclock like a Swiss watchmaker on meth, or what? Well, in the world of Venice overclocking, our particular 3800+ is apparently something of a runt. Its max stable overclock was 2.7GHz, or “only” 300MHz above the stock clock speed, at 1.5625V. That’s a bit of a disappointment for a Venice core, given that seemingly every other hardware site on the planet has hit 2.8GHz with ease. Our Venice chip would POST and boot into Windows at up to 2808MHz with 1.575V, but it would error out in Prime95 immediately. Lowering the clock speed and/or bumping up the voltage to as high as 1.6V was no help; 2.7GHz was the limit for stability.
Perhaps we saw relatively lackluster results because we used Prime95’s stringent torture test with small FFTs, which will cause a shaky CPU to throw an error even when it will run other programs with no apparent problems. We also used Windows XP x64 Edition for testing, which make uses of transistors that would be dormant in a 32-bit OS, potentially exposing a problem. Or maybe our chip was just a dog. Whatever the cause, 2.7GHz was the practical limit for us.
For benchmarking, I set the RAM speed ratio at the DDR400 setting, which yielded 450MHz memory, given the overclocked HyperTransport link and memory controller. I also bumped down the HyperTransport multiplier to 3X in order to ensure stability. Our Corsair XMS Pro 3200XL DIMMs were quite happy at that 450MHz with 2-8-3-3 timings, but I had to use a 2T command rate. The system simply wouldn’t boot into Windows with a 1T command rate and 450MHz RAM, even at a CAS latency of 3. I also tried running the RAM underclocked slightly with very tight timings, but the 450MHz RAM produced the highest scores. Here’s how the Venice 3800+ performs with that config.
Not too bad, although the Venice core’s smaller cache and slower memory command rate prevent it from catching up to the Athlon 64 FX-55. Our Venice chip isn’t quite the overclocker that some are, but it’s still extremely quick at 2.7GHz.
The Venice core typically offers a bit of a performance advantage over the previous Newcastle core, and it does so while consuming less power at idle and under load than any of the processors we’ve tested alongside it. In some cases, like 3DMark05’s CPU test, Sphinx speech recognition, and video encoding with Windows Media Encoder Advanced Profile, the Venice version of the Athlon 64 3800+ even outruns the older-rev Athlon 64 FX-55. Such events may become more common in the future, as SSE3 becomes more widely used. Better yet, the revision E enhancements are proliferating up and down AMD’s product line, from the low-end Athlon 64 3000+ to the dual-core Athlon 64 X2 4800+. The particular chip we’ve looked at today, the Athlon 64 3800+ lists for $373 and is available at online vendors for slightly less than that. More often than not, the 3800+ proves faster than the Pentium 4 660, which lists at $605. The Pentium 4 650 is a more reasonable $401, but it’s bound to look rather pokey next to the 3800+. So the Venice 3800+ is a solid deal.
However, if you’re willing to wait a while, Intel may have an offer you can’t resist. The Pentium D 830, a dual-core processor running at 3GHz, will list for only $316. No, even its big brother, the Pentium D 840 at 3.2GHz, doesn’t slice through today’s single-threaded games as well as the 3800+, but many of our benchmarks have illustrated the formidable potential of having a second CPU core onboard. Add to that the creamy multitasking smoothness that comes with having multiple CPUs in a system, and you’ve got a tough choice on your hands. I’m not sure yet which I’d choose, honestly, but I’m leaning toward the dually.
The Venice core’s ace in the hole may be its overclocking potential. If you don’t care about dual cores, that probably makes some flavor of Venice-based Athlon 64 the way to go. Our “below average” 3800+ chip hit a rock-solid stable 2.7GHz on air-cooling, and others have reported even better results. Of course, if you want a really monster overclock, you may want to start with the slower, cheaper versions of the Venice core like the 3000+, 3200+, or 3500+. Those chips probably have even more built-in headroom by virtue of their lower default clock speeds, and, hey, they’re cheaper. Machiavelli would be proud.