The truth is somewhat different from both of these visions. 64-bit computing won’t bring us two times the performance in an amazing overnight transformation of the PC, as the move from 8 bits to 16 seemed to do back in the day. But it’s not a pointless exercise in shuffling bits, either. The 64-bit extensions to the venerable x86 instruction set architecture (ISA), including AMD64 and Intel’s code-compatible EM64T, actually offer some tangible benefits with few drawbacks. These extensions to the x86 ISA offer a much larger memory address space, bring a cleaner programming model with performance benefits, and retain backward compatibility with existing 32-bit applications.
In order to help you navigate through the hype, we nabbed a pair of 64-bit processors from AMD and Intel and tested them with the latest release candidate of the 64-bit version of Windows XP. Read on for our take on the move to 64 bits, including a look at the performance of the latest CPUs in Windows XP Pro x64 Edition with both 32 and 64-bit applications.
The essence of the move to 64-bit computing is a set of extensions to the x86 intruction set pioneered by AMD and now known as AMD64. During development, they were sensibly called x86-64, but AMD decided to rename them to AMD64, probably for marketing reasons. In fact, AMD64 is also the official name of AMD’s K8 microarchitecture, just to keep things confusing. When Intel decided to play ball and make its chips compatible with the AMD64 extensions, there was little chance they would advertise their processors “now with AMD64 compatibility!” Heart attacks all around in the boardroom. And so EM64T, Intel’s carbon copy of AMD64 renamed to Intel Extended Memory 64 Technology, was born.
The difference in names obscures a distinct lack of difference in functionality. Code compiled for AMD64 will run on a processor with EM64T and vice versa. They are, for our purposes, the same thing.
Whatever you call ’em, 64-bit extensions are increasingly common in newer x86-compatible processors. Right now, all Athlon 64 and Opteron processors have x86-64 capability, as do Intel’s Pentium 4 600 series processors and newer Xeons. Intel has pledged to bring 64-bit capability throughout its desktop CPU line, right down into the Celeron realm. AMD hasn’t committed to bringing AMD64 extensions to its Sempron lineup, but one would think they’d have to once the Celeron makes the move.
For some time now, various flavors of Linux compiled for 64-bit processors have been available, but Microsoft’s version of Windows for x86-64 is still in beta. That’s about to change, at long last, in April. Windows XP Professional x64 Edition, as it’s called, is finally upon us, as are server versions of Windows with 64-bit support. (You’ll want to note that these operating systems are distinct from Windows XP 64-bit Edition, intended for Intel Itanium processors, which is a whole different ball of wax.) Windows x64 is currently available to the public as a Release Candidate 2, and judging by our experience with it, it’s nearly ready to roll. Once the Windows XP x64 Edition hits the stores, I expect that we’ll see the 64-bit marketing push begin in earnest, and folks will want to know more about what 64-bit computing really means for them.
The immediate impact, in a positive sense, isn’t much at all. Windows x64 can run current 32-bit applications transparently, with few perceptible performance differences, via a facility Microsoft has dubbed WOW64, for Windows on Windows 64-bit. WOW64 allows 32-bit programs to execute normally on a 64-bit OS. Using Windows XP Pro x64 is very much like using the 32-bit version of Windows XP Pro, with the same basic look and feel. Generally, things just work as they should.
There are differences, though. Device drivers, in particular, must be recompiled for Windows x64. The 32-bit versions won’t work. In many cases, Windows x64 ships with drivers for existing hardware. We were able to test on the Intel 925X and nForce4 platforms without any additional chipset drivers, for example. In other cases, we’ll have to rely on hardware vendors to do the right thing and release 64-bit drivers for their products. Both RealTek and NVIDIA, for instance, supply 64-bit versions of their audio and video drivers, respectively, that share version numbers and feature sets with the 32-bit equivalents, and we were able to use them in our testing. ATI has a 64-bit beta version of its Catalyst video drivers available, as well, but not all hardware makers are so on the ball.
Some other types of programs won’t make the transition to Windows x64 seamlessly, either. Microsoft ships WinXP x64 with two versions of Internet Explorer, a 32-bit version and a 64-bit version. The 32-bit version is the OS default because nearly all ActiveX controls and the like are 32-bit code, and where would we be if we couldn’t execute the full range of spyware available to us? Similarly, some system-level utilities and programs that do black magic with direct hardware access are likely to break in the 64-bit version of Windows. There will no doubt be teething pains and patches required for certain types of programs, despite Microsoft’s best efforts.
Of course, many applications will be recompiled as native 64-bit programs as time passes, and those 64-bit binaries will only be compatible with 64-bit processors and operating systems. Those applications should benefit in several ways from making the transition.
The 64-bit advantage
When AMD’s design team created the x86-64 ISA, they tackled several inherent deficiencies of the old x86 ISA. First and foremost among those was a very basic limitation of accessing memory with 32-bit addresses: the sum total of memory one can address at one time with a 32-bit number is 4GB. That may sound like a lot of memory for the average desktop PC, but then again, not every PC is average, and the x86 ISA is increasingly becoming the platform of choice for technical workstations and servers, as well. As memory densities increase over time thanks to the happy benefits of Moore’s Law, that 4GB limit is beginning to look smaller and smaller.
Not only that, but the practical effects of 32-bit addressing are even more constraining. By default, Windows XP limits applications to 2GB of memory space and reserves 2GB for system-level tasks. (It is possible for x86 systems to address more than 4GB of total memory using a mechanism called Physical Address Extension, created by Intel. In fact, some server versions of Windows allow up to 128GB of physical RAM in a 32-bit system. However, PAE uses a paging scheme that generally isn’t considered the most optimal way of doing things.)
Meanwhile, certain types of user data sets are growing constantly, from ever-higher resolutions in digital cameras to HD video streams to video games capable of taking advantage of 512MB of RAM on a graphics card. Scientific computing and technical workstations are already hitting their heads on 32-bit addressing limitations with regularity.
By moving to a 64-bit addressing scheme, the possible address space grows exponentially from 232 to 264, so that the x86-64 ISA allows for what seems like a practically unlimited amount of memory. The theoretical peak size of a 64-bit address space is 16 exabytes, an extremely large number. Current AMD64 processors allow up to 40 bits of physical address space, or one terabyte, and up to 48 bits of virtual address space, or 256TB. Initial versions of WinXP x64 will support as much as 128GB of physical RAM and up to 16 terabytes of virtual memory. The upper limits of the Windows system cache size grow from 1GB in 32 bits to 1TB in 64 bits, a thousand-fold increase. WinXP x64 even takes advantage of the additional headroom for 32-bit apps, giving each one up to 4GB of its own space.
In short, the move to 64 bits removes the memory address space constraints of the old x86 ISA, granting PCs room to grow for quite some time. This change alone won’t bring performance benefits, except in cases where the amount of memory is a performance-constraning factor, but it’s still probably the most important benefit of x86-64 overall.
x86: registered offender
Another problem with the x86 ISA is the number of general-purpose registers (GPRs) available. Registers are fast, local slots inside a processor where programs can store values. Data stored in registers is quickly accessible for reuse, and registers are even faster than on-chip cache. The x86 ISA only provides eight general-purpose registers, and thus is generally considered register-poor. Most reasonably contemporary ISAs offer more. The PowerPC 604 RISC architecture, to give one example, has 32 general-purpose registers. Without a sufficient number of registers for the task at hand, x86 compilers must sometimes direct programs to spend time shuffling data around in order to make the right data available for an operation. This creates overhead that slows down computation.
To help alleviate this bottleneck, the x86-64 ISA brings more and better registers to the table. x86-64 packs 8 more general-purpose registers, for a total of 16, and they are no longer limited to 32-bit valuesall 16 can store 64-bit datatypes. In addition to the new GPRs, x86-64 also includes 8 new 128-bit SSE/SSE2 registers, for a total of 16 of those. These additional registers bring x86 processors up to snuff with the competition, and they will quite likely bring the largest performance gains of any aspect of the move to the x86-64 ISA.
What is the magnitude of those performance gains? Well, it depends. Some tasks aren’t constrained by the number of registers available now, while others will benefit greatly when recompiled for x86-64 because the compiler will have more slots for local data storage. The amount of “register pressure” presented by a program depends on its nature, as this paper on 64-bit technical computing with Fortran explains:
The performance gains from having 16 GPRs available will vary depending on the complexity of your code. Compute-intensive applications with deeply nested loops, as in most Fortran codes, will experience higher levels of register pressure than simpler algorithms that follow a mostly linear execution path.
So, as they say, your mileage may vary. Sometimes, 64-bit programs will see little or no performance advantage over 32-bit versions of the same. In other cases, the performance increase could be substantial. We will, of course, test that theory in the following pages.
Declaring war on alphabet soup
The final major problem the x86 ISA is a programming model cluttered by an alphabet soup of overlapping instruction set extensions that aren’t entirely necessary or, in the case of some legacy instructions, particularly efficient. MMX, 3DNow!, x87, SSE, SSE2, and SSE3 extensions all hang off of the original x86 ISA, overlapping in many cases. x86-64 cleans things up by adopting SSE and SSE2 as part of its core set of instructions and jettisoning MMX, 3DNow!, and the x87 FPU. SSE/2 instructions can duplicate the functionality of those other instruction sets, and as a result, WinXP x64 doesn’t carry over the registers for the FPU and MMX during context switches in 64-bit mode. MMX, 3DNow!, and the x87 FPU are all supported fully in 32-bit compatibility mode in WOW64, but not for 64-bit apps. (SSE3, the newest of the extensions, will likely be supported by all 64-bit processors in the near future, because AMD is expected to add SSE3 to the AMD64 architecture very soon. I’d expect SSE3 to work in 64-bit mode.)
The x87 FPU has long been considered a weakness of x86 CPU architectures compared to competing RISC designs, and x86 processors have indeed had weak FPU performance, relatively speaking. SSE2 exchanges the x87’s stack-based programming model for a more modern one, a potential boon for floating-point math performance. SSE2 also replaces the x87’s IEEE 80-bit precision with the choice of either IEEE 32-bit or 64-bit floating-point math. As a result, x86-64 processors running in 64-bit mode will produce floating-point results more like those of most RISC CPUs, but those results will vary slightly from the answers produced by legacy programs that use the x87 FPU due to the difference in precision.
Because of the move to the 64-bit ISA and the elimination of MMX, 3DNow! and the x87 FPU, Windows applications that include inline assembly code will not compile on Windows x64. That means applications, including games, that include segments of hand-tuned inline assembly code may have to sacrifice their optimizations when being ported to 64 bits. During the transition period between 32 and 64 bits, this reality may be a bit of a counterweight against the performance advantages that x86-64’s extra registers provide. One could see how 32-bit native games or similar applications with lots of optimizations might perform better than their 64-bit equivalents. However, the move to clean up the x86 programming model will almost surely pay dividends in the long run in terms of simplicity of development, ease of optimization, and even outright performance.
Weighing the benefits of 64 bits
Now that we’ve sorted through the theory about 64-bit performance, it’s time to take a look at the current reality. Neither Window XP Pro x64 Edition nor the handful of 64-bit applications and device drivers we used are yet finished products, but as you’ll see, their performance indicates relative maturity. With that mild caveat in mind, we’ll attempt to explore answers to several questions. Among them: How do 32-bit applications perform on Windows x64? What are the performance benefits of running 64-bit code on a 64-bit OS? And how do the Intel and AMD implementations of x86-64 compare? Do they offer similar performance deltas in the move to 64 bits, or does one demonstrate obvious superiority over the other?
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.
Our test systems were configured like so:
|Processor||Athlon 64 4000+ 2.4GHz||Athlon 64 4000+ 2.4GHz||Pentium 4 660 3.6GHz||Pentium 4 660 3.6GHz|
|System bus||1GHz HyperTransport||1GHz HyperTransport||800MHz (200MHz quad-pumped)||800MHz (200MHz quad-pumped)|
|Motherboard||DFI LANParty nF4 SLI-DR||DFI LANParty nF4 SLI-DR||Intel D925XECV2||Intel D925XECV2|
|North bridge||nForce4 SLI||nForce4 SLI||925XE MCH||925XE MCH|
|Chipset drivers||SMBus driver 4.45
IDE driver 5.18
|OS integrated||INF Update 126.96.36.1997||OS integrated|
|Memory size||1GB (2 DIMMs)||1GB (2 DIMMs)||1GB (2 DIMMs)||1GB (2 DIMMs)|
|Memory type||OCZ PC3200 EL DDR SDRAM at 400MHz||OCZ PC3200 EL DDR SDRAM at 400MHz||OCZ PC2 5300 DDR2 SDRAM at 533MHz||OCZ PC2 5300 DDR2 SDRAM at 533MHz|
|CAS latency (CL)||2||2||3||3|
|RAS to CAS delay (tRCD)||2||2||3||3|
|RAS precharge (tRP)||2||2||3||3|
|Cycle time (tRAS)||5||5||10||10|
|Hard drive||Maxtor DiamondMax 10 250GB SATA 150|
with Realtek 188.8.131.5280 drivers
with Realtek 184.108.40.20680 drivers
with Realtek 220.127.116.1134 drivers
|Graphics|| GeForce 6800 Ultra 256MB PCI-E
with ForceWare 71.84 drivers
| GeForce 6800 Ultra 256MB PCI-E
with ForceWare 71.84 drivers
| GeForce 6800 Ultra 256MB PCI-E
with ForceWare 71.84 drivers
| GeForce 6800 Ultra 256MB PCI-E
with ForceWare 71.84 drivers
|OS||Microsoft Windows XP Professional||Windows XP Professional x64 Edition v.1433 (RC2)||Microsoft Windows XP Professional||Windows XP Professional x64 Edition v.1433 (RC2)|
|OS updates||Service Pack 2, DirectX 9.0c||Service Pack 2, DirectX 9.0c|
All tests on the Pentium 4 systems were run with Hyper-Threading enabled.
Thanks to OCZ for providing us with memory for our testing. If you’re looking to tweak out your system to the max and maybe overclock it a little, OCZ’s RAM is definitely worth considering.
The test systems’ Windows desktops were set at 1152×864 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled for all tests.
We used the following versions of our test applications:
- SiSoft Sandra 2005 SR1 10.50
- DOOM 3 1.1 with trdelta1 demo
- Far Cry 1.3 with tr3-pier demo
- Unreal Tournament 2004 v3355 with trdemo1
- The Chronicles of Riddick: Escape from Butcher Bay with trdemo3
- 3DMark05 v120
- POV-Ray for Windows 3.6.1a 32-bit
- POV-Ray for Windows 3.6 64-bit
- picCOLOR v4.0 build 532 32-bit
- picCOLOR v4.0 build 532 64-bit
- The Panorama Factory v3.3
- The Panorama Factory v3.3 AMD64 Edition Beta 3
- Blobby Dancer for AMD64 demo
The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
The Chronicles of Riddick: Escape from Butcher Bay
We’ll start with gaming performance because I know many of you will be interested to see these numbers first. Our first test is a surprisingly good new game, The Chronicles of Riddick: Escape from Butcher Bay. This is one of the most visually impressive games on the PC, perhaps even better than Doom 3. The game also comes out of the box with a 64-bit executable (in addition to the standard 32-bit version) and a built-in benchmarking function. That makes it particuarly useful for us, because we can test performance without running a 32-bit benchmarking utility, like FRAPS, alongside it.
We recorded our own custom demo of one of the opening levels of the game and played it back for testing. The game has an advanced rendering mode with soft shadows available on GeForce 6-series GPUs like we used in our test systems, but it really taxes the graphics card, so we bypassed it for the “SM2.0” mode, which runs fast enough to show us when performance is CPU limited.
You’ll notice that in the benchmark graphs below and those on the following pages, we have several sets of data for each CPU. Any result labeled “Win32” was run on the 32-bit version of Windows XP Pro, and anything labeled “Win64” was run on WinXP Pro x64 RC2. The tests labeled “32-bit” used 32-bit executable programs, and those labeled “64-bit” used 64-bit versions. Notice that in many cases you’ll see a mix of “Win64” and “32-bit,” when we are running a 32-bit program via WOW64 on Windows x64.
There’s nothing earth-shattering about the performance of either the AMD or Intel CPUs in 64-bit mode here. Interestingly enough, the Athlon 64 is faster running the 32-bit code on WinXP x64 than on WinXP 32-bit. The Pentium 4, meanwhile, is the opposite, losing a step or two in the 64-bit OS. Neither processor benefits tangibly from the move to 64-bit application code, unfortunately.
We’ll continue our gaming tests with a few more 32-bit games, just to see how they run on Windows XP Pro x64. Few other games have 64-bit versions that are available to the public at present, sadly. That makes our gaming tests a little bit less enlightening than the non-gaming applications that follow.
We tested performance by playing back a custom-recorded demo that should be fairly representative of most of the single-player gameplay in Doom 3.
Doom 3 doesn’t gain or lose much of anything when making the transition to the 64-bit OS. That’s good news for those who would like to make the leap.
Far Cry is an interesting case of a game that, like Riddick, ships with an AMD64 logo on the box. Unlike Riddick, its 64-bit version is long AWOL, so we have to stick to 32-bit code only.
Our Far Cry demo takes place on the Pier level, in one of those massive, open outdoor areas so common in this game. Vegetation is dense, and view distances can be very long.
Once more, no news is good news. Far Cry runs pretty much the same in WinXP x64 as it does in 32 bits.
Unreal Tournament 2004
Our UT2004 demo shows yours truly putting the smack down on some bots in an Onslaught game.
The Pentium 4 runs a few frames per second faster with the 64-bit OS, but overall, it’s safe to say that 32-bit gaming performance on WinXP x64 is now more or less equivalent to WinXP in 32 bits. That wasn’t the case with earlier revisions of the OS and video drivers, so our results show solid progress. It looks like there will be little reason for gamers to avoid making the move to WinXP x64.
The picCOLOR image processing and analysis tool is a nice example of a 32-bit application ported to 64 bits. picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA.
Comparing the 64-bit version of this program to the 32-bit version isn’t entirely straightforward, because the 32-bit version of picCOLOR for Windows includes hand-tuned inline assembly that uses MMX to accelerate the program’s Morph function. This inline assembly code doesn’t fly in 64-bit mode because MMX isn’t supported, and because inline assembly code won’t compile in Microsoft’s 64-bit compiler. As a result, the 64-bit version of picCOLOR doesn’t have this optimized code.
Fortunately, the 32-bit version of picCOLOR includes an option to disable the hand-tuned MMX code, so we can compare 32-bit and 64-bit performance in picCOLOR purely with executable binaries compiled from a high-level language (C, in this case). In the graphs below, the data set with the inline MMX assembly disabled is labeled “32-bit/No MMX.” These “no MMX” results do not include hand-tuned MMX code, but don’t let my labels fool you; the compiler may have chosen to use some MMX instructions in the executable it produced.
Both the Pentium 4 and Athlon 64 gain significantly with the 64-bit version of picCOLOR. Compared directly the to 32-bit version of the program without inline MMX assembly code, the 64-bit version of picCOLOR is quite a bit faster. In a bit of drama, the Athlon 64 4000+ manages to leapfrog the Pentium 4 660 during the move to 64 bitsthe P4 is faster in 32 bits, but the Athlon 64 benefits more from using the x86-64 ISA.
The 32-bit version of picCOLOR is indeed faster with hand-tuned MMX than without, and there’s virtually no performance gained or lost when running the 32-bit version on WinXP x64. Let’s have a look at the individual functions that make up the picCOLOR benchmark sequence to see how they are affected by the transition to 64 bits.
The first function worth a mention here is Morph, which uses inline MMX. The hand-tuned code provides a major performance boost in that function, and turning it off brings a performance loss. However, we more than make up the difference elsewhere simply be recompiling the application for 64 bits. (None of the other functions use inline assembly.) Dr. Müller describes the tradeoff like so:
[Y]ou see that changing from 32 bit to 64 bit gives us a speed up of a good 30% on the AMD, less than 20% on the P4. But inline MMX gave us a factor of 2.5 just for the morph function! Now imagine all the 12 function had been hand-optimized with MMX or SSE2! We’d have some overall score of 8 or 9! Well, but we’d need another 10.000 programming hours… 🙁
The problem is that inline MMX-ing is quite some work, and switching from 32 bit to 64 bit is just re-compiling 🙂
Taking the time to port an app to 64 bits may be a very efficient means of improving performance, relatively speaking.
Undeterred by the restrictions of working in x86-64, Dr. Müller says he may yet convert his hand-tuned MMX code to use SSE2 registers and assemble the SSE2 code separately from his C program. With luck, he hopes to see even more of a speedup in 64-bit mode.
Several of picCOLOR’s other functions get quite a bit quicker in 64 bits. Among them is the Skeleton function, which Dr. Müller describes as “a very simple function with lots of short loops, integer comparisons and array index calculations” that “[s]hould fit in any cache.” It seems quite likely that the additional general-purpose registers are being put to good use here.
The next function, Texture Orien is “based on a 16*16 double precision DCT,” or discrete cosine transform, a bit of math commonly used in image compression algorithms like JPEG and MPEG. It’s also faster in 64 bits, especially on the Athlon 64. The rotate function with floating-point interpolation nearly doubles its performance in 64 bits, as well.
Dr. Müller suspects that the Pentium 4’s relatively strong performance in the rotate test with fixed-point interpolation is the result of the barrel shifter added to the Prescott core, but oddly, this function slows down slightly in 64 bits on the Pentium 4.
Another function that gains dramatically from x86-64 is Watershed, which Dr. Müller says “uses about 5 MBytes of stack, all integer.” He speculates that the Athlon 64’s lower memory access latencies may help it outperform the Pentium 4 in this test, but he’s unsure why the function is so much faster in 64 bits on both architectures.
The Panorama Factory
The Panorama Factory joins together multiple photographs to create ultra-wide-angle panoramic images. Because working with multiple high-resolution images at once can require a lot of memory, The Panorama Factory is also a good candidate for porting to 64 bits. We used the program’s default wizard to join together four very high res (approximately 4000×3500 pixels) images in a partial panorama.
The performance boost when going to 64 bits is dramatic. The Athlon 64 lops off almost exactly one minute from its processing time in 64-bit mode, and the Pentium 4’s gains are similar. The Panorama Factory’s timer function records the time required for each step of the process of converting our sample images into a panorama, so we can see where the speed-ups are.
The stitch function, which is the heart of this program’s capabilities, gains greatly by using the x86-64 ISA. I don’t believe the I/O functions like read and write are included in The Panorama Factory’s calculation of the overall wizard time. The crop, render blend, enhance, and improve quality functions shave off quite a bit of execution time in 64 bits on both the Pentium 4 and Athlon 64. In addition, the Athlon 64 is faster at the align and fine-tune operations in 64 bits, although the P4 doesn’t benefit as much.
POV-Ray is a ray-tracing rendering program that we’ve been using as a benchmark for ages. It’s an open-source program that is intended to be portable to multiple platforms easily, so it’s not multithreaded. There is, however, a 64-bit version available now.
We tested POV-Ray with a pair of scenes. The first one is a classic Chess scene that looks like so:
The two processors are mirror images of each other here. The Athlon 64 renders this scene ten seconds quicker in the 64-bit version of POV-Ray, while the Pentium 4 is actually slower with the 64-bit version of the renderer. With the 32-bit version of the program, the P4 gets a little faster in WinXP x64, but the Athlon 64 is slower. Talk about mixed results!
POV-Ray’s default benchmark tells a similar story. (Note, here, that the results are reported in pixels per second rather than render times.) The P4 again slows down with the 64-bit version of the program, and the Athlon 64 gets a pretty nice speed boost. I’ll be curious to see whether this pattern holds with future versions of the program or with those compiled differently.
Blobby Dancer is a graphics demo from NVIDIA that was originally a 32-bit program, but NVIDIA later ported it to x86-64. Not only is it 64 bits, but it’s funky, too!
The P4 and Athlon 64 are both able to stretch their legs in the 64-bit version of this quirky little demo.
Next up is SiSoft’s Sandra system diagnosis program, which includes a number of different benchmarks. The most interesting of those benchmarks is probably the “multimedia” benchmark, intended to show off the benefits of “multimedia” extensions like MMX and SSE. According to SiSoft’s FAQ, the benchmark actually does a fractal computation:
This benchmark generates a picture (640×480) of the well-known Mandelbrot fractal, using 255 iterations for each data pixel, in 32 colours. It is a real-life benchmark rather than a synthetic benchmark, designed to show the improvements MMX/Enhanced, 3DNow!/Enhanced, SSE(2) bring to such an algorithm.
The 64-bit port of this benchmark, of course, ought to be able to show us how x86-64 aids performance. The benchmark is also multithreaded, and should be able to take advantage of Hyper-Threading.
The “Integer x16” version of this test uses integer numbers to simulate floating-point math. Oddly, the Athlon 64 is slower in the 64-bit integer test. The floating-point version of the benchmark takes advantage of SSE2 to process up to eight Mandelbrot iterations at once. The Pentium 4 has long excelled in highly parallel SSE2 tests, and this one is no exception. The additional SSE2 registers in x86-64 really appear to help, too, on both processors.
The Dhrystone test is more synthetic than the Mandelbrot test. From the FAQ:
The original Dhrystone benchmark is still widely used to measure CPU performance in industry under various versions/variants. The benchmark is designed to contain a representative sample of types of operations, mostly numerical, used by applications. Unfortunately this does not always represent a true real-life performance, but is useful to compare the speed of various CPUs.
The Dhrystone benchmark used here is a multi-threaded, 32/64-bit variant of the original one which runs under UNIX. Up to 64 CPUs in SMP systems are supported. The result is determined by measuring the time it takes to perform some sequences of instructions. Due to various changes, the result is not directly comparable with other Dhrystone benchmarks. However the MIPS (Million Instructions Per Second) should be the same for the same system (+5-10% variation) between benchmarks.
Yes, it’s MIPS, that Meaningless Indicator of Processor Speed! What kind of MIPS differences do we get with x86-64?
About that much. Again, the Pentium 4 gains more than the Athlon 64 here, but both achieve solid improvements.
Whetstone is the floating-point twin of Dhrystone; it reports results in MFLOPS, or millions of floating-point operations per second. SiSoft has created a version of Whetstone that’s vectorized for use with SSE2. The original “FPU” version most likely uses SSE/SSE2 in 64-bit mode, but in a scalar rather than vector fashion.
As in Dhrystone, so in Whetstone; compiling for x86-64 produces higher performance.
These early benchmarks indicate that the x86-64 ISA holds significant promise for better performance when applications are ported to it. The benefits aren’t uniform or universal, but they can be fairly compelling. For technical and scientific computing, the combination of additional registers, a cleaner programming model, and a larger memory address space adds up to a slam dunk. 64 bits is the way to go. The same is likely true for servers.
For PC enthusiasts and gamers, moving to 64 bits may not present as many obvious advantages in the near term, but there’s also very little apparent penalty in going with Windows XP Pro x64, even if it’s only to run 32-bit applications. All of our gaming tests showed very little performance delta between WinXP and WinXP x64, and the same was generally true for other apps. Just make sure that 64-bit device drivers are available for your hardware.
One question that our testing hasn’t answered is whether or not 64-bit versions of popular games will really bring notable performance gains. Judging by our experience with the Riddick game, it’s hard to be terribly optimistic on this front. 64-bit games do hold promise down the road, when really large textures and very complex worlds eat up more than 4GB of total RAM, but that day is still a long way off.
As for the issue of whether the Athlon 64 or the Pentium 4 stands to gain more from 64-bit apps, well, I think the jury is still out. The applications we’ve tested have been all over the map on that question, and I’d hate to venture a guess. The best news, though, is that the typical scenario seems to involve solid performance increases on both architectures with 64-bit programs, if there is any performance increase at all. That makes sense, because both microarchitectures have dedicated transistors to the x86-64 ISA’s additional register space, and those new registers are the key to better performance.