The 64-bit advantage
When AMD's design team created the x86-64 ISA, they tackled several inherent deficiencies of the old x86 ISA. First and foremost among those was a very basic limitation of accessing memory with 32-bit addresses: the sum total of memory one can address at one time with a 32-bit number is 4GB. That may sound like a lot of memory for the average desktop PC, but then again, not every PC is average, and the x86 ISA is increasingly becoming the platform of choice for technical workstations and servers, as well. As memory densities increase over time thanks to the happy benefits of Moore's Law, that 4GB limit is beginning to look smaller and smaller.
Not only that, but the practical effects of 32-bit addressing are even more constraining. By default, Windows XP limits applications to 2GB of memory space and reserves 2GB for system-level tasks. (It is possible for x86 systems to address more than 4GB of total memory using a mechanism called Physical Address Extension, created by Intel. In fact, some server versions of Windows allow up to 128GB of physical RAM in a 32-bit system. However, PAE uses a paging scheme that generally isn't considered the most optimal way of doing things.)
Meanwhile, certain types of user data sets are growing constantly, from ever-higher resolutions in digital cameras to HD video streams to video games capable of taking advantage of 512MB of RAM on a graphics card. Scientific computing and technical workstations are already hitting their heads on 32-bit addressing limitations with regularity.
By moving to a 64-bit addressing scheme, the possible address space grows exponentially from 232 to 264, so that the x86-64 ISA allows for what seems like a practically unlimited amount of memory. The theoretical peak size of a 64-bit address space is 16 exabytes, an extremely large number. Current AMD64 processors allow up to 40 bits of physical address space, or one terabyte, and up to 48 bits of virtual address space, or 256TB. Initial versions of WinXP x64 will support as much as 128GB of physical RAM and up to 16 terabytes of virtual memory. The upper limits of the Windows system cache size grow from 1GB in 32 bits to 1TB in 64 bits, a thousand-fold increase. WinXP x64 even takes advantage of the additional headroom for 32-bit apps, giving each one up to 4GB of its own space.
In short, the move to 64 bits removes the memory address space constraints of the old x86 ISA, granting PCs room to grow for quite some time. This change alone won't bring performance benefits, except in cases where the amount of memory is a performance-constraning factor, but it's still probably the most important benefit of x86-64 overall.
x86: registered offender
Another problem with the x86 ISA is the number of general-purpose registers (GPRs) available. Registers are fast, local slots inside a processor where programs can store values. Data stored in registers is quickly accessible for reuse, and registers are even faster than on-chip cache. The x86 ISA only provides eight general-purpose registers, and thus is generally considered register-poor. Most reasonably contemporary ISAs offer more. The PowerPC 604 RISC architecture, to give one example, has 32 general-purpose registers. Without a sufficient number of registers for the task at hand, x86 compilers must sometimes direct programs to spend time shuffling data around in order to make the right data available for an operation. This creates overhead that slows down computation.
What is the magnitude of those performance gains? Well, it depends. Some tasks aren't constrained by the number of registers available now, while others will benefit greatly when recompiled for x86-64 because the compiler will have more slots for local data storage. The amount of "register pressure" presented by a program depends on its nature, as this paper on 64-bit technical computing with Fortran explains:
The performance gains from having 16 GPRs available will vary depending on the complexity of your code. Compute-intensive applications with deeply nested loops, as in most Fortran codes, will experience higher levels of register pressure than simpler algorithms that follow a mostly linear execution path.So, as they say, your mileage may vary. Sometimes, 64-bit programs will see little or no performance advantage over 32-bit versions of the same. In other cases, the performance increase could be substantial. We will, of course, test that theory in the following pages.
Declaring war on alphabet soup
The final major problem the x86 ISA is a programming model cluttered by an alphabet soup of overlapping instruction set extensions that aren't entirely necessary or, in the case of some legacy instructions, particularly efficient. MMX, 3DNow!, x87, SSE, SSE2, and SSE3 extensions all hang off of the original x86 ISA, overlapping in many cases. x86-64 cleans things up by adopting SSE and SSE2 as part of its core set of instructions and jettisoning MMX, 3DNow!, and the x87 FPU. SSE/2 instructions can duplicate the functionality of those other instruction sets, and as a result, WinXP x64 doesn't carry over the registers for the FPU and MMX during context switches in 64-bit mode. MMX, 3DNow!, and the x87 FPU are all supported fully in 32-bit compatibility mode in WOW64, but not for 64-bit apps. (SSE3, the newest of the extensions, will likely be supported by all 64-bit processors in the near future, because AMD is expected to add SSE3 to the AMD64 architecture very soon. I'd expect SSE3 to work in 64-bit mode.)
The x87 FPU has long been considered a weakness of x86 CPU architectures compared to competing RISC designs, and x86 processors have indeed had weak FPU performance, relatively speaking. SSE2 exchanges the x87's stack-based programming model for a more modern one, a potential boon for floating-point math performance. SSE2 also replaces the x87's IEEE 80-bit precision with the choice of either IEEE 32-bit or 64-bit floating-point math. As a result, x86-64 processors running in 64-bit mode will produce floating-point results more like those of most RISC CPUs, but those results will vary slightly from the answers produced by legacy programs that use the x87 FPU due to the difference in precision.
Because of the move to the 64-bit ISA and the elimination of MMX, 3DNow! and the x87 FPU, Windows applications that include inline assembly code will not compile on Windows x64. That means applications, including games, that include segments of hand-tuned inline assembly code may have to sacrifice their optimizations when being ported to 64 bits. During the transition period between 32 and 64 bits, this reality may be a bit of a counterweight against the performance advantages that x86-64's extra registers provide. One could see how 32-bit native games or similar applications with lots of optimizations might perform better than their 64-bit equivalents. However, the move to clean up the x86 programming model will almost surely pay dividends in the long run in terms of simplicity of development, ease of optimization, and even outright performance.
Weighing the benefits of 64 bits
Now that we've sorted through the theory about 64-bit performance, it's time to take a look at the current reality. Neither Window XP Pro x64 Edition nor the handful of 64-bit applications and device drivers we used are yet finished products, but as you'll see, their performance indicates relative maturity. With that mild caveat in mind, we'll attempt to explore answers to several questions. Among them: How do 32-bit applications perform on Windows x64? What are the performance benefits of running 64-bit code on a 64-bit OS? And how do the Intel and AMD implementations of x86-64 compare? Do they offer similar performance deltas in the move to 64 bits, or does one demonstrate obvious superiority over the other?
|Friday night topic: quadcopters!||14|
|The TR Podcast video 173: Torquing the Titan||1|
|Report: AMD R&D spending falls to near-10-year low||43|
|Deal of the week: Ultra-wide IPS for $750, 16GB DDR4-2666 for $190, plus more||41|
|Broadwell Xeon D lands on Mini-ITX boards||33|
|Half-Life 2: Update mod adds modern polish to old classic||56|
|The TR Podcast is live, so come ask us stuff!||1|
|AMD shows off DirectX 12 performance with new 3DMark benchmark||79|
|Intel and Micron sampling 3D NAND based on floating gates||27|