We've spent the past few weeks testing Hammer-based chips, including the Athlon 64 and its new big brother, the mighty Athlon 64 FX-51. Not only that, but Intel slipped us a Pentium 4 3.2GHz Extreme Edition processor at the last minute, and we've benchmarked it, as well2MB L3 cache and all. We can honestly say we were blown away by the performance of these new chips. The future is now. Read up to see how good it can be.
Hammer comes to the desktop
The Hammer CPU core is an evolutionary design based AMD's K7 microarchitecture. Nonetheless, Hammer is revolutionary, not so much because of what goes on inside the chip itself, but because of how it talks to the rest of the computer. We have dedicated most of our time and effort in preparing this review to empirical testing, so we're not able to cover Hammer's architectural innovations in as much detail as we'd like. Still, we'll hit some of the major points that make AMD's new processor distinctive. Among them:
- An integrated memory controller Conventional systems have long had a memory controller located on a "north bridge" chips that talks to the processor over a front-side bus. Hammer chips have the memory controller built in, so the processor talks to the memory controller directly at the full speed of the CPU2.2GHz in the case of the fastest Athlon 64. (The memory controller itself also runs at the speed of the CPU.) As a result, Hammer processors can access memory with very low latencies, opening up one of the most persistent bottlenecks to overall system performance. Current Hammer chips have DDR memory controllers compatible with memory speeds up to 400MHz.
Beyond the basic performance benefits, the movement of the memory controller on die has implications for the organization of the entire Hammer platform. Core logic chipsets no longer need to provide memory controllers, and the Hammer, strictly speaking, has no traditional front-side bus. Even more mind-bendingly, multiprocessor Hammer systems have individual banks of memory for each processor, so they should scale very well as processors are added.
- HyperTransport communications HyperTransport is the glue that makes AMD's reorg of the traditional PC work. A packet-based chip-to-chip interconnect, HyperTransport links are pairs of 8-bit or 16-bit unidirectional links running at speeds up to 800MHz. Throw in a little DDR action, sending data twice per clock cycle, and you have an effective clock rate of 1.6GHz per link. As implemented in Hammer, HyperTransport links have a maximum throughput of 6.4GB/s (16 bits upstream plus 16 bits downstream at an effective 1.6GHz).
Hammer systems use HyperTransport for several things. In all Hammer systems, one of the CPUs (or the only CPU) talks to the rest of the system over a HyperTransport link. Traditional chipset services like AGP, PCI, and south bridge I/O are delivered over this link much like VPN tunnels are delivered over TCP/IP connections in a computer network. Done right, HyperTransport should simplify motherboard design by replacing slower and wider connections that require more traces to achieve similar results. In multiprocessor implementations, HyperTransport links between processors allow for inter-chip communications, as well.
- 1MB of on-chip L2 cache Previous versions of the K7 core have had varying amounts of L2 cache, up to 512K on the most recent "Barton" Athlon XPs. The first wave of Hammer chips all come with 1MB of L2 cache onboard, upping the ante by a factor of two.
Hammer's L1 cache sizes are unchanged from K7 at 64K for instructions and 64K for data. AMD's caches tend to be exclusive, and that's the case with Hammer; these caches don't replicate the contents of the L1 cache. With the L1 data and L2 caches combined, the Hammer chips' total effective data cache size is 1088K.
We've seen many times before the impact larger caches have on performance. Generally, more cache is better, but many tasks pull through too much data to derive any benefit from extra cache, so the benefits are uneven.
- SSE2 instruction set support Intel introduced the SSE2 instruction set for single-instruction, multiple-data (SIMD) calculations with the Pentium 4. SSE2 allows for SIMD operations on 128-bit IEEE double-precision floating-point datatypes, so it's useful in tasks like 3D rendering, graphics drivers, gaming, and media encoding. Previous Athlon chips have supported SIMD instruction set extensions for both integer (MMX) and single-precision floating-point (SSE, 3DNow!) operations, but they have been missing SSE2, where SIMD on x86 arguably has the most impact. The Hammer core can take advantage of applications optimized for SSE2, making it more competitive with the Pentium 4.
- AMD64 instruction set support In a gutsy move, AMD has concocted its own set of extensions to the x86 instruction set architecture, or ISA. The new AMD64 ISA isn't a radical departure, but it allows for 64-bit addressing, and it adds some additional registers to the register-poor x86 ISA.
AMD's move to 64 bits accomplishes several things. First, it eliminates the barrier of 4GB of addressable memory in 32-bit systems. 4GB may sound like a lot today, but as an upper limit, 4GB could become a nasty constraint, even on common desktop systems, in the next few years.
Second, by adding 32-bit extensions to the x86 ISA, AMD has created an evolutionary alternative to Intel's Itanium chips, which break almost entirely with the industry-standard x86 software infrastructure. Naturally, code will have to be recompiled for AMD64, but AMD64 is familiar enough that retooling compilers for it should be relatively painless.
Finally, AMD64's additional registers, which are present in Hammer, promise better performance on recompiled code. (Registers are essentially temporary local storage slots on a processor. More of them means less storing data in cache or memory.) Addressing memory in 64-bit chunks won't, by itself, necessarily improve performance. The Hammer has eight new 64-bit integer registers and eight new 128-bit SSE/SSE2 registers to help.
- A slightly longer pipeline Hammer's main branch prediction/recovery pipeline has been lengthened from 10 stages to 12. This change should allow the processor to run at higher clock rates at the expense of executing fewer instructions per clock. The Hammer's pipeline is still much shorter than the 20-stage main pipeline in the "speed demon" Pentium 4.
- 0.13-micron SOI fab process All Athlon 64 processors are manufactured at AMD's fab in Dresden, Germany, where AMD employs an advanced silicon-on-insulator (SOI) fabrication process to make these chips. Laying down the silicon on top of an insulator should allow the chip's transistors to operate faster. IBM, who pioneered SOI technology, claims clock frequency gains from SOI as high as 35 percent in testing. However, transitions to new chip fabrication processes are fraught with potential snags. When AMD delayed the Athlon 64's launch this past spring, the company cited difficulties producing chips in volume using SOI technology as the primary culprit and had to turn to IBM for assistance.
The move to SOI is crucial because AMD's enhancements to Hammer add up to a whole lot more transistors per chip than the K7. The last revision of the Athlon XP, code-named Barton, had 54.3 million transistors and a die size of 101 square millimeters. The Northwood Pentium 4 has 55 million transistors on a die that's 145 square millimeters. By contrast, the Athlon 64 packs 105.9 million transistors onto a 192 square millimeter die.