Ever since the introduction of the first Opteron, Intel has faced a formidable foe in the x86 server and workstation markets. AMD's decision to integrate a memory controller into its processors and use a narrow, high-speed interconnect between CPUs and I/O chips has made it a perennial contender in this space. Even recently, while Intel's potent Core microarchitecture has given it a lead in the majority of performance tests, Xeons have been somewhat hamstrung on two fronts: on the power-efficiency front by their prevailing use FB-DIMM type memory, and on the scalability front by the use of a front-side bus and a centralized memory controller.
Those barriers for the Xeon are about to be swept away by today's introduction of new processors based on the chip code-named Nehalem, a new CPU design that brings with it a revised system architecture that will look very familiar to folks who know the Opteron. Try this on for size: a single-chip quad-core processor with a relatively small L2 cache dedicated to each core, backed up by larger L3 cache shared by all cores. Add in an integrated memory controller and a high-speed, low-latency socket interconnect. Sounds positively.. Opteronian, to coin a word, but that's also an apt description of Nehalem.
Of course, none of this is news. Intel has been very forthcoming about its plans for Nehalem for some time now, and the high-end, single-socket desktop part based on this same silicon has been selling for months as the Core i7. Just as with the Opteron, though, Nehalem's true mission and raison d'etre is multi-socket systems, where its architectural advantages can really shine. Those advantages look to be formidable because, to be fair, the Nehalem team set out to do quite a bit more than merely copy the Opteron's basic formula. They attempted to create a solution that's newer, better, and faster in most every way, melding the new system architecture with Intel's best technologies, including a heavily tweaked version of the familiar Core microarchitecture.
Since this is Intel, that effort has benefited from world-class semiconductor fabrication capabilities in the form of Intel's 45nm high-k/metal gate technology, the same process used to produce "Harpertown" Xeons. At roughly 751 million transistors and a die area of 263 mm², though, the Nehalem EP is a much larger chip. (Harpertown is comprised of a pair of dual-core chips, each of which has 410 million transistors in an area 107 mm².) The similarity with AMD's "Shanghai" Opteron core is, again, striking in this department: Shanghai is estimated at 758 million transistors and measures 258 mm².
We have already covered Nehalem at some length, since it's already out in the market in single-socket form. Let me direct you to my review of the Core i7 if you'd like more detail about the microarchitecture. If you want even more depth, I suggest reading David Kanter's Nehalem write-up, as well. Rather than cover all of the same ground again here, I'll try to offer an overview of the changes to Nehalem most relevant to the server and workstation markets.
A brief tour of Nehalem
As we've noted, Nehalem's quad execution cores are based on the four-issue-wide Core microarchitecture, but they have been modified rather extensively to improve performance per clock and to take better advantage of the new system architecture. One of the most prominent additions is the return of simultaneous multithreading (SMT), known in Intel parlance as Hyper-Threading. Each Nehalem core can track and execute two hardware threads, to keep its execution units more fully occupied. This capability has dubious value on the desktop in the Core i7, but it makes perfect sense for Xeon-based servers, where most workloads are widely multithreaded. With 16 hardware threads in a dual-socket config, the new Xeons take threading in this class of system to a new level.
Additionally, the memory subsystem, including the cache hierarchy, has been broadly overhauled. Each core now has 32K L1 instruction and data caches, along with a dedicated 256K L2 cache. A new L3 cache is 8MB in size and serves all four cores; it's part of what Intel calls the "uncore" and is clocked independently, typically at a lower speed than the cores.
The chip's integrated memory controller, also an "uncore" component, interfaces with three 64-bit channels of DDR3 memory, with support for both registered and unbuffered DIMM types, along with ECC. Intel has decided to jettison FB-DIMMs for dual-socket systems, with their added power draw and access latencies. The use of DDR3, which offers higher operating frequencies and lower voltage requirements than DDR2, should contribute to markedly lower platform power consumption. The bandwidth is considerable, as well: a dual-socket system with six channels of DDR3-1333 memory has theoretical peak throughput of 64 GB/s.
That's a little more than one should typically expect, though, because memory frequencies are limited by the number of DIMMs per channel. A Nehalem-based Xeon can host only one DIMM per channel at 1333MHz, two per channel at 1066MHz, and three per channel at 800MHz. The selection of available memory speeds is also limited by the Xeon model involved. Intel expects 1066MHz memory, which allows for 12-DIMM configurations, to be the most commonly used option. The highest capacity possible at present, with all channels populated, is 144GB.
Nehalem's revised memory hierarchy also supports an important new feature: Extended Page Tables, which is again like a familiar Opteron capability, Nested Page Tables. Like NPT, EPT accelerates virtualization by relieving the hypervisor of the burden of software-based page table emulation. NPT and EPT have the potential to reduce the overhead of virtualization substantially.
The third and final major uncore element in Nehalem is the QuickPath Interconnect, or QPI. Much like HyperTransport, QPI is a narrow, high-speed, low-latency, point-to-point interconnect used in both socket-to-socket connections and links to I/O chips. QPI operates at up to 6.4 GT/s in the fastest Xeons, where it yields a peak two-way aggregate transfer rate of 25.6 GB/sagain, a tremendous amount of bandwidth. The CPUs coordinate cache coherency over the QPI link by means of a MESIF protocol, which extends the traditional Xeon MESI protocol with the addition of a new Forwarding state that should reduce traffic in certain cases. (For more on the MESIF protocol, see here.)
One of the implications of the move to QPI and an integrated memory controller is that the new Xeons' memory subsystems are non-uniform. That is, getting to local memory will be notably quicker than retrieving data owned by another processor. Non-uniform memory architectures (NUMA) have some tricky performance ramifications, not all of which have been sufficiently addressed by modern OS schedulers, even now. The Opteron has occasionally run into problems on this front, and now Xeons will, too. One can hope that Intel's move to a NUMA design will prompt broader and deeper OS- and application-level awareness of memory locality issues.
Power efficiency has become a key consideration in server CPUs, and the new Xeons include a range of provisions intended to address this issue. In fact, the chip employs a dedicated microcontroller to manage power and thermals. Nehalem EP includes more power states (15) than Harpertown (4) and makes faster transitions between them, with a typical switch time of under two microseconds, compared to four microseconds for Harpertown. Nehalem's lowest power states make use of a power gate associated with each execution core; this gate can cut voltage to to an idle core entirely, eliminating even leakage power and taking its power consumption to nearly zero.
The power management microcontroller also enables an intriguing new feature, the so-called "Turbo mode." This feature takes advantage of the additional power and thermal headroom available when the CPU is at partial utilization, say with a single- or dual-threaded application, by dynamically raising the clock speed of the busy cores beyond their rated frequency. The clock speed changes involved are relatively conservative: one full increment of the CPU multiplier results in an increase of 133MHz, and most of the new Xeons can only go two "ticks" beyond their usual multiplier ceilings. Still, the highest end W- and X- series Xeons can reach up to three ticks, or 400MHz, beyond their normal limits. Unlike the generally advertised clock frequency of the CPU, this additional Turbo mode headroom is not guaranteed and may vary from chip to chip, depending upon its voltage needs and resulting thermal profile. What headroom is available brings a "free," if modest, performance boost to lightly threaded applications.