Intel’s Core i7 processors

Those of us who are conversant with technology are more or less conditioned to accept and even expect change as a natural part of the course of things. New gadgets and gizmos debut regularly, each one offering some set of advantages or refinements over the prior generation. As a result, well, you folks are a rather difficult lot to impress, frankly speaking. But today is a day when one should sit up and take notice. I’ve been reviewing processors for nearly ten years now, and the Core i7 processors we’re examining here represent one of the most consequential shifts in the industry during that entire span.

Intel, as you know, has been leading its smaller rival AMD in the performance sweeps for some time now, with a virtually unbroken lead since the debut of the first Core 2 processors more than two years ago. Even so, AMD has retained a theoretical (and sometimes practical) advantage in terms of basic system architecture throughout that time, thanks to the changes it introduced with its original K8 (Athlon 64 and Opteron) processors five years back. Those changes included the integration of the memory controller onto the CPU die, the elimination of the front-side bus, and its replacement with a fast, narrow chip-to-chip interconnect known as HyperTransport. This system architecture has served AMD quite well, particularly in multi-socket servers, where the Opteron became a formidable player in very short order and has retained a foothold even with AMD’s recent struggles.

Now, Intel aims to rob AMD of that advantage by introducing a new system architecture of its own, one that mirror’s AMD’s in key respects but is intended to be newer, faster, and better. At the heart of this project is a new microprocessor, code-named Nehalem during its development and now officially christened as the Core i7.

Yeah, I dunno about the name, either. Let’s just roll with it.

The Core i7 design is based on current Core 2 processors but has been widely revised, from its front end to its memory and I/O interfaces and nearly everywhere in between. The Core i7 integrates four cores into a single chip, brings the memory controller onboard, and introduces a low-latency point-to-point interconnect called QuickPath to replace the front-side bus. Intel has modified the chip to take advantage of this new system infrastructure, tweaking it throughout to accommodate the increased flow of data and instructions through its four cores. The memory subsystem and cache hierarchy have been redesigned, and simultaneous multithreading—better known by its marketing name, Hyper-Threading—makes its return, as well. The end result blurs the line between an evolutionary new product and a revolutionary one, with vastly more bandwidth and performance potential than we’ve ever seen in a single CPU socket.

How well does the Core i7 deliver on that potential? Let’s find out.

An overview of the Core i7

The Core i7 modifies the landscape quite a bit, but much of what you need to know about it is apparent in the picture of the processor die below, with the major components labeled.

The Core i7 die and major components. Source: Intel.

What you’re seeing, incidentally, is a pretty good-sized chip—an estimated 731 million transistors arranged into a 263 mm² area via the same 45nm, high-k fabrication process used to produce “Penryn” Core 2 chips. Penryn has roughly 410 million transistors and a die area of 107 mm², but of course, it takes two Penryn dies to make one quad-core product. Meanwhile, AMD’s native quad-core Phenom chips have 463 million transistors but occupy a larger die area of 283 mm² because they’re made on a 65nm process and have a higher ratio of (less dense) logic to (denser) cache transistors. Then again, size is to some degree relative; the GeForce GTX 280 GPU is over twice the size of a Core i7 or Phenom.

Nehalem’s four cores are readily apparent across the center of the chip in the image above, as are the other components (Intel calls these, collectively, the “uncore”) around the periphery. The uncore occupies a substantial portion of the die area, most of which goes to the large, shared L3 cache.

This L3 cache is the last level of a fundamentally reworked cache hierarchy. Although not clearly marked in the image above, inside of each core is a 32 kB L1 instruction cache, a 32 kB L1 data cache (it’s 8-way set associative), and a dedicated 256 kB L2 cache (also 8-way set associative). Outside of the cores is the L3, which is much larger at 8 MB and smarter (16-way associative) than the L2s. This basic arrangement may be familiar from AMD’s native quad-core Phenom processors, and as with the Phenom, the Core i7’s L3 cache serves as the primary means of passing data between its four cores. The Core i7’s cache setup differs from the Phenom’s in key respects, though, including the fact that it’s inclusive—that is, it replicates the contents of the higher level caches—and runs at higher clock frequencies. As a result of these and other design differences, including a revamped TLB hierarchy, the Core i7’s cache latencies are much lower than the Phenom’s, even though its L3 cache is four times the size.

One mechanism Intel uses to make its caches more effective is prefetching, in which the hardware examines memory access patterns and attempts to fill the caches speculatively with data that’s likely to be requested soon. Intel claims the Core i7’s prefetching algorithm is both more efficient than Penryn’s—some server admins wound up disabling hardware prefetch in Xeons because it harmed performance with certain workloads, a measure Intel says should no longer be needed—and more aggressive, as well.

The Core i7 can get to main memory very quickly, too, thanks to its integrated memory controller, which eliminates the chip-to-chip “hop” required when going over a front-side bus to an external north bridge. Again, this is a familiar page from AMD’s template, but Intel has raised the stakes by incorporating support for three channels of DDR3 memory. Officially, the maximum memory speed supported by the first Core i7 processors is 1066 MHz, which is a little conservative for DDR3, but frequencies of 1333, 1600, and 2000 MHz are possible with the most expensive Core i7, the 965 Extreme Edition. In fact, we tested it with 1600 MHz memory, since this is a more likely configuration for a thousand-dollar processor.

For a CPU, the bandwidth numbers involved here are considerable. Three channels of memory at 1066 MHz can achieve an aggregate of 25.6 GB/s of bandwidth. At 1333 MHz, you’re looking at 32 GB/s. At 1600 MHz, the peak would be 38.4 GB/s, and at 2000 MHz, 48 GB/s. By contrast, the peak effective memory bandwidth on a Core 2 system would be 12.8 GB/s, limited by the throughput of a 1600MHz front-side bus. With dual channels of DDR2 memory at 1066MHz, the Phenom’s peak would be 17.1 GB/s. The Core i7 is simply in another league. In fact, our Core i7-965 Extreme test rig with 1600MHz memory has the same total bus width (192 bits) and theoretical memory bandwidth as a GeForce 9600 GSO graphics card.

With the memory controller onboard and the front-side bus gone, the Core i7 communicates with the rest of the system via the QuickPath interconnect, or QPI. QuickPath is Intel’s answer to HyperTransport, a high-speed, narrow, packet-based, point-to-point interconnect between the processor and the I/O chip (or other CPUs in multi-socket systems.) The QPI link on the Core i7-965 Extreme operates at 6.4 GT/s. At 16 bits per transfer, that adds up to 12.8 GB/s, and since QPI links involve dedicated bidirectional pairs, the total bandwidth is 25.6 GB/s. Lower-end Core i7 processors have 4.8 GT/s QPI links with up to 19.2 GB/s of bandwidth. Obviously, these are both just starting points, and Intel will likely ramp up QPI speeds from here in successive product generations. Still, both are somewhat faster than the HyperTransport 3 interconnects in today’s Phenoms, which peak at either 16 or 14.4 GB/s, depending on the chip.

A block diagram of the Core i7 system architecture. Source: Intel.

This first, high-end desktop implementation of Nehalem is code-named Bloomfield, and it’s essentially the same silicon that should go into two-socket servers eventually. As a result, Bloomfield chips come with two QPI links onboard, as the die shot above indicates. However, the second QPI link is unused. In 2P servers based on this architecture, that second interconnect will link the two sockets, and over it, the CPUs will share cache coherency messages (using a new protocol) and data (since the memory subsystem will be NUMA)—again, very similar to the Opteron.

In order to take advantage of this radically modified system architecture, the design team tweaked Nehalem’s processor cores in a range of ways big and small. Although the Core 2’s basic four-issue-wide design and execution resources remain more or less unchanged, almost everything around the execution units has been altered to keep them more fully occupied. The instruction decoder can fuse more types of x86 instructions together and, unlike Core 2, it can do so when running in 64-bit mode. The branch predictor’s accuracy has been enhanced, too. Many of the changes involve the memory subsystem—not just the caches and memory controller, which we’ve already discussed, but inside the core itself. The load and store buffers have been increased in size, for instance.

These modifications make sense in light of the Core i7’s much higher system-level throughput, but they also help make another new mechanism in the chip work better: the resurrected Hyper-Threading, or simultaneous multithreading (SMT). Each core in Nehalem can track two independent hardware threads, much like some other Intel processors, including later versions of the Pentium 4 and, more recently, the Atom. SMT takes advantage of the explicit parallelism built into multithreaded software to keep the CPU’s execution units more fully occupied, and done well, it can be a clear win, delivering solid performance gains at very little cost in terms of additional die area or power use. Intel architect Ronak Singhal outlined how Nehalem’s implementation of Hyper-Threading works at this past Fall IDF. Some hardware, such as the registers, must be duplicated for each thread, but much of it can be shared. Nehalem’s load, store, and reorder buffers are statically partitioned between the two threads, for example, while the reservation station and caches are shared dynamically based on demand. The execution units themselves don’t need to be altered at all.

The upshot of all of this is that a single Core i7 processor supports a total of eight threads, which makes for a pretty wicked looking Task Manager window. Because of the resource sharing involved, of course, Hyper-Threading won’t likely double performance, even the best-case scenario. We’ll look at its precise impact on performance in the following pages.

The changes to Nehalem’s cores don’t stop there, either. Intel has improved the performance of the synchronization primitives used by multithreaded applications, added a handful of instructions known as SSE 4.2—including some for string handling, cyclic redundancy checks, and popcount—and introduced enhancements for hardware-assisted virtualization. There’s too much to cover here, really. If you want more detailed information, I suggest you check out Singhal’s IDF presentation or David Kanter’s Nehalem overview.

Power management and, uh, forced induction

Like AMD’s native quad-core Phenom, the Core i7 can raise and lower the clock speed of each of its processor cores independently and dynamically in response to demand. Unlike the Phenom, though, the Core i7 doesn’t use separate power planes for the cores and the “uncore.” Instead, Intel has put a switch between each core and the voltage regulator output, and power can be shut off to any individual core that goes into the deepest idle state, C6, transparently to software and to the other cores. Because power to the core is shut off, Intel claims even leakage power is eliminated, making that core’s power consumption approximately zero. In the event that all four cores become idle, then the uncore can go into a C6 state, as well, in which most uncore logic is stopped and I/O drops into a low-power mode.

Controlling all of this wizardry in the Core i7 is a dedicated, on-chip microcontroller for power management. This microcontroller is programmable via firmware and can be made to use different algorithms to optimize for, say, the lowest possible power use or for low latencies when stepping up from low-power states. No doubt Intel will use this capability to tune products for diverse segments, giving mobile processors different behaviors than, say, high-performance desktop parts like the ones we’re reviewing here.

One trick that this microcontroller enables is the oh-so creatively named “Turbo mode” built into the Core i7. This feature pushes the active cores beyond their baseline clock frequencies when the CPU isn’t at full utilization. Turbo mode operates according to some simple rules. In the event that a single-threaded application is occupying one core while the rest are idle, Turbo mode will raise clock speeds by as much as two full “ticks” beyond the baseline. For instance, for our Core i7-965 Extreme processor, Turbo mode could raise the multiplier from 24 to 26, or the core clocks from 3.2 GHz to 3.46 GHz, since the base clock in Core i7 systems runs at 133 MHz. With two or more threads active, Turbo mode will only raise clock speeds by one tick. All of this happens automatically using the same basic P-state mechanism as SpeedStep.

The additional clock frequency headroom comes from the fact that a less-than-fully-occupied Core i7 may not run up against the limits imposed by its thermal design power, or TDP—the chip’s specified power envelope. We’ve seen a processor running eight instances of Prime95 stay at “one tick up” for a sustained period of time with good cooling. Then again, Intel has set CPU core voltages individually at the factory for some time now, and it’s quite possible that some chips may not be able to sustain Turbo acceleration within their specified power envelopes for any length of time. As I understand it, that may simply be the luck of the draw, with only the baseline clock speed guaranteed.

Interestingly enough, because the Core i7-965 Extreme Edition doesn’t have a locked upper multiplier, the CPU can be overclocked by tweaking the Turbo mode settings in the BIOS. Intel’s DX58S0 “Smackover” (uh huh) motherboard exposes control over the maximum clock multipliers for one, two, three, and four occupied cores, as well as the ability to adjust the TDP limit in watts and the current limit in amps. You’ll probably want a good aftermarket cooler if you plan to play with these settings. If that’s too fancy for your tastes, one may also choose to disable Turbo mode and overclock via the usual ways, as well—either by raising the multiplier on an Extreme Edition or by cranking up the base clock on any Core i7.

Pricing and availability

Although we are publishing our review of the Core i7 today, products won’t be selling to consumers immediately. Instead, Intel has given us the nebulous target of “in November” for product availability. Beyond that, I have no more information than you about when to expect these things in stores. I can, however, give you pricing and model information. Like this:

Model Clock speed North
bridge speed
QPI
speed
TDP Price
Core
i7-965 Extreme
3.2
GHz
3.2 GHz 6.4
GT/s
130
W
$999
Core
i7-940
2.93
GHz
2.13 GHz 4.8
GT/s
130
W
$562
Core
i7-920
2.66
GHz
2.13 GHz 4.8
GT/s
130
W
$284

All three of the Core i7 processors coming this month are “Bloomfield” chips, so they all have quad cores, three memory channels, and 8 MB of L3 cache onboard. As you can see, though, they do differ in terms of the clock speed of the L3 cache and of what I’ve labeled the “North bridge speed.” That’s basically the clock speed of the “uncore,” but things get a little hairy from there. The uncore includes several elements, including the QPI links, the L3 cache, and the memory controller. Each of these elements may run at different multipliers from the base clock. For the Core i7-965 Extreme, the relationship is straightforward: everything runs at 3.2 GHz, including the QPI link, hence its 6.4 GT/s data rate. In the 940 and 920, the cores run at one speed, the QPI link at another (2.4 GHz), and, as I understand it, the memory controller and L3 cache both run at 2.13 GHz.

One of the implications of the slower memory controller frequency in the Core i7-920 and -940 is that, at least on our Intel “Smackover” board, one cannot achieve DDR3 memory speeds beyond 1066 MHz without overclocking the base system clock, which presents a real risk of instability. The multipliers just aren’t available in the BIOS to go beyond that. We’ll have to see how that works out in practice with enthusiast motherboards from the big names that, uh, aren’t Intel, but it appears DDR 1066 MHz may be a practical limit without overclocking for the 920 and 940, which is a shame.

Expect to see even more variety from Nehalem-derived processors in the future, because the architecture is designed to be modular. Intel may vary the core count, cache sizes, number of QPI links, the presence of Hyper-Threading, and the number of memory channels in future products. We also expect them to integrate a graphics core into some parts. Given what we’ve learned about uncore clocking flexibility, I’d expect some variance there, too. Intel may choose to, say, clock down various parts of the uncore, such as the L3 cache, in lower end or mobile products in order to save on power or to improve yields.

Unfortunately, more affordable variants of Nehalem may be a long time in coming. We know that mainstream desktop Nehalem derivatives are expected to have only two DDR3 memory channels and possible integrated graphics, but those products may not arrive until well into next year. Until then, the Core i7 may remain a rather pricey option, because even the 920 is wedded to motherboards based on the premium X58 chipset. You may, though, want to check out our review of two of the first X58 boards right here.

A new socket, package, and chipset

Obviously, with all of the changes built into the Core i7, retaining compatibility with Intel’s existing LGA775 socket was out of the question. In its place, Intel has introduced the new LGA1366-style socket with, tada, more pins. Betcha can’t guess how many.

Anyhow, this new chip socket and package demands a few pictures, so here you are…


The Core i7 processor


From left to right: An LGA775-style Core 2 processor, a Core i7, and a Socket AM2-based Phenom

A Core i7 mounted in Intel’s DX58S0 motherboard

A close-up of the new LGA1366 socket

As you can see, the Core i7’s new package is relatively large, as these things go. I’d expect a different, smaller socket and package for future mainstream Core i7 derivates.

Matchups to watch

Before we move on to our test results, we should pause to consider several of the key matchups. The most obvious of those is the battle at 3.2GHz, where we pit the Core i7-965 Extreme against the fastest single-socket Core 2 processor, the Core 2 Extreme QX9770. This is, more or less, the clock-for-clock matchup between old and new generations that you’ll want to watch. Only it’s sort of bogus, since Turbo mode means the Core i7-965 Extreme typically runs at 3.33GHz or more.

Also contending at 3.2GHz: a dual-socket rig, the “Skulltrail” system with repurposed Xeons branded as Core 2 Extreme QX9775 processors. We threw this one in for fun, to see how this “ultimate” and “extreme” system would match up against the fastest Core i7. Of course, it’s not a fair fight, but it sure is a fun one.

One of the most intriguing matchups may be the Core i7 versus itself. We’ve tested the 965 Extreme with and without Hyper-Threading enabled, throughout our test suite, to see what different this feature makes. Watch for the “No HT” results to see what happens when Hyper-Threading is disabled.

Then there’s the face-off of the value quad cores, all of which have, at one time, occupied the basic price point at which the Core i7-920 now debuts. The Core 2 Quad Q6600 is a first-generation 65nm Core 2 processor and a long-time favorite here at TR. The 45nm Core 2 Quad Q9300 essentially supplanted the Q6600 and found its way into several of our system guide recommended configs during its time. I’m intrigued to see how the Core i7-920’s performance and value proposition matches up to these two economical quad-core CPUs.

Another quad contender with a nice, low price is the Phenom X4 9950. It’s also AMD’s current top-of-the line processor, so we’ve of course included it. However, AMD’s pricing very much reflects its products’ limited performance, so there’s no direct competition right now between even the Core i7-920 and anything AMD has to offer.

Also in the mix for reference are a couple of higher frequency dual-core processors: the Core 2 Duo E8600, which runs at 3.33GHz and promises to perform very well in lightly threaded applications, and the Athlon 64 X2 6400+. At 3.2GHz, the X2 6400+ is AMD’s highest frequency desktop processor, and it may even upstage the Phenom in single- or dual-threaded apps. These days, though, AMD has chosen to fight Intel’s high-frequency dual cores with its triple core Phenom X3 8750, so we’ve included it, as well.

Test notes

We didn’t become entirely aware of the various flexible uncore clock options for Nehalem until the eleventh hour, and as a result, we’ve only just discovered a problem with one of our test setups. You will see in the table below and throughout the review scores for a “Core i7-940” processor that is really just a Core i7-965 Extreme processor underclocked to the 2.93GHz core clock of the 940 model. Generally, simulating a speed grade of a chip like this isn’t a big problem, at least for performance testing if not power consumption. However, it turns out that, in following the guide Intel offered to us for simulating a 940 with a 965, we (and they) missed a key variable: the “uncore” clock. Ours was running at 3.2 GHz when we simulated the 940, when the proper clock speed is 2.13 GHz. That discrepancy potentially made both the memory controller and the L3 cache quicker than they would probably be in the actual product. We’ve decided to leave the numbers for the 940 in the review, but please realize that they may overstate its performance somewhat. We will try to follow up with more exact numbers in a future article or update.

Special thanks to Corsair for equipping us with all-new memory we used in testing. The most impressive DIMMs they supplied us were part of a special Core i7-tailored three-module kit, pictured above. These puppies ran happily at their rated 8-8-8-24 timings, with a 1T command rate, at 1600 MHz and only 1.65V. The Core i7 memory controller apparently may not deal well with higher voltages, but we found they weren’t necessary with these DIMMs.

Also, thanks to Asus for bringing our Phenom testbed up to date with this M3A79-T Deluxe mobo. We sought this one out because it has a 790FX north bridge combined with AMD’s new SB750 south bridge. Oh, yeah, and check out that CPU cooler, which I was too lazy to remove for the picture (and doing so would have decreased its awesomeness). Don’t put your finger in the fan, folks.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processor Core
2 Quad Q6600 2.4 GHz
Core
2 Duo E8600 3.33 GHz

Core 2 Quad Q9300 2.5 GHz

Core
2 Extreme QX9770 3.2 GHz
Dual
Core
2 Extreme QX9775 3.2 GHz
Core
i7-920 2.66 GHz

Core i7-940 2.93 GHz

Core
i7-965 Extreme 3.2 GHz
Athlon
64 X2 6400+ 3.2 GHz
Phenom
X3 8750 2.4 GHz

Phenom X4 9950

Black 2.6 GHz

System bus 1066
MT/s

(266 MHz)

1333
MT/s

(333 MHz)

1600
MT/s

(400 MHz)

1600
MT/s

(400 MHz)

QPI
4.8 GT/s

(2.4 GHz)

QPI
6.4 GT/s

(3.2 GHz)

HT
2.0 GT/s

(1.0 GHz)

HT
3.6 GT/s (1.8 GHz)
HT
4.0 GT/s (2.0 GHz)
Motherboard Asus
P5E3 Premium
Asus
P5E3 Premium
Asus
P5E3 Premium
Intel
D5400XS
Intel
DX58SO
Intel
DX58SO
Asus
M3A79-T Deluxe
Asus
M3A79-T Deluxe
BIOS revision 0605 0605 0605 XS54010J.86A.1149.

2008.0825.2339

SOX5810J.86A.2260.

2008.0918.1758

SOX5810J.86A.2260.

2008.0918.1758

0403 0403
North bridge X48
Express MCH
X48
Express MCH
X48
Express MCH
5400
MCH
X58
IOH
X58
IOH
790FX 790FX
South bridge ICH9R ICH9R ICH9R 6321ESB ICH ICH10R ICH10R SB750 SB750
Chipset drivers INF
Update 9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF
Update 9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF
Update 9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF Update
9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF
update 9.1.0.1007

Matrix Storage Manager 8.5.0.1032

INF
update 9.1.0.1007

Matrix Storage Manager 8.5.0.1032

AHCI
controller 3.1.1540.61
AHCI
controller 3.1.1540.61
Memory size 4GB
(2 DIMMs)
4GB
(2 DIMMs)
4GB
(2 DIMMs)
4GB
(2 DIMMs)
6GB
(3 DIMMs)
6GB
(3 DIMMs)
4GB
(2 DIMMs)
4GB
(2 DIMMs)
Memory type Corsair
TW3X4G1800C8DF

DDR3 SDRAM

Corsair
TW3X4G1800C8DF

DDR3 SDRAM

Corsair
TW3X4G1800C8DF

DDR3 SDRAM

Micron
ECC DDR2-800

FB-DIMM

Corsair
TR3X6G1600C8D

DDR3 SDRAM

Corsair
TR3X6G1600C8D

DDR3 SDRAM

Corsair
TWIN4X4096-8500C5DF

DDR2 SDRAM 

Corsair
TWIN4X4096-8500C5DF

DDR2 SDRAM

Memory
speed (Effective)
1066
MHz
1333
MHz
1600
MHz
800
MHz
1066
MHz
1600
MHz
800
MHz
1066
MHz
CAS latency (CL) 7 8 8 5 7 8 4 5
RAS to CAS delay (tRCD) 7 8 8 5 7 8 4 5
RAS precharge (tRP) 7 8 8 5 7 8 4 5
Cycle time (tRAS) 20 20 24 18 20 24 12 15
Command
rate
2T 2T 2T 2T 2T 1T 2T 2T
Audio Integrated
ICH9R/AD1988B

with SoundMAX 6.10.2.6480 drivers

Integrated
ICH9R/AD1988B

with SoundMAX 6.10.2.6480 drivers

Integrated
ICH9R/AD1988B

with SoundMAX 6.10.2.6480 drivers

Integrated
6321ESB/STAC9274D5

with SigmaTel 6.10.5713.7 drivers

Integrated
ICH10R/ALC889

with Realtek 6.0.1.5704 drivers

Integrated
ICH10R/ALC889

with Realtek 6.0.1.5704 drivers

Integrated
SB750/AD2000B

with SoundMAX 6.10.2.6480 drivers

Integrated
SB750/AD2000B

with SoundMAX 6.10.2.6480 drivers

Hard drive WD Caviar SE16 320GB SATA
Graphics Radeon
HD 4870 512MB PCIe with Catalyst 8.55.4-081009a-070794E-ATI
drivers
OS Windows Vista Ultimate x64 Edition
OS updates Service
Pack 1, DirectX redist update August 2008

Thanks to Corsair for providing us with memory for our testing. Their products and support are far and away superior to generic, no-name memory.

Our single-socket test systems were powered by OCZ GameXStream 700W power supply units. The dual-socket system was powered by a PC Power & Cooling Turbo-Cool 1KW-SR power supply. Thanks to OCZ for providing these units for our use in testing.

Also, the folks at NCIXUS.com hooked us up with a nice deal on the WD Caviar SE16 drives used in our test rigs. NCIX now sells to U.S. customers, so check them out.

The test systems’ Windows desktops were set at 1600×1200 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled.

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

So how does the Core i7’s overhauled cache and memory subsystem perform? We can measure it in various ways to find out. Here are a few synthetic benchmarks designed to do just that.

Whoa. The Core i7, she is fast, no? The Core i7-965 Extreme achieves nearly three times the throughput of the fastest single-socket Core 2 processor, the QX9770. With slower 1066 MHz memory, the Core i7-920 and 940 don’t quite reach the same heights, but they’re still much, much faster than anything else.

The Phenoms aren’t performing quite as well here as one might hope, and part of the reason may be because we ran the Phenom’s memory controller in dual 64-bit “unganged” mode rather than 128-bit mode. The 128-bit mode may produce somewhat higher scores in synthetic tests, but we chose to test with unganged mode because its all-around performance could potentially be superior.

The results from this test visually illustrate the throughput of the various levels of the memory hierarchy, and we find that the Core i7’s caches are all quite fast. Even at the 512 kB and 1 MB test blocks, where presumably we’re well into the L3 cache, the Core i7s achieve considerably more throughput than the Penryn-based QX9770.

The results without Hyper-Threading are curious: higher performance in the L1/L2 cache ranges, but lower performance in the L3 range.

Since it’s difficult to see the results once we get into main memory, let’s take a closer look at the 256 MB block size:

Among the Intel processors, these results are relatively similar to what we saw in Sandra’s first memory bandwidth test at the top of the page, though the numbers are lower. However, not only do the AMD processors perform relatively better, but their measured throughput is actually higher here. Still, the Phenom X4 9950 is not even close to the Core i7-920, let alone the faster options.

These results come from a little cachemem-like latency test program included with earlier versions of CPU-Z, and they give us a sense of what the Core i7’s integrated memory controller and revamped cache hierarchy bring to the table. (I’ve assumed “one tick up” Turbo clock speeds for the Core i7 processors in calculating access times.) Despite having a third cache level and a much larger total cache size, the 965 Extreme gets out to main memory as quickly as an Athlon X2 6400+, our previous champ. Remarkable. The Core i7-920, with its slower “uncore” clocks and 1066 MHz memory, is still quicker than most Core 2 chips.

If you think we’ve already geeked out beyond all reasonable hope, don’t scroll down any further. What you’ll see below are 3D graphs of memory access latencies at various block and step size for some of the most interesting processors we tested. We’ve color coded them just as a guide, although it doesn’t mean much. Yellow roughly corresponds to the chip’s L1 cache size, light orange to the L2 cache, red to the L3 cache, and dark orange to main memory.

Intel seems to have better managed the problem of L3 cache latency than AMD did with the Phenom, especially in the 965 Extreme, which runs its L3 cache at a full 3.2GHz.

Crysis Warhead

We measured Warhead performance using the FRAPS frame-rate recording tool and playing over the same 60-second section of the game five times on each processor. This method has the advantage of simulating real gameplay quite closely, but it comes at the expense of precise repeatability. We believe five sample sessions are sufficient to get reasonably consistent results. In addition to average frame rates, we’ve included the low frame rates, because those tend to reflect the user experience in performance-critical situations. In order to diminish the effect of outliers, we’ve reported the median of the five low frame rates we encountered.

We tested at at relatively modest graphics settings, 1024×768 resolution with the game’s “Mainstream” quality settings, because we didn’t want our graphics card to be the performance-limiting factor. This is, after all, a CPU test.

When I first set out to put together our CPU test suite, I honestly wondered whether we could find any games that are really CPU-limited these days. Many of them are console ports and simply don’t require much CPU power to run well. This game is an exception, obviously. Like most games, however, Warhead doesn’t look to be heavily multithreaded, since our two dual-core processors perform relatively well here compared to their lower-speed quad-core siblings.

The top two spots are occupied by the Core i7-965 Extreme, with the non-Hyper-Threaded config proving to be a little faster—no surprise given this game’s lack of robust multithreading. Turning off HT and doing away with its partitioning of some on-chip resources does seem to offer a bit of a performance boost in the right situation.

Far Cry 2: Far Cry-ier

After playing around with Far Cry 2, I decided to test it a little bit differently by recording frame rates during the jeep ride sequence at the very beginning of the game. I found that frame rates during this sequence were generally similar to those when running around elsewhere in the game, and after all, playing Far Cry 2 involves quite a bit of driving around. Since this sequence was repeatable, I just captured results from three 90-second sessions.

Again, I didn’t want the graphics card to be our primary performance constraint, so although I tested at fairly high visual quality levels, I used a relatively low 1024×768 display resolution and DirectX 9.

The 965 Extreme again takes the top spots, but the Core i7-920 finishes in mid-pack, behind the Core 2 Quad 9300.

Unreal Tournament 3

As you saw on the preceding page, I did manage to find a couple of CPU-limited games to use in testing. I decided to try to concoct another interesting scenario by setting up a 24-player CTF game on UT3’s epic Facing Worlds map, in which I was the only human player. The rest? Bots controlled by the CPU. I racked up frags like mad while capturing five 60-second gameplay sessions for each processor.

Oh, and the screen resolution was set to 1280×1024 for testing, with UT3’s default quality options and “framerate smoothing” disabled.

We’re looking at playable frame rates with pretty much every processor tested, but we do seem to have sorted out the faster CPUs from the slower ones. Notice that the dual-core processors don’t fare as well here; some degree of multithreading seems to be at work.

All of the Core i7 processors finish strong, even the 920. However, the 940’s victory over the 965 Extreme is a reminder of how much variability is possible when testing in this manner.

Half Life 2: Episode Two

Our next test is a good, old custom-recorded in-game timedemo, precisely repeatable.

Ok, so we have frame rates well into the hundreds, but at least Episode Two‘s ceiling is high enough to show us the differences between the CPUs. Clock for clock, the Core i7 doesn’t look to be much faster than the Core 2 here.

Source engine particle simulation

Next up is a test we picked up during a visit to Valve Software, the developers of the Half-Life games. They had been working to incorporate support for multi-core processors into their Source game engine, and they cooked up some benchmarks to demonstrate the benefits of multithreading.

This test runs a particle simulation inside of the Source engine. Most games today use particle systems to create effects like smoke, steam, and fire, but the realism and interactivity of those effects are limited by the available computing horsepower. Valve’s particle system distributes the load across multiple CPU cores.

Chalk up a win for Hyper-Threading now that we have a nicely multithreaded application, and consider the Core i7’s dominance here. Even the 920 is faster than the Skulltrail dual-QX9775 system with its eight Penryn cores.

WorldBench

WorldBench’s overall score is a pretty decent indication of general-use performance for desktop computers. This benchmark uses scripting to step through a series of tasks in common Windows applications and then produces an overall score for comparison. WorldBench also records individual results for its component application tests, allowing us to compare performance in each. We’ll look at the overall score, and then we’ll show individual application results alongside the results from some of our own application tests.

Here’s a nice indication that the Core i7 offers a fairly general increase in performance. The 965 Extreme beats out the Core 2 Extreme QX9770 by 13 points in WorldBench’s overall index, which is a formidable margin. We’ll look at the individual results in the next few pages to see how the Core i7 did it.

Productivity and general use software

MS Office productivity

Firefox web browsing

Multitasking – Firefox and Windows Media Encoder

WinZip file compression

Nero CD authoring

Two of WorldBench’s tests above, MS Office and the Firefox/Windows Media Encoder combo, are noteworthy because they test a user multitasking scenario, during which multiple applications are running concurrently. In both cases, the Core i7 processors are among the fastest.

Meanwhile, the Nero test leans heavily on the disk controller, and you can see the distinct separation between the different chipsets we used.

Image processing

Photoshop

The Core i7 performs well here, but the Core 2 Duo E8600’s strong showing serves as a reminder that only one or two fast cores are necessary to ace this test.

The Panorama Factory photo stitching
The Panorama Factory handles an increasingly popular image processing task: joining together multiple images to create a wide-aspect panorama. This task can require lots of memory and can be computationally intensive, so The Panorama Factory comes in a 64-bit version that’s widely multithreaded. I asked it to join four pictures, each eight megapixels, into a glorious panorama of the interior of Damage Labs. The program’s timer function captures the amount of time needed to perform each stage of the panorama creation process. I’ve also added up the total operation time to give us an overall measure of performance.

The Core i7 stretches into new performance territory here, with the 965 Extreme once more embarrassing the dual-socket Skulltrail rig.

picCOLOR image analysis

picCOLOR was created by Dr. Reinert H. G. Müller of the FIBUS Institute. This isn’t Photoshop; picCOLOR’s image analysis capabilities can be used for scientific applications like particle flow analysis. Dr. Müller has supplied us with new revisions of his program for some time now, all the while optimizing picCOLOR for new advances in CPU technology, including MMX, SSE2, and Hyper-Threading. Naturally, he’s ported picCOLOR to 64 bits, so we can test performance with the x86-64 ISA. Many of the individual functions that make up the test are multithreaded.

The Core i7-920 has quietly racked up a string of performances in our image processing tests that place it well ahead of the mid-range quad-core processors, the Q6600 and Q9300, that it supplants.

Media encoding and editing

x264 HD benchmark

This benchmark tests performance with one of the most popular H.264 video encoders, the open-source x264. The results come in two parts, for the two passes the encoder makes through the video file. I’ve chosen to report them separately, since that’s typically how the results are reported in the public database of results for this benchmark. These scores come from the newer, faster version 0.59.819 of the x264 executable.

The Core i7 chips perform well enough during pass one, but it’s during pass two (which seems to use more threads) that they really shine.

Windows Media Encoder x64shine Edition video encoding

Windows Media Encoder is one of the few popular video encoding tools that uses four threads to take advantage of quad-core systems, and it comes in a 64-bit version. Unfortunately, it doesn’t appear to use more than four threads, even on an eight-core system. For this test, I asked Windows Media Encoder to transcode a 153MB 1080-line widescreen video into a 720-line WMV using its built-in DVD/Hardware profile. Because the default “High definition quality audio” codec threw some errors in Windows Vista, I instead used the “Multichannel audio” codec. Both audio codecs have a variable bitrate peak of 192Kbps.

The Core i7 delivers a bit of a clock-for-clock performance gain over the Core 2 here, even though it’s handicapped by the fact that the app only uses four threads.

Windows Media Encoder video encoding

Roxio VideoWave Movie Creator

Make of these two WorldBench tests what you will. I prefer our other video encoding benchmarks instead.

LAME MT audio encoding

LAME MT is a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. Of course, multithreading works even better on multi-core processors. You can download a paper (in Word format) describing the programming effort.

Rather than run multiple parallel threads, LAME MT runs the MP3 encoder’s psycho-acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. That means this test won’t really use more than two CPU cores.

We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here.

No real performance gains to report here.

3D modeling and rendering

Cinebench rendering

Graphics is a classic example of a computing problem that’s easily parallelizable, so it’s no surprise that we can exploit a multi-core processor with a 3D rendering app. Cinebench is the first of those we’ll try, a benchmark based on Maxon’s Cinema 4D rendering engine. It’s multithreaded and comes with a 64-bit executable. This test runs with just a single thread and then with as many threads as CPU cores are available.

Now here is a truly impressive performance from the Core i7. Even the Core i7-920 trounces the QX9770, thanks in part to Hyper-Threading. Let’s look at a few more tests, and we’ll discuss the results at the bottom of the page.

POV-Ray rendering

We’re using the latest beta version of POV-Ray 3.7 that includes native multithreading and 64-bit support. Some of the beta 64-bit executables have been quite a bit slower than the 3.6 release, but this should give us a decent look at comparative performance, regardless.

3ds max modeling and rendering

Valve VRAD map compilation

This next test processes a map from Half-Life 2 using Valve’s VRAD lighting tool. Valve uses VRAD to precompute lighting that goes into games like Half-Life 2.

In each of the three fully multithreaded rendering tests above—the POV-Ray chess scene, 3ds max rendering, and Valve VRAD—the Core i7 brings major performance gains over the Core 2. Even the Core i7-920 is consistently faster than the Core 2 Extreme QX9770.

[email protected]

Next, we have a slick little [email protected] benchmark CD created by notfred, one of the members of Team TR, our excellent Folding team. For the unfamiliar, [email protected] is a distributed computing project created by folks at Stanford University that investigates how proteins work in the human body, in an attempt to better understand diseases like Parkinson’s, Alzheimer’s, and cystic fibrosis. It’s a great way to use your PC’s spare CPU cycles to help advance medical research. I’d encourage you to visit our distributed computing forum and consider joining our team if you haven’t already joined one.

The [email protected] project uses a number of highly optimized routines to process different types of work units from Stanford’s research projects. The Gromacs core, for instance, uses SSE on Intel processors, 3DNow! on AMD processors, and Altivec on PowerPCs. Overall, [email protected] should be a great example of real-world scientific computing.

notfred’s Folding Benchmark CD tests the most common work unit types and estimates performance in terms of the points per day that a CPU could earn for a Folding team member. The CD itself is a bootable ISO. The CD boots into Linux, detects the system’s processors and Ethernet adapters, picks up an IP address, and downloads the latest versions of the Folding execution cores from Stanford. It then processes a sample work unit of each type.

On a system with two CPU cores, for instance, the CD spins off a Tinker WU on core 1 and an Amber WU on core 2. When either of those WUs are finished, the benchmark moves on to additional WU types, always keeping both cores occupied with some sort of calculation. Should the benchmark run out of new WUs to test, it simply processes another WU in order to prevent any of the cores from going idle as the others finish. Once all four of the WU types have been tested, the benchmark averages the points per day among them. That points-per-day average is then multiplied by the number of cores on the CPU in order to estimate the total number of points per day that CPU might achieve.

This may be a somewhat quirky method of estimating overall performance, but my sense is that it generally ought to work. We’ve discussed some potential reservations about how it works here, for those who are interested. I have included results for each of the individual WU types below, so you can see how the different CPUs perform on each.

Because of the presence of Hyper-Threading, you have to look at that final graph to make sense of these results. The benchmark keeps eight threads active all of the time on the Core i7, which reduces per-thread performance. Once we get to the end of the road, though, and estimate the total projected points per day, both the Core i7 and Hyper-Threading prove to be winners. Without Hyper-Threading, the Core i7-965 Extreme is only marginally faster than the QX9770, but with it, the contest becomes a rout.

MyriMatch proteomics

Our benchmarks sometimes come from unexpected places, and such is the case with this one. David Tabb is a friend of mine from high school and a long-time TR reader. He has provided us with an intriguing new benchmark based on an application he’s developed for use in his research work. The application is called MyriMatch, and it’s intended for use in proteomics, or the large-scale study of protein. I’ll stop right here and let him explain what MyriMatch does:

In shotgun proteomics, researchers digest complex mixtures of proteins into peptides, separate them by liquid chromatography, and analyze them by tandem mass spectrometers. This creates data sets containing tens of thousands of spectra that can be identified to peptide sequences drawn from the known genomes for most lab organisms. The first software for this purpose was Sequest, created by John Yates and Jimmy Eng at the University of Washington. Recently, David Tabb and Matthew Chambers at Vanderbilt University developed MyriMatch, an algorithm that can exploit multiple cores and multiple computers for this matching. Source code and binaries of MyriMatch are publicly available.
In this test, 5555 tandem mass spectra from a Thermo LTQ mass spectrometer are identified to peptides generated from the 6714 proteins of S. cerevisiae (baker’s yeast). The data set was provided by Andy Link at Vanderbilt University. The FASTA protein sequence database was provided by the Saccharomyces Genome Database.

MyriMatch uses threading to accelerate the handling of protein sequences. The database (read into memory) is separated into a number of jobs, typically the number of threads multiplied by 10. If four threads are used in the above database, for example, each job consists of 168 protein sequences (1/40th of the database). When a thread finishes handling all proteins in the current job, it accepts another job from the queue. This technique is intended to minimize synchronization overhead between threads and minimize CPU idle time.

The most important news for us is that MyriMatch is a widely multithreaded real-world application that we can use with a relevant data set. MyriMatch also offers control over the number of threads used, so we’ve tested with one to eight threads.

I should mention that performance scaling in MyriMatch tends to be limited by several factors, including memory bandwidth, as David explains:

Inefficiencies in scaling occur from a variety of sources. First, each thread is comparing to a common collection of tandem mass spectra in memory. Although most peptides will be compared to different spectra within the collection, sometimes multiple threads attempt to compare to the same spectra simultaneously, necessitating a mutex mechanism for each spectrum. Second, the number of spectra in memory far exceeds the capacity of processor caches, and so the memory controller gets a fair workout during execution.

Here’s how the processors performed.

The Core i7-965 Extreme performs in 60 seconds what the Core 2 Extreme QX9770 requires 100 seconds to complete. This sort of thorny, bandwidth-intensive application benefits greatly from the Core i7’s architectural innovations. Here’s another case where even the dual Core 2 Extreme QX9775 processors in the Skulltrail system can’t keep up with the Core i7.

STARS Euler3d computational fluid dynamics

Charles O’Neill works in the Computational Aeroservoelasticity Laboratory at Oklahoma State University, and he contacted us to suggest we try the computational fluid dynamics (CFD) benchmark based on the STARS Euler3D structural analysis routines developed at CASELab. This benchmark has been available to the public for some time in single-threaded form, but Charles was kind enough to put together a multithreaded version of the benchmark for us with a larger data set. He has also put a web page online with a downloadable version of the multithreaded benchmark, a description, and some results here.

In this test, the application is basically doing analysis of airflow over an aircraft wing. I will step out of the way and let Charles explain the rest:

The benchmark testcase is the AGARD 445.6 aeroelastic test wing. The wing uses a NACA 65A004 airfoil section and has a panel aspect ratio of 1.65, taper ratio of 0.66, and a quarter-chord sweep angle of 45º. This AGARD wing was tested at the NASA Langley Research Center in the 16-foot Transonic Dynamics Tunnel and is a standard aeroelastic test case used for validation of unsteady, compressible CFD codes.
The CFD grid contains 1.23 million tetrahedral elements and 223 thousand nodes . . . . The benchmark executable advances the Mach 0.50 AGARD flow solution. A benchmark score is reported as a CFD cycle frequency in Hertz.

So the higher the score, the faster the computer. Charles tells me these CFD solvers are very floating-point intensive, but oftentimes limited primarily by memory bandwidth. He has modified the benchmark for us in order to enable control over the number of threads used. Here’s how our contenders handled the test with different thread counts.

I believe we have a new world record in this benchmark, and it comes from a single-socket Core i7 system. The dual Core 2 Extreme QX9775 system—essentially a 2P Xeon in disguise—is beaten by even the Core i7-920.

Power consumption and efficiency

Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor into a separate outlet, though.) We measured how each of our test systems used power across a set time period, during which time we ran Cinebench’s multithreaded rendering test.

All of the systems had their power management features (such as SpeedStep and Cool’n’Quiet) enabled during these tests via Windows Vista’s “Balanced” power options profile.

Let’s slice up the data in various ways in order to better understand them. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

Wow, the Core i7’s idle power consumption is very reasonable, especially considering it has a third DIMM in the system that the others don’t.

Next, we can look at peak power draw by taking an average from the ten-second span from 15 to 25 seconds into our test period, during which the processors were rendering.

The Core i7’s peak power use is definitely up from the quad-core Penryns, as one might expect from a larger chip with a design focused on keeping execution units more fully occupied. Peak power draw is only part of the story, though.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

The mix of reasonably low idle power draw and relatively short render times adds up to a moderate amount energy consumed by the Core i7 systems over the duration of our test period.

We can quantify efficiency even better by considering specifically the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

Although the Core i7 systems tend to consume a little more power at peak than the quad-core Penryns, they also tend to finish rendering much sooner. Their more efficient execution means the Core i7 processors require less energy to complete the task of rendering our sample scene. These results also illustrate why Intel claims Hyper-Threading improves power efficiency—because it can.

Overclocking

Ok, so I haven’t yet gotten around to overclocking my Core i7 processors, but I asked Geoff to give it a try, and here’s what he managed to do with his Core i7-920.

Although the 920’s upper multiplier is locked, he was able to increase the base system clock—which CPU-Z labels as “bus speed”—in order to raise the core speed. Using that method, he made it to 3.3 GHz, which isn’t too shabby.

Then again, I think better things are possible, but I need to play with these chips a little more. Come back later, because I’ll update this page once I have something to report. Shouldn’t take too long.

Conclusions

The Core i7-965 Extreme is, by far, the fastest processor we’ve ever tested, and it seems clear the Core i7 architecture brings with it a general performance increase over the 45nm Core 2 processors it succeeds. We’ve seen that increase in everyday desktop applications, including the WorldBench suite and several of the latest games. In part, the Core i7’s performance gains come from higher clock frequencies due to the “Turbo mode” mechanism. When the Core i7-965 Extreme is operating at 3.33 or 3.46 GHz, it’s going to be somewhat faster than a Core 2 at 3.2GHz. That’s why I’ve been I’ve been hesitant to talk about clock-for-clock performance gains for Core i7, as you may have noticed.

Yet in some cases, the Core i7 undeniably delivers clock-for-clock performance increases over Core 2, along with dramatic gains in absolute performance. We saw the biggest improvements in some specific sorts of workloads, including 3D rendering, scientific computing/HPC applications, and nearly any application that could spawn up to eight threads. More than once, a single Core i7-965 Extreme outran our dual-socket “Skulltrail” system by a considerable margin. This new system architecture pushes the performance frontiers forward in places where progress had previously been rather halting.

Such things aren’t exactly the material of everyday futzing around on the PC, but we’re long past the point where Microsoft Office is a prime target for performance optimizations. In fact, for the average guy, the secret hero of our test results was the Core 2 Duo E8600. If your main reason for wanting a fast computer is to surf the web and play games, you’re probably better off getting a fast dual-core like the E8600 than you are picking up a Core i7-920 or any quad-core processor. Game developers keep threatening to really make use of more than two cores, but it just hasn’t happened yet.

Even so, one has to appreciate what Intel has accomplished here. The Core i7 is another solid step beyond its last two product generations, the 45nm and 65nm versions of Core 2. As our power testing showed, the larger Core i7’s power draw at idle is similar to a quad-core Penryn’s. Although its peak power draw is higher, the Core i7 can use less energy to complete a given task, as it did in our Cinebench rendering example. And the new system architecture established by the Core i7 will likely be the basis for Intel systems for the next five years, at least. On all fronts, progress.

One question that remains: Has Intel now built an insurmountable lead over AMD? Almost seems like it. But one never knows. AMD’s 45nm quad cores are coming soon. Perhaps they’ll have a few surprises in store for us.

0 0 votes
Article Rating
1 Comment
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
indeego
indeego
13 years ago
Reply to  indeego

haha. The price a year later is 1/3 what it was when this system first came out. Man those early adopters REALLY pay up the noseg{<.<}g

Pin It on Pinterest

Share This

Share this post with your friends!