A closer look at the Core i7-940

When we first reviewed the Core i7 processor, we had two chips on hand: the high-end Core i7-965 Extreme and the more affordable Core i7-920. Sandwiched in between them in Intel’s product lineup is the Core i7-940. Since we didn’t have one of those to test, we employed a trick we sometimes use and turned down the clock speed on our Core i7-965 Extreme from its native 3.2GHz to the 940’s 2.93GHz frequency. Given the breadth of CPU model ranges these days, we find ourselves using this trick fairly often. In fact, in this case, Intel even recommended that reviewers use this method to test Core i7-940 performance and provided the media with instructions for setting the proper clock speeds.

But such an approach always comes with caveats. Although we’re usually confident that the performance results of a “simulated” product model, using the same silicon and the same clock speed, will be true to the original, we’re less than confident that the power consumption results will match the actual chip. In fact, power consumption can vary from one individual chip to the next, along with the voltage, which is set at the factory. Beyond that, there is the slight possibility that our simulated product might somehow not match the original in terms of its configuration—and thus performance.

We discovered such a problem with our “simulated” Core i7-940 just ahead of the publication of our initial Core i7 review. Turns out that both our test configuration and Intel’s reviewer’s guide had overlooked an important characteristic of the Core i7: it has multiple clock domains that govern its processing cores and what Intel calls its “uncore” elements, such as the memory controller, QuickPath interconnect, and L3 cache. We had set the proper CPU core and QPI link speeds for our simulated Core i7-940, but the memory controller and “uncore” clock were incorrect—they were set to run at 3.2GHz, not the 940’s default speed of 2.13GHz for both. Not only that, but unlike the “unlocked” Extreme edition, the 940’s “uncore” clocks are limited to 2.13GHz and cannot go any higher (at least, not without overclocking.)

After the nice EMT people used the paddles to revive me from my minor coronary event, I decided to go ahead with the publication of our Core i7 review with the simulated Core i7-940 performance numbers intact, but I promised in the text of the review to follow up with numbers from a properly configured Core i7-940.

You see, what we didn’t know at that time was the precise impact of the fast L3 cache speed on the Core i7-940’s performance. But we knew from experience with similar architectures, such as the AMD Phenom, that raising the L3 cache speed of the Core i7 from 2.13GHz to 3.2GHz was potentially what we call in the industry A Very Big Deal; it would affect memory latency, cache bandwidth, and even power consumption. Put plainly, our original review of the Core i7 overstated the 940’s performance by some amount, possibly five to ten percent.

In order to rectify this situation, we got our thermal paste-stained hands on the genuine article: a retail boxed Core i7-940 processor, in order to see how the real thing performs. Here’s a nice picture of the soothing blue box:

We popped this puppy (the processor, not the box) into the exact same test rig we’d used for our initial Core i7 testing and ran it through the entirety of our CPU test suite.

In the end, the differences between our original, simulated Core i7-940 and the retail copy of the same weren’t huge, but they were measurable and sometimes fairly significant. Rather than hit you with a full-on deluge of performance data, I’ve selected some of our key tests—including synthetic memory benchmarks, productivity applications, and games—and provided results from them on the following pages. The scores from the two key contenders in this matchup are colored red in the graphs for easy identification. Our simulated Core i7-940 with the too-fast uncore clocks is marked as (sim), while the real McCoy simply goes by its proper name. Read on to see how much (and sometimes how little) difference the actual product’s slower L3 cache and memory controller can make.

Our testing methods

As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and the results were averaged.

Our test systems were configured like so:

Processor Core
2 Quad Q6600 2.4 GHz
Core
2 Duo E8600 3.33 GHz

Core 2 Quad Q9300 2.5 GHz

Core
2 Extreme QX9770 3.2 GHz
Dual
Core
2 Extreme QX9775 3.2 GHz
Core
i7-920 2.66 GHz

Core i7-940 2.93 GHz

Core
i7-965 Extreme 3.2 GHz
Athlon
64 X2 6400+ 3.2 GHz
Phenom
X3 8750 2.4 GHz

Phenom X4 9950

Black 2.6 GHz

System bus 1066
MT/s

(266 MHz)

1333
MT/s

(333 MHz)

1600
MT/s

(400 MHz)

1600
MT/s

(400 MHz)

QPI
4.8 GT/s

(2.4 GHz)

QPI
6.4 GT/s

(3.2 GHz)

HT
2.0 GT/s

(1.0 GHz)

HT
3.6 GT/s (1.8 GHz)
HT
4.0 GT/s (2.0 GHz)
Motherboard Asus
P5E3 Premium
Asus
P5E3 Premium
Asus
P5E3 Premium
Intel
D5400XS
Intel
DX58SO
Intel
DX58SO
Asus
M3A79-T Deluxe
Asus
M3A79-T Deluxe
BIOS revision 0605 0605 0605 XS54010J.86A.1149.

2008.0825.2339

SOX5810J.86A.2260.

2008.0918.1758

SOX5810J.86A.2260.

2008.0918.1758

0403 0403
North bridge X48
Express MCH
X48
Express MCH
X48
Express MCH
5400
MCH
X58
IOH
X58
IOH
790FX 790FX
South bridge ICH9R ICH9R ICH9R 6321ESB ICH ICH10R ICH10R SB750 SB750
Chipset drivers INF
Update 9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF
Update 9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF
Update 9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF Update
9.0.0.1008

Matrix Storage Manager 8.5.0.1032

INF
update 9.1.0.1007

Matrix Storage Manager 8.5.0.1032

INF
update 9.1.0.1007

Matrix Storage Manager 8.5.0.1032

AHCI
controller 3.1.1540.61
AHCI
controller 3.1.1540.61
Memory size 4GB
(2 DIMMs)
4GB
(2 DIMMs)
4GB
(2 DIMMs)
4GB
(2 DIMMs)
6GB
(3 DIMMs)
6GB
(3 DIMMs)
4GB
(2 DIMMs)
4GB
(2 DIMMs)
Memory type Corsair
TW3X4G1800C8DF

DDR3 SDRAM

Corsair
TW3X4G1800C8DF

DDR3 SDRAM

Corsair
TW3X4G1800C8DF

DDR3 SDRAM

Micron
ECC DDR2-800

FB-DIMM

Corsair
TR3X6G1600C8D

DDR3 SDRAM

Corsair
TR3X6G1600C8D

DDR3 SDRAM

Corsair
TWIN4X4096-8500C5DF

DDR2 SDRAM 

Corsair
TWIN4X4096-8500C5DF

DDR2 SDRAM

Memory
speed (Effective)
1066
MHz
1333
MHz
1600
MHz
800
MHz
1066
MHz
1600
MHz
800
MHz
1066
MHz
CAS latency (CL) 7 8 8 5 7 8 4 5
RAS to CAS delay (tRCD) 7 8 8 5 7 8 4 5
RAS precharge (tRP) 7 8 8 5 7 8 4 5
Cycle time (tRAS) 20 20 24 18 20 24 12 15
Command
rate
2T 2T 2T 2T 2T 1T 2T 2T
Audio Integrated
ICH9R/AD1988B

with SoundMAX 6.10.2.6480 drivers

Integrated
ICH9R/AD1988B

with SoundMAX 6.10.2.6480 drivers

Integrated
ICH9R/AD1988B

with SoundMAX 6.10.2.6480 drivers

Integrated
6321ESB/STAC9274D5

with SigmaTel 6.10.5713.7 drivers

Integrated
ICH10R/ALC889

with Realtek 6.0.1.5704 drivers

Integrated
ICH10R/ALC889

with Realtek 6.0.1.5704 drivers

Integrated
SB750/AD2000B

with SoundMAX 6.10.2.6480 drivers

Integrated
SB750/AD2000B

with SoundMAX 6.10.2.6480 drivers

Hard drive WD Caviar SE16 320GB SATA
Graphics Radeon
HD 4870 512MB PCIe with Catalyst 8.55.4-081009a-070794E-ATI
drivers
OS Windows Vista Ultimate x64 Edition
OS updates Service
Pack 1, DirectX redist update August 2008

Thanks to Corsair for providing us with memory for our testing. Their products and support are far and away superior to generic, no-name memory.

Our single-socket test systems were powered by OCZ GameXStream 700W power supply units. The dual-socket system was powered by a PC Power & Cooling Turbo-Cool 1KW-SR power supply. Thanks to OCZ for providing these units for our use in testing.

Also, the folks at NCIXUS.com hooked us up with a nice deal on the WD Caviar SE16 drives used in our test rigs. NCIX now sells to U.S. customers, so check them out.

The test systems’ Windows desktops were set at 1600×1200 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled.

We used the following versions of our test applications:

The tests and methods we employ are usually publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.

Memory subsystem performance

Right off the bat, we see that the simulated 940’s faster L3 cache and memory controller speed can make a difference. The retail 940 scores almost identically to the Core i7-920, which shares its 2.13GHz L3 and memory controller clocks.

There’s not much separation here overall, but look closely at the 4MB block size. This is right where the L3 cache on the Core i7 will be getting a workout, and the bandwidth difference between the two 940s is substantial: 55 GB/s for the simulated part, and 37 GB/s for the real deal.

Since it’s even more difficult to see the results once we get into main memory, let’s take a closer look at the 256MB block size:

In this test, once main memory becomes the limiting factor, the simulated and retail 940s perform the same.

Memory access latency is another story, though. The simulated 940 shaves eight nanoseconds off the access time of the real thing, a fact that could have real-world performance implications.

WorldBench

For a good look at general desktop application performance, we’ll focus on WorldBench’s overall score and then its individual components.

The simulated 940 comes out four points ahead of the retail product in WorldBench’s overall index. We can see which tests contributed most to this gap by looking through the results below.

Surprisingly enough, the faster L3 cache of the simulated 940 is good for a slight but broad performance boost in nearly every application here. The biggest differences come in the Firefox web browser test and in the multitasking test that runs Firefox concurrently with Windows Media Encoder.

Crysis Warhead

We measured Warhead performance using the FRAPS frame-rate recording tool and playing over the same 60-second section of the game five times on each processor. This method has the advantage of simulating real gameplay quite closely, but it comes at the expense of precise repeatability. We believe five sample sessions are sufficient to get reasonably consistent results. In addition to average frame rates, we’ve included the low frame rates, because those tend to reflect the user experience in performance-critical situations. In order to diminish the effect of outliers, we’ve reported the median of the five low frame rates we encountered.

We tested at relatively modest graphics settings, 1024×768 resolution with the game’s “Mainstream” quality settings, because we didn’t want our graphics card to be the performance-limiting factor. This is, after all, a CPU test.

Far Cry 2

After playing around with Far Cry 2, I decided to test it a little bit differently by recording frame rates during the jeep ride sequence at the very beginning of the game. I found that frame rates during this sequence were generally similar to those when running around elsewhere in the game, and after all, playing Far Cry 2 involves quite a bit of driving around. Since this sequence was repeatable, I just captured results from three 90-second sessions.

Again, I didn’t want the graphics card to be our primary performance constraint, so although I tested at fairly high visual quality levels, I used a relatively low 1024×768 display resolution and DirectX 9.

Performance in both of these games is also affected by the slower cache in the retail chip. We’re not talking about the sort of delta you’re likely to feel by the seat of your pants as you play, but the effect is real, regardless.

Unreal Tournament 3

As you saw on the preceding page, I did manage to find a couple of CPU-limited games to use in testing. I decided to try to concoct another interesting scenario by setting up a 24-player CTF game on UT3’s epic Facing Worlds map, in which I was the only human player. The rest? Bots controlled by the CPU. I racked up frags like mad while capturing five 60-second gameplay sessions for each processor.

Oh, and the screen resolution was set to 1280×1024 for testing, with UT3’s default quality options and “framerate smoothing” disabled.

Testing games manually can be a rather variable affair, especially in a more random scenario like this one, and we cited our simulated 940’s scores in UT3 as proof of that fact in our initial Core i7 review. Here, with the proper L3 cache and memory controller speeds, the retail 940 performs more or less as expected, on average.

Half Life 2: Episode Two

Our next test is a good, old custom-recorded in-game timedemo, precisely repeatable.

Again, a minor but consistent difference.

Source engine particle simulation

Next up is a test we picked up during a visit to Valve Software, the developers of the Half-Life games. They had been working to incorporate support for multi-core processors into their Source game engine, and they cooked up some benchmarks to demonstrate the benefits of multithreading.

This test runs a particle simulation inside of the Source engine. Most games today use particle systems to create effects like smoke, steam, and fire, but the realism and interactivity of those effects are limited by the available computing horsepower. Valve’s particle system distributes the load across multiple CPU cores.

More of the same here.

Power consumption and efficiency

Our Extech 380803 power meter has the ability to log data, so we can capture power use over a span of time. The meter reads power use at the wall socket, so it incorporates power use from the entire system—the CPU, motherboard, memory, graphics solution, hard drives, and anything else plugged into the power supply unit. (We plugged the computer monitor into a separate outlet, though.) We measured how each of our test systems used power across a set time period, during which time we ran Cinebench’s multithreaded rendering test.

All of the systems had their power management features (such as SpeedStep and Cool’n’Quiet) enabled during these tests via Windows Vista’s “Balanced” power options profile.

Let’s slice up the data in various ways in order to better understand them. We’ll start with a look at idle power, taken from the trailing edge of our test period, after all CPUs have completed the render.

The lower “uncore” clocks of the real Core i7-940 grant it the advantage of a couple of watts less power consumption at idle—makes sense, when you think about it.

Next, we can look at peak power draw by taking an average from the ten-second span from 15 to 25 seconds into our test period, during which the processors were rendering.

Here’s the true surprise, though: even with its slower L3 cache and memory controller frequencies, the system equipped with the retail Core i7-940 draws about 15W more than the one with our simulated 940 from our original review. Why? Likely because Intel sorts its chips for the various models based on their characteristics, and the Core i7-965 Extreme chips are the best ones—most able to reach higher clock speeds and to do so with less voltage. Underclocking a 965 Extreme doesn’t give us the best handle on the Core i7-940’s true power consumption because the 965 Extreme is a sweetheart of a chip. The more pedestrian retail Core i7-940 requires more voltage to reach similar clock speeds, and it thus draws more power.

Another way to gauge power efficiency is to look at total energy use over our time span. This method takes into account power use both during the render and during the idle time. We can express the result in terms of watt-seconds, also known as joules.

We can quantify efficiency even better by considering specifically the amount of energy used to render the scene. Since the different systems completed the render at different speeds, we’ve isolated the render period for each system. We’ve then computed the amount of energy used by each system to render the scene. This method should account for both power use and, to some degree, performance, because shorter render times may lead to less energy consumption.

The retail 940’s higher peak power draw and slightly lower performance team up to reduce its overall power efficiency somewhat in our final two tests. The 940 is still very power efficient, of course, but not quite what we first thought.

Conclusions

Now that you’ve seen the performance results, this little exercise may seem like a whole lot of work for a relatively minor issue. But we do strive for accuracy in our testing, and as you can see, our first attempt at measuring the performance of the Core i7-940 did overstate that processor’s performance slightly—but fairly consistently. The peak power consumption numbers for our simulated 940 were also a little tame compared to the real thing. We’ll be following up with additional CPU reviews based on this same performance data, and we’ll soon publish an article with a complete set of results from the retail Core i7-940 included.

Meanwhile, we have learned to be especially wary of the many clock domains inside of new CPUs like the Core i7 and the Phenom. Those secondary clocks can make a difference in overall performance (especially the L3 cache speed), and we expect this issue to become more prominent with time. In fact, we expect Intel to vary the L3 cache, QPI, and memory controller speeds of its Nehalem-derived processors quite a bit as it tailors them for different markets. As a result, it’s possible that a mobile version of the Core i7 (or whatever it’s called once it gets there) simply may not perform as well as the desktop equivalent, even if the core clock speeds are identical.

Overclockers will want to pay attention to these matters, as well. We’ve found that faster memory can make a measurable difference in Core i7 performance in many cases. But some of those performance gains may have been the result of the higher L3 cache frequencies we used (see the table on page 1 there). To some extent, you’re simply forced to turn up the uncore clocks if you want to run faster memory with the Core i7. But if you own a Core i7 and want to tune it for absolutely optimal performance, you might want to push the uncore clock as high as possible, even if it means using lower QPI and memory multipliers to keep things stable.

Comments closed

Pin It on Pinterest

Share This

Share this post with your friends!