The release of this new, faster Pentium 4 presents us with the opportunity to re-examine some of those issues, and to unearth some stones we’ve left unturned. New versions of many benchmark programs have become available recently, and we’ve updated our test suite to reflect that fact. Also, the market has changed in a number of ways since the Pentium 4 first launched. The question of whether to buy a Pentium 4 now is a little different than it was a few months ago. We’ll explore why that is.
We’re also going to take a look at how this new Pentium 4 performs on some tests you may not have seen used before. We think the results will be revealing.
We’ll be pitting the new P4 against the latest from AMD, the 1.33GHz Athlon. Both the 1.7GHz Pentium 4 and the 1.33GHz Athlon are minor revisions with higher clock speeds, not radically new processors. Although it may sound strange to put a 1.33GHz processor head-to-head with a competitor with a 400MHz advantage, we’ve already established that clock speed isn’t everything. The 1.2GHz Athlon laid a pretty good whuppin’ on the 1.5GHz Pentium 4 last time around.
The Pentium 4 has an unusual characteristic: it executes a relatively low number of instructions per clock, or IPC. The P4’s low-IPC design is part of the reason it’s able to achieve such high clock rates. Now, a processor having a relatively low IPC isn’t necessarily a bad thing. Nor is a high IPC necessarily a good thing, because a high-IPC chip may be limited in terms of clock frequencies. At the end of the day, what matters is performance, which is dictated by both a processor’s clock frequency and the number of instructions it can execute per clock. On that front, the Pentium 4 1.7GHz is appropriately matched against the 1.33GHz Athlon.
With that out of the way, let’s have a look at our contenders.
The only real change with the 1.7GHz Pentium 4 Intel supplied us for testing was a redesigned heatsink/fan combo. The two designs are very similar, with a big, copper block at the base and thin metal fins sprouting up off of that. The heatsink for the Pentium 4 1.7GHz, though, has shorter fins and a bigger fan, like so:
To keep this beast cool at higher clock speeds, Intel decided on more active cooling at the expense of some passive cooling. The two heatsink/fan units are the same height.
Many of the benchmark tests you’ll see below are revised versions of what we’ve used in the past here at TR. In some cases, they’re new versions of familiar programs like 3DMark. In others, like the POV-Ray rendering program, we’ve added recompiled binaries made from the same source code as the binaries we’ve used in the past. Either way, the fact these tests are new is important, because in either case, the tests have been compiled using newer compilers. The tests like POV-Ray, where we compare older binaries to new ones, ought to demonstrate the potential benefits of recompiling older code to accommodate newer processor designs. Generally, there are two broad reasons why recompiling can help performance.
Newer is better
First, newer compilers better optimize code for newer processors. With sophisticated out-of-order instruction execution capabilities and the like, the latest x86 CPU designs stand to benefit greatly from friendlier code.
The Pentium 4, for instance, has a 20-stage branch prediction/recovery pipelinetwice the depth of the Athlon’s or Pentium III’s. This pipeline executes instructions speculatively by attempting to anticipate what the program will request next. Get a prediction right, and the results are available almost instantly. Get it wrong, and the results have to be discarded, which takes time. A deeper pipeline carries with it a heavier penalty for a branch misprediction. Better code can help improve a processor’s efficiency by reducing branch mispredictions.
Well, that’s one of many reasons newer compilers help. We’ll leave the rest to the hard-core processor geeks; suffice to say that better code runs faster. Surprisingly, a lot of the executable programs out there are really better suited to a 386, 486, or Pentium processor than they are to an Athlon or a Pentium 4.
The power of alphabet soup
Second, newer x86 processors can execute new instructions designed to improve efficiency even further. Intel and AMD marketing types have given these new sets of instruction names like MMX, SSE, 3DNow!, and SSE2. Most of these new instructions employ a technique called SIMD, for “single instruction multiple data,” to perform a single mathematical operation on multiple chunks of data at once. Using these instructions in the right situationnot every situation is right for SIMDand even an old K6-2 becomes a number-crunching monster.
To review, both the Athlon and Pentium 4 can execute MMX instructions, which are oriented toward integer math and thus not terribly thrilling. The Athlon uses 3DNow! to handle floating-point SIMD math, and the Pentium 4 uses both SSE and SSE2. SSE2 is the newest set of SIMD extensions on the block, and it’s one of the Pentium 4’s biggest potential advantages. SSE2 handles floating-point calculations with much more precision than 3DNow! or SSE, so it’s quite a bit more useful. For certain types of tasks, such as streaming video encoding or real-time 3D rendering, SSE2 could allow the P4 to whup up on the competition. Maybe.
Of course, all of these newer instructions require recompiling applications to take advantage of them. And, in many cases, a recompile alone doesn’t help muchprograms often need to be heavily tweaked or rewritten to take advantage of SIMD instructions.
The truth about optimizations
Since the Pentium 4’s launch, the Athlon has been beating out the new Intel regularly in most benchmark tests. Almost just as regularly, Intel has claimed that recompiled binaries, newer versions of applications, and SSE2 optimizations would help the P4’s performance considerably. They’re right, but it’s not that simple. The Athlon stands to benefit from newer compilers and SIMD optimizations, too.
Also, the usefulness of such optimizations is limited. In reality, an awful lot of applications will make use of older, less efficient code for years to come, just because no one will bother to optimize or recompile them. Intel has its own compilers that are pretty good at making things run faster on its processors, but tools from Microsoft and other companies are much more widely used. There are reasons why this is the case, and we’ll touch on a few of them below. Finally, as we’ve noted, SIMD extensions are of limited use, and they require extra work to implement.
Then there’s the issue of a processor’s performance profile, as I will call it. It may well be true that the P4 will gain more from recompiled code than an Athlon will, but the sword cuts both ways. If that’s true, one could argue that the Pentium 4 simply does a poorer job executing legacy code. This is an especially tricky subject when it comes to benchmarks, since both Intel and AMD take an active interest in seeing their processors do well on commonly used performance tests. However, not every optimized piece of code you see spitting out numbers in a benchmark test accurately reflects the sort of code your processor may encounter in daily use.
That’s a lot of considerations to keep in mind, and we’re just scratching the surface of a very complex issue. Hold tight, and we’ll consider these things as we go.
As ever, we did our best to deliver clean benchmark numbers. All tests were run at least twice, and the results were averaged.
The Pentium 4 system was built using:
Processor: Intel Pentium 4 processors – 1.5GHz and 1.7GHzMotherboard: Intel D850GB – Intel 850 chipset – 82850 memory controller hub (MCH), 82801BA I/O controller hub (ICH2)
Memory: 256MB PC800 DRDRAM memory in two 128MB RIMMs
Video: NVIDIA GeForce 2 Ultra 64MB (Detonator 3 version 6.50 drivers)
Audio: Creative SoundBlaster Live!
Storage: IBM 75GXP 30.5GB 7200RPM ATA/100 hard drive
Our comparison systems varied only with respect to the motherboard, memory, and CPU. The Athlon DDR box looked like this:
Processor: AMD Athlon processors – 1.2GHz and 1.33GHz on a 266MHz (DDR) busMotherboard: Gigabyte GA7-DX motherboard – AMD 761 North Bridge, Via VT82C686B South Bridge
Memory: 256MB PC2100 DDR SDRAM in two 128MB DIMMs
Both systems were equipped with Windows 2000 SP1 with DirectX 8.0a. We used the following versions of our test applications:
- SiSoft Sandra Standard 2001.3.7.50
- Compiled binary of C Linpack port from Ace’s Hardware
- ZD Media Business Winstone 2001 1.0.1
- ZD Media Content Creation Winstone 2001 1.0.1
- LAME 3.70
- SPECviewperf 6.1.2
- ps5bench 1.1 Intermediate
- Adobe Photoshop 6.0.1
- POV-Ray for Windows version 3.1g (multiple compiles)
- 3DMark 2001 Build 200
- Vulpine GLMark 1.1
- Quake III Arena 1.27g with Team Arena Mission Pack
- Serious Sam v1.00
- Expendable Internet demo
- ScienceMark 1.0 (multiple compiles)
The test systems’ Windows desktop was set at 1024×768 in 32-bit color at a 75Hz screen refresh rate. Vertical refresh sync (vsync) was disabled for all tests. The 3D gaming tests used the default or “normal” image quality settings, with the exception that the resolution was set to 640×480 in 32-bit color.
All the tests and methods we employed are publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
Memory performance is a fun place to start, because the differences between the P4 and, well, pretty much everything else are so apparent here. Also, the Linpack graph is just so sophisticated looking.
As always, newbies will probably be a little confused about how to read this graph. Let’s borrow profusely from my Pentium 4 review, because this passage describes the above graph perfectly. (Well, that and I’m lazy.)
Linpack performs floating-point operations on a range of data matrices, and the resulting line graph shows the strengths and weaknesses of each processor. The Athlon shows its floating-point prowess by offering the highest peak performance with a relatively small data set. But once we reach about 192K, the Pentium 4 has a pronounced lead. Its 256-bit data path to its L2 cache, combined with a very smart L2 cache controller, helps put the P4 on top. Note that the Pentium III Coppermine, which also has a 256-bit L2 cache interface, has a similarly shaped curve, and peaks at about the same place as the P4. Both Intel processors start to drop off sharply at about 256K, while the Athlon hangs on until it reaches about 320K. Here you can see the Athlon’s exclusive L2 and L1 caches working together. Because the Athlon’s L2 cache doesn’t replicate the contents of its 64K L1 data cache, its total effective cache size is larger than either of the Intel processors. (The Athlon also has a 64K L1 instruction cache.)
Once we get to those sharp, downward curves, we’re accessing main memory to perform the calculations. And once that happens, the Pentium 4’s fast front-side bus and dual RDRAM channels kick into high gear. The Pentium 4 delivers well over twice the sustained performance of the PC133 SDRAM-based Athlon system with larger data sets, and it crushes the Pentium III, as well. A very impressive showing.
So the Athlon does well with its L1 and L2 caches, but the P4 romps once we have to go out to main memory. I’ve taken the liberty of including a Pentium III and an Athlon/KT133A here for comparison. Both of those systems use PC133 SDRAM, they’re slower accessing main memory than either the RDRAM or DDR SDRAM systems. Sandra’s modified version of the Stream benchmark measures memory bandwidth a little bit differently than Linpack, but the results are similar to what happens with larger data sets in Linpack:
Again, the P4 outperforms the Athlon by a solid margin. I should note here that the Pentium 4’s ability to perform so well on these tests isn’t just the product of its dual RDRAM channels. Yes, the P4’s memory subsystem offers a lot of bandwidth, but the processor is able to take advantage of it. By contrast, the Athlon’s PC2100 DDR SDRAM theoretically doubles memory bandwidth over PC133, but the processor doesn’t seem to be able to make use of it. We don’t know for sure yet, but many folks are speculating that the Pentium 4 could make better use of DDR SDRAM than the Athlon does.
From those wildly theoretical memory bandwidth measurements, lets jumps into the gray, musty cubicle of corporate desktop PC use. Business Winstone uses word processors, spreadsheets, and web browsers to test a system’s performance.
Even the new 1.7GHz Pentium 4 can’t catch the 1.2GHz Athlon here. It’s hard to believe 1200 Athlon megahertz is better than 1700 Pentium 4 megahertz, but it’s true. Truth is, either CPU will be just fine for running WordPerfect or Excel, but the Athlon is undoubtedly faster. Content Creation Winstone 2001
The other Winstone we’ll take a look at is more oriented toward web designers, graphic artists, and multimedia mavens. This test loads up a number of programsMacromedia Director, Adobe Premiere, Photoshop, and othersat once, then proceeds to test them all together.
Once again, the Athlons win by a hair.
POV-Ray 3D rendering
These tests of the POV-Ray 3D ray-tracing program should be interesting, because we were able to find recompiled versions of the program online here. As you might imagine, POV-Ray leans heavily on a processor’s ability to handle floating-point math.
Steve Schmitt’s recompiled POV-Ray comes in two flavors: “PIII” and “P4”. Both were produced with Intel C v. 5.0. The “PIII” version doesn’t use any instructions proprietary to Intel processors or to the PIII; it runs just fine on the Athlon and the P4. The “P4” version uses a small bit of SSE2 code, but it doesn’t take advantage of the P4’s SIMD capabilities. Steve writes:
I recompiled again today, but this time with the code vectorizer diagnostic level turned on, and the disappointing but suspected result was that not ONE of the loops in povray were successfully vectorized (uses SIMD). It will be interesting to see what effect this lack of vectorization has on the P4 scores. I suspect that povray is the rule rather than the exception in this; getting code to vectorize usually requires a lot of planning and effort, and isn’t a simple “drop-in” operation. However, lack of vectorization doesn’t mean that SSE2 won’t help; it has scalar opcodes as well, and is a much cleaner way to use the FPU than the old x87 ops. After making sure that the code is indeed set to optimize ONLY for the P4, the executable faults on my Athlon, signifying that it is indeed using SSE2 or other P4-only features.
I’ve indicated which version of POV-Ray was used in the graphs below next to the processor/speed labels, so it should be easy to track.
The Athlon comes out on top, but the 1.7GHz Pentium 4 with its specially optimized version is very close behind. Both CPUs benefit more from new code than they do from higher clock frequencies. Also interesting is the difference between the two scenes rendered. ntreal.pov is a simpler scene that doesn’t include any reflectivity, so serious ray-tracing doesn’t kick into effect. With chess2.pov, though, the Athlon 1.33GHz with the original version of POV-Ray manages to beat out the Pentium 4 1.5GHz running the recompiled version.
Although ps5bench was, as the name implies, originally intended to run on Photoshop 5, it executes perfectly on Photoshop 6, as well. Our first attempt to use Photoshop 6 with ps5bench exposed some nasty performance problems. Fortunately, version 6.0.1 remedies these problems, and ought to give the Mac freaks something new and different to chew on.
With version 6.0.1 and these new clock frequencies, the Photoshop race tightens up. In our previous tests with version 5.5, all of the Athlon systems beat the Pentium 4 boxes consistently. Not so now. Break it out into individual filters, and you get one of the more intimidating graphs around:
You can make of that what you will. I will say one thing: you don’t see too many wild discrepancies here between the two processors, even though these filters are doing a lot of different types of calculations. Yes, the clock speeds vary a lot, but these processors’ performance balances are very similar.
A beta version of LAME is now available with 3DNow! support, but we chose to stick with release 3.70 to keep the footing equal and to measure only FPU performance with this test.
The 1.7GHz Pentium 4 closes to within a second of the 1.2GHz Athlon, but the AMD processors are simply faster here. SPECviewperf workstation graphics
Now on to our first real-time 3D graphics test. Unlike our gaming tests below, SPEC’s viewperf concentrates on the kinds of graphics done on a 3D workstationCAD/CAM work, 3D modeling, and tasks that use lots and lots of polygons.
As we’ve seen before with viewperf, the different CPUs vary in their performance, but overall, it’s a toss-up.
The Team Arena expansion pack for Quake III presents much more of a challenge for a fast system than the original game. With more polygons and larger outdoor areas, it doesn’t run nearly as fast.
The Pentium 4’s performance in Quake continues to impress. The Athlon is simply outclassed hereone of the few places that’s even remotely the case. Serious Sam OpenGL gaming
A newcomer to our test suite is Serious Sam, a very cool game developed by a group of guys in Croatia. The 3D graphics engine in this game is second to none, and it’s a heckuva lot of fun, too. Serious Sam uses OpenGL to generate its eye candy, and its benchmarking functions make things easy for us.
Unlike that other OpenGL 3D shooter, Serious Sam runs faster on an Athlon. It’s close, though. Expendable Direct3D gaming
Now to a Direct3D test. Let me interject here that there are entirely too few quality Direct3D games with decent benchmarking functions. Rage’s third-person shoot ’em up will do, even if it is a little old.
The Pentium 4 has some trouble keeping up here. It’s almost a mirror image of the Quake III results.
The latest version of MadOnion’s 3D benchmarking suite is very much oriented toward DirectX 8 and future applications. That makes it an interesting test, but we suspect our performance was severely limited by the demands the program put on our GeForce2 Ultra card.
The Intel processors pull ahead here, which is no great surprise. Intel processors have traditionally scored highest on 3DMark until very recently, when Athlons briefly pulled ahead. The new 2001 revision of 3DMark rectifies that situation. Frankly, I’ve got to think MadOnion works a little more closely with Intel than they do with AMD, especially since Intel helps support BapCo/MadOnion. 3DMark is still a fine graphics test, but I’m not sure it’s giving AMD and Intel processors even footing. Vulpine GLMark
Vulpine GLMark, a new competitor to 3DMark, plays through a series of scenes rendered in real time. It uses a number of the advanced 3D features included in DirectX 8, but it accesses them via OpenGL, instead.
The results are very close, with the processors splitting the spoils according to clock speed.
One of the most important new entries in our test suite is Tim Wilkens’ impressive ScienceMark. What is ScienceMark all about? Well, take a look at this description of the Primordia test, which is one component of ScienceMark:
This code solves for the Hartree-Fock ( Orbital Restricted, Spin-Restricted, and Un-Restricted ) of any atoms NON-RELATIVISTICALLY. It solves for the Radial components of the Atomic Orbitals SOLUTION = R(r) * Ylm(theta,phi)
assuming the Spheical Orbitals are fixed. In the benchmark you are solving for the 33 orbital restricted solutions to R(r) on 4000 grid points. A rather daunting task.
Indeed. My head hurts already. But you get the idea. ScienceMark is about serious physics equations, and in many ways such calculations are the real litmus test for a general-purpose microprocessor. If a processor can competently crunch through math problems about the fluid dynamics of liquid argon, it ought to be good at most things. If you want to know more about ScienceMark’s individual tests, go here.
ScienceMark is unique because it runs a series of tests, then spits out a composite overall score, kind of like 3DMark. ScienceMark’s creator, Tim Wilkens, was kind enough to help us out with our testing by supplying multiple versions of the test compiled with different compilers. The original ScienceMark 1.0 was compiled using Compaq’s Visual Fortran (CVF). Tim has been working with Intel to get ScienceMark compiled and working with Intel’s VTune compiler, which promises much better performance in some cases, especially with SSE2 on the Pentium 4. However, working with VTune hasn’t been easy. Here’s what Tim had to say about it:
Ask yourself if the Intel compiler is up to spec. I’ve found 2 bugs and 1 perf issue which have had to be rectified so as to make a fully functional ScienceMark binary optimized for the P4. In the process of fixing these issues I’ve recieved 4-5 compilers from Intel.. each successively more functional than the previous. Intel is really working hard on their compilers and their support staff has given me excellent assistance. They’re a great bunch of people there. But for SSE2 to become widespread and prevalent throughout the hardware community, a compiler that’s “robust” must be available.. and this is simply not the case. Yet. I can’t help but wonder.. how many applications have been released.. that are required to show the power of the “netburst” architecture of the P4 in 2001. I also wonder how many of these applications have been assembly optimized for the P4 without a second thought to optimizing for the Athlon. ScienceMark is an attempt to provide an unbiased guide to the developer as to the state of affairs in compiler development. The source has NO CPU SPECIFIC assembly instructions. It also is well optimized and achieves rather high throughput on the Athlon and PIII.
Nevertheless, Tim was able to supply us with some VTune-generated ScienceMark binaries for testing. In all, we’ve tested four different ScienceMark executables: two each from CVF and VTune. It breaks down like this:
- A version compiled with CVF and optimized for the Pentium III (and 4).
- A version compiled with CVF and optimized for the Athlon.
- A version compiled with Intel VTune and optimized for the Pentium 4.
- A version compiled with Intel VTune and optimized for the Athlon.
Unfortunately, the VTune versions of ScienceMark wouldn’t complete all the tests and produce an overall ScienceMark score. Tim did get it working just before we went to press, but there wasn’t time to re-test everything. Instead, we’ve included VTune results in a couple of places, so you can compare them with the CVF results. First, let’s see how the different processors fared overall..
Just like a lot of our tests above, the Athlons come out ahead, but the 1.7GHz P4 is very closely matched to the 1.2GHz Athlon. Funny how that works.
Intel’s compiler delivers quite a bit better performance here. The Pentium 4 gains more than the Athlon, but the Athlon 1.33GHz still manages to stay on top. As we saw with POV-Ray, changing the compiler can help even more than upgrading the hardware or raising the clock speed. Now on to the Primordia test, which, uhhh.. has something to do with “Atomic RHF Promethium.” It’s all very scientific.
This time around, the Intel compiler produces much slower results. The Pentium 4s beat the Athlons outright in the original CVF version, but VTune manages to slow down the Pentium 4 chips even more than it does the Athlons. This is one of the reasons why Tim Wilkens has expressed frustration with the state of Intel’s compiler. Taken together, SSE2 and a good compiler could make for some noteworthy gains in scientific computing, where faster computation could shave months or years off a large calculation. But at this point, the tools just aren’t ready yet. Of course, such realities spill over into everyday desktop computing. The sad truth is that a whole lot of programs don’t take advantage of a new processor’s power nearly as well as they should. (Insert small advertisement for Open Source software and tools here.)
That said, ScienceMark is an especially useful tool for processor benchmarking because of Tim’s diligence in providing new binaries and providing the best possible optimizations for the processors being tested. It’s helped illustrate a point for us here, and we’ll come back to it in the future as the code, compilers, and hardware progress.
Running this suite of tests on these processors has demonstrated a couple of things worth noting about how these CPUs perform. First and foremost, it’s clear the Athlon 1.33GHz is still the big dawg of PC processors. It’s easily the fastest x86-compatible CPU around. Intel’s new entry, the 1.7GHz Pentium 4, performs about like a 1.2GHz Athlon in most situations.
Not that there’s anything wrong with that.
In fact, this little exercise has finally taught me something the Intel guys have been trying to pound into my head for a while now: the Pentium 4’s performance balance is pretty darn good. By that I mean it handles a variety of types of mathinteger, floating point, SIMDequally well (more or less). In my original Pentium 4 review I echoed some sentiments I’ve heard in a number of places before and since, that the P4’s FPU isn’t very good. Truth is, the Pentium 4’s balance between integer and floating-point performance is very, very similar to the Pentium III’s. And it’s not far from the Athlon’s, either. Sure, the processor executes a relatively low number of instructions per clock, but the P4’s floating-point units aren’t especially bad in this respect, even without the help of SSE or SSE2.
Finally, our tests have shown pretty clearly that the Athlon does a better job running legacy code. Both the P4 and Athlon benefit from the use of newer compilers, and I can’t really give the edge to either processor on this front. But the Athlon is much more resilient when code isn’t terribly friendly. That’s a good trait for any processor to have, but it’s especially vital to a non-Intel CPU that has to survive in an Intel-dominated world. No doubt Intel will continue to push for new code optimizations by improving its compilers and by using its considerable influence in the industry, so the P4 will have a leg up going forward. However, the Athlon has the particularly pleasant advantage of being more comfortable running whatever code you throw at it.
Intel’s pricing bombshell
The sweetest part of this new processor from Intelbesides being able to tell your friends you have a 1700MHz systemis the price: three hundred fifty-two American dollars. (That’s US$352, kids.) The other Pentium 4 speeds will fall in line below that. That’s much more reasonable than the initial P4 pricing was, and it’s more in line with the processor’s performance, too.
Whether or not the 1.7GHz Pentium 4 is a good value at that price is another question. Athlons are still cheaper, and they don’t require RDRAM. Currently RDRAM is about four times the price of PC133 SDRAM and about twice the price of DDR SDRAM. But heck, RDRAM is still under a dollar per megabyte, so buying or building a Pentium 4 system might not even chew up this year’s entire tax return, if you played your cards right with Uncle Sam.