Simultaneous multithreading lands on the desktop
The big change with the new Pentium 4 isn’t actually its clock speed, although 3GHz sounds like a whopper of a number to most folks. Instead, the big news about the new P4 is its Hyper-Threading technology. The impetus behind Hyper-Threading is simple. Throughout the Pentium 4’s young life, it has been a relatively slow performer at a given clock speed. Just last month, in our last big CPU review, we saw AMD’s Athlon XP 2800+ running at 2.25GHz perform roughly as well as a Pentium 4 at 2.8GHz.
The Pentium 4 is relatively slow on a clock-for-clock basis because of its unusually deep, 20-stage main pipeline. The Pentium 4’s clock speeds can reach amazing heights because of this pipeline, too. So it’s a tradeoff. The P4 is by no means a poor performer, but it’s a little slow clock-for-clock.
Intel has taken a number of steps to improve the P4’s clock-for-clock (and overall) performance. Most notably, the company has raised the P4’s front-side bus speed and doubled the size of the L2 cache. Hyper-Threadingor simultaneous multithreading (SMT) as it’s known in the wider, non-copyrighted worldis yet another way to increase the average number of instructions a processor can execute per clock cycle, or instructions per clock (IPC). Simultaneous multithreading makes a single physical processor look like two logical processors, and in doing so, it keeps the CPU’s execution units busier. This isn’t symmetric multiprocessing (SMP)that creamy smooth goodness that comes from having multiple processors in a single systembut it essentially looks like it to operating systems and programs. As with SMP, software will have to be multithreaded in order to take full advantage of SMT.
The logic needed to make Hyper-Threading work adds only 5% to the Pentium 4’s die size, including duplicate copies of key resources necessary for maintaining two architectural states on one chip. Intel points out that’s not much extra real estate for an enhancement that can improve performance by as much as 30% in the right scenarios.
Hyper-Threading adds so little to the Pentium 4’s die size because it only requires physical duplicates of a small subset of the processor’s resources. Many other CPU resources, including the caches, registers, execution units, and scheduling queue, are shared, either through static partitioning (splitting ’em in two) or dynamic sharing. The most important shared resources are the processor’s execution units, where integer math, floating-point math, and load/store functions are handled. Execution stages in the deeply pipelined Pentium 4 are likely to be unused during some CPU cycles, and Hyper-Threading is intended to help keep the chip’s execution pipelines busier by exposing a second logical processor.
I would like to cover what gets shared in HT and why in more detail, but that’s another article altogether. If you want to understand the specifics of Intel’s Hyper-Threading implementation, let me recommend Jon Stokes’ article on the subject. He explores the complexities of adjudicating between logical CPUs competing for resources better than I can here.
Hyper-Threading’s resource sharing has the potential to sap performance in certain situations. Sharing the L2 cache between two logical processors means only half the cache space and bandwidth may be available to execute a given thread. We’ll examine this issue in more detail in our processor benchmarks here shortly.
What supports Hyper-Threading?
Only specific combinations of hardware and software will fully support Hyper-Threading. I’ll try to lead you through the maze the best I can.
First and foremost, you’ll need a Pentium 4 processor 3.06GHz , which is the only desktop processor to support HT. Interestingly enough, the logic necessary for Hyper-Threading has been present in Pentium 4 silicon since the first, original Pentium 4 “Willamette” chips arrived on the scene. However, Intel didn’t enable Hyper-Threading in those early P4 chips, preferring instead to introduce HTpardon the abbreviation, but HT will have to do from here outwith its Xeon server chips.
Before you current P4 owners get too excited, Intel says it can disable access to Hyper-Threading “in the factory,” so your current P4 chips aren’t likely to sprout a second head any time soon. Also, Intel has taken the P4 through a number of revisions (or steppings) over time, and HT logic has been improved as that’s happened. So even if you can hack an older P4 into enabling HT, performance may not be all that great.
For now, Intel plans to keep Hyper-Threading exclusive to P4 chips at 3GHz and above, although I get the sense the company might change its plans if screaming hordes of Taiwanese mobo makers were to show up at its doors. Or if Michael Dell made a phone call.
Next, you’ll need a Hyper-Threading-aware operating system. To date, only Windows XP and Linux (kernel versions 2.4.18 and higher) are HT-aware. Some multithreaded operating systems, like Windows 2000, will run fine with multiple logical processors, but they don’t offer the performance benefits of an HT-aware OS. Microsoft has produced an interesting little white paper on HT support in WinXP, which explains the kernel tweaks needed for best performance. For instance, WinXP more aggressively executes the HLT command on an unused logical CPU. (The HLT command exists to tell processors to take a break for a while.) Doing so frees up shared resources for the other logical processor. In case you were wondering, while the Home edition of WinXP supports only one physical processor, it will support HT on that processor.
Finally, you’ll need a motherboard with Hyper-Threading support. Realistically, that means you need a mobo capable of running a P4 3.06GHz chip, so it will require a 533MHz front-side bus and the right voltage regulator. These requirements will eliminate many existing P4 mobos. Beyond that, you’ll need a BIOS capable of supporting Hyper-Threading. With BIOS level support, users can turn HT on and off at will. Without BIOS-level support, HT won’t work. And you’ll need a chipset capable of supporting Hyper-Threading.
All of Intel’s current P4 chipsets and most of its past ones, with the exception of one stepping of the 845, support HT. With Taiwanese chipset makers VIA and SiS, the situtation is somewhat murky. I asked VIA about it, and I got this response:
Hyperthreading support is enabled in BIOS. VIA P4 mainboards will support hyperthreading. There will be more official info about this coming out of VIA before the end of this week.
That sounds promising, and I expect we’ll know more soon. SiS’s story seems similar. I wouldn’t count on HT support with mobos based on current Taiwanese chipsets, but I wouldn’t count it out, either.
Hyper-Threading in action
Using a Hyper-Threading enabled system looks, to the user, like using a dual-processor box. Windows’ Task Manager shows a pair of CPUs, like so:
The picture above is fun, but it also illustrates an interesting phenomenon. The task you see taking up 50% CPU time is a Folding@Home distributed computing client, and as you can see, it’s oscillating between the systems’ two logical processors, just as it does in SMP systems sometimes. The penalty associated with the required context switches here isn’t as great as it would be with SMP, because cache and other resources are shared between the logical CPUs. Still, I’d rather set the processor affinity for the Folding client and keep it nailed to one logical CPU. Like this:
By the way, the 50% CPU usage number you see in Task Manager doesn’t mean that 50% of the processor’s execution resources are still available for the second logical CPU. The OS only sees two logical CPUs, so it can’t give any deeper insight into what’s going on under the hood.
A word about benchmarks
In order to test the P4 with Hyper-Threading, we’ve mashed together elements of our usual processor benchmark suite with some of our customary SMP tests. Using a whole range of tests allows us to compare the new P4 to its competition, and the non-multithreaded tests will show us any performance penalties associated with HT. The multithreaded tests give us a chance to measure Hyper-Threading’s benefits directly.
There are also some things benchmarks can’t capture, like the “creamy smoothness” that comes from having a second (logical or physical) processor available to handle user tasks while other processes run in the background. The difference in the user experience is, by definition, best measured subjectively, and I’ll talk more about that later on.
I should note, though, that Intel’s push for Hyper-Threading has already started paying dividends in the form of multithreaded benchmarks and multithreaded revisions of popular applications. As we’ve seen in the past, Intel has a lot of clout when it comes to these things, and I’d have to categorize the push for more multithreaded software as a Very Good Thing. Those of us who are devotees of SMP will benefit from this push at least as much as P4 owners will.
So what should you expect to see in the benchmark results? Well, if an application isn’t multithreaded, you may well see a slight performance drop with Hyper-Threading enabled (we tested with HT and without). If a benchmark is multithreaded, the performance benefits of HT should be obvious, if not overwhelming.
One more thing: We tested the P4 3.06GHz with both PC1066 RDRAM and DDR333 memory. RDRAM is Intel’s highest performance solution, and DDR333 is the solution everyone will actually buy. I considered nixing the RDRAM tests, but it’s interesting to see how the different memory subsystems affect Hyper-Threading’s effectiveness, so I’m glad left them in.
Our processor testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least twice, and the results were averaged.
Our test systems were configured like so:
|Athlon XP||Pentium 4 DDR||Pentium 4 RDRAM|
|Processor|| Athlon XP 2200+ 1.8GHz
Athlon XP 2600+ 2.13GHz
|Athlon XP 2800+ 2.25GHz|| Pentium 4 2.53GHz
Pentium 4 2.8GHz
Pentium 4 3.06GHz
|Pentium 4 3.06GHz|
|Front-side bus||266MHz (133MHz DDR)||333MHz (166MHz DDR)||533MHz (133MHz quad-pumped)||533MHz (133MHz quad-pumped)|
|Motherboard||Asus A7N-8X (pre-release sample)||Intel D845PEBT2||Intel D850EMV2|
|Chipset||NVIDIA nForce2||Intel 845PE||Intel 850E|
|North bridge||nForce2 SPP||82845PE MCH||82850E MCH|
|South bridge||nForce2 MCP-T||82801DB ICH4||82801BA ICH2|
|Chipset drivers||2.77||Intel Application Accelerator 6.22||Intel Application Accelerator 6.22|
|Memory size||512MB (2 DIMMs)||512MB (1 DIMM)||512MB (2 32-bit RIMMs)|
|Memory type||Corsair XMS3200 PC2700 DDR SDRAM||Corsair XMS3200 PC2700 DDR SDRAM||Samsung PC1066 Rambus DRAM|
|Graphics||ATI Radeon 9700 Pro 128MB (Catalyst 7.76 drivers)|
|Sound||Creative SoundBlaster Live!|
|Storage||Maxtor DiamondMax Plus D740X 7200RPM ATA/100 hard drive|
|OS||Microsoft Windows XP Professional|
|OS updates||Service Pack 1|
Though it’s not listed above, we also ran some of the multithreaded benchmarks on an Athlon MP 2000+ system, based on MSI’s K7D Master-L, both with single and dual processors. That system was configured similarly to those abovesame OS revisions, amount of RAM, hard drive, sound card, and the likewith the notable exception that the graphics card was a GeForce4 Ti 4600. Since this system is present primarily to illustrate the similarities and contrasts between Hyper-Threading and true SMP, the graphics card difference shouldn’t be a big issue. Just keep in mind how that system varied from the rest.
Thanks to Corsair for providing us with DDR333 memory for our testing. If you’re looking to tweak out your system to the max and maybe overclock it a little, Corsair’s RAM is definitely worth considering. Using it makes life easier for us as we’re dealing with brand-new chipsets and pre-production motherboards, because we don’t have to worry so much about stability and compatibility.
The test systems’ Windows desktops were set at 1024×768 in 32-bit color at an 85Hz screen refresh rate. Vertical refresh sync (vsync) was disabled for all tests.
We used the following versions of our test applications:
- Cachemem 2.6
- SiSoft Sandra Standard 2003
- Compiled binary of C Linpack port from Ace’s Hardware
- ZD Media Business Winstone 2002 1.0.1
- ZD Media Content Creation Winstone 2002 1.0.1
- POV-Ray for Windows version 3.5
- Sphinx 3.3
- LAME 3.92
- Xmpeg 4.5 with DivX Video 5.02
- MadOnion 3DMark 2001 SE Build 330
- Unreal Tournament 2003 demo benchmark
- Comanche 4 demo benchmark
- Quake III Arena v1.31
- Serious Sam SE v1.07
- SPECviewperf 7.0
- Kribibench 1.1
- Cinebench 2000
- picCOLOR NT 1.0 demo
All the tests and methods we employed are publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
First up are our memory tests. These synthetic tests are often enlightening, but they aren’t always a good indicator of overall real-world performance.
Sandra and Cachemem tell similar stories about memory bandwidth. These results are more or less what we’d expect to see from each of these platforms. Having Hyper-Threading enabled has little effect on memory bandwidth scores.
Memory latency also works out as expected. In access latency, RDRAM is slower than DDR memory, and the Athlon XP’s 266MHz bus running with 333MHz memory is slowest of all. Again, Hyper-Threading has no real effect, which is more or less expected.
Linpack shows us visually the performance, from left to right, of the L1 data cache, L2 cache, and memory subsystem for each system. At 3GHz, the Pentium 4 peaks higher than the Athlon XP 2800+. The P4 also peaks much later, courtesy of its larger, faster L2 cache.
Now we get to see some real-world benchmarks of this 3GHz hydra. This newly revamped version of Business Winstone ought to be more accommodating to SMP and SMT systems than previous revisions.
Here you can see a slight but real performance improvement with HT enabled. However, the Athlon XPeven the 2600+ modeloutruns every flavor of Pentium 4 we tested.
Content Creation Winstone
Content Creation Winstone 2002 has always been friendly to SMP systems, so we expected good things from Hyper-Threading. However, the results weren’t so good…
Enabling HT knocks the new P4 down a couple notches. I expect this result would be different if more of the component applications in the CC Winstone suite were multithreaded. In fact, I’ve ordered the brand-new 2003 edition of CC Winstone, which ought to help in this regard, but our copy hasn’t arrived yet.
Still, this result shows that Hyper-Threading’s performance benefits aren’t universal.
LAME MP3 encoding
We used LAME 3.92 to encode a 101MB 16-bit, 44KHz audio file into a high-quality, variable-bit-rate MP3. The exact command-line options we used were:
lame -v -b 128 -q 1 file.wav file.mp3
Here are the results…
LAME isn’t multithreaded, and it shows. Nevertheless, the P4 3.06GHz comes out on top here, leapfrogging the Athlon XP 2800+ by just a few seconds.
DivX video encoding
This ought to be fun. Xmpeg is indeed multithreaded, and it uses Intel’s SSE2 extensions on the Pentium 4, as well. I’ve included results from an Athlon MP system here, so you can see the relative benefits of both SMP and SMT.
For this test, we took a 279MB video file, encoded in MPEG2 format at DVD quality, and converted it to a 33MB DivX file. We used the “medium” quality/speed setting on the DivX encoder, and we turned off audio processing. Otherwise, all settings were left at their defaults.
That’s more like it. Here, enabling HT gives the P4 a nice performance boost. Let’s compare the relative performance increases with Hyper-Threading and our dual Athlon MP system.
Obviously, Hyper-Threading can’t do as much for performance as adding a second CPU, but a performance gain of 14% essentially for “free” is a worthy accomplishment.
Kribibench 3D rendering
Kribibench uses an SSE-aware, software-only rendering engine to generate some very nice looking 3D images. Since rendering is easily parallelizable, Kribibench is multithreaded, too. We’ve tested with a couple of different 3D scenes, just in case there are any big differences in terms of rendering workload between them.
In both cases, Hyper-Threading shows some impressive performance gains. The “jetfog.d” scene runs relatively better on the Pentium 4 systems than the “office.d” scene, but both run quite well compared to the Athlon XP.
The biggest story here, however, is the performance gain with Hyper-Threading.
We’re seeing nearly a 25% improvement in the “office.d” scene with HT enabled.
Cinebench 2000 3D rendering
Cinebench is another multithreaded rendering benchmark. This one can run in both single- and multithreaded modes, so we have single-threaded scores for the SMT and SMP systems.
In a moment of high drama, the Pentium 4 3.06GHz loses out to the Athlon XP 2800+until Hyper-Threading is enabled, and the P4 takes the lead. Of course, everything gets clobbered by the dually rig, but that’s not fair.
POV-Ray 3D rendering
Next up is that old staple of our test suite, POV-Ray. Perhaps I’m missing it, but I don’t believe even POV-Ray 3.5 is multithreaded. How a modern ray-tracing engine can avoid being multithreaded, I cannot fathom. Anyhow, here are the results from our usual scene render…
As ever, the Athlon XP comes out on top in POV-Ray, but the Pentium 4 3GHz is getting mighty close. If POV-Ray were multithreaded, that might put the P4 over the top.
Quake III Arena
Now for some 3D gaming benchmarks. Q3A is distinct in its multithreadedness, but Q3A’s multithreaded mode hasn’t worked very well for quite some time now.
Enabling HT doesn’t cause any significant performance difference, but using the game’s SMP mode slows things down. In this case, the fix is simple: don’t use the “r_smp 1” option. Most SMP system owners probably don’t anyway, simply to avoid crashes and other problems.
Once again, HT doesn’t affect performance significantly, but the clock speed boost to 3.06GHz lifts the Pentium 4 past the Athlon XP 2800+.
Serious Sam SE
Hyper-Threading actually seems to offer a minor speed gain in Serious Sam, but not quite enough to push the P4 to the top of the chart.
Pentium 4 systems simply own Comanche 4, Hyper-Threading or no.
Unreal Tournament 2003
Unreal Tournament comes out much like the rest of our gaming tests, so I’ll take this opportunity to summarize our results. The good news for gamers is that Hyper-Threading doesn’t hurt performance, and if anything, it offers a nearly imperceptible performance boost (perhaps HT handles system overhead better). Any worries about resource sharing with HT slowing down games appear to be unfounded.
Also, with or without HT, the Pentium 4 3.06GHz is the fastest gaming CPU in all but one of our tests.
We’ve included SPEC’s viewperf suite of workstation-class graphics tests for completeness. I expected some of these tests to be multithreaded, but….
Only 3dsmax shows any daylight between the HT and non-HT scores, and even there, the differences are minor. Moving on…
Sphinx is a high-quality speech recognition routine that needs the latest computer hardware to run at speeds close to real-time processing. We use two different versions, built with two different compilers, in an attempt to ensure we’re getting the best possible performance.
There are two goals with Sphinx. The first is to run it faster than real time, so real-time speech recognition is possible. The second, more ambitious goal is to run it at about 0.8 times real time, where additional CPU overhead is available for other sorts of processing, enabling Sphinx-driven real-time applications.
Sphinx isn’t multithreaded and leans heavily on the memory subsystem, so it’s no suprise to see the results we do here. Nonetheless, the P4 3.06GHz with RDRAM produces some astounding numbers. Only the Athlon XP system with a 333MHz bus, the 2800+ system, can run with the Pentium 4 systems in Sphinx.
picCOLOR image processing
picCOLOR came to us via Dr. Reinert Mueller, who wanted us to test his image processing program on a dual-processor Athlon system for him. We’re glad we obliged him, because picCOLOR does a nice job testing common image processing-related functions with multiple threads. For this test, we’ve cleared off the table and just included results for an SMT system and an SMP system.
You can tell that some of the tests are multithreaded, while others aren’t. Let’s look at the relative performance gains.
Incidentally, in order to make this graph readable, I’ve not reported performance decreases, of which there were a few. Overall, the performance increases more than offset the few decreases.
In some cases where SMP helps a fair amount, Hyper-Threading offers no benefit. However, in a few cases, HT does quite well.
Hyper-Threading and resource sharing
Jon Stokes’ discussion of Hyper-Threading’s resource sharing arrangements gave me some ideas for testing, and I checked with Jon to see what he thought of them. The results below are my fault, but Jon gets credit for helping if you like them.
I decided to use Linpack, which can visually represent L1 and L2 cache size and performance, to illustrate HT’s divison of the L2 cache between logical processors. In order to do so, I ran a Quake III Arena botmatch in a 640×480 window on the Windows desktop. Then, with Q3A running, I kicked off Linpack. The game’s “r_smp” variable was set to zero in all cases. Here’s what I found.
With Hyper-Threading enabled, L2 cache performance changes dramatically. The total cache bandwidth available is half what it is without HT. (Assuming the FPU being overworked isn’t the culprit.) Also, the Q3A + HT config peaks at about 192K matrix sizes, earlier than the Q3A + non-HT config. Effectively, the cache size is smaller because cache is being shared between two logical processors. This is just as Jon’s article predicted.
Also, that “hitch” in the Q3A + HT performance at around 270K matrix size is no fluke. What you’re seeing above is an average of three Linpack runs, but the individual Linpack runs all exhibited the same quirk:
I suspect this “hitch” shows us something about how Intel’s HT logic manages cache sharing, but I won’t venture a guess beyond that.
Before you worry too much about losing cache space and bandwidth with Hyper-Threading, though, read on. I showed these results to Jon, and he suggested turning the tables a bit:
I wonder how much Q3A actually benefits from the cache in the first place and is therefore affected by HT. I recall that the original Quake didn’t suffer too much a hit on the cacheless Celeron, because even when it has a cache it dirties the d-cache quite a bit. So if you have Q3A mostly dirtying the cache with data that it’s not going to reuse, and then you have Linpack trying to store matrices in the cache at the same time, then I would expect the Linpack performance to suffer much more than the Q3A performance. In other words, if you ran the same tests, but benchmarked _Q3A’s framerate_ rather than Linpack, my tentative hunch is that you’d see that Q3A’s performance degrades much slower from hyperthreading under the same conditions as Linpack.
I tested Q3A performance with Linpack running, both with and without Hyper-Threading enabled.
The results were just what we suspected. The effects of Hyper-Threading’s resource sharing mechanisms will vary greatly depending on the type of applications used.
Intel claims Hyper-Threading can offer real usability benefits of the same type we’ve always enjoyed from SMP. Here’s the scenario: The user is running a host of different applications at onceweb browser, e-mail client, instant messenger, MS Office apps, and something nasty from Adobe like Acrobat. The something nasty from Adobe conspires with a Flash app in the web browser and several MS Office apps to chew up all the user’s CPU time, probably because something nasty from Adobe is in some kind of an unnecessary loop. The user’s PC slows to a crawl, nearly unresponsive, because no CPU time is available.
The creamy smoothness of SMP is just this: no slowdown in that scenario. Intel claims similar things for Hyper-Threading.
Unfortunately, I haven’t yet been able to decide whether and how much creamy smoothness Hyper-Threading is truly capable of delivering. Some of the benchmarks look promising, but making the subjective usability evaluation is more difficult. Truth be known, any 3GHz Pentium 4 system is so fast, it’s tough to contrive the kinds of slow-downs that one can “feel” subjectively. I’ll have to make a Hyper-Threaded PC my main system, full of MS Office apps, instant messaging programs, and nasty somethings from Adobe before I can make that call.
All in all, the Pentium 4 3.06GHz is the fastest PC processor available. The fastest Athlon XP system we tested, the 2800+ chip running on an nForce2 system, isn’t widely available yet, over a month after we first reviewed it. The Athlon XP 2800+ can challenge even the Pentium 4 3.06GHz for supremacy; which one is faster depends entirely on what you want to do. You’ve seen the results, so you can decide for yourself which would best suit your needs.
Personally, I’d probably pick the Pentium 4 over the Athlon XP, simply because I’m all over fast busses, memory, and graphics. I also like to play the occasional video game when I’m not working on a review (once a year), and the P4 is just a teensy bit faster in most gaming tests. But it’s very close.
That said, the significance of going from 2.8GHz to 3.06GHz is largely symbolic. And I don’t even know exactly what it symbolizes. The steady march of progress in PC processors is, to me, much more impressive than reaching the occasional round-number milestone. But then I’m too close to this stuff, I suppose. “3GHz” is just easier to digest than “2.2GHz, 2.4GHz, 2.53GHz, 2.8GHz….”
Hyper-Threading technology, on the other hand, is truly novel. Our tests have shown performance increases in line with the “up to 30%” claims you may be hearing from Intel, provided the applications are multithreaded and the workloads are easily parallelizable. This is a real-world performance gain that’s “free” for Pentium 4 buyers. Hyper-Threading’s benefits aren’t as universal as those of, say, a clock speed increase or a larger L2 cache, but they are compelling at times, especially if the user experience is truly enhanced by HT. At other times, Hyper-Threading is not a factor, and in rare situations, it causes slowdowns. As Intel’s push for Hyper-Threading progresses, I expect to see more and more multithreaded and HT-aware apps arrive. We’ve seen this pattern before with SSE2 enhancements. Over time, Hyper-Threading’s benefits should grow more pronounced.
Until then, if it causes problems, users can always turn Hyper-Threading off in the system BIOS, and they’ll still have one of the fastest PC processors anywhere.