Before we dive into the test results, let's have a quick review of what makes the Pentium M unique. The quick-and-dirty line on the Pentium M is that it's a Pentium III core mated to a Pentium 4 bus, and that's not entirely inaccurate. However, the Pentium M is much more than just that.
Yes, it is based on the Pentium III, or more properly, the P6 core that started out in the Pentium Pro processor, which evolved into the Pentium II and then Pentium III. And the Pentium M does use essentially the same bus protocol as the Pentium 4, quad-pumped and everything. But the Pentium M has been extensively modified for better performance, higher clock speeds, and lower power consumption. In fact, the Pentium M's main pipeline is somewhat longer than the 10 stages in the original P6 core, although Intel is coy on exactly how many stages are involved. The number is probably closer to the 12 stages in the Athlon 64 than to the 20 stages in the original Pentium 4 Netburst architecture or the 31 stages in the P4 Prescott. Other factors aside, longer pipelines generally mean higher clock speeds and lower clock-for-clock performance. As we'll see, the Pentium M hits clock speeds similar to the Athlon 64 and delivers comparable performance at those speeds.
The Pentium M we're playing with here is actually the second generation of Pentium M, code-named Dothan. (Our review of the original Pentium M "Banias" core is here.) Dothan is manufactured on Intel's 90nm fab process, and it packs a healthy 2MB of L2 cache RAM onboard (along with the corresponding logic for prefetching data into the cache.) That's in addition to a 64KB L1 cache evenly subdivided between data and instruction caches. Thanks to the die shrink, Dothan's 140 million transistors are packed into a die that's only 84mm2, nearly the same size as the original Pentium M Banias core, which had only 1MB of L2 cache. Compare that, if you dare, to the P4 Prescott's 122mm2 die size, or the massive 192mm2 die of the 130nm Athlon 64. The 90nm Athlon 64 "Winchester" also has an 84mm2 die, but that chip has only 512K of L2 cache. I don't have the exact numbers, but I believe 90nm Opterons with 1MB of L2 are expected to be about 100mm2.
The impressive thing about the Pentium M is that the entire processor core was designed, massaged, and tweaked in order to cut down on the amount of power it required. Intel's Israel-based design team used extensive statistical analysis in order to guide its decisions in making tradeoffs between performance and power consumption, and the Pentium M CPU is the result of that process. That's not to say that the Pentium M is full of compromises that harm performance. To the contrary, some of the very best types of power optimizations are performance enhancements, because getting work done in fewer CPU cycles can save power. Also, the Pentium M team didn't lean too aggressively toward saving power because the CPU is only a small part of overall system power consumption in a laptop, where things like the hard drive and LCD display can dominate the battery life equation. For these reasons, the Pentium M may very well make good sense as a desktop processor, even when raw performance is one of the user's primary concerns.
Intel has produced some very informative papers on the Pentium M's design, and I can't go into too much depth about such things here, but I would encourage you to read them if you would like more info. There's one on power savings and another on microarchitecture and performance. I will give you the highlights, though, of some of the changes made to increase the Pentium M's performance and power efficiency. Among them:
- Dynamic clock gating Dynamic clock gating is, essentially, the ability to turn off unused portions of a chip and turn them back on as needed. Doing so requires extra logic inside the chip. Too much additional logic can diminsh the power-saving effects of clock gating, so the Pentium M's clock gating is fine-grained, but not overly so. The Pentium M's designers used some clever techniques in order to keep unneeded transistors inactive. For instance, the register files in the register renaming units are partitioned by data type, so that only the data width necessary is accessed. If the data being processed is in 32-bit integer form, there's no need for 80-bit floating-point-sized registers to be active.
- Lower leakage transistors In some cases, like in the L2 cache of the original Banias Pentium M, Intel used lower leakage transistors that required less power at the expense of speed. Doing so might increase cache latencies or limit peak clock speeds, but it can also save lots of power.
- A new branch prediction unit The Pentium M's branch prediction unit is based on the Pentium 4's, but it's significantly enhanced. Accurate branch prediction is crucial for performance in any modern CPU, but it's especially crucial for power savings, because branch mispredictions amount to wasted energy. The Pentium M team added a loop detector to the branch prediction unit in order to enhance handling of program loops with lots of iterations, and they added an indirect branch predictor that better handles data-dependent indirect branches, as often found in object-oriented code. The branch prediction unit was further tweaked in the Dothan core, as well.
- Micro-ops fusion Like most modern x86 processors, the P6 is a RISC-like core coupled to an x86 instruction decoder. This decoder translates x86 instructions into micro-ops, or instructions that execute on the RISC-like core. Sometimes, x86 instructions decode into multiple micro-ops and execute in a way that's not entirely efficient. For instance, the store instruction becomes two micro-ops, one that calculates the address and another that writes data to that address. The Pentium M's decoder fuses these into one micro-op and keeps them largely united as they're processed. Only at the execution level, when necessary, are they decoupled.
Intel claims micro-ops fusion cuts micro-ops by over 10% in Banias, leading to performance gains of 5% for integer code and 9% for floating-point. The additional logic for micro-ops fusion does consume more power, but Intel says the additional performance offsets this effectan instruction sequence requires less energy to complete. The Dothan core apparently fuses even more instructions, although we don't yet have any details on which or how many.
- A dedicated stack engine This logic, situated near the instruction decoders, manages the updating of the hardware stack pointer register, again cutting down on the number of micro-ops that must be executed. Intel says this more efficient internal housekeeping cuts micro-ops by 5%.
- Enhanced SpeedStep The last item on my list may be the most familiar to many of us. Intel's Enhanced SpeedStep varies both clock speeds and CPU core voltages in order to conserve power when the CPU isn't entirely busy. Enhanced SpeedStep has multiple "gears" and can step speeds up very quickly on demand in order to keep system performance snappy.