The Pentium 4's new look
Wondering how Prescott got to have so many more transistors? The answer is that Prescott is a serious overhaul of the Netburst microarchitecture all Pentium 4s share. In fact, Prescott is arguably a more major revamp than the P6 core got during its long tenure at the heart of the Pentium Pro, Pentium II, and Pentium III processors. There are too many changes to cover in depth here, but I will attempt to summarize them and talk about the most significant modifications of the chip's design.
The watchwords for the Prescott changes are "higher clock frequencies." Virtually all the modifications to the Prescott core are intended to produce high performance while allowing the chip to run at clock speeds of 4GHz and beyond. Many of the radical elements of the original Netburst design are present here in even more radical form, including the deep main pipeline, execution trace cache, and ample amounts of speculative logic and prefetching. Most of these changes represent tradeoffs of various types, between, say, higher clock speeds and higher clock-for-clock performance, or, in many cases, between higher latencies and better peak performance. Generally, Prescott has been tuned for higher clock frequencies, and the choices Intel's design team has made reflect that emphasis.
With that said, we'll let the bullets start flying on our summary of Prescott's new features.
- A much longer pipeline Probably the biggest news of the day is that fact that Netburst's main branch prediction/recovery pipeline has been lengthened from a healthy 20 stages in its previous incarnation to 31 stages in Prescott. To give you a point of reference, that's longer than the Alaskan oil pipeline. Pipelines of around 10 stages are much more common. AMD's Hammer core in the Athlon 64 and Opteron processors is 12 stages.
By making each stage of the pipeline less complex, Intel increases the processor's tolerance for running at higher clock speeds. In doing so, though, Intel's engineers have chosen to reduce clock-for-clock performance. This change, by itself, would significantly lower the number of instructions per clock (IPC) the Pentium 4 can execute. Higher clock speeds can offset a lower IPC, but Prescott starts out at only 3.4GHz, and Northwood runs at that speed, too.
Fortunately, there are a number of countervailing forces to take into account. For one thing, instruction latencies vary; not all instructions use all stages of the pipeline. More importantly, Prescott includes a whole raft of enhancements aimed at increasing its clock-for-clock performancesome in very specific ways. That's what the rest of these bullet points are about.
Before we move on, I should point out once more that taken in context, a lower IPC isn't necessarily a bad thing. Higher or lower IPCs in processor design are tradeoffs, and need not evoke a value judgment. What is true of the Pentium 4, and of Prescott more so than prior revisions, is that Intel has chosen to go full-bore the way of lower IPC and higher clock speeds. This "speed demon" approach to processor design seems to fit reasonably well with Intel's technological prowess in chip fabrication.
- A larger L2 cache The main contributor to Prescott's massive transistor count is its new 1MB L2 cache. We've seen larger caches help performance many times before, the most dramatic recent example being the Pentium 4 Extreme Edition processors with 2MB of L3 cache onboard. The Extreme Edition is a screamer as a result of this massive cache. Prescott's larger L2 cache necessarily has higher latencies, so going to a larger cache has its drawbacks. Still, in a chip designed to run so much faster than main memory, the larger on-chip cache makes sense.
- A larger L1 data cache Northwood's L1 data cache was 8K and 4-way associative. Prescott's is 16K and 8-way associative, so Prescott's L1 cache should have a higher hit rate and, thus, be more effective.
Like previous Netburst processors, Prescott's L1 instruction cache is an unconventional execution trace cache that holds decoded micro-ops for the processor's RISC-like core instead of CISC-style x86 instructions. Prescott's execution trace cache still holds roughly 12,000 micro-ops, but the chip can now encode more types of micro-ops into the trace cache, making it more efficient.
- SSE3 instructions Intel has endowed the Prescott core with 13 new instructions now known as SSE3. Like previous SSE revisions, these extended instructions are intended to accelerate certain types of computational tasks. Five new instructions for complex arithmetic allow for better handling of tasks like Fast Fourier Transforms; these instructions should enhance the Pentium 4's potential in scientific and distributed computing scenarios. Another four new instructions should make the Pentium 4 a better vertex shader for graphics applications by allowing manipulation of data organized as an array of structures, as is common in graphics vertex databases. A pair of new instructions enhances thread synchronization in Hyper-Threading, allowing an unoccupied logical processor to enter a dormant state in order to release resources for the other logical processor, to consume less power, or both. The remaining instructions should improve video encoding and x87-to-integer data conversions.
Of course, programs must be rewritten or recompiled to take advantage of SSE3 instructions, so we won't see SSE3's benefits immediately.
- Better prefetching Intel has improved Prescott's hardware and software prefetch abilities, so it can anticipate what data will be needed next and fetch them into its L2 cache. Most importantly, the hardware prefetching algorithm, which requires no special code, should be smarter about what to grab and when to grab it.
- Enhanced Hyper-Threading Intel's engineers have modified Prescott in various ways to make Hyper-Threading better. Shared resourced have been expanded and more types of operations can be conducted in parallel. The number of store instructions in flight is up from 24 to 32, for instance, and the number of write-combining buffers used to track stores is up from six to eight. These changes should allow multiple threads to execute better simultaneously. Also, Prescott includes measures to reduce L1 cache contention between its two logical processors.
- Lots of microarchitectural tweaks Here's where the bullet point thing breaks down. There are too many important little tweaks to list them all under their own headings.
For instance, Prescott's branch prediction unit has been improved to avoid branch mispredictions, which will be more costly than ever with Prescott's long pipeline. One of the enhancements is the addition of an indirect branch predictor, borrowed from the work of the Pentium M team.
Another key change is a new shifter/rotator block added to one of the chip's simple arithmetic logic units, or ALUs. You will recall that the Pentium 4's simple ALUs run at twice the speed of the rest of the chip; that's still true for Prescott, and now one of the ALUs can handle shift and rotate operations. Also, Prescott now does integer multiplication in a dedicated integer multiplier instead of using the floating-point multiplier, as previous Netburst chips did.
There are also store-to-load forwarding enhancements, improvements to SSE/2/3 and x87 multimedia performance, and more.
All told, Prescott is a rather different animal from the Northwood and Willamette chips that precede it and share the Pentium 4 name. These changes will affect performance in ways that are difficult to predict. Instruction latencies will be higher, except where they're lower. The same is true for performance in general, and that's why we run the benchmarks.
Prescott pullin' the juice
There has been some concern, leading up to Prescott's launch, about how much power the chip will consume and how much heat it will produce. The key spec Intel provides in this realm is TDP, or Thermal Design Power. TDP is not, however, a peak power load number; it is a thermal design guideline. As Intel puts it, "The TDP is not the maximum power that the processor can dissipate." So we have something to go on there, but perhaps not much.
Northwood's TDP at 3.2GHz is 82W, while the Extreme Edition's is about 92W. Prescott's TDP at 3.2GHz is 103W. So yeah, this thing pulls some juice and generates some heat.
To manage Prescott's thermal prowess, Intel has created a new specification for thermals that allows for finer-grained control of fan speeds based on a value returned from the CPU. This value is set "based on the power dissipation of each unit," according to Intel, and combined with the thermal diode temp, will dictate safe fan speeds for coolers. Implementing this scheme will require motherboard changes, but not changes to the actual cooler designs. In fact, Intel-approved coolers for current Pentium 4s should work for Prescott at its initial speed grades.
Intel is also pushing a verification program for ATX cases, trying to ensure enclosures have proper venting and the like. Clearly, Intel is squeezing all it can from ATX while waiting for the new BTX form factor to arrive in force.
So the hundred dollar question is: will Prescott work with my motherboard? The answer is, as with so many things in life, it depends. These first Prescott chips drop into 478-pin sockets, just like Northwoods. Newer motherboards from top vendors have probably been ready for Prescott for some time, but they will have to provide adequate power for Prescott, and not all older motherboards can. So Intel's answer is, "Check with your motherboard manufacturer." We checked with Abit about our IC7-G test platform, and they were able to provide us with a Prescott-ready BIOS. Once we flashed to it, the Prescott ran like a champ on our board. Depending on your motherboard's age and power design, your mileage may vary.