IDF Fall 2005: Intel aims for performance per watt

IF THE THEME OF this past Spring’s Intel Developer Forum was multi-core processors, the theme of IDF Fall 2005 was performance per watt. Not only did Intel announce a new, common processor microarchitecture that promises higher performance and lower power consumption than the current Pentium 4 and its derivatives, but the CPU maker also outlined a broad range of initiatives for reducing power consumption while boosting performance. Those initiatives are taking form in its next wave of products, its 65nm fab process, and in a number of cutting-edge R&D projects. I sat in on the major keynotes and a handful of sessions whose contents resonated with the overarching theme of better performance per watt, and my report follows.

Intel’s new microarchitecture
We’ve already discussed the biggest news out of IDF, Intel’s decision to move to a new CPU microarchitecture common to its mobile, desktop, and server product lines, and we’ve outlined some of the features of that architecture, including a 14-stage pipeline that’s much shorter than the 31-stage monster in current Pentium 4 and Pentium D chips. This new CPU core will be a four-issue design, which means it has more internal parallelism than the three-issue designs in most current x86 processors, including the Pentium M, Pentium 4, and Athlon 64. Done correctly, this new core should achieve higher performance per clock and per watt than Intel’s current CPU cores.

Intel has declined to give a name to this new architecture—at least so far—leaving us to wrestle with tortured references to “the new microarchitecture” and “this new design” from here to kingdom come. Rather than suffer through that, I have decided to give this microarchitecture a name myself. Henceforth, Intel’s new microarchitecture shall be known as Fred.

Intel says Fred incorporates the best elements of both the Pentium 4 Netburst architecture and the Pentium M architecture. That’s not a bad characterization of Fred, since Intel’s current processors do share some technology between them, and Fred will no doubt incorporate some of that same tech. Discussions of chip heritage and the differences between an “all-new” design and a refinement of a past design are becoming ever more tedious as chip designs are increasingly modular. However, the best way to think about Fred is probably as an evolution of the Pentium M processor rather than as a clean-sheet design—not that there’s anything wrong with that.

Fred will be well credentialed, with certifications in all of Intel’s latest technologies (or, as Intel calls ’em, the *Ts), including VT for virtualization, LT for security and copy protection, and EM64T for compatibility with 64-bit software. Curiously, though, Fred will not incorporate one of the original *Ts, Hyper-Threading Technology. Intel says it still likes the idea of multithreaded execution cores and may bring it back in future designs, but I got the distinct impression that the current thinking at the company favors spending its transistor budget on additional CPU cores rather than on symmetric multithreading hardware. The problems that current dual-core Extreme Edition processors sometimes face with thread allocation among four front ends may also have factored into Intel’s decision to pass over HT in the first implementations of the new architecture.

Unlike the Smithfield-based Pentium D processors selling today, Fred will feature a shared L2 cache and the ability to transfer data from one core’s L1 cache to another’s. The L2 cache size itself will be scalable as needed, depending on the application. In other words, mobile processors will likely get smaller caches than server processors. Among Fred’s other talents will be deeper buffers, more L2 cache bandwidth than current designs, better prefetching of data into cache, and speculative loading of data from memory—also known as memory disambiguation.

Intel’s new microarchitecture spans market segments (Source: Intel)

Fred will use the same bus as current Pentium processors, and its first implementations will drop into the same sorts of sockets as the processors it replaces. Merom is the code-name for the mobile version of Fred, intended for Socket 479. The desktop part, code-named Conroe, will come in an LGA775 package, and will have two versions that differ in terms of cache size. All of these chips will be dual-core parts manufactured on Intel’s 65nm process. On the server front, Fred will have two incarnations at 65nm: a dual-core chip with 4MB of L2 cache known as Woodcrest, and a quad-core processor with 16MB of L2 cache code-named Whitefield. (Note to Intel: I stand ready to take delivery of my Whitefield-based Extreme Edition processor for review whenever you’d like to ship it.)

Intel expects much lower power requirements and thermal envelopes from these new desktop and server CPUs. Conroe’s target TDP (its more-or-less maximum thermal dissipation requirement) is 65W, a very healthy dip from the 130W TDP of the Pentium Extreme Edition 840 processor. The company is also claiming twice the outright performance and 3.5X the performance per watt for Woodcrest and Whitefield over current Xeons. For mobile applications, Intel hopes to achieve a jaw-droppingly low 5W TDP for ultra-low voltage processors based on this same common architecture.

The path to the new architecture
Before Fred makes his auspicious debut, Intel will take several steps along the way, including the migration of its current CPU lineup to its 65nm manufacturing process and the introduction of a host of new “platforms,” by which I mean mainly core-logic chipsets. We’re already shown you pictures and descriptions of many of the processors, including Presler, which situates two 65nm Netburst cores on a single chip package.

Upcoming Intel platforms and processors (Source: Intel)

Notice that the mobile, sever, and workstations platforms for Fred-based processors will debut in early 2006, well before the new CPUs. That raises the distinct possibility that Merom and Woodcrest could act as drop-in replacements for the 65nm Pentium M and Xeon processors that precede them. Merom, in fact, is listed as a “Napa Refresh” in some Intel presentations. The picture is a little bit different for desktop platforms. The current 945/955 chipsets won’t be replaced until the middle of 2006, and Conroe is slated to arrive either simultaneously or some time after that. Odds are that Conroe CPUs won’t work in current 945/955 motherboards, although stranger things have happened, I suppose.


Yonah promises big things
Undoubtedly the most interesting of Intel’s first 65nm processors is Yonah, the first dual-core version of the Pentium M, that will be part of the Napa platform due in early 2006. Yonah is essentially a waypoint between the current Pentium M architecture and Intel’s future, common microarchitecture, so it’s worthy of some extra attention. Fortunately, Intel’s Mooly Eden hosted an IDF session outlining the features of Yonah and the Napa platform, so we have a good picture of how these products should look.

At the heart of Yonah is a tweaked version of the Pentium M microarchitecture that’s done so well for Intel over the past few years. If you’re not familiar with the Pentium M and how it draws its heritage from earlier Pentium processors, I suggest reading my article about the Pentium M on the desktop, especially the bits about the Dothan core found in many of today’s new laptops. Yonah builds on this foundation but follows Intel’s company-wide move to dual-core designs.

Yonah is not, however, simply two Pentium M processors located side by side on a single chip or package; it’s a truly intentional dual-core design. The two cores on the processor share a bus internal to the chip, behind which sits 2MB of L2 cache that’s shared between the cores. (As with Dothan, the L2 cache is 8-way set associative.) This shared cache arrangement should be more efficient than the separate L2 caches on the Smithfield desktop chip, and in mobile apps, efficiency is especially important.

In order to achieve a power envelope similar to that of Dothan with a dual-core processor, Intel’s chip designers had to employ a range of power-saving tricks. One of the more interesting of those tricks has to do with that shared L2 cache.

Yonah’s Dynamic Smart Cache Sizing illustrated (Source: Intel)

During periods when the CPU isn’t busy or there’s little demand for cache memory, the contents of Yonah’s L2 cache are automatically flushed to memory in stages, saving the power required to keep the cache active. Should the entire contents of the L2 cache be emptied, Yonah can shut off the entire L2 cache, enabling what Intel calls a “deeper sleep” C-state. This facility should save power when the CPU is idle or nearly so.

One of Dothan’s few performance weaknesses is in applications that use lots of vector math, such as video encoding programs that use SSE instructions. Compared to the Pentium 4 and Athlon 64, the Pentium M sometimes looks a little pokey at such tasks. Yonah should change all that, not just because of the addition of a second CPU core, but because of added capabilities in each core.

As you may know, modern x86 CPUs combine a RISC-like processor core with decoders for the x86 instruction set’s more complex, CISC-style instructions. These decoders translate x86 instructions into simpler instructions (Intel calls them micro-ops) that can run natively on the RISC-like core behind them. The rate at which the processor decodes x86 instructions is one of the potential bottlenecks of such a design, and it has been something of a problem for the Pentium M. Yonah distinguishes itself from past Pentium Ms because all three decoders can now handle 128-bit SSE2 instructions, allowing the CPU to dispatch more of these instructions per clock.

The Pentium M architecture also boosts its performance through a capability Intel has dubbed micro-ops fusion. When an x86 instruction is decoded into two micro-ops that are closely related, the Pentium M’s decoders can sometimes fuse them together and send them down the execution pipeline as one micro-op for more efficient execution. Yonah extends micro-ops fusion to SSE instructions, a move Intel claims makes for better use of the chip’s execution resources and decoder bandwidth.

The enhancements to instruction decoding and grouping should help quite a bit with the Pentium M’s vector math performance, but that’s not all. Yonah also supports SSE3, with instructions intended to accelerate complex arithmetic, video encoding, and vertex processing for graphics. At the execution level, Intel says Yonah is up to 30% faster at SSE2’s Shuffle and Unpack instructions, and the integer DIV instruction is quicker, too. The chip also includes a new register that offers control over rounding of floating-point numbers, more write output buffers, and the nearly ubiquitous “enhanced data pre-fetch” that we’ve come to know and love with every new chip generation. All told, each of Yonah’s cores packs quite a few enhancements aimed at the acceleration of vector math, and especially at the handling of digital media encoding and decoding.

The end result should be a processor much better suited for multimedia playback—and a processor with a performance balance even more acceptable for use in desktop and server-class applications. In fact, Intel won’t wait for its next-generation common microarchitecture to pull Yonah into its desktop and workstation/server lineups. Systems bearing Intel’s new Viiv brand for digital home entertainment must include a dual-core processor, and Yonah is an option alongside the Pentium D and Extreme Edition. I’d expect to see Viiv-branded Yonah-based desktops and media center PCs early in ’06. There will also be a Yonah-derived product, currently code-named Sossaman, pitched as a lower-power option for data centers.

Yonah should more or less hold the line on power, with levels similar to Dothan’s, by taking advantage of a number of advances. Parallelism is a big one, as the past year has proven. The move to dual cores will allow Yonah to deliver better performance at roughly the same clock speeds as Dothan, with no need to seek higher speeds. The die shrink to 65nm will help, too, of course, as will dynamic cache sizing and the deeper sleep state. The improvements in vector math performance may lead to higher peak power consumption, but Intel has claimed in the past that micro-ops fusion is advantageous because less energy is consumed per instruction sequence. Yonah may be able to realize similar benefits from its extension of micro-ops fusion and other per-clock performance increases.

Napa: Intel’s next Centrino platform
There’s more to the Napa platform than just Yonah, of course. The core-logic chipset, code-named Calistoga, will be a mobile version of the 945 Express chipset. There’s also an Intel wireless networking solution in the mix, although we won’t talk much about that.

The Mobile 945 Express chipset will support bus speeds up to 667MHz, which gives us a very good clue about Yonah’s likely front-side bus speed. The chipset also packs a range of power-saving features of its own. Among them is an automatic display brightness control that raises and lowers the intensity of LCD backlighting in response to changes in ambient light. Intel says the operation of this mechanism should be imperceptible to the user.

Additionally, Napa will set a new baseline for Intel mobile graphics with the inclusion of the Graphics Media Accelerator 950 integrated graphics core running at 250MHz. This DirectX 9-class graphics engine should be capable of supporting the Aeroglass look in Windows Vista, and it can accelerate MPEG2 processing in hardware, among other things.


Not quite an integrated memory controller, but…
Those folks who expected Intel to make the move to an integrated memory controller with its next-gen architecture might have been disappointed by last Tuesday’s announcement, but Intel’s Justin Rattner offered some cause for hope in his keynote about Intel’s R&D initiatives on Thursday morning. Rattner spoke about Intel’s efforts to reduce power consumption even further though various techniques, including the development of a silicon radio for wireless communication that incorporates portions of the electronics in a radio into a chip.

The most striking concept he presented, though, was a CMOS-based voltage regulator module that could replace the VRMs on a motherboard with a single chip. Such a VRM could ramp up and down much quicker than traditional VRMs, allowing SpeedStep-like clock- and voltage-throttling capabilities to eliminate much of the inefficiencies caused by slow response times in current systems.

The proposed three-chip package

To illustrate how such a thing might be implemented, Rattner showed off a package holding three chips: a Pentium M processor, a north bridge chip, and a CMOS VRM. He said this arrangement would allow both the CPU and the Memory Controller Hub to ramp voltages up and down very quickly in response to changes in utilization, saving power.

Rattner presented this concept as a power-saving measure and didn’t mention the possibility that an on-package memory controller could help cut memory access latencies. I would expect that moving the memory controller onto the same package as the CPU would have that effect, though. Unfortunately, the CMOS VRM is still a few years off, according to a presenter who mentioned the subject in a session following Rattner’s keynote. We will simply have to see when—and if—this concept makes it into an end-user product.

Multi-core mojo and power efficiency
Bob Crepps, a Technology Strategist in Intel’s Microprocessor Technology Lab, presented some eye-opening results from Intel’s research into power efficiency in multi-core architectures. He covered a lot of ground in his presentation, but a few of the concepts were especially notable, in my view.

First, Crepps made an observation that may not be news, except for the source. Intel doesn’t often say such things, but he pointed out that special-purpose hardware can achieve higher efficiency in terms of MIPS per watt than general-purpose processors. Coming from Intel, this admission may be significant because it could lead to the incorporation of specialized hardware blocks into general-purpose CPUs, or perhaps more extensive hardware acceleration in core-logic chipsets. He identified several areas of opportunity for custom hardware engines, including TCP/IP offloading, MPEG encoding and decoding, speech recognition, and (obviously) graphics.

Crepps pointed out that multi-core processors offer superior performance benefits with increases in die area and power consumption than a single core. As an example, he cited the case of four-core processor, each core with multithreading (a la Hyper-Threading), using a shared cache and front-side bus. This chip would be able to move intensive computational loads from one core to the next—”core hopping,” as he called it—in order to manage hot spots. (I’m not sure exactly what sort of processor he’s been playing with in the lab, but it doesn’t sound like anything we’ve heard of yet.)

He asserted, with a graph to drive home the point, that such a multi-core device could offer much better performance than a single-core processor in the same die area and power envelope. I expect that’s true, given a good, parallelizable workload.

Next came a whopper of a slide that captures the problems with the Pentium 4 better than anything I’ve seen from outside of Intel.

If one factors out changes in process technology and looks only at design changes, the Pentium 4 is six times faster than the 486, but with 23 times the power consumption. That’s not a stellar proposition—hence the need to spend transistors on additional cores rather than trying to make a single CPU core perform better.

So how can we further reduce the energy per instruction used by a multi-core processor? One option is a technique called AMP, for asymmetric multiprocessing. The concept is simple enough: try running different processors (or cores) at different clock speeds within the same power envelope, and see which config achieves the best performance. Intel’s researchers tested several different configurations using a four-way Xeon system. The power envelope they chose allowed them to run four Xeons at 1GHz each or a single Xeon at 2GHz. They also were able to fit into the same power envelope a simple AMP setup that would run highly sequential code on a single 2GHz Xeon and then switch over, for more parallel code, to three Xeons running at 1.25GHz. This AMP setup would deactivate either the 2GHz single processor or the three 1.25GHz processors, depending on the usage pattern.

The researchers found that an AMP system delivered higher performance overall for the same amount of power. In benchmarks with mostly parallel components, the SMP system was better, while in benchmarks with most sequential components, the single-processor system was better. Overall, though, with the sort of part-sequential/part-parallel workloads that are so often typical, AMP performed best.

This research suggests an obvious way forward for optimizing performance per watt on multi-core processors. I’ll be interested to see when it’s first implemented.


Mitosis: speculative threads
I should mention one more bit of Intel research before wrapping things up. Anand and I had an interesting conversation with a fellow from Intel about a research project known as Mitosis that involves speculative threads. These are not the sorts of “helper threads” that you may have heard mentioned in the past as a possible avenue for performance enhancements. Helper threads run alongside main threads doing things like prefetching data, but speculative threads are a very different sort of animal. They take speculative execution into the realm of thread-level parallelism.

Mitosis has a software component and a hardware component. The software is a compiler that can identify portions of a thread that might be good candidates for speculative execution and spin them off into separate threads to be run in parallel, like so:

As I understand it, these speculative threads are executed alongside the main thread, and the results are used if they turn out to be useful, or discarded if not. As with speculative instruction execution inside of a CPU, the usefulness of a result generally depends on the outcome of a conditional.

The hardware component of Mitosis is a CPU microarchitecture modified with the addition of a global register file, a register versioning table, and some version control logic to manage the execution of parallel threads that wish to modify the same registers.

Mitosis has the potential to create multiple threads in cases where a traditional compiler would fail. The initial results on a highly sequential benchmark suite are promising.

Of course, this project is very much in the research stage right now, so I wouldn’t expect to see a processor with speculative threading on the market any time soon. I’m just encouraged to see that this sort of thing is possible. 

Comments closed

Pin It on Pinterest

Share This

Share this post with your friends!