Next year, AMD plans to ship products based on a new processor architecture code-named Bulldozer, and in the world of big, x86-compatible CPUs, that’s huge news. In this arena, the question of how truly “new” a chip architecture is can be vexingly complicated, because technologies, ideas, and logic are often carried over from one generation to the next. But it’s probably safe to say Bulldozer is AMD’s first all-new, bread-and-butter CPU architecture since the introduction of the K7 way back in 1999. The firm has made notable incremental changes along the way—K8 brought a new system architecture, Barcelona integrated four cores together—but the underlying microarchitecture hasn’t changed too much. Bulldozer is something very different, a new microarchitecture incorporating some novel concepts we’ve not seen anywhere else.
Today, at the annual Hot Chips conference, Mike Butler, AMD Fellow and Chief Architect of the Bulldozer core, gave the first detailed public exposition of Bulldozer. We didn’t attend his presentation, but we did talk with Dina McKinney, AMD Corporate Vice President of Design Engineering, who led the Bulldozer team, in advance of the conference. We also have a first look at some of the slides from Butler’s talk, which reveal quite a bit more detail about Bulldozer than we’ve seen anywhere else.
The first thing to know about the information being released today is that it’s a technology announcement, and only a partial one at that. AMD hasn’t yet divulged specifics about Bulldozer-based products yet, and McKinney refused to answer certain questions about the architecture, too. Instead, the company intends to release snippets of information about Bulldozer in a directed way over time in order to maintain the buzz about the new chip—an approach it likens to “rolling thunder,” although I’d say it feels more like a leaky faucet.
The products: New CPUs in 2011
Regardless, we know the broad outlines of expected Bulldozer-based products already. Bulldozer will replace the current server and high-end desktop processors from AMD, including the Opteron 4100 and 6100 series and the Phenom II X6, at some time in 2011. A full calendar year is an awfully big target, especially given how close it is, but AMD isn’t hinting about exactly when next year the products might ship. We do know that the chips are being produced by GlobalFoundries on its latest 32-nm fabrication process, with silicon-on-insulator tech and high-k metal gate transistors. McKinney told us the first chips are already back from the fab and up and running inside of AMD, so Bulldozer is well along in its development. Barring any major unforeseen problems, we’d wager the first products based on it could ship well before the end of 2011, which would be somewhat uncommon considering that these product launch time windows frequently get stretched to their final hours.
One advantage that Bulldozer-based products will have when they do ship is the presence of an established infrastructure ready and waiting for them. AMD says Bulldozer-based chips will be compatible with today’s Opteron sockets C32 and G34, and we expect compatibility with Socket AM3 on the desktop, as well, although specifics about that are still murky.
AMD has committed to three initial Bulldozer variants. “Valencia” will be an eight-core server part, destined for the C32 socket with dual memory channels. “Interlagos” will be a 16-core server processor aimed at the G34 socket, so we’d expect it to have quad memory channels. In fact, Interlagos will likely be comprised of two Valencia chips on a single package, in an arrangement much like the present “Magny-Cours” Opterons. The desktop variant, “Zambezi”, will have eight cores, as well. All three will quite likely be based on the same silicon.
The concept: two ‘tightly coupled’ cores
The specifics of that silicon are what will make Bulldozer distinctive. The key concept for understanding AMD’s approach to this architecture is a novel method of sharing resources within a CPU. Butler’s talk names a couple of well-known options for supporting multiple threads. Simultaneous multithreading (SMT) employs targeted duplication of some hardware and sharing of other hardware in order to track and execute two threads in a single core. That’s the approach Intel uses its current, Nehalem-derived processors. CMP, or chip-level multiprocessing, is just cramming multiple cores on a single chip, as AMD’s current Opterons and Phenoms do. The diagram above depicts how Bulldozer might look had AMD chosen a CMP-style approach.
AMD didn’t take that approach, though. Instead, the team chose to integrate two cores together into a fundamental building block it calls a “Bulldozer module.” This module, diagrammed above, shares portions of a traditional core—including the instruction fetch, decode, and floating-point units and L2 cache—between two otherwise-complete processor cores. The resources AMD chose to share are not always fully utilized in a single core, so not duplicating them could be a win on multiple fronts. The firm claims a Bulldozer module can achieve 80% of the performance of two complete cores of the same capability. Yet McKinney told us AMD has estimated that including the second integer core adds only 12% to the chip area occupied by a Bulldozer module. If these claims are anywhere close to the truth, Bulldozer should be substantially more efficient in terms of performance per chip area—which translates into efficiency per transistor and per watt, as well.
One obvious outcome of the Bulldozer module arrangement, with its shared FPU, is an inherent bias toward increasing integer math performance. We’ve heard several explanations for this choice. McKinney told us the main motivating factor was the presence of more integer math in important workloads, which makes sense. Another explanation we’ve heard is that, with AMD’s emphasis on CPU-GPU fusion, floating-point-intensive problems may be delegated to GPUs or arrays of GPU-like parallel processing engines in the future.
In our talk, McKinney emphasized that a Bulldozer module would provide more predictable performance than an SMT-enabled core—a generally positive trait. That raised an intriguing question about how the OS might schedule threads on a Bulldozer-based processor. For an eight-threaded, quad-core CPU like Nehalem, operating systems generally tend to favor scheduling a single thread on each physical core before adding a second thread on any core. That way, resource sharing within the cores doesn’t come into play before necessary, and performance should be optimal. We suggested such an arrangement might also be best for a Bulldozer-based CPU, but McKinney downplayed the need for any special provisions of that nature on this hardware. She also hinted that scheduling two threads on the same module and leaving the other three modules idle, so they cold drop into a low-power state, might be the best path to power-efficient performance. We don’t yet know what guidance AMD will give operating system developers regarding Bulldozer, but the trade-offs at least shouldn’t be too painful.
The sharing arrangement may be the most noteworthy aspect of the Bulldozer architecture, but the cores themselves are substantially changed from prior AMD processors, too.
The module’s front end includes a prediction pipeline, which predicts what instructions will be used next. A separate fetch pipeline then populates the two instruction queues—one for each thread—with those instructions. The decoders convert complex x86 instructions into the CPU’s simpler internal instructions. Bulldozer has four of these, like Nehalem, while Barcelona has three.
Each module has a trio of schedulers, one for each integer core and one for the FPU. And the integer cores themselves have two execution units and two address generation units each. Early Bulldozer diagrams showed four pipelines per integer core, giving the impression that the cores might have four ALUs each. As a result, we thought perhaps AMD might layer SMT on top of a Bulldozer module at some point in the future. Knowing what we do now, that outcome seems much less likely. Bulldozer doesn’t look to have any “extra” execution hardware waiting to be exploited in those integer cores.
Although each module has only a single floating-point unit, that FPU should be substantially more capable than past AMD FPUs. You can see the dual integer MMX and 128-bit FMAC units in the diagram above. In a sort of quasi-SMT arrangement, the FPU can track two hardware threads, one for each “parent” core on the module.
The FPU supports nearly all the alphabet-soup extensions to the x86 ISA, up to and including SSSE3, SSE 4.1, 4.2, and Intel’s new Advanced Vector Extensions (AVX). AVX allows for higher-throughput processing of graphics, media, and other parallelizable, floating-point-intensive workloads by doubling the width of SIMD vectors from 128 to 256 bits. Bulldozer’s 128-bit FMAC units will work together on 256-bit vectors, effectively producing a single 256-bit vector operation per cycle. Intel’s Sandy Bridge, due early in 2011, will have two 256-bit vector units capable of producing a 256-bit multiply and a 256-bit add in a single cycle, double Bulldozer’s AVX peak.
Bulldozer’s FPU has an advantage in another area, though, as the presence of two 128-bit FMAC units indicates. FMAC is short for “fused multiply-accumulate,” an operation that’s sometimes known as FMA, for “fused multiply-add,” instead. Whatever you call it, a single operation that joins multiplication with addition is new territory for x86 processors, and it has two main benefits.
The first, pretty straightforwardly, is higher performance. The need to multiply two numbers and then add the result turns out to be very common in graphics and media workloads, and fusing them means the processor can achieve twice the throughput for those operations. We’ve seen multiply-add instructions in GPUs for ages, which is why each ALU in a GPU shader can produce two ops per clock at peak. With dual 128-bit FMACs, Bulldozer’s peak FLOPS throughput should be comparable to Sandy Bridge’s peak with AVX and 256-bit vectors.
Second, because an FMA operation feeds the result of the multiply directly into the adder without rounding, the mathematical precision of the result is higher. For this reason, the DirectX 11 generation of GPUs adopted FMA as their new standard, as well.
Crucially, Intel’s Sandy Bridge will not support an FMA operation. Instead, FMA support is slated for Haswell, the architectural refresh coming a full “tick-tock” generation beyond Sandy Bridge, likely in 2013. Earlier this year, Intel architect Ronak Singhal told us the choice to leave FMA out of Sandy Bridge was driven by the fact that it’s “not a small piece of logic” since it requires more sources, or operands, than usual. Intel chose to double the vector width first with AVX and push FMA down the road.
Thus, Bulldozer will be the first x86 processor with FMA capability. That distinction won’t come without controversy, though. Bulldozer supports an AMD-sanctioned four-operand form of FMA operation, whereas Haswell will use a three-operand version. Both instructions will require compiler support and freshly compiled binaries, so we may see yet another fracture in the x86 ISA until Intel and AMD can settle on a single, preferred solution.
When Intel integrated a memory controller into Nehalem and basically aped AMD’s blueprint for a system architecture, it reaped benefits in terms of computing throughput and bandwidth that AMD’s current solutions haven’t been able to match. There are many reasons why, but one of the big ones comes down to the effectiveness of Intel’s data pre-fetch mechanisms, which pull likely-to-be-needed data into the processor’s caches ahead of time, so it’s ready and waiting when needed.
Bulldozer is getting an overhaul in this area, with multiple data prefetchers that operate according to different algorithms in order to predict more accurately what data may be required soon. If they work well, these prefetchers should allow Bulldozer to make more effective use of the tremendous bandwidth available in AMD’s latest DDR3-fortified platforms.
Revamped power management
Although we might think about the changes to Bulldozer primarily in terms of raw performance, a great many facets of this chip are aimed at making it more efficient in terms of performance per die area, per transistor, and per watt. That’s true of both the architecture and the circuit design, as well.
On top of all that, Bulldozer has learned a couple of power-saving tricks that Intel processors have known since Nehalem. One is dynamic clock frequency scaling, like Intel’s Turbo Boost. The Phenom II X6 “Thuban” core has a simple mechanism of this type, dubbed Turbo Core, but the CPU doesn’t seem to spend too much time resident at its highest frequencies, given the performance it produces. Bulldozer’s implementation should be more robust and, hopefully, more effective.
The other trick AMD has ganked from Intel’s playbook is the use of an on-chip power gate to cut off power to individual CPU cores that happen to be idle. Despite the wording of the slide above, Bulldozer incorporates power gates on a per-module basis rather than per-core, although of course the chip includes finer-grained clock gating logic within the module. The ability to shut off power entirely to unused modules should pay some nice dividends.
This initial peek at Bulldozer reveals some truly new thinking about CPU microarchitecture, and it’s undeniably promising in theory. Done well, Bulldozer could restore AMD’s competitiveness in both server/workstation processors and high-end desktops, and it could serve as a foundation for continued success for years to come. Unfortunately, it’s way too early to speculate on the prospects for products based on this architecture. Purely by looking at Barcelona on paper, one might have expected it to outperform the competing Core 2-based processors and to match up well with Nehalem. The reality was far different from that. Bulldozer’s future will hinge on whether AMD can effectively implement the concepts it has introduced here, and we have no crystal ball to tell us what to expect on that front.