Next year, AMD plans to ship products based on a new processor architecture code-named Bulldozer, and in the world of big, x86-compatible CPUs, that's huge news. In this arena, the question of how truly "new" a chip architecture is can be vexingly complicated, because technologies, ideas, and logic are often carried over from one generation to the next. But it's probably safe to say Bulldozer is AMD's first all-new, bread-and-butter CPU architecture since the introduction of the K7 way back in 1999. The firm has made notable incremental changes along the way—K8 brought a new system architecture, Barcelona integrated four cores together—but the underlying microarchitecture hasn't changed too much. Bulldozer is something very different, a new microarchitecture incorporating some novel concepts we've not seen anywhere else.
Today, at the annual Hot Chips conference, Mike Butler, AMD Fellow and Chief Architect of the Bulldozer core, gave the first detailed public exposition of Bulldozer. We didn't attend his presentation, but we did talk with Dina McKinney, AMD Corporate Vice President of Design Engineering, who led the Bulldozer team, in advance of the conference. We also have a first look at some of the slides from Butler's talk, which reveal quite a bit more detail about Bulldozer than we've seen anywhere else.
The first thing to know about the information being released today is that it's a technology announcement, and only a partial one at that. AMD hasn't yet divulged specifics about Bulldozer-based products yet, and McKinney refused to answer certain questions about the architecture, too. Instead, the company intends to release snippets of information about Bulldozer in a directed way over time in order to maintain the buzz about the new chip—an approach it likens to "rolling thunder," although I'd say it feels more like a leaky faucet.
The products: New CPUs in 2011
Regardless, we know the broad outlines of expected Bulldozer-based products already. Bulldozer will replace the current server and high-end desktop processors from AMD, including the Opteron 4100 and 6100 series and the Phenom II X6, at some time in 2011. A full calendar year is an awfully big target, especially given how close it is, but AMD isn't hinting about exactly when next year the products might ship. We do know that the chips are being produced by GlobalFoundries on its latest 32-nm fabrication process, with silicon-on-insulator tech and high-k metal gate transistors. McKinney told us the first chips are already back from the fab and up and running inside of AMD, so Bulldozer is well along in its development. Barring any major unforeseen problems, we'd wager the first products based on it could ship well before the end of 2011, which would be somewhat uncommon considering that these product launch time windows frequently get stretched to their final hours.
One advantage that Bulldozer-based products will have when they do ship is the presence of an established infrastructure ready and waiting for them. AMD says Bulldozer-based chips will be compatible with today's Opteron sockets C32 and G34, and we expect compatibility with Socket AM3 on the desktop, as well, although specifics about that are still murky.
AMD has committed to three initial Bulldozer variants. "Valencia" will be an eight-core server part, destined for the C32 socket with dual memory channels. "Interlagos" will be a 16-core server processor aimed at the G34 socket, so we'd expect it to have quad memory channels. In fact, Interlagos will likely be comprised of two Valencia chips on a single package, in an arrangement much like the present "Magny-Cours" Opterons. The desktop variant, "Zambezi", will have eight cores, as well. All three will quite likely be based on the same silicon.
The concept: two 'tightly coupled' cores
The specifics of that silicon are what will make Bulldozer distinctive. The key concept for understanding AMD's approach to this architecture is a novel method of sharing resources within a CPU. Butler's talk names a couple of well-known options for supporting multiple threads. Simultaneous multithreading (SMT) employs targeted duplication of some hardware and sharing of other hardware in order to track and execute two threads in a single core. That's the approach Intel uses its current, Nehalem-derived processors. CMP, or chip-level multiprocessing, is just cramming multiple cores on a single chip, as AMD's current Opterons and Phenoms do. The diagram above depicts how Bulldozer might look had AMD chosen a CMP-style approach.
AMD didn't take that approach, though. Instead, the team chose to integrate two cores together into a fundamental building block it calls a "Bulldozer module." This module, diagrammed above, shares portions of a traditional core—including the instruction fetch, decode, and floating-point units and L2 cache—between two otherwise-complete processor cores. The resources AMD chose to share are not always fully utilized in a single core, so not duplicating them could be a win on multiple fronts. The firm claims a Bulldozer module can achieve 80% of the performance of two complete cores of the same capability. Yet McKinney told us AMD has estimated that including the second integer core adds only 12% to the chip area occupied by a Bulldozer module. If these claims are anywhere close to the truth, Bulldozer should be substantially more efficient in terms of performance per chip area—which translates into efficiency per transistor and per watt, as well.
One obvious outcome of the Bulldozer module arrangement, with its shared FPU, is an inherent bias toward increasing integer math performance. We've heard several explanations for this choice. McKinney told us the main motivating factor was the presence of more integer math in important workloads, which makes sense. Another explanation we've heard is that, with AMD's emphasis on CPU-GPU fusion, floating-point-intensive problems may be delegated to GPUs or arrays of GPU-like parallel processing engines in the future.
In our talk, McKinney emphasized that a Bulldozer module would provide more predictable performance than an SMT-enabled core—a generally positive trait. That raised an intriguing question about how the OS might schedule threads on a Bulldozer-based processor. For an eight-threaded, quad-core CPU like Nehalem, operating systems generally tend to favor scheduling a single thread on each physical core before adding a second thread on any core. That way, resource sharing within the cores doesn't come into play before necessary, and performance should be optimal. We suggested such an arrangement might also be best for a Bulldozer-based CPU, but McKinney downplayed the need for any special provisions of that nature on this hardware. She also hinted that scheduling two threads on the same module and leaving the other three modules idle, so they cold drop into a low-power state, might be the best path to power-efficient performance. We don't yet know what guidance AMD will give operating system developers regarding Bulldozer, but the trade-offs at least shouldn't be too painful.