Power management and Turbo Core
Now that we've spent entirely too much time on the FPU, let's move on to power management, another topic too large to cover in the time and space we have today. Power efficiency has become critically important in modern processors, and any clean-sheet architecture like this one will include a zillion little pockets of logic conceived with power efficiency in mind.
The headliner here, though, is the use of power gates for each of the modules and a fifth power gate for the north bridge and L3 cache. Closing one of these gates shuts off power to the portion of the chip behind it, even leakage power. Intel has used power gates to good effect since Nehalem. AMD first used power gates in the Llano APU, where they are quite effective, but Bulldozer is its first high-end CPU to employ them.
Another feature makes a surprise return: separate clock domains for each of the modules, along with one for the north bridge. (The north bridge and L3 cache run at 2.2GHz in desktop parts and 2-2.2GHz in Bulldozer-derived Opterons.) AMD first instituted separate clock domains per core in Barcelona, the original Phenom chip, but back-tracked in the Phenom II generation and used BIOS code to lock all four cores to a single clock—making the Phenom II operate much like Intel's recent CPUs do. Turns out threads pinging around from one core to the next in the Windows scheduler sometimes led to performance issues, because threads would be reassigned to cores operating at low frequencies. AMD tells us it has returned to this approach for a simple reason, "because power is important." Our sense is that Bulldozer should be better equipped to avoid problems on this front. The chip has a higher floor for clock speed (1.4GHz versus 800MHz in the Phenom II), improved latency for clock-speed ramps, and can probe the caches of other modules more quickly. AMD also seems to be banking on smart scheduling in future versions of Windows to accommodate the Bulldozer architecture, a subject we'll discuss shortly.
First, though, we should talk about Bulldozer's version of AMD's Turbo Core dynamic clock scaling feature, which raises clock speed on all or part of the chip when there's thermal headroom available to do so. As in other recent AMD CPUs, Turbo Core uses power estimates based on the chip's internal activity monitoring to determine the extent of that thermal headroom. Bulldozer's Turbo Core implementation is the most granular one yet, with three P-states possible. P2 is the base clock of the chip, the speed at which it's guaranteed to run. P1 is an intermediate Turbo clock speed that can apply to all four modules, provided that they're not too heavily loaded. The third state, P0, is an even higher Turbo clock that comes into use when only two modules are active. As before, Turbo Core seeks to run at the highest possible clock speed for the given conditions, and it dithers between the P-states in order to stay within the chip's prescribed thermal envelope, or TDP.
Now that we have Turbo Core in the picture, we have the context to talk more about thread scheduling. Bulldozer's unique architecture creates some intriguing questions about how software threads should be distributed across its cores. There are obvious advantages to scheduling one thread per module before doubling up threads on a single module: shared resources like the front end, L2 cache, and FPU will be dedicated to a lone thread, improving performance. However, scheduling two threads per module gets you several nice things, too, including the possibility of data sharing between related threads via the L2 cache. Power efficiency should improve if more inactive modules can be turned off, and Turbo Core can convert that power savings back into performance by raising the clock speed of the active module.
Unfortunately, the Windows 7 scheduler wasn't built with Bulldozer's distinctive sharing arrangement in mind, and as far as we call tell, the BIOS doesn't provide any hints to that OS about how to schedule threads. Win7 simply sees eight equal cores, with no preference between them. AMD claims Windows 8 will be better optimized for the Bulldozer architecture and cites improvements of 2-10% in several recent games with the Windows 8 developer preview. We haven't been able to squeeze too many details out of AMD about how complex Win8's understanding of Bulldozer scheduling will be, but we get the sense that the OS may attempt to schedule related threads on the same module when possible. We need to play with the Win8 developer preview on a Bulldozer system in order to learn more.
The chip: Orochi
Like mythical heroes in fantasy novels, modern CPUs are known by many names. We've been talking about Bulldozer almost exclusively up to this point, but that code name actually applies to the CPU cores and the microarchitecture inside of them—or something like that. These names are powerful symbols and are often multi-valent. (Yikes, religion major mode OFF. Sorry.) The proper code name for the silicon die that implements the Bulldozer architecture is "Orochi," and Orochi will be deployed in multiple ways, each with its own name. On the desktop, it's called "Zambezi." In 1-2P servers, a single Orochi die will be called "Valencia," and in 1-4P servers, two dies placed together in a package will be called "Interlagos." I liked it better when a single name, like K7, could refer to the whole caboodle, before the marketing guys got into the code-name business, but I suppose that horse left the barn long ago.
Whatever you call it, this chip is AMD's second attempt at a CPU fabricated on GlobalFoundries' 32-nm process, with high-k metal gates and a silicon-on-insulator substrate. The unnecessarily overpopulated table below shows how Orochi compares to a range of other desktop processors from Intel and AMD.
|Bloomfield||Core i7||4||8||8 MB||45||731||263|
|Lynnfield||Core i5, i7||4||8||8 MB||45||774||296|
|Westmere||Core i3, i5||2||4||4 MB||32||383||81|
|Gulftown||Core i7-980X||6||12||12 MB||32||1168||248|
|Sandy Bridge||Core i5, i7||4||8||8 MB||32||995||216|
|Sandy Bridge||Core i3, i5||2||4||4 MB||32||624||149|
|Sandy Bridge||Pentium||2||4||3 MB||32||-||131|
|Deneb||Phenom II||4||4||6 MB||45||758||258|
|Propus/Rana||Athlon II X4/X3||4||4||512 KB x 4||45||300||169|
|Regor||Athlon II X2||2||2||1 MB x 2||45||234||118|
|Thuban||Phenom II X6||6||6||6 MB||45||904||346|
|Llano||A8, A6, A4||4||4||1MB x 4||32||1450||228|
|Llano||A4||2||2||1MB x 2||32||758||-|
With roughly 1.2 billion transistors and a die area of 315 mm², Orochi is a very big and complex chip. Sandy Bridge, which has four cores and integrated graphics, is about 100 mm² smaller. Still, Orochi isn't quite a large as the chip it succeeds, the "Thuban" Phenom II X6, so that's progress of a sort.