Inside the module
Because Bulldozer is what it is—an all-new, high-performance x86-compatible processor—it's incredibly complex and difficult to summarize. Nevertheless, we're going to make a quick attempt, with the assistance of the block diagram below, which provides a high-altitude overview of Bulldozer's key components.
The sharing in a Bulldozer module starts with the front end, where the branch prediction, instruction fetch, and decode units track two threads and service both cores. With two integer cores featuring relatively long pipelines to keep fed, the front-end hardware must be very effective at its job in order for the whole chip to function efficiently.
The decode units dispatch ops, or decoded instructions, to the two integer cores on an interleaved, every-other-cycle basis. Each of those cores has a pair of ALUs, and each ALU has an associated address generation unit. Thus, individual Bulldozer cores have fewer execution resources than those in the preceding Deneb/Thuban architecture. However, instruction scheduling is more flexible, and beyond the obvious increase in integer core counts, Bulldozer seeks to make things up in other ways.
One of those ways is a vastly reworked memory subsystem that looks very different from those in prior AMD chips. Among other things, the memory pipeline can speculatively move loads ahead of stores if doing so won't cause a problem, a capability Intel has called memory disambiguation. Memory access latencies should be further reduced by the use of multiple data prefetchers that operate according to different rules in order to keep the caches populated with, hopefully, the appropriate data for the cores' upcoming work. Both the prefetchers and the L2 cache into which they pull data are shared between the two cores in a module, assuming both cores have active threads. If only one thread is active, these resources are fully used by that single thread.
Another major shared resource in the Bulldozer module is the floating-point unit, which has been spun off into a co-processor arrangement in which both integer cores act as clients. This setup is quite different from the intermixed integer and FP execution resources in Sandy Bridge, and AMD has hinted that it may pave the way for a GPU-type shader array to one day take the place of the traditional FPU. For now, though, Bulldozer's FPU is quite formidable in its own right. The scheduler can track two threads, of course, and the execution units include dual FMAC units capable of processing 128-bit vectors in a single clock cycle, along with dual 128-bit integer units (marked as "MMX" in the diagram above). Yes, that means integer SIMD goodness happens in the FPU, as well as floating-point math.
The fact that both Bulldozer and Sandy Bridge, two substantially new x86 microarchitectures, have hit the streets within the same calendar year isn't entirely coincidental. The common thread is the advent of the follow-on to SSE, the extended instruction set known as Advanced Vector Extensions, or AVX. AVX increases parallelism by extending the width of vectors from 128 to 256 bits, and supporting those wider datatypes requires the broad reworking of the processor's execution engine. The result should be much higher peak computational throughput on data-parallel workloads.
However, the path to that destination will have a few twists and turns. After initially proposing its own 256-bit vector extensions known as SSE5, AMD has reversed course and attempted to follow Intel by making Bulldozer compatible with AVX, instead. As that change was happening, Intel apparently was modifying its own course, as well. So Bulldozer catches up with Sandy Bridge on nearly every front, adding support for SSE 4.1 and 4.2 and most of AVX, including the AES instructions for accelerating encryption. It also includes support for AMD's own XOP extensions, a surviving bit of SSE5 with more of a focus on integer datatypes. Where Bulldozer moves beyond Sandy Bridge, though, is with those two 128-bit FMAC pipes—and there, we get into disputed territory.
The dispute is over the FMAC instruction, which is the key to unlocking AVX's peak potential. FMAC stands for "fused multiply-accumulate," an operation that can be described logically as: "d = a + b * c". Instructions that combine a multiply and an add together tend to map well to multimedia workloads, and they have been a staple of GPU shader cores for quite some. Doing both operations at once has a performance benefit, obviously—the processor is executing two floating-point operations (FLOPS) per clock cycle. The FMAC form of this instruction has a further precision advantage because the results of one operation are fed directly into the other, at the chip's full internal precision, without being stored. These virtues have made FMAC very popular in other chips, including DirectX 11-class GPUs.
Bulldozer is the first x86 CPU to support FMAC. Sandy Bridge doesn't, and the upcoming Ivy Bridge won't, either. Instead, Intel intends to add FMAC support to Haswell, its next architectural refresh, due in 2013. Trouble is, Bulldozer supports a version of FMAC with four operands, while Haswell will support a three-operand variant of FMAC. This sort of incompatibility isn't a good thing when you're trying to persuade software developers to use your new instructions. AMD seems to recognize that fact, so it plans to add FMAC3 support in the next version of Bulldozer, code-named Piledriver, alongside FMAC4. The FMAC4-only chip we're looking at today, though, will always be something of an oddity, as a result.
All of this madness still leaves Bulldozer in decent shape, FPU-wise, but not quite indisputably at the head of the pack. Even without FMAC support, Sandy Bridge still has two 256-bit vector units, so it can produce a 256-bit add and a 256-bit multiply in a single clock cycle. Bulldozer can theoretically match Sandy's peak throughput, either by processing dual 128-bit FMACs or a single 256-bit FMAC per cycle, but it can't match Sandy without FMAC.
For a discussion of the Bulldozer microarchitecture in much more depth, let me point you to David Kanter's excellent piece on the subject, from which I've stolen small bits of info here and there.
|Samsung's DDR4 modules for servers have quadruple-stacked memory dies||29|
|This 8'' Windows 8.1 tablet will cost only $149||28|
|Amazon sale discounts hundreds of downloadable PC games||43|
|Wednesday Evening Shortbread||43|
|Asus shows glimpse of ZenWatch; Apple 'wearable' coming Sept 9||24|
|Zotac's ''Pico'' PC runs Windows, slips into a pocket||78|
|Dropbox Pro now offers 1TB of storage for $9.99 a month||38|
|Predicting player inputs smooths streaming PC games||24|
|Now we can lose our data 8TB at a time.||+44|