The sharing arrangement may be the most noteworthy aspect of the Bulldozer architecture, but the cores themselves are substantially changed from prior AMD processors, too.
The module's front end includes a prediction pipeline, which predicts what instructions will be used next. A separate fetch pipeline then populates the two instruction queues—one for each thread—with those instructions. The decoders convert complex x86 instructions into the CPU's simpler internal instructions. Bulldozer has four of these, like Nehalem, while Barcelona has three.
Each module has a trio of schedulers, one for each integer core and one for the FPU. And the integer cores themselves have two execution units and two address generation units each. Early Bulldozer diagrams showed four pipelines per integer core, giving the impression that the cores might have four ALUs each. As a result, we thought perhaps AMD might layer SMT on top of a Bulldozer module at some point in the future. Knowing what we do now, that outcome seems much less likely. Bulldozer doesn't look to have any "extra" execution hardware waiting to be exploited in those integer cores.
Although each module has only a single floating-point unit, that FPU should be substantially more capable than past AMD FPUs. You can see the dual integer MMX and 128-bit FMAC units in the diagram above. In a sort of quasi-SMT arrangement, the FPU can track two hardware threads, one for each "parent" core on the module.
The FPU supports nearly all the alphabet-soup extensions to the x86 ISA, up to and including SSSE3, SSE 4.1, 4.2, and Intel's new Advanced Vector Extensions (AVX). AVX allows for higher-throughput processing of graphics, media, and other parallelizable, floating-point-intensive workloads by doubling the width of SIMD vectors from 128 to 256 bits. Bulldozer's 128-bit FMAC units will work together on 256-bit vectors, effectively producing a single 256-bit vector operation per cycle. Intel's Sandy Bridge, due early in 2011, will have two 256-bit vector units capable of producing a 256-bit multiply and a 256-bit add in a single cycle, double Bulldozer's AVX peak.
Bulldozer's FPU has an advantage in another area, though, as the presence of two 128-bit FMAC units indicates. FMAC is short for "fused multiply-accumulate," an operation that's sometimes known as FMA, for "fused multiply-add," instead. Whatever you call it, a single operation that joins multiplication with addition is new territory for x86 processors, and it has two main benefits.
The first, pretty straightforwardly, is higher performance. The need to multiply two numbers and then add the result turns out to be very common in graphics and media workloads, and fusing them means the processor can achieve twice the throughput for those operations. We've seen multiply-add instructions in GPUs for ages, which is why each ALU in a GPU shader can produce two ops per clock at peak. With dual 128-bit FMACs, Bulldozer's peak FLOPS throughput should be comparable to Sandy Bridge's peak with AVX and 256-bit vectors.
Second, because an FMA operation feeds the result of the multiply directly into the adder without rounding, the mathematical precision of the result is higher. For this reason, the DirectX 11 generation of GPUs adopted FMA as their new standard, as well.
Crucially, Intel's Sandy Bridge will not support an FMA operation. Instead, FMA support is slated for Haswell, the architectural refresh coming a full "tick-tock" generation beyond Sandy Bridge, likely in 2013. Earlier this year, Intel architect Ronak Singhal told us the choice to leave FMA out of Sandy Bridge was driven by the fact that it's "not a small piece of logic" since it requires more sources, or operands, than usual. Intel chose to double the vector width first with AVX and push FMA down the road.
Thus, Bulldozer will be the first x86 processor with FMA capability. That distinction won't come without controversy, though. Bulldozer supports an AMD-sanctioned four-operand form of FMA operation, whereas Haswell will use a three-operand version. Both instructions will require compiler support and freshly compiled binaries, so we may see yet another fracture in the x86 ISA until Intel and AMD can settle on a single, preferred solution.
When Intel integrated a memory controller into Nehalem and basically aped AMD's blueprint for a system architecture, it reaped benefits in terms of computing throughput and bandwidth that AMD's current solutions haven't been able to match. There are many reasons why, but one of the big ones comes down to the effectiveness of Intel's data pre-fetch mechanisms, which pull likely-to-be-needed data into the processor's caches ahead of time, so it's ready and waiting when needed.
Bulldozer is getting an overhaul in this area, with multiple data prefetchers that operate according to different algorithms in order to predict more accurately what data may be required soon. If they work well, these prefetchers should allow Bulldozer to make more effective use of the tremendous bandwidth available in AMD's latest DDR3-fortified platforms.