AMD’s Bulldozer architecture revealed

Next year, AMD plans to ship products based on a new processor architecture code-named Bulldozer, and in the world of big, x86-compatible CPUs, that’s huge news. In this arena, the question of how truly “new” a chip architecture is can be vexingly complicated, because technologies, ideas, and logic are often carried over from one generation to the next.  But it’s probably safe to say Bulldozer is AMD’s first all-new, bread-and-butter CPU architecture since the introduction of the K7 way back in 1999.  The firm has made notable incremental changes along the way—K8 brought a new system architecture, Barcelona integrated four cores together—but the underlying microarchitecture hasn’t changed too much.  Bulldozer is something very different, a new microarchitecture incorporating some novel concepts we’ve not seen anywhere else.

Today, at the annual Hot Chips conference, Mike Butler, AMD Fellow and Chief Architect of the Bulldozer core, gave the first detailed public exposition of Bulldozer.  We didn’t attend his presentation, but we did talk with Dina McKinney, AMD Corporate Vice President of Design Engineering, who led the Bulldozer team, in advance of the conference. We also have a first look at some of the slides from Butler’s talk, which reveal quite a bit more detail about Bulldozer than we’ve seen anywhere else.

The first thing to know about the information being released today is that it’s a technology announcement, and only a partial one at that.  AMD hasn’t yet divulged specifics about Bulldozer-based products yet, and McKinney refused to answer certain questions about the architecture, too.  Instead, the company intends to release snippets of information about Bulldozer in a directed way over time in order to maintain the buzz about the new chip—an approach it likens to “rolling thunder,” although I’d say it feels more like a leaky faucet.

The products: New CPUs in 2011

Regardless, we know the broad outlines of expected Bulldozer-based products already.  Bulldozer will replace the current server and high-end desktop processors from AMD, including the Opteron 4100 and 6100 series and the Phenom II X6, at some time in 2011. A full calendar year is an awfully big target, especially given how close it is, but AMD isn’t hinting about exactly when next year the products might ship.  We do know that the chips are being produced by GlobalFoundries on its latest 32-nm fabrication process, with silicon-on-insulator tech and high-k metal gate transistors. McKinney told us the first chips are already back from the fab and up and running inside of AMD, so Bulldozer is well along in its development.  Barring any major unforeseen problems, we’d wager the first products based on it could ship well before the end of 2011, which would be somewhat uncommon considering that these product launch time windows frequently get stretched to their final hours.

One advantage that Bulldozer-based products will have when they do ship is the presence of an established infrastructure ready and waiting for them.  AMD says Bulldozer-based chips will be compatible with today’s Opteron sockets C32 and G34, and we expect compatibility with Socket AM3 on the desktop, as well, although specifics about that are still murky.

AMD has committed to three initial Bulldozer variants. “Valencia” will be an eight-core server part, destined for the C32 socket with dual memory channels.  “Interlagos” will be a 16-core server processor aimed at the G34 socket, so we’d expect it to have quad memory channels. In fact, Interlagos will likely be comprised of two Valencia chips on a single package, in an arrangement much like the present “Magny-Cours” Opterons.  The desktop variant, “Zambezi”, will have eight cores, as well.  All three will quite likely be based on the same silicon.

The concept: two ‘tightly coupled’ cores

The specifics of that silicon are what will make Bulldozer distinctive.  The key concept for understanding AMD’s approach to this architecture is a novel method of sharing resources within a CPU.  Butler’s talk names a couple of well-known options for supporting multiple threads. Simultaneous multithreading (SMT) employs targeted duplication of some hardware and sharing of other hardware in order to track and execute two threads in a single core.  That’s the approach Intel uses its current, Nehalem-derived processors.  CMP, or chip-level multiprocessing, is just cramming multiple cores on a single chip, as AMD’s current Opterons and Phenoms do.  The diagram above depicts how Bulldozer might look had AMD chosen a CMP-style approach.

AMD didn’t take that approach, though.  Instead, the team chose to integrate two cores together into a fundamental building block it calls a “Bulldozer module.”  This module, diagrammed above, shares portions of a traditional core—including the instruction fetch, decode, and floating-point units and L2 cache—between two otherwise-complete processor cores.  The resources AMD chose to share are not always fully utilized in a single core, so not duplicating them could be a win on multiple fronts.  The firm claims a Bulldozer module can achieve 80% of the performance of two complete cores of the same capability.  Yet McKinney told us AMD has estimated that including the second integer core adds only 12% to the chip area occupied by a Bulldozer module.  If these claims are anywhere close to the truth, Bulldozer should be substantially more efficient in terms of performance per chip area—which translates into efficiency per transistor and per watt, as well.

One obvious outcome of the Bulldozer module arrangement, with its shared FPU, is an inherent bias toward increasing integer math performance.  We’ve heard several explanations for this choice.  McKinney told us the main motivating factor was the presence of more integer math in important workloads, which makes sense.  Another explanation we’ve heard is that, with AMD’s emphasis on CPU-GPU fusion, floating-point-intensive problems may be delegated to GPUs or arrays of GPU-like parallel processing engines in the future.

In our talk, McKinney emphasized that a Bulldozer module would provide more predictable performance than an SMT-enabled core—a generally positive trait.  That raised an intriguing question about how the OS might schedule threads on a Bulldozer-based processor.  For an eight-threaded, quad-core CPU like Nehalem, operating systems generally tend to favor scheduling a single thread on each physical core before adding a second thread on any core.  That way, resource sharing within the cores doesn’t come into play before necessary, and performance should be optimal.   We suggested such an arrangement might also be best for a Bulldozer-based CPU, but McKinney downplayed the need for any special provisions of that nature on this hardware.  She also hinted that scheduling two threads on the same module and leaving the other three modules idle, so they cold drop into a low-power state, might be the best path to power-efficient performance.  We don’t yet know what guidance AMD will give operating system developers regarding Bulldozer, but the trade-offs at least shouldn’t be too painful.

More microarchitecture

The sharing arrangement may be the most noteworthy aspect of the Bulldozer architecture, but the cores themselves are substantially changed from prior AMD processors, too.

The module’s front end includes a prediction pipeline, which predicts what instructions will be used next.  A separate fetch pipeline then populates the two instruction queues—one for each thread—with those instructions.  The decoders convert complex x86 instructions into the CPU’s simpler internal instructions.  Bulldozer has four of these, like Nehalem, while Barcelona has three.

Each module has a trio of schedulers, one for each integer core and one for the FPU.  And the integer cores themselves have two execution units and two address generation units each.  Early Bulldozer diagrams showed four pipelines per integer core, giving the impression that the cores might have four ALUs each.  As a result, we thought perhaps AMD might layer SMT on top of a Bulldozer module at some point in the future. Knowing what we do now, that outcome seems much less likely.  Bulldozer doesn’t look to have any “extra” execution hardware waiting to be exploited in those integer cores.

Although each module has only a single floating-point unit, that FPU should be substantially more capable than past AMD FPUs.  You can see the dual integer MMX and 128-bit FMAC units in the diagram above.  In a sort of quasi-SMT arrangement, the FPU can track two hardware threads, one for each “parent” core on the module.

The FPU supports nearly all the alphabet-soup extensions to the x86 ISA, up to and including SSSE3, SSE 4.1, 4.2, and Intel’s new Advanced Vector Extensions (AVX).  AVX allows for higher-throughput processing of graphics, media, and other parallelizable, floating-point-intensive workloads by doubling the width of SIMD vectors from 128 to 256 bits.  Bulldozer’s 128-bit FMAC units will work together on 256-bit vectors, effectively producing a single 256-bit vector operation per cycle.  Intel’s Sandy Bridge, due early in 2011, will have two 256-bit vector units capable of producing a 256-bit multiply and a 256-bit add in a single cycle, double Bulldozer’s AVX peak.

Bulldozer’s FPU has an advantage in another area, though, as the presence of two 128-bit FMAC units indicates.  FMAC is short for “fused multiply-accumulate,” an operation that’s sometimes known as FMA, for “fused multiply-add,” instead.  Whatever you call it, a single operation that joins multiplication with addition is new territory for x86 processors, and it has two main benefits.

The first, pretty straightforwardly, is higher performance.  The need to multiply two numbers and then add the result turns out to be very common in graphics and media workloads, and fusing them means the processor can achieve twice the throughput for those operations.  We’ve seen multiply-add instructions in GPUs for ages, which is why each ALU in a GPU shader can produce two ops per clock at peak.  With dual 128-bit FMACs, Bulldozer’s peak FLOPS throughput should be comparable to Sandy Bridge’s peak with AVX and 256-bit vectors.

Second, because an FMA operation feeds the result of the multiply directly into the adder without rounding, the mathematical precision of the result is higher.  For this reason, the DirectX 11 generation of GPUs adopted FMA as their new standard, as well.

Crucially, Intel’s Sandy Bridge will not support an FMA operation. Instead, FMA support is slated for Haswell, the architectural refresh coming a full “tick-tock” generation beyond Sandy Bridge, likely in 2013.  Earlier this year, Intel architect Ronak Singhal told us the choice to leave FMA out of Sandy Bridge was driven by the fact that it’s “not a small piece of logic” since it requires more sources, or operands, than usual.  Intel chose to double the vector width first with AVX and push FMA down the road.

Thus, Bulldozer will be the first x86 processor with FMA capability. That distinction won’t come without controversy, though.  Bulldozer supports an AMD-sanctioned four-operand form of FMA operation, whereas Haswell will use a three-operand version.  Both instructions will require compiler support and freshly compiled binaries, so we may see yet another fracture in the x86 ISA until Intel and AMD can settle on a single, preferred solution.

When Intel integrated a memory controller into Nehalem and basically aped AMD’s blueprint for a system architecture, it reaped benefits in terms of computing throughput and bandwidth that AMD’s current solutions haven’t been able to match.  There are many reasons why, but one of the big ones comes down to the effectiveness of Intel’s data pre-fetch mechanisms, which pull likely-to-be-needed data into the processor’s caches ahead of time, so it’s ready and waiting when needed.

Bulldozer is getting an overhaul in this area, with multiple data prefetchers that operate according to different algorithms in order to predict more accurately what data may be required soon.  If they work well, these prefetchers should allow Bulldozer to make more effective use of the tremendous bandwidth available in AMD’s latest DDR3-fortified platforms. 

Revamped power management

 

Although we might think about the changes to Bulldozer primarily in terms of raw performance, a great many facets of this chip are aimed at making it more efficient in terms of performance per die area, per transistor, and per watt.  That’s true of both the architecture and the circuit design, as well. 

On top of all that, Bulldozer has learned a couple of power-saving tricks that Intel processors have known since Nehalem.  One is dynamic clock frequency scaling, like Intel’s Turbo Boost.  The Phenom II X6 “Thuban” core has a simple mechanism of this type, dubbed Turbo Core, but the CPU doesn’t seem to spend too much time resident at its highest frequencies, given the performance it produces.  Bulldozer’s implementation should be more robust and, hopefully, more effective.

The other trick AMD has ganked from Intel’s playbook is the use of an on-chip power gate to cut off power to individual CPU cores that happen to be idle.  Despite the wording of the slide above, Bulldozer incorporates power gates on a per-module basis rather than per-core, although of course the chip includes finer-grained clock gating logic within the module.  The ability to shut off power entirely to unused modules should pay some nice dividends.

Conclusions

This initial peek at Bulldozer reveals some truly new thinking about CPU microarchitecture, and it’s undeniably promising in theory.  Done well, Bulldozer could restore AMD’s competitiveness in both server/workstation processors and high-end desktops, and it could serve as a foundation for continued success for years to come.  Unfortunately, it’s way too early to speculate on the prospects for products based on this architecture.  Purely by looking at Barcelona on paper, one might have expected it to outperform the competing Core 2-based processors and to match up well with Nehalem.  The reality was far different from that.  Bulldozer’s future will hinge on whether AMD can effectively implement the concepts it has introduced here, and we have no crystal ball to tell us what to expect on that front. 

Comments closed
    • Stranger
    • 9 years ago

    There seems to be a lot of confusion over why AMD choose to go the route they did. Personally I think there is a lot more subtlety inherent in the design then most people would pick up on at first glance.

    Q: Why two separate Int cores each processing one thread rather than one massive int core feed two threads in a more traditional type of SMT?

    A: Complexity does not scale linearly with the number of connections that have to made in a single clock. For example, take the register file of the fat core with 2x threads doing SMT. The Register file has to twice as large as the AMD style core since each int unit has its own set of registers plus that big fat register has to connect to twice as many functional units as the AMD style layout(all things being equal). That added complexity not only makes the fat design big and hot it makes it slower and far far more difficult to debug. I think its clear that the reason why we haven’t seen a single core doing SMT with near the number of excution units as bulldozer is that its damn near impossible to get it running fast enough and debugging it is hellacious.

    Q: why does the bulldozer seem to be lacking in computational resources compared to sandy bridge/ K10/Core.

    There’s a whole lot of evidence that it’s not really need most of the time. Most people didn’t even notice the doubling of the FP power between the K8 and the K10. Due to the speed at which modern processors run the biggest bottle neck is not the rate at which computations can be performed but the rate at which they can be feed by all the surrounding hardware, which seems to be greatly improved in the bulldozer. argueably the most important part of a modern processor has become the branch predictor and the prefetcher.

    plus I’d like to argue that Bulldozer is not very thin at all in most situations that us desktop users would be interested in. These days most games struggle to take advantage of 4 cores, so what happens inside an 8 core bulldozer when that happens? Each thread gets the full 4 wide decoder plus the whole attention of the L2, L1 Icache, all the TLBs, prefetchers, and Branch predictors. All of that makes bulldozer atleast as strong as sandybridge in my mind from a mile high prospective.

    This duality lends itself very well to switching rapidly from multithreaded server code to high performance single threaded code.

    bulldozer is a very subtle design that’s very well suited to AMDs current capabilities to do R&D. AMD is competing against a rival with 10x the R&D budget its fairly amazing that AMD is as close to INTEL as they are. It’s most certainly not a position I’d ever want to be in as a scientist.

    One last thing… people seem to be underestimating how long AMD has been messing around with this kind of layout….

    §[< http://chip-architect.com/news/2000_09_27_double_pumped_core.html< ]§ here's some extra links just in case anyone is interested. §[<http://aceshardware.freeforums.org/amd-bulldozer-preview-t1042.html< ]§ §[<http://aceshardware.freeforums.org/what-are-the-chances-that-we-ll-see-a-brand-new-amd-bulldoze-t881.html<]§ Edit: edited for clarity. didn't mean to ramble on for near as long as i did.

      • moritzgedig
      • 9 years ago

      “The Register file has to twice as large as the AMD style core since each int unit has its own set of registers”
      > could you explain that in more detail?
      Does every ALU/AGU have a set of registers?
      why would one register file per thread be bigger than one per core?
      I get why the RF might have to be central to the Units, causing to high latency.
      Is it really so that there is a RF in the old von Neumann sense? Or is it just all “on the fly”?

      • moritzgedig
      • 9 years ago

      “/[

    • moritzgedig
    • 9 years ago

    so it has 256bit of FP processing per clock and module?
    that is four times what the Athlon has per core and twice what the Phenom has?
    That makes it the same as the Phenom per Core?

    So the idea is:
    To have one front-end that decodes two threads as with HT but feeds two cores like usual?
    I don’t get it.
    The only benefit I see is, that one core can process 256bit FP per clock IF the other one doesn’t do FP.

    I don’t see where the savings come from?!

    • esterhasz
    • 9 years ago

    With all the debate about whether Bulldozer will be able to match Sandy Bridge’s performance, I wonder whether the architectural changes will make comparisons difficult. Will we compare a two module BD to a 2 core / 4 threads SB? Or rather a one core module? Besides using performance / dollar, I’d argue that performance / watt (with the turbo feature hinging on thermal headroom) will increase in importance and it would be interesting to look at performance / die space measures to understand architectural efficiency…

    • d0g_p00p
    • 9 years ago

    Just got to comment on WaltC comments. I love it when he (she?) posts. It’s always long and informative.

      • WaltC
      • 9 years ago

      Actually, I’m an “it” and have been ever since the first day the machine-head transplants were successfully grafted in with no tissue rejection (*that* was a refreshing change, let me tell you!) Sometimes the titanium ball-shanks, carbon toes (we found just three per “foot” worked best), and the SuperLube(T) diamond-roller heels took a bit of getting used to, but hey, no pain, no gain, right? What’s really cool is clinging so care free to the under-hang of a hallway ceiling when everybody else is walking to and fro in the hallway below and suspects nothing! Now, that’s living, my friend!…;)

    • BoBzeBuilder
    • 9 years ago

    VAN HONDERED!

    • jackbomb
    • 9 years ago

    So does this kryptonite go to 11? *hahaha snort*

    • Ryhadar
    • 9 years ago

    As much as I would like to be as knowledgeable as possible about this subject I really can’t add much more then has already been said. However, there are two things I would like to mention.

    1.) Even if AMD flops with Bulldozer they’re not dead — although that would certainly be disappointing from my (hobbyist) point of view. Bobcat looks like a huge win against Atom and certainly Nano. If they can get some good margins and yields on those chips, then AMD is going to be just fine.

    2.) I can’t help but think that had AMD released Bulldozer on time (2008? 2009?) it would have been underperforming and a financial flop. I can’t really tell from a technical aspect, but from a design philosophy aspect the focus is very much on multi-threaded workloads. Workloads that only recently have the software industry paid a great attention to. All told, the timing isn’t the greatest with Sandy Bridge just around the corner so maybe I’m wrong.

    • ronch
    • 9 years ago

    All the things that AMD has revealed so far indicate that Bulldozer is really serious about power consumption and management, as well as transistor budget, which leads to a design with aggressive power management features, shared resources, narrow execution paths (two ALUs and AGUs per Integer Cluster per thread is nothing to drool about), etc.

    AMD also made the pipeline longer and branch predictors better.

    No clear mention of IPC targets, but each Integer Cluster is quite narrow, about 2/3 of what Phenom II has (two decoders for each Integer Cluster, which also has 2 ALUs and 2 AGUs), Even the FPU doesn’t seem like it’ll make grandma happy.

    I just wonder… Could it be that AMD is targeting some pretty nice clock speeds with Bulldozer?

      • Game_boy
      • 9 years ago

      IPC will be higher. Per-core performance will be more than 20% better. These are both official.

        • Anonymous Coward
        • 9 years ago

        However AMD also talks about two “cores” having 80% of the performance of two “real cores”, which says that IPC is not always better.

          • Kaleid
          • 9 years ago

          Could this mean that when they work together as one? That efficiency is not 100% but instead 80% which still would mean very good.

            • khands
            • 9 years ago

            I’m pretty sure that’s kind of the whole point of Bulldozer.

            • Anonymous Coward
            • 9 years ago

            I’m not clear on why performance is reduced, but I’m pretty sure its not entirely because 2 integer units are a problem. Must be some slowdowns associated with sharing decoders and all that.

            • ronch
            • 9 years ago

            Another thing, everyone, I think it’s not quite fair to call a Bulldozer module a dual core design. I’d rather think that a module is just one processor core with a wide front end and four ALUs and four AGUs to combat the Nehalem’s wide architecture. That way, it would be SAFE to call a module a complete core, not a dual core with shared resources.

            To say that a module is a dual core seems to me like calling a Pentium (which has two ALUs and one FPU) a dual core processor, with each core having one ALU and a shared decoder but with each ALU having its own scheduler. No, AMD, don’t say a module is a dual core. That’d be muddying the waters a bit too much. Call a module a single core that’s capable of splitting its four ALUs and four AGUs into two sets to get some HT-like technology, and a 256-bit FMA-capable FPU, and I’ll believe you 100%.

        • ronch
        • 9 years ago

        20% better than K10? Core i7 is already 20 – 50% faster than K10. How do they expect to bulldoze Sandy Bridge?

        I hope AMD is just being scant with the details on its secret weapon within Bulldozer.

    • tfp
    • 9 years ago

    How is this a benefit over a very wide integer core with hyperthreading? Would flexible usage of all of the integer pipelines between the 2 threads vs having half of the integer recourses dedicated to each thread make any difference? Say if they had it set up just like the FP pipelines and group all 8 of the integer pipelines together.

    With only 4 x86 Decoders will they be able to feed all 3 schedulers (2Int, 1 FP)?

    It really seems like hyperthreading on steroids, AMD is duplicating more peices of the core without having a full core. Maybe this makes the integer thread management easier because of the dedicated pipelines, and FP pipes just aren’t used enough to make the addition worth while. Though I guess if you dup the FP pipes as well you might as well just make a full new core…

    • ronch
    • 9 years ago

    “..McKinney refused to answer certain questions about the architecture, too..”

    What say we resort to extortion, eh? I’m dying to know more about Bulldozer, as I’ve been waiting for it for years. Can’t wait to see a die shot of this baby.

    “..first chips are already back from the fab and up and running inside of AMD..”

    Let’s hope it doesn’t have another TLB bug that’ll send AMD back to the drawing boards.

    “..and we expect compatibility with Socket AM3 on the desktop, as well..”

    Good news!

    “.. SMT… That’s the approach Intel uses its current, Nehalem-derived processors. CMP, or chip-level multiprocessing, is just cramming multiple cores on a single chip, as AMD’s current Opterons and Phenoms do… The diagram above depicts how Bulldozer might look had AMD chosen a CMP-style approach.”

    Um, Intel DOES support CMP and throws in SMT to sweeten the pot. AMD does CMP with Phenom and is now doing something close to SMT with Bulldozer, but that doesn’t mean AMD is now done with CMP, as putting a bunch of these modules together can still be seen as CMP and would still entail some of them being under-utilized in many situations. Regardless of where you put function units, whether on the same core, module, etc. there will still be scenarios when it will be under-utilized.

    “Estimated average of 80% of the CMP performance with much less area and power.”

    Seems to me that AMD is prioritizing manufacturing costs and power consumption over sheer performance figures. Right now Core i7 980X (3.33GHz) trumps Phenom II X6 1090T (3.2Ghz) by roughly 20% in single threaded apps and as much as 50% in multi-threaded apps. Let’s hope,
    for AMD’s sake, they aren’t using Phenom II as a reference to improve upon, not even what Intel has right now, but what Intel will have to compete with when Bulldozer comes out, and ultimately, Intel’s Tick-Tock strategy. How about a ‘Knock-Knock’ strategy for AMD where they ‘knock’ down every ‘tick’ and ‘tock’ from Intel?

    In the end though, while Bulldozer looks promising, and would most likely end up in my next build, it’s all about final performance, power, and price, and how well AMD can ramp clock speeds against Sandy Bridge. On paper it’s a bit worrying that each pipeline cluster (of which there are two in a module) contains only 2 I-Pipes and two AGU pipes. I hope AMD is correct in saying the inclusion of micro-ops fusion more than makes up for
    this defficiency on paper. For everyone’s sake, I hope it’s AMD’s turn to shine this time.

      • FuturePastNow
      • 9 years ago

      It looks to me, as a complete layman, that AMD has greatly reduced the die area of a core (and slightly modified the definition of a core) so they can brute-force more performance by cramming many more cores onto a die.

        • ronch
        • 9 years ago

        Oh no. I can’t even use all 4 cores on my X4, and now they want us to get 16? Please, AMD, have mercy on us (and software programmers) already! 😛

          • Meadows
          • 9 years ago

          They don’t want us to get 16.

          • Anonymous Coward
          • 9 years ago

          There might be a stupid number of cores, but on the positive side, using only one core from each pair of them should give pretty respectable performance. I wonder how much of that FP power can be used by existing software.

            • FuturePastNow
            • 9 years ago

            l[

      • NeelyCam
      • 9 years ago

      l[

    • BaronMatrix
    • 9 years ago

    I think the key to 4 IPC per core is based on the concept of dual loads introduced with K10. If they load and execute twice the INSTR window stuff shown at citavia.blog.de makes it clear that it will be an 8 thread module

    IN THE OPTIMAL SITUATION

    while it should really eclipse MC in situation where each core can only process 3 IPC\ per core. The same would then apply to the AGUs where they fire twice per cycle. The Ld\St is separate as is the L1 so single-threaded apps can use the L2 of both cores and still MAX at 4 IPC.

    I’ve done intensive simulations – don’t ask – on both new archs and AMD has a marked advantage unless the SW is optimized for SB. In cases – wait people can force Intel to pay for re-compiles – where it’s not, Intel’s milk shake will be drank.

    • thermistor
    • 9 years ago

    The most “efficient” X86 arch from the P4/Athlon64 days was neither…it was Banias, that is if we’re talking non-server type parts.

      • Joel H.
      • 9 years ago

      For what workload? Banias/Dothan was fabulous in some areas, notably weak in others. You’re making too broad a generalization.

    • MrDigi
    • 9 years ago

    “Each module has a trio of schedulers, one for each integer core and one for the FPU. And the integer cores themselves have two execution units and two address generation units each. Early Bulldozer diagrams showed four pipelines per integer core, giving the impression that the cores might have four ALUs each”

    If the integer core has two execution units and not the four the diagrams imply, then Intel still maintains an integer performance advantage of 50% (3 issue). The module looks more equivalent to a SB core, so is it really an 8 core chip or a 4 core? Just like the how cores are counted in GPUs. May also explain why two memory channels are sufficient.

      • Kurotetsu
      • 9 years ago

      From what I can understand, each module is seen as a core to the OS. So a two module BD chip will be seen as a dual-core chip to the OS, despite having 4 not-quite-complete-cores internally. Though from what I can see, they become complete when handling a single thread, but then SMT gets a speed boost due to the mix of shared and dedicated hardware compare to a ‘normal’ core, in which everything is shared during SMT.

        • Fursdon
        • 9 years ago

        From AMD’s Bulldozer Blog:

        ( TLDR: Each Integer Core will be considered a ‘core’ by the OS. So: 1 module, 2 integer cores, 2 OS cores/2 threads. )

        l[

          • Goty
          • 9 years ago

          Right, but it won’t be “core vs core”, it will be “core vs module” when this is released; i.e. 2-threads per core via SMT for Intel and two threads per module via separate integer cores for AMD, with a 4-module part going up against a 4-core part. Essentially, you’re still getting four execution units from AMD vs Intel’s three, we’re just defining the term “core” differently.

            • Shining Arcanine
            • 9 years ago

            These “modules” are cores. They just have more execution resources than Intel’s cores, but they are still cores.

            Look at POWER7, which can do 4 threads at a time. It has very fat cores, but they are still cores. The fact that the PC industry considers more to be better is the impetus for AMD’s marketing department to claim that this chip has more cores than it actually has, which sounds good on paper, but in terms of performance, it is not capable of matching a true octocore design, especially one like Intel’s Nehalem EX that has SMT on top of having 8 cores for a total of 16 threads.

            • Goty
            • 9 years ago

            Actually, the industry standard definition for a “core” includes only the integer execution units (c.f. IBM, Sun) so the modules are not in fact “cores” by this definition.

            Then again, if you know the definition better than AMD’s engineers…

            ; )

      • Shining Arcanine
      • 9 years ago

      Intel’s chips are 4-issue, not 3-issue.

      These cores on paper are an enhanced version of Intel’s cores and they are therefore still cores. That makes this a quadcore chip.

        • MrDigi
        • 9 years ago

        I believe the 4th is for FP instructions, I was referring to integer instructions. I don’t know though if SB changes this.

    • Shining Arcanine
    • 9 years ago

    This is basically analogous to a quadcore Core i7 with a little more computational power per core. While it will likely mop the floor with a real quadcore Core i7, it will have a difficult time competing with Intel’s Nehalem-EX in markets where parallelism is important.

    I think that this processor will do extremely poor as far as the server market is concerned, but I think it will do well in the gaming market. I also do not think that these concepts are particularly radical. The IBM POWER7 uses similar ideas. They are so similar that it is possible that the IBM POWER7 was the inspiration for this design.

      • Game_boy
      • 9 years ago

      Nehalem-EX is 8C/16T.

      Interlagos will be 16C/16T and will be much cheaper, priced like MC rather than the EXs. The official line is that IL will outperform MC by 50%, average case. If you add 50% to current MC benches that would crush Nehalem-EX, plus the price advantage.

      This will do very well in servers.

      • OneArmedScissor
      • 9 years ago

      Yeah, and an Athlon 64 is “basically analogous” to a Pentium 4. They are both called “CPUs,” after all.

    • Chrispy_
    • 9 years ago

    Will it run -[

      • ronch
      • 9 years ago

      LOL!

      (Ok, that’s all I have to say, but TR says a post with less than 10 characters isn’t worth anybody’s good time, so let’s make it longer.)

        • TaBoVilla
        • 9 years ago

        blank spaces? anyone?….

    • ClickClick5
    • 9 years ago

    Not that nerdy. lol

    Like me, I work with the finished product, not the making of the product. For all the computer engineers here, power to you all!

    EDIT: Reply fail to #1

    • RtFusion
    • 9 years ago

    Around 1:33AM this morning, I was pouring over the first Bulldozer articles that came out. Initially, I was quite excited for this architecture. But then I remembered Barcelona (Queen FTW!!!) and how I also geeked out on the architecture bits. Then it came out to reviewers’ hands and it bombed overall against the Core architecture from Intel.

    I am now cautious, even though there are a slew of improvements over an architecture that AMD has been using for like 7 years now, with tweaks and so on.

    • derFunkenstein
    • 9 years ago

    Nice writeup. Just technical enough for me to learn something new, and easy enough to read that my eyes don’t glaze over.

      • Pettytheft
      • 9 years ago

      That’s TR in a nutshell.

      • SomeOtherGeek
      • 9 years ago

      Exactly my thoughts! Good job, TR!

      • OffBa1ance
      • 9 years ago

      +1, great job.

    • axeman
    • 9 years ago

    I’ve heard this before, but I don’t quite get it. Apparently K8 was based upon K7, so what accounted for the massive performance improvements? The integrated memory controller? K7 was getting whooped by the P4 before the Athlon64 came out, so what did they tweak that made such a big difference? Not relevant, but someone here probably knows. The addition of x86-64 added lots of logic, but not performance in self, right?

    • MikeA
    • 9 years ago

    “The decoders convert complex x86 instructions into the CPU’s simpler internal instructions. Bulldozer has four of these, while Barcelona and Nehalem have three.”

    This is wrong, Nehalem has 4 decoders not 3.

      • Damage
      • 9 years ago

      Totally my bad. You’re right. Updated the article.

    • jdaven
    • 9 years ago

    “In fact, Interlagos will likely be comprised of two Valencia chips on a single package, in an arrangement much like the present “Magny-Cours” Opterons.”

    I thought the whole point of Bulldozer was the modular architecture. In my mind that means no more gluing two complete chips together. You just put the number of modules you want to match a certain performance and power consumption level.

    This is the same as GPU’s where repeating little ‘processors’ are stacked against each other surrounded by shared/common architecture components.

    So Interlagos is 8 modules ‘pulled from a bin’ and placed on a substrate. It is not in my mind 2 of anything (i.e. Valencia). I think we have to leave the world of gluing complete chips together when it comes to AMD’s future architecture. If I’m reading everything correctly, it no longer exists for them. Just modules surrounded by a common architecture.

      • Goty
      • 9 years ago

      Yes, you can glue more modules onto (into?) the die, but you’ve got to remember that they’re all sharing some resources, still. There would definitely be decreasing returns as you added more modules.

      Also, you’ve got to consider die sizes and yields. A chip that is twice as large would yield less than half as well (hyperbole, I don’t know the exact numbers) due to wasted space near the edge of the wafer and the distribution of defects on the wafer. By taking two smaller chips and gluing them together, you effective increase the yields for the same product.

        • Game_boy
        • 9 years ago

        Yes, and you can also take one design (the four-module die) and use it for both desktop and server products by doubling up on dies on the high-end server. Designing a new eight-module die just for 4P server may not give good enough returns.

      • FuturePastNow
      • 9 years ago

      I’m sure they could design a die with 8 or 16 or whatever modules, but then they have to make it. Without any defects on a manufacturing process the GloFo has never used before.

    • emvath
    • 9 years ago

    So….it will be better?…..

      • Xenolith
      • 9 years ago

      Definitive Yes.

      Better questions … Will it be competitive? Will it find a market?

      • Meadows
      • 9 years ago

      No, they’re spending years on a chip that ends up worse.

        • ronch
        • 9 years ago

        I wish they’d spend one more ALU and one more AGU per Integer Cluster, making one module have three ALUs and three AGUs. That way, if there are two threads running on the core each one gets almost-Phenom II-like performance. And if only one thread is running, the scheduler should be smart enough to distribute the work (if that’s possible).

        This shouldn’t be hard, because according to AMD, adding a second Integer cluster only takes up 12% more die space, so, theoretically, adding another ALU and AGU to each cluster should take 12% more space again. No way to tell how many more transistors will go to the schedulers and decoders to feed such a wider path, but perhaps it’s not gonna cost AMD an arm and a leg. And, they seriously have to take back the performance crown or people like me will think they don’t have the engineering depth.

        K5 was AMD’s first in-house design. It flopped.
        K6 was acquired along with the NexGen purchase back in 1996. AMD bolted on an FPU and made it work with Socket 7.
        K7 was a lucky break when a band of ex-DEC engineers, lead by Dirk Meyer, joined AMD and created the K7.
        K8 was a K7 with Hypertransport, an IMC, 64-bit, and some tweaks.
        K9 was… what, two K8’s joined together? Ok, so that deserves the move from ‘8’ to ‘9’ /sarcasm
        K10 is four tweaked K8’s on a monolithic die and an L3 cache and improved Hypertransport, a cut-and-paste 2nd FPU (check out die shots), some tweaks and aggressive pricing to make up for lower performance.
        K11 (a.k.a. Bulldozer) – Let’s hope this will at least be very competitive with Sandy Bridge.

        • TO11MTM
        • 9 years ago

        Prescott?

        Oh… you mean AMD will be.

    • Buub
    • 9 years ago

    Cool deal. Mirrors some of the earlier stuff we’ve heard, but it’s nice to see it spelled out in more detail.

    This should definitely be more performant than SMT, where a single core attempts to run two threads.

    • ssidbroadcast
    • 9 years ago

    Well, most of this technical stuff flies over my head, but it’s nice to see AMD finally tip their hand a bit…

      • sweatshopking
      • 9 years ago

      what technical stuff? you’re a nerd arent you?

        • HurgyMcGurgyGurg
        • 9 years ago

        You can still not know what “PRF based register renaming” is and be a nerd.

        • ronch
        • 9 years ago

        Sure, this is technical. Just ask your grandma to read this and see what she tells you.

        Not in the engineering-level, perhaps, but technical, still.

Pin It on Pinterest