AMD’s Bulldozer architecture revealed

Next year, AMD plans to ship products based on a new processor architecture code-named Bulldozer, and in the world of big, x86-compatible CPUs, that’s huge news. In this arena, the question of how truly “new” a chip architecture is can be vexingly complicated, because technologies, ideas, and logic are often carried over from one generation to the next.  But it’s probably safe to say Bulldozer is AMD’s first all-new, bread-and-butter CPU architecture since the introduction of the K7 way back in 1999.  The firm has made notable incremental changes along the way—K8 brought a new system architecture, Barcelona integrated four cores together—but the underlying microarchitecture hasn’t changed too much.  Bulldozer is something very different, a new microarchitecture incorporating some novel concepts we’ve not seen anywhere else.

Today, at the annual Hot Chips conference, Mike Butler, AMD Fellow and Chief Architect of the Bulldozer core, gave the first detailed public exposition of Bulldozer.  We didn’t attend his presentation, but we did talk with Dina McKinney, AMD Corporate Vice President of Design Engineering, who led the Bulldozer team, in advance of the conference. We also have a first look at some of the slides from Butler’s talk, which reveal quite a bit more detail about Bulldozer than we’ve seen anywhere else.

The first thing to know about the information being released today is that it’s a technology announcement, and only a partial one at that.  AMD hasn’t yet divulged specifics about Bulldozer-based products yet, and McKinney refused to answer certain questions about the architecture, too.  Instead, the company intends to release snippets of information about Bulldozer in a directed way over time in order to maintain the buzz about the new chip—an approach it likens to “rolling thunder,” although I’d say it feels more like a leaky faucet.

The products: New CPUs in 2011

Regardless, we know the broad outlines of expected Bulldozer-based products already.  Bulldozer will replace the current server and high-end desktop processors from AMD, including the Opteron 4100 and 6100 series and the Phenom II X6, at some time in 2011. A full calendar year is an awfully big target, especially given how close it is, but AMD isn’t hinting about exactly when next year the products might ship.  We do know that the chips are being produced by GlobalFoundries on its latest 32-nm fabrication process, with silicon-on-insulator tech and high-k metal gate transistors. McKinney told us the first chips are already back from the fab and up and running inside of AMD, so Bulldozer is well along in its development.  Barring any major unforeseen problems, we’d wager the first products based on it could ship well before the end of 2011, which would be somewhat uncommon considering that these product launch time windows frequently get stretched to their final hours.

One advantage that Bulldozer-based products will have when they do ship is the presence of an established infrastructure ready and waiting for them.  AMD says Bulldozer-based chips will be compatible with today’s Opteron sockets C32 and G34, and we expect compatibility with Socket AM3 on the desktop, as well, although specifics about that are still murky.

AMD has committed to three initial Bulldozer variants. “Valencia” will be an eight-core server part, destined for the C32 socket with dual memory channels.  “Interlagos” will be a 16-core server processor aimed at the G34 socket, so we’d expect it to have quad memory channels. In fact, Interlagos will likely be comprised of two Valencia chips on a single package, in an arrangement much like the present “Magny-Cours” Opterons.  The desktop variant, “Zambezi”, will have eight cores, as well.  All three will quite likely be based on the same silicon.

The concept: two ‘tightly coupled’ cores

The specifics of that silicon are what will make Bulldozer distinctive.  The key concept for understanding AMD’s approach to this architecture is a novel method of sharing resources within a CPU.  Butler’s talk names a couple of well-known options for supporting multiple threads. Simultaneous multithreading (SMT) employs targeted duplication of some hardware and sharing of other hardware in order to track and execute two threads in a single core.  That’s the approach Intel uses its current, Nehalem-derived processors.  CMP, or chip-level multiprocessing, is just cramming multiple cores on a single chip, as AMD’s current Opterons and Phenoms do.  The diagram above depicts how Bulldozer might look had AMD chosen a CMP-style approach.

AMD didn’t take that approach, though.  Instead, the team chose to integrate two cores together into a fundamental building block it calls a “Bulldozer module.”  This module, diagrammed above, shares portions of a traditional core—including the instruction fetch, decode, and floating-point units and L2 cache—between two otherwise-complete processor cores.  The resources AMD chose to share are not always fully utilized in a single core, so not duplicating them could be a win on multiple fronts.  The firm claims a Bulldozer module can achieve 80% of the performance of two complete cores of the same capability.  Yet McKinney told us AMD has estimated that including the second integer core adds only 12% to the chip area occupied by a Bulldozer module.  If these claims are anywhere close to the truth, Bulldozer should be substantially more efficient in terms of performance per chip area—which translates into efficiency per transistor and per watt, as well.

One obvious outcome of the Bulldozer module arrangement, with its shared FPU, is an inherent bias toward increasing integer math performance.  We’ve heard several explanations for this choice.  McKinney told us the main motivating factor was the presence of more integer math in important workloads, which makes sense.  Another explanation we’ve heard is that, with AMD’s emphasis on CPU-GPU fusion, floating-point-intensive problems may be delegated to GPUs or arrays of GPU-like parallel processing engines in the future.

In our talk, McKinney emphasized that a Bulldozer module would provide more predictable performance than an SMT-enabled core—a generally positive trait.  That raised an intriguing question about how the OS might schedule threads on a Bulldozer-based processor.  For an eight-threaded, quad-core CPU like Nehalem, operating systems generally tend to favor scheduling a single thread on each physical core before adding a second thread on any core.  That way, resource sharing within the cores doesn’t come into play before necessary, and performance should be optimal.   We suggested such an arrangement might also be best for a Bulldozer-based CPU, but McKinney downplayed the need for any special provisions of that nature on this hardware.  She also hinted that scheduling two threads on the same module and leaving the other three modules idle, so they cold drop into a low-power state, might be the best path to power-efficient performance.  We don’t yet know what guidance AMD will give operating system developers regarding Bulldozer, but the trade-offs at least shouldn’t be too painful.

More microarchitecture

The sharing arrangement may be the most noteworthy aspect of the Bulldozer architecture, but the cores themselves are substantially changed from prior AMD processors, too.

The module’s front end includes a prediction pipeline, which predicts what instructions will be used next.  A separate fetch pipeline then populates the two instruction queues—one for each thread—with those instructions.  The decoders convert complex x86 instructions into the CPU’s simpler internal instructions.  Bulldozer has four of these, like Nehalem, while Barcelona has three.

Each module has a trio of schedulers, one for each integer core and one for the FPU.  And the integer cores themselves have two execution units and two address generation units each.  Early Bulldozer diagrams showed four pipelines per integer core, giving the impression that the cores might have four ALUs each.  As a result, we thought perhaps AMD might layer SMT on top of a Bulldozer module at some point in the future. Knowing what we do now, that outcome seems much less likely.  Bulldozer doesn’t look to have any “extra” execution hardware waiting to be exploited in those integer cores.

Although each module has only a single floating-point unit, that FPU should be substantially more capable than past AMD FPUs.  You can see the dual integer MMX and 128-bit FMAC units in the diagram above.  In a sort of quasi-SMT arrangement, the FPU can track two hardware threads, one for each “parent” core on the module.

The FPU supports nearly all the alphabet-soup extensions to the x86 ISA, up to and including SSSE3, SSE 4.1, 4.2, and Intel’s new Advanced Vector Extensions (AVX).  AVX allows for higher-throughput processing of graphics, media, and other parallelizable, floating-point-intensive workloads by doubling the width of SIMD vectors from 128 to 256 bits.  Bulldozer’s 128-bit FMAC units will work together on 256-bit vectors, effectively producing a single 256-bit vector operation per cycle.  Intel’s Sandy Bridge, due early in 2011, will have two 256-bit vector units capable of producing a 256-bit multiply and a 256-bit add in a single cycle, double Bulldozer’s AVX peak.

Bulldozer’s FPU has an advantage in another area, though, as the presence of two 128-bit FMAC units indicates.  FMAC is short for “fused multiply-accumulate,” an operation that’s sometimes known as FMA, for “fused multiply-add,” instead.  Whatever you call it, a single operation that joins multiplication with addition is new territory for x86 processors, and it has two main benefits.

The first, pretty straightforwardly, is higher performance.  The need to multiply two numbers and then add the result turns out to be very common in graphics and media workloads, and fusing them means the processor can achieve twice the throughput for those operations.  We’ve seen multiply-add instructions in GPUs for ages, which is why each ALU in a GPU shader can produce two ops per clock at peak.  With dual 128-bit FMACs, Bulldozer’s peak FLOPS throughput should be comparable to Sandy Bridge’s peak with AVX and 256-bit vectors.

Second, because an FMA operation feeds the result of the multiply directly into the adder without rounding, the mathematical precision of the result is higher.  For this reason, the DirectX 11 generation of GPUs adopted FMA as their new standard, as well.

Crucially, Intel’s Sandy Bridge will not support an FMA operation. Instead, FMA support is slated for Haswell, the architectural refresh coming a full “tick-tock” generation beyond Sandy Bridge, likely in 2013.  Earlier this year, Intel architect Ronak Singhal told us the choice to leave FMA out of Sandy Bridge was driven by the fact that it’s “not a small piece of logic” since it requires more sources, or operands, than usual.  Intel chose to double the vector width first with AVX and push FMA down the road.

Thus, Bulldozer will be the first x86 processor with FMA capability. That distinction won’t come without controversy, though.  Bulldozer supports an AMD-sanctioned four-operand form of FMA operation, whereas Haswell will use a three-operand version.  Both instructions will require compiler support and freshly compiled binaries, so we may see yet another fracture in the x86 ISA until Intel and AMD can settle on a single, preferred solution.

When Intel integrated a memory controller into Nehalem and basically aped AMD’s blueprint for a system architecture, it reaped benefits in terms of computing throughput and bandwidth that AMD’s current solutions haven’t been able to match.  There are many reasons why, but one of the big ones comes down to the effectiveness of Intel’s data pre-fetch mechanisms, which pull likely-to-be-needed data into the processor’s caches ahead of time, so it’s ready and waiting when needed.

Bulldozer is getting an overhaul in this area, with multiple data prefetchers that operate according to different algorithms in order to predict more accurately what data may be required soon.  If they work well, these prefetchers should allow Bulldozer to make more effective use of the tremendous bandwidth available in AMD’s latest DDR3-fortified platforms. 

Revamped power management

 

Although we might think about the changes to Bulldozer primarily in terms of raw performance, a great many facets of this chip are aimed at making it more efficient in terms of performance per die area, per transistor, and per watt.  That’s true of both the architecture and the circuit design, as well. 

On top of all that, Bulldozer has learned a couple of power-saving tricks that Intel processors have known since Nehalem.  One is dynamic clock frequency scaling, like Intel’s Turbo Boost.  The Phenom II X6 “Thuban” core has a simple mechanism of this type, dubbed Turbo Core, but the CPU doesn’t seem to spend too much time resident at its highest frequencies, given the performance it produces.  Bulldozer’s implementation should be more robust and, hopefully, more effective.

The other trick AMD has ganked from Intel’s playbook is the use of an on-chip power gate to cut off power to individual CPU cores that happen to be idle.  Despite the wording of the slide above, Bulldozer incorporates power gates on a per-module basis rather than per-core, although of course the chip includes finer-grained clock gating logic within the module.  The ability to shut off power entirely to unused modules should pay some nice dividends.

Conclusions

This initial peek at Bulldozer reveals some truly new thinking about CPU microarchitecture, and it’s undeniably promising in theory.  Done well, Bulldozer could restore AMD’s competitiveness in both server/workstation processors and high-end desktops, and it could serve as a foundation for continued success for years to come.  Unfortunately, it’s way too early to speculate on the prospects for products based on this architecture.  Purely by looking at Barcelona on paper, one might have expected it to outperform the competing Core 2-based processors and to match up well with Nehalem.  The reality was far different from that.  Bulldozer’s future will hinge on whether AMD can effectively implement the concepts it has introduced here, and we have no crystal ball to tell us what to expect on that front. 

Comments closed
    • Stranger
    • 9 years ago

    There seems to be a lot of confusion over why AMD choose to go the route they did. Personally I think there is a lot more subtlety inherent in the design then most people would pick up on at first glance.

    Q: Why two separate Int cores each processing one thread rather than one massive int core feed two threads in a more traditional type of SMT?

    A: Complexity does not scale linearly with the number of connections that have to made in a single clock. For example, take the register file of the fat core with 2x threads doing SMT. The Register file has to twice as large as the AMD style core since each int unit has its own set of registers plus that big fat register has to connect to twice as many functional units as the AMD style layout(all things being equal). That added complexity not only makes the fat design big and hot it makes it slower and far far more difficult to debug. I think its clear that the reason why we haven’t seen a single core doing SMT with near the number of excution units as bulldozer is that its damn near impossible to get it running fast enough and debugging it is hellacious.

    Q: why does the bulldozer seem to be lacking in computational resources compared to sandy bridge/ K10/Core.

    There’s a whole lot of evidence that it’s not really need most of the time. Most people didn’t even notice the doubling of the FP power between the K8 and the K10. Due to the speed at which modern processors run the biggest bottle neck is not the rate at which computations can be performed but the rate at which they can be feed by all the surrounding hardware, which seems to be greatly improved in the bulldozer. argueably the most important part of a modern processor has become the branch predictor and the prefetcher.

    plus I’d like to argue that Bulldozer is not very thin at all in most situations that us desktop users would be interested in. These days most games struggle to take advantage of 4 cores, so what happens inside an 8 core bulldozer when that happens? Each thread gets the full 4 wide decoder plus the whole attention of the L2, L1 Icache, all the TLBs, prefetchers, and Branch predictors. All of that makes bulldozer atleast as strong as sandybridge in my mind from a mile high prospective.

    This duality lends itself very well to switching rapidly from multithreaded server code to high performance single threaded code.

    bulldozer is a very subtle design that’s very well suited to AMDs current capabilities to do R&D. AMD is competing against a rival with 10x the R&D budget its fairly amazing that AMD is as close to INTEL as they are. It’s most certainly not a position I’d ever want to be in as a scientist.

    One last thing… people seem to be underestimating how long AMD has been messing around with this kind of layout….

    §[<http://chip-architect.com/news/2000_09_27_double_pumped_core.html<]§ here's some extra links just in case anyone is interested. §[<http://aceshardware.freeforums.org/amd-bulldozer-preview-t1042.html<]§ §[<http://aceshardware.freeforums.org/what-are-the-chances-that-we-ll-see-a-brand-new-amd-bulldoze-t881.html<]§ Edit: edited for clarity. didn't mean to ramble on for near as long as i did.

      • moritzgedig
      • 9 years ago

      “The Register file has to twice as large as the AMD style core since each int unit has its own set of registers”
      > could you explain that in more detail?
      Does every ALU/AGU have a set of registers?
      why would one register file per thread be bigger than one per core?
      I get why the RF might have to be central to the Units, causing to high latency.
      Is it really so that there is a RF in the old von Neumann sense? Or is it just all “on the fly”?

      • moritzgedig
      • 9 years ago

      “/[

    • moritzgedig
    • 9 years ago

    so it has 256bit of FP processing per clock and module?
    that is four times what the Athlon has per core and twice what the Phenom has?
    That makes it the same as the Phenom per Core?

    So the idea is:
    To have one front-end that decodes two threads as with HT but feeds two cores like usual?
    I don’t get it.
    The only benefit I see is, that one core can process 256bit FP per clock IF the other one doesn’t do FP.

    I don’t see where the savings come from?!

      • mesyn191
      • 9 years ago

      They’re spending around 13% more die space than a single core to get ~80% of the performance of 2 full blown cores.

      The die/power savings are huge. Not as much as HT will do, but AMD’s approach also gets you more performance than HT and its more consistent than HT as well.

        • moritzgedig
        • 9 years ago

        Yes I did read that the Integer-units are 12% and that the module has 160% of one cores performance.
        That doesn’t tell me much.
        The question is: how much bigger is the rest (scheduler, decode, fetch, rename, back-end, …) compared to a single core?
        Doing it this way (HT/SMT) must have a control cost, all the processing has to be organized and kept interference free.
        When intel introduced HT to the P_IV the control logic grow note worthily and cache thrashing became an issue.

          • mesyn191
          • 9 years ago

          Uh it tells you all you need to know, all the differences total come to a 12% bigger die. That number isn’t just for the integer units.

          Also no one knows exactly how fast BD is vs. PhII on a core to core basis, but AMD has said its much faster. Not 160% faster though. Each BD core doesn’t have 80% of the performance of a single PhII core. The module (2 cores) is supposed to get around 80% of the performance of 2 PhII cores. Huge difference.

          Each core has its own cache structure so cache thrashing won’t be any more of an issue than current PhII’s at worst. AMD said they improved a lot of things so its likely to be quite a bit better as far as that is concerned. There also isn’t any penalty for both cores utilizing the FPU at the same time either.

            • moritzgedig
            • 9 years ago

            I think I get the numbers better now.
            There is this hypothetical single core Bulldozer.
            Compared to two of it, the Bulldozer module has 56% the area but 80% of the performance.
            OR
            The Bulldozer-module has 112% the size of it but gets 160% the Performance.
            sure, in this I assumed that two cores are twice as fast as one which isn’t true, but is OK to calculate with, for the sake of comparison.
            Each core/half-module does not have it’s own cache structure. They do share the L2 and of cause L3. Only the L1 caches are separate.
            Assuming the L2 is 1/3 of the area, those 12% become more understandable. Now they are 18% of the logic.
            Assuming that 1/3 of the logic is FP related (due to the massive 256bit per clock), it becomes +27% of the none L2 nor FPU area.

            • mesyn191
            • 9 years ago

            L2 is seperate, L3 is shared.

            §[<http://www.anandtech.com/Gallery/Album/754#5<]§ You're making oranges to apples comparisons too on die size, total die size is 12%, you can look at AMD's slide here: §[<http://www.anandtech.com/show/3863/amd-discloses-bobcat-bulldozer-architectures-at-hot-chips-2010/4<]§

            • moritzgedig
            • 9 years ago

            “L2 is seperate, L3 is shared.”
            > at the module level yes, but not at the half-module level.

            “You’re making oranges to apples comparisons too on die size, total die size is 12%, you can look at AMD’s slide”
            > compared to what? total die size is not 12%.

            • mesyn191
            • 9 years ago

            Dude both links explain it in as simple of a way as possible, if you can’t understand them then I’ve got nothing for you.

    • esterhasz
    • 9 years ago

    With all the debate about whether Bulldozer will be able to match Sandy Bridge’s performance, I wonder whether the architectural changes will make comparisons difficult. Will we compare a two module BD to a 2 core / 4 threads SB? Or rather a one core module? Besides using performance / dollar, I’d argue that performance / watt (with the turbo feature hinging on thermal headroom) will increase in importance and it would be interesting to look at performance / die space measures to understand architectural efficiency…

    • d0g_p00p
    • 9 years ago

    Just got to comment on WaltC comments. I love it when he (she?) posts. It’s always long and informative.

      • WaltC
      • 9 years ago

      Actually, I’m an “it” and have been ever since the first day the machine-head transplants were successfully grafted in with no tissue rejection (*that* was a refreshing change, let me tell you!) Sometimes the titanium ball-shanks, carbon toes (we found just three per “foot” worked best), and the SuperLube(T) diamond-roller heels took a bit of getting used to, but hey, no pain, no gain, right? What’s really cool is clinging so care free to the under-hang of a hallway ceiling when everybody else is walking to and fro in the hallway below and suspects nothing! Now, that’s living, my friend!…;)

    • BoBzeBuilder
    • 9 years ago

    VAN HONDERED!

    • jackbomb
    • 9 years ago

    So does this kryptonite go to 11? *hahaha snort*

    • Ryhadar
    • 9 years ago

    As much as I would like to be as knowledgeable as possible about this subject I really can’t add much more then has already been said. However, there are two things I would like to mention.

    1.) Even if AMD flops with Bulldozer they’re not dead — although that would certainly be disappointing from my (hobbyist) point of view. Bobcat looks like a huge win against Atom and certainly Nano. If they can get some good margins and yields on those chips, then AMD is going to be just fine.

    2.) I can’t help but think that had AMD released Bulldozer on time (2008? 2009?) it would have been underperforming and a financial flop. I can’t really tell from a technical aspect, but from a design philosophy aspect the focus is very much on multi-threaded workloads. Workloads that only recently have the software industry paid a great attention to. All told, the timing isn’t the greatest with Sandy Bridge just around the corner so maybe I’m wrong.

    • ronch
    • 9 years ago

    All the things that AMD has revealed so far indicate that Bulldozer is really serious about power consumption and management, as well as transistor budget, which leads to a design with aggressive power management features, shared resources, narrow execution paths (two ALUs and AGUs per Integer Cluster per thread is nothing to drool about), etc.

    AMD also made the pipeline longer and branch predictors better.

    No clear mention of IPC targets, but each Integer Cluster is quite narrow, about 2/3 of what Phenom II has (two decoders for each Integer Cluster, which also has 2 ALUs and 2 AGUs), Even the FPU doesn’t seem like it’ll make grandma happy.

    I just wonder… Could it be that AMD is targeting some pretty nice clock speeds with Bulldozer?

      • Game_boy
      • 9 years ago

      IPC will be higher. Per-core performance will be more than 20% better. These are both official.

        • Anonymous Coward
        • 9 years ago

        However AMD also talks about two “cores” having 80% of the performance of two “real cores”, which says that IPC is not always better.

          • Kaleid
          • 9 years ago

          Could this mean that when they work together as one? That efficiency is not 100% but instead 80% which still would mean very good.

            • khands
            • 9 years ago

            I’m pretty sure that’s kind of the whole point of Bulldozer.

            • Anonymous Coward
            • 9 years ago

            I’m not clear on why performance is reduced, but I’m pretty sure its not entirely because 2 integer units are a problem. Must be some slowdowns associated with sharing decoders and all that.

            • ronch
            • 9 years ago

            Another thing, everyone, I think it’s not quite fair to call a Bulldozer module a dual core design. I’d rather think that a module is just one processor core with a wide front end and four ALUs and four AGUs to combat the Nehalem’s wide architecture. That way, it would be SAFE to call a module a complete core, not a dual core with shared resources.

            To say that a module is a dual core seems to me like calling a Pentium (which has two ALUs and one FPU) a dual core processor, with each core having one ALU and a shared decoder but with each ALU having its own scheduler. No, AMD, don’t say a module is a dual core. That’d be muddying the waters a bit too much. Call a module a single core that’s capable of splitting its four ALUs and four AGUs into two sets to get some HT-like technology, and a 256-bit FMA-capable FPU, and I’ll believe you 100%.

        • ronch
        • 9 years ago

        20% better than K10? Core i7 is already 20 – 50% faster than K10. How do they expect to bulldoze Sandy Bridge?

        I hope AMD is just being scant with the details on its secret weapon within Bulldozer.

    • tfp
    • 9 years ago

    How is this a benefit over a very wide integer core with hyperthreading? Would flexible usage of all of the integer pipelines between the 2 threads vs having half of the integer recourses dedicated to each thread make any difference? Say if they had it set up just like the FP pipelines and group all 8 of the integer pipelines together.

    With only 4 x86 Decoders will they be able to feed all 3 schedulers (2Int, 1 FP)?

    It really seems like hyperthreading on steroids, AMD is duplicating more peices of the core without having a full core. Maybe this makes the integer thread management easier because of the dedicated pipelines, and FP pipes just aren’t used enough to make the addition worth while. Though I guess if you dup the FP pipes as well you might as well just make a full new core…

      • dragmor
      • 9 years ago

      I can dig up the source of the stats, but 9x% of single threaded code only uses two integer pipelines. I’m guessing that two half integer cores are easier to design than a 4 issue core than can split resources between different threads.

      What we are really seeing here is AMDs first step towards removing the FPU. Bulldozer is step 1 i.e. give it a separate flexible scheduler and let it be called in modes. Step 2 will be to have the scheduler call the FPU or the GPU. Step 3 will be replace the FPU pipelines with the GPU unit.

        • stmok
        • 9 years ago

        l[http://www.xbitlabs.com/news/cpu/display/20100512150105_Second_Iteration_of_AMD_Fusion_Chips_Due_in_2015_AMD.html<]§ Phase 1: Introduce Bulldozer architecture. Phase 2: Incorporate it as part of the 2nd generation Fusion platform.

          • tfp
          • 9 years ago

          I think so too, and if you look at it the right way this new CPU setup is kind of looking more GPU like in general.

      • Game_boy
      • 9 years ago

      HT gives a 20-30% speedup from 1 threads to 2. This gives 80%. For similar die area and power, isn’t this better?

        • tfp
        • 9 years ago

        Really I don’t see it as an apples to apples comparison, to do that one would need a very wide core with near the same integer pipes to know if dedicated Int pipes vs HT + wide makes any difference.

        As for the 80% I’ll believe it when I see it, I’m sure the numbers are best case just like how intel does, but I hope they can manage something close.

        • Joel H.
        • 9 years ago

        You’re misunderstanding what HT does. It compensates for the inherent inefficiency of a processor’s design. This creates a paradoxical effect—the more efficient your processor is, the less Hyper-Threading will improve its performance.

        Obviously AMD felt it was more efficient to opt for two independent integer units than to strictly copy Intel’s SMT implementation. It’s a mistake to assume that HT would provide the same benefit for AMD as it did for Intel, and it’s entirely possible that attempting to incorporate HT + shared modular components would make performance worse, not better.

        AMD is biting off a lot with Bulldozer–tossing HT in on top, IMO, would not be a good idea.

          • tfp
          • 9 years ago

          Id would argue this is pretty much doing HT on top just with dedicated integer units, everything else looks like it is shared similar to HT in general. Maybe they could call it htNOW!

            • mesyn191
            • 9 years ago

            If you’re duplicating integer units you’re not doing HT.

            HT is about improving efficiency of a single core. What AMD is doing is sort of a half assed dual core chip, totally different.

            • rexprimoris
            • 9 years ago

            I would assume that AMD’s chip architects and engineers know more about this than you do with your layman’s opinion about this being a “half-assed” configuration. They obviously made a decision (based on analysis and deliberation and actual research) that a microarchitecture based on a dual-core chip module with each individual core having some independent resources and some shared resources was a better solution to multi-threading than either SMT or brute force CMP. If you take a look at the slide presentation, they present this approach as a better way. Suffice it to say, I value their detailed explanation of the benefits of this architecture over your quick and infantile assessment.

      • tfp
      • 9 years ago

      From reading what is left of aceshardware forums, AMD is cutting the integer pipelines from 3ALU + 3AGU in the current K10 to now 2ALU + 2AGU x 2 with the scheduler. So each module will have less integer pipelines vs the K10, so they will need to improve the performance of the integer pipes to keep up the same performance they are getting with 3 + 3 now. 3 + 3 might not be used all of the time but they will loose a percentage overall single thread performance.

      I hope they will explain why they wouldn’t want to run sharing over 4 + 4 or 5 + 5 and just use hyperthreading.

        • mesyn191
        • 9 years ago

        Performance loss is tiny, something like 6% or so IIRC. They’ve improved IPC in other ways to more than make up for it (ie. dual branch predictors, 1 slow 1 fast, that no longer stall on errors).

        • Anonymous Coward
        • 9 years ago

        Well that claim that 3rd ALU was useless anyway, so really its just the loss of one integer unit. Perhaps the narrower cores can manage a little higher clock speed, and make up the slight loss on average. Of course if they manage to get enough integer threads active, its a definite win.

          • tfp
          • 9 years ago

          Agreed it should be interesting to see.

        • moritzgedig
        • 9 years ago

        If dragmor (#59) §[<http://www.techreport.com/ja.zz?id=502979<]§ is right it should not make much difference in reality. Apparently they made them 3 because it didn't make much difference per core and they had the decode capability anyhow.

        • moritzgedig
        • 9 years ago

        /[

      • moritzgedig
      • 9 years ago

      check out my reply: #125
      there is some reason regarding the register file.

    • ronch
    • 9 years ago

    “..McKinney refused to answer certain questions about the architecture, too..”

    What say we resort to extortion, eh? I’m dying to know more about Bulldozer, as I’ve been waiting for it for years. Can’t wait to see a die shot of this baby.

    “..first chips are already back from the fab and up and running inside of AMD..”

    Let’s hope it doesn’t have another TLB bug that’ll send AMD back to the drawing boards.

    “..and we expect compatibility with Socket AM3 on the desktop, as well..”

    Good news!

    “.. SMT… That’s the approach Intel uses its current, Nehalem-derived processors. CMP, or chip-level multiprocessing, is just cramming multiple cores on a single chip, as AMD’s current Opterons and Phenoms do… The diagram above depicts how Bulldozer might look had AMD chosen a CMP-style approach.”

    Um, Intel DOES support CMP and throws in SMT to sweeten the pot. AMD does CMP with Phenom and is now doing something close to SMT with Bulldozer, but that doesn’t mean AMD is now done with CMP, as putting a bunch of these modules together can still be seen as CMP and would still entail some of them being under-utilized in many situations. Regardless of where you put function units, whether on the same core, module, etc. there will still be scenarios when it will be under-utilized.

    “Estimated average of 80% of the CMP performance with much less area and power.”

    Seems to me that AMD is prioritizing manufacturing costs and power consumption over sheer performance figures. Right now Core i7 980X (3.33GHz) trumps Phenom II X6 1090T (3.2Ghz) by roughly 20% in single threaded apps and as much as 50% in multi-threaded apps. Let’s hope,
    for AMD’s sake, they aren’t using Phenom II as a reference to improve upon, not even what Intel has right now, but what Intel will have to compete with when Bulldozer comes out, and ultimately, Intel’s Tick-Tock strategy. How about a ‘Knock-Knock’ strategy for AMD where they ‘knock’ down every ‘tick’ and ‘tock’ from Intel?

    In the end though, while Bulldozer looks promising, and would most likely end up in my next build, it’s all about final performance, power, and price, and how well AMD can ramp clock speeds against Sandy Bridge. On paper it’s a bit worrying that each pipeline cluster (of which there are two in a module) contains only 2 I-Pipes and two AGU pipes. I hope AMD is correct in saying the inclusion of micro-ops fusion more than makes up for
    this defficiency on paper. For everyone’s sake, I hope it’s AMD’s turn to shine this time.

      • FuturePastNow
      • 9 years ago

      It looks to me, as a complete layman, that AMD has greatly reduced the die area of a core (and slightly modified the definition of a core) so they can brute-force more performance by cramming many more cores onto a die.

        • ronch
        • 9 years ago

        Oh no. I can’t even use all 4 cores on my X4, and now they want us to get 16? Please, AMD, have mercy on us (and software programmers) already! 😛

          • Meadows
          • 9 years ago

          They don’t want us to get 16.

          • Anonymous Coward
          • 9 years ago

          There might be a stupid number of cores, but on the positive side, using only one core from each pair of them should give pretty respectable performance. I wonder how much of that FP power can be used by existing software.

            • FuturePastNow
            • 9 years ago

            l[

      • NeelyCam
      • 9 years ago

      l[

    • BaronMatrix
    • 9 years ago

    I think the key to 4 IPC per core is based on the concept of dual loads introduced with K10. If they load and execute twice the INSTR window stuff shown at citavia.blog.de makes it clear that it will be an 8 thread module

    IN THE OPTIMAL SITUATION

    while it should really eclipse MC in situation where each core can only process 3 IPC\ per core. The same would then apply to the AGUs where they fire twice per cycle. The Ld\St is separate as is the L1 so single-threaded apps can use the L2 of both cores and still MAX at 4 IPC.

    I’ve done intensive simulations – don’t ask – on both new archs and AMD has a marked advantage unless the SW is optimized for SB. In cases – wait people can force Intel to pay for re-compiles – where it’s not, Intel’s milk shake will be drank.

    • thermistor
    • 9 years ago

    The most “efficient” X86 arch from the P4/Athlon64 days was neither…it was Banias, that is if we’re talking non-server type parts.

      • Joel H.
      • 9 years ago

      For what workload? Banias/Dothan was fabulous in some areas, notably weak in others. You’re making too broad a generalization.

    • MrDigi
    • 9 years ago

    “Each module has a trio of schedulers, one for each integer core and one for the FPU. And the integer cores themselves have two execution units and two address generation units each. Early Bulldozer diagrams showed four pipelines per integer core, giving the impression that the cores might have four ALUs each”

    If the integer core has two execution units and not the four the diagrams imply, then Intel still maintains an integer performance advantage of 50% (3 issue). The module looks more equivalent to a SB core, so is it really an 8 core chip or a 4 core? Just like the how cores are counted in GPUs. May also explain why two memory channels are sufficient.

      • Kurotetsu
      • 9 years ago

      From what I can understand, each module is seen as a core to the OS. So a two module BD chip will be seen as a dual-core chip to the OS, despite having 4 not-quite-complete-cores internally. Though from what I can see, they become complete when handling a single thread, but then SMT gets a speed boost due to the mix of shared and dedicated hardware compare to a ‘normal’ core, in which everything is shared during SMT.

        • Fursdon
        • 9 years ago

        From AMD’s Bulldozer Blog:

        ( TLDR: Each Integer Core will be considered a ‘core’ by the OS. So: 1 module, 2 integer cores, 2 OS cores/2 threads. )

        l[

          • Goty
          • 9 years ago

          Right, but it won’t be “core vs core”, it will be “core vs module” when this is released; i.e. 2-threads per core via SMT for Intel and two threads per module via separate integer cores for AMD, with a 4-module part going up against a 4-core part. Essentially, you’re still getting four execution units from AMD vs Intel’s three, we’re just defining the term “core” differently.

            • Shining Arcanine
            • 9 years ago

            These “modules” are cores. They just have more execution resources than Intel’s cores, but they are still cores.

            Look at POWER7, which can do 4 threads at a time. It has very fat cores, but they are still cores. The fact that the PC industry considers more to be better is the impetus for AMD’s marketing department to claim that this chip has more cores than it actually has, which sounds good on paper, but in terms of performance, it is not capable of matching a true octocore design, especially one like Intel’s Nehalem EX that has SMT on top of having 8 cores for a total of 16 threads.

            • Goty
            • 9 years ago

            Actually, the industry standard definition for a “core” includes only the integer execution units (c.f. IBM, Sun) so the modules are not in fact “cores” by this definition.

            Then again, if you know the definition better than AMD’s engineers…

            ; )

      • Shining Arcanine
      • 9 years ago

      Intel’s chips are 4-issue, not 3-issue.

      These cores on paper are an enhanced version of Intel’s cores and they are therefore still cores. That makes this a quadcore chip.

        • MrDigi
        • 9 years ago

        I believe the 4th is for FP instructions, I was referring to integer instructions. I don’t know though if SB changes this.

    • Shining Arcanine
    • 9 years ago

    This is basically analogous to a quadcore Core i7 with a little more computational power per core. While it will likely mop the floor with a real quadcore Core i7, it will have a difficult time competing with Intel’s Nehalem-EX in markets where parallelism is important.

    I think that this processor will do extremely poor as far as the server market is concerned, but I think it will do well in the gaming market. I also do not think that these concepts are particularly radical. The IBM POWER7 uses similar ideas. They are so similar that it is possible that the IBM POWER7 was the inspiration for this design.

      • Game_boy
      • 9 years ago

      Nehalem-EX is 8C/16T.

      Interlagos will be 16C/16T and will be much cheaper, priced like MC rather than the EXs. The official line is that IL will outperform MC by 50%, average case. If you add 50% to current MC benches that would crush Nehalem-EX, plus the price advantage.

      This will do very well in servers.

      • OneArmedScissor
      • 9 years ago

      Yeah, and an Athlon 64 is “basically analogous” to a Pentium 4. They are both called “CPUs,” after all.

    • Chrispy_
    • 9 years ago

    Will it run -[

      • ronch
      • 9 years ago

      LOL!

      (Ok, that’s all I have to say, but TR says a post with less than 10 characters isn’t worth anybody’s good time, so let’s make it longer.)

        • TaBoVilla
        • 9 years ago

        blank spaces? anyone?….

    • ClickClick5
    • 9 years ago

    Not that nerdy. lol

    Like me, I work with the finished product, not the making of the product. For all the computer engineers here, power to you all!

    EDIT: Reply fail to #1

    • RtFusion
    • 9 years ago

    Around 1:33AM this morning, I was pouring over the first Bulldozer articles that came out. Initially, I was quite excited for this architecture. But then I remembered Barcelona (Queen FTW!!!) and how I also geeked out on the architecture bits. Then it came out to reviewers’ hands and it bombed overall against the Core architecture from Intel.

    I am now cautious, even though there are a slew of improvements over an architecture that AMD has been using for like 7 years now, with tweaks and so on.

      • flip-mode
      • 9 years ago

      Barcelona was fundamentally a very decent chip but it suffered from two huge flaws: that one bug that I cannot even remember the name of was one, the other was its heat / power / inability-to-clock-higher. If that bug had never happened and the chip could have hit 3.0 GHz then it would have been a much, much better received chip.

        • RtFusion
        • 9 years ago

        The Translation Lookaside Buffer (TLB) bug, I think you are trying to remember. IIRC, it would give inaccurate results of sorts that wouldn’t play with the OS. Anyone can correct me on that, I have a feeling that I am wrong that bit.

        The recent Phenom II (and Athlon II) models are brilliant and if I had to get a new CPU/Mobo, it would be AMD (mostly for the AM3 socket to get a bit of Bulldozer loving).

        In my view, AMD needs a “Clawhammer/SledgeHammer” win again and since they are still playing catchup with the release dates, Bulldozer needs to be pretty f*cking amazing; at performance/watt/dollar from mobile to HPC/server.

        As with release dates, it does worry me since they aren’t more specific on them. And if they slip deeper into Q3/Q4 2011, AMD is going to be in big trouble. IIRC, that’s when Ivy Bridge is supposed to be released.

        I am reminded from Ars’s coverage on Bulldozer, there can be ZERO margin of error in AMD’s execution of Bulldozer. We all know what happens when you are late for the party and make pretty bold claims about upcoming parts *cough* nVidia/Fermi *cough*.

        AMD really needs to walk the walk since they’ve been doing the talk for a while now.

      • Shining Arcanine
      • 9 years ago

      In all fairness, this chip should at least match Bloomfield’s performance.

        • RtFusion
        • 9 years ago

        That, I don’t think, is good enough for AMD. Especially in the server and HPC markets where it is Intel’s domain.

        Remember, Sandy Bridge is supposed to faster than Nehalem. By how much, I don’t know. AMD can’t got for “at least”, it has to do this perfectly.

          • Shining Arcanine
          • 9 years ago

          Well, AMD plans to produce 16-thread processors by putting two of these cores together on a single package like Intel did with Yorkfield. They have a chance of at least matching the performance of Nehalem EX with such processors, so not all is bad in AMD land.

            • Anonymous Coward
            • 9 years ago

            Like Intel did with Yorkfield, eh? Not like AMD did with Magny Cours?

            • Shining Arcanine
            • 9 years ago

            If you read the article, you would know that AMD plans to put two dies on a single package, which is unlike Magny Cours, but just like Yorkfield.

            • Anonymous Coward
            • 9 years ago

            And all this time I thought Magny Cours was two dies on a single package.

            • flip-mode
            • 9 years ago

            It is. ShiningA is Perpetual Epic Fail at googling. Time after time after time after time he fails to do a 15 second internet search. It’s astonishingly contemptible, really.

            • RtFusion
            • 9 years ago

            If I interpret your post correctly, I’d have to disagree with you that larger core counts to process more threads does not guarantee AMD to be back in the game. I’ve seen two reviews from Anandtech and Bit-tech when they pitted AMD 6174 12 Core Opteron against Intel Xeons 5650 (Bit-tech) and 6175 vs 5670 (Anand). Both systems were Dual-Socket and overall, the Xeons did significantly better overall. Sure, there were some benchmarks where the Opteron’s exceeded, but the Xeons still did better overall. This is why Intel commands over like 90 percent of the server market where AMD is less than 7 percent.

            Big core counts are nice, but they mean crap if they can’t provide better multi-threaded performance than the competition for lower power usage, lower heat waste, and at lower price.

            In one article (DailyTech), the parts might come out in Q3/Q4 of 2011. Now, to my knowledge, that it when most enterprises do some upgrades during the fall/winter (I read this somewhere, forgot where, correct me if I am wrong). This is also where Sandy Bridge is supposed to come in.

            Another thing that worries me is that AMD hasn’t really pushed hard into the server cloud computing. SeaMicro recently announced that a 512 core Atom server system which sucks up less power, is more dense than a Xeon system. I don’t know why AMD hasn’t said anything about something like that with Bobcat. That, in my view, is one of the very few holes that Intel has market wise. Everywhere else, Intel has a firm grip on.

            §[<http://arstechnica.com/business/news/2010/08/evolution-not-revolution-a-look-at-amds-bulldozer.ars<]§ "But notice that I said "conventional server platform" above. There is one obvious gap in Intel's current suite of datacenter offerings: Intel isn't directly pursuing low-power, high-density cloud servers, and this is a gap that both ARM and startups like SeaMicro are looking to fill with very dense server offerings based on mobile technologies (e.g., physicalization solutions). If I ran AMD, I would redirect the company's effort toward building a low-cost, low-power, high-density, flash-based cloud server platform around Bobcat. Intel's Justin Rattner has admitted that for certain cloud workloads, these types of high-density solutions are superior to a monolithic server chip like Xeon. So AMD should stop obsessing over netbooks and monolithic server parts—both of these amount to fighting the last war—and just jump straight into the cloud server market that ARM is set to tackle with its upcoming Eagle part. To do this would be to attack Intel where it is weak, because Intel's current answer to this is still in the labs. Intel will probably keep puttering away at its experimental Single Chip Cloud Computer, while pushing Xeon at cloud vendors and losing rack space to ARM-based systems. AMD could jump right in with something like Bobcat and be well-established as the go-to maker of high-density x86 servers before the SCCC makes it to market. " §[<http://www.theregister.co.uk/2010/08/24/amd_hot_chips/page3.html<]§ "And perhaps, just perhaps, into low-power servers. Both Freuhe and Hoepper said there were no plans for modifying Bobcat chips to run in server platforms. "There is a lot of hype around ARM going into the server space," concedes Hoepper, "and Bobcat would work well here." Fruehe says that AMD will be able to get six-core and eight-core Bulldozer chips in the 30 to 40 watt power range, which is pretty low for a server. "The question is this," says Fruehe. "Is there a need for a more discrete, less-threaded chip for servers?"" Fruehe has a point there. But cloud computing with low power but many CPU cores could the next big thing and it looks like AMD is taking advantage of it.

            • shank15217
            • 9 years ago

            Lol so if you ran AMD you would bow out of the high performance MP sector because of a couple of benchmarks you misinterpreted at Anandtech? Nevermind AMD chips are in the top performing super computers in the world or that Mangy-cours competes very well with Intel chips on HPC loads or that they have a clear socket upgrade path to a significantly faster chip just a year from now.

            • Shining Arcanine
            • 9 years ago

            Historically, AMD’s multiprocessor performance has outperformed Intel’s because of HyperTransport, but now Intel has QuickPath, so AMD has lost that advantage.

            • Shining Arcanine
            • 9 years ago

            I see no relationship between what I said and your response to it. Would you state in a single sentence what it is that you are trying to say?

    • derFunkenstein
    • 9 years ago

    Nice writeup. Just technical enough for me to learn something new, and easy enough to read that my eyes don’t glaze over.

      • Pettytheft
      • 9 years ago

      That’s TR in a nutshell.

      • SomeOtherGeek
      • 9 years ago

      Exactly my thoughts! Good job, TR!

      • OffBa1ance
      • 9 years ago

      +1, great job.

    • axeman
    • 9 years ago

    I’ve heard this before, but I don’t quite get it. Apparently K8 was based upon K7, so what accounted for the massive performance improvements? The integrated memory controller? K7 was getting whooped by the P4 before the Athlon64 came out, so what did they tweak that made such a big difference? Not relevant, but someone here probably knows. The addition of x86-64 added lots of logic, but not performance in self, right?

      • DaveJB
      • 9 years ago

      The on-board memory controller was probably the single biggest source of K8’s performance increase, but AMD improved the chip’s internal workings in just about every way, increasing the chip’s overall efficiency. In fact, it operated very close to its peak Instruction-Per-Clock rate in the vast majority of circumstances, and by that metric it was (with the exception of Nehalem) the single most efficient x86 design that has ever been produced.

      That was also the reason why K10 was such a disappointment, sadly – had it been as IPC efficient as K8, then it would have absolutely murdered the Core 2 chips in just about every benchmark and would still be better than Nehalem on floating point related stuff, but for whatever reason K10’s IPC efficiency was vastly behind its predecessor. Hopefully, Bulldozer will be a lot better in this regard.

        • data8504
        • 9 years ago

        I am very sorry to be a bit of a jerk here, but I feel you really need to check your facts:
        – First of all, “on-board” memory controller is completely incorrect. The iMC was “on die,” and this is not a trivial difference. A device is “on board” if it is literally “on the board.”
        – While no one disputes the decrease in latency from LLC to DRAM (“last level cache”) by integrating the MC, to say “In fact, it operated very close to its peak Instruction-Per-Clock rate in the vast majority of circumstances” is so far from the truth, that you really start showing your colors. CPUs live in starvation, and the theoretical “peak IPC” is completely unattainable. A CPU is even starved living from L2 cache – latency from full speed L2 to full speed L1 is still non-negligible. This is why prefetching is so important. It is well known that every CPU fights starvation (with prefetching).
        – What on earth is your second paragraph talking about?! Listen, I don’t know a competitor’s product, so I’m not going to comment on one generation versus their next, but I want you to really sit and think about what you mean by “IPC efficiency.” Just as an academic exercise, tell me how ID width, pipeline length, speculative execution, cache prefetching, and superscalar resource allocation impact CPU performance and THEN tell me in which of those areas Core 2 (Merom/Conroe) lacked. I’m sorry, but you really do seem to be “fighting the good fight.” Let’s talk specs.

        [yeah, full disclosure, I work for Intel.]

          • kc77
          • 9 years ago

          l[

            • Spotpuff
            • 9 years ago

            I absolutely hate it when people say things that are incorrect and then the person calling them out on it is called the jerk. Misinforming people is worse.

            It’s one thing to say “I think…” and it’s another to state as if certain an answer to the question.

            At the same time, this is the internet, and no one’s backing up anything with any facts or articles or examples, so I’ll just chalk this up to one more internet fight.

            • kc77
            • 9 years ago

            There is a difference between misinforming and not knowing, which is why a gentle correction would have worked here. Not beating the poster over the head. Did that help anything? Nope. Do you think it was constructive other than attempting to gain ego-points? I don’t think so.

            • djgandy
            • 9 years ago

            If you don’t know why pretend you do? That is why “I think” works as a good prefix when you are speculating.

            What just happened here is someone who has seen some acronyms on various forums tried to tie them altogether in a sentence so that they would sound smart. Then someone who has a in-depth knowledge of the subject came and handed them their ass on a plate..

            • NeelyCam
            • 9 years ago

            Yeah, but the ass could’ve handed on a plate in a nicer manner.

            • NeelyCam
            • 9 years ago

            What I hate is those with knowledge boosting their ego by violently putting down people who are less informed.

            • Spotpuff
            • 9 years ago

            Maybe he should have used more touchy feely words, but I didn’t see it as an attack on the OP, just as a nudge in the right direction. He gave the OP some points to consider about processor efficiency beyond IPC.

            He never resorted to ad hominem attacks so other than being seemingly corrected by someone on the internet the OP should be happy someone with more knowledge took the time to explain why his original ideas were wrong.

            Of course everyone’s going to take the information in the post differently, but people seem to be split on how to handle someone who doesn’t know what they are talking about.

            §[<http://xkcd.com/386/<]§

            • data8504
            • 9 years ago

            My apologies for coming across more abrasively than I would’ve liked – I certainly never meant to stir the hornet’s nest this much.

            On an unrelated note though: Spotpuff – Notice the irony of linking xkcd comic #386 to an Intel thread?? 🙂 (and thanks for the hand there, too)

            • NeelyCam
            • 9 years ago

            Apology accepted..
            No worries, lol, constructive confrontation means different things to different peoples

          • NeelyCam
          • 9 years ago

          You’re right – you are a jerk. And incredibly sensitive about Core2Duo… Pisses you off that Nehalem was better? Why didn’t Conroe integrate the memory controller..? Oh – that’s right: because CSI/QPI was botched over and over, and you guys were too proud to jump into HT.

          Good job.

      • data8504
      • 9 years ago

      SSE2.

      • jdaven
      • 9 years ago

      I still see 4 different architectures since the original Pentium days.

      K5/K6 – too far back to remember specifics
      K7 – DEC guys
      K8-K10.5 – integrated memory controller, 64-bit, multiple cores
      Bulldozer

        • Shining Arcanine
        • 9 years ago

        AMD should have called the chip after the K8, the K9. It ran like a dog in comparison to the K8.

        • ronch
        • 9 years ago

        Hi. K5 and K6 were very different cores. K5 was AMD’s first in-house CPU design. In 1996, AMD acquired NexGen and renamed their Nx686 as the AMD K6. Even if you look at the die shots you’ll see K5 and K6 are really very different except for the way they translate CISC to RISC (dubbed RISC86 by NexGen) and back to CISC again.

        Check out §[<http://www.sandpile.org<]§.

      • flip-mode
      • 9 years ago

      K7 whooped by P4? How? On a per-clock basis the K7 was much stronger than the P4, IIRC. But, K7 didn’t clock as high so the P4 was able to pull ahead in total delivered performance.

        • HurgyMcGurgyGurg
        • 9 years ago

        I think the general consensus is still that K7 won that round.

        • axeman
        • 9 years ago

        Yeah, but whereas an AthlonXP at 2.0ghz wasn’t faster than say the P4 3.0C, the Athlon 64 at 2.0ghz beat it handily. I’m not talking IPC per se, just overall performance.

          • HurgyMcGurgyGurg
          • 9 years ago

          True. AMD had some strong years that I’m probably coloring a bit too positively now.

          • Goty
          • 9 years ago

          The execution engine itself (I mean the sum of all the execution units) was widened and tweaked, but each individual component wasn’t largely changed from the K7 architecture.

            • Anonymous Coward
            • 9 years ago

            I’m pretty sure that K7, K8 and K10 all have essentially the same execution units.

            • Goty
            • 9 years ago

            Which was exactly my point.

      • WaltC
      • 9 years ago

      /[

        • axeman
        • 9 years ago

        I don’t know why I stated the P4 whooped K7, overstating as usual I suppose. The only time the P4 really had the overall performance crown was with the Northwood as you stated, and then only decisively with the 800 FSB variants, which in hindsight wasn’t that long before the A64 came out.

        Sorry for stating otherwise, I was rooting for AMD in those days to be sure – I never owned a P4 – a CPU that needed 50% more clock cycles for only a slight overall performance increase never sat well with me.

    • MikeA
    • 9 years ago

    “The decoders convert complex x86 instructions into the CPU’s simpler internal instructions. Bulldozer has four of these, while Barcelona and Nehalem have three.”

    This is wrong, Nehalem has 4 decoders not 3.

      • Damage
      • 9 years ago

      Totally my bad. You’re right. Updated the article.

    • jdaven
    • 9 years ago

    “In fact, Interlagos will likely be comprised of two Valencia chips on a single package, in an arrangement much like the present “Magny-Cours” Opterons.”

    I thought the whole point of Bulldozer was the modular architecture. In my mind that means no more gluing two complete chips together. You just put the number of modules you want to match a certain performance and power consumption level.

    This is the same as GPU’s where repeating little ‘processors’ are stacked against each other surrounded by shared/common architecture components.

    So Interlagos is 8 modules ‘pulled from a bin’ and placed on a substrate. It is not in my mind 2 of anything (i.e. Valencia). I think we have to leave the world of gluing complete chips together when it comes to AMD’s future architecture. If I’m reading everything correctly, it no longer exists for them. Just modules surrounded by a common architecture.

      • Goty
      • 9 years ago

      Yes, you can glue more modules onto (into?) the die, but you’ve got to remember that they’re all sharing some resources, still. There would definitely be decreasing returns as you added more modules.

      Also, you’ve got to consider die sizes and yields. A chip that is twice as large would yield less than half as well (hyperbole, I don’t know the exact numbers) due to wasted space near the edge of the wafer and the distribution of defects on the wafer. By taking two smaller chips and gluing them together, you effective increase the yields for the same product.

        • Game_boy
        • 9 years ago

        Yes, and you can also take one design (the four-module die) and use it for both desktop and server products by doubling up on dies on the high-end server. Designing a new eight-module die just for 4P server may not give good enough returns.

      • FuturePastNow
      • 9 years ago

      I’m sure they could design a die with 8 or 16 or whatever modules, but then they have to make it. Without any defects on a manufacturing process the GloFo has never used before.

    • emvath
    • 9 years ago

    So….it will be better?…..

      • Xenolith
      • 9 years ago

      Definitive Yes.

      Better questions … Will it be competitive? Will it find a market?

      • Meadows
      • 9 years ago

      No, they’re spending years on a chip that ends up worse.

        • ronch
        • 9 years ago

        I wish they’d spend one more ALU and one more AGU per Integer Cluster, making one module have three ALUs and three AGUs. That way, if there are two threads running on the core each one gets almost-Phenom II-like performance. And if only one thread is running, the scheduler should be smart enough to distribute the work (if that’s possible).

        This shouldn’t be hard, because according to AMD, adding a second Integer cluster only takes up 12% more die space, so, theoretically, adding another ALU and AGU to each cluster should take 12% more space again. No way to tell how many more transistors will go to the schedulers and decoders to feed such a wider path, but perhaps it’s not gonna cost AMD an arm and a leg. And, they seriously have to take back the performance crown or people like me will think they don’t have the engineering depth.

        K5 was AMD’s first in-house design. It flopped.
        K6 was acquired along with the NexGen purchase back in 1996. AMD bolted on an FPU and made it work with Socket 7.
        K7 was a lucky break when a band of ex-DEC engineers, lead by Dirk Meyer, joined AMD and created the K7.
        K8 was a K7 with Hypertransport, an IMC, 64-bit, and some tweaks.
        K9 was… what, two K8’s joined together? Ok, so that deserves the move from ‘8’ to ‘9’ /sarcasm
        K10 is four tweaked K8’s on a monolithic die and an L3 cache and improved Hypertransport, a cut-and-paste 2nd FPU (check out die shots), some tweaks and aggressive pricing to make up for lower performance.
        K11 (a.k.a. Bulldozer) – Let’s hope this will at least be very competitive with Sandy Bridge.

        • TO11MTM
        • 9 years ago

        Prescott?

        Oh… you mean AMD will be.

    • Buub
    • 9 years ago

    Cool deal. Mirrors some of the earlier stuff we’ve heard, but it’s nice to see it spelled out in more detail.

    This should definitely be more performant than SMT, where a single core attempts to run two threads.

    • ssidbroadcast
    • 9 years ago

    Well, most of this technical stuff flies over my head, but it’s nice to see AMD finally tip their hand a bit…

      • sweatshopking
      • 9 years ago

      what technical stuff? you’re a nerd arent you?

        • HurgyMcGurgyGurg
        • 9 years ago

        You can still not know what “PRF based register renaming” is and be a nerd.

        • ronch
        • 9 years ago

        Sure, this is technical. Just ask your grandma to read this and see what she tells you.

        Not in the engineering-level, perhaps, but technical, still.

Pin It on Pinterest