Why did Bulldozer disappoint? Some possible answers

AMD’s "Bulldozer" microarchitecture has been something of a disappointment, particularly in the FX desktop processors, where it doesn’t consistently outperform AMD’s prior-generation Phenom II chips. Since Bulldozer is the first full refresh of AMD’s primary x86 architecture in many years, we’ve been left with lots of questions about why, exactly, the new microarchitecture hasn’t performed up to expectations.

There are some obvious contributors, including lower-than-expected clock speeds and thread scheduling problems. Then again, using the Microsoft patches for Bulldozer scheduling didn’t seem to help much during the testing for our Ivy Bridge review.

Some folks have speculated about one or two very specific problems with Bulldozer chips—such as relatively high cache latencies—being the culprit, which offered hope for a quick fix. However, the host of improvements AMD made to the "Piledriver" cores in its Trinity APU only offered gains of 1% or less each in per-clock instruction throughput, yielding relatively modest progress overall. There was no one, big change that fixed everything.

Now, Johan DeGelas has shed a little more light on the Bulldozer mystery with a careful analysis of Opteron performance in various server-oriented workloads, and his take is very much worth reading. He offers some intriguing possible reasons for Bulldozer’s weak performance in certain scenarios, and those reasons aren’t just cache latencies. Instead, he pinpoints this architecture’s inability to hide its branch misprediction penalty, the low associativity of the L1 instruction cache, and—yep—a focus on server workloads as the most likely problem areas. There is hope for future IPC improvements, but some of those will have to happen in the generation beyond Piledriver, whose outlines we already know.

Comments closed
    • ish718
    • 7 years ago

    AMD should of just Refined K10 and made a Phenom III x8 on 32nm.

      • SPOOFE
      • 7 years ago

      “Should of” is not a contraction of “should have”.

        • clone
        • 7 years ago

        seyla.

    • Krogoth
    • 7 years ago

    Bulldozer problem is that it was built to be a server-chip in areas where having tons of threads running is king. This doesn’t happen on the desktop environment. Bulldozer had to make several design trade-offs in order to an effective server-chip. Desktop version is stuck with the trade-offs. That is why Bulldozer is somewhat worse than its Istanbul-based predecessors (Phenom II) at desktop-related applications. It clearly falls behind Sandy Bridge and Ivy Bridge.

      • ColeLT1
      • 7 years ago

      Even as a server chip, it is still disappointing. I just put in 3 R720 servers with 2x Sandybridge E5-2660’s, for my VM farm instead.

    • Arclight
    • 7 years ago

    I was wondering…. Y u no more comics?

      • ClickClick5
      • 7 years ago

      I have asked that many times before and no response. 🙁

    • BaronMatrix
    • 7 years ago

    AMDs big problem is that people act like there are 50 X86 CPU manufacturers and AMD is at the bottom. They have EXACTLY ONE predatory anti-trust company to deal with.

    PERIOD!

      • Ringofett
      • 7 years ago

      Intel not being very friendly doesn’t boost Bulldozer’s performance, though, and thats probably what matters to most the people reading at sites like this.

        • BaronMatrix
        • 7 years ago

        And I would never imply such but when a company does low down dirty things to limit your market share it costs money that you could be spending on additional R&D, engineers, or even Fab equipment, then yes they are affecting your perf.

      • NeelyCam
      • 7 years ago

      How about you go back to S|A where this kind of mindless AMD fanboi trolling is appreciated and even encouraged.

        • Fighterpilot
        • 7 years ago

        Ahh….the irony.
        You mean kinda like the usual performance you put on over at S/A and elsewhere?

          • NeelyCam
          • 7 years ago

          +1 Exactly!!

          Actually, I only comment on two sites these days, and I save the Level10 trolling for S|A. To be honest, I respect the discussion here more, and that’s why I try to minimize my trolling at TR.

          S|A comment section, on the other hand, is overrun by rabid trolldogs, so I don’t feel bad being a rabid trolldog there.

            • BaronMatrix
            • 7 years ago

            What is this trolling thing? This article is about issues with BD perf. I’d say a lot of the issues are the AVX\FMAC games Intel played for two years knowing they couldn’t redesign their FPU for FMAC before Haswell.

            Are you saying that changing ISAs midstream should have been easy?

        • BaronMatrix
        • 7 years ago

        Speaking to me with such disrespect is not healthy. Intel is an anti-trust criminal and the fact that an impartial person wouldn’t put anything past them should tell you you’re getting ripped off.

        I don’t troll, I make truthful factual statements. Are you saying Intel didn’t pay Dell’s bills to keep them from selling AMD? Are you saying EVERY MAJOR market hasn’t convicted them? Are you saying they have never been found C HEATING ON BENCHMARKS?

        I’m trying to figure out how to sue them for taking away freedom of choice with UltraBooks.

    • clone
    • 7 years ago

    thanks very much for linking the article, made for an interesting read and confirmed a few superficial observations while also supporting a few others.

    1: Bulldozer is server focused from top to bottom and to go further AMD seems to be leaving desktop behind.
    2: while their are flaws in the architecture the foundation is sound and it has real promise… in the server world.
    3: while it took forever (12 years to arrive) bulldozer’s largest issue is it’s lack of refinement throughout.
    4: a potential winner in 18 months isn’t that long…. it’s longer than I’ll wait and with Intel constantly moving the bar I’ll still be interested but given it’s clear Bulldozer was never meant for desktop even if fixed it’ll just be a really great server CPU which won’t interest me.

    with this in mind why not at least a die shrunk, clock increased series of X6’s with whatever tweaks are left that can be done until something better makes it to the street?

      • Anonymous Coward
      • 7 years ago

      [quote<]with this in mind why not at least a die shrunk, clock increased series of X6's with whatever tweaks are left that can be done until something better makes it to the street?[/quote<] Perhaps you are under the impression that AMD's 32nm "X4" was better than the 45nm version.

        • Deanjo
        • 7 years ago

        In terms of power consumption and features at the same clock speed they were substancially better.

        • clone
        • 7 years ago

        likely what I meant was that I wished AMD could have further improved the X6’s and offered them as an option for desktop use while using bulldozer as a server cpu…. which it is.

        at the moment AMD has no worthy desktop option save platforms built around integrated graphics.

    • kamikaziechameleon
    • 7 years ago

    I bought into all the buz, got a AM3+ compatible board and a 1090T x6 and am left wishing I’d gone Intel that gen or held onto my money, heck maybe just bought a couple SSDs… Oh if I had that cash back in my pocket, oh if I still lived with my parents and had disposable income…/looks at feet/

      • Bensam123
      • 7 years ago

      Why? What’s wrong with the 1090T?

      You didn’t buy a Bulldozer and the Phenom2 x6s were very respectable for the cost at the time. They still are a pretty darn good deal.

        • Deanjo
        • 7 years ago

        Exactly, the x6’s are a damn good buy (a better one then the FX’s IMHO).

    • Stranger
    • 7 years ago

    since I’m still postulating and all. I’m going to expand on what I think happened at AMD. When AMD began to develop the Fab people at AMD/what would become gloflo layed out a certain set of transistor properties. Then as bulldozer began to tape out it was found that the specs didn’t live up to promises. I get the impression that a single bulldozer module was supposed to be comparable to a single K8/K10 Core that was tuned for higher clock. All the while having the same power budget and a slightly bigger die space. When it became clear that that that bulldozer was running way hotter than it should have the board fired the CEO. Then to try and reduce power consumption they cut the number of modules by a 1/3(maybe more), increased the L2 to 2 MB, and lowered the clockspeed all in order to try and cut power use. There’s no reason to make a processor with as much clock head room as the bulldozer has and not use it. All that head room has a penalty in the form of wasted energy on extra pipeline stages.

    Edit: since I’m still rambling. about the P4(non Prescott)… I think the P4 was a good but flawed design and the reasons why it didn’t work in practice were far more subtle process related than conceptually related. The 180nm roll out was a giant clusterfuck. It was one of the first process generation that really suffered from power problems. No one intentionally designs a processor to run hot. There’s a trade off that has to be made between increasing performance through higher clock speeds and the power penalty that comes along with the extra pipeline stages and when the process doesn’t match what is expected it completely fucks things up. to make things worse processor designs are layed out years in advance. concepts for processors similar to bulldozer have been seen in patents dating to the early 2000s( 10 years ago) and most design decisions are made at the latest about 2 years before the processor hits the market. That means these companies are design processors for process that don’t even exist yet.

    The P4 was a bleeding edge design that incorporated a whole lot of leading edge ideas including the trace cache which has morphed into the loop unrolling cache(what ever intel calls it), a cache hierarchy that largely still exists and is still blazing fast, a branch predictor that AMD is only now catching up with as well as many more ideas.

    In many ways AMD got lucky that their more conservative design happened to fit the power profile better than the P4.

    Edit went to double check the P4 history before running my mouth too much. it was the 180nm-90nm process where the power problem continued to successively get worse. I’ll leave this for a historical perspective.

    en.wikipedia.org/wiki/Pentium_4

      • Bensam123
      • 7 years ago

      Popcorn eating monologue with some far stretches based on cobbling together loosely related things.

        • sweatshopking
        • 7 years ago

        Do you have any friends in real life?

          • Bensam123
          • 7 years ago

          More then you apparently as I don’t need to wreck other peoples sand castles to feel good about myself.

            • sweatshopking
            • 7 years ago

            i’m sure they’d have to be rich and important, otherwise they wouldn’t be worth your time. amirite?

            • clone
            • 7 years ago

            lol.

    • ludi
    • 7 years ago

    Ah, good old Fred Silver. I know the commentariat here was pretty hard on him for his hit-or-miss story telling, but that one there was definitely a clean hit.

      • Meadows
      • 7 years ago

      Come to think of it, what happened to him?

        • sweatshopking
        • 7 years ago

        idk, but I’d love to see more of his stuff

          • Meadows
          • 7 years ago

          Not me, I never “got” his stuff.

      • I.S.T.
      • 7 years ago

      Agreed. This was actually funny.

    • Tristan
    • 7 years ago

    Intel is able to make core, that perform excellent on desktop and severs.
    Bulldzoer fails everywhere. There is not possible to apply some tweaks, to improve IPC. Faster and cheaper will be to design new architecture.

      • Deanjo
      • 7 years ago

      Or just do what intel did with the core series. Look back to what actually worked (back to the PIII days in intels case) and build on that.

    • rrr
    • 7 years ago

    Trinity seems to be well performing and it’s based on a more refined version of Bulldozer, so I guess we can brush it off and move on. Still, FX-4100 isn’t too terrible for the money.

      • Rand
      • 7 years ago

      FX4100 has a lot of trouble competing against relatively cheap Phenom II’s and Athlon II X4’s. It’s less appealing then the FX8150 relative to the competition IMO.

        • FuturePastNow
        • 7 years ago

        I agree with you, there. The FX-81×0 processors, with all four modules, do deliver decent performance in highly multithreaded workstation tasks like photo editing and video encoding. There’s a slight advantage there, at least compared to an i5 with no hyperthreading.

        With only two modules, the FX-41×0 models don’t have single [i<]or[/i<] multithreaded performance. I think they're less appealing than an i3 at the same price.

        • rrr
        • 7 years ago

        Except availability of those is becoming scarce in some places, which drives prices up.

          • NeelyCam
          • 7 years ago

          That doesn’t make sense… If their prices go up, they become less appealing compared to Intel’s offerings. Lack of availability [i<]should[/i<] drive prices up only if there are no other options.

            • rrr
            • 7 years ago

            I don’t have to wonder if it makes sense or not, if I clearly see it being done by retailers. Only time products with low availability are marked down is when they have ZERO of them in stock to attract customers, then when (sorry: if) they get some, they yank the price up.

    • lycium
    • 7 years ago

    For rendering/graphics applications, Intel simply has more FP power on the chip where Bulldozer optimised for integer power; it doesn’t hurt that their memory subsystem is much faster too (large low-latency caches, up to quad-channel mem controllers).

    • Celess
    • 7 years ago

    I think there is another sort of serious misquote here, or appearance thereof:

    The base Anand article states, to paraphrase, that the branch predictor had only a 1% improvement as it was already pretty good, but that the mispredict penaty is very large, and that the predictor is better in “bulldozer” than previous designs.

    Could be argued that here this somehow translates to:

    However, the host of improvements AMD made to the “Piledriver” cores in its Trinity APU only offered gains of 1% or less each in per-clock instruction throughput

    The referenced Trinity article, TR’s own, doesnt seem to have any direct per-clock comparisons and as far as I can tell is largely comparing Llano with Trinity. The referenced Anand article is comparing Bulldozer Opteron and the previous gen Opeteron. Besides the server class system comparisons being very very tricky things, I’m hoping that the 1% didnt come from there. I’m not much for the fuel to the fire thing just to watch another trashy “comments” supposition war. Theres no reference, and that bit is the cornerstone lead-in piece. This article seems like a fuel to the fire thing, with NYT style page 3 supporting cartoon and all.

    What I’m mostly trying to say here is that, in what may have started as a artice for “heres finally some decent rationale for you guys as to whats going on with Bulldoser” somehow ends up seeming to me as fairly sensationalstic. Not used to seeing that here.

    Mr. Mouse: “Trinity doesn’t even have L3 cache. A 1% improvement over BD with L3 cache may be better than it appears.” of course parroting the %1 already but astutely somehow reminding us that there is no such animal to compare yet.

    • Stranger
    • 7 years ago

    I think the biggest problem is that Intel is just so far ahead in process development. Intel is easily 18+ months ahead of everyone else. Imagine if AMD only had to compete with processors from a 1.5 to 2 years ago… When was the last time anyone heard of intel having the same kind of process issues that Nvidia or AMD has had? The reason for that is that intel so so far ahead in process development that they can afford to wait until a process has been figured out before pushing it out. Intel’s R&D budget is almost as large as all the money made by AMD… The fact that AMD is able to compete at all is a minor miracle.

    Also the original P4 wasn’t terrible. Once the P4 started to hit 2.4 GHz it was seriously starting to put the hurt on AMD. What I suspect happened to AMD was the the Delivered process was not what was promised. In particular it ran hotter than expected. If it wasn’t for how hot bulldozer runs it would be able to hit significantly higher clock speeds.

    I think AMD was trying to imitate a power 6/7 style of processor. The power 6 hit 5 GHz five years ago.

      • BobbinThreadbare
      • 7 years ago

      Bulldozer is worse than the Phenom II in many ways, that has nothing to do with Intel’s process advantage.

      • cygnus1
      • 7 years ago

      One of the reasons Intel can do that is the shear number of fabs it operates. That gives them the manufacturing capacity to always have multiple fabs doing development of the next processes. I don’t think there are any other manufacturers that have that capability.

      • Deanjo
      • 7 years ago

      [quote<]Intel's R&D budget is almost as large as all the money made by AMD... The fact that AMD is able to compete at all is a minor miracle.[/quote<] What is sad is that AMD went "asset light" so they could compete on R&D. Unfortunately the last time AMD was competitive on the processor front was when they still ran their own fabs.

        • BobbinThreadbare
        • 7 years ago

        Somewhere Jerry Sanders nods knowingly.

        • ludi
        • 7 years ago

        That was because AMD inherited a small gold mine when they picked up the remnants of the Alpha team, and that gold got all spent on the K7 architecture. The only reason they even got this far was because, after trying to figure out “where to next”, they went with the compatible-yet-better route in developing x86-64 and picked up Microsoft’s support, even while Intel was chasing a dead-end called IA-64. Now we’re all using x86-64 and the gold is all mined out. Having a fab, or not, won’t make a difference.

          • ermo
          • 7 years ago

          In addition, it is accepted wisdom at this point that the Alpha design was so successful in part because it relied heavily on hand tuning the design to match the process capabilities, thereby attaining higher-than-average clock frequencies compared to the competition and hence earning a good reputation with physicists in need of high performance in particular.

          With Bulldozer, I believe there were rumours afloat from a supposedly ex-AMD engineer that AMD had done less hand tuning and relied more on automated design tools, which could also be part of the reason that BD didn’t scale as projected.

          It’ll be interesting to see if future BD-derived designs will end up getting more L1 cache associativity, a µop instruction cache to help hide/offset the misprediction penalty and be hand-tweaked so as to make better use of the available process node in terms of clock scaling.

        • Stranger
        • 7 years ago

        unfortunately they couldn’t afford it even then. and the price for a new fab or even just upgrading an old one has continued to march higher. No mater how successful AMD was in the past they have never been massively profitable. Even when they were trouncing Intel in performance by 50 to 60% they were just barely profitable. And since that point Intel has significantly ramped up R&D spending. at this point for AMD to catch up some catastrophic has to happen to intel ala war in the middle east blowing up intel fabs. to make it worse the only way AMD has been profitable the last few years is by slowly selling off their fab. All through those years AMD was never big enough to buy another fab that it needed to really become profitable. The only way to make lots of money in comodity semiconductors is to do things on a monstrous scale. while AMD had to sell its soul(aka fab) to get here atleast it now has access to two whole fabs with the possibility of a third coming sometime in the vague future. with that they would theoretically a vaguely comparable amount of production compared to Intel.

        [url<]http://www.electronicsweekly.com/blogs/david-manners-semiconductor-blog/2007/04/can-the-chip-industry-afford-i.html[/url<]

          • Deanjo
          • 7 years ago

          [quote<]it now has access to two whole fabs with the possibility of a third coming sometime in the vague future.[/quote<] They MIGHT have access to the fabs. AMD as of a couple of months ago is just another potential customer to Global now that all exclusivity agreements have been nullified. AMD could just as easily in a few years find themselves searching for someone to build their products.

            • BobbinThreadbare
            • 7 years ago

            The exclusivity agreement wasn’t nullified. AMD got permission to build certain chips at other foundries.

            • Deanjo
            • 7 years ago

            The only thing that AMD has with them now is a wafer guarantee for the next few years. Global is free to pursue other clients and with that AMD could easily find themselves being sqeezed out over the next few years as that agreement comes closer to expiring (28 nm being the last of the guarantee IIRC).

    • brucethemoose
    • 7 years ago

    Trinity doesn’t even have L3 cache. A 1% improvement over BD with L3 cache may be better than it appears.

      • swaaye
      • 7 years ago

      I’ve been thinking about this too, but doesn’t a larger L2 often help with weakly threaded apps? That didn’t really appear with Trinity.

        • kalelovil
        • 7 years ago

        Trinity has the same per-integer core L2 cache as Bulldozer and Llano.
        It isn’t appreciably faster than Bulldozer’s L2 cache either, I’m guessing that has to wait until the Steamroller core.

      • kc77
      • 7 years ago

      It is, but not from IPC alone. It’s perf/w where Trinity makes huge gains. Look at the graphs in TR’s Review of the Trinity Laptop related to power use. It’s fallen dramatically despite the larger GPU and increase in clockspeed. This will translate pretty well on desktops and servers because their TDP requirements are much higher. We should see increases in clockspeed with power either falling or staying the same.

      • Anonymous Coward
      • 7 years ago

      It certainly is better for AMD’s bottom line to get 1% boost while dropping a huge L3.

    • BobbinThreadbare
    • 7 years ago

    It sounds like L2 cache latency is one of the problems with most enthusiast workloads.

    [quote<]We do agree that it is a serious problem for desktop applications as most of our profiling shows that games and other consumer applications are much more sensitive to L2 cache latency. It was after all one of the reasons why Nehalem was not much faster than the older Penryn based CPUs. Lowly threaded desktop applications run best in a large, low latency L2 cache.[/quote<]

      • Damage
      • 7 years ago

      Ok, noted and tweaked the text a little. However, his analysis and actual profiling really looks at other issues. Also, Piledriver tweaks do improve the L2 cache a little.

    • OneArmedScissor
    • 7 years ago

    [quote<]He offers some intriguing possible reasons for Bulldozer's weak performance in certain scenarios, and those reasons have little to do with cache latencies.[/quote<] Not really. How many people here were interested in Bulldozer for their server? From the article's conclusion: [quote<]We do agree that it is a serious problem for desktop applications as most of our profiling shows that games and other consumer applications are much more sensitive to L2 cache latency. It was after all one of the reasons why Nehalem was not much faster than the older Penryn based CPUs. Lowly threaded desktop applications run best in a large, low latency L2 cache. But for server applications, we found worse problems than the L2 cache.[/quote<] But that only explains why Bulldozer doesn't quite keep up with Phenom II. There's another reason they don't mention at all, which explains why it's so far behind Sandy Bridge. The L3 cache clock is about half the core clock. That is a huge hit to - wait for it - cache latency. So blame the cache latency. It's not necessarily a design flaw in a server context, but it is still a direct consequence of the design choice, which is inappropriate for PCs.

      • BaronMatrix
      • 7 years ago

      I have two Opteron BDs. As an example of how Intel’s SSE\AVX games screwed AMDs plans, Server 2008 DOES NOT SUPPORT AMD AVX in HyperV. I can say that I get near native perf when running Exchange, SharePoint and SQL.

      Intel changed their AVX\FMAC so many times it may have been better if AMD had not tried to support them.

      All testing I’ve seen shows XOP is just as fast as AVX on Intel. And it’s really weird that OEMs could have FMAC right now but I have yet to see any optimization. CineBench, PhotoShop, even CFD, CAD, CAE would get a HUGE bump with FMAC.

      That was what BD was for…

      THE NEW ISAs.

      Intel screwed them again and I hate sheep even more.

        • Arag0n
        • 7 years ago

        It’s gonna take time for the new ISA’s to be deployed into the Windows ecosystem…. phoenix showed that compilers can dramatically improve performance of new bulldozers. I think that with trinity and other and some years, AMD will just have a competitive point again, but until then, they are left in the dust. AMD must push application developers, specially key application developers such as top-tier games and software vendors as microsoft and their most common tools for updates for bulldozer before it goes to sell and review….it’s the classic error. Start to sell a product that needs recompiling for performance improvements, let websites review and create the idea that is vastly inferior than the predecessor and competition, then once the applications start to target the improvements and you get a significant improvement no one notices.

      • CBHvi7t
      • 7 years ago

      [quote<]The L3 cache clock is about half the core clock. That is a huge hit to - wait for it - cache latency.[/quote<] Now explain to us what clock rate has to do with latency and why spending more than twice as much power on the L3 would be a good idea.

        • BaronMatrix
        • 7 years ago

        I said many times when the first benchmarks came out there was an issue with the WCC. It isn’t fast enough to hide latency. That will affect both L1 since it’s write-through and has to account for misses and L3 because it’s an eviction cache which may need to dump to L2. If the WCC is held up those writes will add more latency.

    • gmskking
    • 7 years ago

    Scrap it and lets never speak of it again.

Pin It on Pinterest

Share This