Bulldozer scheduling patch for Windows arrives

In case you missed it while we were out at CES, Microsoft has (for real, this time) released a patch for Windows 7 that adjusts the thread scheduling behavior to accommodate AMD’s Bulldozer-based processors. As you may know, the default behavior in Windows without the patch is less than ideal for lightly threaded workloads, and manually scheduling threads to avoid sharing a Bulldozer module can improve performance. Microsoft briefly released a portion of the patch prematurely in December, but now the full and final updates are available.

This blog post at AMD’s website explains how to obtain and install the patch and what performance benefits to expect from it. The post, by AMD Marketing Manager Adam Kozak, claims the performance gains in applications that do see a change "averages out to a 1-2 percent uplift." I expect we’ll see substantially more improvement, up to 10-15%, in select applications, but we haven’t yet had time to test the patch for ourselves.

We are pleased to see Microsoft making this update to a core Windows component outside of the usual update loop, though. Any future CPUs based on AMD’s Bulldozer microarchitecture ought to benefit from this change, and that may matter quite a bit once the Trinity APU touches down later this year.

Comments closed
    • Proxicon
    • 8 years ago

    The whole platform is just too overpriced/power hungry compared to the Intel offerings/price.

    • gmskking
    • 8 years ago

    Or just get a 2600K and never look back. 🙂

      • OneArmedScissor
      • 8 years ago

      Or do look back, and be happy with what you already have because it’s probably at least a Core 2, and good luck telling the difference. :p

    • Tristan
    • 8 years ago

    Windows patched
    Performance not

      • chuckula
      • 8 years ago

      Add Moar Coars
      BURMA SHAVE

        • NeelyCam
        • 8 years ago

        [quote<]Add Moar Coars[/quote<] "We have this thing called 'a module'. It's two for the price of one, and it's gonna revolutionize your world!"

    • Arclight
    • 8 years ago

    BD is still a fail with patch or no patch. They bet so much on this architecture and just like Barcelona it failed to deliver.

      • Anonymous Coward
      • 8 years ago

      Refresh my memory here, did AMD throw out Barcelona and start fresh, or what?

        • Arclight
        • 8 years ago

        You could also refresh my memory if AMD had any response to high end Nehalem with Barcelona’s revision called Deneb.

          • Anonymous Coward
          • 8 years ago

          Is that an answer, or an admission that you just can’t stop complaining?

            • Arclight
            • 8 years ago

            AMD did nothing but lose ever since Barcelona on the desktop front. They managed some sales for the cheap Deneb based quad cores but now they will discontinue even those.

            And no i shouldn’t stop complaining if a company has monopoly of a market i’m interested in. Why do you ask? Do you work for Intel?

            • Anonymous Coward
            • 8 years ago

            The 45nm K10’s have been fine chips. They can’t beat Intel but if they could, then Intel would look pretty stupid considering the money they put behind their effort. If you can’t see that K10 was in fact a reasonable design with a poor launch, then there’s not much point in continuing this discussion. Likewise, if you can’t see that its too early to make a judgment on BD, then this is not much point in continuing this discussion.

            Spewing anger and malcontent will do nothing about Intel’s and AMD’s market situations.

            • Arclight
            • 8 years ago

            Obviously we need not continue the discussion.

            [quote<]Spewing anger and malcontent will do nothing about Intel's and AMD's market situations.[/quote<] I never said it will, but God damn do i want and have the right to.

            • clone
            • 8 years ago

            you have the right to complain, no one has ever said you don’t but for how long are you going to?

            it’s like they betrayed you, Intel earned in this past quarter 425 X’s more coin than AMD did, Intel has 10 X the number of gerbils running it’s wheels and from this AMD somehow betrayed you because they failed to be David to Intel’s Goliath….. 425 X’s AMD’s earnings in 1 quarter and AMD is supposed to fabricate market dominance out of thin air and so very many whine and cry when AMD doesn’t achieve it.

            what I want is a cheap fast sub $200 cpu preferably $100 or less, my last cpu was a triple core AMD 3200mhz that cost me $75 and I was happy with it, my next will likely be AMD as well because of motherboard pricing by Intel that dissuades me.

      • BaronMatrix
      • 8 years ago

      SPEC scores say they did deliver. AVX and XOP scores say they did. FMAC scores are in GPU league. They didn’t make BD for old code, but for new code. Had it been Intel, like with HyperThreading, you’d all be offering every excuse you could find. And I believe you all did.

      Please grow up.

        • Arclight
        • 8 years ago

        Delusions of a fanboy….i rather trust the reviews and the consensus is clear.

          • OneArmedScissor
          • 8 years ago

          …that what he said is true? There is little evidence that the Bulldozer core itself is inherently flawed, and certainly some to the contrary. That will have to be proved in time.

          The problem is that you said:

          [quote<]They bet so much on this [b<]architecture[/b<] and just like Barcelona it failed to deliver.[/quote<] And yet, you ignore what the architecture itself actually does and sit there pouring over reviews that are focused on things like video game frame rates, which have been inconsequential since Core 2. I'm sorry, but I get very tired of the, "I'm an enthusiast and I do my research!" attitude. If you [i<]really[/i<] did your research, you'd have noticed years ago that [b<]new CPU core performance doesn't matter in PCs anymore[/b<].

            • I.S.T.
            • 8 years ago

            If you think that, then you’ve not heard of a little known game called Battlefield 3.

          • shank15217
          • 8 years ago

          If if you review new architecture with old code you may not get the advantage of that hardware. Reviews will have to wait till the product matures.

    • Next9
    • 8 years ago

    Windows 7 (nor any other Windows version) is not suitable platform to test anything. For example in case of TrueCrypt and AMD FX-4100 – guess what…. Microsoft OS [url=http://www.amdzone.com/phpbb3/viewtopic.php?f=532&t=138894<]degrades performance[/url<] by +/- 10% margin. Especially it is huge nonsense trying to break performance records using this platform. Every so-called enthusiast ocerclocker using dozens of popular benchmarks starts the race by shooting himself into leg this way (installing Windows) and then oveclocks the machine to compensate the Microsoft penalty.

      • I.S.T.
      • 8 years ago

      Except that Windows is the most popular platform, so testing on it is what people want to read. After all, why would you read a site dedicated to, say, FreeBSD hardware tests when you want to know how X hardware will react to Y application for Windows?

        • Next9
        • 8 years ago

        Agreed. But there is a catch – who to blame if some product does not meet expectations? Who to blame if some product have lower performance? Etc… Unless you routinely blame hardware vendor like everybody always does, everything is OK and I agree with you.

        But, breaking benchmark records on Windows is still a little bit silly anyways.

          • UberGerbil
          • 8 years ago

          Benchmark records are a bit silly, period, unless the reason you own a computer is to run benchmarks. What most people use computers for is to run applications, and what most of them run those applications on is Windows. So benchmarks that involve real applications running on Windows are interesting. Benchmark record chasing is a fine enough hobby, I guess, but its applicability beyond that is a bit like the applicability of db drag-racing to music listening.

          • Firestarter
          • 8 years ago

          If your product doesn’t perform on the target platform, you have some explaining to do to your shareholders. If that means that a CPU vendor has to work with Microsoft to make sure that the CPU doesn’t choke on the OS, then so be it. To say that AMD are not to blame when their CPU fails to perform for a significant portion of the user base is short-sighted, and the shareholders would not agree.

            • Next9
            • 8 years ago

            And what is the “target platform”? And what is “does not perform”?

            There are to many questions here. The primary AMD target platform from performance standpoint are servers. Server generate much more revenue than desktops and they are not dominated by Windows.

            Microsoft Windows are target platform from end-user experience standpoint. Nobody cares, if mp3 encoding takes 30sec or 40 sec. Users care it does not take 1 hour. Majority of users do not notice the difference between various models of AMD and Intel for couple of past years, since CPU performance does not limit regular desktop usage anyways. But what does user usually notice is disabled CPU features which is unfortunate method of some CPU product placement, and you have always potential to buy incapable CPU/chipset that does not meet your needs.

            .

      • ET3D
      • 8 years ago

      If you run TrueCrypt most of the time, then I guess you’re right. If you run games that’s different. I think more people play games than run TrueCrypt.

    • ronch
    • 8 years ago

    I hope TR looks into this patch soon. How it works, how much performance it squeezes out of BD, etc. Also, since TR’s and AMD’s own findings don’t seem to align, it’ll be interesting to see who’s really right about how to treat BD modules.

    I also wonder how this patch figures out whether two threads are related so it could throw the two of them to just one module. Is there a way for applications to tell Windows about how it’s threaded? I hope TR will look into that too.

      • UberGerbil
      • 8 years ago

      Applications can [url=http://msdn.microsoft.com/en-us/library/ms686253(VS.85).aspx<]tell Windows[/url<] that a thread has a "preferred" processor, and they can [url=http://msdn.microsoft.com/en-us/library/ms686247(VS.85).aspx<]set affinity masks[/url<] to prohibit running threads on particular processor(s), among other things. But doing so in a way that maximizes performance on BD requires (a) the application being written with BD in mind (or updated to add this) and (b) sufficient cleverness on the part of the programmer to ensure that such low-level meddling doesn't hurt more than it helps. There hasn't been enough time yet for the former, and I have my doubts about the latter.

    • ish718
    • 8 years ago

    So, when is TR going to release an AMD Bulldozer Windows 7 Patch review?

    • sschaem
    • 8 years ago

    note: this not a ‘free’ hack.
    This will also increase wattage & heat even if no performance is gained as this always increase module activity. (for 2 to 4 thread)
    (And thats why this change in the scheduler will not be active in battery mode)

    I think what AMD did here is worse then HT, very asymmetric performance that require deep knowledge of the thread workload type to schedule correctly, something only developer can correctly manage… (I guess not a problem for Cray)

    I dont think 1% of 1% of the software developers on windows know how to build a thread scheduler mechanism thats design for bulldozer.
    A scheduler cannot say “those 2 thread are going to work on the same small data set using integer instruction, let me schedule them on the same module” , “but not those two, because they stream data and execute a SSE heavy workload.”.. what a mess AMD created.

      • Pancake
      • 8 years ago

      My gut feeling as someone who does a lot of multi-threaded programming is to agree with your estimate of 1% of 1%. So, it’s a good thing we all use the same few operating systems where people with sufficient expertise have done this for us.

        • sschaem
        • 8 years ago

        You missed the point. An OS kernels CANNOT make those decision in their scheduler and this now fall to each individual developer to categorize their workload to schedule them explicitly on module organization for correct use of the bulldozer architecture.

        For this to happen at the OS level, the OS would need to add extra API to control workload description. And this is not happening.

      • jensend
      • 8 years ago

      Bullcrap. Hyperthreading-aware scheduling does at least as much to increase wattage and heat (making more physical cores active than if you have some physical cores idle while work is assigned to logical HT CPUs). AMD’s approach is considerably [i<]less[/i<] asymmetric than Intel's if you mean asymmetry between different logical CPUs or pairs of logical CPUs. It's true that there's a little more asymmetry in the execution of different kinds of instructions (integer instructions/most kinds of load/store vs FP instructions esp AVX) than with an Intel chip, but you don't need every single developer to understand these scheduling intricacies. If a few OS architects and a couple handfuls of compiler and library writers understand it and get it right, that will bring consumers most of the benefits available from scheduling optimizations.

      • bcronce
      • 8 years ago

      While I do agree with some of your points, I don’t agree with the “worse than HT” part. Overall, the BD design is a tradeoff between a few “full” cores, and many “partial” cores. In *most* work loads, it should give better performance than a few full cores, assuming they don’t have bugs slowing things down.

      This coming from someone who has been using Intel for a very long time.

        • MadManOriginal
        • 8 years ago

        BD is crap. It ‘should’ give better performance? Then why does it perform no better than Phenom II sometimes, even in heavily threaded workloads?

        It’s a failed AMD version of Hyperthreading, meant to market MOAR CORES, and AMD fans should just face that fact.

          • I.S.T.
          • 8 years ago

          I would say BD is a foundation that might pay off. Wouldn’t be the first one(Williamette to Northwood, R600 to R7x0), won’t be the last. Give it time. I don’t expect Piledriver to be too much better(Too soon after BD to fix the major problems. Though, admittedly, BD was technically finalized quite a while ago, they were just fixing bugs and working on yields and whatnot. So, it’s not as soon as it appears at first, but still not long enough to do any big, major changes), but what comes after might be quite nice.

          The main problems with BD, IMO, are A. It launched with far lower frequencies than expected. I imagine a 600 MHZ increase on the base and turbo clocks might have been enough to at least propel it to above Core 2 Penryn level, though not anywhere near Nehalem.

          And B. The caches are plain ol’ terrible, especially the L2 and L3 caches. Look ’em up on various places… IIRC Realworldtech has some really good discussions about how bad the latencies are, but finding old forum threads in that place is liking looking for a needle in a haystack in a barn full of haystacks in farm land full of barns with haystacks. You get my point by now, surely.

          If both those problems had been fixed, BD would be a far better arch. It still has major problems in other areas, but from what my admittedly uneducated self as read, those are the two biggest. It wouldn’t compete with Sandy Bridge, though. That’s a goal too high, so to speak.

            • eofpi
            • 8 years ago

            Another big problem with BD is its decode rate. Chipwide, it’s actually lower per clock than Thuban. But, as with many aspects of BD, it wouldn’t be much of an issue if clocks had been higher.

            IANAEE, but as best I can tell, the root of the clock speed issue is Global’s process tech. They’re about the only place still using gate-first, and that imposes voltage penalties on NMOS gates that require the whole chip voltage to be raised, which raises power demands quadratically. Intel switched to gate-last a while ago (65nm?), and the major contract foundries either have already switched or are switching to it at 32nm/28nm. Gate-last has more design restrictions, but the power benefits seem more than worth it. Unfortunately for BD, Global isn’t switching until the next node.

          • Anonymous Coward
          • 8 years ago

          It’s not very clever of you to pass judgment of an architecture based on version #1, especially an AMD version #1.

          • OneArmedScissor
          • 8 years ago

          Would an 8 core Phenom have outdone a 6 core Phenom? If it had also needed more cache and memory bandwidth, the issues would likely be the same or worse than with Bulldozer.

          The real problem is just that something like this is unnecessarily complex and high latency for a PC. You can see how the 6 core Sandy Bridge E is often edged out by the regular quad-cores in PC tasks, and the memory latency is higher.

          Why else do you think they didn’t even bother selling the 8 core, 20MB L3 version? The same effect would have been even more pronounced. You can’t just fix that with 400 turbo modes.

            • BaronMatrix
            • 8 years ago

            When using all cores the 6100 will do better than 1100T. The idea of the patch is to allow the Turbo function to kick in. BullDozer is about AVX\XOP and FMAC, not SSE2. The correct test would be to see if Turbo is activated on 2-4 module loads. It should hit 4.2GHz then.

            But the patch does show that the scheduler for Win8 and Win 7 are WORLD’S apart. I underestimated how different they were. But people who got the patch said they see noticeable differences.

            • OneArmedScissor
            • 8 years ago

            [quote<]When using all cores the 6100 will do better than 1100T.[/quote<] Of course it can, but even the X6s had their disadvantages vs. the equally priced quad-core counterparts. It's a big "if" scenario that is moving further and further from reality for PCs, and towards a trade off in every other case. This is all wait and see until there's another CPU with AMD's new core that trims the fat. I neither believe nor care that the existing BD chip can become faster or equally fast as Intel, with patches or what have you. None of that is really important as the BD chip itself simply is not a PC CPU. The thing I don't like is that people judge the core itself based exclusively on an inappropriate application. If BD had been introduced in a desktop-centric form, as was the case with Sandy Bridge, where it had fewer cores and faster cache, this would have come out very different. The Opteron SKUs are more appropriately synchronized and power optimized, and of course, everyone ignores that. Trinity at least somewhat resembles the standard Sandy Bridge, though it's really laptop-centric. People will undoubtedly blow a gasket all over again when AMD doesn't bother selling a 5 GHz version just for desktops, but it would at least make more sense to wait and pass judgment until it's seen [b<]built[/b<] for a PC.

            • Deanjo
            • 8 years ago

            Ummm no, when using all cores the FX-8150 barely beats the 1100 in the majority of full load tests.

        • sschaem
        • 8 years ago

        Worse then HT to schedule work on. I’m not talking about “Instruction per transistor efficiency”, but what developers need to do to correctly leverage the architecture.

        With HT, 99% of the time its better to schedule 1 thread per physical core, only after that start to pair thread per core (to start leveraging HT) note: to fully leverage HT, some class of apps need to do their own scheduling (select what threads get paired).

        With the AMD design the integer and float workload need to be balanced differently, also the way the cache is organized can make it highly beneficial to run two thread on the same core, sometime.
        So the scheduler need to know what is what so it can correctly select what thread on what core.
        The problem is that the OS doesn’t have that info.

        The ‘hack’ right now is to use one thread per module first. This most often will be highly beneficial, specially for FPU code.
        But doing this can be worse with integer code, you could have gotten the same performance (withint 1% possibly) but with way less power usage. etc…

        What I wanted to explain is that AMD model require deep knowledge of the thread workload to schedule work ‘correctly’.
        Developer can do that, and some already do with HT.

        What you really want is
        “If two thread mainly execute SSE/AVX/FPU code, dont shedule them on the same core”
        “If two thread mainly execute integer code, schedule them on the same core in power saving mode”
        “If two thread mainly execute integer code and the same data set, schedule them on the same core always”
        “If two thread mainly execute memory IO, schedule them on different cores”
        etc…
        The OS has no idea what a thread is going to do… only the developer got that info.

        And this get more complicated when mixing workload of different apps at the same time. You want to pair thread from the same app on the same module when possible. And only the OS can do that.

          • Mime
          • 8 years ago

          Software has been becoming a larger part of system performance ever since multicore processors became common. It’s a little late to start complaining about that now, and it’s not like there’s any other way to schedule threads which is easy. It’s always a mess, but that goes unnoticed most of the time… unless something bizarre happens and a new processor gets released without a solution in place on how to deal with these things.

            • sschaem
            • 8 years ago

            This issue was NEVER there until AMD introduced it… and its AMD specific, bulldozer specific.

            How many people designed specifically for 3dnow! ?

            I can tell you that not even 1% of developers will spend all that time to write their own scheduler just for bulldozer.
            “Guys, we need to start characterizing our workload for AMD bulldozer CPU, then write a layer on top of the OS to shedule task ourselves… BTW, this wont help in anyway on ANY Intel processor, and wont help in anyways on all the AMD CPU but the ‘highly popular ‘FX-Zambezi line.. all ready to spend a couple of month doing this ?”

            And no… no one had to write their own scheduler until now to get within 95% of the chip performance capability.

            Its like saying “Our new CPU will stall if they execute two multiplication back to back.. we know, no other CPU in the world has that problem, but we saved a few transistors doing that.. so please re-configure your compilers and compile your old code using the new compilers with the new rules.”

            AMD made CPU for the server market, where people like cray can afford to do this work… for the PC industry is a big fail.

            • Mime
            • 8 years ago

            Dude… Bulldozer is not a server chip. Server chips have names like POWER7 or SPARC64 T4, and you can’t pick one up for a few hundred bucks at amazon or newegg.

            Perhaps you meant HPC since HPC people are more likely to do the kind of bare metal coding you’re talking about. I agree that your average developer wouldn’t spend the time to build their own schedulers, but I don’t think we’ll need to either. The biggest reason why developers don’t build their own schedulers is that it’s work which has already been done.

    • Xenolith
    • 8 years ago

    The most disappointing thing is that this patch does nothing for apps that max out all cores. These apps will likely need a recompile to see any improvement.

      • Deanjo
      • 8 years ago

      100% is 100%. Unlike the starship Enterprise, on a processor maximum warp really is maximum warp.

        • dpaus
        • 8 years ago

        [Geordi] Wait, we could apply a positron pulse to the cache circuitry, and then compensate for the reduction in cache latency by creating a small bubble in the space-time continuum causing the next operations’ data to [i<]already have been processed[/i<] by the time the instruction register is loaded, meaning it can be passed back in time to the point just before the application loaded, so that the user's data is delivered to them right after they've decided to reach for the mouse to start the app![/Geordi] Which'll be great, unless the sudden appearance of the answer just as they're formulating the question in their heads causes them to change their minds about what they really want, because if they then change the question, that'll tear the whole timeline. Yeah, maybe just leave it the way it is now...

          • theonespork
          • 8 years ago

          You would probably need tachyons and their FTL speed, but no worries. Just reconfigure the forward emitter to pulse tachyon waves into an aligned crystalline matrix that you configure as a memory storage medium; then use the holosuites to generate numerous massively parallel computing arrays that utilize the forward emitter as a northbridge to access the crystal storage matrix memory cache, and then your tachyon crystal memory system will provide instantaneous answers, processed in the future but answered in the present.

          It is really all so simple…

            • Deanjo
            • 8 years ago

            Sure that sounds simple, and it would be, but unfortunately Mr. Scott is the head engineer over at the USS intel.

    • ish718
    • 8 years ago

    [quote<] AMD Marketing Manager Adam Kozak, claims the performance gains in applications that do see a change "averages out to a 1-2 percent uplift." I expect we'll see substantially more improvement, up to 10-15%, in select applications[/quote<] Oops, moving too fast. But I still call bluff on those performance gains. 5 days later, still nothing...

      • odizzido
      • 8 years ago

      reread what you quoted.

      • NeelyCam
      • 8 years ago

      There was a link to a german website with benchmarks a couple of days ago in the shortbread:

      [url<]http://www.tweakpc.de/hardware/tests/cpu/amd_bulldozer_fx_patch/s01.php[/url<]

        • ish718
        • 8 years ago

        Disappointing, I am waiting for more reviews…

          • tfp
          • 8 years ago

          Yeah at some point someone will post the number you want to see right?

        • NeelyCam
        • 8 years ago

        -1? Was the link broken..?

      • pogsnet
      • 8 years ago
    • chuckula
    • 8 years ago

    Irony test: See if the Bulldozer patch has any effect (positive or negative) on hyperthreaded Intel CPUs.

      • Zoomastigophora
      • 8 years ago

      Based on the blog post, it’s more the other way around – Microsoft extended the custom scheduling logic that’s already there for Intel’s SMT to also apply to Bulldozer.

      • Game_boy
      • 8 years ago

      [url<]http://forums.anandtech.com/showthread.php?t=2213186[/url<] Lots of claims of significant increases on Intel CPUs.

        • tbone8ty
        • 8 years ago

        double face palm

          • Chun¢
          • 8 years ago

          I hope Scott has time to test this…

        • FuturePastNow
        • 8 years ago

        Yes, one guy says his rig “feels more snappy” with the patch. Another has “emailed this to many peeple and everyone has had there numbers go up while using 8 threads on a sandybridge cpu.”

        That’s some in-depth testing going on on the AT forum.

        • khands
        • 8 years ago

        On the next page it looks like that post was about the original patch that got pulled and the newer one actually lowers efficiency.

      • BaronMatrix
      • 8 years ago

      Perhaps why it doesn’t increase as much as Win8 is just that. MS can’t break HT. And the kernels are too different.

Pin It on Pinterest

Share This