A quick look at Bulldozer thread scheduling


As you may know if you read our original FX processor review, the shared nature of the “modules” in AMD’s new Bulldozer architecture presents some conundrums for OS and applications developers who want to extract the best possible performance. Each module has two integer cores that are wrapped in some shared resources, including the front-end instruction fetch and decode units, the FPU, and the L2 cache and its associated data pre-fetchers. AMD claims this level of sharing is superior to what goes on in recent Intel processors—whose more resource-rich indvidual cores can track and execute two threads—because the performance of each thread in a Bulldozer module is more “robust” and predictable, less likely to stall due to resource contention.

A scheduling conundrum

That sounds good in theory, but it raises a vexing question: what is the best way to schedule threads on a four-module, eight-core Bulldozer-based chip? Say your application uses four threads, and you want the best possible performance. Would it be better to group those four threads together on the four cores contained in a pair of Bulldozer modules, or is the best approach to spread them across all four modules? Each way has its advantages.

Your first instinct might be to spread the threads across four modules, in order to avoid resource sharing. That’s generally the best approach on an Intel processor with Hyper-Threading. On Bulldozer, that means the front end, FPU, and memory subsystem in each module would be at the full disposal of a single thread.

However, AMD claims Bulldozer’s sharing arrangement has a relatively small performance penalty. Also, if all four threads occupy only two modules, the CPU could potentially run at a higher clock speed thanks to Turbo Core dynamic clock frequency scaling. In the case of our FX-8150 chip, that means the cores would top out at 4.2GHz, whereas they’d stop at 3.9GHz with all four modules active. What’s more, AMD claims the performance of related threads may benefit from sharing data via the module’s L2 cache.

So what’s the best approach? That’s tough to say for sure, simply by considering the theory, and it may depend upon the situation.

One thing we do know is that the scheduler in Windows 7 isn’t any help. The OS is completely unaware of how Bulldozer modules work; it sees only eight equal cores and schedules threads on them evenly. AMD says Windows 8 will address these issues, but outside of developer previews and such, that OS probably won’t be available for another year, at least.

We can, however, take charge of thread scheduling ourselves in certain cases and see how it affects performance. With the aid of some simple tools, we tested a few of the geekier applications used in our Bulldozer review using explicit thread scheduling in order to see what would happen.

Our basic setup is relatively straightforward. Some of our benchmarks run from the command line and can be configured to use a specific number of threads. We told them to use four threads and invoked them via the Windows “start” command, which has an option to set the thread affinity when launching a program. By modifying the affinity, we can distribute the threads to specific CPU cores. For instance, this command:

start /AFFINITY 55 /b /WAIT e3dbm 5 4 > ed3dbm-4-1.txt

…launches our Euler3D computational fluid dynamics test, tells it to run five iterations with four threads, and stores the output in a text file. The “55” after the “/AFFINITY” switch is the mask that specifies which cores to use. The mask format may seem hard to decipher because it’s in hexadecimal; translated into binary, two instances of the number five side by side look like so:

01010101

A one specifies a core to be used, and a zero specifies a core to be skipped. In this case, the mask tells the OS scheduler to assign threads to every other core—or one per module, in the case of Bulldozer.

With affinity mask 55, one thread per module and a 3.9GHz Turbo peak

We were able to verify that the “55” mask was doing what we expected by watching the Windows Task Manager and monitoring CPU clock frequencies. As you can see in the screenshot above, we have a thread on every other core, and our CPU is topping out at 3.9GHz, which is the expected Turbo Core behavior when all four modules are active. We also verified the core and module configuration with AMD, who characterized it as:

Module 0 = Core 0 and 1, Module 1 = Core 2 and 3, Module 2 = Core 4 and 5, Module 3 = Core 6 and 7

So we’re fairly certain we know what we’re getting here.

With affinity mask 0f, two busy modules and 4.2GHz Turbo Core speeds

Our other mask option, 0F, translates into binary as “00001111”. This option packs all four threads onto two modules, forcing some resource sharing but also allowing for the higher Turbo clock frequency of 4.2GHz. Again, the behavior is easily verifiable, as shown above.

In addition to our command line specifications, we have some special, affinitized builds of the picCOLOR image analysis program, courtesy of Dr. Reinert Mueller, who has long supported us with custom builds of his software tailored for new CPU architectures. Several of picCOLOR’s functions use two or four threads, and their performance is potentially altered by where they run.

The results

Without further ado, here’s what happened when we ran our test apps with the default Windows 7 scheduler threading—i.e., with no awareness of modules or sharing—and with our two different affinity masks.

These results couldn’t be much more definitive. In every case but one, distributing the threads one per module, and thus avoiding sharing, produces roughly 10-20% higher performance than packing the threads together on two modules. (And that one case, the FDom function in picCOLOR, shows little difference between the three affinity options.) At least for this handful of workloads, the benefits of avoiding resource sharing between two cores on a module are pretty tangible. Even though the packed config enables a higher Turbo Core frequency of 4.2GHz, the shared config is faster.

Our test apps, obviously, are not your typical desktop applications, and they may not be a perfect indicator of what to expect elsewhere. However, since many games and other apps are lightly threaded, with three or four threads handling the bulk of the work, we wouldn’t be surprised if one-per-module thread affinities were generally a win on Bulldozer-based processors.

Naturally, some folks who have been disappointed with Bulldozer performance to date may find solace in this outcome. With proper scheduling, as may come in Windows 8, future AMD processors derived from this architecture may be able to perform more competitively. Unfortunately, Windows 8 probably won’t ship during the model run of the current FX processors.

At the same time, these results take some of the air out of AMD’s rhetoric about the pitfalls of Intel’s Hyper-threading scheme. The truth is that both major x86 CPU makers now offer flagship desktop CPU architectures with a measure of resource sharing between threads, and proper scheduling is needed in order to extract the best performance from them both. (This situation mirrors what’s happened in 2P servers in recent years, where applictions must be NUMA-aware on current x86 systems in order to achieve optimal throughput.) A gain of up to 20% on a CPU this quick is certainly worthy of note.

Trouble is, right now, Intel has much better OS and application support for Hyper-Threading than AMD does for Bulldozer. In fact, we’re a little surprised AMD hasn’t attempted to piggyback on Intel’s Hyper-Threading infrastructure by making Bulldozer processors present themselves to the OS as four physical cores with eight logical threads. One would think that might be a nice BIOS menu option, at least. (Hmm. Mobo makers, are you listening?)

At any rate, application developers who want to make the most of Bulldozer are free to affinitize threads in upcoming revisions of their software packages anytime. If AMD can persuade some key developers to help out, it’s possible the next round of desktop applications could benefit very soon.

Comments closed
    • format_C
    • 8 years ago

    MS hotfix for Bulldozer:
    [url=http://support.microsoft.com/kb/2592546/<]An update to optimize the performance of AMD Bulldozer CPUs that are used by Windows 7-based or Windows Server 2008 R2-based computers is available[/url<]

    • bbbl67
    • 8 years ago

    Regarding your comment that AMD should’ve just piggybacked on Intel’s Hyperthreading support, I’m surprised that they didn’t do that right from the beginning. When I first read the concept of the modules several years ago, the first thing that jumped out at me was that this must be AMD’s way of doing a better kind of Hyperthreading. But it seems AMD started believing its own hype too much and decided to classify each sub-module to be as good as a fully separate core.

    Then again, there might have been a good reason to classify each sub-module as a full core rather than as cores & threads. Since the BD design is going to be used in both server and desktop, they need to sell these BD processors to server environments where Hyperthreading is typically frowned upon, due to its bad performance. If they implemented it as Hyperthreading, it would’ve been automatically turned off by most administrators, and therefore half of the cores would be rendered unusable. So perhaps because of this server environment quirk, the desktop implementation of BD suffered. But really this idea of implementing a BIOS option to fix this sounds good.

    • ronch
    • 8 years ago

    Late comment, I know, but I’d just like to point out that AMD should have talked to Microsoft about this concern earlier so that Windows 7 would know how to treat BD. How long have Windows 7 and BD in development? Looks like AMD didn’t do its homework.

    • wierdo
    • 8 years ago

    Thought this was an interesting page, showing performance of Bulldozer with 1core/module vs 2core/module settings:

    [url<]http://www.hardware.fr/articles/842-9/efficacite-cmt.html[/url<] Seems that vs 1core/module the Bulldozer actually loses some performance in games, about %5 usually. But in well threaded applications the Bulldozer gets up to ~%80 of the benefit of having the second core. Also compared performance gains from Hyperthreading vs Bulldozer's 2cores/module approach, in HT providing a gain of %5-%36 vs %36-%80 respectively. I'd suggest putting the page through google translator since it's in French, but the charts are easy to understand without translation. Amusing little knob tweaking games.

      • rechicero
      • 8 years ago

      Thanks for the link. That means it’s a better approach than hyperthreading… And it would’ve been really great if the modules were not the bastard children of the P4 fiasco.

    • moozoo
    • 8 years ago

    Goggle “Correct F15h IC aliasing issue”
    [url<]http://us.generation-nt.com/answer/patch-x86-amd-correct-f15h-ic-aliasing-issue-help-204200361.html[/url<] I think this is the source of the cache bug comments.

    • rechicero
    • 8 years ago

    Great article, but the methodology is different compared with the original review of this processor (at least results are) and so, we can’t compare this “correctly” scheduled Bulldozer with other processors… Can we assume the differences are equivalent?

    That’d mean the 8150 would score 4,54 Hz in the STARS Euler3D benchmark, and it would advance 4 positions, beating the i5 2500K.

    63 seconds in Myrimatch, advancing a couple positions (the 2500K was already beaten with auto-scheduling).

    And 25 in PicColor Synthetic (assuming the 4 thread new bench as the “good” one), advancing 3 positions, but losing barely with the 2500K. And 18,86 in Real world, beating the 2500K again.

    That’d mean a 3-1 (Instead of the original 1-3 for the 2500K) for the 8150 in this 4 benchmarks and would change the conclusions about the processor (at least, it’d be faster than its pretended rival instead of slower).

    If this is incorrect, we’d need some benches of other processors (I’d say 2500K, X6 1100T, 2600K, 2400 and X4 980 at least) with the same configuration as these new tests.

    EDIT: If this equivalence is correct (and similar in other benchmarks) that’d mean, with a tweaked scheduler, would be probably the best processor in performance/dollar, beating the 2400 and the 2500K (right now it’s in the league of the 2600K in this metric)

      • accord1999
      • 8 years ago

      The scheduler tweaks only affects scenarios that run between 2-4 CPU intensive threads. It will not have any real impact when all 8 cores are already loaded.

        • shank15217
        • 8 years ago

        That depends on how its loaded, not all apps equally load up cores. The issue is that two physical cores isn’t always two physical cores in Bulldozer. A updated scheduler should improve performance. Also Bulldozer is weaker in lightly threaded applications and that seems to be partially due to thread scheduling issues.

        • rechicero
        • 8 years ago

        You’re right, but I’d really, really like to see an hyperthreading processor in the mix (and not using hyperthreading), to see which approach scales better: hyperthreading or integer cores.

          • tfp
          • 8 years ago

          Some of this info was the on the first view. There was at least one test that showed single threaded performance as well as max cores. BD had slow single thread performance but the multiplier to was better for 8 threads vs an 2600k with 8 threads. The BD still lost however.

            • rechicero
            • 8 years ago

            I’m not sure if that’s the same as we’re adding both physical and virtual cores to the mix. I’d say the multiplier should be even higher when adding only “virtual” cores (as the 3 additional physical cores from the i7 likely add much more performance than the 3 “full” cores from the FX).

            What I’d like to see is a 4 threaded test using only physical(i7)/full(FX) cores against a 8 threaded test using both physical and virtual cores for Intel and shared FPU for AMD. That way we could see which (and by how much) approach is better, without factoring the “quality” of the individual physical cores. If Scott could tell us how much real state is used for hyperthreading and the additional non-FPU cores (this looks like 15% more), that’d tell us the best performance/mm approach.

        • [+Duracell-]
        • 8 years ago

        It could have an impact if two threads share resources, but are scheduled on different modules.

          • accord1999
          • 8 years ago

          But there’s no way any OS scheduler will know whether two threads will share resources. Only the application developer would be able to schedule this properly.

    • odizzido
    • 8 years ago

    What I don’t get is why there can’t be a simple patch for 7/vista to say “if new FX processor use core 0 for one thread, core 2 for two, 4 for three, 6 for four, 1 for five, 3 for six, etc”. I guess they need a reason for people to buy W8 but still.

    Also this could be an interesting power scheme for laptops. A quad core bulldozer in performance mode could put the first thread on core 0 and second on 2, while in power saving mode it can load them on 0 and then 1. Too bad there weren’t any power tests to see the difference between 0+2 and 0+1 being fully loaded.

      • d0g_p00p
      • 8 years ago

      So MS should spend time and engineering costs to patch Vista, Win7, Server 2K/2K3/2K8R2 because AMD screwed up? I do see your point though, MS should keep their OS up to date for new tech that hits the market.

      • UberGerbil
      • 8 years ago

      [url<]https://techreport.com/discussions.x/21865?post=592288#592288[/url<]

    • thesjg
    • 8 years ago

    This test was interesting but very highly un-informative and possibly totally misleading. Remember, floating point hardware is shared completely between two logical cores, but only the front-end of integer units are shared. You neglected to mention what these tests do. So, are they floating point benchmarks or integer benchmarks?

      • Damage
      • 8 years ago

      A mix of both, as noted here:

      [url<]https://techreport.com/discussions.x/21865?post=592039[/url<]

        • thesjg
        • 8 years ago

        Thanks.

    • maxxcool
    • 8 years ago

    Nice addition to the article Scott!

    I am still very curious… what happens when the fsb is bumped to 250, and the NB and HTT buss are set to 10x ?

    • Krogoth
    • 8 years ago

    Basically, this article is forcing Bulldozer to operate more like a regular “desktop” CPU. It is no surprise that it performs better at desktop application, but I suspect that the “hack” hurts its server-level performance. Not that it matters for regular desktop users.

    It does show that Bulldozer architecture has potential that needs to refined. The current chip is really AMD’s “Williamette”. A new CPU architecture that falls to overtake its predecessor at desktop applications of the day, but only shines where in areas that take advantage of its new architecture. For Williamette it was media-encoding, for Bulldozer it is integer-demanding server applications.

    I just hope that AMD can manage turn the current Bulldozer into a “Northwood”.

      • Meadows
      • 8 years ago

      This [u<]is[/u<] a desktop CPU. Also, that word is spelt "Willamette".

        • axeman
        • 8 years ago

        You pointed out a spelling error, only to make a more egregious one. Congratulations.[url<]http://en.wikipedia.org/wiki/Spelt[/url<]

          • Meadows
          • 8 years ago

          [url<]http://www.merriam-webster.com/dictionary/spelt?show=1&t=1319900907[/url<] And thus you fail.

    • StuG
    • 8 years ago

    This is not enough of a performance boost to really make it appealing still. It needs a lot more help off the ground, and AMD is bad at helping its own products.

      • khands
      • 8 years ago

      It’s one of many though, you’ve got this, you’ve got North bridge OCing, you’ve got lots of potential cache issues that need to be straightened out, poor yields on an immature process. The potential is there, they just failed in a lot of minor ways that added up to topple the thing from the get go.

        • StuG
        • 8 years ago

        I just don’t see this as enough with Ivy Bridge around the corner and Intel already going VERY easy on them.

    • Forge
    • 8 years ago

    Sadness. Loading per module first was AMD’s choice, and mentioned many times as a power-saving/power-optimization move. Forcibly bypassing that gives higher performance.

    Runs cool, runs fast. Pick one.

    • TurtlePerson2
    • 8 years ago

    Congratulations on the slashdotting Mr. Wasson. It’s a great article.

    • juampa_valve_rde
    • 8 years ago

    Well it looks like AMD fall short with his multithreading sauce, but if the Windows scheduler can identify the bulldozer like it was an intel hyperthreaded processor (like the smithfields with ht), the thread afinity could get fixed quickly and easily (i think). These is the same old song and dance from 10 years ago, intel didnt offer that much with the hyperthreaded northwood core, took some time to polish that tech, and even now is only useful on highly threaded workloads, now the table has turned, bulldozer is the new pentium 4 (hot, high clocked, decent performance but not the big deal). In that times Intel sold all those P4 anyway probably using his big wallet and marketing hype, but AMD doesnt have either 🙁

    • hendric
    • 8 years ago

    Scott,
    Can you try running with 0xCC instead of 55? That would force the threads to run on the odd cores instead of the even ones. I’m a little curious if there would be a performance difference in this case. Which core does the OS use, for example? And if that cache bug is true, then one set of cores may perform differently than the other.

      • UberGerbil
      • 8 years ago

      I think you want 0xAA/0x55 to run on the even/odd cores. 0xCC would have them run on both cores of modules 1 and 3, and neither of the cores in modules 0 and 2 (assuming the LSBs line up across cores and modules). I don’t know that we have any reason to think the odd cores would perform significantly differently, but you don’t know until you test. (And from that standpoint, 0xCC could indeed be interesting as well, in as much as loading different modules might result in slightly different turbo behavior).

      The OS isn’t affinitized to a single core; the System processes get scheduled like User processes do — and (moreover) much of what we think of as “Windows” is DLLs, the code in which runs in the same thread as the calling process. (There can be some exceptions at the lowest levels of the kernel, when things like NUMA and interrupts come into play causing certain cores to be preferred)

    • kamikaziechameleon
    • 8 years ago

    So basically they aren’t actual 8 core processors, lol.

      • StuffMaster
      • 8 years ago

      8 full cores, no. 8 cores, yes since the integer unit is ultimately what makes a core.

        • tfp
        • 8 years ago

        Reminds me of the old 386 SX with the 387 co-procressor.

        • alpha754293
        • 8 years ago

        Well…that’s not necessarily true. If that were the case, then they wouldn’t even NEED a FPU, but then you’d get the UltraSPARC T1 (which so many people complained about the lack of an FPU that they ended up adding it back in to each and every single “core” and became the UltraSPARC T2.)

        It depends on what workload you give it.

        But considering that the fundamental premise of a computer is that it is a glorified calculator, I’d say that the FPU is more the “core” than ANY integer unit.

        As such, I actually count the latest FX Bulldozers as quad-core processors with hardware “HyperThreading” or “Chip MultiThread” or “hardware-assisted multithreading”. And you can see that for computationally heavy tasks, the old 6-cores still beat the latest and greatest (?) “so-called 8-core” processors. And if your old stuff can beat your new stuff, your new stuff has issues, regardless of semantics. (There aren’t too many people I know of that buy because of semantics.)

          • Meadows
          • 8 years ago

          Integer calculations are [i<]by far[/i<] the most used instructions in almost any environment, whether it's consumers, businesses, or servers. That's why. You're right that they can't outright [i<]remove[/i<] the FPU because it would cause slowness or errors (if it worked at all) in some cases, but you're absolutely wrong in declaring it of utmost importance.

            • JustAnEngineer
            • 8 years ago

            Many years ago, I had a software patch that installed some well-coded integer FPU emulation on a 486SX processor. It fooled the OS into thinking that an FPU was present and provided a significant speedup in gaming and other FPU intensive tasks.

    • anotherengineer
    • 8 years ago

    Interesting.

    Can MS update the scheduler in Win7 via a SP update or is this impossible?

      • UberGerbil
      • 8 years ago

      They can. Whether they want to is an entirely separate question. Since XP SP2 they’ve shied away from putting major code changes into SPs, making them almost entirely patch roll-ups, with the justification that new OS versions will be rolling out fast enough (vs the long XP interregnum) that major SP code changes aren’t necessary. Changing the scheduler involves a lot of regression testing to make sure you haven’t inadvertently hosed any existing systems (yes, they can put it behind a CPUID check and whatever, but they still have to test). If it was Intel with a chip that was selling in the millions in spite of its problems (and if it was Intel with their weight and influence / marketing / developer relations budget)…

      One thing I suppose they don’t have to worry about is that by adding this feature to Win7 they’d dampen enthusiasm for Win8. Yes, it may remove a reason for BD buyers to upgrade, but there wouldn’t seem to be enough of them to matter. On the other hand, that’s just an argument for not doing it at all — but it would appear that Microsoft has already agreed to do it for Win8 anyway (and have the ongoing Win8 betas in which to test it, removing [i<]that[/i<] as an argument for doing it as a patch to Win7) The real elephant in the room here, I suspect, is the Server line -- at least to the extent Bulldozer gets more traction in that market. Server customers are slower to transition platforms, even when they're changing hardware, so there may be Server 2008 R2 customers who actually buy the server version of Bulldozer long before they get around to deploying Server 2012 (or whatever the Server complement to Win8 is called). Keeping those folks happy (if they prove to exist) would involve releasing a scheduler patch to essentially the same codebase as the Win7 kernel, but that still doesn't eliminate the testing burden for the consumer OS (which runs on a lot more diverse hardware).

    • FubbHead
    • 8 years ago

    Is it possible to test running integer intensive tasks on AA and floating intensive tasks on 55 at the same time as well?

    • nicktg
    • 8 years ago

    Very interesting read. It should be noted that the improvement per clock is even greater since it`s achieved while running at a lower speed. I think it would be interesting to compare these results against a phenom, all at the same clock. Bulldozer IPC when running one thread per module isn`t that bad after all.

      • forumics
      • 8 years ago

      i would like to see some power consumption figures as well!

    • Anonymous Coward
    • 8 years ago

    This is a very remarkable result. I recall reading about Linux kernel developers having to implement a fix for a cache bug, and its also very interesting that Piledriver is supposed to be so much faster in such a short period of time. It is further interesting that AMD’s scheduling advise is not what TR has just shown to be most effective. It is also very interesting that AMD has been so quiet about so much performance hiding under the surface, and its interesting that they would overlook the possibility of having the CPU call itself “4 core hyperthreaded” if it where a fix.

    A cache bug could explain it all.

      • ermo
      • 8 years ago

      IIRC, sschaem has been going on a fair bit about the possibility of a cache bug.

    • Lianna
    • 8 years ago

    Great follow-up, but one data that has not been shown up on the graphs – for all 8 cores. I’ve added them from original review:
    [code<] Myrimatch [sec]: Auto 8t 74s 145% Auto 4t 126s 85% Mask 55 107s 100% Mask 0f 132s 81% Euler3d [Hz]: Auto 8t 3.75Hz 129% Auto 4t 2.40Hz 82% Mask 55 2.91Hz 100% Mask 0f 2.36Hz 81% [/code<] It shows why disabling cores is a bad idea, while setting manually affinity on light load is a good idea. IIRC, there was a utility from Tom's Hardware (from P4HT days) to automatically set affinity on defined processes - may be a good idea to use it again. BTW, +45% in Myrimatch from "resource sharing" is a nice speedup.

      • alpha754293
      • 8 years ago

      +29% on Euler3d isn’t bad either. Certainly better than Intel’s HTT.

      Congratulations AMD, you’ve finally landed somewhere in between the realm of Core 2 Duo with HTT (which I don’t think actually exists), and some of the earlier Core 2 Quads, also with HTT (which also, I don’t think actually exists).

      • jcollake
      • 8 years ago

      Process Lasso will freely (forever, without any timed nags, and no bundles) set default (sticky) CPU affinities for set process(es), in addition to a large number of similar features. For some applications, like this, it can be quite useful. I will not provide a link, as anyone can Google Process Lasso and I don’t want you to think this is spam.

      Since Process Lasso has been around for many years, I should also note is has changed a lot and gotten substantially better over time – many more features and functions, more real-world testing, etc.. It is NOT a task manager, it is an automation tool, to set various process ‘rules’.

    • HisDivineOrder
    • 8 years ago

    AMD should be listening to this and turning off the second INT unit on each module to create its 4170 part. Until they manage to get performance up another way.

    Alternatively, they should be getting every BIOS maker on board with making the second INT unit/CPU core something that can be disabled easily. Doing that, they might help the rep of their Bulldongzer before it’s too late…

    Doesn’t bode well for Trinity though…

      • TravelMug
      • 8 years ago

      “Alternatively, they should be getting every BIOS maker on board with making the second INT unit/CPU core something that can be disabled easily.”

      Why would I go out and buy an “8 core” CPU to then disable half of it’s cores? I can just go and get one which is 4 cores from the get go for less money (PhIIX4) or for the same money but more performance (i5).

        • Ryhadar
        • 8 years ago

        It wouldn’t necessarily be a bad idea (on the desktop anyway). If you’re able to disable/enable cores on the fly, like on AMD Overdrive, you could have Phenom II performance from the 4 cores when you want it (and likely with lower power consumption) and then 8 cores when you’re running highly multithreaded stuff.

        But, like you said, if you wanted Phenom II performance for lightly threaded stuff and Bulldozer-like bulldozer for heavy threaded stuff a 2600K or 1100T would be much better choices.

          • OneArmedScissor
          • 8 years ago

          [quote<]...like on AMD Overdrive...[/quote<] This is where they really blew it. Overdrive's flexibility was the best thing about the X6s. What would be [i<]really[/i<] awesome would be if they even extended it to allow control over individual cache configurations, instead of just separate cores. It would be complicated, but very fun to play with.

    • Celess
    • 8 years ago

    @Damage

    Do you think you could do a quick run with AA instead of 55?

      • Anonymous Coward
      • 8 years ago

      Yes this would be interesting.

    • swampfox
    • 8 years ago

    Fascinating quick write-up, thanks!

    • phez
    • 8 years ago

    Sorry but what is the affinity/thread load on ‘auto scheduler’ ? As they are different results from the BD review.

    • tfp
    • 8 years ago

    Interesting write up and at least to me the results are sort of expected.

    To me this leads to questions about many parts of the CPU.

    1) Is the single front end not able to handle feeding both Int units and FP unit quickly enough? Branch prediction ect, what is the impact to these items HW units in the front end when switching between to unrelated threads vs to cooperative threads?
    2) Is cache (data cache?) performance/contention between the 2 threads stalling both threads?
    3) Is I cache large enough to feed both pipes?
    4) Where does the data pre-fetcher plug into this system? In the FX8150 write up it is just a block that isn’t really linked into the pipeline in the image, how is this handling supporting both threads?
    5) What kind of latency is there for swapping threads in the front end, reloading the branch predictor, scheduler, data pre-featcher as it switches between threads?
    6) Can the front end Feed both one Integer and the FP pipe at the same time? (Int1 + FP, Int2 + FP) Or does it have to alternate between the 3 pipes? (Int1, Int2, FP assuming heavy work load)

      • tfp
      • 8 years ago

      One thing to note, this 10-20% increase in performance on 2 to 4 threaded apps is exactly what the CPU needs to be competitive in a number of applications.

        • Vasilyfav
        • 8 years ago

        Competitive in performance. Far, far from competitive in task energy and general energy efficiency I would think. This essentially would turn it into a quad phenom performance-wise clock for clock.

    • mcnels1
    • 8 years ago

    A nitpick: 55 hexadecimal is 01010101 binary. 10101010 binary is AA hexadecimal.

    • jihadjoe
    • 8 years ago

    So it seems that the most efficient way to schedule threads on Bulldozer is to treat it like a hyperthreading CPU. I.e new threads go to a free core first, before invoking hyperthreading in an already loaded core.

    In BD’s case, new threads should go to a new module first, before loading the secondary core on an already loaded module.

    I wonder if there isn’t some way to trick windows into thinking an 8-core BD is a 4-core hyper threading cpu.

      • Palek
      • 8 years ago

      Once you reach 5+ threads, though, it would pay good dividends to run related threads within a single module as much as possible. That would complicate thread scheduling a bit – or a lot, possibly. I ain’t no OS wiz…

        • bcronce
        • 8 years ago

        It’s a good idea, but I’m not sure if any thread APIs allow telling the scheduler that two threads are related. Once would manually schedule their own threads if using C/C++.

        • BobbinThreadbare
        • 8 years ago

        How many tasks are multithreaded up to 5-7 threads but not 8?

          • Palek
          • 8 years ago

          Either you misunderstood what I was saying, or I don’t understand what you are saying. By 5+ I meant 5 and over, including 8 and beyond. Did I miss something?

    • chuckula
    • 8 years ago

    As most of us who had a clue said before Bulldozer came out, there’s no magical software fix that will turn Bulldozer into an Intel killer. What’s really ironic is that the same scheduler used on hyperthreaded Intel CPUs that assigns jobs to physical cores first before starting to assign them to the hyperthreaded cores is exactly the same scheduler that AMD needs… so much for 8 “real” cores….

      • flip-mode
      • 8 years ago

      I have absolutely no problem with labeling these things as “real cores” and I don’t know why that chaffs some people so much. The hardware really is there. It’s not that the hardware isn’t there, it’s that it isn’t working very well.

      Of all the problems that Bulldozer has, whether or not the “cores” are “real” is not one of them. The processor handles 8 threads. The integer hardware for handling the threads probably functions superbly. I think the real problems all happen outside the cores. It’s certainly got to do with scheduling as Scott has pretty well proven here. But is could also have lots to do with the cache – whether it’s latency or some kind of resource contention. But it’s not got to do with the “real cores” and I think that discussion is a “birth certificate” distraction that needs to go away so that the real issues can be discussed.

    • bcronce
    • 8 years ago

    It was my understanding that the BD currently has a bug that causes data destined for the 2nd core in each pair to send data to the 1st core’s L1 cache, then move it back over to the 2nd core’s cache, which causes both cache thrashing in the first core, but also artificially increased latency for the 2nd core.

    If this is the case, then one can’t tell if the issue is two threads sharing the FPU or the cache getting fubar’d.

      • flip-mode
      • 8 years ago

      I’d love to see a cite for that info – I’ve not heard about that.

        • bcronce
        • 8 years ago

        I would to.. Just what I’ve “heard”. My post was half question/half statement

          • TheEmrys
          • 8 years ago

          You had to hear it somewhere…… or was it people just making stuff up?

            • bcronce
            • 8 years ago

            I may have been the Toms Ivy Bridge discussion. Someone was asking about the “cache” bug that everyone has been talking about, and someone replied with a fairly detailed response. That response , in a nutshell, said there was a problem with copying incoming data to the wrong core.

            But nothing “official” that I’ve seen.

    • srg86
    • 8 years ago

    So at least under these tests, Bulldozer behaves like a Hyperthreading enabled processor that doesn’t let the OS know that it has Hyper threading enabled.

    If this is truly the case then why on earth did AMD not just make the processor pretend it has HT? No need to wait for Windows 8 as in this situation, even Windows XP would schedule things properly?

      • bcronce
      • 8 years ago

      Win xp does not understand HT, even Vista doesn’t. Only Win7 right now does.

        • Ryu Connor
        • 8 years ago

        Windows XP is HT aware.

        [url<]http://msdn.microsoft.com/en-us/library/windows/hardware/gg463502.aspx[/url<] Windows Vista and Windows 7 each further built upon this basic support.

          • bcronce
          • 8 years ago

          “Windows XP is HT aware.” aka “Compatible”

          Technically, XP was “Aware”, but only to the point where it didn’t break the machine. The thread scheduler was not “smart” about HT at all.

          Heck, the XP scheduler favored core 0, mostly because of all interrupts being scheduled on it. Also, it had this horrible “load balancing” algorithm for multi-cores. Assume you have on thread and a quad core. So this one thread attempt to consume an entire core. The scheduler sees the thread on core 0, but says “hey core 0 is at 100%, but core 1 is at 0%”, so then the scheduler moves the thread to core 1, then the thread again attempt to consume all of core 1. The scheduler says “Hey, Core 1 is at 100% and Core 2 is at 0%”, so then it moves the thread yet again.

          It keeps doing this over and over and over. This is why XP/Vista would show a “nice” even usage across all cores for even single threaded apps. Win7 finally learned not to do that, which was a big part of the code to place nicely with HT.

            • Ryu Connor
            • 8 years ago

            [url<]http://www.infoworld.com/d/windows/windows-7s-killer-feature-windows-multicore-redux-494?page=0,2[/url<] Giant bar graphs illustrate clearly. As I stated before. XP is aware and implemented basic support. Vista and 7 built upon this and improved.

            • bcronce
            • 8 years ago

            XP is as “aware” of HT as Win7 is “aware” of Bulldozer cores pairs.

            I said XP can use HT, but it does not have any HT optimizations. It treats all of the cores the same. Vista has some multi-thread tweaks which make it scale quite a bit better than XP, but Win7 actually understands HT and is optimized for it.

            I never said XP “won’t” use HT, I only said it doesn’t “understand” it, outside of treating them like regular cores. Your link doesn’t prove/disprove that in anyway.

            I’ve been reading about the Win7 thread scheduler, lock tweaks, etc for the past 3 years from MSDN blogs and Chan9(MSDN video). They’ve made some huge advances in thread scaling.

            • Ryu Connor
            • 8 years ago

            Yet you still haven’t read the MSDN article I linked above.

            Chapter five identifies the fact that XP and 2003 do understand it is a logical processor, has been provided heuristics to cover scenarios of resource contention, and even detail how the OS will make sure to work through various scenarios to guarantee the best performance.

            That’s not just treating it like another core as you claim.

            My beef is also with equally ignorant statement that Vista doesn’t understand HT. You don’t see 267% increases going from 4C(8T) to 8C(16T) without some leveraging of additional threading.

            You haven’t even bothered to link a resource that refutes my claims! Your opinion is worthless against white papers and benchmarks! You want me to take you at your word put up some resources!

            I will happily rescind my position in the face of evidence.

            • forumics
            • 8 years ago

            bleah why are we still talking about an ancient OS?
            unless you guys are dinosaurs. leave xp to die a peaceful death pls!

            • UberGerbil
            • 8 years ago

            You’re preaching that to [url=https://techreport.com/forums/viewtopic.php?f=6&t=78440<]the wrong guy[/url<].

    • Bensam123
    • 8 years ago

    It’d be nice if you could disable cores like you can disable hyperthreading on Intel CPUs. In my experience it yields a more fluid experience in games and overall use compared to with hyperthreading… In this scenario it would function as a kill switch till the user has a OS that is properly aware of the cores and what it means.

    It also would’ve been nice if you could test it with a benchmark that uses 8 threads so it would be possible to see the performance difference between running it on 4 modules with 4 cores vs 4 modules with 8 cores. Do the extra cores even speed things up when utilized fully? Interestingly enough, the amount of threads the benchmarks use in the Bulldozer review aren’t listed to properly assess what effect this has overall.

      • Kurotetsu
      • 8 years ago

      According to this thread:

      [url<]http://www.xtremesystems.org/forums/showthread.php?275873-AMD-FX-quot-Bulldozer-quot-Review-%284%29-!exclusive!-Excuse-for-1-Threaded-Perf[/url<] Certain 990FX motherboards DO allow you to disable individual cores with the right BIOS. Though I imagine that's not going to be a commonplace feature. The one used in that thread is the ASUS Crosshair V Formula I think.

    • shank15217
    • 8 years ago

    Did you profile the integer vs FPU load? Some of benches seem pretty FPU intensive and in that case it would be sharing the resource much more.

      • vvas
      • 8 years ago

      Indeed, these results seem to be the way they are because all the tests are FPU-heavy and therefore the sharing case ends up with the shared FPU as a bottleneck. It’d be interesting to see mostly integer-heavy benchmarks (say, file compression/decompression) under the same kind of investigation.

        • Damage
        • 8 years ago

        For picCOLOR, morph is integer, skeleton is SSE2 integer, and Fourier is floating-point.

        Euler3D uses double-precision floating-point math.

        Myrimatch is a database search, and I believe it’s using integer datatypes. Update: Yeah, the FASTA file format is text-based.

    • Stargazer
    • 8 years ago

    I love reading investigations like this.

    Any chance you could give this treatment to some of your other CPU benchmarks too?

    • flip-mode
    • 8 years ago

    [quote<]Naturally, some folks who have been disappointed with Bulldozer performance to date may find solace in this outcome.[/quote<]Perhaps... [quote<]At the same time, these results take some of the air out of AMD's rhetoric about the pitfalls of Intel's Hyper-threading scheme.[/quote<]but no. If anythings, your very intelligent methodology has revealed two possibilities: 1. AMD's concept is essentially a failure. 2. The Bulldozer's design still has some unidentified egregious flaw (that seems to me, in all my ignorance, to be somehow related to the design of the cache for each module - or could it be branch prediction - is there one branch predictor per module that gets overwhelmed by two threads or one per entire CPU?) that prevents it from performing to it's full potential. Awesome investigation, though, Scott. Much appreciated. Edit: just referred to your original Bulldozer review and there is one branch predictor per module. Maybe it is getting overburdened with more than one thread. That would not surprise me as AMD's branch prediction has been behind Intel's for some time. But it still seems to me to be more likely something with the cache (latency whilst juggling between two "cores"?) itself. Dunno why I say that because I have no educated idea what I'm talking about. Edit2: The more I think about this the more insane the implications are. If AMD were to just extricate half of the transistors from this chip (take away one core from each module) then the design would be faster! Insane. This has to be the most disappointing new architecture from AMD ever (or something dramatic like that!).

      • Damage
      • 8 years ago

      Nah, another way to look at it is that the sharing adds ~70-90% more performance in multithreaded, integer-focused applications with a relatively small increase in die area. That’s no bad thing. They just need to get the scheduling right to help with desktop performance, where one can’t assume eight threads will be available.

      Well, they don’t *just* need that, but the architectural sharing alone isn’t necessarily a problem.

      Oh, and I doubt the rates on the front-end hardware (including the branch predictor) are really badly insufficient to feed both cores. Those things get modeled like crazy, and they have pretty robust hardware there with lots of queueing to avoid slowdowns while waiting for the next-stage unit to be ready. We talked about this some in the comments on the original review.

      However, obviously there is some sharing contention that leads to slowdowns, so I dunno. 🙂

        • flip-mode
        • 8 years ago

        Hmm… I see what you mean… I think. In a “lightly threaded” mode where you can have just one thread per module Bulldozer jumps by 15-20% (due in some part to higher clock speed and in some part to less resource contention); while in a heavily threaded mode Bulldozer looses some per-thread efficiency and some clockspeed for overall lower performance per thread but is able to handle those threads with a relatively small number of additional transistors?

        I don’t know how to take the “relatively small increase in die area” statement, though, I must confess. Zambezi does 8 threads with 2 billion transistors in 315 mm^2. Gulftown does 12 threads with 1.17 billion transistors in 248 mm^2. Sandy Bridge does 8 threads with 1 billion transistors in 216 mm^2 (and that includes integrated GPU). I’m not seeing the threads-per-mm^2 efficiency of Zambezi being in any way impressive here.

          • Damage
          • 8 years ago

          IIRC, AMD has said the second integer core adds something like 15% to the module area. Something relatively small. The total size of the thing is kind of a separate issue from the question of sharing, though.

            • nico1982
            • 8 years ago

            I still remember the Anand errata corrige about that: die area cost figure of the additional core should be 50% more over a single core module.

        • Joel H.
        • 8 years ago

        The 10-20% performance hit is extremely consistent. In some tests, setting a higher clock manually for a 2M/4C configuration is enough to overcome the performance hit relative to 4M/4C.

        • Zoomer
        • 8 years ago

        I still don’t think the IF ID width of 4 is wide enough. That’s only 2 instrs that can be issued per core per clk. Maybe they just didn’t want to make it 8 as it would slow things down too much.

          • eofpi
          • 8 years ago

          It’s more likely a TDP limiting measure. Bulldozer doesn’t exactly run cool, and limiting the number of active pipelines is one way to limit power consumption.

          It’d be nice to see a wider instruction decoder once process improvements free up some thermal headroom, though.

          Edit: I wonder how much of an impact the tiny L1D caches have. Every AMD CPU from the Slot-A K7 through Thuban had 64KB L1D–and even the K6 had 32KB. Bumping each L1D up to 64KB would only add 384KB to the chip, which is a drop in the bucket on a chip with 16MB of L2+L3–and might allow AMD to get away with a smaller, lower-latency L2 without suffering a performance hit.

            • khands
            • 8 years ago

            I think they should drop their L3 entirely from the desktop chips and bump up L1 and L2, but we’ll probably get to see how that works out with Trinity anyways.

            • OneArmedScissor
            • 8 years ago

            The latency would go through the roof if they made the L2 larger. It’s already too big. Just throwing out the L3 and all the other “uncore” phooey should work well enough.

            Of course, there are plenty of other ways for them to screw up Trinity (see: Llano).

            • khands
            • 8 years ago

            Llano is a pretty good chip all around, it does what it needs to and is a sufficient upgrade in heat and power compared to the old Turion stuff. But no one is saying they don’t still have a long ways to go.

    • link626
    • 8 years ago

    you get a 10% boost in lighter threading, but once you force 8 threads, you get back to crappy per-thread performance.

    also, doesn’t 10% bring it back to JUST K10.5 performance ?

      • shank15217
      • 8 years ago

      The original review is right here why don’t you look it up?

      • Anonymous Coward
      • 8 years ago

      Did you expect 8 threads to run on shared resources without compromises?

      • Waco
      • 8 years ago

      10% doesn’t even bring it close.

      My FX-8120 that’s headed back to Newegg was SLOWER in single-threaded performance at 4.375 GHz versus my Phenom II at 3.4 Ghz.

      It needs more like a 25%+ boost in IPC to be comparable…

    • lycium
    • 8 years ago

    I hope the motherboard makers can magic up a BIOS option per your (clever!) suggestion to piggyback on Intel HT software optimisations, because otherwise Bulldozer-owning users of our rendering engine will be even more disadvantaged compared to those with an i7 until we can get a machine for testing and validation…

      • Celess
      • 8 years ago

      If you have your own rendering engine you can also check cpuid, which is very easy to do, and set your own threads’ affinity.

      Generally, setting an affinity provides more stable utilization regardless for realtime workloads or most sustained workloads, due to the scheduling core hops that “all” of the windows OSes do currently for all the CPU architectures, Intel or AMD. Setting an affinity is sort of like saying: hey if you start moving threads around I’d sort of like maybe staying on this one, pick me last if at all.

Pin It on Pinterest

Share This