As you may know if you read our original FX processor review, the shared nature of the “modules” in AMD’s new Bulldozer architecture presents some conundrums for OS and applications developers who want to extract the best possible performance. Each module has two integer cores that are wrapped in some shared resources, including the front-end instruction fetch and decode units, the FPU, and the L2 cache and its associated data pre-fetchers. AMD claims this level of sharing is superior to what goes on in recent Intel processors—whose more resource-rich indvidual cores can track and execute two threads—because the performance of each thread in a Bulldozer module is more “robust” and predictable, less likely to stall due to resource contention.
A scheduling conundrum
That sounds good in theory, but it raises a vexing question: what is the best way to schedule threads on a four-module, eight-core Bulldozer-based chip? Say your application uses four threads, and you want the best possible performance. Would it be better to group those four threads together on the four cores contained in a pair of Bulldozer modules, or is the best approach to spread them across all four modules? Each way has its advantages.
Your first instinct might be to spread the threads across four modules, in order to avoid resource sharing. That’s generally the best approach on an Intel processor with Hyper-Threading. On Bulldozer, that means the front end, FPU, and memory subsystem in each module would be at the full disposal of a single thread.
However, AMD claims Bulldozer’s sharing arrangement has a relatively small performance penalty. Also, if all four threads occupy only two modules, the CPU could potentially run at a higher clock speed thanks to Turbo Core dynamic clock frequency scaling. In the case of our FX-8150 chip, that means the cores would top out at 4.2GHz, whereas they’d stop at 3.9GHz with all four modules active. What’s more, AMD claims the performance of related threads may benefit from sharing data via the module’s L2 cache.
So what’s the best approach? That’s tough to say for sure, simply by considering the theory, and it may depend upon the situation.
One thing we do know is that the scheduler in Windows 7 isn’t any help. The OS is completely unaware of how Bulldozer modules work; it sees only eight equal cores and schedules threads on them evenly. AMD says Windows 8 will address these issues, but outside of developer previews and such, that OS probably won’t be available for another year, at least.
We can, however, take charge of thread scheduling ourselves in certain cases and see how it affects performance. With the aid of some simple tools, we tested a few of the geekier applications used in our Bulldozer review using explicit thread scheduling in order to see what would happen.
Our basic setup is relatively straightforward. Some of our benchmarks run from the command line and can be configured to use a specific number of threads. We told them to use four threads and invoked them via the Windows “start” command, which has an option to set the thread affinity when launching a program. By modifying the affinity, we can distribute the threads to specific CPU cores. For instance, this command:
start /AFFINITY 55 /b /WAIT e3dbm 5 4 > ed3dbm-4-1.txt
…launches our Euler3D computational fluid dynamics test, tells it to run five iterations with four threads, and stores the output in a text file. The “55” after the “/AFFINITY” switch is the mask that specifies which cores to use. The mask format may seem hard to decipher because it’s in hexadecimal; translated into binary, two instances of the number five side by side look like so:
A one specifies a core to be used, and a zero specifies a core to be skipped. In this case, the mask tells the OS scheduler to assign threads to every other core—or one per module, in the case of Bulldozer.
We were able to verify that the “55” mask was doing what we expected by watching the Windows Task Manager and monitoring CPU clock frequencies. As you can see in the screenshot above, we have a thread on every other core, and our CPU is topping out at 3.9GHz, which is the expected Turbo Core behavior when all four modules are active. We also verified the core and module configuration with AMD, who characterized it as:
Module 0 = Core 0 and 1, Module 1 = Core 2 and 3, Module 2 = Core 4 and 5, Module 3 = Core 6 and 7
So we’re fairly certain we know what we’re getting here.
Our other mask option, 0F, translates into binary as “00001111”. This option packs all four threads onto two modules, forcing some resource sharing but also allowing for the higher Turbo clock frequency of 4.2GHz. Again, the behavior is easily verifiable, as shown above.
In addition to our command line specifications, we have some special, affinitized builds of the picCOLOR image analysis program, courtesy of Dr. Reinert Mueller, who has long supported us with custom builds of his software tailored for new CPU architectures. Several of picCOLOR’s functions use two or four threads, and their performance is potentially altered by where they run.
Without further ado, here’s what happened when we ran our test apps with the default Windows 7 scheduler threading—i.e., with no awareness of modules or sharing—and with our two different affinity masks.
These results couldn’t be much more definitive. In every case but one, distributing the threads one per module, and thus avoiding sharing, produces roughly 10-20% higher performance than packing the threads together on two modules. (And that one case, the FDom function in picCOLOR, shows little difference between the three affinity options.) At least for this handful of workloads, the benefits of avoiding resource sharing between two cores on a module are pretty tangible. Even though the packed config enables a higher Turbo Core frequency of 4.2GHz, the shared config is faster.
Our test apps, obviously, are not your typical desktop applications, and they may not be a perfect indicator of what to expect elsewhere. However, since many games and other apps are lightly threaded, with three or four threads handling the bulk of the work, we wouldn’t be surprised if one-per-module thread affinities were generally a win on Bulldozer-based processors.
At the same time, these results take some of the air out of AMD’s rhetoric about the pitfalls of Intel’s Hyper-threading scheme. The truth is that both major x86 CPU makers now offer flagship desktop CPU architectures with a measure of resource sharing between threads, and proper scheduling is needed in order to extract the best performance from them both. (This situation mirrors what’s happened in 2P servers in recent years, where applictions must be NUMA-aware on current x86 systems in order to achieve optimal throughput.) A gain of up to 20% on a CPU this quick is certainly worthy of note.
Trouble is, right now, Intel has much better OS and application support for Hyper-Threading than AMD does for Bulldozer. In fact, we’re a little surprised AMD hasn’t attempted to piggyback on Intel’s Hyper-Threading infrastructure by making Bulldozer processors present themselves to the OS as four physical cores with eight logical threads. One would think that might be a nice BIOS menu option, at least. (Hmm. Mobo makers, are you listening?)
At any rate, application developers who want to make the most of Bulldozer are free to affinitize threads in upcoming revisions of their software packages anytime. If AMD can persuade some key developers to help out, it’s possible the next round of desktop applications could benefit very soon.