Certainly from any high-level view it's clearest to think of each module as a hyperthreaded core, but at the level the OS scheduler works it's not at all clear that treating it purely that way is the optimal
thing to do. (Presumably the Windows patch would have been out earlier if it was.)
I guess this was all news to you, rcs2k4, and the marketing from AMD (and its partners) doesn't go out of its way to clarify things, but this issue got quite a bit of debate a couple of years ago when the Bulldozer design was first disclosed, long before the chips were actually available. (Here's a post I made
on the topic a year ago, though you may find the whole thread interesting; the "Kanter article" I refer to is here
I don't get why the decided to use only 4 FPU's and share them with a pair of so-called core's. Logic dictates there should be 8, and if it had 8 performance would be pretty decent (again in theory). Is the FX 4100 1/2 of a FX 81xx then, so only has 2 FPU's, 4 so-called cores behind that and all of the bigger brothers cache at L3?
It cuts down the die size, meaning more chips per wafer and less chance for a flaw per chip. That makes it cheaper / more efficient for AMD. For more background on AMD's logic, see page two of that Kanter article.
Sharing the FP unit makes sense for server chips, since a lot of server loads don't do any FP work at all: webservers just want as many integer cores as possible, for example, and many database loads are well-threaded but have minimal FP demands. There is precedent for this in server-oriented processors: Sun's first Niagara chip (UltraSPARC T1) went even further in this direction, sharing one FP unit per four cores (and 16 threads), or even more. (That was a little too extreme, however, even for Sun's server customers, and the T2 version added an FP unit per core, though shared by eight threads.)
Of course such a design is not well-suited to the demands of the "enthusiast"/workstation desktop and especially gaming, which tends to be FP-intensive, though we might note that the number of games that can keep even four cores busy with FP loads is still small (albeit growing). Bulldozer really does look like a server chip being sold in non-server markets: it doesn't have integrated graphics either, for example. Even the Trinity chip, which will include an IGP alongside an improved Bulldozer core, isn't going to be what you're looking for (its natural market is mobile, where the IGP should be adequate for gaming at typical laptop resolutions). But then AMD has for many years said they're focusing on servers and mobile, since those markets are the most profitable for them and they don't really have the luxury of dabbling in anything else.
Even so, it should be a better match for desktop loads than it is. The trouble appears to be as much in the execution as the philosophy: many investigators have noted that the Bulldozer cache system in particular is higher-latency than Intel's equivalents, among other problems. Had AMD released a stronger implementation, we might be less focused on the shortcomings of its design.
However, it's possible they'll eventually get back to what you're looking for whenever a process shrink gives them more transistors to spend on FP resources -- though they might be shared with the IGP, as ish718 suggests and as AMD's original "Fusion" vision promised long ago.