A closer look at AMD's dual-core architecture
Let's start by looking at a very simplified diagram of a dual-core Opteron, which looks like so:
As you can see, AMD didn't simply glue a pair of K8 cores together on a single piece of silicon. They've actually done some integration work at a very basic level, so that the two CPU cores can act together more effectively. Each of the K8 cores has its own, independent L2 cache onboard, but the two cores share a common system request queue. They also share a dual-channel DDR memory controller and a set of HyperTransport links to the outside world. Access to these I/O resources is adjudicated via a crossbar, or switch, so that each CPU can talk directly to memory or I/O as efficiently as possible. In some respects, the dual-core Opteron acts very much like a sort of SMP system on a chip, passing data back and forth between the two cores internally. To the rest of the system I/O infrastructure, though, the dual-core Opteron looks more or less like the single-core version.
The Opteron's system architecture remains very different from that of its primary competitor, Intel's Xeon. AMD says its so-called Direct Connect architecture was over-designed for single-core Opterons with an eye to the dual-core future. Each processor (whether dual core or single) has its own local dual-channel DDR memory controller, and the processors talk to one another and to I/O chips via point-to-point HyperTransport links running at 1GHz. This arrangement makes for a network-like system topology with gobs of bandwidth. The total possible bandwidth flowing through the 940 pins of an Opteron 875 is 30.4GB/stechnically, enough to choke a horse. With one less HyperTransport link, the Opteron 275 can theoretically hit 22.4GB/s.
By contrast, current Xeons have a shared front-side bus on which the north bridge chip (with memory controller) and both processors reside. At 800MHz, its total bandwidth is 6.4GB/sa possible bottleneck in certain situations.
In order to understand the impact of AMD's dual-core chip design and system architecture, we should briefly discuss cache coherency. This scary sounding term is actually one of the bigger challenges in a multiprocessor system. How do you handle the fact that one CPU may have a certain chunk of data in its cache and be modifying it while another CPU wants to read it from memory and operate on it, as well? Assuming you don't run from the room screaming in fear at the complexity of it all, the answer is some sort of cache coherency protocol. Such a protocol would store information about the status of data in the cache and offer updates to other CPUs in the system when something changes.
Intel's Xeons use a cache coherency protocol called MESI. MESI is an acronym that stands for the various states that data in the CPU's cache can be flagged as: modified, exclusive, shared, or invalid. Let's tackle them completely out of order, just to be difficult. If a CPU pulls a chunk of data into cache and has not modified it, the data will be flagged as Exclusive. Should another CPU pull that same chunk of data into its cache, the data would then be marked as Shared. Then let's say that one of the processors were to modify that data; the data would be marked locally as Modified, and the same chunk on the other CPU would be flagged as Invalid.
The processor with the Invalid data in its cache (CPU 0, let's say) might then wish to modify that chunk of data, but it could not do so while the only valid copy of the data is in the cache of the other processor (CPU 1). Instead, CPU 0 would have to wait until CPU 1 wrote the modified data back to main memory before proceedingand that takes time, bus bandwidth, and memory bandwidth. This is the great drawback of MESI.
AMD sought to address this problem by making use of a cache coherency protocol called MOESI, which adds a fifth possible state to its quiver: Owner. (MOESI is used by all Opterons and was even used by the Athlon MP and 760MP chipset back in the day.) A CPU that "owns" certain data has that data in its cache, has modified it, and yet makes it available to other CPUs. Data flagged as Owner in an Opteron cache can be delivered directly from the cache of CPU 0 into the cache of CPU 1 via a CPU-to-CPU HyperTransport link, without having to be written to main memory.
That alone is a nice enhancement over MESI, but the dual-core Opterons take things a step further. In the dual-core chip, cache coherency for the two local CPU cores is still managed via MOESI, but updates and data transfers happen through the system request interface (SRI) rather than via HyperTransport. This interface runs at the speed of the CPU, so transfers from the cache on core 0 into the cache on core 1 should happen very, very quickly. Externally, MOESI updates from a pair of cores in a socket are grouped in order to keep HyperTransport utilization low.
Again, this is quite the contrast with Intel's dual-core implementation, which remains on Smithfield almost exactly like a pair of Xeons on two sockets. MESI updates are communicated over the front-side bus. There is no alternative internal on-chip data path.
Interestingly, the ability of the two cores to pass data quickly to one another seems to offer a compelling enough performance benefit that, from what I gather, AMD's guidance to OS vendors has been to give priority to scheduling threads on adjacent cores first before spinning off a thread on a CPU core on another socket. That's despite the fact that there's additional memory bandwidth available on the second socket.
|TR's 2017 Christmas giveaway: eight days left and counting||4|
|Rumor: Ryzen 2 set for Q1 2018 and a Fenghuang APU breaks cover||12|
|MSI gives Radeon RX Vega cards an Air Boost||12|
|Corsair's latest SO-DIMM kit takes 32 GB of DDR4 to 4000 MT/s||4|
|Report: Intel Inside co-marketing program will get a budget cut||28|
|Gingerbread House Day Shortbread||17|
|iMac Pro details and release date come into focus||49|
|Radeon Software Adrenalin Edition: an overview||26|
|Tuesday deals: NVMe storage, a GeForce GTX 1080 Ti, and more||9|
|Full disclosure: while I work for Intel; the opinions I express here are my own I think I understanding the issue you ran into. For the Braswell platf...||+35|