Bigger L2 caches for better performance
Skylake-X also has a much different cache allocation per core compared to its mainstream counterparts. Instead of the relatively small 256kB L2 cache (or mid-level cache, in Intel parlance) in Skylake-S and Broadwell-E, each Skylake-X core enjoys a whopping 1MB of private L2. In support of AVX-512, the bandwidth between the L1 data cache and the L2 cache has been increased to 128 bytes per cycle. On top of that size increase, Intel quadrupled the associativity of the cache from four ways in to 16 ways in Skylake-X. Intel says the move to a larger private cache lets programmers keep usefully large data structures close to the core, and the result is higher performance. Pretty cut and dry.
Intel says it undertook this change because it felt its older architectures placed too much emphasis on data sharing through the L3 caches. In turn, Skylake-X's architects reduced the shared last-level cache allocation to as much as 1.375MB per core, compared to as much as 2.5MB per core for Broadwell-E chips. This last-level cache isn't inclusive of the L2 caches, and it serves as a victim cache for the L2. The tradeoff for this rebalancing of cache allocations is a higher L3 cache access latency, according to Intel.
Finally, Intel is abandoning the ring topology it's used to connect CPU cores in its many-core CPUs for several generations. In place of its ring, the company is introducing a new (or at least new outside of the Knights Landing accelerator) mesh interconnect topolgy that promises several improvements. First off, Intel says its mesh interconnect delivers lower latency and higher bandwidth than the ring bus, all while operating at a lower frequency and voltage. Those last two characteristics are important, because they should result in less power consumption from the interconnect portion of the chip as it scales up.
Intel also says that the mesh design also allows it to include units like I/O, inter-socket interconnects, and memory controllers in a modular, scalable way as core counts increase. The company claims the distribution of these elements across the chip using the mesh minimizes undesirable "hot spots" of activity that might ultimately constrain cores' access to those critical resources, limiting performance.
The mesh design should also offer a boon for applications that need to do a lot of inter-core communication. The last-level cache in Skylake-X is distributed across each core, and thanks to the more uniform access characteristics of the mesh, Intel claims that application developers no longer have to worry about non-uniform latencies when accessing data in those caches. Cores should also enjoy more uniform access characteristics when accessing the die's I/O and memory controller, as well.
Previously, the shared L3 caches on a chip might have resided on different rings, requiring cores to communicate across the buffered switches formerly used to join discrete rings on the die. These switches added latency on top of that incurred by traversing the ring bus in the first place—something that Intel gave customers the opportunity to avoid in past chips with a "cluster-on-die" mode that turned each ring into something resembling a NUMA domain of its own. The mesh topology in Skylake-X should make headaches from the non-uniform distribution and access latencies of resources among rings a thing of the past.
As for characteristics of the Skylake-X silicon itself, Intel honchos clammed up when we asked about die size and transistor count. The company believes that disclosing this information will lead to unfounded conclusions from its competitors about the quality of their chip designs and process technologies compared to Intel's. Only the paranoid survive, we suppose.