In perhaps its most wide-ranging and technically dazzling display in years, Intel offered us a look into its direction as an integrated device manufacturing company steered in part by the architectural leadership of luminaries Raja Koduri and Jim Keller. The company brought a small group to the former estate of Intel founder Robert Noyce to set the stage for a detailed look at how it plans to yoke its vast range of technology into a coherent whole for its future.
Before we go any further, I have to note that Intel opened a firehose of information for us, and the embargo window for that information was quite abbreviated. Worse, my early flight home coincides with the embargo lift, more or less. I’ll be adding to this article throughout the day as I’m able, but we weren’t able to cover the entirety of what Intel showed in one go. Thanks for your patience as we digest this banquet of information.
Although the ostensible order of discussion for the day was both system-level architectures and microarchitectures, the question of process execution came up over and over in the course of our talks. The company was refreshingly frank about the fact that by binding process and architecture together in its cadence of advancement, it had exposed both itself and its customers to risk in the case that its manufacturing group were to stumble in the delivery of that process—and stumble it did, in the case of the 10-nm node. Both Koduri and Keller forcefully stated that the company wouldn’t allow that kind of catastrophic misstep to happen again, as it had harmed not just the company’s own roadmap but also those of the customers that depended on Intel’s reliable delivery of new products to expand the capabilities, performance, and longevity of systems with Intel inside.
Rather than offer up another pithy mnemonic discussing its development cadence, the company acknowledged that future architectures would be developed independent of process and built with the fabrication technique that made the most sense in the timeframe it needed to deliver those products to customers, whether that might be a leading-edge node for density and power reasons or an older node where performance was paramount and power and density were less important.
Furthermore, Keller noted that in the case that a problem did arise, the company would still be able to deliver a product that fulfilled promised improvements in performance and capability to customers by falling back on alternate manufacturing techniques in its arsenal, a position he described as being inspired by his days at Apple. The iPhone maker is well-known for delivering products on a predictable schedule every year, and Keller said that Apple always had contingency plans so that its latest and greatest stuff wouldn’t be held up by unforeseen roadblocks in production.
That approach was affirmed to me in a conversation with Intel fellow Ronak Singhal, who heads the Intel Architecture Cores Group at the company. Singal noted that Intel is now approaching its core processor design more cautiously by logically describing a core earlier on in the life cycle of architectural development with fewer baked-in assumptions about the process it might be built on. The logical description of the chip can later be married to a physical process closer to the time when the company needs to produce it.
While that strategy may seem obvious on its face, my understanding is that past Intel cores were much more closely married to the physical processes that would be used to build them, and there was little room for error in the previously unthinkable event that a core needed to be migrated to a different manufacturing process. With its new development approach, the company is apparently better positioned to produce its newer tech on older nodes if need be so that it can still give customers what they need to build around new processors with new capabilities in a predictable time frame, even in the event that the manufacturing group isn’t ready with the latest and greatest process.
That view is consistent with the industry-wide idea that leading-edge process nodes are now long-term investments and that companies plan to extract value from them for as long as possible by whatever means necessary. While Intel still plans to do the hard work and investment required to develop leading-edge processes, the company ultimately wants its future to be defined more by the products it can deliver rather than process leadership first and product second. The cynic might point out that we’ve already seen the result of this strategy in three years and counting of Skylake-derived CPUs with ever-increasing clock speeds and core counts, but Koduri and Keller both seemed adamant that the long rule of Skylake was an aberration rather than the future of the firm—assuming all goes well from this point forward.
Ice Lake freezes over
Although nobody would explicitly say so at the event, Intel has essentially halted any volume-production plans it might have had for the Cannon Lake microarchitecture introduced with the Core i3-8121U. The first next-generation core built on the 10-nm process that Intel is confident enough to talk about in detail is called Sunny Cove, a name that refers only to the CPU core and not the SoCs that the company plans to build around it. That said, and despite some taciturn responses to questioning on this point at the event, I’m confident in saying that the first 10-nm processors that Intel plans to introduce in volume will fly the Ice Lake code name.
Sunny Cove is the first in the company’s revised 10-nm roadmap, and it’ll presumably begin arriving in client systems starting in the second half of 2019. To be clear, the “Cove” suffix refers to the CPU core inside Ice Lake, not the entire SoC itself. The company’s core roadmap also includes Willow Cove, whose highlights may include a cache redesign, a “new transistor optimization,” and enhanced security features. The Golden Cove follow-in in the 2021 time frame returns the focus to single-threaded performance, AI performance, networking and 5G performance, and more security enhancements.
Intel-watchers have long desired better fundamental per-core performance for some time, and Sunny Cove appears positioned to deliver. Intel’s Ronak Singal noted that the best way to extract general-purpose performance improvements from a CPU is to make it deeper (by finding greater opportunities for parallelism), wider (by making it possible to execute more operations in parallel), and smarter (by introducing newer and better algorithms to reduce latency).
Sunny Cove goes deeper by expanding its caches and record-keeping infrastructure to keep more instructions and data near the core and in flight. This core moves from a 32-KB L1 data cache to a 48-KB allocation. The L2 cache per core will increase, although as we’ve seen in the divergence between Skylake client and server cores, the amount of L2 will differ by product. The micro-op cache also increases in size, and the second-level translation lookaside buffer (TLB) is also more copious than in Skylake.
Sunny Cove is a fundamentally wider core than most every Intel design since Sandy Bridge, as well, expanding from four-wide issue to five-wide and increasing the number of execution ports from eight to 10. Each of those execution units, in turn, is more capable than those of Skylake. Intel added a dedicated integer divider on port 1 to reduce latency for those operations.
The core now has two pathways for storing data, and it now has four address-generation units (up from three in Skylake). The vector side of the chip now has two shuffle units (up from one in Skylake), and every one of the four main execution ports can now perform a load effective address (LEA) operation, up from two such units in Skylake. Sunny Cove also implements support for the AVX-512 instruction set extension that was first meant to be introduced to client systems by way of Cannon Lake.
An early Ice Lake-SP package.
To bolster the idea that Intel’s 10-nm process is in a healthier place than it has been of late, we saw at least three separate implementations of Sunny Cove cores running: at least one development board using an Ice Lake-U processor, another development board featuring Intel’s Foveros 3D packaging technique (more on that later), and an Ice Lake-SP Xeon demonstrating new extensions to the AVX-512 instruction set. While the company certainly wasn’t ready to talk exact die sizes, it was heartening to see 10-nm silicon ranging from minuscule to massive in operation.
Gen11 graphics promise high-end features for baseline gaming
As Sunny Cove will be the next-generation building block of Intel’s general-purpose compute resources, the Gen11 IGP will serve as the next pixel-pushing engine for Ice Lake processors. Intel gave us a high-level look at the GT2 configuration of its Gen11 architecture during its event. For the unfamiliar, GT2 is the middle child of Intel’s integrated graphics processors and sits on the die of to many of the company’s mainstream CPUs.
A prettied-up representation of the Gen11 IGP
Most prominently, Intel wants to establish a teraflop of single-precision floating-point throughput as the baseline level of performance users can expect from GT2 configurations of Gen11. Compared to the roughly 440 GFLOPS (and yes, that’s giga with a G) available from the UHD 620 graphics processor in a broad swath of basic systems on the market today, that kind of performance improvement on a platform with as much reach as Intel’s integrated graphics processors could bring enjoyable gameplay to a far broader audience than ever before.
To get there, engineer David Blythe says his team set out to cram as much performance as it possibly could into the power envelope available to it. A Gen11 IGP in its GT2 configuration has 64 execution units, up from 24 in Gen9, and squeezing that much shader power into an IGP footprint and maximizing its efficiency was a battle of inches, according to Blythe. The Gen11 team apparently had to go after every small improvement it could in the pursuit of its power, performance and area goals, and that meant touching not just one or two parts of the integrated graphics processor, but every part of it.
The net result of that work was a significant reduction in the area of the basic execution unit. Blythe claimed that implementing a Gen9 EU and a Gen11 EU on the same process would put the Gen11 EU at 75% of the area of its predecessor, partially explaining how it was able to pack so many more of those units into the undisclosed area allocated for GT2 configs of Gen11 on Ice Lake.
In pursuit of both power savings and higher performance, Gen11 supports a form of tile-based rendering in addition to its immediate-mode renderer. According to Blythe, certain pixel-limited workloads benefit greatly from the ability to keep their data local to the graphics processor, and by invoking the tile-based renderer, those applications can save 30% memory bandwidth and therefore power from the uncore of the processor. In turn, the Gen11 GPU can take the juice saved that way and turn it into higher frequency on the shader pipeline. The tile-based renderer can be dynamically invoked as needed during the course of shading pixels and left off when it’s not needed.
To keep more data closer to those execution units, Gen11 has a much, much larger L3 cache than Gen9. Blythe says that the GT2 configuration of Gen11 has a 3-MB L2 cache, more than four times larger than the one in the GT2 implementation of Gen9 and even larger in absolute terms than the 2.3-MB L3 in even the highest-performance GT4 implementation of Gen9.
Other improvements in the memory subsystem of the Gen11 IGP include better lossless memory compression, a common focus of improvement for making the most of available memory bandwidth in graphics processors both large and small. Blythe says the Gen11 compression scheme is up to 10% more effective at its best, but real-world performance is more likely to fall around 4% on a geometric-mean measure.
The Gen11 team also separated the per-slice shared local memory in Gen11 from the L3 cache. That structure is now its own per-slice private allocation, and each of those blocks of memory has its own data path to allow the IGP to get better parallelism out of L3 cache accesses and inter-IGP memory accesses. Finally, the Graphics Technology Interface (GTI) that joins the integrated graphics processor with the rest of the CPU is now capable of performing reads and writes at 64 bytes per clock.
Classifying objects in a scene by distance for variable-rate shading
While Nvidia’s Turing architecture might boast the first practical implementation of the ability to vary shading rates in a scene on a fine-grained basis, Intel points out that it invented the idea of what it calls coarse pixel shading. The company claims to have published a paper on the concept as far back as 2014. Now, that technique will be available to programmers on Gen11 graphics processors.
While Intel and Nvidia’s implementations of variable-rate shading likely differ in granularity, the point of the technology remains the same on Gen11 as it is on Turing: to avoid performing shading work that don’t result in appreciable increases in detail for parts of the scene that might not need it. Intel has so far implemented two techniques using CPS: a global coarse-pixel-shading setting and a radial falloff function that resembles foveated rendering. The company also notes that the algorithm is also available on a draw-call-by-draw-call basis.
The company’s demos of coarse pixel shading covered two potential ways the tech can be used. One was a synthetic, pixel-bound case where the software was choosing shading rate going on distance from the camera and using level-of-detail characterizations on a per-object basis. In this demo, employing coarse pixel shading offered as much as a 2x boost in performance, but the company admitted that this was a best-case scenario.
Intel also showed an Unreal Engine demo with the radial falloff filter it had developed. In that case, the improvement from CPS was closer to 1.3x-1.4x that of the base case without CPS. Like Nvidia, Intel says its coarse pixel shading API is simple and easy to integrate, so we’ll be curious to see how much adoption this technology gets and how developers might choose to use it in the real world.
Intel’s VESA Adaptive-Sync demo system in operation
Gen11 is the first Intel graphics processor with support for the long-promised and long-awaited VESA Adaptive Sync standard. Variable-refresh-rate displays are a mature technology at this point, but it’s still welcome to see relatively modest graphics processors like GT2 driving compatible monitors in a tear-free fashion. Intel also claims that its Adaptive Sync-compatible IGPs will include desirable features like low framerate compensation from the get-go.
Overall, the GT2 implementation of Gen11 and its promise of usable gaming performance, combined with its modern display features and likely-to-be-egalitarian positioning could introduce a broad audience to some features that only high-end graphics cards enjoy today.