Intel hasn’t taken too kindly to the revolution in mobile devices that has happened largely without its participation. The rise of smartphones and tablets with ARM-compatible chips onboard has become a major threat to Intel’s dominance in the processor business—and this is, after all, a company built on the mantra that “only the paranoid survive.”
Thus, for several technology generations, Intel has slowly adjusted its heading to better compete in mobile devices. The firm has used its expertise in chip manufacturing and design to cram PC-like performance into ever smaller footprints. Last year’s Haswell chip brought huge progress in terms of power consumption, battery life, and system sizes. This year, a new processor code-named Broadwell promises dramatic gains once again, thanks in part to the world-class nanoscale technology in Intel’s 14-nm chip fabrication process.
The first Broadwell-based processors will carry a new brand name, Core M, and they will target very small systems indeed: iPad-like tablets that are less than nine millimeters thick and have no fans to cool them. Fitting a PC-class processor into such a device is no easy task. Intel claims to have achieved this feat by tweaking nearly every part of the Broadwell silicon and surrounding platform in order to reduce its size and power consumption. More impressively, they say they’ve kept performance steady at the same time.
Enforcing Moore’s Law: Intel’s 14-nm process
One key ingredient in Broadwell’s success is Intel’s 14-nm manufacturing process, the world’s first of its kind. Broadwell has been very publicly delayed due to some teething problems with this new process. In a briefing last week, however, Intel VP and Director of 14-nm Technology Development Sanjay Natarajan told us that the 14-nm process is now qualified and in volume production.
In fact, Natarajan shared quite a few specifics about the 14-nm process in order to underscore Intel’s success. His core message: the 14-nm process provides true scaling from the prior 22-nm node, with virtually all of the traditional benefits of Moore’s Law intact.
Moore’s Law has made the massive advances in microelectronics over the past 40 years possible. Its basic formulation says that the number of transistors one can pack into a given area of a chip will roughly double every couple of years. Intel has moved mountains to keep Moore’s Law on track, and it has reaped huge benefits for doing so. The rest of the semiconductor industry has followed the same path, but in recent years, it has done so from a fair distance behind Intel. For instance, this 14-nm process is the second generation to employ what Intel calls tri-gate transistors (which the rest of the industry calls FinFETs). Other firms have yet to ship first-generation FinFET silicon.
Shrinking on-chip features to ever-smaller dimensions is an incredibly difficult problem, and the complexity of the task has grown with each successive generation. When questioned during a press briefing we attended, Natarajan was quick to admit that the familiar naming convention we use to denote manufacturing processes is mostly just branding. The size of various on-chip elements diverged from the process name years ago, perhaps around the 90-nm node. That said, Intel Fellow and process development guru Mark Bohr quickly pointed out that transistor densities have continued to scale as expected from one generation to the next. In other words, Moore’s Law is alive and well.
To illustrate, Natarajan showed how the fins comprising Intel’s tri-grate transistors have grown closer together at the 14-nm node—fin pitch has been reduced from 60 to 42 nm—while the fins themselves have grown taller and thinner. The closer placement improves density, while the new fin structure allows for increased drive current and thus better performance. This higher performance, in turn, allows Intel to use fewer fins for some on-chip structures, further increasing the effective density of the process. Fewer fins also means lower capacitance and more power-efficient operation.
The gate pitch has been reduced from 90 to 70 nm and, as shown above, the spacing of the smallest interconnects has dropped even more dramatically, from 80 to 52 nm.
The cumulative result of these changes is perhaps best demonstrated by looking at a fairly common benchmark: the size of a six-transistor SRAM cell. On Intel’s 22-nm process, a 6T cell occupies 0.108 square micrometers of space. The same structure at 14-nm takes up only 0.0588 square micrometers—or 54% of the area required at 22-nm. That’s classic Moore’s Law-style area scaling.
The benefits of the 14-nm process extend beyond sheer density. Natarajan shared the graph above to convey the power and performance advances offered by this 14-nm process. Essentially, it can flip bits at higher speeds than prior generations while losing less power in the form of leakage along the way. Intel can choose to tune its products for different points along the leakage-performance curve shown above, but in each case, chips built on the 14-nm process should offer a nicer set of tradeoffs than those from prior process generations.
This next illustration is perhaps the most telling, because it addresses one of the key threats to Moore’s Law going forward: economics. I said before that the transition to each smaller process node has been more difficult than the last. Chipmakers have had to use ever more exotic techniques like double-patterning—creating two separate masks for photolithography and exposing them at a slight offset—in order to achieve higher densities. Doing so increases costs, and as a result, one of the key corollaries of Moore’s Law has been threatened. If moving to finer process nodes can’t reduce the cost per transistor, the march of ever-more-complex microelectronics could slow down considerably. Some chipmakers have hinted that we’ll be approaching that point very soon.
By contrast, Intel says the math continues to work well for its process tech. The area per transistor is dropping steadily over time, while the cost for each square millimeter of silicon is rising at a slower pace. The net result remains a steady decrease in cost per transistor through the 14-nm node. In fact, Bohr told us that he expects Intel to deliver an even lower cost per transistor in its upcoming 10-nm process.
Despite the delays, then, Intel is bullish about its process tech advancements and confident that its 14-nm technology is ready to roll. Natarajan says the company is now shipping 14-nm production chips to its customers, and the first Core M-based products should arrive on store shelves in time for this year’s holiday season. Two fabs, one in Oregon and the other in Arizona, are slated to be producing 14-nm wafers this year, with another plant in Ireland scheduled to ramp up production in 2015. Natarajan expects sufficient 14-nm silicon yields and wafer volumes to support “multiple 14-nm product ramps in the first half of 2015.”
The Broadwell SoC and module
In some ways, the Broadwell-Y chip follows the same basic outlines as the Haswell-Y processor before it. Both chips have dual CPU cores, 3MB of L3 cache, and integrated graphics.
A shot of the Broadwell-Y die. Source: Intel.
Still, according to Stephan Jourdan, Intel Fellow and Director of SoC Architecture in the Platform Engineering Group, fitting a chip like this one into a fanless tablet form factor less than nine millimeters thick is a daunting challenge. The power that an SoC can consume in a device is determined by lots of factors, including display size, chassis thickness, the materials used, and even ambient temperatures. Jourdan says the system type Intel targeted, with a 10.1″ display, requires an SoC that operates at three to five watts of sustained power. (That’s not TDP, or peak power, but likely maps to Intel’s newer SDP metric for mobile processors.) Given that the prior-gen Haswell Y-series processors operate at a 6W SDP, Broadwell would need to cut sustained operating power in half to meet this goal. Broadwell’s physical size would have to shrink, too, in order to fit into the target devices.
The Broadwell team attacked these problems on all fronts. Thanks to the 14-nm process, the SoC shrank substantially from one generation to the next. The Haswell Y-series SoC measures 130 square millimeters, while Broadwell-Y occupies only 82 mm². That’s not exactly half the area, but Intel’s architects have added a number of features to Broadwell in order to improve its power efficiency and performance. The net result of everyone’s efforts, claims Jourdan, is a chip that delivers more than twice the performance per watt of Haswell-Y before it.
Some of the advancement comes courtesy of an advantage unique to Intel, one the company is quick to emphasize these days. Intel’s process tech engineers and chip designers have the ability to work together, within the same company, to “co-optimize” their products and fabrication processes. Jourdan credits a specially tuned flavor of the 14-nm process for a further 10% reduction in capacitance in Broadwell-Y silicon, a 10% lower minimum operating voltage, and a 10-15% switching speed improvement at low voltages. All told, the combination of general 14-nm improvements and process-specific tuning account for roughly two-thirds of the power efficiency gains from Haswell-Y to Broadwell-Y.
As you may know, Haswell-Y isn’t entirely a “true” system on a chip. Many of the legacy I/O functions are hosted on a separate piece of silicon, known as the Platform Controller Hub or PCH, that mounts on a common module with the CPU. Broadwell-Y follows the same template, but the module had to shrink dramatically to fit into tablet-sized devices. The Broadwell module is 50% smaller in area than the Haswell version, as pictured on the right, and it’s 30% shorter in the Z dimension, as well.
The underside of a Broadwell-Y motherboard. Source: Intel.
Yes, this is a dual-core x86 processor with two “big cores,” integrated graphics, and a companion chipset. Kind of hard to believe, isn’t it?
The 3DL PCB’s placement illustrated. Source: Intel.
Reducing the SoC module’s thickness required some ingenuity. The fully-integrated voltage regulator (FIVR) in Haswell and Broadwell allows for fast, fine-grained power state transitions on the chip, but it also requires the presence of external inductors on the SoC package that add height. To overcome this obstacle, the Broadwell team developed a workaround it calls the 3DL module. The inductors are placed on a small external PCB that hangs beneath the SoC module. To make room for the 3DL PCB, each motherboard has a hole cut into it, directly beneath the Broadwell module. This arrangement effectively “hides” the additional Z-height of the inductors and allows Broadwell-Y’s total height to be almost 50% lower than the Haswell equivalent.
Interestingly, Jourdan shared the details of another FIVR workaround the Broadwell team had to implement. Because FIVR isn’t very efficient at low voltages, they added a mode called LVR where FIVR essentially gets bypassed under the right conditions. The need for 3DL and LVR makes one wonder whether the level of VR integration in Broadwell makes sense for future generations of Intel SoCs.
Managing power and extending dynamic range
Intel’s chip- and system-level dynamic power management capabilities are incredibly sophisticated these days. One of the key mechanisms, Turbo Boost, has added a new wrinkle so that Broadwell can fit into a new class of devices.
The smaller batteries in sub-nine-mm tablets can potentially be stressed into failure by short bursts of high power consumption from the CPU, so the Broadwell team had to design a mechanism to avoid such problems. The result is a new, more granular limit in this chip’s dynamic voltage and frequency scaling algorithm known as PL3. The other limits will be familiar from past chips. PL1 is the long-term CPU power limit that the system can withstand without overheating. This limit is measured across minutes of operation. PL2 is the short-term burst limit used for temporary excursions to higher clocks—say, a quick trip to a faster clock frequency to improve responsiveness while loading a program. PL2 is measured in seconds. The new PL3 limit is monitored in milliseconds, to prevent instantaneous power use from damaging the device’s battery.
The additional intelligence in Broadwell’s Turbo Boost control complements the rest of Intel’s power management mojo, which allows power sharing across the SoC die and manages the thermal behavior of the entire system.
Even with all the goodness of the 14-nm process and Broadwell’s dynamic power management, driving SoC power from 6W to 3W while maintaining performance was probably out of reach without some additional help. Intel was bumping up against some basic limits in the physics of chip operation.
For one, the firm could only reduce Broadwell’s operating voltage so much before the transistors would cease to work properly. Any home overclocker knows how crucial voltage is to ensuring stable CPU operation. This lower limit on voltage is a significant barrier to driving down power consumption in a chip like Broadwell with over a billion transistors.
You see, a chip’s power draw is determined by a fairly simple equation that involves the clock frequency, the number of bits actively flipping, and the square of voltage—and that squared term means voltage tends to dominate the conversation. The Broadwell team could push its chip’s clock speeds lower, but doing so would only result in linear reductions in power draw. Any time a portion of the chip is operating at low clock speeds and at the chip’s minimum voltage level, it’s just not being terribly efficient.
The Broadwell team’s solution to this dilemma was to adopt a method known as duty cycling. With duty cycling, some portions of the chip are turned off entirely during certain clock cycles. Intel has used duty cycle throttling (DCT) for years to rein in its CPUs to prevent failures in the event of overheating.
Broadwell introduces a new mechanism called duty cycling control (DCC) that has a different aim. Broadwell’s integrated graphics component takes up roughly a third of its die area, perhaps a little more, and DCC targets those graphics units. Working together, the SoC hardware and Intel’s graphics driver can shut the IGP’s execution units off entirely during some clock cycles, eliminating even leakage power. DCC kicks in when those execution units would otherwise be operating under inefficient conditions: at a low clock frequency where further voltage reductions aren’t practical.
With a light graphics workload that only requires half of the IGP’s horsepower, DCC might ensure that the IGP spends half its cycles turned off and the other half doing its work. Jourdan tells us Broadwell’s integrated GPU has very low latency for switching on and off, which makes this mechanism practical. In fact, Broadwell’s IGP has a range of DCC operating points ranging as low as 12.5% of the regular clock speed. At that lowest level, the graphics EUs are active for only one out of every eight clock cycles. They’re powered down for the rest, even though the IGP may be drawing an animation on screen.
So that’s another way the Broadwell team managed to shoehorn this chip into a much smaller power envelope. One can imagine that this technique could see extensive use in the future, as graphics hardware takes up an ever larger portion of the die area. What’s more, since the SoC can share power across its die, some of the power reductions realized on the graphics side of the house with DCC can be used to enable Broadwell’s CPU cores to run at higher frequencies, as well. So DCC offers an effective increase in dynamic operating range on both ends of the spectrum.
Oh, right: architecture!
Intel’s tick-tock process typically confines major CPU architecture changes to the second chip produced on a new process technology, but that rigid segmentation seems to be leaking a bit over time as Intel pursues its goal of credibility—er, dominance?—in the mobile space.
Broadwell’s CPU cores have received a number of tweaks over Haswell’s, with the net effect of increasing instruction throughput per clock by about five percent, generally speaking. In keeping with Broadwell’s mobile focus, Intel’s architects set a high standard for any added features in this revision of the architecture: a new feature must contribute 2% more performance for every 1% of added power use. In the past, any gain better than 1:1 might have been acceptable, but not so this time.
That said, the list of performance-enhancing changes to Broadwell’s core still has quite a few familiar-sounding items. The expanded transistor budget at 14-nm has allowed for larger structures in many cases: a bigger out-of-order scheduler, a 50% larger TLB for the L2 cache, and a new, dedicated L2 TLB for 1GB pages. Also, a second unit can now handle TLB page misses in parallel with the primary one. With all of the TLB enhancements, it should be no surprise that virtualization round trips are supposedly quicker.
Of course, the ubiquitous “improved branch prediction” line-item is present, but Intel hasn’t disclosed any details of how it’s achieved more accurate predictions.
Broadwell has a fewer beefier execution units, too. The floating-point multiplier’s latency has dropped from five to three cycles. There’s a new Radix-1024 divider, and vector gather operations are now faster. Certain cryptography-specific instructions now execute more quickly, as well.
The changes to Broadwell’s graphics and media architecture are arguably even more sweeping. Here’s a quick but still daunting overview of the new arrangement.
A logical block diagram of Broadwell’s integrated graphics. Source: Intel.
The most notable change in Broadwell-Y’s IGP is an increase in the number of modular “slices” of graphics resources included—three here, versus two in Haswell GT2. Each slice has its own L1 cache, texture cache, and texture sampling/filtering hardware, so Broadwell is up 50% on those fronts versus the prior generation.
Meanwhile, the number of graphics execution units per slice has dropped a bit, from 10 to eight. Broadwell therefore has a total of 24 graphics EUs and 192 stream processors. By contrast, Haswell has 20 EUs and 160 SPs. The overall trajectory in terms of graphics units is northward, but Broadwell tilts the balance toward more texturing and sampling hardware.
The graphics microarchitecture in Broadwell has changed, too, with tweaks to improve geometry throughput and Z- and pixel-fill rates. This hardware officially supports the latest APIs, including DirectX 11.2, OpenGL 4.3, and, at last, OpenCL 2.0 with shared virtual memory for GPU computing.
Without IGP clock speeds, which we don’t yet know, we can’t really make any assessments about how Broadwell compares to Haswell or to competitors like AMD’s Kaveri. As with Haswell, we’d expect to see a beefier GT3 version of Broadwell graphics eventually, likely on a quad-core die and occasionally paired with an external eDRAM chip for much higher throughput.
The addition of more samplers and stream processors directly benefits the IGP’s media processing capabilities. Intel claims Broadwell’s video engine can achieve up to double the throughput of its predecessor, and it says the QuickSync video transcoding engine in the chip has improved in terms of performance and output quality.
Since Broadwell’s display block can drive 4K displays, the chip’s ability to handle 4K-class video processing is a live issue. Rather than decode the new 4K-oriented H.265 standard entirely in hardware, Broadwell will take what Intel calls a hybrid approach, using some fixed-function hardware in conjunction with the graphics EUs to process H.265 video. The firm claims H.265 decoding on Broadwell-Y is “fast enough for 4K” with no caveats, and it says H.265 encoding is sufficient for 4K resolutions at 30 Hz. That’s not too bad, all things considered, although I wouldn’t expect H.265 processing to be terribly power efficient given the involvement of the graphics EUs.
A new chipset: Broadwell PCH-LP
Although it looks to be about the same size as the prior version and is manufactured on an older 32-nm process, Broadwell’s platform controller hub is new silicon, too. The PCH-LP will accompany Broadwell-Y in low-power, fanless systems.
The most dramatic changes here versus last year’s model have to do with power efficiency. Intel’s designers have added more power gating around the PCH chip, resulting in a 25% reduction in idle power draw. Active power use is down by about 20% versus the Haswell PCH-LP, as well, and the firm has built a collection of firmware and software updates that enable the PCH to do fine-grained monitoring of power use.
Feature-wise, the PCH has gotten an upgrade in the audio DSP department, with more SRAM and MIPS than before. As with everything else, Intel expects the improved audio hardware to conserve power at the end of the day. The other feature of note is the welcome addition of support for PCIe-based storage.
Is this thing for real?
Sure looks like it. Although they strangely asked us not to take any pictures during the press briefing, Intel passed around a nifty Broadwell reference design tablet code-named “Llama Mountain.” The screen was 12.5″ in size, and the chassis was 7.2 mm thick. The system was running Windows 8.1, and idling at the desktop, its skin felt relatively cool to the touch.
The Llama Mountain reference tablet. Source: Intel.
Intel appears to have crammed a fairly potent x86 PC into a system not much larger than an iPad Air.
We don’t yet know the full specs of the first Core M processors, but Intel has clearly set the expectation that Broadwell-Y will match the performance of Haswell-Y in half the power envelope. The Haswell-based Core i5-4200Y has a 1.4GHz base clock and a 1.9GHz Turbo peak. I fully expect to see a Core M processor with the same clock speeds in a sub-nine-mm tablet.
Another shot of the Broadwell die. Source: Intel.
The zillion-dollar question is whether having truly astounding performance in a tablet-style power envelope is enough to move the market in Intel’s direction. Will Windows-based tablets and two-in-ones become so attractive with the Core M onboard that consumers will overlook the clumsiness of Windows 8.1’s dual-mode usage model and dearth of touch-oriented applications?
Yeah, that’s a tough one.
A related and more interesting question is what Intel’s tolerance is for exploring lower price points. Broadwell-Y is darn near half the size of Haswell-Y, and assuming the 14-nm process matures as expected, it ought to be incredibly cheap to manufacture. One of those 10″ Windows tablets becomes a much more attractive alternative to an iPad when its price is comparable—or possibly even lower.
Intel also has the intriguing option of pursuing new territory now that Android on x86 is a reality. One could imagine an 11″ convertible tablet a la the Asus Transformer lineup sporting a Core M processor and bringing a whole new class of performance to the Android market.