“Carrizo” is the code name of AMD’s next-generation CPU for notebooks and convertible PCs. This chip has been on AMD’s roadmap for some time now as the successor to the Kaveri chip that powers the firm’s current lineup of A-series APU products. We even got an early look at the first working Carrizo silicon at CES in January.
Still, many of the details about Carrizo and its next-generation “Excavator” CPU cores have been shrouded in mystery to date. Fortunately, that’s changing. In conjunction with the International Solid State Circuits Conference, AMD has begun to tell the story of how it has achieved major improvements in power efficiency and performance with Carrizo, even though the chip is built on a 28-nm fabrication process like Kaveri before it. AMD Corporate Fellow Sam Naffziger and Senior Director of Client Products Kevin Lensing briefed us ahead of ISSC and shared some fascinating information about Carrizo’s new technology.
For the uninitiated, Carrizo is AMD’s answer to Intel’s Broadwell chips, and it’s expected to arrive in consumer systems around the middle of this year. Like AMD’s other “Accelerated Processing Units,” or APUs, Carrizo combines CPU cores and graphics on the same piece of silicon. In fact, Carrizo is almost a complete PC system on a single chip, and nearly every major component onboard has been updated compared to the prior generation.
We don’t yet have all of the details, but Carrizo combines an evolved version of the Bulldozer CPU core known as Excavator, “next generation” Radeon graphics based on the GCN architecture, and an updated UVD accelerator block capable of handling H.265 video. Carrizo is also the first “big” AMD APU to integrate the traditional south bridge I/O functions (like USB and SATA), making it a true system on a chip.
Thanks to this change, Carrizo is able to share the same pinout and motherboard infrastructure as AMD’s low-cost, low-power product, known as Carrizo-L. (Carrizo-L is similar to Beema and Mullins and will likely compete with Intel’s Bay Trail and Cherry Trail products.) Lensing told us AMD hopes the shared infrastructure between Carrizo and Carrizo-L will allow the company to capture more of the available market. PC makers should be able to offer systems across a broad range of price and performance levels based on the same basic chassis and motherboard.
AMD claims Carrizo is “the first processor in the world with HSA 1.0 support,” referring to its Heterogeneous Systems Architecture effort to enable converged CPU-and-GPU computing. That claim is a bit confusing since AMD said something similar about Kaveri, which it touted as the first architecturally complete HSA development platform. In this case, the mention of the HSA 1.0 spec is important. That spec has long been a work in progress and is only just being finalized. Perhaps it’s no surprise that only Carrizo meets its full demands. More concretely, Carrizo adds at least one relevant HSA feature that Kaveri lacks: GPU context switching for multiple processes. When it arrives, Carrizo will surely become the reference platform of choice for HSA development. (Whether or not HSA will gain any great traction with software developers, of course, is another question.)
Carrizo’s real magic isn’t listed in its spec sheet, though. Instead, it has to do with how AMD tackled a daunting engineering challenge: delivering meaningful improvements in chip density, performance, and power efficiency over Kaveri without the benefit of a die shrink. After all, this chip has to compete with Intel’s Broadwell, which is fabricated on a much more advanced 14-nm process with second-gen tri-gate transistors. Carrizo is built on a 28-nm process using traditional planar transistors.
Yet AMD claims it has managed to squeeze out some substantial improvements over Kaveri. Overall, Carrizo weighs in at roughly 3.1 billion transistors, or 29% more than Kaveri, with “approximately the same” die area. Power use and performance, two sides of the same coin, are also apparently much improved in this new chip. The firm has achieved these gains using careful tuning for laptop-class power envelopes bolstered by various innovative techniques—and that’s what AMD is sharing this week at ISSCC.
Excavator: heavy equipment gets streamlined
The Excavator CPU cores in Carrizo are the fourth generation of cores based on the initial Bulldozer microarchitecture. Each generation has improved per-clock instruction throughput and power efficiency over the last one, and Excavator is no exception. AMD estimates a 5% overall gain in per-clock instruction throughput over the prior-gen Steamroller core thanks to various changes.
We don’t know what all of those tweaks are yet, but Naffziger did mention one change in particular: the L1 data cache has doubled in size while maintaining the same access latency. He also alluded to support for new instructions in Excavator, but without offering any further details. Excavator is rumored to add support for AVX2, which would boost performance in specific code paths that use SSE or the like, but the new instructions wouldn’t contribute to a general performance increase. At any rate, we don’t expect dramatic changes on the CPU architecture front from this generation of AMD tech. Those are likely reserved for the upcoming Zen microarchitecture, an all-new, x86-compatible core expected to supersede the Bulldozer family next year.
The most notable changes in Carrizo come not in architecture, but design. Naffziger said the Excavator team “stole some plays from the GPU playbook” by adopting a high-density design library traditionally used for GPUs. This library packs quite a bit more logic into a given amount of chip area. The examples below show some important parts of the Excavator core when laid out using a high-performance library a la Steamroller and a high-density library a la Carrizo.
The overall improvement in density is even more dramatic than one might expect from casual inspection of the examples above. The dual-core CPU modules on Carrizo occupy 23% less area than Kaveri’s, even with the added features and the doubling of the L1 data cache’s capacity—all on a 28-nm process.
That said, we are talking about very different looking chips, when all is said and done. The images below illustrate the layers used in a CPU-focused metal stack versus those used in a GPU-focused stack.
The Excavator team made a trade-off here, choosing the higher logic density of a GPU-style design over the clock frequency headroom afforded by a CPU-style design.
As the plot above attests, that trade-off makes sense for Carrizo because the chip is targeted to laptop-class power envelopes of about 15W. Not only does the high-density library reduce the chip area required for each CPU core (thus saving on costs), but it also yields some nice reductions in power use during low-wattage operation.
Notice that the crossover point where Excavator no longer beats Steamroller is at about 20W per dual-core module. That fact may help explain why AMD hasn’t articulated plans to produce a socketed version of Carrizo for desktop systems. The chip’s tuning probably doesn’t translate well into desktop-class power envelopes of 65W or higher. Carrizo’s benefits over Kaveri may be questionable in such scenarios.
GCN goes low-power
AMD has done some power optimization on the GPU side of things, as well. Again, the changes were all intended to help tailor Carrizo for its intended power envelope. The team tuned Carrizo’s GPU cores for low-power operation by reducing their reliance on high-performance devices that bleed more power in the form of leakage. By selecting cooler, lower-power options from the suite of devices available in the 28-nm process, they were able to realize substantial efficiency gains: either a 20% power savings at the same clock speed or a 10% higher operating frequency in the same power budget.
These optimizations allowed AMD to enable all eight of the GCN compute units simultaneously in the low-power version of Carrizo. Previously, in the low-power Kaveri, they had to limit the chip to six CUs at once in order to avoid exceeding the APU’s power budget.
Again, optimizing an integrated GPU in this fashion involves a trade-off. The peak operating frequencies of the graphics cores in Carrizo are likely lower than Kaveri’s, which would translate into lower peak performance at higher power levels.
Chips need a certain minimum amount of voltage in order to operate properly without crashing. Unfortunately, the voltage supplied to a chip in a typical system isn’t always perfectly steady. To avoid problems, chipmakers typically supply a little extra voltage to their chips. Naffziger told us AMD has generally overvolted by about 10% in order to compensate for potential voltage droop. That may not sound like much, but a chip’s power draw is determined in large part by the square of the voltage, so keeping voltage low is a critical goal.
AMD has an innovative solution to this problem: voltage-adaptive operation. The firm’s CPU cores have the ability to track the supplied voltage in real time, at “sub-nanosecond” speeds, in order to detect when voltage droops. The chip can then reduce its operating speed briefly in response to the voltage reduction, preventing a crash.
Naffziger notes that voltage droop happens “less than one percent of the time,” so voltage-adaptive operation should have no noticeable impact on performance. Instead, he says, “we just get a bunch of that power waste back” by not needing to overvolt the silicon to ensure stability. The firm can choose to turn the power savings into higher clock speeds, too, if it wishes.
Voltage-adaptive operation was built into Kaveri’s CPU cores, but in Carrizo, it’s been incorporated into the graphics CUs, as well. AMD estimates this technique reduces power consumption by up to 19% for its CPU cores and by as much as 10% for its integrated graphics.
Adaptive voltage and frequency scaling
AMD has also sought to reduce voltage in Carrizo by giving the chip the ability to tune itself. More precisely, this optimization, known as Adaptive Voltage and Frequency Scaling (AVFS) currently applies only to the Excavator CPU cores.
Each Excavator core includes a scattered collection of AVFS “modules” that include replica versions of critical logic pathways in the CPU. The AVFS modules can test these pathways for stability at different voltage levels in real time as the chip operates. AVFS thus allows the CPU core to know its lowest safe operating voltage for the present conditions, including the clock speed and temperature.
AMD estimates that the voltage reductions made possible by AVFS can cut power consumption between five and 15% at a given clock speed. As with many of the other optimizations in Carrizo, the biggest gains from AVFS come at lower power levels.
For several generations, AMD’s mobile platforms have been missing support for a Windows 8 feature known as Connected Standby mode. This is the mode that allows Windows-based systems to mimic the behavior of phones and tablets by dropping into a deep sleep state that uses very little power. The system can then wake up briefly to check for any incoming messages and notify the user, or it can simply awaken very quickly when the user is ready to resume working.
AMD didn’t talk about Connected Standby in its Carrizo presentation, but it did confirm that the chip will support the primary power state needed for this feature, a state known as S0i3. In this state, the power management subsystem gates off power to nearly all of the APU’s silicon, reducing chip-wide power consumption to less than 50 milliwatts. The time required to enter and exit S0i3 mode is much lower than the time needed to transition to the traditional S3 standby mode. AMD says the APU can drop into S0i3 in less than a second.
This addition doesn’t yet give AMD’s big APU all of the aggressive “active idle” capabilities that Intel built into its mobile Haswell offerings, in part because Carrizo lacks the fast state transitions made possible by Haswell’s integrated voltage regulation. Intel has said that Haswell can wake from being completely powered down in under three milliseconds. Still, the addition of S0i3 to Carrizo is a step in the right direction.
The future is still fusion?
Back in 2012, AMD’s then-new CTO Mark Papermaster laid out a development direction for AMD that involved SoC-style design principles. Most mobile SoCs are built using a particular approach where discrete functional blocks are glued together using a common interconnect, like ARM’s AMBA spec. This approach, combining modularity with a common interconnect, can allow for rapid development of new chips, with shorter design cycles and more flexibility. Since Papermaster joined AMD, the firm has put together a pretty good track record of delivering new APUs on a fairly regular basis, but it hasn’t yet produced a big x86 APU based on SoC-style design principles.
For example, for most intents and purposes, AMD’s Kaveri is kind of a Radeon glued to a Bulldozer core. Rather than sharing a memory controller over an interconnect fabric that maintains memory coherency—as one would expect with the SoC approach—Kaveri’s GPU has three paths to memory: a 512-bit Radeon memory bus, a 256-bit “Fusion Compute Link” into CPU-owned memory, and another 256-bit link into CPU-owned memory that maintains coherency. In theory, with a proper coherent fabric, these three links could be merged into one, saving power, reducing complexity, and quite probably improving performance. The use of a proper interconnect fabric would also allow AMD to swap in newer or larger graphics IP without requiring as much customization.
AMD surely chose to build Kaveri as it did for good reasons, most notably because it needed to deliver a product to the market in a certain time frame. Still, one can’t help but note that Intel’s original Sandy Bridge chip had a common ring interconnect joining together the CPU cores, graphics, shared last-level cache, and I/O. From a certain perspective, although it wasn’t meant for this mission, Sandy Bridge’s basic architecture was arguably a better fit for AMD’s HSA execution model than Kaveri.
We don’t yet know all of the details about AMD’s new APU, but the firm confirmed that Carrizo follows the same basic implementation style as Kaveri. Carrizo doesn’t yet embrace the SoC-style design methodology AMD intends to adopt.
That said, AMD’s new APU doesn’t have to follow this approach in order to be a great product for 15W laptops. I’m persuaded AMD’s power-optimizations efforts are a smart choice for this generation. Given that it takes about four years to build a chip of this class from scratch, I would expect next year’s models finally to make the transition to a more modular layout.
Zen and the art of feature matrices
Speaking of the future, one of the most interesting slides AMD had to share with us was the following matrix of product features, including some features that are coming in future products. Have a look, since it offers a few juicy tidbits of new information.
I’m not really clear on how all of these time frames map to products, exactly, but AMD told me the features included in Carrizo count as “in product” options in the blue boxes. Some of these things we already know about, such as AVFS and voltage-adaptive operation. Others are still mysterious, like inter-frame power gating, which could be a GPU-focused optimization. (Hmm!) I suspect we’ll know more about these technologies once Carrizo hits the market mid-year.
The “in development” features in purple are probably giving us some of our first insights into the chips based on AMD’s upcoming Zen microarchitecture. Most notable among them is integrated voltage regulation, a la Intel’s Haswell and Broadwell CPUs. Some of the other features listed look to be power-management capabilities made possible by the faster switching and finer granularity of integrated voltage regulation. For instance, per-IP adaptive voltage likely means separate supply rails for various on-chip units. Environment and reliability-aware boost could be an extension of AVFS. The workload-aware energy optimization sounds a lot like the Energy Efficient Turbo feature Intel built into its Haswell-EP processors; this feature monitors stalls on the CPU and reduces clock speeds if the CPU core’s performance is limited by external factors.
I suspect “advanced bandwidth compression” is a GPU feature. AMD introduced much-improved frame buffer compression in its Tonga GPU last fall. Looks like that capability could make it into APU silicon in 2016 alongside Zen. I hate to steal the spotlight from Carrizo, but Zen could be where AMD really catches up to Intel in a big way.
A big win for efficiency in laptops
We don’t have much in the way of specifics about the full extent of Carrizo’s power efficiency gains over the prior generation just yet, but AMD is claiming “double digit” increases in both performance and battery life.
The power efficiency plot above helps tell that tale visually. We’ll probably have to wait until closer to Carrizo’s release before we have more specific numbers. Regardless, this power-efficiency progress is all part of AMD’s goal, stated last year, to make its products 25 times more power-efficient by the year 2020.
AMD tells us Carrizo parts will fit into power envelopes ranging from 12 to 35W. That is the “breadth of the design space” for this chip, although, cryptically, the firm also says that “doesn’t mean it doesn’t scale beyond that.” I’m not sure what to make of that statement. Either it’s purely an engineering sentiment, or it might mean we could eventually see Carrizo-based products that push below 12W or above 35W, though presumably they wouldn’t be entirely optimal implementations of the chip.