Single page Print

The Exynos 5433 SoC
We know the major components of the Exynos 5433 SoC thanks to the Note 4's public specifications, but quite a few of the details are still mysterious. When asked, Samsung's System LSI confirmed that the chip is manufactured on a 20-nm fabrication process, but the company declined to answer our other questions about this SoC's specifics. At least this TechInsights teardown gives us a look at the Note 4's motherboard. Happily, ARM has been fairly forthcoming about some of the major components used in this chip, as well.

Samsung Galaxy Note 4
SoC Samsung Exynos 5433
Manufacturing process Samsung 20 nm
Die size 113 mm²
CPU cores 4 Cortex-A57 + 4 Cortex-A53
A53 quad die area 4.6 mm²
A57 quad die area 15.1 mm²
Max core frequency 1.9GHz (A57) / 1.3GHz (A53)
System memory 3GB LP-DDR3
Memory config 2 x 32-bit channels at 825 MHz

The Exynos 5433 hosts a pair of quad-core CPU clusters, one comprised of Cortex-A53 cores and the other of Cortex-A57s. These CPU cores are compatible with the 64-bit ARMv8 instruction set architecture. Samsung has paired them with a 32-bit version of Android, so the Note 4 can't reap all of the benefits of ARM's new instruction set (which can add up to a ~6% performance improvement independent of other factors.) Still, the Note 4 does make use of ARM's updates to the 32-bit AArch32 instructions, so it shares in some of the improvements, including AES encryption acceleration.

The presence of eight CPU cores may seem like overkill for a phone—and it probably is—but Samsung's engineers have made the SoC more efficient by implementing ARM's big.LITTLE scheme for power-efficient performance. To understand how big.LITTLE works, we first need to understand the differences between the CPU cores in question.

The Cortex-A53 is the latest iteration of ARM's small, ultra-efficient CPU core for application processors, the successor to the Cortex-A5 and A7. This core has a pretty small footprint. Four A53s situated together in a quad occupy about the same die area as a single Cortex-A57.

The A53's microarchitecture borrows heavily from the Cortex-A7 before it. The A53 can issue two instructions per clock cycle, and instructions execute in program order. The main execution pipeline is just five stages long, while the floating-point/SIMD side has seven stages. ARM thinks the A53 has taken this simple structure about as far as possible in terms of instruction throughput and power efficiency. Thanks to a host of tweaks—including better branch prediction, higher internal throughput, and power reductions than can be converted back into performance—the A53 is over 40% faster than the Cortex A7, according to ARM's own estimates. (In fact, ARM tells us the A53 is roughly 15% faster than the mid-sized Cortex-A9 rev4.) Crucially, the Cortex-A53 is fully ARMv8 and 64-bit compliant.

A block diagram of the Cortex-A57 CPU. Source: ARM.

The Cortex-A57, meanwhile, is ARM's largest core. Derived from the Cortex-A15 used in a number of today's phones and tablets, the A57 adds ARMv8 support and incorporates a number of changes meant to increase instruction throughput. ARM intends to see the A57 used in servers, not just mobile devices, so it's pretty beefy. This core can fetch, decode, and dispatch three instructions per clock cycle. The engine gets wider after that, as illustrated in the block diagram above, and it executes instructions out of program order to improve throughput. The A57 is quad-issue into the integer execution units, dual-issue to the floating-point/SIMD units, and dual-issue to the load/store unit. ARM estimates the A57 outperforms the A15 by 25% or better.

A single A57 cluster can host up to four CPU cores, and those cores use a single, shared L2 cache up to 2MB in size.

The idea behind ARM's big.LITTLE is to extend the dynamic operating range of a chip's CPU cores beyond what's possible with a single CPU architecture. big.LITTLE operates in conjunction with traditional SoC power-saving measures. The CPU cores still operate at a range of clock speeds and voltages, depending on how busy they are. The CPU cores can still gate off clock signals to inactive units. Idle CPU cores or clusters can still be powered down temporarily when they're not needed. The difference with big.LITTLE is that threads can also be shifted from a large core to a small one, or vice-versa, depending on which type of core provides the most optimal operating point for the thread's current demands.

For instance, a simple thread that polls a phone's GPS sensor periodically might never need anything more than a Cortex-A53 in order to do its thing. Running that thread on a small core might be the most energy-efficient arrangement. Meanwhile, a big, branchy thread for rendering a webpage might fare best when shifted to a Cortex-A57 for quick completion. Since both of these core types support the full ARMv8 instruction set, transitions between them should be seamless.

The three big.LITTLE schemes illustrated. Source: ARM.

Earlier SoCs have deployed big.LITTLE in relatively simple fashion, swapping threads between a pair of quad-core clusters or migrating directly between big and little cores as needed. More recently, ARM and its partners have moved toward an arrangement known as global task scheduling in order to extract the most efficiency out of big.LITTLE operation. Global task scheduling is a form of asymmetrical multiprocessing in which all cores are active. The OS scheduler—in this case, a modified version of the Android kernel—chooses where to place threads. Newer Exynos SoCs, including the 5433, have been widely reported to use global task scheduling.

Core residency in different power states during workloads. Source: ARM.

In theory, the most efficient hardware configuration for a mobile device with big.LITTLE would likely involve two big cores and four small ones, for reasons illustrated above. Even relatively intensive workloads like games don't spend much time executing on the large cores—and the amount of time spent in the highest power state on the big cores is vanishingly small. Two big cores should be more than sufficient to keep performance from dropping in cases where an especially difficult code sequence must be executed. Either the Exynos 5433 was originally conceived with eight cores for use with CPU migration or, perhaps more likely, the octal core config was chosen for marketing rather than power-performance reasons.

"Eight cores" does have a nice ring to it, I suppose.

One thing to keep in mind as we look at the benchmarks below is that the Note 4's measured CPU performance should largely be defined by its Cortex-A57s. When you're running benchmarks that really push the limits, the big cores will be the ones doing the lion's share of the work. The A53s might chip in a little during multithreaded tests in a global task scheduling scheme, which is an interesting prospect, but they're not going to be the main attraction.

The Exynos 5433 has another major piece of ARM tech inside: the CoreLink CCI-400 north bridge, which glues together the CPU clusters, the Mali graphics processor, and everything else. ARM's north bridges support the proper interfaces and provide hardware cache coherency, so they should work seamlessly with big.LITTLE thread migration.