The new core
Although Silvermont is a brand-new, clean-sheet design, Kuttanna tells us it carries over certain key principles and concepts from the last Atom. Indeed, the new architecture sometimes seems like an evolutionary step. For instance, the core retains the same 32KB L1 instruction cache and 24KB L1 data cache sizes as before.
Another attribute carried over is what Intel calls the "macro-op execution pipeline." Most x86 processors break up the CISC-style instructions of the x86 ISA into multiple, simpler internal operations, but Silvermont executes the vast majority of x86 instructions atomically, as single units. Certain really complex legacy x86 instructions are handled via microcode. Compared to older Atoms, such as the prior-gen Saltwell core, Silvermont microcodes substantially fewer x86 instructions, which should translate into higher performance when those instructions are in use. We'd expect Silvermont to tolerate the vast amounts of legacy code in consumer applications better than current Atoms do.
Kuttanna shared the above block diagram of the Silvermont core with us. We haven't had time to map out the new architecture in any great detail, but we can pass along the highlights he identified.
In the front end, Silvermont can decode two x86 instructions per clock cycle, like its predecessor. However, the branch predictors are larger (and thus, presumably, more accurate), and they include an improved facility for the prediction of indirect branches. Also upgraded is the loop stream buffer, which detects loops that will repeat, buffers the decoded instruction sequence (up to 32 macro-ops in Silvermont), and feeds the sequence into the execution engine. The chip can then shut down its fetch and decode units while the loop executes, to save power.
The execution units have been redesigned with a different mix of resources. The FPU is largely 128 bits wide, but the floating-point multiplier is 64 bits wide, mirroring the prior-gen Atom architecture.
Out-of-order loads are now supported, naturally. Note the presence of only a single address generation unit, with a reissue queue ahead of it. In a bit of dark magic, the architecture can handle a load and a store in parallel when that queue comes into use. The caches have larger translation lookaside buffers, which should allow for quicker accesses. And store-to-load forwarding has been enhanced, as well.
One consequence of the move to OoO execution is that the pipeline is effectively shorter for instructions that don't need to access the cache. The penalty for branch misprediction in Saltwell's in-order pipeline was 13 cycles, but that penalty is reduced to 10 cycles in Silvermont.
The end result of all of Silvermont's enhancements, from the fetch and decode to retirement, is a roughly 50% increase in instruction throughput per clock compared to the generation before. That improvement will be compounded, of course, by higher clock speeds and integration into SoCs with faster complementary subsystems, such as improved memory controllers.
Speaking of which, each dual-core Silvermont module connects to the SoC fabric using a dedicated, point-to-point interface known as IDI. This interface has independent read and write channels, and it features higher bandwidth and lower latency than the old Atom bus, along with support for out-of-order transactions. In the example above, a pair of Silvermont modules connect to the system agent via a pair of IDI links. The system agent then routes requests to the memory controller for access to DRAM.
Oh, I should mention that the L2 cache in each Silvermont module, shared between two cores, is 1MB in size. L2 access latencies have been reduced by two clocks compared to Saltwell, whose L2 cache was smaller at 512KB.
Better burst and power management
The new architecture has gained quite a bit of flexibility and capability in its dynamic frequency and power management schemes. The headliner here is a more capable "burst" mode, as the Atom guys call it, similar to the Turbo Boost feature in Core processors. The prior-gen Atom's boost feature was fairly simple; it exposed an additional P-state to the operating system to allow higher-speed operation when thermal headroom allowed. The frequencies for Silvermont's burst mode are managed in hardware and take into account the current thermal, electrical, and power delivery constraints, both locally and at the platform level. We don't yet have many specifics about the SoCs that Silvermont will inhabit, but we assume an on-chip power microcontroller will be calling the shots.
Silvermont's more sophisticated power management opens up several notable new capabilities, illustrated in the images above. The example on the left shows power sharing between two cores, where an unoccupied core drops into a sleep state, ceding its thermal headroom to the busy core, which can then operate at a higher frequency than its default baseline. In the middle example, the two CPU cores share power with the SoC's integrated graphics processor; since the graphics workload is light, both cores can burst up a couple of steps beyond their default speed. In the example on the left, the cores can temporarily step up to a high frequency even under relatively full utilization, so long as platform-level thermals will allow it. All of these behaviors are familiar from larger Intel SoCs like Ivy Bridge, but the exact algorithms and mechanisms are distinct.
Each Silvermont module is fed by a single voltage plane, but oddly enough, each core in the module can run at its own frequency, independently of the other one. When speeds differ, the shared L2 cache will run at the higher of the two frequencies. The existence of this capability seems rather odd, since we've seen a number of x86 processors run into performance problems when threads hop around onto cores running at low frequencies. Still, architects keep building fully independent clocking into their processors. Our understanding is that independent core clocking within a module probably won't be used in the Bay Trail platform that's most likely to run Windows or other desktop-class operating systems. Instead, Intel tells us independent clocking schemes might be used in specific scenarios, such as very-low-cost parts where one of the two cores might not operate perfectly at higher frequencies or as an enabler for custom TDPs chosen by the system vendor.
Good power management is largely about taking advantage of the idle time between user inputs, and Silvermont is definitely geared to do that. Each core can drop into the C6 "deep sleep" state independently. When it does so, a power gate will shut off power to the core completely.
Silvermont modules can choose from a suite of C6 sub-states depending on the status of their two cores, as shown above. The L2 cache can be kept fully active, partially flushed, or shut down entirely, with each step into a lower-power state carrying a longer wake-up time.
The 22-nm advantage
One of the great thing about being Intel, of course, is having the lead in chip fabrication tech. The firm was first to market with a 3D transistor structure, or FinFET, when it shipped products based on its 22-nm process last year. To date, the company says it has shipped over 100 million processors built on its 22-nm process, and it claims defect densities are now lower than they were with its 32-nm process two years ago. In short, Intel appears to be well over a year ahead of the rest of the industry in terms of process geometries—and even further ahead in productizing FinFETs.
The firm's 22-nm process technology offers some advantages that seem almost ideally suited to low-power processors. Those start with a threshold voltage for transistor operation that's about 100 mV lower than with the 32-nm planar transistors on Intel's older process node. At relatively low voltages, the 22-nm process with tri-gate transistors can operate up to 37% faster. At higher voltages, it can offer similar switching performance to the 22-nm planar process while consuming about half the active power.
What's more, Silvermont-based chips will be built using a variant of this 22-nm process tailored for SoCs. In fact, Intel says the Silvermont architecture and its SoC process variant have been "co-optimized" for one another. Compared to the P1270 process used for Ivy Bridge chips, the P1271 SoC process offers several additional points of flexibility. The SoC process provides more tuning points in the form of lower speed, lower leakage transistors better suited for low-power devices. At the same time, it adds the high-voltage transistors needed for external I/O. These transistors have increased oxide thickness and gate length, and they support both 1.8 and 3.3V operation. Also, the process can be tweaked to provide a range of density options, from 9 to 11 metal layers, at different costs. Interestingly enough, Intel says the 22-nm tri-gate process is better suited for analog devices than the last three generations of planar transistors, as well.