Silverthorne mapped out
Below is a logical block diagram of Silverthorne that Tyler showed us. I'll cover the highlights he pointed out section by section below.

Fetch and decode
Silverthorne's instruction cache is 32KB.
The branch prediction unit employs a 128-entry branch target buffer and a 4K-entry Gshare predictor in order to maintain prediction accuracy.
Scheduling
The chip's two instruction queues have 16 entries each, or these can be unified to a single 32-entry unit. The scheduler can pick two micro-ops from either instruction queue in each clock cycle.
FP/SIMD execution
Silverthorne has two full 128-bit-wide floating-point ALUs. These ALUs and the shuffle unit also comprise a 128-bit-wide data path for SIMD integer math. The floating-point adder supports 128-bit single-precision adds, but most of the rest of the hardware is 64-bits wide, including the floating-point and integer multipliers.
Memory execution
The chip has a 24KB write-back L1 data cache and a 512KB L2 data cache that's 8-way associative. The interface to the L2 cache is 256-bits wide. Both of these caches have hardware data prefetcher logic associated with them. The architecture supports integer store-to-load forwarding, as well.

Keeping power use low
Silverthorne's design team made all sorts of choices aimed at keeping the processor's power use in check, and we've already discussed some of them. In addition to the basic architectural decisions, they employed a whole range of power-saving tricks, some familiar and some new.
One of those tricks is the elimination of special-function units like the integer multiplier and dividerthese chores are handled by the corresponding FP units, instead. Also, as one would expect, Silverthorne has very fine-grained dynamic clock gating. Portions of the chip are deactivated dynamically in response to the data size and functional demands of specific threads. The power overhead required for 64-bit code is reduced when running 32-bit code; the integer register file and execution stack contain power optimizations for this purpose.

Silverthorne also includes the same C states, or sleep states, as Penryn, as summarized in the diagram above. Like Penryn, it can flush all or part of its caches in the various intermediate states. As for the C6 "Deep power down" state, the Silverthorne guys readily admitted "we stole it from Penryn" and "embellished it." "It was a collaboration." This is the sleep state in which virtually the entire chip is shut down and the machine state is stored in an onboard SRAM.

The C6 state is so efficient because Silverthorne has split power planes. Of the chip's 203 IOs, only 21 need to be awake in the C6 state. By putting these on separate power plane, the other 182 can be turned off, resulting in a large reduction in leakage power.
Interestingly, Silverthorne doesn't warm its caches upon coming back from C6, but instead warms them on demand.
To give you some idea of the effectiveness of the various C states, Intel quoted some numbers from its own internal testing, with the C0 state as its base number. In C1, Silverthorne's power consumption was 40% of the base. C4 was 12% of base, and C6 was just 1.6% of base. The firm estimates that a mobile Internet device might spend 80-90% of its time in C6, which is how Silverthorne achieves an average power consumption of around 220mW.
Another optional power-saving measure in Silverthorne's arsenal is the ability to run its front-side bus in CMOS mode rather than using the usual GTL signaling. Obviously, the SCH must support this mode, as well. Intel claims this measure can save between 200 and 500 mW of platform power.
| Friday night topic: The trouble with Best Buy | 143 |