Single page Print

SSE4 becomes a blip on the radar
Many of you familiar with the Core 2 Duo may be surprised to learn that Intel announced a new bundle of instruction set extensions at IDF called SSE4. The Core microarchitecture brought with it a handful of new SSE instructions that folks have casually dubbed SSE4, but Intel now insists the proper name for those instructions is actually Supplemental SSE3. This may be an egregious case of transmogrification, but we'll just have to roll with it.

SSE4 proper won't arrive until the 45nm Penryn core does, and Intel is deviating from its usual course by revealing its plans for SSE4 this early. The company has released a whitepaper that gives an overview of SSE4 and its instructions, but exact implementation details won't be available until they've been proven in 45nm silicon. Intel's decision to show at least some of its hand early here is fortunate, because SSE4 may be the most sweeping change to SSE since the introduction of the Pentium 4.

The bulk of SSE4's 50 or so instructions is comprised of new compiler vectorization primitives, which should make it easier for compilers to translate software written in high-level languages into effectively parallelized code and data structures. These new instructions encompass both integer and floating-point operations, and include provisions for dual- and quad-word multiplication, blending, and format conversions. SSE4 also has some related "media accelerator" functions that expand SSE's capabilities. Among them are four instructions that round floating-point values to integers and a floating-point dot product capability that should prove especially useful for graphics. Taken together, Intel expects these new instructions to help further the traditional promise of SSE as an accelerator for multimedia, 3D gaming, graphics, and scientific computing.

Another handful of instructions in SSE4 that don't fall under the rubric of media accelerators is aimed at faster string operations. These instructions combine multiple text compare and search ops into one, to yield what Intel says will be faster virus scans, database operations, compilers, and more.

Perhaps the biggest departure from past SSE extensions, however, is a pair of new capabilities that Intel is calling "application targeted accelerators." These are not just new instructions intended to accelerate computation by (mostly) repurposing existing transistors; they involve tailored, special-purpose logic integrated into the chip. The SSE4 whitepaper makes clear this is the beginning of a new technology direction for Intel.

Application Targeted Accelerators extend the capabilities of Intel architecture by adding performance-optimized, low-latency, lower power fixed-function accelerators on the processor die to benefit specific applications. Such accelerators are the start of a natural evolution of adding advantageous implementations of fixed-function capabilities to the processor. Just as the evolution of silicon technology from 65 nm to 45 nm to 32 nm will enable more transistors for additional cores and cache, so too will it also enable these fixed-function on-die implementations.
AMD has talked about the possibility of integrating such things into its future processors, but Intel is the first of the two major CPU makers to announce a specific implementation.

These initial accelerators are each tied to a single instruction. The first is a fast CRC intended to help speed a very common method of data integrity check. The whitepaper specifically mentions iSCSI and RDMA as good targets for acceleration via the CRC32 instruction. The second accelerator is a POPCNT instruction that quickly counts the number of bits set to 1 in a data set. Intel cites genome mining and handwriting recognition as strong candidates for acceleration via this facility.

Geneseo: "Torrintel" revealed
AMD announced its Torrenza initiative earlier this year, in which it plans to license its HyperTransport protocol—including the coherent version that governs chips that plug into Opteron/Athlon-style sockets—to third parties. As I've mentioned, AMD has communicated a desire to incorporate special-purpose functional units and specialized coprocessors into its future CPUs. Among other things, Torrenza is a means of seeding the development of such coprocessors. Having those products out in the market should also serve AMD well in certain portions of the server market, like HPC, where custom accelerators have tremendous potential.

At IDF, Intel unveiled a two-pronged approach to countering Torrenza in two seemingly unrelated portions of Pat Gelsinger's Digital Enterprise keynote speech. First, Gelsinger announced plans to license Intel's front-side bus protocol to third parties in order to enable the development of application-specific accelerators. Oddly, though, he only mentioned two companies had licensed the bus, Xilinx and Altera. Both companies make FPGA chips, or field programmable gate arrays. Established players in the coprocessor world, like ClearSpeed, were not mentioned. I suppose we'll have to see how this plays out, but I would be surprised to see lots of companies developing products that use Intel's front-side bus.

One reason I say that is because of what came after the FSB licensing announcement. Gelsinger brought out Tom Bradicich, CTO for IBM's System x group, to help introduce an effort code-named Geneseo. Intel and IBM have cooked up and proposed an extension to the PCI Express specification that incorporates the sorts of provisions needed for coprocessors and application-specific accelerators to reside on a PCIe connection. Representatives from Intel and IBM described Geneseo to me as covering everything right up to the edge of participation in the CPU's cache coherency subsystem, but stopping just short. For the majority of devices, Geneseo will probably be a better home than Intel's FSB.

Geneseo's proposed changes deal with the transaction layer of PCI Express, and will benefit from, but not significantly affect, the physical layer throughput enhancement coming in PCIe 2.0. Among the capabilities proposed in Geneseo is the ability to complete atomic operations, such as read/modify/write, in a single cycle, which should reduce PCIe's overhead. Coprocessors should also benefit from Geneseo's provisions to pass hints about caching (though not coherency) and transaction ordering. All manner of devices could take advantage of Geneseo's proposed facility for fine-grained, dynamic control of PCIe power use in software.

Obviously, such enhancements to PCIe could potentially be a major boon to GPU makers, as well. I understand Nvidia participated in the Geneseo panel, but I've not yet seen an official endorsement of Geneseo from Nvidia or ATI.

IBM likes Geneseo because it's CPU-agnostic, and because it makes possible the development of application accelerators or coprocessor cards that will work in both Intel- and AMD-based servers. Intel no doubt prefers Geneseo to its own front-side bus given the relative age and complexity of that bus and apparent plans to replace it, eventually, with a more HyperTransport-like point-to-point link. Intel and IBM claim Geneseo satisfies the requirements for an extension to the PCIe spec by providing "generic goodness" for PCIe devices and by offering benefits to devices in a range of market segments. Given Intel's clout, IBM's clout, and the apparent general reasonableness of the Geneseo provisions, I would expect this proposal to get serious consideration for adoption by the PCI SIG. We'll have to watch and see how AMD and others react to it.