We don’t yet know as much as we’d like to about Intel’s upcoming Larrabee GPU-CPU hybrid, but enough useful information has leaked out over the past little while to give us the ability to speculate a bit. Intel has disclosed many of the architecture fundamentals, but one of the big missing pieces of the puzzle has been the specific number of cores and other types of hardware that the first implementations will have. The release of a fuzzy die shot yesterday, therefore, caused a bit of a stir around here, with the TR editors sitting around peering at their monitors and exchanging puzzled IMs about what’s what.
I started forming some theories eventually, and after poking around online, I was pleased to see that some folks in the B3D discussion thread had some similar ideas. We don’t really know much about the particular chip shown in the die shot, but given what we know about the architecture from Larry Seiler’s Siggraph paper and Michael Abrash’s overview of the instruction set, some possibilities become apparent.
If you look closely at this high-res version of the die shot, you’ll see that the chip is laid out in three rows. The design of the chip looks to be fairly modular, with repeating areas of uniform structures of several types. The most common unit of the chip is most likely the x86-compatible Larrabee shader core, and the dark areas at the ends of its long, rectangular shape are probably cache of some sort, either L1, L2, or both. We know that each core has L1 data and instruction caches, plus 256KB of L2 cache. By my count, there are a total of 32 cores on the chip—10 on the top row, 12 in the middle, and 10 in the bottom row.
Along with the cores are two other types of regular blocks on the chip. The larger of these two is a little narrower than a core and has a lot of dark area, which suggests cache or other storage. I count eight of those. There’s also one other block type, a narrow column, of which there are four total, two in the top row and two in the bottom. (After I had sorted all of this out myself, I saw this B3D post with an excellent visual aid. Worth a look if you can’t identify what’s what.)
My best guess is that the eight larger, dark-and-light blocks are texture sampling and filtering hardware. Larrabee doesn’t have as much dedicated hardware as most GPUs, but it does have that.
After spending some quality time with the color-coded RV770 die shot at the bottom of this page and noodling it around with David Kanter, who bears no responsibility for any of this mess, I’m betting the logic bits running along the upper and and lower edges of the die, outside of the cores and such, are the memory pads. I see four repeating patterns there. Kanter notes that the four narrow columns on the interior of the chip are perpendicular to the memory pads. They are relatively evenly spaced, protrude from the edge of the chip into the center, and thus could be memory interfaces and other assorted logic that participates on the bus and talks to the I/O pads. So the magic number for memory interfaces would appear to be four.
David also suggests it might be fun to play "Where’s Waldo?" with the fuses, analog nest, and any other logic we’d expect to find in a GPU. We’re guessing the PCIe interface logic is along the right edge of the chip. Some other unidentified, non-repeating bits are on that side of the die.
spent wasted some time trying to figure out the relationships between the cores and these other bits of hardware, but there don’t appear to be any clear groupings of blocks or physical alignments between cores and texture units. More than likely, each of these resources is just a client on Larrabee’s ring bus.
Happily, with no more information than that, we can tentatively pretend to start handicapping this chip’s possible graphics power. We know Larrabee cores have 16-wide vector processing units, so 32 of them would yield a total of 512 operations per clock. The RV770/790 has 160 five-wide execution units for 800 ops per clock, and the GT200/b has 240 scalar units, for 240 ops/clock. Of course, that’s not the whole story. The GT200/b is designed to run at higher clock frequencies than the RV770/790, and its scalar execution units should be more fully utilized, to name two of several considerations. Also, Larrabee cores are dual-issue capable, with a separate scalar execution unit.
If I’m right about the identity of the texture and memory blocks, and if they are done in the usual way for today’s GPUs (quite an assumption, I admit), then this chip should have eight texture units capable of filtering four texels per clock, for a total of 32 tpc, along with four 64-bit memory interfaces. I’d assume we’re looking at GDDR5 memory, which would mean four transfers per clock over that 256-bit (aggregate) memory interface.
All of which brings us closer to some additional guessing about likely clock speeds. Today’s GPUs range from around 700 to 1500MHz, if you count GT200/b shader clocks. G92 shader clocks range up to nearly 1.9GHz. But Larrabee is expected to be produced on Intel’s 45nm fab process, which offers higher switching speeds than the usual 55/65nm TSMC process used by Nvidia and AMD. Penryn and Nehalem chips have made it to ~3.4GHz on Intel’s 45nm tech. At the other end of the spectrum, the low-power Atom tends to run comfortably at 1.6GHz. I’d expect Larrabee to fall somewhere in between.
Where, exactly? Tough to say. I’ve got to think we’re looking at somewhere between 1.5 and 2.5GHz. Assuming we were somehow magically right about everything, and counting on a MADD instruction to enable a peak of two FLOPS per clock, that would mean the Larrabee chip in this die shot could line up something like this:
| Peak bilinear
| Peak bilinear
| Peak shader
|GeForce GTX 285||21.4||53.6||26.8||166.4||744||1116|
|Radeon HD 4890||13.6||34.0||17.0||124.8||1360||–|
|LRB die 1.5GHz||–||48.0||24.0||128.0||1536||1620|
|LRB die 2.0GHz||–||64.0||32.0||128.0||2048||2160|
|LRB die 2.5GHz||–||80.0||40.0||128.0||2560||2700|
In the numbers above, I’m betting that GDDR5 memory will make it up to 1GHz by the time this GPU is released, and I’m counting on Intel’s texture filtering logic to work at half the rate on FP16 texture formats. We can’t determine the pixel fill rate because Larrabee will use its x86 cores to do rasterization in software rather than dedicated hardware. I’m just working my way through Michael Abrash’s write-up of the default Larrabee rasterizer now, but I don’t think we can assume a certain rate per clock given how it all works.
Obviously, clock speed makes a tremendous difference in this whole picture. Nonetheless, we’re looking at a potentially rather powerful graphics chip, at least in terms of raw, peak arithmetic. If the tile-based approach to rasterization is as fast and efficient as purported, then the relatively pedestrian memory bandwidth quoted above might not be as much of an obstacle as it would be for a conventional GPU, either.
That’s my first crack at this, anyhow. Would be cool if I turned out to be more right than wrong, but it’s all guesswork for now. At the very least, one can begin to see the potential for Larrabee to compete with today’s best DX10 GPUs. Whether or not it will be effective enough to contend with tomorrow’s DX11 parts, well, that’s another story.