Sumo wrestling among the Redwoods
Llano's integrated graphics processor is code-named "Sumo," which is mildly disturbing because it offers us a glimpse of our code-named-spangled future, in which every portion of a chip has a proper name we can't remember. Fortunately, Sumo is easy to describe with reference to another code name, Redwood, which is entirely familiar as a discrete graphics processor from the Radeon HD 5000 series—namely, the Radeon HD 5670. Sumo shares Redwood's graphics architecture—with five SIMD engines, a total of 400 shader ALUs, 16 texels per clock of texture filtering capacity, and eight pixels per clock of ROP throughput—and feature set—including robust DirectX 11 support with hardware tessellation, up to 8X multisampled antialiasing in hardware, and additional AA possibilities in software. (The Sandy Bridge IGP, by contrast, supports only DX10 and 4X multisampled AA.)
Sumo's one upgrade over Redwood is an updated video processing block, dubbed UVD3, that's also used in Radeon HD 6000-series discrete GPUs. UVD3 adds support for Blu-ray 3D playback, MPEG4 decode acceleration, and fuller acceleration of MPEG2 video streams to the previous generation's acceleration of the VC-1, H.264, and MPEG2 formats. AMD points out the MPEG4 support, in particular, is noteworthy because Intel's Clear Video block doesn't have it.
|When AMD starts talking about how Llano comes with "discrete level" graphics—a phrase we've heard often in reference to this product—one must remember that discrete graphics cards come in many forms.|
Although the Llano IGP has the same array of graphics resources as a Radeon HD 5670, it has to operate under a considerably different set of constraints. The discrete desktop Radeon HD 5670 runs at a very healthy 775MHz, while the fastest mobile variants of Llano's IGP tick along at 444MHz. (The desktop versions run as fast as 600MHz.) That means the best mobile Llano IGP has theoretical peaks of 3.6 Gpixels/s of fill rate, 8.9 Gtexels/s of texture filtering, and 355 GFLOPS of shader compute power. That's a little more than half the corresponding rates for a discrete Radeon HD 5670. The more notable constraint, though, is memory bandwidth. Thanks to its GDDR5 memory, a discrete Radeon HD 5670 has 64GB/s of bandwidth all to itself. The Sumo IGP, meanwhile, has to share two channels of DDR3 memory with Llano's four CPU cores. With dual 1333MHz memory modules, Llano's shared memory subsystem has less than a third of the 5670's dedicated bandwidth.
Those limitations don't make Llano's IGP a poor one. On the contrary, this is surely the best integrated graphics solution we've ever seen. Still, when AMD starts talking about how Llano comes with "discrete level" graphics—a phrase we've heard often in reference to this product—one must remember that discrete graphics cards come in many forms, down to the $49 Radeon HD 6450, which is pretty anemic. The beefier Radeon HD 5670 easily outpaces the Llano IGP, but it will set you back $77 online. In terms of both graphics power and dollars, the stakes involved here are relatively low.
AMD appears to be acutely aware of how critical memory bandwidth will be to the graphics performance of Llano-derived APUs. The dual-channel DDR3 memory controller will support 1333MHz memory, both in its stock and low-power (1.35V) incarnations, across the entire A-series APU mobile product line. A few variants will support 1600MHz memory, and the desktop versions will push their DIMMs as high as 1866MHz. Capacity will top out at two DIMMs and 32GB in the mobile chips, while the socketed desktop versions will support four DIMMs and up to 64GB. Then again, those are some really big honkin' DIMMs, as we say in the industry, so the practical limits may be lower for the time being.
Glue for adhesion, not Fusion
The final major components in the Llano die are the four PCI Express controller blocks. Each of them can feed eight lanes of second-generation PCIe connectivity, but one of those blocks of eight is dedicated to driving a pair of digital display outputs. The remaining 24 lanes can flex into various configurations. A common one would use 16 lanes to talk to a discrete GPU, four lanes to talk to the FCH or south bridge chip, and leave four lanes for general-purpose use.
Much of the rest of Llano is glue, finding a way to make all of these disparate components talk to one another and function together properly. This chip doesn't have any major architectural modifications geared toward efficient integration; unlike Sandy Bridge, there's no internal communications ring, no shared last-level cache, and no IGP participation in the Turbo mechanism. Instead, Llano's internal links look much like the external links used before. In place of the Radeon's dual memory controllers is a connection to Llano's north bridge. In fact, Goddard said there are actually two links from the IGP into the north bridge, which makes sense historically given that the Redwood GPU has two 64-bit memory interfaces. A separate connection, dubbed the "Fusion compute link," serves the same purpose as a PCIe interconnect between a CPU and a discrete GPU, allowing the IGP to access system memory coherently—that is, without spoiling the complex dance involving multiple CPU cache levels holding multiple copies of data, potentially in different states. Goddard stated that this communication channel will be important in the future for GPU computing applications, but he admitted the engineering team didn't plumb Llano's Fusion Compute Link to be especially high bandwidth. Instead, he expects AMD to invest more in this link going forward—that is, in future APUs.
When asked about the thorny problem of how Llano arbitrates between CPU and IGP requests for memory access, Goddard chose his words carefully. To paraphrase, he noted that fewer CPU-based algorithms require high bandwidth, while GPUs tend to be more tolerant of high latency. Some applications also have isochronous requirements (that is, they need a guaranteed stream of data at a certain rate). The result is a "very complex algorithm." Goddard admitted the team wasn't able to do everything it wanted to do on this front. "We think you'll struggle to find a problem, but there are things we'd like to do differently next time."
If you're getting the sense that Llano's brand of fusion is more like a couple moving into adjacent apartments in the same complex rather than moving in together, you're on the right track. The plan is to move in together, eventually, but that's down the road.
With that said, AMD Graphic CTO Eric Demers did note a couple of compute-focused provisions in the IGP that point to a more fully fused future. The first provision, called "pin-in-place," allows the GPU to reserve a portion of system memory that it can access without traversing any operating system storage buffers—a performance enhancement. Discrete GPUs can use this function, as well; the data transfers then happen over a PCI Express link. The second, known as "zero copy," works in conjunction with pinned memory and lets kernels running on the GPU modify the system's virtual memory directly, rather than copying the data to graphics memory for modification. For systems where the CPU and IGP share the same physical RAM, the use of zero-copy pinned memory can potentially offer some nice performance benefits. Demers said this capability could be used both for 3D graphics, via an OpenGL extension, and for GPU computing via OpenCL. Then again, both pin-in-place and zero-copy have also been available in Nvidia's CUDA toolkit since version 2.2, so developers can employ them on ION-based netbooks, too.